Validation Runs
Terminal Bench benchmark results — GPT-5 · terminus-2 agent harness
JOB2026-02-21
lis-swap-contamination-triage · GPT-5 · terminus-2
3 runs · $0.3562 total
1 of 3 runs failed Layer 2 audit
| RUN ID | STEPS | COST | LAYER 1 | LAYER 2 | RECORDING |
|---|---|---|---|---|---|
| hJQzBJW | 4 | $0.08 | L1PASS | L2FAIL | — |
| HsPAVBJ★ | 5 | $0.12 | L1PASS | L2PASS | ◉ yes |
| Zo4iCGU | 6 | $0.15 | L1PASS | L2PASS | — |