Validation Runs

Terminal Bench benchmark results — GPT-5 · terminus-2 agent harness

JOB2026-02-21

lis-swap-contamination-triage · GPT-5 · terminus-2

3 runs · $0.3562 total

1 of 3 runs failed Layer 2 audit

RUN IDSTEPSCOSTLAYER 1LAYER 2RECORDING
hJQzBJW4$0.08L1PASSL2FAIL
HsPAVBJ5$0.12L1PASSL2PASS◉ yes
Zo4iCGU6$0.15L1PASSL2PASS