Introduction
LIS AI Validation
Clinical laboratories process thousands of specimens every day. Before any result reaches a physician, it passes through autoverification — an automated review that checks whether the result is clinically plausible and safe to release.
This demo shows an AI agent performing autoverification on a batch of specimens. It is evaluated against two questions every lab enterprise must answer before deploying AI:
Did it make the right call on every specimen? — Decision quality
Can you prove it derived its rules from published clinical standards? — Reasoning provenance
Most validation frameworks stop at Layer 1. This framework evaluates both, independently. The agent must not only get the right answers — it must get them for the right reasons.
The regulatory gap
CAP GEN.43875 requires validation “based on changes made.” Traditional LIS validation assumes deterministic, rule-based systems where changes are enumerable. AI agents break that assumption.
When you upgrade a model — GPT-4 to Claude Sonnet — the documented change is “improved reasoning.” What actually changed includes emergent capabilities no one specified. Threshold-only testing may pass. Workflow reasoning may have shifted in ways that only surface on edge cases the threshold never sees.
Workflow-level validation tests the full reasoning path, not individual outputs. That is what makes the scope of change auditable.
Three things this demo shows
The two problems the agent must catch
EDTA Contamination
EDTA is the anticoagulant in purple-top CBC tubes. When it contaminates a chemistry specimen it artificially raises potassium and depresses calcium. A physician seeing the result may treat aggressively for a condition the patient does not have.
The agent detects this using a patient-relative delta check— comparing the result against the patient's own prior values, not a population range.
Identity Swap
A swap occurs when two specimens are collected correctly but their tube labels are transposed. Each individual result looks plausible — the error only appears when you compare across specimens.
The agent detects this through pairwise comparison: for every pair of specimens, it checks whether swapping their patient assignments produces a better fit to both patients' prior histories.
How it works
Clinical Knowledge Graph
CLSI EP33
AI Agent
reads KG nodes
workflow.json
KG-derived params
Decision Engine
triage.py
HOLD / RELEASE
per specimen
Knowledge Graph
CLSI EP33
workflow.json
agent output
Provenance Verifier
provenance_verifier.py
Layer 2 verdict
PASS / FAIL
Repeatability + feedback loop
AI agent runs are non-deterministic — the same task, model, and prompt can produce different reasoning paths on each execution. LabInTrace makes the decision layer deterministic by anchoring it to the Knowledge Graph (KG). The agent's output is a workflow.json — a structured configuration derived from named KG nodes. The decision engine runs against that configuration. Same KG, same decisions.
This is a validated methodology, not a guaranteed outcome. The framework verifies that the agent followed the standard — it does not guarantee the agent will always follow it. When it does not, Layer 2 catches the deviation.
Three independent runs — one benchmark
Most efficient
hJQzBJW
Featured run
HsPAVBJ★
Most deliberate
Zo4iCGU
Agent: GPT-5 via TerminalBench harness · Task: lis-swap-contamination-triage · 2026-02-21 · All runs Reward 1.0
Open source
The benchmark, triage engine, and provenance verifier are all published.
Project home: lisaivalidation.dev