Introduction

LIS AI Validation

Clinical laboratories process thousands of specimens every day. Before any result reaches a physician, it passes through autoverification — an automated review that checks whether the result is clinically plausible and safe to release.

This demo shows an AI agent performing autoverification on a batch of specimens. It is evaluated against two questions every lab enterprise must answer before deploying AI:

L1

Did it make the right call on every specimen? — Decision quality

L2

Can you prove it derived its rules from published clinical standards? — Reasoning provenance

Most validation frameworks stop at Layer 1. This framework evaluates both, independently. The agent must not only get the right answers — it must get them for the right reasons.

The regulatory gap

CAP GEN.43875 requires validation “based on changes made.” Traditional LIS validation assumes deterministic, rule-based systems where changes are enumerable. AI agents break that assumption.

When you upgrade a model — GPT-4 to Claude Sonnet — the documented change is “improved reasoning.” What actually changed includes emergent capabilities no one specified. Threshold-only testing may pass. Workflow reasoning may have shifted in ways that only surface on edge cases the threshold never sees.

Workflow-level validation tests the full reasoning path, not individual outputs. That is what makes the scope of change auditable.

Three things this demo shows

The two problems the agent must catch

EDTA Contamination

EDTA is the anticoagulant in purple-top CBC tubes. When it contaminates a chemistry specimen it artificially raises potassium and depresses calcium. A physician seeing the result may treat aggressively for a condition the patient does not have.

The agent detects this using a patient-relative delta check— comparing the result against the patient's own prior values, not a population range.

Identity Swap

A swap occurs when two specimens are collected correctly but their tube labels are transposed. Each individual result looks plausible — the error only appears when you compare across specimens.

The agent detects this through pairwise comparison: for every pair of specimens, it checks whether swapping their patient assignments produces a better fit to both patients' prior histories.

How it works

Clinical Knowledge Graph

CLSI EP33

AI Agent

reads KG nodes

workflow.json

KG-derived params

Decision Engine

triage.py

HOLD / RELEASE

per specimen

Independent check:

Knowledge Graph

CLSI EP33

workflow.json

agent output

Provenance Verifier

provenance_verifier.py

Layer 2 verdict

PASS / FAIL

Repeatability + feedback loop

AI agent runs are non-deterministic — the same task, model, and prompt can produce different reasoning paths on each execution. LabInTrace makes the decision layer deterministic by anchoring it to the Knowledge Graph (KG). The agent's output is a workflow.json — a structured configuration derived from named KG nodes. The decision engine runs against that configuration. Same KG, same decisions.

This is a validated methodology, not a guaranteed outcome. The framework verifies that the agent followed the standard — it does not guarantee the agent will always follow it. When it does not, Layer 2 catches the deviation.

Three independent runs — one benchmark

Most efficient

hJQzBJW

4 steps · $0.08
L2 FAILGlucose weight not KG-derived

Featured run

HsPAVBJ

5 steps · $0.12
L2 PASS

Most deliberate

Zo4iCGU

6 steps · $0.15
L2 PASS

Agent: GPT-5 via TerminalBench harness · Task: lis-swap-contamination-triage · 2026-02-21 · All runs Reward 1.0

Open source

The benchmark, triage engine, and provenance verifier are all published.

Project home: lisaivalidation.dev

View on GitHub →