Public preprint on SSRN · Submitted to IEEE CIFEr 2026

Evidence from enterprise AI-agent monitoring across 1,000 traces

When frontier LLMs disagree on agent intent drift — and what a multi-engine architecture does about it.

Instinctive Network
Intent Checkpoint

Problem

Humans approve the mission. Agents decide the moves.

Customer policies are written at category level. Agent execution touches specific instances. This gap is where drift hides.

Finding 1

Frontier LLMs disagree on 30% of cases

150 naturalistic agent traces across three enterprise departments. GPT-5.5 and Claude Opus 4.7 labeled each independently. They reached opposite conclusions on 45 of 150.

Per-department disagreement

Finance

38%

Sales

28%

24%

N = 50 per department · overall 30%

The strongest models reach opposite conclusions on the boundary cases. A stronger LLM does not remove the need for independent runtime monitoring.

Finding 2

Category-level IAO pushes scores toward the boundary

Same architecture, same engines, same threshold. No-drift score distribution shifts based on how the IAO is specified:

Instance-level IAO (injected) 0.677

Category-level IAO (naturalistic) 0.515

DriftScore median · higher = more confident no-drift

Real enterprise IAOs are category-level. When intent specification meets instance-level execution, the no-drift distribution compresses toward the detection boundary.

This supports the same structural diagnosis as Finding 1, rather than reducing the result to a single judge-model artifact.

Validation · Headline result

IDV-1000 benchmark

1,000 agent execution traces · three enterprise departments

System	F1	Precision	Recall	FP
Intent Checkpoint	93.9%	95.9%	92.1%	22
GPT-5.5 LLM-as-judge	88.4%	79.5%	99.6%	143
GPT-4o LLM-as-judge	88.8%	79.9%	100.0%	140

6.5×

Fewer false alarms
CI [4.4×, 10.8×]

16.4 pp

Precision advantage
p < 10⁻²¹

100%

Precision on IT and Sales
0 FP on injected

<300 ms

Latency vs 4.7 s
for GPT-5.5

AUROC 98.7% on injected · 92.3% on naturalistic high-consensus subset
LLM judges prioritize recall over precision regardless of prompting. They flag almost everything; humans drown in false alarms.
All 22 of Intent Checkpoint's residual false positives concentrate in Finance — the same boundary region where Finding 1's divergence occurs.

System · Intent Checkpoint architecture

Four parallel engines · two-layer adjudicator · drift detection structurally separated from response priority

Convergence

Three observations. One story.

Even the strongest LLM judges can't agree.

The cleanest traces drift toward the alarm.

Even the best architecture stumbles on the same boundary.

The Minority Report problem is real. It's just running inside your AI agents now.

Why it matters

Operational implications

For teams running agents in production — a quieter, more trustable signal. 6.5× fewer false alarms means reviewers act on real drift, not noise.
For regulated industries — drift detection is structurally separated from impact severity. The architectural invariant is what makes the system defensible to compliance review.
For AI safety — independent runtime monitoring external to the agent's own LLM is now a measurable architectural requirement, not a posture.