Public preprint on SSRN · Submitted to IEEE CIFEr 2026 · OxML 2026 Poster Challenge

Evidence from enterprise AI-agent monitoring across 1,000 traces

When frontier LLMs disagree on agent intent drift — and what a multi-engine architecture does about it.

Instinctive Network
Intent Checkpoint

Humans approve the mission. Agents decide the moves.

Customer policies are written at category level. Agent execution touches specific instances. This gap is where drift hides.

Frontier LLMs disagree on 30% of cases

150 naturalistic agent traces across three enterprise departments. GPT-5.5 and Claude Opus 4.7 labeled each independently. They reached opposite conclusions on 45 of 150.

Per-department disagreement

Finance
38%
Sales
28%
IT
24%

N = 50 per department · overall 30%

The strongest models reach opposite conclusions on the boundary cases. A stronger LLM does not remove the need for independent runtime monitoring.

Category-level IAO pushes scores toward the boundary

Same architecture, same engines, same threshold. No-drift score distribution shifts based on how the IAO is specified:

Instance-level IAO (injected) 0.677
Category-level IAO (naturalistic) 0.515

DriftScore median · higher = more confident no-drift

Real enterprise IAOs are category-level. When intent specification meets instance-level execution, the no-drift distribution compresses toward the detection boundary.

This supports the same structural diagnosis as Finding 1, rather than reducing the result to a single judge-model artifact.

Validation · Headline result

IDV-1000 benchmark

1,000 agent execution traces · three enterprise departments

SystemF1PrecisionRecallFP
Intent Checkpoint93.9%95.9%92.1%22
GPT-5.5 LLM-as-judge88.4%79.5%99.6%143
GPT-4o LLM-as-judge88.8%79.9%100.0%140

6.5×

Fewer false alarms
CI [4.4×, 10.8×]

16.4 pp

Precision advantage
p < 10⁻²¹

100%

Precision on IT and Sales
0 FP on injected

<300 ms

Latency vs 4.7 s
for GPT-5.5

  • AUROC 98.7% on injected · 92.3% on naturalistic high-consensus subset
  • LLM judges prioritize recall over precision regardless of prompting. They flag almost everything; humans drown in false alarms.
  • All 22 of Intent Checkpoint's residual false positives concentrate in Finance — the same boundary region where Finding 1's divergence occurs.
Intent Checkpoint architecture Human intent flows into a customer-specific Intent Anchor Object, which is consumed by four parallel engines whose outputs feed a two-layer adjudicator. Human intent Intent Anchor Object Agent execution Authority Coherence Scope Impact Layer 1 — drift detection Authority + Coherence + Scope Layer 2 — response Calibrated by impact

Four parallel engines · two-layer adjudicator · drift detection structurally separated from response priority

Three observations. One story.

Even the strongest LLM judges can't agree.

The cleanest traces drift toward the alarm.

Even the best architecture stumbles on the same boundary.

The Minority Report problem is real. It's just running inside your AI agents now.

Operational implications

  • For teams running agents in production — a quieter, more trustable signal. 6.5× fewer false alarms means reviewers act on real drift, not noise.
  • For regulated industries — drift detection is structurally separated from impact severity. The architectural invariant is what makes the system defensible to compliance review.
  • For AI safety — independent runtime monitoring external to the agent's own LLM is now a measurable architectural requirement, not a posture.