Public preprint on SSRN · Submitted to IEEE CIFEr 2026 · OxML 2026 Poster Challenge
When frontier LLMs disagree on agent intent drift — and what a multi-engine architecture does about it.
Problem
Customer policies are written at category level. Agent execution touches specific instances. This gap is where drift hides.
Finding 1
150 naturalistic agent traces across three enterprise departments. GPT-5.5 and Claude Opus 4.7 labeled each independently. They reached opposite conclusions on 45 of 150.
N = 50 per department · overall 30%
The strongest models reach opposite conclusions on the boundary cases. A stronger LLM does not remove the need for independent runtime monitoring.
Finding 2
Same architecture, same engines, same threshold. No-drift score distribution shifts based on how the IAO is specified:
DriftScore median · higher = more confident no-drift
Real enterprise IAOs are category-level. When intent specification meets instance-level execution, the no-drift distribution compresses toward the detection boundary.
This supports the same structural diagnosis as Finding 1, rather than reducing the result to a single judge-model artifact.
Validation · Headline result
1,000 agent execution traces · three enterprise departments
| System | F1 | Precision | Recall | FP |
|---|---|---|---|---|
| Intent Checkpoint | 93.9% | 95.9% | 92.1% | 22 |
| GPT-5.5 LLM-as-judge | 88.4% | 79.5% | 99.6% | 143 |
| GPT-4o LLM-as-judge | 88.8% | 79.9% | 100.0% | 140 |
6.5×
Fewer false alarms
CI [4.4×, 10.8×]
16.4 pp
Precision advantage
p < 10⁻²¹
100%
Precision on IT and Sales
0 FP on injected
<300 ms
Latency vs 4.7 s
for GPT-5.5
System · Intent Checkpoint architecture
Four parallel engines · two-layer adjudicator · drift detection structurally separated from response priority
Convergence
Even the strongest LLM judges can't agree.
The cleanest traces drift toward the alarm.
Even the best architecture stumbles on the same boundary.
Why it matters