Frontier AI Models Fail Basic Enterprise IT Tasks: ITBench-AA Benchmark Shows 47% Peak Score in 2026
Key Findings
-
Claude Opus 4.7 leads at only 47% — the best-performing frontier model on ITBench-AA’s Site Reliability Engineering tasks, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%. No frontier model breaks 50%.
-
More investigation doesn’t equal better diagnosis — Turn counts vary 3x across models, but longer trajectories correlate with lower accuracy. GPT-5.5 averages 31 turns at 46%, while Gemini 3.1 Pro Preview uses 83 turns for only 30% accuracy. Over-investigation surfaces false positives.
-
Open-source models compete on the cost frontier — GLM-5.1 (Reasoning) matches Gemini 3.5 Flash at 40% while costing less per task ($1.23 vs $1.70). Gemma 4 31B scores 37% at just $0.14 per task, outperforming the more expensive Gemini 3.1 Pro Preview.
-
This benchmark isn’t saturated — ITBench-AA SRE represents one of the hardest agentic evaluations currently available, with performance far below other agent benchmarks where frontier models routinely exceed 70%.
-
Precision matters more than recall — Models must identify the exact minimal set of root-cause entities. Submitting additional contributing factors or upstream mechanisms counts as false positives, even when the true root cause is included.
Why It Matters
The gap between AI hype and enterprise reality just got quantifiable. While frontier models dominate leaderboards on academic benchmarks and developer tools, ITBench-AA reveals they struggle with the kind of work enterprises actually need: diagnosing production incidents in complex distributed systems.
This matters because enterprise IT operations represent a massive automation opportunity. Site reliability engineers spend significant time on incident response, analyzing logs, tracing dependencies, and identifying root causes across microservices architectures. The promise of AI agents handling tier-1 diagnostics has been dangling for years. ITBench-AA shows we’re not there yet — the best models are coin-flip accurate at best.
The benchmark also exposes a critical weakness in current model evaluation. Most agentic benchmarks test coding ability, web navigation, or synthetic tasks. ITBench-AA evaluates models on enterprise-specific knowledge domains with structured outputs and high precision requirements. The 47% ceiling suggests that reasoning capabilities don’t automatically transfer to specialized operational domains that require both systems thinking and precise entity identification.
For enterprises evaluating AI agent deployments, these results set realistic expectations. A 47% success rate on incident diagnosis means human verification remains mandatory. The cost-performance tradeoffs also matter: Claude Opus 4.7 costs $5.38 per task for its 47% score, while Gemma 4 31B delivers 37% at $0.14 per task. For high-volume operations, that 10-point accuracy gain may not justify a 38x cost increase.
How It Works (Simplified)
ITBench-AA simulates the real work of a site reliability engineer responding to a Kubernetes incident. Each of the 59 tasks provides a frozen snapshot of a failing system: alerts firing, error logs accumulating, distributed traces showing request failures, and a topology map of interconnected services.
The model gets shell access to this incident snapshot — like an SRE sshing into a jump box to investigate. It can grep logs, query metrics, inspect Kubernetes manifests, and trace request paths through the service mesh. The harness (Stirrup, an open-source framework) gives the model 100 turns maximum to explore the environment and reach a diagnosis.
The catch: the model must submit the minimal set of root-cause entities responsible for the failure. Not symptoms. Not contributing factors. The specific Kubernetes resources — Deployments, Services, NetworkPolicies, ConfigMaps — that caused the incident.
Scoring uses recall-gated precision, which sounds academic but reflects operational reality. If you miss any root cause, you score zero for that task — partial credit doesn’t help users whose services are down. If you identify all root causes, your score equals your precision: true positives divided by all entities you submitted. Submit the two correct root causes plus three false positives? You score 40% (2/5) for that task.
This scoring mechanism punishes over-investigation. A model that traces the failure all the way back to the chaos-engineering tool that injected the fault gets penalized. That tool isn’t the root cause the SRE needs to fix — the misconfigured network policy is. This explains why Gemini 3.1 Pro Preview’s 83-turn investigations underperform Gemma 4’s 58-turn analyses. More exploration surfaces more candidate entities, many of which are correlated but not causal.
The faults span realistic failure modes: resource quota exhaustion, rollout failures with bad manifests, connection pool starvation, network partitions from firewall misconfigurations. IBM designed these scenarios based on patterns observed in real enterprise operations, which gives the benchmark external validity that synthetic tasks lack.
Limitations
ITBench-AA evaluates diagnostic skill, not remediation. Identifying the broken NetworkPolicy is only half the SRE workflow — crafting and applying the fix matters too. The benchmark stops at diagnosis because that’s the hardest part to automate and the most generalizable across different infrastructure configurations. But enterprises need end-to-end incident resolution, which this benchmark doesn’t measure.
The 59-task dataset is small by ML standards, and 19 of those tasks are held-out to prevent contamination. This limited size means model rankings could shift with different task samples. IBM and Artificial Analysis plan to expand to Financial Operations and CISO tasks, which will test whether these results generalize across IT domains or reflect SRE-specific gaps.
The benchmark also assumes deterministic root causes with ground-truth answers provided by IBM. Real incidents can have multiple valid interpretations, and experienced SREs might legitimately disagree on whether to classify something as a root cause or contributing factor. The precision-based scoring doesn’t capture this ambiguity — there’s one correct answer, and partial credit doesn’t exist.
Finally, the evaluation uses a standardized harness (Stirrup) that may not play to every model’s strengths. Some models might perform better with different tool interfaces, custom prompting strategies, or retrieval-augmented workflows. The apples-to-apples comparison enables fair ranking but potentially understates what’s achievable with model-specific optimization.
Real-World Impact
Don’t expect AI to take the on-call pager in 2026. These results suggest that autonomous incident response remains 2-3 model generations away, assuming linear progress — which is optimistic given how domain-specific this task is.
The more realistic near-term application is AI-assisted diagnostics. Models at 40-47% accuracy can narrow the search space for human SREs, suggesting candidate root causes to investigate first. This augmentation model already delivers value: cutting mean time to resolution by 30% still matters even if the AI isn’t autonomously resolving incidents.
The cost-performance data also clarifies deployment economics. For enterprises processing hundreds of incidents monthly, spending $5.38 per incident on Claude Opus 4.7 may be justified if it prevents 30 minutes of senior SRE time. For smaller teams or lower-severity alerts, Gemma 4 at $0.14 per task becomes attractive even at 37% accuracy — it’s cheap enough to run on every alert as a first-pass filter.
Expect tooling to emerge around these models that compensates for their precision gaps. Multi-model voting (running three models and taking consensus), human-in-the-loop verification interfaces, and confidence scoring to escalate low-certainty diagnoses all make sense given current performance levels.
The expansion to FinOps and CISO tasks will reveal whether this 50% ceiling is SRE-specific or a broader pattern. If models struggle equally with cost anomaly detection and security incident analysis, that suggests fundamental limitations in how current architectures handle complex enterprise reasoning. If performance jumps in other domains, it indicates SRE’s combination of systems thinking and precise entity identification represents an especially hard problem for current models to solve.