Stock image for research article: ai models fail multi turn attacks cisco research 2026

Every Major AI Model Fails Multi-Turn Attacks: What Cisco's 2026 Research Means for Enterprise Safety

Dr. Sana Okafor 7 min read Updated June 1, 2026

Key Findings

  • Single-turn safety scores are poor predictors of real-world vulnerability. Models showed deltas up to 55 percentage points between single-turn and multi-turn attack success rates—in both directions.
  • Every frontier model tested failed multi-turn attacks, with success rates ranging from 7.89% to 88.30%. GPT-5.4’s attack success rate jumped ninefold from single-turn to multi-turn testing.
  • One configuration change produced a 45-point safety swing. Enabling reasoning mode in Grok 4.1 Fast dropped multi-turn attack success from 88.30% to 43.47%—yet no public benchmark captures this.
  • Anthropic’s Claude models performed best under iterative pressure, maintaining multi-turn attack success rates between 11.16% and 16.20%, while most competitors exceeded 24%.
  • Amazon’s Nova models behaved counterintuitively, showing high single-turn failure but the lowest multi-turn vulnerability in the cohort—suggesting single-turn brittleness doesn’t equal real-world exposure.

Why It Matters

Enterprise AI buyers are making security decisions based on benchmarks that measure the wrong interaction model. Single-turn evaluations—the industry standard for safety assessment—test one-shot prompts. But adversaries don’t operate that way. They iterate, reframe when refused, decompose harmful requests across multiple turns, and escalate gradually.

Cisco’s research across 15 closed frontier models from OpenAI, Anthropic, Google, Amazon, and xAI reveals that this gap between evaluation and reality creates systematic blind spots. A model that looks secure in single-turn testing can be nine times more vulnerable when attackers use conversational persistence—the default mode of actual AI usage.

The implications compound in enterprise deployments where models handle sensitive data, make autonomous decisions, or interact with external systems. If your procurement process relies on provider-published safety scores, you’re likely underestimating risk by an order of magnitude. The same models that pass your security review under static conditions may fail when exposed to the iterative attack patterns documented in this research.

How It Works (Simplified)

Cisco tested each model against two attack regimes. In single-turn attacks, the system sends one carefully crafted prompt designed to elicit harmful content—think of it as trying to pick a lock with a single tool in one attempt. In multi-turn attacks, the system conducts a conversation, adapting based on model responses—like a persistent adversary who tries different angles, reframes questions when blocked, and builds toward a goal across multiple exchanges.

The research decomposed attacks across five strategy families and measured what percentage succeeded in extracting harmful outputs. Success doesn’t mean complete model compromise; it means the model produced content that violated its stated safety guidelines—whether that’s hate speech, detailed instructions for illegal activities, or impersonation of authority figures.

The methodology matters because it mirrors real-world threat patterns. An attacker targeting a customer service bot doesn’t send one jailbreak prompt and walk away. They probe, adjust, and exploit conversational context. A model that refuses a direct request for dangerous information might comply when that same request is embedded in a multi-turn narrative that establishes false context or adopts a trusted persona.

The Grok 4.1 Fast case study illustrates why deployment configuration is as critical as base model safety. The same model, with identical training and weights, showed radically different vulnerability depending on whether reasoning mode was enabled. This suggests that safety is not just a property of the model—it’s a property of the entire inference stack, including runtime flags that many enterprises treat as performance optimizations rather than security controls.

Limitations

Cisco tested base models without system prompts, content filters, or orchestration layers that most enterprises add during deployment. Real-world implementations typically include guardrails that could shift these numbers in either direction—though the research doesn’t specify which direction is more likely. An enterprise might reduce attack success rates with robust filtering, or inadvertently increase them if custom prompts introduce new vulnerabilities.

The study doesn’t reveal the specific attack prompts or provide reproducible test suites, which limits independent verification. Without knowing the exact attack strategies, enterprises can’t test their own deployments against the same threat scenarios. The 15-model sample, while comprehensive across major providers, represents a snapshot of specific model versions at a specific time. Safety characteristics can change with updates, and the research doesn’t address how stable these vulnerability patterns are across model iterations.

The report also doesn’t quantify the severity distribution of successful attacks. An 88% multi-turn attack success rate could mean anything from minor policy violations to critical safety failures. Without severity weighting, aggregate numbers may overstate or understate practical risk depending on what types of failures dominate each model’s vulnerability profile.

Real-World Impact

Enterprises should implement Cisco’s recommended three-point deployment gate immediately. First, require providers to publish attack success rates broken down by strategy family with every model release—if they won’t, that’s decision-relevant information. Second, build regression tests for your highest-risk use cases with a 3-percentage-point threshold that triggers manual review before deployment. Third, flag any model showing a >15-point gap between single-turn and multi-turn attack success rates for additional scrutiny before production use. That rule would have caught eight of the fifteen models tested.

For AI providers, the Grok 4.1 Fast finding creates an immediate disclosure obligation. If a single configuration flag produces a 45-point safety swing, that needs to appear in model cards alongside capability benchmarks. Enterprises making build-versus-buy decisions need to know which deployment-time settings carry security implications, not just performance trade-offs.

The multi-turn vulnerability landscape will likely inform the next generation of safety benchmarks and potentially new regulatory requirements. If single-turn testing becomes recognized as insufficient, expect procurement RFPs to start requiring multi-turn safety documentation by late 2026. Organizations currently in model selection processes should add multi-turn evaluation criteria now rather than waiting for industry standards to catch up to adversary tactics.

Share:

Related Posts