Guardrails see one thing. Models infer another. ArXiv 2605.23196 introduces Prompt Overflow — a novel attack class where adversarial input passes guardrail inspection but is inferred differently by the model itself. The guardrail and the model process identical text through different representation pathways, and attackers exploit that gap with surgical precision.
Current safety filters on frontier LLMs fail because they are not structurally aligned with the inference layer they are meant to protect. The paper demonstrates that this is not a bug in any specific guardrail vendor or model. It is an architectural failure class: the inspection layer and the inference layer operate on different representations of the same input, and attackers now have a systematic methodology for finding and exploiting that divergence.
The fundamental issue is architectural. A guardrail processes input through its own inference pipeline — typically a smaller classification model, a regex engine, a semantic similarity check, or a combination of these. The target LLM processes the same input through a completely different inference pipeline with different tokenization, different attention patterns, and different embedding space geometry.
Prompt Overflow exploits this mismatch. An attacker constructs input that the guardrail's pipeline classifies as benign — safe classification score, no keyword matches, semantic similarity matches known-safe patterns — but that the target LLM's pipeline interprets as an instruction to bypass safety constraints.
The paper formalizes this as representation divergence. Two inference pipelines with different architectures produce different internal representations of the same input string. When those representations diverge enough that one classifies the input as unsafe while the other classifies it as safe — or vice versa — the attacker wins.
The model infers meaning the guardrail cannot see because the guardrail does not share the model's latent space. Text that looks like a benign query under a bag-of-words inspection looks like a jailbreak under attention-weighted semantic analysis — or the reverse, where a jailbreak string is tokenized differently by the guardrail's classifier and slips through.
Different models tokenize input differently. A guardrail using a Byte-Pair Encoding tokenizer with one vocabulary may segment a string into benign tokens while the target model's SentencePiece tokenizer segments the same string into an instruction sequence. The attacker finds strings that span this boundary — characters that look safe under one tokenization scheme but carry instruction semantics under another.
Guardrails that use semantic similarity (embedding cosine distance to known bad examples) rely on the assumption that similar meaning produces similar embeddings across models. This assumption fails when the guardrail's embedding model and the target model have different training distributions. A phrase that embeds far from jailbreak vectors in the guardrail's embedding space may embed close to them in the target model's space.
In conversational and agent contexts, the guardrail typically evaluates individual turns while the model reasons over the full conversation history. An attacker distributes the exploit across multiple turns — each turn passes guardrail inspection in isolation, but the accumulated context triggers unsafe behavior when the model processes the full thread.
Some of the most effective Prompt Overflow examples exploit the ambiguity between instruction tokens and data tokens. The guardrail sees a benign data string. The model treats part of that same string as an instruction — because the model's attention weights prioritize certain syntactic patterns that the guardrail's classifier does not evaluate at all.
The natural response to this threat is to build better guardrails — larger classifiers, more training data, more comprehensive semantic checks. The paper's central finding is that this response alone will not work. Stronger guardrails on the same architectural side of the equation — refining the inspection pipeline — do not close the representation divergence gap. They narrow it, but they cannot eliminate it because the divergence is inherent to having two different inference pipelines processing the same input.
The attacker does not need to defeat the guardrail. The attacker only needs to find one string where the representations diverge enough for the target model to infer something the guardrail did not catch. A stronger guardrail reduces the number of exploitable strings but cannot reduce that number to zero — not without converging on identical representation, which would require the guardrail to be the same model, which defeats the purpose of having a separate inspection layer.
This is not a scaling problem. It is an architectural constraint.
Organizations deploying LLM guardrails in customer-facing chatbots, agent systems, and content moderation pipelines are affected regardless of which guardrail vendor they use or which frontier model sits behind it.
A user who can construct multi-turn Prompt Overflow can elicit responses that bypass safety filters — generating prohibited content, extracting privileged information, or manipulating the chatbot into actions outside its operational scope. Single-turn guardrails inspecting individual messages cannot detect distributed attacks.
Agents that use guardrails for tool-call validation are at risk. The guardrail inspects the agent's intended tool call; the model reinterprets it through the accumulated context. An agent that would never call a dangerous tool under direct instruction can be steered toward it across multiple reasoning steps, each of which passes guardrail inspection individually.
Platforms using guardrails to filter user-generated content before processing by LLM-based moderation systems face the same structural gap. Content that the guardrail classifies as within policy may be interpreted by the moderation model as containing violations, or vice versa — producing false negatives and false positives that cannot be resolved by tuning either model alone.
Closing the Prompt Overflow vulnerability requires a new defensive paradigm. The paper proposes inference-aligned validation: verifying that what the model inferred matches what the guardrail inspected, rather than simply reinforcing the inspection layer.
Inference-aligned validation works by sampling the model's internal representation for a given input and comparing it to the guardrail's expected representation. When they diverge beyond a threshold, the input is flagged for manual review or rejected outright. This requires model providers to expose representation-level signals — logits, attention patterns, or embedding projections — that a validation layer can compare against guardrail predictions.
This shifts the defensive posture from "block bad inputs" to "validate that the model sees what the guardrail sees." It acknowledges that the two pipelines will never be perfectly aligned and provides a mechanism for detecting when the gap is being exploited. It is not a panacea — sophisticated attackers may still find divergences that pass both filters — but it raises the cost of an exploit from "find one diverging string" to "find one diverging string that also bypasses the representation comparison."
While inference-aligned validation is an emerging practice, organizations can adopt intermediate measures now:
Prompt Overflow is the first attack class that systematically exploits the structural gap between guardrail architectures and inference architectures. It is not a vendor bug, it is not a configuration error, and it cannot be fixed by stronger filters on the inspection side alone.
The security community has spent years optimizing guardrails — larger datasets, better classifiers, more sophisticated semantic checks. ArXiv 2605.23196 demonstrates that we have been optimizing the wrong side of the equation. The defense requires inference-aligned validation: verifying that what the model inferred matches what the guardrail inspected. Until the industry shifts to this paradigm, the structural gap between inspection and inference will remain an open attack surface.
Full technical analysis of the Prompt Overflow attack class is available at the link below.
https://cyberian-defenses.com/blog/prompt-overflow-guardrail-bypass