Tech
Security researchers tricked LLMs into giving them cocaine recipes by abusing role models for prompt injection
AI + ML
If you want a picture of the future of LLM security, imagine Whac-a-Mole meets Groundhog Day
Researchers say that machine learning models cannot reliably distinguish between authorized and unauthorized input, ensuring that prompt injection will continue to present a threat until developers find new ways to have machine learning systems process inputs.
AI models provide responses to user-supplied prompts. The problem is that AI models may receive adversarial prompts – directly from a user or indirectly from an ingested document – that tell the model to take action contrary to its built-in system prompt.
Various techniques mitigate prompt injection, but defenders have not found ways to prevent such attacks.
According to independent researchers Charles Ye and Jasmine Cui, and MIT associate professor Dylan Hadfield-Menell, no one is likely to do so under the current fragile LLM security model.
As they observe in a paper titled “Prompt Injection as Role Confusion” in the proceedings of next week’s ICML 2026 conference, LLMs have come to rely on a text tagging system that defines “roles” to separate system text from user text. And roles, they argue, do not guarantee security.
“Role tags were a formatting trick that became the security architecture and the cognitive scaffolding of modern LLMs,” the authors explain in a blog post. “We’ve shown that this architecture doesn’t survive into the model’s actual representations, and that such role confusion is linked to prompt injection.”
When OpenAI’s ChatGPT arrived in 2022, it implemented the concept of roles – described by Anthropic a year earlier – as a way to tell the underlying model to behave in a certain way. The user role would make a request and the model, acting in the role of a helpful assistant, would respond to that request.
“A formatting trick had become the mechanism that turned autocomplete into an assistant,” the authors observe.
Developers introduced other roles over time. In addition to and , there’s , , and . These roles served to draw a line between different objectives so they could be individually optimized during the training process. Model makers want to balance conflicting objectives like being helpful and preventing harm, and this involves role distinctions.
But roles, the researchers say, have become overloaded with responsibilities they cannot reliably carry out. They’ve become like a fuzzier version of permission levels, determining how prompts are trusted and treated.
The problem, the authors contend, is that roles are determined in a fundamentally insecure way: writing style.
“LLMs identify roles from an insecure feature (style),” they explain. “This is like identifying a stranger’s profession from how they talk and dress rather than by checking their ID. Usually everything agrees, so this works fine. But when attackers intentionally create a mismatch, the LLM uses the insecure method (writing style) to identify its role instead of the secure method (tags).”
The authors developed an attack called CoT (Chain of Thought) Forgery that involves using an LLM to spoof the terse style of OpenAI mode and add that to the prompt. The technique won the 2025 OpenAI Kaggle red-teaming contest.
“We asked a bunch of LLMs how to synthesize cocaine, inserting fake reasoning that says it’s fine because we’re wearing a green shirt,” the authors explain. “The LLMs comply. The rationale is transparently dumb, but the models don’t evaluate it as an external claim to be scrutinized. They treat it as their already-reached conclusion, and simply act on it. We’ve stolen the trust given to the role.”
On a standard jailbreaking benchmark, they say, CoT Forgery took the attack success rate from near zero to about 60 percent on the models tested. And whereas most jailbreaks are fragile and work only for certain models, this one transferred because it exploits a structural flaw. It’s not attempting to persuade the model but duping the model into treating the request as something that’s already settled.
The authors also note that while many models report near-perfect safety scores on prompt-injection benchmarks, human red-teamers achieve attack success rates close to 100 percent.
“The discrepancy is straightforward: skilled humans test and adapt attacks until they work, benchmarks don’t,” they state. “Static benchmarks measure attacks models have already learned to catch.”
Roles, the authors argue, deserve more attention from the research community because they’ve become one of the most important abstractions in the AI stack.
“Unless LLMs achieve genuine role perception, we think injection defense will remain a perpetual whack-a-mole game,” they conclude. “And the continuous nature of role boundaries opens the threat of injections designed to subtly shift LLM states through seemingly innocuous text, legally and at scale.” ®
You must be logged in to post a comment Login