Tech
Poems can hack ChatGPT? A new study reveals dangerous AI flaw
Forcing an “AI” to do your will isn’t a tall order to fill—just feed it a line that carefully rhymes and you’ll get it to casually kill. (Ahem, sorry, not sure what came over me there.) According to a new study, it’s easy to get “AI” large language models like ChatGPT to ignore their safety settings. All you need to do is give your instructions in the form of a poem.
“Adversarial poetry” is the term used by a team of researchers at DEXAI, the Sapienza University of Rome, and the Sant’Anna School of Advanced Studies. According to the study, users can deploy their instructions in the form of a poem and use it as a “universal single-turn jailbreak” to get the models to ignore their basic safety functions.
The researchers collected basic commands that would formally trip the large language models (LLMs) into returning a sanitized, polite “no” response (such as asking for instructions on how to build a bomb). Then they converted those instructions into poems using yet another LLM (specifically DeepSeek). When fed the poem—with a flowery but functionally identical command—the LLMs provided the harmful answers.
A series of 1,200 prompt poems was created, covering topics such as violent and sexual crimes, suicide and self-harm, invasion of privacy, defamation, and even chemical and nuclear weapons. Using only a single text prompt at a time, the poems were able to get around LLM safeguards three times more often than straight text examples, with a 65 percent success rate from all tested LLMs.
Products from OpenAI, Google, Meta, xAI, Anthropic, DeepSeek, and others were tested, with some failing to detect the dangerous prompts at up to 90 percent rate. Poetic prompts designed to elicit instructions for code injection attacks, password cracking, and data extraction were especially effective, with “Harmful Manipulation” only succeeding 24 percent of the time. Anthropic’s Claude proved the most resistant, only falling for verse-modified prompts at a rate of 5.24 percent.
“The cross-family consistency indicates that the vulnerability is systemic, not an artifact of a specific provider or training pipeline,” reads the paper, which has yet to be peer-reviewed according to Futurism. In layman’s terms: LLMs can still be fooled, and fooled fairly easily, with a novel approach to a problem that wasn’t anticipated by its operators.
