AI trust and safety: How poetic jailbreaks expose LLM risks
Poems should not be able to talk an AI system into helping someone build a nuclear weapon. Yet recent research suggests that poetic prompts can bypass safety filters in many large language models (LLMs). For any organization deploying AI, this is a clear AI trust and safety warning: guardrails alone are not enough. You need systematic AI risk management, governance, and secure deployment practices.
This article explains what "poetry jailbreaks" are, why they matter for enterprise AI security, and how businesses can respond with practical controls, from governance policies to continuous testing.
Note: We do not provide, reproduce, or endorse any harmful prompts or instructions. Our focus is on understanding the risk and protecting your organization.
What the "poetry jailbreak" is and why it matters
In late 2025, researchers from the Icaro Lab (Sapienza University in Rome and DexAI) published a study on "adversarial poetry" as a way to defeat LLM safeguards[1][2][3]. Their results show that:
- Dangerous questions—about topics like nuclear weapons or malware—were rejected when asked directly.
- The same questions, when embedded in carefully crafted poems, were often answered.
- Success rates were high across many major commercial models[1][2][3].
Specifically, hand-crafted adversarial poems achieved an average attack success rate of 62% across 25 models tested, with some providers exceeding 90%[1][2][3]. When 1,200 harmful prose prompts were automatically converted into verse, the poetic versions produced success rates of approximately 43%, representing a fivefold increase compared to non-poetic baselines with only 8.08% success rates[1][2].
The idea builds on prior work on adversarial suffixes—nonsense strings or long, confusing add‑ons that disrupt model filters. For example, research on adversarial techniques shows that unconventional prompt formatting can bypass content controls.
Why poetic framing can bypass model guardrails
At a high level, most safety systems in LLMs rely on pattern recognition:
- System prompts and policies tell the model what it should or should not do.
- Safety classifiers and heuristics scan prompts and responses for disallowed content (e.g., hate speech, weapons instructions).
Adversarial poetry attacks exploit weaknesses in these layers[1][2]:
- Indirection and metaphor: Harmful intent is wrapped in indirect, figurative language that doesn't match simple keywords or patterns.
- Fragmented syntax: Broken grammar and unusual structures confuse classifiers trained on more standard text.
- Context overload: Long, stylized prompts can drown out simple safety patterns, nudging the model toward "be helpful" over "be careful."[1][2]
From an AI trust and safety perspective, the core lesson is that content filters are not binary switches. They're probabilistic—and adversaries can systematically search for formulations that slip through.
How LLM guardrails fail: model behavior and attack surfaces
To design sensible defenses, it helps to understand where current guardrails sit and how they fail.
Types of guardrails in modern LLMs
Most providers layer several mechanisms:
- Pre‑training filters: Remove some harmful examples from the data used to train the base model.
- Reinforcement learning from human feedback (RLHF): Teach models to be more helpful, honest, and harmless.
- System prompts and policies: Instructions like "never provide guidance on illegal activity."
- Content classifiers: External or in‑model checks that flag disallowed content.
- Post‑processing filters: Final checks on generated text before it reaches the user.
These are crucial, but they operate on patterns seen during training. When attackers invent new linguistic tricks—like poetic disguises—the model can behave in unanticipated ways[1][2].
How adversarial prompts confuse filters
Adversarial prompts (including poetry jailbreaks) take advantage of several properties of LLMs:
- Over‑generalized helpfulness: LLMs are rewarded for satisfying user requests; if a request looks benign or artistic, the safety tilt is weakened.
- Ambiguity exploitation: If the text could plausibly be interpreted as fiction, metaphor, or harmless description, the model may err on the side of answering.
- Classifier blind spots: Safety classifiers are often trained on more literal, direct harmful content. Creative or oblique wording is under‑represented.
This is not just a theoretical issue. Studies on LLM safety and jailbreaking from groups such as Anthropic, OpenAI, and academic researchers repeatedly find that new jailbreak methods can achieve high success rates until models are updated.
From an AI governance standpoint, this means organizations cannot treat "model X is safe by default" as a durable assumption. Safety is conditional on context, configuration, and ongoing oversight.
Enterprise impact: what this means for businesses using AI
Most enterprises are not asking LLMs about nuclear weapons. But the same weaknesses that allow extreme jailbreaks can expose more mundane, yet business‑critical, vulnerabilities.
Risk scenarios for customer‑facing chatbots and internal agents
Some realistic scenarios include:
-
Policy circumvention in customer chatbots
Users might coax a banking or insurance bot into revealing internal scoring criteria, hinting at fraud‑detection rules, or suggesting ways to game pricing. -
Leakage of internal or regulated information
Internal copilots trained on confidential data might be tricked, via indirect or creative prompts, into summarizing sensitive documents or sharing personal data, creating AI data security incidents. -
Social engineering amplification
Attackers can use LLMs to generate highly targeted phishing content, or to practice adversarial prompts before interacting with your public‑facing systems. -
Shadow AI and unvetted integrations
Teams may embed general‑purpose LLMs into workflows without security review. Even if the upstream model is "safe," your integration may bypass or weaken its safeguards.
Regulatory and reputational exposure
Regulators and standards bodies are rapidly converging on expectations for enterprise AI security and governance:
- The EU's AI Act requires risk management, testing, and monitoring for high‑risk AI systems.
- The NIST AI Risk Management Framework emphasizes continuous identification, measurement, and mitigation of AI risks.
- Sectoral regulations (e.g., GDPR, HIPAA, financial conduct rules) still apply when AI mishandling leads to data exposure or discriminatory outcomes.
A single high‑profile jailbreak incident—especially one involving disallowed advice, safety incidents, or leakage of personal data—can:
- Trigger investigations and fines.
- Damage customer trust and brand perception.
- Force sudden rollbacks of AI features, undermining your innovation roadmap.
This is why AI trust and safety must be treated as an enterprise risk function, not just a model‑selection decision.
Operational controls: secure AI deployment and testing
Technology choices and deployment practices go a long way toward secure AI deployment. The goal is not to eliminate risk entirely, but to make successful attacks rarer, less damaging, and quickly detectable.
Red‑team and adversarial testing (without sharing exploits)
Effective AI risk management requires structured testing:
- Internal red‑teaming: Design exercises where security and domain experts try to elicit disallowed behaviors from your models, including creative formulations like poetry or role‑play.
- External testing partners: Work with specialized firms or bug‑bounty programs that understand LLM behavior, with clear disclosure guidelines that avoid publicizing dangerous prompts.
- Scenario coverage: Test not only obvious harmful content (weapons, self‑harm) but also business‑specific risks: fraud, data leakage, policy evasion.
Document and classify findings, then feed them back into model configuration, prompt engineering, and policy updates.
Monitoring, logging, and rollback strategies
Even with good testing, some jailbreaks will only appear in production. Operational controls should include:
- Comprehensive logging (with privacy safeguards): Capture prompts and responses for high‑risk systems so you can investigate incidents.
- Automated anomaly detection: Use heuristics or secondary models to flag unusual patterns (e.g., long, stylized prompts that resemble known jailbreak attacks).
- Safe rollback and feature flags: Make it easy to disable or re‑route certain capabilities (e.g., free‑form generation on sensitive topics) while you investigate.
- Feedback channels: Allow employees and customers to report suspicious AI behavior.
These are standard reliability practices, adapted for LLM‑specific risks.
Governance, compliance, and vendor obligations
Technology controls are only part of the picture. AI governance defines the rules of engagement: who can deploy what, under which constraints, and with which checks.
Policy, access controls, and vendor SLAs
Key governance elements include:
- Acceptable‑use and safety policies for AI systems, tailored to your sector and risk appetite.
- Role‑based access control: Limit who can deploy models, change prompts, or connect new data sources.
- Model and vendor inventory: Maintain an up‑to‑date map of where LLMs are used, what data they see, and the safeguards in place.
- Vendor due diligence and SLAs: Require your AI and cloud vendors to describe their safety architectures, update cycles, incident reporting, and AI compliance solutions.
How compliance solutions reduce enterprise exposure
Modern compliance approaches move beyond checkbox audits:
- Continuous controls monitoring: Validate that logging, access, and safety filters remain active and correctly configured.
- Policy‑as‑code: Implement certain guardrails (e.g., allowed data fields, redaction rules) directly in middleware, not just in human documents.
- Alignment with frameworks: Map controls to standards such as NIST AI RMF, ISO/IEC 42001 (AI management systems), and sectoral data‑protection rules.
This turns high‑level AI trust and safety commitments into enforceable mechanisms.
Hardening AI agents and chatbots
Many organizations are now deploying custom copilots, workflow agents, and domain‑specific chatbots. These bring efficiency, but also new enterprise AI security considerations.
Design choices to reduce sensitive outputs
When you design custom AI agents, you can:
- Minimize permissions: Give each agent access only to the data and tools it absolutely needs.
- Constrain generation: Use structured outputs, templates, or retrieval‑augmented generation (RAG) to reduce free‑form, speculative text.
- Add multi‑step approval for high‑risk actions (e.g., changing user limits, issuing refunds) rather than letting the agent act autonomously.
- Implement secondary filters: Apply topic and data‑loss‑prevention (DLP) filters around the model, not just inside it.
These approaches reduce the blast radius when a jailbreak attempt succeeds.
Where to apply content filters and manage LLM scale/risk trade‑offs
More powerful models are generally more capable—but also more exploitable. Consider:
- Using smaller, tightly scoped models for particularly sensitive use cases.
- Combining models: one for reasoning, another for safety review.
- Placing filters at multiple layers: in the UI, in middleware, and at the model API.
This is especially important for AI data security, where accidental exposure can be as damaging as deliberate exfiltration.
Practical checklist and next steps for teams
To turn these concepts into action, cross‑functional teams (security, data, product, legal, compliance) can work through a focused checklist.
Immediate actions (0–90 days)
-
Inventory your AI systems
Document where LLMs are used, what data they access, and which users they serve. -
Classify use cases by risk
Identify high‑impact areas: customer advice, financial decisions, health or safety contexts, access to personal data. -
Run a targeted red‑team exercise
Include creative prompts (e.g., metaphorical or poetic wording) to test guardrails. -
Tighten configurations
Enable provider‑level safety features; add middleware checks for sensitive topics and data fields. -
Update policies and training
Educate developers, product managers, and support teams on jailbreak risks and secure prompting practices. -
Establish monitoring and escalation paths
Decide what gets logged, who reviews incidents, and how quickly you respond.
Medium‑term actions (3–12 months)
- Align with a formal risk framework such as NIST AI RMF or sector‑specific guidance from regulators.
- Integrate AI risk into enterprise risk management: board‑level reporting, risk registers, and internal audit.
- Automate assessments where possible, so new deployments trigger standardized reviews instead of ad hoc checks.
For a broader sense of best practices, resources from NIST, OECD AI principles, and leading vendors' safety research pages offer useful guidance.
Where specialized partners fit in
Not every organization has deep in‑house expertise on LLM safety engineering, jailbreak testing, and AI governance. Working with a specialist integrator can accelerate your journey from experimentation to robust, compliant operations.
Encorp.ai focuses on pragmatic, secure AI solutions for businesses. Our AI risk management solutions help teams automate parts of their AI risk assessment workflows, integrate security and compliance checks into delivery pipelines, and move from one‑off reviews to continuous oversight.
If you're planning or scaling AI initiatives, you can also explore our broader services at https://encorp.ai to see how we approach secure, value‑driven AI deployments.
Conclusion: balancing innovation and safety
Poetry jailbreaks are a vivid reminder that AI trust and safety is not solved by one‑time model tuning or a handful of content filters[1][2]. As attackers discover new ways to disguise intent—through verse, role‑play, or other creative prompts—organizations must treat LLM safety as an ongoing program, not a feature.
By combining solid AI risk management, robust AI governance, careful design of agents and chatbots, and secure AI deployment practices, enterprises can capture the benefits of generative AI while keeping unacceptable risks in check. The goal is not to eliminate every failure, but to understand where your systems are vulnerable, build sensible defenses, and respond quickly when things go wrong.
Handled this way, AI becomes not just powerful, but trustworthy—a technology your customers, employees, and regulators can rely on.
Martin Kuvandzhiev
CEO and Founder of Encorp.io with expertise in AI and business transformation