"Jailbreak‑Proof" AI Models Hacked in Hours — OpenAI’s Safety Claims Tested on Launch Day

OpenAI introduced its first open‑weight models since 2019—GPT‑OSS‑120b and GPT‑OSS‑20b, marketed as fast, efficient, and optimized with “jailbreak resistance.” The boast didn’t last long. Within hours, famed jailbreak hacker Pliny the Liberator demonstrated the models could still be coerced into generating instructions for creating harmful content, including meth labs, VX nerve agents, and malware. Safety concerns are mounting once again.

Aug 7, 2025 - 10:37
"Jailbreak‑Proof" AI Models Hacked in Hours — OpenAI’s Safety Claims Tested on Launch Day

What Happened?

  • Claimed Security Tested—and Failed
    OpenAI subjected GPT‑OSS models to extensive adversarial training and “worst‑case fine‑tuning,” supported by a Safety Advisory Group review. Yet, Pliny bypassed their safeguards swiftly using prompt transformations to break the models’ resistance.
  • Public Exposure in Real-Time
    The jailbreak was boldly shared on X with the text:

“OPENAI: PWNED GPT‑OSS: LIBERATED”,
alongside screenshots detailing illicit instructions produced by the AI.

  • Safety Verification Missed the Mark
    Despite performance parity claims on benchmarks like StrongReject, real‑world exploitability remains a glaring safety gap. OpenAI even launched a $500K red‑teaming program—but it didn't stop the breach.

Coinccino Insight

“Designing models to be 'jailbreak-proof' is like fortifying the front gate while leaving a window ajar. The launch-day breach underscores that safety claims must be validated under real-world pressure. This is a critical reminder: AI security is never done—it evolves.”


Why It Matters Globally

Region Key Implication
U.S. Trusted AI applications—from healthcare to government—need verifiable safety, not just marketing.
UAE With ambitions in AI governance and smart city deployment, this breach highlights regulatory urgency.
India As dependably safe AI becomes a lynchpin for digital transformation, firms and developers must build privacy-preserving, robust guardrails.