Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch.
Share This Report
Copy the link, post it, or save a PDF version.
Last updated April 7, 2026. All benchmark data sourced from Anthropic's Project Glasswing announcement and the Claude Mythos Preview system card. See the model profile on BenchLM.
A model autonomously found a remote crash bug in OpenBSD, one of the most security-hardened operating systems on earth, in code that had survived 27 years of human review.
That same model found a 16-year-old vulnerability in FFmpeg, in a single line of code that automated fuzzers had hit five million times without ever flagging. Then it located several Linux kernel vulnerabilities and chained them together to escalate from ordinary user access to full machine control. No human steering. No prompting tricks. The model just did it.
Anthropic built that model, watched it do all of this, and decided not to release it. They are calling it Claude Mythos Preview, and it is the most important thing Anthropic has announced this year — not because of what it can do, but because of what they chose not to do with it.
That's the part of the announcement worth paying attention to.
Claude Mythos Preview is an unreleased frontier model from Anthropic, announced April 2026 as part of Project Glasswing. Glasswing is a coordinated industry effort built around Mythos that includes twelve launch partners: Amazon Web Services, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Forty additional organizations that build or maintain critical software infrastructure also have access. Anthropic put $100M of model usage credits behind it, and donated another $4M directly to open-source security organizations.
Mythos is general purpose. It is not a cybersecurity-specialized variant, not a fine-tune, not a distilled cousin of a larger model. The cyber capabilities fall out of the underlying coding and reasoning skill, which is exactly why the announcement matters far beyond cybersecurity. Anything a model this capable does in one domain, it can probably do in every adjacent one.
Pricing for Glasswing participants is $25 per million input tokens and $125 per million output tokens. That puts it at roughly 1.7× the cost of Claude Opus 4.6, which is a useful tell about the compute weight behind it. Anthropic does not generally charge a 70% premium for incremental improvements.
Then there is the line that should make any LLM observer stop and re-read it. From the announcement: Anthropic does not plan to make Mythos generally available. The safeguards required to deploy it safely are not ready. Their stated plan is to ship those safeguards alongside a future Opus release, refine them on a model that does not pose the same level of risk, and revisit Mythos-class deployment after that work is done.
Frontier labs do not usually do this. The last time a major lab publicly held back a model in a comparable way was OpenAI's staged GPT-2 release in 2019, and that turned out to be partly performative. This does not read like performance. It reads like a real call.
So how much better is Mythos actually, and what does "the safeguards aren't ready" mean in practice? The benchmarks answer the first question. The second question is where the post earns its keep.
Anthropic released seven head-to-head comparisons against Claude Opus 4.6, their previous best model and, until last week, the model topping or near-topping most public coding leaderboards. The benchmark gaps are larger than any single-generation jump Anthropic has shown publicly. The full table:
| Benchmark | Mythos Preview | Opus 4.6 | Gap |
|---|---|---|---|
| SWE-bench Verified | 93.9% | 80.8% | +13.1 |
| SWE-bench Pro | 77.8% | 53.4% | +24.4 |
| SWE-bench Multilingual | 87.3% | 77.8% | +9.5 |
| SWE-bench Multimodal | 59.0% | 27.1% | +31.9 |
| Terminal-Bench 2.0 | 82.0% | 65.4% | +16.6 |
| GPQA Diamond | 94.6% | 91.3% | +3.3 |
| HLE (no tools) | 56.8% | 40.0% | +16.8 |
| HLE (with tools) | 64.7% | 53.1% | +11.6 |
| BrowseComp | 86.9% | 83.7% | +3.2 |
| OSWorld-Verified | 79.6% | 72.7% | +6.9 |
Raw numbers without context are how AI marketing happens, so let's actually read these.
SWE-bench Verified is approaching saturation at the top of the leaderboard. A 13-point gap there is bigger than the number suggests, because the easy problems are already solved — what remains is the hard tail of real GitHub issues that defeated previous models. Going from 80.8% to 93.9% is not 14% more capability. It is closer to "the model now handles a category of problems the previous generation could not touch."
Verified is not where the most interesting signal lives, though. SWE-bench Pro is. The Pro split is harder, less saturated, less likely to appear in training data, and more representative of the kind of multi-step engineering work that actual production agents struggle with. Mythos posts +24.4 points on Pro. That is the largest single-generation jump on that benchmark since it was introduced. Anyone who has been benchmarking agent stacks for the last two years knows how hard Pro has been to move.
SWE-bench Multimodal is where the gap goes from notable to genuinely strange. Multimodal asks the model to read screenshots, parse UI state, and work across visual and textual context simultaneously. It is the benchmark closest to what coding agents actually do in production, because most real coding work is no longer pure text-in, text-out — it is "look at this terminal error, look at this stack trace, look at this design mockup, fix the thing." Mythos goes from Opus 4.6's 27.1% to 59.0%. That is more than double. For anyone shipping coding agents into real environments, this is the gap that matters.
Terminal-Bench 2.0 tells the same story from a slightly different angle. Terminal-Bench measures whether a model can inspect environments, run commands, debug failures, and recover from errors across multi-step workflows. It is a coding agent quality benchmark, not a code generation benchmark. Mythos lands at 82.0%, with a footnote disclosing 92.1% under longer-timeout conditions. Either number puts it well above any released model.
Worth flagging honestly: Anthropic ran memorization screens on the SWE-bench evaluations and reported that the gap holds even after excluding flagged problems. They expected that question and answered it preemptively. It is the kind of thing a lab does when it knows the numbers are going to be scrutinized.
GPQA Diamond comes in at 94.6 vs 91.3. Both models sit inside the noise range at the top of the benchmark. GPQA is saturated for the frontier tier. There is nothing useful to read out of this comparison and you should distrust anyone who tries.
Humanity's Last Exam is the more interesting result. HLE is the least saturated frontier knowledge benchmark currently in circulation — most current frontier models score somewhere between 10% and 46%. Mythos hits 56.8% without tools and 64.7% with tools, against Opus 4.6's 40.0% and 53.1%. A 17-point jump on HLE without tools is the kind of move that does not show up often, and would normally be the headline number from any other lab.
Then there is the caveat Anthropic disclosed themselves: Mythos performs well on HLE even at low effort, "which could indicate some level of memorization." They wrote that. We are passing it on. Read the HLE result as a strong upper bound, not a guaranteed real-world capability. The fact that Anthropic flagged it themselves is itself a small signal of seriousness — a less safety-minded lab would have buried that footnote or omitted it entirely.
BrowseComp shows Mythos at 86.9% and Opus 4.6 at 83.7%. A three-point gap. Nothing dramatic on the surface. Then you read the footnote.
Mythos uses 4.9× fewer tokens to reach that score.
Same accuracy, one-fifth the tokens. That is not a smarter model in the conventional sense. That is a model that has stopped wandering. Token efficiency at parity capability is a metric most coverage will skip because it does not fit cleanly on a leaderboard, but for anyone running agents at scale it is the most economically meaningful number in the entire announcement. It changes the unit economics of every multi-step agentic workload. A workflow that costs $0.40 in Opus 4.6 tokens potentially costs $0.08 in Mythos tokens, even at the higher per-token price. The math gets uncomfortable for anyone who priced their agent product against last year's token assumptions.
OSWorld-Verified goes from 72.7% to 79.6%. OSWorld measures whether the model can actually operate a software interface — click the right thing in a real GUI, not just describe what it would click. The jump from "describes the UI" to "operates the UI" is the agent moat, and it is where most current frontier models still break down. Seven points on OSWorld is closer to a step change than the number sounds.
Not a benchmark number. The BrowseComp footnote. Same accuracy, one-fifth the tokens. Every other entry on the table reads as "this model is better." That one line reads as "this model is doing something different." Whatever that difference is — better tool-use discipline, fewer redundant search calls, sharper stopping conditions, some change in how the model decides when it has enough information — it is the architectural signal in the release. The benchmark gaps are the headline. The token efficiency is the story underneath.
Anthropic's announcement compares Mythos only against Opus 4.6, which is convenient framing for a launch. The harder question is where it lands on the broader frontier, against the rest of the current SOTA cohort.
Pulling current top SWE-bench Verified scores from the BenchLM coding leaderboard: GPT-5.3 Codex sits around 85, GPT-5.4 around 81, Opus 4.6 at 80, Gemini 3.1 Pro in the same band. Mythos at 93.9 is not an incremental step on the leaderboard. It would be the largest single-model lead any frontier lab has held since the original GPT-4 launch in 2023, and arguably a wider gap than that one was relative to the field at the time.
On Terminal-Bench 2.0, the comparison cohort is the agentic coding stack — GPT-5 Codex variants, Gemini 3.1 Pro, recent Claude. Mythos at 82% (or 92.1% with the extended-timeout footnote) sits clearly above where any released model has scored. Terminal-Bench is the benchmark most directly relevant to coding agent quality in real environments, and the gap is wide enough to be the difference between "agent works most of the time" and "agent works almost always."
On HLE, current frontier sits mostly under 50%. Mythos at 64.7% with tools is genuinely an outlier, even after the memorization caveat. If even half of that gap is real and not a memorization artifact, it represents the largest jump on HLE since the benchmark launched.
The honest framing: this is a private model running under conditions Anthropic chose. Take the numbers seriously, but treat them as upper bounds. Cherry-picked conditions on internal evaluations are how every frontier lab presents its work. Anthropic is no exception, and being the most safety-vocal lab does not exempt them from the standard pinch of salt. That said, the consistency of the gap across seven different benchmark families is harder to fake than a single eye-popping result on one benchmark would be.
The practical question is what changes if a model like this exists in production. Two answers, both large. Coding agents that no longer need babysitting on multi-step tasks. Security teams that can audit codebases at a speed no human team can match. Either alone would justify the cost. Both together is why Anthropic is moving so cautiously.
This is the section most coverage will skip. We are not skipping it.
Quoting Anthropic's own framing, paraphrased for length: Mythos Preview will not be made generally available, the safeguards required to deploy it safely are not yet built, and the plan is to ship those safeguards alongside an upcoming Opus release. The Opus model becomes the testing ground for the safeguards. Once the safeguards have been refined on a model that does not pose the same level of risk, Anthropic will revisit Mythos-class deployment.
That sequence of decisions is remarkable for a frontier lab. A model was built, evaluated, and the answer was not yet. The 90-day public reporting commitment that comes with Glasswing makes the decision auditable in a way that the staged GPT-2 release in 2019 was not. There will be a public report. The vulnerabilities patched will be enumerated. The lessons learned will be shared with industry partners. None of this is the shape of a marketing exercise. It is the shape of a lab trying to build a public record of how it handled a model it considered too capable to ship without more work.
Whether that judgment turns out to be correct is a separate question. The fact that the judgment was made at all is the signal.
The capabilities that let Mythos find a 27-year-old OpenBSD bug for defenders are exactly the same capabilities that let it find a 27-year-old OpenBSD bug for attackers. There is no version of Mythos that is good at finding vulnerabilities only for the good guys. This is the central tension of every dual-use capability, and it is why Project Glasswing exists in the form it does.
Glasswing is a structured bet that getting the model into the hands of defenders first, in coordination, lets the patch wave outrun the exploit wave. The bet is real and not obviously winnable. It might not work.
The window between vulnerability discovery and active exploitation has been collapsing for years. CrowdStrike's CTO put a number on it in the announcement: what used to take months now happens in minutes. If exploit generation gets cheap before patching gets cheap, defenders lose ground even with the model in their hands. Patching is hard in ways that exploiting is not — patches need to be tested, distributed, and applied across millions of endpoints, and the slowest endpoint in the chain determines the actual security of the system. Exploits only need to find one unpatched target. The race condition is structural, and Glasswing is an explicit attempt to win it by months rather than letting it run on default timing.
From the Anthropic announcement: it will not be long before such capabilities proliferate, potentially beyond actors committed to deploying them safely.
Read that sentence carefully. The framing is not "if." It is "when." The implicit timeline of Project Glasswing is months, not years.
For everyone outside the twelve launch partners, that means the patching window for legacy systems just got measurably shorter. Open-source maintainers without dedicated security teams are the most exposed surface, which is exactly why Anthropic donated $2.5M to Alpha-Omega and OpenSSF through the Linux Foundation, and another $1.5M to the Apache Software Foundation. Most coverage will treat those donations as a footnote. They are the loudest part of the announcement if you are paying attention. They tell you which threat model Anthropic is actually pricing into the rollout.
The implicit message: if you maintain critical open-source infrastructure and you are not currently equipped to defend against AI-augmented vulnerability research, the next twelve months are going to be uncomfortable, and Anthropic would like to help fund the equipment to defend against it.
Anthropic's line in the announcement: frontier AI capabilities are likely to advance substantially over just the next few months.
If Mythos is +13 points on SWE-bench Verified over Opus 4.6, the next jump is not going to be smaller. SWE-bench Pro at +24 is the leading indicator, because Pro is where the headroom is. The next model's gap will show up there first, before it shows up on the saturated benchmarks where the easy wins are already taken.
The honest read for anyone building on top of frontier models: the agent quality you are testing against today is not the agent quality you will be shipping against in six months. Re-baseline your evaluations now, while you have time to do it cleanly. The multi-step task your agent fails 40% of the time on today is the task the next model will fail 15% of the time on. The thing your moat depends on is probably less defensible than it looked last quarter.
This is the part of the conversation that most product teams have not internalized yet. Capability overhang means the floor moves faster than the ceiling, and the floor is what most products are actually competing on.
A general-purpose model that finds zero-days in every major operating system is also a general-purpose model that does everything else at the same level.
The cybersecurity story is the most marketable framing for Mythos because it has clear villains, clear heroes, and a clean narrative arc. It also conveniently focuses attention on a specific risk that has well-understood mitigations: coordinated disclosure, patching, defender access, donations to OSS security. These are real, important things. They are also the easiest part of the conversation to have.
The harder conversation is what a model this capable means for software engineering as a profession. For research science. For every domain that runs on highly skilled humans doing specialized reading-and-reasoning work. The Anthropic announcement does not pretend to answer that, and frankly, no current announcement from any lab does. But the same model that finds a 27-year-old OpenBSD bug is the same model that writes the next ten thousand pull requests on the Linux kernel, and the same model that does the literature review for the next biology paper, and the same model that audits the next financial system for compliance gaps.
Anthropic is not going to lead with that pitch. No one is. But it is the same model.
Apply to Glasswing partner programs now. The $100M usage credit pool is real and the access is real. If you maintain critical open-source infrastructure, the Claude for Open Source program is your access path. Don't wait for general availability — that is not coming on the timeline you are hoping for, and even when it does come, it will come with safeguards designed to limit the offensive capabilities you actually need access to.
Between now and the 90-day Glasswing public report, the most valuable thing your team can do is build the patching infrastructure to absorb a flood of newly disclosed vulnerabilities. The bottleneck for the next year is not finding bugs. It is patching them fast enough.
Stop benchmarking your agent stack against last year's models. Re-baseline every multi-step coding task you have against Opus 4.6 first, because that is the model you can actually run today. Then mentally apply the Mythos delta to project where the next public release will land — and apply it to SWE-bench Pro, not Verified, because Pro is where the headroom shows up.
The tasks that fail today are about to mostly succeed. Your eval harness needs to be ready before the model is. If you have not moved your evals off of HumanEval and onto something agentic in the last year, that is the project that should be on top of your stack this week.
Don't price Mythos-class capability on the rate card. Mythos is $25/$125 per million tokens for Glasswing — premium pricing on its face. But the BrowseComp footnote (4.9× fewer tokens at the same accuracy) means real per-completed-task cost may be flat or even lower than Opus 4.6 on the workloads where Mythos pulls ahead. Price your evaluations on tokens-per-completed-task, not tokens-per-million.
If your finance team is looking at the rate card and balking at the 1.7× input premium, they are reading the wrong number. The right number is total tokens consumed to complete the actual workflow you care about, and on a non-trivial fraction of workflows that number is going down, not up.
The capability gap between frontier closed models and open-weight alternatives just widened again. Anyone whose moat depended on prompt-engineering against last-generation capability is in a worse position today than yesterday. The moats that hold are distribution, proprietary data, workflow lock-in, and integrations. Same as always, but the timeline for everything else just compressed.
The specific question to ask yourself this week: what part of your product becomes commoditized when the underlying model gets better at multi-step agentic work? If the answer is "most of it," your roadmap needs to change.
Three things, in order of importance.
Mythos Preview is a real jump, not a marketing one. The benchmark gaps over Opus 4.6 are larger than any single-generation gap Anthropic has shown publicly, and the gaps are largest exactly where headroom remains: agentic coding, multimodal context, the un-saturated reasoning benchmarks. The token efficiency footnote on BrowseComp is the most architecturally significant detail in the release, because it suggests the underlying model is not just bigger but operating differently — fewer wandering tool calls, sharper stopping conditions, better discipline under uncertainty. If that pattern holds across other workloads, the cost story for Mythos-class models is going to look very different from what the rate card suggests.
Anthropic shelving the model is the most important signal in the entire announcement. Frontier labs do not generally do this. The fact that they did it here, publicly, with a 90-day reporting commitment and $100M in usage credits routed to defenders first, tells you they think the risk profile of this generation is qualitatively different from the last one. Take that seriously even if you usually file AI safety messaging under marketing. The shape of this decision — donations to OSS maintainers, coordinated industry partnership, explicit refusal to ship — is not the shape a lab takes when it is performing caution. It is the shape a lab takes when it has internally concluded that the model is too capable for the current safeguard stack.
The next ninety days matter more than the announcement itself. Project Glasswing is a structured bet that defenders can use these capabilities to harden critical infrastructure faster than attackers can get equivalent capabilities. That bet is winnable but not won. Watch the 90-day public report. Watch the patch volume across major operating systems and browsers in the next quarter. Watch which open-source projects opt into the Claude for Open Source program. Watch whether other frontier labs follow Anthropic's lead on staged rollouts, or whether they ship competing capabilities the conventional way and break the coordination. The story is not the launch. It is the race that started the day the launch went live, and the shape of that race over the next quarter will tell us whether this kind of safeguard-first rollout is something the industry can actually sustain or whether it was a one-time thing.
The first frontier model Anthropic decided not to ship is also probably the first frontier model that should not have been shippable yet. Whether the next one is depends on what happens in the next ninety days.
What is Claude Mythos Preview? Claude Mythos Preview is an unreleased frontier AI model from Anthropic, announced April 2026 as the centerpiece of Project Glasswing. It is a general-purpose model with substantially improved coding, reasoning, and agentic capabilities compared to Anthropic's previous best model, Claude Opus 4.6. Anthropic has stated it will not make Mythos generally available until additional safeguards are developed.
How much better is Mythos Preview than Claude Opus 4.6? Across the seven head-to-head benchmarks Anthropic released, Mythos beats Opus 4.6 by between 3.2 and 31.9 points. The largest gaps are on agentic coding tasks: +24.4 points on SWE-bench Pro and +31.9 on SWE-bench Multimodal. The smallest gaps are on already-saturated benchmarks like GPQA Diamond. On BrowseComp, Mythos matches Opus 4.6's accuracy while using 4.9× fewer tokens, which is the most economically significant result in the release.
Why is Anthropic not releasing Mythos Preview publicly? Anthropic has stated that the safeguards required to deploy a model with Mythos-level capabilities safely are not yet built. The plan is to ship those safeguards with an upcoming Opus release first, refine them on a model that does not pose the same level of risk, then revisit Mythos-class deployment. Glasswing partners get controlled access in the meantime for defensive cybersecurity work.
What is Project Glasswing? Project Glasswing is a coordinated industry effort built around Mythos Preview, with twelve launch partners including AWS, Apple, Google, Microsoft, Cisco, CrowdStrike, JPMorgan Chase, and the Linux Foundation. Forty additional organizations also have access. Anthropic committed $100M in model usage credits to Glasswing and $4M in direct donations to open-source security organizations. The goal is to give defenders coordinated early access to Mythos's vulnerability-finding capabilities before equivalent capabilities proliferate to less safety-focused actors.
Can I access Claude Mythos Preview? For most developers and businesses, no. Mythos is restricted to Glasswing partners and approved organizations doing defensive cybersecurity work. Open-source maintainers can apply through the Claude for Open Source program. Anthropic has not announced a timeline for general availability, and has stated that any future Mythos-class release will depend on safeguard development that is currently happening on a separate, unreleased Opus model.
How does Mythos Preview compare to GPT-5 and Gemini 3.1 Pro? On SWE-bench Verified, Mythos at 93.9% sits clearly above the current frontier cohort, which includes GPT-5.3 Codex at around 85%, GPT-5.4 at around 81%, and Gemini 3.1 Pro in the same band as Opus 4.6. If the benchmark numbers hold up under independent testing, Mythos represents the largest single-model lead any frontier lab has held since GPT-4 launched in 2023. Caveat: these are Anthropic's internal numbers under conditions Anthropic chose, and should be treated as upper bounds until independent evaluations are available.
Benchmark data sourced from Anthropic's Project Glasswing announcement and the Claude Mythos Preview system card. SOTA comparison data from → Full leaderboard · Compare models · Best coding models · Agentic leaderboard. For weekly updates on frontier model releases and benchmark changes, subscribe to the BenchLM newsletter.
Get weekly benchmark updates in your inbox.
Free. No spam. Unsubscribe anytime. We only store derived location metadata for consent routing.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
New model releases, benchmark scores, and leaderboard changes. Every Friday.
Free. Your signup is stored with a derived country code for compliance routing.
Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat.
A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across current BenchLM.ai data. Claude Opus 4.6 now leads overall, in coding, and in agentic work, while GPT-5.4 remains the value pick with specific benchmark strengths.
React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.