Anthropic's Claude Fable 5 brings Mythos-class capability to public users, while Claude Mythos 5 remains trusted-access. The benchmark story is strong, but the real shift is capability-gated deployment.
Share This Report
Copy the link, post it, or save a PDF version.
Last updated June 9, 2026. Benchmark rows below use BenchLM's current mapped data for Claude Fable 5, Claude Mythos 5, and the surrounding frontier cohort. Product, safety, and availability details are sourced from Anthropic's launch announcement, the Claude Fable 5 model page, the Claude Mythos 5 model page, the Claude model documentation, and Anthropic's Claude Fable 5 and Claude Mythos 5 system card. Early-use observations are informed by Ethan Mollick's One Useful Thing post on working with Mythos-class Fable.
Two months ago, Anthropic's Mythos story was simple: the model class was too capable to release broadly.
Now Anthropic has released it anyway, but not as one model. It has split Mythos-class capability into two products. Claude Fable 5 is the generally available version, with safeguards around cyber, biology, chemistry, and distillation. Claude Mythos 5 is the restricted-access version, with some of those restrictions lifted for vetted partners.
That distinction is the whole story.
The benchmark headline is easy. Fable 5 and Mythos 5 are at the top of BenchLM's current Anthropic rows, ahead of Claude Opus 4.8 and the broader GPT/Gemini frontier cluster on the mapped launch data. The more important signal is not that Anthropic shipped a stronger Claude. It is that Anthropic shipped the public version of a model class it had previously treated as too risky for normal access.
The future of frontier AI is not one universal endpoint. It is capability routing.
Anthropic announced two new Claude 5 models on June 9, 2026.
Claude Fable 5 is the public model. Anthropic describes it as a Mythos-class model made safe for general use, with particular strength in software engineering, knowledge work, vision, scientific research, and long-running tasks. It is available as claude-fable-5 through the Claude API and major cloud marketplaces. Pricing is $10 per million input tokens and $50 per million output tokens, with a 90% input-token discount for prompt caching and 1.1x pricing for US-only inference.
Claude Mythos 5 is the restricted model. Anthropic says it is the same underlying model, but with safeguards lifted in some areas for vetted users. Initial access is limited to Project Glasswing partners, government collaborators, and future trusted-access programs.
Fable's safety layer matters in practice. Anthropic says some cyber, biology, chemistry, and model-distillation requests are routed away from Fable 5 to Claude Opus 4.8 in some Claude clients. In the Messages API, the default behavior is stricter: high-risk requests are blocked unless the developer implements or opts into fallback. Anthropic also says routing should affect fewer than 5% of sessions on average, meaning more than 95% of Fable sessions should stay on Fable.
That is a new kind of frontier release. The public product is not just "the model." It is the model plus a classifier, a routing policy, a fallback model, and an access regime.
BenchLM now maps both the launch-table rows and the system-card rows that fit the public schema. The system card matters because it does not treat Fable and Mythos as interchangeable on every benchmark. Anthropic describes them as two configurations of a new model: Mythos 5 is the less restricted trusted-access configuration, while Fable 5 is the public production configuration with safeguards and fallback/block behavior.
| Model | Overall | Context | SWE-bench Verified | SWE-bench Pro | Terminal-Bench 2.1 | OSWorld-Verified | HLE with tools | FrontierCode Diamond | Price in/out |
|---|---|---|---|---|---|---|---|---|---|
| Claude Mythos 5 | 99 | 1M+ | 95.5 | 80.3 | 88.0 | 85.0 | 64.5 | - | restricted |
| Claude Fable 5 | 96 | 1M+ | 95.0 | 80.0 | 84.3 | 85.0 | - | 29.3 | $10/$50 |
| Claude Opus 4.8 | 93 | 1M | 88.6 | 69.2 | 74.6 | 83.4 | 57.9 | 13.4 | $5/$25 |
| Gemini 3.1 Pro | 90 | 1M | 75 | 72 | 77 | 76.2 | 40 | - | $2/$12 |
| GPT-5.5 | 89 | 1M | - | 58.6 | 82 | 78.7 | 52.2 | 5.7 | $5/$30 |
| GPT-5.4 | 89 | 1.05M | 84 | 57.7 | 75.1 | - | 52.1 | - | $2.50/$15 |
| Claude Opus 4.6 | 87 | 1M | 80.8 | 53.4 | 65.4 | 80 | 53 | - | $5/$25 |
The honest read is not "Fable wins everything, conversation over." The honest read is that the Mythos-class model is now visibly ahead in the places frontier systems still have headroom: long-horizon coding, agentic terminal work, hard knowledge, computer use, and grounded multimodal tasks.
SWE-bench Verified at 95.5 for Mythos 5 and 95.0 for Fable 5 is near the ceiling of the current public coding benchmark stack. SWE-bench Pro is more interesting because it is harder and less saturated: Mythos 5 reaches 80.3, while Fable 5 reaches 80.0. Terminal-Bench 2.1 shows the effect of production safeguards more clearly: Mythos 5 is at 88.0, while Fable 5 is lower at 84.3. OSWorld-Verified is tied at 85.0. HLE with tools is reported for Mythos 5 at 64.5; the system card does not report a Fable-specific HLE row in its summary table.
Raw numbers without context are how AI marketing happens, so the caveat belongs in the table too: these are still mostly Anthropic-published rows. BenchLM can compare them to the current dataset, but independent third-party coverage is thin on day one. Treat the numbers as serious, not final.
In April, the Mythos story was about restraint. Anthropic had a model that outperformed Opus 4.6 by wide margins on coding and agentic benchmarks, then chose not to make it generally available. The reason was not that Mythos only worked for cybersecurity. It was a general-purpose model whose cybersecurity capability fell out of stronger coding, reasoning, and tool use.
That made the model awkward to deploy. The same capability that helps a security team find a decades-old vulnerability can help an attacker find it too. There is no clean technical line between "defensive vulnerability research" and "offensive vulnerability research" once the model is good enough.
Fable 5 is Anthropic's answer to that problem.
Instead of waiting until every safeguard was perfect, Anthropic split the deployment. Fable gets the public release, with fallback or blocking for some high-risk requests depending on the interface. Mythos keeps the more sensitive access path, initially through Glasswing and trusted programs. The risk did not disappear. It was moved into the product architecture.
That is the important shift. Anthropic did not just make a better model. It made a more complicated boundary around the model.
The clean mental model is this:
| Product | Underlying capability | Access | Safety posture |
|---|---|---|---|
| Claude Fable 5 | Mythos-class | General availability | Safeguards plus fallback or blocking |
| Claude Mythos 5 | Mythos-class | Vetted trusted access | Some restrictions lifted for sensitive work |
This is not a small naming distinction. It is a preview of how frontier labs may ship the next few generations of models.
For years, model releases were mostly understood as a single endpoint: GPT-4, Claude 3 Opus, Gemini 1.5 Pro, GPT-5, Claude Opus 4. The model name implied a roughly uniform capability surface. Everyone got the same basic intelligence, modulo rate limits, context windows, and enterprise controls.
Fable/Mythos is different. The same underlying intelligence can now appear as multiple products with different policies. Public users get the guarded version. Trusted users get a less restricted version. Some requests stay on the frontier model. Some get routed to a safer fallback. The product is no longer just weights and context window. It is access control.
That is probably where the field is going.
The most important Fable 5 claims are not about chat. They are about autonomy.
Anthropic's Fable page positions the model around long-running projects, multi-day coding sessions, testing its own work, checking visual outputs, reading documents, and operating across large contexts. The launch announcement uses a Stripe example: Fable reportedly helped migrate a 50-million-line Ruby codebase in roughly a day, compared with an internal manual estimate of about two months.
Treat vendor anecdotes carefully. They are selected for maximum impact. But the direction is consistent with the benchmark profile. Fable's strongest public signals are not old academic exams. They are the benchmarks closest to agent behavior: SWE-bench Pro, Terminal-Bench 2, OSWorld-Verified, BrowseComp, and multimodal coding.
That matters because the product category is changing. A model that answers questions well is a chatbot. A model that can work for hours, inspect its own failures, run tests, read screenshots, and keep moving is closer to an asynchronous worker.
The human workflow changes with that. The user stops supervising every step and starts reviewing completed work. The evaluation target changes too. It is less important whether the model sounds smart in a single turn and more important whether it finishes a real task without wandering, looping, or silently breaking something.
If Fable-class models become normal, agent benchmarks become the main battlefield.
Benchmarks tell you whether the model can finish a task. They do not tell you what the work feels like when the model is actually doing it.
That is why Ethan Mollick's early-access writeup for One Useful Thing is useful. Mollick tested Fable outside the cybersecurity domain, partly because Fable's public guardrails make serious cyber work difficult. His conclusion was not just that Fable is better than prior public models. It was that the relationship between user and model changes when the model can run long, delegate subtasks, and return finished artifacts.
That matches the benchmark profile. SWE-bench Pro, Terminal-Bench, OSWorld, BrowseComp, and multimodal coding are all proxies for the same thing: can the model keep a messy project moving when success requires many small choices? Mollick's examples show what that looks like in practice.
The most accessible examples were games. He asked Fable, through Claude Code, to build playable projects from vague prompts, then gave only minor feedback. The interesting part is not that the games existed. Toy game generation has been possible for a while. The interesting part is that Fable had to build art and 3D assets without an image generator, using code and math rather than external media. That is the multimodal-grounded coding story in miniature: the model is not simply writing functions, it is making aesthetic and implementation choices across a whole experience.
The more serious example was an isochrone map. Those maps show how far someone can travel within a given amount of time. Building a credible one requires research, transportation assumptions, routing approximations, visual design, and a lot of boring judgment. Mollick reports that Fable used sub-agents to gather travel data, including thousands of specific flights, major rail schedules, and country-level road-speed information, while continuing to code the project. When he pushed it to improve remote-location coverage, it spun up more agents to research and cross-check edge cases.
That is exactly the behavior the benchmark table cannot show. A normal model answers. A stronger coding model writes a component. A Fable-class system starts to behave like a small project team: researcher, engineer, tester, visual designer, and documentation writer bundled into one run.
The final example matters more for work. Mollick asked Fable to build software for calibrating human and AI judgments across messy research datasets. Fable first produced a long design document, then ran for nine and a half hours. The result, called Concord, was not perfect. Mollick found errors and gaps. But the scope of the delivered artifact was larger than what he had seen from earlier models, and he framed the remaining work as something a software engineer could clean up rather than something that invalidated the project.
That is the future-of-work signal. Fable does not remove the need for expertise. It changes where the expertise is applied. The expert is less often writing every line or checking every intermediate decision. The expert is specifying the target, reviewing the artifact, identifying subtle errors, and deciding whether the result is good enough to use.
This is not automatically better. It is a different control surface.
Mollick's strongest concern is also the one buyers should pay attention to: the more autonomous the run, the less visible the process becomes. Fable makes hundreds of small choices while the user is not watching. Some are research choices. Some are implementation choices. Some are taste choices. Some are assumptions buried in code or generated data. You can ask for edits at the end, but that is not the same as steering the work as it happens.
That is why "agentic" should not be treated as a pure positive. Agentic capability gives you leverage. It also creates review debt. The model may do more of the work, but the human still owns the consequences of the work. If the task is a game prototype, that is fine. If the task is financial analysis, security triage, medical research, or compliance work, the review layer becomes the product.
This is the real meaning of long-horizon AI. The model is not just better at reasoning. It is better at turning a short brief into an artifact large enough that the user cannot casually audit the whole process.
Mythos 5 is most politically important in the domains most teams will never touch directly: cybersecurity, biology, and healthcare.
Anthropic says Mythos Preview helped find more than 10,000 high and critical vulnerabilities across important software. The Mythos 5 page frames the new model as state of the art for cybersecurity, biology research, and healthcare, but access remains limited to vetted organizations. Project Glasswing has expanded beyond the original launch group, and Anthropic is turning trusted access into a formal program.
That is not just safety branding. It is an admission that the strongest models are now dual-use by default.
For security teams, a model that finds vulnerabilities at scale is useful. For attackers, the same model is useful. For biology researchers, a model that accelerates hypothesis generation and experimental design is useful. For bad actors, the same underlying capabilities can be dangerous. The model does not know which side of the institutional boundary the user is on. Anthropic has to enforce that boundary outside the model.
That means the future of frontier AI safety is not only refusals. It is:
This is much less elegant than a single "safe model" story. It is also much closer to how powerful technology is usually deployed.
Fable 5 and Mythos 5 are not cheap. At $10 per million input tokens and $50 per million output tokens, Fable costs twice as much as Opus 4.8 on both sides of the meter. It is also above GPT-5.5 input pricing and far above Gemini 3.1 Pro.
That makes the rate card look simple: use Fable only when you need the best model.
The workflow economics are less simple.
For long agentic tasks, token price is not the right unit. The right unit is cost per completed task. A model that costs twice as much per token can still be cheaper if it needs fewer retries, fewer tool calls, less human repair, and less re-prompting. Anthropic's own positioning leans into this: the model is meant to handle harder, longer work where failure is expensive.
Mollick's testing adds a second cost caveat: Fable appears capable of burning through tokens very quickly on ambitious projects. That is not surprising. Long-running agents are not one model call. They are search, planning, coding, testing, sub-agent delegation, progress notes, and repeated verification. The cost profile looks less like "chat completion" and more like "compute job."
That changes how teams should budget. If a model runs for nine hours and delegates research to cheaper models, the headline Fable rate does not tell you the total cost. You need to know how often it calls sub-agents, how much context it keeps alive, how much intermediate work it writes, how many verification passes it performs, and how many fallback events happen when a guardrail triggers.
The cheapest version of a Fable workflow may not be "use Fable for everything." It may be a router that uses Fable as the planner and final reviewer, cheaper Claude or Gemini rows for narrow subtasks, and a stricter verification model for high-risk output. That is also what Anthropic's product architecture implies. Fable itself is already a routed system. The user-facing API may become a higher-level agent runtime where the named model is only the visible coordinator.
This is where most model-selection spreadsheets break. They multiply input tokens by price, output tokens by price, and stop there. That works for summarization. It does not work for multi-hour coding agents. If a cheaper model fails three times and the expensive model succeeds once, the cheaper model was not cheaper.
For Fable 5, the practical evaluation should be:
That is how frontier-model buying will increasingly work. The model with the highest token price may still win the budget conversation if it reduces operational drag.
The inverse is also true. The most capable model can lose the budget conversation if it creates too much review debt. If Fable produces a sophisticated artifact that takes an expert three hours to inspect, that review time is part of the cost. If it creates code that works but hides fragile assumptions in generated data, the audit cost is part of the cost. If guardrails route a benign request to a weaker model too often, that friction is part of the cost.
The right metric is not dollars per million tokens. It is dollars per reviewed, accepted artifact.
The easy prediction is that models will keep getting smarter. That is true and not very useful.
The more useful prediction is that frontier AI will become more conditional. You will not just ask "which model is best?" You will ask which version of the model you are allowed to use, which safeguards apply, whether your request triggers fallback, whether your organization qualifies for a trusted tier, and whether your use case is sensitive enough to require a special access path.
Fable 5 and Mythos 5 point to five changes.
First, frontier models become gated systems. The public endpoint is no longer the full story. Labs will increasingly expose different capability surfaces to different users.
Second, routing becomes part of model quality. A model can be excellent and still frustrating if benign requests are routed away too often. Safety classifiers and fallback behavior will become user-experience issues, not just policy details.
Third, agents become the real benchmark surface. The models that matter will be the ones that can finish messy, long-running tasks. SWE-bench Pro, Terminal-Bench, OSWorld, BrowseComp, and domain-specific agent evals will matter more than legacy one-shot exams.
Fourth, enterprise AI shifts from copilots to work queues. The productivity jump is not just "autocomplete got better." It is "assign the model a migration, an audit, a research brief, or a refactor and review the result later."
Fifth, independent evaluation gets more important. If model behavior depends on access tier and fallback routing, provider benchmark claims get harder to interpret. A benchmark run on Mythos 5 may not describe what a normal user sees on Fable 5 for sensitive tasks.
That last point matters for BenchLM too. The leaderboard has to track not just model capability, but deployment reality. A restricted model and a public guarded model may share benchmark rows, but they are not the same product for users.
There is a sixth change that belongs next to the first five: interfaces become the bottleneck.
Mollick's post makes this hard to ignore. When a model can work for hours, spawn helpers, and make hundreds of micro-decisions, the normal chat interface is too thin. A transcript is not enough. A progress log is not enough. Users need ways to inspect assumptions, pause branches, compare alternatives, set budgets, approve external data sources, and audit intermediate artifacts without reading every token.
This is where the next product fight probably happens. The model labs will keep racing on capability. The application layer will race on controllability. Whoever gives users the best way to supervise long-running model work may capture more value than whoever has the best raw chat answer.
For coding agents, that means diff-aware review, test evidence, reproducible command logs, dependency explanations, and rollback plans. For research agents, it means source provenance, assumption ledgers, confidence intervals, and methods that can be re-run. For business workflows, it means approval gates, cost caps, audit trails, and policy-aware routing. None of those are benchmark scores, but they decide whether the model can be trusted in production.
This is why Fable/Mythos is bigger than a leaderboard update. It points to a future where the core product is not "ask a smarter model." The product is "operate a smarter model without losing control of the work."
Developers should test Fable 5 on the tasks where current agents fail: multi-file refactors, long migrations, frontend work that requires visual checking, terminal-heavy debugging, and codebase-wide changes. Do not waste the first evaluation on simple snippets. That is not where the model is supposed to matter.
For developer teams, Mollick's examples suggest a useful eval pattern: ask for a whole artifact, not a snippet. Have the model build the thing, document its assumptions, run tests, and produce a reviewable handoff. Then score the handoff, not only the code. Does it tell you what it did? Can another engineer pick it up? Are the data sources listed? Are the hidden assumptions visible? Does it leave a clean path for human correction?
Security teams should watch the trusted-access path. If Anthropic's vulnerability-finding claims hold up, the bottleneck shifts from discovery to triage, patching, and rollout. Finding more bugs is not useful if your organization cannot fix them fast enough.
AI product builders should assume the floor just moved. If your product mostly wraps last-generation coding or research prompts, Fable-class agents compress your moat. Distribution, workflow integration, proprietary data, and evaluation infrastructure matter more than prompt surface area.
Technical buyers should stop comparing models only by input/output price. Run workflow evals. If Fable 5 completes hard work with fewer human interventions, it can justify a premium. If it does not, Opus 4.8, Gemini 3.1 Pro, or GPT-5.5 may still be better buys.
Policy teams should pay attention to the access pattern. Fable/Mythos is a live example of a frontier lab treating model capability as something that needs differentiated distribution, not just a model card and a refusal policy.
Researchers should pay attention to the Concord example. There is a category of internal research tooling that has historically been too niche to build: custom coding schemes, calibration tools, literature-review dashboards, structured interview analyzers, synthetic data checkers, methods validators. A model that can turn one expert's messy need into usable software changes the economics of all of that. It does not make the output automatically correct. It makes previously uneconomic software worth attempting.
Managers should notice the control problem before the productivity story. If employees can commission large artifacts from AI systems, organizations need review norms. Who checks generated code? Who signs off on generated analysis? What counts as sufficient source provenance? Which workflows can run autonomously, and which require approval gates? The productivity upside is real, but it lands inside process, compliance, and accountability systems that most companies have not built yet.
Claude Fable 5 is the public headline. Claude Mythos 5 is the capability boundary.
The benchmark story is strong: Fable 5 enters BenchLM's top tier with a 96 overall score, a 95.0 on SWE-bench Verified, an 80.0 on SWE-bench Pro, an 84.3 on Terminal-Bench 2.1, an 85.0 on OSWorld-Verified, and a 29.3 on FrontierCode Diamond. Mythos 5 pushes some of the same rows higher, including 95.5 on SWE-bench Verified, 88.0 on Terminal-Bench 2.1, and 64.5 on HLE with tools. That is enough to matter on its own.
But the bigger story is deployment and use. Anthropic took a model class it previously held back, split it into a public safeguarded product and a restricted trusted-access product, and made routing part of the release. Early user reports suggest the model also changes the human role: less step-by-step operation, more commissioning, review, and correction.
That is what the future of AI probably looks like: not one smarter chatbot, but a stack of capability tiers, safety routers, trusted-access programs, agent benchmarks, and task-level economics wrapped around the frontier model.
The model got better. The boundary around the model got more important. The interface around the model may matter most of all.
What is Claude Fable 5? Claude Fable 5 is Anthropic's generally available Mythos-class model, launched June 9, 2026. It uses the same underlying model class as Claude Mythos 5, but adds safeguards for high-risk cybersecurity, biology, chemistry, and distillation requests. Anthropic prices it at $10 per million input tokens and $50 per million output tokens.
What is Claude Mythos 5? Claude Mythos 5 is Anthropic's restricted-access version of the same underlying Mythos-class model. It is aimed at vetted cybersecurity, biology, healthcare, and government partners through Project Glasswing and trusted-access programs.
How are Claude Fable 5 and Claude Mythos 5 different? Anthropic says Fable 5 and Mythos 5 use the same underlying model. The difference is deployment. Fable 5 is the public version with safeguards, fallback in some Claude clients, and API blocking unless fallback is configured. Mythos 5 lifts some restrictions for vetted users working in sensitive domains.
Is Claude Fable 5 better than Claude Opus 4.8? On BenchLM's current mapped benchmark data, Claude Fable 5 scores 96 overall versus Claude Opus 4.8 at 92. The biggest visible gaps are in agentic work, coding, hard knowledge, and multimodal grounded tasks. The caveat is that early Fable 5 rows are heavily sourced from Anthropic's launch and system-card materials and need more independent coverage.
How much does Claude Fable 5 cost? Claude Fable 5 costs $10 per million input tokens and $50 per million output tokens. Anthropic also lists a 90% input-token discount for prompt caching and a 1.1x multiplier for US-only inference.
Can anyone use Claude Mythos 5? No. Claude Mythos 5 is restricted to Project Glasswing partners and future trusted-access programs. Anthropic has said broader trusted access is planned, but Mythos 5 is not a normal public API model.
What do Claude Fable 5 and Mythos 5 mean for the future of AI? They suggest that frontier AI is moving toward capability-gated deployment. The same underlying model can be offered as a public safeguarded product, a restricted trusted-access product, and a routed system that falls back to safer models for some requests.
What does Claude Fable 5 feel like to use? Ethan Mollick's early-access testing describes Fable 5 less like a chatbot and more like a system that can be commissioned to complete large projects. His examples included games, a researched isochrone map, and a long-running research software project. The practical shift is that the human spends less time doing each step and more time specifying goals, reviewing outputs, and correcting the final result.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch.
Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison.
Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use.