# BenchLM AI

> BenchLM AI compares 225 tracked AI models across 178 benchmarks in 9 categories: Agentic, Coding, Multimodal & Grounded, Reasoning, Knowledge, Instruction Following, Multilingual, Mathematics, Korean Benchmarks. Leaderboards exclude generated benchmark rows so the public rankings stay conservative and source-aware.

## Main Pages

- [Homepage](https://benchlm.ai/): Overall leaderboard and benchmark explorer
- [Models Directory](https://benchlm.ai/models): Canonical model families and sibling SKUs
- [Compare](https://benchlm.ai/compare): Head-to-head model comparisons
- [Benchmarks](https://benchlm.ai/benchmarks): Benchmark directory and explainer pages
- [Pricing](https://benchlm.ai/pricing): Token pricing comparison for major models
- [LLM Pricing Deep Dive](https://benchlm.ai/llm-pricing): Full pricing table with benchmark score and value context
- [Price vs Performance](https://benchlm.ai/llm-price-performance): Cost-adjusted model rankings and value leaders
- [LLM Speed](https://benchlm.ai/llm-speed): Tokens/sec and TTFT comparisons across providers
- [Benchmark Confidence](https://benchlm.ai/benchmark-confidence): Provenance, verification, and confidence coverage for ranked models
- [AI Race](https://benchlm.ai/ai-race): Release timeline, provider movement, and benchmark freshness snapshot
- [LLM Leaderboard History](https://benchlm.ai/llm-leaderboard-history): Arena Elo history from 2023 to today
- [Alternatives Directory](https://benchlm.ai/alternatives): SEO landing pages for ChatGPT, Claude, Gemini, and OpenAI API alternatives
- [Korean AI Hub](https://benchlm.ai/leaderboards/korean-llm): Best Korean LLM Leaderboard
- [Korean Benchmarks](https://benchlm.ai/leaderboards/korean-benchmarks): Global models evaluated on Korean metrics
- [KMMLU Guide](https://benchlm.ai/guides/kmmlu-explained): KMMLU Benchmark Explained
- [European AI Guide](https://benchlm.ai/best/european-llm): Europe's benchmarked, sovereign, and specialist model landscape
- [European Models Ranking](https://benchlm.ai/best/european-models): Ranked benchmark view for European model creators
- [Blog](https://benchlm.ai/blog): Benchmark explainers and model analysis

## Top Model Profiles

- [Claude Mythos Preview](https://benchlm.ai/models/claude-mythos-preview): #1, Anthropic, 99/100, Proprietary, 1M
- [Gemini 3.1 Pro](https://benchlm.ai/models/gemini-3-1-pro): #2, Google, 92/100, Proprietary, 1M
- [GPT-5.5](https://benchlm.ai/models/gpt-5-5): #3, OpenAI, 91/100, Proprietary, 1M
- [GPT-5.4 Pro](https://benchlm.ai/models/gpt-5-4-pro): #4, OpenAI, 91/100, Proprietary, 1.05M
- [Claude Opus 4.7 (Adaptive)](https://benchlm.ai/models/claude-opus-4-7-adaptive): #5, Anthropic, 90/100, Proprietary, 1M
- [Gemini 3 Pro Deep Think](https://benchlm.ai/models/gemini-3-pro-deep-think): #6, Google, 90/100, Proprietary, 2M
- [Grok 4.1](https://benchlm.ai/models/grok-4-1): #7, xAI, 90/100, Proprietary, 1M
- [GPT-5.4](https://benchlm.ai/models/gpt-5-4): #8, OpenAI, 89/100, Proprietary, 1.05M
- [DeepSeek V4 Pro (Max)](https://benchlm.ai/models/deepseek-v4-pro-max): #9, DeepSeek, 88/100, Open Weight, 1M
- [Claude Opus 4.6](https://benchlm.ai/models/claude-opus-4-6): #10, Anthropic, 87/100, Proprietary, 1M

## Best-Of Rankings

- [Best LLMs for Coding](https://benchlm.ai/coding)
- [Best LLMs for Math](https://benchlm.ai/math)
- [Best LLMs for Knowledge](https://benchlm.ai/knowledge)
- [Best LLMs for Reasoning](https://benchlm.ai/reasoning)
- [Best Agentic AI Models](https://benchlm.ai/agentic)
- [Best Multimodal & Grounded AI Models](https://benchlm.ai/multimodal-grounded)
- [Best LLMs for Instruction Following](https://benchlm.ai/instruction-following)
- [Best Multilingual LLMs](https://benchlm.ai/multilingual)
- [Best Long Context AI Models](https://benchlm.ai/best/long-context)
- [Best Tool Use & Function Calling Models](https://benchlm.ai/best/tool-use)
- [Best AI Models for Web Research](https://benchlm.ai/best/web-research)
- [Best Computer Use AI Models](https://benchlm.ai/best/computer-use)
- [Best Document AI Models](https://benchlm.ai/best/document-ai)
- [Best Image Understanding Models](https://benchlm.ai/best/image-understanding)
- [Best Frontend & App Dev Models](https://benchlm.ai/best/frontend-app-dev)
- [Best Factuality AI Models](https://benchlm.ai/best/factuality)
- [Best Open Source LLMs](https://benchlm.ai/best/open-source)
- [Best Proprietary LLMs](https://benchlm.ai/best/proprietary)
- [Best Reasoning AI Models](https://benchlm.ai/best/reasoning-models)
- [Best OpenAI Models](https://benchlm.ai/best/openai-models)
- [Best Anthropic Models](https://benchlm.ai/best/anthropic-models)
- [Best Google AI Models](https://benchlm.ai/best/google-models)
- [Best Meta AI Models](https://benchlm.ai/best/meta-models)
- [Best DeepSeek Models](https://benchlm.ai/best/deepseek-models)
- [Best AI Models Overall](https://benchlm.ai/best/overall)
- [Best Large Context Window LLMs](https://benchlm.ai/best/large-context-window)
- [Best Chinese AI Models](https://benchlm.ai/best/chinese-models)
- [European AI Models](https://benchlm.ai/best/european-models)
- [Best Non-Reasoning LLMs](https://benchlm.ai/best/non-reasoning-models)
- [Best Mistral Models](https://benchlm.ai/best/mistral-models)
- [Best xAI Grok Models](https://benchlm.ai/best/xai-models)
- [Best Alibaba Qwen Models](https://benchlm.ai/best/alibaba-models)
- [Best Value LLM for Coding](https://benchlm.ai/best/best-value-coding)
- [Best Value Agentic AI Model](https://benchlm.ai/best/best-value-agentic)
- [Best Value LLM for Reasoning](https://benchlm.ai/best/best-value-reasoning)
- [Best Value LLM for Knowledge](https://benchlm.ai/best/best-value-knowledge)
- [Best Value LLM for Math](https://benchlm.ai/best/best-value-math)
- [Best Value Multimodal AI Model](https://benchlm.ai/best/best-value-multimodal)
- [Best Value LLM Overall](https://benchlm.ai/best/best-value-overall)

## Tools & Resources

- [Pricing](https://benchlm.ai/pricing): Canonical token pricing table for major models
- [LLM Pricing Deep Dive](https://benchlm.ai/llm-pricing): Extended pricing page with score and Score/$ context
- [Price vs Performance](https://benchlm.ai/llm-price-performance): Benchmark score per dollar across major categories
- [LLM Speed](https://benchlm.ai/llm-speed): Runtime throughput and first-token latency
- [Benchmark Confidence](https://benchlm.ai/benchmark-confidence): Verified vs generated benchmark coverage
- [Alternative Finder](https://benchlm.ai/tools/alternative-finder): Replace ChatGPT, Claude, Google Gemini, or the OpenAI API using benchmark fit, pricing, context, and open-weight filters
- [LLM Selector Quiz](https://benchlm.ai/tools/llm-selector): Personalized model recommendations
- [Cost Calculator](https://benchlm.ai/tools/cost-calculator): Estimate AI cost per blog post, web page, doc, PRD, or feature

## Alternative Landing Pages

- [Best ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt): chatgpt alternatives, best chatgpt alternatives, best alternative to chatgpt
- [Best Claude Alternatives in 2026](https://benchlm.ai/alternatives/claude): claude alternative, best claude alternative, cheaper alternative to claude
- [Best Google Gemini Alternatives in 2026](https://benchlm.ai/alternatives/google-gemini): google gemini alternative, best google gemini alternative, gemini alternative
- [Best OpenAI API Alternatives in 2026](https://benchlm.ai/alternatives/openai-api): openai api alternative, best openai api alternatives, cheaper openai api alternative
- [Best GLM Alternatives in 2026](https://benchlm.ai/alternatives/glm): glm alternative, best glm alternative, z.ai alternative
- [Best Kimi Alternatives in 2026](https://benchlm.ai/alternatives/kimi): kimi alternative, best kimi alternative, moonshot kimi alternative
- [Best Free ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/free): free chatgpt alternative, best free chatgpt alternative, chatgpt alternative free
- [Best Open Source ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/open-source): open source chatgpt alternative, best open source chatgpt alternative, open-weight chatgpt alternative
- [Best Claude Alternatives for Coding in 2026](https://benchlm.ai/alternatives/claude/coding): claude alternative for coding, best claude alternative for coding, claude code alternative

## Blog Posts

- [ARC-AGI-2 Explained: The Hardest Public Reasoning Benchmark](https://benchlm.ai/blog/posts/arc-agi-2-explained): ARC-AGI-2 measures fluid intelligence through visual grid puzzles that can't be solved by memorization. Here's how it works, what scores mean, and where current frontier models stand.
- [LLM Context Window Comparison 2026: Advertised vs Effective, Input vs Output](https://benchlm.ai/blog/posts/context-window-comparison): Four frontier LLMs now advertise 1M+ tokens. DeepSeek V4 Pro's 384K output changes generation workflows. Gemini leads effective-context evals. Here's the real comparison.
- [DeepSeek V4 Pro vs Claude Opus 4.7 vs GPT-5.5: The Frontier in April 2026](https://benchlm.ai/blog/posts/deepseek-v4-vs-claude-opus-4-7-vs-gpt-5-5): Three frontier flagships launched in eight days. DeepSeek V4 Pro undercuts GPT-5.5 by ~9x on output price under MIT license. Here's how they compare on benchmarks, cost, and real use.
- [Claude API Pricing: Haiku 4.5, Sonnet 4.6, and Opus 4.7 (April 2026)](https://benchlm.ai/blog/posts/claude-api-pricing): Current Anthropic Claude API pricing from official model pages and the Claude Opus 4.7 launch announcement, including prompt caching, batch discounts, and current long-context notes.
- [DeepSeek API Pricing: deepseek-chat vs deepseek-reasoner (April 2026)](https://benchlm.ai/blog/posts/deepseek-api-pricing): Current DeepSeek API pricing from the official docs: deepseek-chat and deepseek-reasoner, cache-hit vs cache-miss pricing, output pricing, and the current V3.2 endpoint mapping.
- [Gemini API Pricing: Current Flash, Flash-Lite, and Pro Rates (April 2026)](https://benchlm.ai/blog/posts/gemini-api-pricing): Current Gemini API pricing from Google's official docs: 3.1 Pro Preview, 3.1 Flash-Lite Preview, 3 Flash Preview, 2.5 Flash, 2.5 Pro, plus Batch and Flex pricing.
- [OpenAI API Pricing: GPT-5.4, GPT-5.2, and GPT-5.1 (April 2026)](https://benchlm.ai/blog/posts/openai-api-pricing): Current OpenAI API pricing from official docs: GPT-5.4, GPT-5.2, GPT-5.1, cached input rates, Batch API discounts, and the pricing details that actually matter.
- [GPT-5 vs Gemini in 2026: Full Benchmark Breakdown](https://benchlm.ai/blog/posts/gpt5-vs-gemini-2026): GPT-5.4 and Gemini 3.1 Pro are one point apart on BenchLM's leaderboard. We compare coding, math, reasoning, multimodal, agentic, pricing, and context window — with Gemini Deep Think as the reasoning wildcard.
- [Mythos Preview is the first frontier model Anthropic decided not to ship. The benchmarks show why.](https://benchlm.ai/blog/posts/mythos-preview-anthropic-not-shipping): Claude Mythos Preview beats Opus 4.6 by double digits on every coding benchmark Anthropic released. Then they shelved it. Here's what the numbers actually show, and why the shipping decision matters more than the launch.
- [Best LLM for RAG in 2026: Top Models Ranked for Retrieval-Augmented Generation](https://benchlm.ai/blog/posts/best-llm-rag): We ranked LLMs for RAG by IFEval, knowledge benchmarks, context window, long-context retrieval accuracy, and pricing. Here are the top models for retrieval-augmented generation in 2026.
- [Best LLM for Writing in 2026: AI Models Ranked for Content Creation](https://benchlm.ai/blog/posts/best-llm-writing): Which AI model is best for writing in 2026? We rank Claude, GPT, Gemini, and open source LLMs by creative writing Arena scores, instruction-following benchmarks, and real-world content quality — with pricing for every budget.
- [How to Choose an LLM in 2026: Which AI Model Is Best for Your Use Case](https://benchlm.ai/blog/posts/which-llm-to-use): A step-by-step framework for choosing the right LLM in 2026. We compare Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4, DeepSeek, and open source models by use case, budget, and deployment needs — backed by benchmark data.
- [Best Open Source LLM in 2026: Rankings, Benchmarks, and the Models Worth Running](https://benchlm.ai/blog/posts/best-open-source-llm): Which open source LLM is best in 2026? We rank the top open weight models by real benchmark data — DeepSeek V4, Kimi K2.6, GLM-5, Qwen3.5, Gemma 4, Llama — and compare them to proprietary leaders.
- [Best Chinese LLMs in 2026: DeepSeek V4, Kimi K2.6, GLM-5, Qwen, and Every Model Ranked](https://benchlm.ai/blog/posts/best-chinese-llm): Which Chinese LLM is best in 2026? We rank DeepSeek V4, Kimi K2.6, GLM-5, GLM-5.1, Qwen3.5, MiMo, and more using current BenchLM data across coding, math, reasoning, and agentic work.
- [ChatGPT vs Claude vs Gemini in 2026: The Definitive Comparison](https://benchlm.ai/blog/posts/chatgpt-vs-claude-vs-gemini-2026): The best AI model depends on your use case. We compare Gemini 3.1 Pro, Claude Opus 4.6, and GPT-5.4 across coding, writing, reasoning, multimodal, price, and speed using current benchmark data.
- [How LLM Token Pricing Works: A Complete Guide to API Costs in 2026](https://benchlm.ai/blog/posts/llm-token-pricing): Learn how LLM API pricing works — from tokens, input/output costs, and reasoning tokens to vision, embedding, and fine-tuning pricing. Includes real cost examples, free tiers, and 6 strategies to cut your AI spend.
- [React Native Evals: The Mobile App Coding Benchmark Explained](https://benchlm.ai/blog/posts/react-native-evals-mobile-benchmark): React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench.
- [State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed](https://benchlm.ai/blog/posts/state-of-llm-benchmarks-2026): State of LLM benchmarks in 2026: current BenchLM rankings, category leaders, benchmark trends, open vs closed performance, and what still matters after the latest scoring changes.
- [Are AI Benchmarks Reliable? The Data Contamination Problem](https://benchlm.ai/blog/posts/benchmark-reliability): AI benchmarks are useful but flawed. Data contamination inflates scores when models train on test questions. Here's how it works, which benchmarks resist it, and how BenchLM accounts for reliability.
- [Best Budget LLMs in 2026: GPT-5.4 Mini, Nano, MiniMax M2.7, and Every Cheap Model Ranked](https://benchlm.ai/blog/posts/best-budget-llms-2026): Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price.

## Markdown Mirrors

- [Homepage (md)](https://benchlm.ai/md/index.md)
- [Models directory (md)](https://benchlm.ai/md/models/index.md)
- [Benchmarks directory (md)](https://benchlm.ai/md/benchmarks/index.md)
- [Compare index (md)](https://benchlm.ai/md/compare/index.md)
- [Pricing (md)](https://benchlm.ai/md/pricing.md)
- [LLM Pricing Deep Dive (md)](https://benchlm.ai/md/llm-pricing.md)
- [Price vs Performance (md)](https://benchlm.ai/md/llm-price-performance.md)
- [LLM Speed (md)](https://benchlm.ai/md/llm-speed.md)
- [Benchmark Confidence (md)](https://benchlm.ai/md/benchmark-confidence.md)
- [AI Race (md)](https://benchlm.ai/md/ai-race.md)
- [LLM Leaderboard History (md)](https://benchlm.ai/md/llm-leaderboard-history.md)
- [Alternatives directory (md)](https://benchlm.ai/md/alternatives/index.md)
- [Alternative Finder (md)](https://benchlm.ai/md/tools/alternative-finder.md)
- [Cost Calculator (md)](https://benchlm.ai/md/tools/cost-calculator.md)
- [LLM Selector (md)](https://benchlm.ai/md/tools/llm-selector.md)
- [Korean LLM Leaderboard (md)](https://benchlm.ai/md/leaderboards/korean-llm.md)
- [Korean Benchmarks (md)](https://benchlm.ai/md/leaderboards/korean-benchmarks.md)
- [KMMLU Guide (md)](https://benchlm.ai/md/guides/kmmlu-explained.md)
- [European AI Guide (md)](https://benchlm.ai/md/best/european-llm.md)
- Individual alternative pages available at: `https://benchlm.ai/md/alternatives/[slug].md`
- Individual model pages available at: `https://benchlm.ai/md/models/[slug].md`
- Benchmark pages available at: `https://benchlm.ai/md/benchmarks/[slug].md`
- Best-of ranking pages available at: `https://benchlm.ai/md/best/[slug].md`
- Comparison pages available at: `https://benchlm.ai/md/compare/[slug].md`
- Blog posts available at: `https://benchlm.ai/md/blog/[slug].md`

## Data & Technical Notes

- Data last updated: April 30, 2026
- Canonical model families tracked: 90
- Total pairwise comparisons available: 25200
- Built with Next.js and deployed on Cloudflare Workers via OpenNext
- Sitemap: https://benchlm.ai/sitemap.xml
- Full crawler bundle: https://benchlm.ai/llms-full.txt
- Author: [@glevd](https://x.com/glevd)