# BenchLM AI > BenchLM AI compares 149 tracked AI models across 61 benchmarks in 8 categories: Agentic, Coding, Multimodal & Grounded, Reasoning, Knowledge, Instruction Following, Multilingual, and Mathematics. Leaderboards exclude generated benchmark rows so the public rankings stay conservative and source-aware. ## Main Pages - [Homepage](https://benchlm.ai/): Overall leaderboard and benchmark explorer - [Models Directory](https://benchlm.ai/models): Canonical model families and sibling SKUs - [Compare](https://benchlm.ai/compare): Head-to-head model comparisons - [Benchmarks](https://benchlm.ai/benchmarks): Benchmark directory and explainer pages - [Pricing](https://benchlm.ai/pricing): Token pricing comparison for major models - [Alternatives Directory](https://benchlm.ai/alternatives): SEO landing pages for ChatGPT, Claude, Gemini, and OpenAI API alternatives - [Korean AI Hub](https://benchlm.ai/leaderboards/korean-llm): Best Korean LLM Leaderboard - [Korean Benchmarks](https://benchlm.ai/leaderboards/korean-benchmarks): Global models evaluated on Korean metrics - [KMMLU Guide](https://benchlm.ai/guides/kmmlu-explained): KMMLU Benchmark Explained - [Blog](https://benchlm.ai/blog): Benchmark explainers and model analysis ## Top Model Profiles - [GPT-5.4 Pro](https://benchlm.ai/models/gpt-5-4-pro): #1, OpenAI, 87/100, Proprietary, 1.05M - [GPT-5.4](https://benchlm.ai/models/gpt-5-4): #2, OpenAI, 84/100, Proprietary, 1.05M - [Gemini 3.1 Pro](https://benchlm.ai/models/gemini-3-1-pro): #3, Google, 83/100, Proprietary, 1M - [Claude Opus 4.6](https://benchlm.ai/models/claude-opus-4-6): #4, Anthropic, 80/100, Proprietary, 1M - [GPT-5.3 Codex](https://benchlm.ai/models/gpt-5-3-codex): #5, OpenAI, 80/100, Proprietary, 400K - [Gemini 3 Pro Deep Think](https://benchlm.ai/models/gemini-3-pro-deep-think): #6, Google, 79/100, Proprietary, 2M - [GPT-5.2](https://benchlm.ai/models/gpt-5-2): #7, OpenAI, 77/100, Proprietary, 400K - [Claude Sonnet 4.6](https://benchlm.ai/models/claude-sonnet-4-6): #8, Anthropic, 76/100, Proprietary, 200K - [Qwen3.5 397B (Reasoning)](https://benchlm.ai/models/qwen3-5-397b-reasoning): #9, Alibaba, 72/100, Open Weight, 128K - [Kimi K2.5 (Reasoning)](https://benchlm.ai/models/kimi-k2-5-reasoning): #10, Moonshot AI, 71/100, Proprietary, 128K ## Best-Of Rankings - [Best LLMs for Coding](https://benchlm.ai/coding) - [Best LLMs for Math](https://benchlm.ai/math) - [Best LLMs for Knowledge](https://benchlm.ai/knowledge) - [Best LLMs for Reasoning](https://benchlm.ai/reasoning) - [Best Agentic AI Models](https://benchlm.ai/agentic) - [Best Multimodal & Grounded AI Models](https://benchlm.ai/multimodal-grounded) - [Best LLMs for Instruction Following](https://benchlm.ai/instruction-following) - [Best Multilingual LLMs](https://benchlm.ai/multilingual) - [Best Open Source LLMs](https://benchlm.ai/best/open-source) - [Best Proprietary LLMs](https://benchlm.ai/best/proprietary) - [Best Reasoning AI Models](https://benchlm.ai/best/reasoning-models) - [Best OpenAI Models](https://benchlm.ai/best/openai-models) - [Best Anthropic Models](https://benchlm.ai/best/anthropic-models) - [Best Google AI Models](https://benchlm.ai/best/google-models) - [Best Meta AI Models](https://benchlm.ai/best/meta-models) - [Best DeepSeek Models](https://benchlm.ai/best/deepseek-models) - [Best AI Models Overall](https://benchlm.ai/best/overall) - [Best Large Context Window LLMs](https://benchlm.ai/best/large-context-window) - [Best Chinese AI Models](https://benchlm.ai/best/chinese-models) - [Best Non-Reasoning LLMs](https://benchlm.ai/best/non-reasoning-models) - [Best Mistral Models](https://benchlm.ai/best/mistral-models) - [Best xAI Grok Models](https://benchlm.ai/best/xai-models) - [Best Alibaba Qwen Models](https://benchlm.ai/best/alibaba-models) ## Tools & Resources - [Alternative Finder](https://benchlm.ai/tools/alternative-finder): Replace ChatGPT, Claude, Google Gemini, or the OpenAI API using benchmark fit, pricing, context, and open-weight filters - [LLM Selector Quiz](https://benchlm.ai/tools/llm-selector): Personalized model recommendations - [AI Cost Calculator](https://benchlm.ai/tools/ai-cost-calculator): Budgeting by deliverable rather than raw token counts - [Cost Calculator](https://benchlm.ai/tools/cost-calculator): Monthly API spend estimates from token usage ## Alternative Landing Pages - [Best ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt): chatgpt alternatives, best chatgpt alternatives, best alternative to chatgpt - [Best Claude Alternatives in 2026](https://benchlm.ai/alternatives/claude): claude alternative, best claude alternative, cheaper alternative to claude - [Best Google Gemini Alternatives in 2026](https://benchlm.ai/alternatives/google-gemini): google gemini alternative, best google gemini alternative, gemini alternative - [Best OpenAI API Alternatives in 2026](https://benchlm.ai/alternatives/openai-api): openai api alternative, best openai api alternatives, cheaper openai api alternative - [Best Free ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/free): free chatgpt alternative, best free chatgpt alternative, chatgpt alternative free - [Best Open Source ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/open-source): open source chatgpt alternative, best open source chatgpt alternative, open-weight chatgpt alternative - [Best Claude Alternatives for Coding in 2026](https://benchlm.ai/alternatives/claude/coding): claude alternative for coding, best claude alternative for coding, claude code alternative ## Blog Posts - [React Native Evals: The Mobile App Coding Benchmark Explained](https://benchlm.ai/blog/posts/react-native-evals-mobile-benchmark): React Native Evals measures whether AI coding models can complete real React Native implementation tasks across navigation, animation, and async state. Here's what it tests, why it matters, and how it differs from SWE-bench and LiveCodeBench. - [Best Chinese LLMs in 2026: Kimi K2.5, DeepSeek V3.2, Qwen, GLM-5, and Every Model Ranked](https://benchlm.ai/blog/posts/best-chinese-llm): Which Chinese LLM is best in 2026? We rank Kimi K2.5, DeepSeek V3.2, Qwen3.5, GLM-5, MiMo, MiniMax M2.7, and more by benchmarks — coding, math, reasoning, and agentic tasks. - [State of LLM Benchmarks 2026: Rankings, Trends, and What Actually Changed](https://benchlm.ai/blog/posts/state-of-llm-benchmarks-2026): State of LLM benchmarks in 2026: top AI model rankings, category leaders, benchmark trends, open vs closed performance, pricing context, and methodology from BenchLM. - [Are AI Benchmarks Reliable? The Data Contamination Problem](https://benchlm.ai/blog/posts/benchmark-reliability): AI benchmarks are useful but flawed. Data contamination inflates scores when models train on test questions. Here's how it works, which benchmarks resist it, and how BenchLM accounts for reliability. - [Best Budget LLMs in 2026: GPT-5.4 Mini, Nano, MiniMax M2.7, and Every Cheap Model Ranked](https://benchlm.ai/blog/posts/best-budget-llms-2026): Which budget LLM should you use in 2026? We rank GPT-5.4 mini, GPT-5.4 nano, MiniMax M2.7, Claude Haiku 4.5, Gemini Flash, DeepSeek, and more by benchmarks and price. - [Best LLM for Coding in 2026: Ranked by SWE-bench, LCB, and Real-World Performance](https://benchlm.ai/blog/posts/best-llm-coding): Which AI model is best for coding in 2026? We rank major LLMs by SWE-bench Pro and LiveCodeBench, with SWE-bench Verified shown as a historical baseline and React Native Evals tracked as a display benchmark for mobile app work. - [BrowseComp Explained: How We Measure Web Research Agents](https://benchlm.ai/blog/posts/browsecomp-browsing-benchmark): BrowseComp evaluates whether AI models can search the web, gather evidence, and answer research questions instead of relying only on latent knowledge. - [Claude Opus 4.6 vs GPT-5.4: Full Benchmark Breakdown (2026)](https://benchlm.ai/blog/posts/claude-opus-vs-gpt-5): Claude Opus 4.6 vs GPT-5.4 head-to-head: benchmark scores, pricing, and when to use each. GPT-5.4 still leads overall at lower cost, but Claude remains strong on HLE, coding, multilingual, and long-form work. - [LLM API Pricing Comparison 2026: Every Major Model, Ranked by Cost](https://benchlm.ai/blog/posts/llm-pricing-2026): Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case. - [OSWorld-Verified Explained: How We Measure Computer-Use Models](https://benchlm.ai/blog/posts/osworld-verified-computer-use-benchmark): OSWorld-Verified measures whether AI models can operate software interfaces and complete multi-step computer tasks with reliability. - [Terminal-Bench 2.0 Explained: How We Measure Agentic Coding](https://benchlm.ai/blog/posts/terminal-bench-2-agentic-benchmark): Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat. - [What Do LLM Benchmarks Actually Measure?](https://benchlm.ai/blog/posts/what-benchmarks-measure): LLM benchmarks don't measure intelligence. They measure specific, narrow abilities under controlled conditions. Here's what each benchmark type actually tests — and what it misses. - [AIME & HMMT: Can AI Models Do Competition Math?](https://benchlm.ai/blog/posts/aime-hmmt-competition-math): AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means. - [Best LLM for Coding in 2026: What the Benchmarks Actually Show](https://benchlm.ai/blog/posts/best-llm-for-coding): We ranked every major LLM by BenchLM's current coding formula — SWE-Rebench, SWE-bench Pro, LiveCodeBench, and SWE-bench Verified. Here's which models actually come out on top and why. - [What Is Chatbot Arena Elo? How Human Preference Drives Rankings](https://benchlm.ai/blog/posts/chatbot-arena-elo-explained): Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks. - [Claude Opus 4.6 vs GPT-5.4: Where Each Model Actually Wins](https://benchlm.ai/blog/posts/claude-opus-4-6-vs-gpt-5-4): A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across current BenchLM.ai data. GPT-5.4 now has the stronger overall profile, but Claude still has specific workflow advantages. - [GPQA Diamond: The PhD-Level Science Benchmark](https://benchlm.ai/blog/posts/gpqa-diamond-science-benchmark): GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works. - [HLE (Humanity's Last Exam): The Hardest Benchmark](https://benchlm.ai/blog/posts/hle-humanitys-last-exam): Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters. - [LiveCodeBench: Why Static Coding Benchmarks Aren't Enough](https://benchlm.ai/blog/posts/livecodebench-contamination-free): LiveCodeBench uses fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to prevent data contamination. Here's why it matters and which models lead. - [MMLU vs MMLU-Pro: What Changed and Why It Matters](https://benchlm.ai/blog/posts/mmlu-vs-mmlu-pro): MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026. ## Markdown Mirrors - [Homepage (md)](https://benchlm.ai/md/index.md) - [Models directory (md)](https://benchlm.ai/md/models/index.md) - [Benchmarks directory (md)](https://benchlm.ai/md/benchmarks/index.md) - [Compare index (md)](https://benchlm.ai/md/compare/index.md) - [Pricing (md)](https://benchlm.ai/md/pricing.md) - [Alternatives directory (md)](https://benchlm.ai/md/alternatives/index.md) - [Alternative Finder (md)](https://benchlm.ai/md/tools/alternative-finder.md) - [AI Cost Calculator (md)](https://benchlm.ai/md/tools/ai-cost-calculator.md) - [LLM Selector (md)](https://benchlm.ai/md/tools/llm-selector.md) - [Cost Calculator (md)](https://benchlm.ai/md/tools/cost-calculator.md) - [Korean LLM Leaderboard (md)](https://benchlm.ai/md/leaderboards/korean-llm.md) - [Korean Benchmarks (md)](https://benchlm.ai/md/leaderboards/korean-benchmarks.md) - [KMMLU Guide (md)](https://benchlm.ai/md/guides/kmmlu-explained.md) - Individual alternative pages available at: `https://benchlm.ai/md/alternatives/[slug].md` - Individual model pages available at: `https://benchlm.ai/md/models/[slug].md` - Benchmark pages available at: `https://benchlm.ai/md/benchmarks/[slug].md` - Best-of ranking pages available at: `https://benchlm.ai/md/best/[slug].md` - Comparison pages available at: `https://benchlm.ai/md/compare/[slug].md` - Blog posts available at: `https://benchlm.ai/md/blog/[slug].md` ## Data & Technical Notes - Data last updated: March 18, 2026 - Canonical model families tracked: 43 - Total pairwise comparisons available: 11026 - Built with Next.js and deployed on Cloudflare Workers via OpenNext - Sitemap: https://benchlm.ai/sitemap.xml - Full crawler bundle: https://benchlm.ai/llms-full.txt - Author: [@glevd](https://x.com/glevd)