# BenchLM AI > BenchLM AI compares 132 tracked AI models across 43 benchmarks in 8 categories: Agentic, Coding, Multimodal & Grounded, Reasoning, Knowledge, Instruction Following, Multilingual, and Mathematics. Leaderboards exclude generated benchmark rows so the public rankings stay conservative and source-aware. ## Main Pages - [Homepage](https://benchlm.ai/): Overall leaderboard and benchmark explorer - [Models Directory](https://benchlm.ai/models): Canonical model families and sibling SKUs - [Compare](https://benchlm.ai/compare): Head-to-head model comparisons - [Benchmarks](https://benchlm.ai/benchmarks): Benchmark directory and explainer pages - [Pricing](https://benchlm.ai/pricing): Token pricing comparison for major models - [Alternatives Directory](https://benchlm.ai/alternatives): SEO landing pages for ChatGPT, Claude, Gemini, and OpenAI API alternatives - [Blog](https://benchlm.ai/blog): Benchmark explainers and model analysis ## Top Model Profiles - [GPT-5.4](https://benchlm.ai/models/gpt-5-4): #1, OpenAI, 85/100, Proprietary, 1.05M - [Gemini 3.1 Pro](https://benchlm.ai/models/gemini-3-1-pro): #2, Google, 84/100, Proprietary, 1M - [Claude Sonnet 4.6](https://benchlm.ai/models/claude-sonnet-4-6): #3, Anthropic, 77/100, Proprietary, 200K - [DeepSeek V3.2 (Thinking)](https://benchlm.ai/models/deepseek-v3-2-thinking): #4, DeepSeek, 69/100, Open Weight, 128K - [o3](https://benchlm.ai/models/o3): #5, OpenAI, 68/100, Proprietary, 200K - [Gemini 3 Pro Deep Think](https://benchlm.ai/models/gemini-3-pro-deep-think): #6, Google, 68/100, Proprietary, 2M - [GLM-4.7](https://benchlm.ai/models/glm-4-7): #7, Zhipu AI, 67/100, Open Weight, 200K - [DeepSeek Coder 2.0](https://benchlm.ai/models/deepseek-coder-2-0): #8, DeepSeek, 66/100, Open Weight, 128K - [Qwen2.5-1M](https://benchlm.ai/models/qwen2-5-1m): #9, Alibaba, 66/100, Open Weight, 1M - [Grok 4](https://benchlm.ai/models/grok-4): #10, xAI, 65/100, Proprietary, 128K ## Best-Of Rankings - [Best LLMs for Coding](https://benchlm.ai/best/coding) - [Best LLMs for Math](https://benchlm.ai/best/math) - [Best LLMs for Knowledge](https://benchlm.ai/best/knowledge) - [Best LLMs for Reasoning](https://benchlm.ai/best/reasoning) - [Best Agentic AI Models](https://benchlm.ai/best/agentic) - [Best Multimodal & Grounded AI Models](https://benchlm.ai/best/multimodal-grounded) - [Best LLMs for Instruction Following](https://benchlm.ai/best/instruction-following) - [Best Multilingual LLMs](https://benchlm.ai/best/multilingual) - [Best Open Source LLMs](https://benchlm.ai/best/open-source) - [Best Proprietary LLMs](https://benchlm.ai/best/proprietary) - [Best Reasoning AI Models](https://benchlm.ai/best/reasoning-models) - [Best OpenAI Models](https://benchlm.ai/best/openai-models) - [Best Anthropic Models](https://benchlm.ai/best/anthropic-models) - [Best Google AI Models](https://benchlm.ai/best/google-models) - [Best Meta AI Models](https://benchlm.ai/best/meta-models) - [Best DeepSeek Models](https://benchlm.ai/best/deepseek-models) - [Best AI Models Overall](https://benchlm.ai/best/overall) - [Best Large Context Window LLMs](https://benchlm.ai/best/large-context-window) - [Best Chinese AI Models](https://benchlm.ai/best/chinese-models) - [Best Non-Reasoning LLMs](https://benchlm.ai/best/non-reasoning-models) - [Best Mistral Models](https://benchlm.ai/best/mistral-models) - [Best xAI Grok Models](https://benchlm.ai/best/xai-models) - [Best Alibaba Qwen Models](https://benchlm.ai/best/alibaba-models) ## Tools & Resources - [Alternative Finder](https://benchlm.ai/tools/alternative-finder): Replace ChatGPT, Claude, Google Gemini, or the OpenAI API using benchmark fit, pricing, context, and open-weight filters - [LLM Selector Quiz](https://benchlm.ai/tools/llm-selector): Personalized model recommendations - [AI Cost Calculator](https://benchlm.ai/tools/ai-cost-calculator): Budgeting by deliverable rather than raw token counts - [Cost Calculator](https://benchlm.ai/tools/cost-calculator): Monthly API spend estimates from token usage ## Alternative Landing Pages - [Best ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt): chatgpt alternatives, best chatgpt alternatives, best alternative to chatgpt - [Best Claude Alternatives in 2026](https://benchlm.ai/alternatives/claude): claude alternative, best claude alternative, cheaper alternative to claude - [Best Google Gemini Alternatives in 2026](https://benchlm.ai/alternatives/google-gemini): google gemini alternative, best google gemini alternative, gemini alternative - [Best OpenAI API Alternatives in 2026](https://benchlm.ai/alternatives/openai-api): openai api alternative, best openai api alternatives, cheaper openai api alternative - [Best Free ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/free): free chatgpt alternative, best free chatgpt alternative, chatgpt alternative free - [Best Open Source ChatGPT Alternatives in 2026](https://benchlm.ai/alternatives/chatgpt/open-source): open source chatgpt alternative, best open source chatgpt alternative, open-weight chatgpt alternative - [Best Claude Alternatives for Coding in 2026](https://benchlm.ai/alternatives/claude/coding): claude alternative for coding, best claude alternative for coding, claude code alternative ## Blog Posts - [Best LLM for Coding in 2026: Ranked by SWE-bench, LCB, and Real-World Performance](https://benchlm.ai/blog/posts/best-llm-coding): Which AI model is best for coding in 2026? We rank major LLMs by SWE-bench Pro and LiveCodeBench, with SWE-bench Verified shown as a historical baseline — plus pricing and use-case guidance. - [BrowseComp Explained: How We Measure Web Research Agents](https://benchlm.ai/blog/posts/browsecomp-browsing-benchmark): BrowseComp evaluates whether AI models can search the web, gather evidence, and answer research questions instead of relying only on latent knowledge. - [Claude Opus 4.6 vs GPT-5.4: Full Benchmark Breakdown (2026)](https://benchlm.ai/blog/posts/claude-opus-vs-gpt-5): Claude Opus 4.6 vs GPT-5.4 head-to-head: benchmark scores, pricing, and when to use each. GPT-5.4 leads on 16 of 20 benchmarks at 6x lower cost. But Claude holds real advantages in some areas. - [LLM API Pricing Comparison 2026: Every Major Model, Ranked by Cost](https://benchlm.ai/blog/posts/llm-pricing-2026): Full LLM API pricing comparison for 2026 — input/output token costs for GPT-5, Claude, Gemini, DeepSeek, Grok, and more. Find the cheapest model for your use case. - [OSWorld-Verified Explained: How We Measure Computer-Use Models](https://benchlm.ai/blog/posts/osworld-verified-computer-use-benchmark): OSWorld-Verified measures whether AI models can operate software interfaces and complete multi-step computer tasks with reliability. - [Terminal-Bench 2.0 Explained: How We Measure Agentic Coding](https://benchlm.ai/blog/posts/terminal-bench-2-agentic-benchmark): Terminal-Bench 2.0 measures whether AI models can work through real terminal-based coding and ops workflows instead of just answering in chat. - [What Do LLM Benchmarks Actually Measure?](https://benchlm.ai/blog/posts/what-benchmarks-measure): LLM benchmarks don't measure intelligence. They measure specific, narrow abilities under controlled conditions. Here's what each benchmark type actually tests — and what it misses. - [AIME & HMMT: Can AI Models Do Competition Math?](https://benchlm.ai/blog/posts/aime-hmmt-competition-math): AIME and HMMT are high school math olympiad competitions now used to benchmark AI. Frontier models score 95-99% — competition math is effectively solved. Here's what that means. - [Best LLM for Coding in 2026: What the Benchmarks Actually Show](https://benchlm.ai/blog/posts/best-llm-for-coding): We ranked every major LLM by coding benchmarks — HumanEval, SWE-bench Verified, and LiveCodeBench. Here's which model actually comes out on top, and why the answer depends on what you're building. - [What Is Chatbot Arena Elo? How Human Preference Drives Rankings](https://benchlm.ai/blog/posts/chatbot-arena-elo-explained): Chatbot Arena ranks AI models using Elo ratings from blind human preference votes. Here's how the system works, what the scores mean, and how Elo compares to standardized benchmarks. - [Claude Opus 4.6 vs GPT-5.4: Where Each Model Actually Wins](https://benchlm.ai/blog/posts/claude-opus-4-6-vs-gpt-5-4): A direct benchmark comparison of Anthropic's Claude Opus 4.6 and OpenAI's GPT-5.4 across current BenchLM.ai data. GPT-5.4 now has the stronger overall profile, but Claude still has specific workflow advantages. - [GPQA Diamond: The PhD-Level Science Benchmark](https://benchlm.ai/blog/posts/gpqa-diamond-science-benchmark): GPQA tests AI models with graduate-level questions in biology, physics, and chemistry that are 'Google-proof' — even skilled non-experts with internet access can't answer them. Here's how it works. - [HLE (Humanity's Last Exam): The Hardest Benchmark](https://benchlm.ai/blog/posts/hle-humanitys-last-exam): Humanity's Last Exam is crowdsourced from thousands of domain experts and designed to probe the absolute frontier of AI. Top models score only 10-46%. Here's why HLE matters. - [LiveCodeBench: Why Static Coding Benchmarks Aren't Enough](https://benchlm.ai/blog/posts/livecodebench-contamination-free): LiveCodeBench uses fresh competitive programming problems from LeetCode, Codeforces, and AtCoder to prevent data contamination. Here's why it matters and which models lead. - [MMLU vs MMLU-Pro: What Changed and Why It Matters](https://benchlm.ai/blog/posts/mmlu-vs-mmlu-pro): MMLU and MMLU-Pro are the most cited knowledge benchmarks in AI. Here's what each measures, why MMLU is saturated, and why MMLU-Pro is the better discriminator in 2026. - [SWE-bench Explained: How We Measure Real-World Coding](https://benchlm.ai/blog/posts/swe-bench-explained): SWE-bench Verified tests AI models on resolving real GitHub issues from Django, Flask, and scikit-learn. Here's how it works, why it matters, and which models score highest. - [What Is HumanEval? The Coding Benchmark Explained](https://benchlm.ai/blog/posts/what-is-humaneval-coding-benchmark): HumanEval tests whether AI models can generate correct Python functions from docstrings. Here's what it measures, why it's nearly saturated, and which benchmarks matter more in 2026. - [Building Your Own LLM Benchmark: A Practical Guide](https://benchlm.ai/blog/posts/building-custom-llm-benchmark): How to create a custom LLM benchmark for your specific use case — from defining tasks and building datasets to scoring models and avoiding common pitfalls. - [The Complete Guide to LLM Benchmarking: Everything You Need to Know](https://benchlm.ai/blog/posts/complete-guide-llm-benchmarking): Everything you need to know about LLM benchmarking — what benchmarks measure, how to choose the right ones, common pitfalls, and how to interpret results for real-world model selection. - [How to Interpret LLM Benchmark Results: A Practical Guide](https://benchlm.ai/blog/posts/interpreting-llm-benchmark-results): How to read LLM benchmark scores correctly — what differences are meaningful, what to ignore, common misinterpretations, and how to translate benchmark data into model selection decisions. ## Markdown Mirrors - [Homepage (md)](https://benchlm.ai/md/index.md) - [Models directory (md)](https://benchlm.ai/md/models/index.md) - [Benchmarks directory (md)](https://benchlm.ai/md/benchmarks/index.md) - [Compare index (md)](https://benchlm.ai/md/compare/index.md) - [Pricing (md)](https://benchlm.ai/md/pricing.md) - [Alternatives directory (md)](https://benchlm.ai/md/alternatives/index.md) - [Alternative Finder (md)](https://benchlm.ai/md/tools/alternative-finder.md) - [AI Cost Calculator (md)](https://benchlm.ai/md/tools/ai-cost-calculator.md) - [LLM Selector (md)](https://benchlm.ai/md/tools/llm-selector.md) - [Cost Calculator (md)](https://benchlm.ai/md/tools/cost-calculator.md) - Individual alternative pages available at: `https://benchlm.ai/md/alternatives/[slug].md` - Individual model pages available at: `https://benchlm.ai/md/models/[slug].md` - Benchmark pages available at: `https://benchlm.ai/md/benchmarks/[slug].md` - Best-of ranking pages available at: `https://benchlm.ai/md/best/[slug].md` - Comparison pages available at: `https://benchlm.ai/md/compare/[slug].md` - Blog posts available at: `https://benchlm.ai/md/blog/[slug].md` ## Data & Technical Notes - Data last updated: March 17, 2026 - Canonical model families tracked: 54 - Total pairwise comparisons available: 8646 - Built with Next.js and deployed on Cloudflare Workers via OpenNext - Sitemap: https://benchlm.ai/sitemap.xml - Full crawler bundle: https://benchlm.ai/llms-full.txt - Author: [@glevd](https://x.com/glevd)