Qwen3.6 Plus Benchmark Scores & Performance

Benchmark analysis of Qwen3.6 Plus by Alibaba across 60 sourced tests on BenchLM.

According to BenchLM.ai, Qwen3.6 Plus ranks #27 out of 98 models with an overall score of 69/100. This places it in the mid-tier of AI models, with strengths in specific benchmark categories.

Qwen3.6 Plus is a proprietary model with a 1M token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

This profile currently has 60 of 125 tracked benchmarks. BenchLM only exposes verified benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Instruction Following (#5), while its weakest is Agentic (#32). This performance profile makes it a well-rounded choice across a range of tasks.

Provider

Alibaba

Source Type

Proprietary

Reasoning

Reasoning

Context Window

1M

Model Status

Current

Release Date

Apr 2, 2026

Overall Score

69#27 of 98

Pricing

$0.00 / $0.00

Input / output per 1M

Runtime

N/A

Latency unavailable

Knowledge Benchmarks

GPQARefreshingDetails
90.4%

GPQA Diamond · Static refresh · updated April 2, 2026

SuperGPQACurrentDetails
71.6%

SuperGPQA 2025 · Quarterly refresh · updated April 2, 2026

MMLU-ProRefreshingDetails
88.5%

MMLU-Pro · Static refresh · updated April 2, 2026

MMLU-ReduxCurrentDisplay onlyDetails
94.5%

MMLU-Redux 2026 · Quarterly refresh · updated April 2, 2026

C-EvalStaleDisplay onlyDetails
93.3%

C-Eval 2023 · Static refresh · updated April 2, 2026

HLECurrentDetails
28.8%

Humanity's Last Exam · Static refresh · updated April 2, 2026

Coding Benchmarks

SWE-bench VerifiedRefreshingDetails
78.8%

SWE-bench Verified 2024 · Annual refresh · updated April 2, 2026

SWE-bench ProCurrentDetails
56.6%

SWE-bench Pro 2026 · Quarterly refresh · updated April 2, 2026

SWE MultilingualCurrentDisplay onlyDetails
73.8%

SWE Multilingual 2026 · Quarterly refresh · updated April 2, 2026

LiveCodeBench v6CurrentDisplay onlyDetails
87.1%

LiveCodeBench v6 2026 · Quarterly refresh · updated April 2, 2026

NL2RepoCurrentDisplay onlyDetails
37.9%

NL2Repo 2026 · Quarterly refresh · updated April 2, 2026

Mathematics Benchmarks

AIME26CurrentDisplay onlyDetails
95.3%

AIME26 2026 · Quarterly refresh · updated April 2, 2026

HMMT Feb 2025CurrentDisplay onlyDetails
96.7%

HMMT Feb 2025 2025 · Quarterly refresh · updated April 2, 2026

HMMT Nov 2025CurrentDisplay onlyDetails
94.6%

HMMT Nov 2025 2025 · Quarterly refresh · updated April 2, 2026

HMMT Feb 2026CurrentDisplay onlyDetails
87.8%

HMMT Feb 2026 2026 · Quarterly refresh · updated April 2, 2026

MMAnswerBenchCurrentDisplay onlyDetails
83.8%

MMAnswerBench 2026 · Quarterly refresh · updated April 2, 2026

Reasoning Benchmarks

AI-NeedleCurrentDisplay onlyDetails
68.3%

AI-Needle 2026 · Quarterly refresh · updated April 2, 2026

LongBench v2CurrentDetails
62%

LongBench v2 2025 · Quarterly refresh · updated April 2, 2026

Agentic Benchmarks

Terminal-Bench 2.0CurrentDetails
61.6%

Terminal-Bench 2 · Quarterly refresh · updated April 2, 2026

Claw-EvalCurrentDisplay onlyDetails
58.7%

Claw-Eval 2026 · Quarterly refresh · updated April 2, 2026

QwenClawBenchCurrentDisplay onlyDetails
57.2%

QwenClawBench 2026 · Quarterly refresh · updated April 2, 2026

QwenWebBenchCurrentDisplay onlyDetails
1502

QwenWebBench 2026 · Quarterly refresh · updated April 2, 2026

TAU3-BenchCurrentDisplay onlyDetails
70.7%

TAU3-Bench 2026 · Quarterly refresh · updated April 2, 2026

VITA-BenchCurrentDisplay onlyDetails
44.3%

VITA-Bench 2025 · Quarterly refresh · updated April 2, 2026

DeepPlanningCurrentDisplay onlyDetails
41.5%

DeepPlanning 2026 · Quarterly refresh · updated April 2, 2026

ToolathlonCurrentDisplay onlyDetails
39.8%

Toolathlon 2026 · Quarterly refresh · updated April 2, 2026

MCP AtlasCurrentDisplay onlyDetails
48.2%

MCP Atlas 2026 · Quarterly refresh · updated April 2, 2026

MCP-TasksCurrentDisplay onlyDetails
74.1%

MCP-Tasks 2026 · Quarterly refresh · updated April 2, 2026

WideResearchCurrentDisplay onlyDetails
74.3%

WideResearch 2026 · Quarterly refresh · updated April 2, 2026

OSWorld-VerifiedCurrentDetails
62.5%

OSWorld Verified · Quarterly refresh · updated April 2, 2026

Multimodal & Grounded Benchmarks

MMMURefreshingDisplay onlyDetails
86.0%

MMMU 2024 · Annual refresh · updated April 2, 2026

MMMU-ProRefreshingDetails
78.8%

MMMU-Pro 2024 · Annual refresh · updated April 2, 2026

RealWorldQACurrentDisplay onlyDetails
85.4%

RealWorldQA 2026 · Quarterly refresh · updated April 2, 2026

OmniDocBench 1.5CurrentDisplay onlyDetails
91.2%

OmniDocBench 1.5 2026 · Quarterly refresh · updated April 2, 2026

Video-MME (with subtitle)CurrentDisplay onlyDetails
87.8%

Video-MME (with subtitle) 2026 · Quarterly refresh · updated April 2, 2026

Video-MME (w/o subtitle)CurrentDisplay onlyDetails
84.2%

Video-MME (w/o subtitle) 2026 · Quarterly refresh · updated April 2, 2026

MathVisionCurrentDisplay onlyDetails
88.0%

MathVision 2026 · Quarterly refresh · updated April 2, 2026

We-MathCurrentDisplay onlyDetails
89.0%

We-Math 2026 · Quarterly refresh · updated April 2, 2026

DynaMathCurrentDisplay onlyDetails
88.0%

DynaMath 2026 · Quarterly refresh · updated April 2, 2026

MStarCurrentDisplay onlyDetails
83.3%

MStar 2026 · Quarterly refresh · updated April 2, 2026

SimpleVQACurrentDisplay onlyDetails
67.3%

SimpleVQA 2026 · Quarterly refresh · updated April 2, 2026

ChatCVQACurrentDisplay onlyDetails
81.5%

ChatCVQA 2026 · Quarterly refresh · updated April 2, 2026

MMLongBench-DocCurrentDisplay onlyDetails
62.0%

MMLongBench-Doc 2026 · Quarterly refresh · updated April 2, 2026

CC-OCRCurrentDisplay onlyDetails
83.4%

CC-OCR 2026 · Quarterly refresh · updated April 2, 2026

AI2D_TESTCurrentDisplay onlyDetails
94.4%

AI2D_TEST 2026 · Quarterly refresh · updated April 2, 2026

CountBenchCurrentDisplay onlyDetails
97.6%

CountBench 2026 · Quarterly refresh · updated April 2, 2026

RefCOCO (avg)CurrentDisplay onlyDetails
93.5%

RefCOCO (avg) 2026 · Quarterly refresh · updated April 2, 2026

ODINW13CurrentDisplay onlyDetails
51.8%

ODINW13 2026 · Quarterly refresh · updated April 2, 2026

ERQACurrentDisplay onlyDetails
65.7%

ERQA 2026 · Quarterly refresh · updated April 2, 2026

VideoMMMUCurrentDisplay onlyDetails
84.0%

VideoMMMU 2026 · Quarterly refresh · updated April 2, 2026

MLVU (M-Avg)CurrentDisplay onlyDetails
86.7%

MLVU (M-Avg) 2026 · Quarterly refresh · updated April 2, 2026

ScreenSpot ProCurrentDisplay onlyDetails
68.2%

ScreenSpot Pro 2025 · Quarterly refresh · updated April 2, 2026

Instruction Following Benchmarks

IFEvalStaleDetails
94.3%

IFEval 2023 · Static refresh · updated April 2, 2026

IFBenchCurrentDisplay onlyDetails
74.2%

IFBench 2026 · Quarterly refresh · updated April 2, 2026

Multilingual Benchmarks

MMLU-ProXCurrentDetails
84.7%

MMLU-ProX 2025 · Static refresh · updated April 2, 2026

NOVA-63CurrentDisplay onlyDetails
57.9%

NOVA-63 2026 · Quarterly refresh · updated April 2, 2026

INCLUDECurrentDisplay onlyDetails
85.1%

INCLUDE 2026 · Quarterly refresh · updated April 2, 2026

PolyMathCurrentDisplay onlyDetails
77.4%

PolyMath 2026 · Quarterly refresh · updated April 2, 2026

VWT2k-liteCurrentDisplay onlyDetails
84.3%

VWT2k-lite 2026 · Quarterly refresh · updated April 2, 2026

MAXIFECurrentDisplay onlyDetails
88.2%

MAXIFE 2026 · Quarterly refresh · updated April 2, 2026

Frequently Asked Questions

How does Qwen3.6 Plus perform overall in AI benchmarks?

Qwen3.6 Plus ranks #27 out of 98 models with an overall score of 69. It is created by Alibaba and features a 1M context window.

Is Qwen3.6 Plus good for knowledge and understanding?

Qwen3.6 Plus ranks #31 out of 98 models in knowledge and understanding benchmarks with an average score of 66. There are stronger options in this category.

Is Qwen3.6 Plus good for coding and programming?

Qwen3.6 Plus ranks #25 out of 98 models in coding and programming benchmarks with an average score of 64.9. There are stronger options in this category.

Is Qwen3.6 Plus good for mathematics?

Qwen3.6 Plus has visible benchmark coverage in mathematics, but BenchLM does not currently assign it a global category rank there.

Is Qwen3.6 Plus good for reasoning and logic?

Qwen3.6 Plus has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Qwen3.6 Plus good for agentic tool use and computer tasks?

Qwen3.6 Plus ranks #32 out of 98 models in agentic tool use and computer tasks benchmarks with an average score of 62. There are stronger options in this category.

Is Qwen3.6 Plus good for multimodal and grounded tasks?

Qwen3.6 Plus ranks #21 out of 98 models in multimodal and grounded tasks benchmarks with an average score of 78.8. There are stronger options in this category.

Is Qwen3.6 Plus good for instruction following?

Qwen3.6 Plus ranks #5 out of 98 models in instruction following benchmarks with an average score of 94.3. It is among the top performers in this category.

Is Qwen3.6 Plus good for multilingual tasks?

Qwen3.6 Plus ranks #22 out of 98 models in multilingual tasks benchmarks with an average score of 84.7. There are stronger options in this category.

Does Qwen3.6 Plus have full benchmark coverage on BenchLM?

Not yet. Qwen3.6 Plus currently has 60 verified benchmark scores out of the 125 benchmarks BenchLM tracks. BenchLM only exposes verified public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Qwen3.6 Plus?

Qwen3.6 Plus has a context window of 1M, which determines how much text it can process in a single interaction.

Last updated: April 2, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Weekly LLM Updates

New model releases, benchmark scores, and leaderboard changes. Every Friday.

Free. Your signup is stored with a derived country code for compliance routing.