Model profile

Qwen3.7 Plus

Name: Qwen3.7 Plus
Author: Alibaba

AlibabaCurrentReleased Jun 3, 2026

Data verified July 23, 2026

Overall Score

67.22Public #21 of 200Verified #19 of 99

Arena Elo

1461

Eligible category ranks

6of 8

Price (1M tokens)

Not listedAPI pricing

Speed

Not listed

Context

Evidence coverage

69 of 323 tracked benchmarks are published. 52 are verified and 17 provisional. 8 of 8 categories are measured.

Updated July 23, 2026Methodology

Published / tracked: 69 / 323
Verified: 52
Provisional: 17
Categories with evidence: 8 / 8

Agentic16 benchmarks
Mixed evidence
Coding9 benchmarks
Mixed evidence
Reasoning3 benchmarks
Mixed evidence
Knowledge13 benchmarks
Mixed evidence
Math3 benchmarks
Verified
Multilingual5 benchmarks
Verified
Multimodal17 benchmarks
Mixed evidence
Inst. Following3 benchmarks
Mixed evidence

ProprietaryReasoning

Confidence:

Very high

base

Qwen3.7 Plus ranks #21 out of 200 models on the public leaderboard with an overall score of 67.22/100. It also ranks #19 out of 99 on the verified leaderboard. This places it in the mid-tier of AI models, with strengths in specific benchmark categories.

Qwen3.7 Plus is a proprietary model with a 1M token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage.

BenchLM links it directly to Qwen3.6 Plus as the earlier related model in that lineage. This profile currently has 69 of 323 tracked benchmarks. BenchLM only exposes non-generated benchmark rows publicly, so missing categories stay blank until a sourced evaluation is available.

Its strongest category is Multilingual (#3), while its weakest is Agentic (#81). This performance profile makes it a well-rounded choice across a range of tasks.

Peer position

Exact provisional scores and ranks for the closest listed peers. A score can appear before a model clears the evidence threshold for a rank, so equal scores can have different rank states.

Range 66.79–67.73

Gemini 3 Pro
Google
#1967.73
Gemini 3 Pro is #19 with a score of 67.73.
Compare
Inkling
Thinking Machines Lab
#2067.54
Inkling is #20 with a score of 67.54.
Compare
Qwen3.7 PlusCurrent model
Alibaba
#2167.22
Qwen3.7 Plus is #21 with a score of 67.22.
GPT-5.6 Luna
OpenAI
#2267.17
GPT-5.6 Luna is #22 with a score of 67.17.
Compare
GPT-5.2 Pro
OpenAI
#2367.01
GPT-5.2 Pro is #23 with a score of 67.01.
Compare
GLM-5-Turbo
Z.AI
#2466.89
GLM-5-Turbo is #24 with a score of 66.89.
Compare
GPT-5.4 nano
OpenAI
#2566.79
GPT-5.4 nano is #25 with a score of 66.79.
Compare

Category percentile

Relative position among models eligible for each sourced category. A higher percentile means a stronger position within that category's ranked cohort; 100 is highest.

Multilingual83%
Eligible cohort rank #3 of 13Category score 78.9
Inst. Following87%
Eligible cohort rank #5 of 31Category score 91.4
Multimodal68%
Eligible cohort rank #10 of 29Category score 71.5
Coding79%
Eligible cohort rank #27 of 122Category score 58.8
Knowledge33%
Eligible cohort rank #35 of 52Category score 65.1
Agentic32%
Eligible cohort rank #81 of 119Category score 43.9

Category evidence

Scores and ranks appear only where this model has published benchmark evidence. Categories without displayable source records remain not measured.

Category scores, ranks, weighting, benchmark coverage, and evidence status
Category	Score	Rank	Percentile	Weight	Benchmarks	Evidence
AgenticRank #81 of 119Percentile 32ndWeight 22%16 benchmarksMixed sources	43.9	#81 of 119	32nd	22%	16 benchmarks	Mixed sources
CodingRank #27 of 122Percentile 79thWeight 20%9 benchmarksMixed sources	58.8	#27 of 122	79th	20%	9 benchmarks	Mixed sources
ReasoningRank Not rankedWeight 17%3 benchmarksMixed sources	87.1	Not ranked	Not available	17%	3 benchmarks	Mixed sources
KnowledgeRank #35 of 52Percentile 33rdWeight 12%13 benchmarksMixed sources	65.1	#35 of 52	33rd	12%	13 benchmarks	Mixed sources
MathRank Not rankedWeight 5%3 benchmarksVerified	79.6	Not ranked	Not available	5%	3 benchmarks	Verified
MultilingualRank #3 of 13Percentile 83rdWeight 7%5 benchmarksVerified	78.9	#3 of 13	83rd	7%	5 benchmarks	Verified
MultimodalRank #10 of 29Percentile 68thWeight 12%17 benchmarksMixed sources	71.5	#10 of 29	68th	12%	17 benchmarks	Mixed sources
Inst. FollowingRank #5 of 31Percentile 87thWeight 5%3 benchmarksMixed sources	91.4	#5 of 31	87th	5%	3 benchmarks	Mixed sources

Chatbot Arena performance

Scroll horizontally to inspect confidence intervals and vote counts.

Chatbot Arena Elo, confidence interval, and vote count by evaluation view
View	Elo	Confidence interval	Votes
Text Overall	1461	±5.6	21,513
Coding	1510	±8.4	6,298
Math	1474	±18.7	996
Instruction Following	1450	±7.8	7,337
Creative Writing	1439	±10.4	3,636
Multi-turn	1473	±10.4	3,612
Hard Prompts	1479	±6.5	14,310
Hard Prompts (English)	1482	±8.2	6,601
Longer Query	1469	±7.5	9,441

Benchmark Details

Rows below have a displayable published verification record. Each source link and provenance note remains in the page HTML while its category is closed. Source-unverified manual rows and generated rows stay hidden.

Agentic16 benchmarks

Terminal-Bench 2.0Provider exact

70.3%Weighted 38%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports Terminal-Bench 2.0-Terminus at 70.3.

OSWorld-VerifiedProvider exact

73.3%Weighted 34%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

QwenClawBenchProvider exact

61.8%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

QwenWebBenchProvider exact

1536Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports QwenWebDev Elo at 1536. BenchLM stores it on the existing Qwen web agent benchmark key.

Claw-EvalProvider exact

62.7%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

BFCL v4Provider exact

Berkeley Function Calling Leaderboard v4

72.9%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MCP AtlasProvider exact

73.2%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

VITA-BenchProvider exact

45.6%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

DeepPlanningProvider exact

62.3%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

AndroidWorldProvider exact

81.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

AA Agentic IndexReported

Artificial Analysis Agentic Index

20.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

APEX-Agents-AAReported

22.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

τ²-bench resultsReported

τ²-Bench Tool-Agent-User Evaluation

93%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

GDPval-AA normalized

21.8%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

GDPval-AAReported

936Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

OSWorld 2.0Benchmark exact

2.8%Display only

Source: OSWorld 2.0 paperProvenance: OSWorld 2.0 reports Qwen 3.7-Plus single action on its 500-step main table. BenchLM stores the binary completion score and notes the corresponding partial score was 21.5%.

Coding9 benchmarks

LiveCodeBenchProvider exact

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

89.6%Weighted 38%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

SWE-bench VerifiedProvider exact

Software Engineering Benchmark Verified

77.7%Weighted 16%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

SciCodeProvider exact

Scientific Code Benchmark

51.3%Weighted 16%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

SWE-bench ProProvider exact

57.6%Weighted 10%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

Terminal-Bench 2.0Provider exact

70.3%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports Terminal-Bench 2.0-Terminus at 70.3. BenchLM stores it in both coding and agentic views.

SWE MultilingualProvider exact

75.8%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

NL2RepoProvider exact

41.1%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

AA Coding IndexReported

Artificial Analysis Coding Index

55.9%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-SciCodeReported

Artificial Analysis SciCode

45.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Reasoning3 benchmarks

MRCRv2Provider exact

91.7%Weighted 31%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports MRCR-v2 128k at 91.7. BenchLM stores it on the existing MRCR-v2 key.

CritPtProvider exact

Critical Physics Tasks

9.1%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

AA-LCRReported

Artificial Analysis Long Context Reasoning

65.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Knowledge13 benchmarks

HLEProvider exact

Humanity's Last Exam

34.7%Weighted 45%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MMLU-ProProvider exact

Massive Multitask Language Understanding Professional

88.5%Weighted 30%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

GPQAProvider exact

Graduate-Level Google-Proof Q&A

90.3%Weighted 7%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports GPQA Diamond at 90.3. BenchLM stores that exact value on the weighted GPQA lane and the display GPQA-Diamond lane.

SuperGPQAProvider exact

SuperGPQA: Scaling LLM Evaluation Across 285 Graduate Disciplines

71.4%Weighted 7%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

GPQA-DProvider exact

GPQA Diamond

90.3%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports GPQA Diamond at 90.3.

MMLU-ReduxProvider exact

94.5%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MMMLUProvider exact

89.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

Artificial Analysis Intelligence IndexReported

39.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-GPQA DiamondReported

Artificial Analysis GPQA Diamond

90.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-HLEReported

Artificial Analysis Humanity's Last Exam

33.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience IndexReported

Artificial Analysis Omniscience Index

2.4%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience AccuracyReported

Artificial Analysis Omniscience Accuracy

22.2%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

AA-Omniscience Hallucination RateReported

Artificial Analysis Omniscience Hallucination Rate

25.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Math3 benchmarks

HMMT Feb 2026Provider exact

Harvard-MIT Mathematics Tournament February 2026

92.9%Weighted 25%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

IMOAnswerBenchProvider exact

86.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

ApexProvider exact

22.7%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

Multilingual5 benchmarks

MMLU-ProXProvider exact

85.4%Weighted 100%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

NOVA-63Provider exact

58.8%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

INCLUDEProvider exact

83.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MAXIFEProvider exact

88.8%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

PolyMathProvider exact

84.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

Multimodal17 benchmarks

MMMU-ProProvider exact

Massive Multi-discipline Multimodal Understanding Pro

79%Weighted 45%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

CharXivProvider exact

CharXiv Reasoning

85.9%Weighted 25%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports CharXiv(RQ) as 85.9 with CI / 84.4 without CI. BenchLM stores the with-CI value on the existing CharXiv key.

MathVisionProvider exact

90.3%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

ERQAProvider exact

69.8%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MedXpertQA (MM)Provider exact

MedXpertQA Multimodal

71.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

ScreenSpot ProProvider exact

79.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

SimpleVQAProvider exact

81.7%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MMSearch-PlusProvider exact

41.4%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

RealWorldQAProvider exact

86.9%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

OmniDocBench 1.5Provider exact

91.4%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

OCRBench V2Provider exact

70.7%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Alibaba Cloud reports OCR-Bench-V2 (EN) at 70.7. BenchLM stores the English row on the existing OCR-Bench-V2 key.

ODINW13Provider exact

51.1%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

Video-MME (with subtitle)Provider exact

Video-MME with subtitle

88.0%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

VideoMMMUProvider exact

85.4%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

MLVU (M-Avg)Provider exact

MLVU mean average

87.4%Display only

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

AA-MMMU-ProReported

Artificial Analysis MMMU-Pro

80.5%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Design Arena WebsiteReported

Design Arena Website Elo

1288Display only

Source: OpenRouter model benchmarksProvenance: Display-only Design Arena Website Elo synced from OpenRouter model benchmark metadata. It is excluded from BenchLM weighted scoring.

Inst. Following3 benchmarks

IFBenchProvider exact

Instruction Following Benchmark

79.1%Weighted 65%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

IFEvalProvider exact

Instruction-Following Eval

94.6%Weighted 35%

Source: Alibaba Cloud: Qwen3.7-Plus launch benchmarksProvenance: Provider exact

AA-IFBenchReported

Artificial Analysis IFBench

78.0%Display only

Source: Artificial Analysis model benchmarksProvenance: Display-only row synced from the current Artificial Analysis model payload. It is excluded from BenchLM weighted scoring.

Qwen3.7 Plus Family

Base entry

Related Earlier Model

Qwen3.6 Plus

Frequently Asked Questions

How does Qwen3.7 Plus perform overall in AI benchmarks?

Qwen3.7 Plus currently ranks #21 out of 200 models on BenchLM's provisional leaderboard with an overall score of 67.22. It also ranks #19 out of 99 on the verified leaderboard. It is created by Alibaba. Its published context window is 1M.

Is Qwen3.7 Plus good for knowledge and understanding?

Qwen3.7 Plus ranks #35 out of 52 models in knowledge and understanding benchmarks with an average score of 65.1. There are stronger options in this category.

Is Qwen3.7 Plus good for coding and programming?

Qwen3.7 Plus ranks #27 out of 122 models in coding and programming benchmarks with an average score of 58.8. There are stronger options in this category.

Is Qwen3.7 Plus good for mathematics?

Qwen3.7 Plus has visible benchmark coverage in mathematics, but BenchLM does not currently assign it a global category rank there.

Is Qwen3.7 Plus good for reasoning and logic?

Qwen3.7 Plus has visible benchmark coverage in reasoning and logic, but BenchLM does not currently assign it a global category rank there.

Is Qwen3.7 Plus good for agentic tool use and computer tasks?

Qwen3.7 Plus ranks #81 out of 119 models in agentic tool use and computer tasks benchmarks with an average score of 43.9. There are stronger options in this category.

Is Qwen3.7 Plus good for multimodal and grounded tasks?

Qwen3.7 Plus ranks #10 out of 29 models in multimodal and grounded tasks benchmarks with an average score of 71.5. It is among the top performers in this category.

Is Qwen3.7 Plus good for instruction following?

Qwen3.7 Plus ranks #5 out of 31 models in instruction following benchmarks with an average score of 91.4. It is among the top performers in this category.

Is Qwen3.7 Plus good for multilingual tasks?

Qwen3.7 Plus ranks #3 out of 13 models in multilingual tasks benchmarks with an average score of 78.9. It is among the top performers in this category.

Does Qwen3.7 Plus have full benchmark coverage on BenchLM?

Not yet. Qwen3.7 Plus currently has 69 published benchmark scores out of the 323 benchmarks BenchLM tracks. BenchLM only exposes non-generated public benchmark rows, so missing categories stay blank until a sourced evaluation is available.

What is the context window size of Qwen3.7 Plus?

Qwen3.7 Plus has a published context window of 1M, which determines how much text it can process in a single interaction.

Related Resources

Last updated: July 23, 2026 · Runtime metrics stay blank until BenchLM has a sourced snapshot.

Choose with this week’s evidence

Join 2,000+ readers for ranking moves, new releases, pricing changes, and the evidence behind them.

Free. One email per week.

Qwen3.7 Plus

Evidence coverage

Evidence by category

Peer position

Category percentile

Category evidence

Chatbot Arena performance

Benchmark Details

Qwen3.7 Plus Family

Frequently Asked Questions

How does Qwen3.7 Plus perform overall in AI benchmarks?

Is Qwen3.7 Plus good for knowledge and understanding?

Is Qwen3.7 Plus good for coding and programming?

Is Qwen3.7 Plus good for mathematics?

Is Qwen3.7 Plus good for reasoning and logic?

Is Qwen3.7 Plus good for agentic tool use and computer tasks?

Is Qwen3.7 Plus good for multimodal and grounded tasks?

Is Qwen3.7 Plus good for instruction following?

Is Qwen3.7 Plus good for multilingual tasks?

Does Qwen3.7 Plus have full benchmark coverage on BenchLM?

What is the context window size of Qwen3.7 Plus?

Related Resources

Choose with this week’s evidence