Benchmark profile

Discrete Reasoning Over Paragraphs (DROP)

A reading-comprehension benchmark requiring discrete reasoning over paragraphs, reported in DeepSeek-V4 base-model evaluations.

Data verified July 23, 2026

Benchmark score on DROP — July 23, 2026

BenchLM mirrors the published score view for DROP. DeepSeek V4 Pro Base leads the public snapshot at 88.7% , followed by DeepSeek V4 Flash Base (88.6%) and Soofi S 30B-A3B (66.5%). BenchLM does not use these results to rank models overall.

1Open

DeepSeek V4 Pro Base

DeepSeek

deepseek-v4-pro-base

88.7%

Overall —Context 1M

2Open

DeepSeek V4 Flash Base

DeepSeek

deepseek-v4-flash-base

88.6%

Overall —Context 1M

3Open

Soofi S 30B-A3B

Soofi Project

soofi-s-30b-a3b

66.5%

Overall —Context 1M

3 modelsReasoningCurrentDisplay onlyUpdated July 23, 2026

Benchmark score table (3 models)

Score

DeepSeek V4 Pro BaseDeepSeek · Open weight

88.7%

DeepSeek V4 Flash BaseDeepSeek · Open weight

88.6%

Soofi S 30B-A3BSoofi Project · Open weight

66.5%

The published DROP snapshot places DeepSeek V4 Pro Base first at 88.7%. The third row is 22.2 points behind. The broader top-10 range is 22.2 points, so the table still separates the published systems.

3 models have been evaluated on DROP. The benchmark falls in the Reasoning category. This category carries a 17% weight in BenchLM.ai's overall scoring system. DROP is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About DROP

Year

2026

Tasks

Paragraph reasoning questions

Format

Difficulty

Reading and numerical reasoning

BenchLM stores DROP as a display-only provider-table row when exact values are published in DeepSeek-V4 evaluations.

DeepSeek-V4 Technical Report

BenchLM freshness & provenance

Version

DROP 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

FAQ

What does DROP measure?

A reading-comprehension benchmark requiring discrete reasoning over paragraphs, reported in DeepSeek-V4 base-model evaluations.

Which model scores highest on DROP?

DeepSeek V4 Pro Base by DeepSeek currently leads with a score of 88.7% on DROP.

How many models are evaluated on DROP?

3 AI models have been evaluated on DROP on BenchLM.

Compare Top Models on DROP

DeepSeek V4 Pro Base vs DeepSeek V4 Flash Base DeepSeek V4 Flash Base vs Soofi S 30B-A3B

Last updated: July 23, 2026 · BenchLM version DROP 2026

Choose a model with this week’s evidence

Join 2,000+ readers for ranking moves, pricing changes, and the claims that still need proof.

One email each week. Unsubscribe anytime.