Skip to main content

Vals CaseLaw v2 (CaseLaw v2)

Vals AI private question-answer benchmark over Canadian court cases.

How BenchLM shows CaseLaw v2

BenchLM mirrors the public Vals AI CaseLaw v2 leaderboard captured from https://www.vals.ai/benchmarks/case_law_v2 and updated by Vals on May 4, 2026. The snapshot preserves overall scores, uncertainty, latency, cost-per-test metadata, and task-level scores where Vals publishes them.

CaseLaw v2 is display only on BenchLM. Vals proprietary or Vals-hosted aggregate views are useful context, but BenchLM does not use them as weighted ranking inputs or as a replacement for benchmark-native source records.

54 Vals rows1 task viewsprivate datasetTasks: OverallDisplay only

CaseLaw v2 score on CaseLaw v2 — May 4, 2026

BenchLM mirrors the published caselaw v2 score view for CaseLaw v2. Grok 4.3 leads the public snapshot at 79.31% , followed by GPT-5.1 (73.42%) and GPT-4.1 (69.88%). BenchLM does not use these results to rank models overall.

54 modelsExternal benchmark mirrorsCurrentDisplay onlyUpdated May 4, 2026

The published CaseLaw v2 snapshot is tightly clustered at the top: Grok 4.3 sits at 79.31%, while the third row is only 9.43 points behind. The broader top-10 spread is 13.61 points, so the benchmark still separates strong models even when the leaders cluster.

54 models have been evaluated on CaseLaw v2. The benchmark falls in the External benchmark mirrors category. BenchLM tracks this category separately from its weighted global scoring system, so these results are best compared on the dedicated Korean benchmark views. CaseLaw v2 is currently displayed for reference but excluded from the scoring formula, so it does not directly affect overall rankings.

About CaseLaw v2

Year

2026

Tasks

Canadian case-law question answering

Format

Accuracy score

Difficulty

Professional legal retrieval and reasoning

Vals marks CaseLaw v2 as archived. BenchLM mirrors the public leaderboard as display-only historical legal-domain context.

BenchLM freshness & provenance

Version

CaseLaw v2 2026

Refresh cadence

Quarterly

Staleness state

Current

Question availability

Public benchmark set

CurrentDisplay only

BenchLM uses freshness metadata to decide whether a benchmark should still be treated as a strong differentiator, a benchmark to watch, or a display-only reference. For the full scoring policy, see the BenchLM methodology page.

CaseLaw v2 score table (54 models)

1
Grok 4.3grok/grok-4.3
79.31%
2
GPT-5.1openai/gpt-5.1-2025-11-13
73.42%
3
GPT-4.1openai/gpt-4.1-2025-04-14
69.88%
4
GPT-5 Miniopenai/gpt-5-mini-2025-08-07
68.49%
5
Claude Opus 4.7anthropic/claude-opus-4-7
68.38%
6
GPT-5openai/gpt-5-2025-08-07
66.45%
7
GPT-5.5openai/gpt-5.5
66.24%
8
GPT-5.2openai/gpt-5.2-2025-12-11
66.02%
9
Grok 4 0709grok/grok-4-0709
65.81%
10
Grok 4 Fast Reasoninggrok/grok-4-fast-reasoning
65.70%
11
Kimi K2 Thinkingkimi/kimi-k2-thinking
65.70%
12
Gemini 3.1 Pro Previewgoogle/gemini-3.1-pro-preview
64.84%
13
Command A 03 2025cohere/command-a-03-2025
64.52%
14
Claude Sonnet 4.6anthropic/claude-sonnet-4-6
63.99%
15
Gemini 2.5 Progoogle/gemini-2.5-pro
63.88%
16
GPT-5.4openai/gpt-5.4-2026-03-05
63.77%
17
Muse Sparkmeta/muse_spark
63.13%
18
Claude Opus 4.5 20251101 Thinkinganthropic/claude-opus-4-5-20251101-thinking
62.59%
19
Claude Sonnet 4.5 20250929 Thinkinganthropic/claude-sonnet-4-5-20250929-thinking
62.16%
20
Claude Opus 4.6 Thinkinganthropic/claude-opus-4-6-thinking
62.06%
21
Mistral Large 2512mistralai/mistral-large-2512
61.41%
22
Kimi K2.6 Thinkingkimi/kimi-k2.6-thinking
61.20%
23
MiniMax M2.7minimax/MiniMax-M2.7
60.88%
24
Grok 4.1 Fast Reasoninggrok/grok-4-1-fast-reasoning
60.45%
25
GPT-4oopenai/gpt-4o-2024-11-20
59.70%
26
Qwen3.5 Plus Thinkingalibaba/qwen3.5-plus-thinking
59.70%
27
DeepSeek V4 Prodeepseek/deepseek-v4-pro
59.38%
28
Kimi K2.5 Thinkingkimi/kimi-k2.5-thinking
58.73%
29
Trinity Large Thinkingarcee-ai/trinity-large-thinking
57.88%
30
Claude Haiku 4.5 20251001 Thinkinganthropic/claude-haiku-4-5-20251001-thinking
56.48%
31
Qwen3.5 Flashalibaba/qwen3.5-flash
55.95%
32
MiniMax M2.1minimax/MiniMax-M2.1
55.84%
33
Gemini 3 Flash Previewgoogle/gemini-3-flash-preview
55.84%
34
DeepSeek V3p2 Thinkingfireworks/deepseek-v3p2-thinking
55.41%
35
Gemini 3.1 Flash Lite Previewgoogle/gemini-3.1-flash-lite-preview
54.98%
36
Qwen3 Maxalibaba/qwen3-max-2026-01-23
54.98%
37
GLM 4.7zai/glm-4.7
54.88%
38
Grok 4.20 0309 Reasoninggrok/grok-4.20-0309-reasoning
54.45%
39
DeepSeek V3p1fireworks/deepseek-v3p1
53.91%
40
MiniMax M2.5 Lightningminimax/MiniMax-M2.5-Lightning
53.48%
41
Qwen3.6 27balibaba/qwen3.6-27b
53.16%
42
Gemini 3 Pro Previewgoogle/gemini-3-pro-preview
53.05%
43
Gemma 4 31b Itgoogle/gemma-4-31b-it
52.63%
44
GPT-5 Nanoopenai/gpt-5-nano-2025-08-07
52.63%
45
GLM 5 Thinkingzai/glm-5-thinking
52.52%
46
GPT-5.4 Nanoopenai/gpt-5.4-nano-2026-03-17
51.88%
47
GPT-5.4 Miniopenai/gpt-5.4-mini-2026-03-17
51.66%
48
GLM 5.1 Thinkingzai/glm-5.1-thinking
51.55%
49
Qwen3.6 Plusalibaba/qwen3.6-plus
51.45%
50
GPT Oss 120bfireworks/gpt-oss-120b
48.77%
51
Qwen3.6 Max Previewalibaba/qwen3.6-max-preview
47.91%
52
Qwen3 Maxalibaba/qwen3-max
47.48%
53
Mistral Medium 3.5mistralai/mistral-medium-3.5
44.16%
54
GPT Oss 20bfireworks/gpt-oss-20b
43.84%

FAQ

What does CaseLaw v2 measure?

Vals AI private question-answer benchmark over Canadian court cases.

Which model leads the published CaseLaw v2 snapshot?

Grok 4.3 currently leads the published CaseLaw v2 snapshot with 79.31% caselaw v2 score. BenchLM shows this benchmark for display only and does not use it in overall rankings.

How many models are evaluated on CaseLaw v2?

54 AI models are included in BenchLM's mirrored CaseLaw v2 snapshot, based on the public leaderboard captured on May 4, 2026.

Last updated: May 4, 2026 · mirrored from the public benchmark leaderboard

The AI models change fast. We track them for you.

For engineers, researchers, and the plain curious — a weekly brief on new models, ranking shifts, and pricing changes.

Free. No spam. Unsubscribe anytime.