OpenBookQA: A New Dataset for Open Book Question Answering (OpenBookQA)

A question-answering dataset modeled after open book exams for assessing human understanding of a subject. Requires combining facts from a knowledge base with broad common sense reasoning.

About OpenBookQA

Year

2018

Tasks

Open book questions

Format

Multiple choice questions

Difficulty

Elementary science level

OpenBookQA tests the ability to combine explicit knowledge with implicit common sense reasoning. Each question requires understanding scientific facts and applying them to novel situations, mimicking real open-book examinations.

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Leaderboard (88 models)

#1GPT-5.4
93
#2Gemini 3.1 Pro
93
#3Claude Opus 4.6
93
#4GPT-5.3 Codex
93
#5Grok 4.1
93
#6GPT-5.2
93
#7GPT-5.2-Codex
93
#9Claude Sonnet 4.6
93
#10Claude Opus 4.5
93
#11Gemini 3 Pro
93
#13GPT-5.1
91
#14GLM-5 (Reasoning)
90
#15Claude Sonnet 4.5
89
#17GPT-5 (high)
87
#18o1-preview
86
#19Kimi K2.5 (Reasoning)
86
#20GPT-5 (medium)
85
#21o3-pro
85
#23o3
83
#24GPT-5 mini
82
#25Grok 4
82
#26GLM-5
82
#28GLM-4.7
80
#29Qwen2.5-1M
79
#30Gemini 2.5 Pro
79
#31DeepSeek V3.2
79
#32Qwen2.5-72B
78
#33o4-mini (high)
78
#34Qwen3.5 397B
78
#35DeepSeek Coder 2.0
75
#36DeepSeekMath V2
75
#37DeepSeek LLM 2.0
74
#38MiMo-V2-Flash
74
#39Claude 4.1 Opus
72
#40Kimi K2.5
72
#41Mistral Large 3
71
#42Claude 4 Sonnet
69
#44MiniMax M2.5
68
#46Gemini 3 Flash
65
#47Mistral Large 2
64
#48Claude Haiku 4.5
63
#49GPT-4o
62
#50Claude 3.5 Sonnet
61
#51GLM-4.7-Flash
61
#52Mistral 8x7B
60
#53Gemini 1.5 Pro
60
#56Gemini 1.0 Pro
58
#58Claude 3 Opus
57
#59GPT-4 Turbo
56
#60Llama 3 70B
54
#61Claude 3 Haiku
52
#63Nemotron-4 15B
49
#64Moonshot v1
48
#65Z-1
47
#66GPT-OSS 120B
46
#67Gemini 2.5 Flash
45
#70Llama 4 Scout
42
#72Gemma 3 27B
40
#73DeepSeek-R1
39
#74Qwen2.5-VL-32B
38
#76Nova Pro
36
#78Qwen3 235B 2507
34
#80GLM-4.5
32
#81MiniMax M1 80k
31
#82GLM-4.5-Air
30
#84DeepSeek V3.1
28
#85Kimi K2
27
#86GPT-OSS 20B
26
#87Mistral 7B v0.3
25
#88Mistral 8x7B v0.2
24

FAQ

What does OpenBookQA measure?

A question-answering dataset modeled after open book exams for assessing human understanding of a subject. Requires combining facts from a knowledge base with broad common sense reasoning.

Which model scores highest on OpenBookQA?

GPT-5.4 by OpenAI currently leads with a score of 93 on OpenBookQA.

How many models are evaluated on OpenBookQA?

88 AI models have been evaluated on OpenBookQA on BenchLM.