A benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.
Year
2024
Tasks
Factual questions
Format
Short-form Q&A
Difficulty
Factual accuracy focused
SimpleQA prioritizes two key properties: questions should have short, factual answers that can be easily verified, and questions should be diverse and challenging. It serves as a crucial test of factual knowledge and accuracy.
Measuring short-form factuality in large language modelsA benchmark that evaluates the ability of language models to answer short, fact-seeking questions accurately. Focuses on factual correctness rather than reasoning complexity.
GPT-5.4 by OpenAI currently leads with a score of 95 on SimpleQA.
88 AI models have been evaluated on SimpleQA on BenchLM.