A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.
Year
2024
Tasks
500 verified issues
Format
Code patch generation
Difficulty
Professional software engineering
SWE-bench Verified is the gold standard for evaluating AI coding agents on real-world software engineering tasks. Each task requires understanding codebases, writing patches, and passing test suites.
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?A curated, human-verified subset of SWE-bench that tests models on resolving real GitHub issues from popular open-source Python repositories like Django, Flask, and scikit-learn.
GPT-5.3 Codex by OpenAI currently leads with a score of 85 on SWE-bench Verified.
88 AI models have been evaluated on SWE-bench Verified on BenchLM.