Challenge the newest AI models with your hardest PhD-level exercises

— and learn how to use AI in your math research —

Project Benchmarks

This benchmark is based on 100 submissions stumping at least 1 active model. 37 research mathematicians have contributed, mostly in the areas algebra and combinatorics. The frontpage shows several sample prompts. We ran every submission once with every model, rerunning failing attempts. We plan to increase this to 8 or even 16 runs in future iterations. DeepSeek-V3.1 failed to produce an answer for 15 submissions and DeepSeek-R1 failed for 18 submissions. These are all counted as stumps. All models were queried via the api, using the strongest available version. Contact: contact@science-bench.ai Press Release (PDF).
Based on 100 submissions that stump at least 1 active model.
Model Name Active Correct Answer
GPT-5 active 43%
DeepSeek-V3.1 active 34%
Grok-4 active 34%
o3 active 32%
Gemini 2.5 Pro active 29%
DeepSeek R1 27%
o3-mini 22%
Gemini 2.5 Flash 18%
Claude Opus 4.1 active 15%
Claude Sonnet 4 9%
Based on 80 submissions that stump at least 2 active models.
Model Name Active Correct Answer
GPT-5 active 35%
DeepSeek-V3.1 active 26%
Grok-4 active 26%
o3 active 23%
DeepSeek R1 22%
Gemini 2.5 Pro active 21%
o3-mini 17%
Gemini 2.5 Flash 11%
Claude Opus 4.1 active 8%
Claude Sonnet 4 6%
Based on 70 submissions that stump at least 3 active models.
Model Name Active Correct Answer
GPT-5 active 27%
DeepSeek-V3.1 active 17%
DeepSeek R1 15%
Grok-4 active 15%
Gemini 2.5 Pro active 14%
o3 active 14%
o3-mini 13%
Gemini 2.5 Flash 6%
Claude Opus 4.1 active 4%
Claude Sonnet 4 4%