Challenge the newest AI models with your hardest PhD-level exercises
— and learn how to use AI in your math research —
Project Benchmarks
This benchmark is based on 100 submissions stumping at least 1 active model. 37 research mathematicians have contributed, mostly in the areas algebra and combinatorics. The frontpage shows several sample prompts.
We ran every submission once with every model, rerunning failing attempts. We plan to increase this to 8 or even 16 runs in future iterations. DeepSeek-V3.1 failed to produce an answer for 15 submissions and DeepSeek-R1 failed for 18 submissions. These are all counted as stumps.
All models were queried via the api, using the strongest available version.
Contact: contact@science-bench.ai
Press Release (PDF).
Based on 100 submissions that stump at least 1 active model.
Model Name | Active | Correct Answer |
---|---|---|
GPT-5 | active | 43% |
DeepSeek-V3.1 | active | 34% |
Grok-4 | active | 34% |
o3 | active | 32% |
Gemini 2.5 Pro | active | 29% |
DeepSeek R1 | 27% | |
o3-mini | 22% | |
Gemini 2.5 Flash | 18% | |
Claude Opus 4.1 | active | 15% |
Claude Sonnet 4 | 9% |
Based on 80 submissions that stump at least 2 active models.
Model Name | Active | Correct Answer |
---|---|---|
GPT-5 | active | 35% |
DeepSeek-V3.1 | active | 26% |
Grok-4 | active | 26% |
o3 | active | 23% |
DeepSeek R1 | 22% | |
Gemini 2.5 Pro | active | 21% |
o3-mini | 17% | |
Gemini 2.5 Flash | 11% | |
Claude Opus 4.1 | active | 8% |
Claude Sonnet 4 | 6% |
Based on 70 submissions that stump at least 3 active models.
Model Name | Active | Correct Answer |
---|---|---|
GPT-5 | active | 27% |
DeepSeek-V3.1 | active | 17% |
DeepSeek R1 | 15% | |
Grok-4 | active | 15% |
Gemini 2.5 Pro | active | 14% |
o3 | active | 14% |
o3-mini | 13% | |
Gemini 2.5 Flash | 6% | |
Claude Opus 4.1 | active | 4% |
Claude Sonnet 4 | 4% |