Challenge the newest AI models with your hardest PhD-level exercises
— and learn how to use AI in your math research —
Project Benchmarks
This benchmark is based on 140 submissions stumping at least 1 active model, visit the Samples Page for several sample submissions. 56 research mathematicians have contributed to this benchmark.
It includes the new models Gemini 3 Pro, GPT 5.1, and Claude Opus 4.5, each performing significantly better than its predecessor.
Every submission was attempted by every model once, rerunning failing attempts. All models were queried via the api, using the strongest available version.
Contact: contact@science-bench.ai
Based on 140 submissions that stump at least 1 active model.
| Model Name | Active | Correct Answer |
|---|---|---|
| Gemini 3 Pro | active | 46% |
| GPT-5.1 | active | 41% |
| GPT-5 | 35% | |
| Grok-4 | active | 23% |
| o3 | active | 23% |
| Claude Opus 4.5 | active | 20% |
| DeepSeek-V3.1 | active | 20% |
| Gemini 2.5 Pro | 20% | |
| o3-mini | 19% | |
| DeepSeek R1 | 18% | |
| Claude Opus 4.1 | 11% | |
| Gemini 2.5 Flash | 10% | |
| Claude Sonnet 4 | 8% |
Based on 130 submissions that stump at least 2 active models.
| Model Name | Active | Correct Answer |
|---|---|---|
| Gemini 3 Pro | active | 40% |
| GPT-5.1 | active | 33% |
| GPT-5 | 29% | |
| o3 | active | 17% |
| Grok-4 | active | 16% |
| DeepSeek R1 | 15% | |
| Gemini 2.5 Pro | 15% | |
| o3-mini | 15% | |
| DeepSeek-V3.1 | active | 14% |
| Claude Opus 4.5 | active | 13% |
| Claude Opus 4.1 | 7% | |
| Claude Sonnet 4 | 7% | |
| Gemini 2.5 Flash | 6% |
Based on 110 submissions that stump at least 3 active models.
| Model Name | Active | Correct Answer |
|---|---|---|
| Gemini 3 Pro | active | 34% |
| GPT-5.1 | active | 28% |
| GPT-5 | 22% | |
| Grok-4 | active | 12% |
| o3-mini | 12% | |
| Gemini 2.5 Pro | 11% | |
| o3 | active | 11% |
| DeepSeek R1 | 10% | |
| DeepSeek-V3.1 | active | 9% |
| Claude Opus 4.5 | active | 6% |
| Claude Sonnet 4 | 6% | |
| Claude Opus 4.1 | 5% | |
| Gemini 2.5 Flash | 5% |
Based on 90 submissions that stump at least 4 active models.
| Model Name | Active | Correct Answer |
|---|---|---|
| Gemini 3 Pro | active | 24% |
| GPT-5 | 17% | |
| GPT-5.1 | active | 15% |
| o3-mini | 10% | |
| Claude Sonnet 4 | 8% | |
| DeepSeek R1 | 7% | |
| Gemini 2.5 Pro | 7% | |
| Grok-4 | active | 7% |
| Claude Opus 4.1 | 4% | |
| o3 | active | 4% |
| Claude Opus 4.5 | active | 3% |
| DeepSeek-V3.1 | active | 3% |
| Gemini 2.5 Flash | 3% |