Challenge the newest AI models with your hardest PhD-level exercises

— and learn how to use AI in your math research —

Nov 26, 2025: We are releasing our newest public benchmark

It includes 140 research-level mathematics problems. Includes Gemini 3 Pro, GPT 5.1, and Claude Opus 4.5.

View the public benchmark, or browse several sample prompts including their model answers.

Let the best models solve your exercises

Study and solve the exercises of others

Show that your exercises go far beyond the capabilities of LLMs

Challenge the following models with your prompts:

Claude Opus 4.5 Gemini 3 Pro GPT-5.1 DeepSeek-V3.1 Grok-4 o3

Want to benchmark your models against PhD-level math problems from professional researchers?

contact@science-bench.ai