OpenAI launches GeneBench-Pro for AI biology research

GeneBench-Pro uses synthetically constructed datasets to ensure correct research based answers require genuine reasoning ability.

OpenAI's GeneBench-Pro tests whether AI can make the judgment calls real scientific research demands.

OpenAI has introduced GeneBench-Pro, a research benchmark designed to assess whether AI agents can perform the complex, judgment-intensive analysis required in real-world computational biology.

Unlike conventional benchmarks that focus on factual recall or routine workflows, GeneBench-Pro is designed to measure what OpenAI calls ‘research taste‘, the sequence of judgement calls involved in scientific analysis, from interpreting ambiguous data and revising assumptions to deciding whether findings are robust enough to inform downstream research.

The benchmark comprises 129 problems spanning ten domains within computational biology, including statistical genetics, cancer genomics, clinical diagnostics, and pharmacogenomics. Each problem presents an AI agent with a realistic and deliberately messy dataset, brief experimental context, and a target to estimate.

To answer correctly, the model must explore the data iteratively, select an appropriate analytical approach, and supply a final answer without exploiting shortcuts or matching arbitrary author preferences. To prevent common benchmark shortcuts, every problem uses synthetically generated data whose underlying causal structure is fully known, allowing performance to be measured against a controlled ground truth.

OpenAI said its flagship model, GPT-5.6 Sol, achieved a pass rate of 28.7% at the highest reasoning setting, increasing to 31.5% in Pro mode. By comparison, the strongest model available when the original GeneBench was introduced scored below 5%.

External reviewers estimated that completing a typical GeneBench-Pro task would require 20 to 40 hours of expert work and cost thousands of dollars, whereas AI inference currently costs only a few dollars per run. OpenAI argues this suggests substantial economic potential even before models achieve expert-level performance.

OpenAI acknowledged that frontier models still solve fewer than one-third of the benchmark problems, often making partial progress but failing to complete the full chain of scientific reasoning expected from experienced researchers. To encourage independent evaluation, the company is open-sourcing ten representative tasks on Hugging Face and providing a 50-question subset to Artificial Analysis for third-party benchmarking.

Why does it matter?

GeneBench-Pro reflects a broader shift in AI evaluation from testing factual knowledge and coding ability to assessing whether models can support complex scientific reasoning. As computational biology increasingly becomes limited by data interpretation rather than data generation, reliable AI assistance in analytical workflows could accelerate research in areas such as genomics, drug discovery and precision medicine.

The benchmark also highlights the importance of rigorous evaluation methods for frontier AI. By using controlled synthetic datasets with known ground truth, GeneBench-Pro seeks to measure not only whether models reach the correct answer but also how well they make the sequence of judgements required in real-world scientific research.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot