AI Nears PhD-Level Proficiency as New Benchmark Records Rapid Gains in Expert Reasoning

NextFin News - A new benchmark designed to represent the absolute ceiling of human academic expertise has revealed that artificial intelligence is closing the gap on PhD-level reasoning at a pace that has caught even some of its creators by surprise. The "Humanity’s Last Exam" (HLE) metric, a collaborative project between Scale AI and the Center for AI Safety, consists of 2,500 questions so specialized that they require years of doctoral study to answer. While OpenAI’s GPT-4o struggled to surpass a 3% accuracy rate in 2024, the latest data released this quarter shows Google’s Gemini 3.1 Pro Preview has surged to a 45.9% success rate, effectively tripling the performance of frontier models in less than a year.

The rapid ascent in HLE scores has sparked a debate over the proximity of Artificial General Intelligence (AGI). Calvin Zhang, a researcher at Scale Labs who has been a vocal proponent of the "scaling laws" theory—the belief that more data and compute inevitably lead to higher intelligence—noted that the progress over the past few months has been "insane." Zhang’s position reflects a segment of the Silicon Valley engineering elite that views the surpassing of human scientific intelligence not as a matter of "if," but as a countdown of months. However, his perspective is often viewed as optimistic by academic circles, where the distinction between "pattern matching" and "genuine discovery" remains a point of contention.

The HLE benchmark is not a standard multiple-choice test. It includes highly obscure queries ranging from the nuances of biblical Hebrew pronunciation to the specific anatomical count of tendons in a hummingbird’s wing. To ensure the integrity of the results and prevent "data contamination"—where an AI simply memorizes answers it found on the internet—a significant portion of the question set is kept secret. The jump from GPT-5 Pro’s 31.6% to Gemini’s 45.9% suggests that the industry is moving beyond general conversation toward specialized, expert-level reasoning that could soon rival human scientists in technical fields.

Despite the impressive numbers, the data reveals a critical flaw: "calibration error." According to the Scale AI leaderboard, while models are getting more answers right, they remain systematically overconfident, often asserting incorrect answers with the same certainty as correct ones. Gemini 3.1 Pro, despite its high score, maintains a calibration error of 50%, suggesting that the AI does not yet "know what it doesn't know." This lack of self-awareness is a significant hurdle for applications in high-stakes environments like drug discovery or structural engineering, where a confident but wrong answer can have catastrophic consequences.

Skeptics within the research community, including Kate Olszewska of Google’s DeepMind, suggest that while passing a written exam is a milestone, it does not equate to the holistic intelligence required for scientific breakthrough. Olszewska has previously argued that the "frontier" is not just answering existing questions but "novel problem discovery"—the ability to identify what needs to be researched in the first place. This nuanced view suggests that even if an AI scores 100% on HLE, it may still lack the creative spark or physical intuition required for laboratory work or complex surgery.

The economic implications of this shift are already being felt in the labor market for high-end research. As AI models begin to handle the "breadth of knowledge" tasks that previously required teams of junior PhDs, the value of human expertise is shifting toward "depth of reasoning" and cross-disciplinary synthesis. The HLE results indicate that the window for humans to claim superiority in pure information retrieval and specialized academic logic is narrowing. The next phase of competition will likely focus on whether these models can move from being "universal experts" on paper to active participants in the scientific method.

Explore more exclusive insights at nextfin.ai.

AI Nears PhD-Level Proficiency as New Benchmark Records Rapid Gains in Expert Reasoning

Insights

What is the Humanity’s Last Exam (HLE) benchmark?

What are the origins of the HLE metric?

How do current AI models compare in performance on the HLE benchmark?

What feedback have users provided regarding AI performance on expert reasoning tasks?

What recent updates have been made in AI capabilities as reflected in HLE scores?

What policy changes might affect AI development and deployment in research?

What future directions are anticipated for AI in achieving AGI?

What long-term impacts could AI advancements have on the research labor market?

What are the key challenges facing AI models in expert reasoning tasks?

What are the controversial points regarding AI's ability to surpass human intelligence?

How does Gemini 3.1 Pro compare to GPT-4o in HLE performance?

What historical cases demonstrate AI's evolution in reasoning tasks?

What similar concepts exist in AI research and how do they relate to HLE?

What implications does calibration error have for AI applications in high-stakes fields?

How do experts differentiate between pattern matching and genuine discovery in AI?

What role does cross-disciplinary synthesis play in the future of AI research?

How might the competition between AI and human researchers evolve?

What steps can be taken to improve AI's self-awareness in reasoning tasks?

What is the significance of the 50% calibration error in AI models?