NextFin News - Large language models are failing the most basic test of scientific rigor: consistency. A new study led by Mesut Cicek, an associate professor at Washington State University, reveals that ChatGPT frequently provides incorrect and contradictory answers when tasked with evaluating scientific hypotheses. The research, published this week in the Rutgers Business Review, suggests that the "fluency trap"—the ability of AI to generate smooth, authoritative-sounding prose—is masking a profound lack of conceptual understanding that could compromise the integrity of academic and industrial research.
Cicek and his team subjected ChatGPT versions 3.5 and 5 mini to a rigorous stress test, feeding the models 719 scientific hypotheses drawn from business journals published since 2021. To measure reliability, the researchers asked the same question ten times for each hypothesis. While the AI appeared competent on the surface with an 80% accuracy rate in some instances, a deeper statistical dive revealed a much grimmer reality. When adjusted for random guessing, the AI’s true reliability plummeted to a level only 60% better than chance—a performance the researchers equated to a "low D grade" in a classroom setting.
The most alarming failure occurred when the models were presented with false statements. ChatGPT correctly identified false hypotheses only 16.4% of the time, showing a systemic bias toward affirming whatever premise it is given. This "yes-man" tendency in AI architecture poses a significant risk to researchers who might use these tools to vet new ideas or summarize existing literature. If a model is hard-wired to agree with a false premise, it becomes an engine for misinformation rather than a tool for discovery.
Beyond simple errors, the study highlights a fundamental "inconsistency problem." Unlike a human expert or a traditional database, the AI often changed its mind when asked the same question multiple times. This volatility suggests that the models are not "reasoning" through the scientific logic of a hypothesis but are instead navigating a probabilistic map of word associations. Cicek noted that the AI lacks a conceptual "brain," relying on memorized patterns that can shift based on the slight statistical variations of each prompt generation.
This creates what the researchers call an "illusion of understanding." Because the AI writes with the confidence of a tenured professor, users are prone to cognitive bias, accepting the information as fact-based simply because it is articulated well. This "fluency trap" is particularly dangerous in the current political and economic climate. As U.S. President Trump’s administration pushes for accelerated AI integration across federal agencies and the private sector, the WSU study serves as a timely warning that speed must not come at the expense of verification.
The implications extend far beyond the ivory tower. In industries ranging from pharmaceuticals to engineering, where scientific accuracy is a matter of safety and multi-billion dollar investments, the reliance on inconsistent AI could lead to catastrophic failures. The study suggests that while AI can be a powerful assistant for drafting or brainstorming, it remains an unreliable narrator for the "truth" of scientific data. The best approach, according to Cicek, is a disciplined skepticism. The era of blind trust in algorithmic output is hitting a wall of scientific reality, and the burden of proof remains, for now, firmly in human hands.
Explore more exclusive insights at nextfin.ai.

