‘AI Godfather’ Hinton Calls for Stronger Data Curation When Training LLMs

Breaking NewsDec. 08, 2025

Summarized by NextFin AI

Geoffrey Hinton, Nobel Prize winner and Turing Award recipient, emphasized the importance of carefully selecting data to ensure the safety of large language models (LLMs).
During a discussion at the 2025 T-EDGE event, Hinton criticized the practice of training LLMs on all available data, including inappropriate content.
He argued for stronger data curation, suggesting that while it may reduce the amount of data, it could make AI systems less dangerous.
Hinton compared this to teaching children, stating that they should not be exposed to harmful content until they develop a strong moral understanding.

Geoffrey Hinton, the Nobel Prize winner in physics and the recipient of the Turing Award, said on Monday that carefully selecting data is essential for ensuring the safety of large language models (LLMs).

Hinton made the comments in his conversation with Jany Hejuan Zhao, the founder and CEO of NextFin.AI, during the 2025 T-EDGE that kicked off on Monday, December 8 and lasts through December 21.

“At present, the big language models tend to be trained on all the data you can get your hands on and that will include things like the diaries of serial killers,” said Hinton. “That seems like a bad idea to me. If I were teaching my child to read, I wouldn't teach them to read on the diaries of serial killers. I wouldn't let them read that until they had already developed a strong moral sense and realized it was wrong.”

“So I think we do need a lot more curation although it'll mean there's less data. But I believe we need much stronger curation of the data. So I think you can make AI less dangerous and less likely to do bad things by curating the data,” said Hinton.

Explore more exclusive insights at nextfin.ai.

Insights

What are the fundamental concepts behind data curation in AI training?

What origins led Geoffrey Hinton to advocate for stronger data curation?

What is the current state of data practices in training large language models?

What feedback have users provided regarding data curation in AI models?

What are the latest trends in AI regarding data selection and safety?

What recent updates have emerged from the T-EDGE conference about AI safety?

What policy changes are being discussed to improve data curation in AI?

What future developments might arise from improved data curation in AI?

What long-term impacts could stronger data curation have on AI safety?

What challenges exist in implementing better data curation practices in AI?

What controversies surround data curation in training large language models?

How does Hinton's view compare to other experts on AI data practices?

What historical cases illustrate the importance of data curation in AI?

How do current AI models differ from those trained with curated data?

What similar concepts exist in other fields regarding data curation?

NextFin.Al

No Noise, only Signal.

Open App