NextFin News - On March 22, 2023, at NVIDIA's GTC Digital Spring 2023 conference, NVIDIA CEO Jensen Huang sat down with OpenAI co‑founder and chief scientist Ilya Sutskever for a fireside chat titled "AI Today and Vision of the Future." The session, which aired as part of GTC's program, covered the arc from early deep‑learning research through the development of the GPT family and the capabilities and limitations of GPT‑4.
The conversation brought together Sutskever's recollections of technical choices and conceptual pivots with concrete descriptions of how large language models are trained and refined. Below we present Sutskever's core statements organized by topic, quoted and paraphrased from the discussion.
Early intuition about deep learning
Sutskever described his initial attraction to AI as driven by curiosity about consciousness and the human experience and by the belief that "learning" was the capability computers could not do in the early 2000s. He explained why neural networks stood out: they allow you to "automatically program parallel computers" from data, and they are"similar enough to the brain" to offer a promising long‑term path. As he put it, "of all the things that existed, that seemed like it had by far the greatest long‑term promise."
AlexNet, ImageNet and the importance of scale
Sutskever recalled the context that led to AlexNet: the realization that supervised learning at large scale would get traction, combined with an argument that "if your neural network is deep and large, then it could be configured to solve a hard task." He emphasized that before AlexNet a million parameters was considered large and researchers often ran unoptimized CPU code. On ImageNet and GPUs he said the dataset was "unbelievably difficult" but if a large convolutional network could be trained on it, success was inevitable. He recounted how Alex Krizhevsky's GPU programming and convolutional kernels enabled training that "shocked the world" and created a clear discontinuity for computer vision.
GPUs and why they mattered
On the role of GPUs, Sutskever explained that Jeff Hinton's lab experimented with GPUs, and the convolutional architecture was an excellent fit for GPU acceleration. He noted the practical moment when GPUs allowed previously infeasible training regimes to run "unbelievably fast," enabling networks of unprecedented size and producing the breakthrough results on ImageNet.
Founding OpenAI: two foundational ideas
Sutskever described two early, enduring ideas at OpenAI. The first was "unsupervised learning through compression": the intuition that models which compress data well must extract its hidden structure. He mentioned earlier work such as the sentiment neuron experiments (predicting next characters in Amazon reviews) as evidence that next‑token prediction can discover meaningful latent structure. The second idea was the importance of reinforcement learning; he pointed to large projects such as learning to play the real‑time strategy game Dota 2 as an important line of work whose techniques later influenced reinforcement learning from human feedback (RLHF).
Pretraining as learning a world model
Sutskever framed pretraining a large language model as learning a "world model" by predicting the next token in diverse internet text. He argued that predicting the next word is not mere statistical correlation but, when done accurately, produces "a compressed abstract usable representation" of people, motivations and the world. He explained that the more accurate the next‑token prediction, the higher the fidelity of that internal representation.
Fine‑tuning, RLHF and communicating desired behavior
On the gap between pretraining and deployable assistants, Sutskever emphasized that pretraining does not specify desired behavior. The second stage—fine‑tuning and RLHF—is where humans (and human+AI teams) "communicate to it what it is that we want it to be," including guardrails and behavioral constraints. He said this stage is "extremely important" and that improving its fidelity raises usefulness and reliability.
ChatGPT, system architecture and the role of surrounding systems
Sutskever clarified that ChatGPT is more than a single model: it is a system with surrounding components. He described the neural net's role as the foundation built by pretraining and the surrounding systems (fine‑tuning, RLHF, prompting and interfaces) as the means to keep the application on rails and convey user intent.
GPT‑4: what changed
Asked about differences between ChatGPT (GPT‑3.5 era) and GPT‑4, Sutskever said the most important difference is that GPT‑4's base model predicts the next word with greater accuracy, which translates into increased understanding and capability. He stated that GPT‑4 was trained months earlier than the conversation and described it as "a pretty substantial improvement" across many dimensions, including better performance on standardized tests and practical tasks.
Reasoning: capabilities and limits
Sutskever addressed whether next‑token models can reason. He said reasoning is not sharply defined but that predicting the next word at high fidelity requires the model to capture complex dependencies and, in effect, some forms of reasoning. He noted techniques such as "asking the network to think out loud" (chain‑of‑thought prompting) as effective, and he emphasized that while reasoning capabilities have grown, reliability remains the central limitation.
Reliability and hallucination as the current bottleneck
Repeatedly, Sutskever returned to reliability. He argued that occasional hallucinations or unexpected mistakes are what prevent models from being fully useful. The next big frontier, he said, is systems that reliably admit uncertainty, ask clarifying questions, and say "I don't know" when appropriate. Improvements in these areas, he predicted, will drive the largest gains in real‑world usefulness.
Retrieval, multimodality and vision
About retrieval and external knowledge, Sutskever noted that the released GPT‑4 model at the time was primarily a next‑word predictor (though it can consume images) and did not ship with built‑in retrieval, while acknowledging retrieval would improve factuality. On multimodality he gave two reasons for its importance: vision is practically useful because humans are visual, and learning from images adds information about the world that text alone may not convey as efficiently. He gave examples—colors and diagrams—where vision materially improves understanding, and he said audio could also be useful as an additional signal.
Data availability and synthetic data
When asked whether the world will run out of training tokens, Sutskever counseled not to underestimate existing data and said there is probably more data than commonly assumed. He accepted that synthetic data generation by AI is a possibility for future training regimes, but left open how central that will become.
Short‑term outlook: two years ahead
On the near future, Sutskever predicted continued capability advances and emphasized that the greatest impact will come from improving trust and reliability: models that know when they do not understand, ask clarifying questions, and reliably follow intent. He suggested that progress in these areas will make AI systems far more useful across many domains.
Surprising skills and reflections
Sutskever highlighted surprising strengths observed in GPT‑4: increased reliability, much stronger math and proof‑style reasoning in many cases, and impressive vision capabilities such as explaining memes and diagrams. Taking a step back he said the most surprising thing to him over the past two decades is that the same fundamental neural‑network ideas, scaled and trained appropriately, actually worked so effectively.
References and session links:
NVIDIA press release: NVIDIA GTC 2023 to Feature Latest Advances in AI Computing Systems
Condensed transcript / highlights (third‑party transcript page)
Explore more exclusive insights at nextfin.ai.

