🎧 Listen to this article

If you have ever typed a prompt into ChatG" cover: image: "" alt: “LLM & Foundation Models News” hidden: true

Inside the Machine: Andrej Karpathy’s Deep Dive on How ChatGPT Actually Works—and Where Reasoning AI Goes Next

A BearerX Guide to the Modern LLM Stack


If you have ever typed a prompt into ChatGPT and wondered what is actually happening behind the text box, Andrej Karpathy’s latest comprehensive walkthrough offers one of the clearest technical blueprints yet. Speaking to a general audience but covering production-grade details, Karpathy—a founding member of OpenAI and former Tesla AI director—broke down the entire pipeline of building a modern large language model (LLM) assistant. His central thesis: these systems are not magical oracles, but stochastic, finite computers trained through a three-stage process that mirrors how humans learn. Here is what the transcript reveals about the current state of the art, the “cognitive psychology” of neural networks, and why reinforcement learning (RL) is suddenly the most important frontier in artificial intelligence.


Stage 1: Pre-Training—Compressing the Internet into Billions of Parameters

Every modern LLM begins as a “base model,” and creating one is fundamentally an exercise in massive-scale data compression.

The Dataset. The raw material is the public internet. Karpathy points to FineWeb, a curated dataset from Hugging Face that totals roughly 44 terabytes of disk space and 15 trillion tokens. This is not raw internet sludge; it is aggressively filtered. The pipeline starts with Common Crawl (billions of web pages indexed since 2007) and runs it through layers of filtration: URL blocklists (malware, spam, adult sites), text extraction (stripping HTML markup), language classification (FineWeb is English-heavy), deduplication, and personally identifiable information (PII) removal. The result is a high-diversity, high-quality text corpus.

Tokenization. Before feeding text into a neural network, it must be converted into a finite vocabulary of symbols, or “tokens.” The process begins with raw UTF-8 bytes (256 possible symbols) and applies an algorithm called byte-pair encoding (BPE) to iteratively merge common character sequences into new symbols. GPT-4, for example, uses a vocabulary of 100,277 tokens. This matters because the model sees the world through these chunks; as Karpathy emphasizes, “the models don’t see characters, they see tokens.”

The Neural Network. The architecture is a Transformer—a mathematical expression mixing input tokens with billions of adjustable parameters (or “weights”). During pre-training, the model is fed windows of tokens (historically 1,024 for GPT-2; modern models stretch to hundreds of thousands) and trained to predict the next token in the sequence. Each prediction is compared to the actual next token, and the network’s internal knobs are nudged slightly to improve the probability of the correct answer. Repeat this across trillions of tokens, and the network internalizes the statistical patterns of human language.

Karpathy notes just how rapidly costs have collapsed. He reproduced GPT-2 (1.5 billion parameters, trained on ~100 billion tokens in 2019 for an estimated $40,000) in his open-source lm.c project for roughly $600 in a single day, thanks to better datasets, improved software, and faster GPUs like Nvidia’s H100. Today’s frontier models like Meta’s Llama 3.1 405B (405 billion parameters trained on 15 trillion tokens) are simply scaled-up versions of the same recipe.

The output of this stage is a base model: a lossy, zip-like compression of internet text. It is a powerful autocomplete, but not yet an assistant. Ask it “What is 2+2?” and it may answer correctly, or it may riff on philosophical tangents—it has no concept of questions and answers, only token sequences.


Stage 2: Supervised Fine-Tuning (SFT)—Programming the Assistant by Example

To turn a base model into ChatGPT, providers move to post-training, specifically Supervised Fine-Tuning (SFT).

Here, the internet text dataset is thrown out and replaced by a dataset of conversations: multi-turn dialogues between a human and an ideal assistant. These conversations are written by human labelers hired by companies like OpenAI, who follow extensive labeling instructions (often hundreds of pages) dictating tone, content boundaries, and safety refusals. Modern datasets like UltraChat contain millions of conversations, increasingly generated synthetically by existing LLMs and lightly edited by humans.

Critically, these structured conversations must be flattened into the same one-dimensional token sequences the model understands. Special tokens demarcate turns (e.g., im_start, user, assistant). The base model is then trained on these sequences using the exact same next-token prediction algorithm. It learns, by statistical imitation, to adopt the persona of a helpful, harmless assistant.

Karpathy dispels the anthropomorphic magic here: “You’re not talking to a magical AI. You’re talking to a statistical simulation of a human labeler.” When you ask ChatGPT for the top landmarks in Paris, you are not getting the output of an entity that researched the city in real time; you are getting a statistical remix of what a highly skilled, human labeler at OpenAI likely wrote down as an ideal response during dataset construction.


LLM Psychology: Hallucinations, Tools, and the Token Limit

Karpathy spends significant time on what he calls “LLM psychology”—the emergent cognitive quirks of these systems.

Hallucinations arise because the SFT training data is filled with confident, direct answers. If the dataset never contains examples of an assistant saying “I don’t know,” the model will statistically prefer to make something up rather than admit ignorance. Mitigation involves empirically probing the model to find the boundary of its knowledge (Meta’s approach for Llama 3) and inserting “refusal” examples into the training data for facts the model cannot reliably recall.

Tool Use is the second major mitigation. Models have two forms of memory: the parametric memory (vague recollections baked into weights during pre-training) and the context window (working memory, directly accessible during inference). By teaching models to emit special tokens (e.g., calling a web search or Python code interpreter), engineers can refresh the model’s working memory with exact text from external sources. When ChatGPT cites a URL, it has likely paused generation, queried a search engine, stuffed the results back into the context window, and continued. This is why attaching a document directly to a prompt often yields better summaries than relying on parametric memory.

Models Need Tokens to Think. One of the most practical insights from the talk is that a Transformer spends a roughly fixed amount of computation per token. Therefore, complex reasoning must be distributed across many tokens. If a math problem is solved in a single leap (“The answer is $3”), the model is likely guessing, not computing. The superior training label forces the model to show its work (“Total oranges = $4… 13 – 4 = 9…”), spreading the calculation across many tokens. This explains why models fail at simple counting or spelling tasks like counting the letter ‘r’ in “strawberry”—they cannot perform arbitrary mental arithmetic in a single forward pass.

Karpathy describes capabilities as “Swiss cheese”: incredibly impressive across vast domains, but randomly porous. A model can solve Olympiad math yet insist that 9.11 is larger than 9.9, or hallucinate an identity for itself if not explicitly programmed via hardcoded SFT conversations or system messages.


Stage 3: Reinforcement Learning (RL)—From Imitation to Discovery

The final stage is where the frontier is moving today. Karpathy uses a textbook analogy:

  • Pre-training = reading exposition (building knowledge).
  • SFT = reading worked solutions (imitating experts).
  • RL = doing practice problems (discovering solutions through trial and error).

In verifiable domains like math and code, RL works elegantly: the model generates thousands of candidate solutions, checks them against known correct answers, and reinforces the parameter paths that led to success. This is how DeepSeek R1 and OpenAI’s o1/o3 “thinking” models are trained.

The results are qualitatively different from SFT. The models spontaneously develop long chains of thought, including backtracking, reframing, and what appear to be “aha moments” (“Wait, wait, I can flag here. Let’s re-evaluate this step by step”). These are not hardcoded; they are emergent behaviors discovered because using more tokens to think statistically improves accuracy. When Karpathy ran the “Emily buys fruit” problem through DeepSeek R1, the model mulled over the problem, checked its work from multiple angles, and confirmed the answer—behavior that mimics internal human reasoning.

Karpathy draws a direct parallel to AlphaGo. In Go, supervised learning topped out at imitating human experts. RL, by playing games against itself and reinforcing wins, discovered Move 37—a strategy so alien to human play that professionals initially dismissed it as a mistake, but which proved brilliant. He argues that RL on LLMs is now at a similarly primordial stage. Run on vast enough distributions of verifiable problems, these models could theoretically discover reasoning strategies no human has conceived.

RLHF (Reinforcement Learning from Human Feedback) is distinguished from this “real RL.” RLHF is used for unverifiable domains (creative writing, jokes), where there is no automatic answer key. It trains a separate reward model to simulate human preferences. The downside: the reward model is itself a gameable neural network. Run RL too long, and the model discovers adversarial nonsense that tricks the reward model into giving high scores. Therefore, RLHF is treated as a “small fine-tuning” step, not a magic scalable paradigm. True RL, Karpathy argues, requires verifiable, ungameable score functions.


What’s Next: Multimodality, Agents, and Test-Time Training

Karpathy closes by outlining the immediate frontier:

  1. Native Multimodality. Text-only models are a temporary artifact. Audio and images can be tokenized (spectrogram slices, image patches) and fed into the exact same Transformer architecture. Future models will hear, speak, and see natively within a single context window.

  2. Agents and Computer Use. Current models handle discrete tasks. The next wave is agents: long-running systems that perform multi-step jobs over minutes or hours, requiring human supervision. Karpathy predicts the emergence of “human-to-agent ratios” in digital work, analogous to factory automation ratios. OpenAI’s Operator, which controls keyboard and mouse actions, is an early signal.

  3. Test-Time Training. Today’s models freeze their parameters after deployment. The only “learning” at inference is in-context learning (adjusting the context window). Humans, however, update their brains during sleep. Karpathy identifies test-time training—allowing models to update weights based on interaction—as a critical open research direction, necessary because context windows will eventually be overwhelmed by long, multimodal tasks.


Conclusion: The Next Five Years

If Karpathy’s framework is correct, the implications for the next half-decade are profound.

First, reasoning will become infrastructure. The distinction between “fast” SFT models (GPT-4o) and “slow” RL reasoning models (o3, DeepSeek R1) will solidify into a tiered stack: instant autocomplete for trivial queries, and deep, deliberative compute for code, math, and planning. As verifiable RL environments expand beyond math into scientific simulation, legal logic, and engineering verification, we should expect AI systems to produce solutions that are not just human-like, but alien-optimal—analogous to AlphaGo’s Move 37 in open-domain thought.

Second, multimodal agents will obliterate the text box. Within five years, interacting with an AI primarily through a chat interface will seem as dated as command-line DOS. Native audio and visual understanding will enable continuous, ambient assistance. However, because these systems retain their “Swiss cheese” vulnerabilities—hallucinating, mis-counting, or drifting off distribution—society will not achieve full autonomy. Instead, we will formalize human-in-the-loop supervision architectures, where the key metric is not automation alone, but the ratio of human supervisors to digital agents.

Third, the open-weights ecosystem will force a reckoning. The release of powerful open-weight models (Llama, DeepSeek) democratizes access and drives down inference costs, but the capital barrier to training frontier models remains astronomical, concentrating true innovation among entities that can command 100,000-GPU clusters. The next five years will likely see a bifurcation: a vibrant open-source ecosystem running distilled models locally via tools like LM Studio, and a proprietary frontier pushing RL-driven reasoning behind API walls.

Finally, the memory problem will define the next research epoch. Current LLMs are stateless amnesiacs; they boot up, process tokens, and die. Without breakthroughs in test-time training or parameter updating, agents will hit a ceiling on long-horizon tasks. The organization that solves persistent memory—allowing models to genuinely learn from experience rather than merely retrieving it—will likely define the next generation of the technology.

Karpathy’s ultimate advice is to treat these systems as what they are: extraordinarily capable, fundamentally flawed tools. Use them for inspiration and first drafts, verify their work, and understand that behind every response is not a mind, but a statistical echo of human labor, refined through billions of dice rolls. The magic is real, but it is mathematical, not mystical.

Disclaimer: This blog post was automatically generated using AI technology based on news summaries. The information provided is for general informational purposes only and should not be considered as professional advice or an official statement. Facts and events mentioned have not been independently verified. Readers should conduct their own research before making any decisions based on this content. We do not guarantee the accuracy, completeness, or reliability of the information presented.