Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)

🎧 Listen to this article

Episode Map

Introduction — Speaker introduces LLMs, key components for training architecture, loss/algorit" cover: image: "" alt: “LLM & Foundation Models News” hidden: true

Stanford Online on youtube |

Episode Map

Introduction — Speaker introduces LLMs, key components for training (architecture, loss/algorithm, data, evaluation, systems), and lecture focus
Lecture Overview — Skips architecture/Transformers, emphasizes data/evaluation/systems over academia’s focus; outlines pre-training vs post-training
Language Modeling Basics — Defines LLMs as probability distributions over token sequences, examples with sentences, generative nature
Autoregressive Models — Chain rule decomposition, prediction task, training vs inference process
Model Mechanics — Embedding, neural network processing, linear layer, softmax, cross-entropy loss equivalent to log likelihood maximization
Tokenizers — Importance, generality over words (typos, non-Latin languages), vs character-level (sequence length issue)

Key Insights

Key Components for LLM Training

Five main components matter: architecture, training loss/algorithm, data, evaluation, and systems for running on hardware. Systems have become crucial due to model size. Academia focuses on architecture/losses, but industry prioritizes data, evaluation, and systems.

those five components what matters in practice is mostly the three other topics so data evaluation and systems uh which is what of most of Industry actually focuses on

Pre-training vs Post-training

Pre-training is classical language modeling on internet-scale data (e.g., GPT-3). Post-training turns models into AI assistants (e.g., ChatGPT). Lecture covers both, starting with pre-training tasks and losses.

pre-training uh you probably heard that word this is kind of the classical language modeling uh Paradigm uh where you basically train your language model to essentially model all of internet post training which is a more recent Paradigm which is taking these large language models and making them essentially AI assistants

Autoregressive Language Models

Models predict next token via chain rule of probability. Training uses cross-entropy loss to maximize likelihood; inference involves sampling in a loop, which is slow for long sequences.

the key idea of autor regressive language models is that you take this distribution over words and you basically decompose it into the into the distribution of the first word multiply the by the distribution of or the likelihood of the distribution of the second word given the first word minimizing the loss is the same thing as maximizing the likelihood of your text

Tokenizer Importance

Tokenizers are more general than words, handling typos and non-space-separated languages like Thai. Character-level works but leads to overly long sequences.

tokenizers are extremely important tokens are much more General Than Words your sequence becomes super long

Patterns and Themes

Emphasis on practical industry priorities (data, eval, systems) over academic architecture focus
Autoregressive prediction as core task, with training/inference distinctions repeated
High-level explanations avoiding deep architecture dives, recapping basics

Clips and Quotes

in reality honestly what matters in practice is mostly the three other topics so data evaluation and systems uh which is what of most of Industry actually focuses on

TikTok clip contrasting academia vs industry LLM priorities

pre-training uh you probably heard that word this is kind of the classical language modeling uh Paradigm uh where you basically train your language model to essentially model all of internet

Hook for intro explaining pre-training simply

One downside of autoaggressive language models is that when you actually sample from this autoaggressive language model you basically have a for Loop which generates the next word then conditions on that next word and then regenerate an other word so basically if you have a longer sentence that you want to generate you it takes more time

Short explainer on autoregressive generation limits

if there’s a typo in your word then you might not have any token associated with this this word with a typo and then you don’t know how to actually pass this word with a typo into a large language model

Highlight tokenizer necessity for real-world text

Actions, Risks, and Follow-Ups

Open Questions:

How do you deal with adding more tokens to the corpus (vocabulary size)?
Details on tokenization methods for handling new tokens (mentioned to be covered later)

Risks / Concerns:

Autoregressive sampling is slow for long sequences due to sequential for-loop generation
Word-level tokenization fails on typos or non-Latin languages without spaces

Follow-Up Directions:

Deeper dive into tokenization strategies as promised later in lecture
Post-training techniques for turning pre-trained models into AI assistants

Verdict: Worth watching for a clear, practical overview of LLM training beyond hype, emphasizing underrepresented topics like data and systems. The one thing to remember: industry succeeds on data, evaluation, and systems, not just fancy architectures.

Tone: educational | For: ML students, engineers new to LLMs, or industry pros wanting a high-level refresh on training components and why systems matter now.

Disclaimer: This blog post was automatically generated using AI technology based on news summaries. The information provided is for general informational purposes only and should not be considered as professional advice or an official statement. Facts and events mentioned have not been independently verified. Readers should conduct their own research before making any decisions based on this content. We do not guarantee the accuracy, completeness, or reliability of the information presented.

Episode Map#

Key Insights#

Patterns and Themes#

Clips and Quotes#

Actions, Risks, and Follow-Ups#

Episode Map

Key Insights

Patterns and Themes

Clips and Quotes

Actions, Risks, and Follow-Ups