notes

LLMs

A language model is a type of artificial intelligence system that is trained to understand and generate human-like text. It learns the structure, grammar, and semantics of a language by processing vast amounts of textual data. The primary goal of a language model is to predict the next word or token in a sequence based on the context provided by previous words.

Transformer Architecture

The transformer is the foundational architecture behind modern LLMs like GPT, Claude, and BERT. Think of it as a sophisticated pattern-matching and prediction engine that processes text in parallel rather than sequentially.

High-Level Architecture

Input: "The cat sat on the"
         ↓
    [Tokenization]
         ↓
    [Embeddings] ← Convert tokens to vectors
         ↓
    [Position Encoding] ← Add position information
         ↓
    ╔════════════════════════════════════════╗
    ║        TRANSFORMER BLOCK               ║
    ║  ┌──────────────────────────────────┐  ║
    ║  │   Multi-Head Attention           │  ║
    ║  │   (What tokens relate to what?)  │  ║
    ║  └──────────────────────────────────┘  ║
    ║              ↓                         ║
    ║  ┌──────────────────────────────────┐  ║
    ║  │   Add & Normalize                │  ║
    ║  └──────────────────────────────────┘  ║
    ║              ↓                         ║
    ║  ┌──────────────────────────────────┐  ║
    ║  │   Feed Forward Network           │  ║
    ║  │   (Transform representations)    │  ║
    ║  └──────────────────────────────────┘  ║
    ║              ↓                         ║
    ║  ┌──────────────────────────────────┐  ║
    ║  │   Add & Normalize                │  ║
    ║  └──────────────────────────────────┘  ║
    ╚════════════════════════════════════════╝
         ↓
    [Repeat N times] ← Stack 12-96+ blocks
         ↓
    [Output Layer] ← Probability distribution
         ↓
    Prediction: "mat" (highest probability)

Core Components Explained

1. Tokenization & Embeddings

Think of it as: Converting strings to numerical vectors that capture semantic meaning.

"cat" → [0.2, -0.5, 0.8, 0.1, ...] (512-4096 dimensions)
"dog" → [0.3, -0.4, 0.7, 0.2, ...] (similar vector, close in space)
"car" → [-0.1, 0.9, -0.3, 0.6, ...] (different vector, far from "cat")

The embedding layer is essentially a giant lookup table: token_id → vector. These vectors are learned during training so that semantically similar words end up close together in vector space.

2. Position Encoding

Problem: Unlike Recurrent Neural Networks (RNNs) that process sequences step-by-step, transformers process all tokens in parallel. But word order matters!

Solution: Add positional information to each token’s embedding using sine/cosine functions:

position_encoding[pos][2i]   = sin(pos / 10000^(2i/d))
position_encoding[pos][2i+1] = cos(pos / 10000^(2i/d))

This creates a unique “fingerprint” for each position that the model can learn to use.

3. Multi-Head Attention (The Core Innovation)

Think of it as: A sophisticated dictionary lookup where each word asks “which other words should I pay attention to?”

Detailed View:

For input token "sat":
┌─────────────────────────────────────────────────────┐
│ Query (Q): "What am I looking for?"                 │
│ Key (K):   "What do I contain?" (for all tokens)   │
│ Value (V): "What information do I carry?"           │
└─────────────────────────────────────────────────────┘

Attention(Q, K, V) = softmax(Q·K^T / √d_k) · V

Example: Analyzing "The cat sat on the mat"

Token "sat" queries all other tokens:
  "The" ─────┬──→ Attention Score: 0.05
  "cat" ─────┼──→ Attention Score: 0.40  (high!)
  "sat" ─────┼──→ Attention Score: 0.10
  "on"  ─────┼──→ Attention Score: 0.15
  "the" ─────┼──→ Attention Score: 0.05
  "mat" ─────┴──→ Attention Score: 0.25  (high!)

Result: "sat" pays most attention to "cat" and "mat"

Multi-Head: Run attention mechanism multiple times in parallel (8-16 “heads”), each learning different relationships:

4. Feed Forward Network

After attention, each token representation is passed through a simple 2-layer neural network:

FFN(x) = ReLU(x · W1 + b1) · W2 + b2

Purpose: Transform and enrich the representations. If attention is “gather information from other tokens,” FFN is “process that information.”

Software Engineer Analogy:

Think of the FFN as a data transformation pipeline that processes each token independently:

  1. Input: A vector of 768 numbers representing a token
  2. Layer 1 (Expansion): Multiply by matrix W1 → expands to 3072 dimensions
    • Like mapping your data through a lookup table or hash function
    • Each output dimension is a weighted sum of inputs (similar to a dot product)
  3. ReLU Activation: max(0, x) - zeros out negative values
    • Acts like a filter: keeps positive signals, discards negative ones
    • Introduces non-linearity (without this, the whole network would be just one big matrix multiplication)
  4. Layer 2 (Compression): Multiply by matrix W2 → back to 768 dimensions
    • Projects the expanded representation back to original size
  5. Add bias terms (b1, b2): Like adding constants to adjust the output

Why expand then compress?

Key difference from attention:

5. Add & Normalize (Residual Connections)

Residual Connection:

output = LayerNorm(input + Sublayer(input))

Why?

How It All Works Together

Step-by-step processing of “The cat sat”:

  1. Input: Tokens [The, cat, sat] → IDs [42, 156, 234]

  2. Embedding + Position:
    [42] → [0.2, 0.5, ...] + [sin/cos positions for pos 0]
    [156] → [0.3, 0.4, ...] + [sin/cos positions for pos 1]
    [234] → [-0.1, 0.7, ...] + [sin/cos positions for pos 2]
    
  3. Attention (Layer 1):
    • “cat” looks at “The” and learns it’s the determiner
    • “sat” looks at “cat” and learns it’s the subject
    • Each token’s representation is updated with context
  4. FFN (Layer 1):
    • Transform representations non-linearly
    • Each token processed independently
  5. Repeat 2-3 for N layers (12, 24, 96, etc.)
    • Early layers: Learn syntax, grammar, word relationships
    • Middle layers: Learn semantic meaning, context
    • Late layers: Learn task-specific patterns, reasoning
  6. Output:
    • Final representation → probability distribution over vocabulary
    • Pick token with highest probability

Key Insights

Parallel Processing:

Memory & Context:

Training vs Inference:

Scale is Critical:

Decoder-Only vs Encoder-Decoder

Encoder-Only (BERT):

Decoder-Only (GPT, Claude):

Encoder-Decoder (T5):

Why Transformers Won

Before Transformers (RNNs/LSTMs):

Transformers:

Practical Implications

Context Window:

Emergence:

Inference Cost: