Large Language Models (LLMs)
From GPT to Claude — Advanced Level (For Researchers and Practitioners)
A Large Language Model (LLM) is a language model that scales an autoregressive Transformer to billions or even trillions of parameters, pre-trained on massive text corpora. After pre-training, the model is fine-tuned using alignment techniques such as RLHF to generate responses that align with human intent. This article assumes familiarity with attention mechanisms and the Transformer architecture (readers unfamiliar with these topics are encouraged to review them first), and provides a systematic overview of the technologies specific to LLMs.
1. Tokenization
Rather than processing text directly, LLMs first convert it into a sequence of substrings called tokens. Since vocabulary size and the handling of unknown words directly affect model performance, subword segmentation is the standard approach.
BPE (Byte-Pair Encoding) Algorithm
- Initialize the vocabulary at the character (or byte) level
- Find the most frequent adjacent token pair in the corpus
- Merge that pair into a new token and add it to the vocabulary
- Repeat steps 2–3 until the target vocabulary size is reached
For example, if "l o w" and "l o w e r" appear frequently, the algorithm first merges "lo" → "low", yielding efficient substring representations.
2. LLM Architecture
Modern LLMs are based on the Transformer decoder, incorporating several architectural improvements.
RMSNorm
A lightweight normalization method that removes the mean subtraction from LayerNorm and normalizes using only the root mean square (RMS).
$$\text{RMSNorm}(\boldsymbol{x}) = \frac{\boldsymbol{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \odot \boldsymbol{\gamma}$$RoPE (Rotary Position Embedding)
Relative position information is injected by multiplying the Query and Key vectors with rotation matrices. The $d$-dimensional vector is split into 2-dimensional blocks, and each block is rotated according to position $m$:
$$\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}, \quad \theta_i = 10000^{-2i/d}$$Due to this rotation, the inner product of the Query at position $m$ and the Key at position $n$ depends only on the position difference $m - n$. Unlike absolute position encoding, relative position information is naturally injected without adding any learnable parameters.
Grouped-Query Attention (GQA)
In MHA (Multi-Head Attention), each head has independent Q, K, and V projections. In GQA, multiple Query heads share Key-Value heads, thereby reducing KV cache memory usage while maintaining performance close to MHA.
MHA vs MQA vs GQA
- MHA (Multi-Head): Independent K, V per head → high memory usage
- MQA (Multi-Query): All heads share K, V → minimum memory, slightly lower accuracy
- GQA (Grouped-Query): K, V shared within G groups → intermediate between MHA and MQA
SwiGLU
An activation function that replaces ReLU with Swish (SiLU) and adds a gating mechanism.
$$\text{SwiGLU}(\boldsymbol{x}) = (\boldsymbol{x} W_1) \odot \text{Swish}(\boldsymbol{x} W_\text{gate}) \cdot W_2$$where $\text{Swish}(x) = x \cdot \sigma(x)$, and $\sigma$ denotes the sigmoid function.
3. Scaling Laws
One of the most important discoveries in LLM research is the scaling laws that quantitatively describe the relationship between model performance and scale.
Kaplan's Scaling Laws
The test loss $L$ follows a power law with respect to the number of parameters $N$, data size $D$, and compute $C$:
$$L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}, \quad L(C) \propto C^{-0.050}$$This means that making the model 10 times larger improves the loss by a fixed amount, and another 10-fold increase yields the same improvement again—a predictable relationship. However, these exponents hold only when the other two variables are sufficiently large (i.e., not a bottleneck); in practice, insufficient data or compute causes convergence to slow.
Chinchilla Scaling Laws
Hoffmann et al. (2022) derived the optimal relationship between parameter count and data size given a fixed compute budget.
Chinchilla Optimality
For a given compute budget $C$, the optimal parameter count $N^*$ and training token count $D^*$ should both scale proportionally with $C$:
$$N^* \propto C^{0.50}, \quad D^* \propto C^{0.50}$$In other words, if the parameter count is doubled, the training data should also be doubled. This finding suggested that many models of the time (such as GPT-3) were under-trained.
- GPT-3 (175B parameters): Trained on 300B tokens → Chinchilla-optimal would be approximately 3.5T tokens
- Chinchilla (70B parameters): Trained on 1.4T tokens → Smaller than GPT-3 but achieved comparable or better performance
- LLaMA 2 (70B parameters): Trained on 2T tokens → Exceeds the Chinchilla optimum
- LLaMA 3 (8B parameters): Trained on 15T tokens → Approximately 100 times the Chinchilla optimum
Over-training Strategy
The Chinchilla law optimizes for minimizing training cost, not inference cost. Training a smaller model far beyond the Chinchilla optimum, as with LLaMA 3, increases training cost but dramatically reduces inference cost compared to a larger model of equivalent performance. In production deployments where the number of inference calls vastly exceeds training runs, this "over-training" strategy is economically rational and has become the prevailing approach.
4. Pre-training
LLM pre-training is next-token prediction (Causal Language Modeling) on a large-scale corpus.
Pre-training Objective
For a token sequence $x_1, x_2, \ldots, x_T$, the autoregressive log-likelihood is maximized:
$$\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$Training Data
Typically, a corpus of trillions of tokens is used, comprising web crawls (CommonCrawl, etc.), books, academic papers, and code. Data quality (deduplication, filtering) has a significant impact on performance.
Training Stability
- Learning rate scheduler: Warm-up + cosine decay is the standard approach
- Gradient clipping: Constrains gradient norms to prevent training collapse
- Mixed-precision training: Computes in BF16/FP16 while maintaining master weights in FP32
- Distributed training: Combines data parallelism, tensor parallelism, and pipeline parallelism
5. RLHF (Reinforcement Learning from Human Feedback)
Pre-training alone can result in an LLM that generates harmful content or fails to follow instructions. RLHF (Reinforcement Learning from Human Feedback) is a technique that adjusts the model to align with human preferences, and it was a key technology behind the success of ChatGPT.
Three Steps of RLHF
-
Supervised Fine-Tuning (SFT)
The pre-trained model is fine-tuned on high-quality instruction-response pairs written by humans. -
Reward Model (RM) Training
Humans rank multiple responses to the same prompt, and a model is trained to output a scalar reward that reproduces this ranking. The loss function is based on the Bradley-Terry model: $$\mathcal{L}_{\text{RM}} = -\log \sigma\bigl(r_\theta(x, y_w) - r_\theta(x, y_l)\bigr)$$ where $y_w$ is the preferred response and $y_l$ is the less preferred one. -
Reinforcement Learning with PPO (Proximal Policy Optimization)
Reinforcement learning is performed using the reward model's score as the reward and the SFT model as the policy: $$\max_\pi \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot|x)}\bigl[r_\theta(x, y)\bigr] - \beta \cdot D_{\text{KL}}\bigl(\pi \| \pi_{\text{SFT}}\bigr)$$ The KL penalty prevents excessive divergence from the SFT model.
DPO (Direct Preference Optimization)
Proposed by Rafailov et al. (2023), this method bypasses explicit reward model training and directly optimizes the policy from human ranking data. The key idea is to directly learn the probability ratio between preferred and non-preferred responses, without using a reinforcement learning loop like PPO.
$$\mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$Since it eliminates the need for an unstable reinforcement learning loop like PPO and can be implemented within the same framework as supervised learning, DPO is widely adopted in practice.
Constitutional AI (RLAIF)
Instead of human labeling, the AI itself evaluates and improves responses based on a "constitution" (a list of principles). This approach is employed in Anthropic's Claude.
Other Alignment Methods
As extensions of DPO, increasingly simpler and more efficient alignment methods have been proposed, such as ORPO (Odds Ratio Preference Optimization), which integrates SFT and preference optimization into a single step, and SimPO, which eliminates the need for a reference model.
6. Inference Optimization
Inference optimization is essential for running LLMs with billions of parameters at practical speeds.
KV Cache
In autoregressive generation, each new token requires recomputing attention over all previous tokens. The KV cache stores the Key and Value vectors of past tokens, avoiding redundant computation.
KV Cache Memory
For $L$ layers, $n_{\text{kv}}$ KV heads, head dimension $d_k$, sequence length $S$, and batch size $B$, the FP16 memory for the KV cache is:
$$\text{Memory}_{\text{KV}} = 2 \times 2 \times B \times L \times S \times n_{\text{kv}} \times d_k \;\text{bytes}$$In MHA, $n_{\text{kv}} = n_{\text{heads}}$, so $n_{\text{kv}} \times d_k = d_{\text{model}}$. For example, with MHA configuration where $L=80,\; d_{\text{model}}=8192,\; B=1,\; S=4096$ → approximately 10 GB. In contrast, LLaMA 2-70B uses GQA with $n_{\text{kv}} = 8$ (64 Query heads), reducing the KV cache to $8/64 = 1/8$, approximately 1.3 GB.
Quantization
Weights and activations are represented in low-bit formats (INT8, INT4, NF4, etc.) to reduce memory usage and computation.
- GPTQ: Weight-only Post-Training Quantization
- AWQ: Activation-aware weight quantization
- GGUF: Quantization format for llama.cpp (optimized for CPU inference)
- bitsandbytes: NF4 quantization used in QLoRA
Continuous Batching and PagedAttention
With traditional static batching, all requests in a batch must wait for the longest sequence, resulting in low GPU utilization. Continuous Batching immediately removes completed requests from the batch and inserts new ones, significantly improving throughput.
PagedAttention (used in vLLM) manages the KV cache in page units, inspired by OS virtual memory. This eliminates the need to allocate contiguous memory for each sequence, resolving memory fragmentation and enabling larger batch sizes.
Speculative Decoding
A small "draft model" generates multiple tokens ahead, which are then verified in bulk by a larger "target model." The output distribution of the target model remains unchanged while generation speed is improved by a factor of 2–3.
VRAM Quick Calculator
Estimated VRAM: 35.0 GB
* Approximate value for weights only. Does not include KV cache or activation memory.
7. Major Model Lineage
| Model | Year | Parameters | Key Features |
|---|---|---|---|
| GPT-3 | 2020 | 175B | Demonstrated few-shot learning |
| PaLM | 2022 | 540B | Pathways system, Chain-of-Thought |
| ChatGPT | 2022 | — | Specialized for dialogue via RLHF |
| LLaMA 2 | 2023 | 7B–70B | Open weights, GQA |
| GPT-4 | 2023 | Est. 1T+ | Multimodal (text + image) |
| GPT-4o | 2024 | — | Unified multimodal: text, image, and audio |
| Claude 3.5 | 2024 | — | Constitutional AI, 200K context length |
| LLaMA 3 | 2024 | 8B–405B | 15T+ tokens, over-training |
| Gemma 2 | 2024 | 2B–27B | Small and efficient, knowledge distillation |
| Qwen 2.5 | 2024 | 0.5B–72B | Multilingual, MoE variant available |
| DeepSeek-V3 | 2024 | 671B (MoE) | Mixture-of-Experts, high efficiency |
8. Emerging Topics
Mixture of Experts (MoE)
The FFN layer is divided into multiple "experts," and a router activates only the top $k$ experts for each token. While the total parameter count is large, the computational cost during inference remains small.
$$y = \sum_{i=1}^{k} g_i \cdot E_i(x), \quad g_i = \text{TopK}\bigl(\text{softmax}(W_r \cdot x)\bigr)_i$$Long-Context Support
- RoPE extrapolation: YaRN, NTK-aware scaling to handle contexts beyond the training length
- Ring Attention: Distributes sequences across GPUs to process contexts with millions of tokens
Multimodal LLMs
These models process not only text but also images, audio, and video in an integrated manner. Images are converted to patch embeddings using ViT and concatenated with text tokens in the same sequence. GPT-4V, Claude 3, and Gemini are representative examples.
Tool Use and Agents
LLMs can call external tools (search engines, calculators, APIs) to supplement their knowledge limitations and computational capabilities. RAG is also a form of this paradigm.
Summary
- Tokenization: Subword segmentation via BPE / SentencePiece
- Architecture: RMSNorm + RoPE + GQA + SwiGLU is the modern standard
- Scaling Laws: Loss follows a power law with parameters and data size
- Pre-training: Next-token prediction over trillions of tokens
- RLHF / DPO: Optimizing the policy to align with human preferences
- Inference Optimization: KV cache, quantization, speculative decoding
- Emerging Technologies: MoE, long context, multimodal, agents
References
- Vaswani et al., "Attention Is All You Need" (2017)
- Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
- Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022)
- Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
- Rafailov et al., "Direct Preference Optimization" (DPO, 2023)
- Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
- Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (RoPE, 2021)
- Large language model - Wikipedia
Frequently Asked Questions
Q: Can I run an LLM on a local machine?
A: With quantization techniques (GGUF, AWQ, etc.), models with 7B to 13B parameters can run on a GPU with approximately 16 GB of VRAM. Using llama.cpp, inference on CPU alone is also possible, though speed is significantly reduced. Models with 70B or more parameters require 48 GB or more of VRAM (A6000, H100, etc.).
Q: Should I use RLHF or DPO?
A: DPO eliminates the need for reward model training and the PPO reinforcement learning loop, making it easier to implement and more stable during training. Consequently, DPO is widely adopted in practice. However, RLHF is more suitable when explicit control over the reward model is desired or when online learning is required.
Q: How much GPU is needed for LLM inference?
A: In FP16, the minimum VRAM capacity is approximately the number of parameters times 2 bytes (about 140 GB for a 70B model). With INT4 quantization, this can be reduced to roughly one quarter, allowing a 70B model to run on approximately 40 GB of VRAM. Additional memory for the KV cache (which grows with longer contexts) must also be considered.