Large Language Models (LLMs)

From GPT to Claude — Advanced Level (For Researchers and Practitioners)

A Large Language Model (LLM) is a language model that scales an autoregressive Transformer to billions or even trillions of parameters, pre-trained on massive text corpora. After pre-training, the model is fine-tuned using alignment techniques such as RLHF to generate responses that align with human intent. This article assumes familiarity with attention mechanisms and the Transformer architecture (readers unfamiliar with these topics are encouraged to review them first), and provides a systematic overview of the technologies specific to LLMs.

1. Tokenization

Rather than processing text directly, LLMs first convert it into a sequence of substrings called tokens. Since vocabulary size and the handling of unknown words directly affect model performance, subword segmentation is the standard approach.

Figure 1: Comparison of segmentation methods and vocabulary sizes for BPE, WordPiece, and SentencePiece

BPE (Byte-Pair Encoding) Algorithm

Initialize the vocabulary at the character (or byte) level
Find the most frequent adjacent token pair in the corpus
Merge that pair into a new token and add it to the vocabulary
Repeat steps 2–3 until the target vocabulary size is reached

For example, if "l o w" and "l o w e r" appear frequently, the algorithm first merges "lo" → "low", yielding efficient substring representations.

2. LLM Architecture

Modern LLMs are based on the Transformer decoder, incorporating several architectural improvements.

Figure 2: Structure of a modern LLM decoder block with Pre-Norm, GQA, RoPE, and SwiGLU

RMSNorm

A lightweight normalization method that removes the mean subtraction from LayerNorm and normalizes using only the root mean square (RMS).

$$\text{RMSNorm}(\boldsymbol{x}) = \frac{\boldsymbol{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \odot \boldsymbol{\gamma}$$

RoPE (Rotary Position Embedding)

Relative position information is injected by multiplying the Query and Key vectors with rotation matrices. The $d$-dimensional vector is split into 2-dimensional blocks, and each block is rotated according to position $m$:

$$\begin{pmatrix} x'_{2i} \\ x'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}, \quad \theta_i = 10000^{-2i/d}$$

Due to this rotation, the inner product of the Query at position $m$ and the Key at position $n$ depends only on the position difference $m - n$. Unlike absolute position encoding, relative position information is naturally injected without adding any learnable parameters.

Grouped-Query Attention (GQA)

In MHA (Multi-Head Attention), each head has independent Q, K, and V projections. In GQA, multiple Query heads share Key-Value heads, thereby reducing KV cache memory usage while maintaining performance close to MHA.

MHA vs MQA vs GQA

MHA (Multi-Head): Independent K, V per head → high memory usage
MQA (Multi-Query): All heads share K, V → minimum memory, slightly lower accuracy
GQA (Grouped-Query): K, V shared within G groups → intermediate between MHA and MQA

SwiGLU

An activation function that replaces ReLU with Swish (SiLU) and adds a gating mechanism.

$$\text{SwiGLU}(\boldsymbol{x}) = (\boldsymbol{x} W_1) \odot \text{Swish}(\boldsymbol{x} W_\text{gate}) \cdot W_2$$

where $\text{Swish}(x) = x \cdot \sigma(x)$, and $\sigma$ denotes the sigmoid function.

3. Scaling Laws

One of the most important discoveries in LLM research is the scaling laws that quantitatively describe the relationship between model performance and scale.

Figure 3: Decrease in test loss as parameter count increases (log scale)

Kaplan's Scaling Laws

The test loss $L$ follows a power law with respect to the number of parameters $N$, data size $D$, and compute $C$:

$$L(N) \propto N^{-0.076}, \quad L(D) \propto D^{-0.095}, \quad L(C) \propto C^{-0.050}$$

This means that making the model 10 times larger improves the loss by a fixed amount, and another 10-fold increase yields the same improvement again—a predictable relationship. However, these exponents hold only when the other two variables are sufficiently large (i.e., not a bottleneck); in practice, insufficient data or compute causes convergence to slow.

Chinchilla Scaling Laws

Hoffmann et al. (2022) derived the optimal relationship between parameter count and data size given a fixed compute budget.

Chinchilla Optimality

For a given compute budget $C$, the optimal parameter count $N^*$ and training token count $D^*$ should both scale proportionally with $C$:

$$N^* \propto C^{0.50}, \quad D^* \propto C^{0.50}$$

In other words, if the parameter count is doubled, the training data should also be doubled. This finding suggested that many models of the time (such as GPT-3) were under-trained.

Concrete Examples

GPT-3 (175B parameters): Trained on 300B tokens → Chinchilla-optimal would be approximately 3.5T tokens
Chinchilla (70B parameters): Trained on 1.4T tokens → Smaller than GPT-3 but achieved comparable or better performance
LLaMA 2 (70B parameters): Trained on 2T tokens → Exceeds the Chinchilla optimum
LLaMA 3 (8B parameters): Trained on 15T tokens → Approximately 100 times the Chinchilla optimum

Over-training Strategy

The Chinchilla law optimizes for minimizing training cost, not inference cost. Training a smaller model far beyond the Chinchilla optimum, as with LLaMA 3, increases training cost but dramatically reduces inference cost compared to a larger model of equivalent performance. In production deployments where the number of inference calls vastly exceeds training runs, this "over-training" strategy is economically rational and has become the prevailing approach.

4. Pre-training

LLM pre-training is next-token prediction (Causal Language Modeling) on a large-scale corpus.

Pre-training Objective

For a token sequence $x_1, x_2, \ldots, x_T$, the autoregressive log-likelihood is maximized:

$$\mathcal{L}_{\text{pretrain}} = -\sum_{t=1}^{T} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$

Training Data

Typically, a corpus of trillions of tokens is used, comprising web crawls (CommonCrawl, etc.), books, academic papers, and code. Data quality (deduplication, filtering) has a significant impact on performance.

Training Stability

Learning rate scheduler: Warm-up + cosine decay is the standard approach
Gradient clipping: Constrains gradient norms to prevent training collapse
Mixed-precision training: Computes in BF16/FP16 while maintaining master weights in FP32
Distributed training: Combines data parallelism, tensor parallelism, and pipeline parallelism

5. RLHF (Reinforcement Learning from Human Feedback)

Pre-training alone can result in an LLM that generates harmful content or fails to follow instructions. RLHF (Reinforcement Learning from Human Feedback) is a technique that adjusts the model to align with human preferences, and it was a key technology behind the success of ChatGPT.

Three Steps of RLHF

Supervised Fine-Tuning (SFT)
The pre-trained model is fine-tuned on high-quality instruction-response pairs written by humans.
Reward Model (RM) Training
Humans rank multiple responses to the same prompt, and a model is trained to output a scalar reward that reproduces this ranking. The loss function is based on the Bradley-Terry model: $$\mathcal{L}_{\text{RM}} = -\log \sigma\bigl(r_\theta(x, y_w) - r_\theta(x, y_l)\bigr)$$ where $y_w$ is the preferred response and $y_l$ is the less preferred one.
Reinforcement Learning with PPO (Proximal Policy Optimization)
Reinforcement learning is performed using the reward model's score as the reward and the SFT model as the policy: $$\max_\pi \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi(\cdot|x)}\bigl[r_\theta(x, y)\bigr] - \beta \cdot D_{\text{KL}}\bigl(\pi \| \pi_{\text{SFT}}\bigr)$$ The KL penalty prevents excessive divergence from the SFT model.

DPO (Direct Preference Optimization)

Proposed by Rafailov et al. (2023), this method bypasses explicit reward model training and directly optimizes the policy from human ranking data. The key idea is to directly learn the probability ratio between preferred and non-preferred responses, without using a reinforcement learning loop like PPO.

$$\mathcal{L}_{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)$$

Since it eliminates the need for an unstable reinforcement learning loop like PPO and can be implemented within the same framework as supervised learning, DPO is widely adopted in practice.

Constitutional AI (RLAIF)

Instead of human labeling, the AI itself evaluates and improves responses based on a "constitution" (a list of principles). This approach is employed in Anthropic's Claude.

Other Alignment Methods

As extensions of DPO, increasingly simpler and more efficient alignment methods have been proposed, such as ORPO (Odds Ratio Preference Optimization), which integrates SFT and preference optimization into a single step, and SimPO, which eliminates the need for a reference model.

6. Inference Optimization

Inference optimization is essential for running LLMs with billions of parameters at practical speeds.

KV Cache

In autoregressive generation, each new token requires recomputing attention over all previous tokens. The KV cache stores the Key and Value vectors of past tokens, avoiding redundant computation.

KV Cache Memory

For $L$ layers, $n_{\text{kv}}$ KV heads, head dimension $d_k$, sequence length $S$, and batch size $B$, the FP16 memory for the KV cache is:

$$\text{Memory}_{\text{KV}} = 2 \times 2 \times B \times L \times S \times n_{\text{kv}} \times d_k \;\text{bytes}$$

In MHA, $n_{\text{kv}} = n_{\text{heads}}$, so $n_{\text{kv}} \times d_k = d_{\text{model}}$. For example, with MHA configuration where $L=80,\; d_{\text{model}}=8192,\; B=1,\; S=4096$ → approximately 10 GB. In contrast, LLaMA 2-70B uses GQA with $n_{\text{kv}} = 8$ (64 Query heads), reducing the KV cache to $8/64 = 1/8$, approximately 1.3 GB.

Quantization

Weights and activations are represented in low-bit formats (INT8, INT4, NF4, etc.) to reduce memory usage and computation.

GPTQ: Weight-only Post-Training Quantization
AWQ: Activation-aware weight quantization
GGUF: Quantization format for llama.cpp (optimized for CPU inference)
bitsandbytes: NF4 quantization used in QLoRA

Continuous Batching and PagedAttention

With traditional static batching, all requests in a batch must wait for the longest sequence, resulting in low GPU utilization. Continuous Batching immediately removes completed requests from the batch and inserts new ones, significantly improving throughput.

PagedAttention (used in vLLM) manages the KV cache in page units, inspired by OS virtual memory. This eliminates the need to allocate contiguous memory for each sequence, resolving memory fragmentation and enabling larger batch sizes.

Speculative Decoding

A small "draft model" generates multiple tokens ahead, which are then verified in bulk by a larger "target model." The output distribution of the target model remains unchanged while generation speed is improved by a factor of 2–3.

VRAM Quick Calculator

Parameters (B): Quantization: Estimated VRAM: 35.0 GB

* Approximate value for weights only. Does not include KV cache or activation memory.

7. Major Model Lineage

Evolution of Major LLMs

Model	Year	Parameters	Key Features
GPT-3	2020	175B	Demonstrated few-shot learning
PaLM	2022	540B	Pathways system, Chain-of-Thought
ChatGPT	2022	—	Specialized for dialogue via RLHF
LLaMA 2	2023	7B–70B	Open weights, GQA
GPT-4	2023	Est. 1T+	Multimodal (text + image)
GPT-4o	2024	—	Unified multimodal: text, image, and audio
Claude 3.5	2024	—	Constitutional AI, 200K context length
LLaMA 3	2024	8B–405B	15T+ tokens, over-training
Gemma 2	2024	2B–27B	Small and efficient, knowledge distillation
Qwen 2.5	2024	0.5B–72B	Multilingual, MoE variant available
DeepSeek-V3	2024	671B (MoE)	Mixture-of-Experts, high efficiency

8. Emerging Topics

Mixture of Experts (MoE)

The FFN layer is divided into multiple "experts," and a router activates only the top $k$ experts for each token. While the total parameter count is large, the computational cost during inference remains small.

$$y = \sum_{i=1}^{k} g_i \cdot E_i(x), \quad g_i = \text{TopK}\bigl(\text{softmax}(W_r \cdot x)\bigr)_i$$

Long-Context Support

RoPE extrapolation: YaRN, NTK-aware scaling to handle contexts beyond the training length
Ring Attention: Distributes sequences across GPUs to process contexts with millions of tokens

Multimodal LLMs

These models process not only text but also images, audio, and video in an integrated manner. Images are converted to patch embeddings using ViT and concatenated with text tokens in the same sequence. GPT-4V, Claude 3, and Gemini are representative examples.

Tool Use and Agents

LLMs can call external tools (search engines, calculators, APIs) to supplement their knowledge limitations and computational capabilities. RAG is also a form of this paradigm.

Summary

Tokenization: Subword segmentation via BPE / SentencePiece
Architecture: RMSNorm + RoPE + GQA + SwiGLU is the modern standard
Scaling Laws: Loss follows a power law with parameters and data size
Pre-training: Next-token prediction over trillions of tokens
RLHF / DPO: Optimizing the policy to align with human preferences
Inference Optimization: KV cache, quantization, speculative decoding
Emerging Technologies: MoE, long context, multimodal, agents

References

Vaswani et al., "Attention Is All You Need" (2017)
Kaplan et al., "Scaling Laws for Neural Language Models" (2020)
Hoffmann et al., "Training Compute-Optimal Large Language Models" (Chinchilla, 2022)
Ouyang et al., "Training language models to follow instructions with human feedback" (InstructGPT, 2022)
Rafailov et al., "Direct Preference Optimization" (DPO, 2023)
Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models" (2023)
Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (RoPE, 2021)
Large language model - Wikipedia

Frequently Asked Questions

Q: Can I run an LLM on a local machine?

A: With quantization techniques (GGUF, AWQ, etc.), models with 7B to 13B parameters can run on a GPU with approximately 16 GB of VRAM. Using llama.cpp, inference on CPU alone is also possible, though speed is significantly reduced. Models with 70B or more parameters require 48 GB or more of VRAM (A6000, H100, etc.).

Q: Should I use RLHF or DPO?

A: DPO eliminates the need for reward model training and the PPO reinforcement learning loop, making it easier to implement and more stable during training. Consequently, DPO is widely adopted in practice. However, RLHF is more suitable when explicit control over the reward model is desired or when online learning is required.

Q: How much GPU is needed for LLM inference?

A: In FP16, the minimum VRAM capacity is approximately the number of parameters times 2 bytes (about 140 GB for a 70B model). With INT4 quantization, this can be reduced to roughly one quarter, allowing a 70B model to run on approximately 40 GB of VRAM. Additional memory for the KV cache (which grows with longer contexts) must also be considered.