Machine Learning Advanced

Modern Deep Learning and Theory — Advanced (For Researchers and Practitioners)

About This Chapter

In the advanced level, we study the frontiers of modern deep learning and their theoretical foundations. Topics include the Transformer architecture, generative models (VAE, GAN, diffusion models), and statistical learning theory. The goal is to develop the ability to read cutting-edge research papers and to design and implement novel methods.

Prerequisites

Intermediate-level content (NN, CNN, RNN)
Probability theory and statistics
Fundamentals of optimization theory
Basics of information theory (KL divergence)

1. Attention Mechanisms

Foundations of the Transformer.

Self-Attention
Multi-Head Attention
Positional Encoding

2. Transformer

The foundation of modern NLP.

Encoder-Decoder Architecture
BERT, GPT
Large Language Models

3. Vision Transformer

Applying Transformers to images.

Patch Embedding
ViT Architecture
Comparison with CNN

4. Variational Autoencoder (VAE)

Probabilistic generative model.

Latent Variable Models
Evidence Lower Bound (ELBO)
Reparameterization Trick

5. Generative Adversarial Network (GAN)

Generation through a two-player game.

Generator and Discriminator
Training Instability
StyleGAN, BigGAN

6. Diffusion Models

State-of-the-art generative models.

Denoising Score Matching
DDPM
Conditional Generation

7. Self-Supervised Learning

Leveraging unlabeled data.

Contrastive Learning (SimCLR, MoCo)
Masked Language Models
Pre-training and Fine-tuning

8. Statistical Learning Theory

Theory of generalization.

PAC Learning
VC Dimension
Rademacher Complexity

9. Theory of Deep Learning

Why does deep learning work?

Over-parameterization and Implicit Regularization
Neural Tangent Kernel
Loss Landscape

10. Deep Reinforcement Learning

From DQN to AlphaGo.

Deep Q-Network
Policy Gradient
Actor-Critic

11. Latest Topics

The frontiers of research.

Multimodal Learning
Prompt Learning
AI Safety and Alignment

12. Large Language Models (LLM)

From GPT to Claude.

Tokenization (BPE, SentencePiece)
Scaling Laws (Kaplan, Chinchilla)
Alignment via RLHF / DPO
Inference Optimization (KV Cache, Quantization)

13. Fine-Tuning and LoRA

Parameter-efficient adaptation.

LoRA / QLoRA (Low-Rank Adaptation)
Adapters, Prefix Tuning
Comparison and Practice of PEFT Methods

14. RAG (Retrieval-Augmented Generation)

Enhancing LLMs with external knowledge.

Vector Search and Embedding Models
Chunking Strategies (Fixed-length / Semantic / Parent-Child)
Re-ranking and Hybrid Search
Self-RAG, GraphRAG

15. Information Theory and Machine Learning

Understanding learning from an information-theoretic perspective.

KL Divergence, Mutual Information
ELBO and Variational Inference
Information Bottleneck Theory

16. Graph Neural Networks

Learning on graph-structured data.

GCN, GAT, GraphSAGE
Message Passing
Molecular Design, Recommender Systems

17. Causal Inference

From correlation to causation.

Counterfactuals and do-calculus
Causal Graphs (DAG)
Treatment Effect Estimation, Causal Forests

18. Mathematical Principles of Stable Diffusion

SDE, score functions, and latent diffusion.

Score Functions and Score Matching
Stochastic Differential Equations (SDE/ODE)
VAE Latent Space and Latent Diffusion
Derivation of Classifier-Free Guidance

Key Concepts and Methods

Self-Attention

Given Query $\boldsymbol{Q}$, Key $\boldsymbol{K}$, Value $\boldsymbol{V}$: $$\text{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^\top}{\sqrt{d_k}}\right)\boldsymbol{V}$$ Directly models dependencies between arbitrary positions within a sequence.

VAE Objective Function (ELBO)

$$\mathcal{L} = \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{\text{KL}}(q(\boldsymbol{z}|\boldsymbol{x}) \| p(\boldsymbol{z}))$$ A trade-off between reconstruction error and regularization toward the prior distribution.

GAN Objective Function

$$\min_G \max_D \mathbb{E}_{\boldsymbol{x} \sim p_{\text{data}}}[\log D(\boldsymbol{x})] + \mathbb{E}_{\boldsymbol{z} \sim p(\boldsymbol{z})}[\log(1 - D(G(\boldsymbol{z})))]$$ A minimax game between the Generator and the Discriminator.

Diffusion Models

The model learns a forward process that gradually adds noise to data, and a reverse process that removes the noise. The reverse process is parameterized by a neural network as $p_\theta(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$.

PAC Learning

For a sample size $n$, it provides a sufficient condition for the generalization error to be at most $\varepsilon$ with probability at least $1-\delta$: $n = O\left(\frac{1}{\varepsilon}\left(d \log\frac{1}{\varepsilon} + \log\frac{1}{\delta}\right)\right)$ (where $d$ is the VC dimension).

Applications at This Level

Large Language Models

Language models such as GPT and Claude. Prompt engineering, fine-tuning, and RLHF.

Image Generation

Stable Diffusion, DALL-E. Generating images from text. Practical applications of diffusion models.

Protein Structure Prediction

AlphaFold. Fusion of Transformers and structure prediction. Application of deep learning to scientific research.

Game AI

AlphaGo, AlphaZero. Mastering games through deep reinforcement learning.

Connection to the Research Frontier

The advanced-level content is directly linked to cutting-edge research presented at top conferences such as NeurIPS, ICML, and ICLR. The goal at this level is to develop the ability to read papers, produce reproducible implementations, and validate new ideas.

Study Tips

Read papers: Develop a habit of following the latest papers on arXiv
Reproduce implementations: Implement models from papers yourself
Theory and practice: Be able to explain why methods work
Critical thinking: Understand the limitations and assumptions of each method

References

Goodfellow, Bengio & Courville, Deep Learning
Vaswani et al., "Attention Is All You Need" (2017)
Kingma & Welling, "Auto-Encoding Variational Bayes" (2014)
Ho et al., "Denoising Diffusion Probabilistic Models" (2020)
Shalev-Shwartz & Ben-David, Understanding Machine Learning

Related Series

Generative Models - Details on VAE, GAN, and Diffusion Models