Machine Learning Advanced
Modern Deep Learning and Theory — Advanced (For Researchers and Practitioners)
About This Chapter
In the advanced level, we study the frontiers of modern deep learning and their theoretical foundations. Topics include the Transformer architecture, generative models (VAE, GAN, diffusion models), and statistical learning theory. The goal is to develop the ability to read cutting-edge research papers and to design and implement novel methods.
Prerequisites
- Intermediate-level content (NN, CNN, RNN)
- Probability theory and statistics
- Fundamentals of optimization theory
- Basics of information theory (KL divergence)
Table of Contents
1. Attention Mechanisms
Foundations of the Transformer.
- Self-Attention
- Multi-Head Attention
- Positional Encoding
2. Transformer
The foundation of modern NLP.
- Encoder-Decoder Architecture
- BERT, GPT
- Large Language Models
3. Vision Transformer
Applying Transformers to images.
- Patch Embedding
- ViT Architecture
- Comparison with CNN
4. Variational Autoencoder (VAE)
Probabilistic generative model.
- Latent Variable Models
- Evidence Lower Bound (ELBO)
- Reparameterization Trick
5. Generative Adversarial Network (GAN)
Generation through a two-player game.
- Generator and Discriminator
- Training Instability
- StyleGAN, BigGAN
6. Diffusion Models
State-of-the-art generative models.
- Denoising Score Matching
- DDPM
- Conditional Generation
7. Self-Supervised Learning
Leveraging unlabeled data.
- Contrastive Learning (SimCLR, MoCo)
- Masked Language Models
- Pre-training and Fine-tuning
8. Statistical Learning Theory
Theory of generalization.
- PAC Learning
- VC Dimension
- Rademacher Complexity
9. Theory of Deep Learning
Why does deep learning work?
- Over-parameterization and Implicit Regularization
- Neural Tangent Kernel
- Loss Landscape
10. Deep Reinforcement Learning
From DQN to AlphaGo.
- Deep Q-Network
- Policy Gradient
- Actor-Critic
11. Latest Topics
The frontiers of research.
- Multimodal Learning
- Prompt Learning
- AI Safety and Alignment
12. Large Language Models (LLM)
From GPT to Claude.
- Tokenization (BPE, SentencePiece)
- Scaling Laws (Kaplan, Chinchilla)
- Alignment via RLHF / DPO
- Inference Optimization (KV Cache, Quantization)
13. Fine-Tuning and LoRA
Parameter-efficient adaptation.
- LoRA / QLoRA (Low-Rank Adaptation)
- Adapters, Prefix Tuning
- Comparison and Practice of PEFT Methods
14. RAG (Retrieval-Augmented Generation)
Enhancing LLMs with external knowledge.
- Vector Search and Embedding Models
- Chunking Strategies (Fixed-length / Semantic / Parent-Child)
- Re-ranking and Hybrid Search
- Self-RAG, GraphRAG
15. Information Theory and Machine Learning
Understanding learning from an information-theoretic perspective.
- KL Divergence, Mutual Information
- ELBO and Variational Inference
- Information Bottleneck Theory
16. Graph Neural Networks
Learning on graph-structured data.
- GCN, GAT, GraphSAGE
- Message Passing
- Molecular Design, Recommender Systems
17. Causal Inference
From correlation to causation.
- Counterfactuals and do-calculus
- Causal Graphs (DAG)
- Treatment Effect Estimation, Causal Forests
18. Mathematical Principles of Stable Diffusion
SDE, score functions, and latent diffusion.
- Score Functions and Score Matching
- Stochastic Differential Equations (SDE/ODE)
- VAE Latent Space and Latent Diffusion
- Derivation of Classifier-Free Guidance
Key Concepts and Methods
Self-Attention
Given Query $\boldsymbol{Q}$, Key $\boldsymbol{K}$, Value $\boldsymbol{V}$: $$\text{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^\top}{\sqrt{d_k}}\right)\boldsymbol{V}$$ Directly models dependencies between arbitrary positions within a sequence.
VAE Objective Function (ELBO)
$$\mathcal{L} = \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{\text{KL}}(q(\boldsymbol{z}|\boldsymbol{x}) \| p(\boldsymbol{z}))$$ A trade-off between reconstruction error and regularization toward the prior distribution.
GAN Objective Function
$$\min_G \max_D \mathbb{E}_{\boldsymbol{x} \sim p_{\text{data}}}[\log D(\boldsymbol{x})] + \mathbb{E}_{\boldsymbol{z} \sim p(\boldsymbol{z})}[\log(1 - D(G(\boldsymbol{z})))]$$ A minimax game between the Generator and the Discriminator.
Diffusion Models
The model learns a forward process that gradually adds noise to data, and a reverse process that removes the noise. The reverse process is parameterized by a neural network as $p_\theta(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$.
PAC Learning
For a sample size $n$, it provides a sufficient condition for the generalization error to be at most $\varepsilon$ with probability at least $1-\delta$: $n = O\left(\frac{1}{\varepsilon}\left(d \log\frac{1}{\varepsilon} + \log\frac{1}{\delta}\right)\right)$ (where $d$ is the VC dimension).
Applications at This Level
Large Language Models
Language models such as GPT and Claude. Prompt engineering, fine-tuning, and RLHF.
Image Generation
Stable Diffusion, DALL-E. Generating images from text. Practical applications of diffusion models.
Protein Structure Prediction
AlphaFold. Fusion of Transformers and structure prediction. Application of deep learning to scientific research.
Game AI
AlphaGo, AlphaZero. Mastering games through deep reinforcement learning.
Connection to the Research Frontier
The advanced-level content is directly linked to cutting-edge research presented at top conferences such as NeurIPS, ICML, and ICLR. The goal at this level is to develop the ability to read papers, produce reproducible implementations, and validate new ideas.
Study Tips
- Read papers: Develop a habit of following the latest papers on arXiv
- Reproduce implementations: Implement models from papers yourself
- Theory and practice: Be able to explain why methods work
- Critical thinking: Understand the limitations and assumptions of each method
References
- Goodfellow, Bengio & Courville, Deep Learning
- Vaswani et al., "Attention Is All You Need" (2017)
- Kingma & Welling, "Auto-Encoding Variational Bayes" (2014)
- Ho et al., "Denoising Diffusion Probabilistic Models" (2020)
- Shalev-Shwartz & Ben-David, Understanding Machine Learning
Related Series
- Generative Models - Details on VAE, GAN, and Diffusion Models