CS229br Foundations of Deep Learning (aka Topics in the Foundations of Machine Learning)

Spring 2023, Thursdays 3:45pm-6:30pm SEC 1.402 Classroom (First lecture Jan 26)

Instructor: Boaz Barak

Teaching Fellows: Gustaf Ahdritz, Gal Kaplun

Links (enrolled students only): Canvas Perusall Gradescope

See also Spring 2021 version (the field is moving rapidly, and so the courses would not be the same, but it gives some sense; also, the spring 2021 was held over Zoom - Spring 2023 course would be much more “hand on” and so you could expect going into greater depth but also more work.)

TL;DR: The goal of this course is to prepare students for research in the foundations of deep learning. By the end of the course you should be able to read most cutting-edge papers in this field, as well as be capable of reproducing at least some experimental results (those that do not require an inordinate amount of computational and human resources). Ideally, you should be on the way of working on original research on the field. To achieve this the course will require a large amount of independence from students, including both self-study and peer study.

See also these two blog posts of Boaz:

Formal description: A graduate level course on recent advances and open questions in the foundations of machine learning and specifically deep learning. We will review both classical results as well as recent papers in areas including classifiers and generalization gaps, representation learning, generative models, adversarial robustness, out of distribution performance, and more.

This is a fast-moving area and it will be a fast-moving course. We will aim to cover both state-of-art results, as well as the intellectual foundations for them, and have a substantive discussion on both the “big picture” and technical details of the papers. In addition to the theoretical lectures, the course will involve a programming component aiming to get students to the point where they can both reproduce results from papers and work on their own research. This component will be largely self-directed and we expect students to be proficient in Python and in picking up technologies and libraries such as pytorch/numpy/etc on their own (aka “Stack Overflow oriented programming”).

Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network. (Or achieve this via self study before the beginning of the course; see homework zero).

Apply to this course: The course will be capped and students will need to apply. Before applying, please make sure to complete homework zero which you should submit as part of the application. Applications are due by January 17, 2023 11:59pm. Note: If you have any questions about homework zero then feel free to email Boaz+Gustaf+Gal.


Lecture 1: Thursday, January 26, 2023

slides (powerpoint) slides (pdf)

Introduction to the course, a quick review of classical ML: representation (i.e., approximation theorems), optimization (convexity, stochastic gradient descent), generalization (bias/variance tradeoff).  Differences between that and modern paradigms.

Transformer architecture. How it works, why it is well-suited for GPUs, auto-regressive language models.  The next-token prediction task. Some questions: are transformers useful for their inductive bias, or for their highly efficient GPU implementation?  Differences between fine tuning, prompt tuning, linear readouts.

Options (not sure how much we will cover): Vision transformers, MLP mixer,  attention in linear time


Model: original paper and annotated version (colab version)

Other transformer tutorials:

Vision: vision transformer, MLP mixer

Efficiency: compute/energy consumption of models, GPUs and linear algebra

Inductive bias: learning convolutions from scratch (Benham)

Linear time attention reading: Efficient attention, Nyströmformer (blog, paper), Linformer , MEGA (sub quadratic). Attention-free transformer,   

Pretraining without attention SSM

Lecture 2: Thursday, February 2, 2023

slides (powerpoint) slides (pdf) Handwritten notes for board (pdf)

Generative models: Variational principle, VAEs, normalizing flows. 

Reading: Chapter 2 (VAE) Kingma and Welling survey on VAEs. Chapter 3 (exponential distributions, can skim concrete examples in 3.3) Wainwright and Jordan. Lilan Weng blog on normalizing flows. Survey by Kobyzev, Prince, and Brubaker (see also CVPR 21 tutorial)

Lecture  3: Thursday, February 9, 2023

slides (powerpoint) slides (pdf)

Diffusion models

Reading: On Perusall - Weng blog, Karras et al unifying design space , MacAllester math of diffusion.

Additional resources: Latent diffusion (Rombach et al), classifier-free guidance (Ho and Salimans) Blog posts of Song and Das. Vadhat tutorial (video, 2 hours).

Lecture 4: Thursday, February 16, 2023

slides (powerpoint) slides (pdf) Handwritten notes for board (pdf)

Privacy in machine learning

2014 manuscript on Differential Privacy by Dwork and Roth . For issues of computational complexity, see the survey of Vadhan.

DP-SGD paper see lecture notes by Smith and Ullman, notes by Kamath, and slides by Bellet. This video of Kamath can also be useful.



Attacks on non-private models: Membership inference. Extracting training data from GPT2  and Diffusion models 

Failure of heuristics, e.g. Attack on InstaHide.

Exposed! A survey of attacks on private data.

Issues with DP for deep learning: Tramer-Boneh: DP needs better featuresBagdasaryan-Shmatikov: DP impacts subgroups differently.

Machine unlearning: see this

Relaxations of DP: label DP, privacy-preserving predictions.  DP fine tuning of large models (see also this).

Separate issue: Protecting model weights from inference server via homomorphic encryption or other cryptographic tools, see cryptonets (2016),  this recent paper and references within.

Lecture 5: Thursday, February 23, 2023

slides (powerpoint) slides (pdf)

Protein Folding: AlphaFold - guest lecture by Gustaf Ahdritz.

Reading: AlphaFold1 paper, AlphaFold2 paper. Blog:  Mohammed AlQuraishi blog1, blog2 

Lecture 6: Thursday, March 2, 2023

slides (powerpoint) slides (pdf) Handwritten notes for board (pdf)

Training Dynamics: Differences between back-propagation and perturbative methods, natural gradient, edge of stability, deep bootstrap, the effect of issues such as batch norm, residual connections, SGD vs Adam.

Reading: lecture notes of Roger Grosse, Deep Bootstrap paper, Edge of stability paper, SGD complexity paper. Francis Bach’s blog on depth-2 networks dynamics (guest post by Lénaïc Chizat).   Chinchilla paper on scaling laws.

Lecture 7: Thursday, March 9, 2023

slides (powerpoint) slides (pdf)

Training dynamics continued.

We will look at Deep Boostrap, Edge of Stability, and scaling laws (particularly Chinchilla and to what extent they are challenged by LlaMA). Some other reading: mathematical models that demonstrate the above phenomena: deep bootstrap in kernels, understanding edge-of-stability via minimialist example, edge-of-stability in 2-layer netsexplaining neural scaling laws, power laws in Kernels (see also this , this, and nearest-neighbor rates).

(No lecture on Thursday, March 16, 2023)

Lecture 8: Thursday, March 23, 2023

slides (pdf)

Reinforcement learning - guest lecture by Sham Kakade


Lecture 9: Thursday, March 30, 2023

slides (powerpoint) slides (pdf)

Test-time computation-  test-time augmentation, beam search, retrieval-based models, differntiable vs non-differentiable memory and tools.


Lecture 10: Thursday, April 6, 2023

slides (powerpoint) slides (pdf)

Boaz’s post-lecture blog post on safety

AI Safety, Fairness, Accountability, Transparency, Alignment.

Fair ML textbook. Hendrycks safety course.

Algorithmic Auditing Veccione at alAgainst predictive optimization Wang et al. Meta study on bias papers in NLP. Feature highlighting explanations in model interpretability (Barocas et al). The mythos of model interprtability - Lipton. Gender Shades - Boulamwini and Gebru.

Impact of Russian disinformation campaign - Eady et al

Natural selection favors AIs over humans - Hendrycks. (see also Carlsmith)

Unsolved problems in AI safety, Hendrycks et al (see also X risk analysis Hendrycks and Mazeika) Reward misspecification - Pan et al . Christiano blog post. Alignment problem from DL perspective (Ngo et al)

Verification/ Critique: Readteaming LMs with LMs (Deepmind) , Self-critiquing models (Openai) 

Beyond normal accident theory Marais et al

AI will change world but not take over via 3d chess / Barak and Edelman

We might not talk a lot about adversarial robustness but some sources include

RobustBench and the links there

Uncertainty under distribution shift  - Ovadia et al


Lecture 11: Thursday, April 13, 2023

Guest lecture on efficient training of deep nets, by Horace He from the Pytorch team.

Horace’s slides


The Bitter Lesson / Sutton

Is Moore’s law ending or not? / Herz

Stephen Jones video - GPU programming - especially 15m30 to 22m20

Horace He: Making Deep Learning Go Brrrr From First Principles

Overview of Parallelism Strategies / Lilian Weng

Matt Pharr: the story of ispc

Lecture 12: Thursday, April 20, 2023

slides (powerpoint) slides (pdf)

Course summary, looking back into the early days of computers in general and AI in particular, as well as trying to make predictions about the future.

Reading Some historical notes about the development of AI:

Future predictions: