Spring 2023, Thursdays 3:45pm-6:30pm SEC 1.402 Classroom (First lecture Jan 26)
Instructor: Boaz Barak
Teaching Fellows: Gustaf Ahdritz, Gal Kaplun
Links (enrolled students only): Canvas | Perusall | Gradescope |
See also Spring 2021 version (the field is moving rapidly, and so the courses would not be the same, but it gives some sense; also, the spring 2021 was held over Zoom - Spring 2023 course would be much more “hand on” and so you could expect going into greater depth but also more work.)
TL;DR: The goal of this course is to prepare students for research in the foundations of deep learning. By the end of the course you should be able to read most cutting-edge papers in this field, as well as be capable of reproducing at least some experimental results (those that do not require an inordinate amount of computational and human resources). Ideally, you should be on the way of working on original research on the field. To achieve this the course will require a large amount of independence from students, including both self-study and peer study.
See also these two blog posts of Boaz:
Formal description: A graduate level course on recent advances and open questions in the foundations of machine learning and specifically deep learning. We will review both classical results as well as recent papers in areas including classifiers and generalization gaps, representation learning, generative models, adversarial robustness, out of distribution performance, and more.
This is a fast-moving area and it will be a fast-moving course. We will aim to cover both state-of-art results, as well as the intellectual foundations for them, and have a substantive discussion on both the “big picture” and technical details of the papers. In addition to the theoretical lectures, the course will involve a programming component aiming to get students to the point where they can both reproduce results from papers and work on their own research. This component will be largely self-directed and we expect students to be proficient in Python and in picking up technologies and libraries such as pytorch/numpy/etc on their own (aka “Stack Overflow oriented programming”).
Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network. (Or achieve this via self study before the beginning of the course; see homework zero).
Apply to this course: The course will be capped and students will need to apply. Before applying, please make sure to complete homework zero which you should submit as part of the application. Applications are due by January 17, 2023 11:59pm. Note: If you have any questions about homework zero then feel free to email Boaz+Gustaf+Gal.
slides (powerpoint) slides (pdf)
Introduction to the course, a quick review of classical ML: representation (i.e., approximation theorems), optimization (convexity, stochastic gradient descent), generalization (bias/variance tradeoff). Differences between that and modern paradigms.
Transformer architecture. How it works, why it is well-suited for GPUs, auto-regressive language models. The next-token prediction task. Some questions: are transformers useful for their inductive bias, or for their highly efficient GPU implementation? Differences between fine tuning, prompt tuning, linear readouts.
Options (not sure how much we will cover): Vision transformers, MLP mixer, attention in linear time
Reading:
Model: original paper and annotated version (colab version)
Other transformer tutorials:
Vision: vision transformer, MLP mixer
Efficiency: compute/energy consumption of models, GPUs and linear algebra
Inductive bias: learning convolutions from scratch (Benham)
Linear time attention reading: Efficient attention, Nyströmformer (blog, paper), Linformer , MEGA (sub quadratic). Attention-free transformer,
Pretraining without attention SSM
slides (powerpoint) slides (pdf) Handwritten notes for board (pdf)
Generative models: Variational principle, VAEs, normalizing flows.
Reading: Chapter 2 (VAE) Kingma and Welling survey on VAEs. Chapter 3 (exponential distributions, can skim concrete examples in 3.3) Wainwright and Jordan. Lilan Weng blog on normalizing flows. Survey by Kobyzev, Prince, and Brubaker (see also CVPR 21 tutorial)
slides (powerpoint) slides (pdf)
Diffusion models
Reading: On Perusall - Weng blog, Karras et al unifying design space , MacAllester math of diffusion.
Additional resources: Latent diffusion (Rombach et al), classifier-free guidance (Ho and Salimans) Blog posts of Song and Das. Vadhat tutorial (video, 2 hours).
slides (powerpoint) slides (pdf) Handwritten notes for board (pdf)
Privacy in machine learning
2014 manuscript on Differential Privacy by Dwork and Roth . For issues of computational complexity, see the survey of Vadhan.
DP-SGD paper see lecture notes by Smith and Ullman, notes by Kamath, and slides by Bellet. This video of Kamath can also be useful.
https://differentialprivacy.org/
.
Attacks on non-private models: Membership inference. Extracting training data from GPT2 and Diffusion models
Failure of heuristics, e.g. Attack on InstaHide.
Exposed! A survey of attacks on private data.
Issues with DP for deep learning: Tramer-Boneh: DP needs better featuresBagdasaryan-Shmatikov: DP impacts subgroups differently.
Machine unlearning: see this
Relaxations of DP: label DP, privacy-preserving predictions. DP fine tuning of large models (see also this).
Separate issue: Protecting model weights from inference server via homomorphic encryption or other cryptographic tools, see cryptonets (2016), this recent paper and references within.
slides (powerpoint) slides (pdf)
Protein Folding: AlphaFold - guest lecture by Gustaf Ahdritz.
Reading: AlphaFold1 paper, AlphaFold2 paper. Blog: Mohammed AlQuraishi blog1, blog2
slides (powerpoint) slides (pdf) Handwritten notes for board (pdf)
Training Dynamics: Differences between back-propagation and perturbative methods, natural gradient, edge of stability, deep bootstrap, the effect of issues such as batch norm, residual connections, SGD vs Adam.
Reading: lecture notes of Roger Grosse, Deep Bootstrap paper, Edge of stability paper, SGD complexity paper. Francis Bach’s blog on depth-2 networks dynamics (guest post by Lénaïc Chizat). Chinchilla paper on scaling laws.
slides (powerpoint) slides (pdf)
Training dynamics continued.
We will look at Deep Boostrap, Edge of Stability, and scaling laws (particularly Chinchilla and to what extent they are challenged by LlaMA). Some other reading: mathematical models that demonstrate the above phenomena: deep bootstrap in kernels, understanding edge-of-stability via minimialist example, edge-of-stability in 2-layer nets, explaining neural scaling laws, power laws in Kernels (see also this , this, and nearest-neighbor rates).
(No lecture on Thursday, March 16, 2023)
Reinforcement learning - guest lecture by Sham Kakade
Readings:
slides (powerpoint) slides (pdf)
Test-time computation- test-time augmentation, beam search, retrieval-based models, differntiable vs non-differentiable memory and tools.
Reading:
Survey on augmented language models.
Best of n outputs WebGPT paper, plurality voting Wang et al, Minerva paper
In-context learning, and is it really “learning” or “conext conditioning”: Min et al - in-context examples more useful for the data distributions than labels, Wei et al - LLMs can adapt to label dist also
Chain of thought: Wei et al, zero shot CoT Kojima et al (“step by step”)
Differentiable memory: RETRO (Deepmind) , Memorizing transformers, Ruccrent memory (Bulatov et al)
Non-differentiable memory, Natural language as universal API Toolformer (Schick et al), see also “Bing inner monologue” (e.g. here, here, unsure the extent these are confirmed), langchain, Taskmatrix.ai
slides (powerpoint) slides (pdf)
Boaz’s post-lecture blog post on safety
AI Safety, Fairness, Accountability, Transparency, Alignment.
Fair ML textbook. Hendrycks safety course.
Algorithmic Auditing Veccione at al. Against predictive optimization Wang et al. Meta study on bias papers in NLP. Feature highlighting explanations in model interpretability (Barocas et al). The mythos of model interprtability - Lipton. Gender Shades - Boulamwini and Gebru.
Impact of Russian disinformation campaign - Eady et al
Natural selection favors AIs over humans - Hendrycks. (see also Carlsmith)
Unsolved problems in AI safety, Hendrycks et al (see also X risk analysis Hendrycks and Mazeika) Reward misspecification - Pan et al . Christiano blog post. Alignment problem from DL perspective (Ngo et al)
Verification/ Critique: Readteaming LMs with LMs (Deepmind) , Self-critiquing models (Openai)
Beyond normal accident theory Marais et al
AI will change world but not take over via 3d chess / Barak and Edelman
We might not talk a lot about adversarial robustness but some sources include
RobustBench and the links there
Uncertainty under distribution shift - Ovadia et al
.
Guest lecture on efficient training of deep nets, by Horace He from the Pytorch team.
Reading:
Is Moore’s law ending or not? / Herz
Stephen Jones video - GPU programming - especially 15m30 to 22m20
Horace He: Making Deep Learning Go Brrrr From First Principles
Overview of Parallelism Strategies / Lilian Weng
slides (powerpoint) slides (pdf)
Course summary, looking back into the early days of computers in general and AI in particular, as well as trying to make predictions about the future.
Reading Some historical notes about the development of AI:
John von Neumann The Computer and the Brain
A New Yorker profile on Marvin Minsky from 1981. This is not just for reading about Minsky’s achievements, but also to get a sense of the people involved, and how AI research was perceived in the early 1980s. (Even if the author is too reverential towards Minsk
Original 1943 paper of McCullough and Pitts
Alan Turing’s 1950 Computing Machinery and Intelligence where he presented his famous “Turing test”.
Rosenblatt’s 1961 book on preceptrons
The proposal for the 1956 Dartmouth workshop see also Wikipedia article
Sir James Lighthill’s 1972 report on state of AI this depressing report summarizes the perception of an “AI winter” and apparently also caused the UK AI winter.
A sociological history of the neural network controversy - Olazaran, 1993.
Talking Nets: Oral history - Anderson and Rosenfeld 1998. A sequence of interviews taken in the 1990s with Michael Arbib, Gail Carpenter, Leon Cooper, Jack Cowan, Walter Freeman, Stephen Grossberg, Robert Hecht-Neilsen, Geoffrey Hinton, Teuvo Kohonen, Bart Kosko, Jerome Lettvin, Carver Mead, David Rumelhart, Terry Sejnowski, Paul Werbos, and Bernard Widrow.
Early-ish discussions on “singularity” I.J. Good 1966 , Vinge 1993
Future predictions: