Fall 2025, Thursdays 3:45pm-6:30pm (First lecture September 4)

Course: CS 2881R - AI Safety

Instructor: Boaz Barak

Teaching Fellows: Natalie Abreu (natalieabreu@g.harvard.edu), Roy Rinberg (royrinberg@g.harvard.edu), Hanlin Zhang (hanlinzhang@g.harvard.edu)

Course Description: This will be a graduate level course on challenges in alignment and safety of artificial intelligence. We will consider both technical aspects as well as questions on societal and other impacts of the field.

Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network.

Important: Read the Course Introduction!

Course Introduction Blog Post - This contains Homework Zero and important course information. Students who filled in the form will receive more instructions by email.

Questions? If you have any questions about the course, please email harvardcs2881@gmail.com

Related reading by Boaz:

Previous versions: Spring 2023 ML Theory Seminar

Spring 2021 ML Theory Seminar

Mini Syllabus

The course will have 13 in person lectures - each lecture will involve also discussion and presentation of an experiment by a group of students.
Students are expected to attend all lectures and do the reading in advance as well discuss these in electronic forum.
The assignments, project, and other requirements for the course will be determined later.
POTENTIAL CONFLICT OF INTEREST NOTE: In addition to his position at Harvard, Boaz is also a member of the technical staff at OpenAI. The course will include discussions of models from multiple providers, including OpenAI, and students are also encouraged to use AIs from multiple providers while doing their work. If students in the course feel any issue with this conflict, please do not hesitate to contact Boaz, the other staff, or the Harvard SEAS administration. For what it’s worth, I (Boaz) will see it as a great success of the course if its graduates work in AI safety in any capacity, including at academia, non-profit, governments, and any of OpenAI’s competitors.

Schedule

Classes begin September 2, 2025. Reading period December 4-9, 2025.

Note: This schedule is periodically synchronized with the course schedule Google Doc, which contains the most up-to-date version.

Thursday, September 4, 2025

Introduction

AI impact and timelines overview
Risks of AI
AI alignment goals
Vulnerable world hypothesis
Lessons from other industries

Experiment:

"Emerging alignment" - Fine-tune a model on outputs from a model with a "good persona" and evaluate performance on other datasets. Try with "subtle alignment" using random inputs.

Resources:

Thursday, September 11, 2025

Modern LLM Training

Modern LLM training overview (DeepSeek R1)
Pretraining
Mid training
Reinforcement Learning (RLHF/RLVF)
Safety training

Experiment:

Use policy-gradient algorithm to optimize prompt prefixes. Take 10,000 notable people's names, create a logits vector where "You are X" probability is proportional to exp(P[i]). Optimize performance across benchmarks.

Resources:

Thursday, September 18, 2025

Adversarial Robustness, Jailbreaks & Prompt Injection

Adversarial robustness
Jailbreaks
Prompt injection
Lessons from vision/software security
Buffer overflow and SQL injection concepts

Experiment:

Test-time scaling laws with red/blue team approach - Red team: Create jailbreak dataset via "many shot" and filtering - Blue team: Analyze model responses with different reasoning efforts

Resources:

Thursday, September 25, 2025

Model Specifications & Compliance

Lessons from law
Value alignment vs. detailed adherence

Experiment:

Model Spec adherence evals - test generalization of model behavior across different domains

Resources:

Thursday, October 2, 2025

Potentially Catastrophic Capabilities & Responsible Scaling

Responsible scaling policies
Scalable evaluations
Safety through capability vs. weakness

Experiment:

Evaluate open and closed source models, potentially using jailbreaking techniques

Resources:

Thursday, October 9, 2025

Scheming, Reward Hacking & Deception

Exploring "bad behavior" tied to training objectives
Investigating potential deception in monitoring models

Experiment:

Demonstrate how impossible tasks or conflicting objectives lead to lying/scheming

Resources:

Thursday, October 16, 2025

Recursive Self-Improvement

Is AI R&D an "AI-complete" task?

Experiment:

To be determined

Resources:

Yudkowsky, E. – "AGI Take-off Speeds" (Arbital 2016)

Thursday, October 23, 2025

Economic Impacts of Foundation Models

Labour substitution & productivity effects
Inequality & policy responses

Experiment:

To be determined

Resources:

Thursday, October 30, 2025

Military & Surveillance Applications of AI

Lethal autonomous weapon systems (LAWS)
Strategic stability & escalation risks
Mass-scale surveillance infrastructure

Experiment:

To be determined

Resources:

Thursday, November 6, 2025

Interpretability

Activations
Sparse Auto Encoders (SAE)
Black box models
Chain of thought

Experiment:

To be determined

Resources:

Thursday, November 13, 2025

Emotional Reliance and Persuasion

Domestic & international regulatory approaches
Standards-setting & audits

Experiment:

To be determined

Resources:

Resources to be determined

Thursday, November 20, 2025

TBD

Topics to be determined

Experiment:

To be determined

Resources:

Resources to be determined

No lecture on Thursday, November 27 – Thanksgiving Break

Thursday, December 4, 2025

AI 2035 - Possible Futures of AI

Student project presentations and discussion of future directions in AI safety research

Resources:

Resources to be determined

Schedule content is synchronized with the course schedule document