Fall 2025, Thursdays 3:45pm-6:30pm (First lecture September 4)
Course: CS 2881R - AI Safety
Instructor: Boaz Barak
Teaching Fellows: Natalie Abreu (natalieabreu@g.harvard.edu), Roy Rinberg (royrinberg@g.harvard.edu), Hanlin Zhang (hanlinzhang@g.harvard.edu)
Course Description: This will be a graduate level course on challenges in alignment and safety of artificial intelligence. We will consider both technical aspects as well as questions on societal and other impacts of the field.
Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network.
Important: Read the Course Introduction!
- Course Introduction Blog Post - This contains Homework Zero and important course information. Students who filled in the form will receive more instructions by email.
Questions? If you have any questions about the course, please email harvardcs2881@gmail.com
Related reading by Boaz:
Mini Syllabus
-
The course will have 13 in person lectures - each lecture will involve also discussion and presentation of an experiment by a group of students.
-
Students are expected to attend all lectures and do the reading in advance as well discuss these in electronic forum.
-
The assignments, project, and other requirements for the course will be determined later.
-
POTENTIAL CONFLICT OF INTEREST NOTE: In addition to his position at Harvard, Boaz is also a member of the technical staff at OpenAI. The course will include discussions of models from multiple providers, including OpenAI, and students are also encouraged to use AIs from multiple providers while doing their work. If students in the course feel any issue with this conflict, please do not hesitate to contact Boaz, the other staff, or the Harvard SEAS administration. For what it’s worth, I (Boaz) will see it as a great success of the course if its graduates work in AI safety in any capacity, including at academia, non-profit, governments, and any of OpenAI’s competitors.
Schedule
Classes begin September 2, 2025. Reading period December 4-9, 2025.
Note: This schedule is periodically synchronized with the course schedule Google Doc, which contains the most up-to-date version.
Thursday, September 4, 2025
Introduction
- AI impact and timelines overview
- Risks of AI
- AI alignment goals
- Vulnerable world hypothesis
- Lessons from other industries
Experiment:
"Emerging alignment" - Fine-tune a model on outputs from a model with a "good persona" and evaluate performance on other datasets. Try with "subtle alignment" using random inputs.
Thursday, September 11, 2025
Modern LLM Training
- Modern LLM training overview (DeepSeek R1)
- Pretraining
- Mid training
- Reinforcement Learning (RLHF/RLVF)
- Safety training
Experiment:
Use policy-gradient algorithm to optimize prompt prefixes. Take 10,000 notable people's names, create a logits vector where "You are X" probability is proportional to exp(P[i]). Optimize performance across benchmarks.
Resources:
- DeepSeek AI – "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
- Sahin Ahmed – "Deepseek R1 Overview" (Medium)
- Raschka, S. – "Understanding Reasoning LLMs" (2024)
- Chowdhery, A., et al. – "PaLM: Scaling Language Modeling" (2022)
- Ouyang, L., et al. – "InstructGPT: Aligning Language Models with Human Feedback" (2022)
- Christiano, P., et al. – "Deep Reinforcement Learning from Human Preferences" (2017)
- Rafailov, R., et al. – "Direct Preference Optimization" (2023)
- Bai, Y., et al. – "Constitutional AI: Harmlessness from AI Feedback" (2022)
- Lee, K., et al. – "Reinforcement Learning from AI Feedback" (2023)
- Guan, J., et al. – "Deliberative Alignment" (2024)
- Raschka, S. – "LLM Architecture Comparison" (2024)
- Qwen GSPO link
- Zheng et al. – "Group Sequence Policy Optimization (GSPO)" (2025)
Thursday, September 18, 2025
Adversarial Robustness, Jailbreaks & Prompt Injection
- Adversarial robustness
- Jailbreaks
- Prompt injection
- Lessons from vision/software security
- Buffer overflow and SQL injection concepts
Experiment:
Test-time scaling laws with red/blue team approach
- Red team: Create jailbreak dataset via "many shot" and filtering
- Blue team: Analyze model responses with different reasoning efforts
Thursday, September 25, 2025
Model Specifications & Compliance
- Lessons from law
- Value alignment vs. detailed adherence
Experiment:
Model Spec adherence evals - test generalization of model behavior across different domains
Thursday, October 2, 2025
Potentially Catastrophic Capabilities & Responsible Scaling
- Responsible scaling policies
- Scalable evaluations
- Safety through capability vs. weakness
Experiment:
Evaluate open and closed source models, potentially using jailbreaking techniques
Thursday, October 9, 2025
Scheming, Reward Hacking & Deception
- Exploring "bad behavior" tied to training objectives
- Investigating potential deception in monitoring models
Experiment:
Demonstrate how impossible tasks or conflicting objectives lead to lying/scheming
Resources:
- Weng, L. – "Reward Hacking" (2024)
- Carlsmith, J. – "Scheming AIs Report" (2023)
- Zou, A., et al. – "Representation Engineering" (2023)
- Lin, S., et al. – "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (2022)
- Langosco, L., et al. – "Goal Misgeneralization in Deep Reinforcement Learning" (2022)
- Hadfield-Menell, D., et al. – "Inverse Reward Design" (2017)
- Krueger, D., et al. – "Hidden Incentives for Auto-Induced Distributional Shift" (2020)
Thursday, October 16, 2025
Recursive Self-Improvement
- Is AI R&D an "AI-complete" task?
Experiment:
To be determined
Thursday, October 23, 2025
Economic Impacts of Foundation Models
- Labour substitution & productivity effects
- Inequality & policy responses
Experiment:
To be determined
Thursday, October 30, 2025
Military & Surveillance Applications of AI
- Lethal autonomous weapon systems (LAWS)
- Strategic stability & escalation risks
- Mass-scale surveillance infrastructure
Experiment:
To be determined
Thursday, November 6, 2025
Interpretability
- Activations
- Sparse Auto Encoders (SAE)
- Black box models
- Chain of thought
Experiment:
To be determined
Thursday, November 13, 2025
Emotional Reliance and Persuasion
- Domestic & international regulatory approaches
- Standards-setting & audits
Experiment:
To be determined
Resources:
- Resources to be determined
Thursday, November 20, 2025
TBD
Experiment:
To be determined
Resources:
- Resources to be determined
No lecture on Thursday, November 27 – Thanksgiving Break
Thursday, December 4, 2025
AI 2035 - Possible Futures of AI
- Student project presentations and discussion of future directions in AI safety research
Resources:
- Resources to be determined
Schedule content is synchronized with the course schedule document