Fall 2025, Thursdays 3:45pm-6:30pm (First lecture September 4)
Course: CS 2881R - AI Safety
Time and Place: Thursdays 3:45pm-6:30pm, SEC LL2.229
(SEC is in 150 Western Ave, first lecture September 4)
Instructor: Boaz Barak
Teaching Fellows: Natalie Abreu (natalieabreu@g.harvard.edu), Roy Rinberg (royrinberg@g.harvard.edu), Hanlin Zhang (hanlinzhang@g.harvard.edu)
Course Description: This will be a graduate level course on challenges in alignment and safety of artificial intelligence. We will consider both technical aspects as well as questions on societal and other impacts of the field.
Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network.
Important: Read the Course Introduction!
- Course Introduction Blog Post - This contains Homework Zero and important course information. Students who filled in the form will receive more instructions by email.
Questions? If you have any questions about the course, please email harvardcs2881@gmail.com
Related reading by Boaz:
Mini Syllabus
-
The course will have 13 in person lectures - each lecture will involve also discussion and presentation of an experiment by a group of students.
-
The assignments, project, and other requirements for the course will be determined later.
-
Attendance: Attendance is mandatory. Students are expected to attend all lectures and do the reading in advance as well discuss these in electronic forum.
-
Generative AI: Students are allowed and encouraged to use generative AI as much as they can for studying, exploring concept, and their assignments and projects. Given the availability of AI tools, expectations for projects and assignments will have more ambitious than in past years.
-
Electronic device policy students can use laptops in class but we will ask those using them to sit in the back so they don’t distract other students.
-
Lecture recordings To the extent technically possible we intend to record and publish the lectures online, though we might have some time lag in doing that. However note that recording is done automatically by a static in-room camera, and some parts of the lecture (e.g. whiteboard, or discussions) may not be captured as well. Also we will honor requests by external speakers not to record their talks.
-
POTENTIAL CONFLICT OF INTEREST NOTE: In addition to his position at Harvard, Boaz is also a member of the technical staff at OpenAI. The course will include discussions of models from multiple providers, including OpenAI, and students are also encouraged to use AIs from multiple providers while doing their work. If students in the course feel any issue with this conflict, please do not hesitate to contact Boaz, the other staff, or the Harvard SEAS administration. For what it’s worth, I (Boaz) will see it as a great success of the course if its graduates work in AI safety in any capacity, including at academia, non-profit, governments, and any of OpenAI’s competitors.
Schedule
Classes begin September 2, 2025. Reading period December 4-9, 2025.
Thursday, September 4, 2025
Introduction
- AI impact and timelines overview
- Risks of AI
- AI alignment goals
- Vulnerable world hypothesis
- Lessons from other industries
Experiment:
"Emerging alignment" - Fine-tune a model on outputs from a model with a "good persona" and evaluate performance on other datasets. Try with "subtle alignment" using random inputs.
Thursday, September 11, 2025
Modern LLM Training
- Modern LLM training overview (DeepSeek R1)
- Pretraining
- Mid training
- Reinforcement Learning (RLHF/RLVF)
- Safety training
Experiment:
Use policy-gradient algorithm to optimize prompt prefixes. Take 10,000 notable people's names, create a logits vector where "You are X" probability is proportional to exp(P[i]). Optimize performance across benchmarks.
Resources:
- DeepSeek AI – "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025)
- Sahin Ahmed – "Deepseek R1 Overview" (Medium)
- Raschka, S. – "Understanding Reasoning LLMs" (2024)
- Chowdhery, A., et al. – "PaLM: Scaling Language Modeling" (2022)
- Ouyang, L., et al. – "InstructGPT: Aligning Language Models with Human Feedback" (2022)
- Christiano, P., et al. – "Deep Reinforcement Learning from Human Preferences" (2017)
- Rafailov, R., et al. – "Direct Preference Optimization" (2023)
- Bai, Y., et al. – "Constitutional AI: Harmlessness from AI Feedback" (2022)
- Lee, K., et al. – "Reinforcement Learning from AI Feedback" (2023)
- Guan, J., et al. – "Deliberative Alignment" (2024)
- Raschka, S. – "LLM Architecture Comparison" (2024)
- Qwen GSPO link
- Zheng et al. – "Group Sequence Policy Optimization (GSPO)" (2025)
Thursday, September 18, 2025
Adversarial Robustness, Jailbreaks & Prompt Injection
- Adversarial robustness
- Jailbreaks
- Prompt injection
- Lessons from vision/software security
- Buffer overflow and SQL injection concepts
Experiment:
Test-time scaling laws with red/blue team approach
- Red team: Create jailbreak dataset via "many shot" and filtering
- Blue team: Analyze model responses with different reasoning efforts
Thursday, September 25, 2025
Model Specifications & Compliance
- Lessons from law
- Value alignment vs. detailed adherence
Experiment:
Model Spec adherence evals - test generalization of model behavior across different domains
Thursday, October 2, 2025
Potentially Catastrophic Capabilities & Responsible Scaling
- Responsible scaling policies
- Scalable evaluations
- Safety through capability vs. weakness
Experiment:
Evaluate open and closed source models, potentially using jailbreaking techniques
Thursday, October 9, 2025
Scheming, Reward Hacking & Deception
- Exploring "bad behavior" tied to training objectives
- Investigating potential deception in monitoring models
Experiment:
Demonstrate how impossible tasks or conflicting objectives lead to lying/scheming
Resources:
- Weng, L. – "Reward Hacking" (2024)
- Carlsmith, J. – "Scheming AIs Report" (2023)
- Zou, A., et al. – "Representation Engineering" (2023)
- Lin, S., et al. – "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (2022)
- Langosco, L., et al. – "Goal Misgeneralization in Deep Reinforcement Learning" (2022)
- Hadfield-Menell, D., et al. – "Inverse Reward Design" (2017)
- Krueger, D., et al. – "Hidden Incentives for Auto-Induced Distributional Shift" (2020)
Thursday, October 16, 2025
Recursive Self-Improvement
- Is AI R&D an "AI-complete" task?
Experiment:
To be determined: some thoughts - an experiment to determine the extent which success in a narrow task such as coding or AI requires broad general skills.
Thursday, October 23, 2025
Economic Impacts of Foundation Models
- Labour substitution & productivity effects
- Inequality & policy responses
Experiment:
To be determined
Thursday, October 30, 2025
Military & Surveillance Applications of AI
- Lethal autonomous weapon systems (LAWS)
- Strategic stability & escalation risks
- Mass-scale surveillance infrastructure
Experiment:
To be determined
Thursday, November 6, 2025
Interpretability
- Activations
- Sparse Auto Encoders (SAE)
- Black box models
- Chain of thought
Experiment:
To be determined
Thursday, November 13, 2025
Emotional Reliance and Persuasion
- Domestic & international regulatory approaches
- Standards-setting & audits
Experiment:
To be determined
Resources:
- Resources to be determined
Thursday, November 20, 2025
TBD
Experiment:
To be determined
Resources:
- Resources to be determined
No lecture on Thursday, November 27 – Thanksgiving Break
Thursday, December 4, 2025
AI 2035 - Possible Futures of AI
- Student project presentations and discussion of future directions in AI safety research
Resources:
- Resources to be determined
</p>