Fall 2025, Thursdays 3:45pm-6:30pm (First lecture September 4)
Course: CS 2881R - AI Safety
Time and Place: Thursdays 3:45pm-6:30pm Eastern Time, SEC LL2.229 (SEC is in 150 Western Ave, Allston, MA)
Instructor: Boaz Barak
Teaching Fellows: Natalie Abreu (natalieabreu@g.harvard.edu), Roy Rinberg (royrinberg@g.harvard.edu), Hanlin Zhang (hanlinzhang@g.harvard.edu), Sunny Qin (Harvard)
Course Description: This will be a graduate level course on challenges in alignment and safety of artificial intelligence. We will consider both technical aspects as well as questions on societal and other impacts of the field.
Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network.
Important: Read the Course Introduction!
Questions? If you have any questions about the course, please email harvardcs2881@gmail.com
Related reading by Boaz:
Mini Syllabus
-
The course will have 13 in person lectures - each lecture will involve also discussion and presentation of an experiment by a group of students.
-
The assignments, project, and other requirements for the course will be determined later.
-
Attendance: Attendance is mandatory. Students are expected to attend all lectures and do the reading in advance as well discuss these in electronic forum.
-
Generative AI: Students are allowed and encouraged to use generative AI as much as they can for studying, exploring concept, and their assignments and projects. Given the availability of AI tools, expectations for projects and assignments will have more ambitious than in past years.
-
Electronic device policy students can use laptops in class but we will ask those using them to sit in the back so they don’t distract other students.
-
Lecture recordings To the extent technically possible we intend to record and publish the lectures online, though we might have some time lag in doing that. However note that recording is done automatically by a static in-room camera, and some parts of the lecture (e.g. whiteboard, or discussions) may not be captured as well. Also we will honor requests by external speakers not to record their talks.
-
POTENTIAL CONFLICT OF INTEREST NOTE: In addition to his position at Harvard, Boaz is also a member of the technical staff at OpenAI. The course will include discussions of models from multiple providers, including OpenAI, and students are also encouraged to use AIs from multiple providers while doing their work. If students in the course feel any issue with this conflict, please do not hesitate to contact Boaz, the other staff, or the Harvard SEAS administration. For what it’s worth, I (Boaz) will see it as a great success of the course if its graduates work in AI safety in any capacity, including at academia, non-profit, governments, and any of OpenAI’s competitors.
Schedule
Classes begin September 2, 2025. Reading period December 4-9, 2025.
Thursday, September 4, 2025
Introduction
Experiment:
"Emerging alignment" - Fine-tune a model on outputs from a model with a "good persona" and evaluate performance on other datasets. Try with "subtle alignment" using random inputs.
Thursday, September 11, 2025
Modern LLM Training
Experiment:
Use policy-gradient algorithm to optimize prompt prefixes. Take 10,000 notable people's names, create a logits vector where "You are X" probability is proportional to exp(P[i]). Optimize performance across benchmarks.
Resources:
- Ouyang, L., et al. – "InstructGPT: Aligning Language Models with Human Feedback" (2022) pre reading
- Bai, Y., et al. – "Constitutional AI: Harmlessness from AI Feedback" (2022) pre reading
- Shao et al "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024) pre reading
- DeepSeek AI – "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025) pre reading
- Guan, M., et al. – "Deliberative Alignment" (2024) pre reading
- Sahin Ahmed – "Deepseek R1 Overview" (Medium)
- Raschka, S. – "Understanding Reasoning LLMs" (2024)
- Chowdhery, A., et al. – "PaLM: Scaling Language Modeling" (2022)
- Christiano, P., et al. – "Deep Reinforcement Learning from Human Preferences" (2017)
- Ziegler et al - "Fine Tuning Lanaguage Models from Human Preferences" (2019)
- Rafailov, R., et al. – "Direct Preference Optimization" (2023)
- Lee, K., et al. – "Reinforcement Learning from AI Feedback" (2023)
- Raschka, S. – "LLM Architecture Comparison" (2024)
- Qwen GSPO link
- Zheng et al. – "Group Sequence Policy Optimization (GSPO)" (2025)
Thursday, September 18, 2025
Adversarial Robustness, Jailbreaks, Prompt Injection, Security
- Lecture video
- Blog post summary (Ege Cakar)
- Guest lecturers:Nicholas Carlini (Anthropic), Keri Warr (Anthropic)
- Adversarial robustness
- Jailbreaks
- Prompt injection
- Lessons from vision/software security
- Buffer overflow and SQL injection concepts
- Defense in depth
- Securing weights
Experiment:
Test-time scaling laws with red/blue team approach
- Red team: Create jailbreak dataset via "many shot" and filtering
- Blue team: Analyze model responses with different reasoning efforts
Thursday, September 25, 2025
Model Specifications & Compliance
Experiment:
Model Spec adherence evals - test generalization of model behavior across different domains
Thursday, October 2, 2025
Content Policies
- Guest Lecturer: Ziad Reslan (Product Policy, OpenAI)
- Content policies and moderation
- Platform governance
- Policy enforcement challenges
Experiment:
Evaluate open and closed source models, potentially using jailbreaking techniques
Thursday, October 9, 2025
Recursive Self-Improvement
- Is AI R&D an "AI-complete" task?
Experiment:
To be determined: some thoughts - an experiment to determine the extent which success in a narrow task such as coding or AI requires broad general skills.
Resources:
- Tom Davidson – "Takeoff Speeds" (Presentation at Anthropic) pre-reading
- Davidson, T., Hadshar, R., & MacAskill, W. – "Three Types of Intelligence Explosion" (2025) pre-reading
- Epoch AI – "GATE: Modeling the Trajectory of AI and Automation" (2025) pre-reading
- Epoch AI – "AI in 2030" (2025) pre-reading
- Aghion, P., Jones, B. F., & Jones, C. I. – "Artificial Intelligence and Economic Growth"
- Davidson, T., & Houlden, T. – "How quick and big would a software intelligence explosion be?"
- Eth & Davidson – "Will AI R&D Automation Cause a Software Intelligence Explosion?"
- Erdil, E., Besiroglu, T., & Ho, A. – "Estimating Idea Production: A Methodological Survey" (2024)
- Erdil, E., & Besiroglu, T. – "Explosive growth from AI automation: A review of the arguments" (2023)
- Schrittwieser, J. – "The Case Against the Singularity" (2020)
- Schrittwieser, J. – "Failing to Understand the Exponential, Again" (2025)
- Silver, D., et al. – "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)" (2017)
- Davidson, T. – "AI Takeoff Simulation Playground" (Interactive tool)
Thursday, October 16, 2025
Capabilites vs. Safety
Thursday, October 23, 2025
Scheming, Reward Hacking & Deception
- Guest Lecturers: Buck Shlegeris (Redwood Research), Marius Hobbhahn (Apollo Research)
- Exploring "bad behavior" tied to training objectives
- Investigating potential deception in monitoring models
Experiment:
Demonstrate how impossible tasks or conflicting objectives lead to lying/scheming
Resources:
- Weng, L. – "Reward Hacking" (2024)
- Carlsmith, J. – "Scheming AIs Report" (2023)
- Zou, A., et al. – "Representation Engineering" (2023)
- Lin, S., et al. – "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (2022)
- Langosco, L., et al. – "Goal Misgeneralization in Deep Reinforcement Learning" (2022)
- Hadfield-Menell, D., et al. – "Inverse Reward Design" (2017)
- Krueger, D., et al. – "Hidden Incentives for Auto-Induced Distributional Shift" (2020)
Thursday, October 30, 2025
Economic Impacts of Foundation Models
- Guest Lecturer: Ronnie Chatterji (Chief Economist, OpenAI)
- Labor substitution & productivity effects
- Inequality & policy responses
Experiment:
To be determined
Thursday, November 6, 2025
Interpretability
Guest lecturers (remote): Neel Nanda (Google DeepMind), Bowen Baker (OpenAI), Jack Lindsey (Anthropic), Leo Gao (OpenAI)
- Activations
- Sparse Auto Encoders (SAE)
- Black box models
- Chain of thought
Experiment:
To be determined
Thursday, November 13, 2025
Emotional Reliance and Persuasion
- Domestic & international regulatory approaches
- Standards-setting & audits
Experiment:
To be determined
Resources:
- Resources to be determined
Thursday, November 20, 2025
Military & Surveillance Applications of AI
- Lethal autonomous weapon systems (LAWS)
- Strategic stability & escalation risks
- Mass-scale surveillance infrastructure
Experiment:
To be determined
No lecture on Thursday, November 27 – Thanksgiving Break
Thursday, December 4, 2025
AI 2035 - Possible Futures of AI
- Student project presentations and discussion of future directions in AI safety research
Resources:
- Resources to be determined
</p>