Fall 2025,  Thursdays 3:45pm-6:30pm (First lecture September 4)
Course: CS 2881R - AI Safety
Time and Place: Thursdays 3:45pm-6:30pm Eastern Time, SEC LL2.229   (SEC is in 150 Western Ave, Allston, MA)
Instructor: Boaz Barak
Teaching Fellows: Natalie Abreu (natalieabreu@g.harvard.edu), Roy Rinberg (royrinberg@g.harvard.edu), Hanlin Zhang (hanlinzhang@g.harvard.edu), Sunny Qin (Harvard)
Course Description: This will be a graduate level course on challenges in alignment and safety of artificial intelligence. We will consider both technical aspects as well as questions on societal and other impacts of the field.
Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network.
Important: Read the Course Introduction!
Questions? If you have any questions about the course, please email harvardcs2881@gmail.com
Related reading by Boaz:
Mini Syllabus
  - 
    The course will have 13 in person lectures - each lecture will involve also discussion and presentation of an experiment by a group of students. 
- 
    The assignments, project, and other requirements for the course will be determined later. 
- 
    Attendance: Attendance is mandatory. Students are expected to attend all lectures and do the reading in advance as well discuss these in electronic forum. 
- 
    Generative AI: Students are allowed and encouraged to use generative AI as much as they can for studying, exploring concept, and their assignments and projects. Given the availability of AI tools, expectations for projects and assignments will have more ambitious than in past years. 
- 
    Electronic device policy students can use laptops in class but we will ask those using them to sit in the back so they donβt distract other students. 
- 
    Lecture recordings To the extent technically possible we intend to record and publish the lectures online, though we might have some time lag in doing that. However note that recording is done automatically by a static in-room camera, and some parts of the lecture (e.g. whiteboard, or discussions) may not be captured as well. Also we will honor requests by external speakers not to record their talks. 
- 
    POTENTIAL CONFLICT OF INTEREST NOTE: In addition to his position at Harvard, Boaz is also a member of the technical staff at OpenAI. The course will include discussions of models from multiple providers, including OpenAI, and students are also encouraged to use AIs from multiple providers while doing their work. If students in the course feel any issue with this conflict, please do not hesitate to contact Boaz, the other staff, or the Harvard SEAS administration. For what itβs worth, I (Boaz) will see it as a great success of the course if its graduates work in AI safety in any capacity, including at academia, non-profit, governments, and any of OpenAIβs competitors. 
Schedule
Classes begin September 2, 2025. Reading period December 4-9, 2025.
  Thursday, September 4, 2025
  
  
    
    
      Experiment:
      "Emerging alignment" - Fine-tune a model on outputs from a model with a "good persona" and evaluate performance on other datasets. Try with "subtle alignment" using random inputs.
    
 
    
   
 
  Thursday, September 11, 2025
  
  
    
    
      Experiment:
      Use policy-gradient algorithm to optimize prompt prefixes. Take 10,000 notable people's names, create a logits vector where "You are X" probability is proportional to exp(P[i]). Optimize performance across benchmarks.
    
 
    
      Resources:
      
       - Ouyang, L., et al. β "InstructGPT: Aligning Language Models with Human Feedback" (2022)  pre reading
- Bai, Y., et al. β "Constitutional AI: Harmlessness from AI Feedback" (2022)  pre reading
- Shao et al "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024)  pre reading
- DeepSeek AI β "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025) pre reading
- Guan, M., et al. β "Deliberative Alignment" (2024)  pre reading
- Sahin Ahmed β "Deepseek R1 Overview" (Medium)
- Raschka, S. β "Understanding Reasoning LLMs" (2024)
- Chowdhery, A., et al. β "PaLM: Scaling Language Modeling" (2022)
- Christiano, P., et al. β "Deep Reinforcement Learning from Human Preferences" (2017)
- Ziegler et al - "Fine Tuning Lanaguage Models from Human Preferences" (2019)
- Rafailov, R., et al. β "Direct Preference Optimization" (2023)
- Lee, K., et al. β "Reinforcement Learning from AI Feedback" (2023)
- Raschka, S. β "LLM Architecture Comparison" (2024)
- Qwen GSPO link
- Zheng et al. β "Group Sequence Policy Optimization (GSPO)" (2025)
 
   
 
  Thursday, September 18, 2025
  Adversarial Robustness, Jailbreaks, Prompt Injection, Security 
π
    
      - Lecture video
- Blog post summary (Ege Cakar)
- Guest lecturers:Nicholas Carlini (Anthropic), Keri Warr (Anthropic)
- Adversarial robustness
- Jailbreaks
- Prompt injection
- Lessons from vision/software security
- Buffer overflow and SQL injection concepts
- Defense in depth
- Securing weights
      Experiment:
      Test-time scaling laws with red/blue team approach
      - Red team: Create jailbreak dataset via "many shot" and filtering
      - Blue team: Analyze model responses with different reasoning efforts
    
 
    
   
 
  Thursday, September 25, 2025
  Model Specifications & Compliance 
π
    
    
      Experiment:
      Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks (Hugh Van Deventer) | 
Slides | 
GitHub | 
Blog post
      
      Comparing the effect of system prompts vs safety training on over-refusal and toxic-refusal benchmarks. Results show that system prompt style effects are highly model-dependent, with some configurations achieving comparable toxic refusal rates to safety-trained models while maintaining significantly lower over-refusal.
    
 
    
   
 
  Thursday, October 2, 2025
  
  
    
    
      Experiment:
      Evaluate open and closed source models, potentially using jailbreaking techniques
    
 
    
   
 
  Thursday, October 9, 2025
  Recursive Self-Improvement 
π
    
    
      Experiment:
      To be determined: some thoughts - an experiment to determine the extent which success in a narrow task such as coding or AI requires broad general skills.
    
 
    
      Resources:
      
        - Tom Davidson β "Takeoff Speeds" (Presentation at Anthropic) pre-reading
- Davidson, T., Hadshar, R., & MacAskill, W. β "Three Types of Intelligence Explosion" (2025) pre-reading
- Epoch AI β "GATE: Modeling the Trajectory of AI and Automation" (2025) pre-reading
- Epoch AI β "AI in 2030" (2025) pre-reading
- Aghion, P., Jones, B. F., & Jones, C. I. β "Artificial Intelligence and Economic Growth"
- Davidson, T., & Houlden, T. β "How quick and big would a software intelligence explosion be?"
- Eth & Davidson β "Will AI R&D Automation Cause a Software Intelligence Explosion?"
- Erdil, E., Besiroglu, T., & Ho, A. β "Estimating Idea Production: A Methodological Survey" (2024)
- Erdil, E., & Besiroglu, T. β "Explosive growth from AI automation: A review of the arguments" (2023)
- Schrittwieser, J. β "The Case Against the Singularity" (2020)
- Schrittwieser, J. β "Failing to Understand the Exponential, Again" (2025)
- Silver, D., et al. β "Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (AlphaZero)" (2017)
- Davidson, T. β "AI Takeoff Simulation Playground" (Interactive tool)
 
   
 
  Thursday, October 16, 2025
  Capabilites vs. Safety 
π
    
      - Guest Lecturer: Joel Becker (METR)
- Lecture video
- Lecture slides
- Growth in capabilities: METR task doubling, METR developer productivity, OpenAI gdpval
- What it means for:
        - Large scale job displacement
- Automating AI R&D
- OpenAI preparedness framework
- Other responsible scaling policies
      Resources:
      
        - Becker, J., Rush, N., Barnes, E., & Rein, D. β "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (2025) pre-reading
- METR β "GPT-5 Report" pre-reading
- Anthropic β "Responsible Scaling Policy" (2024) pre-reading
- OpenAI β "Preparedness Framework v2" (2024) pre-reading
- Brynjolfsson, E., Chandar, A., & Chen, N. β "Canaries in the Coal Mine" (2025)
- METR β "Measuring AI Ability to Complete Long Tasks" (2025)
- METR β "Common Elements of Frontier AI Safety Policies" (2025)
- METR β "HCAST: Human-Calibrated Agent Scaffolding Tasks" (2025)
- Davidson, T. β "Scenarios for the Transition to AGI" (2024)
- OpenAI β "Preparing for AGI & Beyond: Responsible Scaling Policy" (2024)
- OpenAI β "Updating Our Preparedness Framework" (2024)
- DeepMind β "Introducing the Frontier Safety Framework" (2024)
- Anthropic β "Announcing Our Updated Responsible Scaling Policy" (2024)
- Shevlane, T., et al. β "Evaluating Frontier Models for Extreme Risks" (2023)
- UK Government β "Frontier AI Regulation Policy Paper" (2024)
- European Union β "AI Act" - code of practice (Final version July 2025)
- U.S. White House β "Executive Order 14110 on Safe, Secure & Trustworthy AI" (2023)
- NIST β "AI Risk Management Framework" (2023)
- Sastry, G., et al. β "Computing Power and the Governance of Artificial Intelligence" (2024)
- ISO/IEC β "Management System Standard for AI (42001)" (2024)
 
   
 
  Thursday, October 23, 2025
  Scheming, Reward Hacking & Deception 
π
    
      - Guest Lecturers: Buck Shlegeris (Redwood Research), Marius Hobbhahn (Apollo Research)
- Exploring "bad behavior" tied to training objectives
- Investigating potential deception in monitoring models
      Experiment:
      When Honest Work Becomes Impossible - Coding Agents Under Pressure (Joey Bejjani, Itamar Rocha Filho, Haichuan Wang, Zidi Xiong) | 
Slides | 
GitHub
      
      Demonstrate how impossible tasks and threats to autonomy and capabilities lead to evaluation hacking by coding agents. Highlight the challenges of measuring misaligned behaviors with situational awareness as a growing concern.
    
 
    
      Resources:
      
        - Greenblatt, R., et al. β "Alignment faking in large language models" (2024) pre-reading
- Korbak, T., Clymer, J., Hilton, B., Shlegeris, B., & Irving, G. β "A sketch of an AI control safety case" (2025) pre-reading
- Schoen, B., et al. β "Stress Testing Deliberative Alignment for Anti-Scheming Training" (2025) pre-reading
- Korbak, T., Balesni, M., Shlegeris, B., & Irving, G. β "How to evaluate control measures for LLM agents? A trajectory from today to superintelligence" (2025) pre-reading
- Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. β "Frontier Models are Capable of In-context Scheming" (2024)
- Anthropic β "Claude Sonnet 4.5 System Card"
- Weng, L. β "Reward Hacking" (2024)
- Carlsmith, J. β "Scheming AIs Report" (2023)
- Zou, A., et al. β "Representation Engineering" (2023)
- Lin, S., et al. β "TruthfulQA: Measuring How Models Mimic Human Falsehoods" (2022)
- Langosco, L., et al. β "Goal Misgeneralization in Deep Reinforcement Learning" (2022)
- Hadfield-Menell, D., et al. β "Inverse Reward Design" (2017)
- Krueger, D., et al. β "Hidden Incentives for Auto-Induced Distributional Shift" (2020)
 
   
 
  Thursday, October 30, 2025
  Economic Impacts of Foundation Models 
π
    
      - Guest Lecturer: Ronnie Chatterji (Chief Economist, OpenAI)
- Labor substitution & productivity effects
- Inequality & policy responses
      Experiment:
      To be determined
    
 
    
      Resources:
      
        - Jones, B. F. β "Artificial Intelligence in Research and Development" (2025) pre-reading
- Chatterji, A., et al. β "How People Use ChatGPT" (2025) pre-reading
- Brynjolfsson, E., Chandar, B., & Chen, R. β "Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence" (2025) pre-reading
- Jones, C. I. β "The A.I. Dilemma: Growth versus Existential Risk" (2024) pre-reading
- Goldman Sachs β "Long-Run Impact of AI on GDP & Jobs" (2023)
- Brynjolfsson et al. β "Generative AI at Work" (2023)
- Acemoglu & Restrepo β "AI, Automation & Work" (2024)
- Eloundou et al. β "GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models" (2023)
- Roodman β "Modeling Economic Impact of Transformative AI" (2023)
 
   
 
  Thursday, November 6, 2025
  
  
  Guest lecturers (remote): Neel Nanda (Google DeepMind), Bowen Baker (OpenAI), Jack Lindsey (Anthropic), Leo Gao (OpenAI)
    
      - Activations
- Sparse Auto Encoders (SAE)
- Black box models
- Chain of thought
      Experiment:
      To be determined
    
 
    
   
 
  Thursday, November 13, 2025
  Emotional Reliance and Persuasion 
π
    
      - Domestic & international regulatory approaches
- Standards-setting & audits
- Lethal autonomous weapon systems (LAWS)
- Strategic stability & escalation risks
- Mass-scale surveillance infrastructure
      Experiment:
      To be determined
    
 
    
   
 
  Thursday, November 20, 2025
  
  
    
      - Discussion of future directions in AI safety research
      Resources:
      
        - Resources to be determined
 
   
 
No lecture on Thursday, November 27 β Thanksgiving Break
 
</p>