CS 2881 AI Safety

Fall 2025 - Harvard

YouTube Playlist | Lecture Notes

Lectures

Fall 2025, Thursdays 3:45pm-6:30pm (First lecture September 4)

Course: CS 2881R - AI Safety

YouTube Lecture Playlist Course Lecture Notes and Experiments

Time and Place: Thursdays 3:45pm-6:30pm Eastern Time, SEC LL2.229 (SEC is in 150 Western Ave, Allston, MA)

Instructor: Boaz Barak

Teaching Fellows: Natalie Abreu (natalieabreu@g.harvard.edu), Roy Rinberg (royrinberg@g.harvard.edu), Hanlin Zhang (hanlinzhang@g.harvard.edu), Sunny Qin (Harvard)

Course Description: This will be a graduate level course on challenges in alignment and safety of artificial intelligence. We will consider both technical aspects as well as questions on societal and other impacts of the field.

Prerequisites: We require mathematical maturity, and proficiency with proofs, probability, and information theory, as well as the basics of machine learning, at the level of an undergraduate ML course such as Harvard CS 181 or MIT 6.036. You should be familiar with topics such as empirical and population loss, gradient descent, neural networks, linear regression, principal component analysis, etc. On the applied side, you should be comfortable with Python programming, and be able to train a basic neural network.

Important: Read the Course Introduction!

Questions? If you have any questions about the course, please email harvardcs2881@gmail.com

Related reading by Boaz:

Previous versions: Spring 2023 ML Theory Seminar Spring 2021 ML Theory Seminar

Mini Syllabus

Schedule

Classes begin September 2, 2025. Reading period December 4-9, 2025.

Thursday, September 11, 2025
Modern LLM Training πŸ”—
Experiment:
Use policy-gradient algorithm to optimize prompt prefixes. Take 10,000 notable people's names, create a logits vector where "You are X" probability is proportional to exp(P[i]). Optimize performance across benchmarks.
Thursday, September 18, 2025
Adversarial Robustness, Jailbreaks, Prompt Injection, Security πŸ”—
  • Lecture video
  • Blog post summary (Ege Cakar)
  • Guest lecturers:Nicholas Carlini (Anthropic), Keri Warr (Anthropic)
  • Adversarial robustness
  • Jailbreaks
  • Prompt injection
  • Lessons from vision/software security
  • Buffer overflow and SQL injection concepts
  • Defense in depth
  • Securing weights
Experiment:
Test-time scaling laws with red/blue team approach - Red team: Create jailbreak dataset via "many shot" and filtering - Blue team: Analyze model responses with different reasoning efforts
Thursday, September 25, 2025
Model Specifications & Compliance πŸ”—
Experiment:
Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks (Hugh Van Deventer) | Slides | GitHub | Blog post

Comparing the effect of system prompts vs safety training on over-refusal and toxic-refusal benchmarks. Results show that system prompt style effects are highly model-dependent, with some configurations achieving comparable toxic refusal rates to safety-trained models while maintaining significantly lower over-refusal.
Thursday, October 2, 2025
Content Policies πŸ”—
Thursday, October 16, 2025
Capabilites vs. Safety πŸ”—
  • Guest Lecturer: Joel Becker (METR)
  • Lecture video
  • Lecture slides
  • Growth in capabilities: METR task doubling, METR developer productivity, OpenAI gdpval
  • What it means for:
    • Large scale job displacement
    • Automating AI R&D
  • OpenAI preparedness framework
  • Other responsible scaling policies
Experiment:
TBD
Thursday, October 23, 2025
Scheming, Reward Hacking & Deception πŸ”—
  • Guest Lecturers: Buck Shlegeris (Redwood Research), Marius Hobbhahn (Apollo Research)
  • Exploring "bad behavior" tied to training objectives
  • Investigating potential deception in monitoring models
Experiment:
When Honest Work Becomes Impossible - Coding Agents Under Pressure (Joey Bejjani, Itamar Rocha Filho, Haichuan Wang, Zidi Xiong) | Slides | GitHub

Demonstrate how impossible tasks and threats to autonomy and capabilities lead to evaluation hacking by coding agents. Highlight the challenges of measuring misaligned behaviors with situational awareness as a growing concern.
Thursday, November 13, 2025
Emotional Reliance and Persuasion πŸ”—
  • Domestic & international regulatory approaches
  • Standards-setting & audits
  • Lethal autonomous weapon systems (LAWS)
  • Strategic stability & escalation risks
  • Mass-scale surveillance infrastructure
Experiment:
To be determined
Thursday, November 20, 2025
AI 2035 πŸ”—
  • Discussion of future directions in AI safety research
Resources:
  • Resources to be determined
No lecture on Thursday, November 27 – Thanksgiving Break

</p>