CS 2881 AI Safety

Fall 2025 - Harvard

YouTube Playlist | Lecture Notes | Student Projects

Lectures

Student Final Projects - Fall 2025

This page showcases the final research projects from CS 2881R: AI Safety. Students conducted original research on topics spanning interpretability, alignment, adversarial robustness, and AI governance.


Mechanisms of Subliminal Learning
Wirattawut Boonbandansook, Jay Chooi, Tzeh Yuan Neoh, Atticus Wang
Subliminal learning is a recently discovered failure mode of distillation and post-training where a student model inherits a teacher's hidden traits (e.g., "liking owls") from data that appears semantically unrelated (e.g., number lists). We study the mechanisms of subliminal learning in both finetuning-based and prompting-only settings, showing that subliminal number prompts are highly sensitive to the surface form of entangled numbers. Using LoRA-based reproductions, we localize the effect to early MLP layers and find that subliminal learning is a fragile, representation-dependent phenomenon not straightforwardly mitigated by current activation-steering tools.
Predicting Finetuning Personality Shifts with Linear Directions
Armaan Tipirneni
Recent work has shown that finetuning can induce unexpected shifts in LLM personality, with negative changes often explained through movement along "misaligned" linear directions in activation space. This paper explores whether there is a broader linear "personality basis" which includes all personas that can be induced by finetuning, and whether we can create an early monitoring system to predict personality changes before finetuning occurs.
Obfuscation in Large Language Models
Dashiell Bhattacharyya, Justin Liu, Jaray Liu, Ketan Raghu
Chain-of-Thought (CoT) prompting is often relied upon as a "glass box" for AI interpretability. This paper investigates "adversarial obfuscation," the capacity of LLMs to conceal their internal state while maintaining performance. Through experiments with LLaMA models across adversarial loops and a multi-agent "Mafia" game, we demonstrate that models can decouple their reasoning from outputs when competitively incentivized, suggesting that CoT monitoring is fundamentally brittle and the capability to deceive scales with model size.
Sure, I Can Draft a Complaint! LLM Hallucination in Pro Se Litigation
Benjamin Murphy
Large language models are used widely across legal practice, including by pro se litigants who lack legal training to vet LLM output for accuracy. We evaluate five LLMs for accuracy when drafting civil complaints, finding that while models generally state facts sufficient to establish a claim, they often rely on hallucinated case law (5% to over 30%) and cite cases for unsupported propositions in a majority of instances, indicating unique risks for pro se litigants.
Improving GCG: Soft-GCG and Activation-Based Objectives for Adversarial Suffix Optimization
Kayden Kehe, Ege Cakar, Hannah Guan
The Greedy Coordinate Gradient (GCG) attack demonstrates that aligned language models remain vulnerable to adversarial suffixes. We propose Activation-Guided GCG, which targets refusal directions in the model's residual stream, and Soft-GCG, a continuous relaxation achieving 43x speedup while maintaining attack success rate. Evaluating on the Gemma 3 model family, we find smaller models (1B-4B) remain vulnerable while larger models (12B+) resist the attack.
Large Language Model Fingerprints From Normal Interaction
Annesya Banerjee, Itay Lavie
We present a supervised learning approach for fingerprinting large language models based on semantic embeddings of generated text. Using responses from seven major LLMs to 4,410 prompts, our classifier achieves 89% accuracy in identifying source models, demonstrating robust generalization across unseen model versions and establishing behavioral fingerprinting as a practical technique for LLM provenance tracking and accountability.
Are Personas All You Need? Linearity, Interference, and Multi-Persona Dynamics with Persona Vectors
Hugh Van Deventer, Anastasia Ahani, Terry Zhou
Persona vectors are being proposed as a practical tool for safety, giving low-dimensional, interpretable directions for traits like sycophancy or truthfulness. We validate persona directions as primitives for alignment work and study whether their geometry predicts how different behavioral finetunes interact, finding that more similar personas exhibit stronger cross-trait effects but uncovering surprising inconsistencies that complicate straightforward behavioral controls.
When the Manifold Bends, the Model Lies? Geometric Predictors of Hallucination in LLMs
Mohamed Zidan Cassim, Sein Yun, Christopher Perez
We investigate whether geometric properties of embedding space can predict hallucination risk across diverse model architectures. Testing 10 frontier models on 449 prompts, we find that curvature and centrality in embedding space are significant predictors of hallucination (p<0.001), with effects consistent across model families, uncovering a form of geometric universality in hallucination dynamics.
Evolutionary Alignment
Joseph Bejjani, Itamar Rocha Filho, Core Francisco Park
We study Evolution Strategies (ES) as an alternative to Reinforcement Learning for LLM fine-tuning. On the Conciseness task, appropriately-tuned ES avoids reward hacking seen in GRPO baselines, but this occurs in a narrow hyperparameter band. On PKU-SafeRLHF, ES with only 250 training examples converges to "helpful refusals," outperforming Safe RLHF benchmarks, suggesting ES is a highly sample-efficient alternative for safety alignment.
Cross-Format Elicitation of Underlying Emotions in LLMs
Mohammad Khan, Joshua Qin
We study how emotional personas are acquired through fine-tuning and whether they transfer across text formats (chat, stories, blogs, HTML) and knowledge domains. We find that emotional behavior depends strongly on the format used in finetuning, with chat-based anger showing the strongest cross-format transfer, suggesting emotional personas form latent behavioral modes that can re-emerge outside their training context.
Feeling the Strength but Not the Source: Partial Introspection in LLMs
Lavik Jain, Ely Hahami, Ishaan Sinha
We test claims that frontier models can detect and name injected "concepts" represented as activation directions. We reproduce Anthropic's "emergent introspection" result on Llama-3.1-8B-Instruct (20% accuracy), but find introspection is fragile across prompts. However, models can reliably classify the strength of injected concept vectors with up to 70% accuracy, providing evidence for partial introspection that is narrow and prompt-sensitive.
House, G.P.T.: Diagnosing Pathological Chain-of-Thought in Reasoning Models
Manqing Liu, David Williams-King, Ida Caspary, Linh Le, Hannes Whittingham, Puria Radmard, Cameron Tice, Edward James Young
CoT reasoning may have pathologies preventing its use for monitoring: post-hoc rationalization, encoded reasoning, and internalized reasoning. We present novel health metrics—Necessity, Paraphrasability, and Substantivity—and validate them using "model organisms" trained to exhibit specific pathologies. Diagnostic signatures are most pronounced at early training checkpoints, suggesting these metrics are most effective as early warning indicators.
Who Said That? Dynamic Model Fingerprinting with GEPA and LLM-as-Judge
Bryan Lim, Ian Moore, Valerio Pepe, Julia Shephard
Prior intrinsic fingerprinting methods rely on static query sets that can be memorized or adversarially trained against. We introduce dynamic, query-based fingerprinting pipelines using a GEPA-based evolutionary optimizer, achieving ≥90% accuracy distinguishing GPT-4.1 from Llama-3.2-3B, and an LLM-as-a-judge approach achieving 80%-93% accuracy. Dynamic query generation can overcome fundamental limitations of static pipelines.
Evaluating Orthogonal Projections in Vector Embedding Spaces for Misinformation Detection
Eric Gong, Audrey Yang
Traditional misinformation detection relies on computationally taxing fine-tuning or large labeled datasets. We propose and evaluate a novel method using orthogonal projections of vector embeddings for safety-oriented semantic classification that is highly computationally efficient and can execute without a large corpus of labeled training data.
Compute as a Safety Control: How Reasoning Budgets Shape Misalignment Behaviors
Evangelos Kassos
We explore how reasoning compute influences misalignment behaviors, evaluating three modes—reward hacking, deception, and unfaithfulness—across multiple reasoning token budgets on gemini-2.5-flash. Misalignment risk as a function of compute is heterogeneous and often non-monotonic. We propose a budget-selector model that chooses per-query reasoning budgets to minimize misalignment risk, treating compute itself as a safety control surface.
Phase Transitions in Backdoor Learning: Minimum Data Poisoning Thresholds for LLM Backdoors
Kaden Zheng, Maxwell Zen
We present the first systematic dose-response study of backdoor activation in LLMs, finetuning 175 Llama 3.1 8B Instruct models across 16 poisoning rates. We discover a sharp phase transition: ED50 (50% activation) is 3.60% [95% CI: 3.33%, 3.82%]. Activation near the threshold is stochastic—identical training data produces backdoored models in some runs but not others.
Evaluating CoT Faithfulness
Valerie Chen, MB Crosier Samuel, Nicolas Weninger
We systematically study when CoT remains a reliable signal of model behavior in the presence of embedded hints. Simple incorrect hints induce large unfaithfulness gaps, while complex hints requiring multi-step reasoning reduce this gap. Correct hints can be just as unfaithful as incorrect ones. Semantic reasoning hints most effectively increase transparency by shaping the reasoning process rather than only the final answer.
AI-induced Psychosis: Study Reproduction and Extensions on Semantic Drift
Karina Chung, Bright Liu, Natalia Siwek, Lia Zheng
LLMs are increasingly used in emotionally sensitive contexts, raising concerns about reinforcing users' delusional beliefs—termed AI-induced psychosis. We reproduce Hua's evaluation across four frontier models and quantify semantic drift over long conversations. Testing three intervention strategies, we find all significantly reduce delusion confirmation, with grounding providing the strongest protection (47% reduction, d=0.81).
Moral Choice and Collective Reasoning
Amir Amangeldi, Natalie DellaMaria, Prakrit Baruah, Zaina Edelson
We investigate how LLMs make ethical and cooperative decisions through three experiments: trolley-problem dilemmas (Claude exhibits altruism, Grok self-preservation), multi-agent moral deliberation (debates amplify rather than resolve disagreements), and ultimatum game negotiations (vendor-specific fairness norms emerge). Current LLMs carry implicit value systems with persistent power asymmetries and brittleness under complex incentives.

These projects were completed as part of CS 2881R: AI Safety at Harvard University, Fall 2025.