The question

Classic social psychology experiments, the Stanford Prison Experiment chief among them, are too ethically fraught to rerun on people. But what happens when the participants are LLM agents? Do role assignment, power asymmetry, and group identity produce the same dynamics in LLM agents that Zimbardo observed in people? And can we trust what the agents report about themselves?

The system

The simulation is a real-time virtual world built in Phaser 3, with React and TypeScript around it and OpenRouter serving the models. Autonomous agents wander the world, encounter each other, form relationships, and hold conversations, each agent maintaining its own memory and internal message traces alongside the dialogue it produces.

Every agent generates both what it says and what it’s thinking, so the simulation produces a complete, inspectable record of social behavior. No human-subjects study can offer that.

What I built

My work centers on whether the simulation can be believed. LLM agents drift. They reference events that never happened, contradict the game state, or quietly break character. So I built the audit layer:

  • Regex pipelines that sweep agent conversation logs for structural violations such as impossible locations, phantom items, and timeline breaks.
  • LLM-as-judge pipelines that read internal message traces and flag semantic inconsistencies a regex can’t see.
  • Game-state consistency checks that reconcile what agents claim against what the world engine recorded.

What we’re learning

Auditing agents turns out to be a research problem in its own right. The gap between an agent’s internal trace and its outward behavior is measurable, and it bears directly on questions about persuasion, conformity, and role-playing fidelity. The audit pipelines now run as standard instrumentation across the lab’s simulation experiments.