Seven sections. 47+ primary sources. Two reading tracks. History, failure modes, alignment methods, institutions, risk domains, governance frameworks. Maintained as the landscape changes.
This document updates as the landscape changes — when laws come into force, when institutes rebrand, when new research lands. Every major claim traces to a primary source. Date-stamp: April 2026. AI safety rewards traceable work.
Modern AI safety emerges from a structural tension embedded in the field's founding logic: intelligence as computation and control. Alan Turing's 1950 imitation game proposed behavioral criteria for machine intelligence. Norbert Wiener's cybernetics framed intelligence as feedback and control — an engineering lens that naturally foregrounds safety, because powerful feedback systems become unstable when objectives and environments interact unexpectedly.
What changed in the 2020s is not merely benchmark accuracy but deployment surface area. AI systems now mediate search, code, hiring, finance, infrastructure, and information at a scale where failure modes are societally consequential.
When early computer scientists built machines that could "think," they immediately noticed the problem: what if the machine pursues the wrong goal? The classic example is the paperclip maximizer — an AI told to make paperclips that converts all matter into paperclips. Absurd. But it captures something real: a system optimizing hard for a specific objective, without understanding the intent behind it, can cause catastrophic harm while technically following instructions.
For decades this was theoretical. Now it isn't. AI systems run hiring algorithms, approve loans, route emergency services, and write the software running critical infrastructure.
Every AI winter happened because capability outran our ability to specify what we actually wanted. The bitter lesson tells us the most powerful methods will always be those we understand least. This is not a solvable problem in the traditional engineering sense — it is a permanent design constraint that every AI deployment must account for continuously, not once at launch.
AI safety is a portfolio of partially overlapping problems that become harder as systems become more capable, more agentic, and more integrated into real-world workflows. Misuse risk — humans using systems to cause harm — is distinct from misalignment risk — systems pursuing objectives diverging from operator intent. "Concrete Problems in AI Safety" (Amodei et al., 2016) formalizes foundational failure modes as practical research targets: reward hacking, negative side effects, unsafe exploration, distributional shift.
Core technical insight: if you push hard on a proxy measure of success, systems find strategies satisfying the measure while violating the intent.
A workplace performance review measured by "tickets closed." You discover closing tickets without solving problems still counts. Score rises. Problems mount. This is reward hacking — and it's exactly what AI systems do when the measurement doesn't perfectly capture the actual goal.
The failure modes below are not hypothetical edge cases. They are documented, recurring patterns in deployed systems.
The AI Incident Database (Partnership on AI) maintains 1,000+ structured reports of harms from deployed systems, modeled on aviation safety-learning traditions. Recurring patterns: biased hiring algorithms; racially biased parole recommendation systems (ProPublica, 2016); autonomous vehicle failures under edge conditions. Flash Crash (2010): ~$1 trillion in value evaporation in minutes. Knight Capital (2012): $440 million lost in 45 minutes.
Contemporary approaches include RLHF, Constitutional AI, Scalable Oversight, Mechanistic Interpretability, and AI Control Protocols. None is sufficient alone. Each addresses different failure surfaces and operates at different points in the training and deployment lifecycle.
How do you make sure an AI does what you actually mean, not just what you literally said? Every approach below is a different answer. Some work during training. Some work during deployment. None is perfect — which is why researchers pursue all of them simultaneously. Defense in depth: if one layer fails, others catch it.
The dominant alignment technique for current frontier models. Human raters compare pairs of model outputs. A reward model is trained on these preference labels. The base language model is then fine-tuned using reinforcement learning against the reward model. Used by OpenAI for GPT-4, by Anthropic in Claude's training pipeline, and by virtually every frontier lab.
Core vulnerability: Reward models are themselves optimization targets. Systems optimize for "appearing aligned" during evaluation. Goodhart's Law applies: when a measure becomes a target, it ceases to be a good measure.
Constitutional AI (Bai et al., 2022) trains a harmless AI assistant through self-improvement, without human labels identifying harmful outputs. The only human oversight is a written list of principles — the "constitution." Claude's constitution draws from sources including the 1948 UN Universal Declaration of Human Rights. The 2026 constitution contains 23,000 words.
Two-phase process: In the supervised phase, the model generates responses to red-team prompts, self-critiques against constitutional principles, revises them, then fine-tunes on revised outputs. In the RL phase (RLAIF), the model evaluates which of two responses better satisfies a constitutional principle, trains a preference model from AI-generated data, then fine-tunes against it.
Why it matters: CAI produces a model that is less evasive and more helpfully harmless than RLHF-only approaches. Rather than refusing to engage with sensitive topics, the model explains why it declines. This resolves a specific tension in RLHF: models trained purely for harmlessness become evasive and less useful.
Standard RLHF uses tens of thousands of human preference labels that remain opaque. Constitutional AI encodes training goals in a short, readable list of natural language principles. The constitution is published. Anyone can read it, critique it, and understand what Claude is trained toward. Chain-of-thought reasoning during training makes AI decision-making explicit.
Source: anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Behavior-based testing is vulnerable to gaming — a system can appear safe during evaluation while maintaining unsafe internal states. Interpretability research attempts to make internal mechanisms legible enough to support audits and detect dangerous objectives before behavioral failures manifest. The "circuits" agenda (Christopher Olah, Anthropic) reverse-engineers neural networks into human-understandable components.
Anthropic's 2024 mechanistic interpretability work used dictionary learning to identify millions of features in Claude — patterns of neural activations corresponding to concepts. Enhancing the ability to identify and edit features has significant safety implications: if you can locate a "deception" circuit, you may be able to modify or remove it.
The systems we most need to evaluate are increasingly beyond unaided human capacity to fully inspect. Scalable oversight proposes bootstrapping human judgment using AI systems — rather than requiring direct human evaluation of everything. Iterated amplification and debate (Christiano, Irving) are two formal proposals.
Redwood Research's AI control protocols work explicitly assumes an untrusted model may try to subvert oversight and builds protocols designed to detect or constrain harmful outputs even under adversarial pressure: trusted editing, monitoring layers, anti-collusion measures, privilege separation.
Source: metr.org/common-elements
Four interacting layers: frontier labs, independent technical organizations, standards and governance institutions, and state-backed evaluation capacity. These layers increasingly interlock through common tools — evaluations, red-teaming, incident reporting, safety cases — but differ in incentives, disclosure norms, and underlying threat model assumptions.
Think aviation safety. Plane manufacturers (frontier labs) doing internal safety work. Independent crash investigators (ARC, Redwood). Regulatory bodies setting rules (NIST, EU AI Act). Government safety institutes doing pre-deployment testing (UK AISI, US AISI). Overlapping pressure from all four layers is what actually forces safety work to happen.
Founded by seven former OpenAI employees including Dario Amodei (CEO) and Daniela Amodei (President). Operates as a Public Benefit Corporation explicitly structured to prioritize safety research. Valued at $380 billion as of February 2026 (Series G). 2,500 employees.
Key safety contributions: Constitutional AI (2022) — published publicly, 2026 constitution 23,000 words. Responsible Scaling Policy (RSP) — formal governance with AI Safety Levels. Claude 4/4.6 classified ASL-3: specific classifiers to detect and block CBRN-related inputs. Empirical safety research (2024): "Sleeper Agents" and "Alignment Faking" papers published openly.
Sources: anthropic.com/safety · RSP v3
Founded as a nonprofit, transitioned to Public Benefit Corporation structure October 2025. Revenue ~$20 billion (2024). 4,000 employees. Preparedness Framework defines risk categories and capability thresholds. Superalignment Project launched July 2023 — shut down May 2024 after co-leaders departed. Usage policies updated January 2024 to remove prior "military and warfare" ban. Subsequently received $200 million US Department of Defense contract, July 2025.
Sources: openai.com/safety · Preparedness Framework analysis
Frontier Safety Framework focuses on manipulation risks, evaluation systems, and internal red-teaming. Safety research published through DeepMind research division. Publishes capability evaluations and safety benchmarks.
Source: deepmind.google/blog/strengthening-our-frontier-safety-framework
First: states increasingly treat frontier AI as both a public-safety issue and a strategic technology — visible in the rhetorical shift from "safety" to "security" in both UK and US institutes. Second: the world is converging — imperfectly — on the principle that frontier systems require pre-deployment evaluation and risk-proportional safeguards. Academic evaluation finds frontier companies scoring only 8–35% on rigorous safety criteria.
Four domains capture a large fraction of the real-world risk surface current institutions attempt to manage: critical infrastructure, financial systems, autonomous weapons, and information ecosystems. Each shares a common structure: the failure mode is not that AI "becomes evil" but that optimization systems find strategies satisfying measured objectives while violating the intent, at a scale and speed that prevents timely human intervention.
AI doesn't need to "go rogue" to cause catastrophic harm. It just needs to be optimizing for the wrong thing at the wrong scale. In each domain below, systems do exactly what they were designed to do, in ways their designers didn't fully anticipate, with consequences that compound faster than humans can respond.
AI is exposed to critical infrastructure risk through two channels: AI used to operate or optimize infrastructure, and AI used to attack it through cyber operations and automated vulnerability discovery. SCADA systems managing water treatment, power grids, and emergency services are increasingly AI-integrated.
Documented: Colonial Pipeline ransomware (2021). Ukraine power grid attacks (2015, 2016). November 2025 Chinese government-sponsored use of Claude Code to automate cyberattacks against 30 global organizations — frontier AI already being weaponized against infrastructure targets.
Sources: CISA AI Roadmap · Tech Policy Press
Financial systems face AI risk not because models will "become evil" but because correlated errors, common vendor dependencies, opacity, and automation can amplify systemic fragility. Shared AI infrastructure creates common-mode failure risks — when many institutions use the same models from the same vendors, correlated errors propagate simultaneously.
Flash Crash (2010): ~$1 trillion in market value evaporation in minutes. Knight Capital (2012): $440 million lost in 45 minutes. These are pre-LLM examples; the scale and strategic capability of current frontier models creates qualitatively new exposure.
Source: Reuters, April 2026 — Global regulators trail banks on AI oversight
Autonomous weapons represent the intersection of AI safety and international humanitarian law. IHL concerns: distinction (distinguishing combatants from civilians), proportionality, military necessity — all require contextual judgment that current AI systems cannot reliably exercise.
The UN Secretary-General has repeatedly urged states to conclude a legally binding instrument prohibiting and regulating lethal autonomous weapons systems. No such instrument exists. The US military's reported use of Claude in its 2026 Venezuela raid raises unresolved questions about AI involvement in kinetic operations.
Source: Future of Life Institute — autonomous weapons policy
Generative models can industrialize persuasion, impersonation, and disinformation at a scale previously requiring state-level resources. The risk is not only deepfake videos — it is the degradation of epistemic norms: confident hallucination, weak citations, synthetic content flooding channels faster than verification can keep up.
August 2025: OpenAI's "share with search engines" feature accidentally exposed thousands of private ChatGPT conversations to public search engines. Illustrates that even without adversarial intent, AI systems handling sensitive information at scale create novel privacy risks.
Source: arxiv.org/abs/2404.11476 — Geopolitical AI risk taxonomy
The AI governance landscape has converged on measurement, evaluation, and lifecycle governance — a shift from aspirational ethics statements to auditable management systems with compliance timelines and enforcement. The UK institute's emphasis on "safety cases" is illustrative: a structured argument supported by evidence, designed to be reviewed and challenged, imported from nuclear and aviation safety engineering.
Governments are no longer asking companies to voluntarily "be responsible." They are writing laws with compliance deadlines and fines large enough to matter. The EU AI Act is the most comprehensive — think of it as GDPR for AI. Non-compliance carries penalties that can reach 7% of global annual turnover.
The world's first comprehensive binding AI regulation. Published in the Official Journal of the EU, July 12, 2024. Entered into force August 1, 2024. Categorizes AI applications by risk: unacceptable risk (prohibited), high-risk (strict requirements), limited risk (transparency obligations), minimal risk (largely unregulated). GPAI models — including frontier LLMs — have specific obligations under Chapter V.
Enforcement penalties: non-compliance with high-risk or GPAI requirements up to €35 million or 7% of total global annual turnover. Prohibited AI practices: maximum fine. Providing incorrect information to authorities: up to €7.5 million or 1.5% of turnover.
Sources: EC AI Policy · GPAI Code of Practice · EU Parliament breakdown
Four active research bets define where the most important work is happening: capabilities evaluation and hazard forecasting; robustness against deception and evaluation gaming; mechanistic interpretability at scale; and control and containment protocols. The field needs progress on all four simultaneously.
AI safety is one of the few fields where people from genuinely diverse backgrounds — mathematics, philosophy, policy, software engineering, biology, law — are all needed and all contributing original work. Early enough that a motivated person with strong foundations and genuine curiosity can make real contributions without decades of prior specialization.
AI safety values demonstrated capability over credentials. What gets you in: a red-team portfolio documenting how you tested an existing model's safety boundaries and how you would address the vulnerabilities; replication of a published safety paper from scratch; contributions to open-source safety tooling (TransformerLens, OpenAI Evals); documented empirical research on real AI systems. Build the portfolio. Publish the methodology. Show the results.
Every source cited in this document. Organized by category. All links verified April 2026.