Book review - Brian Christian, The Alignment Problem

★★★★★

Mar 27, 2025

Introduction

If you ever read any of my reviews, you’ll have noticed that while I am very interested in math, it doesn’t seem to percolate to the adjacent areas of Computer Science and machine learning. This hasn’t essentially changed recently, but due to my increasing intellectual interactions with the Effective Altruism and Rationalist belief systems, I feel like I need to get a firm grasp on what is arguably the greatest nerd catnip for these communities: Artificial Intelligence alignment. While there’s a ton of scattered literature in this respect (blog posts, mostly, but also articles and some videos), I decided I’d try to approach this in my traditional way, i.e., through a good and relatively thick book that should map out the basics of the field for me. My attempt at this is what led me to Brian Christian’s The Alignment Problem.

Before delving into its specific contents, it is a good idea to give at least a shot to roughly explaining what the problem that gives its title to the book is. Alignment in general refers to creating states of agreement and coordination such that different elements (people, organizations, societies) can work harmoniously toward a common purpose and/or are consistent with one another. Humans do this all the time: you could say that from birth, humans are trained in different ways so as to ‘align’ with the sort of behavior that is accepted (and expected) in the different, overlapping circles that individuals belong to. And you will know that the process generally works, but not perfectly: all societies have unaligned individuals who usually end up in prison or worse. Aligning humans to a set of agreed upon practices and values is hard: some would dispute if it is even possible (or desirable) at a macro level (which would beg questions about the particular system of values and practices we all have to agree on, how fixed or mutable they are, etc…).

The Alignment problem extends this challenge to the very powerful machine learning systems that we have created in the early 21st century, and to the more powerful and autonomous ones that we might be creating any time soon, like AGI (artificial general intelligence, that would arguable match or surpass humans across all levels of cognitive work). If you believe it likely that AGI will appear, and soon, and that it will end up squarely defeating us in intelligence by many orders of magnitude, it makes sense to worry on how we can ensure they’ll end up doing what humans actually want them to do, even when our goals are hard to specify, are misunderstood, or may change over time (which of course happens, because humans are a really stupid mess of evolved and contradictory drives). And you can picture in your mind what the consequences of misalignment might be: machines that are more powerful than us that can force us to obey them can easily kill us all, or create dystopian scenarios of disempowerment for the human race. And this can come about even if we can encode and irreversibly fix their goals (think of The Sorcerer’s Apprentice and what can happen with a literal order that is less than optimally formulated and that is essentially impossible to change or countermand).

Christian’s book explores all the main challenges about aligning machines with human values, which are already taking place even with the relatively limited technologies we currently possess. In explain them, he also juxtaposes a history of AI alignment and research from the 1950s to this day, mostly by exploring theories and practical breakthroughs of computer scientists he interviewed for the making of the book.

The Contents

The Alignment Problem is divided into three sections (Prophesy, Agency, Normativity) which are then broken up into three chapters each. The following is a (relatively) brief summary of each of them, which you can just take a look at or skip and jump to the conclusion if you aren’t into that much detail.

Chapter 1: Representation

This chapter traces the early history of machine learning and its foundational concerns with how models represent the world. Christian begins with Frank Rosenblatt’s perceptron and ties it to philosophical concerns about learning and representation. The perceptron demonstrated how machines could learn from labeled data by updating weights—a forerunner to modern supervised learning.

Christian then explores how representation bias arises from the data chosen to train models. ImageNet and word embeddings are examined as major examples: ImageNet labeled images with crowd-sourced tags that reflect social consensus, not objective truth, while word embeddings like word2vec expose historical biases (e.g., "man is to computer programmer as woman is to homemaker"). The chapter stresses how model behavior depends on training data—and how even accurate representations may encode harmful stereotypes or structural inequalities.

Chapter 2: Fairness

Christian dives into the challenge of defining and achieving fairness in algorithmic systems. He focuses on criminal justice tools like COMPAS, which predict recidivism risk but are trained on flawed proxies (like rearrest rates, which are racially biased). He shows how these systems perpetuate historical inequities and can create feedback loops that increase incarceration unjustly.

Different mathematical definitions of fairness—such as calibration, equalized odds, and demographic parity—are examined. Crucially, these cannot all be satisfied simultaneously, leading to a “no free lunch” theorem in fairness. Christian introduces Moritz Hardt's proposal to train models directly on human judgments of fairness, though this itself raises alignment challenges. The chapter argues that fairness is not just a technical challenge but a profoundly social one.

Chapter 3: Transparency

This chapter explores interpretability in machine learning and the tension between performance and comprehensibility. Christian tells the story of Rich Caruana, who trained a neural net to predict pneumonia mortality. Though it outperformed simpler models, it learned a dangerous and opaque rule: that patients with asthma had lower mortality (because they were hospitalized quickly). The model was ultimately rejected in favor of a simpler, interpretable one.

Christian surveys tools for making models more transparent: decision trees, decision sets, saliency maps, and concept activation vectors. He warns about adversarial explanations—systems that manipulate their “reasons” for a decision without changing behavior—and about the human tendency to overtrust explanations, even when misleading. The chapter closes by suggesting that real transparency involves surfacing the human values, goals, and processes embedded in systems, not just their inner workings.

Chapter 4: Reinforcement

Here, Christian turns to reinforcement learning (RL), which models how agents learn from reward signals rather than labeled data. He traces the lineage of RL from Edward Thorndike’s Law of Effect and B.F. Skinner’s experiments with rats and pigeons, through to computational RL frameworks like those of Sutton and Barto.

The chapter outlines the credit assignment problem (how to figure out which actions led to a reward), the exploration-exploitation tradeoff, and the structure of reward functions. Christian notes how easily agents can pursue unintended behaviors if reward functions are misspecified—citing OpenAI’s simulated boat-race agent that learned to spin in circles to get more points.

The key insight is that rewards are powerful but dangerous: unless designed carefully, they can push agents to behave in counterproductive or even catastrophic ways.

Chapter 5: Shaping

This chapter explores shaping—Skinner’s method of training animals to perform complex behaviors by reinforcing successive approximations. Christian recounts how Skinner and the Brelands (his students) used shaping to train pigeons and later ran a large animal-training company. Shaping became a central concept in behaviorism and psychology.

He then transitions to how shaping principles are used in modern RL and robotics. Key to this is the idea that rather than specifying exact goals, one can scaffold learning with intermediate rewards. However, problems arise when the shaping rewards themselves become the object of optimization, leading to unintended behavior. Christian emphasizes that even reward functions used temporarily can derail long-term goals.

The chapter also discusses how evolution and culture can be seen as shaping forces on human values, and how designing AI may require similar patience and care—not merely in what reward we give, but in how we guide learning over time.

Chapter 6: Curiosity

This chapter examines intrinsic motivation in both humans and AI. Christian traces curiosity research from early psychology (e.g., Harlow's monkeys solving puzzles without external rewards) to modern reinforcement learning. He explains that in many environments (like Atari’s Montezuma’s Revenge), agents fail to progress without exploration bonuses—essentially curiosity incentives—because rewards are too sparse.

Curiosity-driven approaches reward agents for novelty and surprise. Algorithms like count-based exploration and pseudo-counts help agents explore more effectively. The author draws analogies to child psychology, noting that children are drawn not just to novelty but to ambiguity and surprise—things that challenge their expectations.

The core idea is that curiosity is more than random exploration; it’s a goal-directed drive toward understanding, and machines may need similar mechanisms to learn robustly in complex, real-world environments.

Chapter 7: Imitation

Here, we get imitation learning, where AI systems learn from human demonstrations rather than rewards. The chapter begins with examples from child psychology—how even neonates imitate facial expressions—and how imitation underpins early cognitive development.

In AI, behavioral cloning (supervised learning from expert behavior) can be brittle due to cascading errors: small mistakes compound because the agent trains on ideal trajectories, not on its own, often messier, experiences. Techniques like DAgger address this by allowing agents to learn from corrections to their own behavior, not just expert demonstrations.

Christian emphasizes the risks of assuming symmetry between expert and imitator: differences in body, perspective, or environment can undermine learning. He also distinguishes imitation from emulation and discusses when copying behavior works—and when understanding intentions matters more.

Chapter 8: Inference

This chapter focuses on inverse reinforcement learning (IRL)—learning what someone values by observing what they do. Christian describes this as crucial to alignment: rather than explicitly telling an agent what to optimize, we infer goals from behavior.

He explains the difficulty of IRL: behavior doesn’t always transparently reflect values. People may act irrationally, inconsistently, or under constraints, and different value systems may generate the same behavior. Yet progress is being made—models now try to infer latent preferences by accounting for limitations, habits, and even biases.

Christian covers cooperative inverse RL (CIRL), where humans and agents jointly learn and infer goals, and reward modeling, where systems learn from both behavior and evaluative feedback. These approaches aim to bridge the gap between observed action and true intent.

Chapter 9: Uncertainty

Christian closes with the theme of uncertainty, arguing that safety in AI requires systems to recognize what they don’t know. He contrasts confident but brittle models with those that are calibrated, robust, and cautious in novel scenarios.

He draws on the story of Stanislav Petrov, who averted nuclear disaster by correctly doubting a faulty alert system, and uses it as a parable for AI systems needing epistemic humility. Topics include adversarial examples, out-of-distribution detection, and open category learning—recognizing that not everything has been seen before.

He introduces inverse reward design (IRD), where agents treat given rewards as imperfect clues about human intent, not absolute truths. The chapter suggests that alignment depends on agents modeling themselves as fallible and interpreting commands not literally but thoughtfully.

Conclusion

This is an excellent book that completely satisfies its intended purpose: making a lay reader aware of the issues that arise when trying to align mechanical systems. It is extremely well written too: the author knows the tools of the trade and can introduce enough interesting anecdotes and filter technical information through its protagonists while avoiding getting into the too technical (in mathematics, programming and engineering) that could lead to reader confusion and getting lost. This does come with some brushing under the carpet that raises intriguing questions for the lay reader (myself, I am still slightly flabbergasted by how it is possible to create utility functions and rewards (?!) that machines would feel worth following, for example), but this isn’t really an issue of the book. I highly recommend you read it: you will learn a lot and you will enjoy it too.

Manuel’s Substack

Discussion about this post