Books

‘The Alignment Problem: Machine Learning and Human Values’ — book review

Can AI be ethical? Who gets to decide what that means?

August 13, 2020

In case you think fears about artificial intelligence (AI) are purely products of the 21st century, author Brian Christian highlights a quote from the famous 1960 Science article by MIT’s Norbert Wiener, called “Some Moral and Technical Consequences of Automation“:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it . . . then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it.

AI continues to advance rapidly across a wide range of domains, and there’s growing concern that we’re falling short in our ability to align those AI with our own goals and desires. In order to grasp the scope and severity of this concern, Christian has distilled a truly impressive number of discussions with experts from across the cognitive sciences and machine learning spectrum into The Alignment Problem: Machine Learning and Human Values.

“This is a book about machine learning and human values: about systems that learn from data without being explicitly programmed, and about how—and what exactly—we are trying to teach them,” Christian writes. The problem is both philosophical and technological, as we have to ask two distinct questions:

What are the ethics/values we want to teach the AI?
How do we teach the AI to fully understand those ethics/values?

As Christian notes, until relatively recently the alignment problem has largely been seen as a less serious project than making AI better, but that appears to be shifting, as more and more experts openly acknowledge that the alignment problem is inseparable from making better AI. From racist algorithms to planet-destroying superintelligences, if we don’t get a handle on the problem ASAP, the results could range from worsening dystopia to full-on existential crisis.

The AI even has you right now, sucking you in with this nonthreatening Clipart

Christian does substantial work trying to address both pieces of the alignment problem, with differing degrees of success. Not for any lacking on his part, though — it just may be the case that the philosophical side of the problem is going to be a heavier lift than the engineering side. Trying to give a broadly acceptable and sufficiently rich account of what we actually value and what we ought to value, ethically, is extremely difficult. Compared to the deep questions of ethics, the engineering problems of developing machine learning sufficiently advanced that it can handle complex material like ethics seem relatively tractable.

Advances on the engineering and training side have trended toward solutions that might allow us to teach ethics to AIs without actually knowing what ethics is first. Strange as that sounds, indirect normativity, where you essentially make the AI guess what ethics is rather than just tell it, may actually be far easier than solving all the philosophical problems of ethics and then formalizing the answers into a language the AI can understand. Classic examples of direct normativity like Isaac Asimov’s laws of robotics have proven far too fragile and inflexible to cope with an immensely complex moral world.

So how do we teach the AI ethics without having figured it out for ourselves first? In Part Three of The Alignment Problem, “Normativity,” Christian does a great job laying out the cutting edge of techniques in AI training. He describes the evolution of each stage and their respective difficulties using accessible examples like Mario Kart.

First, there were the traditional models in which the AI learned from watching an expert play, but those suffered from fatal errors because the AI didn’t know how to correct itself when it got in trouble in ways the expert didn’t anticipate. Then a collaborative approach was developed, where the AI and the human driver would both simultaneously try to drive the kart, but the program would randomly decide whose signal to act on. The AI would sometimes screw up while in control, forcing the human driver to correct the mistake, and thus the AI learned to correct itself much more effectively.

“Hal, let me use the blue turtle shell. Just open the box, Hal!”

From there, Christian unpacks the technique of Cooperative Inverse Reinforcement Learning (CIRL). Inverse reinforcement learning occurs when an AI observes our behavior and attempts to infer what our goals are, as a way to develop its own goal structure, rather than us trying to make it explicit — perfect for a challenge like indirect normativity. The “cooperative” part refers to moving beyond the “watch and learn” model to one in which the humans and AI work together to solve the problem in real time, and in doing so train the AI to better understand their shared goal.

Researchers have found this cooperative method helps avoid problems where the AI watches, infers the goal, and then finds a way to achieve that goal through unacceptable means. Christian gives a nonmoral example of an AI that watched remote helicopter pilots attempt the most difficult trick they know, called “chaos,” and then inferred the trick they were trying to accomplish. Based on his descriptions, it seems to me that CIRL makes substantial advances in addressing some of the serious problems on the applied side of the alignment problem.

CIRL is not without its own challenges, of course, some of which come down to the philosophical side of the alignment problem. CIRL assumes the behavior an AI is learning from is coming from an expert, an assumption that becomes much more problematic when we shift from model helicopters to human ethics.

There’s also reason to remain doubtful that ethics is sufficiently like model helicopters or Mario Kart, just with more complexity. In games and hobbies, there is nothing wrong with humans arbitrarily setting the rules and goals in ways we find maximize personal enjoyment of the activity. If we did the same thing in ethics, the results would likely be disastrous.

Christian mentions several unresolved debates in ethics that could have massive ramifications for AI, like the one between possibilism and actualism. Possibilism is the view that one ought to do what will produce the best possible results. Actualism is the opposing view that one ought to do what will produce the best actual results. The classic example is the procrastinating professor who’s asked by a student to review a paper. The professor is the top expert on the topic, so agreeing to review the paper would produce the best possible results, assuming the professor does review the paper. However, if the professor can confidently predict they will never get around to it, saying no to the student and directing them to a more reliable alternative with slightly less expertise will produce the best actual results.

So the question is, should we and our AIs be possiblists, actualists, or some hybrid of the two? Actualists will say that Possibilism is hopelessly naive and paves the way to hellish outcomes. Possibilists can plausibly respond that Actualism is an excuse to give in to the worst parts of ourselves rather than working to improve them. A hybrid model could bring with it the worst of both options without successfully addressing the major problems of either. Even if we don’t have to work out a perfect answer on the front end, thanks to CIRL and indirect normativity, we’ll still need to decide on the back end if we’re happy with the results.

It’s possible there may be no answer that everyone will find satisfying, at which point the question becomes how much agreement do we need, and from whom? Saying “ethics experts” isn’t a sufficient solution, since they’ll give you even more competing answers than non-ethicists.

Honestly, you don’t even need multiple ethics experts. One is more than enough.

There’s also a major social expectation problem looming, because people want a kind of perfection from AI that may be neither possible nor desirable. Christian quotes machine learning and robotics researcher Stéphane Ross on the expectations for self-driving cars:

The level of reliability that we need is, like, orders and orders of magnitude more than anything we do in academia. So that’s where the real challenge is. Like, how do you make sure that your model works all the time — not just 95 or 99% of the time; that’s not even good enough.

The problem here is not just engineering. What would it really mean to say that the AI’s ethical model works “all the time”? Ethical growth in humans often comes from making ethical mistakes. Will the AI not be given similar leeway to learn? Would perfection require that it always gives an answer we agree with, or an answer we can at least understand? What if the AI decides that human ethics is insufficient, and attempts to move beyond them, producing ethical answers that involve concepts we don’t share?

It’s not enough to say we’ll simply prevent the AI from developing that far, because preventing it from engaging in autonomous ethical growth could then prevent it from becoming sufficiently ethical at all. I think we should want an AI that is ethically advanced enough to point out when humans are acting unethically, but I wonder if we’ll be happy with that if we get it.

Despite these concerns, I do think The Alignment Problem presents some important avenues for further research. The idea that we should instill in AIs a kind of moral parliament that produces a range of ethical views, which the AI then acts on in an aggregated way, could create a flexibility that can effectively incorporate a wide range of ethical concerns.

I also particularly enjoyed the discussion around training “domesticity” into the AI. Domesticity, a method of limiting how much the AI allows itself to impact the world, is desirable but difficult to quantify. Christian suggests building on work that researchers have done to teach a gaming AI to leave the world with as many options still open as possible, as a way to accommodate the uncertainty around future needs. The example given is a simple puzzle game of moving boxes around, where some moves can’t be undone. The successful result was an AI that went out of its way to make sure as many moves as possible were still available, even as it completed the puzzle.

This, but with not turning the whole universe into a giant computer

My one difficulty with the style of The Alignment Problem is that I often found myself wanting shorter allegories and anecdotes that got to the point a bit faster. The stories, while often interesting, sometimes felt unnecessary and may make the book a bit daunting for some readers. Style is a matter of personal taste, though, and others may enjoy the lengthier narratives.

The Alignment Problem describes a present-tense issue, not something in the far-flung future. Your social media algorithms are optimized toward profit, not human wellbeing, and the negative impacts have become increasingly apparent. “We must hope that we can correct our folly, rather than our impotence, first,” Christian concludes, but if we can’t even summon the social will to address these ongoing and relatively simplistic misalignments, I’m skeptical we’re going to be adequately prepared for the increasingly complex challenges that lie ahead. If we get there though, it’s going to be through work like The Alignment Problem.

The Alignment Problem: Machine Learning and Human Values is out in October.

Aaron Rabinowitz currently serves as the Philosopher in Residence for the Rutgers Honors College, and teaches ethics through the Rutgers philosophy department.

AIPT Science is co-presented by AIPT and the New York City Skeptics.

In this article:Philosophy