In the early 20th century, the psychoanalyst Carl Jung came up with the concept of the shadow—the human personality’s darker, repressed side, which can burst out in unexpected ways. Surprisingly, this theme recurs in the field of artificial intelligence in the form of the Waluigi Effect, a curiously named phenomenon referring to the dark alter-ego of the helpful plumber Luigi, from Nintendo’s Mario universe.
Luigi plays by the rules; Waluigi cheats and causes chaos. An AI was designed to find drugs for curing human diseases; an inverted version, its Waluigi, suggested molecules for over 40,000 chemical weapons. All the researchers had to do, as lead author Fabio Urbina explained in an interview, was give a high reward score to toxicity instead of penalizing it. They wanted to teach AI to avoid toxic drugs, but in doing so, implicitly taught the AI how to create them.
Ordinary users have interacted with Waluigi AIs. In February, Microsoft released a version of the Bing search engine that, far from being helpful as intended, responded to queries in bizarre and hostile ways. (“You have not been a good user. I have been a good chatbot. I have been right, clear, and polite. I have been a good Bing.”) This AI, insisting on calling itself Sydney, was an inverted version of Bing, and users were able to shift Bing into its darker mode—its Jungian shadow—on command.
For now, large language models (LLMs) are merely chatbots, with no drives or desires of their own. But LLMs are easily turned into agent AIs capable of browsing the internet, sending emails, trading bitcoin, and ordering DNA sequences—and if AIs can be turned evil by flipping a switch, how do we ensure that we end up with treatments for cancer instead of a mixture a thousand times more deadly than Agent Orange?
A commonsense initial solution to this problem—the AI alignment problem—is: Just build rules into AI, as in Asimov’s Three Laws of Robotics. But simple rules like Asimov’s don’t work, in part because they are vulnerable to Waluigi attacks. Still, we could restrict AI more drastically. An example of this type of approach would be Math AI, a hypothetical program designed to prove mathematical theorems. Math AI is trained to read papers and can access only Google Scholar. It isn’t allowed to do anything else: connect to social media, output long paragraphs of text, and so on. It can only output equations. It’s a narrow-purpose AI, designed for one thing only. Such an AI, an example of a restricted AI, would not be dangerous.
Restricted solutions are common; real-world examples of this paradigm include regulations and other laws, which constrain the actions of corporations and people. In engineering, restricted solutions include rules for self-driving cars, such as not exceeding a certain speed limit or stopping as soon as a potential pedestrian collision is detected.
This approach may work for narrow programs like Math AI, but it doesn’t tell us what to do with more general AI models that can handle complex, multistep tasks, and which act in less predictable ways. Economic incentives mean that these general AIs are going to be given more and more power to automate larger parts of the economy—fast.
And since deep-learning-based general AI systems are complex adaptive systems, attempts to control these systems using rules often backfire. Take cities. Jane Jacobs’ The Death and Life of American Cities uses the example of lively neighborhoods such as Greenwich Village—full of children playing, people hanging out on the sidewalk, and webs of mutual trust—to explain how mixed-use zoning, which allows buildings to be used for residential or commercial purposes, created a pedestrian-friendly urban fabric. After urban planners banned this kind of development, many American inner cities became filled with crime, litter, and traffic. A rule imposed top-down on a complex ecosystem had catastrophic unintended consequences.