AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference

Arvind Narayanan & Sayash Kapoor

Non-Fiction

Read in

2024

More about the book

Highlights

Here are three questions about how a computer system performs a task that may help us determine whether the label AI is appropriate. Each of these questions captures something about what we mean by AI, but none is a complete definition. First, does the task require creative effort or training for a human to perform? If yes, and the computer can perform it, it might be AI. This would explain why image generation, for example, qualifies as AI. To produce an image, humans need a certain amount of skill and practice, perhaps in the creative arts or in graphic design. But even recognizing what’s in an image, say a cat or a teapot—a task that is trivial and automatic for humans—proved daunting to automate until the 2010s, yet object recognition has generally been labeled AI. Clearly, comparison to human intelligence is not the only relevant criterion. Second, we can ask: Was the behavior of the system directly specified in code by the developer, or did it indirectly emerge, say by learning from examples or searching through a database? If the system’s behavior emerged indirectly, it might qualify as AI. Learning from examples is called machine learning, which is a form of AI. This criterion helps explain why an insurance pricing formula, for example, might be considered AI if it was developed by having the computer analyze past claims data, but not if it was a direct result of an expert’s knowledge, even if the actual rule was identical in both cases. Still, many manually programmed systems are nonetheless considered AI, such as some robot vacuum cleaners that avoid obstacles and walls. A third criterion is whether the system makes decisions more or less autonomously and possesses some degree of flexibility and adaptability to the environment. If the answer is yes, the system might be considered AI. Autonomous driving is a good example—it is considered AI. But like the previous criteria, this criterion alone can’t be considered a complete definition—we wouldn’t call a traditional thermostat AI, one that contains no electronics. Its behavior rather arises from the simple principle of a metal expanding or contracting in response to changes in temperature and turning the flow of current on or off.

In the end, whether an application gets labeled AI is heavily influenced by historical usage, marketing, and other factors.

There’s a humorous AI definition that’s worth mentioning, because it reveals an important point: “AI is whatever hasn’t been done yet.” In other words, once an application starts working reliably, it fades into the background and people take it for granted, so it’s no longer thought of as AI.

The second best way to understand a topic in a university is to take a course on it. The best way is to teach a course on it.

The more buzzy the research topic, the worse the quality seems to be. There are thousands of studies claiming to detect COVID-19 from chest x-rays and other imaging data. One systematic review looked at over four hundred papers, and concluded that none of them were of any clinical use because of flawed methods. In over a dozen cases, the researchers used a training dataset where all the images of people with COVID-19 were from adults, and all the images of people without COVID-19 were from children. As a result, the AI they developed had merely learned to distinguish between adults and children, but the researchers mistakenly concluded that they had developed a COVID-19 detector.

In many cases, AI works to some extent but is accompanied by exaggerated claims by the companies selling it. That hype leads to overreliance, such as using AI as a replacement for human expertise instead of as a way to augment it.

Why is predictive logic so pervasive in our world? We think a major reason is our deep discomfort with randomness. Many experiments in psychology show that we see patterns where none exist, and we even think we have control over things that are, in fact, random.

Increased computational power, more data, and better equations for simulating the weather have led to weather forecasting accuracy increasing by roughly one day per decade. A five-day weather forecast a decade ago is about as accurate as a six-day weather forecast today.

Still, there are many qualitative criteria that can help us understand whether prediction tasks can be done well. Weather forecasting isn’t perfect, but it can be done well enough that many people look at the forecast in their city every morning to decide whether they need an umbrella. But we can’t predict if any one person will be involved in a traffic accident on their way to work, so people don’t consult an accident forecast every morning.

This comparison highlights another important quality of predictions: we only care about how good a prediction is in relation to what can be done using that prediction.

So when we say life outcomes are hard to predict, we are using a combination of these three criteria: real-world utility, moral legitimacy, and irreducible error, that is, error that won’t go away with more data and better computational methods.

Perhaps collecting enough data to make accurate social predictions about people is not just impractical—it’s impossible. Matt Salganik calls this the eight billion problem: What if we can’t make accurate predictions because there aren’t enough people on Earth to learn all the patterns that exist?

But why should there be blockbusters at all? Why does the success of books and movies vary by orders of magnitude? Are some products really thousands of times “better” than others? Of course not. A big chunk of the content that is produced is good enough that the majority of people would enjoy consuming it. The reason we don’t have a more equitable distribution of consumption becomes obvious when we think about what such a world would look like. Each book would have only a few readers, and each song only a few listeners. We wouldn’t be able to talk about books or movies or music with our friends, because any two people would have hardly anything in common in terms of what they’ve read or watched. Cultural products wouldn’t contribute to culture in this hypothetical scenario, as culture relies on shared experiences. No one wants to live in that world.

This is just another way to say that the market for cultural products has rich-get-richer dynamics built into it, also called “cumulative advantage.” Regardless of what we may tell ourselves, most of us are strongly influenced by what others around us are reading or watching, so success breeds success.

Research on X (formerly Twitter) backs this up; researchers have found it essentially impossible to predict a tweet’s popularity by analyzing its content using machine learning.

As early as 1985, renowned natural language processing researcher Frederick Jelinek said, “Every time I fire a linguist the performance of the speech recognizer goes up,” the idea being that the presence of experts hindered rather than helped the effort to develop an accurate model.

To generate a single token—part of a word—ChatGPT has to perform roughly a trillion arithmetic operations.

Fine-tuning merely changes the model’s behavior; it “unlocks” specific capabilities. In other words, fine-tuning is an elaborate way of telling the model what the user wants it to do. But pretraining, rather than fine-tuning, is what gives it the capability to function in that way. This explains the P in ChatGPT, which stands for “pretrained.”

Philosopher Harry Frankfurt defined bullshit as speech that is intended to persuade without regard for the truth. In this sense, chatbots are bullshitters. They are trained to produce plausible text, not true statements. ChatGPT is shockingly good at sounding convincing on any conceivable topic.

AI “agents” are bots that perform complex tasks by breaking them down into subtasks—and those subtasks into yet more subtasks, as many times as necessary—and farming out the subtasks to copies of themselves.

What we’ve seen in the history of AI research is that once one aspect gets automated, other aspects that weren’t recognized earlier tend to reveal themselves as bottlenecks.

What we’ve seen in the history of AI research is that once one aspect gets automated, other aspects that weren’t recognized earlier tend to reveal themselves as bottlenecks. For example, once we could write complex pieces of code, we found that there was only so far we could go by making codebases more complex—further progress in AI depended on collecting large datasets. The fact that datasets were the bottleneck wasn’t even recognized for a long time.

There is one fundamental difference between AI and crypto. Despite being touted as the future of the internet, crypto and Web3 lack socially beneficial uses.

There can also be more subtle ways of misinforming readers. For instance, accuracy numbers can appear inflated if one of the outcomes is much more prevalent than the other. In civil war prediction, peace observations are much more likely than observations of war. So, a model can have 99 percent accuracy just by predicting there will be peace all the time.

Yet another myth about regulation is that it always lags behind the development of technology. This perception is partly fueled by the complex nature of technology, which can be intimidating to those who aren’t well versed in it. But the law isn’t just about technical details; it’s also about principles. The First Amendment of the United States Constitution, which guarantees freedom of speech, was drafted centuries before the invention of the internet. Yet, it’s still used as a guiding principle when dealing with issues like online censorship and hate speech. The details of how to apply these principles to new technology may change, but the principles themselves remain relatively stable over time.

Automation often decreases the number of people working in a job or sector without eliminating it, as has happened gradually with farming.

In other areas, automation has lowered the cost of goods or services, leading to more demand for those goods. This is what happened with the introduction of ATMs in banks. The machines reduced the cost of running banks, and in turn led to an increase in the number of banks, and therefore bank tellers, overall. This is known as the automation paradox.

Finally, perhaps the most common type of impact from automation is a change in the nature of job duties. An office assistant in 1980 may have spent a lot of time organizing filing cabinets and typing dedicated notes. Those tasks are obsolete, but today they might help make PowerPoint presentations and troubleshoot digital devices.