Smart Software: Does ML Have Fragile Hips and Painted Lips?

November 25, 2020

It looks like machine learning must head back to the drawing board. MIT Technology Review discusses recent findings from Google in, “The Way We Train AI Is Fundamentally Flawed.” This is not about the well-known problem of data shift, where data used to train algorithms is not close enough to real-world examples. This is something else entirely.

The paper, which represents the efforts of 40 researchers across seven Googley teams, brings what one engineer describes as a “wrecking ball” to the field. The paper calls the issue “under specification,” a term from statistics that describes an observed effect with many possible causes. Lead researcher Alex D’Amour, with his background in causal reasoning, found the term applies quite well to the machine-learning problem he set out to investigate. Writer Will Douglas Heaven explains:

“Roughly put, building a machine-learning model involves training it on a large number of examples and then testing it on a bunch of similar examples that it has not yet seen. When the model passes the test, you’re done. What the Google researchers point out is that this bar is too low. The training process can produce many different models that all pass the test but—and this is the crucial part—these models will differ in small, arbitrary ways, depending on things like the random values given to the nodes in a neural network before training starts, the way training data is selected or represented, the number of training runs, and so on. These small, often random, differences are typically overlooked if they don’t affect how a model does on the test. But it turns out they can lead to huge variation in performance in the real world. In other words, the process used to build most machine-learning models today cannot tell which models will work in the real world and which ones won’t.”

The researchers found this problem is all sorts of AI applications. Using homogenous training methods, they created several machine-learning models and performed stress tests designed to discern out performance differences. These models covered the areas of image recognition, natural language processing, and medical AI; see the article for specifics.

The main conclusion is that much more testing is needed before machine learning systems are put into practice—a process that is not always possible due to a lack of real-world data. D’Amour also advises engineers would do well to greatly narrow the requirements for their models. Then there is the suggestion that designers produce many models, test those on real-world tasks, and pick the best performer. Not a simple method, but one that might be worth the time and effort to a large company like Google. Whatever the solutions, it is clear AI is not performing as promised. Not yet at least.

Cynthia Murrell, November 25, 2020


Got something to say?

  • Archives

  • Recent Posts

  • Meta