Why Be Like ClearView AI? Google Fabs Data the Way TSMC Makes Chips
April 8, 2022
Machine learning requires data. Lots of data. Datasets can set AI trainers back millions of dollars, and even that does not guarantee a collection free of problems like bias and privacy issues. Researchers at MIT have developed another way, at least when it comes to image identification. The World Economic Forum reports, “These AI Tools Are Teaching Themselves to Improve How they Classify Images.” Of course, one must start somewhere, so a generative model is first trained on some actual data. From there, it generates synthetic data that, we’re told, is almost indistinguishable from the real thing. Writer Adam Zewe cites the paper‘s lead author Ali Jahanian as he emphasizes:
“But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can ‘imagine’ how a car would look in different situations — situations it did not see during training — and then output images that show the car in unique poses, colors, or sizes. Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different. The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains. ‘This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations,’ he says.”
Ah, algorithmic teamwork. Another advantage of this method is the nearly infinite samples the model can generate, since more samples (usually) make for a better trained AI. Jahanian also notes once a generative model has created a repository of synthetic data, that resource can be posted online for others to use. The team also hopes to use their technique to generate corner cases, which often cannot be learned from real data sets and are especially troublesome when it comes to potentially dangerous uses like self-driving cars. If this hope is realized, it could be a huge boon.
This all sounds great, but what if—just a minor if—the model is off base? And, once this tech moves out of the laboratory, how would we know? The researchers acknowledge a couple other limitations. For one, their generative models occasionally reveal source data, which negates the privacy advantage. Furthermore, any biases in the limited datasets used for the initial training will be amplified unless the model is “properly audited.” It seems like transparency, which somehow remains elusive in commercial AI applications, would be crucial. Perhaps the researchers have an idea how to solve that riddle.
Funding for the project was supplied, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.
Cynthia Murrell, April 8, 2022