Smart Software and Cartels: Another View of the Question To Google or Not to Google?

December 7, 2021

I read “A Cartel of Influential Datasets Is Dominating Machine Learning Research New Study Suggests.” The “team” beavering away is an impressive one to the AI big wigs: The University of California at Los Angeles and Google. The findings are interesting. Developers of smart software have relied on widely available datasets. And those datasets can and may have posed a problem. The datasets are not ones that the average computer user will know about or understand what’s in them. But they are available and less expensive than building a collection of data, making sure it is sort of unbiased, benchmarking the dataset, and then deploying it in such a manner than errors or statistical eddies, currents, and drifts are noted and addressed. Wow, that’s a lot of work, and it is expensive. It is just more efficient to use what’s available and trust the “law of big numbers” or the magic of statistical procedures to fill in the potholes.

The problem is that the expensive alternative is a non starter in today’s go go, let’s make money now world. This means that my interpretation of this allegedly objective, peer reviewed, credential bedecked study is different.

Here’s what I think is afoot. The research discredits what most of the companies building machine learning centric solutions is doing. The fix, in my opinion, is Google’s embrace of the principles and practices of the Stanford Artificial Intelligence Laboratory or SAIL. The idea is manifested in Dr. Christopher Ré’s research, the DeepDyve system, and the Snorkel open source software and commercial variants.

The solution is to skip as much of the human involvement in training as possible. Let the downstream system work out the details and fix the pavement in the information superhighway. The Snorkel approach is going to be better in every possible way than using the widely available datasets and a whole lot cheaper than creating training data by hand and then paying quite a few subject matter experts to tune the system.

Net net: My hunch is that Google is lobbying for its approach and the opportunity to put in place a Googley solution. And what if those outputs are biased. Well, that’s just not possible, is it? One should ask Dr. Timnit Gebru and others who took umbrage at how the estimable Google responded to a bright person’s questions about the broader Google play.

PS. Check out the original research paper, the Snorkel method, and the push back from Xoogler Dr. Gebru. This is an important moment for smart software: To Google or not to Google? That is the question.

Stephen E Arnold, December 7, 2021

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta