Machine Learning: Cheating Is a Feature?

August 9, 2022

I read “MIT Boffins Make AI Chips 1 Million Times Faster Than the Synapses in the Human Brain. Plus: Why ML Research Is Difficult to Produce – and Army Lab Extends AI Contract with Palantir.” I dismissed the first item as some of the quantum supremacy stuff output by high school science club types. I ignored the Palantir Technologies’ item because the US Army has to make a distributed common ground system work and leave resolution to the next team rotation. Good or bad, Palantir has the ball. But the middle item in the club sandwich article contains a statement I found particularly interesting.

If you have followed out comments about smart software, we have taken a pragmatic view of getting “AI/ML” systems to work in the 80 to 95 percent confidence range in a consistent way even when new “content objects” are fed into the zeros and ones. To get off on the right foot, human subject matter experts assembled training data which reflected the content the system would be processing in the real world. The way smart software is expected to work is that it learns… on its own… sort of. It is very time consuming and very expensive to create hand crafted training sets and then “update” the system with the affected module. What if the prior content had to be reprocessed? Well, not too many have the funds, time, resources, and patience for that.

Thus, today’s AI/ML forward leaning cost conscious wizards want to use synthetic data, minimize the human SMEs’ cost and time, and do everything auto-magically. Sounds good. Yes, and the ideas make great PowerPoint decks too.

The sentence in the article which caught may attention is this one:

Data leakage occurs when the data used to train an algorithm can leak into its testing; when its performance is assessed the model seems better than it actually is because it has already, in effect, seen the answers to the questions. Sometimes machine learning methods seem more effective than they are because they aren’t tested in more robust settings.

Here’s the link to “Leakage and the Reproducibility Crisis in ML-Based Science in which more details appear. Wowza if these experts are correct. Who goes swimming without a functioning snorkel? Maybe the Google?

Stephen E Arnold, August 8, 2022

Written by Stephen E. Arnold · Filed Under AI, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.