Datasets: An Analysis Which Tap Dances around Some Consequences

December 22, 2021

I read “3 Big Problems with Datasets in AI and Machine Learning.” The arguments presented support the SAIL, Snorkel, and Google type approach to building datasets. I have addressed some of my thoughts about configuring once and letting fancy math do the heavy lifting going forward. This is probably not the intended purpose of the Venture Beat write up. My hunch is that pointing out other people’s problems frames the SAIL, Snorkel, and Google type approaches. No one asks, “What happens if the SAIL, Snorkel, and Google type approaches don’t work or have some interesting downstream consequences?” Why bother?

Here are the problems as presented by the cited article:

  1. The Training Dilemma. The write up says: “History is filled with examples of the consequences of deploying models trained using flawed datasets.” That’s correct. The challenge is that creating and validating a training set for a discipline, topic, or “space” is that new content arrives using new lingo and even metaphors instead of words like “rock.” Building a dataset and doing what informed people from the early days of Autonomy’s neuro-linguistic method know is that no one wants to spend money, time, and computing resources in endless Sisyphean work. That rock keeps rolling back down the hill. This is a deal breaker, so considerable efforts has been expended figuring out how to cut corners, use good enough data, set loose shoes thresholds, and rely on normalization to smooth out the acne scars. Thus, we are in an era of using what’s available. Make it work or become a content creator on TikTok.
  2. Issues with Labeling. I don’t like it when the word “indexing” is replaced with works like labels, metatags, hashtags, and semantic sign posts. Give me a break. Automatic indexing is more consistent than human indexers who get tired and fall back on a quiver of terms because who wants to work too hard at a boring job for many. But the automatic systems are in the same “good enough” basket as smart training data set creation. The problem is words and humans. Software is clueless when it comes to snide remarks, cynicism, certain types of fake news and bogus research reports in peer reviewed journals, etc. Indexing using esoteric words means the Average Joe and Janet can’t find the content. Indexing with everyday words means that search results work great for pizza near me but no so well for beatles diet when I want food insects eat, not what kept George thin. The write up says: “Still other methods aim to replace real-world data with partially or entirely synthetic data — although the jury’s out on whether models trained on synthetic data can match the accuracy of their real-world-data counterparts.” Yep, let’s make up stuff.
  3. A Benchmarking Problem. The write up asserts: “SOTA benchmarking [also] does not encourage scientists to develop a nuanced understanding of the concrete challenges presented by their task in the real world, and instead can encourage tunnel vision on increasing scores. The requirement to achieve SOTA constrains the creation of novel algorithms or algorithms which can solve real-world problems.” Got that. My view is that validating data is a bridge too far for anyone except a graduate student working for a professor with grant money. But why benchmark when one can go snorkeling? The reality is that datasets are in most cases flawed but no one knows how flawed. Just use them and let the results light the path forward. Cheap and sounds good when couched in jargon.

What’s the fix? The fix is what I call the SAIL, Snorkel, and Google type solution. (Yep, Facebook digs in this sandbox too.)

My take is easily expressed just not popular. Too bad.

  1. Do the work to create and validate a training set. Rely on subject matter experts to check outputs and when the outputs drift, hit the brakes, and recalibrate and retrain.
  2. Admit that outputs are likely to be incomplete, misleading, or just plain wrong. Knock of the good enough approach to information.
  3. Return to methods which require thresholds to be be validated by user feedback and output validity. Letting cheap and fast methods decide which secondary school teacher gets fired strikes me as not too helpful.
  4. Make sure analyses of solutions don’t functions as advertisements for the world’s largest online ad outfit.

Stephen E Arnold, December 22, 2021


Comments are closed.

  • Archives

  • Recent Posts

  • Meta