Why Some Outputs from Smart Software Are Wonky

July 26, 2021

Some models work like a champ. Utility rate models are reasonably reliable. When it is hot, use of electricity goes up. Rates are then “adjusted.” Perfect. Other models are less solid; for example, Bayesian systems which are not checked every hour or large neural nets which are “assumed” to be honking along like a well-ordered flight of geese. Why do I offer such Negative Ned observations? Experience for one thing and the nifty little concepts tossed out by Ben Kuhn, a Twitter persona. You can locate this string of observations at this link. Well, you could as of July 26, 2021, at 630 am US Eastern time. Here’s a selection of what are apparently the highlights of Mr. Kuhn’s conversation with “a former roommate.” That’s provenance enough for me.

Item One:

Most big number theory results are apparently 50-100 page papers where deeply understanding them is ~as hard as a semester-long course. Because of this, ~nobody has time to understand all the results they use—instead they “black-box” many of them without deeply understanding.

Could this be true? How could newly minted, be an expert with our $40 online course, create professionals who use models packaged in downloadable and easy to plug in modules be unfamiliar with the inner workings of said bundles of brilliance? Impossible? Really?

Item Two:

A lot of number theory is figuring out how to stitch together many different such black boxes to get some new big result. Roommate described this as “flailing around” but also highly effective and endorsed my analogy to copy-pasting code from many different Stack Overflow answers.

Oh, come on. Flailing around. Do developers flail or do they “trust” the outfits who pretend to know how some multi-layered systems work. Fiddling with assumptions, thresholds, and (close your ears) the data themselves  are never, ever a way to work around a glitch.

Item Three

Roommate told a story of using a technique to calculate a number and having a high-powered prof go “wow, I didn’t know you could actually do that”

No kidding? That’s impossible in general, and that expression would never be uttered at Amazon-, Facebook-, and Google-type operations, would it?

Will Mr. Kuhn be banned for heresy. [Keep in mind how Wikipedia defines this term: “is any belief or theory that is strongly at variance with established beliefs or customs, in particular the accepted beliefs of a church or religious organization.”] Just repeating an idea once would warrant a close encounter with an Iron Maiden or a pile of firewood. Probably not today. Someone might emit a slightly critical tweet, however.

Stephen E Arnold, July 26, 2021

Elasticsearch Versus RocksDB: The Old Real Time Razzle Dazzle

July 22, 2021

Something happens. The “event” is captured and written to the file. Even if you are watching the “something” happening, there is latency between the event and the sensor or the human perceiving the event. The calculus of real time is mostly avoiding too much talk about latency. But real time is hot because who wants to look at old data, not TikTok fans and not the money-fueled lovers of Robinhood.

Rockset CEO on Mission to Bring Real-Time Analytics to the Stack” used lots of buzzwords, sidesteps inherent latency, and avoids commentary on other allegedly real-time analytics systems. Rockset is built on RockDB, an open source software. Nevertheless, there is some interesting information about Elasticsearch; for example:

  • Unsupported factoids like: “Every enterprise is now generating more data than what Google had to index in [year] 2000.”
  • No definition or baseline for “simple”: “The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.”
  • Different from Elasticsearch and RocksDB: “So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.”
  • Similarities with Rockset: “So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval.”
  • Jargon and unique selling proposition: “We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.”

Amazon has rolled out its real time system, and there are a number of options available from vendors like Trendalyze.

Each of these vendors emphasizes real time. The problem, however, is that latency exists regardless of system. Each has use cases which make their system seem to be the solution to real time data analysis. That’s what makes horse races interesting. These unfold in real time if one is at the track. Fractional delays have big consequences for those betting their solution is the least latent.

Stephen E Arnold, July 22, 2021

A Theory: No Room for Shortcuts in Healthcare Datasets

July 1, 2021

The value of any machine learning algorithm depends on the data it was trained on, we are reminded in the article, “Machine Learning Deserves Better Than This” at AAASScience Mag. Writer Derek Lowe makes some good points that are, nevertheless, likely to make him unpopular among the rah-rah AI crowd. He is specifically concerned with the ways machine learning is currently being applied in healthcare. As an example, Lowe examines a paper on coronavirus pathology as revealed in lung X-ray data. He writes:

“Every single one of the studies falls into clear methodological errors that invalidate their conclusions. These range from failures to reveal key details about the training and experimental data sets, to not performing robustness or sensitivity analyses of their models, not performing any external validation work, not showing any confidence intervals around the final results (or not revealing the statistical methods used to compute any such), and many more. A very common problem was the (unacknowledged) risk of bias right up front. Many of these papers relied on public collections of radiological data, but these have not been checked to see if the scans marked as COVID-19 positive patients really were (or if the ones marked negative were as well). It also needs to be noted that many of these collections are very light on actual COVID scans compared to the whole database, which is not a good foundation to work from, either, even if everything actually is labeled correctly by some miracle. Some papers used the entire dataset in such cases, while others excluded images using criteria that were not revealed, which is naturally a further source of unexamined bias.”

As our regular readers are aware, any AI is only as good as the data it is trained upon. However, data scientists can be so eager to develop tools (or, to be less charitable, to get published) that they take shortcuts. Some, for example, accept all data from public databases without any verification. Others misapply data, like the collection of lung x-rays from patients under the age of five that was included in the all-ages pneumonia dataset. Then there are the datasets and algorithms that simply do not have enough documentation to be trusted. How was the imaging data pre-processed? How was the model trained? How was it selected and validated? Crickets.

We understand why people are excited about the potential of machine learning in healthcare, a high-stakes field where solutions can be frustratingly elusive. However, it benefits no one to rely on conclusions drawn from flawed data. In fact, doing so can be downright dangerous. Let us take the time to get machine learning right first.

Cynthia Murrell, July 1, 2021

TikTok: What Is the Problem? None to Sillycon Valley Pundits.

June 18, 2021

I remember making a comment in a DarkCyber video about the lack of risk TikTok posed to its users. I think I heard a couple of Sillycon Valley pundits suggest that TikTok is no big deal. Chinese links? Hey, so what. These are short videos. Harmless.

Individuals like this are lost in clouds of unknowing with a dusting of gold and silver naive sparkles.

TikTok Has Started Collecting Your ‘Faceprints’ and ‘Voiceprints.’ Here’s What It Could Do With Them” provides some color for parents whose children are probably tracked, mapped, and imaged:

Recently, TikTok made a change to its U.S. privacy policy,allowing the company to “automatically” collect new types of biometric data, including what it describes as “faceprints” and “voiceprints.” TikTok’s unclear intent, the permanence of the biometric data and potential future uses for it have caused concern 

Well, gee whiz. The write up is pretty good, but there are a couple of uses of these types of data left out of the write up:

  • Cross correlate the images with other data about a minor, young adult, college student, or aging lurker
  • Feed the data into analytic systems so that predictions can be made about the “flexibility” of certain individuals
  • Cluster young people into egg cartons so fellow travelers and their weakness could be exploited for nefarious or really good purposes.

Will the Sillycon Valley real journalists get the message? Maybe if I convert this to a TikTok video.

Stephen E Arnold, June 18, 2021

Google Encourages Competition: Our Way or No Way. Seems Fair

June 4, 2021

I get a kick out of the Google. First, there was the really embarrassing matter of the diversity director outputting a few years ago some spicy comments about a country. Here’s a rundown of what makes the Timnit Gebru affair like just another synthetic pearl in a long string of management jewelry at a flea market.

I found this story even more revealing. The context is that numerous legal eagles are slapping Googzilla with a wide range of legal documents. Many of these are related to alleged monopolistic practices. I am no lawyer, but I get the feeling that some people are concerned about Google’s ability to absorb online advertising revenues, control what information people can find via the universal search thing, and Google’s Amazon like arrogance. (Yep, Amazon is the new Big Dog, but you knew that, right?)

Here’s the key statement:

Today I Learned you can not advertise on  @GoogleAds if you use  @googleanalytics competitors like  @matomo_org

This seems reasonable. An “if then” statement for organizations that want to tap into Google’s billions of “users.”

An entity called @HashNuke added:

This is easily identifiable as anti-competitive practice. Wouldn’t this be illegal in many countries?

If these statements are accurate, isn’t being Googley just the best way to inspire individuals and organizations. Some of those legal eagles may find the information worth checking out.

Stephen E Arnold, June 4, 2021

Data Silos vs. Knowledge Graphs

May 26, 2021

Data scientist and blogger Dan McCreary has high hopes for his field’s future, describing what he sees as the upcoming shift “From Data Science to Knowledge Science.” He predicts:

“I believe that within five years there will be dramatic growth in a new field called Knowledge Science. Knowledge scientists will be ten times more productive than today’s data scientists because they will be able to make a new set of assumptions about the inputs to their models and they will be able to quickly store their insights in a knowledge graph for others to use. Knowledge scientists will be able to assume their input features:

  1. Have higher quality
  2. Are harmonized for consistency
  3. Are normalized to be within well-defined ranges
  4. Remain highly connected to other relevant data as such as provenance and lineage metadata”

Why will this evolution occur? Because professionals are motivated to develop their way past the current tedious state of affairs—we are told data scientists typically spend 50% to 80% of their time on data clean-up. This leaves little time to explore the nuggets of knowledge they eventually find among the weeds.

As McCreary sees it, however, the keys to a solution already exist. For example, machine learning can be used to feed high-quality, normalized data into accessible and evolving knowledge graphs. He describes how MarkLogic, where he used to work, developed and uses data quality scores. Such scores would be key to building knowledge graphs that analysts can trust. See the post for more details on how today’s tedious data science might evolve into this more efficient “knowledge science.” We hope his predictions are correct, but only time will tell. About five years, apparently.

Cynthia Murrell, May 26, 2021

Reengineering Bias: What an Interesting Idea

April 5, 2021

If this is true, AI may be in trouble now. VentureBeat reports, “Researchers Find that Debiasing Doesn’t Eliminate Racism from Hate Speech Detection Models.” It is known that AI systems meant to detect toxic language themselves have a problem with bias. Specifically, they tend to flag text by Black users more often than text by white users. Oh, the irony. The AI gets hung up on language markers often found in vernaculars like African-American English (AAE). See the article for a few examples. Researchers at the Allen Institute tried several techniques to reteach existing systems to be more even handed. Reporter Kyle Wiggers writes:

“In the course of their work, the researchers looked at one debiasing method designed to tackle ‘predefined biases’ (e.g., lexical and dialectal). They also explored a process that filters ‘easy’ training examples with correlations that might mislead a hate speech detection model. According to the researchers, both approaches face challenges in mitigating biases from a model trained on a biased dataset for toxic language detection. In their experiments, while filtering reduced bias in the data, models trained on filtered datasets still picked up lexical and dialectal biases. Even ‘debiased’ models disproportionately flagged text in certain snippets as toxic. Perhaps more discouragingly, mitigating dialectal bias didn’t appear to change a model’s propensity to label text by Black authors as more toxic than white authors. In the interest of thoroughness, the researchers embarked on a proof-of-concept study involving relabeling examples of supposedly toxic text whose translations from AAE to ‘white-aligned English’ were deemed nontoxic. They used OpenAI’s GPT-3 to perform the translations and create a synthetic dataset — a dataset, they say, that resulted in a model less prone to dialectal and racial biases.”

The researchers acknowledge that re-writing Black users’ posts to sound more white is not a viable solution. The real fix would be to expose AI systems to a wider variety of dialects in the original training phase, but will developers take the trouble? Like many people, once hate-speech detection bots become prejudiced it is nigh impossible to train them out of it.

Cynthia Murrell, April 5, 2021

Google and Cookies: Crafting Quite Tasty Bait

March 19, 2021

I read “Alphabet: Five Things We Know about Google’s Ad Changes after Cookies.” I approached the write up with some interest. Cookies have been around for a long time. The reason? They allowed a number of interesting functions, including tracking, cross correlation of user actions, and a covert existence.

Now, no more Google cookies.

The write up explains what Google wants keen observers, real journalists, and thumbtypers to know; to wit:

  1. Privacy is really, really important to Google—now. Therefore, the GOOG won’t support third party cookies. Oh, shucks, what about cross site tracking? Yeah, what about it?
  2. Individuals can be targeted. Those with a rifle shot orientation have to provide data to the Google and use the Google software system called “customer match.” Yeah, ad narrowcasting lives.
  3. Google will draw some boundaries about its data leveraging for advertisers. But what about “publishers”? Hey, Google has some special rules. Yeah, a permeable membrane for certain folks.
  4. FLOC makes non-personalized ad targeting possible. I want to write, “You’ve been FLOC’ed” but I shall not. Yeah, FLOC. But you can always try FLEDGE. So “You’ve been FLEDGED” is a possibility.

How’s this work? The write up does not shed any light. Here’s a question for a “real news” outfit to tackle:

How many data points does a disambiguation system require to identify a name, location, and other personal details of a single individual?

Give up. Better not. Oh, the bait, pivoted cookies. Great for catching prospects I think.

Stephen E Arnold, March 19, 2021

Watching the Future of Talend

March 15, 2021

I read “Talend Sells to Private Equity Firm Thoma Bravo in $2.4 Billion Deal.” I find this interesting. Talend is a software company providing extract, transform, and load services and analytics. Data remain the problem for many thumbtypers fresh from Amazon or Google certification classes. The idea is to suck in legally data from different sources. These data are often in odd ball formats to malformed because another digital mechanic missed a bolt or added a bit of finery. Some people love MarkLogic innovations in XML; others, not so enamored of the tweaks.

What’s Thoma Bravo bring to the table for a publicly traded company with a number of competitors?

I can think of two benefits:

The first is MBA think. Thoma Bravo is skilled in the methods for making a company more efficient. It is a good idea to internalize the definition of “efficiency” as the word is used at McKinsey & Co.

The second is acquisition think. From my point of view, the idea is to identify interesting companies which provide additional functionality around the core Talend business. Then Thoma Bravo assists the Talend management to bring these companies into the mothership, train sales professionals, and close deals.

No problem exists with this game plan. One can identify some indicators to monitor; for example:

  • Executive turnover
  • Realigning expenditures; possibly taking money from security and allocating the funds to sales and marketing
  • Targeting specific market segments with special bundles of enhanced Talend software and business methods.

For more information about Talend as it exists in March 2021, navigate to this link.

Oh, one final comment. Thoma Bravo was involved in making SolarWinds the business success it became.

Stephen E Arnold, March 15, 2021

About TikTok and Privacy: $92 Million Catch Your Attention

March 4, 2021

I have commented about the superficial understanding shared among some “real” and big time journalists of data collection. What’s the big deal about TikTok? Who cares what kids are doing? Dismissive attitude flipped off these questions because “real” news knows what’s up?

ByteDance Agrees to US$92 Million Privacy Settlement with US TikTok Users” suggests that ignoring the China-linked TikTok may warrant some scrutiny. The story reports:

The lawsuits claimed the TikTok app “infiltrates its users’ devices and extracts a broad array of private data including biometric data and content that defendants use to track and profile TikTok users for the purpose of, among other things, ad targeting and profit.” The settlement was reached after “an expert-led inside look at TikTok’s source code” and extensive mediation efforts, according to the motion seeking approval of the settlement.

My view is that tracking a user via a range of methods can create a digital fingerprint of a TikTok user. That fingerprint can be matched or cross correlated with other data available to a specialist; for example, information obtained from Oracle. The result is that a user could be identified and tracked across time.

Yep, today’s young person is tomorrow’s thumbtyper in one of the outfits compromised by the SolarWinds’ misstep. What if the TikTok data make it possible to put pressure on a user? What if the user releases access information or other high value data?

TikTok, TikTok, the clock may be ticketing quietly away.

Stephen E Arnold, March 4, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta