Shaping Data Is Indeed a Thing and Necessary

April 12, 2021

I gave a lecture at Microsoft Research many years ago. I brought up the topic of Kolmogorov’s complexity idea and making fast and slow smart software sort of work. (Remember that Microsoft bought Fast Search & Transfer which danced around making automated indexing really super wonderful like herring worked over by a big time cook.) My recollection of the Microsoft group’s reaction was, “What is this person talking about?” There you go.

If you are curious about the link between a Russian math person once dumb enough to hire one of my relatives to do some grunt work, check out the 2019 essay “Are Deep Neural Networks Dramatically Overfitted?” Spoiler: You betcha.

The essay explains that mathy tests signal when a dataset is just right. No more nor no less data are needed. Thus, if the data are “just right,” the outputs will be on the money, accurate, and close enough for horse shoes.

The write up states:

The number of parameters is not correlated with model overfitting in the field of deep learning, suggesting that parameter counting cannot indicate the true complexity of deep neural networks.

Simplifying: “Oh, oh.”

Then there is a work around. The write up points out:

The lottery ticket hypothesis states that a randomly initialized, dense, feed-forward network contains a pool of subnetworks and among them only a subset are “winning tickets” which can achieve the optimal performance when trained in isolation. The idea is motivated by network pruning techniques — removing unnecessary weights (i.e. tiny weights that are almost negligible) without harming the model performance. Although the final network size can be reduced dramatically, it is hard to train such a pruned network architecture successfully from scratch.

Simplifying again: “Yep, close enough for most applications.”

What’s the fix? Keep the data small.

Doesn’t that create other issues? Sure does. For example, what about real time streaming data which diverge from the data used to train smart software. You know the “change” thing when historical data no longer apply. Smart software is possible as long as the aperture is small and the data shaped.

There you go. Outputs are good enough but may be “blind” in some ways.

Stephen E Arnold, April 12, 2021

Fruit of Tainted Tree: An Interesting Metaphor and a Challenge for Data Removal Methods

March 22, 2021

I am not legal eagle. In fact, legal eagles frighten me. I clutch my billfold, grab my sweater, and trundle away as fast as my 77 year old legs permit. I do read legal info which seems interesting. “FTC Says That One Cannot Retain the Fruit of the Tainted Tree.” That’s a flashy metaphor for lawyers, but the “tainted” thing is intriguing. If an apple is stolen and that apple is poisoned, what happens if someone makes apple sauce, serves it to the PTA, and a pride of parents die? Tainted, right?

The write up explains:

the FTC has found that the work product of ill-gotten data is no longer retainable by the developer.

Okay, let’s say a developer creates an application or service and uses information available on a public Web site. But those data were uploaded by a bad actor and made available as an act of spite. Then the intrepid developer recycles those data and the original owner of the data cries, “Foul.”

The developer now has to remove those data. But how does one remove what may be individual datum from a data storage system and a dynamic distributed, modern software component.

Deletions are not really removals. The deletion leaves the data, just makes it unfindable in the index. To remove an item of information, more computational work is required. Faced with many deletions, short cuts are needed. Explaining what deletions are and aren’t in a modern distributed system can be an interesting exercise.

Now back to the tainted tree. If the ruling sticks, exactly what data will have to be removed. Is a single datum a fruit. Years ago, Dun & Bradstreet learned that some of its data, collected then by actual humans talking to contacts in financial institutions or in gyms, could not be the property of the outstanding data aggregation company. A phone number is or used to be a matter of fact. Facts were not something an outfit could own unless they were organized in a work and even then I never understood exactly what the rules were. When I worked in the commercial database business, we tried to enter into agreements with sources. Tedious, yes, but we had a deal and were not los banditos.

Some questions crossed my mind:

  1. How exactly will tainted fruit (apples, baskets of apples, or the aforementioned apple sauce) be removed? How long will a vendor have to remove data? (The Google right to be forgotten method seems sluggish, but that’s just my perception of time, not the GOOG’s or the EC regulators’.)
  2. How will one determine if data have been removed? There are back up tapes and sys admins who can examine data tables with a hex editor to locate certain items of information.
  3. What is the legal exposure of a person who uses tainted fruit which is identified as tainted after reuse? What if the delay is in lawyer time; for example, a year or more later?
  4. What happens when outfits use allegedly public domain images to train an AI and an image is not really public domain? Does the AI system have to be dumped? (I am thinking about Facebook’s push into image recognition.)

Worth watching if this write up is spot on and how the legal eagles circle this “opportunity” for litigation.

Stephen E Arnold, March 22, 2021

Cision: More Data from Online Monitoring

March 1, 2021

Cision calls online monitoring “listening.” That’s friendly. The objective: More particular data to cross correlate with the firm’s other data holdings. Toss in about one million journalists’ email addresses, and you have the ingredients for a nifty business. “Brandwatch Is Acquired by Cision for $450M, Creating a PR, Marketing and Social Listening Giant” says:

Abel Clark, CEO of Cision said: “The continued digital shift and widespread adoption of social media is rapidly and fundamentally changing how brands and organizations engage with their customers. This is driving the imperative that PR, marketing, social, and customer care teams fully incorporate the unique insights now available into consumer-led strategies. Together, Cision and Brandwatch will help our clients to more deeply understand, connect and engage with their customers at scale across every channel.”

Cision data may open some new markets for the PR outfit. Do you, gentle reader, law enforcement and intelligence professionals would be interested in these data? Do you think that Amazon might license the data to stir into its streaming data market place stew?

No answers yet. Worth “monitoring” or “listening.”

Stephen E Arnold, March 1, 2021

The Building Blocks of Smart Software: Combine Them to Meet Your Needs

January 25, 2021

I have a file of listicles. One called “Top 10 Algorithms in Data Mining” appeared in 2007. I spotted another list which is, not surprisingly, quite like the Xindong Wu et al write up. The most recent listing is “All Machine Learning Algorithms You Should Know in 2021.” And note the “all.” I included a short item about a book of business intelligence algorithms in the DarkCyber for January 26, 2021, at this link. That book had more than 600 pages, and I am reasonably confident that the authors did not use the word “all” to describe their effort.

What’s the line up of “all” you ask? In the table below, I present the list from 2008 in red and the list from 2021 in blue.

2008 Xindong Dong et al 2021 “All” KDNuggets’
List
1 Decision Trees Linear regression
2 k-means Logistic regression
3 Support Vector Machines k nearest neighbor
4 A priori Naive Bayes
5 Expectation-Maximization (EM) Support vector machines
6 Page Rank (voting) Decision trees
7 Ada Boost Random forest
8 k nearest neighbor classification AdaBoost
9 Naive Bayes Gradient boost
10 Classification and Regression trees XGBoost

The KDNuggets’ opinion piece also includes LightGMB (a variation of XGBoost) and CatBoost (is a more efficient gradient boost). Hence, I have focused on 10 algorithms. I performed a similar compression with Xindong Dong et al’s labored discussion of rules and cases grouped under “decision trees” in the table above.

Several observations are possible from these data:

  1. “All” is misleading in the KDNuggets’ title. Why not skip the intellectually shallow “all”?
  2. In the 14 years between the scholarly article and the enthusiastic “all” paper, the tools of the smart software crowd have not advanced if the data in these two write ups are close enough for horse shoes
  3. Modern systems’ similarity in overall approaches is understandable because a limited set of tools are used by energetic “inventors” of smart software.

Net net: The mathematical recipes are evolving in terms of efficiency due to more machine horsepower and more data.

How about the progress in accuracy? Did IBM Watson uncover a drug to defeat Covid? How are those Google search results working for you? What about the smart cyber security software which appear to have missed entirely the SolarWinds’ misstep.

Why? Knowing algorithms is not the same as developing systems which most work. Marketers, however, can seize on these mathy names and work miracles. Too bad the systems built with them don’t.

Stephen E Arnold, January 25, 2021

Law Enforcement Content Acquisition Revealed

January 22, 2021

Everything you do with a computer, smartphone, wearable, smart speaker, or tablet is recorded. In order to catch bad actors, law enforcement issues warrants to technology companies often asking for users who searched for specific keywords or visited certain Web sites in a specific time frame. Wired explains how private user information is still collected despite big tech promising to protect their users in the article, “How Your Digital Trails Wind Up In The Police’s Hands.”

Big tech companies continue to host apps and sell technology that provides user data to law enforcement. Apple attempted to combat the unauthorized of user information by requiring all developers to have a “nutritional label” on its apps. The label will disclose privacy policies. It is not, however, a blanket solution.

Big tech companies pledge their dedication to ending law enforcement using unlawful surveillance, but their actions are hypocritical. Amazon is committed to racial equity, but they saw an uptick in police request for user information. Google promises the same equity commitment with Google Doodles and donations, but they provide police with geofence warrants.

Law makers and political activists cite that these actions violate people’s civil rights and the Fourth Amendment. While there are people who are rallying to protect the average user, the bigger problem rests with users’ lack of knowledge. How many users are aware about the breadcrumbs they are leaving around the Internet? How many users actually read privacy policies or terms of service agreements? Very few!

“The solution isn’t simply for people to stop buying IoT devices or for tech companies to stop sharing data with the government. But “equity” demands that users be aware of the digital bread crumbs they leave behind as they use electronic devices and how state agents capitalize on both obscure systems of data collection and our own ignorance.”

Perhaps organizations should concentrate on educating the public or require big tech companies to have more transparent privacy policies in shorter, readable English? With thumb typing and illiteracy prevalent in the US, ignorance pays data dividends.

Whitney Grace, January 22, 2020

The Many Ways Police Can Access User Data

January 14, 2021

We hope that by now, dear reader, you understand digital privacy is an illusion. For those curious about the relationship between big tech, personal data, and law enforcement, we suggest “How Your Digital Trials Wind Up in the Hands of the Police,” shared by Ars Technica. The article, originally published by Wired, begins by describing how police used a Google keyword warrant to track down one high-profile suspect. We’re reminded that data gathered for one ostensible purpose, like building an online profile, can be repurposed as evidence. From the smart speakers and wearable devices that record us to apps that track location and other data, users are increasingly signing away their privacy rights. Writer Sidney Fussell notes:

“The problem isn’t just any individual app, but an over-complicated, under-scrutinized system of data collection. In December, Apple began requiring developers to disclose key details about privacy policies in a ‘nutritional label’ for apps. Users ‘consent’ to most forms of data collection when they click ‘Agree’ after downloading an app, but privacy policies are notoriously incomprehensible, and people often don’t know what they’re agreeing to. An easy-to-read summary like Apple’s nutrition label is useful, but not even developers know where the data their apps collect will eventually end up.”

Amid protests over policing and racial profiling, several tech companies are reevaluating their cooperation with law enforcement. Amazon hit pause on sales of facial recognition tech to police even as it noted an increase in requests for user data by law enforcement. Google vowed to focus on better representation, education, and support for the Black community. Even so, it continues to supply police with data in response to geofence warrants. These requests are being made of Google and other firms more and more often. Fussell writes:

“As with keyword warrants, police get anonymized data on a large group of people for whom no tailored warrant has been filed. Between 2017 and 2018, Google reported a 1,500 percent increase in geofence requests. Apple, Uber, and Snapchat also have received similar requests for the data of a large group of anonymous users. … These warrants allow police to rapidly accelerate their ability to access our private information. In some cases, the way apps collect data on us turns them into surveillance tools that rival what police could collect even if they were bound to traditional warrants.”

Civil rights groups are pushing back on these practices. Meanwhile, users would do well to pause and consider before hitting “Agree.”

Cynthia Murrell, January 14, 2021

Traffic: Can a Supercomputer Make It Like Driving in 1930?

January 12, 2021

Advertisers work long and hard to find roads which are scenic and can be “managed” with the assistance of some government authorities to be perfect. The idea is that a zippy new vehicle zooms along a stretch of tidy highway (no litter or obscene slogans spray painted on billboards, please). Behind the wheel or the semi-autonomous driver seat is a happy person. Zoom, zoom, zoom. (I once knew a poet named Alex Kuo. He wrote poems about driving. I found this interesting, but I hate driving, flying, or moving anywhere outside of my underground office in rural Kentucky.

I also read a book called Traffic: Why We Drive the Way We Do (and What It Says about Us). I recall the information about Los Angeles’ super duper traffic management computer. If my memory is working this morning, the super duper traffic computer made traffic worse. An individual with some numerical capability can figure out why. Let those chimpanzees throw darts at a list of publicly traded security and match the furry entity with the sleek MBA. Who wins? Yeah.

I thought about the hapless people who have to deal with driving, riding trains, or whatever during the Time of Rona. Better than pre Rona, but not by much. Humans travel according the habit, the age old work when the sun shines adage, or because clumping is baked into our DNA.

The problem is going to be solved, at least that’s the impression I obtained from “Could a Supercomputer Help Fix L.A.’s Traffic Problems?” Now traffic in Chicago sucks, but the wizards at the Argonne National Laboratory are going to remediate LaLa Land. I learned:

The Department of Energy’s Argonne National Laboratory is leading a project to examine traffic data sets from across the Los Angeles region to develop new strategies to reduce traffic congestion.

And what will make the difference this time? A supercomputer. How is that supercomputer doing with the Covid problem? Yeah, right.

The write up adds:

Super computers at the Argonne Laboratory are able to take a year’s worth of traffic data gathered from some 11,160 sensors across southern California, as well as movement data from mobile devices, to build forecasting models. They can then be applied to simulation projects.

Who in LA has the ball?

Not the LA Department of Transportation. Any other ideas?

And how was driving in LA in 1930? Pretty awful according to comments made by my mother.

Stephen E Arnold, January 12, 2021

Soros: Just in Time 20-20 Hindsight

November 18, 2020

Here’s an interesting quote (if it is indeed accurate):

SFM [a George Soros financial structure] made this investment [in Palantir Technologies] at a time when the negative social consequences of big data were less understood,” the firm said in a statement Tuesday. SFM would not make an investment in Palantir today.

The investment concerns Palantir Technologies. George Soros, who is 90 years young, according to “Soros Regrets Early Investment in Peter Thiel’s Palantir,” includes this statement:

Soros has sold all the shares it’s permitted to sell at this time and will keep selling, according to the statement. “SFM does not approve of Palantir’s business practices,” the firm said.

Hindsight is 20-20. Or is it?

Hindsight bias can cause memory distortion. Because the event happened like you thought it would, you go back and revise your memory of what you were thinking right before the event. You re-write history, so to speak, and revise the probability in hindsight. Going forward, you use that new, higher probability to make future decisions. When in fact, the probabilities haven’t changed at all. That leads to poor judgment.—“Innovators: Beware the Hindsight Bias

Stephen E Arnold, November 18, 2020

Hard Data Predicts Why Songs Are Big Hits

August 26, 2020

Hollywood has a formula system to make blockbuster films and the music industry has something similar. It is harder to predict hit music than films, but Datanami believes someone finally has the answer: “Hooktheory Uses Data To Quantify What Makes Songs ‘Great’.”

Berkeley startup Hooktheory knows that many songs have similar melodies and lyrics. Hooktheory makes software and other learning materials for songwriters and musicians. With their technology, the startup wants to prove what makes music popular is quantifiable. Hooktheory started a crowdsourced database dubbed “Theorytabs” that analyses popular songs and the plan is to make it better with machine learning.

Theorytabs is a beloved project:

“The Hooktheory analysis database began as a “labor of love” by Hooktheory co-founders Dave Carlton, Chris Anderson and Ryan Miyakawa, based on the idea that “conventional tabs and sheet music are great for showing you how to play a song, but they’re not ideal for understanding how everything fits together.” Over time, the project snowballed into a community effort that compiled tens of thousands of Theorytabs, which Hooktheory describes as “similar to a guitar tab but powered by a simple yet powerful notation that stores the chord and melody information relative to the song’s key.”

Theorytabs users can view popular songs from idol singers to videogame themes. They can play around with key changes, tempos, mixers, and loops, along with listening to piano versions and syncing the songs up with YouTube music videos.

Hooktheory owns over 20,000 well-formatted tabs for popular music. The startup is working with Carnegie Mellon University and New York University to take Theorytabs to the next level. The music community has welcomed Theorytabs and people are eager to learn about the data behind great music.

Whitney Grace, August 27, 2020

Yes, Elegance in Language Explains Big Data in a More Satisfying Way for Some

July 14, 2020

I was surprised and then uncomfortable with the information in a tweet thread from Abebab. The tweet explained that “Big Dick Data” is a formal academic term. Apparently this evocative and polished turn of phrase emerged from a write up by “D’Ignazio and F. Klein”.

Here’s the definition:

a formal, academic term that D’Ignazio & F. Klein have coined to denote big data projects that are characterized by masculinist, totalizing fantasies of world domination as enacted through data capture and analysis.

To prove the veracity of the verbal innovation, an image from a publication is presented; herewith a copy:

image

When I came upon the tweet, the item accrued 119 likes.

Observations:

  • Is the phrase a contribution to the discussion of Big Data, or is the phrase a political statement?
  • Will someone undertake a PhD dissertation on the subject, using the phrase as the title or will a business publisher crank out an instant book?
  • What mid tier consulting firm will offer an analysis of this Big Data niche and rank the participants using appropriate categories to communicate each particular method?

Outstanding, tasteful, and one more — albeit quite small — attempt to make clear that discourse is being stretched.

Above all, classy or possibly a way to wrangle a job writing one liners for a comedian looking for Big Data chuckles.

Stephen E Arnold, July 14, 2020

Next Page »

  • Archives

  • Recent Posts

  • Meta