The Building Blocks of Smart Software: Combine Them to Meet Your Needs

January 25, 2021

I have a file of listicles. One called “Top 10 Algorithms in Data Mining” appeared in 2007. I spotted another list which is, not surprisingly, quite like the Xindong Wu et al write up. The most recent listing is “All Machine Learning Algorithms You Should Know in 2021.” And note the “all.” I included a short item about a book of business intelligence algorithms in the DarkCyber for January 26, 2021, at this link. That book had more than 600 pages, and I am reasonably confident that the authors did not use the word “all” to describe their effort.

What’s the line up of “all” you ask? In the table below, I present the list from 2008 in red and the list from 2021 in blue.

2008 Xindong Dong et al 2021 “All” KDNuggets’
List
1 Decision Trees Linear regression
2 k-means Logistic regression
3 Support Vector Machines k nearest neighbor
4 A priori Naive Bayes
5 Expectation-Maximization (EM) Support vector machines
6 Page Rank (voting) Decision trees
7 Ada Boost Random forest
8 k nearest neighbor classification AdaBoost
9 Naive Bayes Gradient boost
10 Classification and Regression trees XGBoost

The KDNuggets’ opinion piece also includes LightGMB (a variation of XGBoost) and CatBoost (is a more efficient gradient boost). Hence, I have focused on 10 algorithms. I performed a similar compression with Xindong Dong et al’s labored discussion of rules and cases grouped under “decision trees” in the table above.

Several observations are possible from these data:

  1. “All” is misleading in the KDNuggets’ title. Why not skip the intellectually shallow “all”?
  2. In the 14 years between the scholarly article and the enthusiastic “all” paper, the tools of the smart software crowd have not advanced if the data in these two write ups are close enough for horse shoes
  3. Modern systems’ similarity in overall approaches is understandable because a limited set of tools are used by energetic “inventors” of smart software.

Net net: The mathematical recipes are evolving in terms of efficiency due to more machine horsepower and more data.

How about the progress in accuracy? Did IBM Watson uncover a drug to defeat Covid? How are those Google search results working for you? What about the smart cyber security software which appear to have missed entirely the SolarWinds’ misstep.

Why? Knowing algorithms is not the same as developing systems which most work. Marketers, however, can seize on these mathy names and work miracles. Too bad the systems built with them don’t.

Stephen E Arnold, January 25, 2021

Law Enforcement Content Acquisition Revealed

January 22, 2021

Everything you do with a computer, smartphone, wearable, smart speaker, or tablet is recorded. In order to catch bad actors, law enforcement issues warrants to technology companies often asking for users who searched for specific keywords or visited certain Web sites in a specific time frame. Wired explains how private user information is still collected despite big tech promising to protect their users in the article, “How Your Digital Trails Wind Up In The Police’s Hands.”

Big tech companies continue to host apps and sell technology that provides user data to law enforcement. Apple attempted to combat the unauthorized of user information by requiring all developers to have a “nutritional label” on its apps. The label will disclose privacy policies. It is not, however, a blanket solution.

Big tech companies pledge their dedication to ending law enforcement using unlawful surveillance, but their actions are hypocritical. Amazon is committed to racial equity, but they saw an uptick in police request for user information. Google promises the same equity commitment with Google Doodles and donations, but they provide police with geofence warrants.

Law makers and political activists cite that these actions violate people’s civil rights and the Fourth Amendment. While there are people who are rallying to protect the average user, the bigger problem rests with users’ lack of knowledge. How many users are aware about the breadcrumbs they are leaving around the Internet? How many users actually read privacy policies or terms of service agreements? Very few!

“The solution isn’t simply for people to stop buying IoT devices or for tech companies to stop sharing data with the government. But “equity” demands that users be aware of the digital bread crumbs they leave behind as they use electronic devices and how state agents capitalize on both obscure systems of data collection and our own ignorance.”

Perhaps organizations should concentrate on educating the public or require big tech companies to have more transparent privacy policies in shorter, readable English? With thumb typing and illiteracy prevalent in the US, ignorance pays data dividends.

Whitney Grace, January 22, 2020

The Many Ways Police Can Access User Data

January 14, 2021

We hope that by now, dear reader, you understand digital privacy is an illusion. For those curious about the relationship between big tech, personal data, and law enforcement, we suggest “How Your Digital Trials Wind Up in the Hands of the Police,” shared by Ars Technica. The article, originally published by Wired, begins by describing how police used a Google keyword warrant to track down one high-profile suspect. We’re reminded that data gathered for one ostensible purpose, like building an online profile, can be repurposed as evidence. From the smart speakers and wearable devices that record us to apps that track location and other data, users are increasingly signing away their privacy rights. Writer Sidney Fussell notes:

“The problem isn’t just any individual app, but an over-complicated, under-scrutinized system of data collection. In December, Apple began requiring developers to disclose key details about privacy policies in a ‘nutritional label’ for apps. Users ‘consent’ to most forms of data collection when they click ‘Agree’ after downloading an app, but privacy policies are notoriously incomprehensible, and people often don’t know what they’re agreeing to. An easy-to-read summary like Apple’s nutrition label is useful, but not even developers know where the data their apps collect will eventually end up.”

Amid protests over policing and racial profiling, several tech companies are reevaluating their cooperation with law enforcement. Amazon hit pause on sales of facial recognition tech to police even as it noted an increase in requests for user data by law enforcement. Google vowed to focus on better representation, education, and support for the Black community. Even so, it continues to supply police with data in response to geofence warrants. These requests are being made of Google and other firms more and more often. Fussell writes:

“As with keyword warrants, police get anonymized data on a large group of people for whom no tailored warrant has been filed. Between 2017 and 2018, Google reported a 1,500 percent increase in geofence requests. Apple, Uber, and Snapchat also have received similar requests for the data of a large group of anonymous users. … These warrants allow police to rapidly accelerate their ability to access our private information. In some cases, the way apps collect data on us turns them into surveillance tools that rival what police could collect even if they were bound to traditional warrants.”

Civil rights groups are pushing back on these practices. Meanwhile, users would do well to pause and consider before hitting “Agree.”

Cynthia Murrell, January 14, 2021

Traffic: Can a Supercomputer Make It Like Driving in 1930?

January 12, 2021

Advertisers work long and hard to find roads which are scenic and can be “managed” with the assistance of some government authorities to be perfect. The idea is that a zippy new vehicle zooms along a stretch of tidy highway (no litter or obscene slogans spray painted on billboards, please). Behind the wheel or the semi-autonomous driver seat is a happy person. Zoom, zoom, zoom. (I once knew a poet named Alex Kuo. He wrote poems about driving. I found this interesting, but I hate driving, flying, or moving anywhere outside of my underground office in rural Kentucky.

I also read a book called Traffic: Why We Drive the Way We Do (and What It Says about Us). I recall the information about Los Angeles’ super duper traffic management computer. If my memory is working this morning, the super duper traffic computer made traffic worse. An individual with some numerical capability can figure out why. Let those chimpanzees throw darts at a list of publicly traded security and match the furry entity with the sleek MBA. Who wins? Yeah.

I thought about the hapless people who have to deal with driving, riding trains, or whatever during the Time of Rona. Better than pre Rona, but not by much. Humans travel according the habit, the age old work when the sun shines adage, or because clumping is baked into our DNA.

The problem is going to be solved, at least that’s the impression I obtained from “Could a Supercomputer Help Fix L.A.’s Traffic Problems?” Now traffic in Chicago sucks, but the wizards at the Argonne National Laboratory are going to remediate LaLa Land. I learned:

The Department of Energy’s Argonne National Laboratory is leading a project to examine traffic data sets from across the Los Angeles region to develop new strategies to reduce traffic congestion.

And what will make the difference this time? A supercomputer. How is that supercomputer doing with the Covid problem? Yeah, right.

The write up adds:

Super computers at the Argonne Laboratory are able to take a year’s worth of traffic data gathered from some 11,160 sensors across southern California, as well as movement data from mobile devices, to build forecasting models. They can then be applied to simulation projects.

Who in LA has the ball?

Not the LA Department of Transportation. Any other ideas?

And how was driving in LA in 1930? Pretty awful according to comments made by my mother.

Stephen E Arnold, January 12, 2021

Soros: Just in Time 20-20 Hindsight

November 18, 2020

Here’s an interesting quote (if it is indeed accurate):

SFM [a George Soros financial structure] made this investment [in Palantir Technologies] at a time when the negative social consequences of big data were less understood,” the firm said in a statement Tuesday. SFM would not make an investment in Palantir today.

The investment concerns Palantir Technologies. George Soros, who is 90 years young, according to “Soros Regrets Early Investment in Peter Thiel’s Palantir,” includes this statement:

Soros has sold all the shares it’s permitted to sell at this time and will keep selling, according to the statement. “SFM does not approve of Palantir’s business practices,” the firm said.

Hindsight is 20-20. Or is it?

Hindsight bias can cause memory distortion. Because the event happened like you thought it would, you go back and revise your memory of what you were thinking right before the event. You re-write history, so to speak, and revise the probability in hindsight. Going forward, you use that new, higher probability to make future decisions. When in fact, the probabilities haven’t changed at all. That leads to poor judgment.—“Innovators: Beware the Hindsight Bias

Stephen E Arnold, November 18, 2020

Hard Data Predicts Why Songs Are Big Hits

August 26, 2020

Hollywood has a formula system to make blockbuster films and the music industry has something similar. It is harder to predict hit music than films, but Datanami believes someone finally has the answer: “Hooktheory Uses Data To Quantify What Makes Songs ‘Great’.”

Berkeley startup Hooktheory knows that many songs have similar melodies and lyrics. Hooktheory makes software and other learning materials for songwriters and musicians. With their technology, the startup wants to prove what makes music popular is quantifiable. Hooktheory started a crowdsourced database dubbed “Theorytabs” that analyses popular songs and the plan is to make it better with machine learning.

Theorytabs is a beloved project:

“The Hooktheory analysis database began as a “labor of love” by Hooktheory co-founders Dave Carlton, Chris Anderson and Ryan Miyakawa, based on the idea that “conventional tabs and sheet music are great for showing you how to play a song, but they’re not ideal for understanding how everything fits together.” Over time, the project snowballed into a community effort that compiled tens of thousands of Theorytabs, which Hooktheory describes as “similar to a guitar tab but powered by a simple yet powerful notation that stores the chord and melody information relative to the song’s key.”

Theorytabs users can view popular songs from idol singers to videogame themes. They can play around with key changes, tempos, mixers, and loops, along with listening to piano versions and syncing the songs up with YouTube music videos.

Hooktheory owns over 20,000 well-formatted tabs for popular music. The startup is working with Carnegie Mellon University and New York University to take Theorytabs to the next level. The music community has welcomed Theorytabs and people are eager to learn about the data behind great music.

Whitney Grace, August 27, 2020

Yes, Elegance in Language Explains Big Data in a More Satisfying Way for Some

July 14, 2020

I was surprised and then uncomfortable with the information in a tweet thread from Abebab. The tweet explained that “Big Dick Data” is a formal academic term. Apparently this evocative and polished turn of phrase emerged from a write up by “D’Ignazio and F. Klein”.

Here’s the definition:

a formal, academic term that D’Ignazio & F. Klein have coined to denote big data projects that are characterized by masculinist, totalizing fantasies of world domination as enacted through data capture and analysis.

To prove the veracity of the verbal innovation, an image from a publication is presented; herewith a copy:

image

When I came upon the tweet, the item accrued 119 likes.

Observations:

  • Is the phrase a contribution to the discussion of Big Data, or is the phrase a political statement?
  • Will someone undertake a PhD dissertation on the subject, using the phrase as the title or will a business publisher crank out an instant book?
  • What mid tier consulting firm will offer an analysis of this Big Data niche and rank the participants using appropriate categories to communicate each particular method?

Outstanding, tasteful, and one more — albeit quite small — attempt to make clear that discourse is being stretched.

Above all, classy or possibly a way to wrangle a job writing one liners for a comedian looking for Big Data chuckles.

Stephen E Arnold, July 14, 2020

CFO Surprises: Making Smart Software Smarter

April 27, 2020

The Cost of Training NLP Models is a useful summary. However, the write up leaves out some significant costs.

The focus of the paper is a:

review the cost of training large-scale language models, and the drivers of these costs.

The cost factors discussed include:

  • The paradox of compute costs going down yet the cost of processing data goes up—a lot. The reason is that more data are needed and more data can be crunched more quickly. Zoom go the costs.
  • The unknown unknowns associated with processing the appropriate amount of data to make the models work as well as they can
  • The wide use of statistical models which have a voracious appetite for training data.

These are valid points. However, the costs of training include other factors, and these are significant as well; for example:

  1. The directs and indirects associated with creating training sets
  2. The personnel costs required to assess and define retraining and the information assembly required for that retraining
  3. The costs of normalizing training corpuses.

More research into the costs of smart software training and tuning is required.

Stephen E Arnold, April 28, 2020

 

Homeland Security Wants To Make Most of Its Data

April 24, 2020

The US Department of Homeland Security gathers terabytes of data relating to national security. One of the department’s biggest quandaries is figuring out how to share that information across all law enforcement agencies. FedTech explains how Homeland Security discovered a solution in the article, “DHS’ CDM Program Focuses On Shared Services Dashboard.”

The project for sharing data is officially from the Department of Homeland Security and is called Continuous Diagnostics and Mitigation program. The Continuous Diagnostics and Mitigation program is a dashboard that gives IT leaders keener insights into cybersecurity vulnerabilities and how IT security compares to other agencies. From April 2020 to September 2020 (the end of the fiscal year), the Department of Homeland Security will pilot the dashboard. The Continuous Diagnostics and Mitigation program uses Elasticsearch to power its enterprise search, metrics, and business analytics.

Kevin Cox is the manager for the Continuous Diagnostics and Mitigation program. Cox states that the program will be expanded beyond regular law enforcement agency:

“DHS is also focused on bringing in more agencies that were not originally participating in the CDM program, Cox tells Federal News Network. DHS needed to make sure they had asset management capabilities, awareness of the devices connected to their networks and identity and access management capabilities, according to Cox.

For 34 smaller, non-CFO Act agencies, DHS has provided them with a common shared service platform to serve as their CDM dashboard, although each small agency can see its own data individually as well, which is summarized in the larger federal dashboard.

Cox notes that this process has not been easy, and DHS benefits when it has flexibility to meet each individual agency’s cybersecurity data needs.”

One of the program’s goals is to see if the tool meets the desired requirements. Cox wants the data to be recorded, utilized on the dashboard, insights are found, and shared with agencies across the dashboard. It sounds like the Continuous Diagnostics and Mitigation program is a social media platform that specializes in cybersecurity threats.

Whitney Grace, April 24, 2020

Smart Software: What Is Wrong?

April 8, 2020

We have the Google not solving death. We have the IBM Watson thing losing its parking spot at a Houston cancer center. We have a Department of Justice study reporting issues with predictive analytics. And, the supercomputer and their smart software have not delivered a solution to the coronavirus problem. Yep. What’s up?

Data Science: Reality Doesn’t Meet Expectations” explains some of the reasons. DarkCyber recommends this write up. The article provides seven reasons why the marketing fluff generated by  former art history majors for “bros” of different ilk are not delivering; to wit:

  1. People don’t know what “data science” does.
  2. Data science leadership is sorely lacking.
  3. Data science can’t always be built to specs.
  4. You’re likely the only “data person”
  5. Your impact is tough to measure — data doesn’t always translate to value
  6. Data & infrastructure have serious quality problems.
  7. Data work can be profoundly unethical. Moral courage required.

DarkCyber has nothing to add.

Stephen E Arnold, April 8, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta