Hard Data Predicts Why Songs Are Big Hits

August 26, 2020

Hollywood has a formula system to make blockbuster films and the music industry has something similar. It is harder to predict hit music than films, but Datanami believes someone finally has the answer: “Hooktheory Uses Data To Quantify What Makes Songs ‘Great’.”

Berkeley startup Hooktheory knows that many songs have similar melodies and lyrics. Hooktheory makes software and other learning materials for songwriters and musicians. With their technology, the startup wants to prove what makes music popular is quantifiable. Hooktheory started a crowdsourced database dubbed “Theorytabs” that analyses popular songs and the plan is to make it better with machine learning.

Theorytabs is a beloved project:

“The Hooktheory analysis database began as a “labor of love” by Hooktheory co-founders Dave Carlton, Chris Anderson and Ryan Miyakawa, based on the idea that “conventional tabs and sheet music are great for showing you how to play a song, but they’re not ideal for understanding how everything fits together.” Over time, the project snowballed into a community effort that compiled tens of thousands of Theorytabs, which Hooktheory describes as “similar to a guitar tab but powered by a simple yet powerful notation that stores the chord and melody information relative to the song’s key.”

Theorytabs users can view popular songs from idol singers to videogame themes. They can play around with key changes, tempos, mixers, and loops, along with listening to piano versions and syncing the songs up with YouTube music videos.

Hooktheory owns over 20,000 well-formatted tabs for popular music. The startup is working with Carnegie Mellon University and New York University to take Theorytabs to the next level. The music community has welcomed Theorytabs and people are eager to learn about the data behind great music.

Whitney Grace, August 27, 2020

Yes, Elegance in Language Explains Big Data in a More Satisfying Way for Some

July 14, 2020

I was surprised and then uncomfortable with the information in a tweet thread from Abebab. The tweet explained that “Big Dick Data” is a formal academic term. Apparently this evocative and polished turn of phrase emerged from a write up by “D’Ignazio and F. Klein”.

Here’s the definition:

a formal, academic term that D’Ignazio & F. Klein have coined to denote big data projects that are characterized by masculinist, totalizing fantasies of world domination as enacted through data capture and analysis.

To prove the veracity of the verbal innovation, an image from a publication is presented; herewith a copy:


When I came upon the tweet, the item accrued 119 likes.


  • Is the phrase a contribution to the discussion of Big Data, or is the phrase a political statement?
  • Will someone undertake a PhD dissertation on the subject, using the phrase as the title or will a business publisher crank out an instant book?
  • What mid tier consulting firm will offer an analysis of this Big Data niche and rank the participants using appropriate categories to communicate each particular method?

Outstanding, tasteful, and one more — albeit quite small — attempt to make clear that discourse is being stretched.

Above all, classy or possibly a way to wrangle a job writing one liners for a comedian looking for Big Data chuckles.

Stephen E Arnold, July 14, 2020

CFO Surprises: Making Smart Software Smarter

April 27, 2020

The Cost of Training NLP Models is a useful summary. However, the write up leaves out some significant costs.

The focus of the paper is a:

review the cost of training large-scale language models, and the drivers of these costs.

The cost factors discussed include:

  • The paradox of compute costs going down yet the cost of processing data goes up—a lot. The reason is that more data are needed and more data can be crunched more quickly. Zoom go the costs.
  • The unknown unknowns associated with processing the appropriate amount of data to make the models work as well as they can
  • The wide use of statistical models which have a voracious appetite for training data.

These are valid points. However, the costs of training include other factors, and these are significant as well; for example:

  1. The directs and indirects associated with creating training sets
  2. The personnel costs required to assess and define retraining and the information assembly required for that retraining
  3. The costs of normalizing training corpuses.

More research into the costs of smart software training and tuning is required.

Stephen E Arnold, April 28, 2020


Homeland Security Wants To Make Most of Its Data

April 24, 2020

The US Department of Homeland Security gathers terabytes of data relating to national security. One of the department’s biggest quandaries is figuring out how to share that information across all law enforcement agencies. FedTech explains how Homeland Security discovered a solution in the article, “DHS’ CDM Program Focuses On Shared Services Dashboard.”

The project for sharing data is officially from the Department of Homeland Security and is called Continuous Diagnostics and Mitigation program. The Continuous Diagnostics and Mitigation program is a dashboard that gives IT leaders keener insights into cybersecurity vulnerabilities and how IT security compares to other agencies. From April 2020 to September 2020 (the end of the fiscal year), the Department of Homeland Security will pilot the dashboard. The Continuous Diagnostics and Mitigation program uses Elasticsearch to power its enterprise search, metrics, and business analytics.

Kevin Cox is the manager for the Continuous Diagnostics and Mitigation program. Cox states that the program will be expanded beyond regular law enforcement agency:

“DHS is also focused on bringing in more agencies that were not originally participating in the CDM program, Cox tells Federal News Network. DHS needed to make sure they had asset management capabilities, awareness of the devices connected to their networks and identity and access management capabilities, according to Cox.

For 34 smaller, non-CFO Act agencies, DHS has provided them with a common shared service platform to serve as their CDM dashboard, although each small agency can see its own data individually as well, which is summarized in the larger federal dashboard.

Cox notes that this process has not been easy, and DHS benefits when it has flexibility to meet each individual agency’s cybersecurity data needs.”

One of the program’s goals is to see if the tool meets the desired requirements. Cox wants the data to be recorded, utilized on the dashboard, insights are found, and shared with agencies across the dashboard. It sounds like the Continuous Diagnostics and Mitigation program is a social media platform that specializes in cybersecurity threats.

Whitney Grace, April 24, 2020

Smart Software: What Is Wrong?

April 8, 2020

We have the Google not solving death. We have the IBM Watson thing losing its parking spot at a Houston cancer center. We have a Department of Justice study reporting issues with predictive analytics. And, the supercomputer and their smart software have not delivered a solution to the coronavirus problem. Yep. What’s up?

Data Science: Reality Doesn’t Meet Expectations” explains some of the reasons. DarkCyber recommends this write up. The article provides seven reasons why the marketing fluff generated by  former art history majors for “bros” of different ilk are not delivering; to wit:

  1. People don’t know what “data science” does.
  2. Data science leadership is sorely lacking.
  3. Data science can’t always be built to specs.
  4. You’re likely the only “data person”
  5. Your impact is tough to measure — data doesn’t always translate to value
  6. Data & infrastructure have serious quality problems.
  7. Data work can be profoundly unethical. Moral courage required.

DarkCyber has nothing to add.

Stephen E Arnold, April 8, 2020

Big Data Gets a New Term: DarkCyber Had to Look This One Up

April 2, 2020

In our feed this morning (April 1, 2020) we skipped over the flood of news about Zoom (a Middle Kingdom inspired marvel), the virus stories output by companies contributing their smart software to find a solution), and the trend of Amazon bashing (firing a worker who wanted to sanitize a facility and Amazon’s organizational skills are wobbling).

What stopped our scanning eyes was “Why Your Business May Be on a Data-Driven Coddiwomple.” DarkCyber admits that one of our team write a story for an old school publisher which used the word “cuculus” in its title “Google in the Enterprise 2009: The Cuculus Strategy.” A “cuculus,” as you probably know, gentle reader, is a remarkable bird, sort of a thief.

But Coddiwomple? That word means travel in a purposeful manner to a vague definition. Most of the YouTube train ride and the Kara and Nate trips qualify. Other examples include the aimless wandering of enterprise search vendors who travel to the lands of customer service, analytics, business process engineering, and only occasionally returning to their home base of the 50 year old desert of proprietary enterprise search.

What’s the point of “Why Your Business May Be on a Data-Driven Coddiwomple”? DarkCyber believes the main point is valid:

In practical terms the lack of clarity on the starting point can involve a lack of vision into what the specific objectives of the team are, or what human resources and skills are already in house. Meanwhile, the diverse and siloed stakeholders in a “destination” for the data-driven endeavor may all have slightly different ideas on what the result should be, leading to a divergent and fuzzy path to follow.

In DarkCyber’s lingo, these data and analytics journeys are just hand waving and money spending.

Are businesses and other entities data driven?

Ho ho ho. Most organizations are not sure what the heck is going on. The data are easy to interpret, and no fancy, little understood analytics system is needed to figure out that an iceberg has nicked the good ship Silicon Lollipop.

There are interesting uses of data and clever applications of systems and methods that are quite old.

Like the cuculus, opportunism is important. The coddiwomple is a secondary effect. The cuculus gets into a company’s nest and raises money consumers. When the money suckers are bigger, each flies to another nest and the cycle repeats.

Data driven is a metaphor for doing something even though results are often difficult to explain: Higher costs, increased complexity, and an inability to adapt to the business environment.

I support the cuculus inspired consultants. The management of the nest can enjoy the coddiwomple as they seek a satisfying place to begin again.

Stephen E Arnold, April 2, 2020

The Problem of Too Much Info

March 17, 2020

The belief is that the more information one has the better decision one can make. Is this really true? The Eurasia Review shares how too much information might be a bad thing in the article, “More Information Doesn’t Necessarily Help People Make Better Decisions.”

According to the Stevens Institute of Technology, too much knowledge causes people to make worse decisions. The finding explains that there is a critical gap between assimilating new information with past knowledge and beliefs. Associate Professor of Computer Science at the Steves Institute Samantha Kleinberg is studying the phenomenon using AI and machine learning to investigate how financial advisors and healthcare professionals to their clients. She discovered:

“ ‘Being accurate is not enough for information to be useful,’ said Kleinberg. ‘It’s assumed that AI and machine learning will uncover great information, we’ll give it to people and they’ll make good decisions. However, the basic point of the paper is that there is a step missing: we need to help people build upon what they already know and understand how they will use the new information.’

For example: when doctors communicate information to patients, such as recommending blood pressure medication or explaining risk factors for diabetes, people may be thinking about the cost of medication or alternative ways to reach the same goal. ‘So, if you don’t understand all these other beliefs, it’s really hard to treat them in an effective way,’ said Kleinberg, whose work appears in the Feb. 13 issue of Cognitive Research: Principles and Implications.”

Kleinberg and her team studied 4,000 participants on their decision making processes with scenarios they would be familiar with to ones they would not. When confronted with an unusual problem, participants focused on the problem without any extra knowledge, but if they were asked to deal with a regular scenario such as healthcare or finances their prior knowledge got in the way.

Information overload and not being able to merge old information with the new is a problem. How do you fix it? Your answer is as good as mine.

Whitney Grace, March 17, 2020

Google Trends Used to Reveal Misspelled Wirds or Is It Words?

November 25, 2019

We spotted a listing of the most misspelled words in each of the USA’s 50 states. Too bad Puerto Rico. Kentucky’s most misspelled word is “ninety.” Navigate to Considerable and learn what residents cannot spell. How often? Silly kweston.

The listing includes some bafflers and may reveal what can go wrong with data from an online ad sales data collection system; for example:

  • Washington, DC (which is not a state in DarkCyber’s book) cannot spell “enough”; for example, “enuf already with these televised hearings and talking heads”
  • Idaho residents cannot spell embarrassed, which as listeners to Kara Swisher know has two r’s and two s’s. Helpful that.
  • Montana residents cannot spell “comma.” Do those in Montana use commas?
  • And not surprisingly, those in Tennessee cannot spell “intelligent.” Imagine that!

What happens if one trains smart software on these data?

Sumthink mite go awf the railz.

Stephen E Arnold, November 25, 2019

Info Extraction: Improving?

November 21, 2019

Information extraction (IE) is key to machine learning and artificial intelligence (AI), especially for natural language processing (NLP). The problem with information extraction is while information is pulled from datasets it often lacks context, thusly it fails to properly categorize and rationalize the data. Good Men Project shares some hopeful news for IE in the article, “Measuring Without Labels: A Different Approach To Information Extraction.”

Current IE relies on an AI programmed with a specific set of schema that states what information needs to be extracted. A retail Web site like Amazon probably uses an IE AI programmed to extract product names, UPCs, and price, while a travel Web site like Kayak uses an IE AI to find price, airlines, dates, and hotel names. For law enforcement officials, it is particularly difficult to design schema for human trafficking, because datasets on that subject do not exist. Also traditional IE methods, such as crowdsourcing, do not work due to the sensitivity.

In order to create a reliable human trafficking dataset and prove its worth, the IE dependencies between extractions. A dependency works as:

“Consider the network illustrated in the figure above. In this kind of network, called attribute extraction network (AEN), we model each document as a node. An edge exists between two nodes if their underlying documents share an extraction (in this case, names). For example, documents D1 and D2 are connected by an edge because they share the extraction ‘Mayank.’ Note that constructing the AEN only requires the output of an IE, not a gold standard set of labels. Our primary hypothesis in the article was that, by measuring network-theoretic properties (like the degree distribution, connectivity etc.) of the AEN, correlations would emerge between these properties and IE performance metrics like precision and recall, which require a sufficiently large gold standard set of IE labels to compute. The intuition is that IE noise is not random noise, and that the non-random nature of IE noise will show up in the network metrics. Why is IE noise non-random? We believe that it is due to ambiguity in the real world over some terms, but not others.”

Using the attributes names, phone numbers, and locations, correlations were discovered. AI systems that have dependencies creates a new methodology to evaluate them. Network science relies on non-abstract interactions to test IE, but the AEN is an abstract network of IE interactions. The mistakes, in fact, allow law enforcement to use IE AI to acquire the desired information without having a practice dataset.

Whitney Grace, November 21, 2019

Tracking Trends in News Homepage Links with Google BigQuery

October 17, 2019

Some readers may be familiar with the term “culturomics,” a particular application of n-gram-based linguistic analysis to text. The practice arose after a 2010 project that applied such analysis to five million historical books across seven languages. The technique creates n-gram word frequency histograms from the source text. Now the technique has been applied to links found on news organizations’ home pages using Google’s BigQuery platform. Forbes reports, “Using the Cloud to Explore the Linguistic Patterns of Half a Trillion Words of News Homepage Hyperlinks.” Writer Kalev Leetaru explains:

“News media represents a real-time reflection of localized events, narratives, beliefs and emotions across the world, offering an unprecedented look into the lens through which we see the world around us. The open data GDELT Project has monitored the homepages of more than 50,000 news outlets worldwide every hour since March 2018 through its Global Frontpage Graph (GFG), cataloging their links in an effort to understand global journalistic editorial decision-making. In contrast to traditional print and broadcast mediums, online outlets have theoretically unlimited space, allowing them to publish a story without displacing another. Their homepages, however, remain precious fixed real estate, carefully curated by editors that must decide which stories are the most important at any moment. Analyzing these decisions can help researchers better understand which stories each news outlet believed to be the most important to its readership at any given moment in time and how those decisions changed hour by hour.”

The project has now collected more than 134 billion such links. The article describes how researchers have used BigQuery to analyze this dataset with a single SQL query, so navigate there for the technical details. Interestingly, one thing they are looking at is trends across the 110 languages represented by the samples. Leetaru emphasizes this endeavor demonstrates how much faster these computations can be achieved compared to the 2010 project. He concludes:

“Even large-scale analyses are moving so close to real-time that we are fast approaching the ability of almost any analysis to transition from ‘what if’ and ‘I wonder’ to final analysis in just minutes with a single query.”

Will faster analysis lead to wiser decisions? We shall see.

Cynthia Murrell, October 17, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta