Tracking Trends in News Homepage Links with Google BigQuery

October 17, 2019

Some readers may be familiar with the term “culturomics,” a particular application of n-gram-based linguistic analysis to text. The practice arose after a 2010 project that applied such analysis to five million historical books across seven languages. The technique creates n-gram word frequency histograms from the source text. Now the technique has been applied to links found on news organizations’ home pages using Google’s BigQuery platform. Forbes reports, “Using the Cloud to Explore the Linguistic Patterns of Half a Trillion Words of News Homepage Hyperlinks.” Writer Kalev Leetaru explains:

“News media represents a real-time reflection of localized events, narratives, beliefs and emotions across the world, offering an unprecedented look into the lens through which we see the world around us. The open data GDELT Project has monitored the homepages of more than 50,000 news outlets worldwide every hour since March 2018 through its Global Frontpage Graph (GFG), cataloging their links in an effort to understand global journalistic editorial decision-making. In contrast to traditional print and broadcast mediums, online outlets have theoretically unlimited space, allowing them to publish a story without displacing another. Their homepages, however, remain precious fixed real estate, carefully curated by editors that must decide which stories are the most important at any moment. Analyzing these decisions can help researchers better understand which stories each news outlet believed to be the most important to its readership at any given moment in time and how those decisions changed hour by hour.”

The project has now collected more than 134 billion such links. The article describes how researchers have used BigQuery to analyze this dataset with a single SQL query, so navigate there for the technical details. Interestingly, one thing they are looking at is trends across the 110 languages represented by the samples. Leetaru emphasizes this endeavor demonstrates how much faster these computations can be achieved compared to the 2010 project. He concludes:

“Even large-scale analyses are moving so close to real-time that we are fast approaching the ability of almost any analysis to transition from ‘what if’ and ‘I wonder’ to final analysis in just minutes with a single query.”

Will faster analysis lead to wiser decisions? We shall see.

Cynthia Murrell, October 17, 2019

A List of Eavesdroppers: Possibly Sort of Incomplete and Misleading?

August 22, 2019

DarkCyber noted “Here’s Every Major Service That Uses Humans to Eavesdrop on Your Voice Commands.” Notice the word “major.” Here’s the list from the write up:

  • Amazon
  • Apple
  • Facebook
  • Google
  • Microsoft

DarkCyber wonders if these vendors/systems should be considered for inclusion in the list of “every” eavesdropping service:

  • China Telecom
  • Huawei
  • Shoghi
  • Utimaco

DarkCyber is confused about “every” when five candidates are advanced. The six we have suggested for consideration are organizations plucked from our list of interesting companies which may be in the surveillance sector. We await more comprehensive lists from the “real news” outfit “Daily Beast.” Growl!

Stephen E Arnold, August 22, 2019

Scalability: Assumed to Be Infinite?

August 20, 2019

I hear and read about scalability—whether I want to or not. Within the last 24 hours, I learned that certain US government applications have to be smart (AI and ML) and have the ability to scale. Scale to what? In what amount of time? How?

The answers to these questions are usually Amazon, Google, IBM, Microsoft, or some other company’s cloud.

I thought about this implicit assumption about scaling when I read “Vitalik Buterin: Ethereum’s Scalability Issue Is Proving To Be A Formidable Impediment To Adoption By Institutions.” The “inventor” of Ethereum (a technology supported by Amazon AWS by the way), allegedly said:

Scalability is a big bottleneck because Ethereum blockchain is almost full. If you’re a bigger organization, the calculus is that if we join it will not only be full but we will be competing with everyone for transaction space. It’s already expensive and it will be even five times more expensive because of us. There is pressure keeping people from joining, but improvements in scalability can do a lot in improving that.”

There are fixes. Here’s one from the write up:

Notably, Vitalik is known to be a supporter of other crypto currencies besides Ethereum. In July, Buterin suggested using Bitcoin Cash (BCH) to solve the scalability barrier in the short-term as they figure out a more permanent solution. Additionally, early this month, he supported the idea of integrating Bitcoin Lightning Network into the Ethereum smart contracts asserting that the “future of crypto currencies is diverse and pluralist”.

Questions which may be germane:

  1. What’s the limit of scalability?
  2. How do today’s systems scale?
  3. What’s the time and resource demand when one scales to an unknown scope?

Please, don’t tell me, “Scaling is infinite.”


There are constraints and limits. Two factors some people don’t want to think about. Better to say, “Scaling. No problem.”

Wrong. Scaling is a problem. Someone has to pay for the infrastructure, the know how, downstream consequences of latency, and the other “costs.”

Stephen E Arnold, August 20, 2019

Hadoop Fail: A Warning Signal in Big Data Fantasy Land?

August 11, 2019

DarkCyber notices when high profile companies talk about data federation, data lakes, and intelligent federation of real time data with historical data. Examples include Amazon and Anduril to name two companies offering this type of data capability.

What Happened to Hadoop and Where Do We Go from Here?” does not directly discuss the data management systems in Amazon and Anduril, but the points the author highlights may be germane to thinking about what is possible and what remains just out of reach when it comes to processing the rarely defined world of “Big Data.”

The write up focuses on Hadoop, the elephant logo thing. Three issues are identified:

  1. Data provenance was tough to maintain and therefore determine. This is a variation on the GIGO theme (garbage in, garbage out)
  2. Creating a data lake is complicated. With talent shortages, the problem of complexity may hardwire failure.
  3. The big pool of data becomes the focus. That’s okay, but the application to solve the problem is often lost.

Why is a discussion of Hadoop relevant to Amazon and Anduril? The reason is that despite the weaknesses of these systems, both companies are addressing the “Hadoop problem” but in different ways.

These two firms, therefore, may be significant because of their approach and their different angles of attacks.

Amazon is providing a platform which, in the hands of a skilled Amazon technologist, can deliver a cohesive data environment. Furthermore, the digital craftsman can build a solution that works. It may be expensive and possibly flakey, but it mostly works.

Anduril, on the other hand, delivers the federation in a box. Anduril is a hardware product, smart software, and applications. License, deploy, and use.

Despite the different angles of attack, both companies are making headway in the data federation, data lake, and real time analytics sector.

The issue is not what will happen to Hadoop, the issue is how quickly will competitors respond to these different ways of dealing with Big Data.

Stephen E Arnold, August 11, 2019

15 Reasons You Need Business Intelligence Software

May 21, 2019

I read StrategyDriven’s “The Importance of Business Intelligence Software and Why It’s Integral for Business Success.” I found the laundry list interesting, but I asked myself, “If BI software is so important, why is it necessary to provide 15 reasons?”

I went through the list of items a couple of times.Some of the reasons struck me as a bit of a stretch. I had a teacher at the University of Illinois who loved the phrase “a bit of a stretch, right” when a graduate student proposed a wild and crazy hypothesis or drew a nutsy conclusion from data.

Let’s look at four of these reasons and see if there’s merit to my skepticism about delivering fish to a busy manager when the person wanted a fish sandwich.

Reason 1: Better business decisions. Really? If a BI system outputs data to a clueless person or uses flawed, incomplete, or stale data to present an output to a bright person, are better business decisions an outcome? In my experience, nope.

Reason 6. Accurate decision making. What the human does with the outputs is likely to result in a decision. That’s true. But accurate? Too many variables exist to create a one to one correlation with the assertion and what happens in a decider’s head or among a group of deciders who get together to figure out what to do. Example: Google has data. Google decided to pay a person accused of improper behavior millions of dollars. Accurate decision making? I suppose it depends on one’s point of view.

Reason 11. Reduced cost. I am confident when I say, “Most companies do not calculate or have the ability to assemble the information needed to produce fully loaded costs.” Consequently, the cost of a BI system is not the license fee. There are the associated directs and indirects. And when a decision from the BI system is wrong, there are some other costs as well. How are Facebook’s eDiscovery systems generating a payback today? Facebook has data, but the costs of its eDiscovery systems are not known, nor does anyone care as the legal hassles continue to flood the company’s executive suite.

Reason 13. High quality data. Whoa, hold your horses. The data cost is an issue in virtually every company with which I have experience. No one wants to invest to make certain that the information is complete, accurate, up to date, and maintained (indexed accurately and put in a consistent format). This is a pretty crazy assertion about BI when there is no guarantee that the data fed into the system is representative, comprehensive, accurate, and fresh.

Business intelligence is a tool. Use of a BI system does not generate guaranteed outcomes.

Stephen E Arnold, May 21, 2019

Comfort with Big Data or You May Not Be Hired

April 5, 2019

I read an interesting essay in Analytics India Magazine, a source I find useful in explaining how managers from that country think about certain issues.

Case in point: What makes a good employee, presumably of a company operating in Analytics India’s home territory or managed by a person who devours each issue in search of data nuggets.

The article which caught my attention? “Why Everyone In The Organization Has To Be Comfortable Dealing With Data.”

I noted this passage:

For a successful functioning of an organization, it is necessary that everyone in an organization is comfortable dealing with data.

I like the categorical affirmative: Everyone.

I like the notion of not being informed, good, or competent. Comfortable only.

Now the questions?

  1. Does the argument require the HR (personnel) to define “comfort” and then measure that quality?
  2. What happens to those who perform certain services like greeting visitors, providing administrative support, or chauffeuring the owner to his or her private jet? Outsourcing perhaps? A special class of workers removed from the Big Data folks?
  3. What happens to employees in countries which graduate individuals from a university lacking desired numerical skills? No jobs?

I enjoyed the recommendations for addressing this requirement. Educate and upskill (presented as two action items but to innumerate me these are one thing. Then “every department has to realize the power of data.” I love the “every” and the sort of adulty phrase “has to realize.”

But the keeper is this statement: “Adopt methods for data cleaning.”

Yeah, clean data for Big Data. Who does that work? Obviously employees who are comfortable. Yep, comfort will deal with data issues like validity, consistency, etc. etc.

Stephen E Arnold, April 5, 2019

Data and Analytics: Do Good, Not Bad

April 1, 2019

Nope, not an April Fool’s spoof. “Using Data and Analytics for Good” is an attempt to make a case for monitoring and intercept technology to make the world a better place. No, the write up does not use China’s social credit score as an example of “doing good.”

I noted this statement from Cindi Howson, a Gartner fellow traveler, in the write up:

Howson said the mission is a personal one for her that started when she was a college student. She was working two jobs to pay her own way, and after she wrote that big tuition check she had only $2 left to buy hotdogs and a box of macaroni to last a week. She knew that financially there wasn’t much separating her from the homeless people she had passed on the streets of New York City every night.

Was this the plight of the students whose parents paid hundreds of thousands of dollars so that their progeny could enter “prestigious schools”?

Will Gartner convert data for good into revenue? Stakeholders may be crossing their fingers.

The “doing good” thing does not get much coverage in The Age of Surveillance Capitalism. That’s no April Fool’s joke.

Stephen E Arnold, April 1, 2019

Who Is Assisting China in Its Technology Push?

March 20, 2019

I read “U.S. Firms Are Helping Build China’s Orwellian State.” The write up is interesting because it identifies companies which allegedly provide technology to the Middle Kingdom. The article also uses an interesting phrase; that is, “tech partnerships.” Please, read the original article for the names of the US companies allegedly cooperating with China.

I want to tell a story.

Several years ago, my team was asked to prepare a report for a major US university. Our task was to try and answer what I thought was a simple question when I accepted the engagement, “Why isn’t this university’s computer science program ranked in the top ten in the US?”

The answer, my team and I learned, had zero to do with faculty, courses, or the intelligence of students. The primary reason was that the university’s graduates were returning to their “home countries.” These included China, Russia, and India, among others. In one advanced course, there was no US born, US educated student.

We documented that for over a seven year period, when the undergraduate, the graduate students, and post doctoral students completed their work, they had little incentive to start up companies in proximity to the university, donate to the school’s fund raising, and provide the rah rah that happy graduates often do. To see the rah rah in action, may I suggest you visit a “get together” of graduates near Stanford or an eatery in Boston or on NCAA elimination week end in Las Vegas.

How could my client fix this problem? We were not able to offer a quick fix or even an easy fix. The university had institutionalized revenue from non US student and was, when we did the research, dependent on non US students. These students were very, very capable and they came to the US to learn, form friendships, and sharpen their business and technical “soft” skills. These, I assume, were skills put to use to reach out to firms where a “soft” contact could be easily initiated and brought to fruition.

threads fixed

Follow the threads and the money.

China has been a country eager to learn in and from the US. The identification of some US firms which work with China should not be a surprise.

However, I would suggest that Foreign Policy or another investigative entity consider a slightly different approach to the topic of China’s technical capabilities. Let me offer one example. Consider this question:

What Israeli companies provide technology to China and other countries which may have some antipathy to the US?

This line of inquiry might lead to some interesting items of information; for example, a major US company which meets on a regular basis with a counterpart with what I would characterize as “close links” to the Chinese government. One colloquial way to describe the situation is like a conduit. Digging in  this field of inquiry, one can learn how the Israeli company “flows” US intelligence-related technology from the US and elsewhere through an intermediary so that certain surveillance systems in China can benefit directly from what looks like technology developed in Israel.

Net net: If one wants to understand how US technology moves from the US, the subject must be examined in terms of academic programs, admissions, policies, and connections as well as from the point of view of US company investments in technologies which received funding from Chinese sources routed through entities based in Israel. Looking at a couple of firms does not do the topic justice and indeed suggests a small scale operation.

Uighur monitoring is one thread to follow. But just one.

Stephen E Arnold, March 20, 2019

US Government Slow In Adopting Big Data?

March 13, 2019

We are not sure if this is good news or bad news. But the United States may be slow in adopting new technology and policies. The IRS is one government branch that is leveraging big data with actual results. Mondaq shares the IRS’s data analysis in the article, “United States: States Follow The IRS In Joining The Big Data Revolution.”

The IRS has used data analysis since the 1960s to select taxes to adult. As the technology advanced over the years, it has caught more errors and corrected them without any human involvement. The IRS created a new data analysis projected dubbed the Nationally Coordinated Investigation Unit (NCIU). NCIU will focus on using external data and the IRS to select criminal investigations. They also signed a $99 million deal with Palantir. With Palantir’s technology, the IRS will analyze and search terabytes of data on internal and external data sources on a single platform. The IRS is not only data mining for criminal activities. Big data is also being used for civil audits and predict outcomes on cases referred to the IRS Office of Appeals.

State governments have followed the IRS and implemented their own tax data analysis projects. Many of them have already caught fraudulent returns and so far state governments have saved sizable chunks of cash. These data analysis implementations are great, but there are still limitations. We learned:

“Like the IRS, many state departments of revenue have faced significant budgetary pressure in recent years, as governments have tried to cut down the size and cost of government, and have turned to technology to fill the gap. As powerful as data analytics are, however, there is a limit to the extent they can replace human investigators. In 2016, for example, the Arizona Department of Revenue began to lay off dozens of auditors and tax collectors, citing budget cuts. The result was a catastrophe, as audit collections dropped nearly 47 percent—$82 million—in 2017. The IRS itself has taken a markedly different approach: IRS CI has recently announced a hiring blitz, in the course of which it will hire 250 special agents, a number of data scientists, and over 100 professional staff.”

Big data analysis will become a significant tool in the future for the IRS and local tax offices. Good or bad? Excellent question.

Whitney Grace, March 13, 2019

Good News about Big Data and AI: Not Likely

February 25, 2019

I read a write up which was a bit of a downer. The story appeared in Analytics India and was titled “10 Challenges That Data Science Industry Still Faces.” Oh, oh. Maybe not good news?

My first thought was, “Only 10?”

The write up explains that the number one challenge is humans. The idea that smart software would solve these types of problems: Sluggish workers at fast food restaurants, fascinating decisions made by entry level workers in some government bureaus, and the often remarkable statements offered by talking heads on US cable TV “real news” programs, among others.

Nope. The number one challenge is finding humans who can do data science work.

What’s number two after this somewhat thorny problem? The answer is finding the “right data” and then getting a chunk of data one can actually process.

So one and two are what I would call bedrock issues: Expertise and information.

What about the other eight challenges. Here are three of them. I urge you to read the original article for the other five issues.

  • Informing people why data science and its related operations are good for you. Is this similar to convincing a three year old that lima beans are just super.
  • Storytelling. I think this means, “These data mean…” One hopes the humans (who are in short supply) draw the correct inferences. One hopes.
  • Models. This is a shorthand way of saying, “What’s assembled will work.” Hopefully the answer is, “Sure, our models are great.”

Analytics India has taken a risk with their write up. None of the data science acolytes want to hear “bad news.”

Let’s federate and analyze that with great data we can select to generate a useful output. Maybe 80 percent “accuracy” on a good day?

Stephen E Arnold, February 25, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta