Google Trends Used to Reveal Misspelled Wirds or Is It Words?

November 25, 2019

We spotted a listing of the most misspelled words in each of the USA’s 50 states. Too bad Puerto Rico. Kentucky’s most misspelled word is “ninety.” Navigate to Considerable and learn what residents cannot spell. How often? Silly kweston.

The listing includes some bafflers and may reveal what can go wrong with data from an online ad sales data collection system; for example:

  • Washington, DC (which is not a state in DarkCyber’s book) cannot spell “enough”; for example, “enuf already with these televised hearings and talking heads”
  • Idaho residents cannot spell embarrassed, which as listeners to Kara Swisher know has two r’s and two s’s. Helpful that.
  • Montana residents cannot spell “comma.” Do those in Montana use commas?
  • And not surprisingly, those in Tennessee cannot spell “intelligent.” Imagine that!

What happens if one trains smart software on these data?

Sumthink mite go awf the railz.

Stephen E Arnold, November 25, 2019

Info Extraction: Improving?

November 21, 2019

Information extraction (IE) is key to machine learning and artificial intelligence (AI), especially for natural language processing (NLP). The problem with information extraction is while information is pulled from datasets it often lacks context, thusly it fails to properly categorize and rationalize the data. Good Men Project shares some hopeful news for IE in the article, “Measuring Without Labels: A Different Approach To Information Extraction.”

Current IE relies on an AI programmed with a specific set of schema that states what information needs to be extracted. A retail Web site like Amazon probably uses an IE AI programmed to extract product names, UPCs, and price, while a travel Web site like Kayak uses an IE AI to find price, airlines, dates, and hotel names. For law enforcement officials, it is particularly difficult to design schema for human trafficking, because datasets on that subject do not exist. Also traditional IE methods, such as crowdsourcing, do not work due to the sensitivity.

In order to create a reliable human trafficking dataset and prove its worth, the IE dependencies between extractions. A dependency works as:

“Consider the network illustrated in the figure above. In this kind of network, called attribute extraction network (AEN), we model each document as a node. An edge exists between two nodes if their underlying documents share an extraction (in this case, names). For example, documents D1 and D2 are connected by an edge because they share the extraction ‘Mayank.’ Note that constructing the AEN only requires the output of an IE, not a gold standard set of labels. Our primary hypothesis in the article was that, by measuring network-theoretic properties (like the degree distribution, connectivity etc.) of the AEN, correlations would emerge between these properties and IE performance metrics like precision and recall, which require a sufficiently large gold standard set of IE labels to compute. The intuition is that IE noise is not random noise, and that the non-random nature of IE noise will show up in the network metrics. Why is IE noise non-random? We believe that it is due to ambiguity in the real world over some terms, but not others.”

Using the attributes names, phone numbers, and locations, correlations were discovered. AI systems that have dependencies creates a new methodology to evaluate them. Network science relies on non-abstract interactions to test IE, but the AEN is an abstract network of IE interactions. The mistakes, in fact, allow law enforcement to use IE AI to acquire the desired information without having a practice dataset.

Whitney Grace, November 21, 2019

Tracking Trends in News Homepage Links with Google BigQuery

October 17, 2019

Some readers may be familiar with the term “culturomics,” a particular application of n-gram-based linguistic analysis to text. The practice arose after a 2010 project that applied such analysis to five million historical books across seven languages. The technique creates n-gram word frequency histograms from the source text. Now the technique has been applied to links found on news organizations’ home pages using Google’s BigQuery platform. Forbes reports, “Using the Cloud to Explore the Linguistic Patterns of Half a Trillion Words of News Homepage Hyperlinks.” Writer Kalev Leetaru explains:

“News media represents a real-time reflection of localized events, narratives, beliefs and emotions across the world, offering an unprecedented look into the lens through which we see the world around us. The open data GDELT Project has monitored the homepages of more than 50,000 news outlets worldwide every hour since March 2018 through its Global Frontpage Graph (GFG), cataloging their links in an effort to understand global journalistic editorial decision-making. In contrast to traditional print and broadcast mediums, online outlets have theoretically unlimited space, allowing them to publish a story without displacing another. Their homepages, however, remain precious fixed real estate, carefully curated by editors that must decide which stories are the most important at any moment. Analyzing these decisions can help researchers better understand which stories each news outlet believed to be the most important to its readership at any given moment in time and how those decisions changed hour by hour.”

The project has now collected more than 134 billion such links. The article describes how researchers have used BigQuery to analyze this dataset with a single SQL query, so navigate there for the technical details. Interestingly, one thing they are looking at is trends across the 110 languages represented by the samples. Leetaru emphasizes this endeavor demonstrates how much faster these computations can be achieved compared to the 2010 project. He concludes:

“Even large-scale analyses are moving so close to real-time that we are fast approaching the ability of almost any analysis to transition from ‘what if’ and ‘I wonder’ to final analysis in just minutes with a single query.”

Will faster analysis lead to wiser decisions? We shall see.

Cynthia Murrell, October 17, 2019

A List of Eavesdroppers: Possibly Sort of Incomplete and Misleading?

August 22, 2019

DarkCyber noted “Here’s Every Major Service That Uses Humans to Eavesdrop on Your Voice Commands.” Notice the word “major.” Here’s the list from the write up:

  • Amazon
  • Apple
  • Facebook
  • Google
  • Microsoft

DarkCyber wonders if these vendors/systems should be considered for inclusion in the list of “every” eavesdropping service:

  • China Telecom
  • Huawei
  • Shoghi
  • Utimaco

DarkCyber is confused about “every” when five candidates are advanced. The six we have suggested for consideration are organizations plucked from our list of interesting companies which may be in the surveillance sector. We await more comprehensive lists from the “real news” outfit “Daily Beast.” Growl!

Stephen E Arnold, August 22, 2019

Scalability: Assumed to Be Infinite?

August 20, 2019

I hear and read about scalability—whether I want to or not. Within the last 24 hours, I learned that certain US government applications have to be smart (AI and ML) and have the ability to scale. Scale to what? In what amount of time? How?

The answers to these questions are usually Amazon, Google, IBM, Microsoft, or some other company’s cloud.

I thought about this implicit assumption about scaling when I read “Vitalik Buterin: Ethereum’s Scalability Issue Is Proving To Be A Formidable Impediment To Adoption By Institutions.” The “inventor” of Ethereum (a technology supported by Amazon AWS by the way), allegedly said:

Scalability is a big bottleneck because Ethereum blockchain is almost full. If you’re a bigger organization, the calculus is that if we join it will not only be full but we will be competing with everyone for transaction space. It’s already expensive and it will be even five times more expensive because of us. There is pressure keeping people from joining, but improvements in scalability can do a lot in improving that.”

There are fixes. Here’s one from the write up:

Notably, Vitalik is known to be a supporter of other crypto currencies besides Ethereum. In July, Buterin suggested using Bitcoin Cash (BCH) to solve the scalability barrier in the short-term as they figure out a more permanent solution. Additionally, early this month, he supported the idea of integrating Bitcoin Lightning Network into the Ethereum smart contracts asserting that the “future of crypto currencies is diverse and pluralist”.

Questions which may be germane:

  1. What’s the limit of scalability?
  2. How do today’s systems scale?
  3. What’s the time and resource demand when one scales to an unknown scope?

Please, don’t tell me, “Scaling is infinite.”


There are constraints and limits. Two factors some people don’t want to think about. Better to say, “Scaling. No problem.”

Wrong. Scaling is a problem. Someone has to pay for the infrastructure, the know how, downstream consequences of latency, and the other “costs.”

Stephen E Arnold, August 20, 2019

Hadoop Fail: A Warning Signal in Big Data Fantasy Land?

August 11, 2019

DarkCyber notices when high profile companies talk about data federation, data lakes, and intelligent federation of real time data with historical data. Examples include Amazon and Anduril to name two companies offering this type of data capability.

What Happened to Hadoop and Where Do We Go from Here?” does not directly discuss the data management systems in Amazon and Anduril, but the points the author highlights may be germane to thinking about what is possible and what remains just out of reach when it comes to processing the rarely defined world of “Big Data.”

The write up focuses on Hadoop, the elephant logo thing. Three issues are identified:

  1. Data provenance was tough to maintain and therefore determine. This is a variation on the GIGO theme (garbage in, garbage out)
  2. Creating a data lake is complicated. With talent shortages, the problem of complexity may hardwire failure.
  3. The big pool of data becomes the focus. That’s okay, but the application to solve the problem is often lost.

Why is a discussion of Hadoop relevant to Amazon and Anduril? The reason is that despite the weaknesses of these systems, both companies are addressing the “Hadoop problem” but in different ways.

These two firms, therefore, may be significant because of their approach and their different angles of attacks.

Amazon is providing a platform which, in the hands of a skilled Amazon technologist, can deliver a cohesive data environment. Furthermore, the digital craftsman can build a solution that works. It may be expensive and possibly flakey, but it mostly works.

Anduril, on the other hand, delivers the federation in a box. Anduril is a hardware product, smart software, and applications. License, deploy, and use.

Despite the different angles of attack, both companies are making headway in the data federation, data lake, and real time analytics sector.

The issue is not what will happen to Hadoop, the issue is how quickly will competitors respond to these different ways of dealing with Big Data.

Stephen E Arnold, August 11, 2019

15 Reasons You Need Business Intelligence Software

May 21, 2019

I read StrategyDriven’s “The Importance of Business Intelligence Software and Why It’s Integral for Business Success.” I found the laundry list interesting, but I asked myself, “If BI software is so important, why is it necessary to provide 15 reasons?”

I went through the list of items a couple of times.Some of the reasons struck me as a bit of a stretch. I had a teacher at the University of Illinois who loved the phrase “a bit of a stretch, right” when a graduate student proposed a wild and crazy hypothesis or drew a nutsy conclusion from data.

Let’s look at four of these reasons and see if there’s merit to my skepticism about delivering fish to a busy manager when the person wanted a fish sandwich.

Reason 1: Better business decisions. Really? If a BI system outputs data to a clueless person or uses flawed, incomplete, or stale data to present an output to a bright person, are better business decisions an outcome? In my experience, nope.

Reason 6. Accurate decision making. What the human does with the outputs is likely to result in a decision. That’s true. But accurate? Too many variables exist to create a one to one correlation with the assertion and what happens in a decider’s head or among a group of deciders who get together to figure out what to do. Example: Google has data. Google decided to pay a person accused of improper behavior millions of dollars. Accurate decision making? I suppose it depends on one’s point of view.

Reason 11. Reduced cost. I am confident when I say, “Most companies do not calculate or have the ability to assemble the information needed to produce fully loaded costs.” Consequently, the cost of a BI system is not the license fee. There are the associated directs and indirects. And when a decision from the BI system is wrong, there are some other costs as well. How are Facebook’s eDiscovery systems generating a payback today? Facebook has data, but the costs of its eDiscovery systems are not known, nor does anyone care as the legal hassles continue to flood the company’s executive suite.

Reason 13. High quality data. Whoa, hold your horses. The data cost is an issue in virtually every company with which I have experience. No one wants to invest to make certain that the information is complete, accurate, up to date, and maintained (indexed accurately and put in a consistent format). This is a pretty crazy assertion about BI when there is no guarantee that the data fed into the system is representative, comprehensive, accurate, and fresh.

Business intelligence is a tool. Use of a BI system does not generate guaranteed outcomes.

Stephen E Arnold, May 21, 2019

Comfort with Big Data or You May Not Be Hired

April 5, 2019

I read an interesting essay in Analytics India Magazine, a source I find useful in explaining how managers from that country think about certain issues.

Case in point: What makes a good employee, presumably of a company operating in Analytics India’s home territory or managed by a person who devours each issue in search of data nuggets.

The article which caught my attention? “Why Everyone In The Organization Has To Be Comfortable Dealing With Data.”

I noted this passage:

For a successful functioning of an organization, it is necessary that everyone in an organization is comfortable dealing with data.

I like the categorical affirmative: Everyone.

I like the notion of not being informed, good, or competent. Comfortable only.

Now the questions?

  1. Does the argument require the HR (personnel) to define “comfort” and then measure that quality?
  2. What happens to those who perform certain services like greeting visitors, providing administrative support, or chauffeuring the owner to his or her private jet? Outsourcing perhaps? A special class of workers removed from the Big Data folks?
  3. What happens to employees in countries which graduate individuals from a university lacking desired numerical skills? No jobs?

I enjoyed the recommendations for addressing this requirement. Educate and upskill (presented as two action items but to innumerate me these are one thing. Then “every department has to realize the power of data.” I love the “every” and the sort of adulty phrase “has to realize.”

But the keeper is this statement: “Adopt methods for data cleaning.”

Yeah, clean data for Big Data. Who does that work? Obviously employees who are comfortable. Yep, comfort will deal with data issues like validity, consistency, etc. etc.

Stephen E Arnold, April 5, 2019

Data and Analytics: Do Good, Not Bad

April 1, 2019

Nope, not an April Fool’s spoof. “Using Data and Analytics for Good” is an attempt to make a case for monitoring and intercept technology to make the world a better place. No, the write up does not use China’s social credit score as an example of “doing good.”

I noted this statement from Cindi Howson, a Gartner fellow traveler, in the write up:

Howson said the mission is a personal one for her that started when she was a college student. She was working two jobs to pay her own way, and after she wrote that big tuition check she had only $2 left to buy hotdogs and a box of macaroni to last a week. She knew that financially there wasn’t much separating her from the homeless people she had passed on the streets of New York City every night.

Was this the plight of the students whose parents paid hundreds of thousands of dollars so that their progeny could enter “prestigious schools”?

Will Gartner convert data for good into revenue? Stakeholders may be crossing their fingers.

The “doing good” thing does not get much coverage in The Age of Surveillance Capitalism. That’s no April Fool’s joke.

Stephen E Arnold, April 1, 2019

Who Is Assisting China in Its Technology Push?

March 20, 2019

I read “U.S. Firms Are Helping Build China’s Orwellian State.” The write up is interesting because it identifies companies which allegedly provide technology to the Middle Kingdom. The article also uses an interesting phrase; that is, “tech partnerships.” Please, read the original article for the names of the US companies allegedly cooperating with China.

I want to tell a story.

Several years ago, my team was asked to prepare a report for a major US university. Our task was to try and answer what I thought was a simple question when I accepted the engagement, “Why isn’t this university’s computer science program ranked in the top ten in the US?”

The answer, my team and I learned, had zero to do with faculty, courses, or the intelligence of students. The primary reason was that the university’s graduates were returning to their “home countries.” These included China, Russia, and India, among others. In one advanced course, there was no US born, US educated student.

We documented that for over a seven year period, when the undergraduate, the graduate students, and post doctoral students completed their work, they had little incentive to start up companies in proximity to the university, donate to the school’s fund raising, and provide the rah rah that happy graduates often do. To see the rah rah in action, may I suggest you visit a “get together” of graduates near Stanford or an eatery in Boston or on NCAA elimination week end in Las Vegas.

How could my client fix this problem? We were not able to offer a quick fix or even an easy fix. The university had institutionalized revenue from non US student and was, when we did the research, dependent on non US students. These students were very, very capable and they came to the US to learn, form friendships, and sharpen their business and technical “soft” skills. These, I assume, were skills put to use to reach out to firms where a “soft” contact could be easily initiated and brought to fruition.

threads fixed

Follow the threads and the money.

China has been a country eager to learn in and from the US. The identification of some US firms which work with China should not be a surprise.

However, I would suggest that Foreign Policy or another investigative entity consider a slightly different approach to the topic of China’s technical capabilities. Let me offer one example. Consider this question:

What Israeli companies provide technology to China and other countries which may have some antipathy to the US?

This line of inquiry might lead to some interesting items of information; for example, a major US company which meets on a regular basis with a counterpart with what I would characterize as “close links” to the Chinese government. One colloquial way to describe the situation is like a conduit. Digging in  this field of inquiry, one can learn how the Israeli company “flows” US intelligence-related technology from the US and elsewhere through an intermediary so that certain surveillance systems in China can benefit directly from what looks like technology developed in Israel.

Net net: If one wants to understand how US technology moves from the US, the subject must be examined in terms of academic programs, admissions, policies, and connections as well as from the point of view of US company investments in technologies which received funding from Chinese sources routed through entities based in Israel. Looking at a couple of firms does not do the topic justice and indeed suggests a small scale operation.

Uighur monitoring is one thread to follow. But just one.

Stephen E Arnold, March 20, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta