Big Data Gets a New Term: DarkCyber Had to Look This One Up

April 2, 2020

In our feed this morning (April 1, 2020) we skipped over the flood of news about Zoom (a Middle Kingdom inspired marvel), the virus stories output by companies contributing their smart software to find a solution), and the trend of Amazon bashing (firing a worker who wanted to sanitize a facility and Amazon’s organizational skills are wobbling).

What stopped our scanning eyes was “Why Your Business May Be on a Data-Driven Coddiwomple.” DarkCyber admits that one of our team write a story for an old school publisher which used the word “cuculus” in its title “Google in the Enterprise 2009: The Cuculus Strategy.” A “cuculus,” as you probably know, gentle reader, is a remarkable bird, sort of a thief.

But Coddiwomple? That word means travel in a purposeful manner to a vague definition. Most of the YouTube train ride and the Kara and Nate trips qualify. Other examples include the aimless wandering of enterprise search vendors who travel to the lands of customer service, analytics, business process engineering, and only occasionally returning to their home base of the 50 year old desert of proprietary enterprise search.

What’s the point of “Why Your Business May Be on a Data-Driven Coddiwomple”? DarkCyber believes the main point is valid:

In practical terms the lack of clarity on the starting point can involve a lack of vision into what the specific objectives of the team are, or what human resources and skills are already in house. Meanwhile, the diverse and siloed stakeholders in a “destination” for the data-driven endeavor may all have slightly different ideas on what the result should be, leading to a divergent and fuzzy path to follow.

In DarkCyber’s lingo, these data and analytics journeys are just hand waving and money spending.

Are businesses and other entities data driven?

Ho ho ho. Most organizations are not sure what the heck is going on. The data are easy to interpret, and no fancy, little understood analytics system is needed to figure out that an iceberg has nicked the good ship Silicon Lollipop.

There are interesting uses of data and clever applications of systems and methods that are quite old.

Like the cuculus, opportunism is important. The coddiwomple is a secondary effect. The cuculus gets into a company’s nest and raises money consumers. When the money suckers are bigger, each flies to another nest and the cycle repeats.

Data driven is a metaphor for doing something even though results are often difficult to explain: Higher costs, increased complexity, and an inability to adapt to the business environment.

I support the cuculus inspired consultants. The management of the nest can enjoy the coddiwomple as they seek a satisfying place to begin again.

Stephen E Arnold, April 2, 2020

The Problem of Too Much Info

March 17, 2020

The belief is that the more information one has the better decision one can make. Is this really true? The Eurasia Review shares how too much information might be a bad thing in the article, “More Information Doesn’t Necessarily Help People Make Better Decisions.”

According to the Stevens Institute of Technology, too much knowledge causes people to make worse decisions. The finding explains that there is a critical gap between assimilating new information with past knowledge and beliefs. Associate Professor of Computer Science at the Steves Institute Samantha Kleinberg is studying the phenomenon using AI and machine learning to investigate how financial advisors and healthcare professionals to their clients. She discovered:

“ ‘Being accurate is not enough for information to be useful,’ said Kleinberg. ‘It’s assumed that AI and machine learning will uncover great information, we’ll give it to people and they’ll make good decisions. However, the basic point of the paper is that there is a step missing: we need to help people build upon what they already know and understand how they will use the new information.’

For example: when doctors communicate information to patients, such as recommending blood pressure medication or explaining risk factors for diabetes, people may be thinking about the cost of medication or alternative ways to reach the same goal. ‘So, if you don’t understand all these other beliefs, it’s really hard to treat them in an effective way,’ said Kleinberg, whose work appears in the Feb. 13 issue of Cognitive Research: Principles and Implications.”

Kleinberg and her team studied 4,000 participants on their decision making processes with scenarios they would be familiar with to ones they would not. When confronted with an unusual problem, participants focused on the problem without any extra knowledge, but if they were asked to deal with a regular scenario such as healthcare or finances their prior knowledge got in the way.

Information overload and not being able to merge old information with the new is a problem. How do you fix it? Your answer is as good as mine.

Whitney Grace, March 17, 2020

Google Trends Used to Reveal Misspelled Wirds or Is It Words?

November 25, 2019

We spotted a listing of the most misspelled words in each of the USA’s 50 states. Too bad Puerto Rico. Kentucky’s most misspelled word is “ninety.” Navigate to Considerable and learn what residents cannot spell. How often? Silly kweston.

The listing includes some bafflers and may reveal what can go wrong with data from an online ad sales data collection system; for example:

  • Washington, DC (which is not a state in DarkCyber’s book) cannot spell “enough”; for example, “enuf already with these televised hearings and talking heads”
  • Idaho residents cannot spell embarrassed, which as listeners to Kara Swisher know has two r’s and two s’s. Helpful that.
  • Montana residents cannot spell “comma.” Do those in Montana use commas?
  • And not surprisingly, those in Tennessee cannot spell “intelligent.” Imagine that!

What happens if one trains smart software on these data?

Sumthink mite go awf the railz.

Stephen E Arnold, November 25, 2019

Info Extraction: Improving?

November 21, 2019

Information extraction (IE) is key to machine learning and artificial intelligence (AI), especially for natural language processing (NLP). The problem with information extraction is while information is pulled from datasets it often lacks context, thusly it fails to properly categorize and rationalize the data. Good Men Project shares some hopeful news for IE in the article, “Measuring Without Labels: A Different Approach To Information Extraction.”

Current IE relies on an AI programmed with a specific set of schema that states what information needs to be extracted. A retail Web site like Amazon probably uses an IE AI programmed to extract product names, UPCs, and price, while a travel Web site like Kayak uses an IE AI to find price, airlines, dates, and hotel names. For law enforcement officials, it is particularly difficult to design schema for human trafficking, because datasets on that subject do not exist. Also traditional IE methods, such as crowdsourcing, do not work due to the sensitivity.

In order to create a reliable human trafficking dataset and prove its worth, the IE dependencies between extractions. A dependency works as:

“Consider the network illustrated in the figure above. In this kind of network, called attribute extraction network (AEN), we model each document as a node. An edge exists between two nodes if their underlying documents share an extraction (in this case, names). For example, documents D1 and D2 are connected by an edge because they share the extraction ‘Mayank.’ Note that constructing the AEN only requires the output of an IE, not a gold standard set of labels. Our primary hypothesis in the article was that, by measuring network-theoretic properties (like the degree distribution, connectivity etc.) of the AEN, correlations would emerge between these properties and IE performance metrics like precision and recall, which require a sufficiently large gold standard set of IE labels to compute. The intuition is that IE noise is not random noise, and that the non-random nature of IE noise will show up in the network metrics. Why is IE noise non-random? We believe that it is due to ambiguity in the real world over some terms, but not others.”

Using the attributes names, phone numbers, and locations, correlations were discovered. AI systems that have dependencies creates a new methodology to evaluate them. Network science relies on non-abstract interactions to test IE, but the AEN is an abstract network of IE interactions. The mistakes, in fact, allow law enforcement to use IE AI to acquire the desired information without having a practice dataset.

Whitney Grace, November 21, 2019

Tracking Trends in News Homepage Links with Google BigQuery

October 17, 2019

Some readers may be familiar with the term “culturomics,” a particular application of n-gram-based linguistic analysis to text. The practice arose after a 2010 project that applied such analysis to five million historical books across seven languages. The technique creates n-gram word frequency histograms from the source text. Now the technique has been applied to links found on news organizations’ home pages using Google’s BigQuery platform. Forbes reports, “Using the Cloud to Explore the Linguistic Patterns of Half a Trillion Words of News Homepage Hyperlinks.” Writer Kalev Leetaru explains:

“News media represents a real-time reflection of localized events, narratives, beliefs and emotions across the world, offering an unprecedented look into the lens through which we see the world around us. The open data GDELT Project has monitored the homepages of more than 50,000 news outlets worldwide every hour since March 2018 through its Global Frontpage Graph (GFG), cataloging their links in an effort to understand global journalistic editorial decision-making. In contrast to traditional print and broadcast mediums, online outlets have theoretically unlimited space, allowing them to publish a story without displacing another. Their homepages, however, remain precious fixed real estate, carefully curated by editors that must decide which stories are the most important at any moment. Analyzing these decisions can help researchers better understand which stories each news outlet believed to be the most important to its readership at any given moment in time and how those decisions changed hour by hour.”

The project has now collected more than 134 billion such links. The article describes how researchers have used BigQuery to analyze this dataset with a single SQL query, so navigate there for the technical details. Interestingly, one thing they are looking at is trends across the 110 languages represented by the samples. Leetaru emphasizes this endeavor demonstrates how much faster these computations can be achieved compared to the 2010 project. He concludes:

“Even large-scale analyses are moving so close to real-time that we are fast approaching the ability of almost any analysis to transition from ‘what if’ and ‘I wonder’ to final analysis in just minutes with a single query.”

Will faster analysis lead to wiser decisions? We shall see.

Cynthia Murrell, October 17, 2019

A List of Eavesdroppers: Possibly Sort of Incomplete and Misleading?

August 22, 2019

DarkCyber noted “Here’s Every Major Service That Uses Humans to Eavesdrop on Your Voice Commands.” Notice the word “major.” Here’s the list from the write up:

  • Amazon
  • Apple
  • Facebook
  • Google
  • Microsoft

DarkCyber wonders if these vendors/systems should be considered for inclusion in the list of “every” eavesdropping service:

  • China Telecom
  • Huawei
  • Shoghi
  • Utimaco

DarkCyber is confused about “every” when five candidates are advanced. The six we have suggested for consideration are organizations plucked from our list of interesting companies which may be in the surveillance sector. We await more comprehensive lists from the “real news” outfit “Daily Beast.” Growl!

Stephen E Arnold, August 22, 2019

Scalability: Assumed to Be Infinite?

August 20, 2019

I hear and read about scalability—whether I want to or not. Within the last 24 hours, I learned that certain US government applications have to be smart (AI and ML) and have the ability to scale. Scale to what? In what amount of time? How?

The answers to these questions are usually Amazon, Google, IBM, Microsoft, or some other company’s cloud.

I thought about this implicit assumption about scaling when I read “Vitalik Buterin: Ethereum’s Scalability Issue Is Proving To Be A Formidable Impediment To Adoption By Institutions.” The “inventor” of Ethereum (a technology supported by Amazon AWS by the way), allegedly said:

Scalability is a big bottleneck because Ethereum blockchain is almost full. If you’re a bigger organization, the calculus is that if we join it will not only be full but we will be competing with everyone for transaction space. It’s already expensive and it will be even five times more expensive because of us. There is pressure keeping people from joining, but improvements in scalability can do a lot in improving that.”

There are fixes. Here’s one from the write up:

Notably, Vitalik is known to be a supporter of other crypto currencies besides Ethereum. In July, Buterin suggested using Bitcoin Cash (BCH) to solve the scalability barrier in the short-term as they figure out a more permanent solution. Additionally, early this month, he supported the idea of integrating Bitcoin Lightning Network into the Ethereum smart contracts asserting that the “future of crypto currencies is diverse and pluralist”.

Questions which may be germane:

  1. What’s the limit of scalability?
  2. How do today’s systems scale?
  3. What’s the time and resource demand when one scales to an unknown scope?

Please, don’t tell me, “Scaling is infinite.”

Why?

There are constraints and limits. Two factors some people don’t want to think about. Better to say, “Scaling. No problem.”

Wrong. Scaling is a problem. Someone has to pay for the infrastructure, the know how, downstream consequences of latency, and the other “costs.”

Stephen E Arnold, August 20, 2019

Hadoop Fail: A Warning Signal in Big Data Fantasy Land?

August 11, 2019

DarkCyber notices when high profile companies talk about data federation, data lakes, and intelligent federation of real time data with historical data. Examples include Amazon and Anduril to name two companies offering this type of data capability.

What Happened to Hadoop and Where Do We Go from Here?” does not directly discuss the data management systems in Amazon and Anduril, but the points the author highlights may be germane to thinking about what is possible and what remains just out of reach when it comes to processing the rarely defined world of “Big Data.”

The write up focuses on Hadoop, the elephant logo thing. Three issues are identified:

  1. Data provenance was tough to maintain and therefore determine. This is a variation on the GIGO theme (garbage in, garbage out)
  2. Creating a data lake is complicated. With talent shortages, the problem of complexity may hardwire failure.
  3. The big pool of data becomes the focus. That’s okay, but the application to solve the problem is often lost.

Why is a discussion of Hadoop relevant to Amazon and Anduril? The reason is that despite the weaknesses of these systems, both companies are addressing the “Hadoop problem” but in different ways.

These two firms, therefore, may be significant because of their approach and their different angles of attacks.

Amazon is providing a platform which, in the hands of a skilled Amazon technologist, can deliver a cohesive data environment. Furthermore, the digital craftsman can build a solution that works. It may be expensive and possibly flakey, but it mostly works.

Anduril, on the other hand, delivers the federation in a box. Anduril is a hardware product, smart software, and applications. License, deploy, and use.

Despite the different angles of attack, both companies are making headway in the data federation, data lake, and real time analytics sector.

The issue is not what will happen to Hadoop, the issue is how quickly will competitors respond to these different ways of dealing with Big Data.

Stephen E Arnold, August 11, 2019

15 Reasons You Need Business Intelligence Software

May 21, 2019

I read StrategyDriven’s “The Importance of Business Intelligence Software and Why It’s Integral for Business Success.” I found the laundry list interesting, but I asked myself, “If BI software is so important, why is it necessary to provide 15 reasons?”

I went through the list of items a couple of times.Some of the reasons struck me as a bit of a stretch. I had a teacher at the University of Illinois who loved the phrase “a bit of a stretch, right” when a graduate student proposed a wild and crazy hypothesis or drew a nutsy conclusion from data.

Let’s look at four of these reasons and see if there’s merit to my skepticism about delivering fish to a busy manager when the person wanted a fish sandwich.

Reason 1: Better business decisions. Really? If a BI system outputs data to a clueless person or uses flawed, incomplete, or stale data to present an output to a bright person, are better business decisions an outcome? In my experience, nope.

Reason 6. Accurate decision making. What the human does with the outputs is likely to result in a decision. That’s true. But accurate? Too many variables exist to create a one to one correlation with the assertion and what happens in a decider’s head or among a group of deciders who get together to figure out what to do. Example: Google has data. Google decided to pay a person accused of improper behavior millions of dollars. Accurate decision making? I suppose it depends on one’s point of view.

Reason 11. Reduced cost. I am confident when I say, “Most companies do not calculate or have the ability to assemble the information needed to produce fully loaded costs.” Consequently, the cost of a BI system is not the license fee. There are the associated directs and indirects. And when a decision from the BI system is wrong, there are some other costs as well. How are Facebook’s eDiscovery systems generating a payback today? Facebook has data, but the costs of its eDiscovery systems are not known, nor does anyone care as the legal hassles continue to flood the company’s executive suite.

Reason 13. High quality data. Whoa, hold your horses. The data cost is an issue in virtually every company with which I have experience. No one wants to invest to make certain that the information is complete, accurate, up to date, and maintained (indexed accurately and put in a consistent format). This is a pretty crazy assertion about BI when there is no guarantee that the data fed into the system is representative, comprehensive, accurate, and fresh.

Business intelligence is a tool. Use of a BI system does not generate guaranteed outcomes.

Stephen E Arnold, May 21, 2019

Comfort with Big Data or You May Not Be Hired

April 5, 2019

I read an interesting essay in Analytics India Magazine, a source I find useful in explaining how managers from that country think about certain issues.

Case in point: What makes a good employee, presumably of a company operating in Analytics India’s home territory or managed by a person who devours each issue in search of data nuggets.

The article which caught my attention? “Why Everyone In The Organization Has To Be Comfortable Dealing With Data.”

I noted this passage:

For a successful functioning of an organization, it is necessary that everyone in an organization is comfortable dealing with data.

I like the categorical affirmative: Everyone.

I like the notion of not being informed, good, or competent. Comfortable only.

Now the questions?

  1. Does the argument require the HR (personnel) to define “comfort” and then measure that quality?
  2. What happens to those who perform certain services like greeting visitors, providing administrative support, or chauffeuring the owner to his or her private jet? Outsourcing perhaps? A special class of workers removed from the Big Data folks?
  3. What happens to employees in countries which graduate individuals from a university lacking desired numerical skills? No jobs?

I enjoyed the recommendations for addressing this requirement. Educate and upskill (presented as two action items but to innumerate me these are one thing. Then “every department has to realize the power of data.” I love the “every” and the sort of adulty phrase “has to realize.”

But the keeper is this statement: “Adopt methods for data cleaning.”

Yeah, clean data for Big Data. Who does that work? Obviously employees who are comfortable. Yep, comfort will deal with data issues like validity, consistency, etc. etc.

Stephen E Arnold, April 5, 2019

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta