Abandoned Books: Yep, Analytics to the Rescue

January 6, 2020

DarkCyber noted “The Most ‘Abandoned’ Books on GoodReads.” The idea is that by using available data, a list of books people could not finish reading can be generated. Disclosure: I will try free or $1.99 books on my Kindle and bail out if the content does not make me quiver with excitement.

The research, which is presented in academic finery, reports that the the author of Harry Potter’s adventurers churned out a book few people could finish. The title? The Casual Vacancy by J.K. Rowling. I was unaware of the book, but I will wager that the author is happy enough with the advance and any royalty checks which clear the bank. Success is not completion; success is money I assume.

I want to direct your attention, gentle reader, to the explanation of the methodology used to award this singular honor to J.K. Rowling, who is probably pleased as punch with the bank interaction referenced in the preceding paragraph.

Several points merit brief, very brief comment:

  • Bayesian. A go to method. Works reasonably well. Guessing has its benefits.
  • Data sets. Not exactly comprehensive. Amazon? What about the Kindle customer data, including time to abandonment, page of abandonment, etc.? Library of Congress? Any data to share? Top 20 library systems in the US? Got some numbers; for example, number of copies in circulation?
  • Communication. The write up is a good example why some big time thinkers ignore the inputs of certain analysts.

To sum up, perhaps The Casual Vacancy may make a great gift when offered by Hamilton Books? A coffee table book perhaps?

Stephen E Arnold, January 6, 2020

Megaputer Spans Text Analysis Disciplines

January 6, 2020

What exactly do we mean by “text analysis”? That depends entirely on the context. Megaputer shares a useful list of the most popular types in its post, “What’s in a Text Analysis Tool?” The introduction explains:

“If you ask five different people, ‘What does a Text Analysis tool do?’, it is very likely you will get five different responses. The term Text Analysis is used to cover a broad range of tasks that include identifying important information in text: from a low, structural level to more complicated, high-level concepts. Included in this very broad category are also tools that convert audio to text and perform Optical Character Recognition (OCR); however, the focus of these tools is on the input, rather than the core tasks of text analysis. Text Analysis tools not only perform different tasks, but they are also targeted to different user bases. For example, the needs of a researcher studying the reactions of people on Twitter during election debates may require different Text Analysis tasks than those of a healthcare specialist creating a model for the prediction of sepsis in medical records. Additionally, some of these tools require the user to have knowledge of a programming language like Python or Java, whereas other platforms offer a Graphical User Interface.”

The list begins with two of the basics—Part-of-Speech (POS) Taggers and Syntactic Parsing. These tasks usually underpin more complex analysis. Concordance or Keyword tools create alphabetical lists of a text’s words and put them into context. Text Annotation Tools, either manual or automated, tag parts of a text according to a designated schema or categorization model, while Entity Recognition Tools often use knowledge graphs to identify people, organizations, and locations. Topic Identification and Modeling Tools derive emerging themes or high-level subjects using text-clustering methods. Sentiment Analysis Tools diagnose positive and negative sentiments, some with more refinement than others. Query Search Tools let users search text for a word or a phrase, while Summarization Tools pick out and present key points from lengthy texts (provided they are well organized.) See the article for more on any of these categories.

The post concludes by noting that most text analysis platforms offer one or two of the above functions, but that users often require more than that. This is where the article shows its PR roots—Megaputer, as it happens, offers just such an all-in-one platform called PolyAnalyst. Still, the write-up is a handy rundown of some different text-analysis tasks.

Based in Bloomington, Indiana, Megaputer launched in 1997. The company grew out of AI research from the Moscow State University and Bauman Technical University. Just a few of their many prominent clients include HP, Johnson & Johnson, American Express, and several US government offices.

Cynthia Murrell, January 02, 2020

Visual Data Exploration via Natural Language

November 4, 2019

New York University announced a natural language interface for data visualization. You can read the rah rah from the university here. The main idea is that a person can use simple English to create complex machine learning based visualizations. Sounds like the answer to a Wall Street analyst’s prayers.

The university reported:

A team at the NYU Tandon School of Engineering’s Visualization and Data Analytics (VIDA) lab, led by Claudio Silva, professor in the department of computer science and engineering, developed a framework called VisFlow, by which those who may not be experts in machine learning can create highly flexible data visualizations from almost any data. Furthermore, the team made it easier and more intuitive to edit these models by developing an extension of VisFlow called FlowSense, which allows users to synthesize data exploration pipelines through a natural language interface.

You can download (as of November 3, 2019, but no promises the document will be online after this date) “FlowSense: A Natural Language Interface for Visual Data Exploration within a Dataflow System.”

DarkCyber wants to point out that talking to a computer to get information continues to be of interest to many researchers. Will this innovation put human analysts out of their jobs.

Maybe not tomorrow but in the future. Absolutely. And what will those newly-unemployed people do for money?

Interesting question and one some may find difficult to consider at this time.

Stephen E Arnold, November 4, 2019


Tools and Tips for Google Analytics Implementations

September 16, 2019

Here is a handy resource to bookmark for anyone with Google Analytics in their future. Hacking Analytics describes “The Complexity of Implementing Google Analytics.” Writer and solution architect/ data manager Julien Kervizic explains:

“There is more than just placing a small snippet on a website to implement Google analytics. There are different integration patterns in order to capture the data into Google Analytics, and each integration is subject to a lot of pitfalls and potential regressions needed to guard against. There are also question as to whether or how to use the different APIs provided by GA.”

Kervizic begins by detailing three primary integration patterns: scraping a website, pushing events into a JavaScript data layer, and tapping into structured data. Next are several pitfalls one might run into and ways to counter each. See the write-up for those details.

Of course, your tracking setup is futile if it is not maintained. We learn about automated tests and monitoring tools to help with this step. Last but not least are Google Analytics APIs; Kervizic writes:

“Implementing Google analytics, sometimes requires integrating with Google Analytics APIs, be it for reporting purpose, to push some backend data, or to provide cost or product information. Google Analytics has 3 main APIs for these purposes.”

These are the three main APIs: the reporting API, augmented with the dimensions & metrics explorer for checking different field-naming; the measurement protocol with its hit builder tool for setting up requests; and the management API for automating data imports, managing audiences, and uploading cost info from third-party ad providers.

Cynthia Murrell, September 16, 2019

Graph Theory: Moving to the Mainstream

August 21, 2019

Physics helps engineers master their craft and binary is the start of all basic code, but graph theory is the key to understanding data science. Few people understand the power behind data science, but it powers Web sites they visit everyday: eBay, Facebook, and the all-powerful Google. Graph theory is part of mathematics and allows data to be presented in a clear, concise manner. Analytics India shares a list of game theory software that will make any data scientist’s job easier: “Top 10 Graph Theory Software.” The article explains that:

“Apart from knowing graph theory, it is necessary that one is not only able to create graphs but understand and analyze them. Graph theory software makes this job much easier. There are plenty of tools available to assist a detailed analysis. Here we list down the top 10 software for graph theory popular among the tech folks. They are presented in a random order and are available on major operating systems like Windows, MacOS and Linux.”

Among the recommended software are Tikz and PGF used in scientific research to create vector style graphs. Gephi is free to download and is best used for network visualization and data exploration. NetworkX is a reliable Python library for graphs and networks. LaTeXDraw is for document preparation and typesetting with a graphics editor. It is built on Java. One popular open source tool for mathematics projects is Sage. It is used for outlining graphs and hyper graphs.

MATLAB requires a subscription, but it is extremely powerful tool in creating graph theory visualizations and has a bioinformatics toolbox packed with more ways to explore graph theory functions. Graphic designers favor Inkscape for its ease of use and ability to create many different diagrams. GraphViz is famous for various graphical options for graph theory and also has customizable options. NodeXI is a Microsoft Excel template that is exclusively used for network graphs. One only has to enter a network edge list and then a graph is generated. Finally, MetaPost is used as a programming language and an interpreter program. It can use macros to make graph theory features.

Most of these graph theory software are available with free downloads with upgraded subscription services.

Whitney Grace, August 21, 2019

Hadoop Fail: A Warning Signal in Big Data Fantasy Land?

August 11, 2019

DarkCyber notices when high profile companies talk about data federation, data lakes, and intelligent federation of real time data with historical data. Examples include Amazon and Anduril to name two companies offering this type of data capability.

What Happened to Hadoop and Where Do We Go from Here?” does not directly discuss the data management systems in Amazon and Anduril, but the points the author highlights may be germane to thinking about what is possible and what remains just out of reach when it comes to processing the rarely defined world of “Big Data.”

The write up focuses on Hadoop, the elephant logo thing. Three issues are identified:

  1. Data provenance was tough to maintain and therefore determine. This is a variation on the GIGO theme (garbage in, garbage out)
  2. Creating a data lake is complicated. With talent shortages, the problem of complexity may hardwire failure.
  3. The big pool of data becomes the focus. That’s okay, but the application to solve the problem is often lost.

Why is a discussion of Hadoop relevant to Amazon and Anduril? The reason is that despite the weaknesses of these systems, both companies are addressing the “Hadoop problem” but in different ways.

These two firms, therefore, may be significant because of their approach and their different angles of attacks.

Amazon is providing a platform which, in the hands of a skilled Amazon technologist, can deliver a cohesive data environment. Furthermore, the digital craftsman can build a solution that works. It may be expensive and possibly flakey, but it mostly works.

Anduril, on the other hand, delivers the federation in a box. Anduril is a hardware product, smart software, and applications. License, deploy, and use.

Despite the different angles of attack, both companies are making headway in the data federation, data lake, and real time analytics sector.

The issue is not what will happen to Hadoop, the issue is how quickly will competitors respond to these different ways of dealing with Big Data.

Stephen E Arnold, August 11, 2019

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

A Partial Look: Data Discovery Service for Anyone

July 18, 2019

F-Secure has made available a Data Discovery Portal. The idea is that a curious person (not anyone on the DarkCyber team but one of our contractors will be beavering away today) can “find out what information you have given to the tech giants over the years.” Pick a social media service — for example, Apple — and this is what you see:


A curious person plugs in the Apple ID information and F-Secure obtains and displays the “data.” If one works through the services for which F-Secure offers this data discovery service, the curious user will have provided some interesting data to F-Secure.

Sound like a good idea? You can try it yourself at this F-Secure link.

F-Secure operates from Finland and was founded in 1988.

Do you trust the Finnish anti virus wizards with your user names and passwords to your social media accounts?

Are the data displayed by F-Secure comprehensive? Filtered? Accurate?

Stephen E Arnold, July 18, 2019

Intel: Chips Like a Brain

July 18, 2019

We noted “Intel Unveils Neuromorphic Computing System That Mimics the Human Brain.” The main idea is that Intel is a chip leader. Forget the security issues with some Intel processors. Forget the fabrication challenges. Forget the supply problem for certain Intel silicon.

Think “neuromophic computing.”

According to the marketing centric write up:

Intel said the Loihi chips can process information up to 1,000 times faster and 10,000 times more efficiently than traditional central processing units for specialized applications such as sparse coding, graph search and constraint-satisfaction problems.

Buzz, buzz, buzz. That’s the sound of marketing jargon zipping around.

How about this statement, offered without any charts, graphs, or benchmarks?

With the Loihi chip we’ve been able to demonstrate 109 times lower power consumption running a real-time deep learning benchmark compared to a graphics processing unit, and five times lower power consumption compared to specialized IoT inference hardware,” said Chris Eliasmith, co-chief executive officer of Applied Brain Research Inc., which is one of Intel’s research partners. “Even better, as we scale the network up by 50-times, Loihi maintains real-time performance results and uses only 30% more power, whereas the IoT hardware uses 500% more power and is no longer in real-time.”

Excited? What about the security, fab, and supply chain facets of getting neuromorphic disrupting other vendors eager to support the artificial intelligence revolution? Not in the Silicon Angle write up.

How quickly will an enterprise search vendor embrace “neuromorphic”? Proably more quickly than Intel can deliver seven nanometer nodes.

Stephen E Arnold, July 18, 2019

Need a Machine Learning Algorithm?

July 17, 2019

r entry

The R-Bloggers.com Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?


Stephen E Arnold, July 17, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta