Graph Theory: Moving to the Mainstream

August 21, 2019

Physics helps engineers master their craft and binary is the start of all basic code, but graph theory is the key to understanding data science. Few people understand the power behind data science, but it powers Web sites they visit everyday: eBay, Facebook, and the all-powerful Google. Graph theory is part of mathematics and allows data to be presented in a clear, concise manner. Analytics India shares a list of game theory software that will make any data scientist’s job easier: “Top 10 Graph Theory Software.” The article explains that:

“Apart from knowing graph theory, it is necessary that one is not only able to create graphs but understand and analyze them. Graph theory software makes this job much easier. There are plenty of tools available to assist a detailed analysis. Here we list down the top 10 software for graph theory popular among the tech folks. They are presented in a random order and are available on major operating systems like Windows, MacOS and Linux.”

Among the recommended software are Tikz and PGF used in scientific research to create vector style graphs. Gephi is free to download and is best used for network visualization and data exploration. NetworkX is a reliable Python library for graphs and networks. LaTeXDraw is for document preparation and typesetting with a graphics editor. It is built on Java. One popular open source tool for mathematics projects is Sage. It is used for outlining graphs and hyper graphs.

MATLAB requires a subscription, but it is extremely powerful tool in creating graph theory visualizations and has a bioinformatics toolbox packed with more ways to explore graph theory functions. Graphic designers favor Inkscape for its ease of use and ability to create many different diagrams. GraphViz is famous for various graphical options for graph theory and also has customizable options. NodeXI is a Microsoft Excel template that is exclusively used for network graphs. One only has to enter a network edge list and then a graph is generated. Finally, MetaPost is used as a programming language and an interpreter program. It can use macros to make graph theory features.

Most of these graph theory software are available with free downloads with upgraded subscription services.

Whitney Grace, August 21, 2019

Hadoop Fail: A Warning Signal in Big Data Fantasy Land?

August 11, 2019

DarkCyber notices when high profile companies talk about data federation, data lakes, and intelligent federation of real time data with historical data. Examples include Amazon and Anduril to name two companies offering this type of data capability.

What Happened to Hadoop and Where Do We Go from Here?” does not directly discuss the data management systems in Amazon and Anduril, but the points the author highlights may be germane to thinking about what is possible and what remains just out of reach when it comes to processing the rarely defined world of “Big Data.”

The write up focuses on Hadoop, the elephant logo thing. Three issues are identified:

  1. Data provenance was tough to maintain and therefore determine. This is a variation on the GIGO theme (garbage in, garbage out)
  2. Creating a data lake is complicated. With talent shortages, the problem of complexity may hardwire failure.
  3. The big pool of data becomes the focus. That’s okay, but the application to solve the problem is often lost.

Why is a discussion of Hadoop relevant to Amazon and Anduril? The reason is that despite the weaknesses of these systems, both companies are addressing the “Hadoop problem” but in different ways.

These two firms, therefore, may be significant because of their approach and their different angles of attacks.

Amazon is providing a platform which, in the hands of a skilled Amazon technologist, can deliver a cohesive data environment. Furthermore, the digital craftsman can build a solution that works. It may be expensive and possibly flakey, but it mostly works.

Anduril, on the other hand, delivers the federation in a box. Anduril is a hardware product, smart software, and applications. License, deploy, and use.

Despite the different angles of attack, both companies are making headway in the data federation, data lake, and real time analytics sector.

The issue is not what will happen to Hadoop, the issue is how quickly will competitors respond to these different ways of dealing with Big Data.

Stephen E Arnold, August 11, 2019

Trovicor: A Slogan as an Equation

August 2, 2019

We spotted this slogan on the Trovicor Web site:

The Trovicor formula: Actionable Intelligence = f (data generation; fusion; analysis; visualization)

The function consists of four buzzwords used by vendors of policeware and intelware:

  • Data generation (which suggests metadata assigned to intercepted, scraped, or provided content objects)
  • Fusion (which means in DarkCyber’s world a single index to disparate data)
  • Analysis (numerical recipes to identify patterns or other interesting data
  • Virtualization (use of technology to replace old school methods like 1950s’ style physical wire taps, software defined components, and software centric widgets).

The buzzwords make it easy to identify other companies providing somewhat similar services.

Trovicor maintains a low profile. But obtaining open source information about the company may be a helpful activity.

Stephen E Arnold, August 2, 2019

A Partial Look: Data Discovery Service for Anyone

July 18, 2019

F-Secure has made available a Data Discovery Portal. The idea is that a curious person (not anyone on the DarkCyber team but one of our contractors will be beavering away today) can “find out what information you have given to the tech giants over the years.” Pick a social media service — for example, Apple — and this is what you see:

fsecure

A curious person plugs in the Apple ID information and F-Secure obtains and displays the “data.” If one works through the services for which F-Secure offers this data discovery service, the curious user will have provided some interesting data to F-Secure.

Sound like a good idea? You can try it yourself at this F-Secure link.

F-Secure operates from Finland and was founded in 1988.

Do you trust the Finnish anti virus wizards with your user names and passwords to your social media accounts?

Are the data displayed by F-Secure comprehensive? Filtered? Accurate?

Stephen E Arnold, July 18, 2019

Intel: Chips Like a Brain

July 18, 2019

We noted “Intel Unveils Neuromorphic Computing System That Mimics the Human Brain.” The main idea is that Intel is a chip leader. Forget the security issues with some Intel processors. Forget the fabrication challenges. Forget the supply problem for certain Intel silicon.

Think “neuromophic computing.”

According to the marketing centric write up:

Intel said the Loihi chips can process information up to 1,000 times faster and 10,000 times more efficiently than traditional central processing units for specialized applications such as sparse coding, graph search and constraint-satisfaction problems.

Buzz, buzz, buzz. That’s the sound of marketing jargon zipping around.

How about this statement, offered without any charts, graphs, or benchmarks?

With the Loihi chip we’ve been able to demonstrate 109 times lower power consumption running a real-time deep learning benchmark compared to a graphics processing unit, and five times lower power consumption compared to specialized IoT inference hardware,” said Chris Eliasmith, co-chief executive officer of Applied Brain Research Inc., which is one of Intel’s research partners. “Even better, as we scale the network up by 50-times, Loihi maintains real-time performance results and uses only 30% more power, whereas the IoT hardware uses 500% more power and is no longer in real-time.”

Excited? What about the security, fab, and supply chain facets of getting neuromorphic disrupting other vendors eager to support the artificial intelligence revolution? Not in the Silicon Angle write up.

How quickly will an enterprise search vendor embrace “neuromorphic”? Proably more quickly than Intel can deliver seven nanometer nodes.

Stephen E Arnold, July 18, 2019

Need a Machine Learning Algorithm?

July 17, 2019

r entry

The R-Bloggers.com Web site published “101 Machine Learning Algorithms for Data Science with Cheat Sheets.” The write up recycles information from DataScienceDojo, and some of the information looks familiar. But lists of algorithms are not original. They are useful. What sets this list apart is the inclusion of “cheat sheets.”

What’s a cheat sheet?

In this particular collection, a cheat sheet looks like this:

r entry example

You can see the entry for the algorithm: Bernoulli Naive Bayes with a definition. The “cheat sheet” is a link to a python example. In this case, the example is a link to an explanation on the Chris Albon blog.

What’s interesting is that the 101 algorithms are grouped under 18 categories. Of these 18, Bayes and derivative methods total five.

No big deal, but in my lectures about widely used algorithms I highlight 10, mostly because it is a nice round number. The point is that most of the analytics vendors use the same basic algorithms. Variations among products built on these algorithms are significant.

As analytics systems become more modular — that  is, like Lego blocks — it seems that the trajectory of development will be to select, preconfigure thresholds, and streamline processes in a black box.

Is this good or bad?

It depends on whether one’s black box is a dominant solution or platform?

Will users know that this almost inevitable narrowing has upsides and downsides?

Nope.

Stephen E Arnold, July 17, 2019

Exclusive: DataWalk Explained by Chris Westphal

July 9, 2019

An Interview with Chris Westphal” provides an in-depth review of a company now disrupting the analytic and investigative software landscape.

DataWalk is a company shaped by a patented method for making sense of different types of data. The technique is novel and makes it possible for analysts to extract high value insights from large flows of data in near real time with an unprecedented ease of use.

DarkCyber interviewed in late June 2019 Chris Westphal, the innovator who co-founded Visual Analytics. That company’s combination of analytics methods and visualizations was acquired by Raytheon in 2013. Now Westphal is applying his talents to a new venture DataWalk.

Westphal, who monitors advanced analytics, learned about DataWalk and joined the firm in 2017 as the Chief Analytics Officer. The company has grown rapidly and now has client relationships with corporations, governments, and ministries throughout the world. Applications of the DataWalk technology include investigators focused on fraud, corruption, and serious crimes.

Unlike most investigative and analytics systems, users can obtain actionable outputs by pointing and clicking. The system captures these clicks on a ribbon. The actions on the ribbon can be modified, replayed, and shared.

In an exclusive interview with Mr. Westphal, DarkCyber learned:

The [DataWalk] system gets “smarter” by encoding the analytical workflows used to query the data; it stores the steps, values, and filters to produce results thereby delivering more consistency and reliability while minimizing the training time for new users. These workflows (aka “easy buttons”) represent domain or mission-specific knowledge acquired directly from the client’s operations and derived from their own data; a perfect trifecta!

One of the differentiating features of DataWalk’s platform is that it squarely addresses the shortage of trained analysts and investigators in many organizations. Westphal pointed out:

…The workflow idea is one of the ingredients in the DataWalk secret sauce. Not only do these workflows capture the domain expertise of the users and offer management insights and metrics into their operations such as utilization, performance, and throughput, they also form the basis for scoring any entity in the system. DataWalk allows users to create risk scores for any combination of workflows, each with a user-defined weight, to produce an overall, aggregated score for every entity. Want to find the most suspicious person? Easy, just select the person with the highest risk-score and review which workflows were activated. Simple. Adaptable. Efficient.

Another problem some investigative and analytic system developers face is user criticism. According to Westphal, DataWalk takes a different approach:

We listen carefully to our end-user community. We actively solicit their feedback and we prioritize their inputs. We try to solve problems versus selling licenses… DataWalk is focused on interfacing to a wide range of data providers and other technology companies. We want to create a seamless user experience that maximizes the utility of the system in the context of our client’s operational environments.

For more information about DataWalk, navigate to www.datawalk.com. For the full text of the interview, click this link. You can view a short video summary of DataWalk in the July 2, 2019, DarkCyber Video available on Vimeo.

Stephen E Arnold, July 9, 2019

Data: Such a Flexible and Handy Marketing Tool

July 5, 2019

Love Big Data? Like New Age research? Enjoy studies funded by commercial enterprises? If you are nodding in agreement, head on over to ““Evidence-Based Medicine Has Been Hijacked: A Confession from John Ioannidis.”

Here’s a statement to ponder:

Since clinical research that can generate useful clinical evidence has fallen off the radar screen of many/most public funders, it is largely left up to the industry to support it.  The sales and marketing departments in most companies are more powerful than their R&D departments. Hence, the design, conduct, reporting, and dissemination of this clinical evidence becomes an advertisement tool. As for “basic” research, as I explain in the paper, the current system favors PIs who make a primary focus of their career how to absorb more money. Success in obtaining (more) funding in a fiercely competitive world is what counts the most. Given that much “basic” research is justifiably unpredictable in terms of its yield, we are encouraging aggressive gamblers. Unfortunately, it is not gambling for getting major, high-risk discoveries (which would have been nice), it is gambling for simply getting more money.

Does this observation apply to the world of Big Data, online advertising, and the spreadsheet fever plaguing MBAs? Yep.

  1. People believe numbers and most do not ask, “Where did this number come from? What was the sample? How did you verify these data?”
  2. Outputs can be shaped. Check out your college class notes for Statistics 101; that is, I am assuming you kept your college notes. See anything about best practices? Validity tests?
  3. What about those thresholds? Many Bayesian methods are based upon guesses. Toss in some Monte Carlo? How representative of the outputs? What are the deltas between the current outputs and other available data?

Our next Factualities will appear in this blog on Wednesday. There are some special numbers in that round up.

A friend of mine who owns a successful online business said, “Nobody cares.”

Nobody cares?

Stephen E Arnold, July 5, 2019

Knowledge Graphs: Getting Hot

July 4, 2019

Artificial intelligence, semantics, and machine learning may lose their pride of place in the techno-jargon whiz bang marketing world. I read “A Common Sense View of Knowledge Graphs,” and noted this graph:

image

This is a good, old fashioned, Gene Garfield (remember him, gentle reader) citation analysis. The idea is that one can “see” how frequently an author or, in this case, a concept has been cited in the “literature.” Now publishers are dropping like flies and are publishing bunk. Nevertheless, one can see that using the phrase knowledge graph is getting popular within the sample of “literature” parsed for this graph. (No, I don’t recommend trying to perform citation analysis in Bing, Facebook, or Google. The reasons will just depress me and you, gentle reader.)

The section of the write I found useful and worthy of my “research” file is the collection of references to documents defining “knowledge graph.” This is useful, helpful research.

The write up also includes a diagram which may be one of the first representations of a graph centric triple. I thought this was something cooked up by Drs. Bray, Guha, and others in the tsunami of semantic excitement.

One final point: The list of endnotes is also useful. In short, good write up. The downside is that if the article gets wider distribution, a feeding frenzy among money desperate consultants, advisers, and analysts will be ignited like a Fourth of July fountain of flame.

Stephen E Arnold, July 4, 2019

15 Reasons You Need Business Intelligence Software

May 21, 2019

I read StrategyDriven’s “The Importance of Business Intelligence Software and Why It’s Integral for Business Success.” I found the laundry list interesting, but I asked myself, “If BI software is so important, why is it necessary to provide 15 reasons?”

I went through the list of items a couple of times.Some of the reasons struck me as a bit of a stretch. I had a teacher at the University of Illinois who loved the phrase “a bit of a stretch, right” when a graduate student proposed a wild and crazy hypothesis or drew a nutsy conclusion from data.

Let’s look at four of these reasons and see if there’s merit to my skepticism about delivering fish to a busy manager when the person wanted a fish sandwich.

Reason 1: Better business decisions. Really? If a BI system outputs data to a clueless person or uses flawed, incomplete, or stale data to present an output to a bright person, are better business decisions an outcome? In my experience, nope.

Reason 6. Accurate decision making. What the human does with the outputs is likely to result in a decision. That’s true. But accurate? Too many variables exist to create a one to one correlation with the assertion and what happens in a decider’s head or among a group of deciders who get together to figure out what to do. Example: Google has data. Google decided to pay a person accused of improper behavior millions of dollars. Accurate decision making? I suppose it depends on one’s point of view.

Reason 11. Reduced cost. I am confident when I say, “Most companies do not calculate or have the ability to assemble the information needed to produce fully loaded costs.” Consequently, the cost of a BI system is not the license fee. There are the associated directs and indirects. And when a decision from the BI system is wrong, there are some other costs as well. How are Facebook’s eDiscovery systems generating a payback today? Facebook has data, but the costs of its eDiscovery systems are not known, nor does anyone care as the legal hassles continue to flood the company’s executive suite.

Reason 13. High quality data. Whoa, hold your horses. The data cost is an issue in virtually every company with which I have experience. No one wants to invest to make certain that the information is complete, accurate, up to date, and maintained (indexed accurately and put in a consistent format). This is a pretty crazy assertion about BI when there is no guarantee that the data fed into the system is representative, comprehensive, accurate, and fresh.

Business intelligence is a tool. Use of a BI system does not generate guaranteed outcomes.

Stephen E Arnold, May 21, 2019

Next Page »

  • Archives

  • Recent Posts

  • Meta