CyberOSINT banner

Now Big Data Has to Be Fast

May 15, 2016

I read “Big Data Is No Longer Enough: It’s Now All about Fast Data.” The write up is interesting because it shifts the focus from having lots of information to infrastructure which can process the data in a timely manner. Note that “timely” means different things in different contexts. For example, to a crazed MBA stock market maven, next week is not too useful. To a clueless marketing professional with a degree in art history, “next week” might be just speedy enough.

The write up points out:

Processing data at these breakneck speeds requires two technologies: a system that can handle developments as quickly as they appear and a data warehouse capable of working through each item once it arrives. These velocity-oriented databases can support real-time analytics and complex decision-making in real time, while processing a relentless incoming data feed.

The point omitted from the article is that speed comes at a cost. The humans required to figure out what’s needed to go fast, the engineers to build the system, and the time required to complete the task. The “cloud” is not a solution to the cost.

Another omission in the article is that the numerical recipes required to “make sense” of large volumes of data require specialist knowledge. A system which outputs nifty charts may be of zero utility when it comes to making a decision.

The write up ignores the information in “What Beats Big Data? Small Data.” Some organizations cannot afford the cost of fast data. Even outfits which have the money can find themselves tripping over their analyses. See, for example, “Amazon Isn’t Racist, It’s Just Been an Unfortunate Victim of Big Data.” Understanding the information is important. Smart software often lacks the ability to discern nuances or issues with data quality, poor algorithm selection, or knowing what to look for in the first place.

Will the write up cause marketers and baloney makers to alter their pitches about Big Data and smart software. Not a chance. Vendors’ end game is revenue; licensees have a different agenda. When the two do not meet, there may be some excitement.

Stephen E Arnold, May 15, 2016

Deep Learning: Old Wine, New Labels

May 13, 2016

I read “Deep Learning: Definition, Resources, and Comparison with Machine Learning.” The most useful segment of the article to me is the list of resources. I did highlight this statement and its links:

Many deep learning algorithms (clustering, pattern recognition, automated bidding, recommendation engine, and so on)  — even though they appear in new contexts such as IoT or machine to machine communication — still rely on relatively old-fashioned techniques such as logistic regression, SVM, decision trees, K-NN, naive Bayes, Bayesian modeling, ensembles, random forests, signal processing, filtering, graph theory, gaming theory, and many others. Click here and here for details about the top 10 algorithms.

The point is that folks are getting interested in established methods hooked together in interesting ways. Perhaps new methods will find their way into the high flying vehicles for smart software? But wait. Are computational barriers acting like a venturi in the innovation flow? What about that vacuum?

Stephen E Arnold, May 13, 2016

DARPA Seeks Keys to Peace with High-Tech Social Science Research

May 11, 2016

Strife has plagued the human race since the beginning, but the Pentagon’s research arm thinks may be able to get to the root of the problem. Defense Systems informs us, “DARPA Looks to Tap Social Media, Big Data to Probe the Causes of Social Unrest.” Writer George Leopold explains:

“The Defense Advanced Research Projects Agency (DARPA) announced this week it is launching a social science research effort designed to probe what unifies individuals and what causes communities to break down into ‘a chaotic mix of disconnected individuals.’ The Next Generation Social Science (NGS2) program will seek to harness steadily advancing digital connections and emerging social and data science tools to identify ‘the primary drivers of social cooperation, instability and resilience.’

“Adam Russell, DARPA’s NGS2 program manager, said the effort also would address current research limitations such as the technical and logistical hurdles faced when studying large populations and ever-larger datasets. The project seeks to build on the ability to link thousands of diverse volunteers online in order to tackle social science problems with implications for U.S. national and economic security.”

The initiative aims to blend social science research with the hard sciences, including computer and data science. Virtual reality, Web-based gaming, and other large platforms will come into play. Researchers hope their findings will make it easier to study large and diverse populations. Funds from NGS2 will be used for the project, with emphases on predictive modeling, experimental structures, and boosting interpretation and reproducibility of results.

Will it be the Pentagon that finally finds the secret to world peace?


Cynthia Murrell, May 11, 2016

Sponsored by, publisher of the CyberOSINT monograph


Expert System: Inspired by Endeca

April 23, 2016

Years ago I listened to Endeca (now owned by Oracle) extol the virtues of its various tools. The idea was that the tools made it somewhat easier to get Endeca up and running. The original patents for Endeca reveal the computational blender which the Endeca method required. Endeca shifted from licensing software to bundling consulting with a software license. Setting up Endeca required MBAs, patience, and money. Endeca rose to generate more than $120 million in revenues before its sale to Oracle. Today Endeca is still available, and the Endeca patents—particularly 7035864—reveal how Endeca pulled off its facets. Today Endeca has lost a bit of its spit and polish, a process that began when Autonomy blasted past the firm in the early 2000s.

Endeca rolled out its “studio” a decade ago. I recall that Business Objects had a “studio.” The idea behind a studio was to make the complex task of creating something an end user could use without much training. But the studio was not aimed at an end user. The studio was a product for a developer, who found the tortuous, proprietary methods complex and difficult to learn. A studio would unleash the developers and, of course, propel the vendors with studios to new revenue heights.

Studio is back. This time, if the information in “Expert System Releases Cogito Studio for Combining the Advantages of Semantic Technology with Deep Learning,” is accurate. The spin is that semantic technology and deep learning—two buzzwords near and dear to the heart of those in search of the next big thing—will be a boon. Who is the intended user? Well, developers. These folks are learning that the marketing talk is a heck of a lot easier than designing, coding, debugging, stabilizing, and then generating useful outputs is quite difficult work.

According to the Expert System announcement:

The new release of Cogito Studio is the result of the hard work and dedication of our labs, which are focused on developing products that are both powerful and easy to use,” said Marco Varone, President and CTO, Expert System. “We believe that we can make significant contributions to the field of artificial intelligence. In our vision of AI, typical deep learning algorithms for automatic learning and knowledge extraction can be made more effective when combined with algorithms based on a comprehension of text and on knowledge structured in a manner similar to that of humans.”

Does this strike you as vague?

Expert System is an Italian, high tech outfit, which was founded in 1989. That’s almost a decade before the Endeca system poked its moist nose into the world of search. Fellow travelers from this era include Fulcrum Technologies and ISYS Search Software. Both of these companies’ technology are still available today.

Thus, it makes sense that the idea of a “studio” becomes a way to chop away at the complexity of Expert System-type systems.

According to Google Finance, Expert System’s stock is trending upwards.

expert system share 4 17

That’s a good sign. My hunch is that announcements about “studios” wrapped in lingo like semantics and Big Data are a good thing.

Stephen E Arnold, April 23, 2016

Data Intake: Still a Hassle

April 21, 2016

I read “Big Data’s Biggest Problem: It’s Too Hard to Get the Data In.” Here’s a quote I noted:

According to a study by data integration specialist Xplenty, a third of business intelligence professionals spend 50% to 90% of their time cleaning up raw data and preparing to input it into the company’s data platforms. That probably has a lot to do with why only 28% of companies think they are generating strategic value from their data.

My hunch is that with the exciting hyperbole about Big Data, the problem of normalizing, cleaning, and importing data is ignored. The challenge of taking file A in a particular file format and converting to another file type is indeed a hassle. A number of companies offer expensive filters to perform this task. The one I remember is Outside In, which sort of worked. I recall that when odd ball characters appeared in the file, there would be some issues. (Does anyone remember XyWrite?) Stellent purchased Outside In in order to move content into that firm’s content management system. Oracle purchased Stellent in 2006. Then Kapow “popped” on the scene. The firm promoted lots of functionality, but I remember it as a vendor who offered software which could take a file in one format and convert it into another format. Kofax (yep, the scanner oriented outfit) bought Kofax to move content from one format into one that Kofax systems could process. Then Lexmark bought Kofax and ended up with Kapow. With that deal, Palantir and other users of the Kapow technology probably had a nervous moment or are now having a nervous moment as Lexmark marches toward a new owner. Entropy, a French outfit, was a file conversion outfit. It sold out to Salesforce. Once again, converting files from Type A to another desired format seems to have been the motivating factor.

Let us not forget the wonderful file conversion tools baked into software. I can save a Word file as an RTF file. I can import a comma separated file into Excel. I can even fire up Framemaker and save a Dot fm file as RTF. In fact, many programs offer these import and export options. The idea is to lessen the pain of have a file in one format which another system cannot handle. Hey, for fun, try opening a macro filled XyWrite file in Framemaker or Indesign. Just change the file extension to one the system thinks it recognizes. This is indeed entertaining.

The write up is not interested in the companies which have sold for big bucks because their technology could make file conversion a walk in the Hounz Lane Park. (Watch out for the rats, gentle reader.) The write up points out three developments which will make the file intake issues go away:

  1. The software performing file conversion “gets better.” Okay, I have been waiting for decades for this happy time to arrive. No joy at the moment.
  2. “Data preparers become the paralegals of data science.” Now that’s a special idea. I am not clear on what a “data preparer” is, but it sounds like a task that will be outsourced pretty quickly to some country far from the home of NASCAR.
  3. Artificial intelligence” will help cleanse data. Excuse me, but smart software has been operative in file conversion methods for quite a while. In my experience, the exception files keep on piling up.

What is the problem with file conversion? I don’t want to convert this free blog post into a lengthy explanation. I can highlight five issues which have plagued me and my work in file conversion for many years:

First, file types change over time. Some of the changes are not announced. Others like the Microsoft Word XML thing are the subject of months long marketing., The problem is that unless the outfit responsible for the file conversion system creates a fix, the exception files can overrun a system’s capacity to keep track of problems. If someone is asleep at the switch, data in the exception folder can have an adverse impact on some production systems. Loss of data is interesting but trashing the file structure is a carnival. Who does not pay attention? In my experience, vendors, licensees, third parties, and probably most of the people responsible for a routine file conversion task.

Second, the thrill of XML is that it is not particularly consistent. Somewhere along the line, creativity takes precedence over for well formed. How does one deal with a couple hundred thousand XML files in an exception folder? What do you think about deleting them?

Third, the file conversion software works as long as the person creating a document does not use Fancy Dan “inserts” in the source document. Problems arise from videos, certain links, macros, and odd ball formatting of the source document. Yep, some folks create text in Excel and wonder why the resulting text is a bit of a mess.

Fourth, workflows get screwed up. A file conversion system is semi smart. If a process creates a file with an unrecognized extension, the file conversion system fills the exception folder. But what if one valid extension is changed to a supported but incorrect extension. Yep, XML users be aware that there are proprietary XML formats. The files converted and made available to a system are “sort of right.” Unfortunately sort of right in mission critical applications can have some interesting consequences.

Fifth, attention to detail is often less popular than fiddling with one’s mobile phone or reading Facebook posts. Human inattention can make large scale data conversion fail. I have watched as a person of my acquaintance deleted the folder of exception files. Yo, it is time for lunch.

So what? Smart software makes certain assumptions. At this time, file intake is perceived as a problem which has been solved. My view is that file intake is a core function which needs a little bit more attention. I do not need to be told that smart software will make file intake pain go away.

Stephen E Arnold, April 21, 2016

Tips on How to Make the Most of Big Data (While Spending the Least)

April 13, 2016

The article titled The 10 Commandments of Business Intelligence in Big Data on Datanami offers wisdom written on USB sticks instead of stone tablets. In the Business Intelligence arena, apparently moral guidance can take a backseat to Big Data cost-savings. Suggestions include: Don’t move Big Data unless you must, try to leverage your existing security system, and engage in extensive data visualization sharing (think Github). The article explains the importance of avoiding certain price-gauging traps,

“When done right, [Big Data] can be extremely cost effective… That said…some BI applications charge users by the gigabyte… It’s totally common to have geometric, exponential, logarithmic growth in data and in adoption with big data. Our customers have seen deployments grow from tens of billions of entries to hundreds of billions in a matter of months. That’s another beauty of big data systems: Incremental scalability. Make sure you don’t get lowballed into a BI tool that penalizes your upside.”

The Fifth Commandment remind us all that analyzing the data in its natural, messy form is far better than flattening it into tables due to the risk of losing key relationships. The Ninth and Tenth Commandments step back and look at the big picture of data analytics in 2016. What was only a buzzword to most people just five years ago is now a key aspect of strategy for any number of organizations. This article reminds us that thanks to data visualization, Big Data isn’t just for data scientists anymore. Employees across departments can make use of data to make decisions, but only if they are empowered to do so.


Chelsea Kerwin, April 13, 2016

Sponsored by, publisher of the CyberOSINT monograph

IBM: Back to Its Roots with Zest, Actually Spark

April 6, 2016

I read “IBM Launches Mainframe Platform for Spark.” This is an announcement which makes sense to me. The Watson baloney annoys; the mainframe news thrills.

According to the write up:

IBM is expanding its embrace of Apache Spark with the release of a mainframe platform that would allow the emerging open-source analytics framework to run natively on the company’s mainframe operating system.

I noted this passage as well:

The IBM platform also seeks to leverage Spark’s in-memory processing approach to crunching data. Hence, the z Systems platform includes data abstraction and integration services so that z/OS analytics applications can leverage standard Spark APIs. That approach eliminates processing and security issues associated with ETL while allowing organizations to analyze data in-place.

Hopefully IBM will play to its strengths not chase rainbows.

Stephen E Arnold, April 6, 2016

Big Data and Its Fry Cooks Who Clean the Grill

April 1, 2016

I read “Clearing Big Data: Most Time Consuming, Least Enjoyable Data Science Task, Survey Says.” A survey?

According to the capitalist tool:

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data.

The point is that few wizards want to come to grips with the problem of figuring out what’s wrong with data in a set or a stream and then getting the data into a form that can be used with reasonable confidence.

Those exception folders, annoying, aren’t they?

The write up points that a data scientist spends 80 percent of his or her time doing housecleaning. Skip the job and the house becomes unpleasant indeed.

The survey also reveals that data scientists have to organize the data to be analyzed. Imagine that. The baloney about automatically sucking in a wide range of data does not match the reality of the survey sample.

Another grim bit of drudgery emerges from the sample which we assume was conducted with the appropriate textbook procedures was that the skills most in demand were for SQL. Yep, old school.

Consider that most of the companies marketing next generation data mining and analytics systems never discuss grunt work and old fashioned data management.

Why the disconnect?

My hunch is that it is the sizzle, not the steak, which sells. Little wonder that some analytics outputs might be lab-made hamburger.

Stephen E Arnold, April 1, 2016

Confused about Hadoop, Spark, and MapReduce? Not Necessary Now

March 24, 2016

I read “MapReduce vs. Apache Spark vs. SQL: Your questions answered here and at #StrataHadoop.” The article strikes at the heart of the Big Data boomlet. The options one has are rich, varied, and infused with consequences.

According to the write up:

Forester is predicting total market saturation for Hadoop in two years, and a growing number of users are leveraging Spark for its superior performance when compared to MapReduce.

Yikes! A mid tier consulting firm is predicting the future again. I almost stopped reading, but I was intrigued. Exactly what are the differences among these three systems, which appear to be, really different. MapReduce is a bit of a golden oldie, and there is the pesky thought in my mind that Hadoop is a close relative of MapReduce. The Spark thing is an open source effort to create a system which runs quickly enough to make performance mesh with the idea that engineers have weekends.

The write up states:

As I mentioned in my previous post, we’re using this blog series to introduce some of the key technologies SAS will be highlighting at Strata Hadoop World. Each Q&A features the thought leaders you’ll be able to meet when you stop by the SAS booth #1022. Next up is Brian Kinnebrew who explains how new enhancements to SAS Data Loader for Hadoop can support Spark.

Yikes, yikes. The write up is a plea for booth traffic. In the booth a visitor can learn about the Hadoop, Spark, and MapReduce options.

The most interesting thing about the article is that it presents a series of questions and some SAS-skewed answers. The point is that SAS, the statistics company every graduate student in psychology learns to love, has a Data Loader Version 2.4 which is going to make life wonderful for the Big Data crowd.

I wondered, “Is this extract, transform, and load” all over again?”

The answer is not to get tangled up in the substantive differences among Hadoop, Spark and MapReduce like the title of the article implied. The point is that one can use NoSQL and regular SQL.

So what did I learn about the differences among Hadoop, Spark, and MapReduce?

Nothing. Just content marketing without much content in my view.

SAS, let me know if you want me to explain the differences to someone in your organization.

Stephen E Arnold, March 24, 2016

Hot Data Startups to Notice

March 22, 2016

An outfit called UBM, which looks a lot like the old IDC I knew and loved, published “9 Hot Big Data and Analyt5ics Startups to Watch.” The article is a series of separate pages. Apparently the lust for clicks is greater than the MBAs’ interest in making information easy to access. Progress in online publishing is zipping right along the information highway it seems.

What are the companies the article and UBM as describing as “hot.” I interpret the word to mean “having a high degree of heat or a high temperature” or “(of food) containing or consisting of pungent spices or peppers that produce a burning sensation when tasted.” I have a hunch the use of the word in this write up is intended to suggest big revenue producers which you must license in order to get or keep a job. Just a guess, mind you.

The companies are:

AtScale, founded in 2013

Algorithmia, founded in 2013

Bedrock Data, founded in 2012

BlueTalon, founded in 2013

Cazena, founded in 2014

Confluent, founded in 2014, founded in 2011

RJMetrics, founded in 2008

Wavefront, founded in 2013

The list is US centric. I assume none of the Big Data and analytics outfits in other countries are “hot.” I think the reason is that the research process looked at Boston, Seattle, and the Sillycon Valley pool and thought, “Close enough for horseshoes.” Just a guess, mind you.

If you are looking for the next big thing founded within the last two to eight years, the list is just what you need to make your company or organization great again. Sorry, some catchphrases are tough to purge from my addled goose brain. Enjoy the listicle. On high latency systems, the slides don’t render. Again. Do MBAs worry about this stuff? A final comment: I like the name “BlueTalon.”

Stephen E Arnold, March 22, 2016

Next Page »