CyberOSINT banner

Expert System: Inspired by Endeca

April 23, 2016

Years ago I listened to Endeca (now owned by Oracle) extol the virtues of its various tools. The idea was that the tools made it somewhat easier to get Endeca up and running. The original patents for Endeca reveal the computational blender which the Endeca method required. Endeca shifted from licensing software to bundling consulting with a software license. Setting up Endeca required MBAs, patience, and money. Endeca rose to generate more than $120 million in revenues before its sale to Oracle. Today Endeca is still available, and the Endeca patents—particularly 7035864—reveal how Endeca pulled off its facets. Today Endeca has lost a bit of its spit and polish, a process that began when Autonomy blasted past the firm in the early 2000s.

Endeca rolled out its “studio” a decade ago. I recall that Business Objects had a “studio.” The idea behind a studio was to make the complex task of creating something an end user could use without much training. But the studio was not aimed at an end user. The studio was a product for a developer, who found the tortuous, proprietary methods complex and difficult to learn. A studio would unleash the developers and, of course, propel the vendors with studios to new revenue heights.

Studio is back. This time, if the information in “Expert System Releases Cogito Studio for Combining the Advantages of Semantic Technology with Deep Learning,” is accurate. The spin is that semantic technology and deep learning—two buzzwords near and dear to the heart of those in search of the next big thing—will be a boon. Who is the intended user? Well, developers. These folks are learning that the marketing talk is a heck of a lot easier than designing, coding, debugging, stabilizing, and then generating useful outputs is quite difficult work.

According to the Expert System announcement:

The new release of Cogito Studio is the result of the hard work and dedication of our labs, which are focused on developing products that are both powerful and easy to use,” said Marco Varone, President and CTO, Expert System. “We believe that we can make significant contributions to the field of artificial intelligence. In our vision of AI, typical deep learning algorithms for automatic learning and knowledge extraction can be made more effective when combined with algorithms based on a comprehension of text and on knowledge structured in a manner similar to that of humans.”

Does this strike you as vague?

Expert System is an Italian, high tech outfit, which was founded in 1989. That’s almost a decade before the Endeca system poked its moist nose into the world of search. Fellow travelers from this era include Fulcrum Technologies and ISYS Search Software. Both of these companies’ technology are still available today.

Thus, it makes sense that the idea of a “studio” becomes a way to chop away at the complexity of Expert System-type systems.

According to Google Finance, Expert System’s stock is trending upwards.

expert system share 4 17

That’s a good sign. My hunch is that announcements about “studios” wrapped in lingo like semantics and Big Data are a good thing.

Stephen E Arnold, April 23, 2016

Data Intake: Still a Hassle

April 21, 2016

I read “Big Data’s Biggest Problem: It’s Too Hard to Get the Data In.” Here’s a quote I noted:

According to a study by data integration specialist Xplenty, a third of business intelligence professionals spend 50% to 90% of their time cleaning up raw data and preparing to input it into the company’s data platforms. That probably has a lot to do with why only 28% of companies think they are generating strategic value from their data.

My hunch is that with the exciting hyperbole about Big Data, the problem of normalizing, cleaning, and importing data is ignored. The challenge of taking file A in a particular file format and converting to another file type is indeed a hassle. A number of companies offer expensive filters to perform this task. The one I remember is Outside In, which sort of worked. I recall that when odd ball characters appeared in the file, there would be some issues. (Does anyone remember XyWrite?) Stellent purchased Outside In in order to move content into that firm’s content management system. Oracle purchased Stellent in 2006. Then Kapow “popped” on the scene. The firm promoted lots of functionality, but I remember it as a vendor who offered software which could take a file in one format and convert it into another format. Kofax (yep, the scanner oriented outfit) bought Kofax to move content from one format into one that Kofax systems could process. Then Lexmark bought Kofax and ended up with Kapow. With that deal, Palantir and other users of the Kapow technology probably had a nervous moment or are now having a nervous moment as Lexmark marches toward a new owner. Entropy, a French outfit, was a file conversion outfit. It sold out to Salesforce. Once again, converting files from Type A to another desired format seems to have been the motivating factor.

Let us not forget the wonderful file conversion tools baked into software. I can save a Word file as an RTF file. I can import a comma separated file into Excel. I can even fire up Framemaker and save a Dot fm file as RTF. In fact, many programs offer these import and export options. The idea is to lessen the pain of have a file in one format which another system cannot handle. Hey, for fun, try opening a macro filled XyWrite file in Framemaker or Indesign. Just change the file extension to one the system thinks it recognizes. This is indeed entertaining.

The write up is not interested in the companies which have sold for big bucks because their technology could make file conversion a walk in the Hounz Lane Park. (Watch out for the rats, gentle reader.) The write up points out three developments which will make the file intake issues go away:

  1. The software performing file conversion “gets better.” Okay, I have been waiting for decades for this happy time to arrive. No joy at the moment.
  2. “Data preparers become the paralegals of data science.” Now that’s a special idea. I am not clear on what a “data preparer” is, but it sounds like a task that will be outsourced pretty quickly to some country far from the home of NASCAR.
  3. Artificial intelligence” will help cleanse data. Excuse me, but smart software has been operative in file conversion methods for quite a while. In my experience, the exception files keep on piling up.

What is the problem with file conversion? I don’t want to convert this free blog post into a lengthy explanation. I can highlight five issues which have plagued me and my work in file conversion for many years:

First, file types change over time. Some of the changes are not announced. Others like the Microsoft Word XML thing are the subject of months long marketing., The problem is that unless the outfit responsible for the file conversion system creates a fix, the exception files can overrun a system’s capacity to keep track of problems. If someone is asleep at the switch, data in the exception folder can have an adverse impact on some production systems. Loss of data is interesting but trashing the file structure is a carnival. Who does not pay attention? In my experience, vendors, licensees, third parties, and probably most of the people responsible for a routine file conversion task.

Second, the thrill of XML is that it is not particularly consistent. Somewhere along the line, creativity takes precedence over for well formed. How does one deal with a couple hundred thousand XML files in an exception folder? What do you think about deleting them?

Third, the file conversion software works as long as the person creating a document does not use Fancy Dan “inserts” in the source document. Problems arise from videos, certain links, macros, and odd ball formatting of the source document. Yep, some folks create text in Excel and wonder why the resulting text is a bit of a mess.

Fourth, workflows get screwed up. A file conversion system is semi smart. If a process creates a file with an unrecognized extension, the file conversion system fills the exception folder. But what if one valid extension is changed to a supported but incorrect extension. Yep, XML users be aware that there are proprietary XML formats. The files converted and made available to a system are “sort of right.” Unfortunately sort of right in mission critical applications can have some interesting consequences.

Fifth, attention to detail is often less popular than fiddling with one’s mobile phone or reading Facebook posts. Human inattention can make large scale data conversion fail. I have watched as a person of my acquaintance deleted the folder of exception files. Yo, it is time for lunch.

So what? Smart software makes certain assumptions. At this time, file intake is perceived as a problem which has been solved. My view is that file intake is a core function which needs a little bit more attention. I do not need to be told that smart software will make file intake pain go away.

Stephen E Arnold, April 21, 2016

Tips on How to Make the Most of Big Data (While Spending the Least)

April 13, 2016

The article titled The 10 Commandments of Business Intelligence in Big Data on Datanami offers wisdom written on USB sticks instead of stone tablets. In the Business Intelligence arena, apparently moral guidance can take a backseat to Big Data cost-savings. Suggestions include: Don’t move Big Data unless you must, try to leverage your existing security system, and engage in extensive data visualization sharing (think Github). The article explains the importance of avoiding certain price-gauging traps,

“When done right, [Big Data] can be extremely cost effective… That said…some BI applications charge users by the gigabyte… It’s totally common to have geometric, exponential, logarithmic growth in data and in adoption with big data. Our customers have seen deployments grow from tens of billions of entries to hundreds of billions in a matter of months. That’s another beauty of big data systems: Incremental scalability. Make sure you don’t get lowballed into a BI tool that penalizes your upside.”

The Fifth Commandment remind us all that analyzing the data in its natural, messy form is far better than flattening it into tables due to the risk of losing key relationships. The Ninth and Tenth Commandments step back and look at the big picture of data analytics in 2016. What was only a buzzword to most people just five years ago is now a key aspect of strategy for any number of organizations. This article reminds us that thanks to data visualization, Big Data isn’t just for data scientists anymore. Employees across departments can make use of data to make decisions, but only if they are empowered to do so.


Chelsea Kerwin, April 13, 2016

Sponsored by, publisher of the CyberOSINT monograph

IBM: Back to Its Roots with Zest, Actually Spark

April 6, 2016

I read “IBM Launches Mainframe Platform for Spark.” This is an announcement which makes sense to me. The Watson baloney annoys; the mainframe news thrills.

According to the write up:

IBM is expanding its embrace of Apache Spark with the release of a mainframe platform that would allow the emerging open-source analytics framework to run natively on the company’s mainframe operating system.

I noted this passage as well:

The IBM platform also seeks to leverage Spark’s in-memory processing approach to crunching data. Hence, the z Systems platform includes data abstraction and integration services so that z/OS analytics applications can leverage standard Spark APIs. That approach eliminates processing and security issues associated with ETL while allowing organizations to analyze data in-place.

Hopefully IBM will play to its strengths not chase rainbows.

Stephen E Arnold, April 6, 2016

Big Data and Its Fry Cooks Who Clean the Grill

April 1, 2016

I read “Clearing Big Data: Most Time Consuming, Least Enjoyable Data Science Task, Survey Says.” A survey?

According to the capitalist tool:

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data.

The point is that few wizards want to come to grips with the problem of figuring out what’s wrong with data in a set or a stream and then getting the data into a form that can be used with reasonable confidence.

Those exception folders, annoying, aren’t they?

The write up points that a data scientist spends 80 percent of his or her time doing housecleaning. Skip the job and the house becomes unpleasant indeed.

The survey also reveals that data scientists have to organize the data to be analyzed. Imagine that. The baloney about automatically sucking in a wide range of data does not match the reality of the survey sample.

Another grim bit of drudgery emerges from the sample which we assume was conducted with the appropriate textbook procedures was that the skills most in demand were for SQL. Yep, old school.

Consider that most of the companies marketing next generation data mining and analytics systems never discuss grunt work and old fashioned data management.

Why the disconnect?

My hunch is that it is the sizzle, not the steak, which sells. Little wonder that some analytics outputs might be lab-made hamburger.

Stephen E Arnold, April 1, 2016

Confused about Hadoop, Spark, and MapReduce? Not Necessary Now

March 24, 2016

I read “MapReduce vs. Apache Spark vs. SQL: Your questions answered here and at #StrataHadoop.” The article strikes at the heart of the Big Data boomlet. The options one has are rich, varied, and infused with consequences.

According to the write up:

Forester is predicting total market saturation for Hadoop in two years, and a growing number of users are leveraging Spark for its superior performance when compared to MapReduce.

Yikes! A mid tier consulting firm is predicting the future again. I almost stopped reading, but I was intrigued. Exactly what are the differences among these three systems, which appear to be, really different. MapReduce is a bit of a golden oldie, and there is the pesky thought in my mind that Hadoop is a close relative of MapReduce. The Spark thing is an open source effort to create a system which runs quickly enough to make performance mesh with the idea that engineers have weekends.

The write up states:

As I mentioned in my previous post, we’re using this blog series to introduce some of the key technologies SAS will be highlighting at Strata Hadoop World. Each Q&A features the thought leaders you’ll be able to meet when you stop by the SAS booth #1022. Next up is Brian Kinnebrew who explains how new enhancements to SAS Data Loader for Hadoop can support Spark.

Yikes, yikes. The write up is a plea for booth traffic. In the booth a visitor can learn about the Hadoop, Spark, and MapReduce options.

The most interesting thing about the article is that it presents a series of questions and some SAS-skewed answers. The point is that SAS, the statistics company every graduate student in psychology learns to love, has a Data Loader Version 2.4 which is going to make life wonderful for the Big Data crowd.

I wondered, “Is this extract, transform, and load” all over again?”

The answer is not to get tangled up in the substantive differences among Hadoop, Spark and MapReduce like the title of the article implied. The point is that one can use NoSQL and regular SQL.

So what did I learn about the differences among Hadoop, Spark, and MapReduce?

Nothing. Just content marketing without much content in my view.

SAS, let me know if you want me to explain the differences to someone in your organization.

Stephen E Arnold, March 24, 2016

Hot Data Startups to Notice

March 22, 2016

An outfit called UBM, which looks a lot like the old IDC I knew and loved, published “9 Hot Big Data and Analyt5ics Startups to Watch.” The article is a series of separate pages. Apparently the lust for clicks is greater than the MBAs’ interest in making information easy to access. Progress in online publishing is zipping right along the information highway it seems.

What are the companies the article and UBM as describing as “hot.” I interpret the word to mean “having a high degree of heat or a high temperature” or “(of food) containing or consisting of pungent spices or peppers that produce a burning sensation when tasted.” I have a hunch the use of the word in this write up is intended to suggest big revenue producers which you must license in order to get or keep a job. Just a guess, mind you.

The companies are:

AtScale, founded in 2013

Algorithmia, founded in 2013

Bedrock Data, founded in 2012

BlueTalon, founded in 2013

Cazena, founded in 2014

Confluent, founded in 2014, founded in 2011

RJMetrics, founded in 2008

Wavefront, founded in 2013

The list is US centric. I assume none of the Big Data and analytics outfits in other countries are “hot.” I think the reason is that the research process looked at Boston, Seattle, and the Sillycon Valley pool and thought, “Close enough for horseshoes.” Just a guess, mind you.

If you are looking for the next big thing founded within the last two to eight years, the list is just what you need to make your company or organization great again. Sorry, some catchphrases are tough to purge from my addled goose brain. Enjoy the listicle. On high latency systems, the slides don’t render. Again. Do MBAs worry about this stuff? A final comment: I like the name “BlueTalon.”

Stephen E Arnold, March 22, 2016

Change Is Hard, Especially in the User Interface

March 22, 2016

One of the most annoying things in life is when you go to the grocery store and notice they have rearranged the entire place since your last visit.  I always ask myself the question, “Why grocery store people did you do this to me?”  Part of the reason is to improve the shopping experience and product exposure, while the other half is to screw with customers (I cannot confirm the latter).  According to the Fuzzy Notepad with its Pokémon Evee mascot the post titled “We Have Always Been At War With UI” explains that programmers and users have always been at war with each other when it comes to the user interface.

Face it, Web sites (and other areas of life) need to change to maintain their relevancy.  The biggest problem related to UI changes is the roll out of said changes.  The post points out that users get confused and spend hours trying to understand the change.  Sometimes the change is announced, other times it is only applied to a certain number of users.

The post lists several changes to UI and how they were handled, describing how they were handled and also the programming.  One constant thread runs through the post is that users simply hate change, but the inevitable question of, “Why?” pops up.

“Ah, but why? I think too many developers trot this line out as an excuse to ignore all criticism of a change, which is very unhealthy. Complaints will always taper off over time, but that doesn’t mean people are happy, just that they’ve gone hoarse. Or, worse, they’ve quietly left, and your graphs won’t tell you why. People aren’t like computers and may not react instantly to change; they may stew for a while and drift away, or they may join a mass exodus when a suitable replacement comes along.”

Big data can measure anything and everything, but the data can be interpreted for or against the changes.  Even worse is that the analysts may not know what exactly they need to measure.  What can be done to avoid total confusion about changes is to have a plan, let users know in advance, and even create tutorial about how to use the changes.  Worse comes to worse, it can be changed back and then we move on.


Whitney Grace, March 22, 2016
Sponsored by, publisher of the CyberOSINT monograph

Infonomics and the Big Data Market Publishers Need to Consider

March 22, 2016

The article on Beyond the Book titled Data Not Content Is Now Publishers’ Product floats a new buzzword in its discussion of the future of information: infonomics, or the study of creation and consumption of information. The article compares information to petroleum as the resource that will cause quite a stir in this century. Grace Hong, Vice-President of Strategic Markets & Development for Wolters Kluwer’s Tax & Accounting, weighs in,

“When it comes to big data – and especially when we think about organizations like traditional publishing organizations – data in and of itself is not valuable.  It’s really about the insights and the problems that you’re able to solve,”  Hong tells CCC’s Chris Kenneally. “From a product standpoint and from a customer standpoint, it’s about asking the right questions and then really deeply understanding how this information can provide value to the customer, not only just mining the data that currently exists.”

Hong points out that the data itself is useless unless it has been produced correctly. That means asking the right questions and using the best technology available to find meaning in the massive collections of information possible to collect. Hong suggests that it is time for publishers to seize on the market created by Big Data.


Chelsea Kerwin, March 22, 2016

Sponsored by, publisher of the CyberOSINT monograph

How Many Types of Big Data Exist?

March 18, 2016

Navigate to “The Five Different Types of Big Data.” If you are a student of classification, you will find the categories set forth in this write up an absolute hoot. The author is an expert, I assume, in energy, transportation, food, and data. Oh, goodie. Food.

I have not thought too much about the types of Big Data. I usually think only when a client pays me to perform that function. An example is my analysis of the concept “real time” information. You can find that write up at this link. Big requires me to understand the concept of relative to what. I find this type of thinking uninteresting, but obviously the editors at Forbes find the idea just another capitalist tool.

When I learned that an expert had chased down the types of Big Data, I was and remain confused. “Big” describes something that is relative. “Data” is the plural of datum and refers to more than two facts or statistics, quantities, characters, symbols, etc.

I am not sure what Big Data is, and like many marketing buzzwords, the phrase has become a catchall for vendors of all manner of computer related products and services.

Here are the five types of Big Data.

  1. Big data. I like the Kurt Friedrich Gödel touch.
  2. Fast data. “Relative to what?” I ask.
  3. Dark data. “Darker than what? Is this secret versus un-secret or some other yardstick?” I wonder.
  4. Lost data. I pose to myself, “Lost as in unknown, known but unknown, or some other Rumsfeldesque state of understanding?”
  5. New data. I think, “I really don’t want to think about what ‘new’ means? Is this new as in never before seen or Madison Avenue ‘new’ like an improved Colgate Total toothpaste with whitener.

I like the tag on the article “Recommended by Forbes.” Quite an endorsement from a fine example of capitalistic tool analysis.

Stephen E Arnold, March 18, 2016

Next Page »