CyberOSINT banner

A Data Lake: Batch Job Dipping Only

February 11, 2016

I love the Hadoop data lake concept. I live in a mostly real time world. The “batch” approach reminds me of my first exposure to computing in 1962. Real time? Give me a break. Hadoop reminded me of those early days. Fun. Standing on line. Waiting and waiting.

I read “Data Lake: Save Me More Money vs. Make Me More Money.” The article strikes me as a conference presentation illustrated with a deck of PowerPoint goodies.

One of the visuals was a modern big data analytics environment. I have seen a number of representations of today’s big data yadda yadda set ups. Here’s the EMC take on the modernity:


Straight away, I note the “all” word. Yep, just put the categorical affirmative into a Hadoop data lake. Don’t forget the video, the wonky stuff in the graphics department, the engineering drawings, and the most recent version of the merger documents requested by a team of government investigators, attorneys, and a pesky solicitor from some small European Community committee. “All” means all, right?

Then there are two “environments”. Okay, a data lake can have ecosystems, so the word environment is okay for flora and fauna. I think the notion is to build two separate analytic subsystems. Interesting approach, but there are platforms which offer applications to handle most of the data slap about work. Why not license one of those; for example, Palantir, Recorded Future?

And that’s it?

Well, no. The write up states that the approach will “save me more money.” In fact, one does not need much more:

The savings from these “Save me more money” activities can be nice with a Return on Investment (ROI) typically in the 10% to 20% range. But if organizations stop there, then they are leaving the 5x to 10x ROI projects on the table. Do I have your attention now?

My answer, “No, no, you do not.”

Stephen E Arnold, February

Big Data Blending Solution

January 20, 2016

I would have used Palantir or maybe our own tools. But an outfit named National Instruments found a different way to perform data blending. “How This Instrument Firm Tackled Big Data Blending” provides a case study and a rah rah for Alteryx. Here’s the paragraph I highlighted:

The software it [National Instruments] selected, from Alteryx, takes a somewhat unique approach in that it provides a visual representation of the data transformation process. Users can acquire, transform, and blend multiple data sources essentially by dragging and dropping icons on a screen. This GUI approach is beneficial to NI employees who aren’t proficient at manipulating data using something like SQL.

The graphical approach has been part of a number of tools. There are also some systems which just figure out where to put what.

The issue for me is, “What happens to rich media like imagery and unstructured information like email?”

There are systems which handle these types of content.

Another challenge is the dependence on structured relational data tables. Certain types of operations are difficult in this environment.

The write up is interesting, but it reveals that a narrow view of available tools may produce a partial solution.

Stephen E Arnold, January 20, 2016

Machine Learning Hindsight

January 18, 2016

Have you ever found yourself saying, “If I only knew then, what I know now”?  It is a moment we all experience, but instead of stewing over our past mistakes it is better to share the lessons we’ve learned with others.  Data scientist Peadar Coyle learned some valuable lessons when he first started working with machine learning.  He discusses three main things he learned in the article, “Three Things I Wish I Knew Earlier About Machine Learning.”

Here are the three items he wishes he knew then about machine learning, but know now:

  • “Getting models into production is a lot more than just micro services
  • Feature selection and feature extraction are really hard to learn from a book
  • The evaluation phase is really important”

Developing models is an easy step, but putting them in production is difficult.  There are many major steps that need attending to and doing all of the little jobs isn’t feasible on huge projects.   Peadar recommends outsourcing when you can.  Books and online information are good reference tools, but when they cannot be applied to actual situations the knowledge is useless.  Paedar learned that real world experience has no comparison.  When it comes to testing, it is a very important thing.  Very much as real world experience is invaluable, so is the evaluation.  Life does not hand perfect datasets for experimentation and testing different situations will better evaluate the model.

Paedar’s advice applies to machine learning, but it applies more to life in general.


Whitney Grace, January 18, 2016
Sponsored by, publisher of the CyberOSINT monograph

Open Source Data Management: It Is Now Easy to Understand

January 10, 2016

I read “16 for 16: What You Must Know about Hadoop and Spark Right Now.” I like the “right now.” Urgency. I am not sure I feel too much urgency at the moment. I will leave that wonderful feeling to the executives who have sucked in venture money and have to find a way to generate revenue in the next 11 months.

The article runs down the basic generalizations associated with each of these open source data management components:

  • Spark
  • Hive
  • Kerberos
  • Ranger/Sentry
  • HBase/Phoenix
  • Impala
  • Hadoop Distributed File System (HDFS)
  • Kafka
  • Storm/Apex
  • Ambari/Cloudera Manager
  • Pig
  • Yarn/Mesos
  • Nifi/Kettle
  • Knox
  • Scala/Python
  • Zeppelin/Databricks

What the list tells me is two things. First, the proliferation of open source data tools is thriving. Second, there will have to be quite a few committed developers to keep these projects afloat.

The write up is not content with this shopping list. The intrepid reader will have an opportunity to learn a bit about:

  • Kylin
  • Atlas/Navigator

As the write up swoops to its end point, I learned about some open source projects which are a bit of a disappointment; for example, Oozie and Tez.

The key point of the article is that Google’s MapReduce which is now pretty long in the tooth is now effectively marginalized.

The Balkanization of data management is evident. The challenge will be to use one or more of these technologies to make some substantial revenue flow.

What happens if a company jumps on the wrong bandwagon as it leaves the parade ground? I would suggest that it may be more like a Pig than an Atlas. The investors will change from Rangers looking for profits to Pythons ready to strike. A Spark can set fire to some hopes and dreams in the Hive. Poorly constructed walls of Databricks can come falling down. That will be an Oozie.

Dear old Oracle, DB2, and SQLServer will just watch.

Stephen E Arnold, January 10, 2016

Short Honk: Hadoop Ecosystem Made Clear

January 3, 2016

Love Hadoop. Love all things Hadoopy? You will want to navigate to “The Hadoop Ecosystem Table.” You have categories of Hadoopiness with examples of the Hadoop amoebae. You are able to see where Spark “fits” or Kudu. Need some document data model options? The table will deliver: ArangoDB and more. Useful stuff.

Stephen E Arnold, December 30, 2015

Data Managers as Data Librarians

December 31, 2015

The tools of a librarian may be the key to better data governance, according to an article at InFocus titled, “What Librarians Can Teach Us About Managing Big Data.” Writer Joseph Dossantos begins by outlining the plight data managers often find themselves in: executives can talk a big game about big data, but want to foist all the responsibility onto their overworked and outdated IT departments. The article asserts, though, that today’s emphasis on data analysis will force a shift in perspective and approach—data organization will come to resemble the Dewey Decimal System. Dossantos writes:

“Traditional Data Warehouses do not work unless there a common vocabulary and understanding of a problem, but consider how things work in academia.  Every day, tenured professors  and students pore over raw material looking for new insights into the past and new ways to explain culture, politics, and philosophy.  Their sources of choice:  archived photographs, primary documents found in a city hall, monastery or excavation site, scrolls from a long-abandoned cave, or voice recordings from the Oval office – in short, anything in any kind of format.  And who can help them find what they are looking for?  A skilled librarian who knows how to effectively search for not only books, but primary source material across the world, who can understand, create, and navigate a catalog to accelerate a researcher’s efforts.”

The article goes on to discuss the influence of the “Wikipedia mindset;” data accuracy and whether it matters; and devising structures to address different researchers’ needs. See the article for details on each of these (especially on meeting different needs.) The write-up concludes with a call for data-governance professionals to think of themselves as “data librarians.” Is this approach the key to more effective data search and analysis?

Cynthia Murrell, December 31, 2015

Sponsored by, publisher of the CyberOSINT monograph

Caution about NoSQL Databases

December 22, 2015

I read “Exasol and Birst Join In-Memory Database to ‘Networked’ BI to Aid Mutual Expansion.” Another day, another marketing tie up. But the article contained a very interesting statement, attributed to a Birst big dog:

NoSQL databases are great for atomic storage and retrieval, and for elastic scaling over a distributed [server] environment, but when it comes to doing aggregations with joins – and that’s what analytics is about – it is just not what they are built for.”

I wonder if that shot is aimed at outfits like MarkLogic. Worth watching this partnership.

Stephen E Arnold, December 22, 2015

Two AI Paths Pondered by Teradata

December 20, 2015

I read the content marketing write up by Karthik Guruswamy. I like the “guru” part of the expert’s name. I am stuck with the “old” part of my name.

The write is called “Data Science: Machine Learning Vs. Rules Based Systems.” I know a little bit about both of these methods, and I know a teeny tiny bit about Teradata, an outstanding data warehouse solution chugging along with its stock in the high $20s per share. The Google finance chart suggests that the company has some challenges with net income and profit margin to my unlearned eye:


Looks like some content marketing oomph is needed to move that top line number.

I learned in the write up:

Rules based systems will work effectively if all the situations, under which decisions can be made, are known ahead of time.

Okay. Insight. Know everything ahead of time and one can write rules to cover the situation. Is this expensive? Is this a never ending job? Consultants sure hope so.

There is an alternative:

Enter Machine Learning or ML! If we classify the data into good vs. bad data sets or categorize them into different labels like A, B, C, D etc., the Machine Learning algorithms can help build rules on the fly. This step is called training which results in a model. During operationalization, this model is used by the prediction algorithm to classify the incoming data in the right way which in turn leads to sound decision making.

I recall that Autonomy used this approach for its system. Those familiar with Autonomy have some experience with retraining, Bayesian drift, and other exciting facets of machine learning based systems. Consultants love to build new training sets.

The write up asserts:

With Machine Learning, one can iteratively achieve good results by cleansing & prepping the data, changing or combining algorithms or merely tweaking the algorithm parameters. This is becoming much easier thanks to the increased awareness and the availability of different types of data science tools in the market today.

High five.

My view is that the write up left out some information. But there is one omission which warrants a special comment.

Neither of these systems works without human intervention.

Bummer. Reality is sort of a drag, but maybe that’s why Teradata is wrestling with revenue and net profit alligators. Consultants, on the other hand, can bill to enhance either approach.

What about the customer? Well, some customers of brand name data warehouse systems struggle to get data into and out of this whiz bang systems in my experience. Regardless of the craziness involved with Hadoop and Spark, these open source approaches may make more sense than pumping six or seven figures into a proprietary system.

Consultants can still bill, of course. That’s one upside of any approach one wishes to embrace.

Stephen E Arnold, December 20, 2015

XML Marches On

December 2, 2015

For fans of XML and automated indexing, there’s a new duo in town. The shoot out at the JSON corral is not scheduled, but you can get the pre show down information in “Smartlogic and MarkLogic Corporation Enhance Platform Integration between Semaphore and MarkLogic Database.” Rumors of closer ties between the outfits surfaced earlier this year. I pinged one of the automated indexing company’s wizards and learned, “Nope, nothing going on.” Gee, I almost believed this until a virtual strategy story turned up. Virtual no more.

According to the write up:

Smartlogic, the Content Intelligence Company, today announced tighter software integration with MarkLogic, the Enterprise NoSQL database platform provider, creating a seamless approach to semantic information management where organizations maximize information to drive change. Smartlogic’s Content Intelligence capabilities provide a robust set of semantic tools which create intelligent metadata, enhancing the ability of the enterprise-grade MarkLogic database to power smarter applications.

For fans of user  friendliness, the tie up may mean more XQuery scripting and some Semaphore tweaks. And JSON? Not germane.

What is germane is that Smartlogic may covet some of MarkLogic’s publishing licensees. After slicing and dicing, some of these outfits are having trouble finding out what their machine assisted editors have crafted with refined quantities of editorial humans.

Stephen E Arnold, December 2, 2015

Medical Publisher Does Rah Rah for MarkLogic

November 20, 2015

Now MarkLogic is a unicorn. The company wants to generate revenues. Okay. No problem.

I found “200-Year-Old Publisher Finds Happiness with NoSQL Database” quite interesting. The write up explains that the New England Journal of Medicine uses MarkLogic’s XML data management system to — well — manage its text and other content.

The write up states:

With features like XQuery, a SQL-like query engine for XML data, MarkLogic promised to retrieve unstructured data at speeds no SQL database could approach.

What did I note? The big thing is that this deal went down when MarkLogic was a “fledgling company.” Hmm. Was this a Dave Kellogg-era deal? I also noted that the write up did not beat the drum for MarkLogic as a business and government intelligence. email management, and analytics Swiss Army knife able to cut into the revenues of Oracle and other Codd database outfits.

MarkLogic’s marketing may be making progress by emphasizing what MarkLogic’s technology was built to deliver: A data management system for publishers. The publication still uses SQL for financial records and dabbles with the open source quasi-doppelgänger MondoDB.

MarkLogic hit a wall at about $60 million. Today the fledgling is a unicorn. Will MarkLogic put wings on its unicorn? Stakeholders sure think is going to happen. For me, I will observe. Will the proprietary MarkLogic prevail or will open source alternatives nibble into this box of Kellogg’s revenue?

Stephen E Arnold, November 20, 2015

Next Page »