CyberOSINT banner

The Database Divide: SQL or NoSQL

April 13, 2016

I enjoy reading about technical issues which depend on use cases. When I read “Big Data And RDBMS: Can They Coexist?”, I thought about the premise, not the article. Information Week is one of those once, high flying dead tree outfits which have embraced digital. My hunch is that the juicy headline is designed less to speak to technical issues and more to the need to create some traffic.

In my case, it worked. I clicked. I read. I ignored because obviously specific methods exist because there are different problems to solve.

Here’s what I read after the lusted after click:

Peaceful coexistence is turning out to be the norm, as the two technologies prove to be complementary, not exclusive. As much as casual observers would like to see big data technologies win the future, RDBMS (the basis for SQL and database systems such as Microsoft SQL Server, IBM DB82, Oracle, and MySQL) is going to stick around for a bit longer.

So this is news? In an organization, some types of use cases are appropriate for the row and column approach. Think Excel. Others are better addressed with a whizzy system like Cassandra or a similar data management tool.

The write up reported that Codd based systems are pretty useful for transactions. Yep, that is accurate for most transactional applications. But there are some situations better suited to different approaches. My hunch is that is why Palantir Technologies developed its data management middleware AtlasDB, but let’s not get caught in a specific approach.

The write up points out that governance is a good idea. The context for governance is the SQL world, but my experience is that figuring out what to analyze and how to ensure “good enough” data quality is important for the NoSQL crowd as well.

I noted this statement from the wizard “Brown” who authored Data Mining for Dummies:

Users are not always clear [RDBMS and big data] are different products,” Brown said. “The sales reps are steering them to whatever product they want [the users] to buy.”

Yep, sales. Writing about data can educate, entertain, or market.

In this case, the notion that two technologies themselves content for attention does little to help one determine what method to use and when. Marketing triumphs.

Stephen E Arnold, April 13, 2016

AnalyzeThe.US, the 2016 Version?

April 12, 2016

I read “With Government Data Unlocked, MIT Tries to Make It Easier to Soft Through.” I came away from the write up a bit confused. I recall that Palantir Technologies offered for a short period of time a site called AnalyzeThe.US. It disappeared. I also recalled seeing a job posting for a person with a top secret clearance who knew Tableau (Excel on steroids) and Palantir Gotham (augmented intelligence). Ii am getting old but I thought that Michael Kim, once a Deloitte wizard, gave a lecture about how one can use Palantir for analytics.

Why is this important?

The write up points out that MIT worked with Deloitte which, I learned:

provided funding and expertise on how people use government data sets in business and for research.

The Gray Lady’s article does  not see any DNA linking AnalyzeThe.US, Deloitte, and the “new” Data USA site. Palantir’s Stephanie Yu gave a talk at MIT. I wonder if those in that session perceive any connection between Palantir and MIT. Who knows. I wonder if the MIT site makes use of AngularJS.

With regard to US government information, www.data.gov is still online. The information can be a challenge for a person without Tableau and Palantir expertise to wrangle in my experience. For those who don’t think Palantir is into sales, my view is that Palantir sells via intermediaries. The deal, in this type of MIT case, is to try to get some MIT students to get bitten by the Gotham and Metropolitan fever. Thank goodness I am not a real journalist trying to figure out who provides what to whom and for what reason. Okay, back to contemplating the pond filled with Kentucky mine run off water.

Stephen E Arnold, April 12, 2016

MarkLogic: Not Much Information about DI2E on the MarkLogic Web Site

April 11, 2016

Short honk: I have been thinking about MarkLogic in the context of Palantir Technologies. The two companies are sort of pals. Both companies are playing the high stakes game for next generation augmented intelligence systems for the Department of Defense. Palantir’s approach has been to generate revenues from sales to the intelligence community. MarkLogic’s approach has been to ride on the Distributed Common Ground System which is now referenced in some non-Hunter circles as Di2E.

You can get a sense of what MarkLogic makes available by navigating to www.marklogic.com and running a query for DI2E or DCGS.

The Plugfest documents provide a snapshot of the vendors involved as of December 2015 in this project. Here’s a snippet from the unclassified set of slides “Plugfest Industry Day: Plugfest/Mashup 2016.”

palantir vs marklogic plugfest

What caught my attention is that Palantir, which has its roots in CIA-type thought processes, is in the same “industry partner” illustration as MarkLogic. I noticed that IBM (the DB2 folks) and Oracle (the one-time champion in database technology) are also “partners.”

The only hitch in this “plugfest” partnering deal is Palantir’s quite interesting AlphaDB innovation and the disclosure of data management systems and methods in US 2016/0085817, “System and Method for Investigating Large Amounts of Data”, an invention of the now not-so-secret Hobbits Geoffrey Stowe, Chris Fischer, Paul George, Eli Bingham, and Rosco Hill.

Palantir’s one-two punch is AtlasDB and its data management method. The reason I find this interesting is that MarkLogic is the NoSQL, XML, slice-and-dice advanced technology which some individuals find difficult to use. IBM and Oracle are decidedly old school.

MarkLogic may not publicize its involvement in DCGS/DI2E, but the revenue is important for MarkLogic and the other vendors in the “partnering” diagram. Palantir, however, has been diversifying with, from what I hear, considerable success.

MarkLogic is a Silicon Valley innovator which opened its doors in 2001. Yep, that’s 15 years ago. Palantir Technologies is the newer kid on the block. The company was set up in 2003, that 13 years ago. What I find interesting is that MarkLogic’s approach is looking a bit long in the tooth. Palantir’s approach is a bit more current, and its user experience is more friendly than wrestling with XQuery and its extensions.

What happens if Palantir becomes the plumbing for the DCGS/DI2E system? Perhaps IBM or Oracle will have to think about acquiring Palantir. With technology IPOs somewhat rare, Palantir stakeholders may find that thinking the unthinkable is attractive.

What happens if Palantir takes its commercial business into a separate company and then formulates a deal to sell only the high-vitamin augmented intelligence business? MarkLogic may be faced with some difficult choices. Simplifying its data management and query systems may be child’s play compared to figuring out what its future will be if either IBM or Oracle snap up the quite interesting Palantir technologies, particularly the database and data management systems.

Watch for my for-fee report about Palantir Technologies. There will be a discounted price for law enforcement and intelligence professionals and another price for those not engaged in these two disciplines. Expect the report in early summer 2016. A small segment of the Palantir special report will appear in the forthcoming “Dark Web Notebook”, which I referenced in the Singularity 1 on 1 interview in mid-March 2016. To reserve copies of either of these two new monographs, write benkent2020 at Yahoo dot com.

Stephen E Arnold, April 11, 2016

The Forrester Wave Becomes Blobs

April 4, 2016

I want you to know that I read this statement attached to the illustration in “Master Data Management: Which MDM Tool Is Right For You?”; to wit:

Unauthorized reproduction, citation , or distribution prohibited.

Okay, none of that, gentle reader. The Forrester Wave has morphed from a knock off of the Eisenhower grid which was reinvented by Boston Consulting Group. The new look is like this:

image

Remember Psych 101? What do you see? How do you feel about that? What do you mean it looks like a dog’s breakfast? Do you love your mother?

Each tinted region denotes a type of Master Data Management classification. The classifications which the mid tier consulting firm generated from a rigorous statistical analysis of the data available to the wizards working on this report are:

  • Integration model vendors
  • Logical model vendors
  • Contextual model vendors
  • Analytic model vendors.

I am not sure what the differences in the categories are because I am familiar with some of the outfits in the Master Data Model space and it seems to me that outfits like IBM, Oracle, and others offer a range of Master Data Model services and capabilities. Hey, I don’t want to assemble the bits and pieces on offer from IBM into a functioning solution, but I suppose one can.

What companies deliver what function in this Rorschachian analysis?

Integration model, a pale blue horizontal elliptical blob:

elliptical blog

  • Dell Boomi
  • Information Builders
  • Microsoft
  • Profisee (like prophecy I assume)
  • Software AG
  • Semarchy
  • Teradata
  • Tibco (the data bus folks)

Analytic model, a gray circle:

image

  • Novetta
  • Reltio

Logical model, a blue gray ellipse which looks like an egg standing on one end:

balanced egg

  • SAS
  • SAP
  • IBM
  • Tibco (yep, in two places at once like an entangled particle)
  • Software AG (only the “ware AG” makes it into the logical egg thing)
  • Information Builders (the “builders” component is logical. Go figure.)
  • Teradata (yikes, just the “ata” is logical. Makes sense to the mid tier crowd I assume.)

Contextual model, which looks like a fried egg to me with a context of breakfast:

image

  • Informatica (another outfit which is like a satyr, half one thing and half another)
  • Liaison Technologies
  • Magnitude Software (the outfit is another entangled MDM provider because it is included in the logical model. Socrates, got that?)
  • Orchestra Software (also part of the logical category like a Rap musician who fills in when the the first violin at the London Philharmonic is on holiday).
  • Pitney Bowes (the postage meter outfit?)
  • Verato

I wish I could reproduce the diagram, but there is that legal threat. A legal threat is one way to make sure that constructive criticism of the blobs is constrained. I suppose my representations of the geometry of the analysis connects the dots for you, gentle reader. If not, the mid tier wizards will explain their “real” intent.

I love the fried egg group. How about some hot cakes with that analysis? Also, no half baked biscuits with that, please.

Stephen E Arnold, April 4, 2016

Big Data and Its Fry Cooks Who Clean the Grill

April 1, 2016

I read “Clearing Big Data: Most Time Consuming, Least Enjoyable Data Science Task, Survey Says.” A survey?

According to the capitalist tool:

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data.

The point is that few wizards want to come to grips with the problem of figuring out what’s wrong with data in a set or a stream and then getting the data into a form that can be used with reasonable confidence.

Those exception folders, annoying, aren’t they?

The write up points that a data scientist spends 80 percent of his or her time doing housecleaning. Skip the job and the house becomes unpleasant indeed.

The survey also reveals that data scientists have to organize the data to be analyzed. Imagine that. The baloney about automatically sucking in a wide range of data does not match the reality of the survey sample.

Another grim bit of drudgery emerges from the sample which we assume was conducted with the appropriate textbook procedures was that the skills most in demand were for SQL. Yep, old school.

Consider that most of the companies marketing next generation data mining and analytics systems never discuss grunt work and old fashioned data management.

Why the disconnect?

My hunch is that it is the sizzle, not the steak, which sells. Little wonder that some analytics outputs might be lab-made hamburger.

Stephen E Arnold, April 1, 2016

Search as a Framework

March 26, 2016

A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.

If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.

The article states:

Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.

The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.

Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”

Stephen E Arnold, March 26, 2016

A Data Lake: Batch Job Dipping Only

February 11, 2016

I love the Hadoop data lake concept. I live in a mostly real time world. The “batch” approach reminds me of my first exposure to computing in 1962. Real time? Give me a break. Hadoop reminded me of those early days. Fun. Standing on line. Waiting and waiting.

I read “Data Lake: Save Me More Money vs. Make Me More Money.” The article strikes me as a conference presentation illustrated with a deck of PowerPoint goodies.

One of the visuals was a modern big data analytics environment. I have seen a number of representations of today’s big data yadda yadda set ups. Here’s the EMC take on the modernity:

image

Straight away, I note the “all” word. Yep, just put the categorical affirmative into a Hadoop data lake. Don’t forget the video, the wonky stuff in the graphics department, the engineering drawings, and the most recent version of the merger documents requested by a team of government investigators, attorneys, and a pesky solicitor from some small European Community committee. “All” means all, right?

Then there are two “environments”. Okay, a data lake can have ecosystems, so the word environment is okay for flora and fauna. I think the notion is to build two separate analytic subsystems. Interesting approach, but there are platforms which offer applications to handle most of the data slap about work. Why not license one of those; for example, Palantir, Recorded Future?

And that’s it?

Well, no. The write up states that the approach will “save me more money.” In fact, one does not need much more:

The savings from these “Save me more money” activities can be nice with a Return on Investment (ROI) typically in the 10% to 20% range. But if organizations stop there, then they are leaving the 5x to 10x ROI projects on the table. Do I have your attention now?

My answer, “No, no, you do not.”

Stephen E Arnold, February

Big Data Blending Solution

January 20, 2016

I would have used Palantir or maybe our own tools. But an outfit named National Instruments found a different way to perform data blending. “How This Instrument Firm Tackled Big Data Blending” provides a case study and a rah rah for Alteryx. Here’s the paragraph I highlighted:

The software it [National Instruments] selected, from Alteryx, takes a somewhat unique approach in that it provides a visual representation of the data transformation process. Users can acquire, transform, and blend multiple data sources essentially by dragging and dropping icons on a screen. This GUI approach is beneficial to NI employees who aren’t proficient at manipulating data using something like SQL.

The graphical approach has been part of a number of tools. There are also some systems which just figure out where to put what.

The issue for me is, “What happens to rich media like imagery and unstructured information like email?”

There are systems which handle these types of content.

Another challenge is the dependence on structured relational data tables. Certain types of operations are difficult in this environment.

The write up is interesting, but it reveals that a narrow view of available tools may produce a partial solution.

Stephen E Arnold, January 20, 2016

Machine Learning Hindsight

January 18, 2016

Have you ever found yourself saying, “If I only knew then, what I know now”?  It is a moment we all experience, but instead of stewing over our past mistakes it is better to share the lessons we’ve learned with others.  Data scientist Peadar Coyle learned some valuable lessons when he first started working with machine learning.  He discusses three main things he learned in the article, “Three Things I Wish I Knew Earlier About Machine Learning.”

Here are the three items he wishes he knew then about machine learning, but know now:

  • “Getting models into production is a lot more than just micro services
  • Feature selection and feature extraction are really hard to learn from a book
  • The evaluation phase is really important”

Developing models is an easy step, but putting them in production is difficult.  There are many major steps that need attending to and doing all of the little jobs isn’t feasible on huge projects.   Peadar recommends outsourcing when you can.  Books and online information are good reference tools, but when they cannot be applied to actual situations the knowledge is useless.  Paedar learned that real world experience has no comparison.  When it comes to testing, it is a very important thing.  Very much as real world experience is invaluable, so is the evaluation.  Life does not hand perfect datasets for experimentation and testing different situations will better evaluate the model.

Paedar’s advice applies to machine learning, but it applies more to life in general.

 

Whitney Grace, January 18, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Open Source Data Management: It Is Now Easy to Understand

January 10, 2016

I read “16 for 16: What You Must Know about Hadoop and Spark Right Now.” I like the “right now.” Urgency. I am not sure I feel too much urgency at the moment. I will leave that wonderful feeling to the executives who have sucked in venture money and have to find a way to generate revenue in the next 11 months.

The article runs down the basic generalizations associated with each of these open source data management components:

  • Spark
  • Hive
  • Kerberos
  • Ranger/Sentry
  • HBase/Phoenix
  • Impala
  • Hadoop Distributed File System (HDFS)
  • Kafka
  • Storm/Apex
  • Ambari/Cloudera Manager
  • Pig
  • Yarn/Mesos
  • Nifi/Kettle
  • Knox
  • Scala/Python
  • Zeppelin/Databricks

What the list tells me is two things. First, the proliferation of open source data tools is thriving. Second, there will have to be quite a few committed developers to keep these projects afloat.

The write up is not content with this shopping list. The intrepid reader will have an opportunity to learn a bit about:

  • Kylin
  • Atlas/Navigator

As the write up swoops to its end point, I learned about some open source projects which are a bit of a disappointment; for example, Oozie and Tez.

The key point of the article is that Google’s MapReduce which is now pretty long in the tooth is now effectively marginalized.

The Balkanization of data management is evident. The challenge will be to use one or more of these technologies to make some substantial revenue flow.

What happens if a company jumps on the wrong bandwagon as it leaves the parade ground? I would suggest that it may be more like a Pig than an Atlas. The investors will change from Rangers looking for profits to Pythons ready to strike. A Spark can set fire to some hopes and dreams in the Hive. Poorly constructed walls of Databricks can come falling down. That will be an Oozie.

Dear old Oracle, DB2, and SQLServer will just watch.

Stephen E Arnold, January 10, 2016

Next Page »