CyberOSINT banner

MarkLogic Tells a Good Story

May 25, 2016

I lost track of MarkLogic when the company hit about $51 million in revenue and changed CEOs in 2006. In 2012, another CEO changed took place Since Gary Bloom, a former Oracle executive took over, the company, according to “Gary Bloom Interview: Big Data Driving Sales Boom at MarkLogic,” the company is now “topping” $100 million in annual revenue.

MarkLogic is one of the outfits laboring in the DCGX / DI2E vineyard. The company may be butting heads with outfits like Palantir Technologies as the US Army’s plan to federate its systems and data move forward.

MarkLogic opened for business in 2003 and has ingested, according to Crunchbase, $175 million in venture funding. With a timeline equivalent to Palantir Technologies’, there may be some value in comparing these two “startups” and their performance. That is an exercise better left to the feisty young MBAs who have to produce a return for the Sequoia and Wellington experts.

The interview contained two interesting statements which I found surprising:

The driver is Big Data: large corporations are convinced there is an El Dorado of untapped commercial opportunities — if only they can run their reports across all their data sources. But integrating all that data is too costly, and takes too long with relational databases. The future will be full of data in many forms, formats, and sources and how that data is used will be the differentiator in many competitive battles. If that data can’t be searched it can’t be used.

That is indeed the belief and the challenge. Based on what I have learned via open sources about the DCGS project, the reality is different from the “all” notions which fill the heads of some of the vendors delivering a comprehensive intelligence system to US government clients. In fact, the reality today seems to me to be similar to the hope for the Convera system when it was doing the “all” approach to some US government information. That, as you may recall, did not work out as some had hoped.

The second statement I highlighted is:

Although MarkLogic is tiny compared to Oracle there are some interesting parallels. “MarkLogic is at about the same size as Oracle was when I began working there. It took a long time for Oracle to get security and other enterprise features right, but when it did, that was when company really took off.”

The stakeholders hope that MarkLogic does “take off.” With more than 12 years of performance history under its belt, MarkLogic could be the next big thing. The only hitch in the git along is that normalization of information and data have to take place. Then there is the challenge of the query language. One cannot overlook the competitors which continue to bedevil those in the data management game.

With Oracle also involved in some US government work, there might be a bit of push back as the future of MarkLogic rolls forward. What happens if IBM’s data management systems group decide to acquire MarkLogic? Excitement? Perhaps.

Stephen E Arnold, May 25, 2016

DGraph Labs Startup Aims to Fill Gap in Graph Database Market

May 24, 2016

The article on GlobeNewsWire titled Ex-Googler Startup DGraph Labs Raises US$1.1 Million in Seed Funding Round to Build Industry’s First Open Source, Native and Distributed Graph Database names Bain Capital Ventures and Blackbird Ventures as the main investors in the startup. Manish Jain, founder and CEO of DGraph, worked on Google’s Knowledge Graph Infrastructure for six years. He explains the technology,

“Graph data structures store objects and the relationships between them. In these data structures, the relationship is as important as the object. Graph databases are, therefore, designed to store the relationships as first class citizens… Accessing those connections is an efficient, constant-time operation that allows you to traverse millions of objects quickly. Many companies including Google, Facebook, Twitter, eBay, LinkedIn and Dropbox use graph databases to power their smart search engines and newsfeeds.”

Among the many applications of graph databases, the internet of thing, behavior analysis, medical and DNA research, and AI are included. So what is DGraph going to do with their fresh funds? Jain wants to focus on forging a talented team of engineers and developing the company’s core technology. He notes in the article that this sort of work is hardly the typical obstacle faced by a startup, but rather the focus of major tech companies like Google or Facebook.


Chelsea Kerwin, May 24, 2016

Sponsored by, publisher of the CyberOSINT monograph

The Trials, Tribulations, and Party Anecdotes Of “Edge Case” Names

May 16, 2016

The article titled These Unlucky People Have Names That Break Computers on BBC Future delves into the strange world of “edge cases” or people with unexpected or problematic names that reveal glitches in the most commonplace systems that those of us named “Smith” or “Jones” take for granted. Consider Jennifer Null, the Virginia woman who can’t book a plane ticket or complete her taxes without extensive phone calls and headaches. The article says,

“But to any programmer, it’s painfully easy to see why “Null” could cause problems for a database. This is because the word “null” is often inserted into database fields to indicate that there is no data there. Now and again, system administrators have to try and fix the problem for people who are actually named “Null” – but the issue is rare and sometimes surprisingly difficult to solve.”

It may be tricky to find people with names like Null. Because of the nature of the controls related to names, issues generally arise for people like Null on systems where it actually does matter, like government forms. This is not an issue unique to the US, either. One Patrick McKenzie, an American programmer living in Japan, has run into regular difficulties because of the length of his last name. But that is nothing compared to Janice Keihanaikukauakahihulihe’ekahaunaele, a Hawaiian woman who championed for more flexibility in name length restrictions for state ID cards.


Chelsea Kerwin, May 16, 2016

Sponsored by, publisher of the CyberOSINT monograph


The Database Divide: SQL or NoSQL

April 13, 2016

I enjoy reading about technical issues which depend on use cases. When I read “Big Data And RDBMS: Can They Coexist?”, I thought about the premise, not the article. Information Week is one of those once, high flying dead tree outfits which have embraced digital. My hunch is that the juicy headline is designed less to speak to technical issues and more to the need to create some traffic.

In my case, it worked. I clicked. I read. I ignored because obviously specific methods exist because there are different problems to solve.

Here’s what I read after the lusted after click:

Peaceful coexistence is turning out to be the norm, as the two technologies prove to be complementary, not exclusive. As much as casual observers would like to see big data technologies win the future, RDBMS (the basis for SQL and database systems such as Microsoft SQL Server, IBM DB82, Oracle, and MySQL) is going to stick around for a bit longer.

So this is news? In an organization, some types of use cases are appropriate for the row and column approach. Think Excel. Others are better addressed with a whizzy system like Cassandra or a similar data management tool.

The write up reported that Codd based systems are pretty useful for transactions. Yep, that is accurate for most transactional applications. But there are some situations better suited to different approaches. My hunch is that is why Palantir Technologies developed its data management middleware AtlasDB, but let’s not get caught in a specific approach.

The write up points out that governance is a good idea. The context for governance is the SQL world, but my experience is that figuring out what to analyze and how to ensure “good enough” data quality is important for the NoSQL crowd as well.

I noted this statement from the wizard “Brown” who authored Data Mining for Dummies:

Users are not always clear [RDBMS and big data] are different products,” Brown said. “The sales reps are steering them to whatever product they want [the users] to buy.”

Yep, sales. Writing about data can educate, entertain, or market.

In this case, the notion that two technologies themselves content for attention does little to help one determine what method to use and when. Marketing triumphs.

Stephen E Arnold, April 13, 2016

AnalyzeThe.US, the 2016 Version?

April 12, 2016

I read “With Government Data Unlocked, MIT Tries to Make It Easier to Soft Through.” I came away from the write up a bit confused. I recall that Palantir Technologies offered for a short period of time a site called AnalyzeThe.US. It disappeared. I also recalled seeing a job posting for a person with a top secret clearance who knew Tableau (Excel on steroids) and Palantir Gotham (augmented intelligence). Ii am getting old but I thought that Michael Kim, once a Deloitte wizard, gave a lecture about how one can use Palantir for analytics.

Why is this important?

The write up points out that MIT worked with Deloitte which, I learned:

provided funding and expertise on how people use government data sets in business and for research.

The Gray Lady’s article does  not see any DNA linking AnalyzeThe.US, Deloitte, and the “new” Data USA site. Palantir’s Stephanie Yu gave a talk at MIT. I wonder if those in that session perceive any connection between Palantir and MIT. Who knows. I wonder if the MIT site makes use of AngularJS.

With regard to US government information, is still online. The information can be a challenge for a person without Tableau and Palantir expertise to wrangle in my experience. For those who don’t think Palantir is into sales, my view is that Palantir sells via intermediaries. The deal, in this type of MIT case, is to try to get some MIT students to get bitten by the Gotham and Metropolitan fever. Thank goodness I am not a real journalist trying to figure out who provides what to whom and for what reason. Okay, back to contemplating the pond filled with Kentucky mine run off water.

Stephen E Arnold, April 12, 2016

MarkLogic: Not Much Information about DI2E on the MarkLogic Web Site

April 11, 2016

Short honk: I have been thinking about MarkLogic in the context of Palantir Technologies. The two companies are sort of pals. Both companies are playing the high stakes game for next generation augmented intelligence systems for the Department of Defense. Palantir’s approach has been to generate revenues from sales to the intelligence community. MarkLogic’s approach has been to ride on the Distributed Common Ground System which is now referenced in some non-Hunter circles as Di2E.

You can get a sense of what MarkLogic makes available by navigating to and running a query for DI2E or DCGS.

The Plugfest documents provide a snapshot of the vendors involved as of December 2015 in this project. Here’s a snippet from the unclassified set of slides “Plugfest Industry Day: Plugfest/Mashup 2016.”

palantir vs marklogic plugfest

What caught my attention is that Palantir, which has its roots in CIA-type thought processes, is in the same “industry partner” illustration as MarkLogic. I noticed that IBM (the DB2 folks) and Oracle (the one-time champion in database technology) are also “partners.”

The only hitch in this “plugfest” partnering deal is Palantir’s quite interesting AlphaDB innovation and the disclosure of data management systems and methods in US 2016/0085817, “System and Method for Investigating Large Amounts of Data”, an invention of the now not-so-secret Hobbits Geoffrey Stowe, Chris Fischer, Paul George, Eli Bingham, and Rosco Hill.

Palantir’s one-two punch is AtlasDB and its data management method. The reason I find this interesting is that MarkLogic is the NoSQL, XML, slice-and-dice advanced technology which some individuals find difficult to use. IBM and Oracle are decidedly old school.

MarkLogic may not publicize its involvement in DCGS/DI2E, but the revenue is important for MarkLogic and the other vendors in the “partnering” diagram. Palantir, however, has been diversifying with, from what I hear, considerable success.

MarkLogic is a Silicon Valley innovator which opened its doors in 2001. Yep, that’s 15 years ago. Palantir Technologies is the newer kid on the block. The company was set up in 2003, that 13 years ago. What I find interesting is that MarkLogic’s approach is looking a bit long in the tooth. Palantir’s approach is a bit more current, and its user experience is more friendly than wrestling with XQuery and its extensions.

What happens if Palantir becomes the plumbing for the DCGS/DI2E system? Perhaps IBM or Oracle will have to think about acquiring Palantir. With technology IPOs somewhat rare, Palantir stakeholders may find that thinking the unthinkable is attractive.

What happens if Palantir takes its commercial business into a separate company and then formulates a deal to sell only the high-vitamin augmented intelligence business? MarkLogic may be faced with some difficult choices. Simplifying its data management and query systems may be child’s play compared to figuring out what its future will be if either IBM or Oracle snap up the quite interesting Palantir technologies, particularly the database and data management systems.

Watch for my for-fee report about Palantir Technologies. There will be a discounted price for law enforcement and intelligence professionals and another price for those not engaged in these two disciplines. Expect the report in early summer 2016. A small segment of the Palantir special report will appear in the forthcoming “Dark Web Notebook”, which I referenced in the Singularity 1 on 1 interview in mid-March 2016. To reserve copies of either of these two new monographs, write benkent2020 at Yahoo dot com.

Stephen E Arnold, April 11, 2016

The Forrester Wave Becomes Blobs

April 4, 2016

I want you to know that I read this statement attached to the illustration in “Master Data Management: Which MDM Tool Is Right For You?”; to wit:

Unauthorized reproduction, citation , or distribution prohibited.

Okay, none of that, gentle reader. The Forrester Wave has morphed from a knock off of the Eisenhower grid which was reinvented by Boston Consulting Group. The new look is like this:


Remember Psych 101? What do you see? How do you feel about that? What do you mean it looks like a dog’s breakfast? Do you love your mother?

Each tinted region denotes a type of Master Data Management classification. The classifications which the mid tier consulting firm generated from a rigorous statistical analysis of the data available to the wizards working on this report are:

  • Integration model vendors
  • Logical model vendors
  • Contextual model vendors
  • Analytic model vendors.

I am not sure what the differences in the categories are because I am familiar with some of the outfits in the Master Data Model space and it seems to me that outfits like IBM, Oracle, and others offer a range of Master Data Model services and capabilities. Hey, I don’t want to assemble the bits and pieces on offer from IBM into a functioning solution, but I suppose one can.

What companies deliver what function in this Rorschachian analysis?

Integration model, a pale blue horizontal elliptical blob:

elliptical blog

  • Dell Boomi
  • Information Builders
  • Microsoft
  • Profisee (like prophecy I assume)
  • Software AG
  • Semarchy
  • Teradata
  • Tibco (the data bus folks)

Analytic model, a gray circle:


  • Novetta
  • Reltio

Logical model, a blue gray ellipse which looks like an egg standing on one end:

balanced egg

  • SAS
  • SAP
  • IBM
  • Tibco (yep, in two places at once like an entangled particle)
  • Software AG (only the “ware AG” makes it into the logical egg thing)
  • Information Builders (the “builders” component is logical. Go figure.)
  • Teradata (yikes, just the “ata” is logical. Makes sense to the mid tier crowd I assume.)

Contextual model, which looks like a fried egg to me with a context of breakfast:


  • Informatica (another outfit which is like a satyr, half one thing and half another)
  • Liaison Technologies
  • Magnitude Software (the outfit is another entangled MDM provider because it is included in the logical model. Socrates, got that?)
  • Orchestra Software (also part of the logical category like a Rap musician who fills in when the the first violin at the London Philharmonic is on holiday).
  • Pitney Bowes (the postage meter outfit?)
  • Verato

I wish I could reproduce the diagram, but there is that legal threat. A legal threat is one way to make sure that constructive criticism of the blobs is constrained. I suppose my representations of the geometry of the analysis connects the dots for you, gentle reader. If not, the mid tier wizards will explain their “real” intent.

I love the fried egg group. How about some hot cakes with that analysis? Also, no half baked biscuits with that, please.

Stephen E Arnold, April 4, 2016

Big Data and Its Fry Cooks Who Clean the Grill

April 1, 2016

I read “Clearing Big Data: Most Time Consuming, Least Enjoyable Data Science Task, Survey Says.” A survey?

According to the capitalist tool:

A new survey of data scientists found that they spend most of their time massaging rather than mining or modeling data.

The point is that few wizards want to come to grips with the problem of figuring out what’s wrong with data in a set or a stream and then getting the data into a form that can be used with reasonable confidence.

Those exception folders, annoying, aren’t they?

The write up points that a data scientist spends 80 percent of his or her time doing housecleaning. Skip the job and the house becomes unpleasant indeed.

The survey also reveals that data scientists have to organize the data to be analyzed. Imagine that. The baloney about automatically sucking in a wide range of data does not match the reality of the survey sample.

Another grim bit of drudgery emerges from the sample which we assume was conducted with the appropriate textbook procedures was that the skills most in demand were for SQL. Yep, old school.

Consider that most of the companies marketing next generation data mining and analytics systems never discuss grunt work and old fashioned data management.

Why the disconnect?

My hunch is that it is the sizzle, not the steak, which sells. Little wonder that some analytics outputs might be lab-made hamburger.

Stephen E Arnold, April 1, 2016

Search as a Framework

March 26, 2016

A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.

If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.

The article states:

Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.

The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.

Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”

Stephen E Arnold, March 26, 2016

A Data Lake: Batch Job Dipping Only

February 11, 2016

I love the Hadoop data lake concept. I live in a mostly real time world. The “batch” approach reminds me of my first exposure to computing in 1962. Real time? Give me a break. Hadoop reminded me of those early days. Fun. Standing on line. Waiting and waiting.

I read “Data Lake: Save Me More Money vs. Make Me More Money.” The article strikes me as a conference presentation illustrated with a deck of PowerPoint goodies.

One of the visuals was a modern big data analytics environment. I have seen a number of representations of today’s big data yadda yadda set ups. Here’s the EMC take on the modernity:


Straight away, I note the “all” word. Yep, just put the categorical affirmative into a Hadoop data lake. Don’t forget the video, the wonky stuff in the graphics department, the engineering drawings, and the most recent version of the merger documents requested by a team of government investigators, attorneys, and a pesky solicitor from some small European Community committee. “All” means all, right?

Then there are two “environments”. Okay, a data lake can have ecosystems, so the word environment is okay for flora and fauna. I think the notion is to build two separate analytic subsystems. Interesting approach, but there are platforms which offer applications to handle most of the data slap about work. Why not license one of those; for example, Palantir, Recorded Future?

And that’s it?

Well, no. The write up states that the approach will “save me more money.” In fact, one does not need much more:

The savings from these “Save me more money” activities can be nice with a Return on Investment (ROI) typically in the 10% to 20% range. But if organizations stop there, then they are leaving the 5x to 10x ROI projects on the table. Do I have your attention now?

My answer, “No, no, you do not.”

Stephen E Arnold, February

Next Page »