CyberOSINT banner

Data: Lakes, Streams, Whatever

June 15, 2016

I read “Data Lakes vs Data Streams: Which Is Better?” The answer seems to me to be “both.” Streams are now. Lakes are “were.” Who wants to make decisions based on historical data. On the other hand, real time data may mislead the unwary data sailor. The write up states:

The availability of these new ways [lakes and streams] of storing and managing data has created a need for smarter, faster data storage and analytics tools to keep up with the scale and speed of the data. There is also a much broader set of users out there who want to be able to ask questions of their data themselves, perhaps to aid their decision making and drive their trading strategy in real-time rather than weekly or quarterly. And they don’t want to rely on or wait for someone else such as a dedicated business analyst or other limited resource to do the analysis for them. This increased ability and accessibility is creating whole new sets of users and completely new use cases, as well as transforming old ones.

Good news for self appointed lake and stream experts. Bad news for a company trying to figure out how to generate new revenues.

The first step may be to answer some basic questions about what data are available, their reliability, and what person “knows” about data wrangling. Worrying about lakes and streams before one knows if the water is polluted is a good idea before diving into the murky waters.

Stephen E Arnold, June 15, 2016

Websites Found to Be Blocking Tor Traffic

June 8, 2016

Discrimination or wise precaution? Perhaps both? MakeUseOf tells us, “This Is Why Tor Users Are Being Blocked by Major Websites.” A recent study (PDF) by the University of Cambridge; University of California, Berkeley; University College London; and International Computer Science Institute, Berkeley confirms that many sites are actively blocking users who approach through a known Tor exit node. Writer Philip Bates explains:

“Users are finding that they’re faced with a substandard service from some websites, CAPTCHAs and other such nuisances from others, and in further cases, are denied access completely. The researchers argue that this: ‘Degraded service [results in Tor users] effectively being relegated to the role of second-class citizens on the Internet.’ Two good examples of prejudice hosting and content delivery firms are CloudFlare and Akamai — the latter of which either blocks Tor users or, in the case of, infinitely redirects. CloudFlare, meanwhile, presents CAPTCHA to prove the user isn’t a malicious bot. It identifies large amounts of traffic from an exit node, then assigns a score to an IP address that determines whether the server has a good or bad reputation. This means that innocent users are treated the same way as those with negative intentions, just because they happen to use the same exit node.”

The article goes on to discuss legitimate reasons users might want the privacy Tor provides, as well as reasons companies feel they must protect their Websites from anonymous users. Bates notes that there  is not much one can do about such measures. He does point to Tor’s own Don’t Block Me project, which is working to convince sites to stop blocking people just for using Tor. It is also developing a list of best practices that concerned sites can follow, instead. One site, GameFAQs, has reportedly lifted its block, and CloudFlare may be considering a similar move. Will the momentum build, or must those who protect their online privacy resign themselves to being treated with suspicion?


Cynthia Murrell, June 8, 2016

Sponsored by, publisher of the CyberOSINT monograph

GAO DCGS Letter B-412746

June 1, 2016

A few days ago, I stumbled upon a copy of a letter from the GAO concerning Palantir Technologies dated May 18, 2016. The letter became available to me a few days after the 18th, and the US holiday probably limited circulation of the document. The letter is from the US Government Accountability Office and signed by Susan A. Poling, general counsel. There are eight recipients, some from Palantir, some from the US Army, and two in the GAO.

palantir checkmate

Has the US Army put Palantir in an untenable spot? Is there a deus ex machina about to resolve the apparent checkmate?

The letter tells Palantir Technologies that its protest of the DCGS Increment 2 award to another contractor is denied. I don’t want to revisit the history or the details as I understand them of the DCGS project. (DCGS, pronounced “dsigs”, is a US government information fusion project associated with the US Army but seemingly applicable to other Department of Defense entities like the Air Force and the Navy.)

The passage in the letter I found interesting was:

While the market research revealed that commercial items were available to meet some of the DCGS-A2 requirements, the agency concluded that there was no commercial solution that could  meet all the requirements of DCGS-A2. As the agency explained in its report, the DCGS-A2 contractor will need to do a great deal of development and integration work, which will include importing capabilities from DCGS-A1 and designing mature interfaces for them. Because  the agency concluded that significant portions of the anticipated DCSG-A2 scope of work were not available as a commercial product, the agency determined that the DCGS-A2 development effort could not be procured as a commercial product under FAR part 12 procedures. The protester has failed to show that the agency’s determination in this regard was unreasonable.

The “importing” point is a big deal. I find it difficult to imagine that IBM i2 engineers will be eager to permit the Palantir Gotham system to work like one happy family. The importation and manipulation of i2 data in a third party system is more difficult than opening an RTF file in Word in my experience. My recollection is that the unfortunate i2-Palantir legal matter was, in part, related to figuring out how to deal with ANB files. (ANB is i2 shorthand for Analysts Notebook’s file format, a somewhat complex and closely-held construct.)

Net net: Palantir Technologies will not be the dog wagging the tail of IBM i2 and a number of other major US government integrators. The good news is that there will be quite a bit of work available for firms able to support the prime contractors and the vendors eligible and selected to provide for-fee products and services.

Was this a shoot-from-the-hip decision to deny Palantir’s objection to the award? No. I believe the FAR procurement guidelines and the content of the statement of work provided the framework for the decision. However, context is important as are past experiences and perceptions of vendors in the running for substantive US government programs.

Read more

The Google Knowledge Vault Claimed to Be the Future

May 31, 2016

Back in 2014, I heard rumors that the Google Knowledge Vault was supposed to be the next wave of search.  How many times do you hear a company or a product making the claim it is the next big thing?  After I rolled my eyes, I decided to research what became of the Knowledge Vault and I found an old article from Search Engine Land: “Google ‘Knowledge Vault’ To Power Future Of Search.” Google Knowledge Graph was used to supply more information to search results, what we now recognize as the summarized information at the top of Google search results.  The Knowledge Vault was supposedly the successor and would rely less on third party information providers.

“Sensationally characterized as ‘the largest store of knowledge in human history,’ Knowledge Vault is being assembled from content across the Internet without human editorial involvement. ‘Knowledge Vault autonomously gathers and merges information from across the web into a single base of facts about the world, and the people and objects in it,’ says New Scientist. Google has reportedly assembled 1.6 billion “facts” and scored them according to confidence in their accuracy. Roughly 16 percent of the information in the database qualifies as ‘confident facts.’”

Knowledge Vault was also supposed to give Google a one up in the mobile search market and even be the basis for artificial intelligence applications.  It was a lot of hoopla, but I did a bit more research and learned from Wikipedia that Knowledge Vault was nothing more than a research paper.

Since 2014, Google, Apple, Facebook, and other tech companies have concentrated their efforts and resources on developing artificial intelligence and integrating it within their products.  While Knowledge Vault was a red herring, the predictions about artificial intelligence were correct.


Whitney Grace, May 31, 2016
Sponsored by, publisher of the CyberOSINT monograph

MarkLogic Tells a Good Story

May 25, 2016

I lost track of MarkLogic when the company hit about $51 million in revenue and changed CEOs in 2006. In 2012, another CEO changed took place Since Gary Bloom, a former Oracle executive took over, the company, according to “Gary Bloom Interview: Big Data Driving Sales Boom at MarkLogic,” the company is now “topping” $100 million in annual revenue.

MarkLogic is one of the outfits laboring in the DCGX / DI2E vineyard. The company may be butting heads with outfits like Palantir Technologies as the US Army’s plan to federate its systems and data move forward.

MarkLogic opened for business in 2003 and has ingested, according to Crunchbase, $175 million in venture funding. With a timeline equivalent to Palantir Technologies’, there may be some value in comparing these two “startups” and their performance. That is an exercise better left to the feisty young MBAs who have to produce a return for the Sequoia and Wellington experts.

The interview contained two interesting statements which I found surprising:

The driver is Big Data: large corporations are convinced there is an El Dorado of untapped commercial opportunities — if only they can run their reports across all their data sources. But integrating all that data is too costly, and takes too long with relational databases. The future will be full of data in many forms, formats, and sources and how that data is used will be the differentiator in many competitive battles. If that data can’t be searched it can’t be used.

That is indeed the belief and the challenge. Based on what I have learned via open sources about the DCGS project, the reality is different from the “all” notions which fill the heads of some of the vendors delivering a comprehensive intelligence system to US government clients. In fact, the reality today seems to me to be similar to the hope for the Convera system when it was doing the “all” approach to some US government information. That, as you may recall, did not work out as some had hoped.

The second statement I highlighted is:

Although MarkLogic is tiny compared to Oracle there are some interesting parallels. “MarkLogic is at about the same size as Oracle was when I began working there. It took a long time for Oracle to get security and other enterprise features right, but when it did, that was when company really took off.”

The stakeholders hope that MarkLogic does “take off.” With more than 12 years of performance history under its belt, MarkLogic could be the next big thing. The only hitch in the git along is that normalization of information and data have to take place. Then there is the challenge of the query language. One cannot overlook the competitors which continue to bedevil those in the data management game.

With Oracle also involved in some US government work, there might be a bit of push back as the future of MarkLogic rolls forward. What happens if IBM’s data management systems group decide to acquire MarkLogic? Excitement? Perhaps.

Stephen E Arnold, May 25, 2016

DGraph Labs Startup Aims to Fill Gap in Graph Database Market

May 24, 2016

The article on GlobeNewsWire titled Ex-Googler Startup DGraph Labs Raises US$1.1 Million in Seed Funding Round to Build Industry’s First Open Source, Native and Distributed Graph Database names Bain Capital Ventures and Blackbird Ventures as the main investors in the startup. Manish Jain, founder and CEO of DGraph, worked on Google’s Knowledge Graph Infrastructure for six years. He explains the technology,

“Graph data structures store objects and the relationships between them. In these data structures, the relationship is as important as the object. Graph databases are, therefore, designed to store the relationships as first class citizens… Accessing those connections is an efficient, constant-time operation that allows you to traverse millions of objects quickly. Many companies including Google, Facebook, Twitter, eBay, LinkedIn and Dropbox use graph databases to power their smart search engines and newsfeeds.”

Among the many applications of graph databases, the internet of thing, behavior analysis, medical and DNA research, and AI are included. So what is DGraph going to do with their fresh funds? Jain wants to focus on forging a talented team of engineers and developing the company’s core technology. He notes in the article that this sort of work is hardly the typical obstacle faced by a startup, but rather the focus of major tech companies like Google or Facebook.


Chelsea Kerwin, May 24, 2016

Sponsored by, publisher of the CyberOSINT monograph

The Trials, Tribulations, and Party Anecdotes Of “Edge Case” Names

May 16, 2016

The article titled These Unlucky People Have Names That Break Computers on BBC Future delves into the strange world of “edge cases” or people with unexpected or problematic names that reveal glitches in the most commonplace systems that those of us named “Smith” or “Jones” take for granted. Consider Jennifer Null, the Virginia woman who can’t book a plane ticket or complete her taxes without extensive phone calls and headaches. The article says,

“But to any programmer, it’s painfully easy to see why “Null” could cause problems for a database. This is because the word “null” is often inserted into database fields to indicate that there is no data there. Now and again, system administrators have to try and fix the problem for people who are actually named “Null” – but the issue is rare and sometimes surprisingly difficult to solve.”

It may be tricky to find people with names like Null. Because of the nature of the controls related to names, issues generally arise for people like Null on systems where it actually does matter, like government forms. This is not an issue unique to the US, either. One Patrick McKenzie, an American programmer living in Japan, has run into regular difficulties because of the length of his last name. But that is nothing compared to Janice Keihanaikukauakahihulihe’ekahaunaele, a Hawaiian woman who championed for more flexibility in name length restrictions for state ID cards.


Chelsea Kerwin, May 16, 2016

Sponsored by, publisher of the CyberOSINT monograph


The Database Divide: SQL or NoSQL

April 13, 2016

I enjoy reading about technical issues which depend on use cases. When I read “Big Data And RDBMS: Can They Coexist?”, I thought about the premise, not the article. Information Week is one of those once, high flying dead tree outfits which have embraced digital. My hunch is that the juicy headline is designed less to speak to technical issues and more to the need to create some traffic.

In my case, it worked. I clicked. I read. I ignored because obviously specific methods exist because there are different problems to solve.

Here’s what I read after the lusted after click:

Peaceful coexistence is turning out to be the norm, as the two technologies prove to be complementary, not exclusive. As much as casual observers would like to see big data technologies win the future, RDBMS (the basis for SQL and database systems such as Microsoft SQL Server, IBM DB82, Oracle, and MySQL) is going to stick around for a bit longer.

So this is news? In an organization, some types of use cases are appropriate for the row and column approach. Think Excel. Others are better addressed with a whizzy system like Cassandra or a similar data management tool.

The write up reported that Codd based systems are pretty useful for transactions. Yep, that is accurate for most transactional applications. But there are some situations better suited to different approaches. My hunch is that is why Palantir Technologies developed its data management middleware AtlasDB, but let’s not get caught in a specific approach.

The write up points out that governance is a good idea. The context for governance is the SQL world, but my experience is that figuring out what to analyze and how to ensure “good enough” data quality is important for the NoSQL crowd as well.

I noted this statement from the wizard “Brown” who authored Data Mining for Dummies:

Users are not always clear [RDBMS and big data] are different products,” Brown said. “The sales reps are steering them to whatever product they want [the users] to buy.”

Yep, sales. Writing about data can educate, entertain, or market.

In this case, the notion that two technologies themselves content for attention does little to help one determine what method to use and when. Marketing triumphs.

Stephen E Arnold, April 13, 2016

AnalyzeThe.US, the 2016 Version?

April 12, 2016

I read “With Government Data Unlocked, MIT Tries to Make It Easier to Soft Through.” I came away from the write up a bit confused. I recall that Palantir Technologies offered for a short period of time a site called AnalyzeThe.US. It disappeared. I also recalled seeing a job posting for a person with a top secret clearance who knew Tableau (Excel on steroids) and Palantir Gotham (augmented intelligence). Ii am getting old but I thought that Michael Kim, once a Deloitte wizard, gave a lecture about how one can use Palantir for analytics.

Why is this important?

The write up points out that MIT worked with Deloitte which, I learned:

provided funding and expertise on how people use government data sets in business and for research.

The Gray Lady’s article does  not see any DNA linking AnalyzeThe.US, Deloitte, and the “new” Data USA site. Palantir’s Stephanie Yu gave a talk at MIT. I wonder if those in that session perceive any connection between Palantir and MIT. Who knows. I wonder if the MIT site makes use of AngularJS.

With regard to US government information, is still online. The information can be a challenge for a person without Tableau and Palantir expertise to wrangle in my experience. For those who don’t think Palantir is into sales, my view is that Palantir sells via intermediaries. The deal, in this type of MIT case, is to try to get some MIT students to get bitten by the Gotham and Metropolitan fever. Thank goodness I am not a real journalist trying to figure out who provides what to whom and for what reason. Okay, back to contemplating the pond filled with Kentucky mine run off water.

Stephen E Arnold, April 12, 2016

MarkLogic: Not Much Information about DI2E on the MarkLogic Web Site

April 11, 2016

Short honk: I have been thinking about MarkLogic in the context of Palantir Technologies. The two companies are sort of pals. Both companies are playing the high stakes game for next generation augmented intelligence systems for the Department of Defense. Palantir’s approach has been to generate revenues from sales to the intelligence community. MarkLogic’s approach has been to ride on the Distributed Common Ground System which is now referenced in some non-Hunter circles as Di2E.

You can get a sense of what MarkLogic makes available by navigating to and running a query for DI2E or DCGS.

The Plugfest documents provide a snapshot of the vendors involved as of December 2015 in this project. Here’s a snippet from the unclassified set of slides “Plugfest Industry Day: Plugfest/Mashup 2016.”

palantir vs marklogic plugfest

What caught my attention is that Palantir, which has its roots in CIA-type thought processes, is in the same “industry partner” illustration as MarkLogic. I noticed that IBM (the DB2 folks) and Oracle (the one-time champion in database technology) are also “partners.”

The only hitch in this “plugfest” partnering deal is Palantir’s quite interesting AlphaDB innovation and the disclosure of data management systems and methods in US 2016/0085817, “System and Method for Investigating Large Amounts of Data”, an invention of the now not-so-secret Hobbits Geoffrey Stowe, Chris Fischer, Paul George, Eli Bingham, and Rosco Hill.

Palantir’s one-two punch is AtlasDB and its data management method. The reason I find this interesting is that MarkLogic is the NoSQL, XML, slice-and-dice advanced technology which some individuals find difficult to use. IBM and Oracle are decidedly old school.

MarkLogic may not publicize its involvement in DCGS/DI2E, but the revenue is important for MarkLogic and the other vendors in the “partnering” diagram. Palantir, however, has been diversifying with, from what I hear, considerable success.

MarkLogic is a Silicon Valley innovator which opened its doors in 2001. Yep, that’s 15 years ago. Palantir Technologies is the newer kid on the block. The company was set up in 2003, that 13 years ago. What I find interesting is that MarkLogic’s approach is looking a bit long in the tooth. Palantir’s approach is a bit more current, and its user experience is more friendly than wrestling with XQuery and its extensions.

What happens if Palantir becomes the plumbing for the DCGS/DI2E system? Perhaps IBM or Oracle will have to think about acquiring Palantir. With technology IPOs somewhat rare, Palantir stakeholders may find that thinking the unthinkable is attractive.

What happens if Palantir takes its commercial business into a separate company and then formulates a deal to sell only the high-vitamin augmented intelligence business? MarkLogic may be faced with some difficult choices. Simplifying its data management and query systems may be child’s play compared to figuring out what its future will be if either IBM or Oracle snap up the quite interesting Palantir technologies, particularly the database and data management systems.

Watch for my for-fee report about Palantir Technologies. There will be a discounted price for law enforcement and intelligence professionals and another price for those not engaged in these two disciplines. Expect the report in early summer 2016. A small segment of the Palantir special report will appear in the forthcoming “Dark Web Notebook”, which I referenced in the Singularity 1 on 1 interview in mid-March 2016. To reserve copies of either of these two new monographs, write benkent2020 at Yahoo dot com.

Stephen E Arnold, April 11, 2016

Next Page »