Honkin' News banner

VirtualWorks Purchases Natural Language Processing Firm

July 8, 2016

Another day, another merger. PR Newswire released a story, VirtualWorks and Language Tools Announce Merger, which covers Virtual Works’ purchase of Language Tools. In Language Tools, they will inherit computational linguistics and natural language processing technologies. Virtual Works is an enterprise search firm. Erik Baklid, Chief Executive Officer of VirtualWorks is quoted in the article,

“We are incredibly excited about what this combined merger means to the future of our business. The potential to analyze and make sense of the vast unstructured data that exists for enterprises, both internally and externally, cannot be understated. Our underlying technology offers a sophisticated solution to extract meaning from text in a systematic way without the shortcomings of machine learning. We are well positioned to bring to market applications that provide insight, never before possible, into the vast majority of data that is out there.”

This is another case of a company positioning themselves as a leader in enterprise search. Are they anything special? Well, the news release mentions several core technologies will be bolstered due to the merger: text analytics, data management, and discovery techniques. We will have to wait and see what their future holds in regards to the enterprise search and business intelligence sector they seek to be a leader in.

Megan Feil, July 8, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph


Voyager Search: New Release Available

July 1, 2016

Voyager Search is vendor of search and retrieval based on Lucene. I was not familiar with the company until I read “Voyager Search Improves Search Capabilities and Overall Usability With More Than 150 Updates to Its Version 1.9.8.” According to the write up:

In the new version, Voyager makes it easier to configure content in Navigo, its modern web app, extends its spatial content search, and improves the usability of its Navigo processing tools. Managing content in Navigo can now be done through the new personalized ‘My Voyager’ customization page, which allows customers to share saved searches and update display configurations through a drag and drop interface.

One point in the write up I noted was this statement: “An improved ?spatial search interface now includes the ability to draw and buffer points, lines and polygons.” The idea is that geo-spatial operations appear to be supported by the system.

I also highlighted this comment:

Voyager Search is a leading global provider of geospatial, enterprise search tools that connect, find and deliver more than 1,800 different file formats.

In my experience, support for more than 1,000 file formats suggests a large number of conversion widgets.

The company bills itself as the “only install and go Solr/Lucene search engine.” Information about the company is available at this link. A demo is available here.

Stephen E Arnold, July 1, 2016

More Palantir Spotting

June 27, 2016

Trainspotting is a collection of short stories or a novel presented as a series of short stories by Irvine Welsh. The fun lovers in the fiction embrace avocations which seem to be addictive. The thrill is the thing. Now I think I have identified Palantir spotting.

Navigate to “Palantir Seeks to Muzzle Former Employees.” I am not too interested in the allegations in the write up. What is interesting is that the article is one of what appears to be of series of stories about Palantir Technologies enriched with non public documents.


The Thingverse muzzle might be just the ticket for employees who want to chatter about proprietary information. I assume the muzzle is sanitary and durable, comes in various sizes, and adapts to the jaw movement of the lucky dog wearing the gizmo.

Why use the phrase “Palantir spotting.” It seems to me that making an outfit which provides services and software to government entities is an unusual hobby. I, for example, lecture about the Dark Web, how to recognize recycled analytics algorithms and their assorted “foibles,” and how to find information in the new, super helpful Google Web search system.

Poking the innards of an outfit with interesting software and some wizards who might be a bit testy is okay if done with some Onion type  or Colbert like humor. Doing what one of my old employers did in the 1970s to help ensure that company policies remain inside the company is old hat to me.

In the write up, I noted:

The Silicon Valley data-analysis company, which recently said it would buy up to $225 million of its own common stock from current and former staff, has attached some serious strings to the offer. It is requiring former employees who want to sell their shares to renew their non-disclosure agreements, agree not to poach Palantir employees for 12 months, and promise not to sue the company or its executives, a confidential contract reviewed by BuzzFeed News shows. The terms also dictate how former staff can talk to the press. If they get any inquiries about Palantir from reporters, the contract says, they must immediately notify Palantir and then email the company a copy of the inquiry within three business days. These provisions, which haven’t previously been reported, show one way Palantir stands to benefit from the stock purchase offer, known as a “liquidity event.”

Okay, manage information flow. In my experience, money often comes with some caveats. At one time I had lots and lots of @Home goodies which disappeared in a Sillycon Valley minute. The fine print for the deal covered the disappearance. Sigh. That’s life with techno-financial wizards. It seems life has not changed too much since the @Home affair decades ago.

I expect that there will be more Palantir centric stories. I will try to note these when they hit my steam powered radar detector in Harrod’s Creek. My thought is that like the protagonists in Trainspotting, Palantir spotting might have some after effects.

I keep asking myself this question:

How do company confidential documents escape the gravitational field of a comparatively secretive company?

The Palantir spotters are great data gatherers or those with access to the documents are making the material available. No answers yet. Just that question about “how”.

Stephen E Arnold, June 27, 2016

Palantir Technologies: Now Beer Pong and Human Augmented Intelligence?

June 23, 2016

I went months, nay years, without reading very much about Palantir Technologies. Now the unicorn seems to be prancing through my newsfeeds frequently. I read “Palantir’s Party Culture: Beer Pong, Office Pranks, and a Bad Case of the Hives.” The focus is less on how Gotham works and the nifty data management system the firm has engineered and more upon revelations about life inside a stealthy vendor of search and content processing systems.

The write up uses what appears to be company emails  and letters from attorneys as sources of information. I thought that emails were the type of information not widely available. Lawyer letters? Hmm. Guess not. A former Hobbit (allegedly the Palantirians’ names for themselves in the Shire) has revealed information about a matter involving a terminated employee.

The Sillycon Valley company allegedly has or had employees who horsed around. I find this difficult to believe. Fun at work? Wow. The aggrieved individual alleges he was injured by a “drunk coworker” who was playing beer pong. And the individual with a beef allegedly had “snacks” taken from his work space. (I thought Palantir-type outfits provided food for the Hobbit-like individuals.)

The write up contains this statement:

The letter [from a legal eagle?] also makes the surprising allegation that Palantir engaged in improper business practices by using both Bloomberg data feeds and software from an IT firm called ANB without the appropriate licenses. Neither Palantir, Bloomberg, nor ANB responded to requests for comment. In the July 2010 letter, Cohen’s attorney states that his client was retaliated against for speaking out about these practices. From the letter:

Mr Cohen was retaliated against for…complaining about issues such as Palantir’s illegal use of third party copyrighted and trademarked icons and Bloomberg data feeds without adequate licenses. In addition, Mr. Cohen was retaliated against for complaining about the illegal use of open source code without crediting authors, and the illegal use of ANB software development kit without ANB’s authorization.

Yikes. From beer pong and missing snacks to the allegation of “improper business practices.”  Who knew this was possible?

Please, note that the statements in the write up about “ANB” probably refer to IBM i2’s proprietary file structures for the Analyst’s Notebook product. (I dug in that outfit’s garden for a while.) What other errors lurk within these write ups about disenchanted Hobbits?

Several questions occurred to me:

  1. Is Palantir’s email system insecure? Have there been other caches of company email let loose from the Shire?
  2. Are these emails publicly available? Will those with access to the emails gather them and post them on a pastesite?
  3. What is the relationship between the IBM i2 proprietary file format and the Gotham system? (Wasn’t there a legal dust up with regard to i2’s proprietary technology?)
  4. How do commercial database content feeds find their way into systems not licensed for such access?

I find it interesting how a company which purports to maintain a low profile captures the attention of “real” journalists who have access to emails and legal letters.

I noted a couple of factoids too:

Key factoid one: Beer pong can be dangerous.

Key factoid two: People working in high tech outfits may want to check out their internal governance methods. Emails don’t walk; emails get sent or copied before, during, or after beer pong.

Stephen E Arnold, June 23, 2016

GAO DCGS Letter B-412746

June 1, 2016

A few days ago, I stumbled upon a copy of a letter from the GAO concerning Palantir Technologies dated May 18, 2016. The letter became available to me a few days after the 18th, and the US holiday probably limited circulation of the document. The letter is from the US Government Accountability Office and signed by Susan A. Poling, general counsel. There are eight recipients, some from Palantir, some from the US Army, and two in the GAO.

palantir checkmate

Has the US Army put Palantir in an untenable spot? Is there a deus ex machina about to resolve the apparent checkmate?

The letter tells Palantir Technologies that its protest of the DCGS Increment 2 award to another contractor is denied. I don’t want to revisit the history or the details as I understand them of the DCGS project. (DCGS, pronounced “dsigs”, is a US government information fusion project associated with the US Army but seemingly applicable to other Department of Defense entities like the Air Force and the Navy.)

The passage in the letter I found interesting was:

While the market research revealed that commercial items were available to meet some of the DCGS-A2 requirements, the agency concluded that there was no commercial solution that could  meet all the requirements of DCGS-A2. As the agency explained in its report, the DCGS-A2 contractor will need to do a great deal of development and integration work, which will include importing capabilities from DCGS-A1 and designing mature interfaces for them. Because  the agency concluded that significant portions of the anticipated DCSG-A2 scope of work were not available as a commercial product, the agency determined that the DCGS-A2 development effort could not be procured as a commercial product under FAR part 12 procedures. The protester has failed to show that the agency’s determination in this regard was unreasonable.

The “importing” point is a big deal. I find it difficult to imagine that IBM i2 engineers will be eager to permit the Palantir Gotham system to work like one happy family. The importation and manipulation of i2 data in a third party system is more difficult than opening an RTF file in Word in my experience. My recollection is that the unfortunate i2-Palantir legal matter was, in part, related to figuring out how to deal with ANB files. (ANB is i2 shorthand for Analysts Notebook’s file format, a somewhat complex and closely-held construct.)

Net net: Palantir Technologies will not be the dog wagging the tail of IBM i2 and a number of other major US government integrators. The good news is that there will be quite a bit of work available for firms able to support the prime contractors and the vendors eligible and selected to provide for-fee products and services.

Was this a shoot-from-the-hip decision to deny Palantir’s objection to the award? No. I believe the FAR procurement guidelines and the content of the statement of work provided the framework for the decision. However, context is important as are past experiences and perceptions of vendors in the running for substantive US government programs.

Read more

Search Sink Hole Identified and Allegedly Paved and Converted to a Data Convenience Store

May 20, 2016

I try to avoid reading more than one write up a day about alleged revolutions in content processing and information analytics. My addled goose brain cannot cope with the endlessly recycled algorithms dressed up in Project Runway finery.

I read “Ryft: Bringing High Performance Analytics to Every Enterprise,” and I was pleased to see a couple of statements which resonated with my dim view of information access systems. There is an accompanying video in the write up. I, as you may know, gentle reader, am not into video. I prefer reading, which is the old fashioned way to suck up useful factoids.

Here’s the first passage I highlighted:

Any search tool can match an exact query to structured data—but only after all of the data is indexed. What happens when there are variations? What if the data is unstructured and there’s no time for indexing? [Emphasis added]

The answer to the question is increasing costs for sales and marketing. The early warning for amped up baloney are the presentations given at conferences and pumped out via public relations firms. (No, Buffy, no, Trent, I am not interested in speaking with the visionary CEO who hired you.)

I also highlighted:

With the power to complete fuzzy search 600X faster at scale, Ryft has opened up tremendous new possibilities for data-driven advances in every industry.”

I circled the 600X. Gentle reader, I struggle to comprehend a 600X increase in content processing. Dear Mother Google has invested to create a new chip to get around the limitations of our friend Von Neumann’s approach to executing instructions. I am not sure Mother Google has this nailed because Mother Google, like IBM, announces innovations without too much real world demonstration of the nifty “new” things.

I noted this statement too:

For the first time, you can conduct the most accurate fuzzy search and matching at the same speed as exact search without spending days or weeks indexing data.

Okay, this strikes me as a capability I would embrace if I could get over or around my skepticism. I was able to take a look at the “solution” which delivers the astounding performance and information access capability. Here’s an image from Ryft’s engineering professionals:


Notice that we have Spark and pre built components. I assume there are myriad other innovations at work.

The hitch in the git along is that in order to deal with certain real world information processing challenges, the inputs come from disparate systems, each generating substantial data flows in real time.

Here’s an example of a real world information access and understanding challenge, which, as far as I know, has not been solved in a cost effective, reliable, or usable manner.


Image source: Plugfest 2016 Unclassified.

This unclassified illustration makes clear that the little things in the sky pump out lots of data into operational theaters. Each stream of data must be normalized and then converted to actionable intelligence.

The assertion about 600X sounds tempting, but my hunch is that the latency in normalizing, transferring, and processing will not meet the need for real time, actionable, accurate outputs when someone is shooting at a person with a hardened laptop in a threat environment.

In short, perhaps the spark will ignite a fire of performance. But I have my doubts. Hey, that’s why I spend my time in rural Kentucky where reasonable people shoot squirrels with high power surplus military equipment.

Stephen E Arnold, May 20, 2016

Big Data and Value

May 19, 2016

I read “The Real Lesson for Data Science That is Demonstrated by Palantir’s Struggles · Simply Statistics.” I love write ups that plunk the word statistics near simple.

Here’s the passage I highlighted in money green:

… What is the value of data analysis?, and secondarily, how do you communicate that value?

I want to step away from the Palantir Technologies’ example and consider a broader spectrum of outfits tossing around the jargon “big data,” “analytics,” and synonyms for smart software. One doesn’t communicate value. One finds a person who needs a solution and crafts the message to close the deal.

When a company and its perceived technology catches the attention of allegedly informed buyers, a bandwagon effort kicks in. Talks inside an organization leads to mentions in internal meetings. The vendor whose products and services are the subject of these comments begins to hint at bigger and better things at conferences. Then a real journalist may catch a scent of “something happening” and writes an article. Technical talks at niche conferences generate wonky articles usually without dates or footnotes which make sense to someone without access to commercial databases. If a social media breeze whips up the smoldering interest, then a fire breaks out.

A start up should be so clever, lucky, or tactically gifted to pull off this type of wildfire. But when it happens, big money chases the outfit. Once money flows, the company and its products and services become real.

The problem with companies processing a range of data is that there are some friction inducing processes that are tough to coat with Teflon. These include:

  1. Taking different types of data, normalizing it, indexing it in a meaningful manner, and creating metadata which is accurate and timely
  2. Converting numerical recipes, many with built in threshold settings and chains of calculations, into marching band order able to produce recognizable outputs.
  3. Figuring out how to provide an infrastructure that can sort of keep pace with the flows of new data and the updates/corrections to the already processed data.
  4. Generating outputs that people in a hurry or in a hot zone can use to positive effect; for example, in a war zone, not get killed when the visualization is not spot on.

The write up focuses on a single company and its alleged problems. That’s okay, but it understates the problem. Most content processing companies run out of revenue steam. The reason is that the licensees or customers want the systems to work better, faster, and more cheaply than predecessor or incumbent systems.

The vast majority of search and content processing systems are flawed, expensive to set up and maintain, and really difficult to use in a way that produces high reliability outputs over time. I would suggest that the problem bedevils a number of companies.

Some of those struggling with these issues are big names. Others are much smaller firms. What’s interesting to me is that the trajectory content processing companies follow is a well worn path. One can read about Autonomy, Convera, Endeca, Fast Search & Transfer, Verity, and dozens of other outfits and discern what’s going to happen. Here’s a summary for those who don’t want to work through the case studies on my Xenky intel site:

Stage 1: Early struggles and wild and crazy efforts to get big name clients

Stage 2: Making promises that are difficult to implement but which are essential to capture customers looking actively for a silver bullet

Stage 3: Frantic building and deployment accompanied with heroic exertions to keep the customers happy

Stage 4: Closing as many deals as possible either for additional financing or for licensing/consulting deals

Stage 5: The early customers start grousing and the momentum slows

Stage 6: Sell off the company or shut down like Delphes, Entopia, Siderean Software and dozens of others.

The problem is not technology, math, or Big Data. The force which undermines these types of outfits is the difficulty of making sense out of words and numbers. In my experience, the task is a very difficult one for humans and for software. Humans want to golf, cruise Facebook, emulate Amazon Echo, or like water find the path of least resistance.

Making sense out of information when someone is lobbing mortars at one is a problem which technology can only solve in a haphazard manner. Hope springs eternal and managers are known to buy or license a solution in the hopes that my view of the content processing world is dead wrong.

So far I am on the beam. Content processing requires time, humans, and a range of flawed tools which must be used by a person with old fashioned human thought processes and procedures.

Value is in the eye of the beholder, not in zeros and ones.

Stephen E Arnold, May 19, 2016

Affinio and the Differences between Useful Data and Fanciful Data

May 17, 2016

I read “Understanding the Cultural Differences Between NASCAR and Formula One Fans [Analysis].” The write up is in a blog post from Affinio. The company describes itself in this way:

Marketing Intelligence that leverages the social graph to understand today’s customer.

The information in the  write up presents clusters of interest between the two fan bases for each of these motor sports. F1 consists of clusters labeled this way:


To illustrate the differences, Affinio presents a visualization of the Nascar audience:


The labels strike me as unhelpful; for example, Cluster 14, Cluster 6, etc.

The top interests of the two audiences consist of a collage of small images. I am not sure what each image represents.


Equally unhelpful is the word clouds for each of the audiences; for example:


The map showing the geographic area where F1 is popular focuses on a global scale with a centroid in Western Europe. The absence of a hot spot in the Middle East was puzzling. Is Australia as large an F1 market as the UAE in terms of money spent on F1 activities?


The map for the Nascar market depicts only the US of A. My question, “Why not show a global map?”


Thinking about this analysis, I have several questions:

  1. A list of dot points would get the message across in a more efficient, possibly less confusing way would it not?
  2. What is analyzed? It seems that the single actionable fact is that the F1 market is global and the Nascar market is local.
  3. What are the data sets used for the analysis?
  4. Why are terms like “Cluster 14” used instead of words?

The most important data from my uninformed vantage point is the money generated by the two types of motor racing.

My hunch is that the Affino write up wanted to show off visualizations, not substantive and actionable data analysis. In short, is this marketing or is it substance? I will leave the answer to you, gentle reader.

Stephen E Arnold, May 17, 2016

The Most Dangerous Writing App Will Delete Your Work If You Stop Typing, for Free

May 2, 2016

The article on The Verge titled The Most Dangerous Writing App Lets You Delete All of Your Work For Free speculates on the difficulties and hubris of charging money for technology that someone can clone and offer for free. Manuel Ebert’s The Most Dangerous Writing App offers a self-detonating notebook that you trigger if you stop typing. The article explains,

“Ebert’s service appears to be a repackaging of Flowstate, a $15 Mac app released back in January that functions in a nearly identical way. He even calls it The Most Dangerous Writing App, which is a direct reference to the words displayed on Flowstate creator Overman’s website. The difference: Ebert’s app is free, which could help it take off among the admittedly niche community of writers looking for self-deleting online notebooks.”

One such community that comes to mind is that of the creative writers. Many writers, and poets in particular, rely on exercises akin to the philosophy of The Most Dangerous Writing App: don’t let your pen leave the page, even if you are just writing nonsense. Adding higher stakes to the process might be an interesting twist, especially for those writers who believe that just as the nonsense begins, truth and significance are unlocked.


Chelsea Kerwin, May 2, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Search without Indexing

April 27, 2016

I read “Outsmarting Google Search: Making Fuzzy Search Fast and Easy Without Indexing.”

Here’s a passage I highlighted:

It’s clear the “Google way” of indexing data to enable fuzzy search isn’t always the best way. It’s also clear that limiting the fuzzy search to an edit distance of two won’t give you the answers you need or the most comprehensive view of your data. To get real-time fuzzy searches that return all relevant results you must use a data analytics platform that is not constrained by the underlying sequential processing architectures that make up software parallelism. The key is hardware parallelism, not software parallelism, made possible by the hybrid FPGA/x86 compute engine at the heart of the Ryft ONE.

I also circled:

By combining massively parallel FPGA processing with an x86-powered Linux front-end, 48 TB of storage, a library of algorithmic components and open APIs in a small 1U device, Ryft has created the first easy-to-use appliance to accelerate fuzzy search to match exact search speeds without indexing.

An outfit called InsideBigData published “Ryft Makes Real-time Fuzzy Search a Reality.” Alas, that link is now dead.

Perhaps a real time fuzzy search will reveal the quickly deleted content?

Sounds promising. How does one retrieve information within videos, audio streams, and images? How does one hook together or link a reference to an entity (discovered without controlled term lists) with a phone number?

My hunch is that the methods disclosed in the article have promise, the future of search seems to be lurching toward applications that solve real world, real time problems. Ryft may be heading in that direction in a search climate which presents formidable headwinds.

Stephen E Arnold, April 27, 2016

« Previous PageNext Page »