CyberOSINT banner

Why Enterprise Search Fails

July 12, 2016

I participated in a telephone call before the US holiday break. The subject was the likelihood of a potential investment in an enterprise search technology would be a winner. I listened for most of the 60 minute call. I offered a brief example of the over promise and under deliver problems which plagued Convera and Fast Search & Transfer and several of the people on the call asked, “What’s a Convera?” I knew that today’s whiz kids are essentially reinventing the wheel.

I wanted to capture three ideas which I jotted down during that call. My thought is that at some future time, a person wanting to understand the incredible failures that enterprise search vendors have tallied will have three observations to consider.

No background is necessary. You don’t need to read about throwing rocks at the Google bus, search engine optimization, or any of the craziness about search making Big Data a little pussycat.

Enterprise Search: Does a Couple of Things Well When Users Expect Much More

Enterprise search systems ship with filters or widgets which convert source text into a format that the content processing module can index. The problem is that images, videos, audio files, content from wonky legacy systems, or proprietary file formats like IBM i2’s ANB files do not lend themselves to indexing by a standard enterprise search system.  The buyers or licensees of the enterprise search system do not understand this one trick pony nature of text retrieval. Therefore, when the system is deployed, consternation follows confusion when content is not “in” the enterprise search system and, therefore, cannot be found. There are systems which can deal with a wide range of content, but these systems are marketed in a different way, often cost millions of dollars a year to set up, maintain, and operate.


Net net: Vendors do not explain the limitations of text search. Licensees do not take the time or have the desire to understand what an enterprise search system can actually do. Marketers obfuscate in order to close the deal. Failure is a natural consequence.

Data Management Needed

The disconnect boils down to what digital information the licensee wants to search. Once the universe is defined, the system into which the data will be placed must be resolved. No data management, no enterprise search. The reason is that licensees and the users of an enterprise search system assume that “all” or “everything” – maps to web content, email to outputs from an AS/400 Ironside are available any time. Baloney. Few organizations have the expertise or the appetite to deal with figuring out what is where, how much, how frequently each type of data changes, and the formats used. I can hear you saying, “Hey, we know what we have and what we need. We don’t need a stupid, time consuming, expensive inventory.” There you go. Failure is a distinct possibility.


Net net: Hope springs eternal. When problems arise, few know what’s where, who’s on first, and why I don’t know is on third.

Read more

Google Search: Retrievers Lose. Smart Software Wins

June 28, 2016

I scanned a number of write ups about Google’s embrace of machine learning and smart software. I supplement my Google queries with the results of other systems. Some of these have their own index; for example, and Exalead. Others are metasearch engines will suck in results and do some post processing to help answer the users’ questions. Others are disappointing and I check them out when I have a client who is willing to pay for stone flipping; for example, DuckDuckGo, iSeek, or the estimable Qwant. (I love quirky spelling too.)

I read “RankBrain Third Most Important Factor Determining Google Search Results.” Here’s the quote I noted:

Google is characteristically fuzzy on exactly how it improves search (something to do with the long tail? Better interpretation of ambiguous requests?) but Jeff Dean [former AltaVista wizard] says that RankBrain is “involved in every query,” and affects the actual rankings “probably not in every query but in a lot of queries.” What’s more, it’s hugely effective. Of the hundreds of “signals” Google search uses when it calculates its rankings (a signal might be the user’s geographical location, or whether the headline on a page matches the text in the query), RankBrain is now rated as the third most useful. “It was significant to the company that we were successful in making search better with machine learning,” says John Giannandrea. “That caused a lot of people to pay attention.”Pedro Domingos, the University of Washington professor who wrote The Master Algorithm, puts it a different way: “There was always this battle between the retrievers and the machine learning people,” he says. “The machine learners have finally won the battle.”

I have noticed in the last year, that I am unable to locate certain documents when I use the words and phrases which had served me well before smart software became the cat’s pajamas.

One recent example was my need to locate a case example about a German policeman’s trials and tribulations with the Dark Web. When I first located this document, I was trying to verify an anecdote shared with me after one of my intelligence community lectures.

I had the document in my file and I pulled it up on my monitor. The document in question is the work of an outfit and person labeled “Lars Hilse.” The title of the write up is “Dark Web & Bitcoin: Global Terrorism “Threat Assessment. The document was published in April 2013 with an update issued in November 2013. (That document was the source or maybe confirmed the anecdote about the German policeman and his Dark Web research.)

For my amusement, I wondered if I could use the new and improved Google Web search to locate the document. I display section 4.8 on my screen. The heading of the section is “Extortion (of Law Enforcement Personnel).

I entered the phrase into Google without quotes. Here’s the first page of results:


None of the hits points to the document with the five word phrase.

Read more

Palantir Technologies Challenges US Government Procurement

June 22, 2016

I was a wee lad when I read Don Quixote. I know that students in Spain and some other countries study the text of the 17th century novel closely. I did not. I remember flipping through a Classics’ comic book, reading the chapter summaries in Cliff’s Notes, and looking at the pictures in the edition in my high school’s library. Close enough for horse shoes. (I got an A on the test. Heh heh heh.)

Here’s what I recall the Don and his sidekick. A cultured fellow read a lot of fantasy fiction, mixed it up the real world, and went off on adventures or sallies. The protagonist (see I remember something from Ms. Sperling’s literature class in 1960) rode a horse and charged into the countryside to kill windmills. I remember there were lots and lots of adventures, not too much sex – drugs – rock and roll, and many convoluted literary tropes.

I still like the windmills. A Google search showed me an image which is very similar to the one in the comic book I used as my definitive version of the great novel. Here it is:

Image result for don quixote windmills

What does a guy riding a horse with a lance toward a windmill have to do with search and content processing? Well, I read “Palantir Lambastes Army Over $206 Million Contract Bidding.” I assume the information in the write up is spot on.

Palantir Technologies, a unicorn which is the current fixation of a Buzzfeed journalist, is going to sue the US Army over a “to be” contract for work. The issue is an all source information system procurement known as DCGS or sometimes DI2E. The acronyms are irrelevant. What is important is that the US Army has been plugging away with a cadre of established government contractors for a decade. Depending on whom one asks, DCGS is the greatest thing since sliced bread or it is a flop.

However, Palantir believes that its augmented intelligence system is a better DCGS / DI2E. than the actual DCGS / DI2E.

The US Army may not agree and appears be on the path to awarding the contract for DCGS work to other vendors.

According to the write up:

Palantir claims the Army’s solicitation is “unlawful, irrational, arbitrary and capricious,” according to the letter of intent Palantir sent to the U.S. Army and the Department of Justice, which was obtained by Bloomberg. The letter is a legal courtesy, which states Palantir will file a formal protest in the U.S. Court of Federal Claims next week and requests the Army delay awarding the first phase of the contract until litigation is resolved. The contract is slated to be awarded by the end of 2016.

The contract is worth a couple of hundred million, but the follow on work is likely to hit nine figures. Palantir has some investors who want more growth. The best way to get it, if the write up is accurate, is on the backs of legal eagles.

I don’t know anything about the US Army and next to nothing about Palantir, but I have some experience watching vendors protest the US government’s procurement process. My thought is that when bidders sue the government:

  • Costs go up. Lawyers are very busy, often for a year or more. In lawyer land, billing is really good.
  • Delays occur. The government unit snagged in the contracting hassle have to juggle more balls; for example, tasks have to be completed. When the vendors are not able to begin work, delays occur. This may not be a problem in lawyer land, but in the real world, downstream dependencies can be a hitch in the git along.
  • Old scores may be hummed. Palantir settled a legal dust up with IBM which owns i2 Analysts Notebook. The Analysts Notebook is the very same software system whose file structure Palantir wanted to understand. i2 was not too keen on making its details available. (Note: I was a consultant to i2 for a number of years, and this was input number one to me from one of the founders). IBM has a pretty good institutional memory without consulting Watson.)

And Don Quixote? I wonder if the Palantirians, some of whom fancy themselves Hobbits, are going to be able to shape the real world to their vision. The trajectory of this legal dust up will be interesting to watch as it flames across the sky toward Spain and Don Quixote’s fictional library. Flame out or direct hit? The US Army and US government procurement policies are able to absorb charging horses and possibly a lance poke or two.

Stephen E Arnold, June 22, 2016

GAO DCGS Letter B-412746

June 1, 2016

A few days ago, I stumbled upon a copy of a letter from the GAO concerning Palantir Technologies dated May 18, 2016. The letter became available to me a few days after the 18th, and the US holiday probably limited circulation of the document. The letter is from the US Government Accountability Office and signed by Susan A. Poling, general counsel. There are eight recipients, some from Palantir, some from the US Army, and two in the GAO.

palantir checkmate

Has the US Army put Palantir in an untenable spot? Is there a deus ex machina about to resolve the apparent checkmate?

The letter tells Palantir Technologies that its protest of the DCGS Increment 2 award to another contractor is denied. I don’t want to revisit the history or the details as I understand them of the DCGS project. (DCGS, pronounced “dsigs”, is a US government information fusion project associated with the US Army but seemingly applicable to other Department of Defense entities like the Air Force and the Navy.)

The passage in the letter I found interesting was:

While the market research revealed that commercial items were available to meet some of the DCGS-A2 requirements, the agency concluded that there was no commercial solution that could  meet all the requirements of DCGS-A2. As the agency explained in its report, the DCGS-A2 contractor will need to do a great deal of development and integration work, which will include importing capabilities from DCGS-A1 and designing mature interfaces for them. Because  the agency concluded that significant portions of the anticipated DCSG-A2 scope of work were not available as a commercial product, the agency determined that the DCGS-A2 development effort could not be procured as a commercial product under FAR part 12 procedures. The protester has failed to show that the agency’s determination in this regard was unreasonable.

The “importing” point is a big deal. I find it difficult to imagine that IBM i2 engineers will be eager to permit the Palantir Gotham system to work like one happy family. The importation and manipulation of i2 data in a third party system is more difficult than opening an RTF file in Word in my experience. My recollection is that the unfortunate i2-Palantir legal matter was, in part, related to figuring out how to deal with ANB files. (ANB is i2 shorthand for Analysts Notebook’s file format, a somewhat complex and closely-held construct.)

Net net: Palantir Technologies will not be the dog wagging the tail of IBM i2 and a number of other major US government integrators. The good news is that there will be quite a bit of work available for firms able to support the prime contractors and the vendors eligible and selected to provide for-fee products and services.

Was this a shoot-from-the-hip decision to deny Palantir’s objection to the award? No. I believe the FAR procurement guidelines and the content of the statement of work provided the framework for the decision. However, context is important as are past experiences and perceptions of vendors in the running for substantive US government programs.

Read more

Interview with Stephen E Arnold, Reveals Insights about Content Processing

March 22, 2016

Nikola Danaylov of the Singularity Weblog interviewed technology and financial analyst Stephen E. Arnold on the latest episode of his podcast, Singularity 1 on 1. The interview, Stephen E. Arnold on Search Engines and Intelligence Gathering, offers thought-provoking ideas on important topics related to sectors — such as intelligence, enterprise search, and financial — which use indexing and content processing methods Arnold has worked with for over 50 years.

Arnold attributes the origins of his interest in technology to a programming challenge he sought and accepted from a computer science professor, outside of the realm of his college major of English. His focus on creating actionable software and his affinity for problem-solving of any nature led him to leave PhD work for a job with Halliburton Nuclear. His career includes employment at Booz, Allen & Hamilton, the Courier Journal & Louisville Times, and Ziff Communications, before starting strategic information services in 1991. He co-founded and sold a search system to Lycos, Inc., worked with numerous organizations including several intelligence and enforcement organizations such as US Senate Police and General Services Administration, and authored seven books and monographs on search related topics.

With a continued emphasis on search technologies, Arnold began his blog, Beyond Search, in 2008 aiming to provide an independent source of “information about what I think are problems or misstatements related to online search and content processing.” Speaking to the relevance of the blog to his current interest in the intelligence sector of search, he asserts:

“Finding information is the core of the intelligence process. It’s absolutely essential to understand answering questions on point and so someone can do the job and that’s been the theme of Beyond Search.”

As Danaylov notes, the concept of search encompasses several areas where information discovery is key for one audience or another, whether counter-terrorism, commercial, or other purposes. Arnold agrees,

“It’s exactly the same as what the professor wanted to do in 1962. He had a collection  of Latin sermons. The only way to find anything was to look at sermons on microfilm. Whether it is cell phone intercepts, geospatial data, processing YouTube videos uploaded from a specific IP address– exactly the same problem and process. The difficulty that exists is that today we need to process data in a range of file types and at much higher speeds than ever anticipated, but the processes remain the same.”

Arnold explains the iterative nature of his work:

“The proof of the value of the legacy is I don’t really do anything new, I just keep following these themes. The Dark Web Notebook is very logical. This is a new content domain. And if you’re an intelligence or information professional, you want to know, how do you make headway in that space.”

Describing his most recent book, Dark Web Notebook, Arnold calls it “a cookbook for an investigator to access information on the Dark Web.” This monograph includes profiles of little-known firms which perform high-value Dark Web indexing and follows a book he authored in 2015 called CYBEROSINT: Next Generation Information Access.

Read more

Real Time: Maybe, Maybe Not

March 1, 2016

Years ago an outfit in Europe wanted me to look at claims made by search and content processing vendors about real time functions.

The goslings and I rounded up the systems, pumped our test corpus through, and tried to figure out what was real time.

The general buzzy Teddy Bear notion of real time is that when new data are available to the system, the system processes the data and makes them available to other software processes and users.

The Teddy Bear view is:

  1. Zero latency
  2. Works reliably
  3. No big deal for modern infrastructure
  4. No engineering required
  5. Any user connected to the system has immediate access to reports including the new or changed data.

Well, guess what, Pilgrim?

We learned quickly that real time, like love and truth, is a darned slippery concept. Here’s one view of what we learned:


Types of Real Time Operations. © Stephen E Arnold, 2009

The main point of the chart is that there are six types of real time search and content processing. When someone says, “Real time,” there are a number of questions to ask. The major finding of the study was that for near real time processing for a financial trading outfit, the cost soars into seven figures and may keep on rising as the volume of data to be processed goes up. The other big finding was that every real time system introduces latency. Seconds, minutes, hours, days, and weeks may pass before the update actually becomes available to other subsystems or to users. If you think you are looking at real time info, you may want to shoot us an email. We can help you figure out which type of “real time” your real time system is delivering. Write benkent2020 @ yahoo dot com and put Real Time in the subject line, gentle reader.

I thought about this research project when I read “Why the Search Console Reporting Is not real time: Explains Google!” As you work through the write up, you will see that the latency in the system is essentially part of the woodwork. The data one accesses is stale. Figuring out how stale is a fairly big job. The Alphabet Google thing is dealing with budgets, infrastructure costs, and a new chief financial officer.

Real time. Not now and not unless something magic happens to eliminate latencies, marketing baloney, and user misunderstanding of real time.

Excitement in non real time.

Stephen E Arnold, March 1, 2016

Alphabet Google Justifies Its R&D Science Club Methods

January 23, 2016

In the midst of the snowmageddon craziness in rural Kentucky, I noted a couple of Alphabet Google write ups. Unlike the sale of shares, the article tackle the conceptual value of the Alphabet Google’s approach to research and development. I view most of Google’s post 2006 research as an advanced version of my high school science club projects.

Our tasks in 1960 included doing a moon measurement from central Illinois. Don’t laugh, Don and Bernard Jackson published their follow on to the science club musing in 1962. In Don’s first University of Illinois astronomy class, the paper was mentioned by the professor. The prof raised a question about the method. Don raised his hand and explained how the data were gathered. The prof was not impressed. Like many mavens, the notion that a college freshman and his brother wrote a paper, got it published, and then explained the method in front of a class of indifferent freshman was too much for the expert. I think the prof shifted to social science or economics, both less rigorous disciplines in my view.


Google’s research interests.

The point is that youth can get some things right. As folks age, the view of what’s right and what’s a little off the beam differ.

Let’s look at the first write up called “How Larry Page’s Obsessions Became Google’s Business.” Note that if the link is dead, you may have to subscribe to the newspaper or hit the library in search of a dead tree copy. The New York Times have an on again and off again approach to the Google. It’s not that the reporters don’t ask the right questions. I think that the “real” journalists get distracted with the free mouse pads and folks like Tony Bennett crooning in the cafeteria to think about what the Google was, is, and has become.

The article points out:

Mr. Page is hardly the first Silicon Valley chief with a case of intellectual wanderlust, but unlike most of his peers, he has invested far beyond his company’s core business and in many ways has made it a reflection of his personal fascinations.

I then learned:

Another question he likes to ask: “Why can’t this be bigger?”

The suggestion that bigger is better is interesting. Stakeholders assume the “bigger” means more revenue and profit. Let’s hope.

Then this insight:

When Mr. Page does talk in public, he tends to focus on optimistic pronouncements about the future and Google’s desire to help humanity.

Optimism is good.

I then worked through “Google Alphabet and Four times the Research Budget of Darpa and Larger Moonshot Ambitions than Darpa.”

The bigger, I thought, may not be revenue. The bigger may be the budget of the science club. If Don and Bernie Jackson could build on the moon data, Google can too. Right?

Read more

Proprietary Enterprise Search: False Hopes and Brutal Costs

December 21, 2015

At lunch the other day, the goslings and I engaged in what I thought was a routine discussion: The sad state of the enterprise search market.

I pointed out that the “Enterprise Search Daily” set up by Edwin Stauthamer was almost exclusively a compilation of Big Data articles. Enterprise search, although the title of the daily, was not the focal point of the content.


Enterprise search is a cost black hole. R&D, support, customization, and bug fixes gorge on money and engineers. Instead of adding value to an enterprise system, search becomes the reason the CFO has a migraine and why sales professionals struggle to close deals.

I said, “Enterprise search has disappeared.”

One of the goslings asked, “What’s happened to the proprietary search systems acquired by some big companies?”

We were off an running.

The goslings mentioned that Dassault Systèmes bought Exalead and the brand has disappeared from the US market. IBM bought Vivisimo, and the purchase was explained as a Big Data buy, but the company and its technology have disappeared into the Great Blue Hole, which is today’s IBM. Hummingbird bought Fulcrum, and then OpenText bought Hummingbird. Open Text owns Information Dimension’s BASIS, BRS Search, and its own home brew search system. Oracle snapped up Endeca, InQuira, and RightNow in a barrage of search binge shopping. Lexmark—formerly a unit of Big Blue—bought ISYS Search Software and Brainware. Then there was the famous purchase of Fast Search & Transfer by Microsoft and the subsequent police investigation and the charges filed against a former executive for fancy dancing with the revenue numbers. And who can forget the $11 billion purchase of Autonomy by IBM. There have been other deals, and the goslings enjoyed commenting on this.

I called a halt to the lunch time stand up comedy routine. The executives of these companies were trying to do what they thought was best for their [a] financial future and [b] for their stakeholders. Some of these stakeholders had suffered through revenue droughts and were looking for a way out of the sea of red ink enterprise search vendors generate with aplomb.

The point I raised was, “Does the purchase of a proprietary enterprise search system?” make a substantive contribution to the financial health of the purchasing company.

Read more

Reed Elsevier Lexis Nexis Embraces Legal Analytics: No, Not an Oxymoron

November 27, 2015

Lawyers and legal search and content processing systems do words. The analytics part of life, based on my limited experience of watching attorneys do mathy stuff, is not these folks’ core competency. Words. Oh, and billing. I can’t overlook billing.

I read “Now It’s Official: Lexis Nexis Acquires Lex Machina.” This is good news for the stakeholders of Lex Machina. Reed Elsevier certainly expects Lex Machina’s business processes to deliver an avalanche of high margin revenue. One can only raise prices so far before the old chestnut from Economics 101 kicks in: Price elasticity. Once something is too expensive, the customers kick the habit, find an alternative, or innovate in remarkable ways.

According to the write up:

LexisNexis today announced the acquisition of Silicon Valley-based Lex Machina, creators of the award-winning Legal Analytics platform that helps law firms and companies excel in the business and practice of law.

So what does legal analytics do? Here’s the official explanation, which is in, gentle reader, words:

  • A look into the near future. The integration of Lex Machina Legal Analytics with the deep collection of LexisNexis content and technology will unleash the creation of new, innovative solutions to help predict the results of legal strategies for all areas of the law.
  • Industry narrative. The acquisition is a prominent and fresh example of how a major player in legal technology and publishing is investing in analytics capabilities.

I don’t exactly know what Lex Machina delivers. The company’s Web page states:

We mine litigation data, revealing insights never before available about judges, lawyers, parties, and patents, culled from millions of pages of IP litigation information. We call these insights Legal Analytics, because analytics involves the discovery and communication of meaningful patterns in data. Our customers use to win in the highly competitive business and practice of law. Corporate counsel use Lex Machina to select and manage outside counsel, increase IP value and income, protect company assets, and compare performance with competitors. Law firm attorneys and their staff use Lex Machina to pitch and land new clients, win IP lawsuits, close transactions, and prosecute new patents.

I think I understand. Lex Machina applies the systems and methods used for decades by companies like BAE Systems (Detica/ NetReveal) and similar firms to provide tools which identify important items. (BAE was one of Autonomy’s early customers back in the late 1990s.) Algorithms, not humans reading documents in banker boxes, find the good stuff. Costs go down because software is less expensive than real legal eagles. Partners can review outputs and even visualizations. Revolutionary.

Read more

Big Data: Systems of Insight

October 6, 2015

I read “All Your Big Data Will Mean Nothing without Systems of Insight.” The title reminded me of the verbiage generated by mid tier consulting firms and adjuncts teaching MBA courses at some institutions of higher learning. Malarkey, parental advice, and Big Data—a Paula Dean-type recipe for low-calorie intellectual fare.


Can one live on the outputs of mid tier consulting firm lingo prepared to be fudgier?

The notion of a system of insight is not particularly interesting. The rhetorical trip of moving from a particular to a more general concept fools some beginning debaters. For a more experienced debater, the key is to keep the eye on the ball, which, in this case, is the tenuous connection between Big Data and strategic management methods. (I am not sure these exist even after reading every one of Peter Drucker’s books.)

But I like to deal with particulars.

Computerworld is a sister or first cousin unit of the IDC outfit which sold my research on Amazon without asking my permission. My valiant legal eagle was able to disappear the report. I was concerned with the connection of my name and the names of two of my researchers with the IDC outfit. I have presented some of the back story in previous blog posts. I included screenshots along with the details of not issuing a contract, using content in ways to which I would never agree, and engaging in letters with my attorney offering inducements to drop the matter. Wow. A big company is unable to get organized and then pays its law firm to find a solution to the self created problem.

The report in question was a limp wristed, eight pages in length and available to Amazon’s eager readers of romance novels for a mere $3,500. Hey, the good stuff in our research was chopped out, leaving a GrapeNut flakes experience for those able to read the document. I am a lousy writer, but I try to get my points across in a colorful way. Cereal bowl writing is not for me.

What does this have to do with Big Data and a system of insights?

Aren’t Amazon’s sales data big? Isn’t it possible to look at what sells on Amazon by scanning the company’s public information about books? Won’t a casual Google search reveal information about Amazon’s best selling eBooks? Best sellers’ lists rarely feature eight pages of watered down analysis of a search vendor with some soul bonding with the outstanding Fast Search & Transfer operation. How many folks visiting the digital WalMart buy $3,500 reports with my name on them?

Er, zero. So what’s the disconnect between basic data about what sells on Amazon, issuing appropriate contractual documents, and selling research with my name and two of my goslings on the $3,500, eight page document. That’s brilliant data analysis for sure.

The write up explains:

Businesses want to use data to understand customers, but they can’t do that without harnessing insights and consistently turning data into effective action.

That sort of makes sense except that the company which owns Computerworld, under the keen-eyed Dave Schubmehl, appeared to ignore this step when trying to sell a report with my name on it to the Amazon faithful. Do the folks at Computerworld and the company’s various knowledge properties connect data with their colleagues’ decisions?

Read more

Next Page »