One User Finds Some Flaws in Elasticsearch

August 18, 2014

We are jazzed about Elasticsearch. Our own search expert Stephen E. Arnold, who has been yearning for some real innovation in search for years now, recently declared, “I will be telling those who attend my lectures to go with Elasticsearch. That’s where the developers and the money are.” Personally, I’m inclined to go with the search expert here (though I admit I may be a bit biased.) This declaration is just to preface my reaction to a post at Sammaye’s Blog, “Things I Have Learnt in the First 5 Minutes of Using Elastic Search.” Apparently, how to spell the name correctly was not one of those things.

Still, it looks like programmer Sam Millman (aka Sammaye) may have some good points. For example, he describes the querying as “the most verbose in the universe,” balks at the requirement to define indexes client-side, and claims Lucene is a bad platform on which to base search in the first place. He also calls the documentation terrible, and bad documentation happens to be a pet peeve of mine. (I’ve written documentation. If you must supply it, you might as well make it comprehensive, organized, and well-written. It’s not that difficult.) Millman explains:

“Its documentation is great at explaining the API, no doubt about it but if you want to actually find out how something works and why something is then you have to constantly ask StackOverflow. It just describes what parameters to put in and then leaves the rest up to you thinking that you don’t want to bother yourself with those details. We do though, we are not bandwagoning your product, we want to know how sharding and replication works, how indexes work and how to manage the product and more. Even when looking at the API the documentation can sometimes be…unhelpful. Mainly due to its huge font-size, yet tiny middle centered layout, English language problems and disorganisation. Overall I came out less than impressed about Elastic Searches documentation. I actually Google search everything first so I don’t have to navigate that mess.”

So, perhaps Elasticsearch is not perfect. See the article for Millman’s full roster of complaints. However, if Arnold is correct and this is “where the developers and the money are” right now, vexing problems should be fixed in short order. It would be a mistake to not take Elasticsearch seriously. Formed in 2012, the company is based in Amsterdam with offices in the U.S., the U.K., France, Germany, and Switzerland. They are also hiring as of this writing, in case anyone here wants to help them iron out some wrinkles.

Cynthia Murrell, August 18, 2014

Sponsored by, developer of Augmentext

The Guardian Explores HP Autonomy

August 16, 2014

I read “Hewlett-Packard Allegations: Autonomy Founder Mike Lynch Tries to Clear Name.” The British “real” newspaper focuses on Mike Lynch, the founder of Autonomy. I am convinced that Autonomy pitched the value of its company to a number of firms. I know that Hewlett Packard bought Autonomy. I assume that spending $11 billion was not a K Mart blue light special impulse purchase. I know that HP has had what the MBAs call “governance challenges.” These range from allegations of getting frisky with folks to management churn. I know that for me, the HP of electronic devices yielded to the HP of the ink cartridges.

Here’s a point I highlighted in the Guardian’s write up:

Meanwhile, lawyers on all sides are using legal privilege to sling mud. Lynch says it is not only his name that has been stained, but that of the British technology industry. Autonomy’s accounting and marketing methods had attracted criticism before the HP acquisition, but Lynch was also a poster child for the achievements of Cambridge’s Silicon Fen. The Autonomy affair casts a shadow, and a conclusion from the SFO is overdue.

I have a slightly different view of the dust up. Folks want to believe that information retrieval will generate another Google. Because of those expectations, executives whose expertise in search extends to running a Google search on a mobile device assume they know about content processing.

When buyers get excited about a purchase, some people buy Bugatti Veyrons and spring for gold iPhones. Others snap up search companies and expect the money to roll in like the oohs and aahs at the golf club when the Veyron rolls up.

Wrong. The dust up between HP and Autonomy is an illustration of what happens when folks without too much understanding of content processing’s complexities covet a home run. The impact does affect Mike Lynch, a Cambridge PhD and real live inventor.

The collateral damage is on the buyers of search companies who toss millions at a sector without understanding how difficult it is to create a search company that is not selling ads or living exclusively on Department of Defense largesse.

HP bought a company with a strong brand, customers, and technology that when properly resourced works. HP did not buy a Google scale money stream, a Palantir clinging to the US government, or a break even metasearch system.

The impact on the reputation of Autonomy professionals is significant. What does this dispute do to other search and content processing companies? Search is tough enough without having a megaton dispute played out in the datasphere.

HP did not have to buy Autonomy. Microsoft passed. Oracle passed. HP bought. HP had time and resources to dig through Autonomy. If it did not, then HP created its own problem. If it did, HP created its own problem. Autonomy, with 15 years of history, was looking for a buyer. My hunch is that HP was looking for a Google and bought a different business because HP convinced itself it could generate more money than Autonomy could. HP found out that it could not match Autonomy’s revenues. Whom does any self respecting MBA or lawyer blame? The other guy.

This hassle says much about HP. Sadly it affects other search and content processing companies as well.

Stephen E Arnold, August 16, 2014

Venture Outcome: The Search and Content Processing Angle

August 14, 2014

I suggest you read “Venture Outcomes Are Even More Skewed Than You Think.” The write up contains several factoids. I highlighted one and added a couple of exclamation points. I suggest you print out the article, grab a writing instrument, and do your own filtering.

The main point of the write up is buried in the paragraph that begins “This really underscores the challenge of crating a venture portfolio that produces reasonable returns.” The factoid I honored with exclamation points is:

In my hypothetical $100M fund with 20 investments, the total number of financings producing a return above 5x was 0.8 – producing almost $100M of proceeds. My theoretical fund actually didn’t find their purple unicorn, they found 4/5ths of that company. If they had missed it, they would have failed to return capital after fees.  Even if we doubled the number of portfolio companies in the hypothetical portfolio, a full quarter of the fund’s return comes from the roughly ½ of a company they invested in that generated 10x or above. Had they missed it, they would have produced a return that roughly approximated investing in bonds – not the kind of risk adjusted return they or their investors were looking for.

I know this is a hypothetical. Assume that the analysis is off by plus or minimum 10 percent. What do we get? Lousy returns; that is, returns comparable to dumping cash into bonds. I think about the banking and venture firm meetings in which I have participated. I cannot recall any of the smiling MBAs considering that their best ideas could perform on a par with bonds. My hunch is that the people who pushed money into venture funds and bank VP-inspired investments are not thinking bond-type yield.

If the number is accurate, I wonder if those folks who have pumped tens of millions of dollars into outfits promising a money ball from search and content processing will get their money back. Forget an upside. Break even may be tough. Search and content processing makes headlines like this one every day:


To get similar results, navigate to Google News and enter the query Autonomy HP or Autonomy CFO.

The second item I circled with my pink marker was a diagram:


The important part is the small number of “winners” graphically embodied in the miniscule 0.4% column. This is a broad swath of investments. For search and content processing, the payoffs have to be measured in what money flows via revenues or a sell off like Fast Search to Microsoft, Exalead to Dassault, or Autonomy to HP. The number of folks who made big bucks and are really happy may be modest. In fact, judging from the legal hassles with regard to Fast Search and the recent HP Autonomy headlines, even those who were MBA winners may have headaches. Information retrieval seems to deliver a number of headaches for stakeholders.

The third item is the factoid that makes clear the failure rate of start ups. Search and content processing poses similar challenges. There is a twist. Once a search and content processing sells to a larger firm, how many have become major money pumps to the acquiring companies? The question is very difficult to answer. The absence of information tells me that there are not too many feel good stories to tell. The pleas on LinkedIn enterprise search discussion threads for positive case studies about search are easy to ignore. Good news with regard to search and content processing is not sloshing around the Big Data bucket in which we exist.

How long with companies that have been in business for many years promising a money ball from search be able to survive? How long will the old soft shoe about search and content processing open checkbooks? How many years will it take some information retrieval companies to replace red ink with the blank ink of hefty after tax profits? How long will it take those seeking answers to information retrieval problems to wake up to the fact that consultant saucisson, Star Trek fantasies, and marketing hyperbole are unlikely to deliver a Disneyland-like “win”?

The data set for the Seth Levine write up is large enough to warrant a tentative answer, “Probably never.” Search and content processing are different. The algorithms and methods are decades old. Talk does not change what can be accomplished with affordable computational resources. Pumping money into search, therefore, may be painful when the actual financial data are reviewed by investors and stakeholders.

Why aren’t their abundant “good news” cases for search and content processing? There just aren’t that many. Think a power curve of implementation successes. There are more examples of search going off the rails than home runs. This is surprising when so many profess to be experts in search and so much money has been injected into information retrieval start ups. The business strategy of search and content processing companies may be raising money. Any other work may be of little interest.

Stephen E Arnold, August 14, 2014

A Case Study for Search from Opentext

August 13, 2014

The Customer Story about Distell on OpenText tells of the successful South African beverage company. The “article” might provide a search case study. Opentext is an information management software that offers guidance in content management, archiving, web content management, and a myriad of other pursuits within the umbrella of “unleashing the power of information.” The article provides a list of bullet points about the company and an About section that states,

“Distell is Africa’s leading producer and marketer of spirits, fine wines, ciders, and ready-to-drinks (RTDs). It employs nearly 5,000 people and has an annual turnover in excess of R12,3 billion. When Distell was formed in 2000 it had 1,700 information workers but due to mainly organic growth and the acquisitions of Bisquit, a French cognac company, and Burn Stewart Distillers, a Scottish whisky producer, that has now grown to 3,000 users spread across over 80 offices, mainly in Southern Africa, but also in eight international locations.”

Otherwise it has a movie and lots of dot points. Substantive cost overrun info? Nope. Of course there is also a link to the full story, a three page PDF that provides detailed information about the company and its prospects. But the dot points are a lot more appealing.

Chelsea Kerwin, August 13, 2014

Sponsored by, developer of Augmentext

Flurry in Stock Market Listings Coincides with SLI Systems Downward Spiral

August 12, 2014

The article titled SLI Systems Plunges to Lowest Since Listing on TVNZ discusses the recent burst of listings. SLI Systems is a company that provides site search, navigation and “user-generated SEO.” SLI’s share price shows the pressure findability vendors are facing in today’s marketplace. The stock fell over seven percent and remains just above its initial public offer price of $1.15. The article states,

“The local stock market is experiencing a flurry of listings which is spoiling investors for choice after it got a shot in the arm from the government’s partial privatisation last year, and the recent listings of software developers Gentrack Group and Serko have only added to tech investments available. Next week, IkeGPS Group, which sells a range of portable measuring devices, plans to list while Vista Entertainment, the cinema software and data analytics company, is due in August…”

Paul Harrison of Salt Funds Management, believes that the flood of listings is not the only culprit for falling prices. Instead, he suggests that certain stocks were simply priced too highly and the current downward trend is a “hangover” following the initial “frenzy.” Other affected companies mentioned include Xero, the accounting software firm, the biotech company Pacific Edge which was unchanged, and Diligent, which also fell in price.

Chelsea Kerwin, August 12, 2014

Sponsored by, developer of Augmentext

OnlyBoth Launches “Niche Finding” Data Search

August 12, 2014

An article on the Library Journal Infodocket is titled Co-Founder of Vivisimo Launches “OnlyBoth” and It’s Super Cool! The article continues in this entirely unbiased vein. OnlyBoth, it explains, was created by Raul Valdes- Perez and Andre Lessa. It offers an automated process of finding data and delivering it to the user in perfect English. The article states,

“What does OnlyBoth do? Actions speak louder than words so go take a look but in a nutshell, OnlyBoth can mine a dataset, discover insights, and then write what it finds in grammatically correct sentences. The entire process is automated. At launch, OnlyBoth offers an application providing insights o 3,122 U.S. colleges and universities described by 190 attributes. Entries also include a list of similar and neighboring institutions. More applications are forthcoming.”

The article suggests that this technology will easily lend itself to more applications, for now it is limited to presenting the facts about colleges and baseball in perfect English. The idea is called “niche finding” which Valedes-Perez developed in the early 2000s and never finished. The technology focuses on factual data that requires some reasoning. For example, the Onlyboth website suggests that the insight “If California were a country, it would be the tenth biggest in the world” is a more complicated piece of information than just a simple fact like the population of California. OnlyBoth promises that more applications are forthcoming.

Chelsea Kerwin, August 12, 2014

Sponsored by, developer of Augmentext

A Google Savior for US Government Web Sites

August 11, 2014

I know that Googlers and Xooglers are absolutely the best. I read “Ex-Google Engineer to Lead Fix-It Team for Government Websites.” I am confident that the Xoogler will bring high magic to the problematic Web sites from numerous Federal entities and quasi-government entities. In year 2000, there were 36,000 of these puppies. I don’t recall how many were not working the way the developers intended.

I don’t know how many US government Web sites there are today because the nifty free tools I used in 2000 and 2001 the way they did a decade ago.

How long will it take to address the backend issues of or get the other sites with glitches working “just like Google”? I think might warrant a quick look too. I suppose one could check out the performance metrics for America Online or Yahoo, two outfits run by Xooglers. There may be some data that help in predicting the fix time.

Stephen E Arnold, August 11, 2014

Search Be Random

August 8, 2014

The early days of Internet search always yielded a myriad of search results. No two searches were ever alike and sponsored ads never made it to the top, because they were not around much. It was especially fun, because you go to see more personal, less corporate content. Now search results are so cluttered, albeit more accurate results and with paid links. Given that humans are also creatures of habit, we tend not to stray far from out safe surfing paths and shockingly the Internet can become a boring place. wrote “Discover Interesting Content With Five Ways To Randomize The Internet” and it points out some neat ways to discover new information. It highlights basic ways: Random Wikipedia, random Google Street View, random YouTube, and random Reddit. For all of these be prepared to get sucked into Internet linkage, videos, and photos for hours if you use any of these tools of randomness. Random Website takes users to any random Web site in its generator.

“How often do you find yourself on the Internet looking at the same boring pages? You know there is something out there but you don’t know where to look. Trust me, how bad could it be?”

What is fun is being taken to dark pages of Web 1.0 or a Web site that serves no purpose other than hosting a single word on a single page.

A lot of Internet content is weird, as seen by using these tools, but some of it can lead you to new thoughts and interests. If you need a metaphor, imagine the Internet is like an encyclopedia, except the entries never end and contain all the information about a topic instead of a short summary.

Whitney Grace, August 08, 2014
Sponsored by, developer of Augmentext

QuickAnswers Currently Limited but Possibly Promising

August 7, 2014

Sphere Engineering is looking to reinvent the way Web information is organized with This search engine returns succinct answers to questions instead of results lists. More a narrowed Wolfram|Alpha than a Google. At least that’s the idea. So far, though, it’s a great place to ask a question—as long as it’s a question to which the system knows the answer. I tried a few queries and got back almost as many “sorry, I don’t know”s or nonsense responses. For now, at least, the page admits that “the current state of this project only reflects a tiny fraction of what is possible.” Still, it may be worth checking back in as the system progresses.

The company’s blog post about the project lets us in on the vision of what QuickAnswers could become. Software engineer François Chollet writes:

“I recently completed a total rewrite of, based on a new algorithm. I call it ‘shallow QA’, as opposed to IBM Waston’s ‘deep QA’. IBM Watson keeps a large knowledge model available for queries and thus requires a supercomputer to run. At the other end of the spectrum, generates partial knowledge models on the fly and can run on a micro-instance.

“ is a semantic question answering engine, capable of providing quick answer snippets to any question that can be answered with knowledge found on the web. It’s like a specialized, quicker version of a search engine. You can see a quick overview of the previous version here.”

The description then gets technical. Chollet uses several examples to illustrate the algorithm’s approach, the results, and some of the challenges he’s faced. He also explains his ambitious long-range vision:

“In the longer term, I’d like to read the entirety of the web and build a complete semantic Bayesian map matching a maximum of knowledge items. Also, it would be nice to have access to a visualization tool for the different answers available and their frequency across sectors of opinion, thus solving the problem of subjectivity.”

These are some good ideas, but of course implementation is the tough part. We should keep an eye on these folks to see whether those ideas make it to fruition. While pursuing such visionary projects, Sphere Engineering earns its dough by building custom machine-learning and data-mining solutions.

Cynthia Murrell, August 07, 2014

Sponsored by, developer of Augmentext

Free Intranet Search System

August 7, 2014

Anyone on the lookout for a free intranet search system? FreewareFiles offers Arch Search Engine 1.7, also known as CSIRO Arch. The software will eat up 22.28MB, and works on both 32-bit and 64-bit systems running Windows 2000 through Windows 7 or MacOS or MacOS X. Here’s part of the product description:

Arch is an open source extension of Apache Nutch (a popular, highly scalable general purpose search engine) for intranet search. Not happy with your corporate search engine? No surprise, very few people are. Arch (finally!) solves this problem. Don’t believe it? Try Arch, blind test evaluation tools are included.

In addition to excellent search quality, Arch has many features critical for corporate environments, such as document level security.


*Excellent search quality: Arch has solved the problem of providing good search results for corporate web sites and intranets!

*Up to date information: Arch is very efficient at updating indexes and this ensures that the search results are up to date and relevant. Unlike most search engines, no complete ‘recrawls’ are done. The indexes can be updated daily, with new pages discovered automatically.

*Multiple web sites: Arch supports easy dynamic inclusion or removal of websites.

They also say the system is easy to install and maintain; uses two indexes so there’s always a working one; and is customizable with either Java or PHP.

Cynthia Murrell, August 07, 2014

Sponsored by, developer of Augmentext

« Previous PageNext Page »