Why Good Enough Is the New Norm in Search

September 29, 2014

Navigate to “Postgres Full Text Search Is Good Enough.” I first heard this argument at a German information technology conference a few years ago. The idea is surprisingly easy to understand. As long as a user can bang in a couple of key words, scan a result list, and locate information that the user finds helpful—job done. The search results may consist of flawed or manipulated information. The search results may be off point for the user’s query when evaluated by old fashioned methods such as precision and recall. The user may be dumb and relies on what the user finds accurate.


This write up explains the good enough approach in terms of PostgreSQL, a useful open source Codd type data management system. Please, note. I am not uncomfortable with good enough search. I understand that when the herd stampedes, it is not particularly easy to stop the run. Prudence suggests that one take cover.

Here’s the guts of the write up:

What do I mean by ‘good enough’? I mean a search engine with the following features:

  • Stemming
  • Ranking / Boost
  • Support Multiple languages
  • Fuzzy search for misspelling
  • Accent support

Luckily PostgreSQL supports all these features.

The write up contains some useful code snippets to make use of search features. The discussion of full text search is coherent and addresses a vast swath of content. Note that proprietary vendors have tilled acres of marketing earth and fertilizer to convert search into a mind boggling range of functions.

This article includes code snippets to tackle full text within PostgreSQL.

Querying is included as well. Again, code snippets are included. (My teenage advisors said, “Very useful snippets.” Okay. Good.

The write up concludes:

We have seen how to build a decent multi-language search engine based on a non-trivial document. This article is only an overview but it should give you enough background and examples to get you started with your own….Postgres is not as advanced as ElasticSearch and SOLR but these two are dedicated full-text search tools whereas full-text search is only a feature of PostgreSQL and a pretty good one

Reasonable observation. Worth reading.

If you are a vendor of proprietary search technology, there will be more individuals infused with the sprit of open source, not fewer. How many experts are there for proprietary systems? Fewer than the cadres of open source volk I surmise.

Stephen E Arnold, September 29, 2014

Tibco: Will It Regain Its Momentum?

September 29, 2014

I read “Tibco Sells Out to Private Equity in $4.3bn Deal with Vista Equity Partners.” I found Tibco interesting when I saw the servers used to power Yahoo News a number of years ago. The company is now owned by accountants and MBAs. I learned in the write up:

Tibco was founded in 1997 by its current chairman and CEO Vivek Ranadive. It was a pioneer of message-oriented middleware, particularly for the financial sector, which enables information to be pushed to multiple recipients at precisely the same time. However, Tibco’s expensive high-end proprietary software is under attack from open source in the form of the Advanced Message Queuing Protocol (AMQP), which promises not just lower-cost message queuing software, but also inter-operability between different vendors’ implementations of the open-source standard.

My recollection is that Tibco’s “information bus” made some of the old line outfits uncomfortable. Perhaps IBM? If the write up is accurate, open source is claiming a proprietary vendor.

How long will proprietary enterprise search vendors be able to keep the open source predators away? If the financial market gets the willies, the collapse of over hyped proprietary systems are likely to face high seas. Some swimmers drown in rough water even though the marketers insist the sun is shining.

Stephen E Arnold, September 29, 2014

Microsoft Azure Price Cuts? Maybe More Bad News for Search Vendors

September 26, 2014

The race for commodity pricing in cloud computing is underway. I read an article, which I assume is semi-accurate, called “Microsoft Azure Sees Big Price Reductions: Competition Is Good.” “Good” is a often a relative term.

For those looking for low cost cloud computing that delivers Azure functions, lower prices mean that Amazon- and Google-type prices may be too high.

For a vendor trying to pitch an information retrieval system to a Microsoft centric outfit, the falling prices may mean that Azure Search is not just good enough. It is a deal. The only systems that can be less expensive are those one downloads from an open source repository or one that a hard worker codes herself.

The write up states:

Microsoft has announced, in a blog post, that it will be slashing the cost of some of its Azure cloud services from October 1st….customers buying through Enterprise agreements will enjoy even lower prices. The rate card currently shows 63 services being reduced by up to about 40%.

For enterprise search vendors chasing SharePoint licensees with promises of better, faster, and cheaper—the move by Microsoft is likely to be of interest.

I anticipate that search vendors will scramble even harder than ever. Furthermore, I look forward to even more outrageous assertions about the value of content processing. As an example, check out this set of assertions about an open source based system that has been scrambling for purchase on the sales mountain for six or seven years.

Stephen E Arnold, September 26, 2014

Red Hat: The Cloud Is the Future

September 24, 2014

I read “Red Hat CEO Announces a Shift from Client-Server to Cloud Computing.” With Red Hat the poster child for the economic viability of an open source business model, this shift seems to mark a break with Red Hat’s past focus.

The article reports:

In case you haven’t gotten the point yet, Whitehurst [Red Hat big gun] states, “We want to be the undisputed leader in enterprise cloud.” In Red Hat’s future, Linux will be the means to a cloud, not an end unto itself.

No problem with this move. Most of the organizations with which I have contact bemoan the cost of on premises computing. The cloud, as I understanding their MBA-tinged reasoning, is cheaper. Cut back on staff, eliminate the expensive weekend triage sessions with engineers who charge more than roving physicians in New Jersey, and the hassles of human resources professionals who complain about body shops, background checks, and turnover—these themes surface.

The move should be okay for Red Hat. The company is moving in a new direction. Existing customers will be okay for the foreseeable future.

On a related note, I was scanning one of the less and less heavily visited LinkedIn enterprise search bulletin boards. What did I see? A brave soul was looking for a hosted version of Solr, presumably for its facets and perceived zippy performance.

In one of the comments—an “expert” mentioned that Lucid Works, which invokes from me the thought, “Really?”—said that the Lucid Works cloud offering was no longer available.

I suppose this is an example of contrarianism, but if the statement were true, maybe Lucid Works knows something that has eluded Red Hat? Interesting question. My hunch is that Red Hat knows what it is doing.

Stephen E Arnold, September 23, 2014

Open Source AdDetector Flags Native Advertising

September 24, 2014

Though many news sites allow ads to more or less (depending on the site) blend in with their real articles, this native advertising is usually easy enough to spot if you know what you’re looking for. Still, it can put a crimp in one’s skimming speed. Now, Google engineer Ian Webster offers the open source AdDetector, a browser plug-in that makes such “stories” more obvious. The plug-in is currently available for Chrome and Firefox. The description states:

“AdDetector reveals articles with corporate sponsors. This browser plugin puts a red banner above articles that may appear unbiased but are actually ads or press releases. Its goal is to improve transparency in media and on the web. Trusted by 14,000+ people, AdDetector spots ads in over 100 top newspapers and online publications. More sites are being added daily. If you’d like to see a site added, tweet, email, or use this form.”

The page includes screenshots of its banners in action. The software works by detecting sponsor markings on these pages, many of which are not visible to readers. There is no word on the plug-in’s error rate, but it seems bound to smooth the path for news speed-readers like me.

Cynthia Murrell, September 24, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Open Sourcers Believe In Cassandra

August 27, 2014

In Homer’s Odyssey, the character Cassandra had the gift of prophesy, but she was also cursed to where no one believed her. The NoSQL database of the same name shared a similar problem when it first started, but unlike the tragic heroine it has since grown to be a popular and profitable bit of code. Wired discusses Cassandra’s history and current endeavors in “Out In the Open: The Abandoned Facebook Tech That Now Helps Power Apple.”

Cassandra is the brainchild of Jonathan Ellis and he used it to found DataStax. Facebook used Cassandra to better scale information across machines and open sourced it in 2008. It faded into the background for a while, but DataStax continued to gain traction with its proprietary software. Apple has since joined the Cassandra community and is its second largest contributor. DataStax, however, will not acknowledge that Apple is one of its clients.

The article points out that a single database product cannot reign supreme in 2014’s market. New ways to house and utilize data will continue to grow, much of it driven by open source. What does that mean for DataStax and Cassandra?

“Ellis says the strategy for Cassandra and DataStax will be ensuring that its technology can work with any new technology that can come along. For example, DataStax recently released a connector for Spark that will enable developers to easily use Spark to analyze data stored in Cassandra. ‘We’re trying to be the database that drives our application, not necessarily the analytics,’ he says. ‘There’s nothing that marries us to one of those platforms.’”

From reading this, it seems the big data push has quieted down somewhat, but companies based on open source software are trying to create products that allow people to use their data smarter and without the holdups of earlier big data pushes. One thing for sure is if DataStax truly does have Apple as a client, they can kiss success on the mouth.

Whitney Grace, August 27, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Hackers Leverage Elasticsearch Flaw in the Cloud

August 25, 2014

Just as Elasticsearch is reveling in its recent successes, CloudPro informs us that “Hackers Target Elasticsearch to Set Up DDoS Botnet on AWS.” Writer Rene Millman reports that cloud providers besides Amazon Web Services could be affected by the attacks, which leverage a vulnerability in the older Elasticsearch 1.1 versions. Because of its ability to run on multiple nodes, Elasticsearch’s open source, Java-based full-text-search application is a popular choice for use with cloud environments. The article describes the vulnerability hackers are now exploiting:

“Researchers at Kaspersky Labs have found that cybercriminals have exploited a flaw in the software to install DDoS malware on various clouds. The flaw was found in Elasticsearch v. 1.1x and a scripting exploit. The software has default support for active scripting, but does not use authentication and also does not sandbox the script code. Criminals can use the flaw to hack into EC2 VMs and then use a use a new variant of Linux DDoS Trojan Mayday – Backdoor.Linux.Mayday.g – to launch their attack, according to Kaspersky Lab principal security researcher Kurt Baumgartner.”

Millman goes on to quote a blog post by Kurt Baumgartner, principal security researcher at Kaspersky Lab. Baumgartner states:

“The [Mayday variants] in use on compromised EC2 instances oddly enough were flooding sites with UDP traffic only. The flow is strong enough that the DDoS’d victims were forced to move from their normal hosting operations IP addresses to those of an anti-DDoS solution.

“The flow is also strong enough that Amazon is now notifying their customers, probably because of potential for unexpected accumulation of excessive resource charges for their customers. The situation is probably similar at other cloud providers.”

Unsurprisingly, the goal of these attacks seems to be financial. Baumgertner notes that among those affected by this attacks are a large regional U.S. bank, a large electronics maker, and a Japanese service provider. For its part, Amazon is urging users to upgrade asap to the latest version of Elasticsearch, which is free from this vulnerability.

Cynthia Murrell, August 25, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Open Source Software Costs: The Wal-Mart View

August 24, 2014

Which is more economical? Proprietary software or open source software? Which approach delivers greater “value”? In Wal-Mart’s tussle with Amazon, will it deliver a better online experience for shopping, search, and logistics? I ask because the Wal-Mart closest to Harrod’s Creek has fewer products, dimmer lighting, and restocking challenges in my experience.

Some information that may help answer these questions appeared in “Wal-Mart’s Investment in Open Source Isn’t Cheap.” Note that this publication is owned by IDG / IDC the mid tier consulting firm that sold my content on Amazon without my permission. Some details are at this link.

This write up explains that open source software is more than a price:

Wal-Mart has put in place a set of metrics to estimate the return on investment. Hammer explains “every five startups using Hapi translated to the value of one full-time developer, while every 10 large companies translated to one full-time senior developer.” In return for its extra work on open development, Wal-Mart gets high-quality programming at a cost far below that of recruiting and retaining extra staff. In turn, this demonstrable return allows the company to justify further development investment because “by paying developers to work on Hapi full time, we get back twice (or more) that much in engineering value.”

Wal-Mart, however, is a place that sells stuff at what looks like low prices. There are some legal arabesques related to Wal-Mart’s parsimonious streak.

Three questions:

  • Is Wal-Mart looking for ways to obtain maximum freedom from traditional vendors, not just value or cost savings. Freedom can translate to handling software the Wal-Mart way?
  • Will developers find themselves subject to the same cost parameters that Wal-Mart has honed to deliver its competitive prices?
  • How will Wal-Mart adapt when an open source project loses its community?

With Amazon looking more and more proprietary, Wal-Mart seems to be heading in the opposite direction. Will Wal-Mart out Amazon Amazon or will Wal-Mart become more like Amazon?

The search experience for both Amazon and Wal-Mart online is often frustrating. Perhaps in a few months one of these discounters will crack their information retrieval nuts.

For those looking for information about the cost of open source, the Wal-Mart approach is worth tucking into one’s card file.

Stephen E Arnold, August 24, 2014

Free Intranet Search System

August 7, 2014

Anyone on the lookout for a free intranet search system? FreewareFiles offers Arch Search Engine 1.7, also known as CSIRO Arch. The software will eat up 22.28MB, and works on both 32-bit and 64-bit systems running Windows 2000 through Windows 7 or MacOS or MacOS X. Here’s part of the product description:

Arch is an open source extension of Apache Nutch (a popular, highly scalable general purpose search engine) for intranet search. Not happy with your corporate search engine? No surprise, very few people are. Arch (finally!) solves this problem. Don’t believe it? Try Arch, blind test evaluation tools are included.

In addition to excellent search quality, Arch has many features critical for corporate environments, such as document level security.


*Excellent search quality: Arch has solved the problem of providing good search results for corporate web sites and intranets!

*Up to date information: Arch is very efficient at updating indexes and this ensures that the search results are up to date and relevant. Unlike most search engines, no complete ‘recrawls’ are done. The indexes can be updated daily, with new pages discovered automatically.

*Multiple web sites: Arch supports easy dynamic inclusion or removal of websites.

They also say the system is easy to install and maintain; uses two indexes so there’s always a working one; and is customizable with either Java or PHP.

Cynthia Murrell, August 07, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

TextTeaser Goes Open Source

July 16, 2014

If you are looking for an auto-summarization tool, TechCrunch says “Auto-Summarization Tool TextTeaser Relaunches As Open Source Code.” Joe Balbin is the creator of TextTeaser and he added it to GitHub after experiencing scalability issues in the API. Balbin recoded the program and the process is now faster. Developers have two plan options: one is $12 for ever 1000 articles summarized, while the enterprise plan is $250/month and comes with a dedicated server to store the article source.

“ ‘In this TextTeaser, you can train your own summarizer,’ Balbin explains. ‘You can provide the category and source of the article that will be used to improve the quality of the summaries. In the future, users might also have the ability to provide what keyword is important and what is not.’ ”

TextTeaser is used in reader apps, such as Gist. Balbin hopes to optimize the program for medical, financial, and legal documents.

TextTeaser sounds like it makes reading faster. The code is a valuable tool. We will stay tuned to see how else it is used.

Whitney Grace, July 16, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Next Page »