March 22, 2015
If you are a user of PostgreSQL and want to implement fuzzy, relaxed, or “show ‘em something sort of close to the user’s query,” you will want to read “Super Fuzzy Searching on PostgreSQL.” Fuzzy search makes it possible to show a user who is not quite sure how terms appear in an index. Fuzzy is not exactly like “close” in horseshoes. More algorithmic magic is at play in information retrieval systems.
The article explains PostgreSQL fuzzy capabilities and launches into the notion of trigrams. Keep in mind that Manning & Napier (creators of DR LINK) possess some n-gram patents. The old Brainware which may have once been SER) also possesses some n-gram type patents. I recall hearing years ago that Brainware developed a trigram search system which worked reasonably well when looking for similar patent claims. Brainware is now part of a printer company, and I have lost track of the search technology. I suppose I could investigate the Brainware/Lexmark status, but I have other tasks beckoning my attention.
The write up explains how to implement trigrams for PostgreSQL. The code examples are useful and the tips for dealing with large datasets are quite helpful. The author does not mention the n-gram related patents. I assume that the author assumes that the patent holders assume no one is infringing. That is a triple assumption set. int ere sti ngt rig ram coi nci den ce_
Stephen E Arnold, March 22, 2015
March 21, 2015
There are many ways for commercial enterprises to gain traction via open source. Some companies, like IBM, cheerlead for Eclipse and Lucene, among other open source projects. Other companies hold conferences to tout an open source solution and then pitch extra cost add ons like consulting and training so the unfamiliar can become familiar with the “free” software. A few firms slip open source hints into their commercial messages. One company which sells a government- and academic-based search system used “open source” on a Web page. When I pointed this commercial outfit hinting that their for fee, proprietary product was open source, the reference disappeared after a frisky email exchange. It seems that some company presidents do not look at their own firm’s Web sites.
I read “2015 Open Source Donations.” The write up was straightforward, listing various donations from DuckDuckGo to worthy causes. One of these is the Amnesic Incognito Live System or Tails.
I am okay with this support for open source via cash. Many firms have followed the path. I find it interesting that DuckDuckGo, which I understand is essentially a metasearch engine, is following this route.
Other commercial outfits will become more open about their support of open source. After all, why use a commercial, proprietary product when you can use a perfectly good open source product. All one needs is know how. That, of course, is what the open source services firm sell.
DuckDuckGo wants to keep the communities in which it has an interest watered, fed, and loved. Good deal.
Stephen E Arnold, March 21, 2015
March 18, 2015
For anyone who sees setting up an instance of Hadoop as a huge challenge, Open Source Insider points to IBM’s efforts to help in, “Has IBM Made (Hard) Hadoop Easier?” Why do some folks consider Hadoop so difficult? Blogger Adrian Bridgwater elaborates:
“More specifically, it has been said that the Hadoop framework for distributed processing of large data sets across clusters of computers using simple programming models is tough to get to grips with because:
Hadoop is not a database
Hadoop is not an analytics environment
Hadoop is not a visualisation tool
Hadoop is not known for clusters that meet enterprise-grade security requirements
This is because Hadoop is a ‘foundational’ technology in many senses, so its route to ‘business usefulness’ is neither direct or clear cut in many cases.”
Hmm. So, perhaps one should understand what Hadoop is and what it does before trying to implement it. Still, the folks at IBM would prefer companies just pay them to handle it. The article cites a survey of “bit-data developers” (commissioned by IBM) that shows about a quarter of the respondents us IBM’s Hadoop. Bridgwater also mentions:
“IBM also recently conducted an independently audited benchmark, which was reviewed by third-party Infosizing, of three popular SQL-on-Hadoop implementations, and the results showed that IBM’s Big SQL was the only Hadoop solution tested that was able to run all 99 Hadoop-DS queries…. Smith says that this new report and benchmark are proof that customers can ask more complex questions of IBM when it comes to Hadoop implementation.”
I’m not sure that’s what those factors prove, but it is clear that many companies do turn to the tech giant for help with Hadoop. But is their assistance worth the cost? Unfortunately, this article includes no word on IBM’s Hadoop pricing.
Cynthia Murrell, March 18, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
March 17, 2015
Axonic’s enterprise-centric search products eliminate most, if not all, of the problems a Windows user encounters when trying to locate related information produced by different applications on a desktop computer. Email and other types of information are findable with a few keystrokes.
When I was in Germany in June 2014, I learned about Lookeen, a desktop search product that was built on Lucene. The idea was to tap the power of Lucene to put content on a user’s computer at one’s fingertips. Imagine working in Outlook, reading a message, and seeing a reference to a PowerPoint on the user’s external storage device. Lookeen allows access to the content from within Outlook. Now the company is releasing a commercial version of its desktop search product that promises to be a game changer on the desktop and in the enterprise. The company offers robust functionality at a very attractive price point.
The role of Lucene and other technical innovations in the high-performance software appears in an exclusive interview with Lookeen’s chief operating officer. You can find the interview at http://bit.ly/1LizbkQ.
The Lookeen interface is intuitive. No training is required to install the Lucene-based system nor to use it for simple or complex information retrieval tasks. Image used with the permission of Axonic GmbH.
Lookeen is a product developed by Axonic, a software and services firm located in Karlsruhe, Germany, in Rhine Valley, a short distance from Stuttgart. Axonic is one of the leading software development and services firms for Outlook and Exchange Server search technologies in Europe. The company specializes in enterprise applications and has a core competency in Microsoft technologies.
I wanted more detail about Lookeen’s approach to desktop search. In an exclusive interview, Peter Oehler, COO, revealed a its breakthrough approach to desktop search. The company’s Lookeen software gives Windows users the industry-leading search technology tuned for the Microsoft environment. Outlook email, PowerPoint decks, Word documents and other common file types are instantly findable.
Peter Oehler said:
We’ve utilized Lucene’s extensive query syntax to enable users to use familiar Google-like Boolean search, as well as wildcard, proximity, and keyword matching. The introduction of more search strings and filter features enable users to narrow down searches in an easy and intuitive way, and more proficient searchers can access the best of Lucene’s query syntax.
Lucene is a very good, widely used open source search system. Many of the innovations we’ve developed on top of the Lucene engine stem directly from our extensive experience with Outlook. For example, the Lookeen context menu allows a user to open, reply to, forward, move and summarize emails and topics, all from within Lookeen.
What sets Lookeen apart from proprietary, freeware, and shareware is that Axonic has engineered its system to provide real-time access to information on the user’s computer. The system can handle terabytes of user content, returning results almost instantaneously.
Axonic has deep experience with Microsoft technology. Oehler told me:
Lucene is a beast within the Microsoft environment. Microsoft doesn’t make it easy to work with Outlook without causing problems or affecting performance. Outlook is the lifeblood of most professionals – the most important tool. If it stops working, you stop working. The art of our product is how we tackle the complex code hiding under the surface of Outlook and combine it with Lucene to create a deceptively smooth and simple search solution.
Beyond Search ran tests on Lookeen and compared the results with outputs from a number of test systems. Lookeen’s response times were among the fastest. When indexing and searching email, including archived collections of emails, Lookeen was the top performer. Our test systems include Copernic, dtSearch, Effective File Search, Gaviri, ISYS Desktop Search, and X1.
Lookeen requires no special training or complex set up. Lookeen allows a user to search external shared content directly from the Lookeen app. The interface is clear and logical. A busy professional can access needed documents, view and interact with them without launching an external application.
A 14 day free trial is available. The license fee is $58 for a single user version. The company offers a business edition (at $83) which adds group policy functions and an enterprise edition, which begins at about $116 per user, however volume discounts are available.
To read the complete exclusive interview with Peter Oehler, navigate to the Search Wizards Speak service at this link on ArnoldIT. More information about the company is available at http://www.lookeen.com.
Stephen E Arnold, March 17, 2015
March 14, 2015
I know that my comments about the dead end nature of enterprise search have caught the attention of some vendors. Let’s face it. Search is a utility, a tool to be used when performing other work. Search is not, as some failed middle school teachers and English majors dressed up in Merlin the Magician outfits, promulgate.
Elasticsearch has shifted gears and rebranded itself as Elastic. The company provides some information about the shift at its new Web site www.elastic.co. The company says:
Elastic believes getting immediate, actionable insight from data matters. As the company behind the three open source projects — Elasticsearch, Logstash, and Kibana — designed to take data from any source and search, analyze, and visualize it in real time, Elastic is helping people make sense of data. From stock quotes to Twitter streams, Apache logs to WordPress blogs, our products are extending what’s possible with data, delivering on the promise that good things come from connecting the dots.
I think this repositioning is likely to put a tight elastic band around the throat of a number of competitors. I don’t think Elastic is sufficiently tight to kill these outfits. The positioning grip is definitely going to make their breathing more difficult.
Search is not dead at Elastic. The company is responding to the market’s need for a solution that delivers a tangible benefit, not a laundry list of jargon, buzzwords, and assertions that history has made clear are mostly baloney.
One question crossed my mind, “What will LucidWorks do to respond?” My thought is that LucidWorks is probably trying to craft a counter move. Millions are at stake, and I think the financial backers of the former Lucid Imagination will want more than ideas.
Stephen E Arnold, March 14, 2015
March 12, 2015
ElasticSearch is a popular open source search engine that has been downloaded over 10 million times since it deployed in 2010. Amazon recently announced they are planning on adding an ElasticSearch management service to EC2 to relieve workloads for developers. Rival Google announced on the Google Cloud Platform Blog that they will be adding ElasticSearch compatibility to its own cloud computing platform: “Deploy ElasticSearch On Google Compute Engine.”
The Google Compute Engine is ecstatic that ElasticSearch will be deployed on the platform and are actively encouraging end users to download it. They even made a list about why people need to start using ElasticSearch:
1 “Based on Lucene: Elasticsearch is an open source document-oriented search server based on Lucene. Lucene is a time tested open source library that is capable of reading everything from HTML to PDFs.
2 Designed for cloud: Elasticsearch was designed first for the cloud with its capabilities around simple cluster configuration and discovery and high-availability by default. This means you can expand your Elasticsearch deployment simply by adding new nodes. This expansion of your cluster — or in the case of a hardware failure, reduction — results in automatic reconfiguration of your document indices across the cluster.
3 Native use of JSON over HTTP: Extending the platform is simple for developers. The schema doesn’t need to be defined up front and your cluster can be extended with a variety of libraries in your languages of choice, even using the command line.”
ElasticSearch can be deployed with a few easy clicks ad once it is working you can immediately use it for log processes and analysis with Logstash, keyword text search, and data visualization with Kibana.
Deployment on the Google Compute Engine means ElasticSearch will reach an entirely new customer line. Other open source search engines will be pressured to up their ante with new features and services that ElasticSearch does not have. LucidWorks and other open source based search companies are feeling the pressure.
March 10, 2015
A slide deck providing a round up of the new features of Elasticsearch as of June 2014 is available as of March 9, 2015 via Speakerdeck. Snag the deck. Some Elasticsearch presentations disappear themselves quickly.
Stephen E Arnold, March 9, 2015
February 6, 2015
An oddball TechWars graphic suggests that Lucene is making life difficult for vendors of proprietary search systems. In the site’s head-to-head “dtSearch vs Lucene” comparison, the open source solution seems to handily trounce dtSearch. Of course, for us, Lucene means Elasticsearch. For those unfamiliar with TechWars, here’s what the site’s description of what it does:
Data-driven: TechWars shows objective data gathered from the web to help you make the right decision when choosing technology for your projects.
Up-to-date: TechWars scans the web to catch the latest trends, so you can sit back and relax while we keep you updated.
Professional: TechWars is built for professionals, by professionals. Let’s build the best tech comparison tool together!
Community: TechWars serves the developer community by opening case studies for discussion. We are always open to requests and feedback via Facebook and Twitter.
The graphic compares dtSearch and Lucene in several areas. We’re told that 196 of TechWars users use Lucene, versus just 15 who use dtSearch. Under the “which companies use it?” heading, sixteen companies (several high-profile) are listed for Lucene, but “no companies found” for dtSearch. Um, it seems like a pretty shallow dataset they’re tapping into there. The site does use Google data for one comparison—a graph that shows how very many more folks have searched for information on Lucene than on dtSearch. At a glance, Lucene would seem to be coming out ahead.
Cynthia Murrell, February 06, 2015
February 6, 2015
One of Vivisimo’s founders, Jerome Pesenti, seems to be the voice of IBM Watson. Vivisimo was a metasearch system with hit clustering. The company went through several management arabesques and was sold to IBM in 2012. Vivisimo pitched its system as a federated search engine. The configuration method, as I recall, required Jerome level input. In one installation, I learned that the Vivisimo system hit a wall when 250,000 documents were processed. There were work arounds, but these too required humans who knew the ins and outs of Vivisimo.
I recall that prior to the sale of Vivisimo to IBM, Vivisimo shifted to a government consulting services focus. Many search vendors in the hay day of the buy outs followed this path. License fees were not generating the cash the spreadsheet jockeys funding outfits like Endeca, Exalead, and Vivisimo envisioned. No problem. Some organizations wanted proprietary content processing systems and figured that it was time to sell out. The Big Dog of sell outs was Hewlett Packard’s $11 billion purchase of Autonomy. Vivisimo fetched about $20 million or one year’s projected revenue according to the stockholder familiar with the deal suggested.
Fast forward two or three years and Vivisimo is now Watson. Oh, Vivisimo is also a Big Data solution, not a metasearch engine. I assume the index limits have been addressed. I am thinking about IBM Watson for two reasons:
- IBM is going through a staff reduction. I assume this action was determined by querying the super smart Watson system
- I read “Five New Services Expand IBM Watson Capabilities to Images, Speech, and More,” an IBM in house marketing article.
To my surprise there was a significant shift in Watson marketing; to wit, there are now links to demos of IBM’s text to speech service, image recognition service, relationship analysis service, and something called tradeoff analytics. Now demos are helpful. So is the Watson “great video” about concept insights.
I ran the suggested query for “quantum physics.” Remember I used to work at Halliburton Nuclear Services. Here’s what I saw:
I noticed that each of the experts in the human resources database use the word “quantum” to describe their background.
I then ran a query for “tamarind,” one of the ingredients in a barbeque sauce created by Watson during its recipe phase. Here’s what I saw:
There is no recipe, nor is there an IBM person listing the barbeque recipe as his or her work. I was surprised. No tamarind wizard in the data set.
I asked myself, “Can’t I do this with Elasticsearch?” The answer my mind generated was, “No. No. No. You silly oaf. Watson uses Lucene but it is much, much more.”
How confident are the Watson workers who have dodged IBM layoffs?
What happens if Watson with Vivisimo, iPhrase, WebFountain, and assorted Almaden semantic goodies are aced by Hewlett Packard Autonomy or—heaven forbid—Amazon?
Will Dr. Pesenti be able to build a business that is orders of magnitude larger than Vivisimo’s revenue?
Interesting stuff. Not CyberOSINT level work, but interesting. I wonder why the i2 and related technologies are not pushed more aggressively. i2 works. (Note: I was a consultant to i2 prior to IBM’s purchase of the company.)
Stephen E Arnold, February 6, 2015
February 4, 2015
For months I have been commenting about the increasingly weird marketing pitches for IBM Watson. This is the Lucene and home grown script system positioned as the next big thing in information retrieval. The financial goals for this system were crazy. My recollection is that IBM wanted to generate a billion in revenue from open source search and bits and pieces of the IBM technology lumber.
Impossible. Having a system ingest bounded content and then answer “questions” about that content is neither new, remarkable, or particularly interesting to me. When the system is presented as a way to solve the problem of cancer and generate barbeque sauce with tamarind, the silliness points to desperation.
IBM marketers were trying everything to make open source search into a billion dollar baby and pull of the stunt quickly. Keep in mind that Autonomy required 15 years and a number of pretty savvy acquisitions to nose into the $700 million range.
IBM, in its confused state, believed that it could do the trick in a fraction of the time. IBM apparently was unaware of the erratic thinking at Hewlett Packard that spent $11 billion for Autonomy and wanted to generate billions from that system at the same time IBM was going to collect a billion or more from the same market.
Both of these companies, dazed by a long term struggle with spreadsheet fever, were ignoring or simply did not understand the doldrums of the enterprise information access market. Big companies were quite happy to give open source solutions a try. Vendors of proprietary systems were pitching their keyword systems as everything from customer support “solutions” to business intelligence systems that would “predict” what the company should know.
I read with some sadness the posts at Alliance@IBM. The viewpoint is not that of IBM management which is now firing or resource allocating its way people. I am not sure how many folks are going to be terminated, but the comments in this series of IBM employee comments suggest that the staff are unhappy. Some may not go gentle into that good night.
The point is that the underlying problems at IBM were evident in the silly Watson marketing. An organization that can with a straight face suggest that a next generation information access system can discover a new recipe provides a glimpse into an organization’s disconnect at a fundamental level.
Too bad. The stock buybacks, the sale of manufacturing assets, and the assertions that a mainframe is a mobile platform tells me that IBM stockholders may want to reevaluate those holdings.
If IBM asked Watson, I question the outputs.
Stephen E Arnold, February 4, 2015