Merging of Lucene Solr Reported

December 17, 2010

A reader sent me a link to “Lucene and Solr Development Merged.” We are working to track down the details, but I wanted to capture the news item. In addition to the development merger, the write up references Riak Search. Here is the passage that caught my attention:

With merged dev, there is now a single set of committers across both projects. Everyone in both communities can now drive releases – so when Solr releases, Lucene will also release – easing concerns about releasing Solr on a development version of Lucene. So now, Solr will always be on the latest trunk version of Lucene and code can be easily shared between projects – Lucene will likely benefit from Analyzers and QueryParsers that were only available to Solr users in the past. Lucene will also benefit from greater test coverage, as now you can make a single change in Lucene and run tests for both projects – getting immediate feedback on the change by testing an application that extensively uses the Lucene libraries. Both projects will also gain from a wider development community, as this change will foster more cross pollination between Lucene and Solr devs (now just Lucene/Solr devs).

Riak Search is described in “Riak 0.13, Featuring Riak Search” and “Riak Search and Riak Full Text Indexing”.

The primary information appears on the Riak Web site in a Web page titled “Riak Search.”

Riak Search uses Lucene and features “a Solr like API on top.” According to the Basho blog’s article “Riak 0.13 Released”:

At a very high level, Search works like this: when a bucket in Riak has been enabled for Search integration (by installing the Search pre-commit hook), any objects stored in that bucket are also indexed seamlessly in Riak Search. You can then find and retrieve your Riak objects using the objects’ values. The Riak Client API can then be used to perform Search queries that return a list of bucket/key pairs matching the query. Alternatively, the query results can be used as the input to a Riak MapReduce operation. Currently the PHP, Python, Ruby, and Erlang APIs support integration with Riak Search.

The story “Riak 0.13 Released” provides additional information, including explicit links to download Riak 0.13 and Riak Search for a variety of platforms.

At first glance, Riak Search makes search and retrieval available to NoSQL data stores like the Basho Riak open source scalable data store.

A number of questions require some further data collection and consideration:

  1. Will other NoSQL implementations “bundle” or “snap in” a search component?
  2. What are the technical considerations of this approach to search in NoSQL data stores?
  3. Are there any performance or scaling issues to consider?

The blending of the Lucene Solr merging story with the Riak Search information caught us by surprise. Time to flip through the Rolodex to see whom we can call for more information. If a reader has additional insight on these two items, please, use the comments section of the blog to make the information available to the other two readers of Beyond Search.

We did a bit of sleuthing and wanted to pass along that Riak may be using some of the Lucene/Solr analyzers. One view is that the indexing and search code may not be Lucene based. The implication is that scaling and performance may be an issue. Faceting and group may also be an issue. Without digging too deeply into the innards of Riak Search, we suggest you do some testing on a suitable data set or corpus.

We located some information about Solr as NoSQL. You can find that information on the Lucid Imagination Web site at this link.

Stephen E Arnold, December 17, 2010

Freebie

OCLC-SkyRiver Dust Up

December 16, 2010

In the excitement of the i2 Ltd. legal action against Palantir, I put the OCLCSkyRiver legal hassle aside. I was reminded of the library wrestling match when I read “SkyRiver Challenges OCLC as Newest LC Authority Records Node.” I don’t do too much in libraries at this time. But OCLC is a familiar name to me; SkyRiver not so much. The original article about the legal issue appeared in Library Journal in July 29, 2010, “SkyRiver and Innovative Interfaces File Major Antitrust Lawsuit against OCLC.” Libraries are mostly about information access. Search would not have become the core function if it had not been for libraries’ early adoption of online services and their making online access available to patrons. In the days before the wild and wooly Web, libraries were harbingers of the revolution in research.

Legal battles are not unknown in the staid world of research, library services, and traditional indexing and content processing activities. But a fight between a household name and OCLC and a company with which I had modest familiarity is news.

image

Here’s the key passage from the Library Journal write up:

Bibliographic services company SkyRiver Technology Solutions recently announced that it had become an official node of the Name Authority Cooperative Program (NACO), part of the Library of Congress’s (LC) Program for Cooperative Cataloging. It’s the first private company to provide this service, which was already provided by the nonprofit OCLC—SkyRiver’s much larger competitor in the bibliographic services field—and the British Library. Previously, many institutions have submitted their name authority records via OCLC. But SkyRiver’s new status as a NACO node allows it to provide the service, once exclusive to OCLC in the United States, to its users directly.

For me, this is a poke in the eye for OCLC, an outfit that used me on a couple of project when General K. Wayne Smith was running a very tight operation. I don’t know how management works at OCLC, but I think any action by the Library of Congress is going to trigger some meetings.

SkyRiver sees OCLC as acting in a non-competitive way. Now the Library of Congress has blown a kiss at SkyRiver. Looks like the library landscape, already ravaged by budget bulldozers, may be undergoing another change. I think outline of the mountain range where the work is underway appears to spell out the word “Monopoly.” Nah, probably my imagination.

Stephen E Arnold, December 16, 2010

Freebie

Repositioning 2011: The Mad Scramble

December 15, 2010

Yep, the new year fast approaches. Time to turn one’s thoughts to vendors of search, content processing, data fusion, text mining, and—who could forget?—knowledge management. In the last two weeks, I have done several live-and-in-person briefings about ArnoldIT.com’s views on enterprise search and related disciplines.

Today enterprise search has become what I call an elastic concept. It is stretched over a baker’s dozen of quite divergent information retrieval concepts. Examples range from the old bugaboo of many companies customer support to the effervescence of knowledge management. In between the hard realities of the costs of support actual customers and the frothy topping of “knowledge”.

Several trends are pushing through the fractured landscape of information retrieval. Like earthquakes, the effects can vary significantly depending on one’s position at the time of the event.

image

Source: http://www.sportsnet.ca/gallery/2009/12/30/scramble_gal_640.jpg

Search can looked at in different ways. One can focus on a particular problem; for example, content management system repositories. The challenge is to find information in these systems. One would think that after years of making Web pages, the problem would be solved. Apparently not. CMS with embedded search stubs trigger some grousing in most of the organizations with which I am familiar. Search works, just not exactly as the users expect. A vendor of search technology can position the search solution as one that makes it easy for users to locate information in a CMS. This is, of course, the pitch of numerous Microsoft Certified Gold resellers of various types of search solutions, utilities, and work arounds. This an example of a search market defined by the type of enterprise system that creates a retrieval problem.

Other problems for search crop up when specific rules and regulations mandate a particular type of information processing. One example is the eDiscovery market. Anyone can be sued, and eDiscovery systems have to make content findable, but the users of an eDiscovery system have quite particular needs. One example is bookkeeping so that the time and search process can be documented and provided upon request under certain conditions.

Social media has created a new type of problem. One can take a specific industry sector such as the Madison Avenue crowd and apply information technology to the social media problem. The idea is for a search system to “harvest” data from social content sources like Facebook or Twitter, process the text which can be ambiguous, and generate information about how the people creating Facebook messages or tweets perceive a product, person, ad, or some other activity for the advertising team. The idea is that search unlocks hidden information. The Mad Ave crowd thinks in terms of nuggets of information that will allow the ad team to upsell the advertiser. Search is doing search work but the object of the exercise is to make sense out of content streams that are too voluminous for a single person to read. This type of search market—which may not be classic search and retrieval at all—is closer to what various intelligence agencies want software to do to transcribed phone calls, email, and general information from a range of sources.

Let’s stop with the examples of information access problems already. There are more information access problems than at any other time, and I want to move on to the impact of these quite diverse problems upon vendors in 2011.

Now let’s take a vendor that has a search system that can index Word documents, email, and content found in most office environments. Nothing tricky like product specifications, chemical structures, or the data in the R&D department’s lab notebooks. For mainstream search, here is the problem:

Commoditization

Right now (now pun on the vendor of customer support solutions by the way) anyone can download an open source search solution. It helps if the person downloading Lucene, Solr, or one of the other open source solutions has a technical bent. If not, a local university’s computer science department can provide a student to do the installation and get the system up and running. If the part time contracting approach won’t work, you can hire a company specializing in open source to do the work. There are dozens of these outfits bouncing around.

Read more

CopperEye: Speedy Stuff

December 10, 2010

I came across CopperEye several years ago. I was looking for a solution that would cope with large volumes of data, mainframe and client server hardware, and specific performance requirements. CopperEye met the specs. In London last week, I engaged in a conversation and learned that CopperEye was not widely known in the more traditional search and retrieval field. The purpose of this write up is to provide some basic information about the company. In a nutshell, the firm offers a system that can discover, parse and index data in a relational database or flat file output. The method can handle “big data”. (A video demo is available on YouTube.)

In 2007, In-Q-Tel, the investment arm of the Central Intelligence Agency, signed a deal for a strategic investment in CopperEye. In that 2007 announcement, an In-Q-Tel spokesperson said:

We selected CopperEye because it offers superior technology in the area of the retention and retrieval of structured, historical data,” said Troy M. Pearsall, Executive Vice President of Technology Transfer at In-Q-Tel.  “Given the volume of information gathered by organizations within the public and private sectors, it made perfect sense to invest in an innovative data access technology that will potentially meet the critical needs of the U.S. Intelligence Community. We look forward to working with CopperEye in the coming months and years.”

Based on the information in my Overflight system, CopperEye is privately held. Now about 10 years old, the company provides enterprise class archiving solutions, including compliance archiving. The firm’s search product is called CopperEye Search. The Greenwich product uses standard SQL to retrieve records from log files. the Secure Data Retrieval Server is an an appliance that complies with with data retention regulations. the CopperEye Indexing function is optimized for high speed.

The current version Retrieval Server includes features improved compression The compression introduces no latency while yielding more efficient storage and reduced disc accesses. The system has been engineered for high availability. When deployed as a distributed system, queries operate as though the data set were a single environment. One interesting feature is that the system can be configured to process queries as parallel, failover or round robin methods.

The CEO of the firm is Carmen Carey. The founders are Paul McCafferty (COO) and Duncan Pauly. You can get more information about the company at www.coppereye.com.

Stephen E Arnold, December 10, 2010

Freebie

Real Time Conversation with a Mid Tier Wizard

December 9, 2010

I am not making this conversation up. I gave a talk to 43 20 somethings at Skinker’s, a delightful place near London Bridge tube stop. No, I did not buy a Skinker’s T shirt, but it did look smart. My topic was real time search. More accurately, I was explaining the engineering considerations in delivering low latency indexing and querying which most vendors and second string consultants happily tell you is “real time search”.

The most interesting part of my evening was a short conversation I had with a mid tier consultant, what I call an azure chip consultant or generally the azurini. To be a blue chip consultant is easy. Just get hired by one of the two three or four management consulting firms, do some notable work, and not die of a heart attack from the pressure. Thousands of Type A’s who crave constant stroking takes a toll, believe you me. The mid tier lad introduced himself. He reminded me that I had met him before. In the dim light of Skinker’s I would not have  been able to recognize Tess, my deaf white boxer. No matter. A big grin and warm handshake were what the azure chip lad thought would jog my memory.

image

The basic idea is that real time is not achievable. There are gating factors at three main points in any content processing system. The first is the green box, which is the catch all for the service providers, ISPs, and others in the network chain. The pink  boxes represent the vendors providing services to the client who wants low latency service. The yellow boxes represent the different “friction points” behind the firewall or within the organization’s hybrid infrastructure. Resolving these points of “friction” boils down to brains and money. If an organization lacks either, the latency of the system will be high and increase over time. Users, of course, don’t know this. The problems latency produces range from financial losses to field operations personnel being killed due to stale intelligence.

It didn’t.

Anyway, three observations.

Read more

Microsoft, the US Treasury, and Search

December 9, 2010

The new Microsoft-based Treasury.gov Web site works pretty well. Pictures flash, the links work, and the lay out is reasonably clear. There is the normal challenge of government jargon. So “Help, I am going to lose my home” becomes “Homeowner’s HOPE Hotline”.

I am interested in search and retrieval. I wanted to run through my preliminary impressions of the search interface, system responsiveness, and the relevance of the queries. I look at public facing search services differently from most people’s angle of attack. Spare me direct complaints via email. Just put your criticisms, cautions, and comments in the form provided at the foot of this Web page.

Search Interface

The basic search box is in the top right hand corner of the splash page. No problem, and when I navigate to other pages in the Web site, the search box stays put. However, when I click on some links I am whisked outside of the Treasury.gov site and the shift is problematic. No search box on some pages. Here’s an example: http://www.makinghomeaffordable.gov/index.html. Remember my example from the HOPE Hotline reference? Well, that query did not surface content gold on Treasury.gov. I went somewhere else, and I was confused. This probably is a problem peculiar to me, but I found it disconcerting.

Other queries I ran a query for “Treasury Hunt,” a service that allows me to determine if a former Arnold left money or “issues” for me. Here’s the result screen for the query “Treasury Hunt”:

treasury hunt results

The first hit in the result list points to this page:

treasury hunt result 1

The problem is that the hot link from this page points to this Web site, which I could not locate in the results list.

treasury direct explicit link

Several observations:

First, the response time for the system was sluggish, probably two seconds, which was longer than Google’s response time. No big deal, just saying “slower.”

Second, the results list did not return the expected hit. For most people, this makes zero difference. For me, I found the lack of matching hits to explicit links interesting. In fact, I assumed that the results list would have the TreasuryDirect hit at the top of the results list. Not wrong, just not what I expected.

Read more

Which Is Better? Abstract or Full Text Search?

November 26, 2010

Please bear with us while we present a short lesson in the obvious: “Users searching full text are more likely to find relevant articles than searching only abstracts.”  A recent BMC Bioinformatics research article written by Jimmy Lin titled “Is Searching Full Text More Effective than Searching Abstracts?” explores exactly that.

So maybe we opened with the conclusion, but here is some background information.  Since it is no longer an anomaly to view a full-text article online, the author set out to determine if it would be more effective to search full-text versus only the short but direct text of an abstract.  The results:

“Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles.”

Yep, at the end of the day, searching from a bank of more words will in fact increase your likeliness of a hit.  The extension here is the future must bring with it some solutions.  Due to the longer length of the full-text articles and the growing digital archive waiting to be tamed, Lin predicts that multiple machines in a cluster as well as distributed text retrieval algorithms will be necessary to effectively handle the search requirements.  Wonder who will be first in line to provide these services…

Sarah Rogers, November 26, 2010

Freebie

Reflections on Ask.com

November 13, 2010

Ask.com used to be the premier search engine for the Internet. According to the article, “IAC’s Barry Diller Surrenders to Google, Ends Ask.com’s Search Effort” they don’t even break the Top Five. Because of this backslide, Diller’s corporation will be laying off 130 engineers and letting the competition take most of its brute force, Web search business.

In the era before Yahoo and Google you could type in any question and your trusty guide, Jeeves, would take you anywhere you needed to go. Not anymore. It seems that Ask.com can no longer keep up with the Jones’s or, in this case, the Google. The write up asserted:

It’s become this huge juggernaut of a company that we really thought we could compete against by innovating. We did a great job of holding our market share but it wasn’t enough to grow the way IAC had hoped we would grow when it bought us.

Google has grown to be the world’s top search engine, and it seems to control 65 percent of the searches performed in the United States.

Some observations:

  • How long will Google be able to sustain brute force indexing? The more interesting services use human input to deliver content.
  • Who will be the next Google? Maybe it will be Facebook?
  • With the rise of “training wheels” on search systems, will most users fiddle with key words? Won’t “get it fast, get it good enough” may become the competitive advantage?

Google is now the old man of search. I see the company moving clumsily. There was the “don’t go to Facebook” payoffs earlier this week. There is the Facebook game and Google watching from the cheap seats.

Changes afoot. I fondly recall the third tier consultant who told me that Ask.com was a winner. I assume that young person is now advising the movers and shakers about search and content processing. Maybe Google needs an advisor to help the firm move from the cheap seats to the starting line up?

Stephen E Arnold and Leslie Radcliff, November 13, 2010

Freebie

Brainware jumps to Version 5.2

November 4, 2010

Short honk: My in box overflowed with a news release about Brainware’s Version 5.2 of its enterprise search system. The news release provides some publicity for a trade show at which Brainware has an exhibit. In addition to helping out the trade show outfit, Brainware called my attention to new features in Version 5.2. These include:

  • More flexible security for processed documents
  • Enhanced indexing of content in relational databases
  • More control over what’s displayed in response to a query.

Brainware’s approach to content processing relies on trigrams for which the firm has a patent. For more information about Brainware, navigate to the firm’s Web site at www.brainware.com. No licensing fee details are available to me at this time. I did see a demo of the new system and I think the firm will give you a peek as well. I had been watching to see if Oracle would acquire Brainware. The database giant seems happy with Brainware’s content acquisition components. Oracle, however, moved in a different direction. I will keep my ear to the shoreline here at the goose pond.

Stephen E Arnold, November 4, 2010

Freebie

Content Analyst Partners with TCDI

November 3, 2010

Lawyers need tools to respond to the demands of their clients. Content Analyst Company, the leader of advanced document analytics tools, helps Technology Concepts & Design, Inc. (TCDI) reduce the time required to analyze information generated by the discovery process.

TCDI and Content Analyst Company Announce Strategic Partnership, Expanding Analytics Capabilities in eDiscovery” reported:

[The companies] will incorporate Content Analyst Analytics Technology (CAAT) into its proprietary eDiscovery Application Suites: Discovery WorkFlow® and ClarVergence®. This partnership offers TCDI’s clients improvements in Document Review efficiencies and increased visibility into their document collections. The enhanced analytics will also reduce the time and cost associated with Document Review.

The tie up will yield improved document review and increased visibility into their document collections.” Content Analyst Company develops advanced document analytics tools, based on patented Latent Semantic Indexing (LSI) technology. Content Analyst Analytical Technology (CAAT) exponentially reduces the time needed to discern relevant information from large volumes of unstructured text. For more information, navigate to www.contentanalyst.com.

Harleena Singh, November 3, 2010

Freebie

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta