Exclusive Interview: Erik Arnold, Adhere Solutions

August 9, 2010

How does one consultant interview another? Cautiously. How does a father interview a son? Buy the Diet Coke and provide the questions before flipping on the digital recorder. I spoke with Erik Arnold, managing director of Adhere Solutions in a charming eatery in Chicago with a buzzing neon sign advertising “Free Refills” on Sunday. Today is Monday. Now some readers wonder if I write about my son and get paid for that work. Anyone who has a successful son knows that fathers get to pay. What’s my compensation? When you have a gosling flying circles around your goose pond, you will figure it out.

ErikSArnold

Erik Arnold, managing director of Adhere Solutions, will be giving a talk about the use of open source search technology for the White House’s USA.gov Web site.

Erik Arnold has over 15 years of experience in the search industry, divided uniquely between both Web and enterprise search. Adhere Solutions is a consulting firm that advises companies on improving their search systems. Prior to Adhere, Erik served as a subject matter expert for a government consulting company where he primarily worked with the House of Representatives and the USA.gov web portal. He started his career at Lycos, one of the first Internet search engines, where he was a product marketing manager. Erik then moved to NBCi search engine (Snap.com) where he served as business development manager.

He will be giving a talk about the impact of open source search on certain US government initiatives at the October 2010 Lucene Revolution Conference.

The full text of the interview appears below:

Here we are again talking about search technology.

That’s right.

For readers who may not know about your company, what’s an Adhere Solutions?

Adhere Solutions offers products and services that help organizations with their of search systems. We focus on Google and open source technologies. Adhere Solutions has been a trusted Google Enterprise Partner since 2007, with a client roster of Wal-Mart, Lexis-Nexis, and the Federal Trade Commission among others.

You have worked on the USA.gov and related Federal projects. When did you get into this type of work?

A decade ago. I think I did my first Federal consulting job in 2000 for the Clinton Administration.

Read more

Scaling with Solr, Python and Django

August 5, 2010

Scaling is tough problem. Gmail has had its share of hiccups. Reddit has recently made a switch in its search system to deal with latency. Twitter is embarking on an infrastructure project to cope with getting bigger. Toby White’s scaling tips are useful in my opinion. His Timetric Blog included a useful write up called “Scaling Search to a Million Pages with Solr, Python, and Django.” The article references a slide deck, which contains code snippets and explanatory details. You can locate an instance of the file at http://dl.dropbox.com/u/1942316/SolrMillionsOfDocs.pdf. In my opinion, one of the key points in the write up in the Timetric Blog is:

On the large scale, each installation will have its own problems, but three things you’ll almost certainly need to pay attention to are:

  • Decoupling reading from and writing to the index. They have very different performance characteristics (and writing presents special problems if you’re updating documents as well as adding brand new documents).
  • Working out the right balance of adding/committing/optimizing data. This will be driven by the frequency with which you add data, and how soon you need to be able to serve results from newly-added data. Must it be immediate, or can you wait seconds/minutes/hours?
  • Fine-tuning your tokenizers/analyzers. Although small and fiddly, this is an issue which will bite you more and more as a corpus of data grows. You’ll need to tweak your indexing algorithms away from the defaults; extracting relevant results from a pile of a million documents is much harder than from a few thousand.

You may want to check out Toby White’s Python/Solr library sunburnt. Worth a look.

Stephen E Arnold, August 5, 2010

4 August Ultrasaurus on Lucene/Solr

August 4, 2010

I quite like the image “ultrasaurus” evokes. A goose, in comparison, lacks oomph. Nevertheless, you will want to navigate to “Lucene/Solr Meet Up, July 28, 2010.” There are some interesting factoids in the thorough summary of the presentations and remarks.

Let me highlight four that struck me as interesting, and you can work your way through the original post to get the rest of this meetup’s flavor.

First, Salesforce.com seems to sporting a Lucene/Solr T shirt under the firm’s business casual garb. Bill Press, according to Ultrasaurus, offered some metrics about the scale of the firm’s operation; for example, eight terabytes of searchable information. The incremental indexing zips along with 70 percent of new content and deltas crunched in less than one minute.

Second, Lucid Imagination’s Grant Ingersoll provided some case examples. One sequence jumped out at me; that is: his suggested links for more information:

Lucid Imagination is the go-to outfit for Lucene/Solr engineering and professional services.

Finally, Jon Gifford from Loggly said:

Solr is awesome at what it does, but not so good for data mining. [So] plan to plug in Hadoop for large-volume analytics.

image

Possible logo for open source search solutions? Image source: http://wargames.spyz.org/convSALAMANDER.html

Will Lucene/Solr abandon their present logotypes and go for something along the line of a Spinosaurus. With Lucene/Solr adoptions moving upwards, a Spinosaurus might have easy pickings from clients of somewhat marginalized commercial search systems in Austria, Denmark, Germany, and other European Commission member states. Snack time may be approaching. SharePoint nibbles, anyone?

Stephen E Arnold, August 4, 2010

Taxodiary: At Last a Taxonomy News Service

August 3, 2010

I have tried to write about taxonomies, ontologies, and controlled term lists. I will be the first to admit that my approach has been to comment on the faux pundits, the so-called experts, and the azurini (self appointed experts in metatagging and indexing). The problem with the existing content flowing through the datasphere is that it is uninformed.

What makes commentary about tagging informed? Three attributes. First, I expect those who write about taxonomies to have built commercially-successful systems to manage terms lists and that those term lists are in wide use, conform to standards from ISO, ANSI, and similar outfits. Second, I expect those running the company to have broad experience in tagging for serious subjects, not the baloney that smacks of search engine optimization and snookering humans and algorithms with their alleged cleverness. Third, I expect the systems used to build taxonomies, manage classification schemes, and term lists to work; that is, a user can figure out how to get information out of a system relevant to his / her query.

taxodiary splash

Splash page for the Taxodiary news and information service.

How rare are these attributes?

Darned rare. When I worked on ABI/INFORM, Business Dateline, and the other database products, I relied on two people to guide my team and me. The first person is Betty Eddison, one of the leaders in indexing. May she rest in indexing heaven where SEO is confined to Hell. Betty was one of the founders of InMagic, a company on whose board I served for several years. Top notch. Care to argue? Get ready for a rumble, gentle reader.

The second person was Margie Hlava. Now Ms. Hlava, like Ms. Eddison, is one of the top guns in indexing. In fact, I would assert that she is on my yardstick either at the top or holds the top spot in this discipline. Please, keep in mind that her company Access Innovations and her partner Dr. Jay ven Eman are included in my reference to Ms. Hlava. How good is Ms. Hlava? Very good saith the goose.

Read more

Comparison Highlights Lucene

August 3, 2010

Vik Singh has posted a thorough and impartial comparative analysis of selected search engines. Singh used his own testing code, and kept the playing field level by not changing any numerical tuning parameters. He summarizes by saying:

Based on these preliminary results and anecdotal information I’ve collected from the web and people in the field (with more emphasis on the latter), I would probably recommend Lucene (which is an IR library – use a wrapper platform like Solr w/ Nutch if you need all the search dressings like snippets, crawlers, servlets) for many vertical search indexing applications – especially if you need something that runs decently well out of the box (as that’s what I’m mainly evaluating here) and community support.

Lucene earned a perfect 5/5 for support–highest of all tested platforms. (You can download Lucene/Solr at Lucid Imagination.)

As an IT professional, you are always on the lookout for ways to cut costs, and you also know that software licenses aren’t getting any cheaper, particularly for popular pro-sumer products such as Photoshop and Dreamweaver. http://www.osalt.com hosts a treasure trove of free, high-quality open source alternatives designed to save you time and money and still deliver a first-rate final product. By choosing an open source product, the user obtains a number of advantages compared to commercial products. Besides the fact that open source is always available for free, it is a transparent application, in that you are invited exclusively behind the scenes to view all source code and thereby to suggest improvements to the product. Furthermore, every product is covered by a large dedicated network, or community, who is more than willing to answer any questions you may have. http://www.osalt.com is definitely worth bookmarking.

Brett Quinn, August 3, 2010

Webnocular

July 27, 2010

I looked at this metasearch system a couple of weeks ago. I revisited it because a reader sent me a link to it, asking for my opinion. You can locate the site at http://www.webnocular.com/. Metasearch and mobile search are popular. The reason is that the cost of brute force Web indexing has made it impossible for smaller firms to compete. Exalead, now a unit of the French superstar services firm Dassault, has built an index of about eight billion Web pages. I use it first and then Google for my research. Google returns too many irrelevant results to keep this goose happy. Exalead’s method, on the other hand, does a much better job for the types of queries I routinely run. I also use Exalead to index Google’s own Web logs. I find that Google’s consumerist approach makes it tough to pinpoint some of Google’s own blog content. You can try the Exalead Google blog index at http://overflight.labs.exalead.com/.

Now what about Webnocular?

webnocular

The system takes a query, performs some normal metasearch tricks, fires off the request, gets the results back, and performs some special magic. The idea is that metasearch systems do not have to brute force index the Web like Exalead, Google, and Microsoft do. Heck, it is expensive and more complicated than it looks to the home economics majors who end up working at the azurini (second and third tier consulting companies).

A query for “enterprise search” returned some results after some chugging. The results were okay, but not as useful to me as a query for the phrase on Exalead, Ixquick, or Red Tram, which is becoming one of my favorite current information indexing services.

I did not download the add in toolbar. I find these invasive. I don’t tweet and I don’t post to Facebook. Who cares what an addled goose likes. If you are into tool bars and social media, you may want to give Webnocular a test drive. The company offers code “extenders” such as an Instant Messenger service which is “a full-featured chat program.” The company says:

[Webnocular Messenger] includes features such as Moderated chat, high load support, font/color/ customization, emoticons, private messaging, private chat room, profanity filtering, ignoring users, file Transfer, and many more!

Our take on the service is that it implements some good ideas, and it could catch fire among some user segments.According to Most Popular Websites, Webnocular is in the top million most popular Web sites.

Stephen E Arnold, July 27, 2010

Freebie

Summer Search Rumor Round Up

July 26, 2010

The addled goose has been preoccupied with some new projects. In the course of running around and honking, he has heard some rumors. The goose wants to be clear. He is not sure if these rumors are 100 percent rock solid. He does want to capture them before the mushy information slips away:

image

Source: http://oneyearbibleimages.com/rumors.gif

First, the goose heard that there will be some turnover at Microsoft Fast. The author of some of the posts in the Microsoft Enterprise Search Blog may be leaving for greener pastures. You can check out the blog at this link. What does this tell the goose? More flip flopping at Microsoft? Not sure. Any outfit that pays $1.2 billion for software that comes with its own police investigation is probably an outfit that would scare the addled goose to death. The blog is updated irregularly with such write ups as “Crawling Case Sensitive Repositories Using SharePoint Server 2010” and “SharePoint 2010 Search ‘Dogfood’ Part 3 – Query Performance Optimization.” Ah, the new problem of upper and lower case and the ever present dog food regarding performance. I thought Windows most recent software ran as fast as a jack rabbit. Guess not.

Second, a number of traditional search vendors are poking around for semantic technology. The notion that key words don’t work particularly well seems to be gaining traction. The problem is that some of the high profile outfits have been snapped up. For example, Powerset fell into the Microsoft maw and Radar Networks was gobbled by Paul Allen’s love child, Evri. Now the stampede is on. The problem is that the pickings seem to be slim, a bit like the t shirts after a sale at the Wal-Mart up the road from the goose pond here in Harrods’s Creek. For some lucky semantic startups, Christmas could come early this year. Anyone hear, a sound like “hack, hack”. Oh, that must be short for Hakia. You never know.

Third, performance may have forced a change at HMV.co.uk in merrie olde England. Dieselpoint was the incumbent. I heard that Dieselpoint is on the look out for partners and investors. The addled goose tried to interview the founder of the company but a clever PR person sidelined the goose and shunted him to the drainage ditch that runs through Blue Island, Illinois. Will Dieselpoint land the big bucks as Palantir did.

Fourth, the goose heard that a trio of Microsoft certified partners with snap in SharePoint search components were looking for greener pastures. What seems to be happening is that the easy sales have dried up since Microsoft started its current round of partner cheerleading. The words are there, but the sales are not. Microsoft seems to want the money to flow to itself and not its partners. Who is affected? The goose cannot name names without invoking the wrath of Redmond and a pride of PR people who insist that their clients are knocking the socks off the competition. However, does the enterprise need a half dozen companies pitching metatagging to SharePoint licensees? I think not. If sales don’t pick up, the search engine death watch list will pick up a few new entries before the leaves fall. Vendors in the US, Denmark, Germany, Austria, and Canada are likely to watching Beyond Search’s death watch list. Remember Convera? It spawned Search Technologies. Remember the pre Microsoft Fast? It spawned Comperio? When a search engine goes away, the azurini flower.

Fifth, what’s happened to the Oracle killers? I lost track of Speed of Mind years ago. There was a start up with a whiz bang method of indexing databases. I haven’t heard much about killing Oracle lately. In fact, stodgy old Oracle is once again poking around for search and content processing technology according to one highly unreliable source. With SES11g now available to Oracle database administrators, perhaps the time is right to put some wood behind a 21st century search solution.

If you want to complain about one of these rumors, use the comments section of this blog. Alternatively, contact one of the azurini outfits and get “real” verification. Some of their consultants use this blog as training material for the consultants whom you compensate. No rumor this. Fact.

Stephen E Arnold, July 26, 2010

Freebie

Index Engines Polishes Platform

July 23, 2010

Index Engines recently announced they’ve made enhancements to their 3.2 platform that will better the system to allow for indexing of multiple streams of data from backup tapes. A significantly larger amount of tape data can be processed with these new developments in tight time frames.

Up to six streams of data can be processed now at a speed of one terabyte per hour. The process can save a company millions in storage costs and the stockpiling of these tapes can be a liability according to a company spokesman. We did not test the new system, so you may want to run some benchmarks on your own before whipping out your American Express Platinum card.

This is, according to Index Engines, the only product of its type on the market that directly indexes stored data. Index Engines is involved with enterprise discovery solutions. The company was founded in 2003 and their mission is to organize enterprise data assets, making them immediately accessible, searchable and easy to manage.

Rob Starr, July 23, 2010

Vivisimo Chases Call Center Sales

July 22, 2010

One of the most frustrating things for a call center agent is not having the information that a customer needs right at their fingertips. Any business knows that they can lose customers when they have agents fumbling around through applications looking for answers, and no one really has the resources to be constantly updating this kind of information.

Sometimes the solutions come from unlikely sources. Vivisimo started by supplying applications for the military and academia but is now tackling the more practical problems that call centers face with Velocity. Here’s a real company on the move and they swear by this new information platform which they say optimizes fragmented information with any easy to use interface.

Vivisimo’s history begins with an on-the-fly clustering function, veers into Web indexing, jumps to enterprise search, embraced integration, and now flirts with call center search. Agility or chasing revenue? The goslings and I are not sure.

Now is most definitely the time for some of the world’s best companies to apply their knowledge to practical economic solutions.

Vivisimo may have to show some Autonomy-style innovation to make a quantum leap in revenue in my opinion.

Stephen E Arnold, July 22, 2010

A Factoid from Dell Computer

July 13, 2010

Dell: 90% of Data Is Never Read Again” appeared on PC Pro, a UK Web site. The article presented data from Dell Computer that asserted “90% of company data is written once and never read again.” The write up contains some azure chip stuff; for example:

It’s an odd statistic. How is that data measured? 90% of all documents? 90% of stored bytes? When they said “ever again” did they mean explicitly retrieved by name, or should we include free text searches in that statistic? How long an interval needs to pass before some piece of data is clearly identified as belonging to the 90%, so that steps can be taken to reflect its reduced importance?

Anyone hear about offline storage, near line storage, and online storage? Certainly not at Dell, an outfit trying to boost its storage revenues and its knowledge of what companies do with their data.

One of the challenges of enterprise search is to index information and deliver relevant results. Popularity based systems—like the method used in the original Google Search Appliance—don’t work in organizations. Google figured this out and adapted its system. Specialized vendors, including Index Engines, built their business around the fact that once data are archived no one knows what’s there or how to find it.

Modern search and content processing systems are tough to configure for many reasons. One of them is the fact that most information tucked on an organization’s computers is lost. Only a handful of systems deliver what an employee needs to make a business decision. That information is usually relatively recent data. The write up descends into the weeds of which storage systems are going to ring the journalists’ and consultants’ chimes.

The topic I wanted to see addressed was ignored: search, indexing cycles, relevance, and other trivial questions. Buying hardware is more important I suppose.

Stephen E Arnold, July 13, 2010

Freebie

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta