March 31, 2008

Search and content processing company ZyLAB opened an office in New York City in Rockefeller Center. ZyLAB describes its suite of technology as an “Information Access Platform,” a positioning shift that other vendors are emulating.

Dr. Johannes Scholtes, president of ZyLAB North America LLC, told Beyond Search, “We understand that as our client base continues to grow that we need to expand our corporate presence to ensure our customer service remains first-class.”

Beyond Search, a new study published by the Gilbane Group, profiles ZyLAB’s innovative technology. The company blends scanning, rich content processing, and search to allow licensees to manipulate a range of structured and unstructured data whether in hard copy or electronic form.

More information about the company is located at www.zylab.com.

Stephen Arnold, March 31, 2008

Key Word Search Vendors: Panting Laggards

March 31, 2008

In September 2003, I gave an invited lecture at LANL, an acronym for Los Alamos National Laboratories for those of you who don’t keep up with some of the US government’s most interesting research nomenclature. I poked around my digital warehouse today when I saw an announcement that a major search-and-retrieval vendor was now officially in the “information access business”. I used to work for Ziff Communications Co., and we owned an outfit called Information Access Co. That was a great company name, but the whole shooting match was sold to the giant Thomson Corporation and the name Information Access fell into disuse or so I thought.

I marvel at the “back from the dead” certain terminology demonstrates. IAC, as Information Access was known for more than 15 years, allowed a person to search for electronic information. The idea was a good one, and IAC had revenues of more than $100 million at the time of the sale. The idea was simple. We used bibliographic records or what today would be called “structured metadata”, full text of articles or what today would be called content, and proprietary scripts to generate reports or what today would be called business intelligence. The user of our General Business File product in 1990 would pick from a menu of options; for example, look for a job. Then the user would pick from one of the major cities whose employment opportunities we indexed (now tagged) and the system would display job openings. A mouse click sent the report to the printer, and we had happy users. We sold more than 1,000 of these systems in less than nine months in 1990. Considering each system was in the $20,000 plus range, the General Business File would be a success in our Googley world.

The LANL group wanted to know about the future of search and “The Information Implications of Social Software”. Now in 2003, there wasn’t the popular awareness of social software because MySpace.com, Facebook.com, the Web 2.0 “revolution”, and AJAX were dreams or oddities known to a handful of code bangers.

One of the key points in my presentation was that “information access” was an umbrella term for a bundle of activities and functions. These separate entities were now able to interact to form new, often quite surprising products and services. Social software–which I defined as the use of network technology for communication, collaboration, and combination–was a terrible term, but we were stuck with it. (To learn more about my annoyance with information terminology, Searcher Magazine is running an features story that updates to my 1999 article and my year 2000 article about technology convergence. Sorry. I don’t have a publication date yet, but the editor, Barbara Quint, is working on my lousy prose now.)

Take a look at one diagram from my lecture. Keep in mind that I prepared this five years ago, but for our purpose it is, I hope, useful to you.


Someone complained that I was copyrighting my work on this Web log. Okay, I won’t put the copyright symbol on this graphic. If you want to recycle my work, please, send me an email and get permission. I get annoyed when certain individuals borrow with neither attribution nor permission. Right, Mr. Hermans?

Let’s take a quick tour of this diagram, and then I will close with some observations about the “panting laggard” that is behind-the-firewall search.

Yellow Spheres

Notice the “yellow spheres”. You may have to click on the small image in order to read the notations on this diagram. The heading is “Enabling”. The idea is that each of the “yellow spheres” represents a category of technology that makes online information more useful. For example, “Converting Creating Content” refers to content authoring and content transformation. Behind-the-firewall systems have to take different file types and homogenize them so the system can manipulate them. If a search or content processing system can’t “read” a file, the system won’t process it. The idea, then, is to get the content regardless of its form and format into the search and content processing system. The bottom “yellow ball” is labeled “Spidering, Indexing, and Searching”. You recognize these ideas because 90 percent of a search vendor’s sales pitch talks about this “yellow ball”. In terms of this diagram, it’s easy to see that these three operations–spidering, indexing, and search–are just a cog in a much larger system. Vendors who pitch you about these three features are “panting laggards”. These vendors are almost out of the race and almost certainly won’t win in the long run in my opinion.

Purple Spheres

The “purple spheres” are identified as “Analysis”. Each of these four spaces are now mainstream. Vendors offer these services because each is easier for a manager to assess in terms of a payoff. Few people in an organization want to see laundry lists of information. Filtering eliminates information that rules, methods, or user-defined specifications say, “I don’t want information about enterprise search. I want information about predictive analytics.” Clustering is a catch-all term. In it reside classification, grouping, categorization, and any thing to do with today’s idées du jour–taxonomies and ontologies. The idea is that the system groups similar documents in a meaningful way. If you don’t know what you really want to review, you scan the category labels and browse the results. The third “purple sphere” is data mining. Companies like SPSS and SAS Institute are familiar to you if you took advanced statistics in college. These companies are not in the business of text processing and offering a burgeoning array of features and functions designed to whip unstructured content into shape. SAS Institute bought Teragram, and their PR team told me that SAS will become an “enterprise search company”. I detest this term, but the move is a good one. SAS wants to chop up text, pull out the juicy bits, count them, crunch them, and generate reports for users. The final “purple sphere” is labeled “static / video imaging”. Most organizations are awash in digital information, but most of that is text. Not for long will it be text. “Going forward”, I said in 2003, “behind-the-firewall search systems will have to come to grip with the information-charged binary files–chemical structures, engineering drawings, audio recordings, and video.” Now five years later, only Autonomy has a reasonable solution to video. The other data types remain “outside” the behind-the-firewall system vendors capabilities.

Gray Bar

The “gray bar” was intended to be a spectrum. My lousy Photoshop skills produced this blah “gray bar”. The idea is that “Enabling” and “Analysis” are two distinct types of pressure on search and content processing opportunities. As the “yellow spheres” get bigger, they will exert pressure on the folks in the “gray bar”. Similarly, as the “purple spheres” exert their influence on users, a catalytic reaction occurs in the “gray bar”. In 2003, I identified three significant changes in the way employees will interact with digital information.

First, instead of a search box, people looking for information want some sort of information finder “landing page”. For want of a better term, I used the word portal for the notion of gaining access to information in a search and content processing system.

Second, I identified the shift from getting laundry lists of “hits” to a type of collaborative work. Vendors often forget that documents are created by people, unless you are lucky enough to live inside some hyper-advanced culture like Google’s. But the GOOG is an anomaly, so think about your company. You want to accomplish a work task. Many work tasks require working with one or more colleagues. So, the world of search and retrieval becomes an enabler of collaborative interaction.

Third, the search system is a means of keeping track of what’s been done and how information has changed. In my new study, Beyond Search, published by the Gilbane Group, I talk about one of Google’s most interesting acquisitions data management acquisitions in 2006. (A discussion of this company and its technology appears in Beyond Search.) This company was working is this type of hyper-search space, and if Google does more than launch betas, the technology could revolutionize its enterprise applications division. The point is that search is simply one facet of a much more significant set of processes coming about as the “yellow spheres” and the “purple spheres” expand and change the “pressure” for next-generation applications.

Going Nuclear at LANL

To wrap up, I was making explicit that key word search was a dead end. The action was in the “yellow spheres” and the “purple spheres”. As these various functional and technical areas grew more robust and fell in price, the notion of key words is irrelevant to the real opportunities in the “gray bar”.

In my discussion of the prescient Sagemaker technology here, I make it clear that the flabby key word search had short comings that were well known a decade ago. Now many leaders in search and retrieval are repositioning themselves–actually distancing themselves–from key word search. Not only is it a commodity, the financial difficulties of some of the highest profile vendors make it clear that generating revenue is not easy to do. You can snag Lucene (discussed here) or Flax (discussed here) and save yourself some money.

The LANL folks were not thrilled with my talk. I thought some in the audience would explode. Webmasters and government marketers had just completed a redesign of the LANL Web site. Key word search was offered, but it was slow as molasses. I think it’s been improved now. None of the functions I identified as important in the “gray bar” were available on the LANL’s public-facing or employee-only Web site.

These wizards invited a guy from rural Kentucky, and I did the intellectual equivalent of tracking mud on their white carpet. Competition for clicks among the national labs is fierce. LANL, long the number one research facility, had suffered some security disappointments and the wily wizards at Oak Ridge National Lab had rolled out a niftier Web site. Believe it or not, a high-traffic Web site makes a difference at budget time on Capitol Hill. Here I was making a mess of the new white carpet. I turned in my fancy badge and high-tailed it back to Kentucky.

Most vendors of search and content processing systems have been slow to provide the functionality shown on my amateurish diagram. These vendors are now charging forward with new positioning, new buzzwords, and new ways to explain the benefits of their systems. Like the out-of-shape athlete, some of these folks are coming into our offices looking much the worse for wear. Most are “panting laggards”–not fit for serious information access duty and several years too late.

Stephen Arnold, April 1, 2008

Brainware’s Growth Hits 900 Percent

March 31, 2008

James Zubok, chief financial officer of Brainware, a search and content processing company in northern Virginia revealed Brainware’s rapid growth in the last calendar year. In an exclusive interview, Mr. Zubok said, “In less than two years we’ve experienced remarkable growth. Our sales have grown by more than 900 percent and we’ve doubled our sales force. We’re in these larger Ashburn offices because we ran out of space in our previous facility.”

You can read the full interview at ArnoldIT.com’s Search Wizards Speak service. The full interview is at http://www.arnoldit.com/search-wizards-speak/brainware.html. Other “search wizards” participating in this series include executives from Endeca, ISYS Search Software, Vivisimo, and others.

Stephen Arnold, March 31, 2008

Brainware’s James Zubok Interviewed

March 31, 2008

Privately-held Brainware, once a unit of the German high-tech content management vendor SER Systems AG, is expanding rapidly, the company told Stephen Arnold, managing partner of ArnoldIT.com. The company uses a patented system and method anchored in numerical processes.

James Zubok, an attorney and the company’s chief financial officer, said in an interview on March 30, 2008: “In less than two years we’ve experienced remarkable growth. Our sales have grown by more than 900 percent and we’ve doubled our sales force.”

The complete interview appears as part of the Search Wizards Speak series available on the ArnoldIT.com Web site.

Brainware has a patented method for processing text. In sharp contrast to the dozens of vendors who index by key word and then try to discover metadata. The technique involves trigrams or three-letter sequences. Mr. Zubok described the system in this way:

When we index the word “BRAINWARE” we store a representation of the following trigrams: “BRA”; “RAI”; “AIN”; “INW”; etc. We create a similar trigram representation of all of the text in a search query. During a search, instead of trying to match up entire words, we match the trigrams, which allows our application to be incredibly fault tolerant. Even if some of the trigrams are not a match, our search yields relevant results without relying on any dictionaries or other pre-defined rules.

The system lends itself to some high-value applications; for example, patent application and patent analysis, email discovery, and competitive intelligence activities.

One interesting aspect of the Brainware approach to content processing is its work flow functions. Mr. Zubok said:

We have workflow solutions for our intelligent data capture offerings (they have embedded search capabilities). We have two workflow applications: WF-distiller, which is our principal workflow component that is used for creating and managing workflows of all types of complexities; and A/P-WebDesk, a specialized workflow module built using WF-distiller but used specifically for Accounts Payable management. A/P-WebDesk (which includes A/P-WebDesk for SAP, a version built specifically for seamless integration with SAP) provides an easy-to-use interface to manage the entire invoice processing lifecycle.

The company’s system can be “tuned” using additional word lists and knowledge bases. You can read the complete interview with James Zubok here. More information about Brainware is available on the company’s Web site. You can download a trial version of the desktop build of Brainware’s search and content processing system from the Brainware.com Web site.

Stephen Arnold, March 30, 2008

A TV First for Google

March 30, 2008

At about 6 pm Eastern time, a Davidson student held up a sign that enjoined basketball fans to “Davidson. Just Google it.” With US television ad rates chewing through some companies’ budgets, Google scored today. Google and basketball–an eye ball slam dunk.

When online companies run adverts, serious money changes hands. Google has reached something of a cult status at least among the Davidson College, a small, elite institution not far from Charlotte, North Carolina.

Microsoft, Yahoo, Autonomy, and Fast Search & Transfer will consider getting signs into the hands of basketball fans during next week’s collegiate basketball finals.

Stephen Arnold, March 30, 2008

Search: The Wheel Keeps on a Turnin’

March 30, 2008

In the late 1990s, I learned about a news aggregator. The company was Retrieval Technologies. The company’s founder had a great idea–aggregate news and make it available in real time. The product was News Machine. Among its features were in 1995 on-the-fly classification. In retrospect, News Machine was a proprietary version of today’s RSS (really simple syndication).

That company was acquired by an outfit called Sagemaker in 1999. Sagemaker was one of the first companies providing a dashboard, vertical business intelligence, and the New Machine’s real-time updates–on a Microsoft Windows platform.

The idea was that the Intranet was “a management tool”. Instead of search, Sagemaker provided users with personalization tools. The idea was that a “one size fits all” approach to search and retrieval was not what companies wanted., The Sagemaker system federated information from behind-the-firewall sources and external sources. The public Internet could be harvested. The system’s could also ingest analyst reports and make those available to Sagemaker users. Sagemaker called these types of for-fee, third-party materials “branded content”. On the back end, Sagemaker included a usage tracking system. At the time, I thought it was quite robust, and it offered the type of granularity that online Web search systems now have in place.

A Forward-Looking Approach to Search

In my files I located this overview of the Sagemaker architecture. The acronym EIP stands for Enterprise Integration Platform. The idea is that functions–what Sagemaker called “card slots–were plugged into the EIP. XML was the lingua franca of the system. Java was used for the messaging service and the server was based on Java. Sagemaker, therefore, was a pioneer in merging Java servers with Windows. More intriguing was that parts of the Sagemaker service were hosted; that is, the functions ran from the cloud. Other functions–the graphical interface and the code that was installed on the licensee’s premises–were Windows.


I find that this approach was unable to generate sufficient traction to sweep the enterprise market. Sagemaker competed with Plumtree (now part of BEA, which is now part of Oracle) and Documentum, which is now part of EMC, the storage company turned into tech conglomerate. Read more

Search Hoops: Exercising Technology to Meeting User Needs

March 29, 2008

A “hoop” is a circular that binds a barrel’s staves together. A “hoops” has a more informal meaning; the word is a synonym for basketball. In Kentucky, you say, “The Louisville Cardinals shoot serious hoops”. This sentence won’t make much sense in Santiago, Chile, but it does at the local gas station.

Search “hoops” are different. These are technical spaces that make it possible for a person to look for information. The figure below shows a series of search hoops. I want to take a few minutes to talk briefly about each of these with particular emphasis on their relationship to behind-the-firewall search. As you know, I think the term enterprise search is essentially valueless. It’s become an audible pause mouthed by vendors of many shapes and sizes. When I hear it, I’m baffled. Truth be told, most of the vendors who use the term enterprise search don’t know what it means. The job of explaining its meaning is left to the pundits and mavens who earn a living blowing smoke to explain fuzziness. Visibility and comprehension hit the two to four inch range.

This is a diagram from a report I wrote for a company silly enough to pay me for an analysis of the online search-and-retrieval trends in the period 1975 to 2003. I have an updated version, but that’s something I sell to buy my beloved boxer dog Tyson Kibbles and Bits.


© Stephen E. Arnold, 2002-2008

Please, click on the image so you can read the textual annotations to each of the rings. I’m not going to repeat the information in the diagram’s annotations. I will related these “hoops” to the challenge of behind-the-firewall search.

Read more

Exalead Adds Content Connectors

March 29, 2008

Exalead–a provider of search and content processing systems–said that it has added software connectors for Allfresco, FileNet P8, Hummingbird DM, Interwoven TeamSite, Micreosoft SharePoint, and IBM Lotus Quickplace. These newly-supported enterprise applications perform content and data management operations. Exalead’s system can now seamlessly access information in these systems’ content repositories.

These connectors supplement Exalead’s existing connectors for Microsoft Exchange, Lotus Notes, and common file types such as Word, PowerPoint, and Excel. Exalead also provides application programming interfaces that can be used to integrate the Exalead content processing system with enterprise applications, among other custom operations.

According to Exalead, the connectors are provided by EntropySoft, a firm focused on the integration of unstructured data. Exalead said, “Organizations today rely on a variety of data sources. Partnering with EntropySoft will allow us to build upon the enterprise connectors we have already developed.”

The deal allows Exalead to integrate EntropySoft bidirectional connectors exalead one:search. More information is available here.

Stephen Arnold, March 28, 2008

A 12-Step Program for Behind-the-Firewall Search

March 28, 2008

In 2006, one of the young engineers working on a search system at a large company said to me, “I’m in a 12-step program for this !%$&^ search system–two six packs of beer.”

This clever and stressed young engineer was the “owner” of her employer’s blue-chip, high–profile, it-slices-it-dices search system. The young wizard was learning that high marks in computer science do not a smooth behind-the-firewall search system make.

I kept this “12-step” tag in my mind. In late 2006, I used this graphic to illustrate one way to deploy a behind-the-firewall search system with few hassles and certainly no recourse to alcohol.

12 steps

Let me run through the 12 steps and conclude with a reminder that short cuts can lead to some interesting challenges.

Step 1. You will need a team to assist you with your behind-the-firewall search project. Search has quite a few moving parts. Working alone is not a good idea.

Step 2. You need to know a great deal about the content you plan to index. You want to know how much content you must index; how much change occurs in the content; how much new content becomes available every day, week, month, and year; access constraints; file types; and special issues such as chemical structures that must be indexed, among other points.

Step 3. You need to know what problem your behind-the-firewall search system is to solve. Is it key word search relevancy, or are you deploying a business intelligence system?

Step 4. You need to have a clear idea about who can access what information. If your organization has a security officer who handles these details, bond with this person. If not, yoiu will need to take steps to manage access to information processed by the system. Allowing colleagues to see health and salary data without authorization creates new challenges.

Step 5. You need to have a clear statement of system requirements. Keep in mind that you want to focus on the must-have features. The “nice to have” requirements should be winnowed from the “must have” requirements. Focus on the “must haves”. Read more

Northern Light: A New Business Information Search Service

March 27, 2008

Northern Light has made a free business information search services. You can try it yourself at www.nlsearch.com. Search and browse are free, but you will have to pay to access certain content. A day pass is priced at about $5.00 and enterprise licenses are available.

Northern Light, in the mid-1990s, offered a somewhat similar service. The company received an infusion of capital from Reuters in 1999. By 2002, the company had become part of the now-defunct divine Interventures.  Northern Light is once again a self-standing company. David Seuss, the former consultant who founded the firm, is once-again running Northern Light.

Northern Light was one of the first search systems to enhance its results list with folders grouping similar results. More information is available from the Northern Light Web site. Information Today’s Paula Hane’s story has additional details about the service here.

Stephen Arnold, March 27, 2008

