Inside the Information Tokamak, Part 1: The Blue Spheres of Messaging
April 2, 2008
I’ve also enjoyed the tokamak, a machine that produces a toroidal magnetic field for confining a plasma. (A plasma is, for those who cut physics class to enjoy a spring day, an ionized gas containing an approximately equal number of positive ions and electrons. Zap this puppy, you get interesting phenomena. Here’s one example on a slightly larger scale than your local university’s physics lab.
Source: http://ocw.mit.edu/NR/rdonlyres/Global/7/
77E722FA-4A00-476D-9D4A-3F86C9BDA2B3/0/chp_sun_plasma.jpg
So what does nuclear physics have to do with behind-the-firewall search? Actually, quite a log if you have a poetic side to your curious self.
I am living in a digital tokamak. Instead of ions and electrons, I am bombarded by the information particles shown in the diagram below:
This is a diagram prepared in 2003. I am using it “as is” despite its flaws. If you want to recycle the diagram, please coordinate with me.
If you read my earlier post about the “gray bar”, you know that the “yellow spheres” and the “purple spheres” exert pressure on an organization’s information environment. The three new sets of spheres in blue, red, and green are what’s inside the “gray bar” in this diagram.
Key Word Search Vendors: Panting Laggards
March 31, 2008
In September 2003, I gave an invited lecture at LANL, an acronym for Los Alamos National Laboratories for those of you who don’t keep up with some of the US government’s most interesting research nomenclature. I poked around my digital warehouse today when I saw an announcement that a major search-and-retrieval vendor was now officially in the “information access business”. I used to work for Ziff Communications Co., and we owned an outfit called Information Access Co. That was a great company name, but the whole shooting match was sold to the giant Thomson Corporation and the name Information Access fell into disuse or so I thought.
I marvel at the “back from the dead” certain terminology demonstrates. IAC, as Information Access was known for more than 15 years, allowed a person to search for electronic information. The idea was a good one, and IAC had revenues of more than $100 million at the time of the sale. The idea was simple. We used bibliographic records or what today would be called “structured metadata”, full text of articles or what today would be called content, and proprietary scripts to generate reports or what today would be called business intelligence. The user of our General Business File product in 1990 would pick from a menu of options; for example, look for a job. Then the user would pick from one of the major cities whose employment opportunities we indexed (now tagged) and the system would display job openings. A mouse click sent the report to the printer, and we had happy users. We sold more than 1,000 of these systems in less than nine months in 1990. Considering each system was in the $20,000 plus range, the General Business File would be a success in our Googley world.
The LANL group wanted to know about the future of search and “The Information Implications of Social Software”. Now in 2003, there wasn’t the popular awareness of social software because MySpace.com, Facebook.com, the Web 2.0 “revolution”, and AJAX were dreams or oddities known to a handful of code bangers.
One of the key points in my presentation was that “information access” was an umbrella term for a bundle of activities and functions. These separate entities were now able to interact to form new, often quite surprising products and services. Social software–which I defined as the use of network technology for communication, collaboration, and combination–was a terrible term, but we were stuck with it. (To learn more about my annoyance with information terminology, Searcher Magazine is running an features story that updates to my 1999 article and my year 2000 article about technology convergence. Sorry. I don’t have a publication date yet, but the editor, Barbara Quint, is working on my lousy prose now.)
Take a look at one diagram from my lecture. Keep in mind that I prepared this five years ago, but for our purpose it is, I hope, useful to you.
Someone complained that I was copyrighting my work on this Web log. Okay, I won’t put the copyright symbol on this graphic. If you want to recycle my work, please, send me an email and get permission. I get annoyed when certain individuals borrow with neither attribution nor permission. Right, Mr. Hermans?
Let’s take a quick tour of this diagram, and then I will close with some observations about the “panting laggard” that is behind-the-firewall search.
Yellow Spheres
Notice the “yellow spheres”. You may have to click on the small image in order to read the notations on this diagram. The heading is “Enabling”. The idea is that each of the “yellow spheres” represents a category of technology that makes online information more useful. For example, “Converting Creating Content” refers to content authoring and content transformation. Behind-the-firewall systems have to take different file types and homogenize them so the system can manipulate them. If a search or content processing system can’t “read” a file, the system won’t process it. The idea, then, is to get the content regardless of its form and format into the search and content processing system. The bottom “yellow ball” is labeled “Spidering, Indexing, and Searching”. You recognize these ideas because 90 percent of a search vendor’s sales pitch talks about this “yellow ball”. In terms of this diagram, it’s easy to see that these three operations–spidering, indexing, and search–are just a cog in a much larger system. Vendors who pitch you about these three features are “panting laggards”. These vendors are almost out of the race and almost certainly won’t win in the long run in my opinion.
Purple Spheres
The “purple spheres” are identified as “Analysis”. Each of these four spaces are now mainstream. Vendors offer these services because each is easier for a manager to assess in terms of a payoff. Few people in an organization want to see laundry lists of information. Filtering eliminates information that rules, methods, or user-defined specifications say, “I don’t want information about enterprise search. I want information about predictive analytics.” Clustering is a catch-all term. In it reside classification, grouping, categorization, and any thing to do with today’s idées du jour–taxonomies and ontologies. The idea is that the system groups similar documents in a meaningful way. If you don’t know what you really want to review, you scan the category labels and browse the results. The third “purple sphere” is data mining. Companies like SPSS and SAS Institute are familiar to you if you took advanced statistics in college. These companies are not in the business of text processing and offering a burgeoning array of features and functions designed to whip unstructured content into shape. SAS Institute bought Teragram, and their PR team told me that SAS will become an “enterprise search company”. I detest this term, but the move is a good one. SAS wants to chop up text, pull out the juicy bits, count them, crunch them, and generate reports for users. The final “purple sphere” is labeled “static / video imaging”. Most organizations are awash in digital information, but most of that is text. Not for long will it be text. “Going forward”, I said in 2003, “behind-the-firewall search systems will have to come to grip with the information-charged binary files–chemical structures, engineering drawings, audio recordings, and video.” Now five years later, only Autonomy has a reasonable solution to video. The other data types remain “outside” the behind-the-firewall system vendors capabilities.
Gray Bar
The “gray bar” was intended to be a spectrum. My lousy Photoshop skills produced this blah “gray bar”. The idea is that “Enabling” and “Analysis” are two distinct types of pressure on search and content processing opportunities. As the “yellow spheres” get bigger, they will exert pressure on the folks in the “gray bar”. Similarly, as the “purple spheres” exert their influence on users, a catalytic reaction occurs in the “gray bar”. In 2003, I identified three significant changes in the way employees will interact with digital information.
First, instead of a search box, people looking for information want some sort of information finder “landing page”. For want of a better term, I used the word portal for the notion of gaining access to information in a search and content processing system.
Second, I identified the shift from getting laundry lists of “hits” to a type of collaborative work. Vendors often forget that documents are created by people, unless you are lucky enough to live inside some hyper-advanced culture like Google’s. But the GOOG is an anomaly, so think about your company. You want to accomplish a work task. Many work tasks require working with one or more colleagues. So, the world of search and retrieval becomes an enabler of collaborative interaction.
Third, the search system is a means of keeping track of what’s been done and how information has changed. In my new study, Beyond Search, published by the Gilbane Group, I talk about one of Google’s most interesting acquisitions data management acquisitions in 2006. (A discussion of this company and its technology appears in Beyond Search.) This company was working is this type of hyper-search space, and if Google does more than launch betas, the technology could revolutionize its enterprise applications division. The point is that search is simply one facet of a much more significant set of processes coming about as the “yellow spheres” and the “purple spheres” expand and change the “pressure” for next-generation applications.
Going Nuclear at LANL
To wrap up, I was making explicit that key word search was a dead end. The action was in the “yellow spheres” and the “purple spheres”. As these various functional and technical areas grew more robust and fell in price, the notion of key words is irrelevant to the real opportunities in the “gray bar”.
In my discussion of the prescient Sagemaker technology here, I make it clear that the flabby key word search had short comings that were well known a decade ago. Now many leaders in search and retrieval are repositioning themselves–actually distancing themselves–from key word search. Not only is it a commodity, the financial difficulties of some of the highest profile vendors make it clear that generating revenue is not easy to do. You can snag Lucene (discussed here) or Flax (discussed here) and save yourself some money.
The LANL folks were not thrilled with my talk. I thought some in the audience would explode. Webmasters and government marketers had just completed a redesign of the LANL Web site. Key word search was offered, but it was slow as molasses. I think it’s been improved now. None of the functions I identified as important in the “gray bar” were available on the LANL’s public-facing or employee-only Web site.
These wizards invited a guy from rural Kentucky, and I did the intellectual equivalent of tracking mud on their white carpet. Competition for clicks among the national labs is fierce. LANL, long the number one research facility, had suffered some security disappointments and the wily wizards at Oak Ridge National Lab had rolled out a niftier Web site. Believe it or not, a high-traffic Web site makes a difference at budget time on Capitol Hill. Here I was making a mess of the new white carpet. I turned in my fancy badge and high-tailed it back to Kentucky.
Most vendors of search and content processing systems have been slow to provide the functionality shown on my amateurish diagram. These vendors are now charging forward with new positioning, new buzzwords, and new ways to explain the benefits of their systems. Like the out-of-shape athlete, some of these folks are coming into our offices looking much the worse for wear. Most are “panting laggards”–not fit for serious information access duty and several years too late.
Stephen Arnold, April 1, 2008
Search: The Wheel Keeps on a Turnin’
March 30, 2008
In the late 1990s, I learned about a news aggregator. The company was Retrieval Technologies. The company’s founder had a great idea–aggregate news and make it available in real time. The product was News Machine. Among its features were in 1995 on-the-fly classification. In retrospect, News Machine was a proprietary version of today’s RSS (really simple syndication).
That company was acquired by an outfit called Sagemaker in 1999. Sagemaker was one of the first companies providing a dashboard, vertical business intelligence, and the New Machine’s real-time updates–on a Microsoft Windows platform.
The idea was that the Intranet was “a management tool”. Instead of search, Sagemaker provided users with personalization tools. The idea was that a “one size fits all” approach to search and retrieval was not what companies wanted., The Sagemaker system federated information from behind-the-firewall sources and external sources. The public Internet could be harvested. The system’s could also ingest analyst reports and make those available to Sagemaker users. Sagemaker called these types of for-fee, third-party materials “branded content”. On the back end, Sagemaker included a usage tracking system. At the time, I thought it was quite robust, and it offered the type of granularity that online Web search systems now have in place.
A Forward-Looking Approach to Search
In my files I located this overview of the Sagemaker architecture. The acronym EIP stands for Enterprise Integration Platform. The idea is that functions–what Sagemaker called “card slots–were plugged into the EIP. XML was the lingua franca of the system. Java was used for the messaging service and the server was based on Java. Sagemaker, therefore, was a pioneer in merging Java servers with Windows. More intriguing was that parts of the Sagemaker service were hosted; that is, the functions ran from the cloud. Other functions–the graphical interface and the code that was installed on the licensee’s premises–were Windows.
I find that this approach was unable to generate sufficient traction to sweep the enterprise market. Sagemaker competed with Plumtree (now part of BEA, which is now part of Oracle) and Documentum, which is now part of EMC, the storage company turned into tech conglomerate. Read more
Search Hoops: Exercising Technology to Meeting User Needs
March 29, 2008
A “hoop” is a circular that binds a barrel’s staves together. A “hoops” has a more informal meaning; the word is a synonym for basketball. In Kentucky, you say, “The Louisville Cardinals shoot serious hoops”. This sentence won’t make much sense in Santiago, Chile, but it does at the local gas station.
Search “hoops” are different. These are technical spaces that make it possible for a person to look for information. The figure below shows a series of search hoops. I want to take a few minutes to talk briefly about each of these with particular emphasis on their relationship to behind-the-firewall search. As you know, I think the term enterprise search is essentially valueless. It’s become an audible pause mouthed by vendors of many shapes and sizes. When I hear it, I’m baffled. Truth be told, most of the vendors who use the term enterprise search don’t know what it means. The job of explaining its meaning is left to the pundits and mavens who earn a living blowing smoke to explain fuzziness. Visibility and comprehension hit the two to four inch range.
This is a diagram from a report I wrote for a company silly enough to pay me for an analysis of the online search-and-retrieval trends in the period 1975 to 2003. I have an updated version, but that’s something I sell to buy my beloved boxer dog Tyson Kibbles and Bits.
© Stephen E. Arnold, 2002-2008
Please, click on the image so you can read the textual annotations to each of the rings. I’m not going to repeat the information in the diagram’s annotations. I will related these “hoops” to the challenge of behind-the-firewall search.
A 12-Step Program for Behind-the-Firewall Search
March 28, 2008
In 2006, one of the young engineers working on a search system at a large company said to me, “I’m in a 12-step program for this !%$&^ search system–two six packs of beer.”
This clever and stressed young engineer was the “owner” of her employer’s blue-chip, high–profile, it-slices-it-dices search system. The young wizard was learning that high marks in computer science do not a smooth behind-the-firewall search system make.
I kept this “12-step” tag in my mind. In late 2006, I used this graphic to illustrate one way to deploy a behind-the-firewall search system with few hassles and certainly no recourse to alcohol.
Let me run through the 12 steps and conclude with a reminder that short cuts can lead to some interesting challenges.
Step 1. You will need a team to assist you with your behind-the-firewall search project. Search has quite a few moving parts. Working alone is not a good idea.
Step 2. You need to know a great deal about the content you plan to index. You want to know how much content you must index; how much change occurs in the content; how much new content becomes available every day, week, month, and year; access constraints; file types; and special issues such as chemical structures that must be indexed, among other points.
Step 3. You need to know what problem your behind-the-firewall search system is to solve. Is it key word search relevancy, or are you deploying a business intelligence system?
Step 4. You need to have a clear idea about who can access what information. If your organization has a security officer who handles these details, bond with this person. If not, yoiu will need to take steps to manage access to information processed by the system. Allowing colleagues to see health and salary data without authorization creates new challenges.
Step 5. You need to have a clear statement of system requirements. Keep in mind that you want to focus on the must-have features. The “nice to have” requirements should be winnowed from the “must have” requirements. Focus on the “must haves”. Read more
Search: The Three Curves of Despair
March 27, 2008
For my 2005 seminar series “Search: How to Deliver Useful Results within Budget”, I created a series of three line charts. One of the well-kept secrets about behind-the-firewall search is that costs are difficult, if not impossible, to control. That presentation is not available on my Web site archive, and I’m not sure I have a copy of the PowerPoint deck at hand. I did locate the Excel sheet for the chart which appears below. I thought it might be useful to discuss the data briefly and admittedly in an incomplete way. (I sell information for a living, so I instinctively hold some back to keep the wolves from my log cabin’s door here in rural Kentucky.)
Let me be direct: Well-dressed MBAs and sallow financial mavens simply don’t believe my search cost data.
At my age, I’m used to this type of uninformed skepticism or derisory denial. The information technology professionals attending my lectures usually smirk the way I once did as a callow nerd. Their reaction is understandable. And I support myself by my wits. When these superstars lose their jobs, my flabby self is unscathed. My children are grown. The domicile is safe from creditors. I’m offering information, not re-jigging inflated egos.
Now scan these three curves.
© Stephen E. Arnold, 2002-2008.
You see a gray line. That is the precision / recall curve. This refers to a specific method of determining if a query returns results germane to the user’s query and another method for figuring out how much germane information the search system missed. Search and a categorical affirmative such as “all” do not make happy bedfellows. Most folks don’t know what a search system does not include. Could that be one reason why the “curves of despair” evoke snickers of disbelief? Read more
Search Waves: Are We Living through Periodicity?
March 26, 2008
I’m fascinated with cyclical phenomena. When working on my graduate degree, I accepted a grant from Duquesne University in 1967. Located in Pittsburgh, Pennsylvania, this Jesuit university was little-c catholic. All faiths were acceptable. One of my professors and friends, Dr. Richard Oehling told me, “Where else could an orthodox Jew teach the Protestant reformation to a group of Jesuits?” Such was Duquesne.
In one course, I confronted phenomenological existentialism, then a hot concept in philosophy. Although I was busy indexing sermons in Latin using ancient mainframes, the fuzzy-wuzzy world of existential philosophy caught my attention. I had zero clue about epistemology, heuristics, and other concepts that whipped serious students of different beliefs and backgrounds into a frenzy. This philosophical banter was better at stirring up emotions than a break down on the Squirrel Hill bus.
So, what’s this got to do with search?
I’m not sure, but there’s a thesis-antithesis-synthesis dialectic rippling the fabric of acquisitions, start ups, and “old wine in new bottles” innovations I read about in news releases. Just this morning, my Google Alert service informed me of “A New Wave of Enterprise Search”. The essay appeared in the CMSwatch Trendwatch blog. The key sentence to me was, “There’s a growing movement afoot to de-throne the old guard; talk of replacing FAST and Autonomy seemed to be uttered by every vendor that wasn’t a household name.” Read more
Search: A Kitchen Sink and the Carcassonne Problem
March 25, 2008
As I worked on my keynote for the upcoming Buying and Selling eContent Conference in April 2008, I flipped through PowerPoint decks in search of examples. I came across a presentation I delivered in the summer of 2006. In that talk, I described behind-the-firewall search as following an interesting trajectory. Humans have a tendency to elaborate, embroider, and complicate.
Let me give you an example. My mother and father recently moved from their home to a condominium-style dwelling. The “space” was a blank canvas. After a year, I noticed that the white space was filled in. Some of the objects were family mementos like the hand-carved ebony elephant that has been in the Arnold family for a century. But other acquisitions were plaques identifying my mother as a “red hat lady”. My father had taped instructions for replacing the cartridge in his printer next to his flat panel monitor. In short, the white space was being filled in.
I noticed a similar “stuffing” when I was in Carcassonne, the walled city in Aude. Every square inch inside the city walls had been put to use. Read more
Search: Appearances Are Deceiving
March 22, 2008
In Toronto, Ontario, several years ago, I attended a lecture in which the speaker (whose name I have forgotten) asked the audience, “What do you see?” When I saw this illustration, I saved it. My source was the University of Toronto. What do you see?
My myopic eyes see wheels that rotate. When I focus my attention on a single “wheel”, nothing moves. When I shift my vision, some wheels turn.
Search and retrieval is to some people similar to this illusion. I wish I could assure you that “search” will settle down, allow us to examine it carefully, and remain fixed if we shift our attention to another problem. I can’t. Search is a blob of digital mercury, and we are — at least for the foreseeable future — going to find that it’s elusive. Perception of the viewer “defines” search.
Why is this important?
On March 21, 2008, I spoke to a journalist who asked me, “What’s the difference between Intranet search and a company’s Web site search system?” The distinction is important because information behind-the-firewall is usually viewed as “for employees only”. There are exceptions such as a consultant or attorney who needs to examine information residing on an organization’s servers. The idea is that a user name, password, and even other types of authentication may be required to tap into invoices, customer information, marketing and sales materials, and other organization information and data. Read more
Vivisimo’s Founders Interviewed: Raul Valdes-Perez and Jerome Pesenti
March 21, 2008
In mid-March, Vivisimo received an infusion of $4 million from North Atlantic Capital. Vivisimo has emerged as a full-scale “behind the firewall” search provider. The company landed the high-profile search-and-retrieval deal with the US Federal government for USA.gov, the public-facing portal for government information. Then, the company inked a deal with Interwoven, the content management company, to provide search and content processing system for the Interwoven CMS system.
Some pundits see Vivisimo as specialist vendor. That view of the company is incorrect. My sources tell me that Vivisimo is finding itself invited to bid on a range of commercial, government, and association projects. Executives at some well-known, high-profile search firms have asked me about Vivisimo. In my experience, this means Vivisimo is doing something right.