Google Version 2.0 Patents Available without Charge
January 24, 2008
Google Version 2.0 (Infonortics, Ltd., Tetbury, Glou., 2007) references more than 60 Google patent applications and patents. I’ve received a number of requests for this collection of US documents. I’m delighted to make them available to anyone at ArnoldIT.com. The patent applications and patents in this collection represent the Google inventions that provide significant insight into the company’s remarkable technology. You can learn about a game designer’s next-generation ad system. You can marvel at the productivity of former Digital Equipment AltaVista.com’s engineers solution to bottlenecks in traditional parallel processing systems. Some of these inventions require considerable effort to digest; for example, Ramanathan Guha’s inventions regarding the Semantic Web. Others are a blend of youthful brilliance and Google’s computational infrastructure; specifically, the transportation routing system Google uses to move employees around the San Francisco area. Enjoy.
Stephen E. Arnold, January 24, 2008
Reducing the “Pain” in Behind-the-Firewall Search
January 23, 2008
I received several interesting emails in the last 48 hours. I would like to share the details with you, but the threat of legal action dissuades me. The emails caused me to think about the “pain” that accompanies some behind-the-firewall search implementations. You probably have experienced some of these pains.
Item: fee-creep pain. What happens is that the vendor sends a bill that is greater than the anticipated amount. Meetings ensue, and in most cases, the licensees pay the bill. Cost over runs, in my experience, occur with depressing frequency. There are always reasonable explanations.
Item: complexity pain. Some systems get more convoluted with what some of my clients have told me is “depressing quickness.” With behind-the-firewall search nudging into middle age, is it necessary for systems to become more complicated, making it very difficult, if not impossible, for the licensee’s technical staff to make changes. One licensee of a well-known search system told me, “If we push in a button here, it breaks something over there. We don’t know how the system’s components interconnect.”
Item: relevancy pain. Here’s the scenario. You are sitting in your office and a user sends an email that says, “We can’t find a document that we know is in the system. We type in the name of the document, and we can’t find it.” In effect, users are baffled about the connection between their query and what the system returns as a relevant result. Now the hard part comes. The licensee’s engineer tries to tweak relevancy or hard wire a certain hit to appear at the top of the results list. Some systems don’t allow fiddling with the relevancy settings. Others offer dozens, three score, knobs and dials. Few or no controls, or too many controls — the happy medium is nowhere to be found.
Item: performance pain. The behind-the-firewall system churns through the training data. It indexes the identified servers. The queries come back in less than one second, blindingly fast for a behind-the-firewall network. Then performance degrades, not all at once. No, the system gets slower over time. How does one fix performance? The solution that our research suggests is the preferred one is more hardware. The budget is left gasping, but performance then degrades.
Item: impelling. Some vendors install a system. Before the licensee knows it, the vendor’s sales professional is touting an upgrade. One rarely discussed issue is that certain vendors upgrades — how shall I phrase it — often introduce issues. The zippy new feature is not worth the time, cost, and hassle of stabilizing a system or getting it back online. Search is becoming a consumer product with “new” and “improved” bandied freely among the vendor, PR professionals, tech journalists, and the licensees. Impelling for some vendors is where the profit is. So upgrades are less about the system and more about generating revenue.
The causes for each of these pressure points are often complicated. Many times the licensees are at fault. The customer is not always right when he or she opines, “Our existing hardware can handle the load.” Or, “We have plenty of bandwidth and storage.” Vendors can shade facts in order to make the sale with the hope of getting lucrative training, consulting, and support work. One vendor hires an engineer at $60,000 per year and bills that person’s time at a 5X multiple counting on 100 percent billability to pump more than $200,000 into the search firm’s pockets after paying salaries, insurance, and overhead. Other vendors are marketing operations, and their executives exercise judgment when it comes to explaining what the system can and can’t do under certain conditions.
What can be done about these pain points? The answer is likely to surprise some readers. You expect a checklist of six things that will convert search lemons into search lemonade. I am going to disappoint you. If you license a search system and install it on your organization’s hardware, you will experience pain, probably sooner rather than later. The reason is that most licensees underestimate the complexity, hardware requirements, and manual grunt work needed to get a behind-the-firewall system to deliver superior relevancy and precision. My advice is to budget so the search vendor does the heavy lifting. Also consider hosted or managed services. Appliances can reduce some of the aches as well. But none of these solutions delivers a trouble-free search solution.
There are organizations with search systems that work and customers who are happy. You can talk to these proud owners at the various conferences featuring case studies. Fast Search & Transfer hosts its own search conference so attendees can learn about successful implementations. You will learn some useful facts at these trade shows. But the best approach is to have search implementation notches on your belt. Installing and maintaining a search system is the best way to learn what works and what doesn’t. With each installation, you get more street smarts, and you know what you want to do to avoid an outright disaster. User groups have fallen from favor in the last decade. Customers can “wander off the reservation”, creating a PR flap in some situations. Due to the high level of dissatisfaction among users of behind-the-firewall search systems, it’s difficult to get detailed, timely information from a search vendor’s customers. Ignorance may keep some people blissfully happy.
Okay, users are unhappy. Vendors make it a bit of work to get the facts about their systems. Your IT team has false confidence in its abilities. You need to fall back on the basics. You know these as well as I do: research, plan, formulate reasonable requirements, budget, run a competitive bid process, manage, verify, assess, modify, etc. The problem is that going through these tasks is difficult and tedious work. In most organizations, people are scheduled to the max, or there’s too few people to do the work due to staff rationalizations. Nevertheless, a reliable behind-the-firewall search implementation takes work, a great deal of work. Shortcuts — on the licensee’s side of the fence or the vendor’s patch of turn — increase the likelihood of a problem.
Another practical approach is to outsource the search function. A number of vendors offer hosted or managed search solutions. You may have to hunt for vendors who offer these services, sometimes called subscription search. Take a look at Blossom Software.
Also, consider one of the up-and-coming vendors. I’ve been impressed with ISYS Search Software and Siderean Software. You may also want to take another look at the Thunderstone – EPI appliance or the Google Search Appliance. I think both of these systems can deliver a reliable search solution. Both vendors’ appliances can be customized and extended.
But even with these pragmatic approaches, you run a good chance of turning your ankle or falling on your face if you stray too far from the basics. Pain may not be avoidable, but you can pick your way through various search obstacles if you proceed in a methodical, prudent way.
Stephen Arnold, January 23, 2008
Search, Content Processing, and the Great Database Controversy
January 22, 2008
“The Typical Programmer” posted the article “Why Programmers Don’t Like Relational Databases,” and ignited a mini-bonfire on September 25, 2007. I missed the essay when it first appeared, but a kind soul forwarded it to me as part of an email criticizing my remarks about Google’s MapReduce.
I agreed with most of the statements in the article, and I enjoyed the comments by readers. When I worked in the Keystone Steel Mill’s machine shop in college, I learned two things: [a] don’t get killed by doing something stupid and [b] use the right tool for every job.
If you have experience with behind-the-firewall search systems and content processing systems, you know that there is no right way to handle data management tasks in these systems. If you poke around the innards of some of the best-selling systems, you will find a wide range of data management and storage techniques. In my new study “Beyond Search,” I don’t focus on this particular issue because most licensees don’t think about data management until their systems run aground.
Let me highlight a handful of systems (without taking sides or mentioning names) and the range of data management techniques employed. I will conclude by making a few observations about one of the many crises that bedevil some behind-the-firewall search solutions available today.
The Repository Approach. IBM acquired iPhrase in 2005. The iPhrase approach to data management was similar to that used by Teratext. The history of Teratext is interesting, and the technology seems to have been folded back into the giant technical services firm SAIC. Both of these systems ingest text, store the source content in a transformed state, and create proprietary indexes that support query processing. When a document is required, the document is pulled from the repository. When I asked both companies about the data management techniques for used in their systems for the first edition of The Enterprise Search Report (2003-2004), I got very little information. What I recall from my research is that both systems used a combination of technologies integrated into a system. The licensee was insulated from the mechanics under the hood. The key point is that two very large systems able to handle large amounts of data relied on data warehousing and proprietary indexes. I heard when IBM bought iPhrase that one reason for IBM’s interest was the iPhrase customers were buying hardware from IBM in prodigious amounts. The fact that Teratext is unknown in most organizations is that it is one of the specialized tools purpose-built to handle CIA- and NSA-grade information chores.
The Proprietary Data Management Approach. One of the “Big Three” of enterprise search has created its own database technology, its own data management solution, and its own data platform. The reason is that this company was among the first to generate a significant amount of metadata from “intelligent” software. In order to reduce latency and cope with the large temporary files iterative processing generated, the company looked for an off-the-shelf solution. Not finding what it needed, the company’s engineers solved the problem and even today “bakes in” its data management, base, and manipulation components. When this system is licensed on an OEM (original equipment manufacturing product), the company’s own “database” lives within the software built by the licensee. Few are aware of this doubling up of technology, but it works reasonably well. When a corporate customer of a content management system wants to upgrade the search system included in the CMS, the upgrade is a complete version of the search system. There is easy way to get around the need to implement a complete, parallel solution.
The Fruit Salad Approach. A number of search and content processing companies deliver a fruit salad of data solutions in a single product. (I want to be vague because some readers will want to know who is delivering systems with these hybrid systems, and I won’t reveal the information in a public forum. Period.) Poke around and you will find open source database components. MySQL is popular, but there are other RDBMS offerings available, and depending on the vendor’s requirements, the best open source solution will be selected. Next, the vendor’s engineers will have designed a proprietary index. In many cases, the structure and details of the index are closely-guarded secrets. The reason is that the speed of query processing is often related to the cleverness of the index design. What I have found is that companies that start at the same time usually same similar approaches. I think this is because when the engineers were in university, the courses taught the received wisdom. The students then went on to their careers and tweaked what was learned in college. Despite the assertions of uniqueness, I find interesting coincidences based on this education factor. Finally, successful behind-the-firewall search and content processing companies license, buy, or are the beneficiaries of a helpful venture capital firm. The company ends up with different chunks of code, and in many cases, it is easier to use whatever is there than trying to figure out and make the solution work with the pears, apricots, and apples in use elsewhere in the company.
The Leap Frog. One company has designed a next-generation data management system. I talk about this technology in my column for one of Information Today’s tabloids, so I won’t repeat myself. This approach says, in effect: Today’s solutions are quite right for the petabyte-scale of some behind-the-firewall indexing tasks. The fix is to create something new, jumping over Dr. Codd, RDBMS, the costs of scaling, etc. When this data management technology becomes commercially available, there will be considerable pressure placed upon IBM, Microsoft, Oracle; open source database and data management solutions; and companies asserting “a unique solution” while putting old wine in new bottles.
Let me hazard several observations:
First, today’s solutions must be matched to the particular search and content processing problem. The technology, while important, is secondary to your getting what you want done within the time and budget parameters you have. Worrying about plumbing when the vendors won’t or can’t tell you what’s under the hood is not going to get your system up and running.
Second, regardless of the database, data management, or data transformation techniques used by a vendor, the key is reliability, stability, and ease of use from the point of view of the technical professional who has to keep the system up and running. You might want to have a homogeneous system, but you will be better off getting one that keeps your users engaged. When the data plumbing is flawed, look first at the resources available to the system. Any of today’s approaches work when properly resourced. Once you have vetted your organization, then turn your attention to the vendor.
Third, the leap frog solution is coming. I don’t know when, but there are researchers at universities in the U.S. and in other countries working on the problems of “databases” in our post-petabyte world. I appreciate the arguments from programmers, database administrators, vendors, and consultants. They are all generally correct. The problem, however, is that none of today’s solutions were designed to handle the types or volumes of information sloshing through networks today.
In closing, as the volume of information increases, today’s solutions — CICS, RDBMS, OODB and other approaches — are not the right tool for tomorrow’s job. As a pragmatist, I use what works for each engagement. I have given up trying to wrangle the “facts” from vendors. I don’t try to take sides in the technical religion wars. I do look forward to the solution to the endemic problems of big data. If you don’t believe me, try and find a specific version of a document. None of the approaches identified above can do this very well. No wonder users of behind-the-firewall search systems are generally annoyed most of the time. Today’s solutions are like the adult returning to college, finding a weird new world, and getting average marks with remarkable consistency.
Stephen Arnold, January 22, 2008
Search Vendors and Source Code
January 21, 2008
A reader of this Web log wrote and asked the question, “Why is software source code (e.g. programs, JCL, Shell scripts, etc.) not included with the “enterprise search” [system]?”
In my own work, I keep the source code because: [a] it’s a miracle (sometimes) that the system really works, and I don’t want youngsters to realize my weaknesses, [b] I don’t want to lose control of my intellectual property such as it is, [c] I am not certain what might happen; for example, a client might intentionally or unintentionally use my work for a purpose with which I am not comfortable, or [d] I might earn more money if I am asked to add customize the system.
No search engine vendor with whom I have worked has provided source code to the licensee unless specific contractual requirements were met. In some U.S. Federal procurements, the vendor may be asked to place a copy of a specific version of the software in escrow. The purpose of placing source code in escrow is to provide an insurance policy and peace of mind. If the vendor goes out of business — so the reasoning goes — then the government agency or consultants acting on the agency’s behalf can keep the system running.
Most of the search systems involved in certain types of government work do place their systems’ source code in escrow. Some commercial agreements with which I have familiarity have requested the source code to be placed in escrow. In my experience, the requirement is discussed thoroughly and considerable attention is given to the language regarding this provision.
I can’t speak for the hundreds of vendors who develop search and content processing systems, but I can speculate that the senior management of these firms have similar reasons to [a], [b], [c], and [d] above.
Based on my conversations with vendors and developers, there may be other factors operating as well. Let me highlight these but remember, your mileage may vary:
First, some vendors don’t develop their own search systems and, therefore, don’t have source code or at least complete source code. For example, when search and content processing companies come into being, the “system” may be a mixture of original code, open source, and licensed components. At start up, the “system” may be positioned in terms of features, not the underlying technology. As a result, no one gives much thought to the source code other than keeping it close to the vest for competitive, legal, or contractual reasons. This is a “repackaging” situation where the marketing paints one picture, and the technical reality is behind the overlay.
Second, some vendors have very complicated deals for their systems technology. One example are vendors who may enjoy a significant market share. Some companies are early adopters of certain technology. In some cases, the expertise may be highly specialized. In the development of commercial products some firms find themselves in interesting licensing arrangements; for example, an entrepreneur may rely on a professor or classmate for some technology. Sometimes, over time, these antecedents are merged with other technology. As a result, these companies do not make their source code available. One result is that some engineers, in the search vendor’s company and at its customer locations, may have to research the solution (which can take time) or perform workarounds to meet their customers’ needs (which can increase the fees for customer service).
Third, some search vendors find themselves with orphaned technology. The search vendor licensed a component from another person or company. That person or company quit business, and the source code disappeared or is mired in complex legal proceedings. As a result, the search vendor doesn’t have the source code itself. Few licensees are willing to foot the bill for Easter egg hunts or resolving legal issues. In my experience, this situation does occur, though not often.
Keep in mind that search and content processing research funded by U.S. government money may be publicly available. The process required to get access to this research work and possibly source code is tricky. Some people don’t realize that the patent for PageRank (US6285999) is held by the Stanford University Board of Trustees, not Google. Federal funding and the Federal “strings” may be partly responsible. My inquiries to Google on this matter have proven ineffectual.
Several companies, including IBM, use Lucene or pieces of Lucene as a search engine. The Lucene engine is available from Apache. You can download code, documentation, and widgets developed by the open source community. One company, Tesuji in Hungary, licenses a version of Lucene plus Lucene support services. So, if you have a Lucene-based search system, you can use the Apache version of the program to understand how the system works.
To summarize, there are many motives for keeping search system source code out of circulation. Whether it’s fear of the competition or a legal consideration, I don’t think search and content processing vendors will change their policies any time soon. I know that when my team has had access to source code for due diligence conducted for a client of mine, I recall my engineers recoiling in horror or laughing in an unflattering manner. The reasons are part programmer snobbishness and part the numerous short cuts that some search system vendors have taken. I chastise my engineers, but I know only too well how time and resource constraints impose constraints that exact harsh penalties. I myself have embraced the policy of “starting with something” instead of “starting from scratch.” That’s why I live in rural Kentucky, burning wood for heat, and eating squirrels for dinner. I am at the opposite end of the intellectual spectrum from the wizards at Google and Microsoft, among other illustrious firms.
Bottom line: some vendors adopt the policy of keeping the source code to themselves. The approach allows the vendors to focus on making the customer happy and has the advantage of keeping the provenance of some technology in the background. You can always ask a vendor to provide source code. Who knows, you may get lucky.
Stephen Arnold, January 21, 2008
Sentiment Analysis: Bubbling Up as the Economy Tanks
January 20, 2008
Sentiment analysis is a sub-discipline of text mining. Text mining, as most of you know, refers to processing unstructured information and text blocks in a database to wheedle useful information from sentences, paragraphs, and entire documents. Text mining looks for entities, linguistic clues, and statistically significant high points.
The processing approach varies from vendor to vendor. Some vendors use statistics; others semantic techniques. More and more, mix and match procedures to get the best of each approach. The idea is that software “reads” or “understands” text. None of the more than 100 vendors offering text mining systems and utilities does as well as a human, but the systems are improving. When properly configured, some systems out perform a human indexer. (Most people think humans are the best indexers, but for some applications, software can do a better job.) Humans are needed to resolve “exceptions” when automated systems stumble. But unlike the human indexer who often memorizes a number of terms and uses these sometimes without seeking a more appropriate term from the controlled vocabulary. Also, human indexers can get tired, and fatigue affects indexing performance. Software indexing is the only way to deal with the large volumes of information in digital form today.
Sentiment analysis “reads” and “understands” text in order to find out if the document is positive or negative. About eight years ago, my team did a sentiment analysis for a major investment fund’s start up. The start up’s engineers were heads down on another technical matter, and the sentiment analysis job came to ArnoldIT.com.
We took some short cuts because time was limited. After looking at various open source tools and the code snippets in ArnoldIT’s repository, we generated a list of words and phrases that were generally positive and generally negative. We had several collections of text, mostly from customer support projects. We used these and applied some ArnoldIT “magic”. We were able to process unstructured information and assign a positive or negative score to documents based on our ArnoldIT “magic” and the dictionary. We assigned a red icon for results that our system identified as negative. Without much originality, we used a green icon to flag positive comments. The investment bank moved on, and I don’t know what the fate of our early sentiment analysis system was. I do recall that it was useful in pinpointing negative emails about products and services.
A number of companies offer sentiment analysis as a text mining function. Vendors include, Autonomy, Corpora Software, and Fast Search & Transfer, among others. A number of companies offer sentiment analysis as a hosted service with the work more sharply focused on marketing and brands. Buzzmetrics (a unit of AC Nielsen), Summize, and Andiamo Systems compete in the consumer segment. ClearForest, before it was subsumed into Reuters (which was then bought by the Thomson Corporation) had tools that performed a range of sentiment functions.
The news that triggered my thinking about sentiment was statistics and business intelligence giant SPSS’s announcement that it had enhanced the sentiment analysis functions of its Clementine content processing system. According to ITWire, Clementine has added “automated modeiing to identify the best analytic models, as well as combining multiple predictions for the most accurate results. You can read more about SPSS’s Clementine technology here. SPSS acquired LexiQuest, an early player in rich content processing, in 2002. SPSS has integrated its own text mining technology with the LexiQuest technology. SAS followed suit but licensed Inxight Software technology and combined that with SAS’s home-grown content processing tools.
There’s growing interest in analyzing call center, customer support, and Web log content for sentiment about people, places, and things. I will be watching for more announcements from other vendors. In the behind-the-firewall search and content processing sectors, there’s a strong tendency to do “me too” announcements. The challenge is to figure out which system does what. Figuring out the differences (often very modest) between and among different vendors’ solutions is a tough job.
Will 2008 be the year for sentiment analysis? We’ll know in a few months if SPSS competitors jump on this band wagon.
Stephen E. Arnold, January 20, 2008.
Map Reduce: The Great Database Controversy
January 18, 2008
I read with interest the article “Map Reduce: A Major Step Backwards” by David DeWitt. The article appeared in “The Database Column” on January 17, 2008. I agree that Map Reduce is not a database, not a commercial alternative for products like IBM’s DB2 or any other relational database, and definitely not the greatest thing since sliced bread.
Map Reduce is one of the innovations that seems to have come from top-notch engineers Google hired from AltaVista.com. Hewlett Packard orphaned an interesting search system because it was expensive to scale in the midst of the Compaq acquisition. Search, to Hewlett Packard’s senior management, was expensive, generated no revenue, and a commercial dead end. But in term s of Web search, AltaVista.com was quite important because it allowed its engineers to come face to face with the potential and challenges of multi-core processors, new issues in memory management, and programming challenges for distributed, parallel systems. Google surfed on AltaVista.com’s learnings. Hewlett Packard missed the wave.
So, Map Reduce was an early Google innovation, and my research suggests it was influenced by technology that was well known among database experts. In my The Google Legacy: How Search Became the Next Application Platform (Infonortics, Ltd. 2005) I tried to explain in layman’s terms how Map Reduce bundled and optimized two Lisp functions. The engineering wizardry of Google was making these two functions operate at scale and quickly. The engineering tricks were clever, but not like Albert Einstein’s sitting in a patent office thinking about relativity. Google’s “invention” of Map Reduce was driven by necessity. Traditional ways to match queries with results were slow, not flawed, just turtle-like. Google needed really fast heavy lifting. The choke points that plague some query processing systems had to be removed in an economical, reliable way. Every engineering decision involves trade offs. Google sacrificed some of the cows protected by certain vendors in order to get speed and rock bottom computational costs. (Note: I did not update my Map Reduce information in my newer Google study, Google Version 2.0 (Infonortics, Ltd. 2007). There have been a number of extensions to Map Reduce in the last three years. A search for the term MapReduce on Google will yield a treasure trove of information about this function, its libraries, its upside, and its downside.)
I am constantly surprised at the amount of technical information Google makes available as Google Papers. Its public relations professionals and lawyers aren’t the best communicators. I have found Google’s engineers to be remarkably communicative in technical papers and at conferences. For example, Google engineers rely on MySQL and other tools (think Oracle) to perform some data processes. Obviously Map Reduce is only one cog in the larger Google “machine.” Those of you who have followed my work about Google’s technology know that I refer to the three dozen server farms, the software, and the infrastructure as The Googleplex. Google uses this term to refer to a building, but I think it is a useful way to describe the infrastructure Google has been constructing for the last decade. Keep in mind that Map Reduce–no matter how good, derivative, or clever–is a single component in its digital matroska.
My analyses of Map Reduce suggest that Google’s engineers obsess about scale, not break through invention. I was surprised to learn that much of Google’s technology is available to any one; for example, hadoop. Some of Google’s technology comes from standard collections of algorithms like Numerical Recipes with Source Code CD-ROM 3rd Edition: The Art of Scientific Computing. Other bits and pieces are based on concepts that have been tested in various university computer science labs supported by U.S. government funds. And, there’s open source code kept intact but “wrapped” in a Google technical DNA for scale and speed. Remember that Google grinds through upwards of four score petabytes of data every 24 hours. What my work tells me is that Google takes well-known engineering procedures and makes them work at Google scale on Google’s infrastructure.
Google has told two of its “partners,” if my sources are correct, that the company does not have a commercial database now, nor does it plan to make a commercial database like IBM’s, Microsoft’s, or Oracle’s available. Google and most people involved in manipulating large-scale data know that traditional databases can handle almost unlimited amounts of information. But it’s hard, expensive, and tricky work. The problem is not the vendors. The problem is that Codd databases or relational database management systems (RDBMS) were not engineered to handle the data management and manipulation tasks at Google scale. Today, many Web sites and organizations face an information technology challenge because big data in some cases bring systems to their knees, exhaust engineers and drain budgets in a nonce.
Google’s publicly-disclosed research and its acquisitions make it clear that Google wants to break free of the boundaries, costs, reliability, and performance issues of RDBMS. In my forthcoming study Beyond Search, I devote a chapter to one of Google’s most interesting engineering initiatives for the post-database era. For the data mavens among my readers, I include pointers to some of Google’s public disclosures about their approach to solving the problems of the RDBMS. Google’s work, based on the information I have been able to gather from open sources, is also not new. Like Map Reduce, the concepts have been kicking around in classes taught at the University of Illinois, the University of Wisconsin – Madison, University of California – Berkeley, and the University of Washington, among others, for about 15 years.
If Google is going to deal with its own big data challenges, it has to wrap Map Reduce and other Google innovations in a different data management framework. Map Reduce will remain for the foreseeable future one piece of a very large technology mosaic. When archeologists unearth a Roman mosaic, considerable effort is needed to reveal the entire image. Looking at a single part of the large mosaic tells us little about the overall creation. Google is like that Roman mosaic. Focusing on a single Google innovation such as Chubby, Sawzall, containers (not the XML variety), the Programmable Search Engine, the “I’m feeling doubly lucky” invention, or any one of hundreds of Google’s publicly disclosed innovations yields a distorted view.
In summary, Map Reduce is not a database. It is no more of a database than Amazon’s SimpleDB service is. It can perform some database-like functions, but it is not a database. Many in the database elite know that the “next big thing” in databases may burst upon the scene with little fanfare. In the last seven years, Map Reduce has matured and become a much more versatile animal. Map Reduce can perform tasks its original designers did not envision. What I find delightful about Google’s technology is that it often does one thing well like Map Reduce. But when mixed with other Google innovations, unanticipated functionality comes to light. I believe Google often solves one problem and then another Googler figures out another use for that engineering process.
Google, of course, refuses to comment on my analyses. I have no affiliation with the company. But I find its approach to solving some of the well-known problems associated with big data interesting. Some Google watchers may find it more useful to ask the question, “What is Google doing to resolve the data management challenges associated with crunching petabytes of information quickly?” That’s the question I want to try and answer.
Stephen E. Arnold, January 18, 2008
Autonomy: Marketing Chess
January 18, 2008
The Microsoft – Fast deal may not have the impact of the Oracle – BEA Systems deal, but dollar for dollar, search marketers will be working overtime. The specter of what Microsoft might do begs for prompt, immediate action. The “good offense is the best defense” seems to be at work at Autonomy plc, arguably one of the world’s leading vendors of behind-the-firewall search and various applications that leverage Autonomy’s IDOL (integrated data operating layer) and its rocket-science mathematics.
The Microsoft – Fast Search & Transfer has flipped Autonomy’s ignition switch. On January 14, 2008, CBROnline reported that Autonomy’s Integrated Data Operating Layer gets a Power Pack for Microsoft Vista. IDOL can now natively process more than 1000 file formats. Autonomy also added support for additional third-party content feeds.
The goal of the enhancements is to make it easier for a Microsoft-centric organization to make use of the entity extraction, classification, categorization and conceptual search capabilities. Autonomy’s tailoring of IDOL to Microsoft Windows began more than a 18 months ago, possibly earlier. Microsoft SharePoint installations now have more than 65 million users. Despite some grousing about security and sluggish performance, Microsoft’s enterprise initiatives are generating revenue. The Dot Net framework keeps getting better. Companies that offer a “platform alternative” face a hard fact — Microsoft is a platform powered by a company with serious marketing and sales nitromenthane. No senior manger worth his salary and bonus can ignore a market of such magnitude. Now, with the acquisition of Fast Search & Transfer, Autonomy is faced with the potential threat of direct and indirect activity by Microsoft to prevent third-party vendors like Autonomy from capturing customers. Microsoft wants the revenue, and it wants to keep other vendors’ noses out of its customers’ tents.
Autonomy has never shown reluctance for innovative, aggressive, and opportunistic marketing (Remember the catchy “Portal in a Box” campaign?) It makes a great deal of business sense for Autonomy to inject steroids into its its Vista product. I expect Autonomy to continue to enhance its support for Microsoft environments on a continuous basis. To do less would boost Microsoft’s confidence in its ability to alter a market with an acquisition. I call this “money rattling.” The noise of the action scares off the opposition.ÂÂ
Other search vendors will also keep a sharp eye on Microsoft and its SharePoint customers. Among the companies offering a snap-in search or content processing solution are Coveo, dtSearch, Exalead, and ISYS Search Software, among others. It’s difficult for me to predict with accuracy how these companies might respond to Autonomy’s sharp functional escalation of IDOL in articular and the Microsoft – Fast tie up in general. I think that Microsoft will want to keep third-party vendors out of the SharePoint search business. Microsoft wants a homogeneous software environment, and, of course, more revenue from its customers. Let my think out load, describing several hypothetical scenarios that Microsoft might explore:
- Microsoft reduces the license fee for Fast Search & Transfer’s SharePoint adaptor and Fast Search’s ESP (enterprise search platform). With Fast Search’s pre-sell out license fees in the $200,000 range, a price shift would have significant impact upon Autonomy and other high-end search solutions. This is the price war option, and it could wreck havoc on the fragile finances of some behind-the-firewall search system vendors.
- Microsoft leaves the list price of Fast Search unchanged but begins bundling ESP with other Microsoft applications. The cost for an enterprise search solution is rolled into a larger sale for Microsoft’s customer relationship management system or a shift from either IBM DB2 or Oracle’s database to enterprise SQL Server. Microsoft makes high-end search a functional component of a larger, enterprise-wide, higher value solution. This is the bundled feature option, and it makes a great deal of sense to a chief financial officer because one fee delivers the functionality without additive administrative and operational costs of another enterprise system.
- Microsoft makes changes to its framework(s), requiring Microsoft Certified Partners to modify their systems to keep their certification. Increasing the speed of incremental changes could place a greater technical and support burden on some Certified Partners developing and marketing replacements for Microsoft search solutions for SharePoint. I call this Microsoft’s fast-cycle technical whip saw option. Some vendors won’t be able to tolerate the work needed to keep their search application certified, stable, and in step with the framework.
- Microsoft does nothing different, allowing Fast Search and its competitors to raise their stress levels and (maybe) make a misstep implementing an aggressive response to … maybe no substantive action by Microsoft. I think of this as the Zen of uncertainty option. Competitors don’t know what Microsoft will or will not do. Some competitors feel compelled to prepare for some Microsoft action. These companies burn resources in to get some type of insurance against an unknown future Microsoft action.
Microsoft’s actions with regard to Fast Search will, over time, have an impact on the vendors serving the SharePoint market. I don’t think that the Microsoft – Fast deal will make a substantive change in search and content processing. The reason is that most vendors are competing with substantially similar technologies. Most solutions are similar to one another. And, in my opinion, some of Fast Search’s technology is starting to become weighted down with its heterogeneous mix of original, open source, and acquired technology.
I believe that when a leap-frogging, game-changing technology becomes available, most vendors — including Autonomy, IBM, Microsoft, Oracle, and SAP, among others — will be blindsided. In today’s market, it’s received wisdom to make modest incremental changes and toot the marketing tuba. For the last five or six years, minor innovations have been positioned as revolutionary in behind-the-firewall search. I think that much of the innovation in search has been to handles sales and marketing in a more professional way. The technology has been adequate in most cases. My work suggests that most users of today’s behind the firewall search systems are not happy with their information access tools — regardless of vendor. Furthermore, in terms of precision and recall, there’s not been much improvement in the last few years. Most systems deliver 75 to 80 percent precision and recall upon installation. After tuning, 85 percent scores are possible. Good but not a home run I assert.
I applaud Autonomy for enhancing IDOL for Vista. I will watch Microsoft to see if the company adopts one or more of my hypothetical options. I am also on the look out for a search break through. When that comes along, I will be among the first to jettison the tools I now use for the next big thing. I wonder how many organizations will take a similar approach? I want to be on the crest of the wave, not swamped by quotidian tweaks, unable to embrace the “search” future when it arrives.
Stephen Arnold, January 18, 2008
Two Visions of the Future from the U.K.
January 17, 2008
Two different news items offered insights about the future of online. My focus is the limitations of key word search. I downloaded both articles, I must admit, eager to see if my research were disproved or augmented.
Whitebread
The first report appeared on January 14, 2008, in the (London) Times online in a news story “White Bread for Young Minds, Says University Professor.” In the intervening 72 hours, numerous comments appeared. The catch phrase is the coinage of Tara Brabazon, professor of Media Studies at the University of Brighton. She allegedly prohibits her students from using Google for research. The metaphor connotes in a memorable way a statement attributed to her in the Times’s article: “Google is filling, but it does not necessarily offer nutritional content.”
The argument strikes a chord with me because [a] I am a dinosaur, preferring warm thoughts about “the way it was” as the snow of time accretes on my shoulders; [b] schools are perceived to be in decline because it seems that some young people don’t read, ignore newspapers except for the sporty pictures that enliven gray pages of newsprint, and can’t do mathematics reliably at take-away shops; and [c] I respond to the charm of a “sky is falling” argument.
Ms. Brabazon’s argument is solid. Libraries seem to be morphing into Starbuck’s with more free media on offer. Google–the icon of “I’m feeling lucky” research–allows almost anyone to locate information on a topic regardless of its obscurity or commonness. I find myself flipping my dinosaurian tail out of the way to get the telephone number of the local tire shop, check the weather instead of looking out my window, and converting worthless dollars into high-value pounds. Why remember? Google or Live.com or Yahoo are there to do the heavy lifting for me.
Educators are in the business of transmitting certain skills to students. When digital technology seeps into the process, the hegemony begins to erode, so the argument goes. Ms. Brabazon joins Neil Postman Amusing Ourselves to Death: Public Discourse in the Age of Show Business, 1985) and more recently Andrew Keen (The Cult of the Amateur, 2007) among others in documenting the emergence of what I call the “inattention economy.”
I don’t like the loss of what weird expertise I possessed that allowed me to get good grades the old-fashioned way, but it’s reality. The notion that Google is more than an online service is interesting. I have argued in my two Google studies that Google is indeed much more than a Web search system growing fat on advertisers’ money. My research reveals little about Google as a corrosive effect on a teacher’s ability to get students to do their work using a range of research tools. Who wouldn’t use an online service to locate a journal article or book? I remember how comfortable my little study nook was in the rat hole in which I lived as a student, then slogging through the Illinois winter, dealing with the Easter egg hunt in the library stuffed with physical books that were never shelved in sequence, and manually taking notes or feeding 10-cent coins into a foul-smelling photocopy machine that rarely produced a readable copy. Give me my laptop and a high-speed Internet connection. I’m a dinosaur, and I don’t want to go back to my research roots. I am confident that the professor who shaped my research style–Professor William Gillis, may he rest in peace–neither knew nor cared how I gathered my information, performed my analyses, and assembled the blather that whizzed me through university and graduate school.
If a dinosaur can figure out a better way, Tefloned along by Google, a savvy teen will too. Draw your own conclusions about the “whitebread” argument, but it does reinforce my research that suggests a powerful “pull” exists for search systems that work better, faster, and more intelligently than those today. Where there’s a market pull, there’s change. So, the notion of going back to the days of taking class notes on wax in wooden frames and wandering with a professor under the lemon trees is charming but irrelevant.
The Researcher of the Future
The British Library is a highly-regarded, venerable institution. Some of its managers have great confidence that their perception of online in general and Google in particular is informed, substantiated by facts, and well-considered. The Library’s Web site offers a summary of a new study called (and I’m not sure of the bibliographic niceties for this title): A Ciber [sic] Briefing Paper. Information Behaviour of the Researcher of the Future, 11 January 2008. My system’s spelling checker is flashing madly regarding the spelling of cyber as ciber, but I’m certainly not intellectually as sharp as the erudite folks at the British Library, living in rural Kentucky and working by the light of buring coal. You can download this 1.67 megabyte 35 page document Researcher of the Future.
The British Library’s Web site article identifies the key point of the study as “research-behaviour traits that are commonly associated with younger users — impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs — are now becoming the norm for all age-groups, from younger pupils and undergraduates through to professors.” The British Library has learned that online is changing research habits. (As I noted in the first section of this essay, an old dinosaur like me figured out that doing research online faster, easier, and cheaper than playing “Find the Info” in my university’s library.)
My reading of this weirdly formatted document, which looks as if it was a PowerPoint presentation converted to a handout, identified several other important points. Let me share my reading of this unusual study’s findings with you:
- The study was a “virtual longitudinal study”. My take on this is that the researchers did the type of work identified as questionable in the “whitebread” argument summarized in the first section of the paper. If the British Library does “Googley research”, I posit that Ms. Brabazon’s and other defenders of the “right way” to do research have lost their battle. Score: 1 for Google-Live.com-Yahoo. Nil for Ms. Brabazon and the British Library.
- Libraries will be affected by the shift to online, virtualization, pervasive computing, and other impedimentia of the modern world for affluent people. Score 1 for Google-Live.com-Yahoo. Nil for Mr. Brabazon, nil for the British Library, nil for traditional libraries. I bet librarians reading this study will be really surprised to hear that traditional libraries have been affected by the online revolution.
- The Google generation is comprised of “expert searchers”. The reader learns that most people are lousy searchers. Companies developing new search systems are working overtime to create smarter search systems because most online users–forget about age, gentle reader–are really terrible searchers and researchers. The “fix” is computational intelligence in the search systems, not in the users. Score 1 more for Google-Live.com-Yahoo and any other search vendor. Nil for the British Library, nil for traditional education. Give Ms. Brabazon a bonus point because she reached her conclusion without spending money for the CIBER researchers to “validate” the change in learning behavior.
- The future is “a unified Web culture,” more digital content, eBooks, and the Semantic Web. The word unified stopped my ageing synapses. My research yielded data that suggest the emergence of monopolies in certain functions, and increasing fragmentation of information and markets. Unified is not a word I can apply to the online landscape.In my BearStearns’ report published in 2007 as Google’s Semantic Web: The Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft, I revealed that Google wants to become the Semantic Web.
Wrap Up
I look forward to heated debate about Google’s role in “whitebreading” youth. (Sounds similar to waterboarding, doesn’t it?) I also hunger for more reports from CIBER, the British Library, and folks a heck of lot smarter than I am. Nevertheless, my Beyond Search study will assert the following:
- Search has to get smarter. Most users aren’t progressing as rapidly as young information retrieval experts.
- The traditional ways of doing research, meeting people, even conversing are being altered as information flows course through thought and action.
- The future is going to be different from what big thinkers posit.
Traditional libraries will be buffeted by bits and bytes and Boards of Directors who favor quill pens and scratching on shards. Publishers want their old monopolies back. Universities want that darned trivium too. These are notions I support but recognize that the odds are indeed long.
Stephen E. Arnold, January 17, 2008
MSFT – FAST: Will It Make a Difference?
January 16, 2008
On January 15, I received a telephone call from one of the Seybold Group’s analysts. Little did I know that at the same time the call was taking place, Google’s Rodrigo Vaca, a Googler working in the Enterprise Division, posted “Make a Fast Switch to Google.”
The question posed to me by Seybold’s representative was: “Will Microsoft’s buying Fast Search & Transfer?” My answer, and I am summarizing, “No, certainly not in the short-term. In fact looking 12 to 18 months out, I don’t think the behind-the-firewall market will be significantly affected by this $1.2 billion buy out.”
After I made the statement, there was a longish pause as the caller thought about what I asserted. The follow up question was, “Why do you say that?” There are three reasons, and I want to highlight them because most of the coverage of the impending deal has been interesting but uninformed. Let me offer my analysis:
- The technology for behind-the-firewall search is stable. Most of the vendors offer systems that work reasonably well when properly configured, resourced, and maintained. In fact, if I were to demonstrate three different systems to you, gentle reader, you would be hard pressed to tell me which system was demonstrated and you would not be able to point out the strengths and weaknesses of properly deployed systems. Let me be clear. Very few vendors offer a search-and-retrieval solution significantly different from its competitors. Procurements get slowed because those on the procurement team have a difficult time differentiating among the sales pitches, the systems, and the deals offered by vendors. I’ve been doing search-related work for 35 years, and I get confused when I hear the latest briefing from a vendor.
- An organization with five or more enterprise search systems usually grandfathers an older system. Every system has its supporters, and it is a hassle to rip and replace an existing system, convert that system’s habitual users. Behind-the-firewall search, therefore, is often additive. An organization leaves well enough alone and uses its resources to deploy the new system. Ergo: large organizations have multiple search and retrieval systems. No wonder employees are dissatisfied with behind-the-firewall search. A person looking for information must search local machines, the content management system, the enterprise accounting system, and whatever search systems are running in departments, acquired companies, and in the information technology department. Think gradualism and accretion, not radical change in search and retrieval.
- The technical professionals at an organization have an investment of time in their incumbent systems. An Oracle data base administrator wants to work with Oracle products. The learning curve is reduced and the library of expertise in the DBA’s head is useful in troubleshooting Oracle-centric software and systems. The same holds true with SharePoint-certified engineers. An IT professional who has a stable Fast Search installation, a working DB2 data warehouse, an Autonomy search stub in a BEA Systems’ application server, a Google Search Appliance in marketing, and a Stratify eDiscovery system in the legal department doesn’t want to rock the boat. Therefore, the company’s own technical team stonewalls change.
I’m not sure how many people in the behind-the-firewall business thank their lucky stars for enterprise inertia. Radical change, particularly in search and retrieval, is an oxymoron. The Seybold interviewer was surprised that I was essentially saying, “Whoever sells the customer first has a leg up. An incumbent has the equivalent of a cereal brand with shelf space in the grocery store.”
Now, let’s shift to Mr. Vaca’s assertion that “confused and concerned customers” may want to license a Google Search Appliance in order to avoid the messiness (implied) with the purchase of Fast Search by Microsoft. The idea is one of those that seems very logical. A big company like Microsoft buys a company with 2,500 corporate customers. Microsoft is, well, Microsoft. Therefore, jump to Google. I don’t know Mr. Vaca, but I have an image of a good looking, earnest, and very smart graduate of a name brand university. (I graduated from a cow college on the prairie, so I am probably revealing my own inferiority with this conjured image of Mr. Vaca.)
The problem is that the local logic of Mr. Vaca is not the real-world logic of an organization with an investment in Fast Search & Transfer technology. Most information technology professionals want to live with something that is stable, good enough, and reasonably well understood. Google seems to have made a similar offer in November 2005 when the Autonomy purchase of Verity became known. Nothing changed in 2005, and nothing will change in 2008 in terms of defectors leaving Fast Search for the welcoming arms of Google.
To conclude: the market for behind-the-firewall search is not much different from the market for other enterprise software at this time. However, two things make the behind-the-firewall sector volatile. First, because of the similar performance of the systems now on offer, customers may well be willing to embrace a solution that is larger than information retrieval. A solution for information access and data management may be sufficiently different to allow an innovator to attack from above; that is, offer a meta-solution that today’s vendors can’t see coming and can’t duplicate. Google, for example, is capable of such a meta-attack. IBM is another firm able to leap frog a market.
Second, the pain of getting a behind-the-firewall search up and stable is significant. Remember: there is human effort, money, infrastructure, users, and the mind numbing costs of content transformation operating to prevent sudden changes of direction in the type of organization with which I am familiar.
Bottom line: the Microsoft – Fast deal is making headlines. The deal is fueling increased awareness of search, public relations, and investor frenzy. For the time being, the deal will not make a significant difference in the present landscape for behind-the-firewall search. Looking forward, the opportunity for an innovator is to break out of the search-and-retrieval circumvallation. Mergers can’t deliver the impact needed to change the market rules.
Stephen E. Arnold, January 16, 2008, 9 am
Google Responds to Jarg Allegation
January 15, 2008
Intranet Journal reported on January 14, 2008, that Google denies the Jarg allegation of patent infringement. I’m not an attorney, and claims about online processes are complex. You can read US5694593, “Distributed Computer Database System and Method” at the USPTO or Google Patents Service.
As I understand the issue, the Jarg patent covers technology that Jarg believes is used in Google’s “plumbing.” In The Google Legacy and in Google Version 2.0, I dig into some of the inner workings that allow Google to deliver the services that comprise what I call the Googleplex. Note: I borrowed this term from Google’s own jargon for its office complex in Mountain View, California.
If the Jarg allegation has merit, Google may be required to make adjustments, pay Jarg, or some other action. I have read the Jarg patent, and I do see some parallels. In my reading of more than 250 Google patent applications and patents, the key theme is not the mechanics of operations. Most of Google’s inventions make use of technology long taught in college courses in computer science, software engineering, and mathematics.
What sets Google’s inventions apart are the engineering innovations that allow the company to operate at what I call “Google scale.” There are Google presentations, technical papers, and public comments that underscore the meaning of scale at Google. According to Googlers Jeff Dean and Sanjay Ghemawat, Google crunches upwards of 20 petabytes a day via 100,000 MapReduce jobs. A petabyte is a 1,000 terabytes. What’s more interesting is that Google spawns hundreds of sub-processes for each query it receives. The millisecond response time is possible because Google has done a very good job of taking what has been available as standard procedures, blended in some ideas from the scientists doing research at universities, and advanced mathematics to make its application platform work.
Remember that search was an orphan when Google emerged from the Backrub test. Excite, Lycos, Microsoft, and Yahoo saw search as a no-brainer, a dead end of sorts when compared to the notion of a portal. University research chugged along with technology transfer programs at major institutions encouraging experimentation, commercialization, and patent applications.
What makes the Jarg allegation interesting is that most universities and their researchers tap U.S. government funds. Somewhere in the labs at Syracuse University, Stanford University, the University of California at Los Angeles, or the University of Illinois there’s government-funded activity underway. In my experience, when government money sprays over a technology, there is a possibility that the research must comply with government guidelines for any invention that evolves from these dollops of money.
When I read the original Google PageRank patent application US6285999, Method of Node Ranking in a Linked Database (September 4, 2001) I was surprised at one fact in this remarkable blend of engineering and plain old voting. That fact was that the assignee of the invention was not Mr. Page. The assignee was The Board of Trustees of the Leland Stanford Junior University. The provisional patent application was filed on January 10, 1997, and I — almost eight years later — just realized that the work was performed under a U.S. government grant.
I will be interested in the trajectory of the Jarg allegation. I wonder if any of the work disclosed in the Jarg patent has an interesting family tree. I am also curious about the various data management practices, generally well-known in the parallel computing niche, have been widely disseminated by professors teaching their students basic information and illuminating those lectures with real-life examples from the research work conducted in labs in colleges and universities in the U.S.
Litigation in my experience as an expert witness is a tedious, intellectually-demanding process. Engineering does not map point for point to the law. When the U.S. government explicitly encourages recipients of its funds to make an effort to commercialize their inventions, the technology transfer business got a jolt of adrenaline. Patent applications and patents on novel approaches from government-funded research contribute to the flood of patent work choking the desks of USPTO professionals. Figuring out what’s going on in complex inventions and then determining which function is sufficiently novel to withstand the scrutiny of cadres of lawyers and their experts is expensive, time-consuming, and often maddeningly uncertain.
Not surprisingly, most litigation is settled out of court. Sometimes one party runs out of cash or the willingness to pay the cost of operating the litigation process. Think millions of dollars. Measure the effort in person years.
As the Intranet Journal story says: “Google has responded to the patent-infringement lawsuit filed against it by semantic search vendor Jarg and Northeastern University, denying the parties’ claims of patent infringement. Google has also filed a counterclaim, asking the court to dismiss the patent in question as invalid.”
Will this be the computer scientists’ version of the OJ Simpson trial? Stay tuned.
Stephen E. Arnold, January 15, 2008, Noon eastern