Intel: Cloud Factoid

August 4, 2008

I tracked down an Intel presentation from 2006 and also used in 2007. The link is to ZDNet here. The presentation offers some interesting insights into Intel’s data center problem or opportunities in mid 2006; namely:

  • Intel has 136 of these puppies with an average cost pegged in the $100 million to $200 million range
  • Average idle capacity was about 200 million CPU hours with capacity at 900 million CPU hours, give or take a few hundred thousand hours
  • In 2006, 62 percent of the 136 data centers were 10 years old or older.
  • Plans in 2006 were to move to eight strategic hub centers.

My initial reaction to this 2006 presentation was that Intel’s zippy new chips might find a place in Intel’s own data centers. It would be interesting to calculate the cost of power across the old data centers with the aging chips versus the newer “green” chips. I expect that the money flying out the air conditioning duct is trivial to a giant like Intel.

More on this issue appeared in Data Center Knowledge in 2007 here. In 2007, according to Data Center Knowlege Google had about 93,000 servers in its data centers.

In April 2008, Travis Broughton, Intel, wrote here:

Our cost-cutting measures tend to be related to at least two of the three “R’s” – reducing what we consume, many times by reusing what we already have.

I’m not sure what this means in the context of the Cloud Two initiative, but I will keep poking around.

Stephen Arnold, August 4, 2008

Microsoft’s Browser Rank

July 26, 2008

I heard about Browser Rank a while ago. My take on the technology is a bit different from that of the experts, wizards, and pundits stressing the upside of the approach. To get the received “wisdom”, you will want to review these analyses of this technology:

  • Microsoft’s own summary of the technology here. The full paper is here. (Note: I have discovered that certain papers are no longer available from Microsoft.com; for example, the DNABlueprint document. Snag this document in a sprightly manner.)
  • Steve Shankland’s write up for CNet here. The diagram is a nice addition to the article.
  • Arnold Zafra’s description for Search Engine Journal here.

By the time you read this, there will dozens of commentaries.

Here’s my take:

Microsoft has asserted that it has more than 20 billion pages in its index. However, indexing resources are tight, so Microsoft has been working to find ways to know exactly which pages to index and reindex without spidering the bulk of the Web pages each time. The answer is to let user behavior generate a short list of what must get indexed. The idea is to get maximum payoff from minimal indexing effort.

This is pretty standard practice. Most systems have a short list of “must index” frequently links. There is a vast middle ground which gets pinged and updated on a cycle; for example, every 30 days. Then there are sites like the Railway Retirement Board, which gets indexed on a relaxed schedule, which could be never.

Microsoft’s approach is to take a bunch of factors that can be snagged by monitoring user behavior and use these data to generate the index priority list. Dwell time is presented in the paper as radically new, but it isn’t. In fact, most of the features have been in use or tested by a number of search systems, including the now ancient system used by The Point (Top 5% of the Internet), which Chris Kitze, my son, and I crafted 15 years ago.

We too needed a  way to know only the specific Web sites to index. Trying to index the entire Web was beyond our financial and technical resources. Our approach worked, and I think Microsoft’s approach worked. But keep in mind that “worked” means users looking for popular content will be well served. Users looking for more narrow content will be left to fend for themselves.

I applaud Microsoft’s team for bundling these factors to create a browser graph. The problem is that scale is going to make the difference in Web search, Web advertising, and Web content analytics. Big data returns more useful insights about who wants what under what situation. Context, therefore, not shortcuts to work around capacity limitations is the next big thing.

Watch for the new IDC report authored by Sue Feldman and me on this topic. Keep in mind that this is my opinion. Let me know if you agree or disagree.

Stephen Arnold, July 26, 2008

MicroStrategy: TSA Swims through Data with PIMS

July 24, 2008

Government information technology makes me perspire. When a government news item renders in my news reader, I don’t pause. I want to make an exception. MicroStrategy is a very intriguing company. The fact that the firm has ramped its services to a law enforcement agency is interesting. MicroStrategy has been working with TSA since 2004. The deal signed in 2006 has saved TSA more than $100 million. The sentence that caught my attention was:

The TSA is a metrics-based organization… We [the TSA] use metrics every day to drive our decision making and quantify security effectiveness, operational efficiency and workforce management.

An example of this metrics focus is that since 2004, the TSA uses PIMS to run one million reports per year. TSA has about 12,000 users of the system. Each user prints about two reports a week. TSA is right in line with the Office of Management & Budget’s guidelines for managers to make decisions based on hard data, not hunches.

MicroStrategy, as you may know, popped in and out of the news in the 2000-2002 period. One of the items I recalled reading is here. Some former MicroStrategy professionals founded Clarabridge, a company focused on the overlap between business intelligence and content processing. You can find information about that company here.

I want to pay closer attention to MicroStrategy. Companies that can help Federal agencies save $100 million are the taxpayers’ best friends. I am interested in the MicroStrategy – Clarabridge alignment as well. Off to the library in the morning to find what I can find.

Stephen Arnold, July 24, 2008

Googzilla Swallows Telegraph Media Group

July 24, 2008

Traditional media has been my favorite whipping boy for a long time. The Telegraph Media Group may force me to rethink my critical view of companies who write stuff, print of dead trees, and employ folks at near starvation wages to get the messy artifacts to a declining readership. Silicon.com reported here that the publisher of The Daily Telegraph, Sunday Telegraph and Weekly Telegraph, as well as the telegraph.co.uk Web site will standardize on Google Apps–word processing, mail, collaboration. The whole shooting match.

My reading of the announcement suggests that TMG did the math and calculated that it could save a bundle. More significantly, TMG lets Google worry about software, presumably so the newspapers can worry about selling adverts. The most interesting statement in the Silicon.com write up is this remark attributed to one of TMG’s managers:

We see the levels of innovation happening in the consumer space…you can actually take advantage of within the enterprise space.

Microsoft, among other traditional software companies, are going to learn first hand how fissionable material goes critical. A few things happen, then a few more things, and then the game changes. Is Google Apps ready to go critical?

My view: yes.

Stephen Arnold, July 24, 2008

Microsoft’s Vietnam: Search

July 21, 2008

What an interesting idea. ZDNet posted a short item that caught my attention on this 95 degree Sunday in rural Kentucky. Larry Dignan’s “Microsoft’s Search Ambitions Are Its Vietnam” appeared on the ZDNet Web logs on July 18, 2008. I suggest you read the item here. Mr. Dignan has opened a new line of analysis about the Microsoft – Google face off.

The key point in the piece for me was:

The online services business lost $1.23 billion for the fiscal year ending June 30. I [Mr. Dignan] quipped that it’s no wonder that Microsoft is so hot for Yahoo. Something has to save this online business. And what’s startling about that figure is that Microsoft only lost $732 million in 2007. Microsoft’s online services business was actually profitable in 2005.

Mr. Digan is spot on. One point warrants further comment, however. The cost of catching Google may be beyond Microsoft’s reach. Here’s why?

  • Google continues to press forward and Microsoft’s efforts to catch Google seem to be focused on where Google was in late 2006 or early 2007. A leap not a catch up is needed.
  • Time is working for Google and against Microsoft. Each month Google continues to increase its lead in Web search. At some point, Google will dominate the market, which means that the race may be over for Web queries and online advertising.
  • Google is “seeping” into the enterprise. Microsoft seems confident that its three big revenue producers will fund the online battle with Google. I think the complexity of products like SharePoint will open the door to newcomers, not necessarily Google, by the way. Any revenue loss increases Microsoft’s vulnerability.

Agree? Disagree? Let me know your thoughts.

Stephen Arnold, July 21, 2008

Microsoft: 1999 to 2008

July 14, 2008

I have written one short post and two longer posts about Microsoft.com’s architecture for its online services. You can read each of these essays by clicking on the titles of the stories:

I want to urge each of my two or three Web log readers to validate my assertions. Not only am I an addled goose, I am an old goose. I make errors as young wizards delight in reminding me. On Friday, July 11, 2008, two of my engineers filled some gaps in my knowledge about X++, one of Microsoft’s less well-known programming languages.

the perils of complexity

The diagram shows how complexity increases when systems are designed to support solutions that do not simplify the design. Source: http://www.epmbook.com/complexity.gif

Stepping Back

As I reflected upon the information I reviewed pertaining to Microsoft.com’s online architecture, several thoughts bubbled to the surface of my consciousness:

First, I believe Microsoft’s new data centers and online architecture shares DNA with those 1999 data centers. Microsoft is not embracing the systems and methods in use at Amazon, Google, and even the hapless Yahoo. Microsoft is using its own “dog food”. While commendable, the bottlenecks have not been fully resolved. Microsoft uses scale up and scale out to make systems keep pace with user expectations of response time. One engineer who works at a company competing with Microsoft told me: “Run a query on Live.com. The response times in many cases are faster than our. The reason is that Microsoft caches everything. It works, but it is expensive.”

Second, Microsoft lacks a cohesive code base and a new one. With each upgrade, legacy code and baked in features and functions are dragged along. A good example is SQL Server. Although rewritten from the good old days with Sybase, SQL Server is not the right tool for peta-scale data manipulation chores. Alternatives exist and Amazon and Yahoo are using them. Microsoft is sticking with its RDBMS engine, and it is very expensive to replicate, cluster, back up with stand by hardware, and keep in sync. The performance challenge remains even though user experience seems as good if not better than the competition’s. In my opinion, the reliance on this particular “dog food” is akin to building a wooden power boat with unseasoned wood.

Third, in each of the essays, Microsoft’s own engineers emphasize the cost of the engineering approaches. There is no emphasis on slashing costs. The emphasis is on spending money to get the job done. In my opinion, spending money to solve problems via the scale up and scale out approach is okay as long as there are barrels of cash to throw at the problem. The better approach, in my opinion is to engineer solutions that make scaling and performance as economical as possible and direct investment at finding ways to leap frog over the well-known, long-standing problems with the Codd database model, inefficient and latency inducing message passing, and dedicated hardware for specific functions and applications then replicating these clusters. And, finally, using more hardware that is, in effect, sitting like an idle railroad car until needed. What happens when the money for these expensive approaches becomes less available?

Read more

The Economics of Dealing with Complex Information

May 24, 2008

Microsoft announced via its Live Search blog that its Live Search Books and Live Search Academic are “taken down”. Google’s book digitization and journal project caused concern to the commercial database vendors. Google, with its generous cash flow and avowed goal of indexing “all the world’s information” seemed to sign the death warrants of such companies as Dialog, Ebsco, and ProQuest, among others. A flap of the wings to Techmeme for its related links.

The economics of doing anything significant with complex information are not taught in the ivory towers at Harvard, Stanford, and Yale. Google–indifferent to the brutal economics that hobble commercial database publishers–has the cash to figure out how to use software to do tasks usually done by humans. For example, Google has figured out how to scan a book, have software determine what should be converted to ASCII, and generating a reasonably clean, searchable text file. The page images are mostly linked to the correspond text references. Not so for most database producers. These decisions still require humans, often working in exotic locations where labor is less expensive than in Ann Arbor, Boston, and Denver.

Google also has figured out how to take content, apply structure to it, create a variety of additional index terms (metadata), and convert the whole shebang into easily manipulated numerical representations. Not so with the mainstream commercial database publishers. Tagging, cross referencing, and content clean up still takes expensive humans.

Manipulating the information in books and journals is for commercial database producers very expensive. Many costs are difficult to reduce. Google, on the other hand, has invested over the last decade to find software solutions to these intractable cost problems. Fortunately for the commercial database publishers, Google so far has been content to process books and journals. Google finds access to weighty tomes useful for a variety of purposes. I haven’t heard that these motive forces are related to revenue. Google appears to be casual about the cost of its books and journals project. If you aren’t familiar with Google Books, navigate to http://books.google.com. For Google Scholar, go to http://scholar.google.com.

Enter Microsoft. The company jumped to index books and journals. Now it is climbing out of the swamp of costs. Unlike Google, Microsoft faces–maybe for the first time in the company’s history–a need to focus its technical and financial resources. Google keeps on scanning and indexing documents about hyperbolic geometry. Microsoft can’t and no longer will.

For me the most telling statement in the announcement is:

Given the evolution of the Web and our strategy, we believe the next generation of search is about the development of an underlying, sustainable business model for the search engine, consumer, and content partner. For example, this past Wednesday we announced our strategy to focus on verticals with high commercial intent, such as travel, and offer users cash back on their purchases from our advertisers. With Live Search Books and Live Search Academic, we digitized 750,000 books and indexed 80 million journal articles. Based on our experience, we foresee that the best way for a search engine to make book content available will be by crawling content repositories created by book publishers and libraries. With our investments, the technology to create these repositories is now available at lower costs for those with the commercial interest or public mandate to digitize book content. We will continue to track the evolution of the industry and evaluate future opportunities.

Here’s how I read this. First, the reference to next-generation search is about making money with a business model. In short, next-generation search is not about moving beyond traditional metadata, pushing into data management, and creating new types of user experiences. Search at Microsoft means money.

Second, Microsoft wants to index what’s available. That’s certainly less costly than fiddling with the train schedules that Google has indexed at Oxford University. In my experience, indexing what is already available begs for applications that moves beyond what I can do at my local library or with a search engine such as Exalead.com or metasearch system such as Vivisimo’s Clusty.com.

Third, the notion of tracking and looking for future opportunities does not convince me that Microsoft knows what it will do tomorrow. And whatever the company does, by definition, will be reactive.

Microsoft’s termination of this service means that the status quo in the commercial database world will be subject to pressure from Google. More troubling is that Google’s technical papers and its patent documents reveal that the company is moving beyond key word search at an increasing pace. I think that it is significant that Microsoft is husbanding its resources. Now I want to read in a Microsoft Web log about an innovation path that will permit the company to leap frog over Google. Send me a link to this information, and you will receive a gentle quack.

Stephen Arnold, May 24, 2008

Search: The Three Curves of Despair

March 27, 2008

For my 2005 seminar series “Search: How to Deliver Useful Results within Budget”, I created a series of three line charts. One of the well-kept secrets about behind-the-firewall search is that costs are difficult, if not impossible, to control. That presentation is not available on my Web site archive, and I’m not sure I have a copy of the PowerPoint deck at hand. I did locate the Excel sheet for the chart which appears below. I thought it might be useful to discuss the data briefly and admittedly in an incomplete way. (I sell information for a living, so I instinctively hold some back to keep the wolves from my log cabin’s door here in rural Kentucky.)

Let me be direct: Well-dressed MBAs and sallow financial mavens simply don’t believe my search cost data.

At my age, I’m used to this type of uninformed skepticism or derisory denial. The information technology professionals attending my lectures usually smirk the way I once did as a callow nerd. Their reaction is understandable. And I support myself by my wits. When these superstars lose their jobs, my flabby self is unscathed. My children are grown. The domicile is safe from creditors. I’m offering information, not re-jigging inflated egos.

Now scan these three curves.

thesearchcurves

© Stephen E. Arnold, 2002-2008.

You see a gray line. That is the precision / recall curve. This refers to a specific method of determining if a query returns results germane to the user’s query and another method for figuring out how much germane information the search system missed. Search and a categorical affirmative such as “all” do not make happy bedfellows. Most folks don’t know what a search system does not include. Could that be one reason why the “curves of despair” evoke snickers of disbelief? Read more

Indexing Hot Spots

February 29, 2008

Introduction

This is the third in a series of cost hot spots in behind-the-firewall search. This essay does not duplicate the information in Beyond Search, my new study for the Gilbane Group. This document is designed to highlight several functions or operations in an indexing subsystem than can cause system slow downs or bottlenecks. No specific vendors’ systems are referenced in this essay. I see no value in finger pointing because no indexing subsystem is without potential for performance degradation in a real world installation. – Stephen Arnold, February 29, 2008

Indexing: Often a Mysterious Series of Multiple Operations

One of the most misunderstood parts of a behind-the-firewall search system is indexing. The term indexing itself is the problem. For most people, an index is the key word listing that appears at the back of a book. For those hip to the ways of online, indexing means metatagging, usually in the form of a series of words or phrases assigned to a Web page or an element in a document; for example, an image and its associated test. The actual index in your search system may not be one data table. The index may be multiple tables or numeric values that “float” mysteriously within the larger search system. The “index” may not even be in one system. Parts of the index are in different places, updated in a series of processes that cannot be easily recreated after a crash, software glitch, or other corruption. This CartoonStock.com image makes clear the impact of a search system crash.

Centuries ago, people lucky enough to have a “book” learned that some sort of system was needed to find a scroll, stored in a leather or clay tube, sometimes chained to the wall to keep the source document from wandering off. In the so called Dark Ages, information was not free, nor did it flow freely. Information was something special and of high value. Today, we talk about information as a flood, a tidal wave, a problem. It is ubiquitous, without provenance, and digital. Information wants to be free, fluid, moving around, and unstable, dynamic. For indexing to work, you have a specific object at a point in time to process; otherwise, the index is useless. Also, the index must be “fresh”. Fresh means that the most recent information is in the system and therefore available to users. With lots of new and changed information, you have to determine how fresh is fresh enough. Real time data also provides a challenge. If your system can index 100 megabytes a minute and to keep up with larger volumes of new and changed data, something’s got to give. You may have to prioritize what you index. You handle high-priority documents first, then shift to lower priority document until new higher-priority documents arrive. This triage affects the freshness in the index or you can throw more hardware at your system, thus increasing capital investment and operational cost.Index freshness is important. A person in a professional setting cannot do “work” unless the digital information can be located. Once located, the information must be the “right” information. Freshness is important, but there are issues of versions of documents. These are indexing challenges and can require considerable intellectual effort to resolve. You have to get freshness right for a search system to be useful to your colleagues. In general, the more involved your indexing, the more important is the architecture and engineering of the “moving parts” in your search system’s indexing subsystem.Why is indexing a cost hot spot? Let’s look at some hot spots I have encountered in the last nine months.

Remediating Indiscriminate Indexing

When you deploy your behind-the-firewall search or content processing system, you have to tell your system how to process the content. You can operate an advanced system in default mode, but you may want to select certain features, level of stringency, and make sure that you are familiar with the various controls available to you. Investing time prior to deployment in testing may be useful when troubleshooting. The first cost hot spot is encountering disc thrashing or long indexing times. You come in one morning, check the logs, and learn no content was processed. In Beyond Search I talk about some steps you can take to troubleshoot this condition. If you can’t remediate the situation by rebooting the indexing subsystem, then you will have to work through the vendor’s technical support group, restore the system to a known good state, or – in some cases – reinstall the system. When you reinstall, some systems cannot use the back up index files. If you find that your back ups won’t work or deliver erratic results on test queries, then you may have to rebuild the index. In a small two person business, the time and cost are trivial. In an organization with hundreds of servers, the process can consume significant resources.

Updating the Index or Indexes

Your search or content processing system allows you to specify how frequently the index updates. When your system has robust resources, you can specify indexing to occur as soon as content becomes available. Some vendors talk about their systems as “real time” indexing engines. If you find that your indexing engine starts to slow down, you may have encountered a “big document” problem. Indexing systems make short work of HTML pages, short PDFs, and emails. But when document size grows, the indexing subsystem needs more “time” to process long documents. I have encountered situations in which a Word document includes objects that are large. The Word document requires the indexing subsystem to grind away on this monster file. If you hit a patch characterized by a large number of big documents, the indexing subsystem will appear to be busy but indexing subsystem outputs fall sharply.Let’s assume you build your roll out index based on a thorough document analysis. You have verified security and access controls so the “right” people see the information to which they have appropriate access. You know that the majority of the documents your system processes are in the 600 kilobyte range over the first three months of indexing subsystem operation. Suddenly the document size leaps to six megabytes and the number of big documents becomes more than 20 percent of the document throughput. You may learn that the set up of your indexing subsystem or the resources available are hot spots.Another situation concerns different versions of documents. Some search and content processing systems identify duplicates using date and time stamps. Other systems include algorithms to identify duplicate content and remove it or tag it so the duplicates may or may not be displayed under certain conditions. A surge in duplicates may occur when an organization is preparing for a trade show. Emails with different versions of a PowerPoint may proliferate rapidly. Obviously indexing every six megabyte PowerPoint makes sense if each PowerPoint is different. How your indexing subsystem handles duplicates is important. A hot spot occurs when a surge in the number of files with the same name and different date and time stamps are fed into the indexing system. The hot spot may be remediated by identifying the problem files and deleting them manually or via your system’s administrative controls. Versions of documents can become an issue under certain circumstances such as a legal matter. Unexpected indexing subsystem behavior may be related to a duplicate file situation.Depending on your system, you will have some fiddling to do in order to handle different versions of documents in a way that makes sense to your users. You also have to set up a de-duplication process in order to make it easy for your users to find the specific version of the document needed to perform a work task. These administrative interventions are not difficult when you know where to look for the problem. If you are not able to pinpoint a specific problem, the hunt for the hot spot can become time consuming.

Common Operations Become a Problem

Once an index has been constructed – a process often called indexation – incremental updates are generally trouble free. Notice that I said generally. Let’s look at some situations that can arise, albeit infrequently.Index RebuildYou have a crash. The restore operation fails. You have to reindex the content. Why is this expensive? You have to plan reindexing and then baby sit the update. For reindexing you will need the resources required when you performed the first indexation of your content. In addition, you have to work through the normal verifications for access, duplicates, and content processing each time you update. Whatever caused the index restore operation to fail must be remediated, a back up created when reindexing is completed, and then a test run to make sure the new back up restores correctly.Indexing New or Changed ContentLet’s assume that you have a system, and you have been performing incremental indexes for six months with no observable problems and no red flags from users. Users with no prior history of complaining about the search system complain that certain new documents are not in the system. Depending on your search system’s configuration, you may have a hot spot in the incremental indexing update process. The cause may be related to volume, configuration, or an unexpected software glitch. You need to identify the problem and figure out a fix. Some systems maintain separate indexes based on a maximum index size. When the index grows beyond a certain size, the system creates or allows the system administrator to create a second index. Parallelization makes it possible to query index components with no appreciable increase in system response time. A hot spot can result when a configuration error causes an index to exceed its maximum size, halting the system or corrupting the index itself, although other symptoms may be observable. Again – the key to resolving this hot spot is often configuration and infrastructure.Value-Added Content ProcessingNew search and content processing systems incorporate more sophisticated procedures, systems, and methods than systems did a few years ago. Fortunately faster processors, 64-bit chips, and plummeting prices for memory and storage devices allows indexing systems to pile on the operations and maintain good indexing throughput, easily several megabytes a minute to five gigabytes of content per hour or more.If you experience slow downs in index updating, you face some stark choices when you saturate your machine capacity or storage. In my experience, these are:

  • Reduce the number of documents processed
  • Expand the indexing infrastructure; that is, throw hardware at the problem
  • Turn off certain resource intensive indexing operations; in effect, eliminating some of the processes that use statistical, linguistic, or syntactic functions.

One of the questions that comes up frequently is, “Why are value-added processing systems more prone to slow downs?” The answer is that when the number of documents processed goes up or the size of documents rises, the infrastructure cannot handle the load. Indexing subsystems require constant monitoring and routine hardware upgrades.Iterative systems cycle through processes two or more times.Some iterative functions are dependent on other processes; for example, until the linguistic processes complete, another component – for example, entity extraction – cannot be completed. Many current indexing systems are be parallelized. But situations can arise in which indexing slows to a crawl because a software glitch fails to keep the internal pipelines flowing smoothly. If process A slows down, the lack of available data to process means process B waits. Log analysis can be useful in resolving this hot spot.Crashes: Still OccurMany modern indexing systems can hiccup and corrupt an index. The way to fix a corrupt index is to have two systems. When one fails, the other system continues to function.But many organizations can’t afford tandem operation and hot failovers. When an index corruption occurs, some organizations restore the index to a prior state. A gap may exist between the points in the back up and the index state at the time of the failure. Most systems can determine which content must be processed to “catch up”. Checking the rebuilt indexes is a useful step to take when a crash has taken place and the index restored and rebuilt. Keep in mind that back ups are not fool proof. Test your system’s back up and restore procedures to make sure you can survive a crash and have the system again operational.

Wrap Up

Let’s step back. The hot spots for indexing fall into three categories. First, you have to have adequate infrastructure. Ideally your infrastructure will be engineered to permit pipelined functions to operate rapidly and without latency. Second, you will want to have specific throughput targets so you can handle new and changed content whether your vendor requires one index or multiple indexes. Third, you will want to understand how to recover from a failure and have procedures in place to restore an index or “roll back” to a known good state and then process content to ensure no lost content.In general, the more value added content processing you use, your potential for hot spots increases. Search used to be simpler from an operational point of view. Key word indexing is very straight forward compared to some of the advanced content processing systems in use today. The performance of any system fluctuates to some extent. As sophisticated as today’s systems are, there is room for innovation in system design, architecture, and administration of indexing subsystems. Keep in mind that more specific information appears in Beyond Search, due out in April 2008.

Stephen Arnold, February 29, 2008

Document Processing: Transformation Hot Spots

February 23, 2008

Let’s pick up the thread of sluggish behind-the-firewall search systems. I want to look at one hot spot in the sub system responsible for document processing. Recall that the crawler sub system finds or accepts information. The notion of “find” relates to a crawler or spider able to identify new or changed information. The spider copies the information back to the content processing sub system. For the purposes of our discussion, we will simplify spidering to the find-and-send back approach. The other way to get content to the search system is to push it. The idea is that a small program wakes up when new or changed content is placed in a specific location on a server. The script “pushes” the content — that is, copies the information — to a specific storage area on the content processing sub system. So, we’re dealing with pushing or pulling content. The diagram to which these comments refer is here.

Now what happens?

There are many possible functions a vendor can place in the document processing subsystem. I want to focus on one key function — content transformation. Content transformation takes a file — let’s say a PowerPoint — and creates a version of this document in an XML structure “known” to the search system. The idea is that a number of different file types are found in an organization. These can range from garden variety Word 2003 files to the more exotic XyWrite files still in use at certain US government agencies. (Yes, I know that’s hard to believe because you may not know what XyWrite is.)

Most search system vendors say, “We support more than 200 different file types.” That’s true. Over the years, scripts that convert a source file of one type into an output file of another type have been written. Years ago, there were independent firms doing business as Data Junction and Outside In. These two companies, along with dozens of others, have been acquired. A vendor can license these tools from their owners. Also, there are a number of open source conversion and transformation tools available from Source Forge, shareware repositories, and from freeware distributors. However, a number of search system vendors will assert, “We wrote our own filters.” This is usually a way to differentiate their transformation tools from a competitor. The reality is that most vendors get use a combination of licensed tools, open source tools, and home-grown tools. The key point is the answer to two questions:

  1. How well do these filters or transformation routines work on the specific content you want to have the search system make searchable?
  2. How fast do these systems operate on the specific production machines you will use for content transformation?

The only way to answer these two questions with accuracy is to test the transformation throughput on your content and on the exact machines you will use in production. Any other approach will create a general throughput rate value that your production system may or may not be able to deliver. Isn’t it better to know what you can transform before you start processing content for real?

I’ve just identified the two reasons for unexpected bottlenecks and, hence, poor document processing performance. First, you have content that the vendor’s filters cannot handle. When a document processing sub system can’t figure out how to transform a file, it writes the file name, date, time, size, and maybe an error code in the document processing log. If you have too many rejected files, you have to intervene, figure out the problem with the files, and then take remedial action. Remedial action may mean re keying the file or going through some manual process of converting the file from its native format, to a neutral format like ASCII, doing to manual touch up like adding sub heads or tags, and then putting the touched up file into the document processing queue. Talk about a bottleneck. In most organizations, there is neither money nor people to do this work. Fixing the content transformation problems can take days, week, or never be done at all. Not surprisingly, a system that can’t process the content cannot make that content available to the system users. This glitch is a trivial problem when you are first deploying a system because you don’t have much knowledge of what will be transformed and what won’t. Imagine the magnitude of the problem when a transformation problem is discovered after the system is up and running. You may find log files over writing themselves. You may find “out of space” messages in the folder used by the system to write files that can’t be transformed. You may find intermittent errors cascading back through the content acquisition system due to transformation issues. Have you looked at your document processing log files today?

The second problem has to do with document processing hardware. In my experience exactly zero of the organizations with which I am familiar have run pre-deal tests on the exact hardware that will be used in production document processing. The exception are the organizations licensing appliances. The appliance vendors deliver hardware with a known capacity. Appliances, however, comprise less than 15 percent of the installed base of behind-the-firewall search systems. Most organizations’ information technology departments think that vendor estimates are good enough. Furthermore, most information technology groups believe that existing hardware and infrastructure are adequate for a search application. What happens? The system goes into operation and runs along until the volume of content to be proc3essed exceeds available resources. When that happens, the document processing sub system slows to a crawl or hangs.

Performance Erosion

Document processing is not a set-it and forget-it sub system. Let’s look at why you need to invest the time engineering, testing, monitoring, and upgrading the document processing sub system. I know before I summarize the information from my files that few, if any, readers of this Web log will take these actions. I must admit indifference to the document processing sub system generates significant revenue for consultants, but so many hassles can be avoided by taking some simple preventive actions. Sigh.

Let’s look at the causes of performance erosion:

  1. The volume of content is increasing. Most organizations whose digital content production volume I have analyzed double their digital content every 12 months. This means that if one employee has five megabytes of new content when you turn on the system, that employee will have on her computer, 12 months after you start the search system, you will have the original five megabytes in the index and the new five megabytes for a total of 10 megabytes of content. No big deal, right? Storage is cheap. It is a big deal when you are working in an organization with constraints on storage, an inability to remove duplicate content from the index, and an indiscriminate content acquisition process. Some organizations can’t “plug in” new storage the way you can on a PC or Mac. Storage must be ordered, installed, and certified. In the meantime, what happens? The document processing system falls behind. Can it catch up? Maybe. Maybe not.
  2. The content is not new. Employees recycle, save different drafts of documents, and merge pieces of boiler plate text to create new documents. Again, if you work on one PowerPoint, you can index any PowerPoint. But when you have many PowerPoints each with minor changes and the email messages like “Take a look at this an send me your changes”, you can index the same content again and again. A results list is not just filled with irrelevant hits; the basic function of search and retrieval is broken. Does your search system return a results list of what look like the same document with different date, time, and size values? How do you determine which version of the document is the “best and final” one? What are the risks of using the incorrect version of a document? How much does your organization spend on figuring out which version of a document is the “one” the CEO really needs?

As you wrestle with these questions, recall that you are shoving more content through a system which unless constantly upgraded will slow to a crawl. You have set the stage for thrashing. The available resources are being consumed processing the same information again and again, not processing the meaningful documents one time and again only when a significant change is made. Ah, you don’t know what documents are meaningful? You are now like the snake eating its tail. Because you don’t have an editorial policy or content acquisition procedures in place, you have found the slow down in document processing is nothing more than a consequence of an earlier misstep. So, no matter what you do to “fix” document processing, you won’t be able to get your search system working the way users want it to. Pretty depressing? Furthermore, senior management doesn’t understand why throwing money at a problem in document processing doesn’t have any significant pay off to justify the expense.

XML and Transformation

I’m not sure I can name a search vendor who does not support XML. XML is an incantation. Say it enough times, and I guess it becomes the magic fix to what ever ailments a content processing system has.

Let me give you my view of this XML baloney. First, XML or extensible mark up language is not a panacea. XML is, at its core, a programmatic approach to content. How many of you reading this column program anything in any language? Darn few. So the painful truth is, you don’t know how to “fix” or “create” a valid XML instance, but you sure sound great when you chatter about XML.

Second, XML is a simplified version of SGML which is in turn a decendent of CALS (computer aided logistics system) spawned by our insightful colleagues in the US government to deal with procurement. Lurking behind a nice Word document in the “new” DOCX format is a DTD, document type definition. But out of sight, out of mind, correct? Unfortunately, no.

Third, XML is like an ancient Roman wall in 25 BCE. The smooth surface conceals a heck of a lot of rubble between some rigid structures made of irregular brick or stone. This means that taking a “flavor” of XML and converting it to the XML that your search system understands is a programmatic process. This notion of converting a source file like a WordPerfect document into an XML version that the search system can use is pretty darn complicated., When it goes wacky, it’s just like debugging any other computer program. Who knows how easy or hard it will be to find and fix the error? Who knows how long it will take? Who knows how much it will cost? I sure don’t.

If we take these three comments and think about them, it’s evident that this document transformation can chew up some computer processing cycles. If a document can’t be transformed, the exception log can grow. Dealing with these exceptions is not something one does in a few spare minutes between meetings.

Nope.

XML is work, which when done properly, greatly increases the functionality of indexing sub systems. When done poorly, XML is just another search system nightmare.

Stepping Back

How can you head off these document processing / transformation challenges?

The first step is knowing about them. If your vendor has educated you, great. If you have learned from the school of hard knocks, that’s probably better. If you have researched search and ingested as much other information as you can, you go to the head of the class.

An increasing number of organizations are solving this throughput problem by: [a] ripping and replacing the incumbent search system. At best, this is a temporary fix; [b] shifting to an appliance model. This works pretty well, but you have to keep adding appliances to keep up with content growth and the procedure and policy issues will surface again unless addressed before the appliance is deployed; [c] shifting to a hosted solution. This is an up-and-coming fix because it outsources the problem and slithers away from the capital investment on-premises installations require.

Notice that I’m not suggesting slapping an adhesive bandage on your incumbent search system. A quick fix is not going to do much more than buy time. In Beyond Search, I go into some depth about vendors who can “wrap” your ailing search system with a life-support system. This approach is much better than a quick fix, but you will have to address the larger policy and procedural issues to make this hybrid solution work over the long term.

You are probably wondering how transforming a bunch of content can become such a headache. You have just learned something about the “hidden secrets” of behind-the-firewall search. You have to dig into a number of murky, complex areas before you make your search system “live.”

I think the following checklist has not been made available without charge before. You may find it useful, and if I have left something out, please, let me know via the comments function on this Web log.

  • How much information in what format must the search system acquire and transform on a monthly and annual basis?
  • What percent of the transformation is for new content? How much for changed content?
  • What percent of content that must be processed exists in what specific file types? Does our vendor’s transformation system handle this source material? What percent of documents cannot be handled?
  • What filters must be modified, tested, and integrated into the search systems?
  • What is the administrative procedure for dealing with [a] exceptions and [b] new file types such as an email with an unrecognized attachment?
  • What is the mechanism for determining what content is a valid version and which content is a duplication? What pre-indexing process must be created to minimize system cycles needed to identify duplicate content; that is, how can I get my colleagues to flag only content that should be indexed before the content is acquired by the document processing system?
  • What is the upgrade plan for the document processing sub system?
  • What content will not be processed if the document processing sub system slows? What is the procedure for processing excluded content when the document processing subsystem again has capacity?
  • What is the financial switch over point from on-premises search to an appliance or a hosted / managed service model?
  • What is the triage procedure when a document processing sub system degrades to an unacceptable level?
  • What’s the XML strategy for this search system? What does the vendor do to fix issues? What are my contingency plans and options when a problem becomes evident?

In another post, I want to look at hot spots in indexing. What’s intriguing is that so far we have brought or had content pushed to the search system storage devices. We have normalized content and written that content in a form the indexing system can understand to the storage sub system. Is any one keeping track of how many instances of a document we have in the search system at any one time? We need that number. If we run out of storage, we’re dead in the water.

This behind-the-firewall search is a no-brainer. Believe it or not, a senior technologist at a 10,000-person organization told me in late 2007, “Search is not that complicated.” That’s a guy who really knows his information retrieval limits!

Stephen Arnold, February 23, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta