Headlong into an Abyss

July 17, 2008

Right before the 4th of July, my phone rang. A very enthusiastic person had to speak with me. I hate the phone. Since February 2007, my hearing has gone downhill.

The chipper caller explained that a major organization had a problem. In five minutes, I learned that this outfit had three content processing systems. Each system was a search system, a collaboration system, and a content processing system.

The problem was that no one could find anything. Right after the holiday, I opened by mail program and there sat several plump PDF files stuffed full of baloney about requirements, guidelines, billing, and other administrivia. The problem was boiled down to a request for suggestions about making the three systems work happily together so employees could find information.

I thought about this situation and sent an email message telling Ms. Chipper, “No bid. Tx, Steve”. I do this a lot.

In this essay, I want to run down the four reasons I want to steer clear of outfits who are ready to do a header off a cliff into the search abyss.

Too Many Toys

The organization has money and buys search toys. No one plays with the search toys but there is a person who thinks that the organization should play with the search toys.

too many toys

Source: http://www.tgtbt.com/images/atozvictoriatoys50pc.JPG

Buy, Try, Buy, Try

Organizations unable to get one system working just buy another one. The reasons range from a change in management so any organizational intelligence about search and content processing are lost. One big drug company got a new president, and he mandated a new system. Who really knows how to make a search system work? No one, so buy another one. Maybe it will work. I call this crazy procurement, and it is a sure sign of a dysfunctional organization.

Silo Wars

Multiple search systems can also be a consequence of units that refuse to cooperate. If unit A wants one system, well, unit B wants a different one. This baffles me, because neither system allows a user to access content in one query. When a person tells me, “We need federated search,” I know that this is a silo war situation. Somehow finding a way to take one query, send it to three or more separate search systems, and return concatenated results will save the day. Not likely. The silo barons will find a way to keep their information, thank you.

Read more

Vertical Search Resurgent

July 16, 2008

Several years ago, the mantra among some of my financial service clients was, “Vertical search.” What’s vertical search? It is two ideas rolled into one buzzword.

A Casual Definition

First, the content processed by the search system is about a particular topic. Different database producers define the scope of a database in idiosyncratic ways. In Compendex, an index of engineering information, you can find a wide range of engineering topics, covering many fields. You can find information about environmental engineering, which looks to me as if the article belongs in a database about chemistry. But in general, the processed information fits into a topical basket. Chemical Abstracts is about chemistry, but the span of chemistry is wide. Nevertheless, the guts of a vertical search engine is bounded content that is brought together in a generally useful topic area. When you look for information about travel, you are using a vertical search engine. For example, Orbitz.com and BookIt.com are vertical search engines.

Second, the content has to searchable. So, vertical content collections require a search engine. Vertical content is often structured. When you look for a flight from LGA to SFO, you fill in dates, times, department airport code, arrival airport code, etc. A parametric query is a fancy way of saying, “Training wheels for a SQL query.” But vertical content collections can be processed by the menagerie of text processing systems. When you query, the Dr. Koop Web site, you are using the type of search system provided by Live.com and Yahoo.com.

wheel

Source: http://www.sonirodban.com/images/wheel.jpg

Google is a horizontal search engine, but it is also a vertical search engine. If you navigate to Google’s advanced search page, which is accessed by fewer than three percent of Google’s users, you will find links to a number of vertical search engines; for example, the Microsoft collection and the US government collection. Note: Google’s universal search is a bit of marketing swizzle that means Google can take a query and pass it across indexes for discrete collections. The results are pulled together, deduplicated, and relevance ranked. This is a function available from Vivisimo since 2000. Universal search Google style displays maps and images, but it is far from cutting edge technology save for one Google factor–scale.

Why am I writing about vertical search when the topic for me came and went years ago. In fact, at the height of the vertical search frenzy I dismissed the hype. Innovators, unaware of the vertical nature of commercial databases 30 years ago, thought something quite new was at hand. Wrong. Google’s horizontal information dominance forced other companies to find niches where Google was not doing a good job or any job for that matter.

Vertical search flashed on my radar today (July 15, 2008) when I flipped through the wonderful information in my tireless news reader.

Autonomy announced:

that Foundography, a subsidiary of Nexus Business Media Ltd, has selected Autonomy to power vertical search on its website (sic)  for IT professionals: foundographytech.com. The site enables business information users to access only the information they want and through Autonomy’s unique conceptual capabilities delivers an ‘already found’ set of results, providing pertinent information users may not have known existed. The site also presents a unique proposition for advertisers, providing conceptually targeted ad selling.

Read more

Digital Convergence: A Blast from the Past

July 15, 2008

In 1999, I wrote two articles for a professional journal called Searcher. The editor, Barbara Quint, former guru of information at RAND Corporation, asked me to update the these two articles. I no longer had copies of them, but Ms. Quint emailed my fair copies, and I read my nine-year old prose.

The 2008 version is “Digital Convergence: Building Blocks or Mud Bricks”. You can obtain a hard copy from the publisher, Information Today here. In a month or two, an electronic version of the article will appear in one of the online commercial databases.

My son, Erik, who contributed his column to Searcher this month as well, asked me, “What’s with the mud bricks?” I chose the title to suggest that the technologies I identified as potential winners in 1999 may lack staying power. One example is human assigned tags. This is indexing, and it has been around in one form or another since humans learned to write. Imagine trying to find a single scroll in a stack of scrolls. Indexing was a must. What’s not going to have staying power is my assigning tags. The concept of indexing is a keeper; the function is moving to smart software, which can arguably do a better job than a subject matter expert as long as we define “better” as meaning faster and cheaper”. A “mud brick” is a technology that decomposes into a more basic element. Innovations are based on interesting assemblages of constitute components. Get the mix right and you have something with substance, the equivalent of the Lion’s Gate keystone.

lion-gate-mycenae-2a

Today’s information environment is composed of systems and methods that are durable. XML, for example, is not new. It traces its roots back 50 years. Today’s tools took decades of refinement. Good or bad, the notion of structuring content for meaning and separating the layout information from content is with us for the foreseeable future.

Three thoughts emerged from the review of the original essays whose titles I no longer recall.

First, most of today’s hottest technologies were around nine years ago. Computers were too expensive and storage was too costly to make wide spread deployment of services based on antecedents of today’s hottest applications such as social search and mobile search, among others.

Second, even though I identified a dozen or so “hot” technologies in 1999, I had to wait for competition and market actions to identify the winners. Content processing, to pick one, is just now emerging as a method that most organizations can afford to deploy. In short, it’s easy to identify a group of interesting technologies; it’s hard for me to pick the technology that will generate the most money or have the greatest impact.

Read more

Microsoft: 1999 to 2008

July 14, 2008

I have written one short post and two longer posts about Microsoft.com’s architecture for its online services. You can read each of these essays by clicking on the titles of the stories:

I want to urge each of my two or three Web log readers to validate my assertions. Not only am I an addled goose, I am an old goose. I make errors as young wizards delight in reminding me. On Friday, July 11, 2008, two of my engineers filled some gaps in my knowledge about X++, one of Microsoft’s less well-known programming languages.

the perils of complexity

The diagram shows how complexity increases when systems are designed to support solutions that do not simplify the design. Source: http://www.epmbook.com/complexity.gif

Stepping Back

As I reflected upon the information I reviewed pertaining to Microsoft.com’s online architecture, several thoughts bubbled to the surface of my consciousness:

First, I believe Microsoft’s new data centers and online architecture shares DNA with those 1999 data centers. Microsoft is not embracing the systems and methods in use at Amazon, Google, and even the hapless Yahoo. Microsoft is using its own “dog food”. While commendable, the bottlenecks have not been fully resolved. Microsoft uses scale up and scale out to make systems keep pace with user expectations of response time. One engineer who works at a company competing with Microsoft told me: “Run a query on Live.com. The response times in many cases are faster than our. The reason is that Microsoft caches everything. It works, but it is expensive.”

Second, Microsoft lacks a cohesive code base and a new one. With each upgrade, legacy code and baked in features and functions are dragged along. A good example is SQL Server. Although rewritten from the good old days with Sybase, SQL Server is not the right tool for peta-scale data manipulation chores. Alternatives exist and Amazon and Yahoo are using them. Microsoft is sticking with its RDBMS engine, and it is very expensive to replicate, cluster, back up with stand by hardware, and keep in sync. The performance challenge remains even though user experience seems as good if not better than the competition’s. In my opinion, the reliance on this particular “dog food” is akin to building a wooden power boat with unseasoned wood.

Third, in each of the essays, Microsoft’s own engineers emphasize the cost of the engineering approaches. There is no emphasis on slashing costs. The emphasis is on spending money to get the job done. In my opinion, spending money to solve problems via the scale up and scale out approach is okay as long as there are barrels of cash to throw at the problem. The better approach, in my opinion is to engineer solutions that make scaling and performance as economical as possible and direct investment at finding ways to leap frog over the well-known, long-standing problems with the Codd database model, inefficient and latency inducing message passing, and dedicated hardware for specific functions and applications then replicating these clusters. And, finally, using more hardware that is, in effect, sitting like an idle railroad car until needed. What happens when the money for these expensive approaches becomes less available?

Read more

Microsoft.com in 2006

July 13, 2008

In late 2006, I had to prepare a report assessing a recommendation made to a large services firm by Microsoft Consulting. One of the questions I had to try and answer was, “How does Microsoft set up its online system?” I had the Jim Gray diagram which I referenced in this Web log essay “Microsoft.com in 1999”. To be forthright, I had not paid much attention to Microsoft because I was immersed in my Google research.

I poked around on various search systems, MSDN, and eventually found a diagram that purported to explain the layout of Microsoft’s online system. The information appeared in a PowerPoint presentation by Sunjeev Pandey, Senior Director Microsoft.com Operations and Paul Wright, Technology Architect Manager, Microsoft.com Operations. On July 13, 2008 the presentation was available here. The PowerPoint itself does not appear in the Live.com index. I cannot guarantee that this link will remain valid. Important documents about Microsoft’s own architecture are disappearing from MSDN and other Microsoft Web sites. I am reluctant to post the entire presentation even though it does not carry a Microsoft copyright.

I want to spell out the caveats. Some new readers of this Web log assume that I am writing news. I am not. The information in this essay is from June 2006, possibly a few months earlier. Furthermore, as I get new information, I reserve the right to change my mind. This means that I am not asserting absolutes. I am capturing my ideas as if I were Samuel Pepys writing in the 17th century. You want real news? Navigate elsewhere.

My notes suggest that Messrs Pandey and Wright prepared a PowerPoint deck for use in a Web case about Microsoft’s own infrastructure. These Web casts are available, but my Verizon wireless service times out when I try to view them. You may have better luck.

Microsoft.com in 2006

Here is a diagram from the presentation “Microsoft.com: Design for Resilience. The Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center.” The title is important because the focus is narrow compared to the bundle of services explained in Mr. Gray’s Three Talks PowerPoint deck and in Steven Levi and Galen Hunt “Challenges to Building Scalable Services.” In a future essay, I will comment on this shift. For now, let’s look at what Microsoft.com’s architecture may have been in mid-2007.

2006 architecture

Microsoft.com Mid-2006

This architecture represents a more robust approach. Between 1995 and 2006, the number of users rose from 30,000 per day to about 17 million per day. In 2001, the baseline operating system was Windows 2000. The shift to Microsoft’s 64-bit operating system took place in 2005, a year in which (if Messrs Pandey and Wright are correct) Microsoft.com experienced some interesting challenges. For example, international network service was disrupted in May and September of 2005. More tellingly, Microsoft was subject to Denial of Service attacks and experience network failures in April and May of 2005. Presumably, the mid-2006 architecture was designed to address these challenges.

The block diagram makes it clear that Microsoft wanted to deploy an architecture in 2006 that provided excellent availability and better performance via caching. The drawbacks are those that were part of the DNA of the original 1999 design–higher costs due to the scale up and out model and its use of name brand, top quality hardware and the complexity of the system. You can see four distinct tiers in the architecture.

Information has to move from the Microsoft Corp. network to the back end network tier. Then the information must move from the back end to the content delivery tier. Due to the “islands” approach that now includes distributed data centers, the information must propagate across data centers. Finally, the most accessed data or the highest priority information must be make available to the Akamai and Savvis “edge of network” systems. Microsoft, presumably to get engineering expertise and exercise better control of costs, purchased two adjoining data centers from Savvis in mid-2007 for about $200 million. (Note: for comparison purposes, keep in mind that Microsoft’s San Antonio data center cost about $600 to $650 million.)

Read more

Microsoft.com in 1999

July 12, 2008

In my previous essay about Jim Gray’s Three Talks in 1999, I mentioned that he and his team had done an excellent job of summarizing trends in data center design, online infrastructure options, and cost analysis of power and name brand hardware. If you have not read that essay, I invite you to take a look at it here. You may want to download the Power Point here. The document does not carry a copyright mark, but I am reluctant to post it for my readers. Please, keep in mind that Microsoft can remove this document at any time. One of the baseline papers referenced in this 1999 Three Talks document is no longer available, and I have a resource working on tracking it down now.

I invite you to look at this diagram. I apologize for the poor quality of the graphic, but I am using an image in Mr. Gray’s 1999 presentation which has been crunched by the WordPress program. I will make some high level observations, and you will be able to download the 1999 PowerPoint and examine the image in that document.

gray diagram 1998

I want to keep the engineering jargon to a minimum. Half of my two to four Web log regulars are MBAs, and I have been asked to clarify or expand on a number of technical concepts. I will not provide that “deep dive” in my public Web log. Information of that type appears in my for-fee studies. If this offends you, please, stop reading. I have to make a decision about what is placed on the Web log as general information and what goes in the studies that pays for the blood sucking leeches who assist me in my research.

The Diagram: High-Level Observations

The set up of Microsoft.com in 1999–if Mr. Gray’s diagram is accurate–shows islands of two types. First, there are discrete data centers; for example, European Data Center, Japan Data Center, and Building 11. Each of these appear to be microcosms of the larger set up used in North America. European and Japan Data Centers are identical in the schematic. I took this to mean that Microsoft had a “cookie cutter” model. This is a good approach, and it is one used by many online services today. Instead of coming up with a new design for each data center, a standard plan is followed. Japan is connected to the Internet with a high speed OC3 line. The European Data Center connection is identified as Ethernet. When you print out Mr. Gray’s Three Talk presentation, you will see that details of the hardware and the cost of the hardware is provided. For example, in the Japan Data Center, the SQL Server cluster uses two servers with an average cost of $80,000. I know this number seems high, but Microsoft is using brand name equipment, a practice which the material I have reviewed suggests continues in 2008.

Second, there is a big island–a cluster of machines that provide database services. For example, there are “Live SQL Servers” with an average cost of $83,000, SQL Consolidators at a cost of $83,000, a feeder local area network to hook thee two SQL Server components together. I interpret this approach as a pragmatic way to reduce latency when hitting the SQL Server data stores for reading data and to reduce the bottlenecks that can occur when writing to SQL Server. Appreciate that in 1999, SQL Server lacked many of the features in the forthcoming SQL Server update. Database access is a continuing problem even today. In my opinion, relational databases or RDBMS are not well suited to handle the spikes that accompany online access. Furthermore, in this design, there is no provision I can see in this schematic for distributing database reads across data centers. We will return to the implications of this approach in a moment.

Third, notice that there are separate clusters of servers in an even bigger island, probably a big data center. Each performs a specific function. For example, there is a search cluster identified as “search.microsoft.com” and an ActiveX cluster identified as “activex.microsoft.com”. Presumably in a major data center or possibly two data centers connected by a high speed line in North America, the servers are hard wired to perform specific functions. The connections among the servers in the data centers use a very sophisticated and expensive in 1999 dollars a fiber ring or more precisely Fiber Distributed Data Interface. (FDDI is a 100 Mbps fiber optic LAN. It is an ANSI standard. It accommodates redundancy.) Microsoft’s own definition here says:

[The acronym] stands for Fiber Distributed Data Interface, a high-speed (100 Mbps) networking technology based on fiber optic cable, token passing, and a ring topology.

To me, the set up is pragmatic, but it suggests putting every thing in one, maybe two places. In 1999, demand was lower than today obviously. With servers under one roof, administration was simplified. In the absence of automated server management systems, technicians and engineers had to perform many tasks by walking up to a rack, pulling out the keyboard, and directly interacting with the servers.

Finally (there are many other points that can be explored, of course), note that one FDDI ring connects to the primary node (not a good word but the diagram shows the FDDI rings in this type of set up) to a secondary FDDI ring. Some services are mirrored such as home.microsoft.com and support.microsoft.com. Others such as premium.microsoft.com and “ftp://ftp.microsoft.com” are not.

Read more

Yahoo Cost Estimate

July 11, 2008

I wanted to run through some of the cost data I have gathered over the years. The reason is this sentence in Miguel Helft’s “Yahoo Is Inviting Partners to Build on Its Search Power,” an essay that appeared in the Kentucky edition of the New York Times, July 10, 2008, page C5:

Yahoo estimates that it would cost $300 million to build a search service from scratch.

No link for this. Sorry. I have the dead tree version, and I refuse to deal with the New York Times’s Web site, and its weird reader thing.

The Yahoo BOSS initiative has been choking my news reader. I don’t want to be a link pig, but I will flag three posts that you may want to scan. First, the LA Times’s “Who’s the BOSS? Yahoo Searches for a Way to Unseat Google,” by Jessica Guynn. You can as of 7 45 pm on July 10, 2008, read it here. I liked this write up because of this remark:

Yahoo has made myriad efforts over the years.

By golly, that nails it. Lots of effort, little progress. The rest of Ms Guynn’s essay unrolls a well worn red carpet decorated with platitudes.

Next, I suggest you scan Larry Dignan’s essay “Yahoo’s Desperate Search Times Call for Open Source.” I like most of the ZDNet essays. I would characterize the approach as gentle pragmatism. I liked this sentence:

Yahoo’s open strategy makes a lot of sense. But let’s not kid ourselves, Yahoo’s open strategy could be characterized as a Hail Mary pass too. It may work. BOSS may turn out to be brilliant. But let’s reserve judgment until we see some results–on the business and technology fronts.

Nailed. Enough said.

The last essay on this short list is John Letzing’s “In an Effort to Disrupt, Yahoo Further Opens Search” on MarketWatch. You can read this article here. (Warning: MarketWatch essays can be tough to track down. Very wacky url and a not-so-hot search engine make a killer combination.) The essay is good, and it takes a business angle on the story. For me, this was the key sentence:

Yahoo distributed a slide presentation to accompany news of the BOSS initiative that includes a pie chart showing a dramatic projected gain for “BOSS partners & developers,” at the expense of Google, Microsoft and Yahoo-branded services. Michels stressed that the pie chart isn’t based on actual calculated estimates, but rather reflects Yahoo’s directional goals.

Presentations based on assumptions–those will go a long way to restoring investor confidence in Yahoo.

Now back to the single sentence in the New York Times today:

Yahoo estimates that it would cost $300 million to build a search service from scratch.

This is Yahoo math.

yahoo math

My data suggest that Yahoo’s estimate is baloney. Over the years, Yahoo has accumulated search technologies; for example, Inktomi, AllTheWeb.com, Stata Labs, and AltaVista.com. Yahoo’s acquisitions arrived with search systems, often pretty weak; for example, Delicious.com’s and Flickr.com’s. Yahoo has licensed third-party search tools such as InQuira’s question answering system. To top it off, Yahoo’s engineers have cooked up Mindset, which has some nice features, and the more recent semantic search system here.

This $300 million number is low enough for a company of Yahoo’s size to have built a search system if it could be done. The wacky estimates and the track record of collecting search system like the hopeful’s on Antique Road Show are evidence that Yahoo could not build a search system.

Yahoo could spend time, money, and talent creating a collection of stuff that has zero chance of thwarting Google. The search vendors lining up to use Yahoo’s index and infrastructure, the open source voodoo, and the unsubstantiated cost estimate underscore how far from reality Yahoo has allowed itself to drift.

I am going to watch how the BOSS play unfolds. Yahoo is in a pretty unpleasant spot, and its executives’ willingness to do first year MBA student projections annoys me.

Let me end with a question. If search is a $300 million dollar investment, for what is Google spending billions? Why is Microsoft spending moe billions than Google AND buying search technology with a devil-may-care insouciance that I admire. It is as if Carly Fiorina was the buy out guru.

Yahoo’s ad revenue projections and its cost estimates are examples of spreadsheet fever. I hope the disease runs its course before the patient becomes incurable.

There’s a math cartoon floating around. The letter “i” (Descartes’ imaginary number) is talking to pi (the Greek symbol you recall from 7th grade math). The caption is, “Get real.” Good advice. Those writing about Yahoo may want to pepper their questions with “Get real.”

Stephen Arnold, July 11, 2008

SQL Server: Bringing the Plow Horse to the Race Track for the Derby

July 10, 2008

SQL Server has bought a lot of dog food in Harrod’s Creek. We got paid to figure out why SQL Server back up and replication crashed and burned. We got paid to make SQL Server go faster. We got paid to grunt through scripts to figure out why reports were off by one. Yep, we like that plow horse. It works like a champ for most business database needs. You can use Access as a front end. You can make some nice looking forms with Microsoft tools with some fiddling.

sql diagram

This is a Microsoft diagram. The release date is August, maybe September 2008. More information is here.

But, when the old plow horse has to amble through petabytes of data, SQL Server is not the right animal for the job. In order to search gigabytes of normalized tables, you need to find a way to short cut the process. One of my colleagues figure out a way to intercept writes, eject them, and build a shadow index that could be searched using some nifty methods. Left to its own devices, SQL Server would stroll through processes, not gallop.

I spoke with a skeptic today. Her comments caused me to think about SQL Server in a critical way. Are these points valid? Let’s follow the plow horse idea and see if there’s hay in the stall.

Selected Features

Like she said to me, “A different data management animal is needed, right?”

Will SQL Server 2008 be that beast? Here’s what she told me about the most recent version of this data work horse:

  • An easier to use report builder. I thought the existing report tools were pretty spiffy. Guess I was wrong.
  • Table compression. A good thing but the search still takes some time. Codd databases have their place, but the doctor did not plan for petabyte-scale tables, chubby XML tables, and the other goodies that modern day 20-somethings expect databases to do.
  • More security controls. Microsoft engineers are likely to spark some interest from Oracle, a company known for making security a key part of its database systems.
  • Streamlined administrative controls. Good for a person on a salary. Probably a mixed blessing for SQL Server consultants.
  • Plumbing enhancements. We like partitioned table parallelism because it’s another option for whipping the plow horse.

These are significant changes, but the plow horse is still there, she asserted. She said, “You can comb the mane and tail. You can put liquid shoe polish on the hooves. You can even use a commercial hair conditioner to give the coat a just groomed look. But it is still a plow horse, designed to handle certain tasks quite well.”

Microsoft’s official information page is here. You can find useful links on MSDN. I had somewhat better luck using Google’s special purpose Microsoft index. Pick your poison.

Observations

If you are Microsoft Certified Professional, you probably wonder why I am quoting her plow horse analogy. I think SQL Server 2008 is a vastly improved relational database. It handles mission critical applications in organizations of all sizes 24×7 with excellent reliability when properly set up and resource. Stop with the plow horse.

Let’s shift to a different beast. No more horse analogies. I have a sneaking suspicion that the animal to challenge is Googzilla. The Web search and advertising company uses MySQL for routine RDBMS operations. But for the heavy lifting, Googzilla has jumped up a level. Technically, Google has performed a meta maneuver; that is, Google has looked at the problems of data scale, data transformation (a function that can consume as much as 30 percent of an IT department’s budget), and the need to find a way to do input output and read write without slowing operations to a tortoise-like pace.

So, Microsoft is doing database; Google is doing data management of which database operations are a sub set and handled by MySQL and the odd Oracle installation.

What’s the difference?

In my experience, when you have to deal with large amounts of data, Dr. Codd’s invention is the wrong tool for the job. The idea of big static databases that have to be updated in real time is an expensive proposition, not to mention difficult. Sure, there are work arounds with exotic hardware and brittle engineering techniques. But when you are shoving petas, you don’t have the luxury of time. You certainly don’t have the money to buy cutting edge gizmos that require a permanent MIT engineer to baby sit the system. You want to rip through data as rapidly as possible yet have an “as needed” method to querying, slicing, dicing, and transforming.

That’s her concern, and I guess it is mine too, with regard to SQL Server 2008. The plow horse is going to be put in the Kentucky Derby, and it will probably finish the race, just too slow to win or keep the fans in their seats. The winners want to cash in their tickets and do other interesting things.

When it comes to next generation data manipulation systems, Googzilla may be the creature to set the pace for three reasons:

  1. Lower cost scaling
  2. Optimized for petabyte and larger data
  3. Distributed, massively parallel operation.

Agree? Disagree? Let me know. Just have some cost data so I can get back to my informant.

Stephen Arnold, July 10, 2008

Revisiting Jim Gray’s Three Talks 1999

July 9, 2008

Navigate to http://research.microsoft.com/~gray/Talks/Scaleability_Teminology.ppt and download a presentation by Jim Gray and his colleagues at Microsoft in 1999.

I worked through this deck again this evening (July 8, 2008) and came away with one question, “What went wrong?” This presentation identified the options for large scale systems, what we now call cloud systems. The presentation reviews the trends in storage, memory, and CPUs. The problems associated with traditional approaches to optimizing performance are clearly identified; for example, I/O bottlenecks and overheating, among others.

Why did this one question push others from the front of my mind?

At the time Mr. Gray and his colleagues were wrestling with large-scale issues so was Google. Microsoft had more resources than Google by orders of magnitude. Microsoft had market share, and Google had only a quirky name. Microsoft had a user base, which in 1999 was probably 90 percent of the desktops in the world plus a growing share of the server market, Windows Server 2000 was big technical news. Google had almost zero traffic, no business model, and a rented garage.

Any one looking at Mr. Gray’s presentation would have concluded that:

  1. Microsoft’s engineers understood the problems of scaling online services
  2. The technical options were clearly identified, understood, and quantified. Mr. Gray and his colleagues calculated the cost of storage, racks of servers, and provided enough data to estimate how many system engineers were needed per cluster
  3. Google’s early engineering approach had been researched and analyzed. In fact, the presentation provides a clearer description of what Google was doing in the first year of the company’s existence.

Yet if we look at Microsoft and Google today, roughly a decade after Mr. Gray’s presentation, we find:

  1. Microsoft making home run acquisitions; for example, Fast Search & Transfer, Powerset, and most likely some type of deal with Yahoo. Google buys companies that are known only to an in-crowd in Silicon Valley; for example, Transformic.
  2. Microsoft is engaging in marketing practices that pay for traffic; Google is sucked forward by its online advertising system. Advertisers pay Google, and Google makes many of its products and services available without charge.
  3. Microsoft is now–almost a decade after Mr. Gray’s deck–is building massive data centers; Google continues to open new data centers, but Google is not mounting a Sputnik program, just doing business as usual.
  4. Microsoft has not been able to capture a significant share of the Web search market. Google–except in China and Russia–Google is pushing towards market shares in the 65 percent and higher range.

What happened?

I don’t have my data organized, but tomorrow, I will start grinding through my digital and paper files for information about Microsoft’s decisions about its cloud architecture that obviously could not keep pace with Google’s. Microsoft hired Digital Equipment wizards; for example, Gordon Bell and David Cutler, among others. Google hired Jeff Dean and Sanjay Ghemawat. Both companies had access to equivalent technical information.

How could such disparity come about?

I have some ideas about what information I want to peruse; for example:

  1. What were the consequences of embracing Windows “dog food”; that is, Microsoft’s own products, not Linux with home-grown wrappers used by Google?
  2. What were the cost implications of Microsoft’s using brand name gear from Dell and Hewlett Packard, not the commodity gear Google used?
  3. What was the impact of Microsoft’s use of tiers or layers of servers, not Google’s “everything is the same and can be repurposed as needed” approach?
  4. Why did Microsoft stick with SQL Server and its known performance challenges. Google relied on MySQL for fiddling with smaller data sets, but Google pushed into data management to leap frog certain issues in first Web search and later in other applications running on Google servers?

I jotted down other points when I worked through a hard copy of the presentation this evening. I am tempted to map out my preliminary thoughts about how the Microsoft engine misfired at the critical point in time when Google was getting extra horsepower out of its smaller, unproven engine. I won’t because I learned this week that when I share my thoughts, my two or three readers use my Web log search engines to identify passages that show how my thinking evolves. So, no list of observations.

I don’t want to make Google the focal point of this two or three essays on this topic. I will have to reference Google, because that company poses the greatest single challenge Microsoft has faced since the days of Netscape. I won’t be able to reproduce the diagrams of Google’s architecture. These appear in my Google studies, and the publisher snarled at me today when I asked permission. Sorry.

I will make a few screen shots from the materials I locate. If a document is not identified with a copyright, I will try to have one of my researchers track down the author or at least the company from which the document came. I will be working with digital information that is about 10 years old. I know that some of the information and images I reference will be disinformation or just wrong. Please, use the comments function of this Web log to set me and my two or three readers straight.

Over the next week or so (maybe longer because I am working on a new for-fee study with my colleague in England), I want to post some ideas, some old diagrams, and some comments. Nothing gets fewer clicks than discussions of system architecture from the dark ages of online, but this is my Web log, and you don’t have to read my musings.

One reader asked me to post the documents I mention in my essays. I checked with my attorney, and I learned that I could be sued or forced to remove the documents. Some of the information in my paper files is no longer online. For example, there was an important paper on MSDN called Architectural Blueprint for Large Sites. I found two pages of my hard copy and my archived copy is corrupt. If anyone has a copy of this document–sometimes called the DNABlueprint–please, write me at seaky2000 at yahoo dot com.

Stephen Arnold, July 9, 2008

Learn Shortcuts, Not Content Excellence

July 8, 2008

Google controls the majority of Web search traffic in North America. The GOOG is good, but so far slippery fish like Yandex and Baidu remind venture capitalists and entrepreneur, the GOOG is not flawless. But in the fat-bellied U S of A, Google reigns supreme. A Web site that is not in the Google index does not exist. A Web site that does not appear in the top five or six hits on the first page of a search results’ page takes a kick to the liver.

SEO or search engine optimization is a real live discipline. There are hundreds of companies offering a wide range of services. Some are focused on getting a Web site to comply with Google’s Web master guidelines. Other offer exotic techniques to take a plain Bill Web site and dress him up to look like George Clooney. The idea is that George Clooney Web sites get more eyeballs than just plain Bill Web sites.

Disclaimer: I avoid SEO like I avoid dark alleys, my grade school playground at 3 am, and a yard full of mistreated pit bulls.

If you want your just plain Bill Web site to appear higher in a Google results list, there is now a full day course “Optimizing for Universal Search” for you. You can read the full description of the training class here. The instructors are search engine optimization experts, Greg Jarboe of SEO-PR and Amanda Watlington of Searching for Profit.

My experience in SEO is non existent, but I did grind through some of the information on this arcane practice for a section in my 2005 The Google Legacy. I tracked down about a 100 factors that appear to have had bearing on how a Web site gets to the top of a Google results list. I received more inquiries about this table from crazed people who had to get site traffic than any other topic in the monograph. Like the angry goose which is my logo, I deleted any discussion of SEO in my subsequent Google writings and presentations. I am not interested in helping 20-somethings make a Web site “popular”. I learned enough in my review of SEO to make formulate these views:

  • Most Web sites have little substantive content. I am delighted that these Web sites do not appear high in my queries’ search results.
  • Google spends quite a bit of energy trying to stay one step ahead of SEO wizards who fool PageRank. If I were smart enough to be a Google engineer and unlucky enough to be assigned to the team trying to deal with SEO spoofers’ tricks, I would probably honk loudly in annoyance. What a waste of mental work. Honk, honk.
  • SEO firms charge big bucks to help Bill become George Clooney, a former Mayfair, Kentucky resident I must add. Some of the customers are having a hard time differentiating the services, figuring out of the fixes actually work, and trying to retrofit Web sites to deal with crazy urls pumped out by equally crazy content management systems.

What my research revealed is that content really helps a Web site get a high Google ranking. The theory is proven by my own tests. Put up content that people find interesting, and those people–fueled by their own motives–link to the good, interesting, or useful information. Over time, content wins. Metatag strategies, weird indexing, and child pages stuff with recycled text lose out tosubstantive content.

results

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta