Revisiting Jim Gray’s Three Talks 1999

July 9, 2008

Navigate to http://research.microsoft.com/~gray/Talks/Scaleability_Teminology.ppt and download a presentation by Jim Gray and his colleagues at Microsoft in 1999.

I worked through this deck again this evening (July 8, 2008) and came away with one question, “What went wrong?” This presentation identified the options for large scale systems, what we now call cloud systems. The presentation reviews the trends in storage, memory, and CPUs. The problems associated with traditional approaches to optimizing performance are clearly identified; for example, I/O bottlenecks and overheating, among others.

Why did this one question push others from the front of my mind?

At the time Mr. Gray and his colleagues were wrestling with large-scale issues so was Google. Microsoft had more resources than Google by orders of magnitude. Microsoft had market share, and Google had only a quirky name. Microsoft had a user base, which in 1999 was probably 90 percent of the desktops in the world plus a growing share of the server market, Windows Server 2000 was big technical news. Google had almost zero traffic, no business model, and a rented garage.

Any one looking at Mr. Gray’s presentation would have concluded that:

  1. Microsoft’s engineers understood the problems of scaling online services
  2. The technical options were clearly identified, understood, and quantified. Mr. Gray and his colleagues calculated the cost of storage, racks of servers, and provided enough data to estimate how many system engineers were needed per cluster
  3. Google’s early engineering approach had been researched and analyzed. In fact, the presentation provides a clearer description of what Google was doing in the first year of the company’s existence.

Yet if we look at Microsoft and Google today, roughly a decade after Mr. Gray’s presentation, we find:

  1. Microsoft making home run acquisitions; for example, Fast Search & Transfer, Powerset, and most likely some type of deal with Yahoo. Google buys companies that are known only to an in-crowd in Silicon Valley; for example, Transformic.
  2. Microsoft is engaging in marketing practices that pay for traffic; Google is sucked forward by its online advertising system. Advertisers pay Google, and Google makes many of its products and services available without charge.
  3. Microsoft is now–almost a decade after Mr. Gray’s deck–is building massive data centers; Google continues to open new data centers, but Google is not mounting a Sputnik program, just doing business as usual.
  4. Microsoft has not been able to capture a significant share of the Web search market. Google–except in China and Russia–Google is pushing towards market shares in the 65 percent and higher range.

What happened?

I don’t have my data organized, but tomorrow, I will start grinding through my digital and paper files for information about Microsoft’s decisions about its cloud architecture that obviously could not keep pace with Google’s. Microsoft hired Digital Equipment wizards; for example, Gordon Bell and David Cutler, among others. Google hired Jeff Dean and Sanjay Ghemawat. Both companies had access to equivalent technical information.

How could such disparity come about?

I have some ideas about what information I want to peruse; for example:

  1. What were the consequences of embracing Windows “dog food”; that is, Microsoft’s own products, not Linux with home-grown wrappers used by Google?
  2. What were the cost implications of Microsoft’s using brand name gear from Dell and Hewlett Packard, not the commodity gear Google used?
  3. What was the impact of Microsoft’s use of tiers or layers of servers, not Google’s “everything is the same and can be repurposed as needed” approach?
  4. Why did Microsoft stick with SQL Server and its known performance challenges. Google relied on MySQL for fiddling with smaller data sets, but Google pushed into data management to leap frog certain issues in first Web search and later in other applications running on Google servers?

I jotted down other points when I worked through a hard copy of the presentation this evening. I am tempted to map out my preliminary thoughts about how the Microsoft engine misfired at the critical point in time when Google was getting extra horsepower out of its smaller, unproven engine. I won’t because I learned this week that when I share my thoughts, my two or three readers use my Web log search engines to identify passages that show how my thinking evolves. So, no list of observations.

I don’t want to make Google the focal point of this two or three essays on this topic. I will have to reference Google, because that company poses the greatest single challenge Microsoft has faced since the days of Netscape. I won’t be able to reproduce the diagrams of Google’s architecture. These appear in my Google studies, and the publisher snarled at me today when I asked permission. Sorry.

I will make a few screen shots from the materials I locate. If a document is not identified with a copyright, I will try to have one of my researchers track down the author or at least the company from which the document came. I will be working with digital information that is about 10 years old. I know that some of the information and images I reference will be disinformation or just wrong. Please, use the comments function of this Web log to set me and my two or three readers straight.

Over the next week or so (maybe longer because I am working on a new for-fee study with my colleague in England), I want to post some ideas, some old diagrams, and some comments. Nothing gets fewer clicks than discussions of system architecture from the dark ages of online, but this is my Web log, and you don’t have to read my musings.

One reader asked me to post the documents I mention in my essays. I checked with my attorney, and I learned that I could be sued or forced to remove the documents. Some of the information in my paper files is no longer online. For example, there was an important paper on MSDN called Architectural Blueprint for Large Sites. I found two pages of my hard copy and my archived copy is corrupt. If anyone has a copy of this document–sometimes called the DNABlueprint–please, write me at seaky2000 at yahoo dot com.

Stephen Arnold, July 9, 2008

Concept Searching for SharePoint

July 9, 2008

My SharePoint posts continue to thrill the two or three readers of this Web log. So, here’s another joy booster. You can add taxonomy navigation, concept searching, and classification functions with a snap in from Concept Searching.

The company has offices in the UK (headquarters), spyland in McLean, Virginia, and Capetown, South Africa. The firm’s tag line is “Retrieval Just Got Smarter,” which sums up the company’s approach to content processing quite nicely, thank you. Founded in 2002, John Challis (CEO and CTO) want to develop statistical search and classification products with a difference. The idea was to provide a method that reduced the “drift” that afflicts some statistical methods. You can download a useful fact sheet here.

The SharePoint conceptClassifier, according to the “Microsoft Enterprise Search Blog”:

adds automatic document classification and taxonomy management to Microsoft SharePoint and works without the need to build another search index. It is installed as a set of Features that, when activated, cause new columns to be displayed in the document library listings and new menu options appear that allow authorized users to edit the automatically generated metadata, if required.

To see the system in action navigate to http://moss.conceptsearching.com. When you get to the demo screen, click on concept searching in the left hand panel. You will be able to explore a limited set of content. Some documents return 404 errors, but you will get the idea of the system’s functionality.

Among the features the system adds to SharePoint are:

  • Automatic Classification
  • Controlled Vocabulary
  • Multiple Taxonomies
  • Folksonomies
  • Auto Clue suggestion
  • AJAX Environment
  • Document Movement
  • SQL Based

This is an impressive line up, and you will want to test the system to make sure it meets your needs. The company, like Interse in Copenhagen, recognizes the appetite SharePoint administrators have for features that make the system more useful to SharePoint users, which number somewhere between 65 and 100 million worldwide.

Stephen Arnold, July 9, 2008

Yandex: Adding Video Services, Keeping the GOOG at Bay

July 9, 2008

Russian search engine group Yandex has launched the public beta of Yandex.Video. Service users can search and share videos clips online, as well as view the most popular videos. See an article about it here (http://www.telecompaper.com/news/article.aspx?cid=626524).

Yandex.Video currently searches about twenty video hosting services including youtube.com, rutube.ru, video.mail.ru, smotri.com and myvi.ru. The service’s video search method is based on analysis of names, tags, descriptions and other video clip attributes. Search results are ranked according to user ratings.

Yandex.Video continuously updates the most popular videos shown on its front page, as it receives information about new comments and new videos posted in blogs from Yandex’s Blog Search service. Service users can upload an unlimited amount of video files and create their own favorite lists. The service currently indexes over 2 million videos. Yandex is a portal with a wide variety of services, including the ubiquitous text search.

Yandex search software offers a set of tools for full-text indexing and text data search “understanding” Russian and English language morphology. Beyond that, Yandex mirrors search portals like Google and Yahoo! by including things like images search, latest news and weather, maps, free mail, free web hosting, Yandex.Money (like Paypal) and much, much more. There’s a much more complete list in this article at The Search Engine Journal. (http://www.searchenginejournal.com/yandex-russian-search-engine-spotlight-4/2157/)

That article also says “Given Yandex’s vast offering of services along with WiFi, RSS Search, and a pay system; sounds like it’s a model for Google and Yahoo to follow in terms of network building.” Sounds like to me it would also be a good buy for a company looking to catch up to Google. Yandex has a diverse number of pieces parts in its repetoire, content that would be costly to reproduce. Why reinvent the wheel? Just buy it in Russian.

Jessica Bratcher, July 9, 2008

App Engine: Can Google Still Scale?

July 9, 2008

Computerworld’s Juan Carlos Perez wrote “Google Under Pressure as App Engine Requests Rise.” You can read the full story which appeared on July 7, 2008, here. Mr. Perez summarizes the demand for Google’s hosted application development environment. There are lots of developers, and many developers want more features. Google has rolled out a useful service, but there are some sleeping policemen on the development highway.

The most important point for me in Mr. Perez’s write up was:

App Engine is for applications of the sort Google develops: Web applications with mass appeal that don’t require long-running processes to, for example, crunch scientific data. App Engine is designed for database-backed Web applications like blogs, office productivity programs and social networking wares.

My take on this is that Google is cultivating developers and learning about developer needs, system behaviors, and which developers are Googley. Google won’t reveal how many developers are in its program, but like the early Gmail roll out, Google is managing demand.

The issue of scaling is moot. Google responds to demand. Because the GOOG has its hands on the knobs, the company can increase or decrease functions, limits, and file caps as it wishes. Technology is not the issue; building a developer base is.

Stephen Arnold, July 9, 2008

MSFT vs GOOG: Cloud Gas War on the Horizon?

July 8, 2008

Mary Joe Foley, author of an interesting book about the “new” Version 2.0 Microsoft, wrote “Microsoft to Sell Hosted Service Subscriptions for $3 a Month”. You can read her essay here in the All about Microsoft information service. This is one of the ZDNet Web logs, which I find pretty darn useful to me.

Her point is the price. But the most interesting passage for me was this:

Over the past couple of years, Microsoft has been attempting to persuade its partners, especially those who’ve built businesses around hosting Microsoft software for their customers, that Microsoft isn’t going to steamroll them with its new managed-service offerings. Microsoft execs have been warning partners to get out of the plain-old hosting business and to, instead, focus on more of the value-add they can provide on top of hosted services.

I am not a Microsoft partner. I have watched as Certified Gold Partners innovated in search and add-ons for SharePoint, to cite two examples. Then, with each new release of SharePoint, the features that partners figured out how to make work and educated the customers to appreciate would migrate into the SharePoint platform.

My conclusion was, “Wow, that’s pretty risky for a partner investing big bucks in a value-add, enhancement, or snap-in gizmo for a Microsoft core product.” Well, this is another example of how large companies in their quest for revenue can take actions that put pressure on partners. When I was in South Africa, one guide told me, “When elephants fight, only the grass gets trampled.” Okay, this price war may not be real, but the price cut is.

The target is Google, Salesforce.com, and other vendors who like Macbooks and OSX.

The “grass” may be pretty sparse right now, but any firm thinking about getting into commodity services via the cloud may want to sharpen their pencils and revisit their business plans.

Kudos for Ms. Foley for snagging this information. I want to add three observations, which is now standard operating procedure for this Web log with two or three readers:

  1. A dollar amount play is dangerous. The major competitor has a different business model, and I think Google will use it to further provoke Redmondians.
  2. Customers don’t know whether $3 is too high, too low, or just right. Penetration of for-fee hosted services remains modest. In the enterprise, I saw figures that were in the 10 percent penetration range. This price point may become the benchmark, which if usage spikes could be a big cost hit for Microsoft as it rushes to scale. Google is getting beat up because some of its services are not scaling fast enough. Google’s stuff is free, which muffles the snorts. I will have to wait to see how the service scales, if it even has to scale.
  3. At least one Certified Gold Partners is making plans for life without Microsoft. I spent time with one big Gold outfit, and I thought I heard words to the effect, “To heck with them.” If this anti Microsoft flu spreads, the result might be quite interesting competitively.

You can get a another interest view of this from ReadWriteWeb here.

Agree? Disagree? Don’t be shy?

Stephen Arnold, July 8, 2008

Lemur FLAX: Clever Search Beastie Interview

July 8, 2008

No, I did not interview a real lemur. I tracked down Charlie Hull, one of the wizards driving Lemur Consulting forward. The company makes the open source Xapian information retrieval available as the open source FLAX search engine. Lemur, like Tesuji and dozens of other companies, has tapped the power of open source search and content processing software and crafted a successful business.

FLAX, according to Mr. Hull scales. In an exclusive interview for ArnoldIT.com’s Search Wizards Speak series, he said:

The core technology was originally built to search a collection of 500 million Web pages, and scales easily to over four billion items. We’ve implemented indexes of 30-100 million items on a single standard server. It’s also extremely fast to search a Flax database. We routinely see sub-second retrieval times.

You can see the search system in action at MyDeco.com, a UK-based ecommerce site here.

What I found interesting is the by making FLAX available as open source, the company has generated new customers for the firm’s technical consulting and engineering services. Mr. Hull said:

Our view is that any enterprise search system will necessitate some degree of installation, integration or customization – so a customer will always pay for services. However, with open-source you don’t have to pay any license fees on top. In today’s economic climate this cost saving is more and more important. We’ve seen year-on-year growth of the business, as well as a dawning realization that our open-source approach puts the control back in the hands of the customer – you don’t have to take our word for it that the ‘black box’ of enterprise search is working, you have complete visibility and control over the search system.

Mr. Hull’s secret sauce is technical expertise. The company adds a special ingredient that keeps the company on the fast-track–customer service. The firm prides itself on servicing its customers needs.

In an era when “customer support” means “Don’t bother us,” Lemur is an animal with a clever way to snare clients. You can read the full interview here.

Stephen Arnold, July 8, 2008

Autonomy etalk Bags an Industry Award

July 8, 2008

When I give my lectures, I get dinged when I point out that the high profile enterprise search vendors are no longer in the search and retrieval business. This morning I was interviewed by a nice young journalist, and I trotted out my “search is dead” and the “big name vendors are morphing into other services”.

Let me call your attention to why I think “search” is a silly term to apply to what people must do to access information. Two heavy hitters in customer support–Technology Marketing Corporation’s (TMC) Customer Interaction Solutions magazine (http://www.cismag.com)–awarded Autonomy the 2008 IP Contact Center Technology Pioneer Award. You can read the full story here. (This is a news release, and the link will go dead in a heartbeat. Click quickly, gentle reader.)

Is this search? Well, it depends on how you look at each user of the service. Here’s the official description of Autonomy etalk’s Qfiniti solution:

[A] robust and reliable IP recording for enterprise contact centers and mission critical business environments. This solution offers full customer interaction recording for compliance, risk management, and quality. Qfiniti delivers IP recording through SIP and vendor specific protocols for leading vendor platforms such as Cisco, Avaya, Genesys, Nortel, and Alcatel-Lucent. Customers that use Qfiniti IP recording benefit from streamlined architecture, global scalability, centralized management, and flexible deployment through a single, unified platform.

If you are looking for something, this is a search system. If you are trying to manage a contact center, the Autonomy system is a god send. It puts many geese into one fenced area, making it easy to manage the unruly beasts.

Kudos to Autonomy, of course. I do want to offer several observations, apparently one of the reasons I have two or three regular readers of the Beyond Search Web log:

  1. Search is no longer the Playboy or Playgirl “model of the month”. Anyone with a year or so of computer science can download Lucene of Lemur FLAX and deploy a perfectly usable enterprise search system. Sure, you will not be able to index some documents, but most enterprise search systems are pretty erratic when it comes to indexing content, most users will adapt despite their grousing.
  2. The problems organizations have are where big money is at stake. Let’s face it. General document indexing is a secondary, maybe a tertiary concern. Call centers and customer support are money pits. Screw up customer support and you spend money and revenues drop. Do customer support intelligently and you reduce costs and revenue bleeding can be slowed, maybe revenues can rise. So it doesn’t take an MBA from Wharton to figure out that if an organization has $700,000 to spend, a vendor who solves the customer support type problem will get more of the available money.
  3. The issue is information access as it relates to work employees do. Search–at least key word search–forces employees to spend time hunting for information-filled nuggets. Wrong. Employees want answers quickly. A vendor who can show useful information access in the context of work managers want employees to do will win contracts.

So, I am quite confident that when Autonomy wins this type of award I have another case example to support my content that search vendors aren’t going to be in the traditional search business much longer, not if these companies want to keep growing.

Agree? Disagree? Help me learn.

Stephen Arnold, July 8, 2008

More Transformation Goodness from the Googleplex

July 8, 2008

In press is one of my for-fee write ups that talks about the black art of data transformation. I will let you know when it is available and where you can buy it. The subject of this for-fee “note” is one of the least exciting aspects of search and content processing. (I’m not being coy. I am prohibited from revealing the publisher of this note, the blue-chip company issuing the note, and any specific details.) What I can do is give you a hint. You will want to read this Web log post at Google Code: Open Source Google. News about Google’s Open Source Projects and Programs here. You can read other views of this on two other Google Web logs: The Official Google Web log here and Matt Cutts’s Web log here. You will also want to read the information on the Google project page as well.

The announcement by the Googley Kenton Varda, a member of the software engineering team, is “Protocol Buffers: Google’s Data Interchange Format”. Okay, I know you are yawning, but the DIF (an acronym for something that can chew up one-third of an information technology department’s budget) is reasonably important.

The purpose of a DIF is to take content (Object A in Format X) and via the magic of a method change that content into Format Y. Along the way, some interesting things can be included in the method. For example, nasty XML can be converted into little angel XML. The problem is that XML is a fat pig format and fixing it up is computationally intensive. Google, therefore:

developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple “get” and “set” methods, and once you’re ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.

The approach is sophisticated and subtle. Google’s approach shaves with Occam’s Razor, and the approach is now available to the Open Source community. Why? In my opinion, this is Google’s way of cementing its role as the giant information blender. If protocol buffers catch on, a developer can slice, dice, julienne, and chop without some of the ugly, expensive, hand-coded stuff the “other guys’s approach” forces on developers.

There will be more of this type of functionality “comin’ round the mountain, when she comes,” as the song says. When the transformation express roars into your town, you will want to ride it to the Googleplex. It will work; it will be economical; and it will leapfrog a number of pitfalls developers unwittingly overlook.

Stephen Arnold, July 8, 2008

Microsoft Powerset Could Unseat Google

July 8, 2008

You may find this essay stimulating. I did. Rebecca Sato’s essay “Microsoft Acquires Powerset”: Why a Semantic Web Will Be Smarter, Faster & All-Around Better” is remarkable. Please, navigate to The Daily Galaxy and get the inside scoop on the future of the Web. For example, Ms. Sato writes:

Microsoft’s acquisition of Powerset signals a the building of a future when the entire world will likely have access to virtual “software agents” who will “roam” across the Web, making our travel arrangements, doctor’s appointments and basically taking care of all the day-to-day hassles for humankind. It’s a great vision, but it will never be achieved with today’s current Internet.

My take on Ms. Sato’s thesis is that today, users must struggle with text documents that require the user to figure out what’s important. The future is smarter software, richer indexing, and more dimensionality for the information. Ms. Sato acknowledges that that Powerset-type functions are in their early stages. I agree.

Let me offer two observations;

  • Smart software can be resource intensive. As a result, semantic systems may have to start small and grow as the computing resources become available. To me, this means that semantic systems may be confined to modest roles, often as utilities or special purpose operations. If this happens, semantic systems may take years to deliver on their potential.
  • Semantic technology may find itself playing catch up to search systems that use smart shortcuts. For example, user tagging may provide acceptable payoffs without the complexity and cost of semantic systems. If this happens, the search revolution may be people power, not smart software.

Agree? Disagree? Let me know.

Stephen Arnold, July 8, 2008

Microsoft Architecture in 1998

July 8, 2008

At the time Google was hatching, Microsoft had made decisions about its architecture. I was prowling through my archive of online information, and I found a reference to a document called “Challenges to Building Scalable Services: A Survey of Microsoft’s Internet Services” by Steven Levi and Galen Hunt.

I went looking for this document on July 7, 2008, and I was able to find a copy here. In terms of computing, this decade old write up is the equivalent of poking around a dirt hill on Mykonos looking for a pottery shard. I scanned this document and found it remarkable because it revealed the difference between Google’s engineering decisions and Microsoft’s. I believe that these differences, so clear now, contribute to Google’s lead in Web search, advertising, and some Web services.

I cannot summarize a document that runs more than 8,000 words. What I can do is identify three points that I want to explore in the months ahead:

First, Microsoft’s approach identifies hot spots resulting from read-write disc accesses. Microsoft addressed the problem using “farm pairs”. Google adopted a massively parallel, distributed set up with the Google File System and a master that pushed messaging down to workers, thus reducing message traffic.

Second, Microsoft’s approach relied on human system administrators to handle certain routine tasks; for example, dealing with server failures in server “farms”. Google decided early on to let the Google File System and other proprietary techniques deal with failures; in effect, reducing the need for large numbers of system administrators and technicians in data centers.

Third, the “farm” used clones and packs of partitions running on name-brand hardware. Within the architecture, databased content lived in partitions with a “temp state” running on a separate replicated “pack”. Google did the opposite. Commodity hardware could be dynamically allocated. The Google approach slashed operating costs and added flexibility.

There are more differences, and I will comment on some of them in future discussions of the differences between Google and Microsoft a decade ago. If you have an interest in how Microsoft approach online at the moment when Google began its rise to 70 percent market share, the Levi and Hunt document is useful.

Stephen Arnold, July 8, w008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta