Microsoft Fast for Portals
August 17, 2009
Author’s Note: The images in this Web log post are the property of Microsoft Corp. I am capturing my opinion based on a client’s request to provide feedback about “going with Fast for SharePoint” versus a third party solution from a Microsoft Certified Partner. If you want happy thoughts about Microsoft, Fast ESP, and search in SharePoint environments, look elsewhere. If you want my opinions, read on. Your mileage may vary. If you have questions about how the addled goose approaches these write ups, check out the editorial policy here.
Introduction
Portals are back. The idea is that a browser provides a “door” to information and applications is hot again. I think. You can view a video called “FAST: Building Search Driven Portals with Microsoft Office SharePoint Server 2007 and Microsoft Silverlight” to get the full story. I went back through my SharePoint search links. I focused on a presentation given in 2008 by Two Microsoft Fast engineers–Jan Helge Sageflåt and Stein Danielsen.
After watching the presentation for a second time, I formed several impressions of what seems to be the general thrust of the Microsoft Fast ESP search system. I have heard reports that Microsoft is doing a full court press to get Microsoft-centric organizations to use Fast ESP as the industrial strength search system.
Let me make several observations about the presentation by the Microsoft Fast engineers and then conclude with a suggestion that caution and prudence may be fine dinner companions before one feasts on Fast ESP. Portals are not a substitute for making it easy for employees to locate the item of information needed to answer a run-of-the-mill business information need.
Observations about the 2008 Demo
First, the presentation focuses on building interfaces and making connections to content in SharePoint. Most organizations want to connect to the content scattered on servers, file systems, and enterprise application software data stores. That is job one or it was until the financial meltdown. Now organizations want to acquire, merge, search, and tap into social content. Much of that information has a short shelf life. The 2008 presentation did not provide me with evidence that the Microsoft Fast ESP system could:
- Acquire large flows of non-SharePoint content
- Process that information without significant latency
- Identify the plumbing needed to handle flows of real time content from RSS feeds and the new / updated content from a SharePoint system.
Microsoft Embraces Scale
August 4, 2009
The year was 2002. A cash-rich, confused outfit paid me to write a report about Google’s database technology. In 2002, Google was a Web search company with some good buzz among the alleged wizards of Web search. Google did not have much to say when its executives gave talks. I recall an exchange between me and Larry Page at the Boston Search Engine Meeting in 1999. The topic? Truncation. Now that has real sizzle among the average Web surfer. I referenced an outfit called InQuire, which supported forward truncation. Mr. Page asserted that Google did not have to fool around with truncation. The arguments bored even those who were search experts at the Boston meeting.
I realized then that Google had some very specific methods, and those methods were not influenced by the received wisdom of search as practiced at Inktomi or Lycos, to name two big players in 2000. So I began my research looking for differences between what Google engineers were revealing in their research papers. I compiled a list of differences. I won’t reference my Google studies, because in today’s economic climate, few people are buying $400 studies of Google or much else for that matter.
I flipped through some of the archives I have on one of my back up devices. I did a search for the word “scale”, and I found that it was used frequently by Google engineers and also by Google managers. Scale was a big deal to Google from the days of BackRub, according to my notes. BackRub did not scale. Google, scion of BackRub, was engineered to scale.
The reason, evident to Messrs. Brin and Page in 1998, was that the problem with existing Web search systems was that the operators ran out of money for exotic hardware needed to keep pace with the two rapidly generating cells of search: traffic and new / changed content. The stroke of genius, as I have documented in my Google studies, was that Google tackled the engineering bottlenecks. Other search companies such as Lycos lived with the input output issues, the bottlenecks of hitting the disc for search results, and updating indexes by brute force methods. Not the Google.
Messrs. Brin and Page hired smart men and women whose job was “find a solution”. So engineers from Alta Vista, Bell Labs, Sun Microsystems, and other places where bright folks get jobs worked to solve these inherent problems. Without solutions, there was zero chance that Google could avoid the fate of the Excites, OpenText Web index, and dozens of other companies without a way to grow without consuming the equivalent of the gross domestic product for hardware, disc space, bandwidth, chillers, and network devices.
Google’s brilliance (yes, brilliance) was to resolve in a cost effective way the technical problems that were deal breakers for other search vendors. AltaVista was a pretty good search system but it was too costly to operate. When the Alpha computers were online, you could melt iron ore, so the air condition bill was a killer.
Keep in mind that Google has been working on resolving bottlenecks and plumbing problems for more than 11 years.
I read “Microsoft’s Point Man on Search—Satya Nadella—Speaks: It’s a Game of Scale” and I shook my head in disbelief. Google operates at scale, but scale is a consequence of Google’s solutions to getting results without choking a system with unnecessary disc reads. Scale is a consequence of using dirt cheap hardware that is mostly controlled by smart software interacting with the operating system and the demands users and processes make on the system. Scale is a consequence of figuring out how to get heat out of a rack of servers and replacing conventional uninterruptable power supplies with on motherboard batteries from Walgreen’s to reduce electrical demand, heat and cost. Scale comes from creating certain propriety bits of hardware AND software to squeeze efficiencies out of problems caused by physics of computer operation.
If you navigate to Google and poke around you will discover “Publications by Googlers”. I suggest that anyone interested in Google browse this list of publications. I have tried to read every Google paper, but as I age, I find I cannot keep up. The Googlers have increased their output of research into plumbing and other search arcana by a factor of 10 since I first began following Google’s technical innovations. Here’s one example to give you some context for my comments about Mr. Nadella’s comments, reported by All Things Digital; to wit: “Thwarting Virtual Bottlenecks in Multi-Bitrate Streaming Servers” by Bin Liu and Raju Rangaswami (academics_) and Zora Dimitrijevic (Googler). Yep, there it is in plain English—an innovation plus hard data that shows that Google’s system anticipates bottlenecks. Software makes decisions to avoid these “virtual bottlenecks.” Nice, right? The bottleneck imposed by the way computers operate and the laws of physics are identified BEFORE they take place. The Google system then changes its methods in order to eliminate the bottleneck. Think about that the next time you wait for Oracle to respond to a query across a terabyte set of data tables or you wait as SharePoint labors to generate a new index update. Google’s innovation is predictive analysis and automated intervention. This explains why it is sometimes difficult to explain why a particular Web page declined in a Google set of relevance ranked results. The system, not humans, is adapting.
I understand the frustration that many Google pundits, haters, and apologists express to me. But if you take the time to read Google’s public statements about what it is doing and how it engineers its systems, the Google is quite forthcoming. The problem, as I see it, has two parts. First, Googlers write for those who understand the world as Google does. Notice the language of the “Thwarting” paper. Have you thought about Multi-bitrate streaming servers in a YouTube.com type of environment. YouTube.com has lots of users, and streams a lot of content. The problems are that Google’s notion of clarity is show in the statement below:
Second, very few people in the search business deal with the user loads that Google experiences. Looking up the location of one video and copying it from one computer to another is trivial. Delivering videos to a couple of million people at the same time is a different class of problem. So, why read the “Thwarting” paper? The situation described does not exist for most search companies or streaming media companies. The condition at Google is, by definition, an anomaly. Anomalies are not what make most information technology companies hearts go pitter patter more quickly. Google has to solve these problems or it is not Google. A company that is not Google does not have these Google problems. Therefore, Google solves problems that are irrelevant to 99 percent of the companies in the content processing game.
Back to Mr. Nadella. This comment sums up what I call the Microsoft Yahoo search challenge:
Nadella does acknowledge in the video interview here that Microsoft has has not been able to catch up with Google and talks about how that might now be possible.
I love the “might”. The thoughts the went through my mind when I worked through this multi media article from All Things Digital were:
- Microsoft had access to similar thinking about scale in 1999. Microsoft hired former AltaVista engineers, but the Microsoft approach to data centers is a bit like the US Navy’s approach to aircraft carriers. More new stuff has been put on a design that has remained unchanged for a long time. I have written about Microsoft’s “as is” architecture in the Web log with snapshots of the approach at three points in time
- Google has been unchallenged in search for 11 years. Google has an “as is” infrastructure capable of supporting more than 2,200 queries per second as well as handling the other modest tasks such as YouTube.com, advertising, maps, and enterprise applications. In 2002, Google had not figured out how to handle high load reads and writes because Google focused on eliminating disc reads and gating writes. Google solved that problem years ago.
- Microsoft has to integrate the Yahoo craziness into the Microsoft “as is”, aircraft carrier approach to data centers. The affection for Microsoft server products is strong, but adapting to Yahoo search innovations will require some expensive, time consuming, and costly engineering.
In short, I am delighted that Mr. Nadella has embraced scale. Google is becoming more like a tortoise, but I think there was a fable about the race between the tortoise and the hare. Google’s reflexes are slowing. The company has a truck load of legal problems. New competitors like Collecta.com are running circles around Googzilla. Nevertheless, Microsoft has to figure out the Google problem before the “going Google” campaign bleeds revenue and profits from Microsoft’s pivotal business segments.
My hunch is that Microsoft will run out of cash before dealing the GOOG a disabling blow.
Stephen Arnold, August 4, 2009
Rethinking the Microsoft Corp Search Cost Burden
July 31, 2009
I am waiting to give a talk. I have been flipping through hard copy newspapers and clicking around to see what is happening as I cool my heels. Three items caught may attention. Both the New York Times and the Wall Street Journal reported that the Yahoo deal is good for Yahoo. Maybe? What I think is that Microsoft assumed the cost burden of Yahoo’s search operation. Since my analysis of Yahoo’s search costs in 2006, I have gently reminded folks that Yahoo had a growing cost problem and its various management teams could not do much about these costs. So Yahoo focused on other matters and few people brought the focus back on infrastructure, staff, and duplicative search systems.
Now Microsoft has assumed this burden.
I scanned John Gruber’s “Microsoft’s Long, Slow Decline”, and I noted this comment:
Microsoft remains a very profitable company. However, they have never before reported year-over-year declines like this, nor fallen so short of projected earnings. Something is awry.
Dead on. What is missing is thinking about the challenge Microsoft has in search. My thoughts are:
First, Microsoft has to make headway with its $1.2 billion investment in enterprise search. I think the uptake of SharePoint will produce some slam dunk sales. But competitors in the enterprise search sector know that SharePoint is big and fuzzy, and many Microsoft centric companies have here and now problems. I think there is a real possibility for Microsoft to cut the price of Fast ESP or just give it away if the client buys enough CALs for other Microsoft products. What I wonder is, “How will Microsoft deal with the punishing engineering costs a complex content processing and search system imposes during its first six to 18 months in a licensee’s organization. Microsoft partners may know SharePoint but I don’t think many know Fast ESP. Then there is the R&D cost to keep pace with competitors in search, content processing, and the broader field of findability. Toss in business intelligence and you have a heck of a cost anchor.
Second, Bing.com may get access to Yahoo’s Ford F 150 filled with software. But integrating Yahoo technology with Microsoft technology is going to be expensive. There are other costs as well; for example, Microsoft bought Powerset and some legacy technology from Xerox Parc. Layer on the need for backward compatibility and you have another series of cost black holes.
Finally, there are the many different search technologies that Microsoft has deployed and must presumably rationalize. Fast ESP has a better SQL query method than SQL Server. Will customers get both SQL and Fast ESP or will there be more product variants. Regardless of the path forward, there are increased costs.
Now back to Mr. Gruber’s point: a long, slow decline requires innovation and marketing of the highest order. I think the cost burden imposed by search will be difficult for Microsoft to control. Furthermore, I hypothesize:
- Google will become more aggressive in the enterprise sector. Inflicting several dozen wounds may be enough to addle Microsoft and erode its profitability in one or two core product lines
- Google’s share of the Web search market may erode but not overnight’. The Googlers won’t stand still and the deal with Yahoo strikes me as chasing the Google of 2006, not the Google of 2009.
- Of the more than 200 competitors in enterprise search and content processing, I am confident that a half dozen or more will find ways to suck cash from Microsoft’s key accounts because increasingly people want solutions, not erector sets.
In short, Microsoft’s cost burdens come at a difficult time for the company. Microsoft and Yahoo managers have their work cut out for them.
Stephen Arnold, July 31, 2009
Bing, Ballmer, Bets, and Blodget
June 19, 2009
I have been quite forthright about my enjoyment of Henry Blodget’s analyses. An MBA (once high flying) wanted to introduce me to him, but the meeting got postponed, then there was a financial meltdown, and the rest you know. Mr. Blodget’s “Steve Ballmer Is Making a Bad $10 Billion Bet” is one of those Web log write ups that the Murdoch crowd and the financially challenged New York Times’s staff should tape to their cubicle panel. The beat around the barn approach to Microsoft’s search challenge does no one any good. The excitement about early usage of Bing.com is equally unnerving because until there are several months of data, dipping in a clickstream provides snapshots not feature length movies.
Mr. Blodget runs down some of the history of Microsoft’s spending in the search sector. The historical estimates are hefty but the going forward numbers are big, even for a giant like Microsoft. Mr. Blodget wrote:
Steve has already been investing about 5%-10% of Microsoft’s operating income on the Internet for the past decade, and he has nothing to show for it.
Mr. Blodget inserts a chart with weird green bars instead of the bright red ones that the numbers warrant. Green or red, big bucks. Zero payoff. He continued:
In fact, maybe it would be more realistic (but not actually very realistic at all) to assume that Bing might make a lot less than $8 billion a year–say, $1-$2 billion a year, if it was very successful. Or that, more realistically, once Google saw that Bing was actually making some headway, it might decide to spend some or all of its own $8 billion of free cash flow a year to protect its franchise, given that Bing seemed intent on destroying it. And that, because Google already had 65% market share of the search market versus Bing’s 10% and had weathered all of Bing’s previous attacks, it might very well succeed in defending itself.
Several comments flapped through this addled goose brain of mine:
- Microsoft does not have one search problem. Microsoft has multiple search problems; for example, the desktop search, the enterprise search baked into the 100 million SharePoint installations, the SQL Server search, and the Fast Search & Transfer search system. Each of these costs time and resources. So, Mr. Blodget’s numbers probably understate the cash outflows. The police issue in Norway has a price tag, if not in money, in terms of credibility of the $1.2 billion paid for something that certainly seems dicey.
- Microsoft is constrained by its own technology. There’s lots of rah rah about Microsoft’s data centers and how sophisticated these are. The reality is that the Google has a cost advantage in this chunk of the business. My research suggests that when the Google spends $1.00, Microsoft has to spend as much as $4.00 or more to get similar performance. Another big cash outflow in my opinion.
- Google is in the leapfrog business. I have mentioned Programmable Search Engines, dataspaces, and other interesting Google technology. Even Yahoo with its problems has begun to respond to the Google leapfrog, but so far Microsoft has been focused on the incremental changes, and while helpful, these incremental changes will end up costing more money down the line because the plumbing at Microsoft won’t scale to handle the next challenge Google causes in the online ocean.
Exciting times for Microsoft shareholders because the shares will open in about an hour at $23.50. IBM which has been through the same terrain as Microsoft opens at $106.33. What’s that say?
Stephen Arnold, June 19, 2009
MarkLogic: The Shift Beyond Search
June 5, 2009
Editor’s note: I gave a talk at a recent user group meeting. My actual remarks were extemporaneous, but I did prepare a narrative from which I derived my speech. I am reproducing my notes so I don’t lose track of the examples. I did not mention specific company names. The Successful Enterprise Search Management (SESM) reference is to the new study Martin White and I wrote for Galatea, a publishing company in the UK. MarkLogic paid me to show up and deliver a talk, and the addled goose wishes other companies would turn to Harrod’s Creek for similar enlightenment. MarkLogic is an interesting company because it goes “beyond search”. The firm addresses the thorny problem of information architecture. Once that issue is confronted, search, reports, repurposing, and other information transformations becomes much more useful to users. If you have comments or corrections to my opinions, use the comments feature for this Web log. The talk was given in early May 2009, and the Tyra Banks’s example is now a bit stale. Keep in mind this is my working draft, not my final talk.
Introduction
Thank you for inviting me to be at this conference. My topic is “Multi-Dimensional Content: Enabling Opportunities and Revenue.” A shorter title would be repurposing content to save and make money from information. That’s an important topic today. I want to make a reference to real time information, present two brief cases I researched, offer some observations, and then take questions.
Let me begin with a summary of an event that took place in Manhattan less than a month ago.
Real Time Information
America’s Top Model wanted to add some zest to their popular television reality program. The idea was to hold an audition for short models, not the lanky male and female prototypes with whom we are familiar.
The short models gathered in front of a hotel on Central Park South. In a matter of minutes, the crowd began to grow. A police cruiser stopped and the two officers were watching a full fledged mêlée in progress. Complete with swinging shoulder bags, spike heels, and hair spray. Every combatant was 5 feet six inches taller or below.
The officers called for the SWAT team but the police were caught by surprise.
I learned in the course of the nine months research for the new study written by Martin White (a UK based information governance expert) and myself that a number of police and intelligence groups have embraced one of MarkLogic’s systems to prevent this type of surprise.
Real-time information flows from Twitter, Facebook, and other services are, at their core, publishing methods. The messages may be brief, less than 140 characters or about 12 to 14 words, but they pack a wallop.
MarkLogic’s slicing and dicing capabilities open new revenue opportunities.
Here’s a screenshot of the product about which we heard quite positive comments. This is MarkMail, and it makes it possible to take content from real-time systems such as mail and messaging, process them, and use that information to create opportunities.
Intelligence professionals use the slicing and dicing capabilities to generate intelligence that can save lives and reduce to some extent the type of reactive situation in which the NYPD found itself with the short models disturbance.
Financial services and consulting firms can use MarkMail to produce high value knowledge products for their clients. Publishing companies may have similar opportunities to produce high grade materials from high volume, low quality source material.
Microsoft and Search: Interface Makes Search Disappear
May 5, 2009
The Microsoft Enterprise Search Blog here published the second part of an NUI (natural user interface) essay. The article, when I reviewed it on May 4, had three comments. I found one comment as interesting as the main body of the write up. The author of the remark that caught my attention was Carl Lambrecht, Lexalytics, who commented:
The interface, and method of interaction, in searching for something which can be geographically represented could be quite different from searching for newspaper articles on a particular topic or looking up a phone number. As the user of a NUI, where is the starting point for your search? Should that differ depending on and be relevant to the ultimate object of your search? I think you make a very good point about not reverting to browser methods. That would be the easy way out and seem to defeat the point of having a fresh opportunity to consider a new user experience environment.
Microsoft enterprise search Web log’s NUI series focuses on interface. The focus is Microsoft Surface, which allows a user to interact with information by touching and pointing. A keyboard is optional, I assume. The idea is that a person can walk up to a display and obtain information. A map of a shopping center is the example that came to my mind. I want to “see” where a store is, tap the screen, and get additional information.
This blog post referenced the Fast Forward 2009 conference and its themes. There’s a refernce to EMC’s interest in the technology. The article wraps up with a statement that a different phrase may be needed to describe the NUI (natural user interface), which I mistakenly pronounced like the word ennui.
Microsoft Suface. Image Source: http://psyne.net/blog4/wp-content/uploads/2007/09/microsoftsurface.jpg
Several thoughts:
First, I think that interface is important, but the interface depends upon the underlying plumbing. A great interface sitting on top of lousy plumbing may not be able to deliver information quickly or in some cases present the information the user needed. I see this frequently when ad servers cannot deliver information. The user experience (UX) is degraded. I often give up and navigate elsewhere.
Content Management: Modern Mastodon in a Tar Pit, Part One
April 17, 2009
Editor’s Note: This is a discussion of the reasons why CMS continues to thrive despite the lousy financial climate. The spark for this essay was the report of strong CMS vendor revenues written by an azure chip consulting firm; that is, a high profile outfit a step or two below the Bains, McKinseys, and BCGs of this world.
Part 1: The Tar Pit and Mastodon Metaphor or You Are Stuck
PCWorld reported “Web Content Management Staying Strong in Recession” here. The author, Chris Kanaracus, wrote:
While IT managers are looking to cut costs during the recession, most aren’t looking for savings in Web content management, according to a recent Forrester Research study. Seventy-two percent of the survey’s 261 respondents said they planned to increase WCM deployments or usage this year, even as many also expressed dissatisfaction with how their projects have turned out. Nineteen percent said their implementations would remain the same, and just 3 percent planned to cut back.
When consulting firms generate data, I try to think about the data in the context of my experience. In general, pondering the boundaries of “statistically valid data from a consulting firm” with the wounds and bruises this addled goose gets in client work is an enjoyable exercise.
These data sort of make sense, but I think there are other factors that make CMS one of the alleged bright spots in the otherwise murky financial heavens.
La Brea, Tar, and Stuck Trapped Creatures
I remember the first time I visited the La Brea tar pits in Los Angeles. I was surprised. I had seen well heads chugging away on the drive to a client meeting in Longbeach in the early 1970s, but I did not know there was a tar pit amidst the choked streets of the crown jewel in America’s golden west. It’s there, and I have an image of a big elephant (Mammut americanum for the detail oriented reader) stuck in the tar. Good news for those who study the bones of extinct animals. Bad news for the elephant.
Is this a CMS vendor snagged in litigation or the hapless CMS licensee after the installation of a CMS system?
I had two separate conversations about CMS, the breezy acronym for content management systems. I can’t recall the first time I discovered that species of mastodon software, but I was familiar with the tar pits of content in organizations. Let’s set the state, er, prep the tar pit.
Organizational Writing: An Oxymoron
Organizations produce quite a bit of information. The vast majority of this “stuff” (content objects for the detail oriented reader) is in a constant state of churn. Think of the memos, letters, voice mails, etc. like molecules in a fast-flowing river in New Jersey. The environment is fraught with pollutants, regulators, professional garbage collection managers, and the other elements of modern civilization.
The authors of these information payloads are writing with a purpose; that is, instrumental writing. I have not encountered too many sonnets, poems, or novels in the organizational information I have had the pleasure of indexing since 1971. In the studies I worked on first at Halliburton Nuclear Utility Services and then at Booz, Allen & Hamilton, I learned that most organizational writing is not read by very many people. A big fat report on nuclear power plants had many contributors and reviewers, but most of these people focused on a particular technical aspect of a nuclear power generation system, not the big fat book. I edited the proceedings of a nuclear conference in 1972, and discovered that papers often had six or more authors. When I followed up with the “lead author” about a missing figure or an error in a wild and crazy equation, I learnedthat the “lead author” had zero clue about the information in the particular paragraph to which I referred.
Flash forward. Same situation today just lots more digital content. Instrumental writing, not much accountability, and general cluelessness about the contents of a particular paragraph, figure, chart, whatever in a document.
Organizational writing is a hotch potch of individuals with different capabilities and methods of expressing themselves. Consider an engineer or mathematician. Writing is not usually a core competency, but there are exceptions. In technical fields, there will be a large number of people who are terse to the point of being incomprehensible and a couple of folks who crank out reams of information. In an organization, volume may not correlate with “right” or “important”. A variation of this situation crops up in sales. A sales report often is structured, particularly if the company has licensed a product to force each salesperson to provide a name, address, phone, number, and comments about a “contact”. The idea is that getting basic information is pretty helpful if the salesperson quits or simply refuses to fill in the blanks. Often the salesperson who won’t play ball is the guy or gal who nails a multi million dollar deal. The salesperson figures, “Someone will chase up the details.” The guy or gal is right. Distinct content challenges arise in the legal department. Customer support has its writing preferences, sometimes compressed to methods that make the customer quit calling.
Why CMS for Text?
The Web’s popularization as cheap marketing created a demand for software that would provide writing training wheels to those in an organization who had to contribute information to a Web site. The Web site has gained importance with each passing year since 1993 when hyperlinking poked its nose from the deep recesses of Standard Generalized Markup Language.
Customer relationship management systems really did not support authoring, editorial review, version control, and the other bits and pieces of content production. Enterprise resource planning systems manage back office and nitty gritty warehouse activities. Web content is not a core competency of these labyrinthine systems. Content systems mandated for regulatory compliance are designed to pinpoint which supplier delivered an Inconel pipe that cracked, what inspector looked at the installation, what quality assurance engineer checked the work, and what tech did the weld when the pipe was installed. Useful for compliance, but not what the Web marketing department ordered. Until recently, enterprise publishing systems were generally confined to the graphics department or the group that churned out proposals and specifications. The Web content was an aberrant content type.
Enter content management.
I recall the first system that I looked at closely was called NCompass. When I got a demo in late 1999, I recall vividly that it crashed in the brightly lit, very cheerful exhibition stand in San Jose. Reboot. Demo another function. Crash. Repeat. Microsoft acquired this puppy and integrated it into SharePoint. SharePoint has grown over time like a snowball. Here’s a diagram of the SharePoint system from www.JoiningDots.net:
SharePoint. Simplicity itself. Source: http://www.joiningdots.net/downloads/SharePoint_History.jpg
A Digital Oklahoma Land Rush
By 2001, CMS was a booming industry. In some ways, it reminded me of the case study I wrote for a client about the early days of the automobile industry. There were many small companies which over time would give way to a handful of major players. Today CMS has reached an interesting point. The auto style aggregation has not worked out exactly like the auto industry case I researched. Before the collapse of the US auto industry in 2008, automobile manufacturing had fractured and globalized. There were holding companies making more vehicles than the US population would buy from American firms. There were vast interconnected of supplier subsystems and below these huge pipelines into more fundamental industrial sectors like chemicals, steel, and rubber.
Microsoft and Proprietary Chips
April 10, 2009
Stacey Higginbotham’s “Is Microsoft Turning Away from Commodity Server?” here reminded me of a client study I did five or six years ago. The Sony PS3 was working on a proprietary chip. IBM was involved, and I documented the graphics method which built upon IBM technology. In short order, Microsoft and Nintendo signed up with IBM to use its generic chip design for their next generation game devices. Sony ran into three problems. First, costs went through the roof. Sony did not have a core competency in chip design and fabrication, and it was evident even in the sketchy technical information my Overflight service dug out.
Second, the yield on chips is a tricky issue. Without getting into why a yield goes wrong, I focused on the two key factors: time and cost overruns. The costs were brutal, eventually forcing Sony to change its fabrication plans. The time is a matter of public record. Microsoft beat the PS3 to market, and Sony is starting to recover now. We’re talking years of lost revenue, not days or weeks or months.
Third, the developers were stuck in limbo. With new chips, new programming tools and procedures were needed. Without a flow of chips, developers were flying blind. The problem became critical and when the PS3 launched, the grousing of developers about the complexity of programming the new chip joined with complaints from fanboys that games were in short supply.
Compatibility, availability, and affordability joined the chorus.
Ms. Higginbotham’s article summarized what is known about Microsoft’s alleged interest in creating its own chips for its own servers. The motivator for Microsoft, if I read Ms. Higginbotham’s article correctly, is related to performance. One way to get performance is to get zippier hardware. With faster CPUs and maybe other custom chips, the performance of Microsoft software would improve more than it would by using Intel or AMD CPUs. (Google uses both.)
For me, the most interesting point in her write up was:
The issue of getting software performance to scale linearly with the addition of more cores has become a vexing problem. Plus, as data center operators look for better application performance without expending as many watts, they are experimenting with different kinds of processors that may be better-suited to a particular task, such as using graphics processors for Monte Carlo simulations.
She did not draw any parallels with the Sony chip play. I will:
- The Sony Ken Kutaragi chip play provides a good lesson about the risks of rolling your own chips. Without a core competency across multiple disciplines, I think the chance for a misstep is high. Maybe Microsoft is just researching this topic? That’s prudent. Jumping into a proprietary chip may come, but some ramp up may be needed.
- Google does many proprietary things. The performance of Google’s system is not the result of a crash project. Time is of the essence because the GOOG is gaining momentum, not losing it. Therefore, the Sony “time problem” with regard to the Xbox may translate into years of lost opportunity. Chip designs are running into fundamental laws of physics, so software solutions may reduce the development time.
- The performance problem will not be resolved by faster hardware. Multiple changes are needed across the computing system. There are programming slow downs because tools have to generate zippy code for high speed devices. Most of the slow downs are not caused by calculations. Moving data is the problem. Inefficient designs and code combine with known bottlenecks to choke high performance systems, including those at Google. As the volume of data increases, the plumbing has to be scalable, stable, and dirt cheap. Performance problems are complex and expensive to resolves. Fixes often don’t work which makes the system slower. Nice, right? Need more data? Ask a SharePoint administrator about the cost and payoff of her last SharePoint scaling exercise.
My view is that one hire does not a chip fab make. Microsoft’s analysts have ample data to understand the costs of custom chip design and fabrication. Google requires immediate attention and rapid, purposeful progress on the part of Microsoft’s engineers. Time is the real enemy now. Without a meaningful competitor, Google seems to enjoy large degrees of freedom.
Stephen Arnold, April 10, 2009
Search Certification
April 1, 2009
A happy quack to the reader who told me about the new AIIM search certification program. Now that will be an interesting development. AIIM is a group anchored in the original micrographics business. The organization has morphed over the years, and it now straddles a number of different disciplines. The transition has been slow and in some cases directed by various interest groups from the content management sector and consulting world. CMS experts have produced some major problems for indexing subsystems, and the CMS vendors themselves seem to generate more problems for licensees than their systems resolve. (Click here for one example.)
This is not an April’s Fool joke.
The notion of search certification is interesting for five reasons:
First, there is no widely accepted definition of search in general or enterprise search in particular. I have documented the shift in terminology used by vendors of information retrieval and content processing systems. You can see the lengths here to which some organizations go to avoid using the word “search”, which has been devalued and overburdened in the last three or four years. The issue of definitions becomes quite important, but I suppose in the quest for revenue, providing certification in a discipline without boundaries fulfills some folks’s ambitions for revenue and influence.
Second, the basic idea of search–that is, find information–has shifted from the old command line Boolean to a more trophy-generation approach. Today’s systems are smart, presumably because the users are either too busy to formulate a Boolean query or view the task as irrelevant in a Twitter-choked real time search world. The notion of “showing” information to users means that a fundamental change has taken place which moves search to the margins of this business intelligence or faceted approach to information.
Third, the Google “I’m feeling doubly lucky” invention US2006/0230350 I described last week at a conference in Houston, Texas, removes the need to point and click for information. The Google engineers responsible for “I’m feeling doubly lucky” remove the user from doing much more than using a mobile device. The system monitors and predicts. The information is just there. A certification program for this approach to search will be most interesting because at this time the knowledge to pull off “I’m feeling doubly lucky” resides at Google. If anyone certifies, I suppose it would be Google.
Fourth, search is getting ready to celebrate its 40th birthday if one uses Dr. Salton’s seminal papers as the “official” starting point for search. SQL queries, Codd style, preceded Dr. Salton’s work with text, however. But after 40 years certification seems to be coming a bit late in the game. I can understand certification for a specific vendor’s search system–for example, SharePoint–but I think the notion of tackling a broader swath of this fluid, boundaryless space is logically uncomfortable for me. Others may feel more comfortable with this approach whose time apparently has come.
Finally, search is becoming a commodity, finding itself embedded and reshaped into other enterprise applications. Just as the “I’m feeling doubly lucky” approach shifts the burden of search from the user to the Google infrastructure, these embedded functions create a different problem in navigating and manipulating dataspace.
I applaud the association and its content management advisors for tackling search certification. My thought is that this may be an overly simplistic solution to a problem that has shifted away from the practical into the realm of the improbable.
There is a crisis in search. Certification won’t help too much in my opinion. Other skills are needed and these cannot be imparted in a boot camp or a single seminar. Martin White and I spent almost a year distilling our decades of information retrieval experience into our Successful Enterprise Search Management.
The longest journey begins with a single step. Looks like one step is about to be taken–four decades late. Just my opinion, of course. The question now becomes, “Why has no search certification process been successful in this time interval?” and “Why isn’t there a search professional association?” Any thoughts?
Stephen Arnold, March 31, 2009
Exclusive Interview with David Milward, CTO, Linguamatics
February 16, 2009
Stephen Arnold and Harry Collier interviewed David Milward,the chief technical officer of Linguamatics, on February 12, 2009. Mr. Milward will be one of the featured speakers at the April 2009 Boston Search Engine Meeting. You will find minimal search “fluff” at this important conference. The focus is upon search, information retrieval, and content processing. You will find no trade show booths staffed, no multi-track programs that distract, and no search engine optimization sessions. The Boston Search Engine Meeting is focused on substance from informed experts. More information about the premier search conference is here. Register now.
The full text of the interview with David Milward appears below:
Will you describe briefly your company and its search / content processing technology?
Linguamatics’ goal is to enable our customers to obtain intelligent answers from text – not just lists of documents. We’ve developed agile natural language processing (NLP)-based technology that supports meaning-based querying of very large datasets. Results are delivered as relevant, structured facts and relationships about entities, concepts and sentiment.
Linguamatics’ main focus is solving knowledge discovery problems faced by pharma/biotech organizations. Decision-makers need answers to a diverse range of questions from text, both published literature and in-house sources. Our I2E semantic knowledge discovery platform effectively treats that unstructured and semi-structured text as a structured, context-specific database they can query to enable decision support.
Linguamatics was founded in 2001, is headquartered in Cambridge, UK with US operations in Boston, MA. The company is privately owned, profitable and growing, with I2E deployed at most top-10 pharmaceutical companies.
What are the three major challenges you see in search / content processing in 2009?
The obvious challenges I see include:
- The ability to query across diverse high volume data sources, integrating external literature with in-house content. The latter content may be stored in collaborative environments such as SharePoint, and in a variety of formats including Word and PDF, as well as semi-structured XML.
- The need for easy and affordable access to comprehensive content such as scientific publications, and being able to plug content into a single interface.
- The demand by smaller companies for hosted solutions.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
People have traditionally been able to do simple querying across multiple data sources, but there has been an integration challenge in combining different data formats, and typically the rich structure of the text or document has been lost when moving between formats.
Publishers have tended to develop their own tools to support access to their proprietary data. There is now much more recognition of the need for flexibility to apply best of breed text mining to all available content.
Potential users were reluctant to trust hosted services when queries are business- sensitive. However, hosting is becoming more common, and a considerable amount of external search is already happening using Google and, in the case of life science researchers, PubMed.
What is your approach to problem solving in search and content processing?
Our approach encompasses all of the above. We want to bring the power of NLP-based text mining to users across the enterprise – not just the information specialists. As such we’re bridging the divide between domain-specific, curated databases and search, by providing querying in context. You can query diverse unstructured and semi-structured content sources, and plug in terminologies and ontologies to give the context. The results of a query are not just documents, but structured relationships which can be used for further data mining and analysis.
Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?
Our customers want scalability across the board – both in terms of the size of the document repositories that can be queried and also appropriate querying performance. The hardware does need to be compatible with the task. However, our software is designed to give valuable results even on relatively small machines.
People can have an insatiable demand for finding answers to questions – and we typically find that customers quickly want to scale to more documents, harder questions, and more users. So any text mining platform needs to be both flexible and scalable to support evolving discovery needs and maintain performance. In terms of performance, raw CPU speed is sometimes less of an issue than network bandwidth especially at peak times in global organizations.
Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?
Implementing a proactive e-Discovery capability rather than reacting to issues when they arrive is becoming a strategy to minimize potential legal costs. The forensic abilities of text mining are highly applicable to this area and have an increasing role to play in both eDiscovery and auditing. In particular, the ability to search for meaning and to detect even weak signals connecting information from different sources, along with provenance, is key.
As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?
Organizations are still challenged to maximize the value of what is already known – both in internal documents or in published literature, on blogs, and so on. Even in global companies, text mining is not yet seen as a standard capability, though search engines are ubiquitous. This is changing and I expect text mining to be increasingly regarded as best practice for a wide range of decision support tasks. We also see increasing requirements for text mining to become more embedded in employees’ workflows, including integration with collaboration tools.
Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?
Customers recognize the value of linking entities and concepts via semantic identifiers. There’s effectively a semantic engine at the heart of I2E and so semantic knowledge discovery is core to what we do. I2E is also often used for data-driven discovery of synonyms, and association of these with appropriate concept identifiers.
In the life science domain commonly used identifiers such as gene ids already exist. However, a more comprehensive identification of all types of entities and relationships via semantic web style URIs could still be very valuable.
Where can I find more information about your products, services, and research?
Please contact Susan LeBeau (susan.lebeau@linguamatics.com, tel: +1 774 571 1117) and visit www.linguamatics.com.
Stephen Arnold (ArnoldIT.com) and Harry Collier (Infonortics, Ltd.), February 16, 2009