Bing, Ballmer, Bets, and Blodget

June 19, 2009

I have been quite forthright about my enjoyment of Henry Blodget’s analyses. An MBA (once high flying) wanted to introduce me to him, but the meeting got postponed, then there was a financial meltdown, and the rest you know. Mr. Blodget’s “Steve Ballmer Is Making a Bad $10 Billion Bet” is one of those Web log write ups that the Murdoch crowd and the financially challenged New York Times’s staff should tape to their cubicle panel. The beat around the barn approach to Microsoft’s search challenge does no one any good. The excitement about early usage of Bing.com is equally unnerving because until there are several months of data, dipping in a clickstream provides snapshots not feature length movies.

Mr. Blodget runs down some of the history of Microsoft’s spending in the search sector. The historical estimates are hefty but the going forward numbers are big, even for a giant like Microsoft. Mr. Blodget wrote:

Steve has already been investing about 5%-10% of Microsoft’s operating income on the Internet for the past decade, and he has nothing to show for it.

Mr. Blodget inserts a chart with weird green bars instead of the bright red ones that the numbers warrant. Green or red, big bucks. Zero payoff. He continued:

In fact, maybe it would be more realistic (but not actually very realistic at all) to assume that Bing might make a lot less than $8 billion a year–say, $1-$2 billion a year, if it was very successful.  Or that, more realistically, once Google saw that Bing was actually making some headway, it might decide to spend some or all of its own $8 billion of free cash flow a year to protect its franchise, given that Bing seemed intent on destroying it.  And that, because Google already had 65% market share of the search market versus Bing’s 10% and had weathered all of Bing’s previous attacks, it might very well succeed in defending itself.

Several comments flapped through this addled goose brain of mine:

  • Microsoft does not have one search problem. Microsoft has multiple search problems; for example, the desktop search, the enterprise search baked into the 100 million SharePoint installations, the SQL Server search, and the Fast Search & Transfer search system. Each of these costs time and resources. So, Mr. Blodget’s numbers probably understate the cash outflows. The police issue in Norway has a price tag, if not in money, in terms of credibility of the $1.2 billion paid for something that certainly seems dicey.
  • Microsoft is constrained by its own technology. There’s lots of rah rah about Microsoft’s data centers and how sophisticated these are. The reality is that the Google has a cost advantage in this chunk of the business. My research suggests that when the Google spends $1.00, Microsoft has to spend as much as $4.00 or more to get similar performance. Another big cash outflow in my opinion.
  • Google is in the leapfrog business. I have mentioned Programmable Search Engines, dataspaces, and other interesting Google technology. Even Yahoo with its problems has begun to respond to the Google leapfrog, but so far Microsoft has been focused on the incremental changes, and while helpful, these incremental changes will end up costing more money down the line because the plumbing at Microsoft won’t scale to handle the next challenge Google causes in the online ocean.

Exciting times for Microsoft shareholders because the shares will open in about an hour at $23.50. IBM which has been through the same terrain as Microsoft opens at $106.33. What’s that say?

Stephen Arnold, June 19, 2009

MarkLogic: The Shift Beyond Search

June 5, 2009

Editor’s note: I gave a talk at a recent user group meeting. My actual remarks were extemporaneous, but I did prepare a narrative from which I derived my speech. I am reproducing my notes so I don’t lose track of the examples. I did not mention specific company names. The Successful Enterprise Search Management (SESM) reference is to the new study Martin White and I wrote for Galatea, a publishing company in the UK. MarkLogic paid me to show up and deliver a talk, and the addled goose wishes other companies would turn to Harrod’s Creek for similar enlightenment. MarkLogic is an interesting company because it goes “beyond search”. The firm addresses the thorny problem of information architecture. Once that issue is confronted, search, reports, repurposing, and other information transformations becomes much more useful to users. If you have comments or corrections to my opinions, use the comments feature for this Web log. The talk was given in early May 2009, and the Tyra Banks’s example is now a bit stale. Keep in mind this is my working draft, not my final talk.

Introduction

Thank you for inviting me to be at this conference. My topic is “Multi-Dimensional Content: Enabling Opportunities and Revenue.” A shorter title would be repurposing content to save and make money from information. That’s an important topic today. I want to make a reference to real time information, present two brief cases I researched, offer some observations, and then take questions.

Let me begin with a summary of an event that took place in Manhattan less than a month ago.

Real Time Information

America’s Top Model wanted to add some zest to their popular television reality program. The idea was to hold an audition for short models, not the lanky male and female prototypes with whom we are familiar.

The short models gathered in front of a hotel on Central Park South. In a matter of minutes, the crowd began to grow. A police cruiser stopped and the two officers were watching a full fledged mêlée in progress. Complete with swinging shoulder bags, spike heels, and hair spray. Every combatant was 5 feet six inches taller or below.

The officers called for the SWAT team but the police were caught by surprise.

I learned in the course of the nine months research for the new study written by Martin White (a UK based information governance expert) and myself that a number of police and intelligence groups have embraced one of MarkLogic’s systems to prevent this type of surprise.

Real-time information flows from Twitter, Facebook, and other services are, at their core, publishing methods. The messages may be brief, less than 140 characters or about 12 to 14 words, but they pack a wallop.

image

MarkLogic’s slicing and dicing capabilities open new revenue opportunities.

Here’s a screenshot of the product about which we heard quite positive comments. This is MarkMail, and it makes it possible to take content from real-time systems such as mail and messaging, process them, and use that information to create opportunities.

Intelligence professionals use the slicing and dicing capabilities to generate intelligence that can save lives and reduce to some extent the type of reactive situation in which the NYPD found itself with the short models disturbance.

Financial services and consulting firms can use MarkMail to produce high value knowledge products for their clients. Publishing companies may have similar opportunities to produce high grade materials from high volume, low quality source material.

Read more

Microsoft and Search: Interface Makes Search Disappear

May 5, 2009

The Microsoft Enterprise Search Blog here published the second part of an NUI (natural user interface) essay. The article, when I reviewed it on May 4, had three comments. I found one comment as interesting as the main body of the write up. The author of the remark that caught my attention was Carl Lambrecht, Lexalytics, who commented:

The interface, and method of interaction, in searching for something which can be geographically represented could be quite different from searching for newspaper articles on a particular topic or looking up a phone number. As the user of a NUI, where is the starting point for your search? Should that differ depending on and be relevant to the ultimate object of your search? I think you make a very good point about not reverting to browser methods. That would be the easy way out and seem to defeat the point of having a fresh opportunity to consider a new user experience environment.

Microsoft enterprise search Web log’s NUI series focuses on interface. The focus is Microsoft Surface, which allows a user to interact with information by touching and pointing. A keyboard is optional, I assume. The idea is that a person can walk up to a display and obtain information. A map of a shopping center is the example that came to my mind. I want to “see” where a store is, tap the screen, and get additional information.

This blog post referenced the Fast Forward 2009 conference and its themes. There’s a refernce to EMC’s interest in the technology. The article wraps up with a statement that a different phrase may be needed to describe the NUI (natural user interface), which I mistakenly pronounced like the word ennui.

image

Microsoft Suface. Image Source: http://psyne.net/blog4/wp-content/uploads/2007/09/microsoftsurface.jpg

Several thoughts:

First, I think that interface is important, but the interface depends upon the underlying plumbing. A great interface sitting on top of lousy plumbing may not be able to deliver information quickly or in some cases present the information the user needed. I see this frequently when ad servers cannot deliver information. The user experience (UX) is degraded. I often give up and navigate elsewhere.

Read more

Content Management: Modern Mastodon in a Tar Pit, Part One

April 17, 2009

Editor’s Note: This is a discussion of the reasons why CMS continues to thrive despite the lousy financial climate. The spark for this essay was the report of strong CMS vendor revenues written by an azure chip consulting firm; that is, a high profile outfit a step or two below the Bains, McKinseys, and BCGs of this world.

Part 1: The Tar Pit and Mastodon Metaphor or You Are Stuck

PCWorld reported “Web Content Management Staying Strong in Recession” here. The author, Chris Kanaracus, wrote:

While IT managers are looking to cut costs during the recession, most aren’t looking for savings in Web content management, according to a recent Forrester Research study. Seventy-two percent of the survey’s 261 respondents said they planned to increase WCM deployments or usage this year, even as many also expressed dissatisfaction with how their projects have turned out. Nineteen percent said their implementations would remain the same, and just 3 percent planned to cut back.

When consulting firms generate data, I try to think about the data in the context of my experience. In general, pondering the boundaries of “statistically valid data from a consulting firm” with the wounds and bruises this addled goose gets in client work is an enjoyable exercise.

These data sort of make sense, but I think there are other factors that make CMS one of the alleged bright spots in the otherwise murky financial heavens.

La Brea, Tar, and Stuck Trapped Creatures

I remember the first time I visited the La Brea tar pits in Los Angeles. I was surprised. I had seen well heads chugging away on the drive to a client meeting in Longbeach in the early 1970s, but I did not know there was a tar pit amidst the choked streets of the crown jewel in America’s golden west. It’s there, and I have an image of a big elephant (Mammut americanum for the detail oriented reader) stuck in the tar. Good news for those who study the bones of extinct animals. Bad news for the elephant.

mastadon

Is this a CMS vendor snagged in litigation or the hapless CMS licensee after the installation of a CMS system?

I had two separate conversations about CMS, the breezy acronym for content management systems. I can’t recall the first time I discovered that species of mastodon software, but I was familiar with the tar pits of content in organizations. Let’s set the state, er, prep the tar pit.

Organizational Writing: An Oxymoron

Organizations produce quite a bit of information. The vast majority of this “stuff” (content objects for the detail oriented reader) is in a constant state of churn. Think of the memos, letters, voice mails, etc. like molecules in a fast-flowing river in New Jersey. The environment is fraught with pollutants, regulators, professional garbage collection managers, and the other elements of modern civilization.

The authors of these information payloads are writing with a purpose; that is, instrumental writing. I have not encountered too many sonnets, poems, or novels in the organizational information I have had the pleasure of indexing since 1971. In the studies I worked on first at Halliburton Nuclear Utility Services and then at Booz, Allen & Hamilton, I learned that most organizational writing is not read by very many people. A big fat report on nuclear power plants had many contributors and reviewers, but most of these people focused on a particular technical aspect of a nuclear power generation system, not the big fat book. I edited the proceedings of a nuclear conference in 1972, and discovered that papers often had six or more authors. When I followed up with the “lead author” about a missing figure or an error in a wild and crazy equation, I learnedthat the “lead author” had zero clue about the information in the particular paragraph to which I referred.

Flash forward. Same situation today just lots more digital content. Instrumental writing, not much accountability, and general cluelessness about the contents of a particular paragraph, figure, chart, whatever in a document.

Organizational writing is a hotch potch of individuals with different capabilities and methods of expressing themselves. Consider an engineer or mathematician. Writing is not usually a core competency, but there are exceptions. In technical fields, there will be a large number of people who are terse to the point of being incomprehensible and a couple of folks who crank out reams of information. In an organization, volume may not correlate with “right” or “important”. A variation of this situation crops up in sales. A sales report often is structured, particularly if the company has licensed a product to force each salesperson to provide a name, address, phone, number, and comments about a “contact”. The idea is that getting basic information is pretty helpful if the salesperson quits or simply refuses to fill in the blanks. Often the salesperson who won’t play ball is the guy or gal who nails a multi million dollar deal. The salesperson figures, “Someone will chase up the details.” The guy or gal is right. Distinct content challenges arise in the legal department. Customer support has its writing preferences, sometimes compressed to methods that make the customer quit calling.

Why CMS for Text?

The Web’s popularization as cheap marketing created a demand for software that would provide writing training wheels to those in an organization who had to contribute information to a Web site. The Web site has gained importance with each passing year since 1993 when hyperlinking poked its nose from the deep recesses of Standard Generalized Markup Language.

Customer relationship management systems really did not support authoring, editorial review, version control, and the other bits and pieces of content production. Enterprise resource planning systems manage back office and nitty gritty warehouse activities. Web content is not a core competency of these labyrinthine systems. Content systems mandated for regulatory compliance are designed to pinpoint which supplier delivered an Inconel pipe that cracked, what inspector looked at the installation, what quality assurance engineer checked the work, and what tech did the weld when the pipe was installed. Useful for compliance, but not what the Web marketing department ordered. Until recently, enterprise publishing systems were generally confined to the graphics department or the group that churned out proposals and specifications. The Web content was an aberrant content type.

Enter content management.

I recall the first system that I looked at closely was called NCompass. When I got a demo in late 1999, I recall vividly that it crashed in the brightly lit, very cheerful exhibition stand in San Jose. Reboot. Demo another function. Crash. Repeat. Microsoft acquired this puppy and integrated it into SharePoint. SharePoint has grown over time like a snowball. Here’s a diagram of the SharePoint system from www.JoiningDots.net:

image

SharePoint. Simplicity itself. Source: http://www.joiningdots.net/downloads/SharePoint_History.jpg

A Digital Oklahoma Land Rush

By 2001, CMS was a booming industry. In some ways, it reminded me of the case study I wrote for a client about the early days of the automobile industry. There were many small companies which over time would give way to a handful of major players. Today CMS has reached an interesting point. The auto style aggregation has not worked out exactly like the auto industry case I researched. Before the collapse of the US auto industry in 2008, automobile manufacturing had fractured and globalized. There were holding companies making more vehicles than the US population would buy from American firms. There were vast interconnected of supplier subsystems and below these huge pipelines into more fundamental industrial sectors like chemicals, steel, and rubber.

Read more

Microsoft and Proprietary Chips

April 10, 2009

Stacey Higginbotham’s “Is Microsoft Turning Away from Commodity Server?” here reminded me of a client study I did five or six years ago. The Sony PS3 was working on a proprietary chip. IBM was involved, and I documented the graphics method which built upon IBM technology. In short order, Microsoft and Nintendo signed up with IBM to use its generic chip design for their next generation game devices. Sony ran into three problems. First, costs went through the roof. Sony did not have a core competency in chip design and fabrication, and it was evident even in the sketchy technical information my Overflight service dug out.

Second, the yield on chips is a tricky issue. Without getting into why a yield goes wrong, I focused on the two key factors: time and cost overruns. The costs were brutal, eventually forcing Sony to change its fabrication plans. The time is a matter of public record. Microsoft beat the PS3 to market, and Sony is starting to recover now. We’re talking years of lost revenue, not days or weeks or months.

Third, the developers were stuck in limbo. With new chips, new programming tools and procedures were needed. Without a flow of chips, developers were flying blind. The problem became critical and when the PS3 launched, the grousing of developers about the complexity of programming the new chip joined with complaints from fanboys that games were in short supply.

Compatibility, availability, and affordability joined the chorus.

Ms. Higginbotham’s article summarized what is known about Microsoft’s alleged interest in creating its own chips for its own servers. The motivator for Microsoft, if I read Ms. Higginbotham’s article correctly, is related to performance. One way to get performance is to get zippier hardware. With faster CPUs and maybe other custom chips, the performance of Microsoft software would improve more than it would by using Intel or AMD CPUs. (Google uses both.)

For me, the most interesting point in her write up was:

The issue of getting software performance to scale linearly with the addition of more cores has become a vexing problem. Plus, as data center operators look for better application performance without expending as many watts, they are experimenting with different kinds of processors that may be better-suited to a particular task, such as using graphics processors for Monte Carlo simulations.

She did not draw any parallels with the Sony chip play. I will:

  1. The Sony Ken Kutaragi chip play provides a good lesson about the risks of rolling your own chips. Without a core competency across multiple disciplines, I think the chance for a misstep is high. Maybe Microsoft is just researching this topic? That’s prudent. Jumping into a proprietary chip may come, but some ramp up may be needed.
  2. Google does many proprietary things. The performance of Google’s system is not the result of a crash project. Time is of the essence because the GOOG is gaining momentum, not losing it. Therefore, the Sony “time problem” with regard to the Xbox may translate into years of lost opportunity. Chip designs are running into fundamental laws of physics, so software solutions may reduce the development time.
  3. The performance problem will not be resolved by faster hardware. Multiple changes are needed across the computing system. There are programming slow downs because tools have to generate zippy code for high speed devices. Most of the slow downs are not caused by calculations. Moving data is the problem. Inefficient designs and code combine with known bottlenecks to choke high performance systems, including those at Google. As the volume of data increases, the plumbing has to be scalable, stable, and dirt cheap. Performance problems are complex and expensive to resolves. Fixes often don’t work which makes the system slower. Nice, right? Need more data? Ask a SharePoint administrator about the cost and payoff of her last SharePoint scaling exercise.

My view is that one hire does not a chip fab make. Microsoft’s analysts have ample data to understand the costs of custom chip design and fabrication. Google requires immediate attention and rapid, purposeful progress on the part of Microsoft’s engineers. Time is the real enemy now. Without a meaningful competitor, Google seems to enjoy large degrees of freedom.

Stephen Arnold, April 10, 2009

Search Certification

April 1, 2009

A happy quack to the reader who told me about the new AIIM search certification program. Now that will be an interesting development. AIIM is a group anchored in the original micrographics business. The organization has morphed over the years, and it now straddles a number of different disciplines. The transition has been slow and in some cases directed by various interest groups from the content management sector and consulting world. CMS experts have produced some major problems for indexing subsystems, and the CMS vendors themselves seem to generate more problems for licensees than their systems resolve. (Click here for one example.)

This is not an April’s Fool joke.

The notion of search certification is interesting for five reasons:

First, there is no widely accepted definition of search in general or enterprise search in particular. I have documented the shift in terminology used by vendors of information retrieval and content processing systems. You can see the lengths here to which some organizations go to avoid using the word “search”, which has been devalued and overburdened in the last three or four years. The issue of definitions becomes quite important, but I suppose in the quest for revenue, providing certification in a discipline without boundaries fulfills some folks’s ambitions for revenue and influence.

Second, the basic idea of search–that is, find information–has shifted from the old command line Boolean to a more trophy-generation approach. Today’s systems are smart, presumably because the users are either too busy to formulate a Boolean query or view the task as irrelevant in a Twitter-choked real time search world. The notion of “showing” information to users means that a fundamental change has taken place which moves search to the margins of this business intelligence or faceted approach to information.

Third, the Google “I’m feeling doubly lucky” invention US2006/0230350 I described last week at a conference in Houston, Texas, removes the need to point and click for information. The Google engineers responsible for “I’m feeling doubly lucky” remove the user from doing much more than using a mobile device. The system monitors and predicts. The information is just there. A certification program for this approach to search will be most interesting because at this time the knowledge to pull off “I’m feeling doubly lucky” resides at Google. If anyone certifies, I suppose it would be Google.

Fourth, search is getting ready to celebrate its 40th birthday if one uses Dr. Salton’s seminal papers as the “official” starting point for search. SQL queries, Codd style, preceded Dr. Salton’s work with text, however. But after 40 years certification seems to be coming a bit late in the game. I can understand certification for a specific vendor’s search system–for example, SharePoint–but I think the notion of tackling a broader swath of this fluid, boundaryless space is logically uncomfortable for me. Others may feel more comfortable with this approach whose time apparently has come.

Finally, search is becoming a commodity, finding itself embedded and reshaped into other enterprise applications. Just as the “I’m feeling doubly lucky” approach shifts the burden of search from the user to the Google infrastructure, these embedded functions create a different problem in navigating and manipulating dataspace.

I applaud the association and its content management advisors for tackling search certification. My thought is that this may be an overly simplistic solution to a problem that has shifted away from the practical into the realm of the improbable.

There is a crisis in search. Certification won’t help too much in my opinion. Other skills are needed and these cannot be imparted in a boot camp or a single seminar. Martin White and I spent almost a year distilling our decades of information retrieval experience into our Successful Enterprise Search Management.

The longest journey begins with a single step. Looks like one step is about to be taken–four decades late. Just my opinion, of course. The question now becomes, “Why has no search certification process been successful in this time interval?” and “Why isn’t there a search professional association?” Any thoughts?

Stephen Arnold, March 31, 2009

Exclusive Interview with David Milward, CTO, Linguamatics

February 16, 2009

Stephen Arnold and Harry Collier interviewed David Milward,the chief technical officer of Linguamatics, on February 12, 2009. Mr. Milward will be one of the featured speakers at the April 2009 Boston Search Engine Meeting. You will find minimal search “fluff” at this important conference. The focus is upon search, information retrieval, and content processing. You will find no trade show booths staffed, no multi-track programs that distract, and no search engine optimization sessions. The Boston Search Engine Meeting is focused on substance from informed experts. More information about the premier search conference is here. Register now.

The full text of the interview with David Milward appears below:

Will you describe briefly your company and its search / content processing technology?

Linguamatics’ goal is to enable our customers to obtain intelligent answers from text – not just lists of documents.  We’ve developed agile natural language processing (NLP)-based technology that supports meaning-based querying of very large datasets. Results are delivered as relevant, structured facts and relationships about entities, concepts and sentiment.
Linguamatics’ main focus is solving knowledge discovery problems faced by pharma/biotech organizations. Decision-makers need answers to a diverse range of questions from text, both published literature and in-house sources. Our I2E semantic knowledge discovery platform effectively treats that unstructured and semi-structured text as a structured, context-specific database they can query to enable decision support.

Linguamatics was founded in 2001, is headquartered in Cambridge, UK with US operations in Boston, MA. The company is privately owned, profitable and growing, with I2E deployed at most top-10 pharmaceutical companies.

splash page

What are the three major challenges you see in search / content processing in 2009?

The obvious challenges I see include:

  • The ability to query across diverse high volume data sources, integrating external literature with in-house content. The latter content may be stored in collaborative environments such as SharePoint, and in a variety of formats including Word and PDF, as well as semi-structured XML.
  • The need for easy and affordable access to comprehensive content such as scientific publications, and being able to plug content into a single interface.
  • The demand by smaller companies for hosted solutions.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

People have traditionally been able to do simple querying across multiple data sources, but there has been an integration challenge in combining different data formats, and typically the rich structure of the text or document has been lost when moving between formats.

Publishers have tended to develop their own tools to support access to their proprietary data. There is now much more recognition of the need for flexibility to apply best of breed text mining to all available content.

Potential users were reluctant to trust hosted services when queries are business- sensitive. However, hosting is becoming more common, and a considerable amount of external search is already happening using Google and, in the case of life science researchers, PubMed.

What is your approach to problem solving in search and content processing?

Our approach encompasses all of the above. We want to bring the power of NLP-based text mining to users across the enterprise – not just the information specialists.  As such we’re bridging the divide between domain-specific, curated databases and search, by providing querying in context. You can query diverse unstructured and semi-structured content sources, and plug in terminologies and ontologies to give the context. The results of a query are not just documents, but structured relationships which can be used for further data mining and analysis.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?

Our customers want scalability across the board – both in terms of the size of the document repositories that can be queried and also appropriate querying performance.  The hardware does need to be compatible with the task.  However, our software is designed to give valuable results even on relatively small machines.

People can have an insatiable demand for finding answers to questions – and we typically find that customers quickly want to scale to more documents, harder questions, and more users. So any text mining platform needs to be both flexible and scalable to support evolving discovery needs and maintain performance.  In terms of performance, raw CPU speed is sometimes less of an issue than network bandwidth especially at peak times in global organizations.

Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?

Implementing a proactive e-Discovery capability rather than reacting to issues when they arrive is becoming a strategy to minimize potential legal costs. The forensic abilities of text mining are highly applicable to this area and have an increasing role to play in both eDiscovery and auditing. In particular, the ability to search for meaning and to detect even weak signals connecting information from different sources, along with provenance, is key.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

Organizations are still challenged to maximize the value of what is already known – both in internal documents or in published literature, on blogs, and so on.  Even in global companies, text mining is not yet seen as a standard capability, though search engines are ubiquitous. This is changing and I expect text mining to be increasingly regarded as best practice for a wide range of decision support tasks. We also see increasing requirements for text mining to become more embedded in employees’ workflows, including integration with collaboration tools.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

Customers recognize the value of linking entities and concepts via semantic identifiers. There’s effectively a semantic engine at the heart of I2E and so semantic knowledge discovery is core to what we do.  I2E is also often used for data-driven discovery of synonyms, and association of these with appropriate concept identifiers.

In the life science domain commonly used identifiers such as gene ids already exist.  However, a more comprehensive identification of all types of entities and relationships via semantic web style URIs could still be very valuable.

Where can I find more information about your products, services, and research?

Please contact Susan LeBeau (susan.lebeau@linguamatics.com, tel: +1 774 571 1117) and visit www.linguamatics.com.

Stephen Arnold (ArnoldIT.com) and Harry Collier (Infonortics, Ltd.), February 16, 2009

Weird Math: Open Source Cost Estimates

February 11, 2009

IT Business Edge ran a story by Ann All called “Want More Openness in Enterprise Search? Open Source May Fill Bill?” If you are an IT person named Bill and you don’t know much about open source search, open source may turn “fill bill” into “kill Bill.” On the surface, open source offers quite a few advantages. First, there are lots of volunteers who maintain the code. The reality is that a few people carry the load and others cheerlead. For Lucene, SOLR, and other open source search systems, that works pretty well. (More about this point in a later paragraph.) Second, the “cost” of open source looks like a deal. Ms. All quotes various experts from the azure chip consulting firms and the trophy generation to buttress her arguments. I am not sure the facts in some enterprise environments line up with the assertions but that’s the nature of folks who disguise deep understanding with buzzword cosmetics. Third, some search systems like the Google Search Appliance cost $30,000. I almost want to insert exclamation points. How outrageous. Open source costs less, specifically $18,000. Like some of the Yahoo math, this number is conceptually aligned with Jello. The license fee is not the fully burdened cost of an enterprise search system. (Keep in mind that this type of search is more appropriately called “behind the firewall search”.)

What’s the Beyond Search view of open source?

In my opinion, open source is fine when certain conditions are met; namely:

  1. The client is comfortable with scripts and familiar with the conventions of open source. Even the consulting firms supporting open source can be a trifle technical. A call for help yields and engineer who may prefer repeating Unix commands in a monotone. Good if you are on that wave length. Not so good if you are a corporate IT manager who delegates tech stuff to contractors.
  2. The security and regulatory net thrown over an organization permits open source. Ah, you may think. Open source code is no big deal. Sorry. Open source is a big deal because some organizations have to guarantee that code used for certain projects cannot have backdoors or a murky provenance. Not us, you may think. My suggestion is that you may want to check with your lawyer who presumably has read your contracts with government agencies or the regulations governing certain businesses.
  3. The top brass understand that some functionality may not be possible until a volunteer codes up what’s needed or until your local computer contractor writes scripts. Then, you need to scurry back to your lawyer to make sure that the code and scripts are really yours. There are some strings attached to open source.

Does open source code work? Absolutely. I have few reservations tapping my pal Otto for SOLR, Charles Hull at Lemur Consulting for FLAX, or Anna Tothfalusi at Tesuji.eu for Lucene. Notice that these folks are not all in the good old US of A, which may be a consideration for some organizations. There are some open source search outfits like Lucid Imagination and specialists at various companies who can make open source search sit up and roll over.

possible overruns

It is just a matter of money.

Now, let’s think about the $18,000 versus the Google Search Appliance. The cost of implementing a search system breaks into some categories. License fees are in one category along with maintenance. You have to do your homework to understand that most of the big gun systems, including Google and others have variable pricing in place. Indexing 500,000 documents is one type of system. Boosting that system to handle 300 million documents is another type of system.

Read more

Daniel Tunkelang: Co-Founder of Endeca Interviewed

February 9, 2009

As other search conferences gasp for the fresh air of enervating speakers, Harry Collier’s Boston Search Engine Conference (more information is here) has landed another thought-leader speaker. Daniel Tunkelang is one of the founders of Endeca. After the implosion of Convera and the buys out of Fast Search and Verity, Endeca is one of the two flagship vendors of search, content processing, and information management systems recognized by most information technology professionals. Dr. Tunkelang writes an informative Web log The Noisy Channel here.

image

Dr. Daniel Tunkelang. Source: http://www.cs.cmu.edu/~quixote/dt.jpg

You can get a sense of Dr. Tunkelang’s views in this exclusive interview conducted by Stephen Arnold with the assistance of Harry Collier, Managing Director, Infonortics Ltd.. If you want to hear and meet Dr. Tunkelang, attend the Boston Search Engine Meeting, which is focused on search and information retrieval. The Boston Search Engine Meeting is the show you may want to consider attending. All beef, no filler.

image

The speakers, like Dr. Tunkelang, will challenge you to think about the nature of information and the ways to deal with substantive issues, not antimacassars slapped on a problem. We interviewed Mr. Tunkelang on February 5, 2009. The full text of this interview appears below.

Tell us a bit about yourself and about Endeca.

I’m the Chief Scientist and a co-founder of Endeca, a leading enterprise search vendor. We are the largest organically grown company in our space (no preservatives or acquisitions!), and we have been recognized by industry analysts as a market and technology leader. Our hundreds of clients include household names in retail (Wal*Mart, Home Depot); manufacturing and distribution (Boeing, IBM); media and publishing (LexisNexis, World Book), financial services (ABN AMRO, Bank of America), and government (Defense Intelligence Agency, National Cancer Institute).

My own background: I was an undergraduate at MIT, double majoring in math and computer science, and I completed a PhD at CMU, where I worked on information visualization. Before joining Endeca’s founding team, I worked at the IBM T. J. Watson Research Center and AT&T Bell Labs.

What differentiates Endeca from the field of search and content processing vendors?

In web search, we type a query in a search box and expect to find the information we need in the top handful of results. In enterprise search, this approach too often breaks down. There are a variety of reasons for this breakdown, but the main one is that enterprise information needs are less amenable to the “wisdom of crowds” approach at the heart of PageRank and related approaches used for web search. As a consequence, we must get away from treating the search engine as a mind reader, and instead promote bi-directional communication so that users can effectively articulate their information needs and the system can satisfy them. The approach is known in the academic literature as human computer information retrieval (HCIR).

Endeca implements an HCIR approach by combining a set-oriented retrieval with user interaction to create an interactive dialogue, offering next steps or refinements to help guide users to the results most relevant for their unique needs. An Endeca-powered application responds to a query with not just relevant results, but with an overview of the user’s current context and an organized set of options for incremental exploration.

What do you see as the three major challenges facing search and content processing in 2009 and beyond?

There are so many challenges! But let me pick my top three:

Social Search. While the word “social” is overused as a buzzword, it is true that content is becoming increasingly social in nature, both on the consumer web and in the enterprise. In particular, there is much appeal in the idea that people will tag content within the enterprise and benefit from each other’s tagging. The reality of social search, however, has not lived up to the vision. In order for social search to succeed, enterprise workers need to supply their proprietary knowledge in a process that is not only as painless as possible, but demonstrates the return on investment. We believe that our work at Endeca, on bootstrapping knowledge bases, can help bring about effective social search in the enterprise.

Federation.  As much as an enterprise may value its internal content, much of the content that its workers need resides outside the enterprise. An effective enterprise search tool needs to facilitate users’ access to all of these content sources while preserving value and context of each. But federation raises its own challenges, since every repository offers different levels of access to its contents. For federation to succeed, information repositories will need to offer more meaningful access than returning the top few results for a search query.

Search is not a zero-sum game. Web search engines in general–and Google in particular–have promoted a view of search that is heavily adversarial, thus encouraging a multi-billion dollar industry of companies and consultants trying to manipulate result ranking. This arms race between search engines and SEO consultants is an incredible waste of energy for both sides, and distracts us from building better technology to help people find information.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search and content processing?

There’s no question that information technology purchase decisions will face stricter scrutiny. But, to quote Rahm Emmanuel, “Never let a serious crisis go to waste…it’s an opportunity to do things you couldn’t do before.” Stricter scrutiny is a good thing; it means that search technology will be held accountable for the value it delivers to the enterprise. There will, no doubt, be an increasing pressure to cut costs, from price pressure on vendor to substituting automated techniques for human labor. But that is how it should be: vendors have to justify their value proposition. The difference in today’s climate is that the spotlight shines more intensely on this process.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications? If yes, how will this shift affect the companies providing stand alone search / content processing solutions? If no, what do you see the role of standalone search / content processing applications becoming?

Better search is a requirement for many enterprise applications–not just BI and Call Centers, but also e-commerce, product lifecycle management, CRM, and content management.  The level of search in these applications is only going to increase, and at some point it just isn’t possible for workers to productively use information without access to effective search tools.

For stand-alone vendors like Endeca, interoperability is key. At Endeca, we are continually expanding our connectivity to enterprise systems: more connectors, leveraging data services, etc.  We are also innovating in the area of building configurable applications, which let businesses quickly deploy the right set features for their users.  Our diverse customer base has driven us to support the diversity of their information needs, e.g., customer support representatives have very different requirements from those of online shoppers. Most importantly, everyone benefits from tools that offer an opportunity to meaningfully interact with information, rather than being subjected to a big list of results that they can only page through.

Microsoft acquired Fast Search & Transfer. SAS acquired Teragram. Autonomy acquired Interwoven and Zantaz. In your opinion, will this consolidation create opportunities or shut doors. What options are available to vendors / researchers in this merger-filled environment?

Yes!  Each acquisition changes the dynamics in the market, both creating opportunities and shutting doors at the same time.  For SharePoint customers who want to keep the number of vendors they work with to a minimum, the acquisition of FAST gives them a better starting point over Microsoft Search Server.  For FAST customers who aren’t using SharePoint, I can only speculate as to what is in store for them.

For other vendors in the marketplace, the options are:

  • Get aligned with (or acquired by) one of the big vendors and get more tightly tied into a platform stack like FAST;
  • Carve out a position in a specific segment, like we’re seeing with Autonomy and e-Discovery, or
  • Be agnostic, and serve a number of different platforms and users like Endeca or Google do.  In this group, you’ll see some cases where functionality is king, and some cases where pricing is more important, but there will be plenty of opportunities here to thrive.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar? Is performance a non issue?

Performance is absolutely a consideration, even for systems that make efficient use of hardware resources. And it’s not just about using CPU for run-time query processing: the increasing size of data collections has pushed on memory requirements; data enrichment increases the expectations and resource requirements for indexing; and richer capabilities for query refinement and data visualization present their own performance demands.

Multicore computing is the new shape of Moore’s Law: this is a fundamental consequence of the need to manage power consumption on today’s processors, which contain billions of transistors. Hence, older search systems that were not designed to exploit data parallelism during query evaluation will not scale up as hardware advances.

While tasks like content extraction, enrichment, and indexing lend themselves well to today’s distributed computing approaches, the query side of the problem is more difficult–especially in modern interfaces that incorporate faceted search, group-bys, joins, numeric aggregations, et cetera. Much of the research literature on query parallelism from the database community addresses structured, relational data, and most parallel database work has targeted distributed memory models, so existing techniques must be adapted to handle the problems of search.

Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect? Is Google a significant player in enterprise search, or is Google a minor player?

I think it is a mistake for the higher-end search vendors to dismiss Google as a minor player in the enterprise. Google’s appliance solution may be functionally deficient, but Google’s brand is formidable, as is its position of the appliance as a simple, low-cost solution. Moreover, if buyers do not understand the differences among vendor offerings, they may well be inclined to decide based on the price tag–particularly in a cost-conscious economy. It is thus more incumbent than ever on vendors to be open about what their technology can do, as well as to build a credible case for buyers to compare total cost of ownership.

Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

A number of folks have noted that the design constraints of the iPhone (and of mobile devices in general) lead to an improved user experience, since site designers do a better job of focusing on the information that users will find relevant. I’m delighted to see designers striving to improve the signal-to-noise ratio in information seeking applications.

Still, I think we can take the idea much further. More efficient or ergonomic use of real estate boils down to stripping extraneous content–a good idea, but hardly novel, and making sites vertically oriented (i.e., no horizontal scrolling) is still a cosmetic change. The more interesting question is how to determine what information is best to present in the limited space–-that is the key to optimizing interaction. Indeed, many of the questions raised by small screens also apply to other interfaces, such as voice. Ultimately, we need to reconsider the extreme inefficiency of ranked lists, compared to summarization-oriented approaches. Certainly the mobile space opens great opportunities for someone to get this right on the web.

Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

Semantic search means different things to different people, but broadly falls into two categories: Using linguistic and statistical approaches to derive meaning from unstructured text, using semantic web approaches to represent meaning in content and query structure. Endeca embraces both of these aspects of semantic search.

From early on, we have developed an extensible framework for enriching content through linguistic and statistical information extraction. We have developed some groundbreaking tools ourselves, but have achieved even better results by combining other vendor’s document analysis tools with our unique ability to improve their results through corpus analysis.

The growing prevalence of structured data (e.g., RDF) with well-formed ontologies (e.g., OWL) is very valuable to Endeca, since our flexible data model is ideal for incorporating heterogeneous, semi-structured content. We have done this in major applications for the financial industry, media/publishing, and the federal government.

It is also important that semantic search is not just about the data. In the popular conception of semantic search, the computer is wholly responsible derives meaning from the unstructured input. Endeca’s philosophy, as per the HCIR vision, is that humans determine meaning, and that our job is to give them clues using all of the structure we can provide.

Where can I find more information about your products, services, and research?

Endeca’s web site is http://endeca.com/. I also encourage you to read my blog, The Noisy Channel (http://thenoisychannel.com/), where I share my ideas (as do a number of other people!) on improving the way that people interact with information.

Stephen Arnold, February 9, 2009

Enterprise Search: The Batista Madoff Syndrome

January 4, 2009

Two examples flapped around my aging mind this chilly and dark Sunday, January 4, 2009. I am not sure why I woke up with the names Batista and Madoff juxtaposed. I walked my dogs, Tess (my SharePoint expert) and Tyson (my Google Search Appliance dude). I asked, “How can experts be so wrong?” Both looked at me. Here’s a picture of their inquiring minds directing their attention toward me.

dogs listening 02 copy

Forget Batista and Madoff. We want breakfast.

On our walk in the pre-dawn gloaming, I thought about Felix Batista. In mid-December 2008, Mr. Batista (a security consultant and anti-kidnapping expert) was kidnapped. Although tragic, I wondered how a kidnapping expert in Mexico to give a talk about thwarting kidnapping could get himself snatched that day. I was reminded of search experts recommending a system that did not work. I have been in some interesting situations where kidnapping and mortar attacks were on the morning’s agenda. I am no kidnapping or mortar blast expert. But I figured out how to avoid trouble, and I just used commonsense. I am not as well known as Felix Batista, of course, but the risk of trouble was high. I did not encounter a direct threat even though I was in a high risk situation. I wondered, “What was this expert doing in the wrong place and the right time anyway?” (Please, read this brief and gentle account of Mr. Batista’s travails here.)

Now Bernard Madoff, the fellow who took a Ponzi scheme to new heights. I am not concerned about Mr. Madoff. What I thought about was the headline on the dead tree version of the Wall Street Journal: “Me, Madoff and the Mind: How a Gullibility Expert Was Scammed.” Another expert, another smarter-than-me person proven to be somewhat dull. I suppose that the notions of trust, ethical behavior, and honesty get mixed into the colors of expertise and knowledge. Mr. Madoff is colored a most disturbing shade of brown.

Common Themes

What do these two unrelated incidents have in common? That was the question I pondered on my early morning walk. Let me capture my thoughts before they flap away:

First, the cult of the expert has been a big part of my work at Nuclear Utility Services (a unit of Halliburton) and Booz, Allen & Hamilton (the pre-break up and messy divorce version, thank you). Experts are easy to find in nuclear energy. A mistake can be reasonably exciting. As a result, most of the people involved in the nuclear industry (classified and unclassified versions) are careful. When errors occur, really bad things happen. The quality assurance fad did not sweep the nuclear industry. Nuclear-related work had to be correct. Get it wrong and you have Chernobyl. Nuclear is not a zero defect operation. Nothing done by humans can be. If a nuclear expert were alive, that was one easy and imperfect way determine that the expert knew something. When nuclear experts are wrong, you get pretty spectacular problems.

image

Visualization of the Chernobyl radiation. Source: http://www.gearthblog.com/blog/archives/2006/04/chernobyl_radia.html

At Booz, Allen & Hamilton in the late 1970s and early 1980s, the meaning of the word “expert” was a bit softer than at Halliburton NUS. BAH (as it was then known) had individuals with what the firm called “deep industry experience”. I learned that a recent MBA qualified as an expert for some engagements. The clients were gullible or wanted to believe that Mr. Booz’s 1917 could work its magic for International Harvester or the Department of the Navy. Some BAH professionals had quite a bit of post graduate training in a discipline generally related to the person’s area of expertise. I am still not clear what a Ph.D. in business means. Perhaps I can ask one of Mr. Madoff’s investors this question? The problem was that a BAH expert was not like a Halliburton NUS expert. My boss–Dr. William P. Sommers–told me that Halliburton NUS was a C+ outfit. BAH, he asserted, was an A+ shop. I nodded eagerly because I knew what was required to remain a BAH professional. I did not agree then nor do I agree now. Some of the consultants from the 1970s, like consultants today, have awarded themselves the title of expert. I can point to a recent study of enterprise search as evidence that this self-propagation is practiced today as it was in 1970.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta