The Microsoft Yahoo Fiasco: Impact on SharePoint and Web Search

May 5, 2008

You can’t look at a Web log with out dozens of postings about Microsoft’s adventure with Yahoo. You can grind through the received wisdom on Techmeme River, a wonderful as-it-happened service. In this Web log posting, I want to recap some of my views about this remarkable digital charge at a windmill. On this cheery Monday in rural Kentucky, I can see a modern Don Quixote, who looks quite a bit like Steve Ballmer, thundering down another digital hollow.

What’s the impact on SharePoint search?

Zip. Nada. None. SharePoint search is not one thing. Read my essay about MOSS and MSS. They add up to a MESS. I’m still waiting for the well-dressed but enraged Fast Search PR wizard to spear shake a pointed lance at me for that opinion. Fast Search is sufficiently complex and SharePoint sufficiently Microsoftian in its design to make quick movement in the digital swamp all but impossible.

A T Ball player can swing at the ball until he or she gets a hit, ideally for the parents a home run. Microsoft, like the T Ball player in the illustration, will be swinging for an online hit until the ball soars from the park, scoring a home run and the adulation of the team..

Will Fast Search & Transfer get more attention?

Nope. Fast Search is what it is. I have commented on the long slog this acquisition represents elsewhere. An early January 2008 post provides a glimpse of the complexity that is ESP (that’s enterprise search platform, not extrasensory perception). A more recent discussion talks about the “perfect storm” of Norwegian management expertise, Microsoft’s famed product manager institution, and various technical currents, which I posted on April 26, 2008. These posts caused Fast Search’s ever-infallible PR gurus to try and cook the Beyond Search’s goose. The goose, a nasty bird indeed, side-stepped the charging wunderkind and his hatchet.

Will Microsoft use the Fast Search Web indexing system for Live.com search?

Now that’s a good question. But it misses the point of the “perfect storm” analysis. To rip and replace the Live.com search requires some political horse trading within Microsoft and across the research and product units. Fast Search is arguably a better Web indexing system, but it was not invented at Microsoft, and I think that may present a modest hurdle for the Norwegian management wizards.

Read more

Poking around Google Scholar Service

May 3, 2008

In May 2005, I gave a short talk at Alan Brody’s iBreakfast program. An irrepressible New Yorker, Mr. Brody invites individuals to address a hand-picked audience of movers and shakers who work in Manhattan. I reported to the venue, zoomed through a look at Google’s then-novel index of scholarly information, and sat down.

Although I’ve been asked to address the group in 2006 and 2007, I was a flop. The movers and shakers were hungry for information related to search engine optimization. SEO, as the practice is called, specializes in tips and tricks to spoof Google into putting a Web site on the first page of Google results for a query and ideally in the top spot. Research and much experimentation have revealed that if a Web site isn’t on the first page of a Google results list, that Web site is a loser–at least in terms of generating traffic and hopefully sales.

I want to invest a few minutes of my time taking a look at the information I discussed in 2005, and if you are looking for SEO information, stop reading now. I want to explore Google Scholar. With most Americans losing interest in books and scholarly journals, you’ll be wasting your time with this essay.

Google Scholar: The Unofficial View of This Google Service

Google wants to index the world’s information. Scholarly publications are a small, yet intellectual significant, portion of the world’s information. Scholarly journals are expensive and getting more costly with each passing day. Furthermore, some university libraries don’t have the budgets to keep pace with the significant books and journals that continue to flow from publishers, university presses, and some specialized not-for-profit outfits like the American Chemical Society. Google decided that indexing scholarly literature was a good idea. Google explains the service in this way:

Google Scholar provides a simple way to broadly search for scholarly literature. From one place, you can search across many disciplines and sources: peer-reviewed papers, theses, books, abstracts and articles, from academic publishers, professional societies, preprint repositories, universities and other scholarly organizations. Google Scholar helps you identify the most relevant research across the world of scholarly research.

Google offers libraries a way to make their resources available. You can read about this feature here. Publishers, with whom Google has maintained a wide range of relationships, can read about Google’s policies for this service here. My view of Google’s efforts to work with publishers is quite positive. Google is better at math than it is donning a suit and tie and kow towing to the mavens in Manhattan, however. Not surprisingly, Google and some publishers find one another difficult to understand. Google prefers an equation and a lemma; some publishers prefer a big vocabulary and a scotch.

What a Query Generates

Some teenager at one of the sophisticated research firms in Manhattan determined that Google users are more upscale than Yahoo users. I’m assuming that you have a college education and have undergone the pain of writing a research paper for an accredited university. A mail order BA, MS, or PhD does not count. Stop reading this essay now.

The idea is that you select a topic for a short list of those provided by your teacher (often a graduate student or a professor with an expertise in Babylonian wheat yield or its equivalent). You trundle off to the dorm or library, and you run a query on the library’s research system. If your institution’s library has the funds, you may get access to Thomson Reuters’ databases branded as Dialog or the equivalent offerings from outfits such as LexisNexis (a unit of Reed Elsevier) or Ebsco Electronic Publishing (a unit of the privately held E.B. Stevens Company).

Google works with these organizations, but the details of the arrangements are closely-guarded secrets. No one at the giant commercial content aggregators will tell what its particular relationship with Google embraces. Google–per its standard Googley policy–doesn’t say much of anything, but its non-messages are delivered with great good cheer by its chipper employees.

So, let’s run a query. The ones that work quite well are those concerned with math, physics, and genetics. Babylonian wheat yields, I wish to note, are not a core interest area of the Googlers running this service.

Here’s my query today, May 3, 2008: kolmogorov theorem. If you don’t know what this canny math whiz figured out, don’t fret. For my purpose, I want to draw your attention to the results shown in the screen shot below:

kolmogorov results

Navigate to http://scholar.google.com and enter the bound phrase Kolmogorov Theorem.

As I write this, I am sitting with a person who worked for Gene Garfield, the inventor of citation analysis. He was quite impressed with Google’s generating a hot link to other scholarly articles in the Google system that have cited a particular paper. You can access these by clicking the link. The screen shot below shows you the result screen displayed by clicking on “Representation Properties of Networks”, the first item in the result list above. You can locate the citation link by looking for a phrase after the snippet that begins “Cited by…” Mr. Collier’s recollection of the citation analysis was that Dr. Garfield, a former Bronx cab driver with two PhDs, believed that probability played a major role in determining significance of journal articles. If a particular article were cited by reputable organizations and sources, there was a strong probability that the article was important. To sum up, citation that point to an article are votes. Dr. Garfield came up with the idea, and Messrs. Brin and Page were attentive to this insight. Mr. Page acknowledged Dr. Garfield’s idea in the PageRank patent document.

Read more

FAQ: The Google Legacy and Google Version 2.0

May 2, 2008

Editor’s Note: In the last few months, we have received a number of inquiries about Infonortics’ two Google studies, both written by Stephen E. Arnold, a well-known consultant working in online search, commercial databases and related disciplines. More information about his background is on his Web site and on his Web log. This FAQ contains answers to questions we receive about The Google Legacy, published in mid-2005 and Google Version 2.0, published in the autumn of 2007.

Do I need both The Google Legacy and Google Version 2.0?

The Google Legacy provides a still-valid description of Google’s infrastructure, explanations of its file system (GFS), its Bigtable data management system (now partly accessible via Google App Engine), and other core technical features of what Mr Arnold calls “the Googleplex”; that is, Google’s server, operating system, and software environment.

Google Version 2.0 focuses on more than 18 important Google patent applications and Google patents. Mr Arnold’s approach in Google Version 2.0 is to explain specific features and functions that the Googleplex described in The Google Legacy supports. There is perhaps 5-10 percent overlap across the two volumes and the more than 400 pages of text in the two studies. More significantly, Google Version 2.0 extracts from Google’s investment in intellectual property manifested in patent documents more operational details about specific Google enabling sub systems. For example, in The Google Legacy, you learn about Bigtable. In Google Version 2.0 you learn how the programmable search engine uses the Bigtable to house and manipulate context metadata about users, information, and machine processes.

You can read one book and gain useful insights into Google and its functioning as an application engine. If you read both, you will have a more fine-grained understanding of what Google’s infrastructure makes possible.

What is the focus of Google Version 2.0?

After Google’s initial public offering, the company’s flow of patent applications increased. Since Google became a publicly-traded company, the flow of patent documents has risen each year. Mr Arnold had been collecting open source documents about Google. After completing The Google Legacy, he began analysing these open source documents using different software tools. The results of these analytic passes generated data about what Google was “inventing”. When he looked at Google’s flow of beta products and the firm’s research and development investments, he was able to correlate the flow of patent documents and their subjects with Google betas, acquisitions and investments. The results of those analyses are the foundation upon which Google Version 2.0 rests. He broke new ground in Google Version 2.0 in two ways: [a] text mining provides information about Google’s technical activities and [b] he was able to identify “keystone” inventions that make it possible for Google to expand its advertising revenue and enter new markets.

Read more

Searchenstein: Pensée d’escalier

May 1, 2008

At the Boston Search Engine Meeting, I spoke with a certified search wizard-ette. As you know, my legal eagle discourages me from proper noun extraction in my Web log essay. This means I can’t name the person, nor can I provide you with the name of her employer. You will have to conjure a face less wizard-ette from your imagination. But she’s real, very real.

Set up: the wizard-ette wanted to ask me about Lucene as an enterprise search system. But that was a nerd gambit. The real question was, “Will I be able to graft an add on to perform semantic processing or text mining system on top of Lucene and make the hybrid work?”

The answer is, “Yes but”. Most search and content processing systems are monsters. Some are tame; others are fierce. Only a handful of enterprise search systems have been engineered to be homogeneous.

I knew this wizard-ette wasn’t enthralled with a “yes but”. She wanted a definitive, simple answer. I stumbled and fumbled. Off she drifted. This short essay, then, contains my belated pensée d’escalier.

What Is a Searchenstein?

A searchenstein is a content processing or information access system that contains a great many separate pieces. These different systems, functions, and sub systems are held together with scripts; that is, digital glue or what the code jockeys call middleware. The word middleware sounds more patrician than scripts. (In my experience, a big part of the search and retrieval business reduces to word smithing.)

Searchenstein is a search and content processing system cobbled together from different parts. There are several degrees of searchensteinism. There’s a core system built to a strict engineering plan and then swaddled in bastard code. Instead of working to the original engineering plan, the MBAs running the company take the easier, cheaper, and faster path. Systems from the Big Three of enterprise search are made up of different parts, often from sources that have little knowledge or interest in the system onto which the extras will be bolted. Other vendors have an engineering plan, and the third-party components are more tastefully integrated. This is the difference between a car customization by a cash-strapped teen and the work of Los Angeles after market specialists who build specialized automobiles for the super rich.

searchenstein

This illustration shows the body parts of a searchenstein. In this type of system, it’s easy to get lost in the finger pointing when a problem occurs. Not only are the dependencies tough to figure out, it’s almost impossible to get one’s hand on the single throat to choke.

Another variant is to use many different components from the moment the company gets in the search and content processing business. The complexities of the system are carefully hidden, often in a “black box” or beneath a zippy interface. You can’t fiddle with the innards of the “black box.” The reason, according to the vendor, may be to protect intellectual property. Another reason is that the “black box” is easily destabilized by tinkering.

Read more

SharePoint Search: The Answers May Be Here and the Check Is in the Mail

May 1, 2008

A Microsoft wizard named Dan Blood, a senior tester working in the product group that is responsible for search within MOSS and MSS, says that he will use the Microsoft Enterprise Search Blog “to provide details on the lessons that we [his Microsoft unit] have learned.” The topics Mr. Blood, a senior tester working in the search product group, include (and I paraphrase):

  • His actions to optimize MOSS and MSS
  • Information about optimizing index refreshes; that is, make sure the 28 million documents in his test set are “freshly indexed”
  • Configuration of the SQL machine that underpins MOSS and MSS
  • Monitoring actions to make sure the search system is healthy.

MOSS and MSS

My hunch is that you may not know what MOSS and MSS mean. I’m no expert on things Microsoft, but let me provide my take on these search systems. MSS is an acronym for Microsoft Search Server. MOSS is an acronym for Microsoft Office SharePoint Server. MSS originated as the search subsystem from within the more comprehensive MOSS system, given a smattering of improvements, then packaged as a separate service. Microsoft plans to eventually roll these improvements back into the MOSS line.

sharepoint search

This image comes from http://sharepointsearch.com/images/searcharchitecture.gif. You can read another take on this product here.

Read more

Boston Search Engine Meeting, Day Two

April 30, 2008

The most important news on Day Two of Infonortics’ Boston Search Engine Meeting was the announcement of the “best paper awards” for 2008. The Evvie–named in honor of Ev Brenner–one of the leaders in online information systems and functions–was established after Mr. Brenner’s death in 2006. Mr. Brenner served on the program committee for the Boston Search Engine Meeting since its inception almost 20 years ago. Mr. Brenner had two characteristics that made his participation a signature feature of each year’s program. He was willing to tell a speaker or paper author to “add more content” and after a presentation, Mr. Brenner would ask a presenter one or more penetrating questions that helped make a complex subject more clear.

Sponsored by ArnoldIT.com, the Evvie is an attempt to keep Mr. Brenner’s push for excellence squarely in the minds of the speakers and the conference attendees.

This year’s winners are:

  • Best paper: Charles Clarke, University of Waterloo. His paper “XML Retrieval: Problems and Potential” explained that XML (Extensible Markup Language) is no panacea. Properly used, XML systems create new ways to make search more useful to users. He received a cash prize and an engraved Evvie award.
  • Runner up: Richard Brath, Oculus, for his paper “Search, Sense-Making and Visual User Interfaces”. In this paper, Mr. Brath demonstrated that user interface becomes as important as the underlying content processing functions for search. He received an engraved Evvie award.

evvie 2008

Left: Richard Brath (Oculus), center: Stephen E. Arnold (ArnoldIT.com), right: Charles Clarke (University of Waterloo).

This year’s judges were Dr. Liz Liddy, Syracuse University, Dr. David Evans, Just Systems (Tokyo), and Sue Feldman, IDC Content Technologies Group. Dr. Liddy heads the Center for Natural Language Processing. Dr. Evans, founder of Clairvoyance, is one of the foremost authorities on search. Ms. Feldman is one of the leading analysts in the search, content processing, and information access market sector. Congratulations to this year’s Evvie winners.

Read more

Boston Search Engine Meeting, Day One

April 29, 2008

The Infonortics’ meeting attracts technologists and senior managers involved in search, content processing and information access. For the full program and an overview of the topics, navigate to http://www.infonortics.com.

Summaries of the talks and versions of the PowerPoints will be available on the Infonortics’ Web site on or before May 2, 2008. I will post a news item when I have the specific link.

Background

This conference draws more PhDs per square foot than a Harvard coffee shop. Most of the presentations were delightful if you enjoy equations with your latte. In the last two years, talks about key word search have yielded to discussions about advanced text manipulation methods. What’s unique about this program is that the invited presenters talk with the same enthusiasm an undergraduate in math feels when she has been accepted into MIT’s PhD physics program.

The are often spiced with real world descriptions of products that anyone can use. A highlight was the ISYS Search Software combined useful tips with a system that worked–no PhD required.

Several other observations are warranted:

  • Key word search and long lists of results are no longer enough. To be useful, a system has to provide suggestions, names people, categories, and relevance thermometers
  • An increasing appetite for answers combined with a discovery function.
  • Systems must be usable by the people who need the system to perform a task or answer a question.

Chatter at the Breaks

Chatter at the breaks was enthusiastic. In the conversations to which I was party on Monday, three topics seemed to attract some attention.

First, the acquisition of Fast Search by Microsoft was the subject of considerable speculation. Comments about the reorganization of Microsoft search under the guidance of John Lervik, one of Fast Search’s founders sparked this comment from one attendee: “Organizing search at Microsoft is going to be a very tough job.” One person in this informal group said, “I think some if not all of the coordination may be done from Fast Search’s offices in Massachusetts and Norway.” The rejoinder offered by one individual was, “That’s going to be really difficult.”

Second, the search leader Autonomy’s share price concerned one group of attendees. The question was related to the decline in Autonomy share price on the heels of a strong quarterly report. No one had any specific information, but I was asked about the distribution of Autonomy’s revenue; that is, how much from core search and how much from Autonomy’s high profile units. My analysis–based on a quick reading of the quarterly report press announcements — suggests that Autonomy has some strong growth from the Zantaz unit and in other sectors such as rich media. Autonomy search plays a supporting role in these fast-growth sectors. On that basis, Autonomy may be entering a phase where the bulk of its revenue may come from system sales where search is an inclusion, not the super charger.

Finally, there was much discussion about the need to move beyond key word search. Whether the adjustment is more sophistication “under the hood” with the user seeing suggestions or an interface solution with a range of graphic elements to provide a view of the information space, the people talking about interfaces underscored the need to [a] keep the interface simple and [b] make the information
accessible. One attendee asked at the noon break, “Does anyone know if visualization can be converted to a return on investment?” No one had a case at hand although there was some anecdotal evidence about the payoffs from visualization.

Wrap Up

The second day’s speakers are now on the stage. Stay tuned for an update.

Stephen Arnold, April 29, 2008

“Black Holes” in Enterprise Information

April 27, 2008

Yesterday–trapped once again in the nausea-inducing SEATAC Airport–I talked on the telephone with a caller concerned about problem areas in enterprise information. The issue, as I understood her comments, had to do with launching a search and retrieval system’s crawler or content acquisition “bot” and then running queries to see what was on publicly-accessible folders and servers within the organization.

My comment to her was, “You may want to perform a content inventory, do some testing of access controls, and do some very narrowly focused tests.”

Her response was one I hear frequently from 30-somethings, children of the approval culture: “Why?” These wonderful people have grown up with gold stars on lousy book reports, received “You’re a Champ” T shirts for miserable under-10 soccer efforts, and kisses upon graduating from university with a gentle person’s “C”.

I did what I could to flash the yellow caution signal, but this call, like so many others I get, was “tell me what I want to hear” inquiry, not a “real world” information buzz. The caller wanted me to say, “What a great idea!” Sorry. I’m the wrong guy for that cheerleading.

A Partial List of Black Holes

Here is my preliminary list of enterprise information “black holes”. A black hole is not well understood. My weak intellect thinks that a black hole is a giant whirlpool with radiation, crushing gravity, and the destruction of chubby old atoms such as the ones that make me the doddering fool I am. To wit:

  • School, religious, bake sale, and Girl Scout information in email and any other file formats, including Excel, image files, and applications that send email blasts
  • MP3 and other rich media files that are copyrighted, pornographic, or in some way offensive to management, colleagues, or attorneys. This includes vacation photos of overweight relatives and spouses wearing funny hats.
  • Information in emai or other formats pertaining to employee compensation, health, job performance, or behavior. Think discovery. Think deposition. Think trial.
  • Any data that is at variance with other information vetted and filed at a regulatory body; for example, marked up copies of departmental budgets, laboratory reports, clinical trial data, agreements between a vendor and a manager, and similar “working draft” information. Think how you and your colleagues would look on the six o’clock news in orange jump suits.
  • Software installed or copied to a hard drive that is hacked, borrowed, or acquired from an online source not known to be free from spyware, backdoors, keyloggers, and trojans. Think big, big fine.
  • Information about defeating firewall filters or other security work arounds needed to allow access to Web sites, information, or services that are not permitted by the firm’s security officer, agreements between the firm and a law enforcement or intelligence entity. Think losing lucrative pork barrel goodies.
  • Information germane to a legal action that has not been provided to the firm’s legal counsel regardless of the holder of the information role in the company or that person’s knowledge of a legal matter making the information pertinent to the legal matter. Think of hours with attorneys. Uggh. This makes me queasy typing the words.
  • Email threads discussing behaviors of employees and their dealings with vendors, co workers, business partners, and consultants in which non-work related topics are discussed. Think Tyco, Enron, and other business school case studies about corporate ethics.

Do you have examples of other “black holes”?

In the run up to the release of the index of the US Federal government’s public facing Web sites, I recall sitting in a meeting to discuss the test queries we were running in the summer of year 2000. My own queries surfaced some interesting information. I stumbled upon a document that when opened in an editor carried a clear statement that the document was not to be made public. The document was removed from the crawl and its index pointer deleted. My recollection is hazy, but the test queries surfaced a great deal of information that I certainly did not expect to be sitting on a publicly-accessible server.

To greater and lesser degrees, I’ve learned that test crawls that suck information into a search system’s crawler almost always yields some excitement. The young, hip, enthusiastic search engine managers don’t realize the potential downside of indiscriminate “test indexing”.

Tips on How to Avoid a Black Hole

Here are my suggestions for avoiding self-destruction in an information “black hole”:

  1. Do a thorough content inventory, define a narrow test crawl, and expand the crawl on a schedule that allows time to run test queries, remove or change the access flag on problematic information
  2. Coordinate with your firm’s security and legal professionals. If you don’t have these types of employees sitting in their offices eager to help you, hire a consultant to work with you
  3. Run exhaustive test queries * before * you make the search system available to the users. An alpha test followed by a slightly more expansive beta test is a useful pre release tactic
  4. Inform your co workers about the indexing or “indexation” process so your co workers have time to expunge the grade school’s bake sale promotional literature, budget, and email list from the folder the spider will visit
  5. Inform management that if problematic information turns up, the search system may be offline while the problem is rectified in order to avoid surprises.

I will let you know if she calls me back.

Stephen Arnold, April 27, 2008

Microsoft Chomps and Swallows Fast

April 26, 2008

It’s official. On April 24, 2008, Fast Search & Transfer became part of the Microsoft operation. You can read the details at Digital Trends here, the InfoWorld version here, or Examiner.com’s take here.

John Lervik, the Fast Search CEO, will become a corporate vice president at Microsoft. He will report to Jeff Teper, the corporate vice president for the Office Business Platform at Microsoft. The idea–based on my understanding of the set up–is that Dr. Lervik will develop a comprehensive group of search products and services. The offerings will involve Microsoft Search Sever 2008 Express, search for the Microsoft Office SharePoint Server 2007, and the Fast Enterprise Search Platform. Despite my age, I think the idea is to create a single enterprise search platform. Lucky licensees of Fast Search’s technology prior to the buy out will not be orphaned. Good news indeed, assuming the transition verbiage sets like hydrated lime, pozzolana, and aggregate. Some Roman concrete has been solid for two thousand years.

romanconcrete

This is an example of Roman concrete. The idea of “set in stone” means that change is difficult. Microsoft has some management procedures that resist change.

A Big Job

The job is going to be a complicated one for Microsoft’s and Fast Search’s wizards.

First, Microsoft has encouraged partners to develop search solutions for its operating system, servers, and applications. The effort has been wildly successful. For example, if you are one of the more than 80 million SharePoint users, you can use search solutions from specialists like Interse in Denmark to add zip to the metadata functions of SharePoint, dtSearch to deliver lightning-fast performance with a natural language procession option, Coveo for clustering and seamless integration. You can dial into SurfRay’s snap in replacement for the native SharePoint search. You can turn to the ISYS Search System which delivers fast performance, entity extraction, and other other “beyond search” features. In short, there are dozens of companies who have developed solutions to address some of the native search weaknesses in SharePoint. So, one job will be handling the increased competition as the Fast Search team digs in while keeping “certified gold partners” reasonably happy.

immortals

This is a ceramic rendering of two of the “10,000 Immortals”. The idea is that when one Immortal is killed, another one takes his place. Microsoft’s certified gold partners–if shut out of the lucrative SharePoint aftermarket for search–may fight to keep their customers like the “10,000 Immortals”. The competitors will just keep coming until Microsoft emerges victorious.

Read more

Federation: Big Need, Still a Challenge

April 25, 2008

In May 2001, I gave a talk at one of the first Web Search Universities. The audience was baffled by my talk, which I called “Vertical Search Engines: System-Initiated Information Retrieval”. I recall that no one knew what I was talking about. Sigh. Story of my life.

Organizational Reality

Here’s the core diagram from this talk:

silo

This is a clip art silo and it is a basic feature of the enterprise. This silo does not hold corn; it is a metaphor for the information technology department. IT operates in its own world or space. The engineers and computer wizards stick to themselves, use their own jargon, and occasionally snort at the antics of a 20-something in the marketing department.

Here’s another diagram from my 2001 lecture. This diagram shows a company as a collection of silos. I know that people in organizations are part of one big family, everyone is on the same team, and everyone is in the same fox hole. This all-too-common set up of a company appears below:

a company of silos

Each of these silos has its own information. Even in organizations with an effective IT infrastructure, there are nooks and crannies stuffed with digital information. It may be a laptop that a manager carries back and forth, a USB drive, or a Google Search Appliance tucked in a corner of the marketing department where “competitive intelligence” is kept for the use of the marketing mavens.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta