Exclusive Interview, Martin Baumgartel, From Library Automation to Search

February 23, 2009

For many years, Martin Baumgartel worked for a unit of T-Mobile. His experience spans traditional information retrieval and next-generation search. Stephen Arnold and Harry Collier interviewed Mr. Baumgartel on February 20, 2009. As one of the featured speakers at the premier search conference this spring, you will be able to hear Mr. Baumgartel’s lecture and meet with him in the networking and post presentation breaks. The Boston Search Engine Meeting attracts the world’s brightest minds and most influential companies to an “all content” program. You can learn more about the conference, the tutorials, and the speakers at the Infonortics Ltd. Web site. Unlike other conferences, the Boston Search Engine Meeting limits attendance in order to facilitate conversations and networking. Register early for this year’s conference.

What’s your background in search?

When I entered the search arena in the 1990s, I originated from library automation. Back then, it was all about indexing algorithms and relevance ranking where I did research to develop a search engine. During eight years at T-Systems, we analyzed the situation in large enterprises in order to provide the right search solution. This included, increasingly, the integration of semantic technologies. Given the present hype about semantic technologies, it has been a focus in current projects to determine which approach or product can deliver in specific search scenarios. A related problem is to identify underlying principles of user-interface-innovations to know what’s going to work (and what’s not).

What are the three major challenges you see in search / content processing in 2009?

Let me come at this in a non technical way. There are plenty of challenges awaiting algorithmic solutions, I see more important challenges here:

  1. Identifying the real objectives, fighting myths For an organization to implement internal search today hasn’t become any easier. There are numerous internal stakeholders, paired with a very high user expectation (they want the same quality as with Internet search, only better, more tailored to their work situation and without advertising…). To keep a sharp analysis becomes difficult in an orchestra of opinions, in particular when familiar brand names get involved (“Let’s just take Google internally, that will do.” )
  2. Avoid simplicity. Although many CIOs claim they have “cleaned up” their intranets, enterprise search remains complex; both technological and in terms of successful management. Therefore, to tackle the problem with a self-proclaimed simple solution (plug in, ready, go) will provide Search. But perhaps not the search solution needed and with hidden costs, especially on the long run. In the other extreme, a design too complex – with the purchase of dozens of connectors – is likely to burst your budget.
  3. Attention. Recently, I heard a lot about how the financial crisis will affect search. In my view, the effects are only reinforcing the challenge “How to draw enough management attention to Search to make sure it’s treated like other core assets”. Some customers might slow down the purchase of some SAP add-on modules or postpone a migration to the next version of Backup Software. But the status of those solutions among CIOs will remain high and un questioned.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

There’s no unique definition of the ‘Enterprise Search Problem” as if it would be a math theorem. Therefore, you find somehow amorphous definitions about what is to be solved. Let’s take the scope of content to be searched: everything internal? And nothing external? Another obstacle is the widespread believe in shortcuts. Popular example: Let’s just index the content present in our internal content management system, the other content sources are irrelevant. That way, the concept of completeness in search/result set is sacrificed. But search can be as gruesome as the Marathon: you need endurance and there are no shortcuts. If you take a shortcut, you’ve failed.

What is your approach to problem solving in search and content processing?

Smarter software definitely, because the challenges in search (and there are more than three) are attracting programmers and innovators to come up with new solutions. But, in general, my approach is “keep your cool”. Assess the situation, analyze tools and environment, design the solution and explain it clearly. In the process, interfaces have to be improved sometimes in order to trim them down to fit with the corporate intranet design.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?

We’ll see how far a consolidation process will go. Perhaps we’ll see discontinued search products where we initially didn’t expect it. Also, the relation asked in the following question might be affected: software companies are unlikely to cut back at core features of their product. But integrated search functions are perhaps identified for the scalpel.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?

I’ve seen it the other way around: Customer Support Managers told me (the Search person) that the built-in search-tool is ok but that they would like to look up additional information from some other internal applications. I don’t believe that built-in search will replace stand-alone search. The term “built-in” tells you that the main purpose of the application is something else. No surprise that, for instance, the user interface was designed for this main purpose – and will, in conclusion, not address typical needs of search.

Google has disrupted certain enterprise search markets with its appliance solution. What can a vendor do to adapt to this Google effect?

A vendor should point out where he differs from Google and why to address this Google-effect.

But I see Google as a significant player in enterprise search, if only for the mindset of procurement teams you describe in your question.

As you look forward, what are some new features / issues that you think will become more important in 2009?

The issue of cloudsourcing will gain traction. As a consequence, not only small and medium sized enterprises will discover that they might not invest in in house Content Management and Collaboration applications, but use a hosted service instead. This is when you need more than a “behind the firewall” search, because content will be scattered across multiple clouds (CRM cloud, Office cloud). I’m not sure whether we see a breakthrough there in 36 month; but the sooner the better.

Where can I find more information about your services and research?

http://www.linkedin.com/in/mbaumgartel

Stephen E. Arnold, www.arnoldit.com/sitemap.html and Harry Collier, www.infonortics.com

NSA Oral Histories Available

February 23, 2009

If you are looking for a test corpus against which to benchmark a search system, take a look at the National Security Agency’s declassified oral interviews. A happy quack to the reader who alerted me to BeSpecific’s write up “Declassified Oral History Interviews Posted by National Security Agency” here. Grab ’em quick. The NSA, according to the write up, has reworked its Web site. I enjoyed the “Doing Business with the NSA.” Interesting if not exactly in line with how the world in Beltway Bandit land often works. For more NSA content, run this query on Uncle Sam.

Stephen Arnold, January 23, 2009

Number 13 in the Biggest Technology Goof List

February 22, 2009

ComputerWorld published on February 22, 2008, here a list of the “The 25 Greatest Blunders in Tech History.” I find these lists amusing. I paddled right by the first 12 and the last 12. I focused on blunder number 13:

Search portals. Where are they now? At the height of the dot-com boom, web surfers had a plethora of search engines to choose from: AltaVista, Excite, InfoSeek, Lycos, and many more. Today, the major players of the past are mostly dead. A few have soldiered on, such as Ask.com, but only after repeated redesigns. Chalk it up to old-fashioned hubris. Instead of concentrating on their search offerings, the first-generation search engines fell victim to the portal arms race. They built up dashboards full of sports scores, stock quotes, news headlines, horoscopes, the weather, email, instant messaging, games, and sponsored content – until finding what you wanted was like playing Where’s Waldo. Neither fish nor fowl, they became awkward combinations of search portals and general-interest portals. The world went to Yahoo for the latter. And when an upstart called Google appeared with a clean UI and high-quality search, users told the other engines to get lost.

The consequence of the portal mania. Our pal Googzilla. The failure of portals opened the door to my favorite example of received wisdom (portals are the future) creating the space for a hyperconstruct to reshape online, search, and a number of other businesses. I would have moved this goof to the top 10. But 13 remains an unlucky number for the companies who jumped on the portal bandwagon a decade ago.

Stephen Arnold, February 22, 2009.

Google: Suddenly Too Big

February 22, 2009

Today Google is too big. Yesterday and the day before Google was not too big. Sudden change at Google or a growing sense that Google is not the quirky Web search and advertising company everyone assumed Googzilla was?

The New York Times’s article by professor Randall Stross available temporarily here points out that some perceive Google as “too big.” Mr. Stross quotes various pundits and wizards and adds a tasty factoid that Google allowed him to talk to a legal eagle. Read the story now so you can keep your pulse on the past. Note the words the past. (You can get Business Week’s take on this same “Google too powerful” here.)

The fact is that Google has been big for years. In fact, Google was big before its initial public offering. Mr. Stross’s essay makes it clear that some people are starting to piece together what dear Googzilla has been doing for the past decade. Keep in mind the time span–decade, 10 years, 120 months. Also note that in that time interval Google has faced zero significant competition in Web search, automated ad mechanisms, and smart software. Google is essentially unregulated.

Let me give you an example from 2006 so you can get a sense of the disconnect between what people perceive about Google and what Google has achieved amidst the cloud of unknowing that pervades analysis of the firm.

Location: Copenhagen. Situation: Log files of referred traffic. Organization: Financial services firm. I asked the two Web pros responsible for the financial services firm’s Web site one question, “How much traffic comes to you from Google?” The answer was, “About 30 percent?” I said, “May we look at the logs for the past month?” One Webmaster called up the logs and in 2006 in Denmark, Google delivered 80 percent of the traffic to the Web site.

The perception was that Google was a 30 percent factor. The reality in 2006 was that Google delivered 80 percent of the traffic. That’s big. The baloney delivered from samples of referred traffic, if the Danish data were plus or minus five percent, Google has a larger global footprint than most Web masters and trophy generation pundits grasp. Why? Sampling services get the market share data in ways that understate Google’s paw prints. Methodology, sampling, and reverse engineering of traffic lead to the weird data that research firms generate. The truth is in log files and most outfits cannot process large log files so “estimates” not hard counts become the “way” to truth. (Google has the computational and system moxie to count and perform longitudinal analyses of its log file data. Whizzy research firms don’t. Hence the market share data that show Google in the 65 to 75 percent share range with Yahoo 40 to 50 points behind. Microsoft is even further behind and Microsoft has been trying to close the gap with Google for years.)

So now it’s official because the New York Times runs an essay that say, “Google is big.”

To me, old news.

In my addled goose monographs, I touched on data my research unearthed about some of Google’s “bigness”. Three items will suffice:

  • Google’s programming tools allow a Google programmer to be up to twice as productive as a programmer using commercial programming tools. How’s this possible? The answer is the engineering of tools and methods that relieve programmers of some of the drudgery associated with developing code for parallelized systems. Since my last study — Google Version 2.0 — Google has made advances in automatically generating user facing code. If the Google has 10,000 code writers and you double their productivity, that’s the equivalent of 20,000 programmers’ output. That’s big to me. Who knows? Not too many pundits in my experience.
  • Google’s index contains pointers to structured and unstructured data. The company has been beavering away to that it no longer counts Web pages in billions. The GOOG is in trillions territory. That’s big. Who knows? In my experience, not too many of Google’s Web indexing competitors have these metrics in mind. Why? Google’s plumbing operates at petascale. Competitors struggle to deal with the Google as it was in the 2004 period.
  • The computations processed by Google’s fancy maths are orders of magnitude greater than the number of queries Google processes per second. For each query there are computations for ads, personalization, log updates, and other bits of data effluvia. How big is this? Google does not appear on the list of supercomputers, but it should. And Google’s construct may well crack the top five on that list. Here’s a link to the Google Map of the top 100 systems. (I like the fact that the list folks use the Google for its map of supercomputers.)

The real question is, “What makes it difficult for people to perceive the size, mass, and momentum of Googzilla?” I recall from a philosophy class in 1963 some thing about Plato and looking at life as a reflection in a mirror or dream (???????). Most of the analysis of Google with which I am familiar treats fragments, not Die Gestalt.

Google is a hyper construct and, as such, it is a different type of organization from those much loved by MBAs who work in competitive and strategic analysis.

The company feeds on raw talent and evolves its systems with Darwinian inefficiency (yes, inefficiency). Some things work; some things fail. But in chunks of time, Google evolves in a weird non directive manner. Also, Google’s dominance in Web search and advertising presages what may take place in other markets sectors as well. What’s interesting to me is that Google lets users pull the company forward.

The process is a weird cyber – organic blend quite different from the strategies in use at Microsoft and Yahoo. Of its competitors, Amazon seems somewhat similar, but Amazon is deeply imitative. Google is deeply unpredictable because the GOOG reacts and follows users’ clicks, data about information objects, and inputs about the infrastructure’s machine processes. Three data feeds “inform” the Google.

Many of the quants, pundits, consultants, and MBAs tracking the GOOG are essentially data archeologists. The analyses report what Google was or what Google wanted people to perceive at a point in time.

I assert that it is more interesting to look at the GOOG as it is now.

Because I am semi retired and an addled goose to boot, I spend my time looking at what Google’s open source technology announcements that seem to suggest the company will be doing tomorrow or next week. I collect factoids such as the “I’m feeling doubly lucky” invention, the “programmable search engines” invention, the “dataspaces” research effort, and new patent documents for a Google “content delivery demonstration”, among others — many others I wish to add.

My forthcoming Google: The Digital Gutenberg explains what Google has created. I hypothesize about what the “digital Gutenberg” could enable. Knowing where Google came from and what it did is indeed helpful. But that information will not be enough to assist the businesses increasingly disrupted by Google. By the time business sectors figure out what’s going on, I fear it may be too late for these folks. Their Baedekers don’t provide much actionable information about Googleland. A failure to understand Googleland will accelerate the  competitive dislocation. Analysts who fall into the trap brilliantly articulated in John Ralston Saul’s Voltaire’s Bastards will continue confuse the real Google with the imaginary Google. The right information is nine tenths of any battle. Apply this maxim to the GOOG is my thought.

Stephen Arnold, February 22, 2009

TurboWire: Search for the Children of Some Publishing Executives

February 22, 2009

A bit of irony: at a recent dinner party, a publishing executive explained that his kids had wireless, Macbooks, and mobile phones. He opined that his kids knew the rules for downloading. I was standing behind the chair in which his son was texting and downloading a torrent. The publishing executive stood facing his son and talking to me about his ability to manage digital information. I asked the son what he was downloading. He said, “Mall Cop”. From Netflix I asked? He said, “Nope, a torrent like always.”

If you want to take a look at some of the functionality for search and retrieval of copyrighted materials, check out TurboWire. You can download a copy here. Click here for the publisher’s Web site. The features include search (obviously) and:

  • Auto-connect, browse host, multiple search.
  • Connection quality control.
  • Library management and efficient filtering.
  • Upload throttling.
  • Direct connection to known IP addresses.
  • Full-page connection monitor.
  • Built-in media player.

Oh, talking about piracy is different from preventing one’s progeny from ripping and shipping in my opinion. And, no, I did not tell my host that he was clueless. I just smiled and emitted a gentle honk.

Stephen Arnold, February 22, 2009

Medpedia: Using Web 2.0 to Advance Medicine

February 22, 2009

Editor’s Note: The health information sector is showing some zip. Beyond Search asked Constance Ard, the Answer Maven, to comment on the new service, Medpedia.

Medpedia has a stated purpose of “applying a new collaborative model to the sharing, collection and advancement to medical knowledge.”

This new project has the support of gold star partners Harvard, Stanford and University of Michigan Medical Schools as well as UC Berkley’s School of Public Health. This technology platform is open to the public but has special appeal to users in the medical, health services, academic, and research communities.

The project began in 20089 with Charter Members and Advisors offering support for this collaborative model of medical knowledge sharing.

The Privacy Policy provides support for third-party advertisers to collect and use site user information. Using the site does not require registration for readers. Editors and Members must register. An industry disclosure practice has also been adopted by Medpedia that requires editors to “disclose in their public profiles their corporate and academic affiliations and they must disclose if they receive, or expect to receive, any form of compensation for the content they contribute to Medpedia, or any compensation related to medicine, medical information, or products and services related to the body.”

The Terms of Use outlines very clearly that the site does not provide medical advice and the content is not Peer Reviewed. Contributors must register to use the site. Contributors should review the terms carefully.

Medpedia has kept the user audience in mind for this project. They provide plain English pages for your average Jane Q. User and Clinical pages for medical professionals. This flexibility along with other key features such as interdisciplinary contributions allow Medpedia to reach beyond the consumer and/or researcher to meet the needs of both types of user.

Contributions may be made by anyone. Editors are screened and carefully selected but once a member becomes a recognized editor their profile will track their contributions on Medpedia. Medpedia does plan to expand to languages other than English. Contributors have very specific levels of access for content creation and editing on the site. The FAQ’s lay out the types and responsibilities associated with the various levels.

Using the site is easy. The index of current articles has a list of terms that can be linked to access full encyclopedia articles. The ruling organizational scheme is alphanumeric.

For the layperson, reviewing the search results for a search of the articles on “infectious diseases” at first glance does not hold much hope. However, as you review the results the articles are most definitely indexed appropriately. If you are a keyword user, don’t expect highlighted search terms in the results list. The one line search blurb is literally the first line of the article no matter the format of the full-text.

The seed content does have highly reliable information that can be used by any level researcher for accurate content. Medpedia warns that as the general public contributes to the site this content will require verification. This need for verification is why the Editor and Committee structure will be so important for the development of this collaborative model. The editors will provide the touchstone for accuracy and currency as site content grows.

Finding articles by the contributing organization or by community is easy. Community within the Medpedia environment refers to a particular group of articles, editors and contributors on a specific topic i.e. Adult ADD/ADHD. There is alphanumeric index of the communities and an alpha index of professionals who have provided a profile that provides education and experience.

The collaborative nature of this model is encouraging. The site seems to be well governed to insure that quality, reliable and verifiable information is accessible. The search feature seems effective but the results display has room for improvement, at least from a layperson viewpoint. In my opinion, in the days of keyword searching the blurb in the result needs to be more reflective of the content than the first line of text from the article.

As this site grows it will be important to investigate the effectiveness of the editorial process to ensure that the collaborative model does not fail due to an overwhelming influx of inaccurate out-dated information. As it stands, the seed content makes this a useful and reliable source for medical information. The indexes and structure applied to the content is good and the search tool seems accurate despite the disappointing results display. If you are seeking reliable medical content Medpedia is a good place to start whether you are a professional or Jane Q. User.

Constance Ard, Answer Maven, February 22, 2009

Google: A Scoffing Violator for Sure

February 22, 2009

If Microsoft can release Internet Explorer 8 and put itself on a list of non compliant Web sites, Google can violate its own Webmaster guidelines. SearchNewz doesn’t agree. You can read Dave Davies’ view of the scoffing violator Google here. Mr. Davies includes a link to Google’s explanation of the situation. For me, the most important comment in the write up was:

As it turns out, old Google Japan has been buying links in the form of blog posts to help increase their rankings. Of course, it wasn’t actually Google – it was a third party (of course) and Google Japan’s PageRank has been dropped to a 5 from the 9 it was at. So a black eye for Google. Of course, they have a good explanation but then – who doesn’t. 🙂 All the same, the one person who came out of this looking great – Matt Cutts who once more represents Google well and you just want to trust him to do no evil.

My research suggests that Google takes other liberties with its guidelines as well. But if Google makes its rules, just like a shopping mall owner, Google can breaks its rules. Google is a bit more influential than a shopping mall, however. I don’t mind pointing out Googzilla’s flaws, but I do try to follow its rules. I even put up with silliness from the now famous Cyrus and his death of knowledge about Google’s own open source information stream. Mr. Davies makes a good point, but it won’t amount to a hill of dead Google power supplies.

Stephen Arnold, February 22, 2009

WebFetch: Metasearch UK Style

February 22, 2009

InfoSpace was on my radar several years ago. Since that matter was resolved, I haven’t given the company much thought. I did a quick search of my notes and files about the company and came across a reminder to myself about WebFetch. The WebFetch.com site was an InfoSpace property when I first came across it. A quick visit to the site on February 21, 2009, revealed that the service is tagged as an InfoSpace property. I had this snippet of information in my InfoSpace folder:

Catering to English-language Internet users in Europe and using innovative metasearch technology, WebFetch® offers queries that draw results from many leading search engines all at once. In one click, users receive both free listings and paid-for results. All paid-for results are labeled as “sponsored.”

WebFetch is a comparison metasearch system. Your query is passed against Google’s, Microsoft’s, yahoo’s, and Ask’s Web index. You can review results in a single, relevance-ranked, deduplicated list. Alternatively you can look at the most relevant hits from each of the four search engines. I learned about the system several years ago. I noted a redesign in 2006 that included some graphical representations of search results. An FAQ about the service is here. With a click, one can narrow the search to UK or international content. My tests revealed that there was not significant difference in the results. I have a note to myself that says, “InfoSpace acquired WebFetch.com.” But I cannot verify that item of information in the files loaded on this system.

InfoSpace has been selling its mobile assets. The company seems to be in flux. What struck me when I visited WebFetch.com on February 21, 2009, was:

  • There was no advertising on the pages displayed to me
  • The site was clean but the information about the service took a bit of sleuthing to uncover
  • The flashier features such as the visualization I noted in my 2006 notes to myself was no longer available.

InfoSpace has a long and somewhat interesting history. WebFetch.com seems to be marginalized, but I don’t think too much about other InfoSpace Web search properties either. These include the service named Dogpile.com, which continues to strike me as somewhat off center. Other search properties include MetaCrawler.com and WebCrawler.com. After reviewing each service, I concluded that Dogpile.com was the site that seemed the most well rounded.

What’s the future of metasearch? I think the term is being pushed aside by the notion of federated search. And, federated search itself is being displaced by systems that aggregate, parse, and assemble content. An example of this trend is the Fetch Technologies’ approach. This outfit snagged a Googler in late 2008. My conclusion: bet on Fetch, not WebFetch.com.

Stephen Arnold, February 22, 2009

Google Plumbing Stat

February 21, 2009

Amit Agarwal, a professional blogger and personal technology columnist for a national newspaper, wrote “Single Google Query Uses 1000 Machines in 0.2 Seconds” here. The data came from Googler Jeff Dean, a former Digital Equipment wizard who joined Googzilla 20 patent documents ago. Key points for me were:

  • One query uses 1,000 machines
  • The Google index is in memory
  • Latency now 200 milliseconds, down from 1000 milliseconds
  • Power consumption… a lot.

Hopefully a video of Dr. Dean’s talk with turn up on the Google Channel.

Stephen Arnold, February 21, 2009

Nielsen: Time per User

February 21, 2009

I like the tables and data that ZDNet makes available. It delivers the old Predicast File 16 punch without the online connect and type charges of by gone days. The table “Top Web Brands in December 2008” here ruffled my thinning pin feathers. Let me highlight three companies’ “time” and capture the thoughts that flapped through my addled goose mind. Here are the values that puzzled me:

Yahoo, according to Nielsen, attracted 117 million visitors and each visitor spent 3 minutes and 12 seconds per visit. The barking dog AOL Media Network attracted 86 million visitors and each visitor spent 3 minutes and 41 seconds per visit. YouTube.com (one of the top five sites in terms of traffic according to some stats cats) attracted 81 million visitors and each visitor spent 54 seconds on the site. The site able to attract visitors and make them go away fastest was Amazon with 61 million visitors and each visitor spent 34 seconds on the site.

Now these data strike me as evoking more questions than they answer. For example:

  1. Yahoo gets me to stick around because the system is so slow. Email is not usable from some countries. Yahoo’s gratuitous “Do you want to cache your email?” is nuts. If I am in Estonia on Monday and Warsaw on Tuesday, what do you think? These “sticky” values are indicative of some other factors, which the ZDNet presentation does not address. I think Yahoo gets a high score because of the amount of time required to perform basic email operations. I fondly note the inadequate “ying” server because I have to sit and wait for the darn thing to deliver data to me.
  2. The Amazon number is just odd. I buy books and a few on sale odds and ends. The Amazon system also demonstrates sluggishness. There’s the need to turn on “one click”. That takes time because I can not easily spot the verbiage that allows me to turn on one click and have the system remember that as my preference. Then there is the sluggish rendering of items deep in an Amazon results list. I find the search system terrible, and I waste a lot of time looking for current titles that * are * available for the Kindle. The long Amazon pages take time to browse. In short, how can a visitor get in and out of Amazon in and average time of 34 seconds. Something’s fishy.
  3. The AOL numbers are similar to Yahoo. Maybe system latency is the way to improve dwell time.
  4. The YouTube.com number makes no sense at all. YouTube.com offers short videos and now longer fare. YouTube.com demographics are skewed to the trophy generation. How can a YouTube.com visitor wade through the clutter on the various YouTube.com Web pages, wait for the video to buffer, and then get out of Dodge City  in 54 seconds. Something’s off track here.

I am confident that Nielsen’s analysts have well crafted answers. I wonder, however, if Phil Nielsen would accept those answers. I know I would not unless I could look at the method of data collection, the math behind the calculation, and the method for cranking out the tidy time values. I sure hope no former Wall Street quants were involved in these data because I would be really suspicious.

My hunch is that the simple reason the numbers strike me as weird is that these data are flawed, maybe in several different ways. In today’s economic climate, numbers are like Jello. I never liked Jello.

Stephen Arnold, February 21, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta