How Smart Is Google’s Software?

September 17, 2008

When you read this, I will have completed my “Meet the Guru” session in Utrecht for Eric Hartmann. More information is here. My “guru” talk is not worthy of its name. What I want to discuss is the relationship between two components of Google’s online infrastructure. This venue will mark the first public reference to a topic I have been tracking and researching for several years–computational intelligence. Some background information appears in the Ignorance Is Futile Web log here.

I am going to reference my analysis of Google’s innovation method. I described this in my 2007 study The Google Legacy, and I want to mention one Google patent document; specifically, US20070198481, which is about fact extraction. I chose this particular document because it references research that began a couple of years before the filing and the 2007 granting of the patent. It’s important in my opinion because it reveals some information about Google’s intelligent agents, which Google references as “janitors” in the patent application. Another reason I want to highlight it is that it includes a representation of a Google results list as a report or dossier.

Each time I show a screen shot of the dossier, any Googlers in the audience tell me that I have Photoshopped the Google image, revealing their ignorance of Google’s public patent documents and the lousy graphical representations that Google routinely places in its patent filings. The quality of the images and the cute language like “janitors” are intended to make it difficult to figure out what Google engineers are doing in the Google cubicles. Any Googlers curious about this image (reproduced below) should look at Google’s own public documents before accusing me of spoofing Googzilla. This now happens frequently enough to annoy me, so, Googlers, prove you are the world’s smartest people by reading your own patent documents. That’s what I do to find revealing glimpses such as this one display for a search of the bound phrase “Michael Jackson”:

image

The highlight boxes and call outs are mine. What this diagram shows is a field (structured) report or dossier about Michael Jackson. The red vertical box identifies the field names of the data and the blue rectangle points your attention to the various names by which Michael Jackson is known; for example, Wacko Jacko.

Now this is a result that most people have never seen. Googlers react to this in shock and disbelief because only a handful of Google’s more than 19,000 employees have substantive data about what the firm’s top scientists are doing at their jobs. I’ve learned that 18,500 Googlers “run the game plan”, a Google phrase that means “Do what MOMA tells you”. Google patent documents are important because Google has hundreds of US patent applications and patents, not thousands like IBM and Microsoft. Consequently, there is intent behind funding research, paying attorneys, and dealing with the chaotic baloney that is the specialty of the USPTO.

Read more

Attensity and BzzAgent: What’s the Angle

September 14, 2008

Attensity made a splash in the US intelligence community after 2001. A quick review of Attensity’s news releases suggests that the company began shifting its marketing emphasis from In-Q-Tel related entities to the enterprise in 2004-2005. By 2006, the company was sharpening its focus on customer support. Now Attensity is offering a wider range of technologies to organizations wanting to deal with their customers using Attensity’s technology.

In August 2008, the company announced that it had teamed up with the oddly named BzzAgent to provide insights into consumer conversations. BzzAgent, a specialist in word of mouth media. You can learn more about WOM–that is, word of mouth marketing–at the company’s Web site here.

The Attensity technology makes it possible for BzzAgent to squeeze meaning out of email or any other text. With the outputs of the Attensity system, BzzAgent can figure out whether a product is getting marketing lift or down draft. Other functionality provides beefier metrics to buttress the BaaAgent’s technology.

The purpose of this post is to ask a broader question about content processing and text analytics? To close, I want to offer a comment about the need to find places to sell rocket science information technology.

Why Chase Customer Support?

The big question is, “Why chase customer support?” Call centers, self service Web sites, and online bulletin board systems have replaced people in many organizations. In an effort to slash the cost of support, organizations have outsourced help to countries with lower wages than the organization’s home country. In an interesting twist of fate, Indian software outsourcing firms are sending some programming and technical work back to the US. Atlanta has been a beneficiary of this reverse outsourcing, according to my source in the Peach State.

Attensity’s technology performs what the company once described as “deep extraction.” The idea is to iterate through source documents. The process outputs metadata, entities, and a wide range of data that one can slice, dice, chart, and analyze. Attensity’s technology is quite advanced, and it can be tricky to optimize to get the best performance from the system on a particular domain of content.

Customer support appears to be a niche that functions like a hamburger to a hungry fly buzzing around tailgaters at the college football game. Customer support, despite vendors’ efforts to reduce costs and keep customers happy, has embraced every conceivable technology. There are the “live chat” telepresence services. There work fine until the company realizes that customers may be in time zones when the company is not open for business. There are the smart systems like the one Yahoo deployed using InQuira’s technology. To see how this works, navigate to Yahoo help central, type this question “How do I can premium email?”, and check out the answers. There are even more sophisticated systems deployed using tools from such companies as RightNow. This firm includes work flow tools and consulting to improve customer support services and operations.

The reason is simple–customer support remains a problem, or as the marketers say, “An opportunity.” I know that I avoid customer support whenever possible. Here’s a typical example. Verizon sent me a flier that told me I could reduce my monthly wireless broadband bill from $80 to $60. It took a Web site visit and six telephone calls to find out that the lower price came with a five gigabyte bandwidth cap. Not only was I stressed by the bum customer support experience, I was annoyed at what I perceived rightly or wrongly as the duplicity of the promotion. Software vendors jump at the chance to license Verizon a better mousetrap. So far, costs may have come down for Verizon, but this mouse remains far away from the mouse trap.

The new spin on customer support rotates around one idea: find out stuff * before * the customer calls, visits the Web site, or fires up a telepresence session.

That’s where Attensity’s focus narrows its beam. Attensity’s rocket science technology can support zippy new angles on customer support; for example, BzzAgent’s early warning system.

What’s This Mean for Search and Content Processing?

For me that is the $64 question. Here’s what I think:

  1. Companies like Attensity are working hard to find niches where their text analytics tools can make a difference. By signing licensing deals with third parties like BzzAgent, Attensity gets some revenue and shifts the cost of sales to the BzzAgent’s team.
  2. Attensity’s embedding or inserting its technology into BzzAgent’s systems deemphasizes or possibly eliminates the brand “Attensity” from the customers’ radar. Licensing deals deliver revenue with a concomitant loss of identify. Either way, text analytics moves from the center stage to a supporting role.
  3. The key to success in Attensity’s marketing shift is getting to the new customers first. A stampede is building from other search and content processing vendors to follow a very similar strategy. Saturation will lower prices, which will have the effect of making the customer support sector less attractive to text processing companies than it is now. ClearForest was an early entrant, but now the herd is arriving.

The net net for me is that Attensity has been nimble. What will the arrival of other competitors in the customer support and call center space mean for this niche? My hunch is that search and content processing is quickly becoming a commodity. Companies just discovering the customer support market will have to displace established vendors such as InQuira and Attensity.

Search and content processing certainly appear to be headed rapidly toward commoditization unless the vendor can come up with a magnetic, value add.

Stephen Arnold, September 14, 2008

Search: A Failure to Communicate

September 12, 2008

At lunch today, the ArnoldIT.com team embraced a law librarian. For Mongolian beef, this information professional agreed to talk about indexing. The conversation turned to the grousing that lawyers do when looking for information. I remembered seeing a cartoon that captured the the problem we shelled, boiled, and deviled during our Chinese meal.

failure to communicate chrisian

Source: http://www.i-heart-god.com/images/failure%20to%20communicate.jpg

Our lunch analysis identified three constituencies in a professionals services organization. We agreed that narrowing our focus to consultants, lawyers, financial mavens, and accountants was an easy way to put egg rolls in one basket.

First, we have the people who understand information. Think indexing, consistent tagging for XML documents, consistent bibliographic data, the credibility of the source, and other nuances that escape my 86 year old father when he searches for “Chicago Cubs”.

Second, we have the information technology people. The “information” in their title is a bit of misdirection that leads to a stir fry of trouble. IT pros understand databases and file types. Once data are structured and normalized, the job is complete. Algorithms can handle the indexing and the metadata. When a system needs to go faster, the fix is to buy hardware. If it breaks, the IT pros tinker a bit and then call in an authorized service provider.

Third, we have the professionals. These are the ladies and gentlemen who have trained to master a specific professional skill; for example, legal eagle or bean counter. These folks are trapped within their training. Their notions of information are shaped by their dead lines, crazed clients, and crushing billability.

Here’s where the search system or content processing system begins it rapid slide to the greasy bottom of the organization’s wok.

  1. No one listens or understands the other players’ definition of “information”.
  2. The three players, unable to get their points across, clam up and work to implement their vision of information
  3. The vendors, hungry for the licensing deal, steer clear of this internal collision of ignorant, often supremely confident souls
  4. The system is a clunker, doing nothing particularly well.

Enter the senior manager or the CFO. Users are unhappy. Maybe the system is broken and a big deal is lost or a legal matter goes against the organization. The senior manager wants a fix. The problem is that unless the three constituents go back to the definition of information and carry that common understanding through requirements, to procurement, to deployment, not much will change.

Like the old joke says, “Get me some new numbers or I will get a new numbers guy.” So, heads may roll. The problem remains the same. The search and content processing system annoys a majority of its users. Now, a question for you two or three readers, “How do we fix this problem in professional services organizations?

Stephen Arnold, September 12, 2008

eDiscovery: Speed Bumps Annoy Billing Attorneys

September 12, 2008

A happy quack to my Australian reader who called “eDiscovery Performance Still a Worry”. The article by Greg McNevin appeared on the IDM.net.au Web site on September 10, 2008. The main point of the write up is that 60 percent of those polled about their organization’s eDiscovery litigation support system said, “Dog slow.” The more felicitous wording chosen by Mr. McNevin was:

The survey also found that despite 80 percent of organisations claiming to have made an investment in IT to address discovery challenges, 60 percent of respondents think their IT department is not always able to deliver information quickly enough for them to do their legal job efficiently.

The survey was conducted by Dynamic Markets, who polled 300 in house legal eagles in the Uk, Germany, and the Netherlands. My hunch is that the 60 percent figure may well apply in North America as well. My own research unearthed the fact that two thirds of the users of enterprise search systems were dissatisfied with those systems. The 60 percent score matches up well.

In my view, the larger implication of this CommVault study is that when it comes to text and content processing, more than half the users go away annoyed or use the system whilst grumbling and complaining.

What are vendors doing? There’s quite a bit of activity in the eDiscovery arena. More gladiators arrive to take the place of those who fall on their swords, get bought as trophies, or die at hands of another gladiator. Sadly, the activity does not address the issue of speed. In the sense for this context, “speed” in not three millisecond response time. “Speed” means transforming content, updating indexes, and generating the reports needed to figure out what information is where in the discovered information.

Many vendors are counting on Intel to solve the “speed” problem. I don’t think faster chips will do much, however. The “speed” problem is that eDiscovery relies on a great many processes. Lawyers, in general, have a need for what’s required to meet a deadline. There’s little reason for them to trouble their keen legal minds with such details as content throughput, malformed XML, flawed metatagging, and trashed indexes after an index update.

eDiscovery’s dissatisfaction score mirrors the larger problems with search and content processing. There’s no fix coming that will convert a grim black and white image to a Kodachrome version of reality.

Stephen Arnold, September 12, 2008

First Search Mini-Profile: Stratify

September 9, 2008

Beyond Search has started its search and content processing mini-profile series.

The first profile is about Stratify, and you can read it here.

The goal is to publish each week a brief snapshot of selected search and content processing vendors. The format of each profile will be a short essay that covers the background of the system, its principal features, strengths, weaknesses, and an observation. The idea inspiring each profile is to create a basic summary. Each vendor is invited to post additional information, links, and updates. On a schedule yet to be determined, each mini-profile will be updated and the comments providing new information deleted. The system allows a reasonable trade off between editorial control and vendor supplements. We will try to adhere to the weekly schedule. Our “Search Wizards Speak” series has been well received, and we will add interviews, but the interest in profiles has been good. Remember. You don’t need to write me “off the record” or even worse call me to provide insights, updates, and emendations. Please, use the comments section for each profile. I have other work to do. I enjoy meeting new people via email and the phone, the volume of messages to me is rising rapidly. Enjoy the Stratify post. You will find the profiles under the “Profile” tab on the splash page for the Web log. I will post a short news item when a new profile becomes available. Each profile will be indexed with the key word “profile”.

Stephen Arnold, September

Oracle Teams with ekiwi

September 8, 2008

ekiwi, based in Provo, Utah, has formed a relationship with Oracle. The company was founded in 2002. It focuses on Web based data extraction. The firm’s Screen-Scraper technology is, the news release asserts, “platform-independent and designed to integrate with virtually any existing information technology system.”

The company describes Screen Scraper this way here:

It consists of a proxy server that allows the contents of HTTP and HTTPS requests to be viewed, and an engine that can be configured to extract information from Web sites using special patterns and regular expressions. It handles authentication, redirects, and cookies, and contains an embedded scripting engine that allows extracted data to be manipulated, written out to a file, or inserted into a database. It can be used with PHP, .NET, ColdFusion, Java, or any COM-friendly language such as Visual Basic or Active Server Pages.

Oracle’s revenues are in the $18 to 20 billion range. ekiwi’s revenues may be more modest. Oracle, however, has turned to ekiwi for screen scraping technology to enhance the content acquisition capabilities of Oracle’s flagship enterprise search system, Secure Enterprise Search 10g or SES10g. In May 2008, one of Oracle’s senior executives told me that SES10g was key player in the enterprise search arena and SES10g sold because it was secure. Security, I recall being told, was the key differentiation.

This deal suggests that SES10g has to turn to up-and-coming screen scraping vendors to expand the capabilities of SES10g. I’m still puzzling over this deal, but that’s clearly my inability to understand the sophisticated management thinking that fuels SES10g to its lofty position among the search and content processing vendors.

The news release makes it clear that e-kiwi can access content from the “deep Web”. This buzzword means to me dynamic, database-driven sites. Google has its “deep Web” technologies which may be in part described in its five Programmable Search Engine patents, published by the USPTO as patent applications, in February 2007.

e-kiwi, which offers a very useful Web log here, is:

…a member of the Oracle PartnerNetwork, has worked with Oracle to develop an adaptor that integrates ekiwi’s Screen Scraper with Oracle Secure Enterprise Search to help significantly expand the amount of enterprise content that can be searched while maintaining existing information access and authorization policies. The Oracle Secure Enterprise Search product provides a secure, easy-to-use enterprise search platform that connects to a broad range of enterprise applications and data sources.

The release continues:

The two technologies have already been coupled in a number of cases that demonstrate their ability to work together. In one instance cell phones from many of the major providers were crawled by Screen-Scraper and indexed by Oracle Secure Enterprise Search. A user shopping for cell phones is then able to search, filter, and browse from a single location the various cell phone models by attributes such as price, form factor, and manufacturer. In yet another case, Screen-Scraper was used to extract forum postings from various photography aficionado web sites. This information was then made available through Oracle Secure Enterprise Search, which made it easy to conduct internal marketing analysis on recently released cameras.

I did some poking around and came up short after a quick look at my files and running a couple of Web searches. Information is located, according to the news story about the deal, here. The url is http//:www.screen-scraper.com/ss4ses/. The link redirected for me to http://www.w3.org/Protocols/. The company’s Web site is at http://www.screen-scraper.com, and it looks like this on September 7, 2008, at 8 pm Eastern:

screenscrapersplash

I am delighted that SES10g can acquire Web-based content in dynamic systems. I remain confused about the functions included with SES10g. My understanding was that SES10g was easily extensible, compatible with Oracle Applications, Fusion, and other Oracle technologies. If this were true, SES10g’s ability to pull content from databased services should be trivial for the firm’s engineering team. I was hoping for an upgrade to SES10g, but that seems not to be in the cards at this time. Scraping Web pages seems to be a higher priority that getting a new release out the door. What’s your understanding of Oracle’s enterprise search strategy? I’m confused. Help me out, please.

Stephen Arnold, September 8, 2008

New Beyond Search White Paper: Coveo G2B for Mobile Email Search

September 8, 2008

The Beyond Search research team prepared a white paper about Coveo’s new G2B for Email product. You can download a copy from us here or from Coveo here. Coveo’s system works across different mobile devices, requires no third-party viewers, delivers low-latency access when searching, evidenced no rendering issues, and provided access to contacts and attachments as well as the text in an email. When compared to email search solutions from Google, Microsoft and Yahoo–Coveo’s new service provided a more robust and functional service. Beyond Search identified 13 features that set G2B apart. These include a graphical administrative interface, comprehensive usage reports, and real time indexing of email. The Beyond Search research team—Stephen Arnold, Stuart Schram, Jessica Bratcher, and Anthony Safina–concluded that Coveo established a new benchmark for mobile email search. For more information about Coveo, navigate to www.coveo.com. Pricing information is available from Coveo.

Stephen Arnold, September 5, 2008

Text Processing: Why Servers Choke

September 6, 2008

Resource Shelf posted a link to a Hewlett Packard Labs’s paper. Great find. You can download the HP write up here (verified at 7 pm Eastern) on September 5, 2008. The paper argues that an HP innovation can process text at the rate of 100 megabytes per second per processor core. That’s quite fast. The value of the paper for me was that the authors of Extremely Fast Text Feature Extraction for Classification and Indexing” have done a thorough job of providing data about the performance of certain text processing systems. If you’ve been wondering how slow Lucene is, this paper gives you some metrics. The data seem to suggest that Lucene is a very slow horse in a slow race.

Another highlight of George Forman’s and Evan Kirshebaum’s write up was this statement:

Multiple disks or a 100 gigabit Ethernet feed from many client computers may certainly increase the input rate, but ultimately (multi-core) processing technology is getting faster faster than I/O bandwidth is getting faster. One potential avenue for future work is to push the general-
purpose text feature extraction algorithm closer to the disk hardware.  That is, for each file or block read, the disk controller itself could distill the bag-of-words representation and then transfer only this small amount  of data to the general-purpose processor.  This could enable much higher indexing or classification scanning rates than is currently feasible.  Another potential avenue is to investigate varying the hash function to improve classification performance, e.g. to avoid a particularly unfortunate collision between an important, predictive feature and a more frequent word that masks it.

When I read this, two thoughts came to mind:

  1. Search vendors counting on new multi core CPUs to solve performance problems won’t get the speed ups needed to make some systems process content more quickly. Bad news for one vendor whose system I just analyzed for a company convinced that performance is a strategic advantage. In short, slow loses.
  2. As more content is processed and short cuts taken, hash collisions can reduce the usefulness of the value-added processing. A query returns unexpected results. Much of the HP speed up is a series of short cuts. The problem is that short cuts can undermine what matters most to the user–getting the information needed to meet a need.

I urge you to read this paper. Quite a good piece of work. If you have other thoughts about this paper, please, share them.

Stephen Arnold, September 6, 2008

Autonomy: Not Idle

September 5, 2008

On September 4, 2008, news about another Autonomy search and content processing win circulated in between and around the Google Chrome chum. HBO, a unit of Time Warner, is a premium programming company. In Newton Minnow’s world, HBO would be a provider of waste for the “vast wasteland”. Autonomy nailed this account under the noses of the likes of Endeca, Google, Microsoft Fast ESP and dozens of other companies salivating for HBO money and a chance to meet those involved with “Rome,” “The Sopranos”, and the funero-lark “Six Feet Under.” Too bad for the US vendors. HBO visted the River Cam and found search goodness. Brief stories are appearing at ProactiveInvestors.com here and MoneyAM.com here. When I checked Autonomy’s Web site, the company’s news release had not been posted, but it will appear shortly. Chatter about Autonomy has picked up in the last few weeks. Sources throwing bread crumbs to the addled goose suggest that Autonomy has another mega deal to announce in the next week or two. On top of that, Autonomy itself is making some moves to bolster its technology. When the addled goose gets some kernels of information, he will indeed pass them on.

In response to the Autonomy “summer of sales”, its competitors are  cranking up their marketing machines. Vivisimo is engaging in a Webinar which you can read about here. Other vendors are polishing new white papers. One vendor is ramping up a telemarketing campaign. Google, as everyone knows, is cranking the volume on its marketing and PR machine. The fact of the matter is that Autonomy has demonstrated an almost uncanny ability to find opportunities and close deals as other vendors talk about making sales. Will an outfit step forward and buy Autonomy. SAP hints that it has an appetite for larger acquisitions. Will Oracle take steps to address its search needs? Will a group of investors conclude that Autonomy might be worth more split into a search company, a fraud detection company, and an eDiscovery company? Autonomy is giving me quite a bit to consider. What’s your take on the rumors? Send the addled goose a handful of corn via the Comments function on this Web log.

Stephen Arnold, September 5, 2008

Intel and Search

September 5, 2008

True, this is a Web log posting, but I am interested in search thoughts from Intel or its employees. I found the post  “Why I Will Never Own and Electronic Book” interesting. I can’t decide whether the post is suggestive or naive. You can read the posted by Clay Breshears here. On the surface, Mr. Breshears is pointing out that ebook readers’ search systems are able to locate key words. He wants these generally lousy devices to sport NLP or natural language processing. The portion of the post that caught my attention was:

We need better natural language processing and recognition in our search technology.  Better algorithms along with parallel processing is going to be the key.  Larger memory space will also be needed in these devices to hold thesaurus entries that can find the link between “unemployed” and “jobless” when the search is asked to find the former but only sees the latter.  Maybe, just maybe, when we get to something like that level of sophistication in e-book devices, then I might be interested in getting one.

Intel invested some money in Endeca. Endeca gets cash, and it seems likely that Intel may provide Endeca with some guidance with regard to Intel’s next generation multi core processors. In year 2000, Intel showed interest in getting into the search business with its exciting deal with Convera. I have heard references to Intel’s interest in content processing. The references touch upon the new CPUs computational capability. Most of this horsepower goes unused, and the grape vine suggests that putting some content pre-processing functions in an appliance, firmware, or on the CPU die itself might make sense.

This Web log post may be a one-off comment. On the other hand, this ebook post might hint at other, more substantives conversations about search and content processing within Intel. There’s probably nothing to these rumors, but $10 million signals a modicum of interest from my vantage point in rural Kentucky.

Stephen Arnold, September 5, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta