Clearwell: Another eDiscovery Platform

June 9, 2008

The giant Thomson Reuters owns an outfit called Thomson Litigation Consulting. Thomson Litigation Consulting, in turn, recommends systems to its law firm customers. The consulting unit of Thomson Reuters earned some praise for its recommendation to DLA Piper, a firm that had a need for fast-cycle eDiscovery. You can read the effusive write up as reported on Law.com here,

Clearwell processed all 570,000 e-mail messages and attachments within our deadline of five days, providing enough time for analysis, review and production of the data. Clearwell’s incremental processing capabilities enabled TLC to start the analysis process for initial custodians within 25 minutes. The platform’s communication flow analysis enabled the legal team to quickly find all e-mails sent to specific individuals and to specific organizations (domains) within a confined date range. Clearwell’s organizational discovery automatically identified all variations of a custodian’s e-mail address, ensuring that no data for a custodian was missed.

A happy quack to Thomson Legal Consulting and to the happy, happy client. With as many as two-thirds of search and content processing systems dissatisfied, it is gratifying to know that there are success stories. The question is, “What’s a Clearwell?” The purpose of this short article is to provide some basic information about this system and make several observations about the niche strategy in search and content processing.

clearwell email thread

This is a screen shot of the Clearwell interface to see a thread or chain of related emails. The attorney can use the system to move forward and backward in the email chain. A new query can be launched. A point-and-click interface allows the attorney to filter the processed content by project, name, and other filters. The interface automatically saves an attorney’s query.

What’s a Clearwell?

The metaphor implied by the name of the company is to see into a deep, dark pit. The idea is that technology can illuminate what’s hidden.

The company is backed by Sequoia Capital, Redpoint Ventures, DAG Ventures, and Northgate Capital. In short, the firm has “smart money”. “Smart money” opens doors, presumably to secretive outfits like the Thomson Corporation. Clearwell conducted a Webinar with Google, which illustrates the company’s ability to hook up with the heavy hitters in online to educate companies about eDiscovery.

As one of the investors describes the company, Clearwell

delivers a new level of analysis of information contained in corporate document and email systems. As the first e-discovery 2.0 solution, Clearwell is poised to capitalize on this emerging market, which we expect to become a multi-billion dollar industry with the next few years.

In a nutshell, the company bundles content processing, analytics, and work flow into a product that is tailored to the needs of eDiscovery. “eDiscovery” is the term applied to figuring out what’s in the gigabytes of digital email, Word files, and depositions generated in the course of a legal matter. eDiscovery means that a research tries to know what it is in the discovered information so the lawyers know what they don’t know.

The company, unlike a generalized enterprise search platform, focuses its technology on specific markets unified by each market’s need to perform eDiscovery. These markets are:

  • Corporate security. Think email analysis.
  • Law firms. Grinding through information obtained in the discovery process
  • Service providers. Data centers, ISPs, telcos processing content for compliance
  • Government. Generally I associate the government with surveillance and intelligence operations.

Technology

There are more than 300 companies in the text processing business. I track about 12 firms focusing on the eDiscovery angle. I published a short list of some vendors as a general reference to readers of this Web log here.

The key differentiator for Clearwell is that it is a platform; that is, the customer does not have to assemble a random collection of Lego blocks into a system. Clearwell arrives, installs its system, and provides any technical assistance. For law firms in a time crunch, the Clearwell appliance is packaged as a solution that is:

  • Transparent which means another attorney can figure out what produced a particular result
  • Easy to use which means attorneys aren’t technical wizards
  • Able to handle different type of documents and language, including misspellings
  • Capable of not missing a key document which is a bad thing when the opposing attorney did not miss a document.

How does this work?

Clearwell ships an appliance that can be up and running in less than a half hour, maybe longer if the law firm doesn’t have a full-time system administrator. A graphical administration utility allows the collection or corpus to be identified to the system. Clearwell then processes the content and makes it available to authorized users.

The appliance implements the Electronic Discovery Reference Model which is a methodology supported by about 100 firms. The idea is that EDRM standardizes the eDiscovery process so an opposing attorney has a shot at figuring out where “something” comes from.

As part of the content processing, Clearwell generates entities, metadata, and indexes. One key feature of the system is that Clearwell automatically links emails into threads. An attorney can locate an email of interest and then follow the Clearwell thread through the email processed by the system. Before Clearwell, a human had to make notes about related emails. Other systems provide similar functionality. Brainware, for example, offers similar features, and it is possible to use Recommind and Stratify in this way. The idea is that Clearwell is an “eDiscovery toaster”. Lawyers understand toasters; lawyers don’t understand complex search and content processing systems.

The technical components of the Clearwell system include:

  • Deduplication
  • Support for multiple languages
  • Entity extraction
  • On-the-fly classification
  • Canned analytics to count number of references to entities
  • Basic and advanced search.

The system can be configured to allow an authorized user to add a tag or a flag so a particular document can be reviewed by another person. This function is generally described as a “social search” operation. It is little more than an interface to permit user-assigned index terms.

One of the most common requests made of enterprise search systems is a case function; that is, the ability to keep track of information related to a particular matter. Case operations are quite complex, and the major search platforms make it possible for the licensee to code these functions themselves. In effect, mainstream search systems don’t do case management operations out of the box.

Clearwell does. My review of the system identified this function as one of the most useful operations baked into the appliance. Case management means keeping track of who looked at what and when. In addition, the case management system bundles information about content and operations in one tidy package.

The Clearwell case function includes these features:

  • Analytics which can be used for time calculations, verifying that a person who was supposed to review a document did in fact open the document
  • Ability to handle multiple legal matters
  • Function to permit tags and categories to be set for different legal matters
  • User management tools
  • Audit trails.

Attempting to implement these features with an enterprise search platform is virtually a six month job, not one that can be accomplished in a day or less.

Observations

Clearwell is an example of how a start up can look at a crowded field like enterprise search and content processing, identify points of pain, and build a business providing a product that makes the pain bearable. Clearwell’s technology is, like most search vendors’, is not unique; that is, other companies provide similar functions. What sets the company apart is the packaging of the technology for the target market. Clearwell’s technical acumen is evident in the case management functions and the useful exposure of threaded emails.

Other points that impressed me are:

  • An appliance. I like appliances because I don’t have to build anything. Search is such a basic need in organizations, why should I build a search system. I don’t build a toaster.
  • Bundled software. Clearwell–unlike Exegy, Google, and Thunderstone–delivers a usable application out of the box. Index Engines comes close with its search-back ups solution. But Clearwell is the leader in the appliance-that-works niche in search at this time.
  • Smart money. When investors with a track record bet on a company, I think it’s worth paying attention.

I don’t have a confirmation on the cost of the appliance. My hunch is that it will be competitive with one-year fees from Autonomy, Endeca, and Fast Search (Microsoft) which is to say a six-figure number. If you have solid prices for Clearwell, use the comments section of the Web log to share that information. Please, check out the company at ClearwellSystems.com.

Stephen Arnold, June 9, 2008

Deep Web Tech’s Abe Lederman Interviewed

June 9, 2008

Abe Lederman, one of the founders of Verity, created Deep Web Technologies to provide “one-stop access to multiple research resources.” By 1999, Deep Web Technologies offered a system that performed “federated search.” Mr. Lederman defines “federated search” as a system that “allows users to search multiple information sources in parallel.” He added in his interview with ArnoldIT.com:

Results are retrieved, aggregated, ranked and deduped. This doesn’t seem too difficult, but trust me it’s much harder than one might think. Deep Web started out building federated search solutions for the Federal government. We run some highly visible public sites such as Science.gov, WorldWideScience.org and Scitopia.org. We have expanded our market in the last few years and sell to corporate libraries as well as academic libraries.

believes that Google’s “forms” technology to index the content of dynamic Web sites is flawed.

Mr. Lederman said:

Deep Web goes out and in real-time sends out search requests to information sources. Each such request is equivalent to a user going to the search form of an information source and filling the form out. Google is attempting to do something different. Using automated tools Google is filling out forms that when executed will retrieve search results which can then be downloaded and indexed by Google. This effort has a number of flaws, including automated tools that fill out forms with search terms and retrieve results will only work on a small subset of forms. Google will not be able to download every document in a database as it is only going to be issuing random or semi-random queries.

In the exclusive interview, Mr. Lederman reveals a new feature. He calls is “smart clustering.” Search results within a cluster are displayed in rank order.

You can read the full text of the interview on the ArnoldIT.com Web site in its Search Wizards Speak series. The interview with Mr. Lederman is the 17th interview with individuals who have had an impact on search and content processing. Search Wizards Speak provides an oral history in transcript form of the origin, functions, and positioning of commercial search and text processing systems.

The interview with Mr. Lederman is here. The index of previous interviews is here.

Stephen Arnold, June 9, 2008

Adaptive Search

June 9, 2008

Technology Review, a publication affiliated with the Massachusetts Institute of Technology, has an important essay by Erica Naone about adaptive computing. Her story here is “Adapting Websites [sic] to Users” provides a useful run down of high-profile sites that change what’s displayed for a particular user based on what actions the user takes on a Web page. I found the screen shots of a prototype British Telecom service particularly useful. When a large, fuzzy telecommunications company embraces autonomous computing on a Web site, I know a technology has arrived. Telcos enforce rigorous testing of even trivial technology to make certain an errant chunk of code won’t kill the core system.

For me, the most interesting point in the article is a quotation Ms. Naone attributes to John Hauser, a professor at MIT’s business school; to wit:

Suddenly, you’re finding the website [sic] is easy to navigate, more comfortable, and it gives you the information you need. The user, he says, shouldn’t even realize that the website [sic] is personalized.

User Confusion?

I recall my consternation when one of the versions of Microsoft software displayed reduced menus based on my behaviors. The first time I encountered this change is appearance, I was confused. Then I rooted around in the guts of the system to turn off the adaptive function. I have a visual memory that allows me to recall locations, procedures, and methods using that eidetic ability. Once I see something and then it changes, it throws off a wide range of automatic mental processes. In college, I recall seeing an updated version of an economics book, and I could pinpoint which charts had been changed, and I found one with an error almost 20 years after taking the course.

simplifed autonomous function

This is a schematic I prepared of a simplified autonomous computing process. Note that the core system represented by the circle receives inputs from external and internal processes and sources. The functions in the circular area are, therefore, able to adapt to information about different environmental factors.

Adaptive displays, for me, are a problem. If you want to sell products or shape information for those without this eidetic flaw, adaptive Web pages are for you.

As I thought about the implications of this on-the-fly personalization, I opened a white paper sent to me by a person whom I met via the comments section of my Web log “Beyond Search.”

Microsoft Active in the Field Too

The essay is “What Is Autonomous Search?”, and it is a product of Microsoft’s research unit. The authors are Youssef Hamadi, Eric Monfroy, and Fréderéic Saubion. Each author has an academic affiliation, and I will let you download the paper and sort out its provenance. You can locate the paper here.

In a nutshell, the paper makes it clear that Microsoft wants to use autonomous techniques to make certain types of search smarter. The idea is a deeper application of algorithms and methods that morph a Web page to suit a user’s behaviors. Applied to search, autonomous functions monitor information, machine processes, and user behaviors via log files. When something significant changes, the system modifies a threshold or setting in order to respond to a change.

The system automatically makes inferences. A simple example might be a surge in information and clicks on a soccer player; for example, Gomez. The system would note this name and automatically note that Gomez was associated with the German Euro 2008 team. Relevance is automatically adjusted. Other uses of the system range from determining what to cache to what relationships can be inferred about users in a geographic region.

Google: Automatic with a Human Wrangler Riding Herd

Not surprisingly, Google has a keen interest in autonomous functions. What is interesting is that in the short essay I wrote about Peter Norvig’s conversation with Anand Rajaraman here, Dr. Norvig–now on a Google leave of absence–emphasized Google’s view of automated functions. As I understand what Mr. Rajaraman wrote, Google wants to use autonomous techniques, but Google wants to keep some of its engineers’ hands on the controls. Some autonomous systems can run off the tracks and produce garbage.

I can’t name the enterprise search systems with this flaw, but those search systems that emphasize automated processes that run after ingesting training sets are prone to this problem. The reason is that the thresholds determined by processing the training sets don’t apply to new information entering the system. A simple thought experiment reveals why this happens.

Assume you have a system designed to process information about skin cancer. You assemble a training set of skin cancer information. The search and retrieval system generates good results on test queries; for example, precision and recall scores in the 85 percent range. You turn the system loose on content that is now obtained from Web sites, professional publishers, and authors within an organization. The terminology differs from author to author. The system–anchored in a training set–cannot handle the diffusion of terms or even properly resolve new terms; for example, a new treatment methodology from a different research theater. Over time, the system works less and less well. Training autonomous systems is a tricky business, and it can be expensive.

Google’s approach, therefore, bakes in an expensive human process to keep the “smart” algorithms from becoming dumber over time. The absent mindedness of an Albert Einstein is a quirk. A search system that becomes stupid is a major annoyance.

You can read more about Google’s approach to intelligent algorithms by sifting through the papers on the subject here. If you enjoy patent applications and view their turgid, opaque prose as a way to peek under Google’s kimono, I recommend that you download US2008/0022267. this invention by H. Bruce Johnson, Jr. and Joel Webber discloses how a smart system can handle certain programming chores at Google. The idea is that busy, bright Googlers shouldn’t have to do certain coding manually. An autonomous system can handle the job. The method involves the types of external “looks” and internal “inputs” that appear in the Microsoft paper by Hamadi, Monfry, and Saubion.

Observations

I anticipate more public discussion of autonomous computing systems and methods in the near future. Because the technology is out of sight, it is out of mind. It does have some interesting implications for broader social computing issues as well as enterprise search; for example:

  1. Control. Some users–specifically, me–want to control what I see. If there are automatic functions, I want to see the settings and have the ability to fiddle the dials. Denied that, I will spend considerable time and energy trying to get control of the system. If I can’t, then I will take steps to work around the automated decisions.
  2. Unexpected costs. Fully automated systems can go off the rails. In the enterprise search arena, a licensee must be prepared to retrain an automatic system or assign an expensive human to ride herd on the automated functions. Most search vendors provide administrative interfaces to allow a subject matter expert to override or input a correction. Even Google in its new site search and revamped Google Mini allows a licensee to weight certain values such as time.
  3. Suspicion of nefarious intent. When a system operates autonomously, how is a user to know that a particular adjustment has been made to “help” the user. Could the adjustment be made to exploit a psychological weakness of the user. Digital used car sales professionals could become a popular citizen in the Internet community.
  4. Ineffective regulation. Government officials may have a difficult time understanding autonomous systems and methods. As a result, the wizards of autonomous computing operate without any effective oversight.

The concern I have is that “big data” makes autonomous computing work reasonably well. It follows that the company with the “biggest data” and the ability to crunch those data will dominate. In effect, autonomous computing may set the stage for an enterprise that takes the best pieces of the US Steel, the Standard Oil, and J.P. Morgan models to build a new type of monopoly. Agree? Disagree? Use the comments section to let me know your thoughts.

Stephen Arnold, June 9, 2008

Chicago Tribune Online: Why Old Print Subscribers Will Hate the Online Edition

June 8, 2008

I don’t spend much time writing about user interface or usability. My 86-year-old father, however, forced me to confront the interface for the Chicago Tribune Online. This essay has a search angle, but the majority of my comments apply to the interface for the Chicago Tribune Online. Now if you search Google for “Chicago Tribune Online”, the fist hit is the Chicago Tribune’s main Web site. There is no direct link to the electronic edition for subscribers. You can find this service, which requires a user name and password, here. An 86-year-old person doesn’t file email like his 64 year-old son or the 12-year-old who lives in the neighborhood.

My father prints out important email. This makes it tricky for him to type in the url, enter his user name and password (a helpful eight letters and digits all in upper case so it’s impossible for him to discern whether the zero is a number or an “oh” for the letter.

Why does this matter?

I set up yesterday (June 6, 2008) an icon that contained sufficient pixie dust to send him to the electronic edition and log him in automatically. This morning he called to tell me that he had nuked his icon. I dutifully explained in an email, which he would print out, how to navigate to the page, enter the user name, enter the eight digit password (remember there are two possibilities for the zero), click the “save user name and password option” and access the Sunday newspaper.

Essentially these steps are beyond his computing ability, visual acuity, and keyboarding skills.

Does the Chicago Tribune care? My view is that whoever designed the access Web page gave little thought to the needs of my father. Why should these 20 somethings? Their world is one in which twitching icons and subtle interfaces with designer colors are irrelevant.

There’s one other weirdness about the log in page for the electronic edition of the Chicago Tribune. My father has a big flat screen, and I set it for 800 by 600 pixels so he can read the text. The problem with this size is that most Web pages, including the ones for this Beyond Search Web log are designed for larger displays. I use three displays–two for the Windows machine and one big one for the Mac. Linux machines get cast off monitors which we often unplug once the machine is running because no one “uses” the Linux machines perched in front of the boxes.

Not my father, he gets up close and personal. The failure to design for my father is understandable. Life would be easier if people were perpetually 21. Here’s the full text of the help tips in the email the Chicago Tribune sent my father:

Getting started with your Chicago Tribune electronic subscription: 1. To view a story, photo, or advertisement click the item on the full-page image (left side of your screen). It will enlarge on the right side of your screen for easier reading. 2. Use the pull-down lists located in the top center to navigate through which section and page you would like to view. 3. Use “Advanced Search” on the top center area of the window to find a specific article. 4. Use the buttons on the right to email or print each page. Use the buttons on the left to set up email alerts through e-notify and download articles or the entire paper as a PDF. 5. For more help on all the features, just click on the “Help” button found near the top left under the Chicago Tribune logo.

So, here’s what my father sees when he clicks on the electronic edition link on the 800 x 600 display in his browser:

tribune 800 600

I had trouble figuring out what button and what option was described in the “help” with the registration email. Know why? The log in information requires my father to scroll to the left and then down. There is no visible clue about the log in.

Read more

Funnelback 8: New Version Now Available

June 8, 2008

Funnelback, a search and content processing system, has released Version 8 with a number of new features and enhancements. Formerly Panoptic, the system now supports Microsoft SharePoint, Lotus Notes, and mainstream content management systems such as EMC Documentum and Interwoven. (For search history buffs, you can see a demo of the original Panoptic system here.)

You can now generated point-and-click interfaces. Like Vivisimo, Funnelback makes it possible for a user to add a tag to a document. The system can process structured data and index data behind a Web form. The system has added support for Chinese, Japanese, Korean, and Thai. The system can be installed on premises or it can be deployed in a software as a service model (SaaS).

You can get more information at the company’s Web site. I profiled the Panoptic / Funnelback system in the third edition of the Enterprise Search Report. I can’t recall if that profile was retained for the current edition of Enterprise Search Report. The company has a number of customers in Canada and the UK, but its profile in the United States was modest. You can access a client list here.

You can see the system in action at the Australian job search site CareerOne here. You can enter a free text concept like “Web developer” and narrow your focus via point-and-click drop down boxes. Funnelback has implemented a browse feature, which some vendors call guided navigation or assisted navigation. Whatever the concept’s buzz word, users like this feature.

There’s an implementation of the system’s capabilities on the Australian Securities Exchange site. You can use the text search method, or interact via point-and-click, ticker symbols, or role-based views. You may recall that role-based views are a feature of Microsoft’s next-generation Dynamics’ systems. Funnelback seems to be ahead of Microsoft in this approach to complex information retrieval. You can see the Funnelback Financial Planner view of Australian Securities Exchange data here.

The company has roots in academia (Australian National University, I believe) like many other search and content processing systems. My take on the original Panoptic system and the newer Funnelback system was that it represented a good value. The drawback is one that many non-US companies face when trying to make a sale in the American market. Procurement teams like to have a local presence for a product that has brand recognition on senior managers. I’ve heard rumors that Funnelback will open a US office, but I have no confirmation that this is true. I will keep you posted. In the meantime, check out the system.

Stephen Arnold, June 7, 2008

The Semantic Chimera

June 8, 2008

GigaOM has a very good essay about semantic search. What I liked was the inclusion of screen shots of results of natural language queries–that is, queries without Boolean operators. Two systems indexing Wikipedia are available in semantic garb: Cognition here and Powerset here. (Note: there is another advanced text processing company called Cognition Technologies whose url is www.cognitiontech.com. Don’t confuse these two firms’ technologies.) GigaOM does a good job of making posts findable, but I recommend navigating to the Web log immediately.

Nitin Karandikar reviews both Cognition’s and Powerset’s approach, so I don’t need to rehash that material. For me the most important statement in the essay is this one:

There are still queries (especially when semantic parsing is not involved) in which Google results are much better than [sic] either Powerset or Cognition.

Let me offer several observations about semantic technology applied to constrained domains of content like the Wikipedia:

  1. Semantic technology is extremely important in text processing. By itself, it is not a silver bullet. A search engine vendor can say, “We use semantic technology”. The payoff, as the GigaOM essay makes clear, may not be immediately evident. Hence, the “Google is better” type statement.
  2. Semantic technology is in many search systems, just not given center state. Like Bayesian maths, semantic technology is part of the search engine vendors’ toolkits. Semantic technology delivers very real benefits in functions from disambiguation to entity extraction. As this statement implies, there are many different types of semantics in the semantic technology spectrum. Picking the proper chunk of semantic technology for a particular process is complicated stuff, and most search engine vendors don’t provide much information about what they do, where they get the technology, or how the engineers determined which semantic widget to use in the first place. In my experience, the engineers arrive at their job with academic and work experience. Those factors often play a more important part than rigorous testing.
  3. Google has semantic technology in its gun sights. In February 2007, information became available about Google programmable search engine which has semantics in its plumbing. These patent applications state that Google can discern context from various semantic operations. Google–despite its sudden willingness to talk in fora about its universal search and openness–doesn’t say much about semantics and for good reason. It’s plumbing, not a service. Google has pretty good plumbing, and its results are relevant to many users. Google doesn’t dwell on the nitty gritty of its system. It’s a secret ingredient and no user really cares. Users want answers or relevant information, not a lab demo of a single text processing discipline.
  4. Most users don’t want to type more than 2.2 words in a query. Forget typing well formed queries in natural language. Users expect the system to understand what is needed and the situation into which the information fits. Semantic technology, therefore, is an essential component of figuring out meaning and intention. Properly functioning semantic processes produce an answer. The GigaOM essay makes it clear that when the answers are not comprehensive, on point, or what the user wanted, semantic technology is just another buzz word. Semantic technology is incredibly important, just not as an explicit function for the user to access.

I talk about semantic technology, linguistic technologies, and statistical technologies in this Web log and in my new study for the Gilbane Group. The bottom line is that search doesn’t pivot on one approach. Marketers have a tough time explaining how their systems work, and these folks often fall back on simplifications that blur quite different things. Mash ups are good in some contexts, but in understanding how a Powerset integrates a licensed technology from Xerox PARC and how that differs from Cognition’s approach, simplifications are of modest value.

In my experience, a company which starts out as statistics only quickly expands the system to handle semantics and linguistics. The reason–there’s no magic formula that makes search work better. Search systems are dynamic, and the engineers bolt new functions on in the hope of finding something that will convert a demo into a Google killer. That has not happened yet, but it will. When a better Google emerges, describing it as a semantic search system will not tell the entire story. Plumbing that runs compute intensive processes to cruch log data and smart software are important too.

A demo is not a scalable commercial system. By definition a service like Google’s incorporates many systems and methods. Search requires more than one buzz word.You may also find the New York Times’s Web log post by Miguel Helft about Powerset helpful. It is here.

Stephen Arnold, June 8, 2008

Mobile Projection: Truly Stunning

June 7, 2008

Information Week reporter K.C. Jones reported an iSuppli estimate that stunned me. The title of the story is “Wireless Social Networking To Generate $2.5 Trillion By 2020.” You can read it here, but hurry news has a peculiar way of becoming hard to find a day or two after the story appears on a Web site.

iSuppli–a company in the business of providing applied market intelligence–projected that Wireless social networking products, services, applications, components, and advertising will generate more than $2.5 trillion in revenue by 2020. I think that’s 12 zeros.

I’m not sure I know what wireless social networking is but if iSuppli is correct–consultants and research firms are rarely off base–it’s a great opportunity for entrepreneurs who catch the wave. Wow, $2.5 trillion in 12 short years. I thought I had seen some robust estimates from Forrester, Gartner, 451, and ComScore, but the iSuppli projection is a keeper.

Stephen Arnold, June 7, 2008

Lexalytics: Stepping Up Its Marketing

June 7, 2008

Lexalytics is a finalist in the annual MIXT (Massachusetts Innovation & Technology Exchange. Lexalytics has also revamped its Web site. The company now makes it easy to download a trial of its text analytics software. Teh trial is limited to 50 documents, but you can generate a list of entities, generate summaries of the processed documents. The most interesting function of the trial’s ability to display a sentiment score for a document. In effect, you can tell if opinion is running for or against a product.

The company’s system performs three functions on collections of content. The content can be standard office files such as Word or PowerPoint documents. The system can ingest Web log content and RSS streams as well. Once installed, the system outputs:

  • The sentiment and tone from a text source
  • The names of the people, companies, places or other entities in processed content 
  • Any hot themes in a text source.

Lexalytics has provided technology to other search and content processing companies. For example, Northern Light and Fast Search & Transfer, to name two. A happy quack to the Lexalytics’ team for the MIXT recognition. You can learn more about the company here. 

Stephen Arnold, June 7, 2008

Inside the Microsoft Mind

June 6, 2008

The Washington Post‘s Peter Whoriskey did a bang up job with his interview of Steve Ballmer, the lead dog for the Microsoft pack. Traditional media can make it tough to locate an electronic version of a story, so click here immediately and read the article.

I don’t want to spoil your fun, so I won’t recycle or paraphrase the statements Mr. Whoriskey captured. There was one comment that stuck in my mind:

I have no clue what [Google is] up to. It’s very hard for me to understand what they are up to. . . . I don’t know what Google’s angle is because it sometimes looks like Google wants to become a telecommunications company. And yet that may not be right. But that recent thing where they went in with Sprint and WiMax guys is very confusing to me. I think it’s very confusing to a number of telecommunications companies, as well.

This statement is particularly revealing to me. I have a modest bit of experience with Microsoft, both in the pre-Google days (before 1998) and the post-Google days (1999 to 2007). When I worked on a couple of tiny jobs as a sub sub contractor to the Redmond machine, the focus across the people whom I met was pretty clear. In fact, the people used the phrase Microsoft agenda to refer to Windows, Office, and servers. The “agenda” meant sell licenses, get organizations drinking the Microsoft-flavored KoolAid, and “put a computer on every desk.”

The post-Google period can be summarized for me to one word: Diffused. The original “agenda” expanded in a number of ways. These decisions have been documented in hundreds of books, articles, and Web log posts. Let me mention a few and then move on to my observations: MSN, Zune, Xbox, WebTV, and UMBC. Promising businesses to be sure: “agenda” changers all.

Mr. Ballmer’s statement about his not understanding what “they are up to” is revelatory. The “they” is Google. I wonder if the “confusing” part of Google is a reflection of Microsoft itself.

What’s my research suggestsis that Google is moving in a deliberate way and has been since its initial public offering. As the company has grown, there are more Google initiatives but these are of almost zero incremental cost to Google. Most Google innovations are software that a code wizard loads on the Google super computer. If there are clicks, Google cares. If there are no clicks, there’s no cost or revenue loss, just learning what doesn’t work.

I have documented what Google’s approach in my two studies, The Google Legacy (plumbing) and Google Version 2.0 (mathematical methods). You can buy copies of these here. Others have followed in my footsteps and in many cases gone far beyond my individual, early research about Google.

I can sum up six years of research and hundreds of hours of conversations about Google in one word: disruption. Google disrupts and then looks for advantages. “Look for” is a bit too proactive. What Google does is let the clicks guide them.

Microsoft is facing a disruptive strategy hooked to a different business model. Verizon feels the disruptive force. Traditional publishers sense that Google is “coming”. I look forward to more information from the mind of Microsoft as it wrestles with Google’s digging in and getting comfortable in some of Microsoft’s markets.

Stephen Arnold, June 6, 2008

Search Wizard Starts New Venture

June 6, 2008

Years ago I examined search technology developed by a teen age whiz named Judd Bowman. You can read about his background here. Mr. Bowman and an equally talented Taylor Brockman had devised a way around memory access bottlenecks that hobbled other search companies’ performance. Mr. Bowman founded Pinpoint, which became Motricity. With a keen interest in search, Motricity provided a range of technology to a number of high profile clients, including Motroloa.

Mr. Bowman’s new venture is PocketGear, formerly a unit of Motricity. There’s not much information available at this time. Based on my knowledge of Mr. Bowman’s interest in search, the company will offer mobile search and content services. The venture warrants close observation, particularly with regard to mobile on device applications and cloud based search and retrieval.

Note: the Charlotte News Observer article quotes me and cites my for-fee work for investors in Messrs. Bowman and Brockman’s first company, Pinpoint.

Stephen Arnold, June 6, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta