Google and Semantics: More Puzzle Pieces Revealed
May 6, 2008
On May 5, 2008, Search Engine Round Table carried an interesting post, “Google Improves Semantic Search”. You can find the post here. The key point is that Google is using truncation “to stem complex plurals”. SEO Round Table points to the Google Groups thread as well. That link is here.
Google’s been active in semantics for a number of years. In 2007, I provided information to the late, great Bear Stearns’ Internet team. Based on my work, Bear Stearns issued a short note about Google’s semantic activity. This document may be available from a Bear Stearns’ broker, if there is one on the job.
An in-depth discussion of five Google semantic-centric inventions appears in Google Version 2.0. This analysis pivots on five patent applications filed in February 2007. A sole inventor, Ramanathan Google, describes a programmable search engine that performs semantic analysis and stores various metadata in a context server. The idea is that the context of a document, a user, or a process provides important insights into the meaning of a document. If you are a patent enthusiast, the five Guha inventions are:
- US2007 00386616, filed on April 10, 2005, and published on February 15, 2007 as “Programmable Search Engine”
US2007 0038601, filed on August 10, 2005, and published on February 15, 2007, as “Aggregating Content Data for Programmable Search Engines”
US2007 0038603, filed on August 10, 2005, and published on February 15, 2007, as “Sharing Context Data across Programmable Search Engines”
US2007 0038600, filed on August 10, 2005, and published on February 15, 2007, as “Detecting Spam-Related and Biased Contents for Programmable Search Engines”
US2007 0038614, filed on August 10, 2005, and published on February 15, 2007, as “Generating and Presenting Advertisements Based on Context Data from Programmable Search Engines”.
These patent documents don’t set a time table for Google’s push into semantics. It is interesting to me that an influential leader in the semantic standards effort invented the PSE or programmable search engine. Dr. Guha, a brilliant innovator, demonstrates that he is capable of doing a massive amount of work in a short span of time. I recall that he joined Google in early 2005, filing more than 130 pages of semantic systems and methods in less than nine months. I grouped these because filing five documents on the same day with each document nudging Google’s semantic invention forward from slightly different angles struck me as interesting.
Stephen Arnold, May 7, 2008
LTU Releases LTU-Finder 3.0
April 28, 2008
One of the leaders in image recognition and analysis is a decade-old company, LTU Technologies. The firm released LTU-Finder v. 3.0, which it described as “a breakthrough tool for image and video recognition in the field of computer forensics”. However, LTU’s system suits a wide range of enterprise image and video applications in eDiscovery, copyright, and security.
Version 3.0 of LTU-Finder includes image and video content recognition technology can increase the speed and scope of forensic and legal investigations as well as e-discovery. The new version includes enhanced image and video recognition capabilities and introduces text data identification tools that further automate large-scale file searches in the legal, e-discovery and law enforcement fields. You can use LTU’s products to find copyright infringement and digital fingerprints of images.
LTU-Finder also incorporates automatic document identification tools that separate relevant scanned documents, like e-faxes, from other content such as personal photos or Web graphics. Automating this process eliminates the need for a subject matter expert to click through image files one by one. The system reduces the amount of data that needs to be processed and stored during the e-discovery process.
You can get more information about the company’s image search and recognition technologies at LTU’s Web site here.
Stephen Arnold, April 29, 2008
The Importance of Being First
April 11, 2008
Alex Moskalyuk’s Web log contained a posting on April 10, 2008, that asserted “68 percent of search engine users click on the first page of results.” The story appeared in his Web log on Ziff-Davis’ ZDNet.com site. These data can be tough to find after a few days. Please, access the story and capture the data, which are from iProspect, a unit of the Aegis Group.
I am skeptical of usage data from Internet consultancies and search engine optimization companies. With that caveat in mind, the iProspect data reveal a significant trend in search system user behavior. Specifically, over time–if the data are accurate–users click on the first page of results only. The chart below illustrates this trend:
The top line is climbing, and it means that almost half of the users on Web search systems click on the first page of results. No real surprise, I suppose. The two other lines underscore the fact that fewer and fewer users are working through laundry lists of results. If these data are accurate, information on any other than the first page is not likely to get reviewed by a user.
What’s this mean for enterprise search (sometimes called Intranet search or behind-the-firewall search)? Users won’t spend much time looking for information if it is not slapped in front of their face. Key word search in organizations is generally a push cart filled with items that may or may not be pertinent to the employee’s query. If consumer behavior carries over to enterprise searchers, any system that takes a query such as “Acme proposal” and generates lists of results is going to be annoying.
Enterprise search system users need information to do their jobs, so the laundry list is almost a cinch to be more work than hunting for the needed information in other ways.
The iProspect data have another hook for me. As more young people enter the work force, Web behaviors are going to color their expectations of online search in their employer’s organization. Faced with laundry lists when Google and Microsoft personalize results, using probabilities to deliver a best guess about what’s needed by a particular person, traditional search systems in an enterprise are going to attract fewer and fewer enthusiastic users.
With the attention reports about deep-seated dissatisfaction about traditional enterprise search and content processing systems becoming more widely known, Mr. Moskalyuk’s Web log has provided another chunk of suggestive, interesting data. More details about enterprise search are needed, but in the search business, we have to take what the vendors provide. Like it or not.
Stephen Arnold, April 11, 2008
Sentiment Analysis: Bubbling Up as the Economy Tanks
January 20, 2008
Sentiment analysis is a sub-discipline of text mining. Text mining, as most of you know, refers to processing unstructured information and text blocks in a database to wheedle useful information from sentences, paragraphs, and entire documents. Text mining looks for entities, linguistic clues, and statistically significant high points.
The processing approach varies from vendor to vendor. Some vendors use statistics; others semantic techniques. More and more, mix and match procedures to get the best of each approach. The idea is that software “reads” or “understands” text. None of the more than 100 vendors offering text mining systems and utilities does as well as a human, but the systems are improving. When properly configured, some systems out perform a human indexer. (Most people think humans are the best indexers, but for some applications, software can do a better job.) Humans are needed to resolve “exceptions” when automated systems stumble. But unlike the human indexer who often memorizes a number of terms and uses these sometimes without seeking a more appropriate term from the controlled vocabulary. Also, human indexers can get tired, and fatigue affects indexing performance. Software indexing is the only way to deal with the large volumes of information in digital form today.
Sentiment analysis “reads” and “understands” text in order to find out if the document is positive or negative. About eight years ago, my team did a sentiment analysis for a major investment fund’s start up. The start up’s engineers were heads down on another technical matter, and the sentiment analysis job came to ArnoldIT.com.
We took some short cuts because time was limited. After looking at various open source tools and the code snippets in ArnoldIT’s repository, we generated a list of words and phrases that were generally positive and generally negative. We had several collections of text, mostly from customer support projects. We used these and applied some ArnoldIT “magic”. We were able to process unstructured information and assign a positive or negative score to documents based on our ArnoldIT “magic” and the dictionary. We assigned a red icon for results that our system identified as negative. Without much originality, we used a green icon to flag positive comments. The investment bank moved on, and I don’t know what the fate of our early sentiment analysis system was. I do recall that it was useful in pinpointing negative emails about products and services.
A number of companies offer sentiment analysis as a text mining function. Vendors include, Autonomy, Corpora Software, and Fast Search & Transfer, among others. A number of companies offer sentiment analysis as a hosted service with the work more sharply focused on marketing and brands. Buzzmetrics (a unit of AC Nielsen), Summize, and Andiamo Systems compete in the consumer segment. ClearForest, before it was subsumed into Reuters (which was then bought by the Thomson Corporation) had tools that performed a range of sentiment functions.
The news that triggered my thinking about sentiment was statistics and business intelligence giant SPSS’s announcement that it had enhanced the sentiment analysis functions of its Clementine content processing system. According to ITWire, Clementine has added “automated modeiing to identify the best analytic models, as well as combining multiple predictions for the most accurate results. You can read more about SPSS’s Clementine technology here. SPSS acquired LexiQuest, an early player in rich content processing, in 2002. SPSS has integrated its own text mining technology with the LexiQuest technology. SAS followed suit but licensed Inxight Software technology and combined that with SAS’s home-grown content processing tools.
There’s growing interest in analyzing call center, customer support, and Web log content for sentiment about people, places, and things. I will be watching for more announcements from other vendors. In the behind-the-firewall search and content processing sectors, there’s a strong tendency to do “me too” announcements. The challenge is to figure out which system does what. Figuring out the differences (often very modest) between and among different vendors’ solutions is a tough job.
Will 2008 be the year for sentiment analysis? We’ll know in a few months if SPSS competitors jump on this band wagon.
Stephen E. Arnold, January 20, 2008.