Storage Rages
March 5, 2009
ComputerWorld’s “Virtualization the Top Trend over the Next 5 Years” here underscores a potential opportunity that most traditional search and content processing vendors won’t be able to handle with their here-and-now solutions.
“Storage technology is similar to insurance in the financial services industry. In times of a recession, you have to manage your risk. Storage protects what you have and reduces risk,” said Steve Ingledew, managing director of Millward Brown Research’s Technology Practice.
What is interesting about this quote from the ComputerWorld article is that storage itself becomes a risk. Are most search and content processing systems up to task of managing massive repositories of digital information? The answer, in my opinion, is, “Sort of.” Autonomy moved to buy Interwoven to bolster its enterprise information and eDiscovery footprint. Specialists such as Clearwell Systems and Stratify (Iron Mountain) are farther along than most search and content processing companies. But when the volume of data gets into the tera and peta range, the here-and-now systems may not be up to the task.
With storage booming, there are some major opportunities for companies such as Aster Data, InfoBright, and Perfect Search. Unfamiliar with these companies? One may become the next big thing in data management. Google was on my short list, but the company seems to have lost some zip in the last 12 months. Amazon? At its core it is still an ecommerce vendor and not set up to handle the rigors of spoliation. Storage rages forward.
Stephen Arnold, March 5, 2009
Attivio’s Sid Probstein: An Exclusive Interview
February 25, 2009
I caught up with Sid Probstein, Attivio’s engaging chief technologist on February 23, 2009. Attivio is a new breed information company. The company combines a number of technologies to allow its licensees to extract more value from structured and unstructured information. Mr. Probstein is one of the speakers at the Boston Search Engine Meeting, a show that is now recognized as one of the most important venues for those serious about search, information retrieval, and content processing. You can register to attend this year’s conference here. Too many conferences features confusing multi track programs, cavernous exhibit halls, and annoyed attendees who find that the substance of the program does not match the marketing hyperbole. When you attend the Boston Search Engine Meeting, you have opportunities to talk directly to influential experts like Mr. Probstein. The full text of the interview appears below.
Will you describe briefly your company and its search / content processing technology? If you are not a company, please, describe your research in search / content processing.
Attivio’s Active Intelligence Engine (AIE) is powering today’s critical business solutions with a completely new approach to unifying information access. AIE supports querying with the precision of SQL and the fuzziness of full-text search. Our patent-applied-for query-side JOIN() operator allows relational data to be manipulated as a database would, but in combination with full-text operations like fuzzy search, fielded search, Boolean search, etc. Finally our ability to save any query as an alert and thereafter have new data trigger a workflow that may notify a user or update another system, brings a sorely needed “active” component to information access.
By extending enterprise search capabilities across documents, data and media, AIE brings deeper insight to business applications and Web sites. AIE’s flexible design enables business and technology leaders to speed innovation through rapid prototyping and deployment, which dramatically lowers risk – and important consideration in today’s economy. Systems integrators, independent software vendors, corporations and government agencies partner with Attivio to automate information-driven processes and gain competitive advantage.
What are the three major challenges you see in search / content processing in 2009?
May I offer three plus a bonus challenge?
First, understanding structured and unstructured data; currently most search engines don’t deal with structured data as it exists; they remove or require removal of the relationships. Retaining these relationships is the key challenge and a core value of information access.
Second, switching from the “pull” model in which end-users consume information, to the “push” model in which end-users and information systems are fed a stream of relevant information and analysis.
Third, being able to easily and rapidly construct information access applications. The year-long implementation cycle simply won’t cut it in the current climate; after all, that was the status quo for the past five years – long, challenging implementations, as search was still nascent. In 2009 what took months should take weeks. Also, the model has to change. Instead of trying to determine exactly how to build your information access strategy – the classic “aim, fire” approach – which often misses! – the new model is to “fire” and then “aim, aim aim” – correct your course and learn as you go so that you ultimately produce an application you are delighted with.
I also want to mention supporting complex analysis and enrichment of many different forms of content. For example: identifying important fields, from a search perspective; detecting relationships between pieces of content, or entire silos of content. This is key to breaking down silos – something leading analysts agree that this will be a major focus in enterprise IT starting in 2011.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
There are several hurdles. First, the inverted index structure has not traditionally been able to deal with relationships; just terms and documents. Second, there still is a lack of tools to move data around, as opposed to simply obtaining content, has been a barrier for enterprise search in particular. There has not been an analog to “ETL” in the unstructured world. (The “connector” standard is about getting data, not moving it.) Finally, I think there’s a lack of a truly dynamic architecture has meant having to re-index when changing configuration or adding new types of data to the index; also a lack of support for rapid updates has lead to a proliferation of paired search engines and databases.
With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?
Information access is critically important during a recession. Every interaction with the customer has the potential to cause churn. Reducing churn is less costly by far then acquiring new customers. Good service is one of the keys to retaining customers, and a typical cause of poor service is … poor information access. A real life example: I recently rolled over my 401K. I had 30 days to do it, and did on the 28th day via phone. On the 29th day someone else from my financial services firm called back and asked me if I wanted to roll my 401K over. This was quite surprising. When asked why the representative didn’t know I had done it the day before, they said “I don’t have access to that information”. The cost of that information access problem was two phone calls: the second rollover call, and then another call back from me to verify that I had, in fact, rolled over my 401k.
From the internal perspective of IT, demand to turn-around information access solutions will be higher than ever. The need to show progress quickly has never been higher, so selecting tools that support rapid development via iteration and prototyping is critically important.
Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?
Search is an essential feature in most every application used to create, manage or even analyze content. However, in this mode search is both a commodity and a de-facto silo of data. Standalone search and content processing will still be important as it is the best way to build applications using data across these silos. A good example here is what we call the Agile Content Network (ACN). Every content management system (CMS) has at least minimal search facilities. But how can a content provider create new channels and micro-sites of content across many incompatible CMSs? Standalone information access that can cut across silos is the answer.
Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect?
It is certainly true that Google has a powerful brand. However, vendors must promote transparency and help educate buyers so that they realize, on their own, the fit or non-fit of the GSA. It is also important to explain how what your product does is different from what Google does and how those differences apply to the customers’ needs for accessing information. Buyers are smart, and the challenge for vendors is to be sure to communicate and educate about needs, goals and the most effective way to attain them.
A good example of the Google brand blinding customers to their own needs is detailed in the following blog entry: http://www.attivio.com/attivio/blog/317-report-from-gilbane-2008-our-take-on-open-source-search.html
As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?
I think that there continue to be no real standards around information access. We believe that older standards like SQL need to be updated with full-text capabilities. Legacy enterprise search vendors have traditionally focused on proprietary interfaces or driving their own standards. This will not be the case for the next wave of information access companies. Google and others are showing how powerful language modeling can be. I believe machine translation and various multi-word applications will all become part of the landscape in the next 36 months.
12. Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?
Mobile information access is definitely emerging in the enterprise. In the short term, it needs to become the instrument by which some updates are delivered – as alerts – and in other cases it is simply a notification that a more complex update – perhaps requiring a laptop – is available. In time mobile devices will be able to enrich results on their own. The iPhone, for example, could filter results using GPS location. The iPhone also shows that complex presentations are increasingly possible.
Ultimately, a mobile device, like the desktop, call center, digital home, brick and mortar store kiosk, are all access and delivery channels. Getting the information flow for each to work consistently while taking advantage of the intimacy of the medium (e.g. GPS information for mobile) is the future.
15. Where can I find more information about your products, services, and research?
The best place is our Web site: www.attivio.com.
Stephen Arnold, February 25, 2009
Mysteries of Online 8: Duplicates
February 24, 2009
In print, duplicates are the province of scholars and obsessives. In the good old days, I would sit in a library with two books. I would then look at the data in one book and then hunt through the other book until I located the same or similar information. Then I would examine each entry to see if I could find differences. Once I located a major difference such as a number, a quotation, or an argument of some type, I would write down that information on a 5×8 note card. I had a forensics scholarship along with some other cash for guessing accurately on objective tests. To get the forensics grant, I had to participate in cross examination debate, extemporaneous speaking, and just about any other crazy Saturday time waster my “coaches” demanded.
Not surprisingly, mistakes or variances in books, journals, and scholarly publications were not of much concern to some of the students who attended the party school that accepted an addled goose with thick glasses. There were rewards for spending hours looking for information and then chasing down variances. I recall that our debate team, which was reasonably good if you liked goose arguments, were putting up with a team from Dartmouth College. I was listening when I heard a statement that did not match what I had located in a government reference document and in another source. The opponent from Dartmouth had erroneously presented the information. I gave a short rebuttal. I still remember the look of nausea that crossed our opponent’s face when she realized that I presented what I found in my hours of manual checking and reminded the judges that distorting information suggests an issue with the argument. We won.
For most people, the notion of having two individuals with the same source is an example of duplicate information. Upon closer inspection, duplication does not mean identical in gross features. Duplication drills down to the details of the information and to the need to determine which item of information is at variance and then figuring out why and what is the most likely version of the duplicate.
That’s when the fun begins in traditional research. An addled goose can do this type of analysis. Brains are less important than persistence and a toleration for some dull, tedious work. As a result, finding duplicative information and then figuring out variances was not something that the typical college sophomore spends much time doing.
Enter computer systems.
Deep Web, Surface Sparkles Occlude Deeper Look
February 23, 2009
You can read pundits, mavens, and wizards comment on the New York Times’s “Exploring a Deep Web that Google Can’t Grasp.” The original is here for a short time. Analysis of varying degrees of usefulness appear in Search Engine Land and the Marketing Pilgrim’s “Discovering the Rest of the Internet Iceberg” here.
There’s not much I can say to reverse the flow of misinformation about what Google is doing because Google doesn’t talk to me or to the pundits, mavens, and wizards who explain the company’s alleged weaknesses. In 2007, I wrote a monograph about Google’s programmable search engine disclosures. Published by BearStearns, this document is no longer available. I included the dataspace research in my Beyond Search study for The Gilbane Group in April 2008. In September, I then with Sue Feldman wrote about Google’s dataspace technology. You can get copy of the dataspace report directly from IDC here. Ask for document 213562. Both of these studies explicate Google’s activities in structured data and how those data mesh with Google’s unstructured information methods. I did a detailed explanation of the programmable search engine inventions in Google Version 2.0. That report is still available, but it costs money and I will be darned if I will restate information that is in a for fee study. There are some brief references to these technologies available at ArnoldIT.com without charge and in the archive to this Web log. You can search the ArnoldIT.com archive at www.arnoldit.com/sitemap.html and this Web log from the search box on any blog page.
This sure looks like “deep Web” information to me. But I am not a maven, wizard, or pundit. Nor do I understand search with the depth of the New York Times, search engine optimization experts, and trophy generation analysts. I read patent documents, an activity that clearly disqualifies me from asserting that Google can’t perform a certain action based on its disclosed in open source disclosures. Life is easier when such disclosures are ignored or excluded from the research process.
So what? Two points:
- Google can and does handled structured data. Examples exist in the wild at base.google.com and by entering the query “lga sfo” from Google.com’s search box.
- Yip yap about the “deep Web” has been popular for a while, and it is an issue that requires more analysis than assertions based on relatively modest research into the subject
In my opinion, before asserting that Google’s is baffled, off track, clueless, or slow on the trigger–look a bit deeper than the surface sheen on Googzilla’s scales. No wonder outfits are surprised with some of Google’s “effortless” initiatives. By dealing with superficiality, the substance is not seen for what resides under the surface.
Pundits, mavens, wizards, please, take moment to look into Guha, Halevy, and the other Googlers who have thought about and who are working on structured, semistructured, and unstructured data in the Google data environment. That background will provide some context for Google’s apparent sluggishness in this “space”.
Stephen Arnold, February 23, 2009
Exclusive Interview, Martin Baumgartel, From Library Automation to Search
February 23, 2009
For many years, Martin Baumgartel worked for a unit of T-Mobile. His experience spans traditional information retrieval and next-generation search. Stephen Arnold and Harry Collier interviewed Mr. Baumgartel on February 20, 2009. As one of the featured speakers at the premier search conference this spring, you will be able to hear Mr. Baumgartel’s lecture and meet with him in the networking and post presentation breaks. The Boston Search Engine Meeting attracts the world’s brightest minds and most influential companies to an “all content” program. You can learn more about the conference, the tutorials, and the speakers at the Infonortics Ltd. Web site. Unlike other conferences, the Boston Search Engine Meeting limits attendance in order to facilitate conversations and networking. Register early for this year’s conference.
What’s your background in search?
When I entered the search arena in the 1990s, I originated from library automation. Back then, it was all about indexing algorithms and relevance ranking where I did research to develop a search engine. During eight years at T-Systems, we analyzed the situation in large enterprises in order to provide the right search solution. This included, increasingly, the integration of semantic technologies. Given the present hype about semantic technologies, it has been a focus in current projects to determine which approach or product can deliver in specific search scenarios. A related problem is to identify underlying principles of user-interface-innovations to know what’s going to work (and what’s not).
What are the three major challenges you see in search / content processing in 2009?
Let me come at this in a non technical way. There are plenty of challenges awaiting algorithmic solutions, I see more important challenges here:
- Identifying the real objectives, fighting myths For an organization to implement internal search today hasn’t become any easier. There are numerous internal stakeholders, paired with a very high user expectation (they want the same quality as with Internet search, only better, more tailored to their work situation and without advertising…). To keep a sharp analysis becomes difficult in an orchestra of opinions, in particular when familiar brand names get involved (“Let’s just take Google internally, that will do.” )
- Avoid simplicity. Although many CIOs claim they have “cleaned up” their intranets, enterprise search remains complex; both technological and in terms of successful management. Therefore, to tackle the problem with a self-proclaimed simple solution (plug in, ready, go) will provide Search. But perhaps not the search solution needed and with hidden costs, especially on the long run. In the other extreme, a design too complex – with the purchase of dozens of connectors – is likely to burst your budget.
- Attention. Recently, I heard a lot about how the financial crisis will affect search. In my view, the effects are only reinforcing the challenge “How to draw enough management attention to Search to make sure it’s treated like other core assets”. Some customers might slow down the purchase of some SAP add-on modules or postpone a migration to the next version of Backup Software. But the status of those solutions among CIOs will remain high and un questioned.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
There’s no unique definition of the ‘Enterprise Search Problem” as if it would be a math theorem. Therefore, you find somehow amorphous definitions about what is to be solved. Let’s take the scope of content to be searched: everything internal? And nothing external? Another obstacle is the widespread believe in shortcuts. Popular example: Let’s just index the content present in our internal content management system, the other content sources are irrelevant. That way, the concept of completeness in search/result set is sacrificed. But search can be as gruesome as the Marathon: you need endurance and there are no shortcuts. If you take a shortcut, you’ve failed.
What is your approach to problem solving in search and content processing?
Smarter software definitely, because the challenges in search (and there are more than three) are attracting programmers and innovators to come up with new solutions. But, in general, my approach is “keep your cool”. Assess the situation, analyze tools and environment, design the solution and explain it clearly. In the process, interfaces have to be improved sometimes in order to trim them down to fit with the corporate intranet design.
With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search / content processing?
We’ll see how far a consolidation process will go. Perhaps we’ll see discontinued search products where we initially didn’t expect it. Also, the relation asked in the following question might be affected: software companies are unlikely to cut back at core features of their product. But integrated search functions are perhaps identified for the scalpel.
Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications?
I’ve seen it the other way around: Customer Support Managers told me (the Search person) that the built-in search-tool is ok but that they would like to look up additional information from some other internal applications. I don’t believe that built-in search will replace stand-alone search. The term “built-in” tells you that the main purpose of the application is something else. No surprise that, for instance, the user interface was designed for this main purpose – and will, in conclusion, not address typical needs of search.
Google has disrupted certain enterprise search markets with its appliance solution. What can a vendor do to adapt to this Google effect?
A vendor should point out where he differs from Google and why to address this Google-effect.
But I see Google as a significant player in enterprise search, if only for the mindset of procurement teams you describe in your question.
As you look forward, what are some new features / issues that you think will become more important in 2009?
The issue of cloudsourcing will gain traction. As a consequence, not only small and medium sized enterprises will discover that they might not invest in in house Content Management and Collaboration applications, but use a hosted service instead. This is when you need more than a “behind the firewall” search, because content will be scattered across multiple clouds (CRM cloud, Office cloud). I’m not sure whether we see a breakthrough there in 36 month; but the sooner the better.
Where can I find more information about your services and research?
http://www.linkedin.com/in/mbaumgartel
Stephen E. Arnold, www.arnoldit.com/sitemap.html and Harry Collier, www.infonortics.com
eZ Find: SOLR and More for Search
February 23, 2009
There’s another updated open source search product on the market. eZ, http://ez.no/, has tuned up eZ Find, http://ez.no/ezfind, a search extension for eZ Publish, http://ez.no/ezpublish, its enterprise web content management system. The extension is free to download and install on eZ Publish sites at http://ez.no/ezfind/download. eZ Publish comes out of the box prepped to help you get your content up online ASAP: publishing through both browsers or word processors, translations, multiple file uploads, picture and video galleries and search. eZ Find is just one piece of the puzzle, and with it now you can fine-tune relevance ratings, use faceted searches, take advantage of boolean, fuzzy, and wildcard operators, etc. eZ Publish license information is at http://ez.no/software/proprietary_license_options, and it looks like it has lots of happy customers, list at http://ez.no/customers. Stay tuned for more updates.
Jessica W. Bratcher, February 23, 2009
Exclusive Interview with Kathleen Dahlgren, Cognition Technologies
February 18, 2009
Cognition Technologies’ Kathleen Dahlgren spoke with Harry Collier about her firm’s search and content processing system. Cognition’s core technology, Cognition’s Semantic NLPTM, is the outgrowth of ideas and development work which began over 23 years ago at IBM where Cognition’s founder and CTO, Kathleen Dahlgren, Ph.D., led a research team to create the first prototype of a “natural language understanding system.” In 1990, Dr. Dahlgren left IBM and formed a new company called Intelligent Text Processing (ITP). ITP applied for and won an innovative research grant with the Small Business Administration. This funding enabled the company to develop a commercial prototype of what would become Cognition’s Semantic NLP. That work won a Small Business Innovation Research (SBIR) award for excellence in 1995. In 1998, ITP was awarded a patent on a component of the technology.
Dr. Dahlgren is one of the featured speakers at the Boston Search Engine Meeting. This conference is the world’s leading venue for substantive discussions about search, content processing, and semantic technology. Attendees have an opportunity to hear talks by recognized leaders in information retrieval and then speak with these individuals, ask questions, and engage in conversations with other attendees. You can get more information about the Boston Search Engine Meeting here.
The full text of Mr. Collier’s interview with Dr. Dahlgren, conducted on February 13, 2009, appears below:
Will you describe briefly your company and its search / content processing technology?
CognitionSearch uses linguistic science to analyze language and provide meaning-based search. Cognition has built the largest semantic map of English with morphology (word stems such as catch-caught, baby-babies, communication, intercommunication), word senses (strike meaning hit, strike a state of baseball, etc.), synonymy (“strike” meaning hit, “beat” meaning hit, etc.), hyponymy (“vehicle”-“motor vehicle”-“car”-“Ford”), meaning contexts (“strike” means game state in the context of “baseball”) and phrases (“bok-choy”). . The semantic map enables CognitionSearch to unravel the meaning of text and queries, with the result that search performs with over 90% precision and 90% recall.
What are the three major challenges you see in search / content processing in 2009?
That’s a good question. The three challenges in my opinion are:
- Too much irrelevant material retrieved – poor precision
- Too much relevant material missed – poor recall
- Getting users to adopt new ways of searching that are available with advanced search technologies. NLP semantic search offers users the opportunity to state longer queries in plain English and get results, but they are currently used to keywords, so there will be an adaptation required of them to take advantage of the new advanced technology.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
Poor precision and poor recall are due to the use of pattern-matching and statistical search software. As long as meaning is not recovered, the current search engines will produce mostly irrelevant material. Statistics on popularity boost many of the relevant results to the top, but as a measure across all retrievals, precision is under 30%. Poor recall means that sometimes there are no relevant hits, even though there may be many hits. This is because the alternative ways of expressing the user’s intended meaning in the query are not understood by the search engine. If they add synonyms without first determining meaning, recall can improve, but at the expense of extremely poor precision. This is because all the synonyms of an ambiguous word in all of its meanings, are used as search terms. Most of these are off target. While the ambiguous words in a language are relatively few, they are among the most frequent words. For example, the seventeen thousand most frequent words of English tend to be ambiguous.
What is your approach to problem solving in search and content processing?
Cognition focuses on improving search by improving the underlying software and making it mimic human linguistic reasoning in many respects. CognitionSearch first determines the meanings of words in context and then searches on the particular meanings of search terms, their synonyms (also disambiguated) and hyponyms (more specific word meanings in a concept hierarchy or ontology). For example, given a search for “mental disease in kids” CognitionSearch first determines that “mental disease” is a phrase, and synonymous with an ontological node, and that “kids” has stem “kid”, and that it means “human child” not a type of “goat”. It then finds document with sentences having “mental-dsease” or “OCD” or “obsessive compulsive disorder” or “schizophrenia”, etc. and “kid” (meaning human child) or “child” (meaning human child) or “young person” or “toddler”, etc.
Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?
Natural language processing systems have been notoriously challenged by scalability. Recent massive upgrades in computer power have now made NLP a possibility in web search. CognitionSearch has sub-second response time and is fully distributed to as many processors as desired for both indexing and search. Distribution is one solution to scalability. Another CognitionSearch implements is to compile all reasoning into the index, so that any delays caused by reasoning are not experienced by the end user.
Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect? Is Google a significant player in enterprise search, or is Google a minor player?
Google’s search appliance highlights the weakness of popularity-based searching. On the web, with Google’s vast history of searches, popularity is effective in positioning the more desired sites at the top the relevance rank. Inside the enterprise, popularity is ineffective and Google performs as a plain pattern-matcher. Competitive vendors need to explain this to clients, and even show them with head-to-head comparisons of search with Google and search with their software on the same data. Google brand allegiance is a barrier to sales in enterprise search.
Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?
Intelligent search in eDiscovery can dig up the “smoking gun” of violations within an organization. For example, in the recent mortgage crisis, buyers were lent money without proper proof of income. Terms for this were “stated income only”, “liar loan”, “no-doc loan”, “low-documentation loan”. In eDiscovery, intelligent search such as CognitionSearch would find all mentions of that concept, regardless of the way it was expressed in documents and email. Full exhaustiveness in search empowers lawyers analyzing discovery documents to find absolutely everything that is relevant or responsive. Likewise, intelligent search empowers corporate oversight personnel, and corporate staff in general, to find the desired information without being inundated with irrelevant hits (retrievals). Dedicated systems for eDiscovery and corporate search need only house the indices, not the original documents. It should be possible to host a company-wide secure Web site for internal search at low cost.
As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?
Semantics and the semantic web have attracted a great deal of interest lately. One type of semantic search involves tagging of documents and Web sites, and relating them to each other in a hierarchy expressed in the tags. This type of semantic search enables taggers to perfectly control reasoning with respect to the various documents or sites, but is labor-intensive. Another type of semantic search is runs on free text, is fully automatic, and uses semantically-based software to automatically characterize the meaning of documents and sites, as with CognitionSearch.
Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?
Mobile search heightens the need for improved precision, because the devices don’t have space to display millions of results, most of which are irrelevant.
Where can I find more information about your products, services, and research?
Harry Collier, Infonortics, Ltd., February 18, 2009
Amazon’s Implicit Metadata
February 18, 2009
Amazon is interesting to me because the company does what Google wants to do but more quickly. The other facet of Amazon that is somewhat mysterious is how the company can roll out cloud services with a smaller research and development budget than Google’s. I have not thought much about the A9 search engine since Udi Manber left to go to Google. The system is not too good from my point of view. It returns word matches but it does not offer the Endeca-style guided navigation that some eCommerce sites find useful for stimulating sales.
Intranet Insights disclosed here that Amazon uses “implicit metadata” go index Amazon content. I can’t repeat the full list assembled by Intranet Insights, and I urge you to visit that posting and read the article. I can highlight three examples of Amazon’s “implicit metadata” and offer a couple of observations.
Implicit metadata means automatic indexing. Authors or subject matter experts can manually assign index terms or interact with a semi-automated system such as that available from Access Innovations in Albuquerque, New Mexico. But humans get tired or fall into a habit of using a handful of common terms instead of consulting a controlled term list. Software does not get tired and can hit 90 percent accuracy once properly configured and resourced. Out of the box, automated systems hit 70 to 75 percent accuracy. I am not going to describe the methods for establishing these scores in this article.
Amazon uses, according to Intranet Insights:
- Links to and links from, which is what I call the Kleinberg approach made famous with Google’s PageRank method
- Author’s context; that is, on what “page” or in what “process” was the author when the document or information object was created. Think of this as a variation of the landing page for an inbound link or an exit page when a visitor leaves a Web site or a process
- Automated indexing; that is, words and phrases.
The idea is that Amazon gathers these data and uses them as metadata. Intranet Insights hints that Amazon uses other information as well; for example, comments in reviews and traffic analysis.
The Amazon system puzzles me when I run certain queries. Let me give some examples of problems I encounter:
- How can one search lists of books created by users? These exist, but for me the unfamiliar list is often difficult to locate and I cannot figure out how to find a particular book title on lists in that Amazon function. Why would I want to do this? I have a title but I want to see other books on lists on which the title appears. If you know how to run this query, shoot me the search string so I can test it.
- How can I filter a results list to eliminate books that are not yet published from books that are available? This situation arises when looking for Kindle titles using the drop downs and search function for titles in the Kindle collection. The function is available because the “recommendations” segment forthcoming titles from available titles, but the feature eludes me in the Kindle subsite.
- How can I run a query and see only the reviews by specific reviewers? I know that a prolific reviewer is Robert Steele. So, I want to see the books he has reviewed and I want to see the names of other reviewers who have reviewed a specific title that Mr. Steele has reviewed.
Amazon’s search system like the one Google provides for Apple.com is a frustrating experience for me. Amazon has lost sight of some of the basic principles of search; namely, if you have metadata tags, a user should be able to use these to locate the information in the public index.
This is not a question of implicit or explicit metadata. Amazon, like Apple, is indifferent to user needs. The focus is on delivering a minimally acceptable search service to satisfy the needs of the average user for the express purpose of moving the sale along. The balance I believe is between system burden and generating revenue. Amazon search deserves more attention in my opinion.
Stephen Arnold, February 18, 2009
Semantic Engines Dmitri Soubbotin Exclusive Interview
February 10, 2009
Semantics are booming. Daily I get spam from the trophy generation touting the latest and greatest in semantic technology. A couple of eager folks are organizing a semantic publishing system and gearing up for a semantic conference. I think these efforts are admirable, but I think that the trophy crowd confuses public relations with programming on occasion. Not Dmitri Soubbotin, one of the senior managers at Semantic Engines. Harry Collier and I were able to get the low-profile wizard to sit down and talk with us. Mr. Soubbotin’s interview with Harry Collier (Infonortics Ltd.) and me appears below.
Please, keep in mind that Dmitri Soubbotin is one of world class search, content processing, and semantic technologies experts who will be speaking at the April 2009 Boston Search Engine Meeting. Unlike fan-club conferences or SEO programs designed for marketers, the Boston Search Engine Meeting tackles substantive subjects in an informed way. The opportunity to talk with Mr. Soubbotin or any other speaker at this event is a worthwhile experience. The interview with Mr. Soubbotin makes clear the approach that the conference committee for the Boston Search Engine Meeting. Substance, not marketing hyperbole is the focus for the two day program. For more information and to register, click here.
Now the interview:
Will you describe briefly your company and its search / content
processing technology?
Semantic Engines is mostly known for its search engine SenseBot (www.sensebot.net). The idea of it is to provide search results for a user’s query in the form of a multi-document summary of the most relevant Web sources, presented in a coherent order. Through text mining, the engine attempts to understand what the Web pages are about and extract key phrases to create a summary.
So instead of giving a collection of links to the user, we serve an answer in the form of a summary of multiple sources. For many informational queries, this obviates the need to drill down into individual sources and saves the user a lot of time. If the user still needs more detail, or likes a particular source, he may navigate to it right from the context of the summary.
Strictly speaking, this is going beyond information search and retrieval – to information synthesis. We believe that search engines can do a better service to the users by synthesizing informative answers, essays, reviews, etc., rather than just pointing to Web sites. This idea is part of our patent filing.
Other things that we do are Web services for B2B that extract semantic concepts from texts, generate text summaries from unstructured content, etc. We also have a new product for bloggers and publishers called LinkSensor. It performs in-text content discovery to engage the user in exploring more of the content through suggested relevant links.
What are the three major challenges you see in search / content processing in 2009?
There are many challenges. Let me highlight three that I think are interesting:
First, Relevance: Users spend too much time searching and not always finding. The first page of results presumably contains the most relevant sources. But unless search engines really understand the query and the user intent, we cannot be sure that the user is satisfied. Matching words of the query to words on Web pages is far from an ideal solution.
Second, Volume: The number of results matching a user’s query may be well beyond human capacity to review them. Naturally, the majority of searchers never venture beyond the first page of results – exploring the next page is often seen as not worth the effort. That means that a truly relevant and useful piece of content that happens to be number 11 on the list may become effectively invisible to the user.
Third, Shallow content: Search engines use a formula to calculate page rank. SEO techniques allow a site to improve its ranking through the use of keywords, often propagating a rather shallow site up on the list. The user may not know if the site is really worth exploring until he clicks on its link.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
Not understanding the intent of the user’s query and matching words syntactically rather than by their sense – these are the key barriers preventing from serving more relevant results. NLP and text mining techniques can be employed to understand the query and the Web pages content, and come up with an acceptable answer for the user. Analyzing
Web page content on-the-fly can also help in distinguishing whether a page has value for the user or not.
Of course, the infrastructure requirements would be higher when semantic analysis is used, raising the cost of serving search results. This may have been another barrier to broader use of semantics by
major search engines.
What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?
Smarter, more intelligent software. We use text mining to parse Web pages and pull out the most representative text extracts of them, relevant to the query. We drop the sources that are shallow on content, no matter how high they were ranked by other search engines. We then order the text extracts to create a summary that ideally serves as a useful answer to the user’s query. This type of result is a good fit for an informational query, where the user’s goal is to
understand a concept or event, or to get an overview of a topic. The closer together are the source documents (e.g., in a vertical space), the higher the quality of the summary.
Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated
into enterprise applications?
More and more, people expect to have the same features and user interface when they search at work as they get from home. The underlying difference is that behind the firewall the repositories and taxonomies are controlled, as opposed to the outside world. On one hand, it makes it easier for a search application within the enterprise as it narrows its focus and the accuracy of search can get higher. On the other hand, additional features and expertise would be required compared to the Web search. In general, I think the opportunities in the enterprise are growing for standalone search
providers with unique value propositions.
As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?
I think the use of semantics and intelligent processing of content will become more ubiquitous in 2009 and further. For years, it has been making its way from academia to “alternative” search engines, occasionally showing up in the mainstream. I think we are going to see much higher adoption of semantics by major search engines, first of all Google. Things have definitely been in the works, showing as small improvements here and there, but I expect a critical mass of
experimenting to accumulate and overflow into standard features at some point. This will be a tremendous shift in the way search is perceived by users and implemented by search engines. The impact on the SEO techniques that are primarily keyword-based will be huge as well. Not sure whether this will happen in 2009, but certainly within
the next 36 months.
Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?
I expect to see higher proliferation of Semantic Web and linked data. Currently, the applications in this field mostly go after the content that is inherently structured although hidden within the text – contacts, names, dates. I would be interested to see more integration of linked data apps with text mining tools that can understand unstructured content. This would allow automated processing of large volumes of unstructured content, making it semantic web-ready.
Where can we find more information about your products, services, and research?
Our main sites are www.sensebot.net and www.semanticengines.com. LinkSensor, our tool for bloggers/publishers is at www.linksensor.com. A more detailed explanation of our approach with examples can be found in the following article:
http://www.altsearchengines.com/2008/Q7/22/alternative-search-results/.
Stephen Arnold (Harrod’s Creek, Kentucky) and Harry Collier (Tetbury, Glou.), February 10, 2009
Great Bit Faultline: IT and Legal Eagles
February 6, 2009
The legal conference LegalTech generates quite a bit of information and disinformation about search, content processing, and text mining. Vendors with attorneys on the marketing and sales staff are often more cautious in their wording even though these professionals are not the school president type personalities some vendors prefer. Other vendors are “all sales all the time” and this crowd surfs the trend waves.
You will have to decide whose news release to believe. I read an interesting story in Centre Daily Times here called “Continuing Disconnect between IT and Legal Greatly Hindering eDiscovery Efforts, Recommind Survey Finds”. The article makes a point for which I have only anecdotal information; namely, information technology wizards know little about the eDiscovery game. IT wonks want to keep systems running, restore files, and prevent users from mucking up the enterprise systems. eDiscovery on the other hand wants to pour through data, suck it into a system that prevents spoliation (a fancy word for delete or change documents), and create a purpose built system that attorneys can use to fight for truth, justice, and the American way.
Now, Recommind, one of the many firms claiming leadership in the eDiscovery space, reports the results of a survey. (Without access to the sample selection method and details of the analytic tools, the questionnaire itself, and the folks who did the analysis I’m flying blind.) The article asserts:
Recommind’s survey demonstrates that there is significant work remaining to achieve this goal: only 37% of respondents reported that legal and IT are working more closely together than a year before. This issue is compounded by the fact that only 21% of IT respondents felt that eDiscovery was a “very high” priority, in stark contrast with the overwhelming importance attached to eDiscovery by corporate legal departments. Furthermore, there remains a significant disconnect between corporate accountability and project responsibility, with legal “owning” accountability for eDiscovery (73% of respondents), records management (47%) and data retention (50%), in spite of the fact that the IT department actually makes the technology buying decisions for projects supporting these areas 72% of the time. Exacerbating these problems is an alarming shortage of technical specifications for eDiscovery-related projects. Only 29% of respondents felt that IT truly understood the technical requirements of eDiscovery. The legal department fared even worse, with only 12% of respondents indicating that legal understood the requirements. Not surprisingly, this disconnect is leading to a lack of confidence in eDiscovery project implementation, with only 27% of respondents saying IT is very helpful during eDiscovery projects, and even fewer (16%) believing legal is.
My reaction to these alleged findings was, “Well, makes sense.” You will need to decide for yourself. My hunch is that IT and legal departments are a little like the Hatfields and the McCoys. No one knows what the problem is, but there is a problem.
What I find interesting is that enterprise search and content processing systems are generally inappropriate for the rigors of eDiscovery and other types of legal work. What’s amusing is a search vendor trying to sell to a lawyer who has just been surprised in a legal action. The lawyer has some specific needs, and most enterprise search systems don’t meet these. Equally entertaining is a purpose built legal system being repackaged as a general purpose enterprise search system. That’s a hoot as well.
As the economy continues its drift into the financial Bermuda Triangle, I think everyone involved in legal matters will become more, not less, testy. Stratify, for example, began life as Purple Yogi and an intelligence-centric tool. Now Stratify is a more narrowly defined system with a clutch of legal functions. Does an IT department understand a Stratify? Nope. Does an IT department understand a general purpose search system like Lucene. Nope. Generalists have a tough time understanding the specific methods of experts who require a point solution.
In short, I think the numbers in the Recommind study may be subject to questions, but the overall findings seem to be generally on target.,
Stephen Arnold, February 6, 2009