Semantic Engines Dmitri Soubbotin Exclusive Interview

February 10, 2009

Semantics are booming. Daily I get spam from the trophy generation touting the latest and greatest in semantic technology. A couple of eager folks are organizing a semantic publishing system and gearing up for a semantic conference. I think these efforts are admirable, but I think that the trophy crowd confuses public relations with programming on occasion. Not Dmitri Soubbotin, one of the senior managers at Semantic Engines. Harry Collier and I were able to get the low-profile wizard to sit down and talk with us. Mr. Soubbotin’s interview with Harry Collier (Infonortics Ltd.) and me appears below.

Please, keep in mind that Dmitri Soubbotin is one of world class search, content processing, and semantic technologies experts who will be speaking at the April 2009 Boston Search Engine Meeting. Unlike fan-club conferences or SEO programs designed for marketers, the Boston Search Engine Meeting tackles substantive subjects in an informed way. The opportunity to talk with Mr. Soubbotin or any other speaker at this event is a worthwhile experience. The interview with Mr. Soubbotin makes clear the approach that the conference committee for the Boston Search Engine Meeting. Substance, not marketing hyperbole is the focus for the two day program. For more information and to register, click here.

Now the interview:

Will you describe briefly your company and its search / content
processing technology?

Semantic Engines is mostly known for its search engine SenseBot (www.sensebot.net). The idea of it is to provide search results for a user’s query in the form of a multi-document summary of the most relevant Web sources, presented in a coherent order. Through text mining, the engine attempts to understand what the Web pages are about and extract key phrases to create a summary.

So instead of giving a collection of links to the user, we serve an answer in the form of a summary of multiple sources. For many informational queries, this obviates the need to drill down into individual sources and saves the user a lot of time. If the user still needs more detail, or likes a particular source, he may navigate to it right from the context of the summary.

Strictly speaking, this is going beyond information search and retrieval – to information synthesis. We believe that search engines can do a better service to the users by synthesizing informative answers, essays, reviews, etc., rather than just pointing to Web sites. This idea is part of our patent filing.

Other things that we do are Web services for B2B that extract semantic concepts from texts, generate text summaries from unstructured content, etc. We also have a new product for bloggers and publishers called LinkSensor. It performs in-text content discovery to engage the user in exploring more of the content through suggested relevant links.

What are the three major challenges you see in search / content processing in 2009?

There are many challenges. Let me highlight three that I think are interesting:

First, Relevance: Users spend too much time searching and not always finding. The first page of results presumably contains the most relevant sources. But unless search engines really understand the query and the user intent, we cannot be sure that the user is satisfied. Matching words of the query to words on Web pages is far from an ideal solution.

Second, Volume: The number of results matching a user’s query may be well beyond human capacity to review them. Naturally, the majority of searchers never venture beyond the first page of results – exploring the next page is often seen as not worth the effort. That means that a truly relevant and useful piece of content that happens to be number 11 on the list may become effectively invisible to the user.

Third, Shallow content: Search engines use a formula to calculate page rank. SEO techniques allow a site to improve its ranking through the use of keywords, often propagating a rather shallow site up on the list. The user may not know if the site is really worth exploring until he clicks on its link.

With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?

Not understanding the intent of the user’s query and matching words syntactically rather than by their sense – these are the key barriers preventing from serving more relevant results. NLP and text mining techniques can be employed to understand the query and the Web pages content, and come up with an acceptable answer for the user. Analyzing
Web page content on-the-fly can also help in distinguishing whether a page has value for the user or not.
Of course, the infrastructure requirements would be higher when semantic analysis is used, raising the cost of serving search results. This may have been another barrier to broader use of semantics by
major search engines.

What is your approach to problem solving in search and content processing? Do you focus on smarter software, better content processing, improved interfaces, or some other specific area?

Smarter, more intelligent software. We use text mining to parse Web pages and pull out the most representative text extracts of them, relevant to the query. We drop the sources that are shallow on content, no matter how high they were ranked by other search engines. We then order the text extracts to create a summary that ideally serves as a useful answer to the user’s query. This type of result is a good fit for an informational query, where the user’s goal is to
understand a concept or event, or to get an overview of a topic. The closer together are the source documents (e.g., in a vertical space), the higher the quality of the summary.

More and more, people expect to have the same features and user interface when they search at work as they get from home. The underlying difference is that behind the firewall the repositories and taxonomies are controlled, as opposed to the outside world. On one hand, it makes it easier for a search application within the enterprise as it narrows its focus and the accuracy of search can get higher. On the other hand, additional features and expertise would be required compared to the Web search. In general, I think the opportunities in the enterprise are growing for standalone search
providers with unique value propositions.

As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?

I think the use of semantics and intelligent processing of content will become more ubiquitous in 2009 and further. For years, it has been making its way from academia to “alternative” search engines, occasionally showing up in the mainstream. I think we are going to see much higher adoption of semantics by major search engines, first of all Google. Things have definitely been in the works, showing as small improvements here and there, but I expect a critical mass of
experimenting to accumulate and overflow into standard features at some point. This will be a tremendous shift in the way search is perceived by users and implemented by search engines. The impact on the SEO techniques that are primarily keyword-based will be huge as well. Not sure whether this will happen in 2009, but certainly within
the next 36 months.

Graphical interfaces and portals (now called composite applications) are making a comeback. Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

I expect to see higher proliferation of Semantic Web and linked data. Currently, the applications in this field mostly go after the content that is inherently structured although hidden within the text – contacts, names, dates. I would be interested to see more integration of linked data apps with text mining tools that can understand unstructured content. This would allow automated processing of large volumes of unstructured content, making it semantic web-ready.

Where can we find more information about your products, services, and research?

Our main sites are www.sensebot.net and www.semanticengines.com. LinkSensor, our tool for bloggers/publishers is at www.linksensor.com. A more detailed explanation of our approach with examples can be found in the following article:
http://www.altsearchengines.com/2008/Q7/22/alternative-search-results/.

Stephen Arnold (Harrod’s Creek, Kentucky) and Harry Collier (Tetbury, Glou.), February 10, 2009

Written by Stephen E. Arnold · Filed Under Conferences, Interview, News, Online (general), Search, Semantic, Technology, Text analytics, Text processing | 2 Comments

Sinequa Lands Co Op Financial Deal

February 10, 2009

The world’s largest consumer co-operative just made Sinequa, http://www.sinequa.com, a part of their plans to develop company efficiency, facilitate collaboration and improve customer service. The Co-operative Financial Services (CFS), http://www.cfs.co.uk, will use the Sinequa CS enterprise search engine to connect different business units. The engine links data sources using connectors so employees can access information across the company based on access rights. The co-op needs Sinequa CS to pull all its pieces parts together. It’s a no brainer: the co-op has about 6.5 million customers to serve. Sinequa CS facilitates natural language queries and retrieves search results by way of patented semantic technology. Results include related documents regardless of format (Office, PDF, HTML etc.) classified by category. More about the search engine is available at http://www.sinequa.com/solutions.html.

Jessica West Bratcher, February 10, 2009

Written by Stephen E. Arnold · Filed Under Enterprise, News, Online (general), Search | Comments Off on Sinequa Lands Co Op Financial Deal

More Trouble for Microsoft: GMail Guns for Hotmail

February 9, 2009

I don’t use GMail. Call me careful. Call me an addled goose. What I do is irrelevant because Garrett Rogers reports that the gap between Hotmail and GMail is closing. You can read his article “Gmail usage appears to be closing in on Hotmail” here. If true, Google seems to be making significant progress in territory that Microsoft once considered part of its dominion. For me, the most interesting comment was:

Google still has a long way to go to catch up to Yahoo, but it’s realistic to think that it could happen as soon as 2011 if you look at current growth rates. Part of the reason Google’s email service is becoming so popular is their ability to push out updates and useful features extremely quickly.

It’s 2009, and I think that 2011 may arrive more quickly than Microsoft expects. What can Microsoft do to respond to Google? Well, cutting back staff and chopping data center spending do not strike me as particularly innovative. As an addled goose, I must be missing a key part of Microsoft’s strategy in search. With big search announcements looming, I am hopeful that Microsoft will leap frog the GOOG. Stay tuned.

Stephen Arnold, February 8, 2009

Written by Stephen E. Arnold · Filed Under Business strategy, Google, Microsoft, News, Online (general) | Comments Off on More Trouble for Microsoft: GMail Guns for Hotmail

Cengage: A Publisher to Watch

February 9, 2009

I chopped this factoid out of the foul papers for Google: The Digital Gutenberg, my forthcoming study of our pal, Googzilla. More information is here. I was poking around for weaknesses in the traditional educational publishing business model. I came across an interesting document on the Web site of an outfit called Cengage Learning Inc., which is an “indirect wholly owned subsidiary of Cengage Learning Holdings II L.P (formerly known as TL Holdings II L.P. To me it looks as if an outfit called Apax Partners in the UK and the Ontario Municipal Retirement Service in Canada bought the “old” Thomson Learning in 2007 according to the information here.) You can read this document here. One catchphrase for a unit of the company is “A company that delivers highly-customized learning solutions for universities, instructors, students, libraries, government agencies, corporations, and professionals worldwide.”

If you are sensitive to pre-crash ownership methods as I am, the naming of this company is interesting to me. I did a little reading and located one factoid that may or may not be spot on. Here is the segment that caught my attention. Remember. This is the pre-financial crisis environment of June 2008:

On the Closing Date, Cengage Learning entered into an Incremental Amendment to its existing Credit Agreement, with The Royal Bank of Scotland plc, as administrative agent, collateral agent and swing line lender, and the other lenders party thereto, pursuant to which the Company borrowed an additional $625 million to finance the Acquisition. The borrowings were issued at 97.625% of the principal amount thereof and require annual principal payments of 1% with the remaining amount payable on July 3, 2014. Cengage Learning can elect the term of the borrowing period and each respective rollover period, as well as which benchmark interest rate will apply, subject to contractually specified minimum rates, plus a predefined margin. The minimum interest rate on the additional borrowing is 7.5%.

In short, Cengage borrowed to buy a group of properties. Some of these are in the educational publishing business. I did not do a deep dive on this because I located some references to various legal actions and in my opinion legal issues are like sleeping dogs. Let them rest quietly.

An investor conference call is scheduled for February 12, 2009. Source: www.cengage.com

I did some informal and opinionate thinking.

In my opinion, if traditional publishers are struggling and if buy out deals rely on this Cengage-type financing, I wondered who was going to pay the bill for the interest on the loans. Several thoughts crossed my mind as I realized that Cengage owned such online and publishing operations as Gale Research (now Gale Cengage Learning), some of the “old” Houghton Mifflin Harcourt Publishing Company, Wadsworth Group, and Macmillan among others. (You can look at the same source I located here.)

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Online (general), Publishing, Technology, Text processing | 5 Comments

Upgrades to ArnoldIT.com’s Google Patent Collection

February 9, 2009

I have made an effort to gather Google’s patent documents filed at the US Patent & Trademark Office. When The Google Legacy came out in 2005, I posted some of the Google patent documents referenced in that study. When Google Version 2.0 was published in 2007, I made available to those who purchased that study, a link to the patent documents referenced in that monograph. These Google studies were based on my reading of Google’s open source technical information. Most of the Google books now available steer clear of Google’s systems and methods. My writings are intended for specialist readers, not the consumer audience.

You can now search the full text of Google’s patent documents from 1998 to 2008 by navigating to http://arnoldit.perfectsearchcorp.com. The Perfect Search engineers have indexed the XHTML versions of the documents which have been available on the ArnoldIT.com server. ArnoldIT.com has provided pointers so that a user can click on a link and access the PDF version of Google’s patent applications and patents. No more hunting for a specific patent document PDF using weird and arcane commands. Just click and you can view or download the PDF of a Google patent document. The service is free.

The ArnoldIT.com team has made an attempt to collect most Google patent documents, but there are a number of patent documents referenced in various Google documents that remain elusive. Keep in mind that the information is open source, and I am providing it as a partial step in a user’s journey to understand some aspects of Google. If you are an attorney, you should use the USPTO service or a commercial service from Westlaw or LexisNexis. Those organizations often assert comprehensiveness, accuracy, and a sensitivity to the nuances of legal documents. I am providing a collection that supports my research.

Google is now a decade old, and there is considerable confusion among those who use and analyze Google with regard to the company’s technology. Google provides a patent search service here. But I find it difficult to use, and in some cases, certain documents seem to be hard for me to find.

I hope that Googlers who are quick to tell me that I am writing about Google technology that Google does not possess will be able to use this collection to find Google’s own documents. I have learned that trophy generation Googlers don’t read some of their employer’s open source documents, government filings, and technical papers.

Perfect Search Corp. is the first company to step forward and agree to index these public domain documents. You will find that the Perfect Search system is very fast, and you can easily pinpoint certain Google patent documents in a fraction of the time required when you use Google’s own service or the USPTO’s sluggish and user hostile system.

“The Google Patent Demonstration illustrates the versatility of the Perfect Search system. Response time is fast, precision and recall are excellent, and access to Google’s inventions is painless,” Tim Stay, CEO of Perfect Search, said.

Perfect Search’s software uses semantic technology and allows clients to index and search massive data sets with near real-time incremental indexing at high speeds without latency. It is meant to augment the Google Search Appliance.
Perfect Search technology, explained in depth at http://www.perfectsearchcorp.com/technology-benefits, provides a very economical single-server solution for customers to index files and documents and can add the capability of indexing large amounts of database information as well.

Perfect Search is a software innovation company that specializes in development of search solutions. A total of eight patents have been applied for around the developing technology. The suite of search products at http://www.perfectsearchcorp.com/our-products is available on multiple platforms, from small mobile devices, to single servers, to large server farms. For more information visit http://www.perfectsearchcorp.com/, call +1.801.437.1100 or e-mailinfo@perfectsearchcorp.com.

In the future, I would like to make this collection available to other search and content processing companies. The goal would be to allow users to be able to dig into some of Google’s inventions and learn about the various search systems. Head-to-head comparisons are very useful, but very few organizations in my experience take the time to prepare a test corpus and then use different systems to determine which is more appropriate for a particular application.

If you have suggestions for this service, use the comments section for this Web log.

Stephen Arnold, February 9, 2009

Written by Stephen E. Arnold · Filed Under Google, News, Online (general), Semantic, Technology | 2 Comments

Daniel Tunkelang: Co-Founder of Endeca Interviewed

February 9, 2009

As other search conferences gasp for the fresh air of enervating speakers, Harry Collier’s Boston Search Engine Conference (more information is here) has landed another thought-leader speaker. Daniel Tunkelang is one of the founders of Endeca. After the implosion of Convera and the buys out of Fast Search and Verity, Endeca is one of the two flagship vendors of search, content processing, and information management systems recognized by most information technology professionals. Dr. Tunkelang writes an informative Web log The Noisy Channel here.

Dr. Daniel Tunkelang. Source: http://www.cs.cmu.edu/~quixote/dt.jpg

You can get a sense of Dr. Tunkelang’s views in this exclusive interview conducted by Stephen Arnold with the assistance of Harry Collier, Managing Director, Infonortics Ltd.. If you want to hear and meet Dr. Tunkelang, attend the Boston Search Engine Meeting, which is focused on search and information retrieval. The Boston Search Engine Meeting is the show you may want to consider attending. All beef, no filler.

The speakers, like Dr. Tunkelang, will challenge you to think about the nature of information and the ways to deal with substantive issues, not antimacassars slapped on a problem. We interviewed Mr. Tunkelang on February 5, 2009. The full text of this interview appears below.

Tell us a bit about yourself and about Endeca.

I’m the Chief Scientist and a co-founder of Endeca, a leading enterprise search vendor. We are the largest organically grown company in our space (no preservatives or acquisitions!), and we have been recognized by industry analysts as a market and technology leader. Our hundreds of clients include household names in retail (Wal*Mart, Home Depot); manufacturing and distribution (Boeing, IBM); media and publishing (LexisNexis, World Book), financial services (ABN AMRO, Bank of America), and government (Defense Intelligence Agency, National Cancer Institute).

My own background: I was an undergraduate at MIT, double majoring in math and computer science, and I completed a PhD at CMU, where I worked on information visualization. Before joining Endeca’s founding team, I worked at the IBM T. J. Watson Research Center and AT&T Bell Labs.

What differentiates Endeca from the field of search and content processing vendors?

In web search, we type a query in a search box and expect to find the information we need in the top handful of results. In enterprise search, this approach too often breaks down. There are a variety of reasons for this breakdown, but the main one is that enterprise information needs are less amenable to the “wisdom of crowds” approach at the heart of PageRank and related approaches used for web search. As a consequence, we must get away from treating the search engine as a mind reader, and instead promote bi-directional communication so that users can effectively articulate their information needs and the system can satisfy them. The approach is known in the academic literature as human computer information retrieval (HCIR).

Endeca implements an HCIR approach by combining a set-oriented retrieval with user interaction to create an interactive dialogue, offering next steps or refinements to help guide users to the results most relevant for their unique needs. An Endeca-powered application responds to a query with not just relevant results, but with an overview of the user’s current context and an organized set of options for incremental exploration.

What do you see as the three major challenges facing search and content processing in 2009 and beyond?

There are so many challenges! But let me pick my top three:

Social Search. While the word “social” is overused as a buzzword, it is true that content is becoming increasingly social in nature, both on the consumer web and in the enterprise. In particular, there is much appeal in the idea that people will tag content within the enterprise and benefit from each other’s tagging. The reality of social search, however, has not lived up to the vision. In order for social search to succeed, enterprise workers need to supply their proprietary knowledge in a process that is not only as painless as possible, but demonstrates the return on investment. We believe that our work at Endeca, on bootstrapping knowledge bases, can help bring about effective social search in the enterprise.

Federation. As much as an enterprise may value its internal content, much of the content that its workers need resides outside the enterprise. An effective enterprise search tool needs to facilitate users’ access to all of these content sources while preserving value and context of each. But federation raises its own challenges, since every repository offers different levels of access to its contents. For federation to succeed, information repositories will need to offer more meaningful access than returning the top few results for a search query.

Search is not a zero-sum game. Web search engines in general–and Google in particular–have promoted a view of search that is heavily adversarial, thus encouraging a multi-billion dollar industry of companies and consultants trying to manipulate result ranking. This arms race between search engines and SEO consultants is an incredible waste of energy for both sides, and distracts us from building better technology to help people find information.

With the rapid change in the business climate, how will the increasing financial pressure on information technology affect search and content processing?

There’s no question that information technology purchase decisions will face stricter scrutiny. But, to quote Rahm Emmanuel, “Never let a serious crisis go to waste…it’s an opportunity to do things you couldn’t do before.” Stricter scrutiny is a good thing; it means that search technology will be held accountable for the value it delivers to the enterprise. There will, no doubt, be an increasing pressure to cut costs, from price pressure on vendor to substituting automated techniques for human labor. But that is how it should be: vendors have to justify their value proposition. The difference in today’s climate is that the spotlight shines more intensely on this process.

Search / content processing systems have been integrated into such diverse functions as business intelligence and customer support. Do you see search / content processing becoming increasingly integrated into enterprise applications? If yes, how will this shift affect the companies providing stand alone search / content processing solutions? If no, what do you see the role of standalone search / content processing applications becoming?

Better search is a requirement for many enterprise applications–not just BI and Call Centers, but also e-commerce, product lifecycle management, CRM, and content management. The level of search in these applications is only going to increase, and at some point it just isn’t possible for workers to productively use information without access to effective search tools.

For stand-alone vendors like Endeca, interoperability is key. At Endeca, we are continually expanding our connectivity to enterprise systems: more connectors, leveraging data services, etc. We are also innovating in the area of building configurable applications, which let businesses quickly deploy the right set features for their users. Our diverse customer base has driven us to support the diversity of their information needs, e.g., customer support representatives have very different requirements from those of online shoppers. Most importantly, everyone benefits from tools that offer an opportunity to meaningfully interact with information, rather than being subjected to a big list of results that they can only page through.

Microsoft acquired Fast Search & Transfer. SAS acquired Teragram. Autonomy acquired Interwoven and Zantaz. In your opinion, will this consolidation create opportunities or shut doors. What options are available to vendors / researchers in this merger-filled environment?

Yes! Each acquisition changes the dynamics in the market, both creating opportunities and shutting doors at the same time. For SharePoint customers who want to keep the number of vendors they work with to a minimum, the acquisition of FAST gives them a better starting point over Microsoft Search Server. For FAST customers who aren’t using SharePoint, I can only speculate as to what is in store for them.

For other vendors in the marketplace, the options are:

Get aligned with (or acquired by) one of the big vendors and get more tightly tied into a platform stack like FAST;
Carve out a position in a specific segment, like we’re seeing with Autonomy and e-Discovery, or
Be agnostic, and serve a number of different platforms and users like Endeca or Google do. In this group, you’ll see some cases where functionality is king, and some cases where pricing is more important, but there will be plenty of opportunities here to thrive.

Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar? Is performance a non issue?

Performance is absolutely a consideration, even for systems that make efficient use of hardware resources. And it’s not just about using CPU for run-time query processing: the increasing size of data collections has pushed on memory requirements; data enrichment increases the expectations and resource requirements for indexing; and richer capabilities for query refinement and data visualization present their own performance demands.

Multicore computing is the new shape of Moore’s Law: this is a fundamental consequence of the need to manage power consumption on today’s processors, which contain billions of transistors. Hence, older search systems that were not designed to exploit data parallelism during query evaluation will not scale up as hardware advances.

While tasks like content extraction, enrichment, and indexing lend themselves well to today’s distributed computing approaches, the query side of the problem is more difficult–especially in modern interfaces that incorporate faceted search, group-bys, joins, numeric aggregations, et cetera. Much of the research literature on query parallelism from the database community addresses structured, relational data, and most parallel database work has targeted distributed memory models, so existing techniques must be adapted to handle the problems of search.

Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect? Is Google a significant player in enterprise search, or is Google a minor player?

I think it is a mistake for the higher-end search vendors to dismiss Google as a minor player in the enterprise. Google’s appliance solution may be functionally deficient, but Google’s brand is formidable, as is its position of the appliance as a simple, low-cost solution. Moreover, if buyers do not understand the differences among vendor offerings, they may well be inclined to decide based on the price tag–particularly in a cost-conscious economy. It is thus more incumbent than ever on vendors to be open about what their technology can do, as well as to build a credible case for buyers to compare total cost of ownership.

Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?

A number of folks have noted that the design constraints of the iPhone (and of mobile devices in general) lead to an improved user experience, since site designers do a better job of focusing on the information that users will find relevant. I’m delighted to see designers striving to improve the signal-to-noise ratio in information seeking applications.

Still, I think we can take the idea much further. More efficient or ergonomic use of real estate boils down to stripping extraneous content–a good idea, but hardly novel, and making sites vertically oriented (i.e., no horizontal scrolling) is still a cosmetic change. The more interesting question is how to determine what information is best to present in the limited space–-that is the key to optimizing interaction. Indeed, many of the questions raised by small screens also apply to other interfaces, such as voice. Ultimately, we need to reconsider the extreme inefficiency of ranked lists, compared to summarization-oriented approaches. Certainly the mobile space opens great opportunities for someone to get this right on the web.

Semantic technology can make point and click interfaces more useful. What other uses of semantic technology do you see gaining significance in 2009? What semantic considerations do you bring to your product and research activities?

Semantic search means different things to different people, but broadly falls into two categories: Using linguistic and statistical approaches to derive meaning from unstructured text, using semantic web approaches to represent meaning in content and query structure. Endeca embraces both of these aspects of semantic search.

From early on, we have developed an extensible framework for enriching content through linguistic and statistical information extraction. We have developed some groundbreaking tools ourselves, but have achieved even better results by combining other vendor’s document analysis tools with our unique ability to improve their results through corpus analysis.

The growing prevalence of structured data (e.g., RDF) with well-formed ontologies (e.g., OWL) is very valuable to Endeca, since our flexible data model is ideal for incorporating heterogeneous, semi-structured content. We have done this in major applications for the financial industry, media/publishing, and the federal government.

It is also important that semantic search is not just about the data. In the popular conception of semantic search, the computer is wholly responsible derives meaning from the unstructured input. Endeca’s philosophy, as per the HCIR vision, is that humans determine meaning, and that our job is to give them clues using all of the structure we can provide.

Where can I find more information about your products, services, and research?

Endeca’s web site is http://endeca.com/. I also encourage you to read my blog, The Noisy Channel (http://thenoisychannel.com/), where I share my ideas (as do a number of other people!) on improving the way that people interact with information.

Stephen Arnold, February 9, 2009

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise, Interview, News, Online (general), Search, Semantic, Technology | 1 Comment

Microsoft Wants Users to Help Live.com

February 8, 2009

Google ignores user inputs in many cases. Microsoft, on the other hand, has made available via Softpedia and other news sources a list of emails to which a person can submit notices of abuse or improper content. You can read the story “Microsoft Keeps a Close Eye on Windows Live Servers” here. The article looked like a PR write up, but I had another idea as I thought about this quote in the article:

“Windows Live aims to protect you in these ways: blocking instant messaging spam and inappropriate communications in Windows Live Messenger and providing a means for you to report abuse; filtering incoming e-mail messages in Windows Live Hotmail for spam/junk; scanning attachments you receive, download, and send in Hotmail; scanning comments on Windows Live Spaces for spam; monitoring shared photos for abuse and inappropriate imagery; and providing abuse@live.com for issues you want to report directly to our Direct Mail Abuse team,” Cannon stated.

Google’s policy of ignoring some inputs may irritate me at times, but I know that Mother Google is deaf from blandishments from the addled goose in Kentucky. As a result, I don’t bother communicating with Googzilla any longer. It is what it is. Microsoft on the other hand has posted via Softpedia these contact points:

Messenger: Report abuse
Hotmail: Report abuse
Spaces: Report abuse
SkyDrive: Report abuse
Calendar: Report abuse
Events: Report abuse
Groups: Report abuse
People: Report abuse
Profile: Report abuse

What happens when a person with a peculiar sense of humor floods these addresses with reports that may be false? If anyone has any information about the method used to determine if a complaint is accurate or false, let me know. The policy seems to open the door to issues.

Stephen Arnold, February 7, 2009

Written by Stephen E. Arnold · Filed Under Business strategy, Microsoft, News, Online (general) | 3 Comments

Ask.com: Vertical Search Push

February 8, 2009

The harsh world of Web search seems to have ground down Ask.com even further. Search Engine Watch’s “Ask.com Parent IAC Sees Disappointing Revenues, Plans Vertical Search Strategy” here tells the tale. You can read the financial details yourself. For me, the most interesting comment is the strategy intended to turn the sea of red ink into a salmon fishery was:

Instead of attempting to take on Google head-on, Ask.com will follow a vertical search strategy, which kicked off last month with deal where Ask will power the search experience on NASCAR.com, provide a NASCAR toolbar, and sponsor a car. IAC plans to roll out from 8 to 10 similar relationships this year.

Yep, the search engine of NASCAR will seek “similar relationships”. One hopes that Ask.com tries to locate a relationship not experiencing sponsor defection and declining attendance. When I was at Ziff, one of the Ziffers was involved with the original AskJeeves.com site. Since its founding more than a decade ago, Ask.com has never been useful for my type of research. Maybe this vertical search approach will work? Vertical search is sort of a hassle for me. I prefer to go to one place and get results. Running the same query on different “vertical” systems means I have to federate the results. Nope, I want the system to do the grunt work.

Stephen Arnold, February 8, 2009

Written by Stephen E. Arnold · Filed Under Business strategy, Financial, News, Online (general), Vertical search | Comments Off on Ask.com: Vertical Search Push

Google Latitude: Search without Entering Keywords

February 8, 2009

I have been fascinated by the media and public reaction to Google’s Lattitude service. For a representative example, check out the Scientific American’s story here. The idea is that a Google user can activate a tracking feature for friends. The Lattitude service is positioned as a option for users. The GOOG’s intent is to allow friends and maybe people like parents to see where a person is on a Google Map. Wow, I received several telephone calls and agree to participate in two live radio talk show interviews. The two hosts were concerned that their location could be tracked by anyone at any time. Well, that’s sort of correct but Google Lattitude is not the outfit doing that type of tracking as far as I know.

A couple of points I noted that caught the attention of the media personalities who spoke with me:

There was zero awareness that triangulation is a well-known method. GPS equipped devices that transmit happily even when the owner thinks a device is “off” is a standard in certain law enforcement sectors. One anecdote that made the rounds in 2001 was that a certain person of interest loaned his personal mobile phone to a courier who was fetching videos from a city in a far off land. The homing device in the nose of the missile destroyed the courier’s four wheel drive vehicle. The person of interest switched to a pay as you go phone, having learned an important lesson.
The details of the Google Lattitude service, which is flakey and crashes even in Chrome, did not sink into the media personalities’ knowledgebase. Google makes clear what the service is and does. The words don’t resonate. Fear does. Little wonder that there is a thriving business is discussing this immature Google service which works only with certain software on the user’s mobile device. Gory details are here.
The chipper Googler who does the video about he service sounds to me as if the speaker was a cheerleader at a private school where each student had a horse and a chauffeur. There was what I think one wacky college professor called “cognitive dissonance”. Tracking my husband is, like, well, so coool. Maybe it is my age, but this eager beaver approach to friend tracking troubled me more than the unstable, crash prone service. The video is here.

Next week you will be able to navigate to a Web page and run a query across Google’s USPTO documents and have one click access to a PDF of the patent document. The service is up now and one vendor’s search system is available at this time, but I hope to add additional search systems so you can explore the disclosure corpus yourself. These “innovations” are several years old if you have been reading Google’s technical papers and its patent documents. The baloney that a patent document does not become a product does not hold for Googzilla. If you have been reading my analyses of these documents in The Google Legacy (2005) and Google Version 2.0 (2007) you already know that what is now making its way to alpha and beta testing is three, maybe four years old.

My take on this is that Google watchers are getting blindsided and overly excited too late in the game. When the GOOG rolls out a service or allows a Google wizard to appear in public, the deal is done. Concern about tracking is like fretting over the barn fire three years after the fact. Silly waste of time. The GOOG does a lousy job of hiding its technical direction but few take the time to dig out the information.

Radio hosts should start reading Google technical papers. Would that raise the level of discourse? The tracking service has significant implications for medical device vendors, shipping companies, and law enforcement. So far few pundits are tackling these applications in a substantive way. I touch upon these issues in my forthcoming Google: The Digital Gutenberg here.

Stephen Arnold, February 8, 2009

Written by Stephen E. Arnold · Filed Under Cloud computing, Google, Mobile, News, Online (general) | 3 Comments

Metadata Extraction

February 8, 2009

A happy quack to the reader who sent me a link to “Automate Metadata Extraction for Corporate Search and Mashups” by Dan McCreary here. The write up focuses on the UIMA framework and the increasing interest in semantics, not just key word indexing. I found the inclusion of code snippets useful. The goslings here at Beyond Search are urged to copy, cut and paste before writing original scripts. Why reinvent the wheel? The snippets may not be the exact solution one needs, but a quick web footed waddle through them revealed some useful items. Mr. McCreary has added a section about classification and he used the phrase “faceted search” which may agitate the boffins at Endeca and other firms where facets are as valuable as a double eagle silver dollar. I was less enthusiastic about the discussion of Eclipse, but you may find it just what you need to chop down some software costs.

The write up in in several parts. Here are the links to each section: Part 1, Part 2, and Part 3. I marked this article for future reference. Quite useful if a bit pro-IBM.

Stephen Arnold, February 6, 2009

Written by Stephen E. Arnold · Filed Under Enterprise, News, Online (general), Semantic, Technology | 4 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Semantic Engines Dmitri Soubbotin Exclusive Interview

Sinequa Lands Co Op Financial Deal

More Trouble for Microsoft: GMail Guns for Hotmail

Cengage: A Publisher to Watch

Upgrades to ArnoldIT.com’s Google Patent Collection

Daniel Tunkelang: Co-Founder of Endeca Interviewed

Microsoft Wants Users to Help Live.com

Ask.com: Vertical Search Push

Google Latitude: Search without Entering Keywords

Metadata Extraction

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta