SemantiFind: Semantic Plumbing Exposed

November 11, 2008

Two readers sent me links to SemantiFind, a company that offers a Web service for semantic ontology and search of the Internet. I am on record as suggesting that semantic technology has an important role to play, but behind the scenes. Most users can reap rich information access awards when semantic and other advanced technology work as plumbing. You use the system by registering and installing a browser plug in. You then navigate to Google.com, Live.com, or Yahoo.com and run your query. SemantiFind converts your default query into a list of suggestions. You select the word that best matches your intended query. A useful page can be flagged and its content used to in formulated future queries. SemantiFind provides a community and a system that may be useful to users who have difficulty thinking of words to perform query expansion or query narrowing. Our test queries returned acceptable results. Check it out. Use it if you find it helpful. More information about the company is available here.

Stephen Arnold, November 11, 2008

Micro Mart’s Surprising Web Search Findings: Google Is an Also Ran

November 11, 2008

The trusty newsreader served up a link to a three-part article by Peter Hayes. He wrote a feature “The Secret Life of Search Engines” for Micromart.com. I have conflicting date information for this article. It may have been written yesterday or a year ago.

You can find the first part here. The second part here. And the third part here. The Micromart.com site search engine leaves a bit to be desired because its index does not contain a pointer to the first part of this article. Sigh. My own tools ferreted out the three parts, and I think you will find Mr. Hayes’ analysis surprising. The key point for me is that when a journalist runs benchmark queries across search systems, the gulf between those who understand what readers find interesting and those who build search engines becomes evident. In fact, if Mr. Hayes’ analysis were used as the definitive guide for finding information on the public Web, there would be considerable consternation at a number of high profile firms and cause for joy among a group of search engines that are going nowhere in terms of usage. I want to consider this point at the end of my Beyond Search post. Let’s look at the key points in each of the three parts of this analysis, shall we?

Part One: Outline Politics

Straight off let me say I don’t know what ‘outline politics means. I don’t think it matters much beyond privacy and the ambivalent nature of an index’s utility. I did not get the impression that the phrase is particularly significant in the flow of his argument. The series begins with the notion that you can make money offering a product people use everyday. The idea is flawless when it comes to a fungible product, but I am not sure it applies to the somewhat more slippery world of information. Nevertheless, the point is that traffic is good. Furthermore, the Internet is changing. Content is tricky. Mr. Hayes introduces the notion of official content and unofficial content. That’s a useful distinction, but it did not resonate with me. Mr. Hayes then asserts that search engines have, and I quote:

two major functions. One is to teach, the other is to search. While both have a large positive side we shouldn’t pretend that there isn’t a downside to any tool. Any tool used for good can also be used for bad.

He is now in full stride and hitting a hot button almost guaranteed to whip up interest among European Web uses–privacy. He then heads for the end of Part One with this comment:

My final thought is that search engines are only passengers on the Internet train and not the train itself. The growth of the Internet gives them the prospect of a healthy and prosperous future – but at the same time it is reliant on the safekeeping and update of the Internet to keep up with demand and to protect it from vandals. As our newspaper headlines tell us, the world is not totally a safe and law abiding place.

I must admit that I am not quite sure of the logic of this first section, but let’s move on to Part Two.

Part Two: Tools

Mr. Hayes dives in with location searching and touches upon Boolean logic, promising to tackle this topic elsewhere in his series. His first injunction is to keep a search simple. Web indexes are divided into systems dependent on software and systems dependent on humans. Mr. Hayes does not provide a context for the disparity in usage between these two types of systems, a distinction that will return to haunt him in Part Three of his series. He points out that search systems are not “born equal”. The promised analysis of Boolean arrives and I learn:

Boolean (which consists of the three words AND, OR, NOT, remember) is best explained by example. Some engines don’t allow it and some only use the NOT part. This follows the general rule that nothing to do with the Internet is ever totally straightforward! Typing NOT will take out examples that don’t fit the bill (‘Arsenal NOT soccer’, for example), but this is hard word to use and control. In Yahoo, double meanings are automatically divided out. Also the engine can easily come up with word connections that you would never think of in a million years – including simple names.

I think I understand even though Mr. Hayes’ own examples use symbols for AND, and he does not provide an example of a successful NOT search statement. NOT for Mr. Hayes is a “hard word to control”. I imagine that for him NOT may be troublesome. He points out that:

AND is the least useful of all because most of time, it is taken as read on all known engines that work via keywords. Type ‘Peter Hayes Writing Genius’ it will give the same result as ‘Peter+Hayes+Writing+Genius’ or ‘Peter AND Hayes AND Writing AND Genius’.

The statement confirms my suspicions that Mr. Hayes has taken a very different view of Boolean logic, its complexities, and the way in which logical operators work in his world. I quite like AND, NOT, OR, and even NAND in some systems. You too may find AND and NOT useful as well.

I am not certain what the sub section “Getting It Right” means. The resonance of AND and NOT inutility echoes in my mind. Part Two ends with an observation about how much of the Internet is indexed. That’s a good question, and I now turn to Part Three, where the intellectual rigor of Mr. Hayes meets the Information Superhighway, if I may indulge in a bit of metaphorical whimsy.

Part Three: The Best UK Web Search Engines

I knew I was in for a delightful few minutes after the first two parts of Mr. Hayes’ feature. In Part Three he lays out 10 test queries. I can’t reproduce the full list, but I can highlight two of his queries:

  • Bring me the site of the best selling newspaper in the UK (The Sun)
  • Find a local newspaper covering the Shetlands

I noted that each query is expressed as a string of text. Some vendors would rush to point out that Mr. Hayes is using natural language queries. Not many systems support natural language queries in particularly sophisticated ways. Some, for instance, create a Boolean query from whatever the user enters in the search box. Other systems consult a look up table of what’s been a satisfactory result for the query recently and delivers that result from its cache. Others dump stop words and go with the meaningful words with an simplicity AND or OR Boolean operator. Others look at what’s available from an advertiser and dumps those results directly to the user. Others predict what a user will prefer based on that user’s profile or the user’s usage history. This list is not exhaustive  by any means.

What did Mr. Hayes learn from his analysis of the 10 queries sent to the UK sites for Lycos, AltaVista, Dogpile, Excite, HotBot, Metacrawler, MSN, Yahoo, Ask, and Google. I have converted Mr. Hayes’ findings into the summary table below. Keep in mind that these are his data in a slightly different form. These are not my or my team’s findings:

Rank Engine Hayes’ Take
1 Lycos Answered questions well
2 AltaVista Useful but obscure results
3 Dogpile Surprised it didn’t do better
4 Excite Respectable performer
5 HotBot Good all round performer; Mr. Hayes’ favorite
6 Metacrawler Biggest surprise of the lot
7 MSN Slick and impressive performer
8 Yahoo Handpicked and categorized results a plus
9 Ask Plain English queries
10 Google Did not outperform the opposition

Mr. Hayes includes “scores” for each engine. The top rated engine Lycos received a Hayes number of 83%; the lowest rated engine Google received a Hayes number of 78%.

Observations

I came away from my reading of this three part series in a semi stunned state. I had a number of major and minor quibbles gallivanting around my cranial cavity. Let me highlight three points and move on:

  1. This article made it clear to me that people don’t know what they don’t know about Web search, its technology, and its nuances. Google is probably correct in sticking with its very simple interface and its behind the scenes functions to answer most of the users’ questions with “good enough” information with its approach to results. If Mr. Hayes is an informed user of Web search systems, the fact that he finds the HotBot results more useful to him than other systems’ results, that’s well and good. The idea of using one system to conduct research of any type is an anathema to me. Overlap, freshness, scope of index–these are essential factors for each Web indexing system. Insensitivity to these issues makes me downright nervous. I thought, “If Mr. Hayes can’t figure out the important parts, what about a less informed online user?”
  2. The queries Mr. Hayes formulated reveal why natural language systems are not understood. Forget semantic methods. I am not sure how to remediate Mr. Hayes’ test queries. The approach is foreign to me as is Mr. Hayes’ failure to differentiate each of the test systems with more precision. There is a big difference between a system that is federating results, one that indexes only frequently accessed pages, and one that operates with orphaned code on a shoestring.
  3. The failure to point out that Google serves about 70 percent of the queries in North America and more in Denmark, Germany, and the UK is an oversight. The giant gets the lowest score, which doesn’t make sense to me. Mr. Hayes uses subjective criteria to generate his Hayes numbers and provides zero detail about the method used to calculate a score. I think the idea of scoring Lycos as a better search engine on freshness, features, relevance as measured by the number of on target hits in the first 10,000 results in a result set, and similar criteria will suggest that Lycos, AltaVista, and HotBot aren’t competitive in today’s market. Microsoft’s Live.com and Yahoo search are in some ways easier to benchmark against the Google. The other vendors are non starters in my mind because none has the technical nor financial resources to index at the Google, Microsoft Live.com, and Yahoo levels.

Mr. Hayes omitted a Web search engine that I think is better than eight or nine of those on this list; namely, Exalead. I am well pleased with the results I obtain from Exalead.com here. In general, the French make me nervous with the math skills and sense of style, but Exalead is the functional equivalent of Google, operated by Europeans, and a country mile better on my relevance tests than the orphans AltaVista, Excite, and HotBot.

Keep in mind I am stating my opinion. I am an addled goose. I am sure the experts who organize search conferences will be delighted to feature Mr. Hayes as a keynote speaker. The conference organizers and Mr. Hayes’ understanding of search may be well matched.

Stephen Arnold, November 11, 2008

Nstein: Searching for a Better Search

November 8, 2008

Nstein Technologies [http://www.nstein.com/en/] digital publishing specialist Diane Burley presented a webinar titled “Searching … For a better Search!” on November 6, 2008. The point was to teach media companies to evaluate how search works on their web sites, address the pros and cons of search strategies like link lists or search boxes, and show how sites might be losing readers. Ms. Burley reviewed case studies to illustrate the differences between active and passive search; how to use semantic analysis to improve search; useable ideas for improving stickiness; and real-world examples of media companies using internal and external search. Has returned to its content processing roots?

Jessica Bratcher, November 7, 2008

Yakabod’s Knowledge Appliance

November 5, 2008

What do you think about having search, content management, social networking, and collaboration all in one secure software package appliance that deploys really quickly and is super-intuitive? Impossible, you say? Yakabod Inc. says otherwise. Its new appliance, Yakabox, is a knowledge management system (housed in a striking purple box, no less) that not only sorts, stores, and searched hard copy information, but also more ephemeral data such as opinions, experience, and brainstorming so you don’t have to rebuild the wheel. A graphic on the Yakabod site says: “Did you know that 40% of the documents U.S. workers create every day already exist?” Frightening. Yakabod markets Yakabox to fight these stumbling blocks: deployment time; security difficulties; integration problems; redundancy; and employee resistance. You can download several white papers here, and there’s a list of comparative options here.

Jessica Bratcher, November 5, 2008

Data Management: A New Search Driver

November 4, 2008

Earlier today I reread “The Claremont Report on Database Research.” I had a few minutes, and I recalled reading the document earlier this year, and I wanted to see if I had missed some of its key points. This report is a committee written document prepared as part of an invitation only conference focusing on databases. I follow the work of several of the people listed as authors of the report; for example, Michael Stonebraker and Hector Garcia-Molina, among others.

One passage struck me as important on this reading of the document. On page 6, the report said:

The second challenge is to develop methods for effectively querying and deriving insight from the resulting sea of heterogeneous data…. keyword queries are just one entry point into data exploration, and there is a need for techniques that lead users into the most appropriate querying mechanism. Unlike previous work on information integration, the challenges here are that we do not assume we have semantic mappings for the data sources and we cannot assume that the domain of the query or the data sources is known. We need to develop algorithms for providing best-effort services on loosely integrated data. The system should provide some meaningful answers to queries with no need for any manual integration, and improve over time in a “pay-as-you-go” fashion as semantic relationships are discovered and refined. Developing index structures to support querying hybrid data is also a significant challenge. More generally, we need to develop new notions of correctness and consistency in order to provide metrics and to enable users or system designers to make cost/quality tradeoffs. We also need to develop the appropriate systems concepts around which to tie these functionalities.

Several thoughts crossed my mind as I thought about this passage; namely:

  1. The efforts by some vendors to make search a front end or interface for database queries is bringing this function to enterprise customers. The demonstrations by different vendors of business intelligence systems such as Microsoft Fast’s Active Warehouse or Attivio’s Active Intelligence Engine make it clear that search has morphed from key words to answers.
  2. The notion of “pay as you go” translates to smart software; that is, no humans needed. If a human is needed, that involvement is as a system developer. Once the software begins to run, it educates itself. So, pay as you go becomes a colloquial way to describe what some might have labeled “artificial intelligence” in the past. With data volume increasing, the notion of humans getting paid to touch the content recedes.
  3. Database quality in the commercial database sector could be measured by consistency and completeness. The idea that zip codes were consistent was more important than a zip code being accurate. With statistical procedures the value in a cell may be filled and it will include a score that shows the probability that the zip code is correct. Similarly, if one looks for the salary or mobile number of an individuals, these probability scores become important guides to the user.

ediscovery cost perception

“Pay as you go” computing means that the most expensive functions in a data management method have costs reduced because humans are no longer needed to do “knowledge work” required to winnow and select documents, facts, and information. The company able to implement “pay as you go” computing on a large scale will destabilize the existing database business sector. My research has identified Google as an organization employing research scientists who use the phrase “pay as you go” computing. Is this a coincidence or an indication that Google wants to leap frog traditional database vendors in the enterprise?

In the last month, a number of companies have been kind enough to show me demonstrations of next generation systems that take a query and generate a report. One system allows me to look at a sample screen, click a few options, and then begin my investigation by scanning a “trial report”. I located a sample Google report in a patent application that generates a dossier when the query is for an individual. That output goes an extra step and includes aliases used by the individual who is the subject of the query and a hot link to a map showing geolocations associated with that individual.

The number of companies offering products or advanced demonstrations of these functions means that the word search is going to be stretched even further than assisted navigation or alerts. The vendors who describe search as an interface for business intelligence are moving well beyond key word queries and the seemingly sophisticated interfaces widely available today.

Despite the economic pressures on organizations today, vendors pushing into data management for the purpose of delivering business intelligence will find customers. The problem will be finding a language in which to discuss these new functions and features. The word search may not be up to the task. The phrase business intelligence is similarly devalued for many applications. An interesting problem now confronts buyers, analysts, and vendors, “How can we describe our systems so people will understand that a revolution is taking place?”

The turgid writing in the Claremont Report is designed to keep the secret for the in crowd. My hunch is that certain large organizations–possibly Google–are quite far along in this data management deployment. One risk is that some companies will be better at marketing than at deploying industrial strength next generation data management systems. The nest might be fouled by great marketing not supported by equally robust technology. If this happens, the company that says little about its next generation data management system might deploy the system, allow users to discover it, and thus carry the field without any significant sales and marketing effort.

Does anyone have an opinion on whether the “winner” in data management will be a start up like Aster Data, a market leader like Oracle, or a Web search outfit like Google? Let me know.

Stephen Arnold, November 4, 2008

Disturbing Data, Possible Parallel for Search

October 30, 2008

After wrapping up another section of my forthcoming monograph Google Publishing technology for Infonortics Ltd. in Tetbury, England, I scanned the content sucked in by my crawlers. Another odd duck greeted me with the off point headline “Outlook: Don’t Panic It’s Not 2001” here. (This is a wacky url so you may have to navigate to the parent site www.commsdesign.com and hunt for the author Bolaji Ojo.

For me, one telling paragraph was:

In 2001, for instance, the wireline communications equipment market sank 18 percent to $69.6 billion, from $85.3 billion in the previous year. Semiconductor sales to the segment tumbled 37 percent on a combination of sagging demand and severe pricing declines. Seven years later, wired communications equipment sales have yet to recover to the 2000 level, and estimates indicate the market won’t bounce back fully until sometime in the next decade. ISuppli expects 2009 wired communications sales to be approximately $76.6 billion, improving from an estimated $72.5 billion in 2008, but still below the record 2000 figure of $85 billion.

image

Source: http://thesaleswars.wordpress.com/2008/02/

Another interesting point was:

The entire semiconductor market wasn’t as fortunate. Chip sales plunged 43 percent in 2001, to $101.8 billion from $178.9 billion in 2000, according to the Semiconductor Industry Association. The industry resumed growth in 2002, but it wasn’t until 2004 before global sales finally crawled past the previous record. By then, dozens of semiconductor, passives, interconnect and electromechanical companies and electronic manufacturing services providers had disappeared, some merging with stronger rivals. A few others went under, unable to finance operations as customers froze purchases or exited the embattled networking equipment market.

What these data suggested to me was that the search, content processing, and search enabled application sectors may face significant revenue declines and could take years to recover. The loss of companies that have no revenue is understandable. Funding sources may dry up or cut off the flow of money. Large firms may shed staff, but these vendors will, for the most part, remain in business. The real pressure falls on what I call “tweeners”. Tweeners are organizations that are in growth mode but the broader downturn can reduce their sales and squeeze the companies’ available cash. Slow payment from customers adds to the problem.

Read more

Amazon’s iTunes Like Interface

October 28, 2008

Amazon has developed a new interface. You can read the news story on TechCrunch here. The graphical presentation is intended to make it easier and more fun to browse Amazon’s products. Jason Kinkaid’s article does a very good job of explaining the features of this interface. For me, the most important comment in the write up was:

The site seems geared towards shoppers who are just looking for ideas, as there isn’t a search feature. Users can scroll through the site using their arrow keys, zooming in on individual products by hitting the spacebar. Each product includes a demo video (in the case of movies, songs, and video games) or an excerpt (from books).

I have often asserted that search is dead. I did not say that search was not useful. Amazon believes it has cracked the code on information retrieval without asking the user to type in the title of a book or an author’s name. Amazon wants to be a combination of Apple and Google. Amazon may have to keep trying to manage this transition.

Stephen Arnold, October 28, 2008

Twine’s Semantic Spin on Bookmarks

October 25, 2008

Twine is a company committed to semantic technology. Semantics can be difficult to define. I keep it simple and suggest that semantic technology allows software to understand the meaning of a document. Semantic technology finds a home inside of many commercial search and content processing systems. Users, however, don’t tinker with the semantic plumbing. Users take advantage of assisted navigation, search suggestions, or a system’s ability to take a single word query and automatically hook the term to a concept or make a human-type connection without a human having to do the brain work.

Twine, according to the prestigious MIT publication Technology Review, is breaking new ground. Erica Naone’s article “Untangling Web Information: The Semantic Web Organizer Twine Offers Bookmarking with Built In AI” stop just short of a brass band enhanced endorsement but makes Twine’s new service look quite good. You must read the two part article here. For me, the most significant comment was:

But Jim Hendler, a professor of computer science at Rensselaer Polytechnic Institute and a member of Twine’s advisory board, says that Semantic Web technologies can set Twine apart from other social-networking sites. This could be true, so long as users learn to take advantage of those technologies by paying attention to recommendations and following the threads that Twine offers them. Users could easily miss this, however, by simply throwing bookmarks into Twine without getting involved in public twines or connecting to other users.

Radar Networks developed Twine. The metaphor of twine invokes for me a reminder of the trouble I precipitated when I tangled my father’s ball of hairy, fibrous string. My hunch is that others will think of twine as tying things together.

You will want to look at the Twine service here. Be sure to compare it to the new Microsoft service U Rank. The functions of Twine and U Rank are different, yet both struck me as sharing a strong commitment to sharing and saving Web information that is important to a user. Take a look at IBM’s Dogear. This service has been around for almost a year, yet it is almost unknown. Dogear’s purpose is to give social bookmarking more oomph for the enterprise. You can try this service here.

As I explored the Twine service and refreshed my memory of U Rank and Dogear, several thoughts occurred to me:

  1. Exposing semantic technology in new services is a positive development. The more automatic functions can be a significant time saver. A careless user, however, could lose sight of what’s happening and shift into cruise control mode, losing sight of the need to think critically about who recommends what and from where information comes.
  2. Semantic technology may be more useful in the plumbing. As search enabled applications supplant key word search, putting too much semantic functionality in front of a user could baffle some people. Google has stuck with its 1950s, white refrigerator interface because it works. The Google semantic technology hums along out of sight.
  3. The new semantic services, regardless of the vendor developing them, have not convinced me that they can generate enough cash to stay alive. The Radar Networks and the Microsofts will have to more than provide services that are almost impossible to monetize. IBM’s approach is to think about the enterprise, which may be a better revenue bet.

I am enthusiastic about semantic technology. User facing applications are in their early days. More innovation will be coming.

Stephen Arnold, October 25, 2008

SurfRay Round Up

October 24, 2008

SurfRay and its products have triggered a large number of comments on this Web log. On my recent six day trip to Europe, I was fortunate to be in a position to talk with people who knew about the company’s products. I also toted my Danish language financial statements along, and I was able to find some people to walk me through the financials. Finally, I sat down and read the dozens of postings that have accumulated about this company.

I visited the company on a trip to Copenhagen five or six years ago. I wrote some profiles about the market for SharePoint centric search, sent bills, got paid, and then drifted away from the company. I liked the Mondosoft folks, but I live in rural Kentucky. One of my friends owned a company which ended up in the SurfRay portfolio. I lost track of that product. I recall learning that SurfRay gobbled up an outfit called Ontolica. My recollection was that, like Interse and other SharePoint centric content processing companies’ technology, Ontolica put SharePoint on life support. What this means is that some of SharePoint’s functions work but not too well. Third party vendors pay Microsoft to certify one or more engineers in the SharePoint magic. Then those “certified” companies can sell products to SharePoint customers. If Microsoft likes the technology, a Microsoft engineer may facilitate a deal for a “certified” vendor. I am hazy on the ways in which the Microsoft certification program works, but I have ample data from interviews I have conducted that “certification” yields sales.

image

An Ontolica results list.

Why is this important? It’s background for the points I want to set forth as “believed to be accurate” so the SurfRay folks can comment, correct, clarify, and inform me on what the heck is going on at SurfRay. Here are the points I about which comments are in bounds.

Read more

Silobreaker: Two New Services Coming

October 24, 2008

I rarely come across real news. In London, England, last week I uncovered some information about Silobreaker‘s new services. I have written about Silobreaker before here and interviewed one of the company’s founders, Mats Bjore here. In the course of my chatting with some of the people I know in London, I garnered two useful pieces of intelligence. Keep in mind that the actual details of these forthcoming services may vary, but I am 99% certain that Silobreaker will introduce:

Contextualized Ad Retrieval in Silobreaker.com.

The idea is that Silobreaker’s “smart software” called a “contextualization engine” will be applied to advertising. The method understands concepts and topics, not just keywords. I expect to see Silobreaker offering this system to licensees and partners. What’s the implication of this technology? Obviously, for licensees, the system makes it possible to deliver context-based ads. Another use is for a governmental organization to blend a pool of content with a stream of news. In effect, when certain events occur in a news or content stream, an appropriate message or reminder can be displayed for the user. I can think of numerous police and intelligence applications for this blend of static and dynamic content in operational situations.

Enterprise Media Monitoring & Analysis Service

The other new service I learned about is a fully customizable online service that delivers a simple and effective way for enterprise customers to handle the entire work flow around their media monitoring and analysis needs.  While today’s media monitoring and news clipping efforts remain resource intensive, Silobreaker Enterprise will be a subscription-based service that will automate much of the heavy lifting that either internal or external analysts must perform by hand. The Silobreaker approach is to blend–a key concept in the Silobreaker technical approach–in a single intuitive user interface disparate yet related information. The enterprise customers will be able to define monitoring targets, trigger content aggregation, perform analyses, and display results in a customized web-service. A single mouse click allows a user to generate a report or receive an auto-generated PDF report in response to an event of interest. Silobreaker has also teamed up with a partner company to add sentiment analysis to its already comprehensive suite of analytics.  Currently in final testing phase with large multinational corporate test-users and due to be released at end of 2008/early 2009.

Silobreaker is a leader in search enabled intelligence applications. Check out the company at www.silobreaker.com. A happy quack to the reader who tipped me on these Silobreaker developments.

Stephen Arnold, October 23, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta