Is Precision and Recall Making a Comeback?

March 15, 2011

Microsoft-centric BA Insight explored these touch points of traditional information retrieval. Precision and recall have quite specific meanings to those who care about the details of figuring out which indexing method actually delivers useful results. The Web world and most organizations care not a whit about fooling around with this equation.

image

And recall. This is another numerical recipe that causes the procurement team’s eyes to glaze.

image

I was interested to read in The SharePoint and FAST Search Experts Blog’s “What is the Difference Between Precision and Recall?”  This is a very basic question for determining the relevance of query search results.

Equations aside, precision is the percentage of relevant retrieved documents, and recall is the percentage of relevant documents that are retrieved.  In other words, when you have a search that’s high in precision, your results list will have a large percentage of items relevant to what you typed in, but you may also be missing a lot of items in the total.

With a search that is high in recall, your results list will have more items of what you’re searching for, but will also have a lot of irrelevant items as well.  The post points out that determining the usefulness of search results is actually simpler than this sounds:

“The truth is, you don’t have to calculate relevance to determine how SharePoint or FAST search implementation is performing.  You can look at a much more telling KPI.  Are users actually finding what they are looking for?”

The problem, in my opinion is that most enterprise search deployments lack a solid understanding of the corpus to be processed. As a result, test queries are difficult to run in a “lab-type” setting. A few random queries are close enough for horseshoes. The cost and time required to benchmark a system and then tune it for optimal precision and recall is a step usually skipped.

Kudos to BA Insight for bringing up the subject of precision and recall. My view is that the present environment for enterprise search puts more emphasis on point and click interfaces and training wheels for users who lack the time, motivation, or expertise to formulate effective queries. Even worse, the content processed by the index is usually an unexplored continent. There are more questions like “Why can’t I find that PowerPoint?” that shouts of Eureka! Just my opinion.

Stephen E Arnold, March 15, 2011

Freebie

Google Speeds Tweet Information

March 14, 2011

If you can say one thing about Google is that it likes to do things for itself.  Soshable reports “Forget Indexing Tweets: Google Is Pulling Them Directly from the API.”  Google launched Caffeine last year as a tool for real time web indexing with a heavy influence on social media.

Google used to display tweets from people’s accounts, but now we have learned the company is linking directly to Twitter’s API, thus reducing latency. Our source said:

“Most tweets are eventually indexed – some within minutes, some within hours or even days. These Tweets are being presented in their raw form prior to being indexed. The Tweets themselves are not being used in search results through this new method. They will be indexed separately and can then appear in searches as their own listings, but this is different. Just as with Google’s “Real-time” search, this feature is a fire hose.”

Once tweets are indexed they can be added to search results as individual listings.  One might think this is a new endeavor, but it’s not.  It’s only a quicker way for Google to provide real time information, but it is fact to keep in your frontal memory.

Google continues to make speed a differentiator. In addition to reducing latency for Twitter content, the Chrome Version 10 browser has been positioned as “faster” as well.

Whitney Grace, March 14, 2011

Freebie

Google Search Algorithm: A New Direction?

March 11, 2011

Content, content, content. There is a lot of bad information, some so-so information, and not much high value information available on the public Web. The challenge is to pinpoint the high value information. For Google, the challenge is to identify the high value information and keep the Adwords revenue flowing.

After reading “Google’s New Algorithm Puts Content in the Driver’s Seat“, that word content remained entrenched in my consciousness. The author made some compelling points as he discussed Google’s new algorithm and the role content plays in several aspects of the activities performed online. Citing a noticeable, though not complete improvement to the results of high value search requests, the article expressed both praise for the new formula and relief in what he sees as an overdue shift in the approach to commerce.

One passage I noted was:

“Give them valuable content.  Free.  Give them plenty of it.”

This certainly seems like sound advice. But I want the information that is germane to my query. Who wants to click through a laundry list of links to find what is needed to meet my information need. I don’t.

Google’s PageRank pivoted a decade ago on the importance of links and a site’s rank. Link popularity works for Lady Gaga. Other types of queries may require content that lack have high click or link scores.

Maybe I am sensitive to coincidences. Google’s change to its method comes on the heels of some legal issues related to indexing and results ranking. Is Google trying to improve relevance, manage some push back, or generate some positive public relations? I don’t have the answers to these questions.

Micheal Cory, March 11, 2011

Google and Yelp: The Future of Content Peeks around the PR Baloney

March 7, 2011

My personal view is that content is undergoing a bit of chemical change. The idea that authors are important seems to be eroding. The new value is an online system that generates content from software, volunteers, paid contributors, and sucking goodies from different places. There is no single word that encapsulates these trends. I wish there were. With the right word I could explain what is at the core of the unsolvable problem for Google and Yelp. You can get the core of the hassle in “Google Issues Ultimatum to Yelp: Free Content or No Search Indexing.” One interesting comment in the write up was this passage:

The issue has been ongoing for several years. However, Stoppelman said there is no answer to it at the moment, while Google maintains the same position.

I thought Google had the world’s smartest employees. Yelp has some sharp folks as well. No solution. So we have a new example of an unsolved problem. The Yelp Conundrum is right up there with The Goldbach conjecture. Well, that suggests that neither company is  as smart as I thought it was or both companies have what I call a “power” problem.

Yelp is performing in an area where Google is not doing too well. Google wants the Yelp content and will remove Yelp from its index unless Yelp buckles under. When I read this story I thought about Google’s position when China pulled a power play. Now Google seems to be throwing its traffic power around.

Interesting.

With Bing and Yahoo accounting for 12 percent of Web search and Google most of the other traffic, Google’s position strikes an interesting chord with me.

Let’s assume that Google arbitrarily excludes Yelp. What does this say about objectivity in indexing? What does this make clear about Google’s ability to adjust an index manually? No matter how smart Google’s software is, the decision to block or otherwise hamper Yelp tells me quite a bit.

And what about Yelp? Is the company playing hardball because a potential tie up fell through? Is Yelp the most recent casualty of Google’s effort to expand via acquisition, not organic growth?

Too bad I am not a lawyer. I would have an informed opinion. As an observer from far off Kentucky, I see a glimpse of the future for new content environment.

Stephen E Arnold, March 7, 2011

Freebie

SEO Woe: Cows in the Commons

March 6, 2011

I have another gosling writing about this but I wanted to weigh in on this fine rainy Sunday in Harrod’s Creek. Point your browser at “SEO Is No Longer a Viable Marketing Strategy for Startups.” The basic idea is that certain search engine optimization methods of building traffic have lost their efficacy. The key point in my opinion is:’

I talk to lots of startups and almost none that I know of post-2008 have gained significant traction through SEO (the rare exceptions tend to be focused on content areas that were previously un-monetizable). Google keeps its ranking algorithms secret, but it is widely believed that inbound links are the preeminent ranking factor.  This ends up rewarding sites that are 1) older and have built up years of inbound links 2) willing to engage in aggressive link building, or what is known as black-hat SEO. (It is also very likely that Google rewards sites for the simple fact that they are older. For educated guesses on which factors matter most for SEO, see SEOMoz’s excellent search engine ranking factors survey).

I added boldface to the phrase that struck me as particularly interesting; namely, the one with the word “content”.

Now there are some outfits who have figured out that the lousy economy makes it easy to get people to write articles for a modest amount of money. If a company generates quite a few articles and tosses in the basics of page indexing, the various search engines usually index the content. The more links and buzz that an article generates makes the write up show up in a Tweetmeme.com list or high in the Google search results or even a top spot on Bing.com.

Wonderful. We have a popularity context much like the one that puts such effective professionals into the various US, state, county, and municipal elected offices.

What I find interesting about the voting approach is that our friend Alexis de Tocqueville pointed out that the majority approach delivers quite a number of outputs. Excellence may not be assumed. Popularity is one yardstick. The measurement of quality in a written document may not lend itself to popularity. One chestnut is the plight of Al Einstein. Nothing he wrote resonated with more than a baker’s dozen of folks. Even the Nobel committee struggled to recognize him. That’s the problem with excellence. Voting does not work particularly well in many situations.

image

Too many cows in the commons.

What’s this mean for search engine optimization, content factories, and a results list in a free Web search engine? Three points in my opinion:

  1. Gaming the system (no matter what system) is great fun and extremely lucrative for those who can exploit what I call the “something for nothing” approach to information. A short cut is worth a lot of money, particularly at conferences that explain how to send lots of cows into the common fields of content.
  2. Smart search engines are just not that smart and probably will not be. That is the reason that commercial content producers generally offer information that follows a different path. I know that most people are not interested in provenance, fact checking, and accuracy, but most of the commercial database producers do a better job than a Web master looking for a way to boost traffic and either keep a job or get a raise. Content is not job one for these people.
  3. Search engine optimization is pretty much whatever the experts, pundits, and carpet baggers want it to be. There are tricks to exploit stupid Web indexing methods. I just ignore that sector of what some journalists view as “real search” because search is darned easy for any one with a net connection and a browser.

Bottom line: SEO won’t go away. In my view, one can’t kill it. After a nuclear blast, certain creatures will survive. Publicly accessible, ad supported indexes will not be as objective as I would like. Nor will the indexes do a particularly good job of delivering precision and recall. The advanced features are little more than efforts to get more advertisers.

In short, Web search, SEO, and much of the content on the Web is like a common grazing area with too many cows, too many footballers, and too many gullible walkers. (These are metaphors for marketing for me.)

SEO is not dead. How do you kill looking for a deal, finding a short cut, getting something for nothing? Tough to do. And those cows in the commons. Lots of output. Lots.

Stephen E Arnold, March 6, 2011

Freebie

Search Engine Optimization Discovered in Miami

March 2, 2011

You know a story is bit time when it is covered in a two page article in the Miami Herald. Miami, of course, is the capitol city of The Islands, as The Nine Nations of North America pointed out years ago.

The point of the story is that search engine optimization experts—trained at conferences partially underwritten by the Web indexing services—have learned how to fool the Web search engines. It’s not nice to fool Mother Nature, but it is perfectly okay to fool the Web indexes. Hey, traffic equals ad revenue and has for years. Now the Miami Herald has discovered “The Dark World of Search Engine Manipulation.” There you go.

According the article:

Now Google is incorporating recommendations from your social media “friends” to personalize the search results you get. Who authorized Google to help itself to that information? And precisely how will your so-called friends’ opinions alter the rankings you see?  Google is an extraordinary company, and its credo of “do no harm” is impressive. But it’s difficult to think of another private, profit-seeking entity that has ever exercised such vast power over what the world thinks about and pays attention to. That’s a profoundly public function, and with it comes an obligation of accountability that Google has so far bungled.

Will the Miami Herald’s insight have an impact? Does anyone know how social media can be manipulated to spoof relevance?

Nope.

Stephen E Arnold, March 2, 2011

Freebie unlike the ads on the major Web search engines

Capacity Planning in SharePoint Server 2010

March 1, 2011

Storage and SQL Server Capacity Planning and Configuration (SharePoint Server 2010)” explains how to plan for and configure the storage and Microsoft SQL Server database tier in your Microsoft SharePoint Server 2010 environment.  The article states:

“Because SharePoint Server often runs in environments in which databases are managed by separate SQL Server database administrators, this document is intended for joint use by SharePoint Server farm implementers and SQL Server database administrators. It assumes significant understanding of both SharePoint Server and SQL Server.”

With that as a given, the capacity planning is outlined through several steps.  There is a summary of the databases installed with SharePoint Server 2010 and directions for estimating the IOPS.  The article recommends that you run your environment on the Enterprise Edition of SQL Server 2008 or SQL Server 2008 R2.

The write up advises you on choosing a storage architecture, disk types, and RAID types.  There is a table of guidelines to estimate memory requirements and some advice on network topology requirements.  The Configure SQL server section advises that SharePoint Server 2010 was meant to run on several medium-sized servers rather than a couple of large ones. The final point in the article provides general guidance for monitoring the performance of your system.

This article makes me glad that I am not a database administrator.  With the huge volumes of data that are found on SharePoint, it can be difficult enough to wield as a front-end user.  It reminds me that the more data one has, the more important indexing and semantics become for navigating the wealth of information that someone else plans how to store. Keep in mind that Search Technologies can assist you with your SharePoint capacity planning from the perspective of searchability.

Stephen E Arnold, March 1, 2011

For Search Technologies

Is Customer Support a Revenue Winner for Search Vendors?

February 26, 2011

In a word, “Maybe.” Basic search is now widely available at low or

InQuira has been a player in customer support for a number of years. The big dogs in customer support are outfits like RightNowPega, and a back pack full of off shore outfits. In the last couple of weeks, we have snagged news releases that suggest search vendors are in the customer support business.

Two firms have generated somewhat similar news releases. Coveo, based in Canada, was covered in Search CRM in a story titled “2011 Customer Service Trends: The Mobile Revolution.” The passage that caught our attention was:

The most sophisticated level of mobile enablement includes native applications, such as iPhone applications available from Apple’s App Store, which have been tested and approved by the device manufacturer. Not only do these applications offer the highest level of usability, they allow integration with other device applications. For example, Coveo’s mobile interface for the company’s Customer Information Access Solutions allows you to take action on items in a list of search returns, such as reply to an email or add a comment to a Salesforce.com incident. Like any hot technology trend, when investing in mobile enablement it is important to prioritize projects based on potential return on investment, not “cool” factor.

Okay, mobile for customer support.

Then we saw a few days later “Vivisimo Releases New Customer Experience Optimization Solution” in Destination CRM. Originally a vendor of on-the-fly clustering, Vivisimo has become a full service content processing firm specializing in “information optimization.” The passage that caught our attention was:

Vivisimo has begun to address the needs of these customer-facing professionals with the development of its Customer Experience Optimization (CXO) solution, which gives customer service representatives and account managers quick access to all the information about a customer, no matter where that information is housed and managed—inside or outside a company’s systems, and regardless of the source or type. The company’s products are a hybrid of enterprise search, text-based search, and business intelligence solutions. CXO also targets the $1.4 trillion problem of lost worker productivity fueled by employees losing time looking for information. “All content comes through a single search box,” Calderwood says, “which reduces the amount of time to find information.” CXO works with an enterprise search platform that indexes unstructured data, and a display mechanism that uses analytics to find the data. It sits on top of all the systems and applications a company can have—even hosted applications—and pulls data from them all. It can sync up with major systems from Remedy, Siebel, SAP, Oracle, Microsoft, Salesforce.com, and many others.

So, customer support and customer relationship management it is.

image

Promises are easy to make and sometimes difficult to keep. Source: http://dwellingintheword.wordpress.com/2009/12/29/172-numbers-30-and-31/

I have documented the changes that search and content processing companies have made in the last year. There have been significant executive changes at Lucid Imagination, MarkLogic, and Sinequa. Companies like Attensity and JackBe have shifted from a singular focus on serving a specific business sector to a broader commercial market. Brainware is pushing into document processing and medical information. Recommind has moved from eDiscovery into enterprise search. Palantir, the somewhat interesting visualization and analytics operation, is pushing into financial services, not just government intelligence sectors. There are numerous examples of search vendors looking for revenue love in various market sectors.

So what?

I see four factors influencing search and content processing vendors. I am putting the finishing touches on a “landscape report” in conjunction with Pandia.com about enterprise search. I dipped into the reference material for that study and noted these points:

  1. Repositioning is becoming a standard operating positioning for most search and content processing vendors. Even the giants like Google are trying to find ways to lash their indexing technology to words in hopes of increasing revenue. So wordsmithing is the order of the day. Do these firms have technology that will deliver on the repositioned capability? I am not sure, but I have ample evidence that plain old search is now a commodity. Search does not generate too much excitement among some organizations.
  2. The niches themselves that get attention—customer support, marketers interested in social content, and business intelligence—are in flux. The purpose of customer support is to reduce costs, not put me in touch with an expert who can answer my product question. The social content band wagon is speeding along, but it is unclear if “social media” is useful across a wide swath of business types. Consumer products, yes. Specialty metals, not so much.
  3. A “herd” mentality seems to be operating. Search vendors who once chased “one size fits all” buyers now look at niches. The problem is that some niches like eDiscovery and customer support have quite particular requirements. Consultative selling Endeca-style may be needed, but few search vendors has as many MBA types as Endeca and a handful of other firms. Engineers are not so good at MBA style tailoring, but with staff additions, the gap can be closed, just not overnight. Thus, the herd charges into a sector but there may not be enough grazing to feed everyone.
  4. Significant marketing forces are now at work. You have heard of Watson, I presume. When a company like IBM pushes into search and content processing with a consumer assault, other vendors have to differentiate themselves. Google and Microsoft are also marketing their corporate hearts into 150 beat per minute range. That type of noise forces smaller vendors to amp up their efforts. The result is the type of shape shifting that made the liquid metal terminator so fascinating. But that was a motion picture. Selling information retrieval is real life.

I am confident that the smaller vendors of search and content processing will be moving through a repositioning cycle. The problem for some firms is that their technology is, at the end of the day, roughly equivalent to Lucene/Solr. This means that unless higher value solutions can be delivered, an open source solution may be good enough. Simply saying that a search and retrieval system can deliver eDiscovery, customer support, business intelligence, medical fraud detection, or knowledge management may not be enough to generate needed revenue.

In fact, I think the hunt for revenue is driving the repositioning. Basic search has crumbled away as a money maker. But key word retrieval backed with some categorization is not what makes a customer support solution or one of the other “search positioning plays” work. Each of these niches has specific needs and incumbents who are going to fight back.

Enterprise search and its many variants remains a fascinating niche to monitor. There are changes afoot which are likely to make the known outfits sweat bullets in an effort to find a way to break through the revenue ceilings that seem to be imposed on many vendors of information retrieval technology. Even Google has a challenge, and it has lots of money and smart people. If Google can’t get off its one trick pony, what’s that imply for search vendors with fewer resources?

It is easy to say one has a solution. It is quite another to deliver that solution to an organization with a very real, very large, and very significant problem.

Stephen E Arnold, February 26, 2011

dtSearch Dot Net

February 23, 2011

TechWhack posted a press release announcing, “Announcing New dtSearch® Product Line Release with Native .NET 4 / 64-Bit SDK.”  dtSearch is a leading supplier of enterprise and developer text retrieval and file conversion software.  They released version 7.66 of their product line, their star feature is the 64 bit .Net SDK for their search engine:

“The .NET 4 SDK covers the Spider API for indexing local and remote, static and dynamic web-based content, encompassing both public Internet and secure Intranet data. The .NET 4 release also has a sample application for the Microsoft Azure cloud platform. And the new SDK offers performance enhancements for faceted searching involving millions of document metadata tags or database records.”

Other features include are a terabyte indexer–products can index over a terabyte of text in a single index, cloud applications–.Net 4 code to be used with the Ms Azure cloud platform, Spider–an application that adds websites to a data collection and it can cross multiple integrated software, built-in proprietary file parsers/converters–covers a wide range of file types, and over twenty-five search options/foreign language–these include federated, special forensics features, full-text, and fielded data as well as Unicode for right-to-left languages.

Whitney Grace, February 23, 2011

The Wages of SEO Sin

February 13, 2011

So Google can be fooled. It’s not nice to fool Mother Google. The inverse, however, is not accurate. Mother Google can take some liberties. Any indexing system can. Objectivity is in the eye of the beholder or the person who pays for results.

Judging from the torrent of posts from “experts”, the big guns of search are saying, “We told you so.” The trigger for this outburst of criticism is the New York Times’s write up about JC Penny. You can try this link, but I expect that it and its SEO crunchy headline will go dark shortly. (Yep, the NYT is in the SEO game too.)

Everyone from AOL news to blog-o-rama wizards are reviling Google for not figuring out how to stop folks from gaming the system. Sigh.

I am not sure how many years ago I wrote the “search sucks” article for Searcher Magazine. My position was clear long before the JC Penny affair and the slowly growing awareness that search is anything BUT objective.

image

Source: http://www.brianjamesnyc.com/blog/?p=157

In the good old days, database bias was set forth in the editorial policies for online files. You could disagree with what we selected for ABI/INFORM, but we made an effort to explain what we selected, why we selected certain items for the file, and how the decision affected assignment of index terms and classification codes. The point was that we were explaining the mechanism for making a database which we hoped would be useful. We were successful, and we tried to avoid the silliness of claiming comprehensive coverage. We had an editorial policy, and we shaped our work to that policy. Most people in 1980 did not know much about online. I am willing to risk this statement: I don’t think too many people in 2011 know about online and Web indexing. In the absence of knowledge, some remarkable actions occur.

image

You don’t know what you don’t know or the unknown unknowns. Source: http://dealbreaker.com/donald-rumsfeld/

Flash forward to the Web. Most users assume incorrectly that a search engine is objective. Baloney. Just as we set an editorial policy for ABI/INFORM each crawler and content processing system has similar decisions beneath it.

The difference is that at ABI/INFORM we explained our bias. The modern Web and enterprise search engines don’t. If a system tries to explain what it does, most of the failed Web masters, English majors working as consultants, and unemployed lawyers turned search experts just don’t care.

Search and content processing are complicated businesses, and the appetite for the gory details about certain issues are of zero interest to most professionals. Here’s a quick list of “decisions” that must be made for a basic search engine:

  • How deep will we crawl? Most engines set a limit. No one, not even Google, has the time or money to follow every link.
  • How frequently will we update? Most search engines have to allocate resources in order to get a reasonable index refresh. Sites that get zero traffic don’t get updated too often. Sites that are sprawling and deep may get three of four levels of indexing. The rest? Forget it.
  • What will we index? Most people perceive the various Web search systems as indexing the entire Web. Baloney. Bing.com makes decisions about what to index and when, and I find that it favors certain verticals and trendy topics. Google does a bit better, but there are bluebirds, canaries, and sparrows. Bluebirds get indexed thoroughly and frequently. See Google News for an example. For Google’s Uncle Sam, a different schedule applies. In between, there are lots of sites and lots of factors at play, not the least of which is money.
  • What is on the stop list? Yep, a list can kill index pointers, making the site invisible.
  • When will we revisit a site with slow response time?
  • What actions do we take when a site is owned by a key stakeholder?

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta