Google Search and Ads: Lawsuit Magnets
February 19, 2009
The Washington Post reposted a Sarah Lacy TechCrunch article called “The New Bulls-Eye on Google” here. The hook for the article was encapsulated for me in this comment in the article:
Even before he [President Obama] took office, the Justice Department said it would block Google’s proposed ad search deal with Yahoo, citing the company’s more than 70% market share in search advertising. And now, things look worse. Christine A. Varney has been nominated as Obama?s choice to head the Justice Department?s antitrust division. The same Christine A. Varney who several months ago said, ‘For me, Microsoft is so last century. They are not the problem.’ The new threat? Google, a company that Varney said has ‘acquired a monopoly in Internet online advertising.’ Would that Google-DoubleClick deal have slid through under Varney? Doubtful.
In my opinion Google, like Microsoft, will not be treated like the dry cleaning company in Harrod’s Creek. Litigation is going to play an important part in Google’s life. Google has money. The economy is bad. Winning a settlement makes perfect sense if the risks are acceptable. Law is not fair to addled geese. Google is no goose, and it will have its paws full with legal squabbles for the foreseeable future.
Stephen Arnold, February 19, 2009
Keeping Internet Transparency Clear
February 19, 2009
Lauren Weinstein of Vortex Technology, http://www.vortex.com/, just announced at http://lauren.vortex.com/archive/000506.html a set of forums he is hosting at http://forums.gctip.org/ called GCTIP–the Global Coalition for Transparent Internet Performance. It’s meant to address Internet transparency, performance and ISP Issues. The project grew out of a network measurement workshop sponsored by “Father of the Internet” Vint Cerf and Google, for which Weinstein assisted in organizing the agenda. Weinstein’s point: It’s impossible to know if we’re getting enough bang for our buck using the Internet unless we have hard facts, so set up measurement tests. My point: Not only is this already being done, but how would you ever get definitive results, and how will an info dump help? Am I oversimplifying? Comments?
Jessica Bratcher, February 19, 2009
Mysteries of Online 7: Errors, Quality, and Provenance
February 19, 2009
This installment of “Mysteries of Online” tackles a boring subject that means little or nothing to the entitlement generation. I have recycled information from one of my talks in 1998, but some of the ideas may be relevant today. First, let’s define the terms:
- Errors–Something does not work. Information may be wildly inaccurate but the user may not perceive this problem. An error is a browser that crashes, a page that doesn’t render, a Flash that fails. This notion of an error is very important in decision making. A Web site that delivers erroneous information may be perceived as “right” or “good enough”. Pretty exciting consequences result from this notion of an “error” in my experience.
- Quality–Content displayed on a Web page is consistent. The regularity of the presentation of information, the handling of company names in a standard way, and the tidy rows and columns with appropriate values becomes “quality” output in an online experience. The notion of errors and quality combine to create a belief among some that if the data come from the computer, then those data are right, accurate, reliable.
- Provenance–This is the notion of knowing from where an item came. In the electronic world, I find it difficult to figure out where information originates. The Washington Post reprints a TechCrunch article from a writer who has some nerve ganglia embedded in the companies about which she writes. Is this provenance enough or do we need the equivalent of a PhD from Oxford University and a peer reviewed document. In my experience, few users of online information know or know how to think about the provenance of the information on a Web page or in a search results list. Pay for placement adds spice to provenance in my opinion.
So What?
A gap exists between individuals who want to know whether information is accurate and can be substantiated from multiple sources and those who take what’s on offer. Consider this Web log post. If someone reads it, will that individual poke around to find out about my background, my published work, and what my history is. In my experience, I see a number of comments that say, “Who do you think you are? You are not qualified to comment on X or Y.” I may be an addled goose, but some of the information recycled for this Web log are more accurate than what appears in some high profile publications. A recent example was a journalist’s reporting that Google’s government sales were about $4,000, down from a couple of hundred thousand dollars. The facts were wrong and when I checked back on that story I found that no one pointed out the mistake. A single GB 7007 can hit $250,000 without much effort. It doesn’t take many Google Search Appliance Sales to beat $4,000 a year in revenue from Uncle Sam.
The point is that most users:
- Lack the motivation or expertise to find out if an assertion or a fact is correct or incorrect. Instead of becoming a priority, in my opinion, few people care too much about the dull stuff–chasing facts. Even when I chase facts, I can make an error. I try to correct those I can. What makes me nervous are those individuals who don’t care whether information is on target.
- See research as a core competency. Research is difficult and a thankless task. Many people tell me that they have no time to do research. I received an email from a person asking me how I could post to this Web log every day. Answer: I have help. Most of those assisting me are very good researchers. Individuals with solid research skills do not depend solely upon the Web indexes. When was the last time your colleague did research among sources other than those identified in a Web index.
- Get confused with too many results. Most users look at the first page of search results. Fewer than five percent of online users make use of advanced search functions. Google, based on my research, takes a “good enough” approach to their search results. When Google needs “real” research, the company hires professionals. Why? Good enough is not always good enough. Simplification of search and the finding of information is a habit. Lazy people use Web search because it is easy. Remember: research is difficult.
Google: Warning Bells Clanging
February 19, 2009
Henry Blodget wrote “Yahoo Search Share Rises Again… And Google Falls” here. The hook for the story is a report from the comScore tracking data that shows Google’s share of the Web search market “dropped a half point to 63%.” Mr. Blodget added quite correctly, “You don’t see that every day.” Mr. Blodget also flags Yahoo’s increase in search share, which jumped to 21%. Yahoo has made gains in share for the last five months. Congratulations to Yahoo.
Several comments:
- Data about Web search share is often questionable.
- Think back to your first day in statistics. Remember margin of error? When you have questionable data, a narrow gain or loss, and a data gathering system which is based on some pretty interesting data collection methods–what do you get? You get Jello data.
- The actual Web log data for outfits like Google and Yahoo often tell the company employees a different story. How different? I was lucky enough last year to see some data that revealed Google’s share of the Web search market north of 80 percent in the US. So which data are correct? The point is that sampled data about Web search usage is wide of the actual data by 10 to 15 percent or more.
Is Google in trouble? Not as much trouble as Yahoo. Assume the data are correct. The spread between Yahoo and Google is about 40 percent. Alarmism boosts traffic more easily than Yahoo can boost its share of the Web search market in my opinion.
Stephen Arnold, February 18, 2009
Alacra Raises Its Pulse
February 19, 2009
Alacra Inc., http://www.alacra.com/, released their Pulse Platform today. Along the lines of Beyond Search’s own Overflight service, Alacra’s Pulse finds, filters and packages web-based content by combining semantic analysis and an existing knowledge base to target and analyze more than 2,000 hand-selected feeds. The first platform app is Street Pulse, which carves out “what do key opinion leaders have to say…” about a given company. A press release says it “integrates comments from sell-side, credit analysts, industry analysts and a carefully vetted list of influential bloggers.” It’s offered free at http://pulse.alacra.com. There’s also a clean, licensed professional version with bells and whistles like e-mail alerts. More apps will follow that mayfocus on hush-hush company droppings that everyone loves to muck through. Alacra’s jingle is “Aggregate, Integrate, Package, Deliver.” Since so many people are already wading through info dumps, we see this service growing into a critical search resource. We’re certainly on board the bandwagon and will be watching for further developments.
Jessica W. Bratcher, February 19, 2009
Semantics in Firefox
February 19, 2009
Now available: the wonders of semantic search, plugged right into your Mozilla Firefox browser. headup started in closed testing but is now a public beta model downloadable from http://www.headup.com or http://addons.mozilla.org. You do have to register for it because Firefox lists it as “experimental,” but the reviews at https://addons.mozilla.org/en-US/firefox/reviews/display/10359 are glowing. A product of SemantiNet this plugin is touted to enable “true semantic capabilities” for the first time within any Web page. headup’s engine extracts customized info based on its user and roots out additional data of interest from across the Web, including social media sites like Facebook and Twitter. Looks like this add-on is a step in the right direction to bringing the Semantic Web down to earth. Check it out and let us know what you think.
Jessica Bratcher, February 19, 2009
Mahalo: SEO Excitement
February 18, 2009
If a Web site is not in Google, the Web site does not exist. I first said this in 2004 in a client briefing before my monograph The Google Legacy was published by Infonortics Ltd. The trophy MBAs laughed and gave me the Ivy draped dismissal that sets some Wall Street wizards (now former wizards) apart. The reality then was that other online indexing services were looking at what Google favored and emulating Google’s sparse comments about how it determined a Web site’s score. I had tracked down some of the components of the PageRank algorithm from various open source documents, and I was explaining the “Google method” as my research revealed it. I had a partial picture, but it was clear that Google had cracked the problem of making the first six or seven hits in a result list useful to a large number of people using the Internet. My example was the query “spears”. Did you get Macedonian spears or links to aboriginal weapons? Nope. Google delivered the pop sensation Britney Spears. Meant zero to me, but with Google’s surging share of the Web search market at that time, Google had hit a solid triple.
The SEO (search engine optimization) carpetbaggers sensed a pot of gold at the end of desperate Web site owners’ ignorance. SEO provides advice and some technical services to boost a Web page’s or a Web’s site appeal to the Google PageRank method. Over the intervening four or five years, a big business has exploded to help a clueless marketing manager justify the money pumped into a Web site. Most Web sites get minimal traffic. Violate one of Google’s precepts, and the Web site can disappear from the first page or two of Google results. Do something really crazy like BMW’s spamming or the Ricoh’s trickery and Googzilla removes the offenders from the Google index. In effect, the Web site disappears. This was bad a couple of years ago, but today, it is the kiss of death.
I received a call from a software company that played fast and loose with SEO. The Web site disappeared into the depths of the Google result list for my test queries. The aggrieved vice president (confident of his expertise in indexing and content) wanted to know how to get back in the Google index and then to the top of the results. My answer then is the same as it is today, “Follow the Google Webmaster guidelines and create compelling content that is frequently updated.”
Bummer.
I was fascinated with “Mahalo Caught Spamming Google with PageRank Funneling Link Scheme” here. The focal point of the story is that Mahalo, a company founded by Jason Calacanis, former journalist, allegedly “was caught ranking pages without any original content-in clear violation of Google’s guidelines.” The article continued:
And now he has taken his spam strategy one step further, by creating a widget that bloggers can embed on their blogs.
You can read the Web log post and explore the links. You can try to use the referenced widget. Have at it. Furthermore,I don’t know if this assertion is 100 percent accurate. In fact, I am not sure I care. I see this type of activity in reality or as a thought experiment as reprehensible. Here’s why:
- This gaming of Google and other Web indexing systems costs the indexing copies money. Engineers have to react to the tricks of the SEO carpetbaggers. The SEO carpetbaggers then try to find another way to fool the Web indexing system’s relevance ranking method. A thermonuclear war ensues and the costs of this improper behavior sucks money from other needed engineering activities.
- The notion that a Web site will generate traffic and pay for itself is a fantasy. It was crazy in 1993 when Chris Kitze and I started work on The Point (Top 5% of the Internet), which is quite similar to some of the Mahalo elements. There was no way to trick Lycos or Harvest because it was a verifiable miracle if those systems could update their indexes and handle queries with an increasing load and what is now old-fashioned, inefficient plumbing. Somehow a restaurant in Louisville Kentucky or a custom boat builder in Arizona thinks a Web site will automatically appear when a user types “catering” or “custom boat” in a Google search box. Most sites get minimal traffic and some may be indexed on a cycle ranging from several days to months. Furthermore, some sites are set up in such a wacky way that the indexing systems may not try to index the full site. The problem is not SEO; the problem is a lack of information about what’s involved in crafting a site that works.
- Content on most Web sites is not very good. I look at my ArnoldIT.com Web site and see a dumping ground for old stuff. We index the content using the Blossom search system so I can find something I wrote in 1990, but I would be stunned if I ran a query for “online database” and saw a link to one of my essays. We digitized some of the older stuff, but no one–I repeat–no one looks at the old content. The action goes to the fresh content on the Web log. The “traditional” Web site is a loser except for archival and historical uses.
The fact that a company like Mahalo allegedly gamed Google is not the issue. The culture of cheating and the cult of SEO carpetbaggers makes this type of behavior acceptable. I get snippy little notes from those who bilk money from companies who want to make use of online but don’t know the recipe. The SEO carpetbaggers sell catnip. What these companies need is boring, dull, and substantial intellectual protein.
Google, Microsoft, and Yahoo are somewhat guilty. These companies need content to index. The SEO craziness is a cost of doing business. If a Web site gets some traffic when new, that’s by design. Over time, the Web site will drift down. If the trophy generation Webmaster doesn’t know about content and freshness, the Web indexing companies will sell traffic.
There is no fix. The system is broken. The SEO crowds pay big money to learn how to trick Google and other Web indexing companies. Then the Web indexing companies sell traffic when Web sites don’t appear in a Google results list.
So what’s the fix? Here are some suggestions:
- A Web site is a software program. Like any software, a plan, a design, and a method are needed. This takes work, which is reprehensible to some. Nevertheless, most of the broken Web sites cannot be cosmeticized. Some content management systems generate broken Web sites as seen by a Web indexing system. Fix: when possible, start over and do the fundamentals.
- Content has a short half life. Here’s what this means. If you post a story once a month, your Web site will be essentially invisible even if you are a Fortune 50 company. Exceptions occur when an obscure Web site breaks a story that is picked up and expanded by many other Web sites. Fix: write compelling content daily or better yet more frequently.
- Indexing has to be useful to humans and content processing systems. Stuffing meaningless words into a metatag is silly and counterproductive. Hiding text by tinting it to be the same as a page’s background color is dumb. Fix: find a librarian or better yet take a class in indexing. Select meaningful terms that describe the content or the page accurately. The more specialized your terminology, the more narrow the lens. The broader the term, the wider the lens. Broad terms like “financial services” are almost useless, since the bound phrase is devalued. Try some queries looking for a financial services firm in a mid sized city. Tough to do unless you get a hit in http://local.google.com or just look up the company in a local business publication or ask a friend.
As for Mahalo, who cares? The notion of user generated links by a subject matter expert worked in 1993. The method has been replaced by http://search.twitter.com or asking a friend on Facebook.com. Desperate measures are needed when traffic goes nowhere. Just don’t get caught is the catchphrase in my opinion.
Stephen Arnold, February 18, 2009
Twitter and Search
February 18, 2009
I read Peter Hershberg’s “Does Twitter Represent the Future of Search? Or Is It the Other Way Around?” here. The article begins with a reference to search engine optimization guru Dan Sullivan and then races forward with this argument:
people are increasingly turning to Twitter — rather than Google and Yahoo — when looking for information on breaking news. This is a trend we highlighted in our 2009 predictions post at the end of last year. For proof of Twitter’s real-time search capabilities all you need to do is look back at last week’s plane crash in the Hudson to see where the news initially broke. People were talking about the event for several minutes on Twitter before the first mentions of it on Google News or any major media site, for that matter.
For me, the most interesting comment in the article was:
My personal view is that Google and Yahoo haven’t come up with Twitter solutions simply because they did not initially understand what Twitter represents from a search perspective. Twitter themselves may have failed to grasp this initially, before Summize came into the mix. It’s unlikely that either Google or Yahoo saw Twitter’s potential as a search engine. So, it’s only now that they’re probably starting to put adequate resources behind developing a strategy in this area, though I have to believe that it’s become a very high priority, particularly for Google. That’s where this issue gets really interesting – particularly for someone like me who views social media through the lens of search.
The wrap up made a good point:
To this point, the “Twitterverse” has pretty much been living in a bubble – one where all updates are made and consumed within Twitter and its associated applications alone and where some believe that having 10,000 followers means that you are an authoritative or influential figure. While I believe that is, in fact, the case for some (and I won’t diminish the value in having a large following), the volume of traffic some individual Twitter updates will receive from organic search will dwarf what they are typically able to generate from Twitter alone. It also means that Twitter accounts with fewer followers – but with something important and to say on a given topic – will start to see some increased attention as well. Much like many of the early bloggers did. And when that happens, the whole question of influence and authority will once again be turned on its head.
As I thought about this good write up, I formulated several questions:
- Will Google’s play be to provide a dataspace in which Twitter comments and other social data are organized, indexed and made useful?
- In a Twitterspace, will new types of queries become essential; for example, provenance and confidence?
- Will Google, like Microsoft, be unable to react to the opportunity of real time search and spend time and money trying to catch up with a train that has left the station?
I have no answers. Twitter is making real time search an important tool for users who have no need for the dinosaur caves of archived data that Google continues to build.
Stephen Arnold, February 18, 2009
Yahoo and Its New Mobile Service
February 18, 2009
Yahoo News posted “Yahoo Mobile Aims to Channel Your Inner iPhone” here. Yahoo access on my various mobile devices seemed to require quite a bit of menu shuffling. I also found the interface’s refusal to remember my log in name somewhat idiosyncratic. But the system worked. The new service as described in the news story seemed to me to be a giant step forward. The news release said:
Yahoo Mobile will be released in three versions — one for the mobile Web, one for the iPhone, and one for other smartphones… Yahoo’s onePlace is also available in all three editions. The service lets a user access and manage, from a single location, favorite content such as news topics and sources, RSS feeds, sports scores, weather conditions, stock quotes, blogs, movie theaters, or horoscopes… In the smartphone version, users can also use oneSearch’s voice-search feature simply by talking. It also offers maps; an integrated mini-version of the popular mobile Web browser Opera; and widgets, which are small applications that provide various services that can be mixed and matched.
I fired up my smartphone and navigated to Yahoo, following the same steps I had used prior to my test on February 17, 2009, at 5 pm Eastern. Instead of a new Yahoo service or the old Yahoo service, here’s what I saw:
Sigh. I understand that new Yahoo is not available, but what about old Yahoo?
Stephen Arnold, February 18, 2009
Exclusive Interview with Kathleen Dahlgren, Cognition Technologies
February 18, 2009
Cognition Technologies’ Kathleen Dahlgren spoke with Harry Collier about her firm’s search and content processing system. Cognition’s core technology, Cognition’s Semantic NLPTM, is the outgrowth of ideas and development work which began over 23 years ago at IBM where Cognition’s founder and CTO, Kathleen Dahlgren, Ph.D., led a research team to create the first prototype of a “natural language understanding system.” In 1990, Dr. Dahlgren left IBM and formed a new company called Intelligent Text Processing (ITP). ITP applied for and won an innovative research grant with the Small Business Administration. This funding enabled the company to develop a commercial prototype of what would become Cognition’s Semantic NLP. That work won a Small Business Innovation Research (SBIR) award for excellence in 1995. In 1998, ITP was awarded a patent on a component of the technology.
Dr. Dahlgren is one of the featured speakers at the Boston Search Engine Meeting. This conference is the world’s leading venue for substantive discussions about search, content processing, and semantic technology. Attendees have an opportunity to hear talks by recognized leaders in information retrieval and then speak with these individuals, ask questions, and engage in conversations with other attendees. You can get more information about the Boston Search Engine Meeting here.
The full text of Mr. Collier’s interview with Dr. Dahlgren, conducted on February 13, 2009, appears below:
Will you describe briefly your company and its search / content processing technology?
CognitionSearch uses linguistic science to analyze language and provide meaning-based search. Cognition has built the largest semantic map of English with morphology (word stems such as catch-caught, baby-babies, communication, intercommunication), word senses (strike meaning hit, strike a state of baseball, etc.), synonymy (“strike” meaning hit, “beat” meaning hit, etc.), hyponymy (“vehicle”-“motor vehicle”-“car”-“Ford”), meaning contexts (“strike” means game state in the context of “baseball”) and phrases (“bok-choy”). . The semantic map enables CognitionSearch to unravel the meaning of text and queries, with the result that search performs with over 90% precision and 90% recall.
What are the three major challenges you see in search / content processing in 2009?
That’s a good question. The three challenges in my opinion are:
- Too much irrelevant material retrieved – poor precision
- Too much relevant material missed – poor recall
- Getting users to adopt new ways of searching that are available with advanced search technologies. NLP semantic search offers users the opportunity to state longer queries in plain English and get results, but they are currently used to keywords, so there will be an adaptation required of them to take advantage of the new advanced technology.
With search / content processing decades old, what have been the principal barriers to resolving these challenges in the past?
Poor precision and poor recall are due to the use of pattern-matching and statistical search software. As long as meaning is not recovered, the current search engines will produce mostly irrelevant material. Statistics on popularity boost many of the relevant results to the top, but as a measure across all retrievals, precision is under 30%. Poor recall means that sometimes there are no relevant hits, even though there may be many hits. This is because the alternative ways of expressing the user’s intended meaning in the query are not understood by the search engine. If they add synonyms without first determining meaning, recall can improve, but at the expense of extremely poor precision. This is because all the synonyms of an ambiguous word in all of its meanings, are used as search terms. Most of these are off target. While the ambiguous words in a language are relatively few, they are among the most frequent words. For example, the seventeen thousand most frequent words of English tend to be ambiguous.
What is your approach to problem solving in search and content processing?
Cognition focuses on improving search by improving the underlying software and making it mimic human linguistic reasoning in many respects. CognitionSearch first determines the meanings of words in context and then searches on the particular meanings of search terms, their synonyms (also disambiguated) and hyponyms (more specific word meanings in a concept hierarchy or ontology). For example, given a search for “mental disease in kids” CognitionSearch first determines that “mental disease” is a phrase, and synonymous with an ontological node, and that “kids” has stem “kid”, and that it means “human child” not a type of “goat”. It then finds document with sentences having “mental-dsease” or “OCD” or “obsessive compulsive disorder” or “schizophrenia”, etc. and “kid” (meaning human child) or “child” (meaning human child) or “young person” or “toddler”, etc.
Multi core processors provide significant performance boosts. But search / content processing often faces bottlenecks and latency in indexing and query processing. What’s your view on the performance of your system or systems with which you are familiar?
Natural language processing systems have been notoriously challenged by scalability. Recent massive upgrades in computer power have now made NLP a possibility in web search. CognitionSearch has sub-second response time and is fully distributed to as many processors as desired for both indexing and search. Distribution is one solution to scalability. Another CognitionSearch implements is to compile all reasoning into the index, so that any delays caused by reasoning are not experienced by the end user.
Google has disrupted certain enterprise search markets with its appliance solution. The Google brand creates the idea in the minds of some procurement teams and purchasing agents that Google is the only or preferred search solution. What can a vendor do to adapt to this Google effect? Is Google a significant player in enterprise search, or is Google a minor player?
Google’s search appliance highlights the weakness of popularity-based searching. On the web, with Google’s vast history of searches, popularity is effective in positioning the more desired sites at the top the relevance rank. Inside the enterprise, popularity is ineffective and Google performs as a plain pattern-matcher. Competitive vendors need to explain this to clients, and even show them with head-to-head comparisons of search with Google and search with their software on the same data. Google brand allegiance is a barrier to sales in enterprise search.
Information governance is gaining importance. Search / content processing is becoming part of eDiscovery or internal audit procedures. What’s your view of the the role of search / content processing technology in these specialized sectors?
Intelligent search in eDiscovery can dig up the “smoking gun” of violations within an organization. For example, in the recent mortgage crisis, buyers were lent money without proper proof of income. Terms for this were “stated income only”, “liar loan”, “no-doc loan”, “low-documentation loan”. In eDiscovery, intelligent search such as CognitionSearch would find all mentions of that concept, regardless of the way it was expressed in documents and email. Full exhaustiveness in search empowers lawyers analyzing discovery documents to find absolutely everything that is relevant or responsive. Likewise, intelligent search empowers corporate oversight personnel, and corporate staff in general, to find the desired information without being inundated with irrelevant hits (retrievals). Dedicated systems for eDiscovery and corporate search need only house the indices, not the original documents. It should be possible to host a company-wide secure Web site for internal search at low cost.
As you look forward, what are some new features / issues that you think will become more important in 2009? Where do you see a major break-through over the next 36 months?
Semantics and the semantic web have attracted a great deal of interest lately. One type of semantic search involves tagging of documents and Web sites, and relating them to each other in a hierarchy expressed in the tags. This type of semantic search enables taggers to perfectly control reasoning with respect to the various documents or sites, but is labor-intensive. Another type of semantic search is runs on free text, is fully automatic, and uses semantically-based software to automatically characterize the meaning of documents and sites, as with CognitionSearch.
Mobile search is emerging as an important branch of search / content processing. Mobile search, however, imposes some limitations on presentation and query submission. What are your views of mobile search’s impact on more traditional enterprise search / content processing?
Mobile search heightens the need for improved precision, because the devices don’t have space to display millions of results, most of which are irrelevant.
Where can I find more information about your products, services, and research?
Harry Collier, Infonortics, Ltd., February 18, 2009