Natural Search: SEO Boffin Changes His Spots

January 2, 2009

The crackle of gun fire echoed through the hollow this morning. I am not sure if my neighbors are celebrating the new year or just getting some squirrel for a mid day burgoo. As I scanned the goodies in my newsreader, I learned about a type of search that had eluded me. I want to capture this notion before it dribbles off my slippery memory. Media Post reported  in “Search Insider: The Inside Line on Search Marketing” that 2009 is ripe for “natural search”. The phrase appears in Rob Garner’s “Measuring Natural Search Marketing Success” here. The notion (I think) is that content helps a Web site come up in a results list. I had to sit down and preen my feathers. I was so excited by this insight I was ruffled. For me the most important comment was:

For starters, think of an investment in natural search as a protection for what you are currently getting from natural search engines across the board. Good natural search advice costs are a drop in the bucket compared to returns from natural search, and the risk of doing harm only once can far exceed your costs, and even do irreparable damage. I see clients with returns coming from natural search at over one half-billion to one billion dollars a year or more, and one simple slip could cost millions.

I must admit that I have to interpolate to ferret the meaning from this passage. What I concluded (your mileage may differ) is that if you don’t have content, you may not appear in a Google, Microsoft, or Yahoo results list.

What happened to the phrase “organic search”. I thought it evoked a digital Euell Gibbons moving from Web site to Web site, planting content seeds. “Natural search” has for me a murkier connotation. I think of Tom’s toothpaste, the Natural Products Association, and Mold Cleaner Molderizer.

My hunch is that Google’s tweaks to its PageRank algorithm places a heavy load on the shoulders of the SEO consultants. I have heard that some of the higher profile firms (which I will not name) are charging five figure fees and delivering spotty results. As a result, the SEO mavens are looking for a less risky way to get a Web site to appear in the Google rankings.

Mr. Garner is one of the first in 2009 to suggest that original content offering useful information to a site visitor is an “insurance policy”. I don’t agree. Content is the life support system of a Web site. You buy insurance for you automobile and home.

Stephen Arnold, January 1, 2009

Duplicates and Deduplication

December 29, 2008

In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.

Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.

At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.

Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:

  1. The two emails are not identical and include both and the two attachments
  2. The earlier email is the accurate one and exclude the later email
  3. The later email is accurate and exclude the earlier email.

Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.

image

Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg

Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.

Read more

Moore’s Law: Not Enough for Google

December 29, 2008

I made good progress on my Google and Publishing report for Infonortics over the last three days. I sat down this morning and riffed through my Google technical document collection to find a number. The number is interesting because it appears in a Google patent document and provides a rough estimate of the links that Google would have to process when it runs its loopy text generation system. Here’s the number as it is expressed in the Google patent document:

50 million million billion links

Google’s engineers included an exclamation point to US7231393. The number is pretty big even by Googley standards. And who cares? Few pay much attention to Google’s PhD like technical documents. Google is a search company that sells advertising and until the forthcoming book about Google’s other business interests comes out, I don’t think many people realize that Moore’s law is not going to help Google when it processes lots of links–50 million million billion give or take a few million million.

When I scanned “Sustaining Moore’s Law – 10 Years of the CPU” by Vincent Chang here, I realized that Google has little choice to use fast CPUs and math together. In fact, the faster and more capable the CPU, the more math Google can use. Name another company worrying about Kolmogorov’s procedures?

Take a look at Mr. Chang’s article. The graph shows that the number of transistors appear to keep doubling. The problem is that information keeps growing and the type of analysis Google wants to do to use various probabilistic methods is rising even faster.

The idea that building more data centers allows Google to do more is only half the story. The other half is math. Competitors who focus on building data centers, therefore, may be addressing only part of the job when trying to catch up with Google. Leapfrogging Google seems difficult if my understanding of the issue.

Getting Doored by Search

December 28, 2008

Have you been in Manhattan and watch a bike messenger surprised by a car door opening. The bike messenger loses these battles, which typically destroy the front wheel of the bike. When this occurs, the messenger has been doored. You can experience a similar surprise with enterprise search.

image

What happens when you get doored. Source: http://citynoise.org/author/ken_rosatio

The first situation is one that will be increasingly common in 2009. As the economy tanks, litigation is likely to increase. This means that you will need to provide information as part of the legal discovery process. You will get doored if you try to use your existing search system for this function. No go. You will need specialized systems and you will have to be able to provide assurance that spoliation will not occur. “Spoliation” refers to changing an email. Autonomy offers a solution, to cite one example.

The second situation occurs when you implement one of the social systems; for example, a Web log or a wiki. You will find that most enterprise search systems may lack filters to handle the content in blogs. Some vendors–for example, Blossom Search–can index Web log content. Exalead has a connector to index information within the Blogger.com and other systems. However, your search system may lack the connector. You will be doored because you will have to code or buy a connector. Ouch.

The third situation arises when you need to make email searchable from a mobile device. To pull this off, you need to find a way to preserve security, prevent a user from deleting mail from her desktop or the mail server, and deliver results without latency. When you try this trick with most enterprise search systems, you will be doored. The fix is to tap a vendor like Coveo and use that company’s email search system.

There’s a small consulting outfit prancing around like a holiday elf saying, “Search is simple. Search is easy. Search is transparent.” Like elves, this assertion is a weird mix of silliness, fairy dust, and ignorance. If this outfit helps you deal with a “simple” search, prepare to get doored. It may not be the search system; it may be your colleagues.

Stephen Arnold, December 28, 2008

Google Translation Nudges Forward

December 27, 2008

I recall a chipper 20 something telling me she learned in her first class in engineering; to wit, “Patent applications are not products.” As a trophy generation member, flush with entitlement, she’s is generally correct, but patent applications are not accidental. They are instrumental. If you are working on translation software, you may want to check out Google’s December 25, 2008, “Machine Translation for Query Expansion.” You can find this document by searching the wonderful USPTO system for US20080319962. Once you have that document in front of you, you will learn that Google asserts that it can snag a query, generate synonyms from its statistical machine translation system, and pull back a collection. There are some other methods in the patent application. When I read it, my thought was, “Run a query in English, get back documents in other languages that match the query, and punch the Google Translate button and see the source document in English.” Your interpretation may vary. I was amused that the document appeared on December 25, 2008, when most of the US government was on holiday. I guess the USPTO is working hard to win the favor of the incoming administration.

Stephen Arnold, December 27, 2008

The Future of EasyAsk: Depends on Progress

December 18, 2008

EasyAsk is a search system that works quite well. You can read EasyAsk Facts here. The company is now a unit of Progress Software. Progress began with a core of original code and over the years has acquired a number of companies. I think of the firm as a boutique, which is not what the Progress public relations people want me to keep in my tiny goose brain. I saw a news item about Progress Software’s most recent financial report. You can read a summary of the numbers here. If you want more detail, navigate to Google Finance here. The story is simple: earnings are down to $8.5 million from $15.8 million in the fourth quarter of 2007. With the economic climate in deep chill mode, Progress will have to retool its sales and marketing. If the downdraft continues, the company will have to make some tough decisions about which of its many products to hook up to life support. EasyAsk is like other search systems a complicated beastie. Search systems gobble up money, and the sales cycle is often long even when the MBAs are running at full throttle. When the MBAs are home worrying about their mortgage payments, the search business is likely to suffer. One warning sign: EasyAsk was not mentioned in the news release I read. This goose is accustomed to watching the weather for signs of a storm. My thought is that one might be building and heading the EasyAsk way. What’s your take? No PR people need reply, thanks.

Stephen Arnold, December

Leximancer Satmetrix Tie Up

December 18, 2008

Leximancer has partnered with Satmetrix so that company can utilize Leximancer’s Customer Insight Portal. Satmetrix provides software applications and consulting services to improve customer loyalty. Using “intuitive concept discovery” — semantic analysis — Leximancer develops responses on customer attitudes. Leximancer will provide customer analytics and unstructured text mining for Satmetrix’s Net Promoter, which automatically sifts and categorizes data from blogs, Web sites, social media, e-mails, service notes and survey feedback to increase companies’ customer loyalty, retention and growth. The focus on analyzing positive and negative trends in text entries from customers is key to speed and response for customer service-oriented companies. Satmetrix serves a wide spread of markets including telecommunications firms like Verizon and business services like Careerbuilder.

Jessica Bratcher, December 17, 2008

SharePoint: ChooseChicago

December 18, 2008

I scanned the MSDN Web log postings and saw this headline: “SharePoint Web Sites in Government.” My first reaction was that the author Jamesbr had compiled a list of public facing Web sites running on Microsoft’s fascinating SharePoint content management, collaboration, search, and Swiss Army Knife software. No joy. Mr. Jamesbr pointed to another person’s list which was a trifle thin. You can check out this official WSS tally here. Don’t let the WSS fool you. The sites are SharePoint, and there are 432 of them as of December 16, 2008. I navigated to the featured site, ChooseChicago.com. My broadband connection was having a bad hair day. It took 10 seconds for the base page to render and I had to hit the escape key after 30 seconds to stop the page from trying to locate a missing resource. Sigh. Because this was a featured site that impressed Jamesbr, I did some exploring. First, I navigated to the ChooseChicago.com site and saw this on December 16, 2008:

chicago splash

The search box is located at the top right hand corner of the page and also at the bottom right hand corner. But the search system was a tad sluggish. After entering my query “Chinese”, the system cranked for 20 seconds before returning the results list:

chicago result list

Read more

Expert System’s Luca Scagliarini

December 18, 2008

ArnoldIT.com’s Search Wizards Speak’s series has landed another exclusive. Hard on the heels of the interview with Autonomy’s chief operating officer, Luca Scagliarini, one of the senior executives at Expert System in Modena, Italy, explains the company’s technology and strategy for 2009. Mr. Scagliarini is a technologist’s technologist and a recognized leader in next generation search systems. The company’s COGITO technology has cut a wide swath through European markets and is now available in North America. Mr. Scagliarini told ArnoldIT.com’s Beyond Search:

A major mobile handheld manufacturer uses our technology to address the issue of supporting new users in learning how to use the device. The objective was to reduce the return rate of the device AND to reduce the customer support costs. This natural language-based solution leverages our semantic technology to provide their customers with a simple and effective tool to answer questions and how-to queries with consistency and high precision. As of today the system has answered, in only 5 months, more than 4 million questions with more than 87% precision.

Search is no longer key word matching and long lists of results. Mr. Scagliarini said:

To deliver an effective question and answer system that works on more than a small set of FAQ, it is very important to have a deep understanding of the text. This is possible only through deep semantic analysis. We have several implementations of our natural language Q&A product recently renamed COGITO Answer. In the next 12 months, we will be investing to expand our footprint worldwide–especially in the U.S. and in the Persian Gulf region to replicate our European success there. In the U.S, we are now supporting customer service operations with natural language Q&A for a government unit of the Department of the Interior and we are one of only 5 semantic partners actively promoted by Oracle.

You can read the complete interview with Mr. Scagliarini on the ArnoldIT.com Web site or you can click here. More information about the company and its technology may be found on the firm’s Web site http://www.expertsystem.net or click here.

Semantic Search Laid Bare

December 17, 2008

Yahoo’s Search Blog here has an interesting interview with Dr. Rudi Studer. The focus is semantic search technologies, which are all the rage in enterprise search and Web search circles. Dr. Studer, according to Yahoo:

is no stranger to the world of semantic search. A full professor in Applied Informatics at University of Karlsruhe, Dr. Studer is also director of the Karlsruhe Service Research Institute, an interdisciplinary center designed to spur new concepts and technologies for a services-based economy. His areas of research include ontology management, semantic web services, and knowledge management. He has been a past president of the Semantic Web Science Association and has served as Editor-in-Chief of the journal Web Semantics.

If you are interested in semantics, you will want to read and save the full text of this interview. I want to highlight three points that caught my attention and then–in my goosely manner–offer several observations.

First, Dr. Studer suggests that “lightweight semantic technologies” have a role to play. He said:

In the context of combining Web 2.0 and Semantic Web technologies, we see that the Web is the central point. In terms of short term impact, Web 2.0 has clearly passed the Semantic Web, but in the long run there is a lot that Semantic Web technologies can contribute. We see especially promising advancements in developing and deploying lightweight semantic approaches.

The key idea is lightweight, not giant semantic engines grinding in a lights out data center.

Second, Dr. Studer asserts:

Once search engines index Semantic Web data, the benefits will be even more obvious and immediate to the end user. Yahoo!’s SearchMonkey is a good example of this. In turn, if there is a benefit for the end user, content providers will make their data available using Semantic Web standards.

The idea is that in this chicken and egg problem, it will be the Web page creators’s job to make use of semantic tags.

Finally, Dr. Studer identifies tools as an issue. He said:

One problem in the early days was that the tool support was not as mature as for other technologies. This has changed over the years as we now have stable tooling infrastructure available. This also becomes apparent when looking at the at this year’s Semantic Web Challenge. Another aspect is the complexity of some of the technologies. For example, understanding the foundation of languages such as OWL (being based on Description Logics) is not trivial. At the same time, doing useful stuff does not require being an expert in Logics – many things can already be done exploiting only a small subset of all the language features.

I am no semantic expert. I have watched several semantic centric initiatives enter the world and–somewhat sadly–watched them die. Against this background, let me offer three observations:

  1. Semantic technology is plumbing and like plumbing, semantic technology should be kept out of sight. I want to use plumbing in a user friendly, problem free setting. Beyond that, I don’t want to know anything about plumbing. Lightweight or heavyweight, I think some other users may feel the same way. Do I look at inverted indexes? Do you?
  2. The notion of putting the burden on Web page or content creators is a great idea, but it won’t work. When I analyzed the five Programmable Search Engine inventions by Ramanathan Guha as part of an analysis for the late, great BearStearns, it was clear that Google’s clever Dr. Guha assumed most content would not be tagged in a useful way. Sure, if content was properly tagged, Google could ingest that information. But the core of the PSE invention was Google’s method for taking the semantic bull by the horns. If Dr. Guha’s method works, then Google will become the semantic Web because it will do the tagging work that most people cannot or will not do.
  3. The tools are getting better, but I don’t think users want to use tools. Users want life to be easy, and figuring out how to create appropriate tags, inserting them, and conforming to “standards” such as they are is no fun. The tools will thrill developers and leave most people cold. Check out the tools section at a hardware store. What do you see? Hobbyists and tinkerers and maybe a few professionals who grab what they need and head out. Semantic tools will be like hardware: of interest to a few.

In my opinion, the Google – Guha approach is the one to watch. The semantic Web is gaining traction, but it is in its infancy. If Google jump starts the process by saying, “We will do it for you”, then Google will “own” the semantic Web. Then what? The professional semantic Web folks will grouse, but the GOOG will ignore the howls of protest. Why do you think the GOOG hired Dr. Guha from IBM Almaden? Why did the GOOG create an environment for Dr. Guha to write five patent applications, file them on the same day, and have the USPTO publish five documents on the same day in February 2007? No accident tell you I.

Stephen Arnold, December 17, 2008

Stephen Arnold

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta