ClickZ: Year 2008 as Search Terms

December 31, 2008

For you search engine optimization lovers, navigate to “The Year in Search: A 2008 Review” by Enid burns, ClickZ here. Ms. Burns has gathered the top searches for 2008. The resulting word list provides you with an indication of what will suck traffic to your Web site. It may help if your Web site is about one of the topics in the word list, but I know a couple of Web site wranglers who worry more about spoofing Google, Microsoft, and Yahoo, than about content. Use the list as you will. The Lycos system returned “poker” as the number one search term. If you want to buy a portal and risk your savings, acquire Lycos. None of the terms is surprising because the search terms mirror what’s hot. Web search is different from enterprise search in many ways, but I have yet to see Louis Vuitton or Clay Aiken in the enterprise search logs I have had the opportunity to review.

Stephen Arnold, December 31, 2008

Cruel to Cuil

December 30, 2008

TechCruch pushed the boulder off the hill, and now the avalanche is crashing down on Cuil. I wrote about this service when it first rolled out. You can read that article here. You can find CNet’s take on the failure of Cuil here. Matt Asay’s “Breaking the Google Habit” summarizes the Web search traffic chasm between the GOOG and every other Web search service. Keep in mind that the Google is not doing as well in China, India, Russia, and a couple of other places. But for most of the US of A, search means Google. (When I make this statement in my public lectures, I get entitlement children and trophy crazed 20 somethings chewing on my ankle. Folks, I am just reporting data, not imagining them. What do you call a 65 to 70 percent market share in Web search and nearly 26,000 Google Search Appliances and nearly complete saturation of US government mapping activities with Google Maps?) Mr. Asay picks up the theme of search as a habit. I have mentioned this characteristic of online once or twice in the last 30 years, but it’s a novel idea for CNet. The point is that Cuil started strong and ended up sucking air behind such stars as Ask.com and AOL.com. For me, the most important comment in Mr. Asay’s write up was:

..for competitors looking to kick the Google search habit, you can’t take the Cuil route and compete on search. It just won’t matter if you’re better. You need to create a different, compelling habit.

Wait a minute. I need to get out my acid free paper and archival ink. I want to write that down. I bet the Cuil venture fund check writers would prefer to capture their thoughts with branding irons on Cuil flesh.

Stephen Arnold, December 30, 2008

Wring Value from Google Analytics

December 30, 2008

I fielded several questions in the days before the ghost of Christmas present visited the coal country where I live in rural Kentucky. One question pertained to Google Analytics. If you haven’t seen analytics in action, point your browser to http://trends.google.com and you can see some of what’s possible. A Google search for “Google Analytics” works well too. To dig more deeply into Google Analytics, you will want to read Kissmetrics’s article “50 Resources for Getting the Most Out of Google Analytics” here. My Web log gets so little traffic, analytics depress me. But you, gentle reader, probably have a high traffic site, and you will benefit by clicking through sites and services grouped by useful headings; for example, “Beginner” and “Plugins, Hacks & Additions”. Very useful collection of information. Highly recommended.

Stephen Arnold, December 30, 2008

Microsoft in the Crystal Ball

December 30, 2008

Five scenarios for Microsoft’s future made the shortlist at InfoWorld. You can read the full, remarkably choppy story here. I don’t want to spoil your fun as you scan the five scenarios the wise journalists cooked up for today’s intellectual meal. I can mention one of the scenarios, however. For example, after noting that Microsoft has wandered a bit in 2008, one of the future scenarios is a gentle drift downwards. When I read this, I thought, “If the economy tanks, how gentle will that crash be for the Redmond wizards?” In my opinion, not too gentle. You can work through the other four scenarios which strongly suggest that someone at InfoWorld might want to sign up for an evening MBA program at one of the universities near the InfoWorld offices. The scenario that I think warrants a bit of thought is the break up. The company may be worth more chopped into three or more segments. With a stock price in the $20 range, beleaguered investors and users who have to clean up Word’s crazy behavior on a Windows machine by firing up Word for a Mac may force the issue. If more trouble looms, maybe Mr. Gates will come back. Marketing is not closing the gap between Redmond and the GOOG. Technology, not Zunes, is necessary. And quickly. Time is dribbling away. The company’s 2008 acquisitions provide additional evidence that Microsoft finds itself in a strange new Googley world.

Stephen Arnold, December 30, 2008

Dead Tree Update: Chicago and Suburban Shoppers

December 29, 2008

Newsweek Magazine, a dead tree publication in some danger of marginalization, published “Chicago’s Newspapers Facing a Troubled Future” here. When I read this article, I had the impression that the author, F.N. D’Alessio, was writing about Newsweek and the Associated Press. Mr. D’Alessio refers to newspaper “addicts”. I don’t know too many. I receive four dead tree newspapers: the Courier Journal, USA Today (affectionately known as McPaper), the New York Times, and the Wall Street Journal. I used to get the Financial Times, but the delivery was so erratic I dropped the paper in January 2008. I received an offer of a year’s subscription for $99, and I threw it in the trash. Too much hassle trying to work through clumps of papers arriving twice a week. For me, the most significant comment in the Newsweek story was a comment about the Tribune’s rival, the Chicago Sun Times:

Hollinger’s biggest move was to create the Sun-Times Media Group by buying up 70 suburban and neighborhood newspapers, more than a dozen of which are dailies. Some of those are profitable, and some newspaper analysts envision the Sun-Times company shutting down the namesake paper and keeping the suburban ones.

I read this as a clear statement that big city papers are gone geese. Check out the Tribune’s online version of the newspaper. It is a disaster. My discussion of this wounded duck is here.

The future for dead tree outfits–if there is to be one–is to become ad supported, micro publications serving narrow markets. For years, I thought the Gaithersburg Gazette was had potential. Now that type of publication along with penny shoppers may be the margin of the information world available to the dead tree crowd.

You can make money in niches, but the revenue will buy used Malibus, not the flashy Mercedes the princes of journalism see as suitable transportation.

Stephen Arnold, December 29, 2008

Duplicates and Deduplication

December 29, 2008

In 1962, I was in Dr. Daphne Swartz’s Biology 103 class. I still don’t recall how I ended up amidst the future doctors and pharmacists, but there I was sitting next to my nemesis Camille Berg. She and I competed to get the top grades in every class we shared. I recall that Miss Berg knew that there five variations of twinning three dizygotic and two monozygotic. I had just turned 17 and knew about the Doublemint Twins. I had some catching up to do.

Duplicates continue to appear in data just as the five types of twins did in Bio 103. I find it amusing to hear and read about software that performs deduplication; that is, the machine process of determining which item is identical to another. The simplest type of deduplication is to take a list of numbers and eliminate any that are identical. You probably encountered this type of task in your first programming class. Life gets a bit more tricky when the values are expressed in different ways; for example, a mixed list with binary, hexadecimal, and real numbers plus a few more interesting versions tossed in for good measure. Deduplication becomes a bit more complicated.

At the other end of the scale, consider the challenge of examining two collections of electronic mail seized from a person of interest’s computers. There is the email from her laptop. And there is the email that resides on her desktop computer. Your job is to determine which emails are identical, prepare a single deduplicated list of those emails, generate a file of emails and attachments, and place the merged and deduplicated list on a system that will be used for eDiscovery.

Here are some of the challenges that you will face once you answer this question, “What’s a duplicate?” You have two allegedly identical emails and their attachments. One email is dated January 2, 2008; the other is dated January 3, 2008. You examine each email and find that difference between the two emails is in the inclusion of a single slide in the two PowerPoint decks. You conclude what:

  1. The two emails are not identical and include both and the two attachments
  2. The earlier email is the accurate one and exclude the later email
  3. The later email is accurate and exclude the earlier email.

Now consider that you have 10 million emails to process. We have to go back to our definition of a duplicate and apply the rules for that duplicate to the collection of emails. If we get this wrong, there could be legal consequences. A system develop who generates a file of emails where a mathematical process has determined that a record is different may be too crude to deal with the problem in the context of eDiscovery. Math helps but it is not likely to be able to handle the onerous task of determining near matches and the reasoning required to determine which email is “the” email.

image

Which is Jill? Which is Jane? Parents keep both. Does data work like this? Source: http://celebritybabies.typepad.com/photos/uncategorized/2008/04/02/natalie_grant_twins.jpg

Here’s another situation. You are merging two files of credit card transactions. You have data from an IBM DB2 system and you have data from an Oracle system. The company wants to transform these data, deduplicate them, normalize them, and merge them to produce on master “clean” data table. No, you can’t Google for an offshore service bureau, you have to perform this task yourself. In my experience, the job is going to be tricky. Let me give you one example. You identify two records which agree in field name and data for a single row in Table A and Table B. But you notice that the telephone number varies by a single digit. Which is the correct telephone number? You do a quick spot check and find that half of the entries from Table B have this variant, or you can flip the analysis around and say that half of the entries in Table A vary from Table B. How do you determine which records are duplicates.

Read more

Google Translation Nudges Forward

December 27, 2008

I recall a chipper 20 something telling me she learned in her first class in engineering; to wit, “Patent applications are not products.” As a trophy generation member, flush with entitlement, she’s is generally correct, but patent applications are not accidental. They are instrumental. If you are working on translation software, you may want to check out Google’s December 25, 2008, “Machine Translation for Query Expansion.” You can find this document by searching the wonderful USPTO system for US20080319962. Once you have that document in front of you, you will learn that Google asserts that it can snag a query, generate synonyms from its statistical machine translation system, and pull back a collection. There are some other methods in the patent application. When I read it, my thought was, “Run a query in English, get back documents in other languages that match the query, and punch the Google Translate button and see the source document in English.” Your interpretation may vary. I was amused that the document appeared on December 25, 2008, when most of the US government was on holiday. I guess the USPTO is working hard to win the favor of the incoming administration.

Stephen Arnold, December 27, 2008

Algorithms for All

December 27, 2008

A happy quack to the reader who sent me this link to the ACM’s collection of old, bad, not too old and not too bad algorithms. You can access the list and download the algorithms here. The collection task was a big one. Tim Hopkins, University of Kent, has his name on the referenced page. The geese at ArnoldIT.com want to thank him for his work. Keep in mind that algorithms’ beauty may be found in the eye of the beholder. Some of these are gems; others will choke even a modern hot rod computer. Test and retest, quacks the goose.

Stephen Arnold, December 27, 2008

Reading Google Paw Lines to Foretell Its Future

December 26, 2008

Alex Chitu must have been close enough to Googzilla to get it to show its paw for a fortune telling session. You can read his “Predictions for Google’s 2009” in Google Operating System here. His observations for the most part are interesting and I think, like Nostradamus, some of these predictions may be “true”. For example, Google Translate will become a more widely deployed function in Google products and services. You will find my discussion of Google’s December 25, 2008, patent application US20080319962 germane to this prediction. If you want to peer beyond Mr. Chitu’s flat statement, download the patent document and check out the claims. I also agree that Google Contacts will gain some beef in 2009. If you have been watching the weird ritual mating dance between Googlers and Salesforce.com, you may conclude that the GOOG wants more from customer relationship management than a quick buy out of Salesforce.com for its multi tenant inventions and the company’s potent marketing engine.  The personalized search ads have been visible to me on a couple of my Google “ig” sessions, so that’s a slam dunk for 2009. You can read his other prognostications here. I would like to mention three predictions that I hoped he would mention but did not. These are quite addled, and so these are ideal for the Beyond Search addled goose crystal ball output; namely:

  1. Companies in sectors unrelated to Web search and online advertising realize that the GOOG is disrupting their businesses. The addled goose watched in 2008 as commercial database companies and telecommunications companies woke up to a strange, new, Googley world. Can you guess the business sectors? You can get a list of these plus a diagram in my 2007 Google Version 2.0 which is still available. Click here to order.
  2. Authors will turn to Google as a way to sell, not just market, their original work. With dead tree publishing companies racing toward Armageddon, the GOOG as a publishing medium will come into its own. Google has quite a few technical documents explaining in considerable detail how to make this happen.
  3. Regulators in various countries will realize that Google heralds a new spin on globalization. Local operations deliver quite specific products and services, yet the plumbing exists “out there” so it is tough to deal with the GOOG under existing regulatory umbrellas.

What do you think the GOOG will do in 2009? Oh, I know that Google is just a Web search company in the business of selling ads in a deteriorating economic climate. I am a silly goose for having articulated that Google is more, much more.

Stephen Arnold, December 26, 2008

European Digital Library Back Online

December 26, 2008

The Inquirer reported here that the digital library sponsored by the European Union is online again. You can read the announcement here in “Euro Library Re-Opens”. More servers and more optimism should help the service which crashed when it first opened. The addled goose asks, “When will the EU lose its appetite for pumping money into infrastructure?” I am now calculating the odds that the EU seeks help from a company able to scale. Google is a long shot, but the Exalead engineers could contribute.

Stephen Arnold, December 26, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta