Google Nails Duplicate Detection Invention

December 3, 2009

I know that most of my two or three readers does not give a goose feather for duplicate detection. Pretty boring stuff. Google result lists seem to be just one list with few repeating objects. Even in the Google News service, identical stories rarely slip through the digital net.

The ever reliable USPTO has granted a patent to the Google for its duplicate detection method. If you want to know a bit more about the Google approach, you will want to download US7,627,613, “Duplicate Document Detection in a Web Crawler System”. Before my pals at various search and content processing companies email me to explain that their duplicate detection is better, save that energy. No one at the Beyond Search goose pond is asserting “better”. The Google invention deals with scale, petabytes of digital crapola deduped quickly and reasonably effectively. The “scale” idea is one clue to Google’s technology. The challenges of scale are not well understood unless you have to figure out what to do with trillions of instances of digital crapola.

Google says in its glorious prose:

Duplicate documents are detected in a web crawler system. Upon receiving a newly crawled document, a set of documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.

Notice what’s left out? Now read the patent document. Notice what’s left out? Google does not make explicit how these separate inventions interlock. Those interlocks are sort of important, particularly if you are a competitor and one of your 20 somethings say, “That’s obvious. I can code that up myself.” Scale. Remember scale. Remember that Google can convert speech to text and then dedupe those outputs too. Scale. Performance. Cost. Useful Google concepts all.

Stephen Arnold, December 3, 2009

I wish to disclose to the National Constitution Center that I was not paid to write this essay with its implicit reference to the constitutional right of Google competitors to misunderstand the notion of “scale” in Google’s weird vocabulary.

Some Thoughts About Real Time Content Processing

December 2, 2009

I wanted to provide my two or three readers with a summary of my comments about real time content processing at the Incisive international online information conference. I  arrived more addled than than normal due to three mechanical failures on America’s interpretation of a joint venture between Albanian and Galapagos Airlines. That means Delta Airlines I think.

What I wanted to accomplish in my talk was to make one point—real time search is here to stay. Why?

First, real time means lots of noise and modest information payload. To deal with lots of content requires a robust and expensive line up of hardware, software, and network resources. Marketers have been working overtime by slapping “real time” on any software product conceivable in the hopes of making another sale. And big time search vendors essentially ignored the real time information challenge. Plain vanilla search on content updated when the vendor decided was an easier game.

Real time can mean almost any thing. In fact, most search and content processing systems are not even close to real time. The reason is that slow downs can occur in any component of a large, complex content processing system. As long as the user gets some results, for many of the too-busy 30 somethings that is just fine. Any information is better than no information. Based on the performance of some commercial and governmental organizations, the approach is not working particularly well in my opinion.,

Let me give you an example of real time. In the 1920s, America decided that no booze was good news. Rum runners filled the gap. The US Coast Guard learned that it could tune a radio receiver to a frequency used by the liquor smugglers. The intercepts were in real time, and the Coast Guard increased its interdiction rate. The idea was that a bad buy talked and the Coast Guard listened in real time even though there was a slight delay in wireless transmissions. The same idea is operative today when good guys intercept mobile conversations or listen to table talk at a restaurant.

The problem is that communications and content believed to be real time are not. SMS may be delivered quickly, but I have received SMS sent a day or more earlier. The telco takes considerable license in billing for SMS and delivering SMS. No one seems to be the wiser.

A content management system often creates this ty8pe of conversation in an organization. Jack: “I can’t find my document.” Jill: “Did you put it in the system with the ‘index me’ metatag?’” Jack: “Yes.” Jill: “Gee, that happens to me all the time.” The reason is that the CMS indexes when it can or on a specific schedule. Content in some CMSs are not findable. So much for real time in the organization.

An early version of the Google Search Appliance could index so aggressively that the network was choked by the googlebot. System administrators solved the problem by indexing once a day, maybe twice a day. Again, the user perceives one thing and the system is doing another.

This means that real time will have a specific definition depending on the particular circumstances in which the system is installed and configured.

Several business sectors are gung ho for real time information.

Financial services firms will pay $500,000 for a single Exegy high speed content processing server. When that machine is saturated, just buy another Exegy server. Microsoft is working on a petascale real time content processing system for the financial services industry which will compete with such established vendors as Connotate and Relegence. But a delay of a millisecond or two can spoil the fun.

Accountants want to know exactly what money is where. Purchase order systems and accounts receivable have to be fast. Speed does not prevent accidents. The implosion of such corporate giants as Enron and Tyco make it clear that going faster does not make information or management decisions better.

Intelligence agencies want to know immediately when a term on a watch list appears in a content stream. A good example is “Bin Ladin” or “Bin Laden” or a variant. A delay can cost lives. Systems from Exalead and SRA can handle this type of problem and a range of other real time tasks without breaking a sweat.

The problem is that there is not certifying authority for “real time”. Organizations trying to implement real time may be falling for a pig in the poke or buying a horse without checking to see if it has been enhanced at a horse beauty salon.

In closing, real time is here to stay.

First, Google, Microsoft, and other vendors are jumping into indexing content from social networks, RSS feeds, and Web sites that update when new information is written to their databases. Like it or not, real time links or what appear to be real time links will be in these big commercial systems.

Second, enterprise vendors will provide connectors to handle RSS and other real time content. This geyser of information will be creating wet floors in organizations worldwide.

Third, vendors in many different enterprise sectors will be working to make fresh data available. You may not be able to escape real time information even if you work with an inventory control system.

Finally, users—particularly recent college graduate—will get real time information their own way, like it or not.

To wrap up, “what’s happening now, baby?” is going to be an increasingly common question you will have to answer.

Stephen Arnold, December 2, 2009

Oyez, oyez, I disclose to the National Intelligence Center that the Incisive organization paid me to write about real time information. In theory, I will get some money in eight to 12 weeks. Am I for sale to the highest bidder? I guess it depends on how good looking you are.

Social, Real Time, Content Intelligence

November 29, 2009

I had a long talk this morning about finding useful nuggets from the social content streams. The person with whom I spoke was making a case for tools designed for the intelligence community. My phone pal mentioned JackBe.com, Kapow, and Kroll. None of these outfits is a household word. I pointed to services and software available from NetBase, Radian6, and InsideView.

What came out of this conversation were several broad points of agreement:

First, most search and content processing procurement teams have little or no information about these firms. The horizons of most people working information technology and content processing are neither wide nor far.

Second, none of these companies has a chance of generating significant traction with their current marketing programs. Sure, the companies make sales, but these are hard won and usually anchored in some type of relationship or a serendipitous event.

Third, users need the type of information these firms can deliver. Those same users cannot explain what they need, so the procurement teams fall back into a comfortable and safe bed like a “brand name” search vendor or some fuzzy wuzzy one-size-fits-all solution like the wondrous SharePoint.

We also disagreed on four points:

First, I don’t think these specialist tools will find broad audiences. The person with whom I was discussing these social content software vendors believed that one would be a break out company.

Second, I think Google will add social content “findability” a baby step at a time. One day, I will arise from my goose nest and the Google will simply be “there”. The person at the other end of my phone call sees Google’s days as being numbered. Well, maybe.

Third, I think that social content is a more far reaching change than most publishers and analysts realize. My adversary things that social content is going to become just another type of content. It’s not revolutionary; it’s mundane. Well maybe.

Finally, I think that these systems—despite their fancy Dan marketing lingo—offer functions not included in most search and content processing systems. The person disagreeing with me thinks that companies like Autonomy offer substantially similar services.

In short, how many of these vendors’ products do you know? Not many I wager. So what’s wrong with the coverage of search and content processing by the mavens, pundits, and azure chip consultants? Quite a bit because these folks may know less about these vendors’ systems than how to spoof Google or seem quite informed because of their ability to repeat marketing lingo.

Have a knowledge gap? Better fill it.

Stephen Arnold, November 29, 2009

I want to disclose to the National Intelligence Center that no one paid me to comment on these companies. These outfits are not secret but don’t set the barn on fire with their marketing acumen.

Microsoft and News Corp.: A Tag Team of Giants Will Challenge Google

November 23, 2009

Government regulators are powerless when it comes to online. The best bet, in my opinion, is for large online companies to act as if litigation and regulator hand holding was a cost of doing business. While the legal eagles flap and the regulators meet bright, chipper people, the business of online moves forward.

The news that News Corp. and Microsoft are, according to “Microsoft Offers To Pay News Corp To “De-List” Itself From Google”, and other “experts”, these two giants want to form a digital World Wrestling Federation tag team. In the “fights” to come, these champions—Steve Ballmer and Rupert Murdoch–will take on the unlikely upstarts, Sergey the Algorithm Guy and Larry the Math Whiz.

image

Which of these two tag teams will grace the cover of the WWF marketing collateral? What will their personas become? Source: http://www.x-entertainment.com/pics5/wwe11click.jpg

The idea is to “pull” News Corp. content from Google or make it pay through its snout for the right to index News Corp. content. The deal will probably encompass any News Corp. content. Whatever Google deal is in place with News Corp. would be reworked. News Corp., like other traditional media companies is struggling to regain its revenue traction.

For Microsoft a new wrestling partner makes sense. Bing is gaining market share, but at the expense of Yahoo’s search share. Microsoft now faces Google’s 1,001 tiny cuts. The most recent is the browser based operating system. There is the problem of developers with Microsoft’s former employees rallying the Google faithful. There’s the pesky Android phone thing that went from a joke to a coalition of telephone-centric outfits. There’s the annoyance of Google in the US government. On and on. No one Google nick has to kill Microsoft. Nope. Google just needs to let a trickle of revenue slip away from the veins of Microsoft. The company’s rising blood pressure will do the rest. Eventually, the losses from the 1,001 tiny cuts will force the $70 billion Redmond wrestler to take a break. That “rest” may be what gives Google the opportunity to do significant damage with its as-yet-unappreciated play for the TV, cable, and independent motion picture business. Silverlight 4.0 may not be enough to overcome the structural changes in rich media. That’s real money. Almost as much as the telephony play promises to deliver to the somewhat low key team of Sergey the Algorithm Guy and Larry the Math Whiz

image

Sergey the Algorithm Guy and Larry the Math Whiz take a break from discussing the Kolmogorov-Smirnov test of normality. Training is tough for this duo. Long hours of solitary computation may exhaust the team before it tackles the Ballmer-Murdoch duo, which may be the most dangerous opponent the Math Guys have faced.

I look forward to the fight promoter to pull out all the stops. One of the Buffers will be the announcer. The cut man will be the master, Stitch Duran. The venue will be Las Vegas, followed by other world capitals of money, power, and citizen concern.

Nicholas Carlson reported:

Still, if News Corp were to “de-list” from Google, we’d expect to see all kinds of ads touting Bing as the only place to find the Wall Street Journal and MySpace pages online. Maybe that’d swing search engine share some, but we doubt it.

Read more

Google and Artificial Anchors

November 20, 2009

Folks are blinded by Chrome. What might be missed is what’s often overlooked—Google’s plumbing. Once you have tired of the shiny, bright chatter about Microsoft’s latest reason for its fear and loathing of Google, you may want to navigate to the USPTO and download 20090287698, “Artificial Anchor for a Document.” Google said:

Methods, systems, and apparatus, including computer program products, for linking to an intra-document portion of a target document includes receiving an address for a target document identified by a search engine in response to a query, the target document including query-relevant text that identifies an intra-document portion of the target document, the intra-document portion including the query relevant text. An artificial anchor is generated, the artificial anchor corresponding to the intra-document portion. The artificial anchor is appended the address.

The system and method has a multiplicity of uses, and these are spelled out in Googley detail in the claims made for this patent application. In this free Web log, I won’t dive into the implications of artificial anchors. I will let you don your technical scuba gear and surf on the implications of artificial anchors. Chrome is the surface of the Google ocean. Artificial anchors are part of the Google ocean. Big, big difference.

Stephen Arnold, November 21, 2009

I want to disclose to the USPTO itself that no one paid me to be cryptic in this article.

MarkLogic Tames Big Data

November 20, 2009

I spent several hours on November 18, 2009, at the MarkLogic client conference held in Washington, DC on November 18, 2009. I was expecting another long day of me-too presentations. What a surprise! The conference attracted about 250 people and featured presentations by a number of MarkLogic customers and engineers. There were several points that struck me:

First, unlike the old-fashioned trade show, this program was a combination of briefings, audience interaction, and informal conversations fueled by genuine enthusiasm. Much of that interest came from the people who had used the MarkLogic platform to deliver solutions in very different big data situations. Booz, Allen & Hamilton was particularly enthusiastic. As a former laborer in the BAH knowledge factory, the enthusiasm originates in one place—the client. BAH professionals are upbeat * only * when the firm’s customers are happy. BAH described using the MarkLogic platform as a way to solve a number of different client problems.

clip_image002

MarkLogic’s platform applied to an email use case caught the attention of audiences involved in certain types of investigative and data forensics work.Shown is the default interface which can be customized to the licensee’s requirements.

Second, those in the audience were upfront about their need to find solutions to big data problems—scale, analytics, performance. I assumed that those representing government entities would be looking for ways to respond to President Obama’s mandates. There was an undercurrent of responding to the Administration, but the imperative was the realization that tools like relational databases were not delivering solutions. Some in the audience, based on my observations, were actively looking for new ways to manipulate data. In my view, the MarkLogic system had blipped the radar in some government information technology shops, and the people with problems showed up to learn.

Read more

Google Books, The Nov 14 Edition

November 15, 2009

If you were awake at 11 54 pm Eastern time, you would have seen Google’s “Modifications to the Google Books Settlement.” Prime time for low profile information distribution. I find it interesting that national libraries provided Google an opportunity to do their jobs. Furthermore, despite the revisionism in the Sergey Brin New York Times’s editorial, the Google has been chugging away at Google Books for a decade. With many folks up in arms about Google’s pumping its knowledge base and becoming the de facto world library, the Google continues to move forward. Frankly I am surprised that it has taken those Google users so long to connect Google dots. Google Books embraces more than publishing. Google Books is a small cog in a much larger information system, but the publishing and writing angles have center stage. In my opinion, looking at what the spotlight illuminates may be the least useful place toward which to direct attention. Maybe there’s a knowledge value angle to the Google Books project? You can catch up with Google’s late Friday announcement and enjoy this type of comment:

The changes we’ve made in our amended agreement address many of the concerns we’ve heard (particularly in limiting its international scope), while at the same time preserving the core benefits of the original agreement: opening access to millions of books while providing rights holders with ways to sell and control their work online. You can read a summary of the changes we made here, or by reading our FAQ.

Yep, more opportunities for you, gentle reader, to connect Google dots. What is the knowledge value to Google of book information? Maybe one of the search engine optimization experts will illuminate this dark corner for me? Maybe one of the speakers at an information conference will peek into the wings of the Google Information Theatre?

Stephen Arnold, November 15, 2009

I wish to report to the Advisory Council on Historic Preservation that I was not paid to point out that national libraries abrogated their responsibilities to their nations’ citizens. For this comment, I have received no compensation, either recent or historic. Historical revisionism is an art, not a science. That’s a free editorial comment.

The Google Treadmill System

November 12, 2009

The Google is not in the gym business. The company’s legal eagles find ways of converting wizard whimsy into patents. The tokenspace suite of patent documents does not excite the “Sergey and Larry eat pizza” style of Google watcher. For those who want to get a glimpse of the nuts and bolts in Google’s data management system, check out the treadmill invention by ace Googler, Jeffrey Dean. He had help, of course. The Google likes teams, small teams, but teams nevertheless. Here’s how the invention is described in U7,617,226, “Document Treadmilling System and Method for Updating Documents in a Document Repository and Recovering Storage Space from Invalidated Documents.”

A tokenspace repository stores documents as a sequence of tokens. The tokenspace repository, as well as the inverted index for the tokenspace repository, uses a data structure that has a first end and a second end and allows for insertions at the second end and deletions from the front end. A document in the tokenspace repository is updated by inserting the updated version into the repository at the second end and invalidating the earlier version. Invalidated documents are not deleted immediately; they are identified in a garbage collection list for later garbage collection. The tokenspace repository is treadmilled to shift invalidated documents to the front end, at which point they may be deleted and their storage space   recovered.

There are some interesting innovations in this patent document. Manual steps to reclaim storage space are not the main focus. The big idea is that a digital treadmill allows the Google to perform some magic for content updates. The tokenspace is a nifty idea, but the Google has added the endless chain notion. Oh, and there is scale, compression, and access associated with the invention. You can locate the document at http://www.uspto.gov. In my opinion, the tokenspace technology is pretty important. Ah, what’s a tokenspace you ask? Sorry, not in the blog, gentle reader.

Stephen Arnold, November 11, 2009

I don’t think my AdSense check this month was intended for me to write a short blog post calling attention to a system and method that Google would prefer to remain off the radar. Report me to the USPTO. That outfit pushed the info via RSS to me. So, a freebie.

Google Pressures eCommerce Search Vendors

November 6, 2009

Companies like Dieselpoint, Endeca, and Omniture Mercado face a new competitor. The Google has, according to Internet News, “launched Commerce Search, a cloud-based enterprise search application for e-tailers that promises to improve sales conversion rates and simplify the online shopping experience for their customers.” For me the most significant passage in the write up was:

Commerce Search not only integrates the data submitted to Google’s Product Center and Merchant Center but also ties into its popular Google Analytics application, giving e-tailers an opportunity to not only track customer behavior but the effectiveness of the customized search application. Once an e-tailer has decided to give Commerce Search a shot, it uploads an API with all its product catalog, descriptions and customization requirements and then Google shoots back an API with those specifications that’s installed on the Web site. Google also offers a marketing and administration consultation to highlight a particular brand of camera or T-shirt that the retailer wants to prominently place on its now customized search results. It also gives e-tailers full control to create their own merchandising rules so that it can, for example, always display Canon cameras at the top of its digital camera search results or list its latest seasonal items by descending price order.

Google’s technical investments in its programmable search engine, context server, and shopping cart service chug along within this new service. Google’s system promises to be fast. Most online shopping services are sluggish. Google knows how to deliver high speed performance. Combining Google’s semantic wizardry with low latency results puts some of the leading eCommerce vendors in a technology arm lock.

Some eCommerce vendors have relied on Intel to provide faster CPUs to add vigor to older eCommerce architectures. There are some speed gains, but Google delivers speed plus important semantic enhancements that offer other performance benefits. One example is content processing. Once changes are pushed to Google or spidered by Google from content exposed to Google, the indexes update quickly. Instead of asking a licensee of a traditional eCommerce system to throw hardware at a performance bottleneck or pay for special system tuning, the Google just delivers speed for structured content processed from the Google platform.

In my opinion, competitors will point out that Google is inexperienced in eCommerce. Google may appear to be a beginner in this important search sector. Looking more deeply into the engineering resources responsible for Commerce Search one finds that Google has depth. I hate to keep mentioning folks like Ramanathan Guha, but he is one touchstone whose deep commercial experience has influenced this Google product.

How will competitors like Dieselpoint, Endeca, and Omniture Mercado respond? The first step will be to downplay the importance of this Google initiative. Next I expect to learn that Microsoft Fast ESP has a better, faster, and cheaper eCommerce solution that plays well with SharePoint and Microsoft’s own commerce server technology. Finally, search leaders such as Autonomy will find a marketing angle to leave Google in the shadow of clever positioning. But within a year, my hunch is that Google’s Commerce Search will have helped reshape the landscape for eCommerce search. Google may not be perfect, but its products are often good enough, fast, and much loved by those who cannot imaging life without Google.

Stephen Arnold, November 6, 2009

I want to disclose to the Department of the Navy that none of these vendors offered me so much as a how de doo to write this article.

Metadata Now Fair Game

November 2, 2009

The US legal system has spoken. I saw the ZDNet UK story “Watch Out, Your Metadata Is Showing” and chuckled. Not long ago in goose years, legal eagles realized that the Word fast save function preserved text once in a document. Sending the document with fast save activated could allow the curious to see the bits and pieces of document that were once believed to be deleted from that document. Exciting stuff. Now the Arizona supreme court, according to Simon Bisson and Mary Branscombe, “has decided that the metadata of a document is governed by the same rules as the document. With value-added indexing coming to most SharePoint systems, there will be some interesting discussions about what metadata is the document’s metadata and which metadata are part of another, broader system. If you read vendors’ enthusiastic descriptions of what their smart software will assign to documents, users, and system processes, you will enter into an interesting world. How exciting will be be? Consider a document that has metadata such as date of creation, file format, and the name of the author. Now consider a document that has metadata pertaining to the “aboutness” of a document, who looked at the document, who made which change and when, and who opened the document and for how long. Interesting stuff in my opinion. The courts will be entering data space soon, and I think that journey will be difficult. Next up? A metadata specialist at your local Top 10 law firm. Get your checkbook ready.

Stephen Arnold, November 2, 2009

I say, no pay.

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta