The Google Treadmill System
November 12, 2009
The Google is not in the gym business. The company’s legal eagles find ways of converting wizard whimsy into patents. The tokenspace suite of patent documents does not excite the “Sergey and Larry eat pizza” style of Google watcher. For those who want to get a glimpse of the nuts and bolts in Google’s data management system, check out the treadmill invention by ace Googler, Jeffrey Dean. He had help, of course. The Google likes teams, small teams, but teams nevertheless. Here’s how the invention is described in U7,617,226, “Document Treadmilling System and Method for Updating Documents in a Document Repository and Recovering Storage Space from Invalidated Documents.”
A tokenspace repository stores documents as a sequence of tokens. The tokenspace repository, as well as the inverted index for the tokenspace repository, uses a data structure that has a first end and a second end and allows for insertions at the second end and deletions from the front end. A document in the tokenspace repository is updated by inserting the updated version into the repository at the second end and invalidating the earlier version. Invalidated documents are not deleted immediately; they are identified in a garbage collection list for later garbage collection. The tokenspace repository is treadmilled to shift invalidated documents to the front end, at which point they may be deleted and their storage space recovered.
There are some interesting innovations in this patent document. Manual steps to reclaim storage space are not the main focus. The big idea is that a digital treadmill allows the Google to perform some magic for content updates. The tokenspace is a nifty idea, but the Google has added the endless chain notion. Oh, and there is scale, compression, and access associated with the invention. You can locate the document at http://www.uspto.gov. In my opinion, the tokenspace technology is pretty important. Ah, what’s a tokenspace you ask? Sorry, not in the blog, gentle reader.
Stephen Arnold, November 11, 2009
I don’t think my AdSense check this month was intended for me to write a short blog post calling attention to a system and method that Google would prefer to remain off the radar. Report me to the USPTO. That outfit pushed the info via RSS to me. So, a freebie.
Google Pressures eCommerce Search Vendors
November 6, 2009
Companies like Dieselpoint, Endeca, and Omniture Mercado face a new competitor. The Google has, according to Internet News, “launched Commerce Search, a cloud-based enterprise search application for e-tailers that promises to improve sales conversion rates and simplify the online shopping experience for their customers.” For me the most significant passage in the write up was:
Commerce Search not only integrates the data submitted to Google’s Product Center and Merchant Center but also ties into its popular Google Analytics application, giving e-tailers an opportunity to not only track customer behavior but the effectiveness of the customized search application. Once an e-tailer has decided to give Commerce Search a shot, it uploads an API with all its product catalog, descriptions and customization requirements and then Google shoots back an API with those specifications that’s installed on the Web site. Google also offers a marketing and administration consultation to highlight a particular brand of camera or T-shirt that the retailer wants to prominently place on its now customized search results. It also gives e-tailers full control to create their own merchandising rules so that it can, for example, always display Canon cameras at the top of its digital camera search results or list its latest seasonal items by descending price order.
Google’s technical investments in its programmable search engine, context server, and shopping cart service chug along within this new service. Google’s system promises to be fast. Most online shopping services are sluggish. Google knows how to deliver high speed performance. Combining Google’s semantic wizardry with low latency results puts some of the leading eCommerce vendors in a technology arm lock.
Some eCommerce vendors have relied on Intel to provide faster CPUs to add vigor to older eCommerce architectures. There are some speed gains, but Google delivers speed plus important semantic enhancements that offer other performance benefits. One example is content processing. Once changes are pushed to Google or spidered by Google from content exposed to Google, the indexes update quickly. Instead of asking a licensee of a traditional eCommerce system to throw hardware at a performance bottleneck or pay for special system tuning, the Google just delivers speed for structured content processed from the Google platform.
In my opinion, competitors will point out that Google is inexperienced in eCommerce. Google may appear to be a beginner in this important search sector. Looking more deeply into the engineering resources responsible for Commerce Search one finds that Google has depth. I hate to keep mentioning folks like Ramanathan Guha, but he is one touchstone whose deep commercial experience has influenced this Google product.
How will competitors like Dieselpoint, Endeca, and Omniture Mercado respond? The first step will be to downplay the importance of this Google initiative. Next I expect to learn that Microsoft Fast ESP has a better, faster, and cheaper eCommerce solution that plays well with SharePoint and Microsoft’s own commerce server technology. Finally, search leaders such as Autonomy will find a marketing angle to leave Google in the shadow of clever positioning. But within a year, my hunch is that Google’s Commerce Search will have helped reshape the landscape for eCommerce search. Google may not be perfect, but its products are often good enough, fast, and much loved by those who cannot imaging life without Google.
Stephen Arnold, November 6, 2009
I want to disclose to the Department of the Navy that none of these vendors offered me so much as a how de doo to write this article.
Metadata Now Fair Game
November 2, 2009
The US legal system has spoken. I saw the ZDNet UK story “Watch Out, Your Metadata Is Showing” and chuckled. Not long ago in goose years, legal eagles realized that the Word fast save function preserved text once in a document. Sending the document with fast save activated could allow the curious to see the bits and pieces of document that were once believed to be deleted from that document. Exciting stuff. Now the Arizona supreme court, according to Simon Bisson and Mary Branscombe, “has decided that the metadata of a document is governed by the same rules as the document. With value-added indexing coming to most SharePoint systems, there will be some interesting discussions about what metadata is the document’s metadata and which metadata are part of another, broader system. If you read vendors’ enthusiastic descriptions of what their smart software will assign to documents, users, and system processes, you will enter into an interesting world. How exciting will be be? Consider a document that has metadata such as date of creation, file format, and the name of the author. Now consider a document that has metadata pertaining to the “aboutness” of a document, who looked at the document, who made which change and when, and who opened the document and for how long. Interesting stuff in my opinion. The courts will be entering data space soon, and I think that journey will be difficult. Next up? A metadata specialist at your local Top 10 law firm. Get your checkbook ready.
Stephen Arnold, November 2, 2009
I say, no pay.
In-Q-Tel Pumps Cash into Visible Technologies
October 21, 2009
Overflight snagged a news item called “Visible Technologies Announces Strategic Partnership with In-Q-Tel”. In-Q-Tel is the investment arm of one unit of the US government. Visible Technologies is a content processing company that ingests Web log, Tweets, and other social content and extracts information from these data.
The company said:
Through its comprehensive solution set, Visible Technologies helps organizations adopt new ways of gaining actionable insight from social media conversations. By using Visible Technologies’ platform, organizations both big and small can harness business intelligence derived from social media data to drive strategic direction and tactical implementation of marketing initiatives, improve the customer experience and grow business. Visible Technologies’ end-to-end suite, powered by the truCAST engine, encompasses global features that enable real-time visibility into online social conversations regardless of where dialogue is occurring. Additionally, the company’s truREPUTATION solution is a best-in-class online reputation management service that provides both individuals and brands an effective way to repair, protect and proactively promote their reputation in search engine results.
The company is no spring chicken. Founded in 2003, Visible Technologies has a range of monitoring, reputation, and content analysis tools. The firm’s social media monitoring system is a newer weapon in the company’s arsenal. With police and intelligence agencies struggling to deal with social media, an investment in a firm focusing on this type of content makes clear that the US government wants to keep pace with these content streams.
Stephen Arnold, October 21, 2009
Guha and the Google Trust Method Patent
October 16, 2009
I am a fan of Ramanathan Guha. I had a conversation not long ago with a person who doubted the value of my paying attention to Google’s patent documents. I can’t explain why I find these turgid, chaotic, and cryptic writings of interest. I read stuff about cooling ducts and slugging ads into anything that can be digitized, and I yawn. Then, oh, happy day. One of Google’s core wizards works with attorneys and a meaningful patent document arrives in Harrod’s Creek goose nest.
Today is such a day. The invention is “Search Result Ranking Based on Trust” which you can read courtesy of the every reliable USPTO by searching for US7,603,350 (filed in May 2006). Dr. Guha’s invention is described in this patent in this way:
A search engine system provides search results that are ranked according to a measure of the trust associated with entities that have provided labels for the documents in the search results. A search engine receives a query and selects documents relevant to the query. The search engine also determines labels associated with selected documents, and the trust ranks of the entities that provided the labels. The trust ranks are used to determine trust factors for the respective documents. The trust factors are used to adjust information retrieval scores of the documents. The search results are then ranked based on the adjusted information retrieval scores.
Now before you email me and ask, “Say, what?”, let me make three observations:
- The invention is a component of a far larger data management technology initiative at Google. The implications of the research program are significant and may disrupt the stressed world of traditional RDBMS vendors at some point.
- The notion of providing a “score” that signals the “reliability” or lack thereof is important in consumer searches, but it has some interesting implications for other sectors; for example, health.
- The plumbing to perform “trust” scoring on petascale data flows gives me confidence to assert that Microsoft and other Google challengers are going to have to get in the game. Google is playing 3D chess and other outfits are struggling with checkers.
You can read more about Dr. Guha in my Google Version 2.0. He gets an entire chapter (maybe 30 pages of 10 pt type) for a suite of inventions that make it possible for Google to be the “semantic Web”. lever company, brilliant guy, Guha is.
Stephen Arnold, October 15, 2009
LexisNexis Jumps on Semantic Bandwagon
October 15, 2009
Pure Discovery, a Dallas based search and content processing company, has landed a mid-sized tuna, LexisNexis. Owned by publishing giant Reed Elsevier, LexisNexis faces some strong downstream water. The $1 billion plus operation is paddling its dugout canoe upstream. Government agencies, outfits like Gov Resources, and the Google are offering products and services that address the squeals from law firms. What is the cause of the legal eagle squeaks? The cost of running searches on the commercial online services like LexisNexis and Westlaw, among others like Questel. Clients are putting caps on some law firm expenditures. Even white shoe outfits in New York and Chicago are feeling the pinch.
I saw one short news item about this tie up in an article in Search Engine Watch.
Patent searching is a particularly exciting field of investigation. If you click over to the responsive USPTO, you can search patents for free. Tip: Print out the search hints before you begin. I am not sure who is responsible for this wonderful search system, but it is a wonder.
Semantic technology along with other sophisticated content processing tools can make life a little – notice the word “little” – easier for those conducting patent research. Even the patent examiners have to use third party systems because the corpus of the USPTO is a bit like a buggy without a horse in my opinion.
The company that LexisNexis tapped to provide its semantic technology is Pure Discovery in Dallas, Texas. I had one reference to the firm in my Overflight service and that was to an individual named Adam Keys, Twitter name therealadam. Mr. Keys left Pure Discovery in 2006 after two years at the company. I had a handwritten note to the effect that venture funding was provided in part by Zon Capital Partners in Princeton, New Jersey. I have little detail about how the Pure Discovery system works.
Here’s a description of the company I pulled from Zon’s Web site:
Pure Discovery (Dallas, TX) has developed enterprise semantic web software. Its offering combines automated semantic discovery with a peer networking architecture to transform static networks into dynamic ecosystems for knowledge discovery.
I snagged a few items from the firm’s Web site.
The product line up consists of KnowledgeGraph products. These include the PD BrainLibrary (“BrainLibrary is a breakthrough technology that harnesses the collective intelligence of organizations and their people in ways that have never been possible before), PD Transparent Concept Search (“PD Concept Search has completely removed the top off the black box and for the first time ever, users are not only able to see what has been learned by the system, but also use our QueryCloud application to control it.”), PD QueryCloud Visual Query Generator (“QueryCloud then lets users control what terms or phrases are used, not used, emphasized or de-emphasized. All with the simple click of a button.”), PD Clustering (“D Clustering dynamically orders similar documents into clusters enabling users to browse data by semantically related groups rather than looking at each individual document. PD Clustering is fast enough to cluster even the largest of document populations with a benchmark of over 80 million pages clustered in a 48 hr period on a single machine.”), and PD Near-Dupe Identification (“PureDiscovery’s Near-Dedupe Identification Engine provides instant value to any application by detecting and grouping near duplicate documents. Identifying documents with these slight variances results in dramatic savings in time wasted looking at the same document again and again.”) This information is from the Pure Discovery Web site here.
The company also offers its Transparent Concept Search Query Cloud.
The software is available for specific vertical markets and niches; for example, litigation support, “human capital management” (maybe human resources or knowledge management?), intellectual property, and homeland security and defense.
These are sophisticated functions. I look forward to examining the LexisNexis patent documents using this new tool. Perhaps LexisNexis has found a software bullet to kill the beasties chewing into its core business. If not, LexisNexis will face that rushing torrent without a paddle.
As more information flows to me, I will update this write up.
Stephen Arnold, October 15, 2009
I wrote this short post without so much as a thank you from anyone.
Exclusive Interview with CTO of BrightPlanet Now Available
October 13, 2009
William Bushee, BrightPlanet’s Vice President of Development and the company’s chief technologist, spoke with Stephen E. Arnold. The exclusive interview appears in the Search Wizards Speak series. Mr. Bushee was among the first search professionals to tackle Deep web information harvesting. The “Deep Web” refers to content that traditional Web indexing systems cannot access. Deep Web sites include most major news archives as well as thousands of specialized sources. These sources typically represent the best, most definitive content sources for their subject area. For example, in the health sciences field, the Centers for Disease Control, National Institutes of Health, PubMed, Mayo Clinic, and American Medical Association are all Deep Web sites, often inaccessible from conventional Web crawlers like Google and Yahoo. BrightPlanet supported the ArnoldIT.com analysis of the firm’s system. As a result of this investigation, the technology warranted an in depth discussion with Mr. Bushee.
The wide ranging interview focuses on BrightPlanet’s search, harvest, and OpenPlanet technology. Mr. Bushee told Search Wizards Speak: “As more information is being published directly to the Web, or published only on the Web, it is becoming critical that researchers and analysts have better ways of harvesting this content.”
Mr. Bushee told Search Wizards Speak:
There are two distinct problems that BrightPlanet focuses on for our customers. First we have the ability to harvest content from the Deep Web. And second, we can use our OpenPlanet framework to add enrichment, storage and visualization to harvested content. As more information is being published directly to the Web, or published only on the Web, it is becoming critical that researchers and analysts have better ways of harvesting this content. However, harvesting alone won’t solve the information overload problems researches are faced with today. The answer to a research project cannot be simply finding 5,000 raw documents, no matter how good they are. Researchers are already overwhelmed with too many links from Google and too much information in general. The answer needs to be better harvested content (not search), better analytics, better enrichment and better visualization of intelligence within the content – this is where BrightPlanet’s OpenPlanet framework comes into play. While BrightPlanet has a solid reputation within the Intelligence Community helping to fight the “War on Terror” our next mission is to be known as the commercial and academic leaders in harvesting relevant, high quality content from the Deep Web for those who need content for research, business intelligence or analysis.
You can read the full text of the interview at http://www.arnoldit.com/search-wizards-speak/brightplanet.html. More information about the company’s products and services is available at http://www.brightplanet.com. Mr. Bushee’s technology has gained solid support from some professional researchers and intelligence agencies. BrightPlanet has moved “beyond search” with its suite of content processing technology.
Stephen Arnold, October 13, 2009
Google on Path to Becoming the Internet
September 28, 2009
I thought I made Google’s intent clear in Google Version 2.0. The company provides a user with access to content within the Google index. The inventions reviewed briefly in The Google Legacy and in greater detail in Google Version 2.0 explain that information within the Google data management system can be sliced, diced, remixed, and output as new information objects. The analogy is similar to what an MBA does at Booz, McKinsey, or any other rental firm for semi-wizards. Intakes become high value outputs. I was delighted to read Erick Schonfeld’s “With Google Places, Concerns Rise that Google Just Wants to Link to Its Own Content.” The story makes clear that folks are now beginning to see that Google is a digital Gutenberg and is a different type of information company. Mr. Schonfeld wrote:
The concerns arise, however, back on Google’s main search page, where Google is indexing these Places pages. Since Google controls its own search index, it can push Google Places more prominently if it so desires. There isn’t a heck of a lot of evidence that Google is doing this yet, but the mere fact that Google is indexing these Places pages has the SEO world in a tizzy. And Google is indexing them, despite assurances to the contrary. If you do a search for the Burdick Chocolate Cafe in Boston, for instance, the Google Places page is the sixth result, above results from Yelp, Yahoo Travel, and New York Times Travel. This wouldn’t be so bad if Google wasn’t already linking to itself in the top “one Box” result, which shows a detail from Google Maps. So within the top ten results, two of them link back to Google content.
Directories are variants of vertical search. Google is much more than rich directory listings.
Let me give one example, and you are welcome to snag a copy of my three Google monographs for more examples.
Consider a deal between Google and a mobile telephone company. The users of the mobile telco’s service run a query. The deal makes it possible for the telco to use the content in the Google system. No query goes into the “world beyond Google”. The reason is that Google and the telco gain control over latency, content, and advertising. This makes sense. Let’s assume that this is a deal that Google crafts with an outfit like T Mobile. Remember: this is a hypothetical example. When I use my T Mobile device to get access to the T Mobile Internet service, the content comes from Google with its caches, distributed data centers, and proprietary methods for speeding results to a device. In this example, as a user, I just want fast access to content that is pretty routine; for example, traffic, weather, flight schedules. I don’t do much heavy lifting from my flakey BlackBerry or old person hostile iPhone / iTouch device. Google uses its magical ability to predict, slice, and dice to put what I want in my personal queue so it is ready before I know I need the info. Think “I am feeling doubly lucky”, a “real” patent application by the way. T Mobile wins. The user wins. The Google wins. The stuff not in the Google system loses.
Interesting? I think so. But the system goes well beyond directory listings. I have been writing about Dr. Guha, Simon Tong, Jeff Dean, and the Halevy team for a while. The inventions, systems and methods from this group have revolutionized information access in ways that reach well beyond local directory listings.
The Google has been pecking away for 11 years and I am pleased that some influential journalists / analysts are beginning to see the shape of the world’s first trans national information access company. Google is the digital Gutenberg and well into the process of moving info and data into a hyper state. Google is becoming the Internet. If one is not “in” Google, one may not exist for a certain sector of the Google user community. Googleo ergo sum.
Stephen Arnold, September 28, 2009
Google Waves Build
September 24, 2009
I am a supporter of Wave. I wrote a column about Google WAC-ing the enterprise. W means wave; A is Android, and C represents Chrome. I know that Google’s consumer focus is the pointy end of the Google WAC thrust, but more information about Wave is now splashing around my webbed feet here in rural Kentucky. You take a look at some interesting screenshots plus commentary in “Google Wave Developer Preview: Screenshots.” Perhaps you will assert, “Hey, addled goose, this is not search.” I reply, “Oh, yes, it is.” The notion of eye candy is like lipstick on a pig. Wave is a new animal that will carry you part of the way into dataspace.
Stephen Arnold, September 24, 2009
What If Google Books Goes Away?
September 21, 2009
I had a talk with one of my partners this morning. The article in TechRadar “Google Books Smacked Down by US Government” was the trigger. This Web log post captures the consequences portion of our discussion. I am not sure Google, authors, or any other pundit embroiled in the dust up over Google Books will agree with these points. That’s okay. I am capturing highlights for myself. If you have forgotten this function of this Beyond Search Web log, quit reading or look at the editorial policy for this marketing / diary publication.
Let’s jump into the discussion in media res. The battle is joined and at this time, Google is on the defensive. Keep in mind that Google has been plugging away at this Google Book “project” since 2000 or 2001 when it made a key hire from Caere (now folded into Nuance) to add a turbo charge to the Books project.
Who is David? Who is Goliath?
With nine years of effort under its belt, Google will get a broken snout if the Google Books project stops. Now, let’s assume that the courts stop Google. What might happen?
First, Google could just keep on scanning. Google lawyers will do lawyer-type things. The wheels of justice will grind forward. With enough money and lawyers, Google can buy time. Let’s face it. Publishers could run out of enthusiasm or cash. If the Google keeps on scanning, discourse will deteriorate, but the acquisition of data for the Google knowledge base and for Google repurposing keeps on keeping on.
Second, Google might agree. Shut up shop and go directly to authors with an offer to buy rights to their work. I have four or five publishers right now. I would toss them overboard for a chance to publish my next monograph on the Google system, let Google monetize it any way it sees fit, and give me a percentage of the revenue. Heck, if I get a couple of hundred a month from the Google I am ahead of the game. Note this: none of my publishers are selling very many expensive studies right now. The for fee columns I write produce a pittance as well. One publisher cut my pay by 30 percent as part of a shift to a four day week and a trimmed publishing schedule. Heck, I love my publishers, but I love an outfit that pays money more. I think quite a few authors would find publishing on the Google Press most interesting. If that happens, the Google Books project has a gap, but going forward, Google has the info and the publishers and non participating authors have a different type of competitive problem.
Third, Google cuts a new deal, adjusts the terms, and keeps on scanning books. Google’s management throws enough bird feed to the flock. Google is secure in its knowledge that the future belongs to a trans-national digital information platform stuffed with digital information of various types. No publisher or group of publishers has a comparable platform. Microsoft and Yahoo were in the book game and bailed out. Perhaps their platforms can at some point in the future match Google’s. But my hunch is that the critics of Google’s book project are not looking at the value of the information to Google’s knowledge base, Google’s repurposing technologies, and Google’s next generation dataspace applications. Because these are dark corners, the bright light of protest is illuminating the dust and mice only.
One theme runs through these three possibilities. Google gets information. In this game, the publishers have lost but have not recognized it. Without a better idea and without an alternative to the irreversible erosion of libraries, Google is not the miserable little worm that so many want the company to be. Just my opinion.
Stephen Arnold, September 21, 2009