Cuil Your Jets: Take Offs Are Easier than Landings

July 28, 2008

Digging through the rose petals is tough going. Cuil, it seems, has charmed those interested in Web search.

Balanced Comments Are Here

Among the more balanced commentaries are:

Search maven Danny Sullivan here who says, “Can any start-up search engine “be the next Google?” Many have wondered this, and today’s launch of Cuil (pronounced “cool’) may provide the best test case since Google itself overtook more established search engines.”
Michael Arrington, TechCrunch here, says, “Cuil does a good job of guessing what we’ll want next and presents that in the top right widget. That means Cuil saves time for more research based queries.”
David Utter, WebProNews here, says, “The real test for Cuil when it comes back will be how well it handles the niche queries people make all the time, expecting a solid result from very few words.”

Now, I don’t want to pull harder on this cool search stallion’s bit. I do want to offer several observations:

First, the size of indexes don’t matter. If I am looking for the antidote to save a child’s life, the system need only return one result–the name of the antidote. The “size matters” problem surfaced decades ago when ABI / INFORM, a for fee database with a typical annual index growth of about 50,000 new records, found itself challenged as “too small” by a company called Management Contents. Predicasts jumped on the bandwagon. The number of entries in the index does not correlate to satisfying a user’s query. The size of the index provides very useful data which can be used to enhance a search result, but size in and of itself does not translate to “good results”. For example, on Cuil, run the query beyond search. You will see this Web log’s logo mapped to another site. This means nothing to me, but it shows that one must look beyond the excitement of a new system to explore.

Second, the key to consumer search engines is dealing with the average user who types 2.3 terms per query. The test query spears on Cuil returns the expected britney spears hits. Enter the term britny, and you get very similar results, but the graphics rotate plucking an image from one site and mashing it into the “hit”. Enter the query “brittany” and you get zero hits for Ms. Spears, super star. The fuzzy spelling logic and the synonym expansion is not yet tailored for the average user who can spell Ms. Spears more than 400 ways if I recall a comment made by Googler Jeff Dean several years ago.

Third, I turned on safe search and ran my “brittany” query. Here’s what I saw in the inset that allows me to search by category.

I like Playboy bunnies, and we have dozens of them hanging around the computer lab here in Harrods Creek. However, in some of the libraries in the Commonwealth of Kentucky, a safe search function that returns a hutch of Playboy bunnies can create some excitement.

Fourth, it is not clear to me what learnings from WebFountain, Dr. Patterson’s Google patent documents, and Mr. Monier’s learnings from the AltaVista.com/eBay/Google experiences have or have not found their way into this service. Search is a pretty difficult challenge as Microsoft’s struggles attest over the last 12 or 13 years. My hunch is that there are some facets to the intellectual property within Cuil that warrant a lawyer with a magnifying glass.

Net Net

I applaud the Cuil team for getting a service up and running. Powerset was slow out of the starting blocks and wrangled a pay day with a modest demo. Cuil, in a somewhat snappier, way launched a full service. Over the coming weeks and months, the issues of precision, recall, relevance, synonym expansion, and filters that surprise will be resolved.

I don’t want to suggest this is a Google killer for several reasons. First, I learned from a respected computer scientist that a Gmail address set up for a test and not released seemed to have been snagged in a Cuil crawl. Subsequent tests showed the offending email address was no longer in the index. My thought was that the distance between Cuil and Google might not be so great. Most of the Cuil team are Xooglers, and some share the Stanford computer science old school spirit. Therefore, I want to see exactly how close or far Cuil and Google are.

Second, the issue of using images from one site to illustrate a false drop on another site must be resolved. I don’t care, but some may. Here’s an example of this error for the query beyond search.

If this happens to a person more litigious than I, Cuil will be spending some of its remaining $33 million in venture funds to battle an aggrieved media giant. Google has learned how testy Viacom is over snippets of Beavis and Butt-head. Cuil may enjoy that experience as well.

To close, exercise Cuil. I will continue to monitor the service. I plan to reread Dr. Patterson’s Google patent documents this week as well. If you want to know what she invented when working for the GOOG, you can find a eight or nine page discussion of the inventions in Google Version 2.0. A general “drill down” notion is touched upon in these documents in my opinion.

And, keep in mind, the premise of The Google Legacy is that Google will be with us for a long time. Cuil is just one examples of the Google “legacy”; that is, Xooglers who build on Google’s approach to cloud based computing services.

Stephen Arnold, July 28, 2008

Written by Stephen E. Arnold · Filed Under Cloud computing, Google, News, Online (general), Search, Semantic, Text processing | 1 Comment

Cool Discussion of Cuil

July 28, 2008

Xooglers Anna Patterson and Louth man Tom Costello (husband and wife brains behind Xift which sold to AltaVista.com and Recall), Louis Monier (AltaVista.com top wizard), and Russell Power (worked on TeraGoogle) teamed up to create a next-generation Google. Michael Liedtke’s New Search Engine Claims Three Times the Grunt of Google is worth reading. You can find one instance of the write up here.

TechCrunch wrote about Cuil in 2007. You can read that essay here. The key points in the TechCrunch write up were that Cuil can index Web content faster and more economically than Google. Venture funding was $33 million, which is a healthy chunk for search technology.

Mr. Liedtke pulls together some useful information. For me, the most interesting points in the write up were:

The Cuil index contains 120 billion Web pages.
Cuil is derived from an Irish name.
The search results will appear in a “magazine like format”, not a laundry list of results.
Google has looked the same for the last 10 years and will look the same in the next 10 years.

Although Dr. Patterson left Google in 2006, she authored several patent documents related to search. I profiled these documents in Google Version 2.0, and these provide some insight into how Dr. Patterson thinks about extracting meaning from content. The patent documents are available from the USPTO, and she is listed as the sole inventor on the patent applications.

Observations

If Cuil’s index contains 120 billion Web pages, it would be three times larger than Google’s Web page index of 40 billion Web pages and six times larger than Live.com 20 billion page index. Google has indexed structured data which makes the index far larger, but Google does not reveal the total number of items in its index. The “my fish was this big” approach to search is essentially meaningless without context.

The AltaVista.com connection via Louis Monier is important. A number of AltaVista.com engineers have not joined Google. One company–Exalead–has plumbing that meets or exceeds Google’s infrastructure. My thought is that Cuil will include innovations that Google cannot easily retrofit. Therefore, if Exalead has a killer infrastructure, it is likely that Cuil will have one too. As Mr. Liedtke’s article points out, Google has not changed search in a decade. This observation comes from Dr. Patterson and may have some truth in it. But as Google grows larger, radical change becomes more difficult no matter how many lava lamps there are in the Mountain View office.

The experience Dr. Costello gained in the Web Fountain work for IBM suggests that text analytics will get more than casual treatment. Analytics may play a far larger role in Cuil than it did in either Recall, Xift, or Google for that matter.

The knowledge DNA of the Cuil founders is important. There’s Stanford University, the University of Washington, and AltaVista.com. I make quick judgments about new search technology by looking for this type of knowledge fingerprint.

A David Outperforming Two Goliaths: Factiva, Lexis, Silobreaker

July 24, 2008

A thoughtful reader sent me a screen shot of a Compete.com report. This is the metrics company that says, “Track your rivals. Then eat their lunch.” As you may know, I don’t get too excited by third party analytics. The data have to show me a big jump, or most of the market shares information is a statistical fuzz ball. When I saw this chart, I took notice.

The time period is a 12 month span, ending on June 30, 2008. The companies on the chart are Dow Jones’s “other” online service, Dow Jones Factiva. You can read more about this outfit here. This online service is so adept that it’s Google ad today (July 24, 2008) returns a 404 error or “File Not Found”. I clicked on the ad eight or nine times to see if was traditional media latency or just carelessness. Answer: carelessness.

The second company’s data charted by Compare is Lexis Nexis, one of the two monopolies in legal information. I love the Lexis tag line: “Lead with Confidence. Work with Confidence. Grow with Confidence.” Unfortunately this Compare.com chart shows Lexis following, not something to inspire confidence or trigger growth. Lexis Nexis sells online information to lawyers, but not surprisingly, lawyers have been finding out that their clients expect the legal eagles to use publicly accessible services, not the high priced services. Accordingly, Lexis Nexis has been working overtime to make Lexis spin more money. Nexis, has been paddling upstream for years, and the brand has less visibility than the hair product (Nexxis) in my opinion. Lexis tried to get the hair product company to change its name. Didn’t work. Tough to confuse a sagging online service with shampoo and conditioners in my opinion.

Now, the third company is co-founded by the former McKinsey manager and intelligence officer, Mats Bjore, and the CEO Kristofer Mansson. His company, Silobreaker, is the one with the soaring line of the chart. When a third party generates an upward curve that rises steeply, I take notice. The absolute numbers are less important than the third party’s sampling process notes a significant change. You can read my interview with Mr. Bjore here.

What’s this chart tell me?

First, Silobreaker is gaining attention at the expense of Factiva and LexisNexis. You can see that in the up and down red and green lines.

Second, Silobreaker’s upward ascent tells me that the company is getting new customers, not just sucking oxygen from the bigger guys’ base.

Third, whatever goosed Silobreaker to rapid growth took place early in 2008, and the momentum appears to be holding up. There will be a tail off in the summer when information junkies head for the beach or a trout stream.

But the useful piece of data is that the combined “people” score for Silobreaker.com is only slightly less than the combined “people” score of Factiva.com and Nexis.com.

Silobreaker may be a David. The two Goliaths owned by traditional media companies and a track record of throwing money and people at a “problem” are not out of the game. But if I were the product manager for either of these two companies, I would be considering one of these actions:

[a] Killing Silobreaker.com with a price war or carpet bomb marketing campaign

[b] Polishing my résumé because I am getting humiliated by a company in Sweden, which has a GDP smaller than my employer’s annual revenue

[c] Buying Silobreaker.com and taking credit for the company’s rapid growth, nifty technology, and developers

[d] Deleting my Silobreaker.com bookmark and pretending that the company does not exist.

Since I worked for the world’s smartest publisher, William Ziff, I would go for [c]. Why pretend that a giant traditional publishing company can make a product people want, that’s sexy, and has lift. Buy it, issue a news release, and collect that bonus.

Will Factiva and Lexis wake up? I will keep you posted.

Stephen Arnold, July 24, 2008

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise, News, Online (general), Search, Semantic, Text processing | 2 Comments

Semantra Snags $3 Million in Additional Funding

July 23, 2008

The economy is uneven. Semantra, however, has obtained $3 million in funding from CPMG, a unit of Cardinal Investment Company. The “C” means cardinal and the “PMG” means Public Market Group. No matter, the CPMG investment in Semantra totals about $9 million.

What’s a Semantra?

The company is a leader in “conversational analytics.” This buzzword means that a user of a Microsoft Dynamics CRM system can ask a question in plain English. Semantra converts the question to a syntax Microsoft Dynamics understands. Semantra then displays the answer. The company say that it :

…is a pioneer in Natural Language and Semantics that is applied in a search and information access context that enables enterprises to quickly and easily retrieve precise, critical information from complex corporate databases through inquiries in the language of a user’s business. With an understanding of linguistics, conceptual modeling and relational theory, Semantra built its software to empower business users with real time, common language commands and requests unavailable through traditional BI or enterprise search solutions. Semantra significantly improves the value of any enterprise business application. Semantra’s headquarters are located in Dallas, Texas.

A typical interface looks like this:

The company received an infusion of cash from CPMG about one year ago. The company plans a product roll out later in 2008. You can learn more about the company here.

Despite the challenges some text and content processing companies face with their sources of funding, Semantra appears to have few problems.

Stephen Arnold, July 23, 2008

Written by Stephen E. Arnold · Filed Under Business strategy, Database, Enterprise, News, Online (general), Search, Semantic, Text processing | Comments Off on Semantra Snags $3 Million in Additional Funding

Scale Fail: Amazon and Pizza Team Engineering

July 21, 2008

My news reader is chock full of glowing embers of hostility this morning. It’s 8 30 am in rural Kentucky, where nothing works very well. Power failed again last night, but we have oil lamps and candles. Based on scanning a number of the Amazon S3 outage, Amazon may want to shore up Dr. Werner Vogels’ engineering team today. Shoestrings are great for keeping sneakers on my feet, but massively parallel distributed infrastructures needs a bit more than shareware, cleaver graduate students from the Netherlands, and technical reviews by PhD candidates from University of California computer science programs.

Amazon codes using teams large enough to be fed with one pizza. The idea is that a SOCOM-type unit is better than a rigorous engineering approach found at Boeing or even Microsoft for that matter. Amazon also allows its teams considerable latitude when solving problems. In fact, some teams can use whatever programming language or method that allows the team to solve the problem.

This is a burned pizza. Great ingredients, distracted chef. Source: http://msp71.photobucket.com/albums/i122/xoaleycat926ox/6298db24.jpg

This approach is fast, economical, and flexible. The downside is that if the fix triggers a fault elsewhere, the pizza team or teams must scramble to figure out what happened and why. If the previous team used some off beat language or clever method, then the fixers have to puzzle out the solution. Some folks love puzzles, but I don’t think Amazon Web Services’ customers are too keen on the approach, if I read some of the nasty grams this morning.

Om Malik’s “S3 Outage Highlights Fragility of Web Services” is among the best of the essays I reviewed. You can read his full post here. For me, the key point in his analysis was:

That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.

I quite enjoyed Center Networks’ understatement aboiut the problem by reporting Amazon’s own comment:

Amazon S3 has “elevated error rates”.

I think this means crash or fail.

Written by Stephen E. Arnold · Filed Under Business strategy, Cloud computing, Enterprise, Feature, Microsoft, Online (general), Semantic, Technology, Text processing | 2 Comments

New Idea’s Founder Speaks, New Search Tools Service in Beta

July 21, 2008

New Idea Engineering is one of those specialized engineering firms that keep a low profile because the company is swamped with work. Miles Kehoe and Mark Bennett, the two founders of New Idea have deep experience with search and related technologies. Messrs. Kehoe and Bennett, , revealed in an interview for the Search Wizards Speak series the premise of their firm:

New Idea has from Day One tried to make our products and tools cross-vendor, but none of the major vendors has any incentive to do so until customers start objecting.

This is a clear statement of one reason why search vendors may not rush to resolve some issues for their customers. Now The problems with enterprise search are now becoming more widely known. New Idea’s founders explain why:

…The biggest problem we see in failed implementations is that the technology the customer picked is just not the right one for their environment. Corporate IT managers have to remember that a great demo is indistinguishable from product, but sometimes they seem willing to accept the vendor’s demo as a suitable substitute for their environment. There is also a mind set in many IT departments that search is either not critical – it’s still often a “check-box item” – or that it must be terribly easy…

You can read the full text of the interview here. Additional information about New Idea is here. The company has a useful Web log, and a new addition to the New Idea arsenal of useful resources is a listing of software tools that can help untangle some of the Gordian knots in an enterprise search deployment. An alpha version of the new service called Search Components Online is available here.

Disclaimer: I have provided some information about open source and shareware content transformation tools. Kudos to the New Idea Engineering team for creating a much-needed listing that can help those struggling with flawed enterprise search systems or consultants trying to help their customers get their system back online. I have linked to the company’s enterprise search Web log and cheerfully nabbed nuggets from the company’s informed postings.

Stephen Arnold, July 21, 2008

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise, News, Online (general), Search, Semantic, Technology, Text processing | Comments Off on New Idea’s Founder Speaks, New Search Tools Service in Beta

Google’s NLP in the Address Bar

July 15, 2008

The USPTO published US7401072, “Named URL Entry”. Awarded to Google, the patent discloses a system for performing natural language search on words typed in a browser’s navigation bar. The idea is that when Google Toolbar, Google ig, or a Google-friendly browser is installed on a user’s system, a user can type queries in the navigation bar, not just the search box.

How does this magic work? You will want to read the patent application. My initial thought was that the user would have have a stateful Google session running; for example, Google “ig”, Google Docs, or the Google Toolbar. As I thought about this invention, I wondered, “Will Google introduce its own browser?”

I tried to dig up some useful information about the inventors of this disclosed system and method. What I found was slim pickings. John Piscitello (former Product Manger Google Video) seems to have left the search giant. Xuefu Wang and Breen Hagan are mysteries to me. And Simon Tong, Senior Research Scientist at Google, leaves few biographical traces in content indexed by public search engines.

I find the lack of information about Dr. Tong interesting. He is mentioned in more than a dozen Google patent documents, which qualifies him as a genuine Google wizard. Dr. Tong has received several Google awards for contributions to the firm; for example, the Google Founders’ Award. He does play ping pong very well and enjoys photography. Beyond those facts and his ties to Stanford’s Daphne Koller, I don’t know much about his technical contributions to Google. He did figure as a co-inventor on what I consider a very important Google invention; namely, Large Scale Machine Learning Systems and Methods, 7222127, May 22, 2007. If you have not reviewed this patent document, a half hour with this disclosure may be helpful in understanding Google’s approach to computational intelligence.

My research suggests that when Dr. Tong’s name is on a Google patent document, that document warrants close attention. Almost as interesting is the impact of this invention if Google brings out its own browser. The notion of a walled garden exerts its charms on many because of the control it delivers along with the joys within.

Stephen Arnold, July 15, 2008

Written by Stephen E. Arnold · Filed Under Business strategy, Google, News, Search, Semantic, Technology, Text processing | Comments Off on Google’s NLP in the Address Bar

Hakia to Accelerate Semantic Analysis of the Web

July 10, 2008

A somewhat bold headline hopped from my news reader screen this morning (July 10, 2008). A news release from Hakia, one of the players in the semantic search football match, told me: “Hakia Leverages Yahoo Search BOSS to Accelerate Its Semantic Analysis of the World Wide Web.” You can get a copy of this release from Farrah Hamid (farrah at hakia dot com). As of 8 50 am, the news release is not on the Hakia Web log nor is there a link to this Hakia announcement.

The key point in the news release is that Hakia is using Yahoo’s Build Your Own Search Service or BOSS. The idea is that Hakia will use Yahoo’s search infrastructure to “accelerate Hakia’s crawling of the Web to identify quality documents for semantic analysis using its advanced QDEX (Query Detection and Extraction) technology. The “its” refers to Hakia’s patented technology, not Yahoo’s BOSS service.

Using Yahoo makes sense for two reasons. First, scaling to index Web content is expensive, a fact lost on many search mavens who don’t have a sense of the economics of content processing. Second, Yahoo’s BOSS makes it reasonably easy to tap into Yahoo’s plumbing. I wondered by other semantic search vendors have not looked at this type of hook up to better demonstrate the power of their systems. A couple of years ago, Siderean Software processed the Delicious.com content, and I found that a particularly good demo of the Siderean technology as well as providing me with a very useful resource. I have lost track of Siderean’s Delicious index, so I will need to do a bit of sleuthing later today.

Also, you can refresh your recollection of BOSS at http://www.developer.yahoo.com/boss. While you are at the Yahoo site, check out Yahoo’s own semantic search system, which left me a trifle disappointed. This system is shod with this url http://www.yr-bcn.es/demos/microsearch/. My write up about yr-bcn is here. One hopes the Hakia system raises the bar for Yahoo-based semantic efforts. It would be useful if Hakia puts up a head-to-head comparison of its system compared to Yahoo’s. You can see the Hakia comparison with Google here.

The choice of the BOSS service is understandable. Yahoo these days seems pliable. Cutting a deal with Google is fuzzy, often depending on which Googler one tracks down via email or at a conference. In my opinion, Google has been playing hardball in the semantic space. I am starting to think Google has designs on jump starting the semantic search “revolution” and putting its own systems and methods in place. The semantic Web certainly has not taken off, so why not entertain the notion of Google as the Semantic Web? Makes sense to me.

Microsoft, fresh from its hunt for semantic technology, is a big outfit, so it is also difficult to find an “owner” of the task a company like Hakia wants to use. Microsoft can put a price tag on accessing its index, which one cheery Redmonian told me now contained 25 billion Web pages. I told the Redmonian, “My tests suggest that the index is in the 5 to 7 billion page range.” I was told that I was an addled goose. So, what’s new.

Yahoo–troubled outfit that it is–probably welcomes an opportunity to allow Hakia to get the portal some positive media coverage. But if I had been advising Hakia (which I am not), I would have suggested Hakia give Exalead in Paris, France, a jingle. Exalead’s Web index is fresh, contains eight billion or so Web pages, and its engineers are quite open to new ideas. Yandex also might have made my list partners.

Check out the Hakia system at http://www.hakia.com. When I get additional information, I will try to update this post.

Stephen Arnold, July 10, 2008

Update: July 10, 2008, 10 am: My Hakia post is part of a larger fabric of Yahoo BOSS coverage. You will want to read “Yahoo Radically Opens Web Search with BOSS” in the July 9, 2008, TechCrunch. Mark Hendrickson’s coverage is a very good summary of the information on Yahoo’s Web site. He also takes a positive stance, noting “BOSS is the second concrete product to come out of Yahoo’s Open Strategy. The first was Search Monkey back in April [2008].” I am not ready to even think about being positive. These types of announcements are coming when the firm is in disarray. Any announcement, therefore, may be moving deck chairs on the Titanic. I will take a more skeptical position and say, “Let’s see how this plays out.” Yahoo is in flux, and its own semantic search system, referenced in the essay above, is not too good.

Update 2, July 10, 2008 10 10 am Eastern time: Hakia provided this information to me just a few moments ago.

The news release is on the Hakia Web site at http://company.hakia.com/pr-070308.html. Don’t forget the dots. (How about an explicit link on the splash page, Hakia?)
You can find other Hakia news releases at this location http://company.hakia.com/press.
The “official” Yahoo release is here: This url is too crazy to reproduce.

Written by Stephen E. Arnold · Filed Under News, Online (general), Search, Semantic | 7 Comments

WAND: New Business Taxonomy Available

July 10, 2008

Taxonomies are slightly less popular among the enterprise search crowd than Hanna Montana and petrol prices. WAND, a developer of controlled vocabulary tools and services, has rolled out what the company calls “a robust enterprise taxonomy.”

The idea is that most organizations remain clueless about taxonomies, controlled vocabularies, knowledge bases, and ontologies. The words are easy to say, but the ability to create a schema that a human being in an organization can use is a very different kettle of fish.

WAND’s taxonomy will allow a clueless or semi-clueless organization to get a taxonomy, edit it, and use the terms and hierarchies as a way to tag processed content. According to the company’s news release:

WAND’s new business vocabulary provides a four-level hierarchy of important business terminology covering human resources, accounting and finance, sales and marketing, legal, and information technology. The vocabulary includes all the core business concepts that any company has to deal with and can be extended and customized to include company specific terminology. WAND’s enterprise taxonomy can easily be paired with an existing enterprise search engine to improve the relevancy of search results returned.

You can learn more about the company and license fees here. I wrote about Arikus, another vendor offering off-the-shelf taxonomies here. I profile two other taxonomy players in my Beyond Search study for the Gilbane Group, Access Innovations and SchemaLogic. You can also tap MuseGlobal for this type of information as well. Some companies assert that you can learn how to “do” a taxonomy quickly by signing up for a one-day class. Okay, maybe that will work. It’s taken most of the professionals working on real-deal controlled vocabularies decades to hone their skills. I thought I knew words, but after working with Betty Eddison, founder of InMagic, and later with the Access Innovations’ team, I learned that I knew essentially zero. Fortunately, working with these folks helped me to be more informed about knowledge systems.

Take a peek at the WAND controlled term list and share what you learn with the two or three readers of this Web log.

Stephen Arnold, July 10, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, News, Search, Semantic, Text processing | 1 Comment

More Transformation Goodness from the Googleplex

July 8, 2008

In press is one of my for-fee write ups that talks about the black art of data transformation. I will let you know when it is available and where you can buy it. The subject of this for-fee “note” is one of the least exciting aspects of search and content processing. (I’m not being coy. I am prohibited from revealing the publisher of this note, the blue-chip company issuing the note, and any specific details.) What I can do is give you a hint. You will want to read this Web log post at Google Code: Open Source Google. News about Google’s Open Source Projects and Programs here. You can read other views of this on two other Google Web logs: The Official Google Web log here and Matt Cutts’s Web log here. You will also want to read the information on the Google project page as well.

The announcement by the Googley Kenton Varda, a member of the software engineering team, is “Protocol Buffers: Google’s Data Interchange Format”. Okay, I know you are yawning, but the DIF (an acronym for something that can chew up one-third of an information technology department’s budget) is reasonably important.

The purpose of a DIF is to take content (Object A in Format X) and via the magic of a method change that content into Format Y. Along the way, some interesting things can be included in the method. For example, nasty XML can be converted into little angel XML. The problem is that XML is a fat pig format and fixing it up is computationally intensive. Google, therefore:

developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple “get” and “set” methods, and once you’re ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.

The approach is sophisticated and subtle. Google’s approach shaves with Occam’s Razor, and the approach is now available to the Open Source community. Why? In my opinion, this is Google’s way of cementing its role as the giant information blender. If protocol buffers catch on, a developer can slice, dice, julienne, and chop without some of the ugly, expensive, hand-coded stuff the “other guys’s approach” forces on developers.

There will be more of this type of functionality “comin’ round the mountain, when she comes,” as the song says. When the transformation express roars into your town, you will want to ride it to the Googleplex. It will work; it will be economical; and it will leapfrog a number of pitfalls developers unwittingly overlook.

Stephen Arnold, July 8, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, News, Online (general), Semantic, Technology, Text processing | Comments Off on More Transformation Goodness from the Googleplex

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Cuil Your Jets: Take Offs Are Easier than Landings

Cool Discussion of Cuil

A David Outperforming Two Goliaths: Factiva, Lexis, Silobreaker

Semantra Snags $3 Million in Additional Funding

Scale Fail: Amazon and Pizza Team Engineering

New Idea’s Founder Speaks, New Search Tools Service in Beta

Google’s NLP in the Address Bar

Hakia to Accelerate Semantic Analysis of the Web

WAND: New Business Taxonomy Available

More Transformation Goodness from the Googleplex

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta