Google Metaweb Deal Points to Possible Engineering Issue
July 19, 2010
Years ago, I wrote a BearStearns’ white paper “Google’s Semantic Web: the Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft,” May 16, 2007, about the work of Epinions’ founder, Dr. Ramanathan Guha. Dr. Guha bounced from big outfit to big outfit, landing at Google after a stint at IBM Almaden. My BearStearns’ report focused on an interesting series of patent applications filed in February 2007. The five patent applications were published on the same day. These are now popping out of the ever efficient USPTO as granted patents.
A close reading of the Guha February 2007 patent applications and other Google technical papers make clear that Google had a keen interest in semantic methods. The company’s acquisition of Transformics at about the same time as Dr. Guha’s jump to the Google was another out-of-spectrum signal for most Google watchers.
With Dr. Guha’s Programmable Search Engine inventions and Dr. Alon Halevy’s dataspace methods, Google seemed poised to take over the floundering semantic Web movement. I recall seeing Google classification methods applied in a recipe demo, a headache demo, and a real estate demo. Some of these demos made use of entities; for example, “skin cancer” and “chicken soup”.
Has Google become a one trick pony? The buy-technology trick? Can the Google pony learn the diversify and grow new revenue tricks before it’s time for the glue factory?
In 2006, signals I saw flashed green, and it sure looked as if Google could speed down the Information Highway 101 in its semantic supercar.
Is Metaweb a Turning Point for Google Technology?
What happened?
We know from the cartwheels Web wizards are turning, Google purchased computer Zen master Danny Hillis’ Metaweb business. Metaweb, known mostly to the information retrieval and semantic Web crowd, produced a giant controlled term list of people, places, and things. The Freebase knowledgebase is a next generation open source term list. You can get some useful technical details from the 2007 “On Danny Hillis, eLearning, Freebase, Metaweb, Semantic Web and Web 3.0” and from the Wikipedia Metaweb entry here.
What has been missing in the extensive commentary available to me in my Overflight service is some thinking about what went right or wrong with Google’s investments and research in closely adjacent technologies. Please, keep in mind that the addled goose is offering his observations based on his research for this three Google monographs, The Google Legacy, Google Version 2.0, and Google: the Digital Gutenberg. If you want to honk back, use the comments section of this Web log.
First, Google should be in a position to tap its existing metadata and classification systems such as the Guha context server and the Halevy dataspace method for entities. Failing these methods, Google has its user input methods like Knol and its hugely informative search query usage logs to generate a list of entities. Heck, there is even the disambiguation system to make sense of misspellings of people like Britney Spears. I heard a Googler give a talk in which the factoid about hundreds of variants of Ms. Spears’s name were “known” to the Google system and properly substituted automagically when the user goofed. The fact that Google bought Metaweb makes clear that something is still missing.
Lucene Revolution Conference Details
July 15, 2010
The Beyond Search team received an interesting news release from a reader in San Francisco. We think the information reveals the momentum that is building for open source search. Here’s the story as we received it:
San Mateo, Calif. – July 14, 2010 – Lucid Imagination, the commercial company for Apache Lucene and Solr open source search technologies, is pleased to announce speakers for Lucene Revolution, the first-ever conference [EV1] in the US devoted to open source search. The conference will take place October 7-8, 2010 at the Hyatt Harborside, Boston, Massachusetts. Lucene Revolution is a groundbreaking event that drives broad participation in open source enterprise search , creating opportunities for developers, technologists and business leaders to explore the disruptive new benefits that open source enterprise search makes possible, in a fresh, energetic and forward thinking format.
The diverse and widespread adoption of Lucene/Solr for enterprise search applications is reflected by the broad range of speakers at the event, such as:
- Cisco Systems: Satish Gannu
- eHarmony: Joshua Tuberville
- LinkedIn: John Wang
- Sears: David Oliver
- The McClatchy Company: Martin Streicher
- The Smithsonian: Ching-Hsien Wang
- Twitter: Michael Busch
Conference speakers represent a cross-section of Lucene/Solr adoption – including new media, ecommerce, embedded search applications, content management, social media, and security and intelligence – spanning the broad spectrum of production-class enterprise search implementations, all of whom leverage the power and economics of Lucene/Solr innovation.
Other industry thought leaders participating and sharing their insights into open source enterprise search include Hadley Reynolds (Research Director, Search & Digital Marketplace Technologies, IDC) and Stephen E. Arnold (Beyond Search; Managing Partner, ArnoldIT).
Over the two days of the conference there are over 30 sessions scheduled in a variety of different formats: technical presentations, use cases, panel discussions, and Q&A sessions. In addition there will be an “un-conference” the evening of October 7, where attendees can present lightning talks and take part in hands-on community coding efforts.
Registration for Lucene Revolution is now open for the conference at: http://www.lucenerevolution.com/register. A full list of speakers, along with a complete conference agenda, is available at http://www.lucenerevolution.com/agenda.
If you are not familiar with Lucid, here’s a snapshot:
Lucid Imagination is the commercial company dedicated to Apache Lucene technology. The company provides value-added software, documentation, commercial-grade support, training, high-level consulting, and free certified distributions, for Lucene and Solr. Lucid Imagination’s goal is to serve as a central resource for the entire Lucene community and search marketplace, to make enterprise search application developers more productive. Customers include AT&T, Sears, Ford, Verizon, Elsevier, Zappos, The Motley Fool, Macy’s, Cisco, HP, The Guardian and many other household names. Lucid Imagination is a privately held venture-funded company. Investors include Granite Ventures, Walden International, In-Q-Tel and Shasta Ventures. To learn more please visit www.lucidimagination.com.
Goslings Constance Ard and Dr. Tyra Oldham will be attending. Should be useful. Certainly more timely than the plethora of SharePoint and gasping one-size-fits-all programs. Honk.
Stephen E Arnold, July 15, 2010
Sponsored post.
Autonomy: A Real Success. CMSWatch: Maybe Another Real Miss?
July 12, 2010
In Harrod’s Creek, I can easily spot the real squirrel hunters. They have food. Mostly laconic, these hunters have a big pile of dead squirrels as proof of their competence. There is also the smell of fresh burgoo wafting from their log cabins. I can smell ability from my goose pond.
Lousy hunters have empty gun belts and squirrels shot when snacking on store bought food used to lure the critters. That’s a real danger — cheap tricks or just shooting wildly, often putting bird shot in an innocent’s backsides or the face like the 2006 incident between Vice President Dick Cheney and Texas lawyer Harry Whittington. Some faux hunters have just shot themselves in the foot. Ouch!
Azure chip consultants is a synonym for “bad hunter” in my opinion. Source: http://api.ning.com/files/LCP2NCaWo-ptCqGncB3hGsX8vuh8dnDzSJ0iLnkibas_/18holeinhandG.jpg
One of my two or three readers sent me a link to a write up called “Don’t Ogle Search If You Really Want Content Management”. In my opinion, the write up relies on insinuation, not facts. (I think that some folks are immune to facts, but I find facts useful.) In the article’s headline, the word “ogle”, for example, is one I don’t associate with information retrieval. (The publisher of this “ogle” opinion piece caught my attention in July 2008 with its similar assault on Attivio. My response to that misleading article is here.)
Yet another example of factless criticism of a vendor appears in this segment of the “ogle” write up about Autonomy, one of a very small number of search and content processing vendors with a consistent track record of technical breadth, sales, revenue, and profit:
From an initial focus on enterprise search tools, Autonomy has become a roll-up vendor after acquiring a variety of other information management suppliers such as Interwoven. As a financial strategy this can be successful, and investors seem to cotton to Autonomy. As a technology strategy, vendor roll-ups are problematic. Autonomy’s technology strategy is to rip legacy search subsystems from acquired products, replace them with some pieces from its own IDOL toolset, and then promote its particular approach to search as a distinct advantage for you. Specifically, Autonomy will try to sell you on the value of “meaning-based computing.” Even if you can get your mind around what meaning-based means, you should remain skeptical that Autonomy has technically spectacular or original services here. More importantly, you risk getting sidetracked from your original goal of, say, creating a user-friendly repository for your 50,000 Office documents.
These statements are presented without verifiable foundation to support the allegations in my opinion.
Autonomy is on track to hit $1.0 billion by the end of calendar 2010. The company has a proven track record of improving the performance of the companies it acquires. Autonomy’s management has demonstrated its ability to integrate quickly its acquired products with IDOL (the firm’s integrated data operating layer). The result is Autonomy’s knack of transforming the acquired companies’ position in their markets.
But there are other data that shed light on Autonomy’s track record, which I have documented Autonomy’s technology in my writings such as Beyond Search (Gilbane, 2009), the Enterprise Search Report (CMSWatch.com, 2004-2006), and Successful Enterprise Search Management (Galatea, 2009). Here are three points that must not be overlooked:
- Autonomy has 20,000 plus customers plus around 1,000 licensees of its technologies for use in other enterprise software and systems
- Autonomy has made intelligent acquisitions that has given the firm a strong presence in eDiscovery, rich media, and fraud detection. Autonomy has recently pushed into online marketing using capabilities from Ineterwoven and its IDOL framework. My research reveals that Autonomy has acquired companies to bring its technology to new markets so more content can be understood.
- Autonomy has grown its revenues and generated a profit, making it possible for other UK based technology companies to ride the Autonomy horse in the race for government and venture funding.
In December a year or so ago, at the International Online Conference, in my for-fee, end note debate, I challenged Andrew Kanter (Autonomy), Charlie Hull (Lemur Consulting), and Dr. Charles Oppenheim (Loughborough University) about their views of search, content processing, and related fields. In front of an audience of about 300 search professionals, I pointed out that key word search was dead. I pointed out that most search systems did not understand the meaning of processed information. Autonomy’s Andrew Kanter strongly and politely disagreed with me. As I recall, he said to the audience and me:
Autonomy IDOL is the only product in the market that can understand the meaning and concepts of all information in any language, including audio and video. This has big implications for the content management market as no other vendor can do this.
I demanded some concrete examples to support his position. Mr. Kanter without missing a beat gave me four concrete examples drawn from Autonomy’s work in intelligence, search enabled applications, fraud detection, and rich media.
What did I do?
Humans Not Replaceable Yet
July 5, 2010
The secret to national security is in searches, or so a recent Federal Times article tries to convince us. Citing the botched Christmas Day terror attempt, it claims that Homeland Security is deluged in so much data that agents could never be expected to stop a suspect in time. “Without better information systems,” the article says, “the intelligence community will be hamstrung in its efforts to transform information into intelligence.” The answer, it claims, are semantic searches that make preliminary conclusions on their own. So much faith in smart and semantic search capabilities is exciting, but overlooks the human element. High-powered search tools are great, but the technology still cannot surpass human instincts and knowledge, no matter how sensitive the equipment.
Jessica West Bratcher, July 5, 2010
Freebie
Sentiment in an Unsentimental Manner
July 2, 2010
Sentiment analysis is one of those feeder streams in content processing that now are swelling into a torrent. Seth Grimes, a fellow who actually took one dollar from me and then gave it back, has written a useful write up, “My Feelings About Sentiment Analysis.” The format is an interview with Mr. Grimes as the subject. Here’s a comment I noted and tucked in my “recycle this insight in one of my talks” folder:
How organic is it [sentiment analysis]? Does it need to be managed in real time?
Smart, responsive enterprises have effectively been doing sentiment analysis for years: they’ve been listening to customers and the market. The natural next step is to automate analyses, to take advantage of computers’ speed and power in order to build out and systematizes efforts. Technologies are definitely starting to operate in real-time… and beyond. They can not only analyze and automate response to opportunities and threats as they emerge; via predictive modeling, they can drive pro-active steps that create opportunities and close vulnerabilities. This said, I’ll reemphasize that organizations can work their way up from basic monitoring and engagement to full-blown, predictive analytics at a pace that makes sense given needs and budgets.
Good stuff.
Stephen E Arnold, July 2, 2010
Freebie but maybe I will get asked to give a talk at one of Mr. Grimes’ high profile conferences. Beg, beg, whine. Repeat.
Autonomy Tasers Its Competition
July 2, 2010
I can hear the yelps now, “Don’t tase me, man. No, not again.” Bzzzap. “Yow.”
Now I hear a gasping, “Autonomy cannot be Number One. We are Number One.”
Who is doing the complaining? Probably about 300 vendors of search and content processing systems that is who. Why the howls on this fine summer day?
Navigate to Chron.com and read “Autonomy Is #1 in Search and Discovery Market, According to Leading Market Research Firm.” There is a write up about IDC’s study “Worldwide Search and Discovery 2009 Vendor Shares: An Update on Market Trends.” So, the 300 yelpers have to do more than howl, issue one shot news releases, or drop the ball on marketing, sales, and customer satisfaction. Autonomy — acording to a big gun analyst outfit — is the top dog, the king of the hill, and the cat’s pajamas in search and content processing. This is not my opinion, gentle reader, I am pointing you to a rock solid source, IDC.
What’s the write up say? Here’s a snippet:
Autonomy continues to be the largest enterprise supplier, using its search-based IDOL infrastructure to act as a foundation for content-centric and search-driven business applications including eDiscovery and compliance, Web content management, enterprise content management and rich media, search marketing, intelligence, call center and customer support, and traditional knowledge management applications.” “Businesses from every industry continue to turn to Autonomy to help them achieve what other technology companies fail to deliver on – identifying the meaning within all forms of information, in real-time, in order to protect and promote their organization,” said Mike Lynch, CEO of Autonomy. “Autonomy’s unique meaning-based approach to information computing is what continues to fuel our rapid growth and clear market leadership, as validated by the recent IDC report on Search and Discovery market shares.”
And no big disagreement from the addled goose. I quite like some of the Autonomy technology. I like most of what IDC produces. If the data compiled for the report are accurate, Autonomy has a big footprint and happy customers. Among the thousands of Autonomy licensees are AOL, BAE Systems, BBC, Bloomberg, Boeing, Citigroup, Coca Cola, Daimler AG, Deutsche Bank, DLA Piper, Ericsson, FedEx, Ford, GlaxoSmithKline, Lloyds TSB, NASA, Nestle, the New York Stock Exchange, Reuters, Shell, Tesco, T-Mobile, the U.S. Department of Energy, the U.S. Department of Homeland Security and the U.S. Securities and Exchange Commission.
You may be using Autonomy technology and not even know it. More than 400 companies glue Autonomy to their own systems in order to provide search and content processing functions. Recognize any of these names? Symantec, Citrix, HP, Novell, Oracle, Sybase and TIBCO.
When the competition is able to stop yammering, perhaps some of these 300 vendors will start selling, marketing, and making Autonomy perspire. Google? Microsoft? Are you paying attention. Autonomy has more than 20,000 customers for its search and content processing systems, applications, and services. Oh, keep in mind that IDC offers data to back up its conclusion that Autonomy is Number One.
Competitors who make Kin phones and then kill their Kin the next day may want to reexamine their strategy. Other vendors may want to stop trying to tell governments how to run their railroads and business licensing policies.
Autonomy seems to have more – ah, how shall I say it? – yes, focus.
By the way, how does that taser feel? Want another zap? Bzzzap.
Stephen E Arnold, July 2, 2010
Freebie
Merger Strengthens Law Enforcement Searches
July 1, 2010
Crime solvers now have an improved way to track down clues, thanks to a single merger, a recent V3 article reports. One of the premier analytics firms, SAS in the US, recently purchased the UK based Memex in a step to bolster SAS’ law enforcement services. Memex currently supplies enterprise search solutions to law enforcement agencies from Brittan to Los Angeles. By bringing its research abilities to SAS’s global reach, the company aims to help law enforcement and justice and defense agencies share data by making it more widely available and much more searchable. This is an interesting example of a structured data specialist acquiring specialized technology to service a specific niche. We expect to see similar partnerships sprout up.
Patrick Roland, July 1, 2010
Freebie
More Efficient Social Graph and Semantic Analysis
June 30, 2010
Short honk: My hunch is that the University of Maryland has come up with a nifty method to deal with some cumbersome and computationally intensive computations. Navigate to “Scientists Develop World’s Fastest Program to Find Patterns in Social Networks” and read about fancy math and chopping big data into chunks. With the technique, figuring out patterns gets easier. I will resist a pun about cozying up to big data. Here’s the passage that caught my attention in the write up:
In a paper that has been accepted for presentation at the 2010 Advances in Social Network Analysis and Mining conference to be held in Denmark in August, Broecheler, Pugliese and Subrahmanian [University of Maryland wizards] leveraged a key insight – it is possible to split the social network into a set of almost independent, relatively small sub-networks, each of which is stored on a computer in a cloud computing cluster in such a way that the probability that a query pattern will need to access two nodes is kept as small as possible. Using knowledge of past queries and a complex set of calculations to compute these probabilities, their paper reports algorithms and experiments to answer social network subgraph pattern matching queries on real-world social network data with 778 million edges (which may denote relationships or connections between individuals) in less than one second. More recent results not contained in the paper are able to efficiently answer queries to social network databases containing over a billion edges.
Strikes me as important, particularly for outfits gunning their PT boats toward Fort Google.
Stephen E Arnold, June 30, 2010
Freebie
Business Intelligence: Optimism and Palantir
June 28, 2010
Business intelligence is in the news. Memex, the low profile UK outfit, sold to SAS. Kroll, another low profile operation, became part of Altegrity, anther organization with modest visibility among the vast sea of online experts. Now Palantir snags $90 million, which I learned in “Palantir: the Next Billion Dollar Company Raises $90 Million.” In the post financial meltdown world, there is a lot of money looking for a place that can grow more money. The information systems developed for serious intelligence analysis seem to be a better bet than funding another Web search company.
Palantir has some ardent fans in the US defense and intelligence communities. I like the system as well. What is fascinating to me is that smart money believes that there is gold in them there analytics and visualizations. I don’t doubt for a New York minute that some large commercial organizations can do a better job of figuring out the nuances in their petabytes of data with Palantir-type tools. But Palantir is not exactly Word or Excel.
The system requires an understanding of such nettlesome points as source data, analytic methods, and – yikes – programmatic thinking. The outputs from Palantir are almost good enough for General Stanley McChrystal to get another job. I have seen snippets of some really stunning presentations featuring Palantir outputs. You can see some examples at the Palantir Web site or take a gander (no pun intended by the addled goose) at the image below:
Palantir is an open platform; that is, a licensee with some hefty coinage in their knapsack can use Palantir to tackle the messy problem of data transformation and federation. The approach features dynamic ontologies, which means that humans don’t have to do as much heavy lifting as required by some of the other vendors’ systems. A licensee will want to have a tame rocket scientist around to deal with the internals of pXML, the XML variant used to make Palantir walk and talk.
You can poke around at these links which may go dark in a nonce, of course: https://devzone.palantirtech.com/ and https://www.palantirtech.com/.
Several observations:
- The system is expensive and requires headcount to operate in a way that will deliver satisfactory results under real world conditions
- Extensibility is excellent, but this work is not for a desk jockey no matter how confident that person in his undergraduate history degree and Harvard MBA
- The approach is industrial strength which means that appropriate resources must be available to deal with data acquisition, system tuning, and programming the nifty little extras that are required to make next generation business intelligence systems smarter than a grizzled sergeant with a purple heart.
Can Palantir become a billion dollar outfit? Well, there is always the opportunity to pump in money, increase the marketing, and sell the company to a larger organization with Stone Age business intelligence systems. If Oracle wanted to get serious about XML, Palantir might be worth a look. I can name some other candidates for making the investors day, but I will leave those to your imagination. Will you run your business on a Palantir system in the next month or two? Probably not.
Stephen E Arnold, June 27, 2010
Freebie
Real Time Search Systems, Part 4
June 24, 2010
Editor’s note: In this final snippet from my June 15 and June 17, 2010, lectures, I want to relate the challenge of real-time content to the notion of “aboutness.” An old bit of jargon, I have appropriated the term to embrace the semantic methods necessary to add context to information generated by individuals using such systems as blogging software, Facebook, and Twitter. These three content sources are representative only, and you can toss in any other ephemeric editorial engine you wish. The “aboutness” challenge is that a system must process activity and content. “Activity” refers to who did what when and where. The circumstances are useful as well. The “content” reference refers to the message payload. Appreciate that some message payloads my be rich media, disinformation, or crazy stuff. Figuring out which digital chunk has value for a particular information need is a tough job. No one, to my knowledge, has it right. Heck, people don’t know what “real time” means. The more subtle aspects of the information objects are not on the radar for most of the people in the industry with whom I am acquainted.
Semantics
I hate defining terms. There is always a pedant or a frustrated PhD eager to set me straight. Here’s what I mean when I use the buzzword “semantic”. A numerical recipe figures out what something is about. Other points I try to remember to mention include:
- Algorithms or humans or both looking at messages, trying to map content to concepts or synonyms
- Numerical recipes that send content through a digital rendering plant in order to process words, sentences, and documents and add value to the information object
- Figure out or use probabilities to take a stab at the context for an information object
- Spit out Related Terms, or Use For Terms
- Occupy PhD candidates, Googlers, and 20-something MBAs in search of the next big thing
- A discussion topic for a government committees nailing down the concept before heading out early on a Friday afternoon.
When semantics is figured out and applied, the meaning of Lady Gaga becomes apprehendable to a goose like me:
In order to tackle the semantics of a real time content object, two types of inputs are needed: activities or monitoring the who does what and when. The other is the information object itself. When the real time system converts digital pork into a high value wiener, the metadata and the content representation become more valuable than the individual content objects. This is an important concept, and I am not going to go into detail. I will show you the index / content representation diagram I used in my lectures:
The nifty thing is that when a system or a human beats on the index / content representation, the amount of real time information increases. The outputs become inputs to the index / content representation. The idea is that as the users beat on the index / content representation, the value of the metadata goes up.