Beyond the Database: Implications for Organizations

September 9, 2009

The challenge in information technology in general and information management is particular is that we face a “bridge” challenge. On one side are individuals using a wide range of devices. These include Microsoft Zune HD, Google Android phones, and netbooks like the one I am using. Millions of the young-at-heart have full-scale computers like this Apple iTouch.

On the other side, are large businesses with entrenched information technology infrastructures. Change is expensive, time consuming, and often fiercely resisted by employees. Change means relearning methods that work.

When someone searches for information, a variety of sources is available. For example, if a NOAA professionals gets a weather alert, he or she can pull from many sources. The problem is that the “answer” is not evident.

What about a search for “Florida severe weather”? Bing and Google return laundry lists of results. My research suggests that users do not want laundry lists. Users do want answers or a result that gets them closer to an answer and farther from the almost useless laundry list of results.

In this talk (converted to an essay), I will comment about some of Google’s new technology, but I want to point out that Microsoft is working in this field as well. Most of the major players in search, content processing, and business intelligence know that laundry lists are a dead end, of low value, and a commodity.

Google’s corporate strategy looks unorganized. The Sunday, January 28, 2007, New York Times’s article about Steve Ballmer included a reference to Google’s dependence on search advertising. The implication was that Google is a one-trick pony and therefore vulnerable. Google is in a tough spot because if advertising goes south, the company has to have a way to monetize its infrastructure. Google has spend billions building a global datasphere, a subject to which I will return at the end of this talk / essay.

Stand on the edge of a slice in the land near Antrim, Northern Ireland. You see a gap which you can cross using a rope bridge. Someday, a modern steel structure may be put in place. But for now, the Northern Ireland residents need a “good enough” solution.

That’s the problem the Federal government and many organizations face. Instead of a gap in the terrain, there are many legacy systems inside the organization and new systems outside the organization. The systems gap creates major problems in information access, security, and efficiency. In today’s economic climate and in the new Administration’s commitment to serving citizens, a digital bridge is needed, sooner rather than later.

The opportunity is to bridge these two different sides of the river of technology that flows through our society. Similar gaps can be identified in the structured and unstructured information gap, the legacy systems versus the Web service enabled systems gap, the Microsoft versus Google gap, archived data versus real time data gap, the semantic versus statistical gap, and others.

The question is, “How can we get the bridge built?” and “How can we deal with these gaps?”

These are important issues, and the good news is that tools and approaches are now becoming available. I will highlight some of Google’s innovations and mention one company that has a product available that provides a functional “bridge” between existing IT infrastructure and Google’s services. Many tools are surprisingly affordable, so progress—in my opinion—will be picking up steam in the next six to 12 months.

Because I have limited time, I will focus on Google and make do with side references to other vendors working to build bridges between organizations’ internal systems and the fast-moving, sometimes more innovative world external to the organization.

I have written three monographs about Google technology: The Google Legacy in 2005, Google Version 2.0 in 2006, and Google: The Digital Gutenberg this year. Most of the information I am going to mention comes from my research for these monographs which are available from Infonortics, Ltd. (http://www.infonortics.com). The information in my monographs comes from open source intelligence.

In the Q&A session, I will take questions about IBM’s, Microsoft’s, and other companies’ part in this information drama, but in this talk most of my examples will be drawn from Google. I don’t work for Google and Google probably prefers that I retire, stop writing my monographs and blog posts about the company, and find a different research interest.

Let’s start with a query for an airplane flight.

If you navigate to Google and run the query “SFO LGA”, the Google system recognizes the two three letter strings as airports. The system displays an enhanced result list.

You can see in this enlarged segment of the results list a parametric search box. You can, of course, browse the traditional results list, but for users who want to go directly to a flight selection list, only a click is required. Notice that there are some preferred providers: Cheap Tickets, Expedia, Hotwire, etc. Is this a convenience or a variation of PageRank? I surmise that these companies are partners and either pay Google for premium placement or return a percentage of the user’s ticket fee to Google.

When we click on Hotwire, notice that the Hotwire system takes the Google information string and displays the lowest fare for the specified flight. Again, this is a small convenience but it does save time. What Google is doing with Hot Wire is to create a mini-application that makes search results more immediately useful. I think the interlocking of the user – Google’s search results – Hotwire provides an important insight into Google’s approach.

Now let’s run a query for “kidney stone”. The result page is difficult to read, but I want you to get a feel for the layout of the page.

First, if you click on the plus sign next to the words “Show option”, you will see this list of facets. A “facet” is roughly analogous to a category. Endeca has made “guided navigation” its hallmark. But Google has had a similar capability since 2002 and is only now making this function more widely available. Again the idea is to give the user one click access to bite-sized, related chunks of information. Google wants to make the benefits of narrowing, slicing, an dicing available with one click. Google automates the process for the “kidney stone” query.

Next, notice the images embedded in the search results. Google in 2006 called this approach “universal search”. The idea is that a Google index is not a single entity. There are indexes of collections. Google indexes Web logs and keeps the information separate from the Google image index or the Google Video index. Universal search is really a type of metasearch within the Google’s content collections. With Google’s scale of operation, this metasearch operation is a non trivial technical challenge.

The third region I want you to scrutinize shows a series of categories. In traditional commercial databases, these are roughly analogous to controlled terms. Google has had since late 2003 a 500,000 term “flat” knowledge classification system. Unlike the manually-crafted, hierarchical ABI / INFORM controlled vocabulary and classification codes with which I was associated, the Google system is generated and updated automatically. An entry can contain one or more terms, a single Web page, or another category.

What I want to point out is that Google is making use of this mechanism within certain scientific, medical, and health related queries. STM content lends itself to this type of tagging. In addition, health-related applications are a business sector of considerable interest to Google. I think we are able to glimpse Google’s baby steps in its march toward more robust medical information applications.

As interesting as these current Google functions are, they are not fully developed for public use. Most users don’t want to deal with outputs that force them to do the synthesis and the analysis. For that reason, these current Google services don’t get users where they want to go and, to be blunt, these functions won’t bridge the gap between raw information and answers that relieve the user of the time consuming, difficult mental work of synthesizing and analyzing information. In my opinion, Google’s present enterprise services do not provide a complete suite of functionality. A bridge can be built between an organization and Google, and I think it is advisable to undertake such a project in the near future. The reason is that an organization can learn how to use Google in a secure, appropriate way by taking proactive steps today.

I surmise that the idea is that XML no longer requires significant investment and that XML content is becoming widely available. Digital content can be “transformed” into XML or an equivalent. As you know, an XML document is structured. With “smart” software, an XML document can be processed and be tagged with metadata. Transformed XML can be manipulated in a way roughly analogous to a database’s content. Records or “rows” can be queried, results parsed, outputs generated, and subjected to mathematical recipes to yield potentially useful information about the information.

Please, recognize that Microsoft also has a similar interest in next generation data management systems as well and is relying on researchers from Emory University, the University of Washington, and elsewhere .

I think Google’s interest in next generation data management systems is important because Google dominates digital information retrieval at this time. The company controls about 70 percent of queries for Web content and has an immediate, direct impact on my information retrieval practices. Some of Google’s most recent initiatives – Android, Chrome, and Wave, for instance — have a direct bearing on the next generation data management system “plumbing” Google is building.

The problem with talking about Google is that users, competitors, and financial analysts see Google as an advertising company. Google operates with more sophistication than some Google watchers understand.

Here’s a Google patent application (US7231393). Most of the Google pundits don’t pay much attention to these documents. The most common reason I hear is, “Patents really don’t mean much.” I think Google patent documents do mean something for three reasons: First, the patent process is expensive and no company gets involved with the multi year process casually. Second, the patent applications often provide clues to new features in Google’s public service. Third, Google’s patent documents cluster around certain areas or topics. Obviously multiple patent applications in a narrow area like semantic analysis for a programmable search engine reveals what I call “prejudicial intent”.

Let’s look inside this patent application, filed in 2004; that is roughly five years ago.

What you are looking at is a “dossier” or a “report” like those I used to write when I worked at Booz, Allen & Hamilton. I want to point out three features of this report and then turn to Google’s newer information activities.

First, let’s look at the tags. These are in the outside left hand rail. The data have been placed in a structured form. Second, notice that the system has generated aliases for Mr. Jackson; for example, Wacko Jacko. Third, notice the hot link to the phone number and location of Mr. Jackson as determined by the Google method. I find this patent application interesting, but the dossier function dates from 2003 and 2004.

What about more recent innovations?

This example comes from a Google paper delivered in July 2009 at the Very Large Database Conference held in Paris. You see a segment of a Web page. This type of content contains an enumerated list of films, the motion picture company, and the year of the film’s release. Common content and very useful when properly segmented and tagged. With a trillion pages of content in its index, Google obviously has to figure out how to parse content and deal with it on what I call “Google scale”.

From the technical paper, I have extracted one figure that shows the straightforward way in which Google figures out what is in the list. In fact, looking at these steps, one might conclude that Google has done nothing particularly innovative. Based on the research I have done for many years into search and content processing, I agree. Google’s engineers have looked at previous research and selected the elements that make the most sense for what Google seeks to accomplish; that is, indexing and making available the world’s information.

Here is an output from the Google method. This is difficult to read, and I apologize. What is evident is that Google’s approach creates a rich set of metadata. The point is that the metadata are more verbose than the source document.

Keep in mind the scale, scope, diversity, and other challenges implicit in this statement “the world’s information”.

Many researchers at very prestigious companies will assert that “we can do this too.” I recall Yahoo’s head of research, a former Verity and IBM wizard, telling Bear Stearns that Yahoo was far ahead of Google in this type of quasi-semantic content processing.

He was intelligent but poorly informed. Yahoo has nothing comparable to Google’s 2003-2004 technology. The newer technology I will be discussing in the remainder of this talk is not within Yahoo’s reach given the firm’s present circumstances.

Google is adding information—what I call, “knowledge value”—to the processed content. Imagine what this type of metadata inflation means across a corpuses of one trillion documents.

The technical challenge of writing and reading data from disc puts Google at the outer edge of today’s computing science universe. Most pundits do not appreciate what Google has accomplished with its “as is” infrastructure.

Now let’s look at a timeline.

I don’t want to get too deeply into the mathematics of the processes that Google uses. I discuss some of the math in my three Google monographs. Let me point out that Google is anchored in math and physics. Programming is a means, not an end at Google. The first thing I want to point out is that the intervals between each of these year intervals is decreasing. To me, this means that the pace of diffusion of the technology is increasing.

What I find interesting is that Google’s newest information applications are using math with roots back more than 140 years.

Some of the math Google uses for its Wave system is based on a breakthrough by Bernhard Riemann. Riemann, as you may know, shattered notions of Euclidean geometry. He is the father of the n-space geometry and other mathematical insights such as geometric structures need not be limited to two or three dimensions. A new type of math is needed to deal with a “manifold.”

Source: http://www.math.harvard.edu/tutorials/2003/riemann.gif

Now flash forward to Bell Labs I the mid-1990s. A young Stanford PHD researcher had been working with n-space and manifolds applied to problems in information processing. Dr. Levy’s “Information Manifold for Query Processing” was filed in 1996 and granted in 1999. The invention is assigned to Lucent, now part of Alcatel, not Google.

Dr. Halevy and his structured data team at Google Labs are “off the radar” of many competitors, analysts, and Google mavens. In fact his area of research is confined to a community of about 100 individuals worldwide with the math background to understand how Dr. Halevy has moved taken some of Bernhard Riemann’s original insights and created “next generation data management system” technology. Note: this is not a database system. A database is a subset of a next generation data management system. Dr. Halevy’s team at Google is developing the next generation data management system system into a process framework.

In my opinion, Dr. Halevy’s research is on a par with that of Vannevar Bush (Bell Labs’ researcher who created the discipline of information theory) and Dr. Gerald Salton (Cornell University), who is the father of statistical text analysis. Dr. Levy is a bright fellow, quite possibly Riemann grade.

2009 and Becoming More Widely Available

You may be familiar with Google Wave. The idea is that a Wave combines communications, collaboration, and email, along with other functions. Google has directed the media spotlight at two brothers (Lars and Jens Rasmussen) in Australia. But the principles upon which Wave as a digital container for various information objects is a child of Dr. Halevy’s team and its work. The media have accepted that Wave is an enhanced email system that includes features of Lotus Notes and search. My research suggests that Wave is one of Google’s first deployments of some of its next generation data management system capabilities. The diagram below shows a Wave “view”. You will have to wait until later this month or early October when Wave for education and other limited domains become available.

Source: Google, 2009

Here’s the trajectory then. Bernhard Riemann, the 1996 information manifold invention, the Transformic process framework, and Google with a next generation data management system product. I think this technology trajectory is one of the most significant discoveries in my years of research into information and content processing.

You can see other examples of Google’s baby steps with next generation data management system technology. Navigate to Google. Run a query for fusion tables, and you will see a display that allows you to point and click your way to a tabular summary of answers to your query. Here’s one for the Gross Domestic Product by country.

You can combine some Google functions; for example, fusion tables and Google’s mapping technology. Because Google builds digital “Lego blocks”, Google developers can put these together to build many different applications. That’s the reason I am confident that the next generation data management system applications will be coming more rapidly. A Google engineer can use the Halevy process framework. Dr. Halevy and his team have done the foundation and most of the plumbing. Other Googlers can just use these blocks to build solutions.

What is important for the US government is that some firms have already begun to use Google tools to deliver next generation data management system functions today. Keep in mind that Google is positioning some of its more sophisticated as open source software.

Somat Engineering, an 8A firm located in Detroit, Michigan, has taken some of my research and built a product that makes use of a Google Search Appliance to deliver some next generation data management system functionality you can deploy today. You can use Somat’s next generation data management system system – called Ripply – to derive the benefits of this remarkable new way to process and access information. But more important, based on my testing of the product, Somat’s next generation data management system product acts like an air lock on the International Space Station. The world of a government data system is separate and isolated from Google functions. Somat is worth a hard look because it can deliver next generation data management system benefits today. This allows you to learn in a controlled way about this remarkable new approach. More importantly, the Somat method makes it unnecessary to rip and replace an agency’s existing applications and database or search systems. You can get more information on September 23, 2009, at the National Press Club Ripply press conference which you are invited to attend.

Toward 2010

Now let’s look into the future. I want to go on the record and say that I am often wrong. If I knew the future, I would be at the river boat casino near my home in rural Kentucky. Nevertheless, I think it can be useful try to anticipate what Google may do next.

When people interact with information, it is important to know what changes were made, by whom, and when. I call this type of information “meta metadata”. To get this type of information today, one has to rely on human researchers and investigators. If you have a question about a particular item of information, I think you can sense that knowing who made what change is important. If you are involved in an investigation, the time of the change, the path through the network, and other “knowledge value” data are often valuable.

What I have come to recognize is that when I have knowledge value information that goes into user behavior with a document, I can think about fundamentally new types of queries. I don’t want to dig too deeply into the limitations of today’s query and discovery systems. Imagine being able to determine the provenance of a document; that is, you would know where it came from, where it had been, and who did what to that document. Would it be useful to generate a list of people who changed certain Enron documents for shareholders? To get that type of information today, one has to interview people.

Would it be useful to know which person was most likely to know about a certain event? How useful would you find a system that allows queries for provenance and confidence, a PageRank on steroids in a manner of speaking.

This snippet of a data table comes from a presentation by Janet Widom, Stanford professor and consultant to Google’s next generation data management system unit. The query is an illustration of how law enforcement can query information about a crime, specifically a hit and run accident or a similar event. Notice that the outputs provide a list of suspects and a confidence score. A law enforcement professional can use this type of out put to prepare a list of individuals to question. Even more important, the system assigns a probably of 0.75 to Freddy, a piece of information that can be of potential utility to the investigator. So, next generation data management systems make possible new types of queries that exploit relationships, times, and the mathematical properties of manifolds to answer questions.

A Matter of Perception

Is your head spinning? I apologize. I know that many of these concepts and their implications for personal privacy, commercial information companies, and government agencies generating content. Let’s close with some observations about how Google is approaching the information world at this time. You do not have to agree with me. I want to stimulate your thinking. I want to point out that I am not critical of what Google is doing. I am focused on information retrieval, and I don’t think much about policy or broader issues. That’s my weak spot. I leave those thoughts to others and their specialized skills.

First, here’s how three big companies perceive the world. These firms—AT&T, Viacom, and Microsoft—see themselves as giants in separate sectors. Sure, there are overlapping boundaries, but for the most part, these companies are separate and have different ways of generating money. When it comes to Google, the companies see that company as an aberration with quirky logos, money from advertising, and a popular Web search system. But in the world of these three multi billion dollar giants, Google is an upstart making money from advertising and Web search. Here is the empty fish bowl. Now let me show you what Google’s as is infrastructure looks like. I want you to keep in mind that this next image is a digital fish bowl for my purposes at this moment. For government agencies, then, I think knowing how to look at Google and perceive the opportunities it presents is an important short term job. Over the longer term, I think appropriate use under controlled conditions of Google’s advanced technology makes financial sense. The company’s business model means that advanced technical functions can be obtained at a lower cost from Google due to its business model. If you want to talk with me about how my colleagues and I can assist you with these tasks, please, let me know. Just contact me at my office via the links on my Web site, http://www.arnoldit.com.

What you see is a “datasphere” that wraps the earth.

Google’s technology can be extended to handle outer space as well, but this is the fruit of a decade of investment in distributed computing, search, data management, etc. Notice that Google can “snap in” one machine or complete data centers the way I plug a mouse into my laptop. Applications, once running on a Google server, can be made available to any other Google system or service. You can get full details of this architecture in my 2005 study The Google Legacy. Now think about this Google datasphere as a fish bowl. Here’s how my research suggests that Google’s datasphere embraces competitors like AT&T, Viacom, and Microsoft. These companies are “inside” the Google datasphere.

I think you know that Google is putting pressure on certain business sectors; for instance, telecommunications. My research indicates to me that Google is using its AdWords and AdSense revenue to fund probes into other business sectors.

The work of building the basic Google is almost complete. As a result, Google is shifting from fundamental research into next generation data management systems and other fields of inquiry to applications.

The Google platform becomes a key component that will be integrated in some organizations with on premises computing systems. A blend of local and remote will become a feature of next generation computing systems.

What is important to keep in mind is that the average age of a Googler is lower than in some organizations. As a result, the company often makes immature, jejune, uninformed decisions. Google then changes course without warning. The logic of business etiquette is often lost on the minds of a mathematician, physicist, or electrical engineer. Nevertheless, keep in mind that Google is not a start up. Xooglers are leaving the company and starting their own companies which make use of learnings acquired at Google. Google’s legacy is that it will not go away. Even if Google were to shut down tomorrow, the Google legacy is that Google has transformed computing and information access.

Google is a challenge for information companies, for example, because the children of the media company executives are Google’s customers. Lawyer courtroom victories make zero difference to the children of the winners and loser in the trial. Pandora’s box of change has been opened. We cannot go back and, therefore, must move forward with resolution.

That’s why for the last six years I have concluded my Google talks with the suggestion, “Surf on Google.” It is more prudent than letting the wave carry you shore. The best surfers do not fight the wave; surfers ride and enjoy the wave. The path forward for the US government and other organizations is to make use of the Google platform by connecting to the Google datasphere via an “air lock” like Somat’s Ripply technology, Don’t forget your sunscreen.

Stephen Arnold, September 9, 2009

Stephen Arnold, ArnoldIT.com

www.arnoldit.com/sitemap.html

www.arnoldit.com/wordpress

Mr. Arnold is an independent consultant residing in Harrod’s Creek, Kentucky. He is the author of more than 10 monographs. His most recent is Google: The Digital Gutenberg, available from www.infonortics.com. He writes monthly columns for Information World Review, and the Smart Business Network. His informal writings about information appear in his Web log, Beyond Search.

Written by Stephen E. Arnold · Filed Under Business strategy, Database, Feature, Google, Government

Comments

One Response to “Beyond the Database: Implications for Organizations”

Arnold at CENDI : Beyond Search on September 11th, 2009 3:15 am

[…] Information Service in Alexandria, Virginia. You can read a version of his two hour lecture here. CENDI representatives received a briefing on the challenges of changing information technology, […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.