Google’s Janitors: Clean Up Crew Ready for a Clean Sweep

April 15, 2008

At my Buying & Selling eContent keynote this morning, I discussed briefly Google’s invention of “janitors”. You can get the full text of the patent from the USPTO site. Search for US20070198481, “Automatic Object Reference Identification and Linking in a Browseable Fact Repository.” The inventors are Andrew Hogue and Jonathan Betz, Google, Inc.

The patent is of keen interest to me. It makes use of functions that Google is now making available via its App Engine service, among others. My suggestion is that you read about the App Engine and then look at US20070198481. If you have read about Google’s Programmable Search Engine, you may see linkages among these inventions that the individual patent documents do not make explicit. Google is not hiding any of these technologies, just using its infrastructure in fresh, intriguing ways. Keep in mind that a patent document is not a product. I believe it is useful to look at open source information in order to keep a finger on the pulse of a company’s innovation heart beat.

Now look at this illustration, which I used in my keynote. I want to direct your attention to two things. First, the query generates a report about the topic, in this case, the named entity “Michael Jackson”. Second, this result is not a hit list; it is a report. If my research for my new Gilbane Group study Beyond Search is accurate, Google’s US20070198481 seems to address some of the problems that users experience when confronted with results lists.

You will need to draw your own conclusions about this type of automated report generation. Google is not just in step with what user wants, the company appears to possess technology that makes it possible for the GOOG to jump into professional publishing, expand its reach as a business intelligence tool, and make users happy who want a distillation, not a laundry list of results.

Stephen Arnold, April 15, 2008

Written by Stephen E. Arnold · Filed Under Feature, Google, Search | Comments Off on Google’s Janitors: Clean Up Crew Ready for a Clean Sweep

Search, “No Problem”; Explaining the Value of IT, “Problem”

April 15, 2008

Gartner, the IT consulting giant, exposed its list of the major information technology challenges. ZDNet UK points to Silicon.com’s summary in a post titled “Seven IT Challenges to Change the World”.

Enterprise search–indeed search of any type–is not on the list. Please, check out this list of seven items before it becomes unfindable. As you scan the seven items, think about number seven: Developing clear indicators to spell out the financial benefits of IT investment to business.

Presumably once we revolutionize IT with self-charging devices and automated coding, we will have a way to explain in dollars and cents the value of information technology. For organizations struggling with search and retrieval, good news. You will be able to find needed information before you can explain the value of IT to your colleagues, peers, and superiors.

Stephen Arnold, April 15, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, News | Comments Off on Search, “No Problem”; Explaining the Value of IT, “Problem”

Autonomy Aces Its Rivals Once More

April 14, 2008

Autonomy Information Governance is, according to the company, “the industry’s first information governance platform that automates real-time policy management based on forming a conceptual and contextual understanding of all enterprise information.”

The value of this functionality is that risks inherent in information can be reduced by applying policy based on understanding what an email, document or phone recording says instead of relying solely on its metadata. You can read the full news announcement here.

Autonomy has consistently beaten its rivals in defining search markets and niches. The company’s “portal in a box” promotion remains a high-water mark in search salesmanship. Most of its rivals follow
in Autonomy’s marketing wake. Autonomy’s management has a knack for anticipating opportunities and differentiating its offerings from other vendors’ products. Kudos to Autonomy marketing… again.

Stephen Arnold, April 14, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, News | Comments Off on Autonomy Aces Its Rivals Once More

Groovie Info about Google and Data Management

April 14, 2008

Groovie.org has an essay that does a very good job of explaining what the Google App Engine is and is not. If you are a Google watcher, click here for the information.

The analysis of Google App Engine is one of the more informed reviews of this Google innovation.

Most of the pundits overlook the fact that relational databases, despite their usefulness, pose some cost challenges that only those with deep pockets can resolved.

The difference between data management and database is significant.

Stephen Arnold, April 14, 2008

Written by Stephen E. Arnold · Filed Under Google, News | Comments Off on Groovie Info about Google and Data Management

GOOG to SFRD: Push Them Back, Push Them Back, Way Back!

April 14, 2008

One of the worst kept secrets is that Salesforce is supporting Google’s various enterprise applications. Newsfactor’s discussion “Google Gearing Up for the Enterprise” is a good place to start reading about this blog-tacular event.

The tie up is not new; it is an extension of Google’s cheer leading for Salesforce.com’s approach to the enterprise. Salesforce.com’s marketing angle pivots on a solid anti-Microsoft block of rhetoric. Google is more indirect, even gentler about Microsoft’s dominance. Furthermore, Google has been talking with Salesforce.com for years, and the most recent “development” is an extension of that relationship. Keep in mind that Google is not acquiring Salesforce.com, at least not yet.

Salesforce.com needs a way to work around some of its architectural issues. Like Amazon, there’s razzle-dazzle needed to deliver cloud-based services. Salesforce.com’s multi-tenancy inventions provide some punch that other companies don’t–as yet–have.

The Google Apps allow Salesforce.com to crank its anti-Microsoft marketing engine, and–perhaps more significantly–allows Google to [a] get more information about the traction its products and services have in the enterprise, [b] learn more about the upside and downside of Salesforce.com as a revenue generator, and [c] observe Microsoft’s reaction. How much does this cost Google? Based on the information available to me, the deal costs Google little, and it delivers a significant “intelligence” upside. Microsoft has shown a strong knee-jerk reaction to Google’s activities, and this deal may be another way to agitate Microsoft’s senior executives.

The big question is, “If this Salesforce.com relationship starts to put wood behind Google’s enterprise efforts, will Google buy Salesforce.com?” On the surface, there are some easy benefits to both Google and Salesforce.com. But there are some significant downsides as well; namely, the somewhat fragile nature of the Salesforce.com “plumbing” that a tradtional relational database at its core. I’ve been told that Salesforce.com jumped on the Oracle database when it opened for business. That database has been good and bad. The good is that it can be reliable. The bad is that Salesforce.com has had to do many clever things to avoid choking that database with transactions the Salesforce.com multi-tenant approach; that is, many customers with separate, “virtual” databases. Salesforce.com’s engineers have figured out how to deliver near-real time updates without bringing down the multi-tenant database platform.

Maybe Google will learn enough from this deal, stop cheerleading Salesforce.com from the sidelines, and buy the entire team? Salesforce.com would benefit from more substantive Google engineering. To date, that’s a sideline Google has not chosen to step over.

Stephen Arnold, April 14, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, Google, News | Comments Off on GOOG to SFRD: Push Them Back, Push Them Back, Way Back!

Bitext’s Antonio S. Valderrábanos Interviewed

April 14, 2008

You may not be familiar with Bitext, a search and content processing vendor specializing in natural language processing or NLP. The company has found an appetite for its technology in Spain and in other European countries. The company recently landed a deal to provide search and content processing technology to support a new citizen-facing information service in Spain. Dubbed Red 060, this system will be similar to the US government’s service, USA.gov. The company also is working with US search vendor dtSearch.

Antonio Valderrábanos, founder of Bitext in Madrid, Spain, told Beyond Search:

Our goal is to complement search engines, giving them the ability to handle text according to its content, rather than its form as it happens in most applications, including search engines. We are interested in all forms of search, including search in databases or Geographical Information Systems.

Unlike some vendors, the Bitext system meshes with other vendors’ systems, adding important new functionality. Mr. Valderrábanos told Beyond Search:

Our approach is to say, “Okay, you have a perfectly good key word indexing system. We add value to that system in ways that make users happier and without getting rid of the system in which you have invested significant time and money.” We integrate, complement, turbo-charge.

Bitext is working on important enhancements to the company’s content processing functions, including entity extraction. Entity extraction identifies people, places, events, and certain numerical data in a source document.

Looking farther into the future, Bitext engineers are working on new ways to make access easy and intuitive. Mr. Valderrábanos observed:

I think the future will want one single interface to different information sources, whether documents or databases or some combination of data from many different systems. be them docs or databases or hybrid.

Of course, the interface will be natural language, the simplest most effective way of communicating for human. We will certainly not want to bother with different applications and formal languages–so no key word queries, Boolean statement, SQL strings, or forms. People want to get the information they need without hurdles.

The full interview with Mr. Valderrábanos appears on the ArnoldIT.com Web site as part of the “Search Wizards Speak” series. You can learn more about Bitext’s line of products on the Bitext Web site.

Stephen Arnold, April 14, 2008

Written by Stephen E. Arnold · Filed Under Interview, News, Search | Comments Off on Bitext’s Antonio S. Valderrábanos Interviewed

Digital Dodos: Fed Web Site Archives

April 13, 2008

Computerworld‘s Heather Havenstein wrote a story on April 11, 2008 “Agency Under Fire for Decision Not to Save Federal Web Content”. Please, read it before it goes into the digital never-never land of Computerworld stories, thus becoming almost impossible to find without real sleuthing.

The key point in the story was for me:

NARA, which until this year had collected a “harvests” of federal Web sites at the end of presidential and congressional terms, said in a recent memo that it would discontinue the practice at the end of George W. Bush’s presidency.

NARA for the acronym-challenged is the National Archives and Records Administration. This Federal entity is supposed to keep a copy of government information. Now, government information is slippery, and it is very difficult to put it in one location.

In year 2000, I was one of the lucky dweebs involved in the US Federal government’s citizen-facing portal, now called USA.gov. As part of that project, Inktomi indexed more than 20,000 public facing Web servers and made the information searchable. I thought indexing Federal Web sites would be a piece of cake. Boy, was I wrong.

A Search Puzzle with Hundreds of Pieces

Just take a gander at the Government Printing Office catalog and then do a bit of poking into the Web sites of the Department of Energy, and you won’t find much overlap for big printed reports and studies. For even more government fun, run a query on DEO for “ECCS”. You will get zero results. Now run the query on www.usa.gov, and you get hits to a nuclear power plant’s “nuclear core cooling system”. Related information is not in a single place, and there are different filters in place on different agencies’ Web sites. In short, the job of NARA is gather the information in one place for research or crazed attorneys. There are overlapping jurisdictions, of course. It’s murky water. Few know who is responsible for what information at what point in time.

The same wacky situation plagues the Library of Congress, the library in the US Senate, and the two dozen executive branch agencies. I don’t even want to think about figuring out the information on the public and not-so-public Web sites operated by various intelligence, military, and quasi-government entities. (Remember, I struggled with this information landscape until I threw in the digital towel in 2006.)

You will have to form your own opinion about what information should be gathered by whom. I only know that trying to figure out which agency has what information is no trivial job. With NARA seemingly giving up and other Federal entities grabbing different parts of the information elephants, there may be no solution. Alexa and the Internet Archive have tried / are trying to do the work, but over the years, I’m less and less confident with those efforts.

Microsoft indexes some Federal content as part of its contract for USA.gov with Vivisimo, but that’s a hit and miss index based on my tests. Microsoft asserts that it has more than 30 billion Web pages in its index, but my tests don’t back up that claim. Microsoft is struggling to make resources available for its various initiatives, and I think the index of Federal government content is not at the top of that list. Google indexes a cart load of government information, including a decent job of a number of states’ content.

Let Google Do It

I’m all for letting the GOOG index the Federal govrenment, store the data in the Googleplex, and call it a day. At least I would know where to look for my “emergency core cooling system” documents and the report I did in 1991 about Japan’s investments in high-speed network technology. Under the present system, the information is essentially unfindable with public-facing systems.

If you know a specific item exists, it can be almost impossible to find it on any public index. In my experience, you have to able to log in to the agency’s network and go data spelunking, find a version of the document, and then gather up the different instances of the document to figure out which is the “official” one. Just when you think you have what you need, someone asks, “Did you check the Lotus Notes’ repository? I think there are some modifications in those files too.” So, it’s back to the old data cave for more exploration in the dark. My miner’s light burned out, and I won’t go into the dark any more.

Stephen Arnold, April 13, 2008

Written by Stephen E. Arnold · Filed Under News, Search | Comments Off on Digital Dodos: Fed Web Site Archives

Interse: Danish Search Vendor Opens a US Office

April 12, 2008

Chances are you have not heard of Interse A/S and its iBox technology for SharePoint. The company’s headquarters are in Copenhagen, Denmark, and the firm has recently opened a US office at 3200 Whitehaven Street NW in Washington, DC.

iBox is a Windows-centric component that adds metadata modeling and classification to an incumbent search solution. I did not include this company in the 24 profiles that make up the bulk of my new Beyond Search: What to Do When Your Enterprise Search System Doesn’t Work study for the Gilbane Group. I did get a demo before the company opened its US offices. You may want to take a look at the Interse approach if you are struggling with a SharePoint search issue.

You can find more information on the company’s Web site at www.interse.com. If you want a demo, you will need to register (an increasingly frequent vendor practice that gets in the way of learning about a system). Give the company a jingle at 202 797 5350.

Stephen Arnold, April 11, 2008

Written by Stephen E. Arnold · Filed Under News, Search | Comments Off on Interse: Danish Search Vendor Opens a US Office

Google Forms: A Data Snout for a Bigger Creature

April 12, 2008

Navigate to Google’s Webmaster Central Blog. Scan the posting written by two wizards whom you probably don’t know, Alon Halevy (senior wizard) and Jayant Madhavan (slightly less senior wizard). Here’s what you will be told in well-chosen, Googley prose:

In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.

The idea is that dynamic content does not usually appear in an index. On the public Internet, this type of content is useful to me. For example, when I want to take a Southwest flight, I have to fill in some annoying Southwest forms, fiddle with drop down boxes, and figure out exactly which fare is likely to let me sit in one of the “choice” seats by boarding first. Wouldn’t it be great to be able to run a query on Google, see the flights aggregated, and from that master list jump to the order form? Dynamic content is now becoming more common.

I heard from one wizard at a conference in London that dynamic content is now more than half of the content appearing on the Web. The shift from static to dynamic is, therefore, a fundamental change in the way Web plumbing works on Web log content management systems to the sprawling craziness of Amazon.com.

A diagram from Dr. Guha’s patent applications with the Context Server shown in relation to the other parts of the PSE. This is a figure from Google Version 2.0: The Calculating Predator, published by Infonortics, Ltd., Tetbury, Glou. in July 2007. Infonortics holds the copyright to this study and its contents.

Written by Stephen E. Arnold · Filed Under Database, Enterprise, Feature, Online (general), Search | 2 Comments

Microsoft Windows: The Report of My Death Was an Exaggeration

April 11, 2008

The Gartner Group, the publicly-traded consultancy, made headlines with its interesting assertion that Microsoft Windows will collapse. Like other blue-chip consulting firms, generating buzz is good business. I know because I worked at one of the bluest-chip firms in the world, Booz, Allen & Hamilton a quarter century ago.

At a Gartner symposium, Gartner pundits asserted that Microsoft is big, fumbling, and “overburdened”. Therefore–and this is the part I admired–Windows is “collapsing”. My former boss at Booz, Allen would have swizzled the words, but that’s the difference between a blue-chip and a bluest-chip consulting firm.

Please, read the Computerworld story before it disappears from the public Web site. Also, scan the essay at Read Write Web. Both of these summaries provide useful information about the Gartner pundits’ remarks.

The thought that crossed my mind was that a large number of companies in the technology business are floundering. IBM is a baffler with $96 billion in revenues. Microsoft took advantage of IBM’s skepticism about personal computers, teetered on the precipice until a cookie expert taught the elephant to dance. IBM’s still with us, still pretty confusing to customers and competitors, still in the game.

Hewlett-Packard made what may be the most spectacular non-decision in the history of computing. HP owned AltaVista.com and orphaned it. Along came the Google, hired Jeffrey Dean and a cast of former AltaVista.com wizards. Messrs. Brin and Page–courtesy of HP–had a once-in-a-lifetime opportunity and were savvy enough to seize it. HP floundered, discovered ink, nuked Ms. Fiorina and now the company is digesting its $1.2 billion acquisition of Exstream Software. HP is an ink and printing company which technology enables. HP is in the $100 billion in revenue territory.

For Microsoft to blow a 95 percent share of desktop / notebook operating systems and applications in 24 to 36 months is a big job. If Microsoft tried to make these customers go away, I don’t think the company could do it. My father is in his mid 80s. He has one PC which runs XP. He will never upgrade, and he has minimal trouble. True, he has access to free technical support in the form of my visits. Most of the small businesses with which I am familiar aren’t likely to make any big jump to Macs (too expensive for today’s budget) or to Linux (not for the average bear).

Every year, I get at least one call about IBM mainframes, DEC 20s, and AS/400s. I’m amazed at how much of this hardware is still in use. Microsoft has problems, but what company doesn’t today? Did you used to work at the fifth largest bank in the US? Well, BearStearns is history. Software on lots of computers doesn’t collapse in the same way as a bank with other bankers wanting some cash pronto.

Kudos to Gartner for getting more media coverage than the AOL-Yahoo and Google-Yahoo tie ups. Some Microsofties will be annoyed. But there will be plenty of Windows users in 2010 or 2011 when the meltdown, implosion, or collapse occurs. Bet you a bowl of burgoo that Gartner’s wizards will still be using Word and PowerPoint to crank out their prognostications.

Stephen Arnold, April 11, 2008

Written by Stephen E. Arnold · Filed Under Enterprise, News | 2 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.