Google and Data Object Visualization

June 30, 2009

The USPTO published US7555471 B2 on June 30, 2009. The Beyond Search goslings think this is a reasonably important Google disclosure. The investors include one super Googler and clutch of other Google rock star engineers. Andrew Hogue is a Googler to watch. If you find his official Google page opaque, try this link.  He and his band of engineers have received a patent for “Data Object Visualization.” Don’t get too excited about the graphics. The system and method applies to a core Google system for cleaning up discrepancies in fact tables. If you are a fan of Dilbert, this is the invention that describes one of Google’s smartest agents the official descriptor “janitor”. How smart is the janitor. Smart enough to make dataspaces closer to reality. The USPTO system is sluggish today, so you can get info from FreePatentsOnline.com or one of the other services that provide access to these public documents. I love that janitor lingo too. Googley humor for big time inventions makes clear that the 11 year old Google still possesses math club whimsy. Those examples for atomic mass and volcano are equally illuminating.

Stephen Arnold, June 30, 2009

Arnold at NFAIS: Google Books, Scholar, and Good Enough

June 26, 2009

Speaker’s introduction: The text that appears below is a summary of my remarks at the NFAIS Conference on June 26, 2009, in Philadelphia. I talk from notes, not a written manuscript, but it is my practice to create a narrative that summarizes my main points. I have reproduced this working text for readers of this Web log. I find that it is easier to put some of my work in a Web log than it is to create a PDF and post that version of a presentation on my main Web site, www.arnoldit.com. I have skipped the “who I am” part of the talk and jump into the core of the presentation.

Stephen Arnold, June 26, 2009

In the past, epics were a popular form of entertainment. Most of you have read the Iliad, possibly Beowulf, and some Gilgamesh. One convention is that these complex literary constructs begin in the middle or what my grade school teacher call “In media res.”

That’s how I want to begin my comments about Google’s scanning project – an epic — usually referred to as Google Books. Then I want to go back to the beginning of the story and then jump ahead to what is happening now. I will close with several observations about the future. I don’t work for Google, and my efforts to get Google to comment on topics are ignored. I am not an attorney, so my remarks have zero legal foundation. And I am not a publisher. I write studies about information retrieval. To make matters even more suspect, I do my work from rural Kentucky. From that remote location, I note the Amazon is concerned about Google Books, probably because Google seeks to enter the eBook sector. This story is good enough; that is, in a project so large, so sweeping perfection is not possible. Pages are skewed. Insects scanned. Coverage is hit and miss. But what other outfit is prepared to spend to scan books?

Let’s begin in the heat of the battle. Google is fighting a number things. Google finds itself under scrutiny from publishers and authors. These are the entities with whom Google signed a “truce” of sorts regarding the scanning of books. Increasingly libraries have begun to express concern that Google may not be doing the type of preservation job to keep the source materials in a suitable form for scholars. Regulators have taken an interest in the matter because of the publicity swirling around a number of complicated business and legal issues.

These issues threaten Google with several new challenges.

Since its founding in 1998, Google has enjoyed what I would call positive relationships with users, stakeholders, and most of its constituents. The Google Books’ matter is now creating what I would describe as “rising tension”. If the tension escalates, a series of battles can erupt in the legal arena. As you know, battle is risky when two heroes face off in a sword fight. Fighting in a legal arena is in some ways more risky and more dangerous.

Second, the friction of these battles can distract Google from other business activities. Google, as some commentators, including myself in Google: The Digital Gutenberg may be vulnerable to new types of information challenges. One example is Google’s absence from the real time indexing sector where Facebook, Twitter, Scoopler.com, and even Microsoft seem to be outpacing Google. Distractions like the Google Books matter could exclude Google from an important new opportunity.

Finally, Google’s approach to its projects is notable because the scope of the project makes it hard for most people to comprehend. Scanning books takes exabytes of storage. Converting images to ASCII, transforming the text (that is, adding structure tags), and then indexing the content takes a staggering amount of computing resources.

image

Inputs to outputs, an idea that was shaped between 1999 to 2001. © Stephen E. Arnold, 2009

Google has been measured and slow in its approach. The company works with large libraries, provides copies of the scanned material to its partners, and has tried to keep moving forward. Microsoft and Yahoo, database publishers, the Library of Congress, and most libraries have ceded the scanning of books work to Google.

Now Google finds itself having to juggle a large number of balls.

Now let’s go back in time.

I have noticed that most analysts peg Google Books’s project as starting right before the initial public offering in 2004. That’s not what my research has revealed. Google’s interest in scanning the contents of books reaches back to 2000.

In fact, an analysis of Google’s patent documents and technical papers for the period from 1998 to 2003 reveals that the company had explored knowledge bases, content transformation, and mashing up information from a variety of sources. In addition, the company had examined various security methods, including methods to prevent certain material from being easily copied or repurposed.

The idea, which I described in my The Google Legacy (which I wrote in 2003 and 2004 with publication in early 2005) was to gather a range of information, process that information using mathematical methods in order to produce useful outputs like search results for users and generate information about the information. The word given to describe value added indexing is metadata. I prefer the less common but more accurate term meta indexing.

Read more

Fujitsu Gets Bitten by the Search Bug

June 12, 2009

Juan Carlos Perez’s “Fujitsu Plug In Helps Refine Search Queries” here caught me by surprise. When I think of Japan and search, I think of Just Systems, not Fujitsu. I need to realign my goosely thinking. Mr. Perez wrote:

Fujitsu Laboratories of America has created a browser plug-in that pops-up a cloud of suggested query refinements around search engine boxes. Called Xurch, the tool works with Firefox and Internet Explorer, as well as with several major search engines and some big sites, the company said Thursday [June 11, 2009].

Fujitsu has created a Web site for Xurch here. You can download the free browser plug in here.

zurch

The idea is that the tag cloud shows a “cloud” or unordered list of related terms, concepts, and bound phrases appear. Each is a hot link which chops the longer list of results down. You see only those hits that are germane to your information need. To show the cloud, one moves the xurcher (oops, the cursor) into the search box. To make the cloud go away, move the xurcher (ooops, the cursor) out of the search box hot zone.

tag cloud

My hunch is that the Fujitsu Laboraotries of America here have more search goodness in the creative microwaves. NEC Research near the old Bell Labs building in New Jersey did some interesting search related work. Maybe Fujitsu will reignite Japanese-funded rexurch into information retrieval? A search for “information retrieval” on the Fujitsu Labs’s Web site return a link to a tie up with Open Text but not too much other exicting stuff among the four hits.

Stephen Arnold, June 12, 2009

Similar Sites Is Darn Useful

June 8, 2009

A happy quack to the reader who alerted me to SimilarSites.com. I enter a url that interests me and the system generates a list of similar sites. Try it here. The service is free and works quite well. There is a service called SimilarSites.net, but I am describing the Dot Com version. The company was founded in 2007 by “Web veterans” and I will poke into the outfit because I find the service helpful and not annoying. Who wants three clicks to execute a task? Not me. The company offers a browser add on which is described in somewhat wacky Web words:

an intelligent browser Add-On that dynamically provides easy access to relevant websites and content. Wherever you go on the web, our technology will work behind the scenes to discover valuable common content and present it to you in a useful way. Built on sophisticated algorithms that scout the internet and taking into account user opinions. It matters not if the user is looking at a major portal or a website of some unknown artist, SimilarWeb provides accurate results for rare sites as well as highly ranked ones, the technology excels in the long tail of the web.

Verbiage aside, worth a look. I put this puppy on my quick links list, moving Similicio.us and Tagomatic.com to my bookmarks. The tagline is particularly good for SimilarSites.com: “Discover without searching.” Dead on in my opinion.

Stephen Arnold, June 8, 2009

Successful Enterprise Search: The Guidebook

June 2, 2009

The reviews are coming in for the study by Martin White and me about enterpriser search. The publication is “Successful Enterprise Search Management”. The study brings together a method for implementing an enterprise search solution. The publisher is Galatea in the United Kingdom. You can find the reviews, information about the study, and a page of links to the reviews of the study. If you are involved in enterprise search, you may find the monograph useful. We take care to identify ways to gather information so that decisions about search can be based upon facts. We cover the entire process of planning, procuring, implementing, and enhancing a search system. We do mention some vendors, but the monograph is not a rehash of the unwieldy Enterprise Search Report nor is it as technically top heavy as my three analyses of Google’s mind boggling technology.

I had one copy which disappeared in a procurement team meeting. If you want a copy, you like me will have to pony up some cash to get this useful roadmap, guidebook, and search Baedeker. The monograph contains information not generally included in the breezy analyses of vendor-specific reviews and the fly overs of the industry that consulting firms generate for their paying customers.

One of the reviewers said:

“Martin White and Steve Arnold have created the authoritative guide for executives and business managers to understanding enterprise search from top to bottom. This book covers the players in detail, as well as emerging technologies that promise to improve the search experience in corporations in the coming years.”

Martin and I have made an effort to be clear, concise, and pragmatic. I wish I had had this monograph when I did my first project in the mid-1970s. I did not know then what I know now. You may be able to get a leg up on a what is a quite interesting and challenging application with our study.

Stephen Arnold, June 2, 2009

Francois Schiettecatte, FS Consulting

June 1, 2009

Through a mutual contact, I reconnected with François Schiettecatte, a search engine expert with other computer wizard skills in his toolbox. Mr. Schiettecatte worked on a natural language processing project in the late 1990s. He shifted focus and was a co-founder of Feedster.com. He told that he had contributed to a number of interesting projects and revealed that he was working on a new search and content processing system.

Mr. Schiettecatte consented to an interview. I spoke with him on May 29, and I put the full text of our discussion in the ArnoldIT.com Search Wizards Speak collection. You can find that series of interviews with influential figures in search and content processing here.

Mr. Schiettecatte and I had a lively discussion and he offered some interesting insights into the trajectory of search and retrieval. Let me highlight two of his comments and invite you to read the full text of the discussion here.

In response to a question about the new start ups entering the search and retrieval sector, Mr. Schiettecatte said:

You can apply different search approaches to different data sets, for example traditional search as well as NLP search to the same set of documents. And certain data set will lend themselves more naturally to one type of search as opposed to another. Of course user needs are key here in deciding what approaches work best for what data. I would also add that we have only begun to tackle search and that there is much more to be done, and new companies are usually the ones willing to bring new approaches to the market.

We then discussed the continuing interest in semantic technology. On this matter, Mr. Schiettecatte offered:

More data to search usually means more possible answers to a search, which means that I have to scan more to arrive at the answer, improved precision will go a long way to address that issue. A more pedestrian way to put this is: “I don’t care if there are about a million result, I just want the one result”. Also, having the search engine take the extra step in extracting data out of the search results and synthesizing that data into a meaningful table/report. This is more complicated but I has the potential to really save time in the long run.

For more information about Mr. Schiettecatte’s most recent project, read the full text of the interview here.

Stephen Arnold, June 2, 2009

Connotate Update

May 30, 2009

Connotate is a content aggregation service. Two days ago a reader sent me a link to a story about Connotate on the MyCentralJersey.com Web site. The article “New Brunswick Software Company Tracks Web Info for Clients” here by Jared Twasser was informative and provided an interesting insight into the nine year old company. Mr. Twasser wrote:

Molloy [a Connotate senior manager] said Connotate’s technology is different than search engines, such as Google, that scour the Web searching for keywords. “What we do is we’re able to understand a page at a much deeper level,” he said. “We’re able to understand a page on an element level, not just the whole page, but we can understand objects on the page.” The system works because the user can train the software to find specific information … such as prices, job postings or press releases — on a given Web site. The software was developed at Rutgers University and the company was founded by two Rutgers professors and a former research programmer in 2000.

I found this interesting for two reasons. The notion of understanding content is very much in the news with the firestorm of articles about Microsoft’s smart search system Bing. Second, the idea for parsing content in almost a decade old. More information about the Connotate system is here.

My question, “What is new about Bing’s parsing?”

Any answers, gentle readers.

Stephen Arnold, May 30, 2009

Finding Info about Tsunami Named Google Wave

May 30, 2009

If you are want to ride the Google Wave, you need to get up to speed. I found a couple of resources that may be useful to you. I don’t recommend the Google Web site or the Web log posts. These are breezy and are not as comprehensive as some of the third party write ups. I looked at a number of descriptions today. I would recommend that you read Ben Parr’s Google “Wave: A Complete Guide” here. Then you can sit back and check out the official video. You can find an easy link on Google Blogoscoped here or look at the Google Channel on YouTube. Once you have this information under your belt, head on over to my Overflight service here and read the posts about Wave on the Google Web logs. If you are into code instead of marketing frazzle, click here. I want to reiterate what I wrote earlier. The Wave swamped the new Microsoft Web surfer, Bing Kumo.

Stephen Arnold, May 30, 2009

Bing Kumo Rides the Wave, Wave Soaks Bing Kumo

May 29, 2009

Think back a couple of weeks. Wolfram Alpha became available and Google rolled out announcements about enhancements to its system. Microsoft raised the curtain on its Bing search system, and the Google rolled a Wave across its developers. No accidents of timing. Google wants to be in charge of the digital information flows, and it is clear to me that Google treats capable mathematicians and $65 billion software giants exactly the same. In the war of visibility and media attention, Google neutralizes other firms’ efforts in all things digital.

The number of articles about Wave and Bing Kumo seemed high. I thought it would be interesting to try and quantify which product name received the most coverage. I took a count before I conked out after a long day in Washington, DC, and then again this morning. To my dismay, the miserable high speed Internet connection timed out in the middle of script I used for the count. I tried a couple of more times and concluded that in terms of Megite.com, Microsoft was the lead story. Google’s Wave was a sublisting under a Microsoft Bing story. Twitter wasn’t much use because I timed out and then got what looked like erroneous results. A quick check of Newsnow.co.uk revealed that Microsoft and Google were not the top stories when I checked at about 7 am Eastern.

I did some poking around and learned two things:

First, Bing is neither a winner nor a loser as a “decision engine”. It is another search engine aimed at consumers. The mash ups, the social functions, and the semantics are present, just not dominant. Product Review Net here described its position in this way:

Microsoft tells us that this new search engine will be far different than we were used to with Live Search, Google and Yahoo Search. Normally when you search for something you then get one answer, Bing is different, as it knows that one answer is not often enough.

The key point in the article was the statement,

Internet users have been asking the same question, “Why Bing” and the answer is simple. Decisionengine.com explains that although current search engines are amazing, but as more than four websites are created every second, this means that half the search results that come up are not the results that people had searched for. Bing is different as it has evolved in to something new and better, but we will only know if this is true once Bing is up and running.

Okay, multiple answers. You may find the Bing video here located by Product Reviews useful.

Second, Warwick Ashford made a good point in his write up “Google Unveils Next Wave of Online Communications” here. Mr. Ashford wrote:

Google has posted examples of how services like Twitter can be automatically included in waves. Rasmussen described it as “concurrent rich-text editing”, where users see nearly instantly what collaborators are typing in your wave as well as being able to use “playback” to rewind the wave to see how it evolved.

Google, if Mr. Ashford is correct has focused on communication in which search is one function.

My thoughts about the Wave and Bing Kumo roll outs are:

  1. Microsoft is trying hard to out do Google in a market sector that focuses on finding information in some consumer areas such as tickets. Although the service is interesting, it is, by definition, constrained and inherently narrow. The method of interaction is well know, focused on accessing previously indexed information, and delivering utility such as a discount in airfare and similar practical information outcomes.
  2. Google seems to be cobbling together mash ups of its various components and moving parts. Wave is new and open. The idea is to allow developers first and then users to create information channels and then have those flows available for communication purposes. Wave is not search.

The contrast strikes me as quite significant in the broader information market. I think these three reasons sum up my thoughts in the early days of both services:

First, both services seems to be works in progress. In short, we are watching pundits, mavens, and self appointed wizards exercise themselves with what are not much more than demos. Don’t get me wrong. There’s nothing wrong with demos. Most of my work is a demo. But demos are not products and it is not clear if either of these offerings will have much of an impact on users. In short, I am less than thrilled with both Wave and Bing.

Second, Microsoft seems intent on beating Google at the search game. Google on the other hand is trying hard to invent a new game in which it has not had much success; that is, real time information retrieval. What’s interesting to me is that both Google and Microsoft may be tilting at windmills. My hunch is that Google will plug along in search, and Microsoft will plug along in its desktop applications and server business. Both companies will be hard pressed to achieve much traction in the short term with their Thursday roll outs. Over time, both will be reasonable successful, but I don’t see a future Le Bron James in either demo.

Third, both companies underscore how monocultures react to the new information world. The similarity of each company’s approach to these roll outs makes me see two peas in a pod, not innovative, distinctive ways to address the changing needs of users.

Just my opinion. Honk.

Stephen Arnold, May 29, 2009

Enkia: Early Player in Smart Search

May 26, 2009

Last week, I received a call from a defrocked MBA looking for work. (No surprise that!) The young wizard wanted to know about Enkia, a spin out of Georgia Tech’s incubator program in the late 1990s. If you poke around Web traffic reports, you see a surge for Enkia in year 2000 and then a flat line. In November 2008, a person sent this Twitter message that plopped into my tracking system: “Enkia is alive.” I told the job hunter that I would poke through my search archives to see what information I had. I will be in Atlanta in June, and I will try to swing by the company’s office at 85 Fifth Street in Atlanta to see what’s shakin’. (The last time I tried this approach the TeezIR folks kept the door locked. Big addled geese are often not welcome. Gee, maybe it’s because the addled geese don’t believe the chunks of marketing food tossed at them by vendors.)

The Company

According to an August 2000 article here, the company was

building the foundation of the Intelligent Internet(TM) based on the latest discoveries in cognitive science and artificial intelligence. Enkia’s middleware products overcome the limitations of current Internet search technology by sensing what a browser or shopper wants and recommending information quickly and automatically. This software enables portal providers to create personalized experiences that encourage return site visits and increased sales. Founded in 1998, Enkia is a member of the Advanced Technology Development Center (ATDC), the Georgia Institute of Technology’s high-tech business incubator.

What It Does

Enkia, the name of a Sumerian god with special brain power, was an early entrant in the “artificial intelligence for the Web movement”. If you have been following the exploits of Google, Microsoft, and Yahoo, the notion of smart software is with us today. The marketing verbiage is different, but the notion is the same as it was for Enkia.

Here’s a description from a year 2000 business journal story:

The software [Dr. Ashwin Ram and his students developed, called Enkion, has a type of ESP, if you will, sensing browsers’ needs by what they click. Enkion builds on techniques of artificial intelligence to model the human mind. The technology automatically recommends relevant information so that users don’t have to wade through hundreds of search results.

The company put a demo online, and I had a screen shot of the service. I thought I had results screen shots, but my memory deteriorates more quickly than the value of a US government Treasury note.

image

Screen shot of the Enkia Search Orbit interface, no longer available.

When the service rolled out, Dr. Ram said here:

“EnkiaGuide helps anyone find their ‘needles’ in haystacks of data on and off the Internet,” Dr. Ram adds. “It can help users find their way through technical support libraries or large e-commerce sites, and allow corporations to organize pathways through their large proprietary databases. The EnkiaGuide can make sense out of information chaos.”

The Technology

In my archive, I had a copy of an older white paper which is still available online as of May 25, 2009, here:

The IRIA architecture builds upon and extends the experience-based agent approach by embedding it in a knowledge discovery and presentation engine using techniques from artificial intelligence and machine learning. Crushing demands on resources limit the amount of “smarts” typical web search engines can apply to any particular information resource requests.  IRIA’s design overcomes this problem by leveraging existing search engines for the brute force work of indexing and searching the web and by focusing its “smarts” on modeling and understanding the efforts of an individual or workgroup. The core of IRIA that makes this understanding possible is its reminding engine.  The reminding engine directly applies the experience-based agent approach to the problem of information search, consisting of a context-sensitive search mediator which uses a unified semantic knowledge base called a knowledge map to represent indexed pages, queries, and even browsing sessions in a single format.  This uniform representation enables the development of an experience-based map of available information resources, along with judgments about their relevance, allowing precise searches based on the history of research for an individual, group or online community.  The knowledge map is furthermore a browsable information resource in its own right, accessible by standard internetworking protocols; with appropriate security precautions, this enables workgroups at remote sites to view and exploit information collected by another workgroup.

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta