Why Dataspaces Matter

August 30, 2008

My posts have been whipping super-wizards into action. I don’t want to disappoint anyone over the long American “end of summer” holiday. Let’s consider a problem in information retrieval and then answer in a very brief way why dataspaces matter. No, this is not a typographical error.

Set Up

A dataspace is somewhat different from a database. Databases can be within a dataspace, but other information objects, garden variety metadata, and new types of metadata which I like to call meta metadata, among others can be encompassed. These are represented in an index. For our purpose, we don’t have to worry about the type of index. We’re going to look up something in any of the indexes that represent our dataspace. You can learn more about dataspaces in the IDC report #213562, published on August 28, 2008. It’s a for fee write up, and I don’t have a copy. I just contribute; I don’t own these analyses published by blue chip firms.

Now let’s consider an interesting problem. We want to index people, figure out what those people know about, and then generate results to a query such as “Who’s an expert on Google?” If you run this query on Google, you get a list of hits like this.

google expert

This is not what I want. I require a list of people who are experts on Google. Does Live.com deliver this type of output? Here’s the same query on the Microsoft system:

live expert output

Same problem.

Now let’s try the query on Cluuz.com, a system that I have written about a couple of times. Run the query “Jayant Madhavan” and I get this:

cluuz

I don’t have an expert result list, but I have a wizard and direct links to people Dr. Madhavan knows. I can make the assumption that some of these people will be experts.

If I work in a company, the firm may have the Tacit system. This commercial vendor makes it possible to search for a person with expertise. I can get some of this functionality in the baked in search system provided with SharePoint. The Microsoft method relies on the number of documents a person known to the system writes on a topic, but that’s better than nothing. I could if I were working in a certain US government agency use the MITRE system that delivers a list of experts. The MITRE system is not one whose screen shots I can show, but if you have a friend in a certain government agency, maybe you can take a peek.

None of these systems really do what I want.

Enter Dataspaces

The idea for a dataspace is to process the available information. Some folks call this transformation, and it really helps to have systems and methods to transform, normalize, parse, tag, and crunch the source information. It also helps to monitor the message traffic for some of that meta metadata goodness. An example of meta metadata is an email. I want to index who received the email, who forwarded the email to whom and when, and any cutting or copying of the information in the email to which documents and the people who have access to said information. You get the idea. Meta metadata is where the rubber meets the road in determining what’s important regarding information in a dataspace.

Dataspaces and their meta metadata make it possible to do some new types of queries. We can look at a person or an “actor” in the dataspace as taking actions with regard to information. By plotting these actions on a timeline, we can learn some interesting things about that person’s interests and that person’s competence in certain subject areas. We can also look at references to the content assocaited with a person of interest. Armed with that information and some math, let’s us assign a “score” to the person; for example, for an email from a wizard like a Gartner Group consultant, the system could slap on a 0.999999, the highest possible score which befits a Gartner consultant. I would get a score of 0.000001, one of the lowest allowable scores.

With the timeline and the “score” a query for an expert can generate a much more useful result. I need a list of names and a confidence score next to a person’s name. After all, who wants to spend any time with an addled goose?

Where Are We in This?

Researchers in the US, Europe, and Pacific region are working away on this problem. Right now, this type of expert query is in its early stages. The dataspace technology, which dates from the early 1990s by the way, may provide a solution to this particular information problem. My research into dataspaces suggests that the systems and methods may be quite useful in other information retrieval areas as well.

Net Net

Plumbing like MapReduce is essential to performing manipulations in a dataspace. But the technical challenge requires much more than a couple of functions from LISP. We will have to look to companies who have computational horsepower and the expertise to provide a truly useful and reasonably accurate expert finding system. I have a candidate in mind. If you know of a company making progress in this area, please, share that information in the comments section of this Web log.

Oh, if you want to grouse about my assertion that databases are a subset of a dataspace, write a journal article and logic chop there. Databases are now officially uninteresting to me. We have to move beyond databases just as we have to move beyond key word search. I may be old and an addled goose, but I have a good sense of what’s important when trying to crack tough problems in information access.

Stephen Arnold, August 30, 2008

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta