Xobni: Email Search and More

July 15, 2008

The process of locating an email message remains–ah, shall I say?–uneven. I have seen demos of some nifty email search from Coveo, the Canadian search and content processing that is expanding its product portfolio and its top line revenue.

Xobni, an email extender, now integrates with Facebook. You can read about this function here. I am not a fan of social search, and I think that this type of function in an organization can deliver some surprises to senior management unless certain precautions are taken.

A client asked me if Xobni could be used as an alternative to Clearwell Systems. I have described Clearwell’s approach to content processing here. I am working on a more thorough analysis of Xobni now. My hypothesis is that Xobni is designed for the average email user. Clearwell, on the other hand, is tailored to the needs of attorneys and law librarians, among other specialists, working on legal matters

Xobni does provide email search, but its reach extends to email organization, in box management, and social functions. Xobni has a Googler on the staff and venture money in its bank account.

You can download a demo of the product here. Xobni runs on Windows and requires that you have Outlook 2003 or 2007 installed. The software runs on Microsoft Windows.

The company has a nifty demo here.

One of the two or three people who read this Web log alerted me to Xobni’s embedded entity extraction function. Xobni can parse emails and pull out phone numbers, among other entities. The software features a function that threads people together. This function is somewhat similar to Clearwell System’s email threading operation.

What I find interesting about Xobni is that sophisticated text processing operations are finding their way into what are consumer applications or mainstream business applications.

One risk to Xobni is that Microsoft embeds similar functions into the next release of Outlook. My experience suggests that Xobni is positioning itself to be purchased, possibly by Microsoft.

My concern with any application written for Outlook is that the personal store management issues loom large. Security simply does not exist when most users can copy a PST file and have a go at browsing email, sometimes another person’s.

Take a look at Xobni. There will be more interesting uses of text processing functions.

Stephen Arnold, July 15, 2008

Microsoft Zips Up Zoomix

July 15, 2008

Data are a problem. Microsoft, despite some of its resellers’ and cheerleaders’ assurances, lacked data cleansing tools. No longer. On July 14, 2008, Microsoft bought Zoomix, a company with a system that “uses guided self-learning technology to easily build a knowledge of how to parse, match, classify and clean data, and applies what it has learned to every new piece of information fed into the system, even if it has not encountered similar data before.”

Why is this important? Fiddling with data–normalization, clean up, transformation, and other arcana–can consume more than 30 percent of an information technology budget each year. You can read the Zoomix news release about the deal here. The company says that it delivers a software-based self-learning data quality engine.

Zoomix’s management has buzzword fever. You will have to do some careful reading to figure out what a PIM Accelerator, a BI Accelerator, MDM Accelerator, and UNSPSC Auto Classification do. Hint. Eliminate most of the manual intervention required in traditional data cleansing processes.

Ars Technica has a useful description of the Zoomix technology here.

My research suggests that Zoomix technology will make SQL Server licensees happy. The auto classification technology could bring much needed robustness to the current versions of SharePoint. Zoomix technology can improve some of the native Fast Search & Transfer processes as well, but that use of the technology will take longer to deploy.

Stephen Arnold, July 15, 2008

Cluuz Gets BOSS-y

July 12, 2008

Cluuz.com’s owner–Sprylogics in Toronto, Ontario–alerted me that BOSS , Yahoo’s Build Your Own Search Service–will power Cluuz.com’s search results.The Sprylogics’ news release made me chuckle with this statement:

… www.cluuz.com will now be powered by one of the most valuable assets on the Web, the Yahoo! Search infrastructure.

Yahoo.com certainly makes headlines, but it isn’t doing a spectacular job in Web search. Google’s lead keeps inching forward. Yahoo drifts backward.

Sprylogics, like Hakia, and other search and content processing system providers can use the Yahoo plumbing without paying the sometimes sky-high fees asked of some in recent years. As a user of Cluuz.com, the Yahoo results can yield some useful nuggets. Yahoo’s native search system doesn’t meet my needs, so I poke around the specialized engines that are listed here.

In the near future, I will run some test queries and record my opinions in this Web log. I wrote a short description of the Cluuz.com system on May 8, 2008. You can find this write up here.

Stephen Arnold, July 12, 2008

More Artificial Intelligence: This Time Search

July 11, 2008

I remember when InfoWorld was a big, fat tabloid. I had to keep two subscriptions going because I would be summarily dropped. So, my dog at the time–Kelsey Benjamin–got one, and my now deceased partner, Ken Toth got the other one. It was easy to spoof the circulation folks who had me fill out forms. I used to check my company size as $10 to $20 million and assert that I bought more than $1 million in networking gear.

Paul Krill wrote “Artificial Intelligence Tied to Search Future”, which appeared on July 12, 2008, on the InfoWorld Web site. You can read the story here. (Search is not a core competency of most publishing companies, so you may have to enlist the help of a gum shoe if this link goes dead quickly.)

The point of the well-written essay is that an IBM wizard asserts that artificial intelligence will be instrumental in advanced text processing.

No disagreement from me on that assertion. What struck me as interesting was this passage from the essay:

“We’re going to see in the next five years next-generation search systems based on things like Open IE (Information Extraction),” Etzioni said. Open IE involves techniques for mapping sentences to logical expressions and could apply to arbitrary sentences on the Web, he said.

The Etzioni referenced in the passage is none other than Oren Etzioni, director of the Turing Center at the University of Washington.

Why is this important?

Google and Microsoft hire the junior wizards from this institution, pay them pretty well, and let them do stuff like develop systems that use artificial intelligence. The only point omitted from the article is that smart software has been part of the plumbing at Google for a decade, but Google prefers the term “janitors” to “smartbots”. Microsoft in 1998 was aware of smart software, and the Redmonians have been investing in artificial intelligence for quite a while.

My point is that that AI is not new, and it is not in disfavor among wizards. AI has been in disfavor among marketers and pundits. The marketers avoid the term because it evokes the image of SkyNet in Terminator. SkyNet is smart software that wants to kill all humans. The pundits over hyped AI years ago, discovered that smart software was useful in air craft control systems (yawn) and determining what content to cache on Akamai’s content delivery network servers (bigger yawn).

Now AI is back with zippier names–the essay includes a raft of them, and you can dig through the long list on page 2 of Mr. Krill’s essay. More important, the applications are ones that may mean something to an average Web surfer.

I must admit I don’t know what this means, however:

Etzioni emphasized more intelligent Internet- searching. “We’re going to see in the next five years next-generation search systems based on things like Open IE (Information Extraction),” Etzioni said. Open IE involves techniques for mapping sentences to logical expressions and could apply to arbitrary sentences on the Web, he said.

If you know, use the comments section for this Web log to help me out. In the meantime, run a Google query from www.google.com/ig. There’s AI under the hood. Few take the time to lift it and look. Some of the really neat stuff is coming from Dr. Etzioni’s former students just as it has for the last decade at Google.

Stephen Arnold, July 10, 2008

Autonomy Discovers Virtualization (Not My Headline)

July 10, 2008

Internet News’s February 6, 2008, essay “Autonomy Discovers Virtualization” turned up in my news reader this morning. You can read the full but old story here.

The point of the article is that Autonomy acquired Zantaz. Zantaz has software called Intraspect. The Intraspect software is, according to Internet News, “the first to offer automated search or discovery in a wide range of virtual environments, including VMWare, a process that usually requires a time-consuming, manual set of steps, if it’s done at all.”

And who am I to doubt Internet News?

What caught my eye was the reference to VMWare. That company is in the news. ZDNews has a useful overview of the company’s problems here. My hunch is that filters are on the look out for VMWare as the company spirals into more rough winds. Autonomy may get some play, but in the context of VMWare, I am not sure the halo effect is working the way it should.

Oh, the Internet World reminded one of my engineers of former Vice President Al Gore’s statement about “inventing the Internet”.  The word “discovers” in the Internet News story appears to have a similar effect on my technical team.

Stephen Arnold, July 10, 2008

WAND: New Business Taxonomy Available

July 10, 2008

Taxonomies are slightly less popular among the enterprise search crowd than Hanna Montana and petrol prices. WAND, a developer of controlled vocabulary tools and services, has rolled out what the company calls “a robust enterprise taxonomy.”

The idea is that most organizations remain clueless about taxonomies, controlled vocabularies, knowledge bases, and ontologies. The words are easy to say, but the ability to create a schema that a human being in an organization can use is a very different kettle of fish.

WAND’s taxonomy will allow a clueless or semi-clueless organization to get a taxonomy, edit it, and use the terms and hierarchies as a way to tag processed content. According to the company’s news release:

WAND’s new business vocabulary provides a four-level hierarchy of important business terminology covering human resources, accounting and finance, sales and marketing, legal, and information technology. The vocabulary includes all the core business concepts that any company has to deal with and can be extended and customized to include company specific terminology. WAND’s enterprise taxonomy can easily be paired with an existing enterprise search engine to improve the relevancy of search results returned.

You can learn more about the company and license fees here. I wrote about Arikus, another vendor offering off-the-shelf taxonomies here. I profile two other taxonomy players in my Beyond Search study for the Gilbane Group, Access Innovations and SchemaLogic. You can also tap MuseGlobal for this type of information as well. Some companies assert that you can learn how to “do” a taxonomy quickly by signing up for a one-day class. Okay, maybe that will work. It’s taken most of the professionals working on real-deal controlled vocabularies decades to hone their skills. I thought I knew words, but after working with Betty Eddison, founder of InMagic, and later with the Access Innovations’ team, I learned that I knew essentially zero. Fortunately, working with these folks helped me to be more informed about knowledge systems.

Take a peek at the WAND controlled term list and share what you learn with the two or three readers of this Web log.

Stephen Arnold, July 10, 2008

Copernic Desktop Search Updated

July 9, 2008

Copernic, the Canadian developer of search systems, has released a new version of Copernic Desktop Search. You can download a trial version here. Version 2.3 features speed improvements, a “did you mean” function to correct common misspellings, and a federation feature. A user can now search all index categories with an “All” feature. I particularly liked the “save search” feature. I often run the same query in the course of a project. For me, this is an important time saver. In my opinion, you will want to download the new version and drive it around your data race track.

Stephen Arnold, July 9, 2008

Concept Searching for SharePoint

July 9, 2008

My SharePoint posts continue to thrill the two or three readers of this Web log. So, here’s another joy booster. You can add taxonomy navigation, concept searching, and classification functions with a snap in from Concept Searching.

The company has offices in the UK (headquarters), spyland in McLean, Virginia, and Capetown, South Africa. The firm’s tag line is “Retrieval Just Got Smarter,” which sums up the company’s approach to content processing quite nicely, thank you. Founded in 2002, John Challis (CEO and CTO) want to develop statistical search and classification products with a difference. The idea was to provide a method that reduced the “drift” that afflicts some statistical methods. You can download a useful fact sheet here.

The SharePoint conceptClassifier, according to the “Microsoft Enterprise Search Blog”:

adds automatic document classification and taxonomy management to Microsoft SharePoint and works without the need to build another search index. It is installed as a set of Features that, when activated, cause new columns to be displayed in the document library listings and new menu options appear that allow authorized users to edit the automatically generated metadata, if required.

To see the system in action navigate to http://moss.conceptsearching.com. When you get to the demo screen, click on concept searching in the left hand panel. You will be able to explore a limited set of content. Some documents return 404 errors, but you will get the idea of the system’s functionality.

Among the features the system adds to SharePoint are:

  • Automatic Classification
  • Controlled Vocabulary
  • Multiple Taxonomies
  • Folksonomies
  • Auto Clue suggestion
  • AJAX Environment
  • Document Movement
  • SQL Based

This is an impressive line up, and you will want to test the system to make sure it meets your needs. The company, like Interse in Copenhagen, recognizes the appetite SharePoint administrators have for features that make the system more useful to SharePoint users, which number somewhere between 65 and 100 million worldwide.

Stephen Arnold, July 9, 2008

More Transformation Goodness from the Googleplex

July 8, 2008

In press is one of my for-fee write ups that talks about the black art of data transformation. I will let you know when it is available and where you can buy it. The subject of this for-fee “note” is one of the least exciting aspects of search and content processing. (I’m not being coy. I am prohibited from revealing the publisher of this note, the blue-chip company issuing the note, and any specific details.) What I can do is give you a hint. You will want to read this Web log post at Google Code: Open Source Google. News about Google’s Open Source Projects and Programs here. You can read other views of this on two other Google Web logs: The Official Google Web log here and Matt Cutts’s Web log here. You will also want to read the information on the Google project page as well.

The announcement by the Googley Kenton Varda, a member of the software engineering team, is “Protocol Buffers: Google’s Data Interchange Format”. Okay, I know you are yawning, but the DIF (an acronym for something that can chew up one-third of an information technology department’s budget) is reasonably important.

The purpose of a DIF is to take content (Object A in Format X) and via the magic of a method change that content into Format Y. Along the way, some interesting things can be included in the method. For example, nasty XML can be converted into little angel XML. The problem is that XML is a fat pig format and fixing it up is computationally intensive. Google, therefore:

developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple “get” and “set” methods, and once you’re ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.

The approach is sophisticated and subtle. Google’s approach shaves with Occam’s Razor, and the approach is now available to the Open Source community. Why? In my opinion, this is Google’s way of cementing its role as the giant information blender. If protocol buffers catch on, a developer can slice, dice, julienne, and chop without some of the ugly, expensive, hand-coded stuff the “other guys’s approach” forces on developers.

There will be more of this type of functionality “comin’ round the mountain, when she comes,” as the song says. When the transformation express roars into your town, you will want to ride it to the Googleplex. It will work; it will be economical; and it will leapfrog a number of pitfalls developers unwittingly overlook.

Stephen Arnold, July 8, 2008

Microsoft Powerset Could Unseat Google

July 8, 2008

You may find this essay stimulating. I did. Rebecca Sato’s essay “Microsoft Acquires Powerset”: Why a Semantic Web Will Be Smarter, Faster & All-Around Better” is remarkable. Please, navigate to The Daily Galaxy and get the inside scoop on the future of the Web. For example, Ms. Sato writes:

Microsoft’s acquisition of Powerset signals a the building of a future when the entire world will likely have access to virtual “software agents” who will “roam” across the Web, making our travel arrangements, doctor’s appointments and basically taking care of all the day-to-day hassles for humankind. It’s a great vision, but it will never be achieved with today’s current Internet.

My take on Ms. Sato’s thesis is that today, users must struggle with text documents that require the user to figure out what’s important. The future is smarter software, richer indexing, and more dimensionality for the information. Ms. Sato acknowledges that that Powerset-type functions are in their early stages. I agree.

Let me offer two observations;

  • Smart software can be resource intensive. As a result, semantic systems may have to start small and grow as the computing resources become available. To me, this means that semantic systems may be confined to modest roles, often as utilities or special purpose operations. If this happens, semantic systems may take years to deliver on their potential.
  • Semantic technology may find itself playing catch up to search systems that use smart shortcuts. For example, user tagging may provide acceptable payoffs without the complexity and cost of semantic systems. If this happens, the search revolution may be people power, not smart software.

Agree? Disagree? Let me know.

Stephen Arnold, July 8, 2008

« Previous PageNext Page »