Dow Jones and Automatic Taxonomy Generation

September 30, 2008

An eager beaver reader (I only have two or three) sent me a link to “Taxonomies for Human Vs Auto-Indexing.” The author of the Synaptica Central write up is Wendy Lim. She is summarizing or reproducing information attributed to Heather Hedden. From a bibliographic angle, I think a tad more work could be done to make clear who was writing what, where, and when. But that’s an old, failed database goose quacking about the brilliant work done by “experts” decades younger than I. Quack. Quack.

You can read the September 26, 2008, write up here. The article is about a Taxonomy Bootcamp. After a bit of sleuthing, I discovered that this is an add on to some Information Today trade shows. The bootcamp, as I understand it, is an intellectual Camp Lejune except the that the attendees skip the push ups, the 5 am wake up calls, and the 20 mile runs. Over a period of two or three days, taxonomy recruits emerge battle ready, honed to deal with the intellectual rigors of creating taxonomies.

image

A real taxonomy. Source: www.nnf.org.na

The word “taxonomy” is more popular than “enterprise search” and for good reason. Enterpriser search has emerged from organizations with a bold 4F stamped on its fitness report. After hours, maybe months of work, and some hefty bills to pay, enterprise search customers are looking for a way to kill the enterprise search enemy. That’s where a taxonomy comes it. I’m no expert in taxonomies. I know I was involved in creating taxonomies for some once-hot commercial databases like ABI / INFORM, Business Dateline, General Business File, Health Reference Center, and the 1993 Web direct Point (Top 5% of the Internet). What those experiences taught me was that I don’t know too much about taxonomies or classification systems in general for that matter. I keep in touch with people who do know; for example, Marje Hlava at Access Innovations, Barbara Quint (Searcher Magazine), Marydee Ojala (Online Magazine), Ulla de Stricker (De Stricker & Associates), and other specialists. I get nervous when a 20- or 30-something explains that taxonomies are not big deal or that a business process can crack a taxonomy problem or a certain vendor’s software can auto-magically create a taxonomy.

tag cloud

A Synaptica Central tag cloud.

In my experience, the truth is not to be found in any one solution. In fact, the reality of taxonomies is that the concept has gained traction because of fundamental errors in planning and deploying information access systems. I don’t think a taxonomy can retrofit stupid, short sighted decisions. For that reason, I steer clear of most taxonomy discussions because after working with these beasts for more than 30 years, I understand their unpredictable behavior.

Read more

Expert System: Morphing into an Online Advertising Tool Vendor

September 28, 2008

Several years ago, YourAmigo (an Australian search and content processing vendor) shifted from enterprise search to search engine optimization. I stopped following the company because I have zero interest in figuring out how to get traffic to my Web site or my Web log. Now Expert System has rolled out what it calls Cogito Advertiser. A brief write up appeared in DMReview.com when I was in Europe. You can read that article here.

The new service, according to DMReview.com:

automatically analyzes Web pages to identify the most relevant topics and extract the main themes included in the text. It classifies content by assigning the category related to the text in real time, based on an optimized taxonomy and high precision. By processing the text, it collects all useful data in an output format structured to be uploaded into a database and directly integrates it with the ad server.

Expert System has some interesting technology.The idea is that software that can “understand” will be able to a better job of key word identification than a human, often fresh out of college with vocabulary flush with “ums”, “ers”, and “you knows”.

You can learn more about the company here. As the financial and competitive pressures mount, I expect other vendors to repackage their technology in an effort to tap into more rapidly growing markets with shorter buying cycles than enterprise search typically merits.

Stephen Arnold, September 28, 2008

Taxonomy: Silver Bullet or Shallow Puddle

September 27, 2008

Taxonomy is hot. One of my few readers sent me a link to Fumsi, a Web log that contains a two part discussion of taxonomy. I urge you to read this post by James Kelway, whom I don’t know. You can find the article here. The write up is far better than most of the Webby discussions of taxonomies. After a quick pass at nodes and navigation, he jumps into information architecture requiring fewer than 125 words. The often unreliable Wikipedia discussion of taxonomy here chews up more than 6,000. Brevity is the soul of wit, and whoever contributed to the Wikipedia article must be SWD; that is, severely wit deprived.

Take a look at the Google Trends’ chart I generated at 8 pm on Friday, September 26, 2008. Not only is taxonomy generating more Google traffic than the now mud crawler enterprise search. Taxonomy is not as popular as “CMS”, the shorthand for content management system. But “taxonomy” is a specialist concept that seems to be moving into the mainstream. At the just concluded Information Today trifecta conference featuring search, knowledge management (whatever that is), and streaming media, taxonomy was a hot topic. At the Wednesday roof top cocktail, where I worked on my tan in the 90 degree ambient air temperature, I was asked four times about taxonomies. I know I worked on commercial taxonomies and controlled vocabularies for database, but I learned from those years of experience that taxonomies are really tough, demanding, time consuming intellectual undertakings. I thought I was pretty good at making logical, coherent lists. Then I met the late Betty Eddison and the very active Marje Hlava. These two pros taught me a thing or 50.

google trends taxnonomy

In the dumper is the red line which maps “enterprise search” popularity. The blue line is the up and coming taxonomy popularity. The top line is the really popular, yet hugely disappointing, content management term traffic.

I heard people who have been responsible for failed search systems and non functional content management systems asking, “Will a taxonomy improve our content processing?” The answer is, “Sure, if you get an appropriate taxonomy?” I then excuse myself and head to the bar man for a Diet 7 Up. The kicker, of course, is “appropriate”. Figuring out what’s appropriate and then creating a taxonomy that users will actually exploit directly or indirectly is tough work. But today, you can learn how to do a taxonomy in a 40 minute presentation or if you are really studious a full eight hour seminar.

I remember talking with Betty Eddison and Marje Hlava about their learning how to craft appropriate taxonomies. Marje just laughed and turned to her business partner who also burst out laughing. Betty smiled and in her deep, pleasant voice said, “A life time, kiddo.” She called me “kiddo”, and I don’t think anyone else ever did. Marje Hlava chimed in and added, “Well, Jay [her business partner] and I have been at it for two life times.” I figured out pretty quickly that building “appropriate” taxonomies required more than persistence and blissfully ignorant confidence.

Why are taxonomies perceived as the silver bullet that will kill the vampire search or CMS system. A vampire system is one that will suck those working on it into endless nights and weekends and then gobble available budget dollars. In my opinion, here are the top five reasons:

  1. The notion of a taxonomy as a quick fix is easy to understand. Most people think of a taxonomy as the equivalent of the Dewey Decimal system or the Library of Congress subject headings and think, “How tough can this taxonomy stuff be?” After a couple of runs at the problem, the notion of a quick fix withers and dies.
  2. Vendors of lousy enterpriser search systems wriggle off the hook by asserting, “You just need a taxonomy and then our indexing system will be able to generate an assisted navigation interface.” This is the search equivalent of “The check is in the mail.”
  3. CMS vendors, mired in sluggish performance, lost information, and users who can’t find their writings, can suggest, “A taxonomy and classification module makes it much easier to pinpoint the marketing collateral. If you search for a common term, our system displays those documents with that common term. Yes, a taxonomy will do the trick.” This is the same as “Let’s do lunch” repeated every week to a person whom you know but with whom you don’t want to talk for more than 30 seconds on a street corner in mid town Manhattan.
  4. A shill at a user group meeting–now called a “summit”–praises the usefulness of the taxonomy in making it easier for users to find information. Vendors work hard to get a system that works and win over the project manager. Put on center stage and pampered by the vendor’s PR crafts people, the star customer presents a Kodachrome version of the value of taxonomies. Those in the audience often swallow the tale the way my dog Tess goes after a hot dog that falls from the grill. There’s not much thinking in Tess’s actions either.
  5. Vendors of “automated” taxonomy systems demonstrate how their software chops a tough problem down to size in a matter of hours or days. Stuff in some sample content and the smart algorithms do the work of Betty Eddison and Marje Hlava in a nonce. Not on your life, kiddo. The automated systems really are 100 percent automatic. The training corpus is tough to build. The tuning is a manual task. The smart software needs dummies like me to fiddle. Even more startling to licensees of automatic taxonomy systems is that you may have to buy a third party tool from Access Innovations, Marje Hlava’s company, to get the job done. That old phrase “If ignorance is bliss, hello, happy” comes to mind when I hear vendors pitch the “automated taxonomy” tale.

I assume that some readers may violently disagree with my view of 21st century taxonomy work. That’s okay. Use the comments section to teach this 65 year old dog some new tricks. I promise I will try to learn from those who bring hard data. If  you make assertions, you won’t get too far with me.

Stephen Arnold, September 27, 2008

IBM: Another New Search System from Big Blue

September 27, 2008

IBM announced its eDiscovery Analyzer. You can read the IBM news release on the MarketWatch news release aggregation page here. Alternatively you can put up with the sluggish response of IBM.com and read the more details here. You won’t be able to locate this page using IBM.com’s search function. The eDiscovery Analyzer had not been indexed when I ran the query at 7 30 pm on September 27, 2008. I * was * able to locate the page using Google.com. If I were the IBM person running site search, I would shift to Google, which works.

The eDiscovery Analyzer, according to Big Blue:

… provides conceptual search and analysis of cases created by IBM eDiscovery Manager.

Translating: eDiscovery  Manager  assists  with  legal  discovery,  a  formal  investigation  governed  by  court  rules  and  conducted  before
trial,  and  internal  investigations  on  possible  violations  of  company  policies,  by  enabling  users  to  search  e-mail  documents  that
were  archived  from  multiple  mailboxes  or  Mail  Journaling  databases  into  a  central  repository. You license eDiscovery Manager, the bits and pieces needed to make it go and then you license the brand new eDiscovery Analyzer component.

ibm ediscovery interface

I believe that this is the current interface for the “new” IBM eDiscovery Analyzer. Source: IBM’s Information Management Software IBM eDiscovery Analyzer 2.1 marketing collateral.

You will need FileNet, IBM’s aging content management system. The phrase I liked best in the IBM write up was, “[eDiscovery Analyzer] is easy to deploy and use, Web 2.0 based interface requires minimal user training.” I’m not sure about the easy to deploy assertion. And the system has to be easy to use because the intended users are attorneys. In my experience, which is limited, legal eagles are not too excited about complicated technology unless it boosts their billable hours. You can run your FileNet add in on AIX (think IBM servers) or Windows (think lots of servers).

You can read about IBM’s search and discovery technology here. You can tap into such “easy to deploy” systems as classification, content analysis, OmniFind search, and, if you are truly fortunate, DB2, IBM’s user friendly enterprise database management system. You might want to have a certified database administrator, an expert in SQL, and an IBM-trained optimization engineer on hand in case you run into problems with these user friendly systems. If these systems leave you with an appetite for more sophisticated functions, click here to learn about other IBM search and discovery products. You can, for example, read about four different versions of OmniFind and learn how to buy these products.

Remember: look for IBM products by searching Google. IBM.com’s search system won’t do the job. Of course, IBM’s enterprise eDiscovery Analyzer is a different animal, and I assume it works. By the way, when you try to download the user guide, you get to answer a question about the usefulness of the information * before * you have received the file. I conclude that IBM prefers users who are able to read documents without actually having the document.

Stephen Arnold, September 27, 2008

Linguamatics Sells Bayer CropScience

September 27, 2008

My newsreader snagged this item, which I found interesting. The little-known Linguamatics (a content processing company based in the UK) retained its deal with the warm and friendly Bayer CropScience. The Linguamatics’ technology is called I2E, and Bayer has been using the I2E system since the summer of 2007. In September, Bayer CropScience decided to renew its license and process patent documents, scientific and technical information, and perform knowledge discovery. (I must admit I am not sure how one discovers knowledge, but I will believe the article that you can find here.)

For me, this small news item was interesting for several reasons. First, for many years a relatively small number of companies had been granted access to the inner circle of European pharma. I find it refreshing that after two centuries, upstarts like Linguamatics are able to follow in the footsteps of Temis and other firms who have worked to make sales in these somewhat conservative companies. “Conservative” might not be the correct word. Computational chemists are a fun-loving group. One computational chemist told me last October in Barcelona that computational chemists were pharma’s equivalent to Brazilian soccer football fans. On the off change that a clinical trial goes off the rails, some pharma players prefer keeping “knowledge” quite undiscovered until an “issue” can be resolved.

lingua_searchresults

A representative I2E results display. © Linguamatics, 2008.

Second, Linguamatics–a company I profiled after significant bother and effort–is profiled in my April 2008 study Beyond Search, published by the Gilbane Group. You can learn more about this study here because ferreting out information about I2E is not the walk in the park that I expected from a content processing company with a somewhat low profile. Linguamatics has some interesting technology, and I surmise that the uses of the system are somewhat more sophisticated and useful to Bayer CropScience than “discovering knowledge”.

Finally, Bayer CropScience is a subsidiary of the influential Bayer AG, an outfit with an annual turnover of about US$8.0 billion, give or take a billion because of the sad state of the dollar on the international market. My hunch is that if the CropScience deal feels good, other units of this chemical and pharmaceutical giant will learn to love the I2E system.

Stephen Arnold, September 27, 2008

TeezIR BV: Coquette or Quitter

September 26, 2008

For my first visit to Utrecht, once a bastion of Catholicism and now Rabobank stronghold, I wanted to speak with interesting companies engaged in search and content processing. After a little sleuthing, I spotted TeezIR, a company founded in November 2007. When I tried to track down one of the principals–Victor Van Tol, Arthus Van Bunningen, and Thijs Westerveld–I was stonewalled. I snagged a taxi and visited the firm’s address (according to trusty Google Maps) at Kanaalweg 17L-E, Building A6. I made my way to the second floor but was unable to rouse the TeezIR team. I am hesitant to say, “No one was there”. My ability to peer through walls after a nine hour flight is limited.

I asked myself, “Is TeezIR playing the role of a coquette or has the aforementioned team quit the search and content processing business?” I still don’t know. At the Hartmann conference, no one had heard of the company. One person asked me, “How did you find out about the company?” I just smiled my crafty goose grin and quacked in an evasive manner.

The trick was that one of my two or three readers of this Web log sent me a snippet of text and asked me if I knew of the company:

Proprietary, state-of-the-art technology is information retrieval and search technology. Technology is built up in “standardized building blocks” around search technology.

So, let’s assume TeezIR is still in business. I hope this is true because search, content processing, and the enterprise systems dependent on these functions are in a sorry state. Cloud computing is racing toward traditional on premises installations the way hurricanes line up to smash the American south east. There’s a reason cloud computing is gaining steam–on premises installations are too expensive, too complicated, and too much of a drag on a struggling business. I wanted to know if TeezIR was the next big thing.

My research revealed that TeezIR had some ties to the University of Twente. One person at the Hartmann conference told me that he thought he heard that a company in Ede had been looking for graduate students to do some work in information retrieval. Beyond that tantalizing comment, I was able to find some references to Antal van den Bosch, who has expertise in entity extraction. I found a single mention of Luuk Kornelius, who may have been an interim officer at TeezIR and at one time a laborer in the venture capital field with Arengo (no valid link found on September 16, 2009). Other interesting connections emerged from TeezIR to Arjen P. de Vries (University of Twente), Thomas Roelleke (once hooked up with Fredhopper), and Guido van’t Noordende (security specialist). Adding these names to the management team here, TeezIR looked like a promising start up.

Since I was drawing a blank on getting people affiliated with TeezIR to speak with me, I turned to my own list of international search engines here, and I began the thrilling task of hunting for needles in hay stacks. I tell people that research for me is a matter of running smart software. But for TeezIR, the work was the old-fashioned variety.

Overview

Here’s what I learned:

First, the company seemed to focus on the problem of locating experts. I grudgingly must call this a knowledge problem. In a large organization, it can be hard to find a colleague who, in theory, knows an answer to another employee’s question. Here’s a depiction of the areas in which TeezIR is (was?) working:

image

Second, TeezIR’s approach is (was?) to make search an implicit function. Like me, the TeezIR team realized that by itself search is a commodity, maybe a non starter in the revenue department. Here’s how TeezIR relates content processing to the problem of finding experts:

image

Read more

Eaagle Text Processing Swoops In

September 26, 2008

Eaagle Software announced the availability of Full Text Mapper (FTM), a desktop software program that provides analysis of unstructured data. Eaagle Software brings together advanced text mining technology and desktop computing. ‘Our philosophy is that text mining and data analysis tools should be easy-to-use and not require any particular skills,’ states Yves Kergall, president and CEO of Eaagle. ‘Our software doesn’t require any setup or predefinition to begin discovering knowledge. Simply highlight the information, launch FTM, and instantly visualize your data to begin your analysis…it is that easy.’ You can read the full news story here. For more information about Eaagle, navigate to the company’s Web site here. A single user license is about $4,000.

Stephen Arnold, September 26, 2008

Knol Understanding

September 23, 2008

Slate’s Farhad Manjoo’s “Why Google’s Online Encyclopedia Will Never Be as Good as Wikipedia” takes a somewhat frosty stance toward Knol. You can read his interesting essay here. For me the most significant point was this one:

Knol is a wasteland of such articles: text copied from elsewhere, outdated entries abandoned by their creators, self-promotion, spam, and a great many old college papers that people have dug up from their files. Part of Knol’s problem is its novelty. Google opened the system for public contribution just a couple months ago, so it’s unreasonable to expect too much of it at the moment; Wikipedia took years to attract the sort of contributors and editors who’ve made it the amazing resource it is now.

Knol is one of those Google products that appear and seem to have little or no overt support. I agree. I would like to make three comments:

  1. Knol may be a way for Google to get content for itself first and then secondarily for its users. Google wants information, and Knol is a different mechanism for information acquisition. Assuming that it is a Wikipedia may only be partially correct.
  2. Knol, like many other Google services, does not appear to have a champion. As a result, Knol evolves slowly or not at all. Knol may be another way for Google to determine interest, learn about authors who are alleged experts, and determine if submitted content validates or invalidates other data known to Google.
  3. Knol may be part of a larger grid or data ecosystem. As a result, looking at it out of context and comparing it to a product with which it may not be designed to compete might be a partially informed approach.

Based on my analysis of the Google JotSpot acquisition and the still youthful Knol service, I’m not prepared to label Knol or describe it as either a success or failure. In my 10pinion, Knol is a multi purpose beta. Its principal value may be in the enterprise, not the consumer space. But for me, I have too little data and an incomplete understanding of how the JotSpot “plumbing” is implemented; therefore, I am neutral. What’s your view?

Stephen Arnold, September 23, 2008

Cognition’s Semantic Map

September 22, 2008

I profiled Cognition Technologies in my April 2008 “Beyond Search” report for the Gilbane Group here. I can’t reproduce the profile in my Web log, but you can find out about Cognition by reading the information on the company’s Web site. My take on the firm was that it was working to tame the semantic beast that is prowling around many procurement team meetings. The company has released a knowledge base that “teaches computers the meanings behind words.” You can read more about the semantic map in the RawStory.com article “Computers Figuring Out What Words Mean” here. Cognition has, according to RawStory, licensed the map to LexisNexis, one of the early entrants in online for-fee content access. If you are in the market for a semantic map, check out Cognition’s new offering. My view of semantic technology is that Google seems to be ideally positioned to become the Semantic Web. I provided details behind this assertion in the 2007 report I did for BearStearns before it went down in flames earlier this year. Google has quite a few of its Googley souls laboring in the semantic vine yard. As a result, the semantic efforts of smaller companies and larger outfits like Microsoft have to make significant progress and fast. Cognition’s Web site is here.

Stephen Arnold, September 22, 2008

Business Intelligence: Getting Smarter in a Class with Some Lousy Students

September 22, 2008

Business intelligence sounds more up town than search. Analytics resonates with quantitative goodness. Most employees look back on their classes in mathematics with a combination of nostalgia as in “I wish I would have taken more math” and horror as in “I hated Miss Blackburn’s algebra class”. I did a job for a major university to answer the question, “Can we be number one in computer science?” The answer was, “No.” There were not many math majors who planned on working in the US once the sheepskins were handed out. It’s tough to rise to the top when your future endowment funding sources are working in Wu Han or Mumbai. Loyalties and money may go to the local high school where the math wizards’ genius was first recognized and cultivated.

I find it amusing that search vendors are rushing to become players in the business intelligence arena. Now established business intelligence companies are encouraging the running of the bull-oney. SPSS, SAS, Cognos, and Business Objects have learned to love text because their customers demanded that structured and unstructured data be mind for insights. Ignoring comments on warranty cards, in emails, or in voice calls to a help desk do yield useful information. Some companies learn what customers loathe and then don’t fix the problem. Called your mobile provider lately? How about your bank, assuming it’s still in business? See what I mean.

When I read a good analysis of how business intelligence vendors are getting smarter, I learn something about how the market perceives business intelligence. But I wonder why these analyses don’t dig into the deeper issues associated with vendors who reinvent themselves in order to make sales. I’m not sure the product innovation is of the same quality as the marketing collateral. In short, vendors talk a good game, but the delivery remains much the way it always has. Math and programming people have to be taught the system. The business intelligence system is then set up with rules spelled out. The biggest change is that the traditional method is too expensive, so companies want short cuts to business intelligence goodness. Enter the search and content processing vendor. The idea is simple: index content and convert a user’s query to a form that generates a report. Now will the report have the same concern with the niceties and nuances of hand crafted statistical instructions operating on a well formed data cube? Maybe? But the new approaches are a heck of a lot easier, faster, and cheaper. Licensees are asked to conclude, “You get all three with our new system.”

Take a gander at the well written “Business Intelligence Gets Smart” published on September 5, 2008, by Intelligent Enterprise’s Doug Henschen here. You will have to put up with an annoying ad flop over, but the content is worth the annoyance. The key point of the write up is that business intelligence “improves business performance.” This is a key point. Most search and content processing systems don’t generate a hard return on investment. Business intelligence, according to the Information Week Research Business Intelligence Survey cited by Mr. Henschen does. That’s good news, and it encourages vendors with non-ROI systems to repackage these products as bottom line centric solutions. For me, the most important parts of this write up were the charts and graphs. Mr. Henschen does a good job of pulling together the numbers that help put business intelligence in context.

I would like to offer several observations and, of course, invite comment:

  1. Business intelligence remains a complicated area, and it does not lend itself to facile solutions.
  2. Most business intelligence systems require that content be transformed, then processed, and finally analyzed. If the content processing goes off track, the fix can be time consuming and expensive. BI systems, like search and content processing systems, can experience cost overruns because the assumptions about the source information were wrong or shallow.
  3. Business intelligence even when implemented with some of the search centric solutions on the market like Endeca’s Latitude require a math or programming wizard to configure the systems.

Quite a few search and text analytics companies are asserting that “we do business intelligence”. The statement is both true and false. In order to avoid coming down on the false side of the statement, short cuts should be avoided. Implementing business intelligence is similar to Miss Blackburn’s algebra class. It’s demanding, a great deal of work, and usually disliked by those without the appetite or the aptitude for the tasks.

Stephen Arnold, September 22, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta