Dow Jones and Automatic Taxonomy Generation

September 30, 2008

An eager beaver reader (I only have two or three) sent me a link to “Taxonomies for Human Vs Auto-Indexing.” The author of the Synaptica Central write up is Wendy Lim. She is summarizing or reproducing information attributed to Heather Hedden. From a bibliographic angle, I think a tad more work could be done to make clear who was writing what, where, and when. But that’s an old, failed database goose quacking about the brilliant work done by “experts” decades younger than I. Quack. Quack.

You can read the September 26, 2008, write up here. The article is about a Taxonomy Bootcamp. After a bit of sleuthing, I discovered that this is an add on to some Information Today trade shows. The bootcamp, as I understand it, is an intellectual Camp Lejune except the that the attendees skip the push ups, the 5 am wake up calls, and the 20 mile runs. Over a period of two or three days, taxonomy recruits emerge battle ready, honed to deal with the intellectual rigors of creating taxonomies.

image

A real taxonomy. Source: www.nnf.org.na

The word “taxonomy” is more popular than “enterprise search” and for good reason. Enterpriser search has emerged from organizations with a bold 4F stamped on its fitness report. After hours, maybe months of work, and some hefty bills to pay, enterprise search customers are looking for a way to kill the enterprise search enemy. That’s where a taxonomy comes it. I’m no expert in taxonomies. I know I was involved in creating taxonomies for some once-hot commercial databases like ABI / INFORM, Business Dateline, General Business File, Health Reference Center, and the 1993 Web direct Point (Top 5% of the Internet). What those experiences taught me was that I don’t know too much about taxonomies or classification systems in general for that matter. I keep in touch with people who do know; for example, Marje Hlava at Access Innovations, Barbara Quint (Searcher Magazine), Marydee Ojala (Online Magazine), Ulla de Stricker (De Stricker & Associates), and other specialists. I get nervous when a 20- or 30-something explains that taxonomies are not big deal or that a business process can crack a taxonomy problem or a certain vendor’s software can auto-magically create a taxonomy.

tag cloud

A Synaptica Central tag cloud.

In my experience, the truth is not to be found in any one solution. In fact, the reality of taxonomies is that the concept has gained traction because of fundamental errors in planning and deploying information access systems. I don’t think a taxonomy can retrofit stupid, short sighted decisions. For that reason, I steer clear of most taxonomy discussions because after working with these beasts for more than 30 years, I understand their unpredictable behavior.

Read more

Expert System: Morphing into an Online Advertising Tool Vendor

September 28, 2008

Several years ago, YourAmigo (an Australian search and content processing vendor) shifted from enterprise search to search engine optimization. I stopped following the company because I have zero interest in figuring out how to get traffic to my Web site or my Web log. Now Expert System has rolled out what it calls Cogito Advertiser. A brief write up appeared in DMReview.com when I was in Europe. You can read that article here.

The new service, according to DMReview.com:

automatically analyzes Web pages to identify the most relevant topics and extract the main themes included in the text. It classifies content by assigning the category related to the text in real time, based on an optimized taxonomy and high precision. By processing the text, it collects all useful data in an output format structured to be uploaded into a database and directly integrates it with the ad server.

Expert System has some interesting technology.The idea is that software that can “understand” will be able to a better job of key word identification than a human, often fresh out of college with vocabulary flush with “ums”, “ers”, and “you knows”.

You can learn more about the company here. As the financial and competitive pressures mount, I expect other vendors to repackage their technology in an effort to tap into more rapidly growing markets with shorter buying cycles than enterprise search typically merits.

Stephen Arnold, September 28, 2008

Taxonomy: Silver Bullet or Shallow Puddle

September 27, 2008

Taxonomy is hot. One of my few readers sent me a link to Fumsi, a Web log that contains a two part discussion of taxonomy. I urge you to read this post by James Kelway, whom I don’t know. You can find the article here. The write up is far better than most of the Webby discussions of taxonomies. After a quick pass at nodes and navigation, he jumps into information architecture requiring fewer than 125 words. The often unreliable Wikipedia discussion of taxonomy here chews up more than 6,000. Brevity is the soul of wit, and whoever contributed to the Wikipedia article must be SWD; that is, severely wit deprived.

Take a look at the Google Trends’ chart I generated at 8 pm on Friday, September 26, 2008. Not only is taxonomy generating more Google traffic than the now mud crawler enterprise search. Taxonomy is not as popular as “CMS”, the shorthand for content management system. But “taxonomy” is a specialist concept that seems to be moving into the mainstream. At the just concluded Information Today trifecta conference featuring search, knowledge management (whatever that is), and streaming media, taxonomy was a hot topic. At the Wednesday roof top cocktail, where I worked on my tan in the 90 degree ambient air temperature, I was asked four times about taxonomies. I know I worked on commercial taxonomies and controlled vocabularies for database, but I learned from those years of experience that taxonomies are really tough, demanding, time consuming intellectual undertakings. I thought I was pretty good at making logical, coherent lists. Then I met the late Betty Eddison and the very active Marje Hlava. These two pros taught me a thing or 50.

google trends taxnonomy

In the dumper is the red line which maps “enterprise search” popularity. The blue line is the up and coming taxonomy popularity. The top line is the really popular, yet hugely disappointing, content management term traffic.

I heard people who have been responsible for failed search systems and non functional content management systems asking, “Will a taxonomy improve our content processing?” The answer is, “Sure, if you get an appropriate taxonomy?” I then excuse myself and head to the bar man for a Diet 7 Up. The kicker, of course, is “appropriate”. Figuring out what’s appropriate and then creating a taxonomy that users will actually exploit directly or indirectly is tough work. But today, you can learn how to do a taxonomy in a 40 minute presentation or if you are really studious a full eight hour seminar.

I remember talking with Betty Eddison and Marje Hlava about their learning how to craft appropriate taxonomies. Marje just laughed and turned to her business partner who also burst out laughing. Betty smiled and in her deep, pleasant voice said, “A life time, kiddo.” She called me “kiddo”, and I don’t think anyone else ever did. Marje Hlava chimed in and added, “Well, Jay [her business partner] and I have been at it for two life times.” I figured out pretty quickly that building “appropriate” taxonomies required more than persistence and blissfully ignorant confidence.

Why are taxonomies perceived as the silver bullet that will kill the vampire search or CMS system. A vampire system is one that will suck those working on it into endless nights and weekends and then gobble available budget dollars. In my opinion, here are the top five reasons:

  1. The notion of a taxonomy as a quick fix is easy to understand. Most people think of a taxonomy as the equivalent of the Dewey Decimal system or the Library of Congress subject headings and think, “How tough can this taxonomy stuff be?” After a couple of runs at the problem, the notion of a quick fix withers and dies.
  2. Vendors of lousy enterpriser search systems wriggle off the hook by asserting, “You just need a taxonomy and then our indexing system will be able to generate an assisted navigation interface.” This is the search equivalent of “The check is in the mail.”
  3. CMS vendors, mired in sluggish performance, lost information, and users who can’t find their writings, can suggest, “A taxonomy and classification module makes it much easier to pinpoint the marketing collateral. If you search for a common term, our system displays those documents with that common term. Yes, a taxonomy will do the trick.” This is the same as “Let’s do lunch” repeated every week to a person whom you know but with whom you don’t want to talk for more than 30 seconds on a street corner in mid town Manhattan.
  4. A shill at a user group meeting–now called a “summit”–praises the usefulness of the taxonomy in making it easier for users to find information. Vendors work hard to get a system that works and win over the project manager. Put on center stage and pampered by the vendor’s PR crafts people, the star customer presents a Kodachrome version of the value of taxonomies. Those in the audience often swallow the tale the way my dog Tess goes after a hot dog that falls from the grill. There’s not much thinking in Tess’s actions either.
  5. Vendors of “automated” taxonomy systems demonstrate how their software chops a tough problem down to size in a matter of hours or days. Stuff in some sample content and the smart algorithms do the work of Betty Eddison and Marje Hlava in a nonce. Not on your life, kiddo. The automated systems really are 100 percent automatic. The training corpus is tough to build. The tuning is a manual task. The smart software needs dummies like me to fiddle. Even more startling to licensees of automatic taxonomy systems is that you may have to buy a third party tool from Access Innovations, Marje Hlava’s company, to get the job done. That old phrase “If ignorance is bliss, hello, happy” comes to mind when I hear vendors pitch the “automated taxonomy” tale.

I assume that some readers may violently disagree with my view of 21st century taxonomy work. That’s okay. Use the comments section to teach this 65 year old dog some new tricks. I promise I will try to learn from those who bring hard data. If  you make assertions, you won’t get too far with me.

Stephen Arnold, September 27, 2008

Linguamatics Sells Bayer CropScience

September 27, 2008

My newsreader snagged this item, which I found interesting. The little-known Linguamatics (a content processing company based in the UK) retained its deal with the warm and friendly Bayer CropScience. The Linguamatics’ technology is called I2E, and Bayer has been using the I2E system since the summer of 2007. In September, Bayer CropScience decided to renew its license and process patent documents, scientific and technical information, and perform knowledge discovery. (I must admit I am not sure how one discovers knowledge, but I will believe the article that you can find here.)

For me, this small news item was interesting for several reasons. First, for many years a relatively small number of companies had been granted access to the inner circle of European pharma. I find it refreshing that after two centuries, upstarts like Linguamatics are able to follow in the footsteps of Temis and other firms who have worked to make sales in these somewhat conservative companies. “Conservative” might not be the correct word. Computational chemists are a fun-loving group. One computational chemist told me last October in Barcelona that computational chemists were pharma’s equivalent to Brazilian soccer football fans. On the off change that a clinical trial goes off the rails, some pharma players prefer keeping “knowledge” quite undiscovered until an “issue” can be resolved.

lingua_searchresults

A representative I2E results display. © Linguamatics, 2008.

Second, Linguamatics–a company I profiled after significant bother and effort–is profiled in my April 2008 study Beyond Search, published by the Gilbane Group. You can learn more about this study here because ferreting out information about I2E is not the walk in the park that I expected from a content processing company with a somewhat low profile. Linguamatics has some interesting technology, and I surmise that the uses of the system are somewhat more sophisticated and useful to Bayer CropScience than “discovering knowledge”.

Finally, Bayer CropScience is a subsidiary of the influential Bayer AG, an outfit with an annual turnover of about US$8.0 billion, give or take a billion because of the sad state of the dollar on the international market. My hunch is that if the CropScience deal feels good, other units of this chemical and pharmaceutical giant will learn to love the I2E system.

Stephen Arnold, September 27, 2008

TeezIR BV: Coquette or Quitter

September 26, 2008

For my first visit to Utrecht, once a bastion of Catholicism and now Rabobank stronghold, I wanted to speak with interesting companies engaged in search and content processing. After a little sleuthing, I spotted TeezIR, a company founded in November 2007. When I tried to track down one of the principals–Victor Van Tol, Arthus Van Bunningen, and Thijs Westerveld–I was stonewalled. I snagged a taxi and visited the firm’s address (according to trusty Google Maps) at Kanaalweg 17L-E, Building A6. I made my way to the second floor but was unable to rouse the TeezIR team. I am hesitant to say, “No one was there”. My ability to peer through walls after a nine hour flight is limited.

I asked myself, “Is TeezIR playing the role of a coquette or has the aforementioned team quit the search and content processing business?” I still don’t know. At the Hartmann conference, no one had heard of the company. One person asked me, “How did you find out about the company?” I just smiled my crafty goose grin and quacked in an evasive manner.

The trick was that one of my two or three readers of this Web log sent me a snippet of text and asked me if I knew of the company:

Proprietary, state-of-the-art technology is information retrieval and search technology. Technology is built up in “standardized building blocks” around search technology.

So, let’s assume TeezIR is still in business. I hope this is true because search, content processing, and the enterprise systems dependent on these functions are in a sorry state. Cloud computing is racing toward traditional on premises installations the way hurricanes line up to smash the American south east. There’s a reason cloud computing is gaining steam–on premises installations are too expensive, too complicated, and too much of a drag on a struggling business. I wanted to know if TeezIR was the next big thing.

My research revealed that TeezIR had some ties to the University of Twente. One person at the Hartmann conference told me that he thought he heard that a company in Ede had been looking for graduate students to do some work in information retrieval. Beyond that tantalizing comment, I was able to find some references to Antal van den Bosch, who has expertise in entity extraction. I found a single mention of Luuk Kornelius, who may have been an interim officer at TeezIR and at one time a laborer in the venture capital field with Arengo (no valid link found on September 16, 2009). Other interesting connections emerged from TeezIR to Arjen P. de Vries (University of Twente), Thomas Roelleke (once hooked up with Fredhopper), and Guido van’t Noordende (security specialist). Adding these names to the management team here, TeezIR looked like a promising start up.

Since I was drawing a blank on getting people affiliated with TeezIR to speak with me, I turned to my own list of international search engines here, and I began the thrilling task of hunting for needles in hay stacks. I tell people that research for me is a matter of running smart software. But for TeezIR, the work was the old-fashioned variety.

Overview

Here’s what I learned:

First, the company seemed to focus on the problem of locating experts. I grudgingly must call this a knowledge problem. In a large organization, it can be hard to find a colleague who, in theory, knows an answer to another employee’s question. Here’s a depiction of the areas in which TeezIR is (was?) working:

image

Second, TeezIR’s approach is (was?) to make search an implicit function. Like me, the TeezIR team realized that by itself search is a commodity, maybe a non starter in the revenue department. Here’s how TeezIR relates content processing to the problem of finding experts:

image

Read more

Eaagle Text Processing Swoops In

September 26, 2008

Eaagle Software announced the availability of Full Text Mapper (FTM), a desktop software program that provides analysis of unstructured data. Eaagle Software brings together advanced text mining technology and desktop computing. ‘Our philosophy is that text mining and data analysis tools should be easy-to-use and not require any particular skills,’ states Yves Kergall, president and CEO of Eaagle. ‘Our software doesn’t require any setup or predefinition to begin discovering knowledge. Simply highlight the information, launch FTM, and instantly visualize your data to begin your analysis…it is that easy.’ You can read the full news story here. For more information about Eaagle, navigate to the company’s Web site here. A single user license is about $4,000.

Stephen Arnold, September 26, 2008

hakia: A Cloudsourcing Twist on Semantic Search

September 25, 2008

hakia [www.hakia.com], a semantic search engine, recently announced that it’s adding a new program designed to mine more resources for users, specifically information professionals who need to tap more than the usual 10 percent of web content.

Users can enter URLs now, not just search terms, to target credible content, not just popular results. hakia will process the URL with its semantic technology to make concept and meaning matches.

A hakia rep told me “this is the first time a search engine has channeled the collective knowledge of these expert groups to generate credibility-stamped results using semantic technology.” They’re promoting it as “Trusted Results” – returned information is run through peer review and professionals are invited to submit web sites. hakia is now expanding its content by making an open call for those submissions.

The project is in beta phase, focused on health and medical resources. For instance, results returned will come from the World Health Organization or the Mayo Clinic instead of WebMD or Wikipedia. I hope they work on expanding soon, because it’s a great idea. There’s so much popular information on the Internet, it’s really difficult to search and sort through all the MedicalNet resources when I need serious bibliography material.

You can get more information at Club hakia [http://club.hakia.com/], you just need to do a free registration. They’ve got a really nifty setup where you can enter search terms in both hakia and Google side-by-side. I entered “search engine optimization.” Google’s top returns were from Wikipedia, Google search support, SEO Chat, and then news results. hakia’s top returns included Turks Daily World News, Wikipedia, SEO.com, and Search Optimization Journal.

Jessica Bratcher, September 25, 2008

Cognition’s Semantic Map

September 22, 2008

I profiled Cognition Technologies in my April 2008 “Beyond Search” report for the Gilbane Group here. I can’t reproduce the profile in my Web log, but you can find out about Cognition by reading the information on the company’s Web site. My take on the firm was that it was working to tame the semantic beast that is prowling around many procurement team meetings. The company has released a knowledge base that “teaches computers the meanings behind words.” You can read more about the semantic map in the RawStory.com article “Computers Figuring Out What Words Mean” here. Cognition has, according to RawStory, licensed the map to LexisNexis, one of the early entrants in online for-fee content access. If you are in the market for a semantic map, check out Cognition’s new offering. My view of semantic technology is that Google seems to be ideally positioned to become the Semantic Web. I provided details behind this assertion in the 2007 report I did for BearStearns before it went down in flames earlier this year. Google has quite a few of its Googley souls laboring in the semantic vine yard. As a result, the semantic efforts of smaller companies and larger outfits like Microsoft have to make significant progress and fast. Cognition’s Web site is here.

Stephen Arnold, September 22, 2008

Business Intelligence: Getting Smarter in a Class with Some Lousy Students

September 22, 2008

Business intelligence sounds more up town than search. Analytics resonates with quantitative goodness. Most employees look back on their classes in mathematics with a combination of nostalgia as in “I wish I would have taken more math” and horror as in “I hated Miss Blackburn’s algebra class”. I did a job for a major university to answer the question, “Can we be number one in computer science?” The answer was, “No.” There were not many math majors who planned on working in the US once the sheepskins were handed out. It’s tough to rise to the top when your future endowment funding sources are working in Wu Han or Mumbai. Loyalties and money may go to the local high school where the math wizards’ genius was first recognized and cultivated.

I find it amusing that search vendors are rushing to become players in the business intelligence arena. Now established business intelligence companies are encouraging the running of the bull-oney. SPSS, SAS, Cognos, and Business Objects have learned to love text because their customers demanded that structured and unstructured data be mind for insights. Ignoring comments on warranty cards, in emails, or in voice calls to a help desk do yield useful information. Some companies learn what customers loathe and then don’t fix the problem. Called your mobile provider lately? How about your bank, assuming it’s still in business? See what I mean.

When I read a good analysis of how business intelligence vendors are getting smarter, I learn something about how the market perceives business intelligence. But I wonder why these analyses don’t dig into the deeper issues associated with vendors who reinvent themselves in order to make sales. I’m not sure the product innovation is of the same quality as the marketing collateral. In short, vendors talk a good game, but the delivery remains much the way it always has. Math and programming people have to be taught the system. The business intelligence system is then set up with rules spelled out. The biggest change is that the traditional method is too expensive, so companies want short cuts to business intelligence goodness. Enter the search and content processing vendor. The idea is simple: index content and convert a user’s query to a form that generates a report. Now will the report have the same concern with the niceties and nuances of hand crafted statistical instructions operating on a well formed data cube? Maybe? But the new approaches are a heck of a lot easier, faster, and cheaper. Licensees are asked to conclude, “You get all three with our new system.”

Take a gander at the well written “Business Intelligence Gets Smart” published on September 5, 2008, by Intelligent Enterprise’s Doug Henschen here. You will have to put up with an annoying ad flop over, but the content is worth the annoyance. The key point of the write up is that business intelligence “improves business performance.” This is a key point. Most search and content processing systems don’t generate a hard return on investment. Business intelligence, according to the Information Week Research Business Intelligence Survey cited by Mr. Henschen does. That’s good news, and it encourages vendors with non-ROI systems to repackage these products as bottom line centric solutions. For me, the most important parts of this write up were the charts and graphs. Mr. Henschen does a good job of pulling together the numbers that help put business intelligence in context.

I would like to offer several observations and, of course, invite comment:

  1. Business intelligence remains a complicated area, and it does not lend itself to facile solutions.
  2. Most business intelligence systems require that content be transformed, then processed, and finally analyzed. If the content processing goes off track, the fix can be time consuming and expensive. BI systems, like search and content processing systems, can experience cost overruns because the assumptions about the source information were wrong or shallow.
  3. Business intelligence even when implemented with some of the search centric solutions on the market like Endeca’s Latitude require a math or programming wizard to configure the systems.

Quite a few search and text analytics companies are asserting that “we do business intelligence”. The statement is both true and false. In order to avoid coming down on the false side of the statement, short cuts should be avoided. Implementing business intelligence is similar to Miss Blackburn’s algebra class. It’s demanding, a great deal of work, and usually disliked by those without the appetite or the aptitude for the tasks.

Stephen Arnold, September 22, 2008

Microsoft Powerset Arrives

September 18, 2008

The Powerset Web log contains a summary of the progress made with Powerset’s technology here. You can see the system in action by navigating to Live.com search here and entering the phrase “Chrysler Building”. The system displays an “answer” in the form of an extract from Wikipedia. For me the most interesting part of the Microsoft Powerset article was this statement:

But, many topical queries do not show Answers today such as  musicians, albums, films, etc. For this experiment, we selected some of these categories and will return a topic summary with links, similar to the Freebase Answers we show in Powerset, using data from Freebase.  Eventually, we hope to give Answers for even more topics.

The Answers feature, therefore, may not be available to you. If you launch queries not supported by the system, you won’t see any of the Powerset technology.

The demonstration looks interesting, and as the Web log post states, the Powerset team pulled off this impressive display in only 30 days. This contrasts sharply with the Microsoft Fast Search Web part, a project completed in only 45 days. To me, it looks as if Powerset’s presentation of its Wikipedia search demo was easier to port to Live Search than it was for Fast Search to make its pre-existing Web part available for SharePoint.

I am looking forward to more substantive innovations from both Powerset and Fast Search in the near future. Although interesting, both the Powerset and the Fast Search projects left me wanting more. In fact, I thought of the old Wendy’s advertising theme “Where’s the beef?” for both of these initial development efforts.

Stephen Arnold, September 18, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta