Mondeca: A Semantic Technology Company

April 25, 2008

Twice in the last two days I’ve been asked about Mondeca, based in Paris. If you are not familiar with the company, it has been involved in semantic content processing for almost a decade. The company describes itself in this way:

Mondeca provides software solutions that leverage semantics to help organizations obtain maximum return from their accumulated knowledge, content and software applications. Its solutions are used by publishing, media, industry, tourism, sustainable development and government customers worldwide.

The company made a splash in professional publishing with its work for some of the largest scientific, technical, legal, and business publishers. Its customers include Novartis, the Thomson Corporation, LexisNexis, and Strabon.

Mondeca makes a goodly amount of information available on its Web site. You can learn more about the company’s technology, solutions, and management team by working through the links on the Web site.

Indexing by the Book: Automatic Functions Plus Human Interaction

Semantic technology or semantic content analysis can carry different freights of meaning. My understanding is that Mondeca has been a purist when it comes to observing standards, enforcing the rules for well-formed taxonomies, and assembling internally consistent and user friendly controlled term lists. If you are not familiar with the specifics of a rigorous approach to controlled terms and taxonomies, take a look at this screech of Bodega’s subject matter expert interface. Be aware that I pulled this from my files, so the interface shipping today may differ from this approach. The principal features and functions will remain behind the digital drapery, however.My recollection is that this is the interface used by Wolters Kluwer for some of its legal content.

Interface

What is obvious to me is that Mondeca and a handful of other companies involved in semantic technology take an “old school” approach with no short cuts. Alas, some of the more jejune pundits in the controlled vocabulary and taxonomy game can sometimes be more relaxed. Without training in the fine art of thesauri, a quick glance makes it difficult for an observer to see the logical problems and inconsistencies in a thesaurus or taxonomy. However, after the user runs some queries that deliver more chaff than wheat, the quick-and-dirty approach is like one of those sugar-free and fat-free cookies. There’s just not enough substance to satisfy the user’s information craving.

Read more

Newspapers: Hastening Their Own Demise

April 24, 2008

I dreamed of Darwin. I think my semiconscious was mulling about survival and adaptation. The financial news from the newspaper publishing world was interesting. Losses at Gannett, McClatchy, and the New York Times suggest continued worsening of their financial weather. You can point and click through the remarkable financial picture by running this query on Google News.

To add insult to injury, Moody’s Investors Service, according to CNN.com, downgraded the New York Times Company’s senior unsecured ratings to ‘Baa3’ from ‘Baa1′ and its commercial paper rating to “Prime-3” from “Prime-2”. This is the difference between a premier league soccer team and a third-division squad playing for beer. The news story I read reported that Moody’s said the New York Times had a “stable” financial outlook. If the first quarter results are stable, I must not have a good grasp of how financial whiz kids think. (Please, read this story quickly. These CNN.com links disappear quickly.)

Enterprise search systems can ingest news and information from third-parties. Some news organizations sell live feeds directly into companies. The information is then indexed and made available to employees within the enterprise search system. Over the last few years, I’ve seen an increase in the use of news on Internet sites first as a supplement to commercial vendors’ news and now as a replacement in some organizations. Are commercial news vendors, newspapers, and legitimate commercial aggregators losing their grip in this important market?

I think newspapers are. It may be too soon to tell if outfits like the Associated Press or giant combines will be affected as well. The digitally adept may be able to deal with Darwinian forces. Others won’t be so fortunate.

Every few months I bump into an executive from a New York publishing company. Some of these titans of information work for media companies with newspapers; others labor within the multi-national combines that own professional publishing companies. A few ride the air currents rising from the burning piles of unsold books, magazines, peer-reviewed journals, and controlled-circulation publications.

Viewed as a group, the financial picture is clear. Consolidation is inevitable. I dropped my subscription to the Financial Times because I was getting three deliveries a week, not six. The FT’s hard copy distribution system was incapable of delivering the paper on a daily basis to my redoubt in rural Kentucky. No apologies and no explanations were forthcoming after three years of complaining to my elusive delivery person. My emails to the FT customer center went unheeded. At a trade show, a chipper Financial Times’s booth worker tried to give me a tan baseball cap with an embroidered “FT” logo. I returned the hat to the young person saying, “No, thanks. I have a Google cap and that is already broken in.”

Three Sources of “Real” News

I want to steer clear of the well-worn theme that Web logs provide an alternative to “real” journalism. The best Web logs from my point of view are those written by individuals who were or could have been cracker jack journalists. I worked at the Courier-Journal & Louisville Times in its salad days. I also worked for the fellow once described to me as “the most hated man in New York publishing,” the sharp-as-a-tack Bill Ziff. Mr. Ziff created three media conglomerates and sold each at the peak of their valuation. He would still be working his magic if age and illness had not side lined him. The best Web log writers could have found a home at either the CJ or at Ziff when these outfits were firing on all cylinders.

I want to take a look at three exemplary news services in a cursory way and then offer some observations about why the newspaper publishers who are losing money are probably going to continue losing money for the foreseeable future. If Rupert Murdoch’s legal eagles are reading this essay, calm down. I am not discussing News Corp., the Wall Street Journal, or the likely takeover of Newsday.

First, navigate to a site called Newsnow. I haven’t kept up with the company after speaking with executives a couple of years ago. The service provides a series of links to news grouped by categories. The center panel presents headlines and one sentence summaries of the major story. When I visited the site this morning (April 24, 2008), I had a tidy line up of items relating to the mortgage crisis affecting Europe. An important point is that even on my real lousy Verizon high-speed, use-it-anywhere wireless service–Newsnow loads quickly and is not annoying.

Newsnow

Read more

Calais: Free Semantic Tagger

April 22, 2008

If you want to see how cloud-based software can perform rich metatagging, you will want to give the free Calais service a whirl. Navigate to the Calais Gallery and scroll down to the Capability Demonstrations and select the Calais Document Viewer. If you don’t see the link, click here.

Now cut a document and paste it into the window. The system will display this type of result:

calais_parser

The tags the ClearForest system automatically identifies are highlighted. The left-hand column of the display shows the types of tags identified; for example, city, company, person, etc. A single click opens a drop down list of what the system found. Worked well and it worked quickly with no “false drops” in the sample document. Performance showed some latency, but that’s not unusual with a cloud-based service and some fancy text crunching taking place on remote servers.

More about Calais

For now Calais is working to build a community to extend the Semantic Web. Without tools like Calais, the Semantic Web is likely to remain a great idea that failed because people don’t want to do tagging. When tagging is done, it’s lousy. I’m supposed to know how to index, and the tags for my Web log are pretty miserable. The reasons may be broader than just my own approach. First, indexing to be useful must use a body of terms that the average user can hit upon and remembered. So neologisms are out and weird jargon won’t work at all. Second, writing for a Web site or a Web log like this one is supposed to be disciplined, but it’s not. I have other research work that commands my primary attention. The Web log, while important, comes second, maybe third on some busy days. Finally, I’m not sure what I will write about. I react to information people send me in email, stories in my RSS reader, and comments made–often off the cuff–on a phone call. It is difficult for me to create a controlled term list because I’m not sure what the topics will be. Therefore, lousy tagging.

Calais asserts that its technologies can address my three failings and probably yours as well. You can download developer tools, upload content to Calais, or use the functions on the Calais Web site. Reuters-ClearForest has posted some useful documentation about Calais here. If your a bit nerdy, you can do some integration of Calais and your application. The best way to get a sense of what’s possible is to explore the sample applications on the Calais Web site.

More about ClearForest

ClearForest was founded by text mining guru Ronen Feldman. You can get the inside scoop on this wizard’s approach to squeezing information from text in his 2006 book, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data (Cambridge University Press ISBN 13: 9780521836579).

The ClearForest technology performs “discovery”; that is, the system processes text and identifies important information. The company found a ready market wherever executives wanted to find the “hidden” information in text. I recall attending a presentation by Dr. Feldman in which he showed the ClearForest system processing auto warranty data in the written comments from customer support reps and owners who sent email about their vehicles.

The ClearForest system processed these comments and displayed important discoveries in easy-to-understand reports. One example concerned a flawed component that the ClearForest system pinpointed as one that was causing problems previously overlooked by the automobile manufacturer. The kicker to this example was that the manufacturer was able to make a change to the affected component and take pre emptive action to save significant amounts of warranty cost and avoid customer complaints.

The earlier versions of the ClearForest system made use of rules. Some of these required hand-tweaking or a ClearForest-adept programmer to set up, tune, and deploy. Over the years, ClearForest like other companies looking for ways to leverage “smart” software, more automation has been injected into the ClearForest system.

Read more

Search Vendors: Sniper Firefights Break Out

April 21, 2008

Yesterday, I posted an email containing a statement by a publicly-traded company’s intent to replace its incumbent search system. You can read the full text of this document, which I have verified as originating with a reputable company on the West Coast of the US. I deleted the references to the vendor whose search system is getting the boot. I also redacted the name of the company reaching its boiling point and the name of the hapless information technology manager who was responsible for the acquisition of the incumbent system.

Just three days before I received this email from the aggrieved licensee of a blue-chip search system, I spoke on the telephone with a leading European investment bank’s lead analyst for a publicly-traded company with a presence in search and retrieval. That call probed my knowledge of customers using the publicly-traded firm’s search system. I don’t have too much detail (what the analysts whiz kids call “color”) about expensive systems that are a pain in the neck. What I know I keep to myself. I was interested in why the zingy young MBA was calling me in my underground bunker in Harrod’s Creek, Kentucky. The reason was that the investment bank had heard that some high profile licensees of this publicly-traded company’s search system were going bonkers over costs, erratic search results, and performance. This is a hat trick of sorts, and I slithered out of the call.

Today I spoke with my partner in Washington, DC. She told me, “I have worked with most of the big guys. None of this stuff works without work–a lot of work. So what’s new?”

I guess not much when it comes to enterprise search (what I call Intranet search or behind-the-firewall search). What is new is that the public airing of complaints seems to be ratcheting upwards. A few years ago, an organization would assume that cost overruns, grousing users, and system flakiness was a problem anywhere except the search vendor.

Not today. Some licensees are savvier, and several licensees are not too shy about telling the world, “Hey, this stuff is a major problem. It doesn’t work as advertised.”

Contravellation: An Old Strategy Might Resurface

I am not sure if you are familiar with the word contravellation. Popular among the war college set, the term refers to a fortification set up to protect a besieging force from attack by the defenders of the besieged place. In short, a contravellation is a defensive shield designed to protect one party from another. The Romans apparently had an appetite for contravellations, using them keep their enemies from slipping out of a besieged town. The besieged wall themselves in. A contravellation makes sure no one gets away.

A Civil War Contravellation. It would fence me in.

My hunch is that we are about to see a number of defensive fortifications erected by search vendors to prevent licensees from escaping. Let me be clear. This is not a problem of one vendor. This is a problem created by many different vendors. This is not a problem that appeared overnight. The pot has been boiling for years, a decade or more in some cases.

The worsening economy makes an expensive search system more than a casual expense. Users and some information technology professionals have a much deeper understanding of what a search system can do. Some vendors make an attempt to “lock in” a licensee with a “perpetual license”, deferred payment schedule, bundles of maintenance and support available only with a multi-year deal, and other techniques. Others assume that an investment in a “platform” cannot be easily discarded. The sheer scope and complexity of an information processing system puts a licensee under a state of siege.

The vendors’ contravellation will be designed to prevent a licensee from breaking an agreement.

Read more

Indexing Dynamic Databased Content

April 20, 2008

In the last week, there’s been considerable discussion of what is now called “deep Web” content. The idea is that some content requires the user to enter a query. The system processes the querey and generates a search result from a database. This function is easier to illustrate than explain in words.

Look at the screen shot below. I have navigated to Southwest Airlines Web page and entered a query for flights from Louisville, Kentucky, to Baltimore, Maryland.

southwest form

Here’s what the system shows me:

southwest result

If you do a search on Google, Live.com, or Yahoo, you won’t see the specific listing of flights shown below:

southwest flight listing

Read more

Traditional Publishers: Patricians under Siege

April 19, 2008

This is an abbreviated version of Stephen Arnold’s key note at the Buying and Selling eContent Conference on April 15, 2008. A full text of the remarks is here.

Roman generals like Caesar relied on towers spaced about 3000 feet apart. Torch signals allowed messages to be passed. Routine communications used a Roman version of the “pony express”, based on innovations in Persia centuries before Rome took to the battlefield.

Today, you rely on email and your mobile phones. Those in the teens and tweens Twitter and use “instant” social messaging systems like those in Facebook and Google Mail. Try to Imagine how difficult it would be for Caesar to understand the technology behind Twitter. but how many of you think Caesar would have hit upon a tactical use of this “faster that flares” technology?

Read more

The Text Mining Can of Worms

April 17, 2008

In October 2007, I participated in a half-day “text mining” tutorial held after the International Chemical Information Conference in that most appealing Spanish city, Barcelona. The attendees–I think there were about 24 people mostly from European companies–wanted to learn about advanced text mining systems–in theory. Reality often intrudes, however.

textmining worms copy

Fresh from the primary research for my Beyond Search: What to Do When Your Enterprise Search System Won’t Work, I had a significant amount of information about 50 vendors’ text mining systems and their technologies. The structure of the Barcelona tutorial was straight forward. After defining text mining and differentiating it from the better-known data mining, I walked through some case examples of text mining successes. The second part of the tutorial focused on the business issues of text mining. The end point of this segment tackled three key challenges to which I will return in a moment. The third segment of the tutorial took at look at what Google was disclosing through its engineering papers and speeches about its approach to text mining. This is a very interesting block of information, and I may at some point in the future describe a little of our findings. The tutorial wrap up was a series of observations with time for the attendees to ask additional questions and share some of their experiences.

Read more

Leximancer: Divining Meaning from Words

April 17, 2008

In Australia last year, I met several information technology professionals who mentioned the Leximancer text and content processing system to me. Leximancer now has offices in three cities: Brisbane, Australia, London, England, and Boulder, Colorado. I updated my Leximancer files and made a mental note that that company had some nifty visualization technology. Based on comments made to me, analysts in police and intelligence as well as the academic community find the product of significant value. I heard that the company has more than 200 licensees and is growing at a brisk pace.

At the eContent conference in Phoenix, Arizona, one of the attendees was grilling me about text analytics. As the grill-ee, I was reluctant to provide too much information to the grill-er. Most of what the young, confident MBA wanted is in my new study Beyond Search: What to Do When Your Enterprise Search System Won’t Work. Furthermore, she was convinced after her text mining industry research which included healthy bites of blue-chip consultancies’ pontifications that no firm combined text analysis, discovery, and useful point-and-click visualizations of the topic and concept space of a collection.

Sigh. Like the Fortune 500 country clubbers, vendors are so darn inadequate. Maybe? Sometimes it’s the Fortune 500 Ivy leaguers who are missing a card or two in their deck, not the vendors. Just a thought.

This short essay is a partial response to her assertion, which was–by the way–100 percent incorrect. For some reason, her research overlooked high-profile tools from dozens of vendors as well as point specialists. On the flight back last night, I recalled the Leximancer system, and I thought I would provide some color about that firm’s approach for two reasons: [a] I find it useful to look at companies with interesting search-related technologies and [b] I want to underscore that her assertion and her research was woefully inadequate.

What’s a Leximancer?

Leximancer is text mining software that you can use to analyze the content of collections of textual documents. The system then displays the the extracted information in a browser. Leximancer’s approach to visualization is to use a “concept map”. The idea is that a user can glance at the map, get an overview, and then explore the relationships that Leximancer discovers within the text.

concept map

Read more

Linguistic Agents: Smart Software from Israel

April 16, 2008

In my new study “Beyond Search”, I profile a number of non-US content processing companies. Several years ago I learned about Jerusalem-based Linguistic Agents. The company uses an interesting technique for its natural language processing system. I found Linguistics Agents’ approach interesting.

The firm’s founder is Sasson Margaliot. In 1999, Mr. Margaliot wanted to convert linguistic theories
into practical technologies. The goal was to enable computers to understand human language and context. Like other innovators in content processing, Mr. Margaliot had expertise in theoretical linguistics and application software development. He studied Linguistics at UCLA and Computer Science at Hebrew
University of Jerusalem.

The company’s chief scientist is Alexander Demidov. Mr. Demidov was responsible for the development of Linguistic Grammars for the Company’s NanoSyntactic Parser, the precursor of today’s Streaming Logic engine. Previously, he worked for the Moscow Institute of Applied Mathematics and at Zehut, a company that developed advanced compression and protection algorithms for digital imaging.

Computerworld identified the company in the summer of 2007 as having one of the “cool cutting-edge technologies on the horizon”. Since that burst of publicity in the US, not much has been done to keep the company’s profile above the water line.

The company uses “nano syntax” to extract meaning from documents. On the surface, the approach seems to share some features with Attensity, the “deep extraction company” and the firm that I included in my new study as an exemplar of recursive analysis and linguistic processing for meaning.

The idea is that a series of parallelized processes converts a sentence into a representation that preserves its syntactical meaning. The technology can be applied to search as well as context-based advertising. The company asserts, “The technology can revolutionize how computers and people interact –computers will learn our language instead of vice versa.”

Read more

Google’s Janitors: Clean Up Crew Ready for a Clean Sweep

April 15, 2008

At my Buying & Selling eContent keynote this morning, I discussed briefly Google’s invention of “janitors”. You can get the full text of the patent from the USPTO site. Search for US20070198481, “Automatic Object Reference Identification and Linking in a Browseable Fact Repository.” The inventors are Andrew Hogue and Jonathan Betz, Google, Inc.

The patent is of keen interest to me. It makes use of functions that Google is now making available via its App Engine service, among others. My suggestion is that you read about the App Engine and then look at US20070198481. If you have read about Google’s Programmable Search Engine, you may see linkages among these inventions that the individual patent documents do not make explicit. Google is not hiding any of these technologies, just using its infrastructure in fresh, intriguing ways. Keep in mind that a patent document is not a product. I believe it is useful to look at open source information in order to keep a finger on the pulse of a company’s innovation heart beat.

Figure from US20070198481

Now look at this illustration, which I used in my keynote. I want to direct your attention to two things. First, the query generates a report about the topic, in this case, the named entity “Michael Jackson”. Second, this result is not a hit list; it is a report. If my research for my new Gilbane Group study Beyond Search is accurate, Google’s US20070198481 seems to address some of the problems that users experience when confronted with results lists.

You will need to draw your own conclusions about this type of automated report generation. Google is not just in step with what user wants, the company appears to possess technology that makes it possible for the GOOG to jump into professional publishing, expand its reach as a business intelligence tool, and make users happy who want a distillation, not a laundry list of results.

Stephen Arnold, April 15, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta