Exogenous Complexity 1: Search

January 31, 2012

I am now using the phrase “exogenous complexity” to describe systems, methods, processes, and procedures which are likely to fail due to outside factors. This initial post focuses on indexing, but I will extend the concept to other content centric applications in the future. Disagree with me? Use the comments section of this blog, please.

What is an outside factor?

Let’s think about value adding indexing, content enrichment, or metatagging. The idea is that unstructured text contains entities, facts, bound phrases, and other identifiable entities. A key word search system is mostly blind to the meaning of a number in the form nnn nn nnnn, which in the United States is the pattern for a Social Security Number. There are similar patterns in Federal Express, financial, and other types of sequences. The idea is that a system will recognize these strings and tag them appropriately; for example:

nnn nn nnn Social Security Number

Thus, a query for Social Security Numbers will return a string of digits matching the pattern. The same logic can be applied to certain entities and with the help of a knowledge base, Bayesian numerical recipes, and other techniques such as synonym expansion determine that a query for Obama residence will return White House or a query for the White House will return links to the Obama residence.

One wishes that value added indexing systems were as predictable as a kabuki drama. What vendors of next generation content processing systems participate in is a kabuki which leads to failure two thirds of the time. A tragedy? It depends on whom one asks.

The problem is that companies offering automated solutions to value adding indexing, content enrichment, or metatagging are likely to fail for three reasons:

First, there is the issue of humans who use language in unexpected or what some poets call “fresh” or “metaphoric” methods. English is synthetic in that any string of sounds can be used in quite unexpected ways. Whether it is the use of the name of the fruit “mango” as a code name for software or whether it is the conversion of a noun like information into a verb like informationize which appears in Japanese government English language documents, the automated system may miss the boat. When the boat is missed, continued iterations try to arrive at the correct linkage, but anyone who has used fully automated systems know or who paid attention in math class, the recovery from an initial error can be time consuming and sometimes difficult. Therefore, an automated system—no matter how clever—may find itself fooled by the stream of content flowing through its content processing work flow. The user pays the price because false drops mean more work and suggestions which are not just off the mark, the suggestions are difficult for a human to figure out. You can get the inside dope on why poor suggestions are an issue in Thining, Fast and Slow.

Read more

Synaptica Independent Taxonomy Resource

January 27, 2012

Synaptica started out as Synapse Corporation under founders Trish Yancey and Dave Clarke. The company offered taxonomies, software solutions, and professional lexicography and indexing services for businesses and organizations based on its Synaptica product, a knowledge management and indexing software application, which enables enterprises in managing taxonomies, thesauri, classification schemes, authority control files, and indexes. In 2005, the company, renamed Synaptica, was acquired by Dow Jones and placed in its Factiva unit. Clarke has subsequently regained control of Synaptica.

The company has also has revamped its informational website, Taxonomy Warehouse – a free online resource that has answers enquiries about taxonomies. Named as one of KM World magazine’s “Trend-Setting Products of 2011,” Synaptica is an editorial tool designed for use by professional taxonomists. In 2011, the company added a complementary suite of front-end publication tools that make it easy for any taxonomy or ontology to be presented to end-users.  The Ontology Publishing Suite gives administrators better control over which parts of a master ontology are exposed to end-users, as well as how they are laid out on-screen. Other parts of the Synaptica product suite include Synaptica Enterprise, the behind-the-firewall solution for larger organizations; Synaptica Express, a cloud-computing solution for individuals or small-business users; Synaptica IMS, a complementary suite of tools designed to support the human indexing of content using taxonomies stored in Synaptica; and Synaptica SharePoint Integration, an add-on module enabling taxonomies being managed within Synaptica to be applied as meta-tags to content being stored in SharePoint document libraries, as well as allowing for those same taxonomies to be used for search.

The technology has found a home in corporate, pharmaceutical, government, and e-commerce markets. Clients include Verizon, ProQuest, the BBC, and Harvard Business Publishing. Competitors LexisNexis, Dun & Bradstreet, and InsideView. (I would not include Concept Searching or Ontoprise in this short list due to exogenous complexity factors.)

Stephen E Arnold, January 27, 2012

Sponsored by Pandia.com

OpenCalais: From the Innovators at Thomson Reuters

January 27, 2012

Thomson Reuters is now testing a new print publication called Reuters at the World Economic Forum. Before the firm, returned to print, Thomson Reuters was probing automated tagging.

Founded in 1998, ClearForest was previously an independent software start-up. It was acquired by Reuters in 2007 and is now part of the Markets division of Thomson Reuters. OpenCalais is a strategic initiative from Thomson Reuters, based on ClearForest technology, to support the interoperability of content across the digital landscape.

OpenCalais is free to use in both commercial and non-commercial settings but can only be used on public content. It can process up to 50,000 documents per day (blog posts, news stories, Web pages, etc.) free of charge.  For users needing to process more than that, there is Calais Professional. While it does not keep a copy of the content, it does keep a copy of the metadata it extracts. Offering a de-facto standard for making content interoperable in a fashion that complies with Semantic Web standards ultimately benefits Thomson Reuters, which is then able to track themes, memes and trends on the Web and to potentially do things like link to relevant content that helps provide context to its readers, customers and other constituents.

After releasing a couple of major upgrades – in particular the incorporation of a whole Linked Data ecosystem underneath OpenCalais for companies, geographies, products and a few other things – with little or no adoption and no fundamentally new capabilities being built, the OpenCalais team, headed by Tom Tague, decided to slow down development and let the market for semantic extraction mature. Thomson Reuters believes that there are massive opportunities for OpenCalais in the areas of news, its integration with social media and its utilization as a massive repository of knowledge.

OpenCalais’ early adopters include CBS Interactive / CNET, Huffington Post, Slate, Al Jazeera, “The New Republic,” the White House and more. Customers include: Kodak, Dow Chemical, Eastman Chemical, NASD, EDS, Boeing, US Dept. Air Force, Reuters, Dow Jones, Thomson Financial. Competitors include Eqentia and Evri. . (I would not include Concept Searching or Ontoprise in this short list due to exogenous complexity factors.)

Stephen E Arnold, January 27, 2012

Sponsored by Pandia.com

Intellisophic: Formerly Indraweb

January 26, 2012

Founded in 1999 as Indraweb and changing its name in 2055, Intellisophic, Inc., is a privately-funded technology company that is the world’s largest provider of taxonomic content. Its technology, originating from the work of founders Henry Kon, PhD., George Burch, and Michael Hoey, is based on the premise that concepts within unstructured information can be systematically derived by leveraging the trusted taxonomies of the reference book community. Within this core idea, Intellisophic developed and patented the Orthogonal Corpus Indexing algorithm for extracting and using taxonomies from reference and education books.

During a stint as principal investigator for MIT’s Context Interchange, CTO Kon researched and implemented methodologies for enterprise integration of structured and semi-structured data over independently managed and disparate schema databases. He researched, designed, and prototyped integration engines for distributed multi-database query and caching over heterogeneous, distributed, and partially connected databases. As a member of MIT’s Composite Information Systems Laboratory, Kon published on multi-database integration engines and the use of ontology for bridging database schema. With Intellisophic, he has pioneered innovation in the conceptual management of unstructured information and in the integration of structured, semi-structured and unstructured content.

Intellisophic content is machine-developed, leveraging knowledge from respected referenceworks. The taxonomies are unbounded by subject coverage and are cost-effective to create. The taxonomy library covers several million topic areas defined by hundreds of millions of terms. In addition to taxonomic content, the company offers intelligent solutions, such as enterprise search and retrieval, business intelligence, categorization and classification, compliance management, portal infrastructure, social networking, content and knowledge management, electronic discovery, data warehousing, and government intelligence.

Its strategic alliance partners include Mark Logic, DataLever, SchemaLogic, DFI International, and Mosaic, Inc. Competitors Sandpiper, Intellidimension, and HighFleet. The depth and breadth of Intellisophic’s taxonomies, along with its support of the leading text mining, search, and categorization applications, make it a good solution for many industries. (I would not include Concept Searching or Ontoprise in this short list due to exogenous complexity factors.)

Stephen E Arnold, January 26, 2012

Sponsored by Pandia.com

Mondeca: How Smart Is Your Content?

January 25, 2012

Here in Harrod’s Creek, we and our content are not too smart. Mondeca believes it can change this hapless condition.

Founded in 1999 by Jean Delahousse and others, Mondeca asserts that it is the leading European provider of technology for the management of advanced knowledge structures: ontologies, thesauri, taxonomies, terminologies, metadata repositories, knowledge bases, and Linked Open Data.

Based in Paris, France, the company has been financed by its founders, as well as investment funds Trinova and Banque Populaire. Before starting Mondeca, Delahousse worked for Andersen Consulting, Paris Stock Exchange and Diagram, a publisher of financial software. With expertise in semantic web, ontologies, and content management, he has experience in the design and launch of large software applications, as well as in implementation of semantic technologies for large international clients.

Mondeca’s products help enterprises to integrate and interlink heterogeneous information by mapping it to explicit knowledge references and improve the way information is retrieved, analyzed, and reused by producing consistent, precise, and relevant metadata as well as supplying the relevant context. Mondeca’s technology is at the core of the Semantic Enterprise Information Architecture that allows to interconnect people and resources as well as to extract the most value from information.

Its products include Content Annotation Manager, a platform for building and managing customized workflows for semantic annotation of content that coordinates content analysis, data mapping, human validation, and knowledge enrichment components; and Intelligent Topic Manager, which supports the management of complex knowledge structures throughout their lifecycle, from authoring to delivery and can be either used independently to store and manage complex domain-specific knowledge structures, or as a service that enhances enterprise search, knowledge discovery, and text mining solutions.

Mondeca has also built its credibility in the Semantic Web space as a key contributor to widely-used international standards: OWL, RDF, SKOS, ISO 25964, and Topic Maps. Clients include Hachette Filipacchi, the World Tourism Organization, and Thomson Scientific. Competitors include Layer2 and Wordmap. 

Stephen E Arnold, January 25, 2012

Sponsored by Pandia.com

Taxonomy Presentation from Project Performance Corporation

January 20, 2012

Talk about taxonomy. Synaptica Central announces, “Taxonomy More Complex than Five Years Ago.” While the title states the obvious, the write up points to a presentation that may be worth a look. We learn from the posting:

Zach Wahl of Project Performance Corporation (PPC) said that the average taxonomy application is deeper and more complex than five years ago, and so the need for more sophisticated taxonomy software tools is becoming widely recognized.  PPC is a leading management consultancy with a growing taxonomy practice.  Wahl’s comments drew upon observations of the evolution of RFP requirements over the last few years.

The Project Performance Corporation works to bring efficiency to its clients by divining their best management practices and most effective, up-to-date technology. The company strives to treat its employees well, to give back to communities, and to always continue improving.

There is some room for improvement in this example, I’m afraid. We found the presentation, “Taxonomy Tools Requirements and Capabilities,” to be a gathering of truisms and some tough to understand magic. Check it out, but your mileage may vary.

Cynthia Murrell, January 20, 2012

Sponsored by Pandia.com

Blekko Gets a Makeover

January 19, 2012

According to the Search Engine Watch article “Blekko Gets New User Interface and Ramps Up Auto-Slashing,” Blekko has undergone a massive upgrade. This major upgrade increases their automatic slashtags and they will now cover over 500 content categories. In addition the company will debut a new User Interface (UI). Blekko designed slashtags to provide users with relevant and quality results.

Mike Markson, co-founder and VP of Marketing stated:

Blekko “believes you’re better off as a user not searching the entire internet, but the top sites, because that’s where the quality content resides.

The new UI gives the interface a cleaner look while still providing customers with easy to use and attractive features. One new notable slashtag worth mentioning is /monte. The article uses an interesting comparison “Results using /monte are like a blind taste test. Type in any query – one column is a result from Google, one from Bing, and one from Blekko.” Users can save time by easily scan the lists, weed out the junk, and find what they want. Kind of gives a whole new meaning to “great customer service with a smile.” Yandex has a stake in Blekko and we believe that these two services pose a significant threat to Google in Web search.

April Holmes, January 19, 2012

Sponsored by Pandia.com

Data Harmony: Sweet Tune for Knowledge Management Experts

January 10, 2012

Short honk: Here in Harrod’s Creek, we find meet ups, hoe downs, and webinars plentiful and out of tune with our needs. We want to put on your calendar an event that seems to offer a sweet tune about knowledge management.

The Eighth Annual Data Harmony Users Group (DHUG) meeting, scheduled February 7 to 9, 2012, in Albuquerque, New Mexico will focus on helping users get the most from their investment in the knowledge management software suite, which helps users organize information resources based on a well-built and systematically applied taxonomy or thesaurus.

We learned:

This meeting is an exciting opportunity to learn how to fully utilize the power of Data Harmony software to maximize the effectiveness and profitability of your organization for your members, customers and staff,” said Marjorie M.K. Hlava, president of Access Innovations.

You can get complete details from Access Innovations. The widely read Web log Taxodiary  is encouraging anyone who wishes to share their story at the meeting to contact Data Harmony at this link. Registrations are also now being accepted. For more information about the Eighth Annual Data Harmony Users Group meeting, click here or call (505)998-0800 or 1-800-926-8328. We hope that Access Innovations captures their knowledge in a monograph. Too many amateur taxonomists and knowledge mavens pumping out inaccurate or incomplete information. In our experience, the go-to experts gravitate to the performances by the Mozarts of mark up.

Sounds excellent to us.

Stephen E Arnold, January 10, 2012

Sponsored by Pandia.com

Everlasting Metadata?

January 4, 2012

Professional photographers are working to protect their rights in the digital world, as CNET reveals in “Should Metadata Be Permanent?” The groups supporting an initiative to require that metadata be permanently adhered to image, text, audio, and video files are understandably focused on protecting copyrights. However, there could be other repercussions to the move. Writer Alexandra Savvides points out:

Imagine a whistle-blowing case involving photographic evidence, where the metadata clearly reveals who took the photo. The manifesto also doesn’t seem to address issues of data tampering or manipulation. We’ve seen numerous cases where photo-encryption systems have been cracked, showing that an obviously manipulated image is an original file created by a camera in question. There is nothing to stop similar methodologies being developed that could change the metadata to imply that another person created an image.

It’s a thorny question. I sympathize with artists who must protect their work. On the other hand, there’s the law of unintended consequences. There is also the question of “language drift.” If metadata are not up to date, the searcher of the future might not be able to locate the information object because the search term does not match the metadata’s lingo.

Our question, though, is a little more pragmatic: what if the meta data needs to be changed? Hmm. Inconvenient, that.

Cynthia Murrell, January 4, 2012

Sponsored by Pandia.com

Search Engines May Take Action Against Pirate Web Sites

January 3, 2012

From the Sooner or Later Department:

Google has been in the news a lot lately for being biased when it comes to search result ranking. According to a the recent Telegraph article “Google May Give Pirate Sites Lower Ranking,” that bias may be leading to positive results. A new code will force Search engines to automatically rank pirate websites lower than official ones and give priority to those that were certified under a recognized scheme.

The article states:

According to research by the Publisher’s Association, Google searches for the 50 best-selling books in one week in March returned an average of four illegal links in the top 10 listings. The previous year that figure was closer to two.

Under the code, Google as well as other search engines would stop allowing illegal sites to advertise and would step up their efforts in delisting pirate websites as soon as they are flagged by legitimate rights holders.

While the search engines have yet to respond to the proposal, we believe that if this is policy goes into effect, there may be some unforeseen consequences. Exciting to be the one to define “pirate”.

Jasmine Ashton, January 3, 2012

Sponsored by Pandia.com

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta