Mindbreeze InSite DemoAugmentextPolySpot: Agile Enterprise Search Infrastructure

Exogenous Complexity 1: Search

January 31, 2012

I am now using the phrase “exogenous complexity” to describe systems, methods, processes, and procedures which are likely to fail due to outside factors. This initial post focuses on indexing, but I will extend the concept to other content centric applications in the future. Disagree with me? Use the comments section of this blog, please.

What is an outside factor?

Let’s think about value adding indexing, content enrichment, or metatagging. The idea is that unstructured text contains entities, facts, bound phrases, and other identifiable entities. A key word search system is mostly blind to the meaning of a number in the form nnn nn nnnn, which in the United States is the pattern for a Social Security Number. There are similar patterns in Federal Express, financial, and other types of sequences. The idea is that a system will recognize these strings and tag them appropriately; for example:

nnn nn nnn Social Security Number

Thus, a query for Social Security Numbers will return a string of digits matching the pattern. The same logic can be applied to certain entities and with the help of a knowledge base, Bayesian numerical recipes, and other techniques such as synonym expansion determine that a query for Obama residence will return White House or a query for the White House will return links to the Obama residence.

One wishes that value added indexing systems were as predictable as a kabuki drama. What vendors of next generation content processing systems participate in is a kabuki which leads to failure two thirds of the time. A tragedy? It depends on whom one asks.

The problem is that companies offering automated solutions to value adding indexing, content enrichment, or metatagging are likely to fail for three reasons:

First, there is the issue of humans who use language in unexpected or what some poets call “fresh” or “metaphoric” methods. English is synthetic in that any string of sounds can be used in quite unexpected ways. Whether it is the use of the name of the fruit “mango” as a code name for software or whether it is the conversion of a noun like information into a verb like informationize which appears in Japanese government English language documents, the automated system may miss the boat. When the boat is missed, continued iterations try to arrive at the correct linkage, but anyone who has used fully automated systems know or who paid attention in math class, the recovery from an initial error can be time consuming and sometimes difficult. Therefore, an automated system—no matter how clever—may find itself fooled by the stream of content flowing through its content processing work flow. The user pays the price because false drops mean more work and suggestions which are not just off the mark, the suggestions are difficult for a human to figure out. You can get the inside dope on why poor suggestions are an issue in Thining, Fast and Slow.

Read more

Taxonomy Meetings: Change in 2011 or a Realization?

January 26, 2012

Editor’s Note: Please see the full version of this article at Marjorie Hlava’s Taxodiary blog.

Where should a taxonomist go to learn about the latest implementations of controlled vocabulary strategies? The meetings we have attended for years are dying on the vine. The SLA Expo was sparse, the Information Today meetings are smaller, Online Information (formerly International Online) was nearly empty, and NFAIS remains the same size each year.

The Internet has made many things possible. We can convene a meeting electronically in a very short time. People have turned increasingly to webinars and web searching. We follow blogs to read opinions and discussions. If we go to a meeting, we are expecting something else. We want to find community.

Selling of the speaking slots has had a deleterious effect on the quality of the meetings. The costs have reached a point where they no longer provide a good return on investment. But more than that, the challenge remains: how do you get a sense of community?
There are several budding online communities, which seem to be flourishing. Taxonomy Community of practice is one; the Taxonomy Division of SLA is another. The rest are in user groups. Access Innovation’s Data Harmony User Group meeting will be held in Albuquerque February 7-9 2012. Come join the community!

Marjorie Hlava. January 26, 2012

Sponsored by Pandia.com

TopQuadrant Earns ReportingHub Contract

January 21, 2012

TopQuadrant, a well-known semantic data integration company has been awarded a major contract by the Exploration and Production Information Management Association or EPIM.

According to the document “TopQuadrant To Deliver New Data Reporting System For Oil and Gas Operators on the Norwegian Continental Shelf” on TopQuadrant.com, EPIM has awarded its ReportingHub contract to TopQuadrant. EPIM focuses on coming up with IT solutions that efficiently moves the information flow path between all users. They have been working closely with the oil industry to come up with a plan that will allow companies to collect, normalize, validate, analyze and report data concerning the daily activities of the North Sea oil and gas drillers.

According to an Executive Director at EPIM,

TopQuadrant offers scalable data integration and reporting solutions that support end-to-end systems for the oil and gas industry. Working with EPIM, TopQuadrant will develop a new reporting system that will provide flexibility to meet current and future information sharing needs of the NCS operators and authorities.

TopQuadrant has worked to position itself as a developer of taxonomy tools. We find it interesting that content processing firms are finding ways to leverage core information retrieval technology in interesting ways.

April Holmes, January 21, 2012

Sponsored by Pandia.com

Taxonomy Presentation from Project Performance Corporation

January 20, 2012

Talk about taxonomy. Synaptica Central announces, “Taxonomy More Complex than Five Years Ago.” While the title states the obvious, the write up points to a presentation that may be worth a look. We learn from the posting:

Zach Wahl of Project Performance Corporation (PPC) said that the average taxonomy application is deeper and more complex than five years ago, and so the need for more sophisticated taxonomy software tools is becoming widely recognized.  PPC is a leading management consultancy with a growing taxonomy practice.  Wahl’s comments drew upon observations of the evolution of RFP requirements over the last few years.

The Project Performance Corporation works to bring efficiency to its clients by divining their best management practices and most effective, up-to-date technology. The company strives to treat its employees well, to give back to communities, and to always continue improving.

There is some room for improvement in this example, I’m afraid. We found the presentation, “Taxonomy Tools Requirements and Capabilities,” to be a gathering of truisms and some tough to understand magic. Check it out, but your mileage may vary.

Cynthia Murrell, January 20, 2012

Sponsored by Pandia.com

Crowdsourcing a Taxonomy: Useful or Useless?

January 18, 2012

We vote for useless.

However, the TopCoder blog recently shared an article that breaks down Crowdsourcing into four categories and combines real world examples within the defined taxonomy they are offering. The post is called “Why the Taxonomy of Crowdsourcing Can Not Categorize Software Development.”

According to the article, there has been a push to categorize what Crowdsourcing is which can be a good thing. However, the blog found that for software developers like TopCoder this can be very difficult to do.

The article states:

As we read through the aforementioned crowdsourcing.org article, it struck us that a taxonomy such as this would have a very hard time categorizing what TopCoder accomplishes. You may or may not know what we do. Through our global competitive community of more than 321,000 professionals – we don’t often use the term crowd – we create innovative software, algorithms that optimize business and scientific solutions and graphical digital assets. The further we studied the 4 different categories presented by Crowdsourcing.org, the more we realized that TopCoder competitions fit into all four categories presented.

If TopCoder feels this way, we wonder if other companies will find crowdsourcing a taxonomy to be a flop as well. There are useless taxonomies which do little to assist findability. Then there are ANSI standard taxonomies which work just for folks who understand Boolean, take care to formulate search strategies, and enjoy “real” research. Most of the world prefers the “slap in a word” or “take what the service delivers” approach. Sigh.

Jasmine Ashton, January 18, 2012

Data Harmony: Sweet Tune for Knowledge Management Experts

January 10, 2012

Short honk: Here in Harrod’s Creek, we find meet ups, hoe downs, and webinars plentiful and out of tune with our needs. We want to put on your calendar an event that seems to offer a sweet tune about knowledge management.

The Eighth Annual Data Harmony Users Group (DHUG) meeting, scheduled February 7 to 9, 2012, in Albuquerque, New Mexico will focus on helping users get the most from their investment in the knowledge management software suite, which helps users organize information resources based on a well-built and systematically applied taxonomy or thesaurus.

We learned:

This meeting is an exciting opportunity to learn how to fully utilize the power of Data Harmony software to maximize the effectiveness and profitability of your organization for your members, customers and staff,” said Marjorie M.K. Hlava, president of Access Innovations.

You can get complete details from Access Innovations. The widely read Web log Taxodiary  is encouraging anyone who wishes to share their story at the meeting to contact Data Harmony at this link. Registrations are also now being accepted. For more information about the Eighth Annual Data Harmony Users Group meeting, click here or call (505)998-0800 or 1-800-926-8328. We hope that Access Innovations captures their knowledge in a monograph. Too many amateur taxonomists and knowledge mavens pumping out inaccurate or incomplete information. In our experience, the go-to experts gravitate to the performances by the Mozarts of mark up.

Sounds excellent to us.

Stephen E Arnold, January 10, 2012

Sponsored by Pandia.com

60 Months, Minimal Search Progress

January 1, 2012

When I was writing the Enterprise Search Report, I was younger, less informed, and slightly more optimistic. I wrote in August 2005 “Recent Trends in Enterprise Search”:

The truth is that nothing associated with locating information is cheap, easy or fast.

I omitted one item: accurate. About five years after writing this sentence, I have come to my senses. The volume of information flushing through the “tubes” continues to increase. To explain what petabytes means to the average liberal arts major now working at a services firm, someone coined the phrase “big data.” Simple. Tidy. Inaccurate.

That’s why the notion of accurate information is on my mind. I am tough to motivate in general, and burro like when I have to admit that something I wrote in one of my addled states is incomplete, stupid, or just plain wrong.

Let me start the New Year correctly. Here are four observations which will probably annoy the “real” experts, the self appointed search mavens, and the failed middle school teachers now consulting in the fields of ontology, massive parallelization in virtual environments, and “big data.” I don’t plan to alter my rhetorical approach, so too bad about giving some of these rescued Burger King workers some respite. Won’t happen.

First observation: Even a person as wild-and-wonderful as Jason Calacanis, the much admired innovator who makes a retreating Russian army’s scorched earth policy look green, wants to limit Internet content. “Jason Calacanis: Blogging Is Dead & Why Stupid People Shouldn’t Write” captures his take on accuracy. If one assumes stupid people should not write, then one reason may be that stupid people produce inaccurate information. Sounds okay to me, so let’s go with the stupid angle. In the era of “big data”, trimming out the stupid people should result in higher value information. Keep in mind I am addled. I am not sure where to stand on the “stupid” thing.

Image source: http://www.northernsun.com/Boldly-Going-Nowhere-T-Shirt-(8257).html

Second observation: Disinformation is becoming easier for me to spot. For you? I am not so sure. Let me give you a couple of examples. Navigate to the now out of date list of taxonomy systems prepared by Will Power. The page is available from Willpower Information in Middlesex. Now scan the description of the taxonomy system called MTM. Here’s a snippet:

MTM is the software for multilingual thesauri building and maintenance. It has been designed as a configurable system assisting a user in creating concepts, linking them by means of a set of predefined relations, and controlling the validity of the thesaurus structure…

The main features of the software are inter alia:

  • thesaurus maintenance and support system;
  • KWOC and full tree representation and navigation tools available on-line;
  • KWIC, KWOC and full tree printouts (in an alphabetic and systematic order);
  • defining and customization of up to 100 conceptual relationship types;
  • management of facets, codes (top classification), sources, regional variants, historical notes, etc.;
  • support of the various types of authority files;
  • computer assisted merging;
  • thesauri comparison by means of windows;
  • support of the various alphabets;
  • support of linguistic and orthographic variants;
  • sorting facilities consistent with national standards;
  • variable length data handling;
  • flexibility in defining input and output forms;
  • versatility in terms of relative ease of configuring the software for the various sets of languages;
  • flexibility in defining data structures needed for a given application;
  • a possibility to exchange data with other organizations and systems through exporting and importing terms and relations.

Read more

The Heat in SharePoint Semantics: December 16 to 23, 2011

December 27, 2011

This week, SharePoint Semantics shared several informative articles on how to improve the functionality of Microsoft SharePoint for end users and explain it’s, often complex, new and improved features. I would like to highlight several exceptionally informative posts related to search.

In the post “On-Ramo Your Paper Documents Into SharePoint and Maximize Your ROI” Ken Toth highlights an article that talks about the advantages of getting your enterprise’s paper documents onto SharePoint.

Toth remarks:

The article notes that in the majority of situations with document imaging solutions, paper and costs are reduced, business process workflow is streamlined, and SharePoint investments are maximized.  Digitizing paper is not difficult and results in concrete savings, opportunity savings, and intangible savings.  These savings can commonly be realized in a mere three to six months.

With technology, there is often a lot of speculation around when the newest versions of hot products will hit the market. In “Speculation on Future Microsoft SharePoint Release Dates” Toth shares an article that makes predictions about the popular enterprise search platform.

The article states:

However, the most compelling argument against the next version of SharePoint being SharePoint 2012 is this: The Office 15 client suite consistently refers to SharePoint 2010 and SharePoint 2013, for example in Visio, where you can create workflows for either SharePoint 2010 or SharePoint 2013. In fact, in Visio, even the next version of SharePoint Designer is called “SharePoint Designer 2013.

In “Tips and Traps for SharePoint 2010 Legacy CMS Migration” we look at how content management systems unify an organization and create a true technological enterprise but how managing and moving that content is often tricky.

Toth asserts:

Third-party software is often a reasonable way to make sure that all metadata is being transmitted properly. Whether by utilizing third-party software or third-party engineers, it will cost you, but the job will be done right the first time. Organizing your data properly the first time and only moving what you need will take away some of the headache, as well.

As Microsoft’s fastest selling product and the emerging industry standard for enterprise information management, SharePoint is far from perfect. It’s important that, in addition to paying attention to the valuable information provided by the articles provided by SharePoint Semantics, users also look into valuable third party products like Smarlogic’s Semaphore content intelligence platform to assist them when the going gets tough.

Jasmine Ashton, December 27, 2011

Sponsored by Pandia.com

Taxonomy: More Marketing Craziness in Play?

December 12, 2011

For whatever reason, I have been picking up rumors, factoids, and complaints about the sales and marketing tactics of various search and content processing vendors. With holidays just around the corner, one would think that in run up to Kwanzaa, Christmas, Hanukkah, and Boxing Day folks would chill.

Ah, Agility!

The first dust up concerns tag lines. At issue is the word “agile”, which is becoming one of more popular terms. I was in a meeting at which a heated discussion about whose search and content processing system is agile. Endeca claims agility. I am not going to dispute that a 13 or 14 year old system is not agile, but in Internet years, there may be some flexibility lost. Run a query for “agile” and “search” and you get a hit to a recruitment firm, a marketing outfit, and something called the Tamilan Search Engine. I also spotted PolySpot, a French infrastructure, solutions, and applications company. The problem is that words are slippery. What are the synonyms for “agile”? I expect to see some of these turning up in 2012. How about gazelle search or spry search?

In though economic times, financial pressures can distort business methods.

Circular Partnerships: Snakes Eating Their Tails

The second dust up concerns partnerships. I have been looking through the list of partners identified by such companies as Microsoft, WAND, and others. What I have discovered is that most of the partners are either household names like IBM or companies I have never heard of. Furthermore, when I dig into the partners’ names unfamiliar to me, I discover companies which are consulting firms or resellers who offer a roster of “stuff.” I understand the importance of amplifying a sales force. A partnership plan is little more than a way to reduce the cost of getting a lead and making a sales call. One of the experts in this game is the struggling giant Thomson Reuters. The company signs up partners when sales flag. In the taxonomy game, the partnerships have another twist. The linkages are circular. Antidot or Modeca points to partners and partners point to other search and content processing vendors which point to the original company. I find this confusing because “partner plays” are gaining momentum among specialist firms. I think the “partner” card is an indication that a search and content processing firm may be beating the bushes to get revenue. Just my opinion, of course.

Today, everything is for sale. Be wary if a pitch sounds too good to be true. Image source: http://asksistermarymartha.blogspot.com/2009_10_01_archive.html

Pitching Automation No Matter the Consequences

The third dust up involves taxonomies and is related to the circular nature of partnerships and financial pressures. Now there is considerable contention in the market with regard to taxonomies. The word “taxonomy” itself is a shuttlecock with software badminton players swinging with abandon. The idea is simple: A hierarchical word list. But with hot new spins like ontology (not to be confused with the branch of metaphysics that deals with the nature of being), metatagging, and categorization.

On one side of the dictionary are those who want the software to discover the concepts, terms, and bound phrases. Then these terms are automatically assigned to content processed by the system. If this sounds like the Bayesian magic associated with Hewlett Packard Autonomy or Recommend, you are on the money. There alternative approaches which have considerable payoff. A good example is the work done by Tim Estes and his team at Digital Reasoning, a firm which received financial goodness from SilverLake Sumeru. The idea is that humans play either a modest role or no role at all. Because of the volume of data flowing through a system, human intermediated systems struggle to keep pace with fluidity of human discourse. On one side, therefore, automation. For simplicity’s sake, let’s call this the Google approach.

On the other side of the dictionary are those who see humans with subject matter expertise playing an important role. The idea, which seems quaint to many of the self appointed experts and azure chip consultants, is that human beings can set up a conceptual scheme, populate it with words, terms, and bound phrases. Thus, armed with a controlled term list, a system can use those terms to index or tag content. The idea has merit because the American National Standards Institute has spelled out guidelines for controlled term lists.

Here’s how the battle shapes up. One one side are the “we don’t need any humans” crowd. In my opinion, some enthusiasts for this no-humans position are TEMIS, Google, and in some cases Autonomy. Many of the automated indexing and tagging systems work quite well when the corpus of content is tightly bounded. What do I mean by “tightly bounded?” Pick up a hard copy of a medical journal about cancer or about nuclear engineering. The vocabulary does not vary too much from article to article within each topic area. In fact, once you learn about 2,000 nuclear terms, you can figure out the basic idea of most nuclear power write ups.

Are some search and content processing vendors taking notice of sales methods associated with used car sales professionals? Even Google is advertising on the “vast wasteland”. Image source: http://www.townhillautosales.com/?24

What happens when you process unbounded content? Well, real life language use is more tricky. Non experts simplify complex ideas, often importing non specialist terms for arcane jargon. Do you know what an ECCS is? Probably not. A “real” journalist or consultant will convert the notion of an emergency core cooling system to something along the lines of a “spare radiator.” Not exactly on the money, but indicative of how precise language is softened. In these situations, it is useful to have a term list of the specialist words, terms, and bound phrases. Subcategories under Cooling Systems can contain the ECCS entry and others. The idea is that content can be assigned certain terms no matter what the words and phrases in the source document may be.

Some companies like TEMIS, Google, and Yandex are not to keen on the human involvement. The reasons range from the cost of getting humans to do index and taxonomy development to an arrogance about how software performs. Wizards see the world in terms of their wizardry which is okay with me. I think it is silly to assume software can handle language with the facility of humans, but I am have some experience with what happens when “good enough” is not.

Other companies like Access Innovations (a former client from days of yore)  and (believe it or not) Dow Jones (a component of the exciting Murdoch organization) believe that humans are important. The humans can develop the lists, set up guidelines or rules for the indexing system to consult, and provide interfaces to allow subject matter experts to adjust the term list and tune the indexing system. The benefit is that the accuracy of the indexing, based on my real life experience, is much better. There is language drift, but there are methods to intervene and correct that drift.

Without a method to adjust to what software is too stupid to see, the indexing “drifts”. The impact of this is not too good. You run a query for a particular snake bite treatment, and you cannot locate the content. The term you use is not assigned by the system and it does not appear in the source document. So what? Well, how about your child dies. Maybe this is an unpleasant thought, but the consequences of lousy indexing and concept assignment are often more serious than not finding a pizza joint in San Jose.

Here’s what one indexing professional told me. I have to mask the name and company to avoid a hassle, but you will get the idea from this comment I captured:

Some companies such as a certain Paris-based company sell expensive software to clients and then leave. People don’t know what to do with it.  So they have an expensive difficult to implement natural language processing systems which could work but are left hanging.  The package from us is the whole thing we are big on total service, follow up training, and getting people implemented and using it without our help but we are there – just a phone call or email away to help and support them. The Paris based company says companies like Access Innovations are not a natural language processing system and although we do have the natural language processing  we don’t make people pay for it separately. With most systems, rules are often needed to achieve more than “good enough” tagging.  Access Innovations, a specialist able to generate ANSI compliant term lists, delivers 85 to 90 percent accuracy. The Paris-based outfit delivers far lower accuracy. Clients don’t understand the issues with low accuracy tagging, findability, and long term system usability.

So What?

What we have, gentle reader, is an example of the automation crowd glossing over the need for human-intermediation solutions. What disturbs me is that the chatter about taxonomy in boot camps, companies which are coming from left field, and self appointed experts is putting the spotlight on indexing and classifying content.

That’s a plus.

The downside is that when the indexing goes off the rails, the user may not be able to find the needed information. That’s why companies like Digital Reasoning and Access Innovations have the ability to deliver automation plus human-intermediated interactions. The licensee suffers when automation goes wrong. The users suffer. The search system vendor may be blamed. Beware the taxonomy vendor spouting glittering generalities about smart software. Usually the “spout” dispenses tainted outputs.

Bottom line: I avoid vendors who present to me the “one true way.” This approach may work when preparing foie gras. For some taxonomy vendors hungry for cash, the traditional, labor intensive methods get in the way of making a quick sale. Unfortunately when humans create language, more traditional methods are often completely appropriate for mission critical indexing tasks. Honk!

Stephen E Arnold, December 12, 2011

Sponsored by Pandia.com

BA Insight Interview

December 11, 2011

Short honk: We overlooked a new interview with Guy Mounier, BA-Insight. If you track the vendors who provide components to extend and enhance Microsoft SharePoint, you may find the interview with BA Insight interesting.

image

You can find the interview at this link. The interview carries the date of September 27, 2011. Our error. At age 67, I lose my pen several times a day.

Stephen E Arnold, December 11, 2011

« Previous PageNext Page »

  •  Only search links from this page: