The Failure of Search: Let Many Flowers Bloom and… Die Alone and Sad

November 1, 2022

I read “Taxonomy is Hard.” No argument from me. Yesterday (October 31, 2022) I spoke with a long time colleague and friend. Our conversations usually include some discussion about the loss of the expertise embodied in the early commercial database firms. The old frameworks, work processes, and shared beliefs among the top 15 or 20 for fee online database companies seem to have scattered and recycled in a quantum crazy digital world. We did not mention Google once, but we could have. My colleague and I agreed on several points:

  • Those who want to make digital information must have an informing editorial policy; that is, what’s the content space, what’s included, what’s excluded, and what problem does the commercial database solve
  • Finding information today is more difficult than it has been our two professional lives. We don’t know if the data are current and accurate (online corrections when publications issue fixes), fit within the editorial policy if there is one or the lack of policy shaped by the invisible hand of politics, advertising, and indifference to intellectual nuances. In some services, “old” data are disappeared presumably due to the cost of maintaining, updating if that is actually done, and working out how to make in depth queries work within available time and budget constraints
  • The steady erosion of precision and recall as reliable yardsticks for determining what a search system can find within a specific body of content
  • Professional indexing and content curation is being compressed or ignored by many firms. The process is expensive, time consuming, and intellectually difficult.

The cited article reflects some of these issues. However, the mirror is shaped by the systems and methods in use today. The approaches pivot on metadata (index terms) and tagging (more indexing). The approach is understandable. The shift to technology which slash the needed for subject matter experts, manual methods, meetings about specific terms or categories, and the other impedimenta are the new normal.

A couple of observations:

  1. The problems of social media boil down to editorial policies. Without these guard rails and the specialists needed to maintain them, finding specific items of information on widely used platforms like Facebook, TikTok, or Twitter, among others is difficult
  2. The challenges of processing video are enormous. The obvious fix is to gate the volume and implement specific editorial guidelines before content is made available to a user. Skipping this basic work task leads to the craziness evident in many services today
  3. Indexing can be supplemented by smart software. However, that smart software can drift off course, so specialists have to intervene and recalibrate the system.
  4. Semantic, statistical, or behavior centric methods for identifying and suggesting possible relevant content require the same expert centric approach. There is no free lunch is automated indexing, even for narrow vocabulary technical fields like nuclear physics or engineered materials. What smart software knows how to deal with new breakthroughs in physics which emerge from the study of inter cell behavior among proteins in the human brain?

Net net: Is it time to re-evaluate some discarded systems and methods? Is it time to accept the fact that technology cannot solve in isolation certain problems? Is it time to recognize that close enough for horseshoes and good enough are not appropriate when it comes to knowledge centric activities? Search engines die when the information garden cannot support the buds and shoots of finding useful information the user seeks.

Stephen E Arnold, November 1, 2022

The Rigors of College: Carpal Tunnel Thumb

May 27, 2021

I don’t think I would be able to graduate from a university today. I am not equipped with the wetware necessary to perform in the current academic environment. Okay, I skipped some grades. I have degrees. I am one of those individuals who prefer paper, pencils, notebooks, and motivated specialists to present information in person. Then I like to ask questions and participate in study groups with fellow students. I am not into Zoom and video games.

College Credit for Playing video Games? At Some California Campuses, It’s Happening” reminded me that I am not shaped for today’s digital world. I noted this passage:

“Higher ed needs to evolve or die,” said Dina Ibrahim, the academic advisor of the SF State esports athletic club and a professor of broadcast journalism. “We need to be teaching students relevant skills, that’s going to get them jobs in a rapidly changing landscape.”

That landscape means that esports, games, and unconferences are hot.

The article points out:

Those skills could help students land their first media jobs, said Mark “Garvey” Candella, director of student and education programs for Twitch, a $15 billion company that draws 30 million, mostly younger, visitors to its website daily. Amazon Inc. bought Twitch in 2014 for $970 million. The company makes money by showing ads to viewers, selling subscriptions, and taking a cut of any money viewers donate to streamers. “All the skills that you’re learning and using while you participate in gaming and esports are highly transferable and valuable skills in emerging new and digital media…”

No doubt.

Math, science, language study, and logic — No jobs in those field I assume.

I will stay home and wait for government checks. I will earn an F in egames and avoid carpal tunnel wrist braces.

Stephen E Arnold, May 27, 2021

Update for TemaTres, a Taxonomy Tool

March 25, 2020

In order to create and maintain a Web site, database, or other information source, a powerful knowledge management applications needed. There are numerous proprietary knowledge management software on the market, but the problem is often the price tag and solutions are not available out of the box. Open source software is the best way to save money and curate a knowledge management application to your specifications. The question remains: what open source knowledge management software should you download?

One of the top knowledge management software available via open source is TeamTres. TeamTres is described as a:

“Web application for management formal representations of knowledge, thesauri, taxonomies and multilingual vocabularies.”

TemaTres allows users to manage, publish, and share ontologies, taxonomies, thesauri, and glossaries. TemaTres includes numerous features that are designed for the best taxonomy development experience. Among these features are: MARC21 XML Schema, search function, keyword suggestions, user management, multilingual interface, scope notes, relationship visualizations, term reports, terminology mapping, unique code for each term, free terms control, vocabulary harmonization features, no limits on delimiters, integration into web tools, and more.

TemaTres requires programming knowledge to make it functional. Data governance is an important part of knowledge management and it gives editorial control over content. It is an underrated, but valuable tool.

Whitney Grace, March 25, 2020

Glossary Shrinks Data Science to 27 Concepts

August 6, 2015

I am not exactly sure how mathematics morphed into Big Data and then into data science. The evolutionary spark is powerful indeed. I came across a Glossary for Data Science. I am interested in term lists. For most fancy math I rely on one of my printed references or a Web site like Mathway.com. I was interested in the MapR Blog’s word list. The field of data science is boiled down to 27 terms (concepts). These are useful, but I will continue to use my copy of the Oxford Cocnise Dictionary of Mathematics. Brief lists of terms are not as useful as more comprehensive compilations. We are finalizing the Glossary for the forthcoming Dark Web Basics study. The term list has more than a couple dozen entries too.

Stephen E Arnold, August 6, 2015

Semantic Search: The View from a Taxonomy Consultant

May 9, 2015

My team and I are working on a new project. With our Overflight system, we have an archive of memorable and not so memorable factoids about search and content processing. One of the goslings who was actually working yesterday asked me, “Do you recall this presentation?”

The presentation was “Implementing Semantic Search in the Enterprise,” created in 2009, which works out to six years ago. I did not recall the presentation. But the title evoked an image in my mind like this:

image

I asked, “How is this germane to our present project?’

The reply the gosling quacked was, “Semantic search means taxonomy.” The gosling enjoined me to examine this impressive looking diagram:

image

Okay.

I don’t want a document. I don’t want formatted content. I don’t want unformatted content. I want on point results I can use. To illustrate the gap between dumping a document on my lap and presenting some useful, look at this visualization from Geofeedia:

image

The idea is that a person can draw a shape on a map, see the real time content flowing via mobile devices, and look at a particular object. There are search tools and other utilities. The user of this Geofeedia technology examines information in a manner that does not produce a document to read. Sure, a user can read a tweet, but the focus is on understanding information, regardless of type, in a particular context in real time. There is a classification system operating in the plumbing of this system, but the key point is the functionality, not the fact that a consulting firm specializing in taxonomies is making a taxonomy the Alpha and the Omega of an information access system.

The deck starts with the premise that semantic search pivots on a taxonomy. The idea is that a “categorization scheme” makes it possible to index a document even though the words in the document may be the words in the taxonomy.

image

For me, the slide deck’s argument was off kilter. The mixing up of a term list and semantic search is the evidence of a Rube Goldberg approach to a quite important task: Accessing needed information in a useful, actionable way. Frankly, I think that dumping buzzwords into slide decks creates more confusion when focus and accuracy are essential.

At lunch the goslings and I flipped through the PowerPoint deck which is available via LinkedIn Slideshare. You may have to register to view the PowerPoint deck. I am never clear about what is viewable, what’s downloadable, and what’s on Slideshare. LinkedIn has its real estate, publishing, and personnel businesses to which to attend, so search and retrieval is obviously not a priority. The entire experience was superficially amusing but on a more profound level quite disturbing. No wonder enterprise search implementations careen in a swamp of cost overruns and angry users.

Now creating taxonomies or what I call controlled term lists can a darned exciting process. If one goes the human route, there are discussions about what term maps to what word or phrase. Think buzz group and discussion group and online collaboration. What terms go with what other terms. In the good old days, these term lists were crafted by subject matter and indexing specialists. For example, the guts of the ABI/INFORM classification coding terms originated in the 1981-1982 period and was the product of more than 14 individuals, one advisor (the now deceased Betty Eddison), and the begrudging assistance of the Courier Journal’s information technology department which performed analyses of the index terms and key words in the ABI/INFORM database. The classification system was reasonably, and it was licensed by the Royal Bank of Canada, IBM, and some other savvy outfits for their own indexing projects.

As you might know, investing two years in human and some machine inputs was an expensive proposition. It was the initial step in the reindexing of the ABI/INFORM database, which at the time was one of the go to sources of high value business and management information culled from more than 800 publications worldwide.

The only problem I have with the slide deck’s making a taxonomy a key concept is that one cannot craft a taxonomy without knowing what one is indexing. For example, you have a flow of content through and into an organization. In a business engaged in the manufacture of laboratory equipment, there will be a wide range of information. There will be unstructured information like Word documents prepared by wild eyed marketing associates. There will be legal documents artfully copied and pasted together from boiler plate. There will be images of the products themselves. There will be databases containing the names of customers, prospects, suppliers, and consultants. There will be information that employees download from the Internet or tote into the organization on a storage device.

The key concept of a taxonomy has to be anchored in reality, not an external term list like those which used to be provided by Oracle  for certain vertical markets. In short, the time and cost of processing these items of information so that confidentiality is not breached is likely to make the organization’s accountant sit up and take notice.

Today many vendors assert that their systems can intelligently, automatically, and rapidly develop a taxonomy for an organization. I suggest you read the fine print. Even the whizziest taxonomy generator is going to require some baby sitting. To get a sense of what is required, track down an experienced licensee of the Autonomy IDOL system. There is a training period which requires a cohesive corpus of representative source material. Sorry, no images or videos accepted but the existing image and video metadata can be processed. Once the system is trained, then it is run against a test set of content. The results are examined by a human who knows what he or she is doing, and then the system is tuned. After the smart system runs for a few days, the human inspects and calibrates. The idea is that as content flows through the system  and periodic tweaks are made, the system becomes smarter. In reality, indexing drift creeps in. In effect, the smart software never strays too far from the human subject matter experts riding herd on algorithms.

The problem exists even when there is a relatively stable core of technical terminology. The content of a lab gear manufacturer is many times greater than the problem of a company focusing on a specific branch of engineering, science, technology, or medicine. Indexing Halliburton nuclear energy information is trivial when compared to indexing more generalized business content like that found in ABI/INFORM or the typical services organization today.

I agree that a controlled term list is important. One cannot easily resolve entities unless there is a combination of automated processes and look up lists. An example is figuring out if a reference to I.B.M., Big Blue, or Armonk is a reference to the much loved marketers of Watson. Now handle a transliterated name like Anwar al-Awlaki and its variants. This type of indexing is quite important. Get it wrong and one cannot find information germane to a query. When one is investigating aliases used by bad actors, an error can become a bad day for some folks.

The remainder of the slide deck rides the taxonomy pony into the sunset. When one looks at the information created 72 months ago, it is easy for me to understand why enterprise search and content processing has become a “oh, my goodness” problem in many organizations. I think that a mid sized company would grind to a halt if it needed a controlled vocabulary which matched today’s content flows.

My take away from the slide deck is easy to summarize: The lesson is that putting the cart before the horse won’t get enterprise where it must go to retain credibility and deliver utility.

Stephen E Arnold, May 9, 2015

Taxonomy Turmoil: Good Enough May Be Too Much

February 28, 2015

For years, I have posted a public indexing Overflight. You can examine the selected outputs at this Overflight link. (My non public system is more robust, but the public service is a useful temperature gauge for a slice of the content processing sector.)

When it comes to indexing, most vendors provide keyword, concept tagging, and entity extraction. But are these tags spot on? No, most are good enough.

image

A happy quack to Jackson Taylor for this “good enough” cartoon. The salesman makes it clear that good enough is indeed good enough in today’s marketing enabled world.

I chose about 50 companies that asserted their systems performed some type of indexing or taxonomy function. I learned that the taxonomy business is “about to explode.” I find that to be either an interesting investment tip or a statement that is characteristic of content processing optimists.

Like search and retrieval, plugging in “concepts” or other index terms is a utility function. For example, if one indexes each word in an article appearing in this blog, the article might be about another subject. For example, in this post, I am talking about Overflight, but the real topic is the broader use of metadata in information retrieval systems. I could assign the term “faceted navigation” to this article as a way to mark this article as germane to point and click navigation systems.

If you examine the “reports” Overflight outputs for each of the companies, you will discover several interesting things as I did on February 28, 2015 when I assembled this short article.

  1. Mergers or buying failed vendors at fire sale prices are taking places. Examples include Lucidea’s purchase of Cuadra and InMagic. Both of these firms are anchored in traditional indexing methods and seemed to be within a revenue envelope until their sell out. Business Objects acquired Inxight and then SAP acquired Business Objects. Bouvet acquired Ontopia. Teradata acquired Revelytix
  2. Moving indexing into open source. Thomson Reuters acquired ClearForest and made most of the technology available as OpenCalais. OpenText, a rollup outfit, acquired Nstein. SAS acquired Teragram. Smartlogic acquired Schemalogic. (A free report about Schemalogic is available at www.xenky.com/vendor-profiles.)
  3. A number of companies just failed, shut down, or went quiet. These include Active Classification, Arikus, Arity, Forth ICA, MaxThink, Millennium Engineering, Navigo, Progris, Protege, punkt.net, Questans, Quiver, Reuse Company, Sandpiper,
  4. The indexing sector includes a number of companies my non public system monitors; for example, the little known Data Harmony with six figure revenues after decades of selling really hard to traditional publishers. Conclusion: Indexing is a tough business to keep afloat.

There are numerous vendors who assert their systems perform indexing, entity, and metadata extraction. More than 18 of these companies are profiled in CyberOSINT, my new monograph. Oracle owns Triple Hop, RightNow, and Endeca. Each of these acquired companies performs indexing and metadata operations. Even the mashed potatoes search solution from Microsoft includes indexing tools. The proprietary XML data management vendor MarkLogic asserts that it performs indexing operations on content stored in its repository. Conclusion: More cyber oriented firms are likely to capture the juicy deals.

So what’s going on in the world of taxonomies? Several observations strike me as warranted:

First, none of the taxonomy vendors are huge outfits. I suppose one could argue that IBM’s Lucene based system is a billion dollar baby, but that’s marketing peyote, not reality. Perhaps MarkLogic which is struggling toward $100 million in revenue is the largest of this group. But the majority of the companies in the indexing business are small. Think in terms of a few hundred thousand in annual revenue to $10 million with generous accounting assumptions.

What’s clear to me is that indexing, like search, is a utility function. If a good enough search system delivers good enough indexing, then why spend for humans to slog through the content and make human judgments. Why not let Google funded Recorded Future identify entities, assign geo codes, and extract meaningful signals? Why not rely on Haystax or RedOwl or any one of more agile firms to deliver higher value operations.

I would assert that taxonomies and indexing are important to those who desire the accuracy of a human indexed system. This assumes that the humans are subject matter specialists, the humans are not fatigued, and the humans can keep pace with the flow of changed and new content.

The reality is that companies focused on delivering old school solutions to today’s problems are likely to lose contracts to companies that deliver what the customer perceives as a higher value content processing solution.

What can a taxonomy company do to ignite its engines of growth? Based on the research we performed for CyberOSINT, the future belongs to those who embrace automated collection, analysis, and output methods. Users may, if the user so chooses, provide guidance to the system. But the days of yore, when monks with varying degrees of accuracy created catalog sheets for the scriptoria have been washed to the margin of the data stream by today’s content flows.

What’s this mean for the folks who continue to pump money into taxonomy centric companies? Unless the cyber OSINT drum beat is heeded, the failure rate of the Overflight sample is a wake up call.

Buying Apple bonds might be a more prudent financial choice. On the other hand, there is an opportunity for taxonomy executives to become “experts” in content processing.

Stephen E Arnold, February 28, 2015

DBpedia Makes Wikipedia Part Of The Semantic Web

November 21, 2014

SemanticWeb.com posted an article called “Retrieving And Using Taxonomy Data From DBpedia” with an interesting introduction. It explains that DBpedia is a crowd-sourced Internet community whose entire goal is to extract structured information from Wikipedia and share it. The introduction continues that DBpedia already has over three billion facts W3C standard RDF data model ready for application use.

The W3C standards are already written using the SKOS vocabulary, primarily used by the New York Times, the Library of Congress, and other organizations for their own taxonomies and subject headers. Users can extrapolate the data and implement it in their own RDF applications with the goal of giving your data more value.

DBpedia is doing a wonderful service for users so they do not have to rely on proprietary software to deliver them rich taxonomies. The taxonomies can be retrieved under the open source community bylaws and gain instant improvement for content. There is one caveat:

“Remember that, for better or worse, the data is based on Wikipedia data. If you extend the structure of the query above to retrieve lower, more specific levels of horror film categories, you’d probably find the work of film scholars who’ve done serious research as well as the work of nutty people who are a little too into their favorite subgenres.”

Remember Wikipedia is a good reference tool to gain an understanding of a topic, but you still need to check more verifiable resources for hard facts.

Whitney Grace, November 21, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Wave Your WAND for a New Taxonomy Portal

May 2, 2014

If a library is in need of a taxonomy, most of the time all they need to do it wave a magic wand and its taxonomy wish is granted. Actually, they become a WAND, Inc. client, the world’s leading taxonomy provider. According to the WAND Inc. blog, the company has launched a new endeavor: “WAND Announces Launch Of New Taxonomy Portal.” The WAND Taxonomy Library Portal helps companies develop a taxonomy strategy that is integral for enterprise management strategy.

“According to Mark Leher, WAND’s COO, ‘The amount of unstructured information and data inside organizations continues to explode.  Companies need a taxonomy strategy to organize information and make it easily accessible to enterprise information workers.  The WAND Taxonomy Library Portal is a valuable resource that provides the foundation for a corporate taxonomy strategy.’ “

WAND Taxonomy Library Portal subscribers receive access to all of WAND’s taxonomies. They cover a range of topics, including insurance, medical equipment and supplies, travel, personal care, human resources, and many more. The portal is designed to help companies get the highest return investment on management applications.

“Leher continued, ‘What most people don’t realize is that there are more than 150 common enterprise information management applications that are designed to leverage taxonomy.  We estimate that most large organizations have already invested in 10-20 of those applications.  At WAND, our goal is to provide taxonomies that make those applications more effective and increase the return on investment.’ “

Taxonomies are lists of terms. It is hard to imagine that term lists are integral part of using a management applications, but they are important to identifying content and building a reference framework.

Whitney Grace, May 02, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Taxonomy Round-Up Includes Variety of Taxonomy Discussions

January 27, 2014

The article on Synaptica Central titled End of 2013 Round Up of Taxonomy Blogs, Part 1 is exactly what it sounds like, an end of the year look at taxonomy in terms of articles, blog posts, videos and more gathered from such diverse areas of the internet as Pinterest, Twitter, StackExchange and Youtube, “Taxonomy as it relates to Drupal,” and posts on augmenting a taxonomy.

The author explains:

“It’s that time of year, folks, when it seems like everyone is publishing some kind of year end list or “round up” of the years news highlights, blogsand blog posts, photos, videos, and more. So why not do the same here at Synaptica? To keep things manageable, listed here are some 2013 blogs and blog entries, culled from almost a thousand, that address taxonomy in one way or another.”

Overall the article presents an interesting list of important taxonomy blog entries. Many touch on Bloom, such as the article How to Use Pinterest with Bloom’s Taxonomy Infographic and Revised Bloom’s Taxonomy and the Need for Higher Order Thinking. Another helpful highlight is Twitter Aligned with Bloom’s Taxonomy for Your Students. Whether you are looking for guidance in implementing or augmenting one, or just interested in a dialogue on the subject, this list is certainly a perfect starting point and should direct you toward discussions and communities galore.

Chelsea Kerwin, January 27, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Another Content Management Company Another Day

August 12, 2013

Content management companies are springing up and gaining attention due to the Big Data boom. One of the companies that our content wranglers pulled out of an Internet Search is Applied Relevance. They specialize in several aspects of the content management spectrum, but the company’s Web site prominently promotes its taxonomy services. Applied Relevance offers the AR-Classifer tagging engine that can run on a variety of platforms. Its AR-Semantics is the flagship organization and categorization software, while the AR-Taxonomy is the tool needed to edit and manage taxonomies and if you want to search your taxonomies the AR-Navigator is available.

All this talk about Applied Relevance’s taxonomy software is informative, but what is interesting is the company’s description on the main page:

“Applied Relevance produces software and services to help enterprise users find the information they need. Our solutions augment traditional search engines by providing context for the search results. The AR toolset and our partners provide cost effective technology for the full spectrum of enterprise content management and search applications. With our tools, a search term and a few clicks, users can zero-in past ambiguities and come up with the right answer in the right context. Applied Relevance is located on the west coast of the east coast of North America.”

Descriptive, but not a word on taxonomy or what exactly the company specifically does. The tagline at the end about Applied Relevance’s location is even more ambiguous.

Whitney Grace, August 12, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Next Page »

  • Archives

  • Recent Posts

  • Meta