CyberOSINT banner

Glossary Shrinks Data Science to 27 Concepts

August 6, 2015

I am not exactly sure how mathematics morphed into Big Data and then into data science. The evolutionary spark is powerful indeed. I came across a Glossary for Data Science. I am interested in term lists. For most fancy math I rely on one of my printed references or a Web site like Mathway.com. I was interested in the MapR Blog’s word list. The field of data science is boiled down to 27 terms (concepts). These are useful, but I will continue to use my copy of the Oxford Cocnise Dictionary of Mathematics. Brief lists of terms are not as useful as more comprehensive compilations. We are finalizing the Glossary for the forthcoming Dark Web Basics study. The term list has more than a couple dozen entries too.

Stephen E Arnold, August 6, 2015

Semantic Search: The View from a Taxonomy Consultant

May 9, 2015

My team and I are working on a new project. With our Overflight system, we have an archive of memorable and not so memorable factoids about search and content processing. One of the goslings who was actually working yesterday asked me, “Do you recall this presentation?”

The presentation was “Implementing Semantic Search in the Enterprise,” created in 2009, which works out to six years ago. I did not recall the presentation. But the title evoked an image in my mind like this:

image

I asked, “How is this germane to our present project?’

The reply the gosling quacked was, “Semantic search means taxonomy.” The gosling enjoined me to examine this impressive looking diagram:

image

Okay.

I don’t want a document. I don’t want formatted content. I don’t want unformatted content. I want on point results I can use. To illustrate the gap between dumping a document on my lap and presenting some useful, look at this visualization from Geofeedia:

image

The idea is that a person can draw a shape on a map, see the real time content flowing via mobile devices, and look at a particular object. There are search tools and other utilities. The user of this Geofeedia technology examines information in a manner that does not produce a document to read. Sure, a user can read a tweet, but the focus is on understanding information, regardless of type, in a particular context in real time. There is a classification system operating in the plumbing of this system, but the key point is the functionality, not the fact that a consulting firm specializing in taxonomies is making a taxonomy the Alpha and the Omega of an information access system.

The deck starts with the premise that semantic search pivots on a taxonomy. The idea is that a “categorization scheme” makes it possible to index a document even though the words in the document may be the words in the taxonomy.

image

For me, the slide deck’s argument was off kilter. The mixing up of a term list and semantic search is the evidence of a Rube Goldberg approach to a quite important task: Accessing needed information in a useful, actionable way. Frankly, I think that dumping buzzwords into slide decks creates more confusion when focus and accuracy are essential.

At lunch the goslings and I flipped through the PowerPoint deck which is available via LinkedIn Slideshare. You may have to register to view the PowerPoint deck. I am never clear about what is viewable, what’s downloadable, and what’s on Slideshare. LinkedIn has its real estate, publishing, and personnel businesses to which to attend, so search and retrieval is obviously not a priority. The entire experience was superficially amusing but on a more profound level quite disturbing. No wonder enterprise search implementations careen in a swamp of cost overruns and angry users.

Now creating taxonomies or what I call controlled term lists can a darned exciting process. If one goes the human route, there are discussions about what term maps to what word or phrase. Think buzz group and discussion group and online collaboration. What terms go with what other terms. In the good old days, these term lists were crafted by subject matter and indexing specialists. For example, the guts of the ABI/INFORM classification coding terms originated in the 1981-1982 period and was the product of more than 14 individuals, one advisor (the now deceased Betty Eddison), and the begrudging assistance of the Courier Journal’s information technology department which performed analyses of the index terms and key words in the ABI/INFORM database. The classification system was reasonably, and it was licensed by the Royal Bank of Canada, IBM, and some other savvy outfits for their own indexing projects.

As you might know, investing two years in human and some machine inputs was an expensive proposition. It was the initial step in the reindexing of the ABI/INFORM database, which at the time was one of the go to sources of high value business and management information culled from more than 800 publications worldwide.

The only problem I have with the slide deck’s making a taxonomy a key concept is that one cannot craft a taxonomy without knowing what one is indexing. For example, you have a flow of content through and into an organization. In a business engaged in the manufacture of laboratory equipment, there will be a wide range of information. There will be unstructured information like Word documents prepared by wild eyed marketing associates. There will be legal documents artfully copied and pasted together from boiler plate. There will be images of the products themselves. There will be databases containing the names of customers, prospects, suppliers, and consultants. There will be information that employees download from the Internet or tote into the organization on a storage device.

The key concept of a taxonomy has to be anchored in reality, not an external term list like those which used to be provided by Oracle  for certain vertical markets. In short, the time and cost of processing these items of information so that confidentiality is not breached is likely to make the organization’s accountant sit up and take notice.

Today many vendors assert that their systems can intelligently, automatically, and rapidly develop a taxonomy for an organization. I suggest you read the fine print. Even the whizziest taxonomy generator is going to require some baby sitting. To get a sense of what is required, track down an experienced licensee of the Autonomy IDOL system. There is a training period which requires a cohesive corpus of representative source material. Sorry, no images or videos accepted but the existing image and video metadata can be processed. Once the system is trained, then it is run against a test set of content. The results are examined by a human who knows what he or she is doing, and then the system is tuned. After the smart system runs for a few days, the human inspects and calibrates. The idea is that as content flows through the system  and periodic tweaks are made, the system becomes smarter. In reality, indexing drift creeps in. In effect, the smart software never strays too far from the human subject matter experts riding herd on algorithms.

The problem exists even when there is a relatively stable core of technical terminology. The content of a lab gear manufacturer is many times greater than the problem of a company focusing on a specific branch of engineering, science, technology, or medicine. Indexing Halliburton nuclear energy information is trivial when compared to indexing more generalized business content like that found in ABI/INFORM or the typical services organization today.

I agree that a controlled term list is important. One cannot easily resolve entities unless there is a combination of automated processes and look up lists. An example is figuring out if a reference to I.B.M., Big Blue, or Armonk is a reference to the much loved marketers of Watson. Now handle a transliterated name like Anwar al-Awlaki and its variants. This type of indexing is quite important. Get it wrong and one cannot find information germane to a query. When one is investigating aliases used by bad actors, an error can become a bad day for some folks.

The remainder of the slide deck rides the taxonomy pony into the sunset. When one looks at the information created 72 months ago, it is easy for me to understand why enterprise search and content processing has become a “oh, my goodness” problem in many organizations. I think that a mid sized company would grind to a halt if it needed a controlled vocabulary which matched today’s content flows.

My take away from the slide deck is easy to summarize: The lesson is that putting the cart before the horse won’t get enterprise where it must go to retain credibility and deliver utility.

Stephen E Arnold, May 9, 2015

Taxonomy Turmoil: Good Enough May Be Too Much

February 28, 2015

For years, I have posted a public indexing Overflight. You can examine the selected outputs at this Overflight link. (My non public system is more robust, but the public service is a useful temperature gauge for a slice of the content processing sector.)

When it comes to indexing, most vendors provide keyword, concept tagging, and entity extraction. But are these tags spot on? No, most are good enough.

image

A happy quack to Jackson Taylor for this “good enough” cartoon. The salesman makes it clear that good enough is indeed good enough in today’s marketing enabled world.

I chose about 50 companies that asserted their systems performed some type of indexing or taxonomy function. I learned that the taxonomy business is “about to explode.” I find that to be either an interesting investment tip or a statement that is characteristic of content processing optimists.

Like search and retrieval, plugging in “concepts” or other index terms is a utility function. For example, if one indexes each word in an article appearing in this blog, the article might be about another subject. For example, in this post, I am talking about Overflight, but the real topic is the broader use of metadata in information retrieval systems. I could assign the term “faceted navigation” to this article as a way to mark this article as germane to point and click navigation systems.

If you examine the “reports” Overflight outputs for each of the companies, you will discover several interesting things as I did on February 28, 2015 when I assembled this short article.

  1. Mergers or buying failed vendors at fire sale prices are taking places. Examples include Lucidea’s purchase of Cuadra and InMagic. Both of these firms are anchored in traditional indexing methods and seemed to be within a revenue envelope until their sell out. Business Objects acquired Inxight and then SAP acquired Business Objects. Bouvet acquired Ontopia. Teradata acquired Revelytix
  2. Moving indexing into open source. Thomson Reuters acquired ClearForest and made most of the technology available as OpenCalais. OpenText, a rollup outfit, acquired Nstein. SAS acquired Teragram. Smartlogic acquired Schemalogic. (A free report about Schemalogic is available at www.xenky.com/vendor-profiles.)
  3. A number of companies just failed, shut down, or went quiet. These include Active Classification, Arikus, Arity, Forth ICA, MaxThink, Millennium Engineering, Navigo, Progris, Protege, punkt.net, Questans, Quiver, Reuse Company, Sandpiper,
  4. The indexing sector includes a number of companies my non public system monitors; for example, the little known Data Harmony with six figure revenues after decades of selling really hard to traditional publishers. Conclusion: Indexing is a tough business to keep afloat.

There are numerous vendors who assert their systems perform indexing, entity, and metadata extraction. More than 18 of these companies are profiled in CyberOSINT, my new monograph. Oracle owns Triple Hop, RightNow, and Endeca. Each of these acquired companies performs indexing and metadata operations. Even the mashed potatoes search solution from Microsoft includes indexing tools. The proprietary XML data management vendor MarkLogic asserts that it performs indexing operations on content stored in its repository. Conclusion: More cyber oriented firms are likely to capture the juicy deals.

So what’s going on in the world of taxonomies? Several observations strike me as warranted:

First, none of the taxonomy vendors are huge outfits. I suppose one could argue that IBM’s Lucene based system is a billion dollar baby, but that’s marketing peyote, not reality. Perhaps MarkLogic which is struggling toward $100 million in revenue is the largest of this group. But the majority of the companies in the indexing business are small. Think in terms of a few hundred thousand in annual revenue to $10 million with generous accounting assumptions.

What’s clear to me is that indexing, like search, is a utility function. If a good enough search system delivers good enough indexing, then why spend for humans to slog through the content and make human judgments. Why not let Google funded Recorded Future identify entities, assign geo codes, and extract meaningful signals? Why not rely on Haystax or RedOwl or any one of more agile firms to deliver higher value operations.

I would assert that taxonomies and indexing are important to those who desire the accuracy of a human indexed system. This assumes that the humans are subject matter specialists, the humans are not fatigued, and the humans can keep pace with the flow of changed and new content.

The reality is that companies focused on delivering old school solutions to today’s problems are likely to lose contracts to companies that deliver what the customer perceives as a higher value content processing solution.

What can a taxonomy company do to ignite its engines of growth? Based on the research we performed for CyberOSINT, the future belongs to those who embrace automated collection, analysis, and output methods. Users may, if the user so chooses, provide guidance to the system. But the days of yore, when monks with varying degrees of accuracy created catalog sheets for the scriptoria have been washed to the margin of the data stream by today’s content flows.

What’s this mean for the folks who continue to pump money into taxonomy centric companies? Unless the cyber OSINT drum beat is heeded, the failure rate of the Overflight sample is a wake up call.

Buying Apple bonds might be a more prudent financial choice. On the other hand, there is an opportunity for taxonomy executives to become “experts” in content processing.

Stephen E Arnold, February 28, 2015

DBpedia Makes Wikipedia Part Of The Semantic Web

November 21, 2014

SemanticWeb.com posted an article called “Retrieving And Using Taxonomy Data From DBpedia” with an interesting introduction. It explains that DBpedia is a crowd-sourced Internet community whose entire goal is to extract structured information from Wikipedia and share it. The introduction continues that DBpedia already has over three billion facts W3C standard RDF data model ready for application use.

The W3C standards are already written using the SKOS vocabulary, primarily used by the New York Times, the Library of Congress, and other organizations for their own taxonomies and subject headers. Users can extrapolate the data and implement it in their own RDF applications with the goal of giving your data more value.

DBpedia is doing a wonderful service for users so they do not have to rely on proprietary software to deliver them rich taxonomies. The taxonomies can be retrieved under the open source community bylaws and gain instant improvement for content. There is one caveat:

“Remember that, for better or worse, the data is based on Wikipedia data. If you extend the structure of the query above to retrieve lower, more specific levels of horror film categories, you’d probably find the work of film scholars who’ve done serious research as well as the work of nutty people who are a little too into their favorite subgenres.”

Remember Wikipedia is a good reference tool to gain an understanding of a topic, but you still need to check more verifiable resources for hard facts.

Whitney Grace, November 21, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Wave Your WAND for a New Taxonomy Portal

May 2, 2014

If a library is in need of a taxonomy, most of the time all they need to do it wave a magic wand and its taxonomy wish is granted. Actually, they become a WAND, Inc. client, the world’s leading taxonomy provider. According to the WAND Inc. blog, the company has launched a new endeavor: “WAND Announces Launch Of New Taxonomy Portal.” The WAND Taxonomy Library Portal helps companies develop a taxonomy strategy that is integral for enterprise management strategy.

“According to Mark Leher, WAND’s COO, ‘The amount of unstructured information and data inside organizations continues to explode.  Companies need a taxonomy strategy to organize information and make it easily accessible to enterprise information workers.  The WAND Taxonomy Library Portal is a valuable resource that provides the foundation for a corporate taxonomy strategy.’ “

WAND Taxonomy Library Portal subscribers receive access to all of WAND’s taxonomies. They cover a range of topics, including insurance, medical equipment and supplies, travel, personal care, human resources, and many more. The portal is designed to help companies get the highest return investment on management applications.

“Leher continued, ‘What most people don’t realize is that there are more than 150 common enterprise information management applications that are designed to leverage taxonomy.  We estimate that most large organizations have already invested in 10-20 of those applications.  At WAND, our goal is to provide taxonomies that make those applications more effective and increase the return on investment.’ “

Taxonomies are lists of terms. It is hard to imagine that term lists are integral part of using a management applications, but they are important to identifying content and building a reference framework.

Whitney Grace, May 02, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Taxonomy Round-Up Includes Variety of Taxonomy Discussions

January 27, 2014

The article on Synaptica Central titled End of 2013 Round Up of Taxonomy Blogs, Part 1 is exactly what it sounds like, an end of the year look at taxonomy in terms of articles, blog posts, videos and more gathered from such diverse areas of the internet as Pinterest, Twitter, StackExchange and Youtube, “Taxonomy as it relates to Drupal,” and posts on augmenting a taxonomy.

The author explains:

“It’s that time of year, folks, when it seems like everyone is publishing some kind of year end list or “round up” of the years news highlights, blogsand blog posts, photos, videos, and more. So why not do the same here at Synaptica? To keep things manageable, listed here are some 2013 blogs and blog entries, culled from almost a thousand, that address taxonomy in one way or another.”

Overall the article presents an interesting list of important taxonomy blog entries. Many touch on Bloom, such as the article How to Use Pinterest with Bloom’s Taxonomy Infographic and Revised Bloom’s Taxonomy and the Need for Higher Order Thinking. Another helpful highlight is Twitter Aligned with Bloom’s Taxonomy for Your Students. Whether you are looking for guidance in implementing or augmenting one, or just interested in a dialogue on the subject, this list is certainly a perfect starting point and should direct you toward discussions and communities galore.

Chelsea Kerwin, January 27, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Another Content Management Company Another Day

August 12, 2013

Content management companies are springing up and gaining attention due to the Big Data boom. One of the companies that our content wranglers pulled out of an Internet Search is Applied Relevance. They specialize in several aspects of the content management spectrum, but the company’s Web site prominently promotes its taxonomy services. Applied Relevance offers the AR-Classifer tagging engine that can run on a variety of platforms. Its AR-Semantics is the flagship organization and categorization software, while the AR-Taxonomy is the tool needed to edit and manage taxonomies and if you want to search your taxonomies the AR-Navigator is available.

All this talk about Applied Relevance’s taxonomy software is informative, but what is interesting is the company’s description on the main page:

“Applied Relevance produces software and services to help enterprise users find the information they need. Our solutions augment traditional search engines by providing context for the search results. The AR toolset and our partners provide cost effective technology for the full spectrum of enterprise content management and search applications. With our tools, a search term and a few clicks, users can zero-in past ambiguities and come up with the right answer in the right context. Applied Relevance is located on the west coast of the east coast of North America.”

Descriptive, but not a word on taxonomy or what exactly the company specifically does. The tagline at the end about Applied Relevance’s location is even more ambiguous.

Whitney Grace, August 12, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Spotlight on Access Innovations Living Up to Name

May 14, 2013

New York-based print and digital educational content company, Triumph Learning, has struck up a partnership with taxonomy development leader Access Innovations, Inc. Together, they will be creating a new taxonomy designed to align standards-based instructional content for the k-1 education market. The news release, “Triumph Learning Partners with Access Innovations on Common Core Standards-Integrated Taxonomy,” explains more.

Content management can be a difficult challenge for companies like Triumph learning but Access Innovations facilitated a more efficient management system by developing and building taxonomy out of a structured vocabulary for math and English.

We learned about how the Common Core State Standards apply:

The Common Core State Standards provide concepts and terminology that Triumph Learning writers and editors can use to link pieces of content such as instruction and practice activities, as well as other supplemental material, to corresponding grade-level standards. ‘By using Access Innovations expertise we will be able to properly align our content for both teachers and students,’ said Aoife Dempsey, Chief Technology Officer at Triumph Learning.

For a company that has been around since 1978, Access Innovations truly lives up to their name. Their database and taxonomy creation capabilities and semantic integration technology stand out among others and it looks like their spotlight will continue to shine — especially now that they are involved in bolstering educational reform on a national level.

Megan Feil, May 14, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Taxonomy or Ontology

April 1, 2013

The WAND blog article “Common Taxonomy Questions: What is the difference between a Taxonomy and an Ontology?” attempts to clear up the misconception between taxonomy and ontology. The article details that these are two common words in the information management world but many people do not truly understand the difference.

“Taxonomy is a collection of terms that are connected by broader term, narrower term, related term, and synonym relationships.”

The article makes an interesting comparison. Taxonomy is a tree that has a parent/child relationship with terms and it usually covers a specific subject area. Taxonomies can be a valuable tool when adding structure/content to unstructured information, which makes the information more easily searchable. Multiple taxonomies can be used together as filters to help make the search experience more powerful and exact. Popular sites such as Amazon and Costco use this tool on their sites. When it comes to ontologies the author makes an interesting comparison.

“Ontologies can be thought of more like a web, with many different types of relationships between all concepts. Ontologies can have infinite number of relationships between concepts and it is easier to create relationships between concepts across different subject domains.”

Ontologies are handy for those who want a more sophisticated information model that could be valuable when doing advanced natural language processing or text analytics. Though the name of the system is WAND Product and Service Taxonomy believe it or not it is also an ontology. The blog provides a good distinction between ontology and taxonomy but then says that the WAND system is actually both, which makes one wonder how do you really distinguish the two. Looks like more questions than answers. Here we go again.

April Holmes, April 01, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

A Cause for Celebration

April 1, 2013

The best way to celebrate the successful completion of a project is with a celebration and no celebration is complete without a cake. Synaptica definitely knows how to throw a celebration party. According to the Synaptica Central piece “Elsevier Celebrates New Installation” Synaptica and Elsevier recently celebrated the successful completion of their software development project with a tasty cake.

“It is a pleasure when one of our customers has a specially decorated cake made to celebrate the successful deployment of their customized Synaptica taxonomy management software. The project, completed this month, was a collaboration between Synaptica and the content management team at Elsevier, Netherlands.”

Elsevier got its start with journal and book publishing but is also known for providing scientific, technical and medical information as well as various other products. Synaptica was started in 1995 and is owned by Trish Yancey and Dave Clarke. They are an industry leader in the taxonomy management and ontology software. Their software give users several key benefits such as increased relevance thanks to a synonym-rich indexing vocabulary and the ability to visualize taxonomies in a variety of both textual and graphical formats. Synaptica software can work in the enterprise world and has been integrated with several different third-party applications. In addition Synaptica is user friendly and can be set up in only a matter of minutes. Synaptica taxonomy software is used by a variety of organizations when it comes to their metadata management and information access applications. The company even received the “100 Companies that Matter” award. Looks like they definitely have a reason to celebrate.

 

April Holmes, April 01, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Next Page »