Text Analytics World
An Interview with Tom Reamy
With the increasing interest in Big Data, text analytics has become an important management issue. One of the more interesting conferences focused on this booming sector is Text Analytics World, which will be held in San Francisco April 17-18, 2013. The conference addresses several important topics; namely:
- Social Media – Including sentiment analysis, voice of the customer, and other new social applications
- Enterprise Applications – Including enterprise search and search based applications, knowledge management (collaboration, expertise location, advanced knowledge bases), content management, and other productivity applications
- Intelligence Applications – Including business intelligence, customer intelligence, security, fraud detection, survey analysis, and more
- Knowledge Organization – Including categorization techniques, noun phrase extraction approaches, taxonomy development and applications, semantic web and semantic technologies, and new knowledge models.
The program chair is Tom Reamy. He is the Chief Knowledge Architect and founder of KAPS Group, a group of knowledge architecture, taxonomy, and eLearning consultants.
I spoke with Mr. Reamy on February 8, 2013. The full text of my interview with him appears below:
What was the magnet that pulled you toward text processing?
That’s a long story. It started with a degree in the History of Ideas followed by a failed romance with Artificial Intelligence (seduced by the promise of the eminent breakthrough that we’re still waiting for). Then a few years developing computer games and educational software and then a consulting gig helping a library design and organize a corporate intranet for a large pharmaceutical company.
That was where I got involved in and convinced of the importance of metadata and taxonomies for improving access to information. That was also my next exposure to one more “revolution” , something that has given me a healthy skepticism toward a whole host of IT-inspired “revolutions” that turned out to be rather less than earth shaking.
However, my flirtation with taxonomy also revealed a major gap which was while taxonomies and metadata clearly were needed to ever make sense out of all that unstructured information, the problem was how to apply taxonomies to the content. There are problems with every approach and, as well, taxonomies tended to be too inflexible and formal to really model how different real people thought.
Solving those problems, particularly the major one of applying taxonomies to content, led me to text analytics which offers the best, most consistent, and most economical way of applying taxonomies and other metadata to content.
What's the problem with basic search?
The basic problem is that search engines don’t understand meaning. Humans think in concepts and search engines deal with meaningless strings. Search companies and users alike have been very creative about trying to overcome the basic stupidity of search engines, from adding dictionary stemming to Google’s Page Rank Algorithm which just substitutes human judgments about the worth of documents. To get a handle on just how stupid search engines are, imagine a human being reading a document and someone asks them what the document is about and to answer the question, they start counting up words and answer with, “It must be about policies because the word policy is the most common word.” Right.
As long as search engines can’t deal with meaning, they will continue to underperform. And, of course, there are other problems as well, but we can talk about them later.
So it's the enterprise-search business case that's a difficult problem to crack: Enterprises should spend money on tools that deliver lesser enterprise value at higher cost than Web-search engines deliver? Where do text analytics fit today?
First of all, Web search and enterprise search are operating in very different worlds. Ask anyone who has tried to take a successful web search engine and simply point it at internal content. The content is different in both scale and variety, and even more importantly, the questions people ask are very different. In addition, in some ways search is becoming the least important function of an enterprise search. More important is what kinds of applications you can build with it.
So the cost comparison with web search is not very useful. The answer is yes, people should spend more on enterprise search and they should devote more resources to enhancing search. I’ve seen a number of surveys that the average number of people assigned to search is less than one and then people wonder why satisfaction with search remains under 50%.
So where does text analytics (TA) fit it? Well, quite simply it’s potentially the salvation of search. TA is the piece that can add meaning to search by developing categorization and extraction capabilities that are more flexible than taxonomies, can more closely match how a variety of people think, and can do it more cheaply and consistently than human taggers.
However, it does require effort and resources to set it up. And without that effort, TA will fail almost as badly as search has.
What sets your text analytics conference apart from the others pursuing this market?
You mean besides that I’m the program chair. [Reamy grins.] Probably the biggest difference that I bring to the conference is that I’ve been doing content structure and TA for a long time and for a lot of different clients. My first project in TA was partnering with Inxight and developing a set of categorization taxonomies for a news aggregator back in 2004. What this has meant for the conference is that we cover the entire spectrum of text analytics, from big data/ social media to enterprise text analytics which includes everything from fixing enterprise search to developing advanced, smart applications that gain real value from all that unstructured text.
We also cover all the practical use case examples that let new developers learn how to do TA and more experienced developers share the latest techniques. We cover the real business value that TA can bring to the enterprise and, lastly, given my interest in theory, we share new ideas and new techniques that enrich the theoretical foundation that is needed to deliver the best (and most practical) applications.
What is the impact of big data on text analytics?
So far the actual impact has been fairly small, but that seems to be changing. It is more of a two way street with both enriching each other. One thing to remember is that Big Text is bigger than Big Data.
Some of the ways they work together is that TA can be used to, in essence, convert much of Big Text into Big Data which means that all those incredible new techniques that are being developed in Big Data including predictive analytics can work their magic on the 90% of information that is unstructured. In addition, the growth of social media content and the techniques to extract meaningful patterns through text mining and such are opening up new areas of research that is interesting in its own right but can also be combined with text analytics to get the best of both worlds of data and text.
Among the technologies getting attention are systems that apply index terms and classification tags that place a document in a context or in an enriched index. The buzzwords for this type of process range from metatagging to ontological indexing. What's your view of rich indexing (metadata) which vendors now use to feed algorithms, not searchers?
I’m not sure how to react to this for in one sense that is just what TA does. It can create enriched indexes, generate metadata of all kinds, and then the rich index can be used for any number of kinds of applications. It’s what I have been referring to above, but it can feed both algorithms and searchers.
However, one thing I’m sure about is: If a vendor comes to you and says their software is so advanced that all you have to do is point it at your content and it will magically find all the same patterns and structure that your users want, run screaming from the room. Yes, some of that software is getting better and better and the secret is, as always, what is the best way to combine the human semantic capabilities and machine capabilities. It could be as simple as using highly developed content structure, including ontologies and taxonomies which are human developed. This works if the domain is limited and the targets are broad such as in eDiscovery applications where the goal is not to find the answer, but rather to improve the efficiency of the human effort by removing 80% of the content and giving the human analyst less content to wade through.
Where I see the most value from text mining and other “automatic” or more machine learning approaches is not by themselves but in combination with text analytics and search and taxonomies and metadata.
The Semantic Web contributed some important methods, including a focus on both content and document structure. Google, Microsoft, and Yahoo, as well as vendors focused on the enterprise, describe their systems as having "semantic" functions. What's your view of semantic technology? Has it not stalled out?
I think it’s ironic that there is such a disagreement about what semantic means. And it is probably unfortunate that the semantic technology people took over the word since their meaning has little to do with language and unstructured text and is mostly about objects and data, but so it goes.
I’ve been exploring semantic technology for a few years mostly through speaking and meeting at SemTech. The conference is certainly vibrant and well attended (way higher than ESS or TAW), but they do seem to still be struggling with how to actually realize value from it. There are a few success cases like the BBC, but mostly people seem to struggle – all the while having a great time talking about how exciting the technology is.
So I would say that it has not stalled out, but it is at a crossroads. Personally, I think it will continue to grow and a lot of that growth will come from partnering with other areas, including text analytics, predictive analytics, and even search. For example, I still hear people talk about how fantastic it is that their software can quickly extract billions and billions of triples as if that was a good thing. Reminds me of the early days of Intranets where the excitement was all about how we could put all that information in one place. But if you combine those triples with other techniques then you can realize value.
So, I’d say that semantic technology needs more “semantics” in the broader sense of the word with an additional capability of handling concepts and language and taxonomies and categorization and a more sophisticated handling of the complexity of language.
One thing that I think makes semantic technology still exciting to me is the ability to develop reasoning about the implications of the relationships within an ontology. Knowing that companies pay people (and lots of other relationships) gives you the ability to know that a document that talks about a person and a company also could have implications for pay rates even when those terms are not mentioned in the document. Now if just have to figure out how to build applications that also know what to ignore amongst the billions and billions of triples.
In the last year, I have noticed that search has been in some cases embedded in other enterprise functions. Google enhances its enterprise Apps and search is just "there". Document management, customer support, and back office enterprise systems are offering more robust search systems. What do you make of this growing interest in embedding search?
It depends on how it is done. For document management, there is a much better model discussed below, but for applications I think it is the wave of the future. Search based applications or Info Apps succeed, in part, by limiting the scope of content and the domain of questions. In addition, they have one major advantage, they are not called search. At last spring’s Enterprise Search Summit I was on a panel with Sue Feldman and we all agreed: To succeed with search in the enterprise, whatever you do, don’t call it search. Search has been failing for so long that no one wants to be associated with it.
However, ultimately, as content continues to grow as it always does, the success of these Info Apps rests ultimately on how good the underlying search is and that means once again that it will depend on adding the ability to deal with meaning and that means text analytics.
Stand alone search and content processing systems can be expensive. In your experience, what can an organization do to get the functionality and control the costs for indexing, processing big data, and providing useful information to employees? Why not use Google and call it a day?
First of all, Google inside the enterprise just doesn’t work the way Google on the Internet works. Page Rank and thousands of editors tagging works wonders on the Internet, but not inside where the documents and links are fewer and the questions more focused. Without those, Google is just another search engine.
Second, search isn’t (or shouldn’t be) about technology. There was a great headline in my local paper yesterday about a company whose “Software searches documents faster!”. But no mention of “better”, just faster. That technology focus is one reason search continues to fail.
So, how to keep costs down? Simple, focus your attention on semantics, not technology. I’d like to issue a challenge. I will take any company that is getting ready to go out and spend millions on a new search engine (or already have one of the most advanced and expensive search engines ) and I will take 25% of their budget and using any basic search engine will produce better results by using a few people (taxonomists, editors, librarians, etc.) and by using text analytics.
I think one problem is simply the way accounting systems work with capital expenses being better on the books than hiring people. That and the belief that search is really about technology.
Enhanced text processing can be computationally demanding. What technical innovations have you noticed that deliver enhanced content processing without requiring significant investments in computing infrastructure?
Actually, surprising little investments in computing infrastructure is needed. If you want a complete package with advance text mining and are indexing the Internet, then yes, but still not that much compared with other IT applications. For example, most of the text analytics companies use a server model where if you need more power, just add a few relatively cheap servers.
New vendors are entering the market on a weekly, maybe daily basis. I just wrote about Visual Analytics and Cybertap, to name just two new players. Yet most of the 300 search and content processing companies I track seem to be struggling to get market traction. Are search and content processing technologies on a gerbil track, lots of activity but no progress?
By themselves, yes it’s gerbils in spinning wheels. However, when I do information strategy consulting, I usually promote a hybrid model that consists of a few main elements. The first piece is content management that is augmented by text analytics to semi-automate the generation of metadata and the application of taxonomies to content. Authors are terrible at tagging and won’t do it, but if you give them an automatically generated set of metadata tags including the topic of the document, then they are pretty good at reacting. The second piece is search, also augmented by text analytics to create dynamic tags, particularly all kinds of faceted metadata. This enables the system to handle external content or legacy content. The final piece is building Info Apps sitting on top of the CM - search platform to create smart applications whose only limit is your imagination.
But if the semantics aren’t there, none of the pieces work.
A number of vendors have shown me very fancy interfaces. Are today's info literate online users ignoring information provenance and accuracy for ease and convenience?
Except when they still can’t find anything. Being able to play with visualizing results can only keep you occupied for so long.
What are the hot trends in search for the next 12 to 24 months?
Hopefully, Text Analytics will be one, but we’ll see. I do see more interest in semantic search (just have to figure out what that means). For me it is adding Text Analytics to search. Info Apps is another area that should continue to grow, The last trend I’d like to see is related to Info Apps which is more integration of technologies (always with a strong semantic foundation) so that search and text mining and big data and social media analysis and predictive analytics and more all work together to create much more intelligent handling of information of all kinds.
Where can people get more information?
Well, of course, the best answer is Text Analytics World in San Francisco April 17-18. There are other conferences like Text Analytics Summit and I try to maintain a set of resources on my company’s web site, KAPS Group. Unfortunately, there aren’t any good books on text analytics although there are some on text mining such the classic, The Text Mining Handbook by Ronen Feldman and James Sanger. The other books are all pretty much on theoretical topics in text mining. I keep threatening to write a text analytics book, but have yet to find much time to work on it. Stay tuned.
Text Analytics World strikes us as must-attend event.
Stephen E. Arnold, February 19, 2013