CyberOSINT banner

Medical Search: A Long Road to Travel

April 13, 2015

Do you want a way to search medical information without false drops, the need to learn specialized vocabularies, and sidestep Boolean? Apparently the purveyors of medical search systems have left a user scratch without an antihistamine within reach.

Navigate to Slideshare (yep, LinkedIn) and flip through “Current Advances to Bridge the Usability Expressivity Gap in biomedical Semantic Search.” Before reading the 51 slide deck, you may want to refresh yourself with Quertle, PubMed, MedNar, or one of the other splendiferous medical information resources for researchers.

The slide deck identifies the problems with the existing search approaches. I can relate to these points. For example, those who tout question answering systems ignore the difficulty of passing a question from medicine to a domain consisting of math content. With math the plumbing in many advanced medical processes, the weakness is a bit of a problem and has been for decades.

The “fix” is semantic search. Well, that’s the theory. I interpreted the slide deck as communicating how a medical search system called ReVeaLD would crack this somewhat difficult nut. As an aside: I don’t like the wonky spelling that some researchers and marketers are foisting on the unsuspecting.

I admit that I am skeptical about many NGIA or next generation information access systems. One reason medical research works as well as it does is its body of generally standardized controlled term words. Learn MeSH and you have a fighting chance of figuring out if the drug the doctor prescribed is going to kill off your liver as it remediates your indigestion. Controlled vocabularies in scientific, technology, engineering, and medical domains address the annoying ambiguity problems encounter when one mixes colloquial words with quasi consultant speak. A technical buzzword is part of a technical education. It works, maybe not too well, but it works better than some of the wild and crazy systems which I have explored over the years.

You will have to dig through old jargon and new jargon such as entity reconciliation. In the law enforcement and intelligence fields, an entity from one language has to be “reconciled” with versions of the “entity” in other languages and from other domains. The technology is easier to market than make work. The ReVeaLD system is making progress as I understand the information in the slide deck.

Like other advanced information access systems, ReVeaLD has a fair number of moving parts. Here’s the diagram from Slide 27 in the deck:


There is also a video available at this link. The video explains that Granatum Project uses a constrained domain specific language. So much for cross domain queries, gentle reader. What is interesting to me is the similarity between the ReVeaLD system and some of the cyber OSINT next generation information access systems profiled in my new monograph. There is a visual query builder, a browser for structured data, visualization, and a number of other bells and whistles.

Several observations:

  • Finding relevant technical information requires effort. NGIA systems also require the user to exert effort. Finding the specific information required to solve a time critical problem remains a hurdle for broader deployment of some systems and methods.
  • The computational load for sophisticated content processing is significant. The ReVeaLD system is likely to such up its share of machine resources.
  • Maintaining a system with many moving parts when deployed outside of a research demonstration presents another series of technical challenges.

I am encouraged, but I want to make certain that my one or two readers understand this point: Demos and marketing are much easier to roll out than a hardened, commercial system. Just as the EC’s Promise program, ReVeaLD may have to communicate its achievements to the outside world. A long road must be followed before this particular NGIA system becomes available in Harrod’s Creek, Kentucky.

Stephen E Arnold, April 13, 2015

Spelling Suggestions via the Bisect Module

April 13, 2015

I know that those who want to implement their own search and retrieval systems learn that some features are tricky to implement. I read “Typos in Search Queries at Khan Academy.”

The author states:

The idea is simple. Store a hash of each word in a sorted array and then do binary search on that array. The hashes are small and can be tightly packed in less than 2 MB. Binary search is fast and allows the spell checking algorithm to service any query.

What is not included in the write up is detail about the time required and the frustration experienced to implement what some senior managers assume is trivial. Yep, search is not too tough when the alleged “expert” has never implemented a system.

With education struggling to teach the three Rs, the need for software that caulks the leaks in users’ ability to spell is a must have.

Stephen E Arnold, April 13, 2015

The Challenge of Synonyms

April 12, 2015

I am okay with automated text processing systems. The challenge is for software to keep pace with the words and phrases that questionable or bad actors use to communication. The marketing baloney cranked out by vendors suggests that synonyms are not a problem. I don’t agree. I think that words used to reference a subject can fool smart software and some humans as well. For an example of the challenge, navigate to “The Euphemisms People Use to Pay Their Drug Dealer in Public on Venmo.” The write up presents some of the synonyms for controlled substances; for example:

  • Kale salad thanks
  • Columbia in the 1980s
  • Road trip groceries
  • Sanity 2.0
  • 10 lbs of sugar

The synonym I found interesting was an emoji, which most search and content processing systems cannot “understand.”




Attensity asserts that it can “understand” emojis. Sure, if there is a look up list hard wired to a meaning. What happens if the actor changes the emoji? Like other text processing systems, the smart software may become less adept than the marketers state.

But why rain on the hype parade and remind you that search is difficult? Moving on.

Stephen E Arnold, April 12, 2015

Twitter Plays Hard Ball or DataSift Knows the End Is in Sight

April 11, 2015

I read “Twitter Ends its Partnership with DataSift – Firehose Access Expires on August 13, 2015.” DataSift supports a number of competitive and other intelligence services with its authorized Twitter stream. The write up says:

DataSift’s customers will be able to access Twitter’s firehose of data as normal until August 13th, 2015. After that date all the customers will need to transition to other providers to receive Twitter data. This is an extremely disappointing result to us and the ecosystem of companies we have helped to build solutions around Twitter data.

I found this interesting. Plan now or lose that authorized firehose. Perhaps Twitter wants more money? On the other hand, maybe DataSift realizes that for some intelligence tasks, Facebook is where the money is. Twitter is a noise machine. Facebook, despite its flaws, is anchored in humans, but the noise is increasing. Some content processes become more tricky with each business twist and turn.

Stephen E Arnold, April 11, 2015

Search Is Simple: Factoid Question Answering Made Easy

April 10, 2015

I know that everyone is an expert when it comes to search. There are the Peter Principle managers at Fortune 100 companies who know so much about information retrieval. There are the former English majors who pontificate about next generation systems. There are marketers who have moved from Best Buy to the sylvan manses of faceted search and clustering.

But when one gets into the nitty gritty of figuring out how to identify information that answers a user’s question, the sunny days darkens, just like the shadows in a furrowed brow.

Navigate to “A Neural Network for Factoid Question Answering Over Paragraphs.” Download the technical paper at this link. What we have an interesting discussion of a method for identifying facts that appear in separate paragraphs. The approach makes use of a number of procedures, including a helpful vector space visualization to make evident the outcome of the procedures.

Now does the method work?

There is scant information about throughput, speed of processing, or what has to be done to handle multi-lingual content, blog type information, and short strings like content in WhatsApp.

One thing is known: Answering user questions is not yet akin to nuking a burrito in a microwave.

Stephen E Arnold, April 10, 2015

Predicting Plot Holes Isn’t So Easy

April 10, 2015

According to The Paris Review’s blog post “Man In Hole II: Man In Deeper Hole” Mathew Jockers created an analysis tool to predict archetypal book plots:

A rough primer: Jockers uses a tool called “sentiment analysis” to gauge “the relationship between sentiment and plot shape in fiction”; algorithms assign every word in a novel a positive or negative emotional value, and in compiling these values he’s able to graph the shifts in a story’s narrative. A lot of negative words mean something bad is happening, a lot of positive words mean something good is happening. Ultimately, he derived six archetypal plot shapes.”

Academics, however, found some problems with Jockers’s tool, such as is it possible to assign all words an emotional variance and can all plots really take basic forms?  The problem is that words are as nuanced as human emotion, perspectives change in an instant, and sentiments are subjective.  How would the tool rate sarcasm?

All stories have been broken down into seven basic plots, so why can it not be possible to do the same for book plots?  Jockers already identified six basic book plots and there are some who are curiously optimistic about his analysis tool.  It does beg the question if will staunch author’s creativity or if it will make English professors derive even more subjective meaning from Ulysses?

Whitney Grace, April 10, 2015

Stephen E Arnold, Publisher of CyberOSINT at

Attensity Adds Semantic Markup

April 3, 2015

You have been waiting for more markup. I know I have, and that is why I read “Attensity Semantic Annotation: NLP-Analyse für Unternehmensapplikationen.”

So your wait and mine—over.

Attensity, a leading in figuring out what human discourse means, has rolled out a software development kit so you can do a better job with customer engagement and business intelligence. Attensity offers Dynamic Data Discovery. Unlike traditional analysis tools, Attensity does not focus on keywords. You know, what humans actually use to communicate.

Attensity uses natural language processing in order to identify concepts and issues in plain language. I must admit that I have heard this claim from a number of vendors, including long forgotten systems like DR LINK, among others.

The idea is that the SDK makes it easier to filter data to evaluate textual information and identify issues. Furthermore the SDK performs fast content fusion. The result is, as other vendors have asserted, insight. There was a vendor called Inxight which asserted quite similar functions in 1997. At one time, Attensity had a senior manager from Inxight, but I assume the attribution of functions is one of Attensity’s innovations. (Forgive me for mentioning vendors with which some 20 somethings know quite well.)

If you are dependent upon Java, Attensity is an easy fit. I assume that if you are one of the 150 million plus Microsoft SharePoint outfits, Attensity integration may require a small amount of integration work.

According the Attensity, the benefits of Attensity’s markup approach is that the installation is on site and therefore secure. I am not sure about this because security is person dependent, so cloud or on site, security remains an “issue” different from the one’s Attensity’s system identifies.

Attensity, like Oracle, provides a knowledge base for specific industries. Oracle made term lists available for a number of years. Maybe since its acquisition of Artificial Linguistics in the mid 1990s?

Attensity supports five languages. For these five languages, Attensity can determine the “tone” of the words used in a document. Presumably a company like Bitext can provide additional language support if Attensity does not have these ready to install.

Vendors continue to recycle jargon and buzzwords to describe certain core functions available from many different vendors. If your metatagging outfit is not delivering, you may want to check out Attensity’s solution.

Stephen E Arnold, April 3, 2015

SAS Text Miner Provides Valuable Predictive Analytics

March 25, 2015

If you are searching for predictive analytics software that provides in-depth text analysis with advanced linguistic capabilities, you may want to check out “SAS Text Miner.”  Predictive Analytics Today runs down the features and what SAS Text Miner and details how it works.

It is a user-friendly software with data visualization, flexible entity options, document theme discovery, and more.

“The text analytics software provides supervised, unsupervised, and semi-supervised methods to discover previously unknown patterns in document collections.  It structures data in a numeric representation so that it can be included in advanced analytics, such as predictive analysis, data mining, and forecasting.  This version also includes insightful reports describing the results from the rule generator node, providing clarity to model training and validation results.”

SAS Text Miner includes other features that draw on automatic Boolean rule generation to categorize documents and other rules can be exported into Boolean rules.  Data sets can be made from a directory on crawled from the Web.  The visual analysis feature highlights the relationships between discovered patterns and displays them using a concept link diagram.  SAS Text Miner has received high praise as a predictive analytics software and it might be the solution your company is looking for.

Whitney Grace, March 25, 2015
Stephen E Arnold, Publisher of CyberOSINT at

Aberdeen Consulting Labors to Pump up the Watson Balloon

March 7, 2015

I read “IBM Watson and Answering the Questions of the World with Cognitive Computing.” Darned amazing mid tier consulting firm dream spinning it is. Here’s the paragraph I noted, which is a quote from an IBM Watson guru named Rob High, the chief technical officer for Watson:

…We’re going to see this cognitive computing capability be brought down deeper into things that we do on a daily basis. I think we’re going to find this to be the dominant form of computing in the future, especially given that to personalize all of those things is not something you can conceivably do if you had to program all the logic around that for each individual person. These systems are only going to be able to achieve that kind of personalized value if they’re able to learn, learn about you, learn about your way of interpreting the world and the way that you envision the world and the priorities that are important to you, that perhaps you find useful in how you conduct your life. That’s why I think that’s the role that cognitive computing is going to have for us, is to provide that degree of personalization.”

I can hear personalized trumpet fanfares…almost, maybe. I think I hear the heavy breathing of the number one trumpeter.

Now where’s the real sound, the sound of cash registers ringing as companies spend for Watson’s wizardry?

Stephen E Arnold, March 7, 2015

Opening Watson to the Masses

March 4, 2015

IBM is struggling financially and one of the ways they hope to pull themselves out of the swamp is to find new applications for its supercomputers and software. One way they are trying to cash in on Watson is to create cognitive computer apps. EWeek alerts open source developers, coders, and friendly hackers that IBM released a bunch of beta services: “13 IBM Services That Simplify The Building Of Cognitive Watson Apps.”

IBM now allows all software geeks the chance to add their own input to cognitive computing. How?

“Since its creation in October 2013, the Watson Developer Cloud (WDC) has evolved into a community of over 5,000 partners who have unlocked the power of cognitive computing to build more than 6,000 apps to date. With a total of 13 beta services now available, the IBM Watson Group is quickly expanding its developer ecosystem with innovative and easy-to-use services to power entirely new classes of cognitive computing apps—apps that can learn from experience, understand natural language, identify hidden patterns and trends, and transform entire industries and professions.”

The thirteen new IBM services involve language, text processing, analytical tools, and data visualization. These services can be applied to a wide range of industries and fields, improving the way people work and interact with their data. While it’s easy to imagine the practical applications, it is still a wonder about how they will actually be used.

Whitney Grace, March 04, 2015
Sponsored by, developer of Augmentext

« Previous PageNext Page »