From Jeopardy to the Hospital: Interesting Text Retrieval Route
March 16, 2011
Healthcare researchers now have a valuable tool at their disposal, asserts eWeek.com in “IBM Collaborates with BJC, WUSM on Health Care Data Analytics.”
Working with BJC Healthcare and the Washington University School of Medicine Center for Biometrics, IBM is using its content analytics for good, extracting medical data from a whopping 50 million documents, including clinical notes, electronic health records, and diagnostic reports:
By being able to extract key data from up to 50 million documents in medical records, BJC and WUSM will be able to increase the speed of research, and therefore boost patient care. ‘You can never read 50 million documents and understand what the trends and patterns were across 50 million documents; it’s impossible,’ Rhinehart explained. ‘You couldn’t even take 500 people to do it, because there is never an efficient way to consistently understand the behavior in those documents and then figure out all the trends and patterns’.
The assembled information can be used to draw conclusions or test a hypothesis, for example. It’s about time semantic technology was applied to medical research. What better field?
Now we have some observations. First, applying semantic or other next generation search methods to medical content is somewhat less onerous than trying to figure out colloquial blog posts in Farsi. Second, IBM sells Lucene as OmniFind 9. If the technology is up to medical snuff, IBM needs to apply this method to its Web site’s search and retrieval. We find the access to IBM content on IBM’s own Web site sufficiently frustrating to give me a headache. Third, IBM is sending mixed messages. Is it search, text mining, data mining, or game show winning?
We think it is public relations and eWeek is happy to disseminate the joy.
Stephen E Arnold, March 16, 2011
Freebie unlike open source search wrapped in an OminFind package
Digital Reasoning Garners Patent for Groundbreaking Invention
March 16, 2011
There are outfits in the patent fence business. Google, Hitachi, and IBM come to my mind. The patent applications are interesting because they provide a window through which one can gaze at some of the thinking of the firm’s legal, engineering and management professionals.
Then there are outfits who come up with useful and novel systems and methods. The Digital Reasoning patent US7882055, “Knowledge Discovery Agent System and Method”, granted on February 1, 2011, falls into this category. The patent application was filed in July 2007, so it took the ever efficient USPTO about 48 months to figure out what struck me when I first read the application. But the USPTO makes its living with dogged thoroughness. I supplement my retirement income by tracking and following really smart people like Tim Estes. I make my judgments about search and content processing based on my experience, knowledge of what other outfits have claimed as a unique system and method, and talking with the inventor. You can read two of my conversations with Tim Estes in the ArnoldIT.com Search Wizards Speak series. The link to my 2010 interview and my 2011 interview are at www.arnoldit.com/search-wizards-speak. (I did an interview with a remarkable engineer, Abe Music, at Digital Reasoning here.) Keep in mind that I was able to convert my dogging of this company to a small project this year. Hooray!
The guts of the invention are:
A system and method for processing information in unstructured or structured form, comprising a computer running in a distributed network with one or more data agents. Associations of natural language artifacts may be learned from natural language artifacts in unstructured data sources and semantic and syntactic relationship may be learned in structured data sources, using grouping based on a criteria of shared features that are dynamically determined without the use of a priori classifications, by employing conditional probability constraints.
I learned from my contacts at Digital Reasoning:
The pioneering invention entails intelligent software agents that extract meaning from text as humans do – by analyzing concepts and entities in context. The software learns as it runs, continually comparing new text to existing knowledge. Associated entities and synonym relationships are automatically discovered and relevant documents are identified from across extremely large corpora.
The patent specifically covers the mechanism of measurement and the applications of algorithms to develop machine-understandable structures from patterns of symbol usage. In addition, it covers the semantic alignment of those learned structures from unstructured data with pre-existing structured data – a necessary step in creating enterprise-class entity-oriented systems. The technology as implemented in Synthesys (TM)? provides a unique and now protected means of bringing automated understanding to end users in the
enterprise and beyond.
So what’s this mean?
| The Traditional Method | The Digital Reasoning Method |
In financial analysis, health information, and intelligence applications which do you want to you and your colleagues to use? I go for the Veyron. The 1998 Mustang is great as a back up or knock about. The Veyron means business in my opinion.
Three points:
- This is a true “beyond text” system and method. Key word search and 1998-type methods cannot deliver Synthesys 3.0 (TM) functionality
- Users don’t want laundry lists. The invention delivers actionable information. The value of the method is proven each day in certain very important applications which involve the top concerns of Maslow’s hierarchy
- The system can make use of human inputs but can operate in automatic mode. Many systems include automatic functions, but the method invented by Mr. Estes is a new one. Think of the difference in performance between a 1998 Mustang and the new Bugatti Veyron. Both are automobiles, but there is a difference in state of the art a long time ago and state of the art now.
If you want more information about Digital Reasoning, the company’s Web site is www.digitalreasoning.com.
Stephen E Arnold, March 15, 2011
Freebie but I want a T shirt from Music Row in Nashville
OpenText Buys Metastorm
March 16, 2011
OpenText has bought Metastorm and sooner than expected. “OpenText Says Metastorm Boosts MS Partnership, Centralizes SharePoint Management” outlines the colossal enterprise content management software company’s plans for its new acquisition. A key point in the article was:
In effect, what OpenText is talking about is turning its enterprise content management system from a behemoth into a super-behemoth.
That’s because OpenText is planning to combine its Enterprise Content Management (ECM) and its Business Process Management (BPM) due to client demand. As well, Metastorm‘s BPM can integrate with SharePoint and OpenText offers users centralized management of SharePoint sites. Will this new iteration work? Maybe.
What we know is that SurfRay’s technology can make SharePoint content more accessible. To learn more, navigate to www.surfray.com.
Torben Ellert, March 16, 2011
Metadata Are Important. Good to Know.
March 16, 2011
I read “When it Comes to Securing and Managing Data, It’s all about the Metadata.” The goslings and I have no disagreement about the importance of metadata. We do prefer words and phrases like controlled term lists, controlled vocabularies, classification systems, indexing, and geotagging. But metadata is hot so metadata the term shall be.
There is a phase that is useful when talking about indexing and the sorts of things in our preferred terms list. That phrase is “editorial policy.” Today’s pundits, former English majors, and unemployed Webmasters like the word “governance.” I find the word disconcerting because “governance” is unfamiliar to me. The word is fuzzy and, therefore, ideal for the poobahs who advise organizations unable to find content on the reasons for the lousy performance of one or more enterprise search systems.
The article gallops through these concepts. I learned about the growing issue of managing and securing structured and semi structured data within the enterprise. (Isn’t this part of security?) I learned about collaborative content technologies are on the increase which is an echo of locking a file which several people edit in an authoring system.)
I did notice this factoid:
IDC forecasts that the total digital universe volume will increase by a factor of 44 in 2020. According to the report, unstructured data and metadata have an average annual growth rate of 62 percent. More importantly, high-value information is also skyrocketing. In 2008, IDC found that 22 to 33 percent of the digital universe was high-value information (data and content that are governed by security, compliance and preservation obligations). Today, IDC forecasts that high-value information will comprise close to 50 percent of the digital universe by the end of 2020.
There you go. According to the article, metadata framework technology is a large part of the answer to this problem to collect user and group information, permissions information, access activity, and sensitive content indicators.
My view is to implement an editorial policy for content. Skip the flowery and made-up language. Get back to basics. That would be what I call indexing, a component addressed in an editorial policy. Leave the governance to the government. The government is so darn good at everything it undertakes.
Stephen E Arnold, March 16, 2011
Freebie
Microsoft SharePoint Suggestions
March 16, 2011
Here’s a useful item for you SharePoint fans and consultants. The write up “Tools and Web Parts for SharePoint 2007 and 2010” explains Web parts. This is Microsoft speak for code gadgets. What makes the article useful is that it provides a succinct summary of how a programmer can set up SharePoint to make available more suggestions to a user for his/her query. The key point is:
The only way to add more words to the suggestion feature is using a PowerShell script to add them and run the job manually or with the script. This tool was created to handle all words of suggestions in each Search Services Applications created in SharePoint 2010.
The article has a link to download the tool, as well as an explanation and ten images to explain how to use it. Web parts or web widgets can add needed functionality. Our question, “Why isn’t a more robust suggestion tool included with SharePoint?” We think the answer is that Microsoft likes to leave third parties with opportunities to earn money from the millions of SharePoint licensees. The tactic, in my opinion, is intentional incompleteness.
Stephen E Arnold, March 15, 2011
Freebie
It Is Not You, Computerworld. It Is Me
March 15, 2011
Okay, maybe it is you. I was directed to “8 IT Clichés That Must Go”. In light of the changing of the sports seasons, each replete with personal sets of hackneyed musings, Computer World has reposted a CIO.com listing of IT specific clichés to show us we aren’t so different after all.
The most universal of the bunch:
“We delivered the project on time and on budget.” This really means: “we managed to deliver a subset of the original user requirements while spending the same amount of money over a longer period of time to which we got them to agree.”
Granted, perhaps a chief information officer would find this smattering of overused sayings more entertaining than I. In an effort to give credit where credit is due, when you want the best in clichés, go to the source.
Check it out for yourself, form your own opinion. After all, whatever doesn’t kill you makes you stronger, right? Maybe that’s why enterprise search is a walk in the park.
Sarah Rogers, March 15, 2011
Freebie
EPi and Smartlogic
March 15, 2011
Years ago I did some work for EPi, a Swedish company. I learned quite a bit from my visit with the firm. I also have done a little work for Smartlogic, a company with software and systems that make SharePoint and other content centric systems work better. What caught my attention was a story that mentioned the two companies in one write up.
The idea expressed in “Implementing the Semantic Web in EPiSere6 Using Smartlogic and the Google Search Appliance” was interesting. In a nutshell, EPi crafted one seamless system by connecting EPiServer with Smartlogic’s semantic functions. Smartlogic’s Semaphore provided the mechanism to define and store taxonomical and ontological information, automatically classify documents, and enhance search facilities. Then, the Google Search Appliance was integrated to provide the basic search indexing function.
According to the write up:
“As the GSA indexes are created we are then able to call services in the service layer in the Rufus Leonard Smartlogic solution to receive meaningful results so that we can bring back content that otherwise would not have been found in a standard search. Not only do we use these search results when a user actually queries the site, but we also use these to drive users to other content by suggesting related pages or populating content areas with pages that might be of interest.”
A tip of the hat to the EPi and Smartlogic engineers. The Google Search Appliance is an interesting search product. However, fresh out of the oven, the GSA is not going to provide the type of functionality that some licensees require. With the integration of a content management and rich semantic and term management component, the GSA gains important functionality.
For more information about EPi, navigate to www.episerver.com. For more information about Smartlogic, go to www.smartlogic.com.
Stephen E Arnold, March 15, 2011
Believe it or not a freebie
Is Precision and Recall Making a Comeback?
March 15, 2011
Microsoft-centric BA Insight explored these touch points of traditional information retrieval. Precision and recall have quite specific meanings to those who care about the details of figuring out which indexing method actually delivers useful results. The Web world and most organizations care not a whit about fooling around with this equation.
And recall. This is another numerical recipe that causes the procurement team’s eyes to glaze.
I was interested to read in The SharePoint and FAST Search Experts Blog’s “What is the Difference Between Precision and Recall?” This is a very basic question for determining the relevance of query search results.
Equations aside, precision is the percentage of relevant retrieved documents, and recall is the percentage of relevant documents that are retrieved. In other words, when you have a search that’s high in precision, your results list will have a large percentage of items relevant to what you typed in, but you may also be missing a lot of items in the total.
With a search that is high in recall, your results list will have more items of what you’re searching for, but will also have a lot of irrelevant items as well. The post points out that determining the usefulness of search results is actually simpler than this sounds:
“The truth is, you don’t have to calculate relevance to determine how SharePoint or FAST search implementation is performing. You can look at a much more telling KPI. Are users actually finding what they are looking for?”
The problem, in my opinion is that most enterprise search deployments lack a solid understanding of the corpus to be processed. As a result, test queries are difficult to run in a “lab-type” setting. A few random queries are close enough for horseshoes. The cost and time required to benchmark a system and then tune it for optimal precision and recall is a step usually skipped.
Kudos to BA Insight for bringing up the subject of precision and recall. My view is that the present environment for enterprise search puts more emphasis on point and click interfaces and training wheels for users who lack the time, motivation, or expertise to formulate effective queries. Even worse, the content processed by the index is usually an unexplored continent. There are more questions like “Why can’t I find that PowerPoint?” that shouts of Eureka! Just my opinion.
Stephen E Arnold, March 15, 2011
Freebie
Exclusive Interview with Kamran Khan
March 15, 2011
Enterprise search vendors are changing their market positioning more quickly than at any other time. The vendors’ technology gets new features and functions. With an already complex system, a licensee often needs the help of specialists to get the system up and running. Other companies may have a search system, find it unsuitable, and need help preparing a business case for a new procurement. In short, almost any facet of an enterprise search project may need specialized expertise.
Search Technologies Corp., a privately held firm, has experienced steady, rapid growth over the last five years. The economic downturn had little effect on the company which now has offices across the US and in the United Kingdom. I was able to talk with the founder of this professional engineering services firm in order to get some insight into why Search Technologies has been unaffected by the economic storms that ripple across the business landscape.
Kamran Khan, the founder of Search Technologies, told me in response to my question, “What’s the secret of Search Technologies’ success?”
The founders and the management team are all veterans of the enterprise search industry with between 15 and 22 years experience. I entered the industry in the early 1990s. We all used to work for major search engine vendors. It seemed to us that most search engines contained great technology but were poorly implemented, and we thought that forming a company focused on helping people to implement search software made a lot of sense. Today, we have more than 80 staff, but implementing search solutions is still all we do. Word of mouth about our competence is another important factor in our success.
(You can read the full text of the interview at this link.)
Many companies assert their expertise in dealing with the search and content processing systems of such companies as Microsoft, Google, Autonomy and others. But customers want more than key words. One hot trend is data fusion or a mash up of disparate information instead of search list. I asked Mr. Khan about this demand. He told me:
Absolutely. Many customers are demanding more than a laundry list of results. Mash ups, data fusion and other sophisticated approaches to information presentation will no doubt proliferate. In our experience, the importance of data structure creation is often under-estimated though. People get excited by cool new features but don’t follow through and plan properly to create the necessary data structures to support the cool features, or put processes in place to maintain data structure quality through time as their data set evolves. Data sets have an annoying habit of evolving, just when you thought you’d nailed the search engine implementation. So a substantial part of our business involves helping customers with existing search systems to address challenges, such as relevancy issues. A lack of attention to detail in preparing the data set for search is often the root cause. This has become a significant part of our business and we’ve established a specific services practice around something we call Document Preparation Methodology for Search. Maintaining data structure requires proven ongoing processes, and not just technology.
To read the full text of my interview with one of the leaders in the search engineering and services sector, navigate to Search Wizards Speak on the ArnoldIT.com Web site. For more information about Search Technologies, visit www.searchtechnologies.com.
Stephen E Arnold, March 15, 2011
SharePoint for Business Intelligence?
March 15, 2011
“Revamped Business Intelligence Scenario Hub on TechNet” announces six end-to-end BI scenarios from Microsoft as part of their new content hub. For example, one is “Analyze Sales Performance by Using a Dashboard Built on SharePoint Server, SQL Server, and Office.”
Scenarios can be helpful for working out solutions, but wouldn’t it be nice if solutions were already available? SurfRay provides tools that function as business intelligence systems based on the knowledge creation and retrieval process.
For more information about how SurfRay’s business intelligence solutions leverage search, visit www.surfray.com.
Torben Ellert, March 15, 2011
SurfRay

