From Jeopardy to the Hospital: Interesting Text Retrieval Route
March 16, 2011
Healthcare researchers now have a valuable tool at their disposal, asserts eWeek.com in “IBM Collaborates with BJC, WUSM on Health Care Data Analytics.”
Working with BJC Healthcare and the Washington University School of Medicine Center for Biometrics, IBM is using its content analytics for good, extracting medical data from a whopping 50 million documents, including clinical notes, electronic health records, and diagnostic reports:
By being able to extract key data from up to 50 million documents in medical records, BJC and WUSM will be able to increase the speed of research, and therefore boost patient care. ‘You can never read 50 million documents and understand what the trends and patterns were across 50 million documents; it’s impossible,’ Rhinehart explained. ‘You couldn’t even take 500 people to do it, because there is never an efficient way to consistently understand the behavior in those documents and then figure out all the trends and patterns’.
The assembled information can be used to draw conclusions or test a hypothesis, for example. It’s about time semantic technology was applied to medical research. What better field?
Now we have some observations. First, applying semantic or other next generation search methods to medical content is somewhat less onerous than trying to figure out colloquial blog posts in Farsi. Second, IBM sells Lucene as OmniFind 9. If the technology is up to medical snuff, IBM needs to apply this method to its Web site’s search and retrieval. We find the access to IBM content on IBM’s own Web site sufficiently frustrating to give me a headache. Third, IBM is sending mixed messages. Is it search, text mining, data mining, or game show winning?
We think it is public relations and eWeek is happy to disseminate the joy.
Stephen E Arnold, March 16, 2011
Freebie unlike open source search wrapped in an OminFind package
Digital Reasoning Garners Patent for Groundbreaking Invention
March 16, 2011
There are outfits in the patent fence business. Google, Hitachi, and IBM come to my mind. The patent applications are interesting because they provide a window through which one can gaze at some of the thinking of the firm’s legal, engineering and management professionals.
Then there are outfits who come up with useful and novel systems and methods. The Digital Reasoning patent US7882055, “Knowledge Discovery Agent System and Method”, granted on February 1, 2011, falls into this category. The patent application was filed in July 2007, so it took the ever efficient USPTO about 48 months to figure out what struck me when I first read the application. But the USPTO makes its living with dogged thoroughness. I supplement my retirement income by tracking and following really smart people like Tim Estes. I make my judgments about search and content processing based on my experience, knowledge of what other outfits have claimed as a unique system and method, and talking with the inventor. You can read two of my conversations with Tim Estes in the ArnoldIT.com Search Wizards Speak series. The link to my 2010 interview and my 2011 interview are at www.arnoldit.com/search-wizards-speak. (I did an interview with a remarkable engineer, Abe Music, at Digital Reasoning here.) Keep in mind that I was able to convert my dogging of this company to a small project this year. Hooray!
The guts of the invention are:
A system and method for processing information in unstructured or structured form, comprising a computer running in a distributed network with one or more data agents. Associations of natural language artifacts may be learned from natural language artifacts in unstructured data sources and semantic and syntactic relationship may be learned in structured data sources, using grouping based on a criteria of shared features that are dynamically determined without the use of a priori classifications, by employing conditional probability constraints.
I learned from my contacts at Digital Reasoning:
The pioneering invention entails intelligent software agents that extract meaning from text as humans do – by analyzing concepts and entities in context. The software learns as it runs, continually comparing new text to existing knowledge. Associated entities and synonym relationships are automatically discovered and relevant documents are identified from across extremely large corpora.
The patent specifically covers the mechanism of measurement and the applications of algorithms to develop machine-understandable structures from patterns of symbol usage. In addition, it covers the semantic alignment of those learned structures from unstructured data with pre-existing structured data – a necessary step in creating enterprise-class entity-oriented systems. The technology as implemented in Synthesys (TM)? provides a unique and now protected means of bringing automated understanding to end users in the
enterprise and beyond.
So what’s this mean?
The Traditional Method | The Digital Reasoning Method |
In financial analysis, health information, and intelligence applications which do you want to you and your colleagues to use? I go for the Veyron. The 1998 Mustang is great as a back up or knock about. The Veyron means business in my opinion.
Three points:
- This is a true “beyond text” system and method. Key word search and 1998-type methods cannot deliver Synthesys 3.0 (TM) functionality
- Users don’t want laundry lists. The invention delivers actionable information. The value of the method is proven each day in certain very important applications which involve the top concerns of Maslow’s hierarchy
- The system can make use of human inputs but can operate in automatic mode. Many systems include automatic functions, but the method invented by Mr. Estes is a new one. Think of the difference in performance between a 1998 Mustang and the new Bugatti Veyron. Both are automobiles, but there is a difference in state of the art a long time ago and state of the art now.
If you want more information about Digital Reasoning, the company’s Web site is www.digitalreasoning.com.
Stephen E Arnold, March 15, 2011
Freebie but I want a T shirt from Music Row in Nashville
Protected: OpenText Buys Metastorm
March 16, 2011
Metadata Are Important. Good to Know.
March 16, 2011
I read “When it Comes to Securing and Managing Data, It’s all about the Metadata.” The goslings and I have no disagreement about the importance of metadata. We do prefer words and phrases like controlled term lists, controlled vocabularies, classification systems, indexing, and geotagging. But metadata is hot so metadata the term shall be.
There is a phase that is useful when talking about indexing and the sorts of things in our preferred terms list. That phrase is “editorial policy.” Today’s pundits, former English majors, and unemployed Webmasters like the word “governance.” I find the word disconcerting because “governance” is unfamiliar to me. The word is fuzzy and, therefore, ideal for the poobahs who advise organizations unable to find content on the reasons for the lousy performance of one or more enterprise search systems.
The article gallops through these concepts. I learned about the growing issue of managing and securing structured and semi structured data within the enterprise. (Isn’t this part of security?) I learned about collaborative content technologies are on the increase which is an echo of locking a file which several people edit in an authoring system.)
I did notice this factoid:
IDC forecasts that the total digital universe volume will increase by a factor of 44 in 2020. According to the report, unstructured data and metadata have an average annual growth rate of 62 percent. More importantly, high-value information is also skyrocketing. In 2008, IDC found that 22 to 33 percent of the digital universe was high-value information (data and content that are governed by security, compliance and preservation obligations). Today, IDC forecasts that high-value information will comprise close to 50 percent of the digital universe by the end of 2020.
There you go. According to the article, metadata framework technology is a large part of the answer to this problem to collect user and group information, permissions information, access activity, and sensitive content indicators.
My view is to implement an editorial policy for content. Skip the flowery and made-up language. Get back to basics. That would be what I call indexing, a component addressed in an editorial policy. Leave the governance to the government. The government is so darn good at everything it undertakes.
Stephen E Arnold, March 16, 2011
Freebie
Microsoft SharePoint Suggestions
March 16, 2011
Here’s a useful item for you SharePoint fans and consultants. The write up “Tools and Web Parts for SharePoint 2007 and 2010” explains Web parts. This is Microsoft speak for code gadgets. What makes the article useful is that it provides a succinct summary of how a programmer can set up SharePoint to make available more suggestions to a user for his/her query. The key point is:
The only way to add more words to the suggestion feature is using a PowerShell script to add them and run the job manually or with the script. This tool was created to handle all words of suggestions in each Search Services Applications created in SharePoint 2010.
The article has a link to download the tool, as well as an explanation and ten images to explain how to use it. Web parts or web widgets can add needed functionality. Our question, “Why isn’t a more robust suggestion tool included with SharePoint?” We think the answer is that Microsoft likes to leave third parties with opportunities to earn money from the millions of SharePoint licensees. The tactic, in my opinion, is intentional incompleteness.
Stephen E Arnold, March 15, 2011
Freebie
It Is Not You, Computerworld. It Is Me
March 15, 2011
Okay, maybe it is you. I was directed to “8 IT Clichés That Must Go”. In light of the changing of the sports seasons, each replete with personal sets of hackneyed musings, Computer World has reposted a CIO.com listing of IT specific clichés to show us we aren’t so different after all.
The most universal of the bunch:
“We delivered the project on time and on budget.” This really means: “we managed to deliver a subset of the original user requirements while spending the same amount of money over a longer period of time to which we got them to agree.”
Granted, perhaps a chief information officer would find this smattering of overused sayings more entertaining than I. In an effort to give credit where credit is due, when you want the best in clichés, go to the source.
Check it out for yourself, form your own opinion. After all, whatever doesn’t kill you makes you stronger, right? Maybe that’s why enterprise search is a walk in the park.
Sarah Rogers, March 15, 2011
Freebie
Is Precision and Recall Making a Comeback?
March 15, 2011
Microsoft-centric BA Insight explored these touch points of traditional information retrieval. Precision and recall have quite specific meanings to those who care about the details of figuring out which indexing method actually delivers useful results. The Web world and most organizations care not a whit about fooling around with this equation.
And recall. This is another numerical recipe that causes the procurement team’s eyes to glaze.
I was interested to read in The SharePoint and FAST Search Experts Blog’s “What is the Difference Between Precision and Recall?” This is a very basic question for determining the relevance of query search results.
Equations aside, precision is the percentage of relevant retrieved documents, and recall is the percentage of relevant documents that are retrieved. In other words, when you have a search that’s high in precision, your results list will have a large percentage of items relevant to what you typed in, but you may also be missing a lot of items in the total.
With a search that is high in recall, your results list will have more items of what you’re searching for, but will also have a lot of irrelevant items as well. The post points out that determining the usefulness of search results is actually simpler than this sounds:
“The truth is, you don’t have to calculate relevance to determine how SharePoint or FAST search implementation is performing. You can look at a much more telling KPI. Are users actually finding what they are looking for?”
The problem, in my opinion is that most enterprise search deployments lack a solid understanding of the corpus to be processed. As a result, test queries are difficult to run in a “lab-type” setting. A few random queries are close enough for horseshoes. The cost and time required to benchmark a system and then tune it for optimal precision and recall is a step usually skipped.
Kudos to BA Insight for bringing up the subject of precision and recall. My view is that the present environment for enterprise search puts more emphasis on point and click interfaces and training wheels for users who lack the time, motivation, or expertise to formulate effective queries. Even worse, the content processed by the index is usually an unexplored continent. There are more questions like “Why can’t I find that PowerPoint?” that shouts of Eureka! Just my opinion.
Stephen E Arnold, March 15, 2011
Freebie
Exclusive Interview with Kamran Khan
March 15, 2011
Enterprise search vendors are changing their market positioning more quickly than at any other time. The vendors’ technology gets new features and functions. With an already complex system, a licensee often needs the help of specialists to get the system up and running. Other companies may have a search system, find it unsuitable, and need help preparing a business case for a new procurement. In short, almost any facet of an enterprise search project may need specialized expertise.
Search Technologies Corp., a privately held firm, has experienced steady, rapid growth over the last five years. The economic downturn had little effect on the company which now has offices across the US and in the United Kingdom. I was able to talk with the founder of this professional engineering services firm in order to get some insight into why Search Technologies has been unaffected by the economic storms that ripple across the business landscape.
Kamran Khan, the founder of Search Technologies, told me in response to my question, “What’s the secret of Search Technologies’ success?”
The founders and the management team are all veterans of the enterprise search industry with between 15 and 22 years experience. I entered the industry in the early 1990s. We all used to work for major search engine vendors. It seemed to us that most search engines contained great technology but were poorly implemented, and we thought that forming a company focused on helping people to implement search software made a lot of sense. Today, we have more than 80 staff, but implementing search solutions is still all we do. Word of mouth about our competence is another important factor in our success.
(You can read the full text of the interview at this link.)
Many companies assert their expertise in dealing with the search and content processing systems of such companies as Microsoft, Google, Autonomy and others. But customers want more than key words. One hot trend is data fusion or a mash up of disparate information instead of search list. I asked Mr. Khan about this demand. He told me:
Absolutely. Many customers are demanding more than a laundry list of results. Mash ups, data fusion and other sophisticated approaches to information presentation will no doubt proliferate. In our experience, the importance of data structure creation is often under-estimated though. People get excited by cool new features but don’t follow through and plan properly to create the necessary data structures to support the cool features, or put processes in place to maintain data structure quality through time as their data set evolves. Data sets have an annoying habit of evolving, just when you thought you’d nailed the search engine implementation. So a substantial part of our business involves helping customers with existing search systems to address challenges, such as relevancy issues. A lack of attention to detail in preparing the data set for search is often the root cause. This has become a significant part of our business and we’ve established a specific services practice around something we call Document Preparation Methodology for Search. Maintaining data structure requires proven ongoing processes, and not just technology.
To read the full text of my interview with one of the leaders in the search engineering and services sector, navigate to Search Wizards Speak on the ArnoldIT.com Web site. For more information about Search Technologies, visit www.searchtechnologies.com.
Stephen E Arnold, March 15, 2011
SharePoint for Business Intelligence?
March 15, 2011
“Revamped Business Intelligence Scenario Hub on TechNet” announces six end-to-end BI scenarios from Microsoft as part of their new content hub. For example, one is “Analyze Sales Performance by Using a Dashboard Built on SharePoint Server, SQL Server, and Office.”
Scenarios can be helpful for working out solutions, but wouldn’t it be nice if solutions were already available? SurfRay provides tools that function as business intelligence systems based on the knowledge creation and retrieval process.
For more information about how SurfRay’s business intelligence solutions leverage search, visit www.surfray.com.
Torben Ellert, March 15, 2011
SurfRay
Next Up for YouTube Next
March 14, 2011
Google’s umbrella just grew a little larger. “YouTube Buys NextNewNetworks, Launches YouTube Next” gives a glimpse of the video-sharing host’s latest move: YouTube Next. This incarnation is intended to prop up creator development while increasing partner growth. Built off of NextNewNetwork’s model of “developing, packaging and building audiences for original web video programming”, YouTube is going so far as to even stamp its brand name on certain programs.
I am all for cutting out the restriction-wielding, high-dollar middleman, at least as much as feasible. The introduction of new services, specifically like Amazon VOD, empowers the small video creator and offers a yet unseen accessibility to an expansive audience. We are beginning to see the fall of the movie production and distribution industries as we know them, adding more tumbleweeds to the desolate landscape of recorded songs and printed words. The signs had been emerging with the advent of the minor players, and now Google is throwing its weight behind the endeavor. It will be interesting to see the veritable rainbow of protests from the MPAA as their long-awaited irrelevance solidifies. I sense some long and protracted litigation on the horizon.
There is speculation that being spurned by Sheen’s choice of UStream to broadcast his trifling monologue over its own service has shaken Google, prompting the purchase of this new asset. I maintain that their ocular dollar signs erupted and Google just wants a larger piece of the video action. And that Charlie Sheen is not important. Per the article:
“Google announced the acquisition today, saying that the acquisition was part of a larger effort to support content partners, of which there are more than 15,000 worldwide. Over 2010, the number of partners raking in more than $1,000 a month increased by 300% and YouTube is looking to push that number even higher with YouTube Next.”
NextNewNetworks will be delivering its adopted parent six million subscribers and its partners potentially exponential growth. Thus, the positive effects of this transaction seem to be merely coincidental. If Google is willing to pitch the people a bone in exchange for an extra cent, perhaps at day’s end this is a symbiotic relationship.
Sarah Rogers, March 14, 2011
Freebie