Google and Latent Semantic Indexing: The KnowledgeGraph Play

June 26, 2012

One thing that is always constant is Google changing itself. Not too long ago Google introduced yet another new tool: Knowledge Graph. Business2Community spoke highly about how this new application proves the concept of latent semantic indexing in “Keyword Density is Dead…Enter “Thing Density.” Google’s claim to fame is providing the most relevant search results based on a user’s keywords. Every time they update their algorithm it is to keep relevancy up. The new Knowledge Graph allows users to break down their search by clustering related Web sites and finding what LSI exists between the results. From there the search conducts a secondary search and so on. Google does this to reflect the natural use of human language, i.e. making their products user friendly.

But this change begs an important question:

“What does it mean for me!? Well first and foremost keyword density is dead, I like to consider the new term to be “Concept Density” or to coin Google’s title to this new development “Thing Density.” Which thankfully my High School English teachers would be happy about. They always told us to not use the same term over and over again but to switch it up throughout our papers. Which is a natural and proper style of writing, and we now know this is how Google is approaching it as well.”

The change will means good content and SEO will be rewarded. This does not change the fact, of course, that Google will probably change their algorithm again in a couple months but now they are recognizing that LSI has value. Most IVPs that provide latent semantic indexing, content and text analytics, such as Content Analyst,have gone way beyond what Google’s offering with the latest LSI trends to make data more findable and discover new correlations.

Whitney Grace, June 26, 2012

The Alleged Received Wisdom about Predictive Coding

June 19, 2012

Let’s start off with a recommendation. Snag a copy of the Wall Street Journal and read the hard copy front page story in the Marketplace section, “Computers Carry Water of Pretrial Legal Work.” In theory, you can read the story online if you don’t have Sections A-1, A-10 of the June 18, 2012, newspaper. Check out a variant of the story appears as “Why Hire a Lawyer? Computers Are Cheaper.”

Now let me offer a possibly shocking observation: The costs of litigation are not going down for certain legal matters. Neither bargain basement human attorneys nor Fancy Dan content processing systems make the legal bills smaller. Your mileage may vary, but for those snared in some legal traffic jams, costs are tough to control. In fact, search and content processing can impact costs, just not in the way some of the licensees of next generation systems expect. That is one of the mysteries of online that few can penetrate.

The main idea of the Wall Street Journal story is that “predictive coding” can do work that human lawyers do for a higher cost but sometimes with much less precision. That’s the hint about costs in my opinion. But the article is traditional journalistic gold. Coming from the Murdoch organization, what did I expect? i2 Group has been chugging along with relationship maps for case analyses of important matters since 1990. Big alert: i2 Ltd. was a client of mine. Let’s see that was more than a couple of weeks ago that basic discovery functions were available.

The write up quotes published analyses which indicate that when humans review documents, those humans get tired and do a lousy job. The article cites “experts” who from Thomson Reuters, a firm steeped in legal and digital expertise, who point out that predictive coding is going to be an even bigger business. Here’s the passage I underlined: “Greg McPolin, an executive at the legal outsourcing firm Pangea3 which is owned by Thomson Reuters Corp., says about one third of the company’s clients are considering using predictive coding in their matters.” This factoid is likely to spawn a swarm of azure chip consultants who will explain how big the market for predictive coding will be. Good news for the firms engaged in this content processing activity.

What goes faster? The costs of a legal matter or the costs of a legal matter that requires automation and trained attorneys? Why do companies embrace automation plus human attorneys? Risk certainly is a turbo charger?

The article also explains how predictive coding works, offers some cost estimates for various actions related to a document, and adds some cautionary points about predictive coding proving itself in court. In short, we have a touchstone document about this niche in search and content processing.

My thoughts about predictive coding are related to the broader trends in the use of systems and methods to figure out what is in a corpus and what a document is about.

First, the driver for most content processing is related to two quite human needs. First, the costs of coping with large volumes of information is high and going up fast. Second, the need to reduce risk. Most professionals find quips about orange jump suits, sharing a cell with Mr. Madoff, and the iconic “perp walk” downright depressing. When a legal matter surfaces, the need to know what’s in a collection of content like corporate email is high. The need for speed is driven by executive urgency. The cost factor clicks in when the chief financial officer has to figure out the costs of determining what’s in those documents. Predictive coding to the rescue. One firm used the phrase “rocket docket” to communicate speed. Other firms promise optimized statistical routines. The big idea is that automation is fast and cheaper than having lots of attorneys sifting through documents in printed or digital form. The Wall Street Journal is right. Automated content processing is going to be a big business. I just hit the two key drivers. Why dance around what is fueling this sector?

Written by Stephen E. Arnold · Filed Under EDiscovery, Editorial opinion, Feature, Indexing, Legal matters, Search, Text analytics, Text processing | Comments Off on The Alleged Received Wisdom about Predictive Coding

Protected: Pitney Bowes and iDiscovery Tack on Analytics

June 15, 2012

Written by Stephen E. Arnold · Filed Under EDiscovery, Indexing, Legal matters, News, Predictive coding | Comments Off on Protected: Pitney Bowes and iDiscovery Tack on Analytics

Microsoft SharePoint: Controlled Term Functionality

June 13, 2012

Also covered “SharePointSearch, Synonyms, Thesaurus, and You” provides a useful summary of Microsoft SharePoint’s native support for controlled term lists. Today, the buzzwords taxonomy and ontology are used to refer to term lists which SharePoint can use to index content. Term lists may consist of company-specific vocabulary, the names of peoples and companies with which a firm does business, or formal lists of words and phrases with “Use for” and “See also” cross references.

The important of a controlled term list is often lost when today’s automated indexing systems process content. Almost any search system benefits when the content processing subsystem can use a controlled term list as well as the automated methods baked into the indexer.

In this TechGrowingPains write up, the author says:

A little known, and interesting, feature in SharePoint search is the ability to create customized thesaurus word sets. The word sets can either be synonyms, or word replacements, augmenting search functionality. This ability is not limited to single words, it can also be extend into specific phrases.

The article explains how controlled term lists can be used to assist a user in formulating a query. The method is called “replacement words”. The idea of suggesting terms is a good one which many users find a time saver when doing research. The synonym expansion function is mentioned as well. SharePoint can insert broader terms into a user’s query which increases or decreases the size of the result set.

The centerpiece of the article is a recipe for activating this functionality. A helpful code snippet is included as well.

If you want additional technical support, let us know. Our Search Technoologies’ team has deep experience in Microsoft SharePoint search and customization. We can implement advanced controlled term features in almost any SharePoint system.

Iain Fletcher, June 13, 2012

Written by Stephen E. Arnold · Filed Under Enterprise, Enterprise search, Indexing, Microsoft, SharePoint | Comments Off on Microsoft SharePoint: Controlled Term Functionality

Autonomy Offers Automatic Classification and Taxonomy Generation

May 7, 2012

Conceptualizing the processes and methods behind the storage and organization of data in our current age ruled by unstructured content and meta-tags can prove overwhelming. We found a great source of information from Autonomy, which explains their offering of Automatic Classification and Taxonomy Generation.

With their eye on functionality, IDOL’s classification solutions help users to circumvent issues that have arisen in a time of exponential data growth.

In addition to Taxonomy Libraries and Automatic Categorization and Channels, the Autonomy Collaborative Classifier is included. Their website clearly delineates how these elements work.

The website states the following information regarding Taxonomy Libraries:

“Built by experienced knowledge engineers using best practices learned through hundreds of consulting engagements, Autonomy taxonomies let organizations rapidly deploy industry-standard taxonomies that can be combined with your corporate taxonomies or easily customized to meet company and industry-specific requirements. Each Autonomy taxonomy is based on industry standards, and built using IDOL’s conceptual analysis that provides the highest level of accuracy.”

There are a variety of taxonomies IDOL consists of ranging from biotechnology to financial services: a comprehensive solution, indeed. Overall, IDOL seems equipped to eradicate the need for time consuming intervention required in the past. But open source alternatives exist and should be considered by procurement teams.

Megan Feil, May 9, 2012

Protege 4.2 Now Available

May 5, 2012

Version 4.2 (beta) of Protégé from Stanford University is now available here. The open source application serves as an ontology editor and knowledge-base framework. The product description states:

“The Protégé platform supports two main ways of modeling ontologies via the Protégé-Frames and Protégé-OWL editors. Protégé ontologies can be exported into a variety of formats including RDF(S), OWL, and XML Schema.

“Protégé is based on Java, is extensible, and provides a plug-and-play environment that makes it a flexible base for rapid prototyping and application development.

“Protégé is supported by a strong community of developers and academic, government and corporate users, who are using Protégé for knowledge solutions in areas as diverse as biomedicine, intelligence gathering, and corporate modeling.”

The editor can be customized to provide domain-friendly support for creating knowledge models and entering data. The National Library of Medicine supports Protégé’s biomedical ontologies and knowledge bases, which serve as national resources. The editor is a core component of The National Center for Biomedical Ontology.

Do taxonomy vendors face the open source ogre?

Cynthia Murrell,May 5, 2012

Will Harvard Library to Jettison Paid Access Academic Journals?

May 3, 2012

In what could be another step toward knowledge failure, BoingBoing reports “Harvard Library to Faculty: We’re Going Broke Unless You Go Open Access.” Struggling with the high costs of academic journal access fees, the Harvard Library Faculty Advisory Council has decided to cancel all the library’s paid scholarly subscriptions.

There’s no doubt that these charges are out of control, and steadily encroaching on the budgets for other acquisitions. Writer Cory Doctorow quotes the Council’s Memorandum on Journal Pricing:

“Harvard’s annual cost for journals from these providers now approaches $3.75M. . . . Some journals cost as much as $40,000 per year, others in the tens of thousands. Prices for online content from two providers have increased by about 145% over the past six years, which far exceeds not only the consumer price index, but also the higher education and the library price indices.”

We understand that the library must control costs. It is unfortunate, however, that that knowledge will no longer be at students’ fingertips. The open access academic world is still sparsely populated, and the Council makes this plea in hope of a richer open access community in the future:

“It’s suggesting that faculty make their research publicly available, switch to publishing in open access journals and consider resigning from the boards of journals that don’t allow open access.”

Perhaps the scholarly open access options will grow, in time. In the meanwhile, it will be the students who miss out on key knowledge.

Cynthia Murrell, May 3, 2012

MuseGlobal and Info Library Team for Mobile Access to Libraries

May 3, 2012

As many book retailers are being shut down due to an increase in e-book and tablet use, many are worried about what will become of our public libraries. Info Library and Information Solutions recently reported on a partnership that may provide a solution to this problem in the article, “Muse Global and Info Library and Information Solutions on Mobile Search Platform.”

According to the write-up, MuseGlobal and Info Library and Information Solutions have come together to make libraries more mobile friendly by offering a custom mobile search platform usingMuseGlobal’s cloud-based mobile search interface and platform.

Info Library and Information Solutions also brings quite a bit to the table. Kristina Bivens, MuseGlobal’s CEO stated:

“Info Library and Information Solutions is well known for their end-user oriented product focus in delivering innovative technology solutions that help libraries serve, interact with and empower users with customizable, on-demand information discovery tools. The NOW platform clearly reflects this commitment and we are delighted to collaborate with Info Library and Information Solutions in extending the NOW platform’s offerings to bring together all of the library’s collections, third-party content, and custom services in one convenient mobile interface.”

MuseGlobal’s technology will allow Libraries the convenience of having a platform that is easy to implement and offers users access to their entire catalog without having to allocate additional time and resources. Sounds like an excellent idea to me.

Jasmine Ashton, May 3, 2012

Is the End Approaching for Commercial Metadata Vendors?

April 26, 2012

This is a very interesting move, one that may have implications for the organizations which sell library metadata. Joho the Blog reports, “‘Big Data for Books’: Harvard Puts Metadata for 12M Library Items into the Public Domain.” We learn from the write up:

“Harvard University has today put into the public domain (CC0) full bibliographic information about virtually all the 12M works in its 73 libraries. This is (I believe) the largest and most comprehensive such contribution. The metadata, in the standard MARC21 format, is available for bulk download from Harvard. The University also provided the data to the Digital Public Library of America’s prototype platform for programmatic access via an API. The aim is to make rich data about this cultural heritage openly available to the Web ecosystem so that developers can innovate, and so that other sites can draw upon it.”

Wow. Now, Harvard does ask users to respect community norms, like attributing sources of metadata. Blogger David Weinberger notes that licensing issues have held up the release of library metadata, and that this move makes the metadata of many, many of the most- used library items accessible.

What will happen next? Will the sellers of library metadata fight back?

Cynthia Murrell, April 26, 2012

OpenText Offers Content Auto Classification Solution

April 23, 2012

Open Text recently reported on a new transparent and defensible auto-classification designed for records managers in the article release “Open-Text Auto Classification.”

According to the article, very few companies a sound information governance strategy with appropriate records management services in place and therefore fail to dispose of their unstructured content that is no longer in use.

As a response to this paradoxical issue, the article states:

“At the core, the issue is that content needs to be classified or understood in order to determine why it must be retained, how long it must be retained and when it can be dispositioned. Managing the retention and disposition of information reduces litigation risk, it reduces discovery and storage costs, and it ensures organizations maintain regulatory compliance.”

The article goes on to explain why many end users fail to go through the tedious process of cataloging and managing their information and then advocates for Open Text’s auto classification solution. We found this article to be an interesting explanation of the Nstein technology with some new twists and recommend it as a great read for those interested in understanding records management more thoroughly.

Jasmine Ashton, April 23, 2012

Beyond Search

Google and Latent Semantic Indexing: The KnowledgeGraph Play

The Alleged Received Wisdom about Predictive Coding

Protected: Pitney Bowes and iDiscovery Tack on Analytics

Microsoft SharePoint: Controlled Term Functionality

Autonomy Offers Automatic Classification and Taxonomy Generation

Protege 4.2 Now Available

Will Harvard Library to Jettison Paid Access Academic Journals?

MuseGlobal and Info Library Team for Mobile Access to Libraries

Is the End Approaching for Commercial Metadata Vendors?

OpenText Offers Content Auto Classification Solution

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta