Meme Attention Deficit

April 27, 2014

I read “Algorithm Distinguishes Memes from Ordinary Information.” The article reports that algorithms can pick out memes. A “meme”, according to Google, is “an element of a culture or system of behavior that may be considered to be passed from one individual to another by nongenetic means, especially imitation.” The passage that caught my attention is:

Having found the most important memes, Kuhn and co studied how they have evolved in the last hundred years or so. They say most seem to rise and fall in popularity very quickly. “As new scienti?c paradigms emerge, the old ones seem to quickly lose their appeal, and only a few memes manage to top the rankings over extended periods of time,” they say.

The factoid that reminded me how far smart software has yet to travel is:

To test whether these phrases are indeed interesting topics in physics, Kuhn and co asked a number of experts to pick out those that were interesting. The only ones they did not choose were: 12. Rashba, 14. ‘strange nonchaotic’ and 15. ‘in NbSe3′. Kuhn and co also checked Wikipedia, finding that about 40 per cent of these words and phrases have their own corresponding entries. Together this provides compelling evidence that the new method is indeed finding interesting and important ideas.

Systems produce outputs that are not yet spot on. I concluded that scientists, like marketers, like whizzy new phrases and ideas. Jargon, it seems, is an important part of specialist life.

Stephen E Arnold, April 27, 2014

US Government Content Processing: A Case Study

March 24, 2014

I know that the article “Sinkhole of Bureaucracy” is an example of a single case example. Nevertheless, the write up tickled my funny bone. With fancy technology,, and the hyper modern content processing systems used in many Federal agencies, reality is stranger than science fiction.

This passage snagged my attention:

inside the caverns of an old Pennsylvania limestone mine, there are 600 employees of the Office of Personnel Management. Their task is nothing top-secret. It is to process the retirement papers of the government’s own workers. But that system has a spectacular flaw. It still must be done entirely by hand, and almost entirely on paper.

One of President Obama’s advisors is quote as describing the manual operation as “that crazy cave.”

And the fix? The article asserts:

That failure imposes costs on federal retirees, who have to wait months for their full benefit checks. And it has imposed costs on the taxpayer: The Obama administration has now made the mine run faster, but mainly by paying for more fingers and feet. The staff working in the mine has increased by at least 200 people in the past five years. And the cost of processing each claim has increased from $82 to $108, as total spending on the retirement system reached $55.8 million.

One of the contractors operating the system is Iron Mountain. You may recall that this outfit has a search system and caught my attention when Iron Mountain sold the quite old Stratify (formerly Purple Yogi automatic indexing system to Autonomy).

My observations:

  1. Many systems have a human component that managers ignore, do not know about, or lack the management horsepower to address. When search systems or content processing systems generate floods of red ink, human processes are often the culprit
  2. The notion that modern technology has permeated organizations is false. The cost friction in many companies is directly related to small decisions that grow like a snowball rolling down a hill. When these processes reach the bottom, the mess is no longer amusing.
  3. Moving significant information from paper to a digital form and then using those data in a meaningful way to answer questions is quite difficult.

Do managers want to tackle these problems? In my experience, keeping up appearances and cost cutting are more important than old fashioned problem solving. In a recent LinkedIn post I pointed out that automatic indexing systems often require human input. Forgetting about those costs produces problems that are expensive to fix. Simple indexing won’t bail out the folks in the cave.

Stephen E Arnold, March 24, 2014

Stephen E Arnold, March 24, 2014

SchemaLogic Profile Available

December 3, 2013

A new profile is available on the Xenky site today. SchemaLogic is a controlled vocabulary management system. The system combines traditional vocabulary management with an organization wide content management system specifically for indexing words and phrases. The analysis provides some insight into how a subsystem can easily boost the cost of a basic search system’s staff and infrastructure.

Taxonomy became a chrome trimmed buzzword almost a decade ago. Indexing has been around a long time, and indexing has a complete body of practices and standards for the practitioner to use when indexing content objects.

Just what an organization needs to make sense of its text, images, videos, and other digital information/data. At a commercial database publsihing company, more than a dozen people can be involved in managing a controlled term list and classification coding scheme. When a term is misapplied, finding a content object can be quite a challenge. If audio or video are misindexed, the content object may require a human to open, review, and close files until the required imnage or video can be located. Indexing is important, but many MBAs do not understand the cost of indexing until a needed content object cannot be found; for example, in a legal discovery process related to a patent matter. A happy quack to for the example of a single segment of a much larger organization centric taxonomy. Consider managing a controlled term list with more than 20,000 terms and a 400 node taxononmy across a Fortune 500 company or for the information stored in your laptop computer.

Even early birds in the search and content processing sector like Fulcrum Technologies and Verity embraced controlled vocabularies. A controlled term list contains forms of words and phrases and often the classification categories into which individual documents can be tagged.

The problem was that lists of words had to be maintained. Clever poobahs and mavens created new words to describe allegedly new concepts. Scientists, engineers, and other tech types whipped up new words and phrases to help explain their insights. And humans, often loosey goosey with language, shifted meanings. For example, when I was in college a half century ago, there was a class in “discussion.” Today that class might be called “collaboration.” Software often struggles with these language realities.

What happens when “old school” search and content  processing systems try to index documents?

The systems can “discover” terms and apply them. Vendors with “smart software” use a range of statistical and linguistic techniques to figure out entities, bound phrases, and concepts. Other approaches include sucking in dictionaries and encyclopedias. The combination of a “knowledgebase” like Wikipedia and other methods works reasonably well.

Read more

New Version of Media Mining Indexer (6.2) from SAIL LABS Technology

November 9, 2013

The release titled SAIL LABS Announces New Release Of Media Mining Indexer 6.2 from SAIL LABS Technology on August 5, 2013 provides some insight into the latest version of the Media Mining Indexer. SAIL LABS Technology considers itself as an innovator in creating solutions for vertical markets, and enhancing technologies surrounding advanced language understanding abilities.

The newest release offers such features as:

“Improved named entity detection of names via unified lists across languages… improved topic models for all languages… improved text preprocessing for Greek, Hebrew, Italian, Frasi, US and international English…support of further languages: Catalan, Swedish, Portuguese, Bahasa (Indonesia), Italian, Farsi and Romanian…improved communication with Media Mining Server to relate recognized speakers to their respective profiles.”

Gerhard Backfried, Head of Research at SAIL LABS, called the latest release a “quantum leap forward” considering the system’s tractability, constancy and ability to respond to clients needs. The flagship product is based on SAIL LABS speech recognition technology, which as won awards, and offers a suite of ideal components for multimedia processing, and the transformation of audio and video data into searchable information. The features boast the ability to convert speech to text accurately with Automatic Speech Recognition and the ability to detect different speakers with Speaker Change Detection.

Chelsea Kerwin, November 09, 2013

Sponsored by, developer of Augmentext

Database Indexing Explained

July 29, 2013

Finally, everything you need to explain database indexing to your mom over breakfast. Stack Overflow hosts the discussion, “How Does Database Indexing Work?” The original question, posed by a user going by Zenph Yan, asks for an answer at a “database agnostic level.” The lead answer, also submitted by Zenph Yan, makes for a respectable article all by itself. (Asked and answered by the same user? Odd, perhaps, but that is actively encouraged at Stack Overflow.)

Yan clearly defines the subject at hand:

“Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.

“The downside to indexing is that these indexes require additional space on the disk, since the indexes are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.”

Yan’s explanation also describes why indexing is needed, how it works (with examples), and when it is called for. It is worth checking out for those pondering his question. A couple other users contributed links to helpful resources. Der U suggests another Stack Overflow discussion, “What do Clustered and Non Clustered Index Actually Mean?“, while one, dohaivu, recommends the site, Use the Index, Luke.

Cynthia Murrell, July 29, 2013

Sponsored by, developer of Augmentext

Tag Management Systems Use Governance to Improve Indexing

March 18, 2013

An SEO expert advocates better indexing in the recent article “Top 5 Arguments For Implementing a Tag Management Solution” on Search Engine Watch. The article shares that because of increased functionality and matured capabilities of such systems, tag management is set for a “blowout year” in 2013.

Citing such reasons as ease of modifying tags and cost reduction, it is easy to see how businesses will begin to adopt these systems if they haven’t already. I found the point on code portability and becoming vendor agnostic most appealing:

“As the analytics industry matures, many of us are faced with sharing information between different systems, which can be a huge challenge with respect to back-end integrations. Tag management effectively bridges the gap between several front-end tagging methodologies that can be used to leverage existing development work and easily port information from one script or beacon to another.”

I think this is a very interesting concept and I love the notion of governance as a way to improve indexing. I am reminded of the original method from the days of the library at Ephesus. Next month, the same author will tackle the most common arguments against implementing a tag management system. We will keep an eye out.

Andrea Hayden, March 18, 2013

Sponsored by, developer of Beyond Search

LucidWorks Interviews Update

March 8, 2013

We had a report of Lucid Imagination and LucidWorks links on an index page not resolving on an index page. If you are looking for these interviews, here’s a snapshot of the interviews we have conducted since 2009 with LucidWorks’ professionals.

Mark Bennett
LucidWorks, March 4, 2013

Miles Kehoe
LucidWorks, January 29, 2013

Paul Doscher
LucidWorks, April 16, 2012

Brian Pinkerton
LucidWorks, December 21, 2010

Marc Krellenstein
Lucid Imagination, March 17, 2009

Remember. LucidWorks is the new name for Lucid Imagination.

Tony Safina, March 8, 2013

Oracle Rolls Out Text Index Strategy

March 7, 2013

Oracle’s support of locally partitioned indexes has created a need for users to be able to split those indexes and rebuild them in a timely manner. How do you rebuild an index without making your application unavailable for the entire time?

Prsync’s look into the maintenance disadvantages and subsequent problem solving by Oracle in “Partition Maintenance and Oracle Text Indexes” gives us a look at something new; a “Without Validation” and “Split Partition” features. These options offer a way to rebuild indexes without checking each line-by-line first.

“That solves the problem, but it’s rather heavy-handed. So instead we need to institute some kind of “change management”. There are doubtless several ways to achieve this, but I’ve done it by creating triggers which monitor any updates or inserts on the base table, and copy them to a temporary “staging” table. These transactions can then be copied back to the main table after the partition split or merge is complete, and the index sync’d in the normal way.”

So now that there is a solution, but, by avoiding the need for a system to check every partition key value to make sure the row is going to the correct partition, there is need for extra care when using the without validation feature.

It’s a long needed saving grace that will save time and ultimately money by getting apps back up and running in a more efficient manner but there is no substitute for attention to detail. For a more in-depth look at the process we suggest heading over to prsync.

Leslie Radcliff,  March 07, 2013

Sponsored by, developer of Augmentext

Google: Objective Indexing and a Possible Weak Spot

February 6, 2013

A reader sent me a link to “Manipulating Google Scholar Citations and Google Scholar Metrics: Simple, Easy, and Tempting.” I am not sure how easy and tempting the process of getting a fake scholarly paper into the Google index is, but the information provided is food for thought. Worth a look, particularly if you are a fan of traditional methods for building a corpus and delivering on point results which the researcher can trust. The notion of “ethics” is an interesting additional to a paper which focuses on fake or misleading research.

Stephen E Arnold, February 7, 2013

List of Significant Open Source Programs Neglects Search Engines

January 28, 2013

Zdnet’s recent article focusing on listing, “The 10 Oldest Significant Open Source Programs,” still in popular usage today becomes redundant and neglects to mention other, more relevant projects. Open source software and freeware projects have been influencing software development since the early days of computers.

According to the article:

“Both concepts were actually used long before proprietary software showed up. As Richard M. Stallman, (rms) free software’s founder noted, ‘When I started working at the MIT Artificial Intelligence Lab in 1971, I became part of a software-sharing community that had existed for many years. Sharing of software was not limited to our particular community; it is as old as computers [...]‘”

Linux has certainly had incredible success as the foundation for the internet and the most ported operating system in the world, running on everything from Android devices to super computers.  Python has also proven its impact by becoming the fastest growing open source programming language.

While the article goes on to list several other programming languages and another operating system, I cannot help but notice the lack of open source search engine and indexing software.  Lucene and Solr  have been around since 1999 and 2004, respectively.  These projects merged in March 2010, and have just received a robust update.  Not only are these programs currently still in use, but they are making strides towards solving the search problems that plague big enterprise.

Michael Cole, January 28, 2013

Sponsored by, developer of Beyond Search

Next Page »