Basho Riak Gets Developer Love: Syslog Indexing

April 25, 2012

If you are not familiar with Basho Riak, you can work through the www.basho.com Web site, or you can navigate to www.opensearchnews.com and request our profile of the company. (Click on the “Profile” link at the top of the page.) You may want to check out “Full Text Indexing of Syslog Messages with Riak.” The article describes a tool call riak-syslog. The utility sucks up syslog messages and allows the user to search those messages using the Riak full text search system. The write up has a post which points to indexing syslog messages with Solr. Useful.

Stephen E Arnold, April 25, 2012

Sponsored by PolySpot

Creative Tip to Avoid Indexing in SharePoint Fast

January 4, 2012

At his Tech and Me blog, Mikael Svenson provides a unique search tip in “How to Prevent an Item from Being Indexed with FAST for SharePoint.” Keeping an item from being indexed in FAST using the meta data or text of a file has long been considered next to impossible. Svenson, however, has found a way, and that way is through profanity. Yes, you can use the Offensive Content Filter to your advantage. The article explains:

The thing about the offensive content filter is that it will prevent documents from being indexed if they contain a certain about of bad language. If you get embarrassed by such words, then skip reading 🙂 “So now we have a stage which can drop items, the rest is to assign enough bad words to ‘ocfcontribution’ to get above the threshold it triggers on.

See the write up for a detailed description of how to implement this creative approach.

Svenson notes one important caveat: if you have any documents containing profanity  that you actually want to have indexed, this solution may backfire. Avoid difficulties by tapping the deep search expertise of Search Technologies.

Iain Fletcher, January 4, 2011

Sponsored by Pandia.com

Hlava on Machine Assisted Indexing

September 8, 2011

On September 7, 2011, I interviewed Margie Hlava, president and co-founder of Access Innovations. Access Innovations has been delivering professional taxonomy, indexing, and consulting services to organizations worldwide for more than 30 years. In our first interview, Ms. Hlava discussed the needs for standards and the costs associated with flawed controlled term lists and some loosely-formed indexing methods.

In this podcast, I spoke with her about her MAI or machine assisted indexing technology. The idea is that automated systems can tag in a consistent manner high volume flows of data. The “big data” challenge often creates significant performance problems for some content processing systems. MAI balances high speed processing with the ability to accommodate the inevitable “language drift” that is a natural part of human content generation.

In this interview, Ms. Hlava discusses:

  • The value of a neutral format so that content and tags can be easily repurposed
  • The importance of metadata enrichment which allows an indexing process to capture the nuances of meaning as well as the tagging required to allow a user to “zoom” to a septic location in a document, pinpoint the entities in a document, and automated summarization of documents
  • The role of an inverted index versus the tagging of records with a controlled vocabulary.

One of the key points is that flawed indexing contributes to user dissatisfaction with some search and retrieval systems. She said, “Search is like standing in line for a cold drink on a hot day. No matter how good the drink, there will be some dissatisfaction with the wait, the length of the line, and the process itself.”

You can listen to the second podcast, recorded on August 31, 2011, by pointing your browser to http://arnoldit.com/podcasts/. You can get additional information about Access Innovations at For more information about Access Innovations at this link.  The company publishes Taxodiary, a highly regarded Web log about indexing and taxonomy related topics.

Stephen E Arnold, September 8, 2011

Sponsored by Pandia.com, publishers of The New Landscape of Enterprise Search

Hlava on Indexing, Metadata, and Findability

September 1, 2011

On August 31, 2011, I spoke with Margie Hlava, president and co-founder of Access Innovations. The idea for a podcast grew out of our lunch chatter. I then brought her back to the ArnoldIT office and we recorded a conversation about the challenges of “after the fact” indexing. One of the key points surfacing in the interview is the importance of a specific work process required for developing an indexing approach. “Fire, ready, aim!” is a method which can undermine an otherwise effective search solution. In the podcast, Ms. Hlava makes three points:

  • Today’s search systems are often making it difficult for users to locate exactly the information needed. Access Innovations’ software and services can change “search to found.”
  • Support for standards is important. Once a controlled term list or other value adding indexing process has been implemented, Access Innovations makes it easy for clients to repurpose and move their metadata. Ms. Hlava said, “We are standards wonks.”
  • Indexing and metadata are challenging tasks. On the surface, creating a word list looks easy. Errors in logic make locating information more difficult. Informed support and the right taxonomy management system is important. The Access Innovations’ solutions are available as cloud services or as on premises installations.

The challenge is that automated content processing without controlled term lists creates a wide range of problems for users.

You can listen to the podcast by navigating to http://arnoldit.com/podcasts/. For more information about Access Innovations, point your browser to www.accessinn.com. Be sure to take a look at Access Innovations’ Web log, Taxodiary. Updated each day, the blog is at www.taxodiary.com

Stephen E Arnold, September 1, 2011

Sponsored by Pandia.com

Indexing a Good Start, not a Solution

July 28, 2011

We love it when 20 somethings discover the wheel, fire, or song. Almost as exciting is the breakthrough that allows today’s digital “experts” to see value in indexing. Yes.

InfoWorld asks, “Can Metadata Save Us from Cloud Data Overload?

The simple answer: not by itself.

Writer David Linthicum acknowledges that the rapid and redundant proliferation of data demands action beyond moving it all to the Cloud. Many think that metadata is the solution. Properly tagging your data is necessary, but the big picture is more complicated than that. He states:

The management of data needs to be in the context of an overreaching data management strategy. That means actually considering the reengineering of existing systems, as well as understanding the common data elements among the systems. Doing so requires much more than just leveraging metadata; it calls for understanding the information within the portfolio of applications, cloud or not. It eventually leads to the real fix. The problem with this approach is that it’s a scary concept to consider.

Well, sort of. But indexing is not horse shoes. Indeed, and people and businesses often, though unwisely, spin their wheels looking for easy solutions rather than confront an overwhelming reality. The truth, though, is that indexing is not a silver bullet. There are issues related to editorial policy, use of a controlled term list, and quality control.

The sooner companies face this fact and get into the nuts and bolts of their data operations, the sooner they will reap the rewards of efficiency. In the meantime, we await the next big thing. We have heard it is drawing on cave walls. There’s an automated image indexing system ready to tag those graphic outputs too.

Stephen E Arnold, July 28, 2011

Sponsored by ArnoldIT.com, the resource for enterprise search information and current news about data fusion

Latent Semantic Indexing: Just What Madison Avenue Needs

June 29, 2011

Ontosearch examines “The Use of Latent Semantic Indexing in Internet Marketing.” Going beyond the traditional use of simple keywords, Latent Semantic Indexing (LSI) puts words into context. On the assumption that words used in the same context are synonyms, the method uses math to find patterns within text; this process is known as Singular Value Decomposition. The word “latent” refers to creating correlations that are just sitting there waiting to provide important clues to the reader (either human or software) within the text sample.

When used by a search engine to determine ranking, LSI is a huge advance in establishing relevance to the user’s query. It also helps to lower the rank of duplicate websites. A company’s marketing department must keep this process in mind, and refuse to rely on keywords alone.

Google recently made headlines by revamping their search engine to increase the relevancy of their search results. Enhanced LSI was at the root of that change. Many users have been happy with the results, but a lot of businesses found themselves scrambling to recover their coveted high rankings. Adjustments had to be made.

Ontosearch’s post examines the response to this technique in the marketing world:

Latent Semantic system, is known to enhance or compliment the traditional net marketing keyword analysis technique rather than replacing or competing with them. One drawback of the LSI system is that it is based on a mathematical set of rules, which means that it can be justified mathematically but in the natural term, it has hardly any meaning to the users. The use of Latent Semantic System does not mean that you get rid of the standard use of keywords for search reference, instead it is suggested that you maintain a good density of specific keywords along with a good number of related keywords for appropriate Web marketing of the sites.

That technique allows marketing departments to maximize their search rankings. Wow, the marketers are moving to the future! I guess they know what’s good for them. Any company that refuses to embrace the newest techniques risks being left in the dust, especially these days.

But what happens if the Latent Semantic interpretation is incorrect? It can’t guess correctly every time. Check up on search engines’ interpretation of your site’s text to be sure you appear where you think you should.

During a quick Web search (no, the irony is not lost on me), I found that the method has been used to filter spam. That’s welcome. It’s also been applied to education. It’s also been applied to the study of human memory. Interesting. (That reminds me, have I taken my Ginkgo biloba today?)

Our view is that semantic methods have been in use in the plumbing of Google-like systems for years. The buzz about semantic technology is one of the search methods that surf on Kondratieff waves. This has been a long surf board ride. The shore is in sight.

Cynthia Murrell June 29, 2011

You can read more about enterprise search and retrieval in The New Landscape of Enterprise Search, published my Pandia in Oslo, Norway, in June 2011.

Autonomy Boosts the Discipline of Indexing

April 14, 2011

We found the story “Indexer Flourishes as Search Fails” quite interesting. A few days ago Autonomy, a global leader in enterprise software and “meaning based computing”, released its new service pack for  WorkSite Indexer 8.5 as well as for its new Universal Search Server. While the indexer has done well and received many good reviews, the notion of a “universal server” is a difficult concept. The pre-Microsoft Fast Search & Transfer promised a number of “universal” functions. When “universal” became mired in time consuming and expensive local fixes, some vendors did a global search and replace.

The service pack touts a new Autonomy control center which simplifies the management structure of a multi server environment, improved query returns, additional control over Autonomy’s IDOL components, and an automatic restart feature in case service is snarled due a problem outside of Autonomy’s span of control during a crawl. Network latency continues to be an issue despite the marketing hoo-hah about gigabit this and gigabit that. Based on the information we have at ArnoldIT.com, thus far the service pack has been deployed with little or no trouble.

We have heard some reports that the the Universal Search Server can create some extra perspiration when one tries to deploy multiple WorkSite engines. According to the article cited above, we learned:

Autonomy has identified this as a high priority issue and expects to have a resolution out in the very near future.

Autonomy has been among the more responsive vendors of enterprise solutions. We are confident a fix may be available as you read this or in day or two. If you are an Autonomy licensee, contact your reseller or Autonomy.

Stephen E Arnold, April 14, 2011

Freebie but maybe some day?

Greed Feedback Loops: Web Indexing, SEO, and Content

December 5, 2010

Wow, I thought the teeth gnashing  over “objective search results” was a dead issue. Objectivity is not part of the “free” Web search method. Uninformed people accept results as factual, relevant, and worth an invitation to have lunch with Plato. Wrong. Objective search results are a bit of myth and have been for decades.

Some education, gentle reader. A commercial database exercises editorial control. If you ran a query for ESOP on the Dialog system for File 15, you got a list of results in which the controlled term was applied or, if you were a savvy searcher, in documents in which the string ESOP appeared in a field or an abstract/full text field. The only objectivity involved was that Dialog matched on a string. No string. No match.

Online information is rife with subjectivity.

In the commercial database world, the subjectivity comes into play when the database producer selected an article to summarize, the controlled terms to apply, how the searcher framed his or her query, and what file to use in the first place. In ABI/INFORM the content set guaranteed that you would get only articles from magazine and journals we thought were important. The terms were the domain of the editors. The searcher controlled the query. Dialog was passive.

Flash forward to free Web search.

Search is expensive and the money to pay for content processing and the other bits and pieces of the so called “free system.” The most used Web search services get money mostly from advertising; that is third party payers. The reason advertisers pay money is to get access to Web search users. The present Web search system is largely built to maximize the money that flows to the search service provider. Nothing about the process is objective in my opinion. Unlike Dialog, free Web search meddles with the search results anywhere it can in order to derive benefit for itself. A  happy user is not the goal of the system. A happy advertiser is the main focus in my opinion.

In the good old days, there was overt meddling, but was the the user’s query and the database producer’s editorial policy. The timesharing company providing the service selected some databases for its service and excluded others. Users had no control over the timesharing vendor. Dialog and LexisNexis did what was necessary to maximize revenues and control the customer, the database producer, and the revenues.

But even in the good old days most online searchers di=d not worry much about the database producers’ editorial policies. Today almost no one thinks about the provenance of a content object. The Web search service wants clicks and advertisers. The advertiser wants clicks, leads, and sales. The content is not the main concern of the advertiser. Getting traffic is the main concern. And the Webmaster of an individual Web site wants traffic. The user wants information for free. The SEO industry sprang up to help anyone with money spoof the free Web indexes in order to get more traffic for a Web site which had little or no traffic in many cast. These are the ingredients of the feedback loop that has made free Web search the biased service it is. And the feedback loop that almost guarantees a lack of subjectivity.

Now read “When Businesses Attack Their Customers” or one of the dozens of other write ups by English majors, failed programmers, and search engine optimization experts. The notion of a Web search system fiddling the results seems to be a real light bulb moment. Give me a break. Consider these typical functions in Web indexing and posting today:

  • Lousy content created to get clicks from the clueless. There’s big money in crap content because of programs like Google AdWords. But those annoying pop up ads, those are just variations on the crap content scheme. Lousy content exists because search engines incentive the creators of this content. Users are unable to think critically about information, preferring to take whatever is dished up as gospel.
  • The Web indexes are not in the education business. Web indexes are in the traffic and advertising business, and these outfits will do what’s necessary to get traffic. If the National Railway Retirement Board adds an important document, that document may want a long time before a Web search engine indexes it. Put up a post about Mel Gibson’s court battle, and that document is front and center really fast. Certain content attacks clicks, and that content gets the limelight.
  • People who use the Web describe themselves as good researchers. Baloney. Most people look for information the way a Stone Age person made a fire: Wait for a lighting strike, steal or borrow a burning stick from a tribesman, or get two rocks and bash them together. Primitive queries cause Web search systems to deliver what the user wants without the user having to think about source, provenance, accuracy, or freshness. By delivering what users may want, Web search engines create a way to offer advertisers what appears to be a great sales advantage. I think the present approach delivers advertisers meaningless clicks, big bills, and lots of wacky metrics. Sales. Not so much.

I don’t think the commercial online search systems and the commercial database producers have a future filled with exploding revenues and ever higher quality content. I think the feedback loop set up and fed by free Web search is broken. In its wake is the even more subjective and probably easier to manipulate “social search” method. If you don’t know something, just ask a fried. That will work really well on certain topics. The uninformed are now leading the uninformed. Stupid is and as stupid does.

I use the Exalead Web index. No index is perfect, but I am more confident in Exalead’s approach because the company is not into the ad game. I also use DuckDuckGo and Blekko. Neither is perfect, but I have more confidence in the relevancy of the results, but I don’t know the scope of the companies’ indexes, not their respective editorial policies. The other Web indexes are little more than ad engines.

And SEO or search engine optimization? That “discipline” was created to get a Web page to the top of a results list. Never was the SEO motivation precision, recall, or relevancy. Accuracy of the content was not a primary concern. Clicks were it. As SEO “experts” trashed relevancy methods, the Web search engines abandoned objectivity and went for the clicks and money. I don’t have a problem with this, what I have a problem with is the baloney manufactured about bias, lousy search results, and other problems. These problems, in my opinion, complement the the naive and uninformed approach to research most users of Web search systems rely upon.

A failure in some education systems virtually ensures that critical thinking is in danger of becoming extinct. In an iPad mad world with attention deficit disorder professionals running rampant, I suppose the howls of outrage may be news. For me, this is an old story and an indication of the state of Web search.

The feedback loop is up and operating. Irrelevancy will increase in the quest for ad revenue. No easy fix in sight for a problem that’s been around for a decade. Now the Web search providers want to push search results to users before the users search. Gee, that’s a great opportunity to deliver subjectively ordered results based on advertiser needs. The scary part is that many Web users neither know no care about provenance, precision, recall, or relevance.

Welcome to a future with lots of lousy searchers who think they are experts.

Give me a break.

If you know an information professional, sometimes called a librarian, take a moment and get some advice from a real pro about searching. Too much work? Maybe that’s why so many bad decisions are evident today? Bad data, uninformed decisions, a lack of critical thinking, and flawed information skills are nutrients for big and bad mistakes.

Stephen E Arnold, November  30, 2010

Freebie

Indexing and Content Superficialities

November 27, 2010

Understanding Content Collection and Indexing” provides a collection of definitions and generalizations which makes clear why so many indexing efforts by eager twenty-somethings with degrees in Home Economics and Eighteenth Century Literature go off the rails: it takes more than learning a list of definitions to create a truly useful indexing system. In our opinion, the process should be about solving problems. As the article states:

The ability to find information is important for myriad reasons. Spending too much time looking for information means we’re unable to spend time on other tasks. An inability to find information might force us to make an uninformed or incorrect decision. In worse scenarios, inability to locate can cause regulatory problems, or, in in a hospital, lead to a fatal mistake.

This list is a place to start. It does describe the very basics of content collection, indexing, language processing, classification, metasearch, and document warehousing. We have to ask, though- is this analysis inspired by Associated Content or Demand Media?

For the real deal on indexing, navigate to www.taxodiary.com.

Cynthia Murrell, November 27, 2010

Freebie

Google Extends Government Indexing

June 18, 2010

Google has a better index of US government content than either the government or the vendors who are beavering away on this treasure trove. Now Google has added another chunk of content to its system. You can benefit from these data, but I would assert that Google’s MOMA Intranet may make even better use of the information. How? Just ask your local Googler for a demo.

The US Patent and Trademark Office (USPTO) is entering into a two year, no cost agreement with Google to make bulk electronic patent and trademark public data available. In this arrangement, the USPTO provides the data, Google hosts it for the public.

Research Buzz reported in their post, “Google Teaming Up With USPTO To Make Patent and Trademark Data Available” that the estimated size of this data storage will be about ten terabytes. This not so humble chunk of data will include patent grants and applications, trademark applications, and patent and trademark assignments, with more data (like trademark file histories) available in the future.

Google noted that it is only hosting the data provided by the USPTO; it isn’t altering it or changing it in any way. It should also be noted that this bulk hosting provided in zip files. It appears that Google wants you to download it to your own machines before you start analyzing it.

Skeptical geese might ask, “Why not crunch that content with the Guha / Halevy methods?” I think making the data with the benefit of semantic processing is slightly more useful than a big zip file.

Melody K. Smith, June 18, 2010

Freebie

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta