Access Innovations Dazzles on the High Wire

February 9, 2011

The American Association for Cancer Research (AACR) is to partner with Access Innovations’ semantics services to tag and index their vast journal content on the HighWire platform. The project will allow users to better access and correlate articles on several websites.

As reported in “AACR Selects Access Innovations for Semantic Indexing of Content on High Wire,” AACR has confidence in their choice:

” ‘We are thrilled to be working with Access Innovations to develop an AACR taxonomy that can be applied to our content, and with HighWire Press to allow us to present related articles across our journals,’ said Diane Scott-Lichter, publisher of the AACR suite of journals. ‘We expect to expand these efforts to include semantic tagging of other AACR information.’ “

Between Access Innovations’ years of experience managing data and the advanced HighWire taxonomy, AACR looks to be on solid ground.

Cynthia Murrell February 9, 2011

Freebie

Big Data Action from Cloudant

January 26, 2011

Do you have problems searching big data stored in CouchDB? Cloudant has discovered the solution to your problem by taking CloudDB’s full text indexing and applying them to search. CMS Newswire provides the details in, “Cloudant Has Found the Answer to Searching Big Data.” Three MIT particle physicists created Cloudant when their old tools to weren’t enough to manage their research.

“Cloudant’s product is the only one that integrates search directly into CloudDB to provide real-time access to data. Many of their customers were storing content in two places CouchDB and in Solr; Cloudant saw an opportunity to provide an easier, low cost solution.”

The new program combines the open-source search platform, Lucene, with CloudDB to make customized, easy searching. Cloudant is available free for hosting customers with an upgrade due in February 2011.

Whitney Grace, January 26, 2011

Freebie

Hit Boosting and Google

January 20, 2011

Well, well, well. The Google watchers have discovered hit boosting. “Hit boosting” is the must-have function in a real-world search system. Forget the research computing lab demos. In the real world, folks like Dick Cheney want his Web site at the top of a results list. Do you rely on fancy indexing methods or egg head numerical recipes? This goose doesn’t. “Hit boosting” is a short cut. There are many ways to make a certain result or hits from a specific site or about a certain topic come up first in a results list. If you want more impact than number one in a result list, the goose can slap a box on the results page with the boosted content front and center. To make the content more visible, the goose can force you to locate a tiny “close” link.

Hit boosting has apparently been discovered by non-search experts. Navigate to “Google Gins Search Formula to Favor Its Own Services.” Here’s the key passage in my opinion:

Harvard professor Ben Edelman and colleague Benjamin Lockwood found that Google’s algorithm links to Gmail, YouTube, and other house brands three times more often than other search engines. Search terms such as “mail”, “email”, “maps”, or “video” all yield top results featuring Google’s services, they found. The practice, which Yahoo! was also found to engage in – albeit less blatantly – puts the search engines’ interests ahead of users’ need for unbiased data about the most useful sites on the web, they warned.

Okay, I wonder if the researchers realize that content tuning is a function of a number of search systems. In fact, if there is no administrative control to “weight” content, teenagers can be pressed into duty to write stored queries or hard wired routines to make sure that the boosted content is at the top of a results list. What if the content is not relevant to the user’s query? It doesn’t matter. Hit boosting is not a particularly user-sensitive method. The content is put where it is required. Period.

Do users know about hit boosting? Nah. The computer is supposed to deliver “objective” results. Talk about cluelessness! Most results lists will contain useful information, but there can be lots of other stuff on a Web page. In fact, the Web page may not be a results list. The hits are in a container surrounded by other containers of quite interesting stuff.

Is Google doing something unusual? Nah. Categorical affirmatives are risky, but I can safely say that hit boosting is the norm in many search deployments. Google is just being logical.

Did Dick Cheney’s content appear at the top of the results list? You bet. Did he notice? No. But his Yale and Harvard helpers were on top of the situation. Just like the boosted content.

Amazing what some folks assume.

Stephen E Arnold, January 20, 2011

Freebie

What’s new with dtSearch?

January 11, 2011

Content processing vendors are constantly vying for a larger slice of the market they inhabit.  New versions of existing software, the updating of specific features as well as the ability to remain agile as companies merge or are aggressively consumed remain vital to any firm or product’s survival.  dtSearch, the Maryland based tech merchant is no exception.

Their brand boasts fifteen search options in a new rundown posted to the site.  These include the Natural language type, which allows users to enter an unstructured search request in any international language, or Phrase and Phonic search choices.  Other alternatives are multiple Proximity, Numeric and Wildcard, which allows “?” to hold a single letter place, and “*” to hold multiple letter places.  In my opinion one of the most interesting capabilities was an aspect called Fuzzy Searching, an option that uses a proprietary algorithm to successfully execute the command despite misspelled terms.  Fuzzy Searching also “adjusts from 0 to 10 so you can fine-tune fuzziness to the level of OCR or typographical errors in your files.”

The end of the summer saw the release of two versions of dtSearch 7.65 (Builds 7906 and 7907) which resolved minor bugs with the indexing facets of the software as well as reducing memory requirements for parsing large XLS files.  The dtSearch line of products fill many consumer needs and continue to hone said line to offer high quality at competitive pricing.

Stephen E Arnold, January 11, 2011

Freebie

Wikileaks and Metadata

January 7, 2011

ITReseller’s “Working to Prevent Being the Next Wikileak? Don’t Forget the Metadata.” is worth a look. The write up calls attention to indexing as part of an organization’s buttoning up its document access procedures.

ITReseller says this about metadata:

A key part of the solution is metadata – data about data (or information about information) – and the technology needed to leverage it. When it comes to identifying sensitive data and protecting access to it, a number of types of metadata are relevant: user and group information, permissions information, access activity, and sensitive content indicators. A key benefit to leveraging metadata for preventing data loss is that it can be used to focus and accelerate the data classification process.. In many instances the ability to leverage metadata can speed up the process by up to 90 percent, providing a shortlist of where an organisation’s most sensitive data is, where it is most at risk, who has access to it and who shouldn’t.   Each file and folder, and user or group, has many metadata elements associated with it at any given point in time – permissions, timestamps, location in the file system, etc. – and the constantly changing files and folders generate streams of metadata, especially when combined with access activity. These combined metadata streams become a torrent of critical metadata. To capture, analyze, store and understand so much metadata requires metadata framework technology specifically designed for this purpose.

Some good points here, but what raised our eyebrows was the thought that organizations have not yet figured out how to “index”. Automation is a wonderful thing; however, the uses of metadata are often anchored in humans. One can argue that humans need play no part in indexing or metadata.

We don’t agree. Maybe organizations will take a fresh look at adding trained staff to tackle metadata. By closing in house libraries, many organizations lost the expertise needed to deal with some of the indexing issues touched upon in the article.

Stephen E Arnold, January 7, 2011

Freebie

Facebook Gouges Google TV

January 4, 2011

The basic information about Facebook’s TV service is set forth in “Reality TV for the Rest of Us.” The idea is that TV listings recommended by friends is better than slogging through lots of channels or, even worse, recommendations generated by a numerical recipe with thresholds that may or may not deliver what you expect.

This service is a fresh approach to finding. The method ignores the brute force indexing of some companies and relies on recommendations. The significance is that brute force search is not in the leadership position in terms of a social walled garden like Facebook’s.

The shift is going to be dismissed by Google. Google will attempt to slot social content recommendations into its services. Who knows? Maybe Facebook will implode. Google would then have a shot to get back in the game. I think that this “reality TV” thing is going to be as painful and damaging to Google as a world champion fighter getting a thumb in the eye and then losing vision in that eye.

Why?

First, lightweight. Recommendations are just less hassle than brute force search. Humans do the work. Volunteer work.

Second, relevance. What are friends for? People trust referrals from friends. Word of mouth works. Different and better than a numerical recipe. (Go ahead. Disagree.)

Third, fits core demographics’ established behavior. For those hooked on Facebook, having Facebook spit out TV shows when one is listening to music, texting, and doing homework is a really nifty attention deficit disorder service. I might be driven crazy. For Facebook’s users, the new service is likely to be a must use function. Habitual behavior means a big win for Facebook.

Google may have to go through its social life with one eye operating at 50 percent. Upside? Maybe Google won’t see all those Android devices using Facebook to find content on the vast wasteland?

Stephen E Arnold, January 4, 2010

Freebie

Start Your Year with Your Content Radar On

January 2, 2011

I am concerned about the quality of information which appears in public Web search results. I was fooling around with queries for the “new” silver bullet, which is made of Fool’s Gold. You know this search revolution as taxonomy. Everyone wants a taxonomy because key word indexing usually disappoints the inept searcher. A taxonomy, therefore, is one way to allow a user to slam in a word and maybe get a “use for” or “broader term” to make the results more “relevant.”

But a taxonomy goes only so far. The depleted uranium bullet is one that uses “facets”, another faerie dust term. The hapless user clicks on a descriptor or bound phrase that is broader than a taxonomy entry and magic happens. The results will contain something even the junior college graduate can use.

There is a level above taxonomy and facets too. This is the Disneyworld of predictive search. The idea is that the “system knows best.” The user does not have to do much more than fire up the app or poke her nose against the touch pad’s icon and the system predicts and delivers the needed information. Sounds great.

The problem, gentle reader, is that indexing systems don’t know when the content is addled, wrong, shaped, or just chock full of crapola. Let me illustrate two examples from an outfit with Web sites as JazdTech.com. Yep, “Jazd”, not “jazzed.” That’s a clue that I notice. Some search systems are not as picky.

I use the little known metasearch system Devilfinder.com. Be alert. Turn on “safe search.” Now run the query for taxonomy software vendors and in the results list you find these promising links:

image

There you go. “2011 Top Taxonomy Software Companies in Pharma.” Right on the money. The problem is that the results are not germane to anything remotely close to taxonomy software narrowed to pharmaceutical applications. When I clicked on the link on New Year’s Eve, I saw this Web page:

image

It looks okay but the links are useless and so far off the keywords I used for the query that I laughed out loud. Okay, a metasearch system can make mistakes.

I ran the query “2011 Top Taxonomy Software Companies” on Google and I was greeted with a display that contained not one or two entries to JazdTech.com’s lousy content but there were many listings.

image

After the ads that Google feeds upon were 11 hits to pages which contained irrelevant information which superficially look like content.

What’s my point?

It is easy to run queries which return hits to Web pages which are like the sugar free candy for dieters. The goodies look like the real thing, but are not. That’s okay when fooling the snack addict. For online searching, users expect nutritious information.

JazDTech.com is one outfit benefiting from the indifference of “real” search and metasearch systems. The screenshot below contains lots of information which I find questionable. I can guard myself against most flawed Web content? Others may not so equipped.

image

The domain is registered to an outfit called JAZD Markets, allegedly operating out of Hampstead, New Hampshire. There appears to be a reference to a street address in Andover, Massachusetts on Dundee Park Drive. The “service” is hosted on my favorite outfit Hostgator.com. The staff at JAZD Markets list themselves on LinkedIn, but provide modest information about the quality control in use for the firm’s software listings. Perhaps one purchases a listing and selects a category in which to appear? I will have to check out Firehouse BBQ and Pig Roast when I am next in Andover, a lovely place.

The problem is that some researchers may waste valuable time or use information that will make their search and retrieval cannon explode in their face.

Stephen E Arnold, January 1, 2012

Sponsored by Pandia.com

A Warm Solr Goodie

December 28, 2010

Expanding on the open-source Apache Lucene search software, the Apache Solr project adds another layer of customizable capabilities (see http://lucene.apache.org/solr). As Jayant Kumar puts it in his December 12 blog article, How to Go about Apache-Solr:

“Definitely solr provides an easy to use – ready made solution for search on lucene – which is also scalable.”

Kumar also points out that, when it comes to importing data, Sphinx users have an advantage: they don’t have to write code in order to port data. He also assures us that Solr’s use of ah http interface is no reason to avoid it.

This article is a useful write-up worth adding to your Solr reference library. It provides detailed instructions for installation, configuration, indexing, and importing data using this valuable resource.

Cynthia Murrell, December 28, 2010

Freebie

Content Tagging Costs

December 27, 2010

We read an interesting blog post called “The Search for Machine-Aided Indexing: Why a Rule-Based System is the Cost-Effective Choice.” Information about the costs of indexing content using different methods is often difficult to locate.

The article provides some useful information; however, I always verify any dollar estimates. Vendors often do custom price quotations, which makes it difficult to compare certain products and services.

Here’s the passage that caught my attention:

The database company manager could not give an exact figure for what their final actual costs were for purchasing Nstein; however, she did state that it was “not cheap.” She admitted that it was more expensive than all of the other MAI software products that they considered. (A press release from Nstein reported that the deal was worth approximately $CAN 450,000). When asked about staffing requirements, the manager estimated that it took the time of five full-time indexers and two indexing managers about a “month or so” at first. She added that there is a need for “constant” (she then rephrased that to “annual”) training. The investment company manager preferred not to discuss the actual implementation costs of Nstein, as there was a good deal of negotiation with non-cash assets involved. (A press release from Nstein of March 14th, 2002 reported that the deal was a five-year deal valued at over $CAN 650,000).

I downloaded this write up and tucked it in my Search 2011 pricing file. One never knows when these types of estimates will come in handy. I noticed on a LinkedIn threat relating to enterprise search that a person posted prices for the Google Search Appliance. I did a bit of clicking around and tracked down the original source of the data: SearchBlox Software. The data on the chart reported prices for the Google Mini. When one explores the US government’s price list for Google appliances that can handle 20 million documents which is a count encountered in some search applications, the cost estimates were off by quite a bit. Think in terms of $250,000, not $3,000.

I use whatever pricing data is available via open source research, and I know that hard data are often difficult to locate. The “appliance” approach is one way to control some costs. The “appliance” is designed to limit, like an iPad, what the user can do. Custom installations, by definition, are more expensive. When rules have to be created for any content processing system, the costs can become interesting.

Stephen E Arnold, December 27, 2010

Freebie, although Access Innovations has bought me one keema nan several weeks ago.

Enterprise Search: Baloney Six Ways, like Herring

December 21, 2010

When my team and I discussed my write up about the shift of some vendors from search to business intelligence, quite a bit of discussion ensued.

The idea that a struggling vendor of search—most often an outfit with older technology—“reinvents” itself as a purveyor of business intelligence systems—is common evoked some strong reactions.

One side of the argument was that an established set of methods for indexing unstructured content could be extended. The words used to describe this digital alchemy were Web services, connectors, widgets, and federated content. Now these are or were useful terms. But what happens is that the synthetic nature of English makes it easy to use familiar sounding words in a way to perform an end run around the casual listener’s mental filters. It is just not polite to ask a vendor to define a phrase like business intelligence. The way people react is to nod in a knowing manner and say “for sure” or “I’ve got it.”

image

Have you taken steps to see through the baloney passed off as enterprise search, business intelligence, and knowledge management?

The other side of the argument was that companies are no longer will to pay big money for key word retrieval. The information challenge requires a rethink of what information is available within and to an organization. Then a system developed to “unlock the nuggets” in that treasure trove is needed. This side of the argument points to the use of systems developed for certain government agencies. The idea is that a person wanting to know which supplier delivers the components with the fewest defects needs an entirely different type of system. I understand this side of the argument. I am not sure that I agree but I have heard this case so often, the USB with the MP3 of the business intelligence sound file just runs.

As we approach 2011, I think a different way to look at the information access options is needed. To that end, I have created a tabular representation of information access. I call the table and its content “The Baloney Scorecard, 2011.”

Read more

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta