Cheerleading for dtSearch

September 30, 2014

Short honk: Want to read how one dtSearch user just loves the desktop search system to death? Navigate to “dtSearch: How to handle Big or even Biggish Data.” What strikes me is the write up’s creation of a new buzzword: biggish. dtSearch has been around since 1991 and is now at version 7. The system was once Microsoft centric, but a version for Android is allegedly in beta testing.

The write up states:

The performance of dtSearch is truly impressive and the fact that it’s not only fast but can handle Big Data makes it ideal for all sorts of heavy lifting searches as well as digital forensics; indeed, the company has extensive advice on how to use dtSearch for just that purpose.

There a a few, apparently minor downsides:

There are some things dtSearch doesn’t do such as exporting the data from only one or more indexed fields (for example, just “Sender” and “Date”) although exporting to CSV and importing into Excel allows you to slice and dice the data with ease. My only other criticism of dtSearch is that its user interface looks a little dated.

No information about the time required to add additional content to the index. What happens when dtSearch hits a Drobo with a terabyte of text? Answer: it takes days to index the content collection.

The big plus for dtSearch is not mentioned. In my opinion, dtSearch is one of the few remaining commercial personal desktop search solutions. Exalead and ISYS Search Software have left the field of battle. Freeware and shareware products have an odd predilection to crash and burn.

Check out dtSearch at

Stephen E Arnold, September 30, 2014

Search: Just an Activity

September 30, 2014

Well, this is going to be a surprise for some folks at Google. After building a brand and habit for the search box at, search is just an activity. I leaned this in “Search Is No Longer a Destination. It’s an Activity.”

If I am an advertiser using AdWords or Facebook’s mechanism, I just want sales. Does the shift from activity to destination increase the value of a Facebook ad versus a Google ad.

The article points out:

Search engines have always had a hard time differentiating themselves to the masses. While digital marketers love analyzing the differences between algorithms, targeting methods, and result page layouts, the average person can’t tell much of a difference. That’s why for years “” was one of the top searches on Yahoo. That’s why despite some very clever (in my opinion) “Bing It On” TV commercials and some great case studies, Bing has had a very difficult time winning search traffic away from Google. As long as users aren’t dissatisfied with the results, they’ll keep searching wherever is convenient – often without even realizing what search engine they’re using.

Well, I am not sure that “always” is exactly on target. I think Chemical Abstracts differentiates itself quite well from Bing, Google, and a query about torts passed against Lexis. I know. I know. The article is aimed at folks who think about search in terms of Google, not the context of search and its more uninteresting manifestations.

The one point that I noted as fodder for my files was this one:

Context is the key element that powers these new search experiences. While some still contain a box where you can enter a query, their core functionality is around understanding and anticipating the searcher’s needs in the moment based on secondary signals like location, history, and other personal data the user chooses to share. And should the user need answers outside of this proactive information, voice search is the primary point of interaction.

I suppose I should be cheered that Delve, Microsoft’s search for Office 365, is going to get some blogger love. I am not exactly how a person looking for specific information will go about that task if accounts to commercial databases are not affordable and information access becomes an app.

I do not need to worry. The author provides this glimpse of the benefits of the death of traditional search:

No matter what format search marketing may take in the future, brands that build their strategy around providing valuable answers to their customers’ questions will continue to drive success in search – regardless of how the consumer searches, or if they even know what engine they’re using.

Right. When someone looks for a household cleanser, those ads for big name consumers products will fill the bill. How reassuring.

Stephen E Arnold, September 30, 2014

Processing Content Is Easy, Right?

September 30, 2014

A mobile search app would be useful and appreciated by mobile devices. According to the URX Blog post “Deduplication Of Web Content” it is relatively easy to create a search app, but creating a robust search app is the challenge. A robust search app would need to include link prioritization, feature extraction, re-crawl estimation, and content deduplication. The post is the first in an article series developing a mobile search app.

Deduplicating content is important for user experience:

“Duplicate pages in a search index poison search results. The goal of a search engine is to return both relevant and diverse documents, allowing users to decide the optimal resolution for a query. Without deduplication, the top-k results returned for a user’s query would likely contain duplicate content. In the extreme, all k results will be copies of the same page. This creates a bad user experience where, as the crawler scales out, the duplicate likelihood increases. In fact, Google’s Matt Cutts believes that up to 20% of web content is duplicated.”

The rest of the post examines the different types of duplication, how to identify them, and remove them from search results.

While the search app will serve an important function, it does not make sense to me why people cannot just open a Web browser on a mobile device and conduct a regular search. What I would like to see is an app that searches content on apps on a device.

Whitney Grace, September 30, 2014
Sponsored by, developer of Augmentext

Internet Business: Slightly Different Points of View

September 29, 2014

First, navigate to “Another Top Investor Sounds the Alarm: When the Market Turns, a Bunch of Startups Are Going to Vaporize.” No big surprise here. The main idea is, in my opinion:

Over the past few years, it’s been relatively easy for startups to raise money from venture capitalists. In some cases, they’re raising hundreds of millions of dollars to keep their companies afloat. But behind the scenes, they’re plowing through that money either on marketing, overhead, or some other expense, which results in high burn rates. These bloated companies are using their millions to hide serious flaws in their business models.

At some point, those who provide the bucks to the venture firms will want a return. Many of the Fancy Dan outfits are not among the world’s most liquid operations. To raise cash, MBAs and accountants can cook up some quite remarkable solutions. The actions cascade down the line and end up pushing technology companies like those that pitch wild and crazy content technology into an Iron Maiden. This is essentially a casket with spikes protruding into the box and spikes pointing into the box on its lid.

Ta da.

The individual is placed into the Iron Maiden and the door is shut. Ouch.

Now navigate to either the Google book itself or the concepts Web site at Eric Schmidt argues that businesses should be like Google. You know the moon shots, trying stuff and failing fast (I am not sure how fast Google has failed at social networking, but I don’t want to be argumentative), and value numbers/data over any humanoid subjectivity.

For many search and content processing companies, the senior managers have been failing for years in some cases. I want to make a list of would be start ups and then provide their date of inception. Heck, why embarrass outfits like Attivio, Coveo, Digital Reasoning, Lucid Imagination (now Lucid Works to which I am tempted to add “Really? but I will not.”), and quite a few others.

The point is that we have two somewhat conflicting interpretations of the present business climate. The tweets that inspired the Business Insider write up are taking a hard look at what happens when the money goes away. No money means that affected firms first people, raise prices, and pivot along with a half dozen or so MBA maneuvers before shutting the doors as Convera, Delphes, did Entopia. A few lucky outfits will sell out like Endeca, Exalead, and iPhrase. A few will struggle along sort of open and sort of closed like a number of French search and content processing firms.

On one hand, these outfits are toast if more money is not “found.” On the other hand, forget money. In Google’s world view, these companies need to be more like Google or out Google Google.

The reality is that the contraction of search and content processing has already begun. Some outfits are going to have to find a way to deliver a solution that solves an actual problem and generates sustainable revenue. Companies in this spot include IBM with its Watson project, Hewlett Packard with its Autonomy IDOL technology, and Palantir, a billion dollar baby of considerable note.

My view is that the doom and gloom expressed in the Business Insider write up is more likely to occur than a Google style entity arising from the Google Moon shot and allied suggestions. I am not sure the Google recommendations apply to Google. A company that is 15 years old and has one revenue stream may be a success that fulfills Steve Ballmer’s one trick pony observation.

For search and content processing vendors, there is no easy way out unless money remains plentiful and Google’s advice actually works for an information retrieval company.

Stephen E Arnold, September 29, 2014

Why Good Enough Is the New Norm in Search

September 29, 2014

Navigate to “Postgres Full Text Search Is Good Enough.” I first heard this argument at a German information technology conference a few years ago. The idea is surprisingly easy to understand. As long as a user can bang in a couple of key words, scan a result list, and locate information that the user finds helpful—job done. The search results may consist of flawed or manipulated information. The search results may be off point for the user’s query when evaluated by old fashioned methods such as precision and recall. The user may be dumb and relies on what the user finds accurate.


This write up explains the good enough approach in terms of PostgreSQL, a useful open source Codd type data management system. Please, note. I am not uncomfortable with good enough search. I understand that when the herd stampedes, it is not particularly easy to stop the run. Prudence suggests that one take cover.

Here’s the guts of the write up:

What do I mean by ‘good enough’? I mean a search engine with the following features:

  • Stemming
  • Ranking / Boost
  • Support Multiple languages
  • Fuzzy search for misspelling
  • Accent support

Luckily PostgreSQL supports all these features.

The write up contains some useful code snippets to make use of search features. The discussion of full text search is coherent and addresses a vast swath of content. Note that proprietary vendors have tilled acres of marketing earth and fertilizer to convert search into a mind boggling range of functions.

This article includes code snippets to tackle full text within PostgreSQL.

Querying is included as well. Again, code snippets are included. (My teenage advisors said, “Very useful snippets.” Okay. Good.

The write up concludes:

We have seen how to build a decent multi-language search engine based on a non-trivial document. This article is only an overview but it should give you enough background and examples to get you started with your own….Postgres is not as advanced as ElasticSearch and SOLR but these two are dedicated full-text search tools whereas full-text search is only a feature of PostgreSQL and a pretty good one

Reasonable observation. Worth reading.

If you are a vendor of proprietary search technology, there will be more individuals infused with the sprit of open source, not fewer. How many experts are there for proprietary systems? Fewer than the cadres of open source volk I surmise.

Stephen E Arnold, September 29, 2014

Yahoo Kills Its Directory

September 28, 2014

Yahooooo. Remember that sound. Once it was a happy yodel. Soon it will be a howl of agony. The Directory created by the original Yahoos, Messrs. Filo and Yang is to be terminated with extreme prejudice. The top Xoogler has decided I learned in “Yahoo to Shut Down Another Batch of Products as Activist Investor Pushes for AOL Acquisition.” The Directory spawned Web search. Web search spawned online advertising. Online advertising created the environment that killed precision and recall. In 20 years, finding information related to what the user actually wanted arrived and will soon depart.

The article asserts:

Directory, meanwhile, is one of Yahoo’s oldest services. As the name suggests, Directory is basically a directory listing designed to help users find the types of websites they’re looking for. Years ago, services like this were a valuable resource but times have certainly changes and Directory will come to an end on December 31.

My hunch is that Yahoo itself may experience the departure of a senior executive in about the same time frame. And the AOL clarion call? Two aged sparrows do not a peacock make.

Stephen E Arnold, September 28, 2014

NSA Catalog Available

September 27, 2014

Short honk: I you want a copy of National Security Agency 2014 Technology Catalog: Technology Transfer Program, you can download it for now from this link. If found pages 26 to 40 fascinating. Will IDC issue its own version of this document, using its surfing technical demonstrated by Dave Schubmehl with my content? I will keep my eye open.

Stephen E Arnold, September 27, 2014

Google-News Corp. Marvelous, He Said, She Said

September 26, 2014

I don’t have a dog in this hunt. I think both Google and News Corp. are wonderful. Weaknesses, none. Both companies just have strengths. Google has its Washington, DC lobbying effort and News Corp. has Fox News. Google has its ups and downs with the privacy issue (except at Stanford University). News Corp. has that alleged telephone tapping matter. Google has legions of users in Europe. News Corp. has fingers clutching newspapers, eyeballs watching television, and some Web users.

One big difference. Google is a  15 year old adolescent. News Corp. is an aged information company.

The two, like a May December romance gone sour will face nothing but irreconcilable differences. Need an example? Check out this blog post from the Google charmingly labeled “Dear Rupert.” See, Google does have a sense of humor.

I don’t have the energy to walk through the arguments and counter arguments. I do want to highlight one point and comment about it. News Corp. leaves a door open with its comment: “Google’s “power” makes it hard for people to “access information independently and meaningfully.” Google is “willing to exploit [its] dominant market position to stifle competition.

The Google response is wonderful. I believe that Commodore Vanderbilt, Jay Gould, John D. Rockefeller (oops sorry. He’s apoplectic about his descendants’ dumping holdings in fossil fuels), and JP Morgan (you remember: the fellow whose portrait makes it appear he is holding a knife as he starts to push himself from a chair) could not have collectively inked a better response:

With the Internet, people enjoy greater choice than ever before — and because the competition is just one click away online, barriers to switching are very, very low.

Well, I sort of enjoy the one click notion, but the reality for online users is that once a habit is formed, users have a tough time breaking it. Google is a habit with a market share only a drug lord can seek: 95% of users in Denmark, 66% of users in the US, 95 percent of users in France (gasp, France, home of Exalead, the former Quaero brain trust centroid, and numerous search vendors), etc. For more data see

Add to the monopoly position Google search controls, competition is few and far between. is just not able to gain significant market share from the GOOG. The hot ticket search engines according to search “experts” are and Er, there are metasearch engines and need access to other vendors’ indexes. As metasearch vendors, as search vendors doing primary indexing bite the dust, these outfits face some tough choices if they want to stay in business. The little known Exalead search,  which is almost unknown, offers a tiny fraction of the GOOG’s coverage. And Yandex? Well, Mr. Putin may make it difficult for that outfit to remain in business without picking up and heading to a new campground.

One click? Nope. As users shift to mobile devices, the information access mode shifts to applications or apps. Maybe News Corp. can tackle Google in this new space. I am not sure, however, if those who know how to do intercepts focus much on cutting Google off at the knees with an appealing online ad platform.

You can work through the rest of the arguments. Remember, you are one click away from finding a new search engine.

Stephen E Arnold, September 26, 2014

A Detailed Look at SharePoint 2013

September 25, 2014

If you’re looking to pull back the curtain on SharePoint, check out “Deep-Dive of Search in SharePoint 2013, Office 365 and SharePoint Online ‘From the Trenches’” at the EPCGroup’s blog. That company has been implementing SharePoint & Office 365 hybrids for years, and is highly regarded by many SharePoint analysts. The introduction to the detailed article tells us:

“In this blog post, EPC Group’s Sr. Search Architects will cover the key service applications and services that power SharePoint 2013, Office 365 and SharePoint Online’s search to enable your organization’s data to easily be found on-demand as well to enable the accuracy of your search results.”

The first section lists SharePoint’s search applications and related services, and notes some things to keep in mind. For example, both “federated search” and “scopes” are now known as “result sources.” Also, a default crawl account must be established; the post explains:

“In order for search to properly work, the SharePoint 2013 Search service must configure a default crawl account which is also referred to as the default content access account. This account must be an active, Active Directory Domain Services domain account. This account should not be setup as an individual or a specific person in IT as EPC Group has seen SharePoint search issues caused by this account being deactivated and an entire organization’s SharePoint search cease to work until the account issue was resolved.”

The article delves into detail on the platform’s components: Search, Crawl, Content Processing, Analytics Processing, Search Administration, Search Index, Search Query, and Search Diagnostics. The flow charts and bulleted lists make this an easy resource to reference; I’d recommend bookmarking to anyone who has a SharePoint system to maintain.

Cynthia Murrell, September 25, 2014

Sponsored by, developer of Augmentext

Elasticsearch Optimization Tips

September 25, 2014

One of the Elasticsearch experts at Found shares some of his wisdom in “Optimizing Elasticsearch Searches.” Writer and open source enthusiast Alex Brasetvik emphasizes that Elasticsearch often offers several ways to approach a problem, and that his suggestions can lead to improved performance. The post begins with a look at the way the platform’s filters work:

“Understanding how filters work is essential to making searches faster. A lot of search optimization is really about how to use filters, where to place them and when to (not) cache them….

“This is the key property of filters: the result will be the same for all searches, hence the result of a filter can be cached and reused for subsequent searches. Caching them is quite cheap, as you can store them as a compact bitmap. When you search with filters that have been cached, you are essentially manipulating in-memory bitmaps – which is just about as fast as it can possibly get.

“A rule of thumb is to use filters when you can and queries when you must: when you need the actual scoring from the queries.”

Brasetvik goes on to elaborate on points such as effective filter usage, combining filters, acceleration filters, aggregation issues, scoring, and important Things to Avoid. The helpful post concludes with a list of further resources.

Cynthia Murrell, September 25, 2014

Sponsored by, developer of Augmentext

« Previous PageNext Page »