Varieties of Open Source Search: Just Like Soup?

June 6, 2011

Search is a commodity. Vendors are rushing to become the Wal-Mart and Costco of information retrieval. Free, discounts, bundles, and more!

Paul Anthony, at www.WebDistortion.com’s published “Open Source Search Engines Every Developer Should Know About”. In the write up he describes four open source search systems:

Apache Solr—an open source alternative to guided navigation type search systems
Constellio—based on Apache Solr
SearchBlox—Based on Lucene
Sphinx—a cross platform system

The list is useful but it is not complete. Open source search “products” are available from a number of vendors. One of our favorites is the Tesuji Anacleto system from Budapest. Tesuji powers Project Gutenberg.

In the WebDistortion essay, we learned:

…Typically search is one of the most poorly implemented pieces of technology on a site, with developers opting for the standard the out of the box solution which comes with most modern content management systems – and in many cases doesn’t do justice to your content. I thought I’d take a look at what other enterprise level and open source search engines out there to find and index the information on your site faster, and provide users with a deeper, more relevant result set.

Our view at Beyond Search is that open source software is disruptive but in quite specific ways. Open source search is disruptive is ways that fit into the broader open source software activity but quite particularized in its impact.

Is open source search like a food commodity? Image source: http://goo.gl/H4TMy

It seems that a number of companies are embracing open source search in order to sidestep licensing costs from certain commercial vendors, get a marketing angle, and focus scarce development resources on “wrappers” or “enhancements” to the open source core.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Open source, Search | Comments Off on Varieties of Open Source Search: Just Like Soup?

Web Search: Picking Sides from the Bleachers

June 5, 2011

One of the interesting characteristics of fans is that shouts can inspire athletes in the game. Here is rural Kentucky, fans can focus on one another. Instead of the usually Southern civility, shouting matches or fisticuffs can break out. The players continue playing as the “game within the game” unfolds.

Fans cheer but whether noise alters the outcome of the game is a matter for a PhD dissertation, not a job, though.

The Gold Team

I read “How Facebook Can Put Google Out of Business.” The write up takes a premise set up by Googler Eric Schmidt, who until recently, was the CEO of the company. The PR-inspired mea culpa positioned Mr. Schmidt as the person who was responsible for Google’s failures in social media. Even before Orkut in 2003, I recall seeing references to social functions in Google’s patent documents prior to Google’s purchase of Orkut and its quite interesting trajectory. As you may know, the path wandered through a legal thicket, toured the more risk filled environs of Brazil, and ended up parked next to the railroad tracks near the Googleplex in Mountain View.

The TechCrunch article pointed out that Facebook has detailed information about its 500 or 600 million “members”. The idea is that Facebook can leverage the information about these members’ in order to create a more compelling “finding” system.

I suppose I can nitpick about the write up, but it presents information that I have touched upon in this Web log for a couple of years. When I read the article, my reaction was, “I thought everyone already knew this.”

The Blue Team

Then I read “The Silliest Idea Ever: Facebook Going After Google In Search.” This write up used a rhetorical technique that I have long employed; namely, taking a contrary position in order to highlight certain features of an issue. In my experience, the approach annoys 30 somethings who have memorized an elevator pitch and want to get back to Call of Duty or their iPhone. However, I enjoy the intellectual exercise and will continue the practice.

The main premise of the “Silliest Idea Ever” is that competing with Google in search is expensive, Google is a moving target, and other types of disruption will influence what happens between Google and Facebook in search. You should read the original write up to get the full freight of meaning.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Financial, Google, Search, Social, Technology, Text processing | 11 Comments

Google, Mobile, and Money: Can We Discern a Pattern, Connect Some Dots

May 31, 2011

I woke up early this morning mostly because the crows decided to have a Post Memorial Day celebration here in the hollow near Harrod’s Creek. Beautiful birds. Often their discourse reminds me of data about the success of Android, the lack of success at RIM, and the slow start Microsoft Phone 7 Windows Mobile edition has had. And Apple? Well, even the crows have iPhones in Kentucky.

What I found interesting was more data about the success/failure of Android and Apple in the mobile game. “Nielsen: Android’s Lead Over iOS May Have Stopped Growing” reports that Android is popular “but no more than it was in March [2011].” You can work through the numbers which are based on Nielsen’s survey results. Note that Nielsen is hedging its bets on its results. My experience is that the results are often driven by the needs of marketing and sales and not so much what I want to know.

I want to connect the dots, but I am not sure what’s happening. Source: http://corknuts.tumblr.com/

Here’s the passage I noted on my trusty iPad:

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Google, Marketing, Mobile | Comments Off on Google, Mobile, and Money: Can We Discern a Pattern, Connect Some Dots

The Web, Blogs, and the Reed Effect

May 31, 2011

There was a blip in the blogosphere about the infusion of capital into the big, firm information arteries of GigaOm, founded by Om Malik. Even the trend tracking Mashable covered the story in “Tech Blog GigaOM Shifts Focus to Premium Content.”

The money apparently flowed from Reed Elsevier Ventures with some other investors betting on the blog news and analysis service. The founder added some cash to the pot as did Alloy Ventures. The funding flies in the face of the well received of the Business Insider’s link to a presentation about how traditional media companies can behave more like start ups.

Traditional professional publishers push prices to the peak of Mount Tolerance. As long as revenues do not decline, the number of customers is irrelevant. Remember the concept of elasticity in pricing from Econ 100?

This is an interesting development for three reasons:

First, although the Huffington Post hit the jackpot, the GigaOM investment is suggestive. What I see is that the GigaOM content play is interesting, but not yet at the Huffington Post level. Investors hope to reach that benchmark in money magnetism so outfits like AOL will acquire GigaOM for an even more juicy pay day.

Second, the shift signals more trouble for the advertiser supported model of publishing. Google Adsense seems to be losing some steam, and the costs of pitching vendors to support a blog is expensive and time consuming. With more cash, GigaOM can follow in the footsteps of more traditional publishing, consulting, and analysis businesses. Get subscribers, sell reports, and cherry pick other money making opportunities as they come along—Sounds like a plan to me. For outfits like Google, the river of money may behave like Lake Hamoun. The Reed Effect, in my view, is pushing prices to the heights. If customers want the information, those customers can pay.

Written by Stephen E. Arnold · Filed Under Feature, Publishing, Search, Technology | Comments Off on The Web, Blogs, and the Reed Effect

Mapping the New Landscape of Enterprise Search

May 23, 2011

What has happened to enterprise search? In a down economy, confusion among potential licensees has increased, based on the information I gathered for my forthcoming The Landscape of Enterprise Search, to be published by Pandia in June 2011. The price for the 186 page report is $20 US and 15 euros. Pandia and I decided that the information in the report should be available to those wrestling with enterprise search. With some “experts’ charging $500 and more for brief, pay to play studies, our approach is to provide substantive information at a very competitive price point.

In this completely new report, my team and I compress a complex subject into a manageable 150 pages of text. There are 30 pages of supplementary material, which you use as needed. The core of the report is an eyes-wide-open analysis of six key vendors: Autonomy, Endeca, Exalead, Google, Microsoft, and Vivisimo.

You may recall that in the 2004 edition of the Enterprise Search Report, I covered about two dozen vendors. By the time I completed the third edition (the last one I wrote), the coverage had swelled to more than 28 vendors and to an unwieldy 600 plus pages of text.

In this new Landscape report, the publisher, my team, and I focused on the companies most often included in procurement reviews. With more than 200 vendors offering enterprise search solutions, there are 194 vendors who could argue that their system is better, faster, and cheaper than the vendors’ systems discussed in Landscape. That may be true, but to include a large number of vendors makes for another unwieldy report. I know from conversations with people who call me asking about another “encyclopedia of search” that most people want two or three profiles of search vendors. We maintain profiles for about 50 systems, and we track about 300 vendors in our in house Overflight system.

My team and I have tried to make clear the key points about the age and technical aspects of each vendor’s search solution. I am also focused on explaining what systems can and cannot do. If you want information that will strike you as new and different, you will want to get a copy of my new Landscape report.

Are you lost in the alchemist’s laboratory? This is a place where unscientific and fiddling take precedence over facts. Little wonder when “experts” explain enterprise search, there is no “lead into gold” moment. There is a mess. The New Landscape of Search helps you avoid the alchemists’ approach. Facts help reduce the risk in procuring an enterprise search solution.

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise, Enterprise search, Feature, Search, Technology, Text processing, User experience, Vertical search | Comments Off on Mapping the New Landscape of Enterprise Search

Search: An Information Retrieval Fukushima?

May 18, 2011

Information about the scale of the horrific nuclear disaster in Japan at the Fukushima Daiichi nuclear complex is now becoming more widely known.

Expertise and Smoothing

My interest in the event is the engineering of a necklace of old-style reactors and the problems the LOCA (loss of coolant accident) triggered. The nagging thought I had was that today’s nuclear engineers understood the issues with the reactor design, the placement of the spent fuel pool, and the risks posed by an earthquake. After my years in the nuclear industry, I am quite confident that engineers articulated these issues. However, the technical information gets “smoothed” and simplified. The complexities of nuclear power generation are well known at least in engineering schools. The nuclear engineers are often viewed as odd ducks by the civil engineers and mechanical engineers. A nuclear engineer has to do the regular engineering stuff of calculating loads and looking up data in hefty tomes. But the nukes need grounding in chemistry, physics, and math, lots of math. Then the engineer who wants to become a certified, professional nuclear engineer has some other hoops to jump through. I won’t bore you with the details, but the end result of the process produces people who can explain clearly a particular process and its impacts.

Does your search experience emit signs of troubles within?

The problem is that art history majors, journalists, failed Web masters, and even Harvard and Wharton MBAs get bored quickly. The details of a particular nuclear process makes zero sense to someone more comfortable commenting about the color of Mona Lisa’s gown. So “smoothing” takes place. The ridges and outcrops of scientific and statistical knowledge get simplified. Once a complex situation has been smoothed, the need for hard expertise is diminished. With these simplifications, the liberal arts crowd can “reason” about risks, costs, upsides, and downsides.

A nuclear fall out map. The effect of a search meltdown extends far beyond the boundaries of a single user’s actions. Flawed search and retrieval has major consequences, many of which cannot be predicted with high confidence.

Everything works in an acceptable or okay manner until there is a LOCA or some other problem like a stuck valve or a crack in a pipe in a radioactive area of the reactor. Quickly the complexities, risks, and costs of the “smoothed problem” reveal the fissures and crags of reality.

Web search and enterprise search are now experiencing what I call a Fukushima event. After years of contentment with finding information, suddenly the dashboards are blinking yellow and red. Users are unable to find the information needed to do their job or something as basic as locate a colleague’s telephone number or office location. I have separated Web search and enterprise search in my professional work.

I want to depart for a moment and consider the two “species” of search as a single process before the ideas slip away from me. I know that Web search processes publicly accessible content, has the luxury of ignoring servers with high latency, and filtering content to create an index that meets the vendors’ needs, not the users’ needs. I know that enterprise search must handle diverse content types, must cope with security and access controls, and perform more functions that one of those two inch wide Swiss Army knives on sale at the airport in Geneva. I understand. My concern is broader is this write up. Please, bear with me.

Written by Stephen E. Arnold · Filed Under Business strategy, Editorial opinion, Enterprise, Enterprise search, Feature, Online (general), Search, Technology, Text processing | Comments Off on Search: An Information Retrieval Fukushima?

Google and Search

May 11, 2011

Over the last five days, I have been immersed in conversations about Google and its public Web search system. I am not able to disclose the people with whom I have spoken. However, I want to isolate the issues that surfaced and offer some observations about the role of traditional Web sites. I want to capture the thoughts that surfaced after I thought about what I learned in my face to face and telephone conversations. In fact, one of the participants in this conversation directed my attention to this post, “Google Panda=Disaster.” I don’t think the problem is Panda. I think a more fundamental change has taken place and Google’s methods are just out of sync with the post shift environment. But hope is not lost. At the end of this write up, I provide a way for you to learn about a different approach. Sales pitch? Sure but a gentle one.

Relevance versus Selling Advertising

The main thrust of the conversations was that Google’s Web search is degrading. I have not experienced this problem, but the three groups with whom I spoke have. Each had different data to show that Google’s method of handling their publicly accessible Web site has changed.

First, one vendor reported that traffic to the firm’s Web site had dropped from 2,000 uniques per month to 100. The Web site is informational. There is a widget that displays headlines from the firm’s Web log. The code is clean and the site is not complex.

Second, another vendor reported that content from the firm’s news page was appearing on competitors’ Web sites. More troubling, the content was appearing high in a Google results list. However, the creator of the content found that the stories from the originating Web site were buried deep in the Google results list. The point is that others were recycling original content and receiving a higher ranking than the source of the original content.

Traditional Web advertising depicted brilliantly by Ken Rockwell. See his work at http://www.kenrockwell.com/canon/compacts/sd880/gallery-10.htm

Third, the third company found that its core business was no longer appearing in a Google results list for a query about the type of service the firm offered. However, the company was turning up in an unrelated or, at best, secondary results list.

I had no answer to the question each firm asked me, “What’s going on?”

Through various contacts, I pieced together a picture that suggests Google itself may not know what is happening. One source indicated that the core search team responsible for the PageRank output is doing its work much as it has for the last 12 years. Googlers responsible for selling advertising were not sure what changes were going on in the core search team’s algorithm tweaks. Not surprisingly, most people are scrutinizing search results, fiddling with metatags and other aspects of a Web site, and then checking to see what happened. The approach is time consuming and, in my opinion, very much like the person who plugs a token into a slot machine and hits the jack pot. There is great excitement at the payoff, but the process is not likely to work on the next go round.

Net net: I think there is a communications filter (intentional or unintentional) between the group at Google working to improve relevance and the sales professionals at Google who need to sell advertising. On one hand, this is probably healthy because many organizations put a wall between certain company functions. On the other hand, if Adwords and Adsense are linked to traffic and that traffic is highly variable, some advertisers may look to other alternatives. Facebook’s alleged 30 percent share of the banner advertising market may grow if the efficacy of Google’s advertising programs drops.

Written by Stephen E. Arnold · Filed Under Business strategy, Facebook, Feature, Google, Marketing, Microsoft, Search, Search enabled applications, Text analytics, Text processing, User experience, Vertical search | 2 Comments

Tracking: Does It Matter?

May 11, 2011

A news story broke this week that was more difficult for many to ignore; it seems our beloved iPhones and iPads are paying us the same attention we lavish on them. It turns out these Apple devices keep an internal log of every cell tower or hot spot they connect to, in essence creating a map of the user’s movements for as long as ten months. It gets better. The log file is highly visible and unencrypted, making it accessible to anyone with your device in their hands.

Getting the scent. Source: http://www2.journalnow.com/news/2011/feb/07/wsweat01-beagle-found-in-a-jiffy-by-tracking-dogs-ar-760887/

This news stems from a couple of British programmers who stumbled upon said “secret” location file. In the midst of the melee that ensued from outraged consumers and lawmakers alike, I was directed to a Bloomberg article titled “Researcher: iPhone Location Data Already Used By Cops”.

Interestingly enough, a rendition of this same story has been covered by the press months ago, only featured in a different light courtesy of an individual studying forensic computing. Per the write-up: “In a post on his blog, he explains that the existence of the location database—which tracks the cell phone towers your phone has connected to—has been public in security circles for some time.

While it’s not widely known, that’s not the same as not being known at all. In fact, he has written and presented several papers on the subject and even contributed a chapter on the location data in a book that covers forensic analysis of the iPhone.”

Written by Stephen E. Arnold · Filed Under Business strategy, Editorial opinion, Feature, Online (general), Privacy, Technology | 1 Comment

New Spin for OmniFind: Content Analytics

May 2, 2011

IBM has dominated my thinking with its bold claims for Watson. In the blaze of game show publicity, I lost track of the Lucene-based search system OmniFind 9.x. My Overflight system alerted me to “Content Analytics Starter Pack.” According to the April 2011 announcement:

The Starter Pack offers an advanced content analytics platform with Content Analytics and industry-leading, knowledge-driven enterprise search with OmniFind Enterprise Edition in a combined package. IBM Content Analytics with Enterprise Search empowers organizations to search, assess, and analyze large volumes of content in order to explore and surface relevant insight quickly to gain the most value from their information repositories inside and outside the firewall.

The product allows IBM licensees to:

Find relevant enterprise content more quickly
Turn raw text into rapid insight from content sources internal and external to your enterprise
Customize rapid insight to industry and customer specific needs
Enable deeper insights through integration to other systems and solutions.

At first glance, I thought IBM Content Analytics V2.2 was one program. I noticed that the OmniFind Enterprise Edition 9.1 has one set of hardware requirements at http://goo.gl/Wie0X and another set of hardware requirements for the analytics component at http://goo.gl/5J1ox. In addition, there are specific software requirements for each product.

The “new” product includes “improved support for content assessment, Cognos® Business Intelligence, and Advanced Case Management.”

Is IBM’s bundling of analytics and search a signal that the era of traditional search and retrieval has officially ended? Base image source: www.awesomefunnyclever.com

When you navigate to http://goo.gl/he3NR, you can see the different configurations available for this combo product.

What’s the pricing? According to IBM, “The charges are unchanged by this announcement.” The pricing seems to be based on processor value units or PVUs. Without a link, I am a bit at sea with regards to pricing. IBM does point out:

For clarification, note that if for any reason you are dissatisfied with the program and you are the original licensee, you may obtain a refund of the amount you paid for it, if within 30 days of your invoice date you return the program and its PoE to the party from whom you obtained it. If you downloaded the program, you may contact the party from whom you acquired it for instructions on how to obtain the refund. For clarification, note that for programs acquired under the IBM International Passport Advantage Agreement, this term applies only to your first acquisition of the program.

Written by Stephen E. Arnold · Filed Under Analytics, Business strategy, Feature, Integration, Online (general), Search, Technology, Text analytics, Text processing | Comments Off on New Spin for OmniFind: Content Analytics

Google and Mobile: Will the Pass from Web to Mobile Search Be Smooth?

April 25, 2011

Over the bunny weekend, I spoke with two people about the direction the Web is moving. In those information conversations, I learned some interesting factoids. First, the Web today is different from the Web of five, even two years ago. The person used the word “ephemeral” to describe much of the information that is available. I thought that “ephemeral” applied to Twitter “tweets” and some of the short content posted in the comments section of blogs and other social media. As I learned, this definition is too narrow. The ephemeral nature of the Web applies to such content types as:

Dynamic Web pages such as those produced by airline ticket or hotel reservation systems. The content which is mostly availability and price changes often with each screen refresh.
Junk pages that someone produces until the pages stop attracting traffic, often leaving no trace anywhere. To see an example, navigate to Webspace.com
Test Web sites or blogs put up and then abandoned. To see an example, navigate to Captain Roy. The Web page stays behind, but the blog and its content is temporary.

I did not agree with the person’s approach to ephemera, but I did agree with the perception that the texture of information available via the Web was quite different today than it was a few years back.

Can Google’s Web search pass the baton to Google mobile search without losing cadence, speed, or control?

The second conversation focused on the notion of the volume of data. I had heard some astounding and unsubstantiated claims about the rate of growth of digital information. One person told me that Web and organizational content was doubling every two months. This person was the president of a trendy software company, so I zipped my lip. But on the call over the weekend, a person who shall remain anonymous asserted, “Web content doubles every 72 hours.” Again, I did not push the issue, but that is a heck of a statement.

Two observations:

There is a lot of digital information and some of it is clearly not intended to be substantive. Persistence, if it does occur, is accidental or irrelevant to the person creating the information. Other content is machine generated like the Webspace.com “page”, and it is little more than a placeholder or a way to generate ad revenue or click throughs.

Finding information in today’s environment is not particularly easy. The general purpose Web search engines like Bing.com and Google.com are able to provide pointers to more traditional Web content. To locate information that appears in a tweet, I have to exert considerable effort to locate an item. For companies with distinct name, my Overflight services works okay but some outfits have names that make it almost impossible to find them. Examples include Brainware, Stratify, and Thunderstone without lots of false drops to games, rock and roll, or other content which has appropriated a word, phrase, or semantic space.

Mobile search is the primary means of finding information for many people. On my trip to Hong Kong at the end of March 2011, I watched people in public spaces like the Starbuck’s at the giant mall near the central rapid transit station. There were a few laptops and iPads, but the majority of the people were using mobile devices. A similar uptake is evident in most big cities. Here in Harrod’s Creek, there are precious few people, so the one person using a clunky laptop at the Dairy Queen is out of the mainstream.

In my printed edition of the New York Times, I read in the business section today (April 25, 2011) “Google, a Giant in Mobile Search, Seeks New Ways to Make It Pay.” The “it”, of course, is mobile search in particular and more generally mobile online information access. You may be able to read the story online, but the links often go dead. More ephemera, I suppose. Try this one, but no guarantees: http://goo.gl/Ebpnz.

Written by Stephen E. Arnold · Filed Under Business strategy, Feature, Google, Marketing, Mobile, Search | 4 Comments

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Fogint
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

Varieties of Open Source Search: Just Like Soup?