Solr 4 Webinar from LucidWorks

July 16, 2013

Make plans to attend the Solr 4 webinar this Thursday hosted by the experts at LucidWorks, through their open resource portal SearchHub. Read all the details of the upcoming event in the LucidWorks release, “WEBINAR: Scaling Through Partitioning and Shard Splitting in Solr 4.”

The details state:

“Over time, even the best designed Solr cluster will reach a point where individual shards are too large to maintain query performance. In this Webinar, you’ll learn about new features in Solr to help manage large-scale clusters. Specifically, we’ll cover data partitioning and shard splitting. Partitioning helps you organize subsets of data based on data contained in your documents, such as a date or customer ID. We’ll see how to use custom hashing to route documents to specific shards during indexing. Shard splitting allows you to split a large shard into 2 smaller shards to increase parallelism during query execution.”

Attendees will come away with real world examples and applications to make Solr 4 production ready. The featured presenter is Timothy Potter, senior Big Data architect at Dachis Group, a true expert in the field. Register today for the free webinar.

Emily Rae Aldridge, July 16, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

PRatronizer Alert: Have Info for ArnoldIT? Proceed with Caution

July 4, 2013

I am not a journalist. My academic training is in medieval poetry in Latin. I was lucky to get out of high school, college, and a couple of graduate programs. Few people embraced my interest in indexing medieval Latin manuscripts. Among those who made the most fun of my interests were those in journalism school, electrical engineers, and people studying to be middle school teachers.

In graduate school, the mathematics majors found my work interesting and offered grudging respect because one of my relatives was Vladimir Ivanovich Arnold, a co-worker with that so-so math guy, the long distance hiker Andrey Kolmogorov.

I have, therefore, some deep seated skepticism about “real” journalists, folks who carry around soldering irons, and the aforementioned middle school teachers.

Last week I received a semi-snarky email about one of my articles. The person writing me shall remain nameless. I have assembled some thoughts designed to address his question, “Why did you not mention [company A] and [company B] in your article about desktop search. I think this was a for fee column which appeared in KMWorld, but I can’t be sure. My team and I produce a number of “articles” every day, and I am not a librarian, another group granted an exemption from my anti journalist, anti EE, and anti middle school stance.

image

Let me highlight the points which are important to me. I understand that you, gentle reader, probably do not have much interest. But this is my blog and I am not a journalist.

First, each of my for fee columns which run in four different publications focus on something “sort of” connected to search, online, analytics, knowledge management (whatever that means), and the even more indefinable content processing. I write about topics which my team suggests might be interesting to people younger and smarter than I. In short, PR people stay away. I pay professionals to identify topics for me. I don’t need help from you. I don’t need the PR attitude which I call “PRantronizing.” Is this clear enough? Do not spam me with crazy “news” releases. Do not call me and pretend we are pals. When a call came in yesterday, I was in a meeting with a law librarian. I put the call on the speaker phone and told the caller to know whom she buzzes before she pretends we are pals. The PRatronizer was annoyed. The law librarian said, “None of us on your team are that friendly to you. Heck, I don’t think you are my friend after four years of daily work.” My reaction, “That’s why you are sitting here with me and the PRatronizer is dealing with a firm, ‘Get lost.’”

Read more

LucidWorks Webinar Available on Solr 4

July 3, 2013

Several posts of late have revolved around the news of the Solr 4 release. The open source community is excited and ready to see what this new iteration can do. LucidWorks is a company that builds its value-added search and Big Data products on top of the Apache Lucene Solr platform. They have a genuine vested interest in the Lucene Solr open source community. One of their experts offered a webinar on Solr 4. Read the details in the release, “Webinar: Solr 4, the NoSQL Search Server.”

The summary begins:

“The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get! Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.”

The presentation was done by Yonik Seeley, an expert in the field if there ever was one. Seeley created Apache Solr and is a co-founder at LucidWorks. This sort of training from an expert is invaluable and LucidWorks is providing it for free! Do not miss your opportunity to get up to speed on all that Solr 4 has to offer.

Emily Rae Aldridge, July 3, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

SolrCloud Configuration

June 17, 2013

SolrCloud is a set of new distributed capabilities in Solr. It is useful for setting up a highly available, fault tolerant cluster of Solr servers. Systems Architect has a useful guide for configuring the system. Read their advice in the entry, “Painless Guide to Solr Cloud Configuration.”

The article begins:

“’Cloud’ become very ambiguous term and it can mean virtually anything those days. If you are not familiar with Solr Cloud think about it as one logical service hosted on multiple servers. Distributed architecture helps with scaling, fault tolerance, distributed indexing and generally speaking improves search capabilities. All of that is very exciting and I’m highly impressed how the service is designed but… it’s relatively new product.”

Cloud capability is a highly desirable attribute in enterprise search solutions, one that many service providers are rapidly adopting. LucidWorks builds their value-added enterprise search and Big Data solutions on top of the power of Apache Lucene Solr. However, instead of having to configure everything independently LucidWorks offers capability out-of-the-box as well as an award winning support and services network. Both solutions are available for deployment on-site, in the Cloud, or in hybrid form.

Emily Rae Aldridge, June 17, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

The Fastest Windows Desktop Search

June 11, 2013

The MakeUseOf article “What Are the Fastest Tools for Windows Desktop Search?” gives readers a glimpse of several different desktop search tools and tries to determine whether Windows desktop search really is faster or if it comes up short when compared to other third party tools. Windows search is easy to use. Open up any explorer window or folder and you will find a search bar at the top right corner of the page. Searches can also be initiated from the Start Menu. The average search time for a Windows search was 3m 30s for un-indexed search and on average <1s fir un-indexed search. Also the Windows search indexing keeps a continual index of all files and folders which can improve overall search speeds.

The next program featured is the search tool Everything. The simplistic search interface provides an empty window that has a search bar across the top and that delivers results below as you type. This simple yet effective search tool produces instantaneous real-time results. It also works by indexing to produce even faster results. Listary was the third search tool reviewed and unlike the previous two it does not have a separate search interface. You simply start typing and it can determine whether you want to search or not. The average search time was <1s for a computer-wide search. Though all three are great tools the author has a clear winner.

“My winner? I prefer Everything. Listary offers the same “find as you type” instantaneous search results but the interface can sometimes be intrusive, especially when you accidentally bring it up. I like how Everything is both fast and compact and only shows up when I open it myself.”

Both third party tools seem worth a try but neither made the June 2013 Publisher Information today article about desktop search which makes one wonder what other potential winners are out there just waiting to be discovered.

April Holmes, June 11, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

LucidWorks Continues Training through Webinars

June 7, 2013

SearchHub is one way that LucidWorks keeps in touch with the open source developer community, particularly those concerned with Apache Lucene and Solr. In addition to providing videos, podcasts, and other reference materials, LucidWorks also posts upcoming webinars and other training opportunities. Check out the latest in the entry, “Webinar: Solr 4, the NoSQL Search Server.”

The webinar will cover:

“The long awaited Solr 4 release brings a large amount of new functionality that blurs the line between search engines and NoSQL databases. Now you can have your cake and search it too with Atomic updates, Versioning and Optimistic Concurrency, Durability, and Real-time Get! Learn about new Solr NoSQL features and implementation details of how the distributed indexing of Solr Cloud was designed from the ground up to accommodate them.”

LucidWorks continues to invest in the open source community through such training and support opportunities. LucidWorks as a company is known for their support and services that surround their value-added enterprise search and Big Data solutions. But LucidWorks is also committed to the foundation of their success – the open source community and innovation and agility it brings.

Emily Rae Aldridge, June 7, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Bitext Delivers a Breakthrough in Localized Sentiment Analysis

May 29, 2013

Identifying user sentiment has become one of the most powerful analytic tools provided by text processing companies, and Bitext’s integrative software approach is making sentiment analysis available to companies seeking to capitalize on its benefits while avoiding burdensome implementation costs.  A few years ago, Lexalytics merged with Infonics. Since that time, Lexalytics has been marketing aggressively to position the company as one of the leaders in sentiment analysis. Exalead also offered sentiment analysis functionality several years ago. I recall a demonstration which generated a report about a restaurant which provided information about how those writing reviews of a restaurant expressed their satisfaction.

Today vendors of enterprise search systems have added “sentiment analysis” as one of the features of their systems. The phrase “sentiment analysis” usually appears cheek-by-jowl with “customer relationship management,” “predictive analytics,” and “business intelligence.” My view is that the early text analysis vendors such as Trec participants in the early 2000’s recognized that key word indexing was not useful for certain types of information retrieval tasks. Go back and look at the suggestions for the benefit of sentiment functions within natural language processing, and you will see that the idea is a good one but it has taken a decade or more to become a buzzword. (See for example, Y. Wilks and M. Stevenson, “The Grammar of Sense: Using Part-of-Speech Tags as a First Step in Semantic Disambiguation, Journal of Natural Language Engineering,1998, Number 4, pages 135–144.)

One of the hurdles to sentiment analysis has been the need to add yet another complex function which has a significant computational cost to existing systems. In an uncertain economic environment, additional expenses are looked at with scrutiny. Not surprisingly, organizations which understand the value of sentiment analysis and want to be in step with the data implications of the shift to mobile devices want a solution which works well and is affordable.

Fortunately Bitext has stepped forward with a semantic analysis program that focuses on complementing and enriching systems, rather than replacing them. This is bad news for some of the traditional text analysis vendors and for enterprise search vendors whose programs often require a complete overhaul or replacement of existing enterprise applications.

I recently saw a demonstration of Bitext’s local sentiment system that highlights some of the integrative features of the application. The demonstration walked me through an online service which delivered an opinion and sentiment snap in, together with topic categorization. The “snap in” or cloud based approach eliminates much of the resource burden imposed by other companies’ approaches, and this information can be easily integrated with any local app or review site.

The Bitext system, however, goes beyond what I call basic sentiment. The company’s approach processes contents from user generated reviews as well as more traditional data such as information in a CRM solution or a database of agent notes, as they do with the Salesforce marketing cloud. One important step forward for  Bitext’s system is its inclusion of trends analysis. Another is its “local sentiment” function, coupled with categorization. Local sentiment means that when I am in a city looking for a restaurant, I can display the locations and consumers’ assessments of nearby dining establishments. While a standard review consists of 10 or 20 lines of texts and an overall star scoring, Bitext can add to that precisely which topics are touched in the review and with associated sentiments. For a simple review like, “the food was excellent but the service was not that good”, Bitext will return two topics and two valuations: food, positive +3; service, negative -1).

A tap displays a detailed list of opinions, positive and negative. This list is automatically generated on the fly. The  Bitext addition includes a “local sentiment score” for each restaurant identified on the map. The screenshot below shows how location-based data and publicly accessible reviews are presented.

Bitext’s system can be used to provide deep insight into consumer opinions and developing trends over a range of consumer activities. The system can aggregate ratings and complex opinions on shopping experiences, events, restaurants, or any other local issue. Bitext’s system can enrich reviews from such sources as Yelp, TripAdvisor, Epinions, and others in a multilingual environment

Bitext boasts social media savvy. The system can process content from Twitter, Google+ Local, FourSquare, Bing Maps, and Yahoo! Local, among others, and easily integrates with any of these applications.

The system can also rate products, customer service representatives, and other organizational concerns. Data processed by the Bitext system includes enterprise data sources, such as contact center transcripts or customer surveys, as well as web content.

In my view, the  Bitext approach goes well beyond the three stars or two dollar signs approach of some systems.  Bitext can evaluate topics or “aspects”. The system can generate opinions for each topic or facet in the content stream. Furthermore, Bitext’s use of natural language provides qualitative information and insight about each topic revealing a more accurate understanding of specific consumer needs that purely quantitative rating systems lacks. Unlike other systems I have reviewed,  Bitext presents an easy to understand and easy to use way to get a sense of what users really have to say, and in multiple languages, not just English!

For those interested in analytics, the  Bitext system can identify trending “places” and topics with a click.

Stephen E Arnold, May 29, 2013

Sponsored by Augmentext

LucidWorks to Participate in OSCON

May 22, 2013

OSCON, the Open Source Convention, will take place in Portland, Oregon in July. Themes of the conference include not just innovation and the exchange of ideas, but also how open source can give back to the community and support upcoming developers. This year, LucidWorks will support the conference. Read more on the LucidWorks Events page.

The event overview begins:

“OSCON is the best place on the planet to prepare for what comes next, from learning new skills to understanding how new and emerging open source technologies are going to impact how we live, work, and do business. In keeping with its O’Reilly heritage, OSCON is a unique gathering of all things open source, where participants find inspiration, confront new challenges, share their expertise, renew bonds to community, make significant connections, and find ways to give back to the open source movement. Erik Hatcher from LucidWorks will be presenting at the event.”

Stay tuned for more details about what Hatcher will present in his Solr Quick Start session. Attendees can expect information regarding installing and running Solr, indexing data, configuring schema, tuning and scaling, and more. LucidWorks offers some of the best value-added open source software with its LucidWorks Search and LucidWorks Big Data offerings. Perhaps more importantly, LucidWorks has a long track record of investing in open source development, training, and support, including employing one-quarter of the committers on the Apache Lucene/Solr project.

Emily Rae Aldridge, May 22, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Video Search: Will It Get Better Post Viacom?

April 19, 2013

I know there’s a push to make sense out of Twitter. I know that millions of people post updates to Facebook. I know about text. Searching for text is pretty lousy, but it is trivial compared to video search. Even the remarkable micro-electronics of Glass are child’s play compared to making sense out of digital video flooding the “inner tubes” of the Internet.

This issue is addressed in part in “Why Video Discovery Startups Fail.” Startup video search and discovery systems do face challenges. The broader question is, “Why doesn’t video search work better on well funded services such as Google YouTube or in governmental systems where “finding” a video needle in a digital hay stack is very important?”

The article says:

Video discovery startups are flawed products and even worse businesses. Why? Because they don’t fit into a consumer’s mental model.

The article identifies some challenges. These range from notions I don’t understand like “context” to concepts I partially grasp; namely, monetization.

My list of reasons video search and discovery fails includes:

  1. The cost of processing large volumes of data
  2. The lack of software which minimizes false drops
  3. The time required for humans to review what automated systems do
  4. The need for humans to cope with problematic videos due to resolution issues
  5. The financial costs of collection, pre processing, processing, and managing the video flows.

What happens is that eager folks and high rollers believe the hype. Video search and indexing is a problem. If we can’t do text, video remains a problem for the future. Viacom decision or no Viacom decision video search is a reminder that finding information in digitized video is a tough problem which becomes more problematic as the volume of digitized video increases.

Stephen E Arnold, April 19, 2013

Sponsored by Augmentext

HP Shares Some of Its 2013 Autonomy Positioning

April 17, 2013

Readers of this information service, which I use to keep track of information I find useful for my columns and speeches, know that I have held Autonomy’s marketing in high regard. There are some azure chip consultants and failed webmasters who pointed out that the phrase “meaning based computing” was not particularly useful. I disagreed. Autonomy—the pre acquisition version of the company—was a darned good marketing and sales organization.

What is easily forgotten in today’s “did I get more traffic on my Facebook page” world is that Autonomy excelled in three areas:

  1. The company was able to enter new markets such as video indexing and fraud detection when other search vendors were running around pitching, “We can index all an organization’s information in one interface.” Autonomy picked a sector and figured out how to paint a compelling story around the IDOL black box, the notion of autonomous operation to reduce some costs, and “meaning based computing.” Competitors responded with a flood of buzz words, which made sense at an off site strategy meeting, but did not translate to simple propositions like “automatic,” reduce costs, and process content in more than 400 different formats.” As a sales pitch, Autonomy did a good job and managed to stay at the top of the search vendor stack in terms of closing deals.
  2. The company used a combination of buying firms which would permit upsells of IDOL related products and very capable management methods to help make the deals pay off. Examples range from the Zantaz buy and the subsequent leveraging of that firm’s technology into cloud service. Autonomy bought Interwoven and pulled together its marketing services into a reasonably compelling bundle of analytics with IDOL sauce.
  3. Autonomy developed what I thought were clever products and services which caught the eye of certain customers and helped the firm enter new markets. Examples range from the now mostly forgotten Kenjin (a smart desktop service) to Aurasma, a virtual reality service for print advertisers.

HP’s management and advisors paid a lot of money to own Autonomy. Like most search and content processing acquisitions, the realties of running a company in this very tough sector became apparent after a few months. I am not interested in the financial and legal battles underway. What’s important is that HP purchased a company, and HP now has to make it work.

A very interesting pair of articles or semi-marketing type articles appeared in eWeek on April 16, 2013. The first is “HP’s Autonomy: 10 Ways It’s Contributing to HP’s Hardware Story.” These slideshows are ways to get page views. Please, flip through the images in the slideshow. Here’s what I noted:

First, HP seems to acknowledge that turnover and management of the HP version of Autonomy has been a problem. The slideshow calls this a “rebirth”. But the big news from a marketing historian’s point of view is that “meaning based computing” is gone and replaced by “the OS for human information.” I find this fascinating. On one hand, competitors can now carp at the scope of the IDOL technology. On the other, in this social buzzword era, “human information” is actually quite a nice turn of phrase. I won’t make a big deal of the fact that when IDOL’s fraud detection algorithms are working on content, the data does not have to be “human” at all. It can be based on a person’s credit transaction, but algorithms for fraud work on machine and human generated information. No big deal because such distinctions are not of interest in today’s here and now environment.

Second, I did not notice much emphasis on search and retrieval. For someone familiar with Autonomy IDOL, I suppose that search is self evident. Autonomy, however, is mostly an information access system. The add ons were, as I noted above, were extensions or wrappers of the IDOL core, based on Bayesian methods and enhanced in many ways since the mid 1990s. Yep, Autonomy’s technology may seem magical to HP management, but it has been around a while and does not perform some of the functions which Google backed Recorded Future performs or which a skilled SAP programmer can crank out. To me, this is a big deal because it underscores the futility of HP’s trying to make big money deals based on plain old search. Companies chasing search deals are not landing huge deals like those HP needs to make its top line grow.

Third, the “10 ways” are focused almost exclusively on Autonomy capabilities which have been available for a long time. I think that the notion of putting Autonomy functions in a printer interesting, but that idea has been floating around for years. I heard presentations from Intel and Xerox which talked about putting content processing in hardware. Interesting stuff, but the “10 ways” are useful because each makes clear to competitors where HP’s marketing and sales will be going. Examples include using Autonomy for customer support, content management.

Great stuff.

The second write up is “HP’s Autonomy Focused on Big Data, Cloud, Mobile, Security: GM”.

This write up contains a number of quite useful insights into HP Autonomy. The “voice” of the article is Robert Youngjohns, the HP manager for the Autonomy unit. I found a number of passages which warrant quoting. I want to highlight three snippets from the three page article. You can get the complete picture in the original article which is worth reading carefully.

First, the story contains the reference to “magical”. Autonomy is math, not magic. The use of the word “magical” is fascinating. It suggests that Autonomy goes well beyond what “normal” content processing can deliver.

Second, the interview lays out the markets which Autonomy will focus upon. These are, as I understand the lingo, big data, information governance, and digital marketing. I am not sure what these phrases encompass, but it is clear that “search” is not playing a front and center role.

Third, there is acknowledgment that the content archiving market is important. The pairing of Autonomy and various HP products is significant. Autonomy will be, to some degree, baked into other HP products and services. This is, in my opinion, an extension of the formula which made Autonomy a revenue producer prior to its sale to HP.

Net net: The Autonomy for 2013 will be fascinating to monitor.

Stephen E Arnold, April  17, 2013

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta