Text Mining and Predicting Doom

June 23, 2009

The New Scientist does not cover the information retrieval sector. Occasionally the publication runs an article like “Email Patterns Can Predict Impending Doom” which gets into a content processing issue. I quite liked the confluence of three buzz words in today’s ever thrilling milieu: “predict”, “email”, and “doom”. What’s the New Scientist’s angle? The answer is that as tension within an organization increases, communication patterns in email can be discerned via text mining. The article hints that analysis of email is tough with privacy a concern. The article offers a suggestive reference to an email project at Yahoo, but provided few details. With monitoring of real time data flows available to anyone with an Internet connection, message patterns seem to be quite useful to those lucky enough to have the tools need to ferret out the nuggets. Nothing about fuzzification of data, however. Nothing about which vendors are leaders in the space except for the Yahoo and Enron comments. I think there is more to be said on this topic.

Stephen Arnold, June 23, 2009

SharePoint Virtualization

June 23, 2009

“SharePoint Virtualization Survey Results” offered some insight into how on sample of Microsoft licensees uses virtual servers. The person running the survey and preparing the summary of results is Wictor Wilén. Among the findings that I found interesting were these:

  • About 96 percent of the respondents virtualized their development environment and half virtualized their production environments. (I was surprised at the disparity between the two percentages.)
  • The Web front end was virtualized by most respondents; query service was the second most virtualized operation. Mr. Wilén wrote: “A quite high number of respondents answered that they were virtualizing the database role (73,9%) but only half of them could really recommend it (37,2%). The Excel Services role was something that about half of the participants virtualized (47,8%) and recommended for virtualization (44,2%).”

You can get more survey details from Mr. Wilen’s Web site.

Gee, Will Microsoft Increase Its Ad Buys in PC World

June 23, 2009

Short honk: The economy is lousy. Microsoft is spending big bucks advertising Bing. I was surprised to find what I thought was a negative story about Microsoft. Commenting about Microsoft’s challenges is a zero readership, no ad revenue Web log is one thing. Exposing the company’s strategy as Swiss cheese is another. Judge for yourself. Read “Can Microsoft Truly Go Mobile?” which apparently was written for specialist publication CIO. Even the addled goose can find some good things to say about the Redmond giant; for example, Visio works reasonably well.

Stephen Arnold, June 23, 2009

Lucid Meet Up: Open Source Search Draws Crowd

June 23, 2009

I was in San Francisco the day of the open source Lucene meet up sponsored by Lucid Imagination. The New Idea Engineering Web log wrote a useful summary of what transpired. You can find “Impressions of First Lucene / Solr Meet Up” on the Enterprise Search Blog. Keep in mind that the founders of the Enterprise Search Blog liked the study “Successful Enterprise Search Management” Martin White and I wrote. People who like what I do may have unusual tolerance for addled geese. You have been warned.

I noted the upside and downside of a technical meet up, but I wanted to know more. I chased down David Fishman, one of the spark plugs for Lucid Imagination. You can read an interview with one of the founders of  Lucid Imagination, Marc Krellenstein, in the ArnoldIT.com “Search Wizards Speak” series.

I came away from my discussion with Mr. Fishman more than a little impressed. Some of the items that remained pinned to my brain’s search bulletin board warrant sharing.

First, open source is hot. Few information technology professionals want to go to a meeting about search without first hand information about Apache Lucene (http://lucene.apache.org/) and Solr.

Second, Lucid Imagination (www.lucidimagination.com) is gaining traction with its industrial strength approach to the open source search technology that promises relief from the seven figure licensing fees imposed by some of the high profile search and retrieval vendors.

The meet up brought together almost 50 engineers and programmers on June 3. Featured speakers included Grant Ingersoll, of Lucid Imagination, and of the Apache Lucene project development team, as well as Erik Hatcher, author of Lucene in Action, of the Apache Lucene project development team, and with Ingersoll, a co-founder of Lucid Imagination. Jason Rutherglen and Jake Mannix of Linked-In talked about how they’ve implemented search at the core of their cutting edge social network. Other speakers talked about a wide range of deep search questions, from numeric search, aka Trie Range queries. Avi Rappoport, a search consultant, talked about the approach to “stop words” — encouraging search application developers not to ignore words like “the”, “in”, and the like given the power of today’s compute resources to deal with such nuances.

Back to Lucid: Grant Ingersoll’s talk focused on innovations in Solr 1.4, the forthcoming release of the search platform built around the Lucene Search engine. While there are a good number of important new features, including Trie-range queries for better searching of numeric data, and advanced replication and better logging for improved scalability and deployment, that’s just the latest in a string of enterprise grade innovations that the open source community has rolled together, closing the gap with many, if not most, of the meaningful technology features of commercial enterprise search software. Erik Hatcher spoke about a new search engine for search developers (http://search.lucidimagination.com) that Lucid sponsors for the community, using Lucene and Solr technology to plow through the abundant discussions and technical info created over the years — providing faster troubleshooting and education than programmers could get before.

There were three takeaways from the meeting, according to David Fishman, who does marketing for Lucid Imagination. The breadth and depth of the search problem set means that it’s not going to be solved by one company or one set of people; the active, engaged open source community is constantly adding and innovating new features, putting them through their paces, and pushing the frontier faster than any single company could.

The technology upon which open source search rests is as good or maybe better than some of the commercial products’ code base. Many hands and many eyes mean that the gotchas hiding in some of the high profile brands’ products are not going to jump out and bite an administrator.

That demand is real: innovative companies, as different as IBM, Zappos, Netflix, Linked In, Digg, AOL, MySpace, Apple, Comcast Interactive and more — all these have built mission critical search services at the core of their business using this technology. The people who came to this meet up, and one just like it two weeks earlier in Reston Virginia (http://www.meetup.com/NOVA-Lucene-Solr-Meetup/) are part of that rapidly accelerating adoption curve, since there’s no need to call a salesperson or schedule a demo to get started — the community lowers the barriers to experimentation and participation.

Not least important is what wasn’t covered, said Fishman. Innovation is half the battle; the other, reliability. As Mark Bennett observed on his blog , this meet up was not the crowd that keeps datacenter and IT managers sleeping soundly through the night. Commercial grade reliability comes from a commercial-grade company with the expertise to help get it working and keep it working. And having talked to the Lucid Imagination team, they not only “get” search. They “get” service level agreements. That’ may be one reason why they’re in the business of offering commercial grade support for these technologies.

To sum up, what strikes me as new is that Lucid’s pool of engineers is available to help — many of them, the same engineers who help write the code and manage the innovations with the Apache Lucene community. What the IT guys get by working with Lucid is the combination of innovation with peace of mind and better control of customization and maintenance.

My hunch is that a company with a search system is going to invest in professional services for  support no matter what search solution you deploy. Even if open source makes it easy to get search, it takes expertise to get search right.

If I know Marc Krellenstein, the Lucid Imagination team will be able to deliver that expertise at competitive rates. Certainly, the range of companies represented suggest that open source search is moving toward center stage.

Can open source search gain traction in the enterprise? The answer: In some organizations, the answer is, “Yes.”
Open source search is here and Lucene/Solr promises to push beyond simple search and retrieval.

Stephen Arnold, June 23, 2009

A Google Vulnerability Exposed

June 22, 2009

Erick Schonfeld’s “When It Comes to Search Trends, Google Is Lagging Behind Bing” identifies a potential Google weakness. I think TechCrunch is on to something, but I think the visible vulnerability explained by Mr. Schonfeld is a symptom of a deeper problem.

The weakness is an ability to handle what’s new and what’s happening. Mr. Schonfeld, wrote:

As Microsoft tries to take away market share from Google with its new search engine, Bing, it is battling Google feature by feature. One feature where Microsoft seems to be edging out Google is with displaying recent search trends. This may not be a major feature, but it shows a weakness in Google’s armor.

Mr. Schonfeld presented sample queries that illustrate this issue. The bottom-line is that for the most recent information, I may want to use more than Google. Bing.com is one option and there are the numerous real time search systems available.

My take on this is different. Keep in mind that I think Mr. Schonfeld has identified a symptom, the deeper disease is “time deficiency.” As zippy as the Google system is when responding to queries, the Google is not as fast on the intake and indexing of real time data flows such as those from social networks.

My research has identified several reasons:

  1. Google’s attention is on its leapfrog technologies such as Google Fusion and Google Wave. Both of these are manifestations of a larger Google play. While the wizards focused on these innovations, the real time content explosion took place, leaving Google without a here-and-now response
  2. Google is big and it is suffering from the same administrative friction that plagued IBM when Microsoft pulled off the disc operating system coup and that hobbled Microsoft when Google zoomed into Web search. Now the Google finds itself aware of Facebook, Twitter, and similar services yet without a here-and-now response. Slow out of the blocks may mean losing the race.
  3. Google’s plumbing is not connected to the real time streams from social and RSS services. Sure, there is some information, but it is simply not as fresh as what I can find on Scoopler and some other services.

What we have is a happy circumstance. If Microsoft can exploit that weakness, I think it has a chance to capture traffic in the real time sector. But having identified a weakness does not mean that hemlock can be poured into Googzilla’s ear.

There are some other weaknesses at the Google as well. I will be talking about one at the NFAIS conference on Friday, June 26, 2009. Get too many weaknesses, and these nicks start to hurt. Addled geese have to be very careful but big companies are often too big and tough to be worried about a few nicks. If there are a thousand of them, well, the big outfit might notice.

Stephen Arnold, June 22, 2009

Enterprise Search and Choice

June 22, 2009

I participated in a teleconference last week during which the following comment was made: “We bought the Google Search Appliance because it did offer us too many choices.” I found a link to a short article by Derek Sivers that shed some light on this comment. “Customers Given Too Many Choices Are 10X Less Likely to Buy” puts some weight behind the injunction, “Keep it simple, stupid” or KISS. Mr. Sivers cites a university professor and reported: about a customer behavior test. You can get the details from Mr. Sivers’ write up. The key points were, in my opinion:

Lessons learned: [1] Having many choices seems appealing (40% vs 60% stopped to taste) [2] Having many choices makes them 10 times less likely to buy (30% vs 3% actually bought). Surgeon Atul Gawande found that 65% of people surveyed said if they were to get cancer, they’d want to choose their own treatment. Among people surveyed who really do have cancer, only 12% of patients want to choose their own treatment.

What’s the lesson for search marketers? Don’t make the pitch too complex due to the many choices the system presents to the user. There’s  downside to the KISS approach. The teleconference was about replacing a search appliance with a more sophisticated system that offered more choice to the administrator. Go figure.

Stephen Arnold, June 22, 2009

Parsing Oracle Text Input

June 22, 2009

Short honk: A happy quack to the reader who sent me a link to this tip for chopping up a list of telephone numbers separated by asterisks. There are a couple of tips revealed by Michel in this post on the Oracle FAQ site. You can get the info by clicking here. The method will vary depending on  your specific source file.

Stephen Arnold, June 22, 2009

Twitter Tools

June 22, 2009

Now that outfits like the New York Times and CNN have concluded that Twitter is useful when reporting certain events, the Social Media Guide’s round up of Twitter tools may find some use in the newsroom. The round up “The Ultimate List of Twitter Tools” is long, grouped, and quite good. Highly recommended for dinosaurs and new forms of sentient information life. A reminder: there are other sources of real time info as well. Keep those options open, the addled goose honks.

Stephen Arnold, June 22, 2009

Library Teaches Search – More Instruction Needed

June 22, 2009

My recollection is that libraries taught search as far back at 1980. I recall that either database vendors would run demonstrations or that librarians skilled in the use of online would provide guidance to those who asked. I recall running a class in ABI/INFORM at Chicago Public Library and there was an overflow crowd of both staff and research minded patrons. I was delighted, therefore, to see an article in the Sacramento Bee that described the Sutter Library’s classes in finding health and medical information online. The class is a reminder to me that:

  1. Librarians and information professionals often know how to search and have an interest in sharing that knowledge
  2. Patrons are smart enough to know that despite the marketing hype and the pundits’ assertions that search is a “done deal” additional instruction attracts people and finds its way into The Sacramento Bee

We have a long way to go before information professionals will be relics of a long gone time. The people who tell me that they “know how to search” and “can locate almost anything online” are kidding themselves. I think I am a reasonably good researcher. But if you spend time monitoring how I find information, you will learn quickly that I turn to experts who make my search skills look primitive. Even my nifty Overflight system pales with the type of information that my research team generates by:

  • Knowing what content is located where
  • Understanding the editorial method behind or absent from certain online systems
  • Leveraging hard-to-manipulate resources such as information from government repositories, specialized services, and individual experts.

I would like to see more libraries move aggressively into online instruction, market those programs, and raise the level of expertise. Most of the people who claim to be experts at search are clueless about how bad their skills are. Among the worst offenders are self appointed search experts who have trouble figuring out when something is likely to be baloney and when something is just plain wrong. Enterprise search, content management, and text mining are three disciplines where better research will be most beneficial in my opinion. Then we need critical thinking skills. Schools have dropped the ball. Maybe libraries can help in this area as well? Search procurement teams will be well served if the team has one or more librarians in the huddle.

Stephen Arnold, June 22, 2009

Why Social Information Becomes More Important to Investors

June 22, 2009

Few people in Harrod’s Creek, Kentucky, pay much attention to the publishing flow from financial services and its related service industry. Most of the puffery gets recycled on the local news program, boiled down to a terse statement about hog prices and the cost of a gallon of gasoline. The Wall Street Journal has become software in the last two years with about 20 percent of the Friday edition and 30 percent of the Saturday edition devoted to wine, automobiles, and lifestyles (now including sports). I am waiting for a regular feature about sports betting, which is one of the key financial interests in Kentucky.

Asking your pal at the local country club is not likely to get you a Bernie Madoff scale tip, but there are quite a few churners. Each is eager to take what money one has, recycle it, and scrape off sufficient commissions to buy a new Porsche. As the deer have been nuked by heavy traffic in the hollow, zippy sports cars are returning to favor. A Porsche drivers fears no big bodywork repair by smoking a squirrel.

I read with interest “Washington Moves to Muzzle Wall Street” by Mike Larson. I think Mr. Larson puts his photo on his Web site, and he looks like a serious person. Squirrels won’t run in front of his vehicle I surmise. He wrote:

he Obama administration revealed a sweeping series of new proposed regulations and reforms — all designed to prevent the next great financial catastrophe. The plan is multi-faceted and complex. Among other things, it aims to increase the Fed’s power, regulate the derivatives and securitization markets more effectively, protect consumers from the potential harm of complex financial products, and more. It’s been a long time in the making, with input from key policymakers, consumer groups, academics, and others.

After the set up, Mr. Larson reviews the components of the Administration’s plan. He observed:

I’m hopeful we’ll see meaningful action this year. More importantly, I’m hopeful that policymakers who are empowered to take new actions to police the markets and protect consumers actually exercise them. That’s the key to making any of this stuff work. It’s unclear exactly when these provisions will start to impact the disclosures you get when you take out a mortgage, or when you’ll be able to protest to the new consumer protection agency should you get shafted on a financial transaction.

His story trigger my thinking. One angle that crossed my mind was that the information generated about the US financial circus may get sucked into the gravitational pull of this initiative. The reason is that money is a form of information. Regulate the money, the information stream is affected.

One consequence is that the type of information generated by social networks, Web logs, Facebook posts, and other “off the radar” sources is likely to become more important. If I am right, the value of companies that can make “off the radar” available or better yet in a form that makes sense of many data points will go up.

My first thought is that if the Wall Street crowd gets muzzled to a greater degree, then the underside of reportage–bloggers like me–may become more important. Just my opinion, of course.

In the months ahead, I want to noodle this idea. My thoughts are exploratory, but I have decided that my preliminary musings will be made available as a PDF which you can download without paying for the information. Keep in mind that the editorial policy in the “About” section of this Web log will apply to free stuff that I am not forcing anyone to read.

Stephen Arnold, June 22, 2009

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta