Splunk and Real Time Search

April 6, 2010

My column for Information World Review addressed the issue of latency in what marketers call real-time search. I am not sure when the article goes on the Information World Review Web site at http://www.iwr.co.uk/, but I can hit the three points in the write up.

Real time means different things in different contexts
The services which return results with less latency are specialist vendors such as Collecta, Surchur, and Twitter, among others.
The real time results in the Big Three’s search systems are uniformly disappointing.

When I read “Splunk Goes Real-Time, Eliminates Latency from IT Data Search,” I wondered what I missed. After working through the write up, I realized that “real time search” was not defined. The assumption that a buzzword makes sense to a casual reader like myself is a common practice.

The write up said:

With a major upgrade, Splunk eliminates the latency by opening the doors to real-time search, analysis and monitoring for live streaming data. The company offered a glimpse by allowing me to go into the site and conduct a random search so that I could see my own search appear in real-time data, just as an IT admin might see it.

Splunk is company that specializes in log management. Logs are important for such applications as search engine optimization and certain security-related tasks. Here’s how the company describes itself:

Splunk is software that provides unique visibility across your entire IT infrastructure from one place in real time. Only Splunk enables you to search, report, monitor and analyze streaming and historical data from any source. Now troubleshoot application problems and investigate security incidents in minutes instead of hours or days, monitor to avoid service degradation or outages, deliver compliance at lower cost and gain new business insights from your IT data.

The addition of a search function that indexes in real time is a potentially big improvement over traditional log file analysis. The system includes a function to post Splunk saved search results to Twitter. You can get the script here.

The ZDNet write up includes a diagram for “Machine-Generated IT Data Contains a Categorical Record of Activity and Behavior”.

Splunk is a low latency search system that indexes certain types of content. “Real time” is a murky concept, and in my experience, every system exhibits latency to some degree.

Stephen E Arnold, April 6, 2010

This is an unsponsored post.

Written by Stephen E. Arnold · Filed Under Enterprise, News, Real time search, Search

Comments

One Response to “Splunk and Real Time Search”

Michael Wilde on April 6th, 2010 8:00 pm

Stephen…

Interesting perspective. Since there appears to be some murkiness, let me still the waters and expose what’s on ocean floor accessed through a cave called Splunk.

Since 2005, Splunk has *always* had realtime indexing of any kind of IT data. By realtime, I mean, if a machine can generate it and spray it at Splunk (such as firewall traffic coming over UDP), it will index it. If Splunk can watch a file and capture changes in realtime (akin to unix “tail -f”), it will index it.

Splunk is at its heart a “time-series” search engine. Time is its organizing vector, the way it stores data, and the way data is retrieved (for the most part). It can index nearly any kind of unstructured, or structured IT data. Google has it easy as far as document formats–they’re all standard. In a datacenter, there are no logging or data formatting standards other than SNMP (which is a fraction). In its real-time indexing, it processes data, classifies it, finds event boundaries and then finally writes to the index. That index being a non-relational datastore which is updated as data is coming in, and is searchable within an instant of data being written to disk — a fairly different search index structure than that of Google, Yahoo, or even the open source Lucene systems. But no diss on Google or Yahoo. They solve easy problems at MASSIVE SCALE.

In that model (prior to this particular week), versions 4.0 and earlier, a user (or the system on behalf of a user), is able to search from that exact time to a time period prior. (e.g “Last 5 minutes, Last 7 days, any weekday, or even all time”). Search in splunk is twofold process. First being retrieval of data that matches the search, second being metadata extraction. Splunk extracts fields at search time, which means the organization of the index has no bearing (other than time) on how data is used by the end user.

An IT organization, when solving a problem, finding a security issue, measuring business statistic, or many other use cases, will likely retrieve data from many different sources (java logs, apache logs, changed config files, transactions, mail messages, etc). Splunk is able to allow not only “search-time structured query (which is a heady concept better seen than described), but map-reduce summarization and graphical reporting. Its like if Google, Hadoop, Hyperion, Crystal Reports, and our old pal “grep” all slept together and had a baby… Hard to believe, i’m sure.

Now comes Splunk 4.1, the subject of the ZDNet writeup, and that very marketing-esque diagram. Oh, and the source of your commentary.

If all of the above were still true. “Reading from the real time IT data stream and indexing”–to make it more useful (especially during a security situation, or a customer issue)–it would probably be nice if someone could ask the search engine to filter and display results as they came in *prior* to be written on disk. Additionally, you’d want to watch all future events (or data) as it came in to the system. You might want this for alerting purposes, you might decide you want visualization in real time, or you may want to stare at the screen and watch the flow of data *as it happens*. In Splunk 4.1 you get access to the full search language, but you can say “show me any time there is a login failure on any server in my network, in real time for a 5 minute historical window (for example)” That being a rolling five minutes to include “now”. But wait.. “now” just passed.. ok.. then whatever is about to be now.. and whatever comes down the pipe in the future.

Lastly.. Splunk loves to be distributed. Index data on 5 servers at the same time, and search across all of them as if they were one big real-time stream.

So, in a nutshell. Splunk, the real-time IT data indexing engine, just came out with actual real-time search.

Really.

Hopefully that helps clear the murkiness. Try it.. or ping me.. I’d be happy to show you Splunk and answer any questions.

Michael Wilde
Splunk Ninja
thewilde (AT) splunk.com
http://splunkninja.com
512-524 -97-FOUR-TWO

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.