Old-New Method from BackType
January 13, 2011
There is considerable interest in real time and big data. One question we hear is, “How does the infrastructure deliver the throughput?” Answering this question can be difficult. We found quite useful “Secrets of BackType’s Data Engineers.” The tips and approach may not be right for some organizations but the information about software and “plumbing” is a quick introduction to one company’s approach.
For me the most striking segment of the write up was:
They experimented with writing out the data to a Cassandra cluster, but ran into performance issues. What they ended up creating instead was a system they call ElephantDB. It takes all the data from a batch job, splits it up into shards, each of which is written out to disk as BerkeleyDB-format files. After that they fire up an ElephantDB cluster to serve the shards. Unlike many traditional databases, it’s read-only, so to update data served from the batch layer you create a new set of shards. So that’s how the heavy processing is done, but what about instant updates? The speed layer exists to compensate for the high latency of the batch layer. It is completely transient and because the batch layer is constantly running it only needs to worry about new data. The speed layer can often make aggressive trade-offs for performance because the batch layer will later extract deep insights and run tougher computations. It takes the data that came in after the last batch processing job and applies fast running algorithms. Because the Hadoop processing is run once or twice a day, the fast layer only has to keep track of a few hours of data to produce its results. The smaller volume makes it easy to use database technologies like MySQL, Tokyo Tyrant and Cassandra in the speed layer. Crawlers put new data on Gearman queues and workers process and write to a database. When the API is called, a thin layer of code queries both the speed layer database and the batch ElephantDB system, and merges the information from both to produce the final output that’s shown to the outside world.
The combination of time proven methods with some of the newer engineering ideas is quite suggestive. A mix of methods can provide the building blocks for a reliable, high performance system. Useful article.
Stephen E Arnold, January 13, 2011
Freebie