Spark Burns Down Hadoop
October 20, 2015
I read “Apache Spark vs Hadoop.” I conceptualized Ronda Rousey climbing in the octagon with Ramazan Emeev. A big gate. As a certain presidential candidate might say, “Huge.”
Alas, the dust up between Spark (MapReduce on steroids) and Hadoop (a batch operation clustering system) was not much of a contest, according to the article.
I highlighted this passage:
With Apache Spark, you can act on your data in whatever way you want. Want to look for interesting tidbits in your data? You can perform some quick queries. Want to run something you know will take a long time? You can use a batch job. Want to process your data streams in real time? You can do that too.
The key to the Spark wonderfulness is RDDs or resilient distributed datasets. I underlined with definition:
They’re fine-grained, keeping track of all changes that have been made from other transformations such as map or join. This means that it’s possible to recover from failures by rebuilding from these transformations (which is why they’re called Resilient Distributed Datasets).
My goodness with these features, poor, old Hadoop may not stand a chance. Now who would win a fight between Rousey and Emeev? One could, I assume, input data about the two fighters and perform on quick queries and get an “answer.”
Like most NoSQL confections, will the answer match what happens in the ring?
Stephen E Arnold, October 20, 2015