Spark Burns Down Hadoop

October 20, 2015

I read “Apache Spark vs Hadoop.” I conceptualized Ronda Rousey climbing in the octagon with Ramazan Emeev. A big gate. As a certain presidential candidate might say, “Huge.”

Alas, the dust up between Spark (MapReduce on steroids) and Hadoop (a batch operation clustering system) was not much of a contest, according to the article.

I highlighted this passage:

With Apache Spark, you can act on your data in whatever way you want. Want to look for interesting tidbits in your data? You can perform some quick queries. Want to run something you know will take a long time? You can use a batch job. Want to process your data streams in real time? You can do that too.

The key to the Spark wonderfulness is RDDs or resilient distributed datasets. I underlined with definition:

They’re fine-grained, keeping track of all changes that have been made from other transformations such as map or join. This means that it’s possible to recover from failures by rebuilding from these transformations (which is why they’re called Resilient Distributed Datasets).

My goodness with these features, poor, old Hadoop may not stand a chance. Now who would win a fight between Rousey and Emeev? One could, I assume, input data about the two fighters and perform on quick queries and get an “answer.”

Like most NoSQL confections, will the answer match what happens in the ring?

Stephen E Arnold, October 20, 2015

Written by Stephen E. Arnold · Filed Under Big data, Database, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.