Big Data and Their Interesting Processes

March 25, 2015

I love it when mid tier consultants wax enthusiastically about Big Data. Search your data lake, enjoins one clueless marketer. Big Data is the future, sings a self appointed expert. Yikes.

To get a glimpse of exactly what has to be done to process certain types of Big Data in an economical yet timely manner, I suggest you read “Analytics on the Cheap.” The author is 0X74696D. Get it?

The write up explains the procedures required to crunch data and manage the budget. The work flow process I found interesting is:

  • Incoming message passes through our CDN to pick up geolocation headers
  • Message has its session authenticated (this happens at our routing layer in Nginx/OpenResty)
  • Message is routed to an ingest server
  • Ingest server transforms message and headers into a single character-delimited querystring value
  • Ingest server makes a HTTP GET to a 0-byte file on S3 with that querystring
  • The bucket on S3 has S3 logging turned on.
  • We ingest the S3 logs directly into Redshift on a daily basis.

The write up then provides code snippets and some business commentary. The author also identifies the upside of the approach used.

Why is this important? It is easy to talk about Big Data. Looking at what is required to make use of Big Data reveals the complexity of the task.

Keep this hype versus real world split in mind the next time you listen to a search vendor yak about Big Data.

Stephen E Arnold, March 25, 2015

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta