LinkedIn’s Data Infrastructure

August 17, 2010

LinkedIn is one of the premier networking sights for business professionals. It is a great way for professionals to meet other industry professionals or stay connected to their colleagues. LinkedIn has millions of members located in various countries so this adds up to massive amounts of data being process daily. In “LinkedIn’s Data Infrastructure” LinkedIn’s Principal Engineer and Engineering Manager Jay Kreps provides attendees at the Hadoop Summit an insight look at how LinkedIn processes data.

LinkedIn keeps the majority of its vital data offline. The offline data process is relatively slow so LinkedIn processes batches of data daily. They use the open source program Hadoop in their daily calculations. The term open source describes any type of program where the owner provides the source code along with the software license that allows users to modify the software if necessary. Hadoop is a popular framework because it is designed to help users work with massive amounts of data.

Kreps made sure to mention two of LinkedIn’s most vital open source projects, Azkaban and Voldemort. The article describes Azkaban as “an open source workflow system for Hadoop, providing cron-like scheduling and make-like dependency analysis, including restart. Its main purpose in the system is to control ETL jobs which are responsible for pushing the database as well as any event logs to Voldemort.

Voldemort can simply be described as a storage facility for LinkedIn’s NoSQL key/value. LinkedIn produces a daily relationship graph that is used for querying in web page results. The data must go through an extensive process which was once done by a database. However, this process was counterproductive because the database had to modify first and then the data had to be manually moved. Voldemort makes partitioning along with the entire data movement process easier and more productive.

Readers can go to Data Applications and Infrastructure at LinkedIn Hadoop Summit 2010 to view the data path and additional information. LinkedIn also has a handy index structure implemented in the Hadoop pipeline for extensive searches. The open source Lucene/Solr is used in the search features to make sure users can conduct advanced searches and obtain accurate information quickly. Open source was instrumental in LinkedIn being able to build a productive database able to specifically handle their massive data load which was exactly what the doctor ordered.

April Holmes, August 17, 2010

Written by Stephen E. Arnold · Filed Under Business strategy, Cloud computing, Financial, Open source, Search, Technology

Comments

2 Responses to “LinkedIn’s Data Infrastructure”

LinkedIn's Data Infrastructure : Beyond Search « Builder on August 17th, 2010 10:21 am

[…] In “LinkedIn’s Data Infrastructure” LinkedIn’s Principal Engineer and Engineering Manager Jay Kreps provides attendees at the Hadoop Summit an insight look at how LinkedIn processes data. LinkedIn keeps the majority of its vital data … infrastructure – Google Blog Search […]
Using LinkedIn to build your Network | The Mouse Trap on September 10th, 2010 9:56 am

[…] LinkedIn’s Data Infrastructure (arnoldit.com) […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.