Common Crawl Makes Baby Steps Towards Google’s Index Numbers

November 14, 2011

Read Write Web published an interesting article recently called “New 5 Billion Page Web Index With Page Rank Now Available.” Not only their page index and page ranks are openly accessible, but also their link graphs and other metadata. Hosted on Amazon EC2, this feat was announced by the Common Crawl Foundation.

Unfortunately for Common Crawl Foundation, we heard Google indexes 32 billion web pages. They remain optimistic because of their cloud computing infrastructure that theoretically provides unlimited storage room in addition to localized access to an elastic compute cloud.

The three-year old organization has just started releasing information about themselves publicly. They made the following statement:

“Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.”

While they plan on this project lending itself towards a new wave of “innovation, education and research,” they will need to ramp up their numbers before they can really claim that they provide access.

Megan Feil, November 14, 2011

Written by Stephen E. Arnold · Filed Under Google, Indexing, Search, Technology

Comments

One Response to “Common Crawl Makes Baby Steps Towards Google’s Index Numbers”

Search Engine News Wrap-up Nov 20 | Domain Buddy on November 24th, 2011 8:20 am

[…] Common Crawl Makes Baby Steps Towards Google’s Index Numbers Beyond Search: Read Write Web published an interesting article recently called “New 5 Billion Page Web Index With Page Rank Now Available.” Not only their page index and page ranks are openly accessible, but also their link graphs and other metadata. Hosted on Amazon EC2, this feat was announced by the Common Crawl Foundation. […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.