Common Crawl Makes Baby Steps Towards Google’s Index Numbers
November 14, 2011
Read Write Web published an interesting article recently called “New 5 Billion Page Web Index With Page Rank Now Available.” Not only their page index and page ranks are openly accessible, but also their link graphs and other metadata. Hosted on Amazon EC2, this feat was announced by the Common Crawl Foundation.
Unfortunately for Common Crawl Foundation, we heard Google indexes 32 billion web pages. They remain optimistic because of their cloud computing infrastructure that theoretically provides unlimited storage room in addition to localized access to an elastic compute cloud.
The three-year old organization has just started releasing information about themselves publicly. They made the following statement:
“Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster.”
While they plan on this project lending itself towards a new wave of “innovation, education and research,” they will need to ramp up their numbers before they can really claim that they provide access.
Megan Feil, November 14, 2011
Comments
One Response to “Common Crawl Makes Baby Steps Towards Google’s Index Numbers”
[…] Common Crawl Makes Baby Steps Towards Google’s Index Numbers Beyond Search: Read Write Web published an interesting article recently called “New 5 Billion Page Web Index With Page Rank Now Available.” Not only their page index and page ranks are openly accessible, but also their link graphs and other metadata. Hosted on Amazon EC2, this feat was announced by the Common Crawl Foundation. […]