Proof Behind Common Crawl Claims

September 18, 2013

Common Crawl is a non-profit foundation with the mission to build and maintain an open crawl of the Web that can be accessed and analyzed by everyone with the goal of an open Web that supports education, research, and business. Does it sound like too lofty of a goal? According to Common Crawl’s main Web site, Sebastian Spiegler, a volunteer for the foundation, investigated the crawl’s effectiveness, says the post, “A Look Inside Our 210TB 2012 Web Corpus.”

Spiegler wanted to see how the crawl measured up, so he conducted an exploratory analysis about its 2012 data. He wrote a summary paper to share his findings and he called it,

“The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainders are images, XML or code like JavaScript and cascading style sheets.”

Spiegler found that Common Crawl is a cost-effective solution to crawl Web data and it yields high results.  Inexpensive, feasible solutions are desirable, so Common Crawl just needs to ramp up the advertising.

Whitney Grace, September 18, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Comments

One Response to “Proof Behind Common Crawl Claims”

  1. online marketing articles on October 5th, 2013 3:10 pm

    Generally, a well-designed websites can entice a lot of guests
    but then, online marketing strategies and search engines should
    be applied so that the website can be easily found by the people.
    The main tool used herein is a responsive web design that facilitates.
    How does a small business owner effectively market their business online when they are
    trying to run their business on a day to day basis.

    my site :: online marketing articles

  • Archives

  • Recent Posts

  • Meta