Short Honk: Crawl the Web at Scale
September 30, 2015
Short honk: I read “Aduana: Link Analysis to Crawl the Web at Scale.” The write up explains an open source project which can copy content “dispersed all over the Web.” Keep in mind that the approach focuses primarily on text. Aduana is a special back end for the developer’s tool for speeding up crawls which is built on top of a data management system.
According to the write up:
we wanted to locate relevant pages first rather than on an ad hoc basis. We also wanted to revisit the more interesting ones more often than the others. We ultimately ran a pilot to see what happens. We figured our sheer capacity might be enough. After all, our cloud-based platform’s users scrape over two billion web pages per month….We think Aduana is a very promising tool to expedite broad crawls at scale. Using it, you can prioritize crawling pages with the specific type of information you’re after. It’s still experimental. And not production-ready yet.
In its present form, Aduana is able to:
- Analyze news.
- Search locations and people.
- Perform sentiment analysis.
- Find companies to classify them.
- Extract job listings.
- Find all sellers of certain products.
The write up contains links to the relevant github information, some code snippets, and descriptive information.
Stephen E Arnold, September 30, 2015