FoxySpider is a Personal Web Crawler for Firefox

February 14, 2014

Ever wish you could have your own, personal web crawler? Now, if you browse with the open-source Mozilla Firefox, you can. Just download the add-on FoxySpider from Firefox’s Add-Ons site. The spider’s About section explains:

“With FoxySpider you can:

*Get all photos from an entire website

*Get all video clips from an entire website

*Get all audio files from an entire website

Well, actually get any file type you want from an entire website.

“FoxySpider can be used to create a thumbnail gallery containing links to rich media files of any file types you are interested in. It can also crawl deep to any level on a website and display the applicable files it found in the same gallery. FoxySpider is useful for different media content pages (music, video, images, documents), thumbnail gallery post (TGP) sites, podcasts. You can narrow and expand the search to support exactly what you want. Once the thumbnail gallery is created you can view, download or share (on Facebook and Twitter) every file that was fetched by FoxySpider.”

One podcast-loving user gives FoxySpider five out of five stars and calls it an “essential tool for harvesting sites rich in your interest.” Another, who bestows four stars, says the crawler crashed his browser, but admits that his “PC is not so good.” If you believe your machine can handle the rigors of a resource-pounding application, download away, gentle reader.

Cynthia Murrell, February 14, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Crawl, News | Comments Off on FoxySpider is a Personal Web Crawler for Firefox

Discover the Open Source Alternative to the Autonomy Crawler

February 7, 2014

Whether Autonomy’s product success is true or false, as proprietary software it comes with a large price tag. The average small business or user cannot afford to purchase HP Autonomy’s IDOL Crawler. Open source is the best alternative, but for the longest time you could not get software comparable to IDOL Crawler. Norconex says that has changed in the article, “An Open Source Crawler For Autonomy IDOL.” Norconex released an HP Autonomy IDOL Committer for its open source Web crawler Norconex HTTP Collector.

The HTTP Collector is available for Github. The developer encourages people to download it and contribute to the project. Its features are mostly the same as those from HP Autonomy HTTP Connector.

The article states:

“Most key features of HP Autonomy HTTP Connector are available in Norconex HTTP Collector, including document changes detection on incremental crawls and purging documents from IDOL for deleted web pages. New ones are introduced, such as having different hit interval at different time of the day and the ability to overwrite pretty much every part of the web crawling flow with your own implementation logic. The IDOL Committer has been tested on diverse public and internal web sites with great performance.”

We can learn from the open source community that if there is not a piece of software you want, all you have to do is wait until a developer makes it or you can take the initiative to do it yourself.

Whitney Grace, February 07, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Crawl, News, Open source, Tools | Comments Off on Discover the Open Source Alternative to the Autonomy Crawler

Proof Behind Common Crawl Claims

September 18, 2013

Common Crawl is a non-profit foundation with the mission to build and maintain an open crawl of the Web that can be accessed and analyzed by everyone with the goal of an open Web that supports education, research, and business. Does it sound like too lofty of a goal? According to Common Crawl’s main Web site, Sebastian Spiegler, a volunteer for the foundation, investigated the crawl’s effectiveness, says the post, “A Look Inside Our 210TB 2012 Web Corpus.”

Spiegler wanted to see how the crawl measured up, so he conducted an exploratory analysis about its 2012 data. He wrote a summary paper to share his findings and he called it,

“The 2012 Common Crawl corpus is an excellent opportunity for individuals or businesses to cost- effectively access a large portion of the internet: 210 terabytes of raw data corresponding to 3.83 billion documents or 41.4 million distinct second- level domains. Twelve of the top-level domains have a representation of above 1% whereas documents from .com account to more than 55% of the corpus. The corpus contains a large amount of sites from youtube.com, blog publishing services like blogspot.com and wordpress.com as well as online shopping sites such as amazon.com. These sites are good sources for comments and reviews. Almost half of all web documents are utf-8 encoded whereas the encoding of the 43% is unknown. The corpus contains 92% HTML documents and 2.4% PDF files. The remainders are images, XML or code like JavaScript and cascading style sheets.”

Spiegler found that Common Crawl is a cost-effective solution to crawl Web data and it yields high results. Inexpensive, feasible solutions are desirable, so Common Crawl just needs to ramp up the advertising.

Whitney Grace, September 18, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Written by Stephen E. Arnold · Filed Under Crawl, News | 1 Comment

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.