Resource Links: Text Extraction From HTML Documents

March 28, 2011

We found another nifty links page to add to your software utility file. The list comes from Tomaž Kova?i?’s Tech Blog. He gathered resource links about text extraction from HTML documents to aid the wayward IT worker.

He first highlights articles that cover the basics of text extraction. By reading these articles, you gain a general knowledge about text extraction and the best way to approach it for your needs. He also mentions how to eliminate content “noise” (i.e. content farms).

He’s also collected a comprehensive list of links related to software about text extraction. He says, “There is only a small amount of competition when it comes to software capable of [removing boilerplate text / extracting article text / cleaning web pages / predicting informative content blocks] or whatever terms authors are using to describe the capabilities of their product.”

Extracting text from an HTML document is relatively simple. The type of software you use makes it more complex. He ends with information about APIs and other miscellaneous links that will be helpful. Stash it away for future use.

Whitney Grace, March 28, 2011

Written by Stephen E. Arnold · Filed Under Connectors, Database, News, Text processing

Comments

One Response to “Resource Links: Text Extraction From HTML Documents”

Tomaž on March 29th, 2011 4:19 pm

I’m a bit surprised that you wrote “Extracting text from an HTML document is relatively simple”. It sounds and feels simple , but when you get into the details the simplicity goes out the door.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.