Resource Links: Text Extraction From HTML Documents

March 28, 2011

We found another nifty links page to add to your software utility file.  The list comes from Tomaž Kova?i?’s Tech Blog.  He gathered resource links about text extraction from HTML documents to aid the wayward IT worker.

He first highlights articles that cover the basics of text extraction.  By reading these articles, you gain a general knowledge about text extraction and the best way to approach it for your needs.  He also mentions how to eliminate content “noise” (i.e. content farms).

He’s also collected a comprehensive list of links related to software about text extraction.  He says, “There is only a small amount of competition when it comes to software capable of [removing boilerplate text / extracting article text / cleaning web pages / predicting informative content blocks] or whatever terms authors are using to describe the capabilities of their product.”

Extracting text from an HTML document is relatively simple.  The type of software you use makes it more complex.  He ends with information about APIs and other miscellaneous links that will be helpful. Stash it away for future use.

Whitney Grace, March 28, 2011

Comments

One Response to “Resource Links: Text Extraction From HTML Documents”

  1. Tomaž on March 29th, 2011 4:19 pm

    I’m a bit surprised that you wrote “Extracting text from an HTML document is relatively simple”. It sounds and feels simple , but when you get into the details the simplicity goes out the door.

  • Archives

  • Recent Posts

  • Meta