Parsing and Coding Guide to Extract Main Sentence Topics

May 30, 2013

The article An Efficient Way to Extract the Main Topics from a Sentence on The Tokenizer, explores methods of parsing sentences to get the correct results. The author, using Python, found the while the results he uncovered were correct, the performance speed was not up to par. The article explains some of his technique, which involved a mixture of parsing and coding. When it was discovered that parsing might go farther than required,

“For example CYK algorithm has the complexity of O(n^3 * |G|) !…Full-parsing was a bit of an overkill for what I wanted to achieve.First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful. Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases. So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.”

The article also provides a summary of the coding utilized. For example, tokenize_sentence equals splitting a sentence into single words, whereas extract equals splitting the sentence, tagging it, and searching for patterns. The article concludes that parsing, while the ideal method, sometimes slows down the process too much. All handy information for the search developer’s notebook.

Chelsea Kerwin, May 30, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.