Parsing and Coding Guide to Extract Main Sentence Topics

May 30, 2013

The article An Efficient Way to Extract the Main Topics from a Sentence on The Tokenizer, explores methods of parsing sentences to get the correct results. The author, using Python, found the while the results he uncovered were correct, the performance speed was not up to par. The article explains some of his technique, which involved a mixture of parsing and coding. When it was discovered that parsing might go farther than required,

“For example CYK algorithm has the complexity of O(n^3 * |G|) !…Full-parsing was a bit of an overkill for what I wanted to achieve.First, I decided to define my own Part of Speech tagger. Luckily I found this article which was very useful.  Second, I decided to define some “Semi-CFG”, which holds the patterns of the Noun Phrases. So in one sentence – My code just tags the sentence with my tagger, then searches for NP patterns in the sentence.”

The article also provides a summary of the coding utilized. For example, tokenize_sentence equals splitting a sentence into single words, whereas extract equals splitting the sentence, tagging it, and searching for patterns. The article concludes that parsing, while the ideal method, sometimes slows down the process too much. All handy information for the search developer’s notebook.

Chelsea Kerwin, May 30, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta