Patterns in Web Content
March 10, 2011
Data mining refers to a form of application which seeks common themes or patterns in specific pools of information. The core of its popularity rests within the scientific communities, though the technology is increasingly being applied in the various arteries of the commercial sector.
The exponential growth of the Web has brought into focus the necessity for the ability to trace and scrutinize the relationships inherent in the aforementioned collections of information.
The Computational Linguistic & Psycholinguistics Research Center (CLiPS) located in Belgium has just released Pattern, a mining unit that was designed to couple with the Python language system. The Pattern Web site says:
“It [Pattern] bundles tools for data retrieval (Google + Twitter + Wikipedia API, Web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).”
When you follow the link above, you can access the release directly. Check out the the specifications for compatibility.
I thought it interesting to discover the designers, in a trial of their creation, used the software to track the progress of a local politicians in the 2010 elections in their home country. Pattern scanned thousands of Tweets, split between two languages, updating the data pool on a daily basis. The results were fascinating. You can read a detailed description of the experiment here.
Micheal Cory, March 10, 2011