Data Mining Algorithms Explained
May 18, 2015
In plain English too. Navigate to “Top 10 Data Mining Algorithms in Plain English.” When you fire up an enterprise content processing system, the algorithms beneath the user experience layer are chestnuts. Universities do a good job of teaching students about some reliable methods to perform data operations. In fact, the universities do such a good job that most content processing systems include almost the same old chestnuts in their solutions. The decision to use some or all of the top 10 data mining algorithms has some interesting consequences, but you will have to attend one of my lectures about the weaknesses of these numerical recipes to get some details.
The write up is worth a read. The article includes a link to information which underscores the ubiquitous nature of these methods. This is the Xindong Wu et all write up “Top 10 Algorithms in Data Mining.” Our research reveals that dependence on these methods is more wide spread now than they were seven years ago when the paper first appeared.
The implication then and now is that content processing systems are more alike than different. The use of similar methods means that the differences among some systems is essentially cosmetic. There is a flub in the paper. I am confident that you, gentle reader, will spot it easily.
Now to the “made simple” write up. The article explains quite clearly the what and why of 10 widely used methods. The article also identifies some of the weaknesses of each method. If there is a weakness, do you think it can be exploited? This is a question worth considering I suggest.
Example: What is a weakness of k means:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.
Note the key word “tricks.” When one deals with math, the way to solve problems is to be clever. It follows that some of the differences among content processing systems boils down to the cleverness of the folks working on a particular implementation. Think back to your high school math class. Was there a student who just spit out an answer and then said, “It’s obvious.” Well, that’s the type of cleverness I am referencing.
The author does not dig too deeply into PageRank, but it too has some flaws. An easy way to identify one is to attend a search engine optimization conference. One flaw turbocharges these events.
My relative Vladimir Arnold, whom some of the Arnolds called Vlad the Annoyer, would have liked the paper. So do I. The write up is a keeper. Plus there is a video, perfect for the folks whose attention span is better than a goldfish’s.
Stephen E Arnold, May 18, 2015