Machine Learning Process Explained
September 5, 2012
Building commercial databases no longer requires so many humans, we learn from a recent post at Factual’s blog. “A Brief Tour of Factual’s Machine Learning Pipeline” does a good job explaining the machine learning workflow. The description is specific to Factual, of course, but is also a good source for understanding the process more generally.
First, algorithms begin by cleaning up and standardizing the wealth of available data. Next comes the process of resolving whether the data can help identify matches, non-matches, both, or neither. The corner cases left by this process are then analyzed, and all the results are ranked for trustworthiness and controlled for quality. That last step is where the humans finally come in. Timothy Chklovski writes:
“One important principle of our systems is that we don’t assume everything can be automated. We take random samples of our data and run them by our quality control team and through teams organized with the help of Amazon’s Mechanical Turk service. This allows us to identify specific cases that slipped through our algorithms. The errors then go back to our algorithm designers who try to find improvements.”
The write up points out that Factual strongly embraces open source solutions. Not surprisingly, they use Hadoop and HBase; they also incorporate data management tool CasaLog, dynamic programming language Clojure , and the URI-based repository dCache. Founded in 2007, the open data platform company is headquartered in Los Angeles.
Cynthia Murrell, September 05, 2012