Machine Learning Process Explained

September 5, 2012

Building commercial databases no longer requires so many humans, we learn from a recent post at Factual’s blog. “A Brief Tour of Factual’s Machine Learning Pipeline” does a good job explaining the machine learning workflow. The description is specific to Factual, of course, but is also a good source for understanding the process more generally.

First, algorithms begin by cleaning up and standardizing the wealth of available data. Next comes the process of resolving whether the data can help identify matches, non-matches, both, or neither. The corner cases left by this process are then analyzed, and all the results are ranked for trustworthiness and controlled for quality. That last step is where the humans finally come in. Timothy Chklovski writes:

“One important principle of our systems is that we don’t assume everything can be automated. We take random samples of our data and run them by our quality control team and through teams organized with the help of Amazon’s Mechanical Turk service. This allows us to identify specific cases that slipped through our algorithms. The errors then go back to our algorithm designers who try to find improvements.”

The write up points out that Factual strongly embraces open source solutions. Not surprisingly, they use Hadoop and HBase; they also incorporate data management tool CasaLog, dynamic programming language Clojure , and the URI-based repository dCache. Founded in 2007, the open data platform company is headquartered in Los Angeles.

Cynthia Murrell, September 05, 2012

Sponsored by ArnoldIT.com, developer of Augmentext

Written by Stephen E. Arnold · Filed Under Data, Database, News, Open source

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.