ThisPlusThat for Smarter Searches
December 2, 2013
Leave it to an astrophysicist to make search smarter. One of the fellows over at the Insight Data Science Fellows Program, Christopher Moody, describes how his search engine uses vector words to produce more accurate search results in, “ThisPlusThat.me: a Search Engine that Lets You ‘Add’ Words as Vectors.” The scientist says he was inspired by the possibilities presented by Google’s new vectoring algorithm, word2vec. He explains:
“What [Google] doesn’t do is understand the relationships between words and understand the similarities or dissimilarities. That’s where ThisPlusThat.me comes in–a search site I built to experiment with the word2vec algorithm recently released by Google. word2vec allows you to add and subtract concepts as if they were vectors, and get out sensible, and interesting results. I applied it to the Wikipedia corpus, and in doing so, tried creating an interactive search site that would allow users to put word2vec through its paces.”
Moody supplies several examples of his project in action. The first and most elementary: querying “King – Man + Woman” leads to “Queen.” Since the algorithm was trained using Wikipedia‘s vast collection of data, Moody explains, it has “a pretty good grasp of not only common words like ‘smart’ or ‘American’ but also loads of human concepts and real world objects, allowing us to manipulate proper nouns.” You can try ThisPlusThat.me for yourself here.
Moody explains how he approached word2vec’s huge dimensional vector table using Hadoop‘s Map functions. To speed computation, he tried a number of tools: NumPy, Cython, Numba, and Numexpr. Near the end of the article, Moody shares links to his code and notebook experiments. The write-up is worth a look for anyone interested in the development of natural language algorithms.
Cynthia Murrell, December 02, 2013
Sponsored by ArnoldIT.com, developer of Augmentext