Update from Lucene
May 10, 2016
It has been awhile since we heard about our old friend Apache Lucene, but the open source search engine has something new, says Open Source Connections in the article, “BM25 The Next Generation Of Lucene Relevance.” Lucene is added BM25 to its search software and it just might improve search results.
“BM25 improves upon TF*IDF. BM25 stands for “Best Match 25”. Released in 1994, it’s the 25th iteration of tweaking the relevance computation. BM25 has its roots in probabilistic information retrieval. Probabilistic information retrieval is a fascinating field unto itself. Basically, it casts relevance as a probability problem. A relevance score, according to probabilistic information retrieval, ought to reflect the probability a user will consider the result relevant.”
Apache Lucene formerly relied on TF*IDF, a way to rank how users value a text match relevance. It relied on two factors: term frequency-how often a term appeared in a document and inverse document frequency aka idf-how many documents the term appears and determines how “special” it is. BM25 improves on the old TF*IDF, because it gives negative scores for terms that have high document frequency. IDF in BM25 solves this problem by adding a 1 value, therefore making it impossible to deliver a negative value.
BM25 will have a big impact on Solr and Elasticsearch, not only improving search results and accuracy with term frequency saturation.
Whitney Grace, May 10, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Machine Learning: 10 Numerical Recipes
April 8, 2016
The chatter about smart is loud. I cannot hear the mixes on my Creamfields 2014 CD. Mozart, you are a goner.
If you want to cook up some smart algorithms to pick music or drive your autonomous vehicle without crashing into a passenger carrying bus, navigate to “Top 10 Machine Learning Algorithms.”
The write up points out that just like pop music, there is a top 10 list. More important in my opinion is the concomitant observation that smart software may be based on a limited number of procedures. Hey, this stuff is taught in many universities. Go with what you know maybe?
What are the top 10? The write up asserts:
- Linear regression
- Logistic regression
- Linear discriminant analysis
- Classification and regression trees
- Naive Bayes
- K nearest neighbors
- Learning vector quantization
- Support vector machines
- Bagged decision trees and random forest
- Boosting and AdaBoost.
The article tosses in a bonus too: Gradient descent.
What is interesting is that there is considerable overlap with the list I developed for my lecture on manipulating content processing using shaped or weaponized text strings. How’s that, Ms. Null?
The point is that when systems use the same basic methods, are those systems sufficiently different? If so, in what ways? How are systems using standard procedures configured? What if those configurations or “settings” are incorrect?
Exciting.
Stephen E Arnold, April 8, 2016
Google Maps: Accuracy Is Relative
April 1, 2016
Editorial comment: Not April Fool bit of spoofery.
I read “Demolition Company Says a Google Maps Error Led Them to Tear Down the Wrong House.” Pesky humans. Google’s autonomous automobiles do not have accidents. When those accidents occur, a human is at fault. Bus drivers, grrrr.
The write up suggests that a Google Map caused a human to instruct demolition workers to level a house at 7601 Cousteau Drive. The human allegedly pointed a finger at Google Maps, a geospatial system which has some big fans in various governmental outfits.,
Here’s what the story asserts:
Google Maps has declined to make a statement, but it did fix the map to pin the correct address.
Yep, what does a “real” journalist expect when it wants to ask about one of Google’s algorithmic services?
Life would be simpler if humans were not getting in the way of Google efficiency, solutions, and services.
Thought: Perhaps one should not use an Android phone and Google Maps to navigate to the edge of the Grand Canyon.
Stephen E Arnold, April 1, 2016
Netflix Algorithm Defaults To “White” Content, Sweeps Diversity Under the Rug
April 1, 2016
The article Marie Claire titled Blackflix; How Netflix’s Algorithm Exposes Technology’s Racial Bias, delves into the racial ramifications of Netflix’s much-lauded content recommendation algorithm. Many users may have had strange realizations about themselves or their preferences due to collisions with the system that the article calls “uncannily spot-on.” To sum it up: Netflix is really good at showing us what we want to watch, but only based on what we have already watched. When it comes to race, sexuality, even feminism (how many movies have I watched in the category “Movies With a Strong Female Lead?”), Netflix stays on course by only showing you similarly diverse films to what you have already selected. The article states,
“Or perhaps I could see the underlying problem, not in what we’re being shown, but in what we’re not being shown. I could see the fact that it’s not until you express specific interest in “black” content that you see how much of it Netflix has to offer. I could see the fact that to the new viewer, whose preferences aren’t yet logged and tracked by Netflix’s algorithm, “black” movies and shows are, for the most part, hidden from view.”
This sort of “default” suggests quite a lot about what Netflix has decided to put forward as normal or inoffensive content. To be fair, they do stress the importance of logging preferences from the initial sign up, but there is something annoying about the idea that there are people who can live in a bubble of straight, white, (or black and white) content. There are among those people some who might really enjoy and appreciate a powerful and relevant film like Fruitvale Station. If it wants to stay current, Netflix needs to show more appreciation or even awareness of its technical bias.
Chelsea Kerwin, April 1, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Stanford Offers Course Overviewing Roots of the Google Algorithm
March 23, 2016
The course syllabus for Stanford’s Computer Science class titled CS 349: Data Mining, Search, and the World Wide Web on Stanford.edu provides an overview of some of the technologies and advances that led to Google search. The syllabus states,
“There has been a close collaboration between the Data Mining Group (MIDAS) and the Digital Libraries Group at Stanford in the area of Web research. It has culminated in the WebBase project whose aims are to maintain a local copy of the World Wide Web (or at least a substantial portion thereof) and to use it as a research tool for information retrieval, data mining, and other applications. This has led to the development of the PageRank algorithm, the Google search engine…”
The syllabus alone offers some extremely useful insights that could help students and laypeople understand the roots of Google search. Key inclusions are the Digital Equipment Corporation (DEC) and PageRank, the algorithm named for Larry Page that enabled Google to become Google. The algorithm ranks web pages based on how many other websites link to them. John Kleinburg also played a key role by realizing that websites with lots of links (like a search engine) should also be seen as more important. The larger context of the course is data mining and information retrieval.
Chelsea Kerwin, March 23, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
A Dead Startup Tally Sheet
March 17, 2016
Startups are the buzzword for companies that are starting up in the tech industry, usually with an innovative idea that garners them several million in investments. Some startups are successful, others plodder along, and many simply fail. CBS Insights makes an interesting (and valid) comparison with tech startups and dot-com bust that fizzled out quicker than a faulty firecracker.
While most starts appear to be run by competent teams that, sometimes they fizzle out or are acquired by a larger company. Many of them are will not make it as a headlining company. As a result, CBS Insights invented, “The Downround Tracker: Which Companies Are Not Living Up To The Expectations?”
CBS Insights named this tech boom, the “unicorn era,” probably from the rare and mythical sightings of some of these companies. The Downround Tracker tracks unicorn era startups that have folded or were purchased. Since 2015, fifty-six total companies have made the Downround Tracker list, including LiveScribe, Fab.com, Yodle, Escrow.com, eMusic, Adesto Technologies, and others.
Browse through the list and some of the names will be familiar and others will make you wonder what some of these companies did in the first place. Companies come and go in a fashion that appears to be quicker than any other generation. At least in shows that human ingenuity is still working, cue Kanas’s “Dust in the Wind.”
Whitney Grace, March 17, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Bitcoin Textbook to Become Available from Princeton
March 16, 2016
Bitcoin is all over the media but this form of currency may not be thoroughly understood by many, including researchers and scholars. An post on this topic, The Princeton Bitcoin textbook is now freely available, was recently published on Freedom to Tinker, a blog hosted by Princeton’s Center for Information Technology Policy. This article announces the first completed draft of a Princeton Bitcoin textbook. At 300 pages, the manuscript is geared to those who hope to gain a technical understanding of how Bitcoin works and is appropriate for those who have a basic understanding of computer science and programming. According to the write-up,
“Researchers and advanced students will find the book useful as well — starting around Chapter 5, most chapters have novel intellectual contributions. Princeton University Press is publishing the official, peer-reviewed, polished, and professionally done version of this book. It will be out this summer. If you’d like to be notified when it comes out, you should sign up here. Several courses have already used an earlier draft of the book in their classes, including Stanford’s CS 251. If you’re an instructor looking to use the book in your class, we welcome you to contact us, and we’d be happy to share additional teaching materials with you.”
As Bitcoin educational resources catch fire in academia, it is only a matter of time before other Bitcoin experts begin creating resources to help other audiences understand the currency of the Dark Web. Additionally, it will be interesting to see if research emerges regarding connections between Bitcoin, the Dark Web and the mainstream internet.
Megan Feil, March 16, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Google Now Has Dowsing Ability
March 16, 2016
People who claim to be psychic are fakes. There is not a way to predict the future, instantly locate a lost person or item, or read someone’s aura. No scientific theory has proven it exists. One of the abilities psychics purport to have is “dowsing,” the power to sense where water, precious stones or metals, and even people are hiding. Instead of relying on a suspended crystal or an angular stick, Google now claims it can identify any location based solely on images, says The Technology Review in the article, “Google Unveils Neural Network With ‘Superhuman’ Ability To Determine The Location Of Almost Any Image.”
Using computer algorithms, not magic powers, and Tobias Weyand’s programming prowess and a team of tech savvy people, they developed a way for a Google deep-learning machine to identity location pictures. Weyand and his team designed PlaNET, the too, and accomplished this by dividing the world into 26,000 square grid (sans ocean and poles) of varying sizes depending on populous areas.
“Next, the team created a database of geolocated images from the Web and used the location data to determine the grid square in which each image was taken. This data set is huge, consisting of 126 million images along with their accompanying Exif location data.
Weyand and co used 91 million of these images to teach a powerful neural network to work out the grid location using only the image itself. Their idea is to input an image into this neural net and get as the output a particular grid location or a set of likely candidates.”
With the remaining 34 million images in the data set, they tested the PlaNET to check its accuracy. PlaNET can accurately guess 3.6% images at street level, 10.1% on city level, 28.4% country of origin, and 48% of the continent. These results are very good compared to the limited knowledge that a human keeps in their head.
Weyand believes that PlaNET is able to determine the location, because it has learned new parents to recognize subtle patterns about areas that humans cannot distinguish, as it has arguably been more places than any human. What is even more amazing is how much memory PlaNET uses: only 377 MB!
When will PlaNET become available as a GPS app?
Whitney Grace, March 16, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Natural Language Processing App Gains Increased Vector Precision
March 1, 2016
For us, concepts have meaning in relationship to other concepts, but it’s easy for computers to define concepts in terms of usage statistics. The post Sense2vec with spaCy and Gensim from SpaCy’s blog offers a well-written outline explaining how natural language processing works highlighting their new Sense2vec app. This application is an upgraded version of word2vec which works with more context-sensitive word vectors. The article describes how this Sense2vec works more precisely,
“The idea behind sense2vec is super simple. If the problem is that duck as in waterfowl andduck as in crouch are different concepts, the straight-forward solution is to just have two entries, duckN and duckV. We’ve wanted to try this for some time. So when Trask et al (2015) published a nice set of experiments showing that the idea worked well, we were easy to convince.
We follow Trask et al in adding part-of-speech tags and named entity labels to the tokens. Additionally, we merge named entities and base noun phrases into single tokens, so that they receive a single vector.”
Curious about the meta definition of natural language processing from SpaCy, we queried natural language processing using Sense2vec. Its neural network is based on every word on Reddit posted in 2015. While it is a feat for NLP to learn from a dataset on one platform, such as Reddit, what about processing that scours multiple data sources?
Megan Feil, March 1, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Squiz and Verint Team up to Save the Government from Itself
February 9, 2016
The article titled Verint and Squiz Announce Partnership to Further Enable Digital Transformation for Government on BusinessWire conveys the global ambitions of the two companies. The article positions Verint, an intel-centric company, and Squiz, an Australian content management company, as the last hope for the world’s governments (on the local, regional, and national level.) While things may not be so dire as all that, the merger is aimed at improving governmental organization, digital management, and customer engagement. The article explains,
“Today, national, regional and local governments across the world are implementing digital transformation strategies, reflecting the need to proactively help deliver citizen services and develop smarter cities. A key focus of such strategies is to help make government services accessible and provide support to their citizens and businesses when needed. This shift to digital is more responsive to citizen and community needs, typically reducing phone or contact center call volumes, and helps government organizations identify monetary savings.”
It will come as no surprise to learn that government bureaucracy is causing obstacles when it comes to updating IT processes. Together, Squiz and Verint hope to aid officials in implementing streamlined, modernized procedures and IT systems while focusing on customer-facing features and ensuring intuitive, user-friendly interfaces. Verint in particular emphasizes superior engagement practices through its Verint Engagement Management service.
Chelsea Kerwin, February 9, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph