Yandex Incorporates Semantic Search

March 15, 2017

Apparently ahead of a rumored IPO launch, Russian search firm Yandex is introducing “Spectrum,” a semantic search feature. We learn of the development from “Russian Search Engine Yandex Gets a Semantic Injection” at the Association of Internet Research Specialists’ Articles Share pages. Writer Wushe Zhiyang observes that, though Yandex claims Spectrum can read users’ minds,  the tech appears to be a mix of semantic technology and machine learning. He specifies:

The system analyses users’ searches and identifies objects like personal names, films or cars. Each object is then classified into one or more categories, e.g. ‘film’, ‘car’, ‘medicine’. For each category there is a range of search intents. [For example] the ‘product’ category will have search intents such as buy something or read customer reviews. So we have a degree of natural language processing, taxonomy, all tied into ‘intent’, which sounds like a very good recipe for highly efficient advertising.

But what if a search query has many potential meanings? Yandex says that Spectrum is able to choose the category and the range of potential user intents for each query to match a user’s expectations as close as possible. It does this by looking at historic search patterns. If the majority of users searching for ‘gone with the wind’ expect to find a film, the majority of search results will be about the film, not the book.

As users’ interests and intents tend to change, the system performs query analysis several times a week’, says Yandex. This amounts to Spectrum analysing about five billion search queries.”

Yandex has been busy. The site recently partnered with VKontakte, Russia’s largest social network, and plans to surface public-facing parts of VKontakte user profiles, in real time, in Yandex searches. If the rumors of a plan to go public are true, will these added features help make Yandex’s IPO a success?

Cynthia Murrell, March 15, 2017

The Human Effort Behind AI Successes

March 14, 2017

An article at Recode, “Watson Claims to Predict Cancer, but Who Trained It To Think,” reminds us that even the most successful AI software was trained by humans, using data collected and input by humans. We have developed high hopes for AI, expecting it to help us cure disease, make our roads safer, and put criminals behind bars, among other worthy endeavors. However, we must not overlook the datasets upon which these systems are built, and the human labor used to create them. Writer (and CEO of DaaS firm Captricity) Kuang Chen points out:

The emergence of large and highly accurate datasets have allowed deep learning to ‘train’ algorithms to recognize patterns in digital representations of sounds, images and other data that have led to remarkable breakthroughs, ones that outperform previous approaches in almost every application area. For example, self-driving cars rely on massive amounts of data collected over several years from efforts like Google’s people-powered street canvassing, which provides the ability to ‘see’ roads (and was started to power services like Google Maps). The photos we upload and collectively tag as Facebook users have led to algorithms that can ‘see’ faces. And even Google’s 411 audio directory service from a decade ago was suspected to be an effort to crowdsource data to train a computer to ‘hear’ about businesses and their locations.

Watson’s promise to help detect cancer also depends on data: decades of doctor notes containing cancer patient outcomes. However, Watson cannot read handwriting. In order to access the data trapped in the historical doctor reports, researchers must have had to employ an army of people to painstakingly type and re-type (for accuracy) the data into computers in order to train Watson.

Chen notes that more and more workers in regulated industries, like healthcare, are mining for gold in their paper archives—manually inputting the valuable data hidden among the dusty pages. That is a lot of data entry. The article closes with a call for us all to remember this caveat: when considering each new and exciting potential application of AI, ask where the training data is coming from.

Cynthia Murrell, March 14, 2017

Yandex Finally Catches the Long-Tailed Queries

March 7, 2017

One of the happiest moments in a dog’s life is when, after having spent countless hours spinning in circles, is when they catch their tail.  They wag for joy, even though they are chomping on their own happiness.  When search engines were finally programmed to handle long-tailed queries, that is queries with a lot of words such as a question, people’s happiness was akin to a dog catching their tail.  Google released RankBrain to handle long-winded ad NLP queries, but Yandex just released their own algorithm to handle questions, “Yandex Launches New Algorithm Named Palekh To Improve Search Results For Long-Tail Queries” from AIRS Association.

Yandex is Russia’s most-used search engine and in order to improve the user experience, they released Palekh to better process long-tail queries.  Palekh, like RankBrain, will bring the search engine closer to understanding the natural language or the common vernacular.  Yandex decided on the name Palekh, because the Russian city of the same name has a firebird on its coat of arms.  The firebird has a long-tail, so the name fits perfectly.

Yandex handles more than 100 million queries per day that fall under the long-tail query umbrella.  When asked if Yandex based Palekh on RankBrain, Yandex only responded that the two algorithms are similar in their purposes.  Yandex also uses machine learning to build neural networks to build a smarter search engine:

Yandex’s Palekh algorithm has started to use neural networks as one of 1,500 factors of ranking. A Yandex spokesperson told us they have “managed to teach our neural networks to see the connections between a query and a document even if they don’t contain common words.” They did this by “converting the words from billions of search queries into numbers (with groups of 300 each) and putting them in 300-dimensional space — now every document has its own vector in that space,” they told us. “If the numbers of a query and numbers of a document are near each other in that space, then the result is relevant,” they added.”

Yandex is one of Google’s biggest rivals and it does not come as a surprise that they are experimenting with algorithms that will expand machine learning and NLP.

Whitney Grace, March 7, 2017

When AI Spreads Propaganda

February 28, 2017

We thought Google was left-leaning, but an article at the Guardian, “How Google’s Search Algorithm Spreads False Information with a Rightwing Bias,” seems to contradict that assessment. The article cites recent research by the Observer, which found neo-Nazi and anti-Semitic views prominently featured in Google search results. The Guardian followed up with its own research and documented more examples of right-leaning misinformation, like climate-change denials, anti-LGBT tirades,  and Sandy Hook conspiracy theories. Reporters Olivia Solon and Sam Levin tell us:

The Guardian’s latest findings further suggest that Google’s searches are contributing to the problem. In the past, when a journalist or academic exposes one of these algorithmic hiccups, humans at Google quietly make manual adjustments in a process that’s neither transparent nor accountable.

At the same time, politically motivated third parties including the ‘alt-right’, a far-right movement in the US, use a variety of techniques to trick the algorithm and push propaganda and misinformation higher up Google’s search rankings.

These insidious manipulations – both by Google and by third parties trying to game the system – impact how users of the search engine perceive the world, even influencing the way they vote. This has led some researchers to study Google’s role in the presidential election in the same way that they have scrutinized Facebook.

Robert Epstein from the American Institute for Behavioral Research and Technology has spent four years trying to reverse engineer Google’s search algorithms. He believes, based on systematic research, that Google has the power to rig elections through something he calls the search engine manipulation effect (SEME).

Epstein conducted five experiments in two countries to find that biased rankings in search results can shift the opinions of undecided voters. If Google tweaks its algorithm to show more positive search results for a candidate, the searcher may form a more positive opinion of that candidate.

This does add a whole new, insidious dimension to propaganda. Did Orwell foresee algorithms? Further complicating the matter is the element of filter bubbles, through which many consume only information from homogenous sources, allowing no room for contrary facts. The article delves into how propagandists are gaming the system and describes Google’s response, so interested readers may wish to navigate there for more information.

One particular point gives me chills– Epstein states that research shows the vast majority of readers are not aware that bias exists within search rankings; they have no idea they are being manipulated. Perhaps those of us with some understanding of search algorithms can spread that insight to the rest of the multitude. It seems such education is sorely needed.

Cynthia Murrell, February 28, 2017

 

 

Forecasting Methods: Detail without Informed Guidance

February 27, 2017

Let’s create a scenario. You are a person trying to figure out how to index a chunk of content. You are working with cancer information sucked down from PubMed or a similar source. You run an extraction process and push the text through an indexing system. You use a system like Leximancer and look at the results. Hmmm.

Next you take a corpus of blog posts dealing with medical information. You suck down the content and run it through your extractor, your indexing system, and your Leximancer set up. You look at the results. Hmmm.

How do you figure out what terms are going to be important for your next batch of mixed content?

You might navigate to “Selecting Forecasting Methods in Data Science.” The write up does a good job of outlining some of the numerical recipes taught in university courses and discussed in textbooks. For example, you can get an overview in this nifty graphic:

image

And you can review outputs from the different methods identified like this:

image

Useful.

What’s missing? For the person floundering away like one government agency’s employee at which I worked years ago, you pick the trend line you want. Then you try to plug in the numbers and generate some useful data. If that is too tough, you hire your friendly GSA schedule consultant to do the work for you. Yep, that’s how I ended up looking at:

  • Manually selected data
  • Lousy controls
  • Outputs from different systems
  • Misindexed text
  • Entities which were not really entities
  • A confused government employee.

Here’s the takeaway. Just because software is available to output stuff in a log file and Excel makes it easy to wrangle most of the data into rows and columns, none of the information may be useful, valid, or even in the same ball game.

When one then applies without understanding different forecasting methods, we have an example of how an individual can create a pretty exciting data analysis.

Descriptions of algorithms do not correlate with high value outputs. Data quality, sampling, understanding why curves are “different”, and other annoying details don’t fit into some busy work lives.

Stephen E Arnold, February 27, 2017

Software Bias Is Being Addressed

February 27, 2017

Researchers are working to fix the problem of bias in software, we learn from the article, “He’s Brilliant, She’s Lovely: Teaching Computers to Be Less Sexist” at NPR’s blog, All Tech Considered. Writer Byrd Pinkerton begins by noting that this issue of software reflecting human biases is well-documented, citing this article from his colleague. He then informs us that Microsoft, for one, is doing something about it:

Adam Kalai thinks we should start with the bits of code that teach computers how to process language. He’s a researcher for Microsoft and his latest project — a joint effort with Boston University — has focused on something called a word embedding. ‘It’s kind of like a dictionary for a computer,’ he explains. Essentially, word embeddings are algorithms that translate the relationships between words into numbers so that a computer can work with them. You can grab a word embedding ‘dictionary’ that someone else has built and plug it into some bigger program that you are writing. …

Kalai and his colleagues have found a way to weed these biases out of word embedding algorithms. In a recent paper, they’ve shown that if you tell the algorithms to ignore certain relationships, they can extrapolate outwards.

And voila, a careful developer can teach an algorithm to fix its own bias. If only the process were so straightforward for humans. See the article for more about the technique.

Ultimately, though, the problem lies less with the biased algorithms themselves and more with the humans who seek to use them in decision-making. Researcher Kalai points to the marketing of health-related products as a project for which a company might actually want to differentiate between males and females. Pinkerton concludes:

For Kalai, the problem is not that people sometimes use word embedding algorithms that differentiate between gender or race, or even algorithms that reflect human bias. The problem is that people are using the algorithms as a black box piece of code, plugging them in to larger programs without considering the biases they contain, and without making careful decisions about whether or not they should be there.

So, though discoveries about biased software are concerning, it is good to know the issue is being addressed. We shall see how fast the effort progresses.

Cynthia Murrell, February 27, 2017

 

 

Bing Improvements

February 17, 2017

Online marketers are usually concerned with the latest Google algorithm, but Microsoft’s Bing is also a viable SEO target. Busines2Community shares recent upgrades to that Internet search engine in its write-up, “2016 New Bing Features.” The section on the mobile app seems to be the most relevant to those interested in Search developments. Writer Asaf Hartuv tells us:

For search, product and local results were improved significantly. Now when you search using the Bing app on an iPhone, you will get more local results with more information featured right on the page. You won’t have to click around to get what you want.

Similarly, when you search for a product you want to buy, you will get more options from more stores, such as eBay and Best Buy. You won’t have to go to as many websites to do the comparison shopping that is so important to making your purchase decision.

While these updates were made to the app, the image and video search results were also improved. You get far more options in a more user-friendly layout when you search for these visuals.

The Bing app also includes practical updates that go beyond search. For example, you can choose to follow a movie and get notified when it becomes available for streaming. Or you can find local bus routes or schedules based on the information you select on a map.

Hartuv also discusses upgrades to Bing Ads (a bargain compared to Google Ads, apparently), and the fact that Bing is now powering AOL’s search results (after being dropped by Yahoo). He also notes that, while not a new feature, Bing Trends is always presenting newly assembled, specialized content to enhance users’ understanding of current events. Hartuv concludes by prompting SEO pros to remember the value of Bing.

Cynthia Murrell, February 17, 2017

Enterprise Heads in the Sand on Data Loss Prevention

February 16, 2017

Enterprises could be doing so much more to protect themselves from cyber attacks, asserts Auriga Technical Manager James Parry in his piece, “The Dark Side: Mining the Dark Web for Cyber Intelligence” at Information Security Buzz. Parry informs us that most businesses fail to do even the bare minimum they should to protect against hackers. This minimum, as he sees it, includes monitoring social media and underground chat forums for chatter about their company. After all, hackers are not known for their modesty, and many do boast about their exploits in the relative open. Most companies just aren’t bothering to look that direction. Such an effort can also reveal those impersonating a business by co-opting its slogans and trademarks.

Companies who wish to go beyond the bare minimum will need to expand their monitoring to the dark web (and expand their data-processing capacity). From “shady” social media to black markets to hacker libraries, the dark web can reveal much about compromised data to those who know how to look. Parry writes:

Yet extrapolating this information into a meaningful form that can be used for threat intelligence is no mean feat. The complexity of accessing the dark web combined with the sheer amount of data involved, correlation of events, and interpretation of patterns is an enormous undertaking, particularly when you then consider that time is the determining factor here. Processing needs to be done fast and in real-time. Algorithms also need to be used which are able to identify and flag threats and vulnerabilities. Therefore, automated event collection and interrogation is required and for that you need the services of a Security Operations Centre (SOC).

The next generation SOC is able to perform this type of processing and detect patterns, from disparate data sources, real-time, historical data etc. These events can then be threat assessed and interpreted by security analysts to determine the level of risk posed to the enterprise. Forewarned, the enterprise can then align resources to reduce the impact of the attack. For instance, in the event of an emerging DoS attack, protection mechanisms can be switched from monitoring to mitigation mode and network capacity adjusted to weather the attack.

Note that Parry’s company, Auriga, supplies a variety of software and R&D services, including a Security Operations Center platform, so he might be a tad biased. Still, he has some good points. The article notes SOC insights can also be used to predict future attacks and to prioritize security spending. Typically, SOC users have been big businesses, but, Parry advocates, scalable and entry-level packages are making such tools available to smaller companies.

From monitoring mainstream social media to setting up an SOC to comb through dark web data, tools exist to combat hackers. The question, Parry observes, is whether companies will face the growing need to embrace those methods.

Cynthia Murrell, February 16, 2017

Google Battling Pirates More and More Each Year

February 10, 2017

So far, this has been a booming year for  DMCA takedown requests, we learn from TorrentFreak’s article, “Google Wipes Record Breaking Half Billion Pirate Links in 2016.” The number of wiped links has been growing rapidly over the last several years, but is that good or bad news for copyright holders? That depends on whom you ask. Writer Ernesto reveals the results of TorrentFreak’s most recent analysis:

Data analyzed by TorrentFreak reveals that Google recently received its 500 millionth takedown request of 2016. The counter currently [in mid-July] displays more than 523,000,000, which is yet another record. For comparison, last year it took almost the entire year to reach the same milestone. If the numbers continue to go up at the same rate throughout the year, Google will process a billion allegedly infringing links during the whole of 2016, a staggering number.

According to Google roughly 98% of the reported URLs are indeed removed. This means that half a billion links were stripped from search results this year alone. However, according to copyright holders, this is still not enough. Entertainment industry groups such as the RIAA, BPI and MPAA have pointed out repeatedly that many files simply reappear under new URLs.

Indeed; copyright holders continue to call for Google to take stronger measures. For its part, the company insists increased link removals is evidence that its process is working quite well. They issued out an update of their report, “How Google Fights Piracy.” The two sides remain deeply divided, and will likely be at odds for some time. Ernesto tells us some copyright holders are calling for the government to step in. That could be interesting.

Cynthia Murrell, February 10, 2017

Probability Algorithms: Boiling Prediction Down

February 6, 2017

I read “The Algorithms Behind Probabilistic Programming.” Making a somewhat less than familiar topic accessible is a good idea. If you want to get a sense for predictive analytics, why not read a blog post about Bayesian methods with a touch of Markov? The write up pitches a more in depth report about predictive analytics. “The Algorithms Behind…” write up makes it clear to peg prediction on a method which continues to confound some “real” consultants. I like the mentions of Monte Carlo methods and the aforementioned sporty Markov. I did not see a reference to LaPlace. Will you be well on your way to understanding predictive analytics after working through the article from Fast Forward Labs. No, but you will have some useful names to Google. When I read explanations of these methods, I like to reflect on Autonomy’s ground breaking products from the 1990s.

Stephen E Arnold, February 6, 2017

Next Page »

  • Archives

  • Recent Posts

  • Meta