Forecasting Methods: Detail without Informed Guidance
February 27, 2017
Let’s create a scenario. You are a person trying to figure out how to index a chunk of content. You are working with cancer information sucked down from PubMed or a similar source. You run an extraction process and push the text through an indexing system. You use a system like Leximancer and look at the results. Hmmm.
Next you take a corpus of blog posts dealing with medical information. You suck down the content and run it through your extractor, your indexing system, and your Leximancer set up. You look at the results. Hmmm.
How do you figure out what terms are going to be important for your next batch of mixed content?
You might navigate to “Selecting Forecasting Methods in Data Science.” The write up does a good job of outlining some of the numerical recipes taught in university courses and discussed in textbooks. For example, you can get an overview in this nifty graphic:
And you can review outputs from the different methods identified like this:
Useful.
What’s missing? For the person floundering away like one government agency’s employee at which I worked years ago, you pick the trend line you want. Then you try to plug in the numbers and generate some useful data. If that is too tough, you hire your friendly GSA schedule consultant to do the work for you. Yep, that’s how I ended up looking at:
- Manually selected data
- Lousy controls
- Outputs from different systems
- Misindexed text
- Entities which were not really entities
- A confused government employee.
Here’s the takeaway. Just because software is available to output stuff in a log file and Excel makes it easy to wrangle most of the data into rows and columns, none of the information may be useful, valid, or even in the same ball game.
When one then applies without understanding different forecasting methods, we have an example of how an individual can create a pretty exciting data analysis.
Descriptions of algorithms do not correlate with high value outputs. Data quality, sampling, understanding why curves are “different”, and other annoying details don’t fit into some busy work lives.
Stephen E Arnold, February 27, 2017
Software Bias Is Being Addressed
February 27, 2017
Researchers are working to fix the problem of bias in software, we learn from the article, “He’s Brilliant, She’s Lovely: Teaching Computers to Be Less Sexist” at NPR’s blog, All Tech Considered. Writer Byrd Pinkerton begins by noting that this issue of software reflecting human biases is well-documented, citing this article from his colleague. He then informs us that Microsoft, for one, is doing something about it:
Adam Kalai thinks we should start with the bits of code that teach computers how to process language. He’s a researcher for Microsoft and his latest project — a joint effort with Boston University — has focused on something called a word embedding. ‘It’s kind of like a dictionary for a computer,’ he explains. Essentially, word embeddings are algorithms that translate the relationships between words into numbers so that a computer can work with them. You can grab a word embedding ‘dictionary’ that someone else has built and plug it into some bigger program that you are writing. …
Kalai and his colleagues have found a way to weed these biases out of word embedding algorithms. In a recent paper, they’ve shown that if you tell the algorithms to ignore certain relationships, they can extrapolate outwards.
And voila, a careful developer can teach an algorithm to fix its own bias. If only the process were so straightforward for humans. See the article for more about the technique.
Ultimately, though, the problem lies less with the biased algorithms themselves and more with the humans who seek to use them in decision-making. Researcher Kalai points to the marketing of health-related products as a project for which a company might actually want to differentiate between males and females. Pinkerton concludes:
For Kalai, the problem is not that people sometimes use word embedding algorithms that differentiate between gender or race, or even algorithms that reflect human bias. The problem is that people are using the algorithms as a black box piece of code, plugging them in to larger programs without considering the biases they contain, and without making careful decisions about whether or not they should be there.
So, though discoveries about biased software are concerning, it is good to know the issue is being addressed. We shall see how fast the effort progresses.
Cynthia Murrell, February 27, 2017
Bing Improvements
February 17, 2017
Online marketers are usually concerned with the latest Google algorithm, but Microsoft’s Bing is also a viable SEO target. Busines2Community shares recent upgrades to that Internet search engine in its write-up, “2016 New Bing Features.” The section on the mobile app seems to be the most relevant to those interested in Search developments. Writer Asaf Hartuv tells us:
For search, product and local results were improved significantly. Now when you search using the Bing app on an iPhone, you will get more local results with more information featured right on the page. You won’t have to click around to get what you want.
Similarly, when you search for a product you want to buy, you will get more options from more stores, such as eBay and Best Buy. You won’t have to go to as many websites to do the comparison shopping that is so important to making your purchase decision.
While these updates were made to the app, the image and video search results were also improved. You get far more options in a more user-friendly layout when you search for these visuals.
The Bing app also includes practical updates that go beyond search. For example, you can choose to follow a movie and get notified when it becomes available for streaming. Or you can find local bus routes or schedules based on the information you select on a map.
Hartuv also discusses upgrades to Bing Ads (a bargain compared to Google Ads, apparently), and the fact that Bing is now powering AOL’s search results (after being dropped by Yahoo). He also notes that, while not a new feature, Bing Trends is always presenting newly assembled, specialized content to enhance users’ understanding of current events. Hartuv concludes by prompting SEO pros to remember the value of Bing.
Cynthia Murrell, February 17, 2017
Enterprise Heads in the Sand on Data Loss Prevention
February 16, 2017
Enterprises could be doing so much more to protect themselves from cyber attacks, asserts Auriga Technical Manager James Parry in his piece, “The Dark Side: Mining the Dark Web for Cyber Intelligence” at Information Security Buzz. Parry informs us that most businesses fail to do even the bare minimum they should to protect against hackers. This minimum, as he sees it, includes monitoring social media and underground chat forums for chatter about their company. After all, hackers are not known for their modesty, and many do boast about their exploits in the relative open. Most companies just aren’t bothering to look that direction. Such an effort can also reveal those impersonating a business by co-opting its slogans and trademarks.
Companies who wish to go beyond the bare minimum will need to expand their monitoring to the dark web (and expand their data-processing capacity). From “shady” social media to black markets to hacker libraries, the dark web can reveal much about compromised data to those who know how to look. Parry writes:
Yet extrapolating this information into a meaningful form that can be used for threat intelligence is no mean feat. The complexity of accessing the dark web combined with the sheer amount of data involved, correlation of events, and interpretation of patterns is an enormous undertaking, particularly when you then consider that time is the determining factor here. Processing needs to be done fast and in real-time. Algorithms also need to be used which are able to identify and flag threats and vulnerabilities. Therefore, automated event collection and interrogation is required and for that you need the services of a Security Operations Centre (SOC).
The next generation SOC is able to perform this type of processing and detect patterns, from disparate data sources, real-time, historical data etc. These events can then be threat assessed and interpreted by security analysts to determine the level of risk posed to the enterprise. Forewarned, the enterprise can then align resources to reduce the impact of the attack. For instance, in the event of an emerging DoS attack, protection mechanisms can be switched from monitoring to mitigation mode and network capacity adjusted to weather the attack.
Note that Parry’s company, Auriga, supplies a variety of software and R&D services, including a Security Operations Center platform, so he might be a tad biased. Still, he has some good points. The article notes SOC insights can also be used to predict future attacks and to prioritize security spending. Typically, SOC users have been big businesses, but, Parry advocates, scalable and entry-level packages are making such tools available to smaller companies.
From monitoring mainstream social media to setting up an SOC to comb through dark web data, tools exist to combat hackers. The question, Parry observes, is whether companies will face the growing need to embrace those methods.
Cynthia Murrell, February 16, 2017
Google Battling Pirates More and More Each Year
February 10, 2017
So far, this has been a booming year for DMCA takedown requests, we learn from TorrentFreak’s article, “Google Wipes Record Breaking Half Billion Pirate Links in 2016.” The number of wiped links has been growing rapidly over the last several years, but is that good or bad news for copyright holders? That depends on whom you ask. Writer Ernesto reveals the results of TorrentFreak’s most recent analysis:
Data analyzed by TorrentFreak reveals that Google recently received its 500 millionth takedown request of 2016. The counter currently [in mid-July] displays more than 523,000,000, which is yet another record. For comparison, last year it took almost the entire year to reach the same milestone. If the numbers continue to go up at the same rate throughout the year, Google will process a billion allegedly infringing links during the whole of 2016, a staggering number.
According to Google roughly 98% of the reported URLs are indeed removed. This means that half a billion links were stripped from search results this year alone. However, according to copyright holders, this is still not enough. Entertainment industry groups such as the RIAA, BPI and MPAA have pointed out repeatedly that many files simply reappear under new URLs.
Indeed; copyright holders continue to call for Google to take stronger measures. For its part, the company insists increased link removals is evidence that its process is working quite well. They issued out an update of their report, “How Google Fights Piracy.” The two sides remain deeply divided, and will likely be at odds for some time. Ernesto tells us some copyright holders are calling for the government to step in. That could be interesting.
Cynthia Murrell, February 10, 2017
Probability Algorithms: Boiling Prediction Down
February 6, 2017
I read “The Algorithms Behind Probabilistic Programming.” Making a somewhat less than familiar topic accessible is a good idea. If you want to get a sense for predictive analytics, why not read a blog post about Bayesian methods with a touch of Markov? The write up pitches a more in depth report about predictive analytics. “The Algorithms Behind…” write up makes it clear to peg prediction on a method which continues to confound some “real” consultants. I like the mentions of Monte Carlo methods and the aforementioned sporty Markov. I did not see a reference to LaPlace. Will you be well on your way to understanding predictive analytics after working through the article from Fast Forward Labs. No, but you will have some useful names to Google. When I read explanations of these methods, I like to reflect on Autonomy’s ground breaking products from the 1990s.
Stephen E Arnold, February 6, 2017
Smart Software Recipe Fiesta
February 2, 2017
I read “140 Machine Learning Formulas.” The listing hits the top 10 most popular algorithms and adds an additional 130. The summary of the formulas is at this link. A happy quack to Rubens Zimbres who compiled the list. A profile of Mr. Zimbres is available at this link. FYI. He’s looking for a new challenge.
Stephen E Arnold, February 2, 2017
Now Watson Wants to Be a Judge
December 27, 2016
IBM has deployed Watson in many fields, including the culinary arts, sports, and medicine. The big data supercomputer can be used in any field or industry that creates a lot of data. Watson, in turn, will digest the data, and depending on the algorithms spit out results. Now IBM wants Watson to take on the daunting task of judging, says The Drum in “Can Watson Pick A Cannes Lion Winner? IBM’s Cognitive System Tries Its Arm At Judging Awards.”
According to the article, judging is a cognitive process and requires special algorithms, not the mention the bias of certain judges. In other words, it should be right up Watson’s alley (perhaps the results will be less subjective as well). The Drum decided to put Watson to the ultimate creative test and fed Watson thousands of previous Cannes films. Then Watson predicted who would win the Cannes Film Festival in the Outdoor category this year.
This could change the way contests are judged:
The Drum’s magazine editor Thomas O’Neill added: “This is an experiment that could massively disrupt the awards industry. We have the potential here of AI being able to identify an award winning ad from a loser before you’ve even bothered splashing out on the entry fee. We’re looking forward to seeing whether it proves as accurate in reality as it did in training.
I would really like to see this applied to the Academy Awards that are often criticized for their lack of diversity and consisting of older, white men. It would be great to see if Watson would yield different results that what the Academy actually selects.
Whitney Grace, December 27, 2016
An Apologia for People. Big Data Are Just Peachy Keen
December 25, 2016
I read “Don’t Blame Big Data for Pollsters’ Failings.” The news about the polls predicting a victory for Hillary Clinton reached me in Harrod’s Creek five days after the election. Hey, Beyond Search is in rural Kentucky. It looks from the news reports and the New York Times’s odd letter about doing “real” journalism that the pundits predicted that the mare would win the US derby.
The write up explains that Big Data did not fail. The reason? The pollsters were not using Big Data. The sample sizes were about 1,000 people. Check your statistics book. In the back will be samples sizes for populations. If you have an older statistics book, you have to use the formula like
Big Data doesn’t fool around with formulas. Big Data just uses “big data.” Is the idea is that the bigger the data, the better the output?
The write up states that the problem was the sample itself: The actual humans.
The write up quotes a mid tier consultant from an outfit called Ovum which reminds me of eggs. I circled this statement:
“When you have data sets that are large enough, you can find signals for just about anything,” says Tony Baer, a big data analyst at Ovum. “So this places a premium on identifying the right data sets and asking the right questions, and relentlessly testing out your hypothesis with test cases extending to more or different data sets.”
The write up tosses in social media. Facebook takes the position that its information had minimal effect on the election. Nifty assertion that.
The solution is, as I understand the write up, to use a more real time system, different types of data, and math. The conclusion is:
With significant economic consequences attached to political outcomes, it is clear that those companies with sufficient depth of real-time behavioral data will likely increase in value.
My view is that hope and other distinctly human behaviors certainly threw an egg at reality. It is great to know that there is a fix and that Big Data emerge as the path forward. More work ahead for the consultants who often determine sample sizes by looking at Web sites like SurveySystem and get their sample from lists of contributors, a 20 something’s mobile phone contact list, or lists available from friends.
If you use Big Data, tap into real time streams of information, and do the social media mining—you will be able to predict the future. Sounds logical? Now about that next Kentucky Derby winner? Happy or unhappy holiday?
Stephen E Arnold, December 25, 2016
Big Data Needs to Go Public
December 16, 2016
Big Data touches every part of our lives and we are unaware. Have you ever noticed when you listen to the news, read an article, or watch a YouTube video that people say items such as: “experts claim, “science says,” etc.” In the past, these statements relied on less than trustworthy sources, but now they can use Big Data to back up their claims. However, popular opinion and puff pieces still need to back up their big data with hard fact. Nature.com says that transparency is a big deal for Big Data and algorithm designers need to work on it in the article, “More Accountability For Big-Data Algorithms.”
One of the hopes is that big data will be used to bridge the divide between one bias and another, except that he opposite can happen. In other words, Big Data algorithms can be designed with a bias:
There are many sources of bias in algorithms. One is the hard-coding of rules and use of data sets that already reflect common societal spin. Put bias in and get bias out. Spurious or dubious correlations are another pitfall. A widely cited example is the way in which hiring algorithms can give a person with a longer commute time a negative score, because data suggest that long commutes correlate with high staff turnover.
Even worse is that people and organizations can design an algorithm to support science or facts they want to pass off as the truth. There is a growing demand for “algorithm accountability,” mostly in academia. The demands are that data sets fed into the algorithms are made public. There also plans to make algorithms that monitor algorithms for bias.
Big Data is here to say, but relying too much on algorithms can distort the facts. This is why the human element is still needed to distinguish between fact and fiction. Minority Report is closer to being our present than ever before.
Whitney Grace, December 16, 2016