Big Data and Predictive Math: Some Doubters

January 19, 2018

I love Big Data. I love fancy math. I spotted two articles this morning which offer a contrarian view about two popular buzzwords: Big Data and Predictive Analytics.

The first write up is from the capitalist’s tool, Forbes Magazine. I can not tell what’s an ad or what’s a “real” journalistic endeavor. But in today’s world? Maybe the distinction is like arguing with St. Thomas Aquinas about the cause of evil.

Forbes’ story is “Big Data Is Overrated Compared To Human Ingenuity.” The main point is that humans with intelligence are more ingenious than software. No software, as far as I can tell, was consulted when formulating the thesis. The main point for me was:

an algorithm may be able to cover sports, you cannot clone or generate whimsy or humor or the essence of what makes writing enjoyable to read. We are not (at least not yet) at a point where computers are able to have full conversations, let alone exude the creativity to come up with ideas. The creative geniuses of the future may, in fact, be aided by big data, but they will simply use it (as one would use Google to search the giant database known as the internet) to ask the right questions to solve the world’s problems.

My thought is, “What about robot wars?” Does that TV show presage the NFL of the future?

The second write up is from a British online publication. The article’s title is “Software That Predicts Whether Crims Will Break the Law Again Is No Better Than You or Me.”

The main idea strikes me as:

…if you took someone with no legal, psychological or criminal justice system training – perhaps you, dear reader – and showed them a few bits of information about a given defendant, they’d be able to guess as well as this software as to whether the criminal would break the law again.

Interesting point; however, software might be able to chop through a backlog of cases, thus reducing costs. Sure a few good apples will be tossed into the for profit prisons, but that’s just a statistical error.

What I find amusing is the point made by a TV pundit in “How to Stop ‘Extremely Disruptive’ AI from Harming Society: Robert Shiller.” I don’t know about you but knowing unintended consequences before they occur might be difficult. Facebook has been around for years, and people are just now figuring out that the system can do more than help grandmother keep track of the grandchildren.

Exciting stuff. Predictive law enforcement is important. Big Data are getting bigger and being used to sell ads to people who don’t recognize the message as an ad. Regulating technology is like standing on the pier after the Queen Mary set sail and shouting, “Hey, come back.”

Stephen E Arnold, January 19, 2018

The Consequences of an Echo Chamber for Google Search

January 19, 2018

I read “Google Memory Loss.” The author is a fellow who created a text search engine, helped found OpenText, did some time at the GOOG, and swam in the Semantic Web pond.

The write up provides useful information to anyone wondering why a Google query for a company name goes off the rails or why the Google suggestions have zero relevance to the user’s query.

There were some important points in the write up; for example:

  1. Search is “crushingly expensive”. This means that when Google needs to cut costs and maximize revenue, the company will make business decisions. The decisions may favor advertising revenues. Maybe.
  2. Archival information is not popular. The reasoning may be, “Why index this stuff or revisit the archive to figure out if there is “new information” in the old archive? If old information is not important, what about unpopular sites the National Railway Retirement Board Web content?
  3. Google is into the timely, not the research-centric type of query.
  4. Dr. Bray uses Google but supplements the look up by using very un-Googley search systems.

Here in Harrod’s Creek, we love the Google. Filtered, ad-tailored results are perfect for looking up KY Fry or the NCAA rules committee’s favorite team, the Louisville Cardinals.

A search for Cardinals returns this results page this morning:


Lots of Googlers love March Madness. Too bad if a 7th grader has to look up information about cardinals with feathers.

Stephen E Arnold, January 19, 2018

Transcribing Podcasts with Help from Amazon

January 19, 2018

I enjoy walking the dog and listening to podcasts. However, I read more quickly than I listen. Speed up is a feature which works well for those in their mid 20s. At age 74, not so much.

Few podcasts create transcripts. Kudos to Steve Gibson at Security Now. He pays for this work himself because other podcasts on the Twit network don’t offer much in the way of transcripts. And in the case of This Week in Law, there aren’t weekly programs. Recently, no programs. Helpful, no?

You can get the basics of the transcriptions produced by Amazon Transcribe in “Podcast Transcription with Amazon Transcribe.”

One has to be a programmer to use the service. Here’s the passage in the write up I highlighted:

The first thing that I would want out of this is speaker detection, i.e. knowing how many different speakers there are and to be able to differentiate their voices. Podcasts typically have more than one host, or a host and a guest for an interview, so that would be helpful. Also, it would be great to be able to send back corrections on words somehow, to help with the training. I’m sure Amazon has a pretty good thing going, but maybe on an account level? Or for proper nouns? I still think it would be good for people to provide that feedback.

Perhaps the podcast transcript void can be filled—at long last.

Stephen E Arnold, January 19, 2018

How SEO Has Shaped the Web

January 19, 2018

With the benefit of hindsight, big-name thinker Anil Dash has concluded that SEO has contributed to the ineffectiveness of Web search. He examines how we got here in his article, “Underscores, Optimization & Arms Races” at Medium.  Starting with the year 2000, Dash traces the development of Internet content management systems (CMS’s), of which he was a part. (It is a good brief summary for anyone who wasn’t following along at the time.) WordPress is an example of a CMS.

As Google’s influence grew, online publishers became aware of an opportunity—they could game the search algorithm to move their site to the top of “relevant” results by playing around with keywords and other content details. The question of whether websites should bow to Google’s whims seemed to go unasked, as site after site fell into this pattern, later to be known as Search Engine Optimization. For Dash, the matter was symbolized by a question over hyphens or underbars to represent spaces in web addresses. Now, of course, one can use either without upsetting Google’s algorithm, but that was not the case at first. When Google’s Matt Cutts stated a preference for the hyphen in 2005, most publishers fell in line. Including Dash, eventually and very reluctantly; for him, the choice represented nothing less than the very nature of the Internet.

He writes:

You see, the theory of how we felt Google should work, and what the company had often claimed, was that it looked at the web and used signals like the links or the formatting of webpages to indicate the quality and relevance of content. Put simply, your search ranking with Google was supposed to be based on Google indexing the web as it is. But what if, due to the market pressure of the increasing value of ranking in Google’s search results, websites were incentivized to change their content to appeal to Google’s algorithm? Or, more accurately, to appeal to the values of the people who coded Google’s algorithm?

Eventually, even Dash and his CMS caved and switched to hyphens. What he did not notice at the time, he muses, was the unsettling development of the  entire SEO community centered around appeasing these algorithms. He concludes:

By the time we realized that we’d gotten suckered into a never-ending two-front battle against both the algorithms of the major tech companies and the destructive movements that wanted to exploit them, it was too late. We’d already set the precedent that independent publishers and tech creators would just keep chasing whatever algorithm Google (and later Facebook and Twitter) fed to us. Now, the challenge is to reform these systems so that we can hold the big platforms accountable for the impacts of their algorithms. We’ve got to encourage today’s newer creative communities in media and tech and culture to not constrain what they’re doing to conform to the dictates of an opaque, unknowable algorithm.

Is that doable, or have we gone too far toward appeasing the Internet behemoths to turn back?

Cynthia Murrell, January 19, 2018

Are There Only 10,000 Machine Learning Experts? LinkedIn Offers a Different Number, 651,627

January 18, 2018

I read in the dead tree edition of the New York Times (still not a tabloid sized “real” journalism delivery vehicle) that there are 10,000 machine learning experts in the world. You can find a version of this story at this link.

Just to check the validity of this magical number, which reinforces the notion of elitism, the one percent of the one percent, and the complexity of the Dark Arts of smart software, I did some research.

I turned to LinkedIn, entered the phrase “machine learning” and this is what I learned from the Microsoft professional social media search system:


I realize that the low key colors and gray type are unreadable, but contact Microsoft LinkedIn, not me.

There are more than 38,000 jobs open for experts in machine learning.

What’s the talent pool?

The number is 651,627.

Now I understand that if one is making a list of top anything, the peak of the pyramid will be, by definition, one. For music, you may have disagreements. For machine learning, it’s different.

Since machine learning and other smart software jargon is pretty vague, mostly incorrect, and generally misunderstood, the New York Times’ story missed the mark by a mere 641,627 “experts.” Keep in mind anyone can say one is an expert in anything unless the government regulates via licenses like those issued to doctors, lawyers, and beauticians. Beauticians? Yep.

Ah, you say. LinkedIn is for marketers and headhunters.

Yes, I respond.

But the point is that in jargon charged disciplines, it is tough to convince me that there are 10,000 machine learning experts in the world. My hunch is that the cream of the crop will be a handful of people, assuming that one can define what it takes to be an expert; for instance:

  1. Math skills that go beyond the required course in computer science with an emphasis on artificial intelligence
  2. Math skills which nose into the territory of Kolmogorov and his cronies (yep, my uncle, the crony)
  3. Database skills tuned to deal with machine learning
  4. Linguistics capabilities to cope with multi lingual content
  5. Engineering skills tuned to the peculiar demands of a real time stream of intercepted data from an outfit like WebHose
  6. Subject matter experts with knowledge of such exciting topics as Bayesian “drift” and how to make necessary human interventions to get the statistical ship back on course
  7. Operations experts who can get something useful from a ML-infused application like creating a smart home appliance which does not burn the roast chicken which must be well done for an ageing boxer.

I could go on.

Right now, anyone can claim to be an expert in machine learning. The problem is that machine learning is not one thing. Google is bundling up a bunch of stuff and making it available to LinkedIn type experts.

What could possibly go wrong? Let’s hope the New York Times knows exactly which type of expert in the components of machine learning to have a reasonable shot of reporting on the event that catches a “real” newsperson’s attention.

Stephen E Arnold, January 18, 2017

We Are Without a Paddle on Growing Data Lakes

January 18, 2018

The pooling of big data is commonly known as a “data lake.” While this technique was first met with excitement, it is beginning to look like a problem, as we learned in a recent Info World story, “Use the Cloud to Create Open, Connected Data Lakes for AI, Not Data Swamps.”

According to the story:

A data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why. First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it’s often impossible to find something you’re looking for in the drawer, as it’s gotten buried.

This disorganization leads to the second problem: users are often not able to find the dataset once ingested into the data lake.

So, how does one take aggregate data from a stagnant swamp to a lake one can traverse? According to Scientific Computing, the secret lies in separating the search function into two pieces, finding and searching. When you combine this thinking with Info World’s logic of using the cloud, suddenly these massive swamps are drained.

Patrick Roland, January 18, 2018



Some Think Google Is No Longer the King of Search

January 18, 2018

Google is much more than a search engine, it’s a verb. Like Xerox and Kleenex before it, that says something about the hierarchy of their business. However, some are claiming it’s time for alternatives (In search…not in copy-making or nose blowing). This, according to a recent Eyerys story, “Searching Beyond Google: When The Internet is Too Big for a Single Search Engine.”

According to the story:

[T]he information you need might be hidden from the tools you use. Either because the webmasters wanted that to happen by blocking search engines’ access, or inaccessible by search engine because they are behind paywalls or login forms, or lies inside the deep web.

To access them, you need more specific tools other than search engines, and look at the right place, with the right privilege.

If and only if you still can’t find the information you’re looking for, it’s either not available on the internet, or doesn’t exist in the first place.

Or, they could be hidden inside database, encrypted, lies deeper and accessible to only using certain IPs, classified methods or privilege. In this case, it’s not publicly available though it is there. You need to be a hacker to get yourself into that, and that is certainly illegal by any means.

While the story has its heart in the right place, recommending alternative engines, like DuckDuckGo, and giving tips on using social media for search, it’s not really too believable. For one, humans are creatures of habit and they are stuck on the single search engine method. This is wishful thinking, and actually makes sense in places, but we can’t see it happening.

Patrick Roland, January 18, 2018

Qwant Goes to China

January 17, 2018

The roots of Qwant stretch back to Pertimm, an interesting search system which pre-dated today’s Qwant. Information in my files about Qwant reminded me that Qwant is a metasearch system which combines its own crawling of French sources. The key feature of Qwant is that it is not retaining data about users’ queries. It is important to keep in mind that legal intercepts can capture Internet data and may be able to map user actions to particular Web sites or topics.

In the article “Not Just a Horse: Macron Also Brings Privacy-Based Browser on Trip to China,” the French delegation visiting Chinese officials is, in part, designed to promote the use of Qwant.

I noted this statement in the article, one of the founders of Qwant allegedly stated:

Yes, we need a lot of data but we don’t need to know that it’s you or me. The whole idea of Qwant is to make AI and IoT without the data of the users. In our case, based on the fact that we are a privacy-based search engine, we don’t need people’s data. So maybe we‘ll have some technology that we can use more easily in China than some of our competitors.

My perception is that China is quite interested in who searches what, particularly within the Middle Kingdom. Qwant will follow “local regulations.”

My recollection is that Google has not achieved the same level of dominance that it has in Europe, home of Qwant.

Since the demise of Quaero and Muscat, Yandex has become one of the European alternatives to Google. The Exalead Web search system is still online, but it does not attract much attention. I find it useful because Google results are thin when I search for older content. You can locate the Exalead search system at this link. Dassault Systèmes uses Exalead for its product component search, and I am surprised that the company does not push the Web search capability more aggressively.

If you have not tried Qwant, you can try it at Compare the results with the Exalead system and the Russian Yandex system.

In my tests, I find it necessary to use multiple search systems, including the low profile and system. It is more difficult than ever to locate certain types of information in general purpose Web search systems. This applies to metasearch systems like Ixquick (now, Unbubble, Izito, and other systems which try to offer researchers an alternative to Google.

Google works well for pizza. Looking for other types of information? Qwant and other low profile systems have to be used. The process of locating something as basic as the address of a company in Madrid can require quite vigorous hoop jumping.

But China? Interesting.

Stephen E Arnold, January 17, 2018

Amazon Cloud Injected with AI Steroids

January 17, 2018

Amazon, Google, and Microsoft are huge cloud computing rivals.  Amazon wants to keep up with the competition, says Fortune, in the article, “Amazon Reportedly Beefing Up Cloud Capabilities In The Cloud.”  Amazon is “beefing up” its cloud performance by injecting it with more machine learning and artificial intelligence.   The world’s biggest retailer is doing this by teaming up with AI-based startups Domino Data Lab and DataRobot.

Cloud computing is mostly used by individuals as computer backups and the ability to access their files from anywhere.  Businesses use it to run their applications and store data, but as cloud computing becomes more standard they want to run machine learning tasks and big data analysis.

Amazon’s new effort is code-named Ironman and is aimed at completing tasks for companies focused on insurance, energy, fraud detection, and drug discovery, The Information reported. The services will be offered to run on graphic processing chips made by Nvidia as well as so-called field programmable gate array chips, which can be reprogrammed as needed for different kinds of software.

Nvidia and other high-performing chip manufacturers such as Advanced Micro Devices and Intel are ecstatic about the competition because it means more cloud operators will purchase their products.  Amazon Web Services is one of the company’s fastest growing areas and continues to bring in the profits.

Whitney Grace, January 17, 2018

Out with the Old, in with the New at Google

January 17, 2018

It may have started with its finance app, but Google is making some drastic changes you might want to keep an eye on. We discovered the tip of the iceberg with Google Blog piece, “Stay on Top of Finance Information on Google.”

According to the story:

Now under a new search navigation tab called “Finance,” you’ll have easier access to finance information based on your interests, keeping you in the know about the latest market news and helping you get in-depth insights about companies. On this page, you can see performance information about stocks you’ve chosen to follow, recommendations on other stocks to follow based on your interests, related news, market indices, and currencies.

As part of this revamped experience, we’re retiring a few features of the original Google Finance, including the portfolio, the ability to download your portfolio, and historical tables. However, a list of the stocks from your portfolio will be accessible through Your Stocks in the search result, and you can get notifications when there are any notable changes on their performance.

Not a big shock, but a big part of Google trying to freshen things up. The company has been in hot water with a string of YouTube videos deemed too much. So, with moves like improving its algorithm to weed out fake news, changes to Google Home, and even Maps, Google is sending a message. The message is one of change and one we hope is for the better.

Patrick Roland, January 17, 2018

Next Page »

  • Archives

  • Recent Posts

  • Meta