Voice Search and Big Data: Defining Technologies for 2017

April 20, 2017

I read “Voice Search and Data: The Two Trends That Will Shape Online Marketing in 2017.” If the story is accurate, figuring out what people say and making sense of data (lots of data) will create new opportunities for innovators.

The article states:

Advancements in voice search and artificial intelligence (AI) will drive rich answers that will help marketers understand the customer intent behind I-want-to-go, I-want-to-know, I-want-to-buy and I-want-to-do micro-moments. Google has developed algorithms to cater directly to the search intent of the customers behind these queries, enabling customers to find the right answers quickly.

My view is that the article is correct in its assessment.

Where the article and I differ boils down to search engine optimization. The idea that voice search and Big Data will make fooling the relevance algorithms of Bing, Google, and Yandex a windfall for search engine optimization experts is partially true. Marketing whiz kids will do and say many things to deliver results that do not answer my query or meet my expectation of a “correct” answer.

My view is that the proliferation of systems which purport to understand human utterances in text,and voice-to-text conversions will discover that the the error rates of 60 to 75 percent are not good enough. Errors can be buried in long lists of results. They can be sidestepped if a voice enabled system works from a set of rules confined to a narrow topic domain.

Open the door to natural language parsing, and the error rates which once were okay become a liability. In my opinion, this will set off a scramble among companies struggling to get their smart software to provide information that customers accept and use repeatedly. Fail and customer turnover can be a fatal knife wound to the heart of an organization. The cost of replacing a paying customer is high. Companies need to keep the customers they have with technology that helps keep paying customers smiling.

What companies are able to provide higher accuracy linguistic functions? There are dozens of companies which assert that their systems can extract entities, figure out semantic relationships, and manipulate content in a handful of languages.

The problem with most of these systems is that certain, very widely used methods collapse when high accuracy is required for large volumes of text. The short cut is to use numerical tricks, and some of those tricks create disconnects between the information the user requests or queries and the results the system displays. Examples range from the difficulties of tuning the Autonomy Digital Reasoning Engine to figuring out how in the heck Google Home arrived at a particular song when the user wanted something else entirely.

Our suggestion is that instead of emailing IBM to sign a deal for that companies language technology, you might have a more productive result if you contact Bitext. This is a company which has been on my mind. I interviewed the founder and CEO (an IBM alum as I learned) and met with some of the remarkable Bitext team.

I am unable to disclose Bitext’s clients. I can suggest that if you fancy a certain German sports car or use one of the world’s most popular online services, you will be bumping into Bitext’s Digital Linguistic Analysis platform. For more information, navigate to Bitext.com.

The data I reviewed suggested that Bitext’s linguistic platform delivers accuracy significantly better than some of the other systems’ outputs I have reviewed. How accurate? Good enough to get an A in my high school math class.

Stephen E Arnold, April 20, 2017

Image Search: Biased by Language. The Fix? Use Humans!

April 19, 2017

Houston, we (male, female, uncertain) have a problem. Bias is baked into some image analysis and just about every other type of smart software.

The culprit?

Numerical recipes.

The first step in solving a problem is to acknowledge that a problem exists. The second step is more difficult.

I read “The Reason Why Most of the Images That Show Up When You Search for Doctor Are White Men.” The headline identifies the problem. However, what does one do about biases rooted in human utterance.

My initial thought was to eliminate human utterances. No fancy dancing required. Just let algorithms do what algorithms do. I realized that although this approach has a certain logical completeness, implementation may meet with a bit of resistance.

What does the write up have to say about the problem? (Remember. The fix is going to be tricky.)

I learned:

Research from Princeton University suggests that these biases, like associating men with doctors and women with nurses, come from the language taught to the algorithm. As some data scientists say, “garbage in, garbage out”: Without good data, the algorithm isn’t going to make good decisions.

Okay, right coast thinking. I feel more comfortable.

What does the write up present as wizard Aylin Caliskan’s view of the problem? A post doctoral researcher seems to be a solid choice for a source. I assume the wizard is a human, so perhaps he, she, it is biased? Hmmm.

I highlighted in true blue several passages from the write up / interview with he, she, it. Let’s look at three statements, shall we?

Regarding genderless languages like Turkish:

when you directly translate, and “nurse” is “she,” that’s not accurate. It should be “he or she or it” is a nurse. We see that it’s making a biased decision—it’s a very simple example of machine translation, but given that these models are incorporated on the web or any application that makes use of textual data, it’s the foundation of most of these applications. If you search for “doctor” and look at the images, you’ll see that most of them are male. You won’t see an equal male and female distribution.

If accurate, this observation means that the “fix” is going to be difficult. Moving from a language without gender identification to a language with gender identification requires changing the target language. Easy for software. Tougher for a human. If the language and its associations are anchored in the brain of a target language speaker, change may be, how shall I say it, a trifle difficult. My fix looks pretty good at this point.

And what about images and videos? I learned:

Yes, anything that text touches. Images and videos are labeled to they can be used on the web. The labels are in text, and it has been shown that those labels have been biased.

And the fix is a human doing the content selection, indexing, and dictionary tweaking. Not so fast. The cost of indexing with humans is very expensive. Don’t believe me. Download 10,000 Wikipedia articles and hire some folks to index them from the controlled term list humans set up. Let me know if you can hit $17 per indexed article. My hunch is that you will exceed this target by several orders of magnitude. (Want to know where the number comes from? Contact me and we discuss a for fee deal for this high value information.)

How does the write up solve the problem? Here’s the capper:

…you cannot directly remove the bias from the dataset or model because it’s giving a very accurate representation of the world, and that’s why we need a specialist to deal with this at the application level.

Notice that my solution is to eliminate humans entirely. Why? The pipe dream of humans doing indexing won’t fly due to [a] time, [b] cost, [c] the massive flows of data to index. Forget the mother of all bombs.

Think about the mother of all indexing backlogs. The gap would make the Modern Language Association’s “gaps” look like weekend catch up party. Is this a job for the operating system for machine intelligence?

Stephen E Arnold, April 17, 2017

Smart Software, Dumb Biases

April 17, 2017

Math is objective, right? Not really. Developers of artificial intelligence systems, what I call smart software, rely on what they learned in math school. If you have flipped through math books ranging from the Googler’s tome on artificial intelligence Artificial Intelligence: A Modern Approach to the musings of the ACM’s journals, you see the same methods recycled. Sure, the algorithms are given a bath and their whiskers are cropped. But underneath that show dog’s sleek appearance, is a familiar pooch. K-means. We have k-means. Decision trees? Yep, decision trees.

What happens when developers feed content into Rube Goldberg machines constructed of mathematical procedures known and loved by math wonks the world over?

The answer appears in “Semantics Derived Automatically from Language Corpora Contain Human Like Biases.” The headline says it clearly, “Smart software becomes as wild and crazy as a group of Kentucky politicos arguing in a bar on Friday night at 2:15 am.”

Biases are expressed and made manifest.

The article in Science reports with considerable surprise it seems to me:

word embeddings encode not only stereotyped biases but also other knowledge, such as the visceral pleasantness of flowers or the gender distribution of occupations.

Ah, ha. Smart software learns biases. Perhaps “smart” correlates with bias?

The canny whiz kids who did the research crawfish a bit:

We stress that we replicated every association documented via the IAT that we tested. The number, variety, and substantive importance of our results raise the possibility that all implicit human biases are reflected in the statistical properties of language. Further research is needed to test this hypothesis and to compare language with other modalities, especially the visual, to see if they have similarly strong explanatory power.

Yep, nothing like further research to prove that when humans build smart software, “magic” happens. The algorithms manifest biases.

What the write up did not address is a method for developing less biases smart software. Is such a method beyond the ken of computer scientists?

To get more information about this question, I asked on the world leader in the field of computational linguistics, Dr. Antonio Valderrabanos, the founder and chief executive officer at Bitext. Dr. Valderrabanos told me:

We use syntactic relations among words instead of using n-grams and similar statistical artifacts, which don’t understand word relations. Bitext’s Deep Linguistics Analysis platform can provide phrases or meaningful relationships to uncover more textured relationships. Our analysis will provide better content to artificial intelligence systems using corpuses of text to learn.

Bitext’s approach is explained in the exclusive interview which appeared in Search Wizards Speak on April 11, 2017. You can read the full text of the interview at this link and review the public information about the breakthrough DLA platform at www.bitext.com.

It seems to me that Bitext has made linguistics the operating system for artificial intelligence.

Stephen E Arnold, April 17, 2017

Bitext: Exclusive Interview with Antonio Valderrabanos

April 11, 2017

On a recent trip to Madrid, Spain, I was able to arrange an interview with Dr. Antonio Valderrabanos, the founder and CEO of Bitext. The company has its primary research and development group in Las Rosas, the high-technology complex a short distance from central Madrid. The company has an office in San Francisco and a number of computational linguists and computer scientists in other locations. Dr. Valderrabanos worked at IBM in an adjacent field before moving to Novell and then making the jump to his own start up. The hard work required to invent a fundamentally new way to make sense of human utterance is now beginning to pay off.

Antonio Valderrabanos of Bitext

Dr. Antonio Valderrabanos, founder and CEO of Bitext. Bitext’s business is growing rapidly. The company’s breakthroughs in deep linguistic analysis solves many difficult problems in text analysis.

Founded in 2008, the firm specializes in deep linguistic analysis. The systems and methods invented and refined by Bitext improve the accuracy of a wide range of content processing and text analytics systems. What’s remarkable about the Bitext breakthroughs is that the company support more than 40 different languages, and its platform can support additional languages with sharp reductions in the time, cost, and effort required by old-school systems. With the proliferation of intelligent software, Bitext, in my opinion, puts the digital brains in overdrive. Bitext’s platform improves the accuracy of many smart software applications, ranging from customer support to business intelligence.

In our wide ranging discussion, Dr. Valderrabanos made a number of insightful comments. Let me highlight three and urge you to read the full text of the interview at this link. (Note: this interview is part of the Search Wizards Speak series.)

Linguistics as an Operating System

One of Dr. Valderrabanos’ most startling observations addresses the future of operating systems for increasingly intelligence software and applications. He said:

Linguistic applications will form a new type of operating system. If we are correct in our thought that language understanding creates a new type of platform, it follows that innovators will build more new things on this foundation. That means that there is no endpoint, just more opportunities to realize new products and services.

Better Understanding Has Arrived

Some of the smart software I have tested is unable to understand what seems to be very basic instructions. The problem, in my opinion, is context. Most smart software struggles to figure out the knowledge cloud which embraces certain data. Dr. Valderrabanos observed:

Search is one thing. Understanding what human utterances mean is another. Bitext’s proprietary technology delivers understanding. Bitext has created an easy to scale and multilingual Deep Linguistic Analysis or DLA platform. Our technology reduces costs and increases user satisfaction in voice applications or customer service applications. I see it as a major breakthrough in the state of the art.

If he is right, the Bitext DLA platform may be one of the next big things in technology. The reason? As smart software becomes more widely adopted, the need to make sense of data and text in different languages becomes increasingly important. Bitext may be the digital differential that makes the smart applications run the way users expect them to.

Snap In Bitext DLA

Advanced technology like Bitext’s often comes with a hidden cost. The advanced system works well in a demonstration or a controlled environment. When that system has to be integrated into “as is” systems from other vendors or from a custom development project, difficulties can pile up. Dr. Valderrabanos asserted:

Bitext DLA provides parsing data for text enrichment for a wide range of languages, for informal and formal text and for different verticals to improve the accuracy of deep learning engines and reduce training times and data needs. Bitext works in this way with many other organizations’ systems.

When I asked him about integration, he said:

No problems. We snap in.

I am interested in Bitext’s technical methods. In the last year, he has signed deals with companies like Audi, Renault, a large mobile handset manufacturer, and an online information retrieval company.

When I thanked him for his time, he was quite polite. But he did say, “I have to get back to my desk. We have received several requests for proposals.”

Las Rosas looked quite a bit like Silicon Valley when I left the Bitext headquarters. Despite the thousands of miles separating Madrid from the US, interest in Bitext’s deep linguistic analysis is surging. Silicon Valley has its charms, and now it has a Bitext US office for what may be the fastest growing computational linguistics and text analysis system in the world. Worth watching this company I think.

For more about Bitext, navigate to the firm’s Web site at www.bitext.com.

Stephen E Arnold, April 11, 2017

Upgraded Social Media Monitoring

February 20, 2017

Analytics are catching up to content. In a recent ZDNet article, Digimind partners with Ditto to add image recognition to social media monitoring, we are reminded images reign supreme on social media. Between Pinterest, Snapchat and Instagram, messages are often conveyed through images as opposed to text. Capitalizing on this, and intelligence software company Digimind has announced a partnership with Ditto Labs to introduce image-recognition technology into their social media monitoring software called Digimind Social. We learned,

The Ditto integration lets brands identify the use of their logos across Twitter no matter the item or context. The detected images are then collected and processed on Digimind Social in the same way textual references, articles, or social media postings are analysed. Logos that are small, obscured, upside down, or in cluttered image montages are recognised. Object and scene recognition means that brands can position their products exactly where there customers are using them. Sentiment is measured by the amount of people in the image and counts how many of them are smiling. It even identifies objects such as bags, cars, car logos, or shoes.

It was only a matter of time before these types of features emerged in social media monitoring. For years now, images have been shown to increase engagement even on platforms that began focused more on text. Will we see more watermarked logos on images? More creative ways to visually identify brands? Both are likely and we will be watching to see what transpires.

Megan Feil, February 20, 2017


Alleged Google Loophole Lets Fake News Flow

January 1, 2017

I read a write up which, like 99 percent of the information available for free via the Internet, is 100 percent accurate.

The write up’s title tells the tale: “Google Does a Better Job with Fake News Than Facebook, but There’s a Big Loophole It Hasn’t Fixed.” What’s the loophole? The write up reports:

…the “newsy” modules that sit at the top of many Google searches (the “In the news” section on desktop, and the “Top stories” section on mobile) don’t pull content straight from Google News. They pull from all sorts of content available across the web, and can include sites not approved by Google News. This is particularly confusing for users on the desktop version of Google’s site, where the “In the news” section lives.Not only does the “In the news” section literally have the word “news” in its name, but the link at the bottom of the module, which says “More news for…,” takes you to the separate Google News page, which is comprised only of articles that Google’s editorial system has approved.

So why isn’t the “In the news” section just the top three Google News results?

The short answer is because Google sees Google Search and Google News as separate products.

The word “news” obviously does not mean news. We reported last week about Google’s effort to define “monopoly” for the European Commission investigating allegations of Google’s being frisky with its search results. News simply needs to be understood in the Google contextual lexicon.

The write up helps me out with this statement:

So why isn’t the “In the news” section just the top three Google News results? The short answer is because Google sees Google Search and Google News as separate products.

Logical? From Google’s point of view absolutely crystal clear.

The write up amplifies the matter:

Google does, however, seem to want to wipe fake news from its platform. “From our perspective, there should just be no situation where fake news gets distributed, so we are all for doing better here,” Google CEO Sundar Pichai said recently. After the issue of fake news entered the spotlight after the election, Google announced it would ban fake-news sites from its ad network, choking off their revenue. But even if Google’s goal is to kick fake-news sites out of its search engine, most Google users probably understand that Google search results don’t have carry the editorial stamp of approval from Google.

Fake news, therefore, is mostly under control. The Google users just have to bone up on how Google works to make information available.

What about mobile?

Google AMP is not news; AMP content labeled as “news” is part of the AMP technical standard which speeds up mobile page display.

Google, like Facebook, may tweak its approach to news.

Beyond Search would like to point out that wild and crazy news releases from big time PR dissemination outfits can propagate a range of information (some mostly accurate and some pretty crazy). The handling of high value sources allows some questionable content to flow. Oh, there are other ways to inject questionable content into the Web indexing systems.

There is not one loophole. There are others. Who wants to nibble into revenue? Not Beyond Search.

Stephen E Arnold, January 1, 2017

Smarter Content for Contentier Intelligence

December 28, 2016

I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.

Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.


The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:


What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:

A Data Ninja API

A Data Ninja “knowledgebase,” which I think is an iPhrase-type or TeraText type of method. Not familiar with iPhrase and TeraText, feel free to browse the descriptions at the links.

A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.

The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.

Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:

  • Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
  • Smart data; that is, I am not sure what this means
  • Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
  • Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
  • The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).

The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.

If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.

Stephen E Arnold, November 28, 2016

Search Email: Not Yours. A Competitor’s.

December 2, 2016

I read “This Startup Helps You Deep Snoop Competitor Email Marketing.” I like that “deep snoop” thing. That works pretty well until one loses access to content to analyze. Just ask Geofeedia which is scrambling since it lost access to Twitter and other social media content.

The outfit Rival Explorer offers:

a tool designed to help users improve their email marketing strategy and product pricing and promotion through comprehensive monitoring of their competitor’s email newsletters. After creating a free account, users can browse through a database of marketing emails from over 50,000 brands. Rival Explorer offers access to a number of different email types, including newsletters, cart abandonment emails, welcome emails, and other transactional messages.

In terms of information access, the Rival Explorer customers:

can search by brand, subject, message body, date, day of week, industry, category, and custom tags and keywords. When users select a message, they’re able to view the sender email, subject line, and timestamp of the messages. In addition to those details, users can view the emails as they appear on tablets and smartphones, plus they also can toggle images to get a better idea of design and copy strategy.

You can get more information at this link. Public content and marketing information can be useful it seems.

Stephen E Arnold, December 2, 2016

Pitching All Source Analysis: Just Do Dark Data. Really?

November 25, 2016

I read “Shedding Light on Dark Data: How to Get Started.” Okay, Dark Data. Like Big Data, the phrase is the fruit of the nomads at Garner Group. The person embracing this sort of old concept is an outfit OdinText. Spoiler: I thought the write up was going to identify outfits like BAE Systems, Centrifuge Systems, IBM Analyst’s Notebook, Palantir Technologies, and Recorded Future (an In-Q-Tel and Google backed outfit). Was I wrong? Yes.

The write up explains that a company has to tackle a range of information in order to be aware, informed, or insightful. Pick one. Here’s the list of Dark Data types, which the aforementioned companies have been working to capture, analyze, and make sense of for almost 20 years in the case of NetReveal (Detica) and Analyst’s Notebook. The other companies are comparative spring chickens with an average of seven years’ experience in this effort.

  • Customer relationship management data
  • Data warehouse information
  • Enterprise resource planning information
  • Log files
  • Machine data
  • Mainframe data
  • Semi structured information
  • Social media content
  • Unstructured data
  • Web content.

I think the company or non profit which tries to suck in these data types and process them may run into some cost and legal issues. Analyzing tweets and Facebook posts can be useful, but there are costs and license fees required. Frankly not even law enforcement and intelligence entities are able to do a Cracker Jack job with these content streams due to their volume, cryptic nature, and pesky quirks related to metadata tagging. But let’s move on. To this statement:

Phone transcripts, chat logs and email are often dark data that text analytics can help illuminate. Would it be helpful to understand how personnel deal with incoming customer questions? Which of your products are discussed with which of your other products or competitors’ products more often? What problems or opportunities are mentioned in conjunction with them? Are there any patterns over time?

Yep, that will work really well in many legal environments. Phone transcripts are particularly exciting.

How does one think about Dark Data? Easy. Here’s a visualization from the OdinText folks:


Notice that there are data types in this diagram NOT included in the listing above. I can’t figure out if this is just carelessness or an insight which escapes me.

How does one deal with Dark Data? OdinText, of course. Yep, of course. Easy.

Stephen E Arnold, November 25, 2016

The House Cleaning of Halevy Dataspace: A Web Curiosity

November 14, 2016

I am preparing three seven minute videos. That effort will be one video each week starting on 20 December 2016. The subject is my Google Trilogy, published by an antique outfit which has drowned in River Avon. The first video is about the 2004 monograph, The Google Legacy. I coined the term “Googzilla” in that 230 page discussion of how Google became baby Google. The second video summarizes several of the take aways from Google: The Calculating Predator, published in 2007. The key to the monograph is the bound phrase “calculating predator.” Yep, not the happy little search out most know and love. The third video hits the main points of Google: The Digital Gutenberg, published in 2009. The idea is that Google spits out more digital content than almost anyone. Few think of the GOOG as the content generator the company has become. Yep, a map is a digital artifact.

Now to the curiosity. I wanted to reference the work of Dr. Alon Halevy, a former University of Washington professor and founder of Nimble and Transformic. I had a stack of links I used when I was doing the research for my predator book. Just out of curiosity I started following the links. I do have PDF versions of most of the open source Halevy-centric content I located.

But guess what?

Dr. Alon Halevy has disappeared. I could not locate the open source version of his talk about dataspaces. I could not locate the Wayback Machine’s archived version of the Transformic.com Web site. The links returned these weird 404 errors. My assumption was that Wayback’s Web pages resided happily on the outfit’s servers. I was incorrect. Here’s what I saw:


I explored the bound phrase “Alon Halvey” with various other terms only to learn that the bulk of the information has disappeared. No PowerPoints, no much substantive information. There were a few “information objects” which have not yet disappeared; for example:

  • An ACM blog post which references “the structured data team” and Nimble and Transformic
  • A Google research paper which will not make those who buy into David Gelerter’s The Tides of the Mind thesis
  • A YouTube video of a lecture given at Technion.

I found the gap between my research gathered in 2005 to 2007 interesting. I asked myself, “How did I end up with so many dead links about a technology I have described as one of the most important in database, data management, data analysis, and information retrieval?

Here are the answers I formulated:

  1. The Web is a lousy source of information. Stuff just disappears like the Darpa listing of open source Dark Web software, blogs, and Web sites
  2. I did really terrible research and even worse librarian type behavior. Yep, mea culpa.
  3. Some filtering procedures became a bit too aggressive and the information has been swept from assorted indexes
  4. The Wayback Machine ran off the rails and pointed to an actual 2005 Web site which its system failed to copy when the original spidering was completed.
  5. Gremlins. Hey, they really do exist. Just ask Grace Hopper. Yikes, she’s not available.

I wanted to mention this apparent or erroneous scrubbing. The story in this week HonkinNews video points out that 89 percent of journalists do their research via Google. Now if information is not in Google, what does that imply for a “real” journalist trying to do an objective, comprehensive story? I leave it up to you, gentle reader, to penetrate this curiosity.

Watch for the Google Trilogy seven minute videos on December 20, 2016, December 27, 2016, and

Stephen E Arnold, November 14, 2016, and January 3, 2017. Free. No pay wall. No Patreon.com pleading. No registration form. Just honkin’ news seven days a week and some video shot on an old Bell+Howell camera in a log cabin in rural Kentucky.

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta