Voice Search and Big Data: Defining Technologies for 2017

April 20, 2017

I read “Voice Search and Data: The Two Trends That Will Shape Online Marketing in 2017.” If the story is accurate, figuring out what people say and making sense of data (lots of data) will create new opportunities for innovators.

The article states:

Advancements in voice search and artificial intelligence (AI) will drive rich answers that will help marketers understand the customer intent behind I-want-to-go, I-want-to-know, I-want-to-buy and I-want-to-do micro-moments. Google has developed algorithms to cater directly to the search intent of the customers behind these queries, enabling customers to find the right answers quickly.

My view is that the article is correct in its assessment.

Where the article and I differ boils down to search engine optimization. The idea that voice search and Big Data will make fooling the relevance algorithms of Bing, Google, and Yandex a windfall for search engine optimization experts is partially true. Marketing whiz kids will do and say many things to deliver results that do not answer my query or meet my expectation of a “correct” answer.

My view is that the proliferation of systems which purport to understand human utterances in text,and voice-to-text conversions will discover that the the error rates of 60 to 75 percent are not good enough. Errors can be buried in long lists of results. They can be sidestepped if a voice enabled system works from a set of rules confined to a narrow topic domain.

Open the door to natural language parsing, and the error rates which once were okay become a liability. In my opinion, this will set off a scramble among companies struggling to get their smart software to provide information that customers accept and use repeatedly. Fail and customer turnover can be a fatal knife wound to the heart of an organization. The cost of replacing a paying customer is high. Companies need to keep the customers they have with technology that helps keep paying customers smiling.

What companies are able to provide higher accuracy linguistic functions? There are dozens of companies which assert that their systems can extract entities, figure out semantic relationships, and manipulate content in a handful of languages.

The problem with most of these systems is that certain, very widely used methods collapse when high accuracy is required for large volumes of text. The short cut is to use numerical tricks, and some of those tricks create disconnects between the information the user requests or queries and the results the system displays. Examples range from the difficulties of tuning the Autonomy Digital Reasoning Engine to figuring out how in the heck Google Home arrived at a particular song when the user wanted something else entirely.

Our suggestion is that instead of emailing IBM to sign a deal for that companies language technology, you might have a more productive result if you contact Bitext. This is a company which has been on my mind. I interviewed the founder and CEO (an IBM alum as I learned) and met with some of the remarkable Bitext team.

I am unable to disclose Bitext’s clients. I can suggest that if you fancy a certain German sports car or use one of the world’s most popular online services, you will be bumping into Bitext’s Digital Linguistic Analysis platform. For more information, navigate to Bitext.com.

The data I reviewed suggested that Bitext’s linguistic platform delivers accuracy significantly better than some of the other systems’ outputs I have reviewed. How accurate? Good enough to get an A in my high school math class.

Stephen E Arnold, April 20, 2017

Image Search: Biased by Language. The Fix? Use Humans!

April 19, 2017

Houston, we (male, female, uncertain) have a problem. Bias is baked into some image analysis and just about every other type of smart software.

The culprit?

Numerical recipes.

The first step in solving a problem is to acknowledge that a problem exists. The second step is more difficult.

I read “The Reason Why Most of the Images That Show Up When You Search for Doctor Are White Men.” The headline identifies the problem. However, what does one do about biases rooted in human utterance.

My initial thought was to eliminate human utterances. No fancy dancing required. Just let algorithms do what algorithms do. I realized that although this approach has a certain logical completeness, implementation may meet with a bit of resistance.

What does the write up have to say about the problem? (Remember. The fix is going to be tricky.)

I learned:

Research from Princeton University suggests that these biases, like associating men with doctors and women with nurses, come from the language taught to the algorithm. As some data scientists say, “garbage in, garbage out”: Without good data, the algorithm isn’t going to make good decisions.

Okay, right coast thinking. I feel more comfortable.

What does the write up present as wizard Aylin Caliskan’s view of the problem? A post doctoral researcher seems to be a solid choice for a source. I assume the wizard is a human, so perhaps he, she, it is biased? Hmmm.

I highlighted in true blue several passages from the write up / interview with he, she, it. Let’s look at three statements, shall we?

Regarding genderless languages like Turkish:

when you directly translate, and “nurse” is “she,” that’s not accurate. It should be “he or she or it” is a nurse. We see that it’s making a biased decision—it’s a very simple example of machine translation, but given that these models are incorporated on the web or any application that makes use of textual data, it’s the foundation of most of these applications. If you search for “doctor” and look at the images, you’ll see that most of them are male. You won’t see an equal male and female distribution.

If accurate, this observation means that the “fix” is going to be difficult. Moving from a language without gender identification to a language with gender identification requires changing the target language. Easy for software. Tougher for a human. If the language and its associations are anchored in the brain of a target language speaker, change may be, how shall I say it, a trifle difficult. My fix looks pretty good at this point.

And what about images and videos? I learned:

Yes, anything that text touches. Images and videos are labeled to they can be used on the web. The labels are in text, and it has been shown that those labels have been biased.

And the fix is a human doing the content selection, indexing, and dictionary tweaking. Not so fast. The cost of indexing with humans is very expensive. Don’t believe me. Download 10,000 Wikipedia articles and hire some folks to index them from the controlled term list humans set up. Let me know if you can hit $17 per indexed article. My hunch is that you will exceed this target by several orders of magnitude. (Want to know where the number comes from? Contact me and we discuss a for fee deal for this high value information.)

How does the write up solve the problem? Here’s the capper:

…you cannot directly remove the bias from the dataset or model because it’s giving a very accurate representation of the world, and that’s why we need a specialist to deal with this at the application level.

Notice that my solution is to eliminate humans entirely. Why? The pipe dream of humans doing indexing won’t fly due to [a] time, [b] cost, [c] the massive flows of data to index. Forget the mother of all bombs.

Think about the mother of all indexing backlogs. The gap would make the Modern Language Association’s “gaps” look like weekend catch up party. Is this a job for the operating system for machine intelligence?

Stephen E Arnold, April 17, 2017

Smart Software, Dumb Biases

April 17, 2017

Math is objective, right? Not really. Developers of artificial intelligence systems, what I call smart software, rely on what they learned in math school. If you have flipped through math books ranging from the Googler’s tome on artificial intelligence Artificial Intelligence: A Modern Approach to the musings of the ACM’s journals, you see the same methods recycled. Sure, the algorithms are given a bath and their whiskers are cropped. But underneath that show dog’s sleek appearance, is a familiar pooch. K-means. We have k-means. Decision trees? Yep, decision trees.

What happens when developers feed content into Rube Goldberg machines constructed of mathematical procedures known and loved by math wonks the world over?

The answer appears in “Semantics Derived Automatically from Language Corpora Contain Human Like Biases.” The headline says it clearly, “Smart software becomes as wild and crazy as a group of Kentucky politicos arguing in a bar on Friday night at 2:15 am.”

Biases are expressed and made manifest.

The article in Science reports with considerable surprise it seems to me:

word embeddings encode not only stereotyped biases but also other knowledge, such as the visceral pleasantness of flowers or the gender distribution of occupations.

Ah, ha. Smart software learns biases. Perhaps “smart” correlates with bias?

The canny whiz kids who did the research crawfish a bit:

We stress that we replicated every association documented via the IAT that we tested. The number, variety, and substantive importance of our results raise the possibility that all implicit human biases are reflected in the statistical properties of language. Further research is needed to test this hypothesis and to compare language with other modalities, especially the visual, to see if they have similarly strong explanatory power.

Yep, nothing like further research to prove that when humans build smart software, “magic” happens. The algorithms manifest biases.

What the write up did not address is a method for developing less biases smart software. Is such a method beyond the ken of computer scientists?

To get more information about this question, I asked on the world leader in the field of computational linguistics, Dr. Antonio Valderrabanos, the founder and chief executive officer at Bitext. Dr. Valderrabanos told me:

We use syntactic relations among words instead of using n-grams and similar statistical artifacts, which don’t understand word relations. Bitext’s Deep Linguistics Analysis platform can provide phrases or meaningful relationships to uncover more textured relationships. Our analysis will provide better content to artificial intelligence systems using corpuses of text to learn.

Bitext’s approach is explained in the exclusive interview which appeared in Search Wizards Speak on April 11, 2017. You can read the full text of the interview at this link and review the public information about the breakthrough DLA platform at www.bitext.com.

It seems to me that Bitext has made linguistics the operating system for artificial intelligence.

Stephen E Arnold, April 17, 2017

Bitext: Exclusive Interview with Antonio Valderrabanos

April 11, 2017

On a recent trip to Madrid, Spain, I was able to arrange an interview with Dr. Antonio Valderrabanos, the founder and CEO of Bitext. The company has its primary research and development group in Las Rosas, the high-technology complex a short distance from central Madrid. The company has an office in San Francisco and a number of computational linguists and computer scientists in other locations. Dr. Valderrabanos worked at IBM in an adjacent field before moving to Novell and then making the jump to his own start up. The hard work required to invent a fundamentally new way to make sense of human utterance is now beginning to pay off.

Antonio Valderrabanos of Bitext

Dr. Antonio Valderrabanos, founder and CEO of Bitext. Bitext’s business is growing rapidly. The company’s breakthroughs in deep linguistic analysis solves many difficult problems in text analysis.

Founded in 2008, the firm specializes in deep linguistic analysis. The systems and methods invented and refined by Bitext improve the accuracy of a wide range of content processing and text analytics systems. What’s remarkable about the Bitext breakthroughs is that the company support more than 40 different languages, and its platform can support additional languages with sharp reductions in the time, cost, and effort required by old-school systems. With the proliferation of intelligent software, Bitext, in my opinion, puts the digital brains in overdrive. Bitext’s platform improves the accuracy of many smart software applications, ranging from customer support to business intelligence.

In our wide ranging discussion, Dr. Valderrabanos made a number of insightful comments. Let me highlight three and urge you to read the full text of the interview at this link. (Note: this interview is part of the Search Wizards Speak series.)

Linguistics as an Operating System

One of Dr. Valderrabanos’ most startling observations addresses the future of operating systems for increasingly intelligence software and applications. He said:

Linguistic applications will form a new type of operating system. If we are correct in our thought that language understanding creates a new type of platform, it follows that innovators will build more new things on this foundation. That means that there is no endpoint, just more opportunities to realize new products and services.

Better Understanding Has Arrived

Some of the smart software I have tested is unable to understand what seems to be very basic instructions. The problem, in my opinion, is context. Most smart software struggles to figure out the knowledge cloud which embraces certain data. Dr. Valderrabanos observed:

Search is one thing. Understanding what human utterances mean is another. Bitext’s proprietary technology delivers understanding. Bitext has created an easy to scale and multilingual Deep Linguistic Analysis or DLA platform. Our technology reduces costs and increases user satisfaction in voice applications or customer service applications. I see it as a major breakthrough in the state of the art.

If he is right, the Bitext DLA platform may be one of the next big things in technology. The reason? As smart software becomes more widely adopted, the need to make sense of data and text in different languages becomes increasingly important. Bitext may be the digital differential that makes the smart applications run the way users expect them to.

Snap In Bitext DLA

Advanced technology like Bitext’s often comes with a hidden cost. The advanced system works well in a demonstration or a controlled environment. When that system has to be integrated into “as is” systems from other vendors or from a custom development project, difficulties can pile up. Dr. Valderrabanos asserted:

Bitext DLA provides parsing data for text enrichment for a wide range of languages, for informal and formal text and for different verticals to improve the accuracy of deep learning engines and reduce training times and data needs. Bitext works in this way with many other organizations’ systems.

When I asked him about integration, he said:

No problems. We snap in.

I am interested in Bitext’s technical methods. In the last year, he has signed deals with companies like Audi, Renault, a large mobile handset manufacturer, and an online information retrieval company.

When I thanked him for his time, he was quite polite. But he did say, “I have to get back to my desk. We have received several requests for proposals.”

Las Rosas looked quite a bit like Silicon Valley when I left the Bitext headquarters. Despite the thousands of miles separating Madrid from the US, interest in Bitext’s deep linguistic analysis is surging. Silicon Valley has its charms, and now it has a Bitext US office for what may be the fastest growing computational linguistics and text analysis system in the world. Worth watching this company I think.

For more about Bitext, navigate to the firm’s Web site at www.bitext.com.

Stephen E Arnold, April 11, 2017

Upgraded Social Media Monitoring

February 20, 2017

Analytics are catching up to content. In a recent ZDNet article, Digimind partners with Ditto to add image recognition to social media monitoring, we are reminded images reign supreme on social media. Between Pinterest, Snapchat and Instagram, messages are often conveyed through images as opposed to text. Capitalizing on this, and intelligence software company Digimind has announced a partnership with Ditto Labs to introduce image-recognition technology into their social media monitoring software called Digimind Social. We learned,

The Ditto integration lets brands identify the use of their logos across Twitter no matter the item or context. The detected images are then collected and processed on Digimind Social in the same way textual references, articles, or social media postings are analysed. Logos that are small, obscured, upside down, or in cluttered image montages are recognised. Object and scene recognition means that brands can position their products exactly where there customers are using them. Sentiment is measured by the amount of people in the image and counts how many of them are smiling. It even identifies objects such as bags, cars, car logos, or shoes.

It was only a matter of time before these types of features emerged in social media monitoring. For years now, images have been shown to increase engagement even on platforms that began focused more on text. Will we see more watermarked logos on images? More creative ways to visually identify brands? Both are likely and we will be watching to see what transpires.

Megan Feil, February 20, 2017


Smarter Content for Contentier Intelligence

December 28, 2016

I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.

Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.


The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:


What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:

A Data Ninja API

A Data Ninja “knowledgebase,” which I think is an iPhrase-type or TeraText type of method. Not familiar with iPhrase and TeraText, feel free to browse the descriptions at the links.

A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.

The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.

Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:

  • Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
  • Smart data; that is, I am not sure what this means
  • Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
  • Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
  • The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).

The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.

If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.

Stephen E Arnold, November 28, 2016

Why Search When You Can Discover

November 11, 2016

What’s next in search? My answer is, “No search at all. The system thinks for you.” Sounds like Utopia for the intellectual couch potato to me.

I read “The Latest in Search: New Services in the Content Discovery Marketplace.” The main point of the write up is to highlight three “discovery” services. A discovery service is one which offers “information users new avenues to the research literature.”

See, no search needed.

The three services highlighted are:

  • Yewno, which is powered by an inference engine. (Does anyone remember the Inference search engine from days gone by?). The Yewno system uses “computational analysis and a concept map.” The problem is that it “supplements institutional discovery.” I don’t know what “institutional discovery” means, and my hunch is that folks living outside of rural Kentucky know what “institutional discovery” means. Sorry to be so ignorant.
  • ScienceOpen, which delivers a service which “complements open Web discovery.” Okay. I assume that this means I run an old fashioned query and ScienceOpen helps me out.
  • TrendMD, which “serves as a classic “onward journey tool” that aims to generate relevant recommendations serendipitously.”

I am okay with the notion of having tools to make it easier to locate information germane to a specific query. I am definitely happy with tools which can illustrate connections via concept maps, link analysis, and similar outputs. I understand that lawyers want to type in a phrase like “Panama deal” and get a set of documents related to this term so the mass of data can be chopped down by sending, recipient, time, etc.

But setting up discovery as a separate operation from keyword or entity based search seems a bit forced to me. The write up spins its lawn mower blades over the TrendMD service. That’s fine, but there are a number of ways to explore scientific, technical, and medical literature. Some are or were delightful like Grateful Med; others are less well known; for example, Mednar and Quertle.

Discovery means one thing to lawyers. It means another thing to me: A search add on.

Stephen E Arnold, November 11, 2016

Palantir Technologies: Less War with Gotham?

November 9, 2016

I read “Peter Thiel Explains Why His Company’s Defense Contracts Could Lead to Less War.” I noted that the write up appeared in the Washington Post, a favorite of Jeff Bezos I believe. The write up referenced a refrain which I have heard before:

Washington “insiders” currently leading the government have “squandered” money, time and human lives on international conflicts.

What I highlighted as an interesting passage was this one:

a spokesman for Thiel explained that the technology allows the military to have a more targeted response to threats, which could render unnecessary the wide-scale conflicts that Thiel sharply criticized.

I also put a star by this statement from the write up:

“If we can pinpoint real security threats, we can defend ourselves without resorting to the crude tactic of invading other countries,” Thiel said in a statement sent to The Post.

The write up pointed out that Palantir booked about $350 million in business between 2007 and 2016 and added:

The total value of the contracts awarded to Palantir is actually higher. Many contracts are paid in a series of installments as work is completed or funds are allocated, meaning the total value of the contract may be reflected over several years. In May, for example, Palantir was awarded a contract worth $222.1 million from the Defense Department to provide software and technical support to the U.S. Special Operations Command. The initial amount paid was $5 million with the remainder to come in installments over four years.

I was surprised at the Washington Post’s write up. No ads for Alexa and no Beltway snarkiness. That too was interesting to me. And I don’t have a dog in the fight. For those with dogs in the fight, there may be some billability worries ahead. I wonder if the traffic jam at 355 and Quince Orchard will now abate when IBM folks do their daily commute.

Stephen E Arnold, November 9, 2016

Entity Extraction: No Slam Dunk

November 7, 2016

There are differences among these three use cases for entity extraction:

  1. Operatives reviewing content for information about watched entities prior to an operation
  2. Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
  3. Indexing Web content to add concepts to keyword indexing.

Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.

The write up identifies the systems which perform the best and the worst.

Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.

The five best performing systems on the test corpus were:

The five worst performing systems on the test corpus were:

There are some caveats to consider:

  1. Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
  2. Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
  3. Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).

Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.

Stephen E Arnold, November 7, 2016

Falcon Searches Through Browser History

October 21, 2016

Have you ever visited a Web site and then lost the address or could not find a particular section on it?  You know that the page exists, but no matter how often you use an advanced search feature or scour through your browser history it cannot be found.  If you use Google Chrome as your main browser than there is a solution, says GHacks in the article, “Falcon: Full-Text history Search For Chrome.”

Falcon is a Google Chrome extension that adds full-text history search to a browser.  Chrome usually remembers Web sites and their extensions when you type them into the address bar.  The Falcon extension augments the default behavior to match text found on previously visited Web Sites.

Falcon is a search option within a search feature:

The main advantage of Falcon over Chrome’s default way of returning results is that it may provide you with better results.  If the title or URL of a page don’t contain the keyword you entered in the address bar, it won’t be displayed by Chrome as a suggestion even if the page is full of that keyword. With Falcon, that page may be returned as well in the suggestions.

The new Chrome extension acts as a delimiter to recorded Web history and improves a user’s search experience so they do not have to sift through results individually.

Whitney Grace, October 21, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph


Next Page »

  • Archives

  • Recent Posts

  • Meta