Bitvore: The AI, Real Time, Custom Report Search Engine

May 16, 2017

Just when I thought information access had slumped quietly through another week, I read in the capitalist tool which you know as Forbes, the content marketing machine, this article:

This AI Search Engine Delivers Tailored Data to Companies in Real Time.

This write up struck me as more interesting than the most recent IBM Watson prime time commercial about smart software for zealous professional basketball fans or Lucidworks’ (really?) acquisition of the interface builder Twigkit. Forbes Magazine’s write up did not point out that the company seems to be channeling Palantir Technologies; for example, Jeff Curie, the president, refers to employees at Bitvorians. Take that, you Hobbits and Palanterians.

image

A Bitvore 3D data structure.

The AI, real time, custom report search engine is called Bitvore. Here in Harrod’s Creek, we recognized the combination of the computer term “bit” with a syllable from one of our favorite morphemes “vore” as in carnivore or omnivore or the vegan-sensitive herbivore.

Read more

Machine Learning Going Through a Phase

May 10, 2017

People think that machine learning is like an algorithm magic wand.   It works by some writing the algorithmic code, popping in the data, and the computer learns how to do a task.  It is not that easy.  The Bitext blog reveals that machine learning needs assistance in the post, “How Phrase Structure Can Help Machine Learning For Text Analysis.”

Machine learning techniques used for text analysis are not that accurate.  The post explains that instead of learning the meaning of words in a sentence according to its structure, all the words are tossed into a bag and translated individually.  The context and meaning are lost.  A real world example is Chinese and Japanese because they use kanji (pictorial symbols representing words).   Chinese and Japanese are two languages, where a kanji’s meaning changes based on the context.  The result is that both languages have a lot of puns and are a nightmare for text analytics.

As you can imagine there are problems in Germanic and Latin-based languages too:

Ignoring the structure of a sentence can lead to various types of analysis problems. The most common one is incorrectly assigning similarity to two unrelated phrases such as Social Security in the Media” and “Security in Social Media” just because they use the same words (although with a different structure).

Besides, this approach has stronger effects for certain types of “special” words like “not” or “if”. In a sentence like “I would recommend this phone if the screen was bigger”, we don’t have a recommendation for the phone, but this could be the output of many text analysis tools, given that we have the words “recommendation” and “phone”, and given that the connection between “if” and “recommend” is not detected.

If you rely solely on the “bag of words” approach for text analysis the problems only get worse.  That is why it phrase structure is very important for text and sentiment analysis.  Bitext incorporates phrase structure and other techniques in their analytics platform used by a large search engine company and another tech company that likes fruit.

Whitney Grace, May 10, 2017

Enterprise Search and a Chimera: Analytical Engines

May 1, 2017

I put on my steam punk outfit before reading “Leading Analytical Engines for Enterprise Search.” Now there was one small factual error; specifically, the Google Search Appliance is a goner. When it was alive and tended to by authorized partners, it was not particularly adept at doing “analytical engine” type things.

What about the rest of the article? Well, I found it amusing.

Let me get to the good stuff and then deal with the nasty reality which confronts the folks who continue to pump money into enterprise search.

What companies does this “real journalism” out identify as purveyors of high top shoes for search. Yikes, sorry. I meant to say enterprise search systems which do analytical engine things.

Here’s the line up:

The Google Search Appliance. As noted, this is a goner. Yep, the Google threw in the towel. Lots of reasons, but my sources say, cost of sales was a consideration. Oh, and there were a couple of Google FTEs plus assorted costs for dealing with those annoyed with the product’s performance, relevance, customization, etc. Anyway. Museum piece.

Microsoft SharePoint. I find this a side splitter. Microsoft SharePoint is many things. In fact, armed with Visual Studio one can actually make the system work in a useful manner. Don’t tell the HR folks who wonder why certified SharePoint experts chew up a chunk of the budget and “fast.” Insider joke. Yeah, Excel is the go to analysis tool no matter what others may say. The challenge is to get the Excel thing to interact in a speedy, useful way with whatever the SharePoint administrator has managed to get working in a reliable way. Nuff said.

Coveo. Interesting addition to the list because Coveo is doing the free search thing, the Salesforce thing, the enterprise search thing, the customer support thing, and I think a bunch of other things. The Canadian outfit wants to do more than surf on government inducements, investors’ trust and money, and a key word based system. So it’s analytical engine time. I am not sure how the wrappers required to make key word search do analytics help out performance, but the company says it is an “analytical engine.” So be it.

Attivio. This is an interesting addition. The company emerged from some “fast” movers and shakers. The baseball data demo was nifty about six years ago. Now the company does search, publishing, analytics, etc. The shift from search to analytical engine is somewhat credible. The challenge the company faces is closing deals and generating sustainable revenue. There is that thing called “open source”. A clever programmer can integrate Lucene (Elasticsearch), use its open source components, and maybe dabble with Ikanow. The result? Perhaps an Attivio killer? Who knows.

Lucidworks (Really?). Yep, this is the Avis to the Hertz in the open source commercial sector. Lucidworks (Really?) is now just about everything sort of associated with Big Data, search, smart software, etc. The clear Lucid problem is Shay Bannon and Elastic. Not only does Elastic have more venture money, Elastic has more deployments and, based on information available to me, more revenue, partners, and clout in the open source world. Lucidworks (Really?) has a track record of executive and founder turnover and the thrill of watching Amazon benefit from a former Lucid employee’s inputs. Exciting. Really?

So what do I think of this article in CIO Review? Two things:

  1. It is not too helpful to me and those looking for search solutions in Harrod’s Creek, Kentucky. The reason? The GSA error and gasping effort to make key word search into something hot and cool. “Analytical engines” does not rev my motor. In fact, it does not turn over.
  2. CIO Review does not want me to copy a quote from the write up. Tip to CIO Review. Anyone can copy wildly crazy analytical engines article by viewing source and copying the somewhat uninteresting content.

Stephen E Arnold, May 1, 2017

Palantir Technologies: A Beatdown Buzz Ringing in My Ears

April 27, 2017

I have zero contacts at Palantir Technologies. The one time I valiantly contacted the company about a speaking opportunity at one of my wonky DC invitation-only conferences, a lawyer from Palantir referred my inquiry to a millennial who had a one word vocabulary, “No.”

There you go.

I have written about Palantir Technologies because I used to be an adviser to the pre-IBM incarnation of i2 and its widely used investigation tool, Analyst’s Notebook. I did write about a misadventure between i2 Group and Palantir Technologies, but no one paid much attention to my commentary.

An outfit called Buzzfeed, however, does pay attention to Palantir Technologies. My hunch is that the online real news outfit believes there is a story in the low profile, Peter Thiel-supported company. The technology Palantir has crafted is not that different from the Analyst’s Notebook, Centrifuge Systems’ solution, and quite a few other companies which provide industrial-strength software and systems to law enforcement, security firms, and the intelligence community. (I list about 15 of these companies in my forthcoming “Dark Web Notebook.” No, I won’t provide that list in this free blog. I may be retired, but I am not giving away high value information.)

So what’s caught my attention. I read the article “Palantir’s Relationship with the Intelligence Community Has Been Worse Than You Think.” The main idea is that the procurement of Palantir’s Gotham and supporting services provided by outfits specializing in Palantir systems has not been sliding on President Reagan’s type of Teflon. The story has been picked up and recycled by several “real” news outfits; for example, Brainsock. The story meshes like matryoshkas with other write ups; for example, “Inside Palantir, Silicon Valley’s Most Secretive Company” and “Palantir Struggles to Retain Clients and Staff, BuzzFeed Reports.” Palantir, it seems to me in Harrod’s Creek, is a newsy magnet.

The write up about Palantir’s lousy relationship with the intelligence community pivots on a two year old video. I learned that the Big Dog at Palantir, Alex Karp, said in a non public meeting which some clever Hobbit type videoed on a smartphone words presented this way by the real news outfit:

The private remarks, made during a staff meeting, are at odds with a carefully crafted public image that has helped Palantir secure a $20 billion valuation and win business from a long list of corporations, nonprofits, and governments around the world. “As many of you know, the SSDA’s recalcitrant,” Karp, using a Palantir codename for the CIA, said in the August 2015 meeting. “And we’ve walked away, or they walked away from us, at the NSA. Either way, I’m happy about that.” The CIA, he said, “may not like us. Well, when the whole world is using Palantir they can still not like us. They’ll have no choice.” Suggesting that the Federal Bureau of Investigation had also had friction with Palantir, he continued, “That’s de facto how we got the FBI, and every other recalcitrant place.”

Okay, I don’t know the context of the remarks. It does strike me that 2015 was more than a year ago. In the zippy doo world of Sillycon Valley, quite a bit can change in one year.

I don’t know if you recall Paul Doscher who was the CEO of Exalead USA and Lucid Imagination (before the company asserted that its technology actually “works). Mr. Doscher is a good speaker, but he delivered a talk in 2009, captured on video, during which he was interviewed by a fellow in a blue sport coat and shirt. Mr. Doscher wore a baseball cap in gangsta style, a crinkled unbuttoned shirt, and evidenced a hipster approach to discussing travel. Now if you know Mr. Doscher, he is not a manager influenced by gangsta style. My hunch is that he responded to an occasion, and he elected to approach travel with a bit of insouciance.

Could Mr. Karp, the focal point of the lousy relationship article, have been responding to an occasion? Could Mr. Karp have adopted a particular tone and style to express frustration with US government procurement? Keep in mind that a year later, Palantir sued the US Army. My hunch is that views expressed in front of a group of employees may not be news of the moment. Interesting? Sure.

What I find interesting is that the coverage of Palantir Technologies does not dig into the parts of the company which I find most significant. To illustrate: Palantir has a system and method for an authorized user to add new content to the Gotham system. The approach makes it possible to generate an audit trail to make it easy (maybe trivial) to answer these questions:

  1. What data were added?
  2. When were the data added?
  3. What person added the data?
  4. What index terms were added to the data?
  5. What entities were added to the metadata?
  6. What special terms or geographic locations were added to the data?

You get the idea. Palantir’s Gotham brings to intelligence analysis the type of audit trail I found some compelling in the Clearwell system and other legal oriented systems. Instead of a person in information technology saying in response to a question like “Where did this information come from?”, “Duh. I don’t know.”

Gotham gets me an answer.

For me, explaining the reasoning behind Palantir’s approach warrants a write up. I think quite a few people struggling with problems of data quality and what is called by the horrid term “governance” would find Palantir’s approach of some interest.

Now do I care about Palantir? Nah.

Do I care about bashing Palantir? Nah.

What I do care about is tabloidism taking precedence over substantive technical approaches. From my hollow in rural Kentucky, I see folks looking for “sort of” information.

How about more substantive information? I am fed up with podcasts which recycle old information with fake good cheer. I am weary of leaks. I want to know about Palantir’s approach to search and content processing and have its systems and methods compared to what its direct competitors purport to do.

Yeah, I know this is difficult to do. But nothing worthwhile comes easy, right?

I can hear the millennials shouting, “Wrong, you dinosaur.” Hey, no problem. I own a house. I don’t need tabloidism. I have picked out a rest home, and I own 60 cemetery plots.

Do your thing, dudes and dudettes of “real” journalism.

Stephen E Arnold, April 27, 2017

Voice Search and Big Data: Defining Technologies for 2017

April 20, 2017

I read “Voice Search and Data: The Two Trends That Will Shape Online Marketing in 2017.” If the story is accurate, figuring out what people say and making sense of data (lots of data) will create new opportunities for innovators.

The article states:

Advancements in voice search and artificial intelligence (AI) will drive rich answers that will help marketers understand the customer intent behind I-want-to-go, I-want-to-know, I-want-to-buy and I-want-to-do micro-moments. Google has developed algorithms to cater directly to the search intent of the customers behind these queries, enabling customers to find the right answers quickly.

My view is that the article is correct in its assessment.

Where the article and I differ boils down to search engine optimization. The idea that voice search and Big Data will make fooling the relevance algorithms of Bing, Google, and Yandex a windfall for search engine optimization experts is partially true. Marketing whiz kids will do and say many things to deliver results that do not answer my query or meet my expectation of a “correct” answer.

My view is that the proliferation of systems which purport to understand human utterances in text,and voice-to-text conversions will discover that the the error rates of 60 to 75 percent are not good enough. Errors can be buried in long lists of results. They can be sidestepped if a voice enabled system works from a set of rules confined to a narrow topic domain.

Open the door to natural language parsing, and the error rates which once were okay become a liability. In my opinion, this will set off a scramble among companies struggling to get their smart software to provide information that customers accept and use repeatedly. Fail and customer turnover can be a fatal knife wound to the heart of an organization. The cost of replacing a paying customer is high. Companies need to keep the customers they have with technology that helps keep paying customers smiling.

What companies are able to provide higher accuracy linguistic functions? There are dozens of companies which assert that their systems can extract entities, figure out semantic relationships, and manipulate content in a handful of languages.

The problem with most of these systems is that certain, very widely used methods collapse when high accuracy is required for large volumes of text. The short cut is to use numerical tricks, and some of those tricks create disconnects between the information the user requests or queries and the results the system displays. Examples range from the difficulties of tuning the Autonomy Digital Reasoning Engine to figuring out how in the heck Google Home arrived at a particular song when the user wanted something else entirely.

Our suggestion is that instead of emailing IBM to sign a deal for that companies language technology, you might have a more productive result if you contact Bitext. This is a company which has been on my mind. I interviewed the founder and CEO (an IBM alum as I learned) and met with some of the remarkable Bitext team.

I am unable to disclose Bitext’s clients. I can suggest that if you fancy a certain German sports car or use one of the world’s most popular online services, you will be bumping into Bitext’s Digital Linguistic Analysis platform. For more information, navigate to Bitext.com.

The data I reviewed suggested that Bitext’s linguistic platform delivers accuracy significantly better than some of the other systems’ outputs I have reviewed. How accurate? Good enough to get an A in my high school math class.

Stephen E Arnold, April 20, 2017

Image Search: Biased by Language. The Fix? Use Humans!

April 19, 2017

Houston, we (male, female, uncertain) have a problem. Bias is baked into some image analysis and just about every other type of smart software.

The culprit?

Numerical recipes.

The first step in solving a problem is to acknowledge that a problem exists. The second step is more difficult.

I read “The Reason Why Most of the Images That Show Up When You Search for Doctor Are White Men.” The headline identifies the problem. However, what does one do about biases rooted in human utterance.

My initial thought was to eliminate human utterances. No fancy dancing required. Just let algorithms do what algorithms do. I realized that although this approach has a certain logical completeness, implementation may meet with a bit of resistance.

What does the write up have to say about the problem? (Remember. The fix is going to be tricky.)

I learned:

Research from Princeton University suggests that these biases, like associating men with doctors and women with nurses, come from the language taught to the algorithm. As some data scientists say, “garbage in, garbage out”: Without good data, the algorithm isn’t going to make good decisions.

Okay, right coast thinking. I feel more comfortable.

What does the write up present as wizard Aylin Caliskan’s view of the problem? A post doctoral researcher seems to be a solid choice for a source. I assume the wizard is a human, so perhaps he, she, it is biased? Hmmm.

I highlighted in true blue several passages from the write up / interview with he, she, it. Let’s look at three statements, shall we?

Regarding genderless languages like Turkish:

when you directly translate, and “nurse” is “she,” that’s not accurate. It should be “he or she or it” is a nurse. We see that it’s making a biased decision—it’s a very simple example of machine translation, but given that these models are incorporated on the web or any application that makes use of textual data, it’s the foundation of most of these applications. If you search for “doctor” and look at the images, you’ll see that most of them are male. You won’t see an equal male and female distribution.

If accurate, this observation means that the “fix” is going to be difficult. Moving from a language without gender identification to a language with gender identification requires changing the target language. Easy for software. Tougher for a human. If the language and its associations are anchored in the brain of a target language speaker, change may be, how shall I say it, a trifle difficult. My fix looks pretty good at this point.

And what about images and videos? I learned:

Yes, anything that text touches. Images and videos are labeled to they can be used on the web. The labels are in text, and it has been shown that those labels have been biased.

And the fix is a human doing the content selection, indexing, and dictionary tweaking. Not so fast. The cost of indexing with humans is very expensive. Don’t believe me. Download 10,000 Wikipedia articles and hire some folks to index them from the controlled term list humans set up. Let me know if you can hit $17 per indexed article. My hunch is that you will exceed this target by several orders of magnitude. (Want to know where the number comes from? Contact me and we discuss a for fee deal for this high value information.)

How does the write up solve the problem? Here’s the capper:

…you cannot directly remove the bias from the dataset or model because it’s giving a very accurate representation of the world, and that’s why we need a specialist to deal with this at the application level.

Notice that my solution is to eliminate humans entirely. Why? The pipe dream of humans doing indexing won’t fly due to [a] time, [b] cost, [c] the massive flows of data to index. Forget the mother of all bombs.

Think about the mother of all indexing backlogs. The gap would make the Modern Language Association’s “gaps” look like weekend catch up party. Is this a job for the operating system for machine intelligence?

Stephen E Arnold, April 17, 2017

Smart Software, Dumb Biases

April 17, 2017

Math is objective, right? Not really. Developers of artificial intelligence systems, what I call smart software, rely on what they learned in math school. If you have flipped through math books ranging from the Googler’s tome on artificial intelligence Artificial Intelligence: A Modern Approach to the musings of the ACM’s journals, you see the same methods recycled. Sure, the algorithms are given a bath and their whiskers are cropped. But underneath that show dog’s sleek appearance, is a familiar pooch. K-means. We have k-means. Decision trees? Yep, decision trees.

What happens when developers feed content into Rube Goldberg machines constructed of mathematical procedures known and loved by math wonks the world over?

The answer appears in “Semantics Derived Automatically from Language Corpora Contain Human Like Biases.” The headline says it clearly, “Smart software becomes as wild and crazy as a group of Kentucky politicos arguing in a bar on Friday night at 2:15 am.”

Biases are expressed and made manifest.

The article in Science reports with considerable surprise it seems to me:

word embeddings encode not only stereotyped biases but also other knowledge, such as the visceral pleasantness of flowers or the gender distribution of occupations.

Ah, ha. Smart software learns biases. Perhaps “smart” correlates with bias?

The canny whiz kids who did the research crawfish a bit:

We stress that we replicated every association documented via the IAT that we tested. The number, variety, and substantive importance of our results raise the possibility that all implicit human biases are reflected in the statistical properties of language. Further research is needed to test this hypothesis and to compare language with other modalities, especially the visual, to see if they have similarly strong explanatory power.

Yep, nothing like further research to prove that when humans build smart software, “magic” happens. The algorithms manifest biases.

What the write up did not address is a method for developing less biases smart software. Is such a method beyond the ken of computer scientists?

To get more information about this question, I asked on the world leader in the field of computational linguistics, Dr. Antonio Valderrabanos, the founder and chief executive officer at Bitext. Dr. Valderrabanos told me:

We use syntactic relations among words instead of using n-grams and similar statistical artifacts, which don’t understand word relations. Bitext’s Deep Linguistics Analysis platform can provide phrases or meaningful relationships to uncover more textured relationships. Our analysis will provide better content to artificial intelligence systems using corpuses of text to learn.

Bitext’s approach is explained in the exclusive interview which appeared in Search Wizards Speak on April 11, 2017. You can read the full text of the interview at this link and review the public information about the breakthrough DLA platform at www.bitext.com.

It seems to me that Bitext has made linguistics the operating system for artificial intelligence.

Stephen E Arnold, April 17, 2017

Bitext: Exclusive Interview with Antonio Valderrabanos

April 11, 2017

On a recent trip to Madrid, Spain, I was able to arrange an interview with Dr. Antonio Valderrabanos, the founder and CEO of Bitext. The company has its primary research and development group in Las Rosas, the high-technology complex a short distance from central Madrid. The company has an office in San Francisco and a number of computational linguists and computer scientists in other locations. Dr. Valderrabanos worked at IBM in an adjacent field before moving to Novell and then making the jump to his own start up. The hard work required to invent a fundamentally new way to make sense of human utterance is now beginning to pay off.

Antonio Valderrabanos of Bitext

Dr. Antonio Valderrabanos, founder and CEO of Bitext. Bitext’s business is growing rapidly. The company’s breakthroughs in deep linguistic analysis solves many difficult problems in text analysis.

Founded in 2008, the firm specializes in deep linguistic analysis. The systems and methods invented and refined by Bitext improve the accuracy of a wide range of content processing and text analytics systems. What’s remarkable about the Bitext breakthroughs is that the company support more than 40 different languages, and its platform can support additional languages with sharp reductions in the time, cost, and effort required by old-school systems. With the proliferation of intelligent software, Bitext, in my opinion, puts the digital brains in overdrive. Bitext’s platform improves the accuracy of many smart software applications, ranging from customer support to business intelligence.

In our wide ranging discussion, Dr. Valderrabanos made a number of insightful comments. Let me highlight three and urge you to read the full text of the interview at this link. (Note: this interview is part of the Search Wizards Speak series.)

Linguistics as an Operating System

One of Dr. Valderrabanos’ most startling observations addresses the future of operating systems for increasingly intelligence software and applications. He said:

Linguistic applications will form a new type of operating system. If we are correct in our thought that language understanding creates a new type of platform, it follows that innovators will build more new things on this foundation. That means that there is no endpoint, just more opportunities to realize new products and services.

Better Understanding Has Arrived

Some of the smart software I have tested is unable to understand what seems to be very basic instructions. The problem, in my opinion, is context. Most smart software struggles to figure out the knowledge cloud which embraces certain data. Dr. Valderrabanos observed:

Search is one thing. Understanding what human utterances mean is another. Bitext’s proprietary technology delivers understanding. Bitext has created an easy to scale and multilingual Deep Linguistic Analysis or DLA platform. Our technology reduces costs and increases user satisfaction in voice applications or customer service applications. I see it as a major breakthrough in the state of the art.

If he is right, the Bitext DLA platform may be one of the next big things in technology. The reason? As smart software becomes more widely adopted, the need to make sense of data and text in different languages becomes increasingly important. Bitext may be the digital differential that makes the smart applications run the way users expect them to.

Snap In Bitext DLA

Advanced technology like Bitext’s often comes with a hidden cost. The advanced system works well in a demonstration or a controlled environment. When that system has to be integrated into “as is” systems from other vendors or from a custom development project, difficulties can pile up. Dr. Valderrabanos asserted:

Bitext DLA provides parsing data for text enrichment for a wide range of languages, for informal and formal text and for different verticals to improve the accuracy of deep learning engines and reduce training times and data needs. Bitext works in this way with many other organizations’ systems.

When I asked him about integration, he said:

No problems. We snap in.

I am interested in Bitext’s technical methods. In the last year, he has signed deals with companies like Audi, Renault, a large mobile handset manufacturer, and an online information retrieval company.

When I thanked him for his time, he was quite polite. But he did say, “I have to get back to my desk. We have received several requests for proposals.”

Las Rosas looked quite a bit like Silicon Valley when I left the Bitext headquarters. Despite the thousands of miles separating Madrid from the US, interest in Bitext’s deep linguistic analysis is surging. Silicon Valley has its charms, and now it has a Bitext US office for what may be the fastest growing computational linguistics and text analysis system in the world. Worth watching this company I think.

For more about Bitext, navigate to the firm’s Web site at www.bitext.com.

Stephen E Arnold, April 11, 2017

Upgraded Social Media Monitoring

February 20, 2017

Analytics are catching up to content. In a recent ZDNet article, Digimind partners with Ditto to add image recognition to social media monitoring, we are reminded images reign supreme on social media. Between Pinterest, Snapchat and Instagram, messages are often conveyed through images as opposed to text. Capitalizing on this, and intelligence software company Digimind has announced a partnership with Ditto Labs to introduce image-recognition technology into their social media monitoring software called Digimind Social. We learned,

The Ditto integration lets brands identify the use of their logos across Twitter no matter the item or context. The detected images are then collected and processed on Digimind Social in the same way textual references, articles, or social media postings are analysed. Logos that are small, obscured, upside down, or in cluttered image montages are recognised. Object and scene recognition means that brands can position their products exactly where there customers are using them. Sentiment is measured by the amount of people in the image and counts how many of them are smiling. It even identifies objects such as bags, cars, car logos, or shoes.

It was only a matter of time before these types of features emerged in social media monitoring. For years now, images have been shown to increase engagement even on platforms that began focused more on text. Will we see more watermarked logos on images? More creative ways to visually identify brands? Both are likely and we will be watching to see what transpires.

Megan Feil, February 20, 2017

 

Smarter Content for Contentier Intelligence

December 28, 2016

I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.

Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.

image

The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:

image

What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:

A Data Ninja API

A Data Ninja “knowledgebase,” which I think is an iPhrase-type or TeraText type of method. Not familiar with iPhrase and TeraText, feel free to browse the descriptions at the links.

A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.

The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.

Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:

  • Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
  • Smart data; that is, I am not sure what this means
  • Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
  • Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
  • The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).

The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.

If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.

Stephen E Arnold, November 28, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta