Palantir Technologies: A Beatdown Buzz Ringing in My Ears

April 27, 2017

I have zero contacts at Palantir Technologies. The one time I valiantly contacted the company about a speaking opportunity at one of my wonky DC invitation-only conferences, a lawyer from Palantir referred my inquiry to a millennial who had a one word vocabulary, “No.”

There you go.

I have written about Palantir Technologies because I used to be an adviser to the pre-IBM incarnation of i2 and its widely used investigation tool, Analyst’s Notebook. I did write about a misadventure between i2 Group and Palantir Technologies, but no one paid much attention to my commentary.

An outfit called Buzzfeed, however, does pay attention to Palantir Technologies. My hunch is that the online real news outfit believes there is a story in the low profile, Peter Thiel-supported company. The technology Palantir has crafted is not that different from the Analyst’s Notebook, Centrifuge Systems’ solution, and quite a few other companies which provide industrial-strength software and systems to law enforcement, security firms, and the intelligence community. (I list about 15 of these companies in my forthcoming “Dark Web Notebook.” No, I won’t provide that list in this free blog. I may be retired, but I am not giving away high value information.)

So what’s caught my attention. I read the article “Palantir’s Relationship with the Intelligence Community Has Been Worse Than You Think.” The main idea is that the procurement of Palantir’s Gotham and supporting services provided by outfits specializing in Palantir systems has not been sliding on President Reagan’s type of Teflon. The story has been picked up and recycled by several “real” news outfits; for example, Brainsock. The story meshes like matryoshkas with other write ups; for example, “Inside Palantir, Silicon Valley’s Most Secretive Company” and “Palantir Struggles to Retain Clients and Staff, BuzzFeed Reports.” Palantir, it seems to me in Harrod’s Creek, is a newsy magnet.

The write up about Palantir’s lousy relationship with the intelligence community pivots on a two year old video. I learned that the Big Dog at Palantir, Alex Karp, said in a non public meeting which some clever Hobbit type videoed on a smartphone words presented this way by the real news outfit:

The private remarks, made during a staff meeting, are at odds with a carefully crafted public image that has helped Palantir secure a $20 billion valuation and win business from a long list of corporations, nonprofits, and governments around the world. “As many of you know, the SSDA’s recalcitrant,” Karp, using a Palantir codename for the CIA, said in the August 2015 meeting. “And we’ve walked away, or they walked away from us, at the NSA. Either way, I’m happy about that.” The CIA, he said, “may not like us. Well, when the whole world is using Palantir they can still not like us. They’ll have no choice.” Suggesting that the Federal Bureau of Investigation had also had friction with Palantir, he continued, “That’s de facto how we got the FBI, and every other recalcitrant place.”

Okay, I don’t know the context of the remarks. It does strike me that 2015 was more than a year ago. In the zippy doo world of Sillycon Valley, quite a bit can change in one year.

I don’t know if you recall Paul Doscher who was the CEO of Exalead USA and Lucid Imagination (before the company asserted that its technology actually “works). Mr. Doscher is a good speaker, but he delivered a talk in 2009, captured on video, during which he was interviewed by a fellow in a blue sport coat and shirt. Mr. Doscher wore a baseball cap in gangsta style, a crinkled unbuttoned shirt, and evidenced a hipster approach to discussing travel. Now if you know Mr. Doscher, he is not a manager influenced by gangsta style. My hunch is that he responded to an occasion, and he elected to approach travel with a bit of insouciance.

Could Mr. Karp, the focal point of the lousy relationship article, have been responding to an occasion? Could Mr. Karp have adopted a particular tone and style to express frustration with US government procurement? Keep in mind that a year later, Palantir sued the US Army. My hunch is that views expressed in front of a group of employees may not be news of the moment. Interesting? Sure.

What I find interesting is that the coverage of Palantir Technologies does not dig into the parts of the company which I find most significant. To illustrate: Palantir has a system and method for an authorized user to add new content to the Gotham system. The approach makes it possible to generate an audit trail to make it easy (maybe trivial) to answer these questions:

  1. What data were added?
  2. When were the data added?
  3. What person added the data?
  4. What index terms were added to the data?
  5. What entities were added to the metadata?
  6. What special terms or geographic locations were added to the data?

You get the idea. Palantir’s Gotham brings to intelligence analysis the type of audit trail I found some compelling in the Clearwell system and other legal oriented systems. Instead of a person in information technology saying in response to a question like “Where did this information come from?”, “Duh. I don’t know.”

Gotham gets me an answer.

For me, explaining the reasoning behind Palantir’s approach warrants a write up. I think quite a few people struggling with problems of data quality and what is called by the horrid term “governance” would find Palantir’s approach of some interest.

Now do I care about Palantir? Nah.

Do I care about bashing Palantir? Nah.

What I do care about is tabloidism taking precedence over substantive technical approaches. From my hollow in rural Kentucky, I see folks looking for “sort of” information.

How about more substantive information? I am fed up with podcasts which recycle old information with fake good cheer. I am weary of leaks. I want to know about Palantir’s approach to search and content processing and have its systems and methods compared to what its direct competitors purport to do.

Yeah, I know this is difficult to do. But nothing worthwhile comes easy, right?

I can hear the millennials shouting, “Wrong, you dinosaur.” Hey, no problem. I own a house. I don’t need tabloidism. I have picked out a rest home, and I own 60 cemetery plots.

Do your thing, dudes and dudettes of “real” journalism.

Stephen E Arnold, April 27, 2017

Voice Search and Big Data: Defining Technologies for 2017

April 20, 2017

I read “Voice Search and Data: The Two Trends That Will Shape Online Marketing in 2017.” If the story is accurate, figuring out what people say and making sense of data (lots of data) will create new opportunities for innovators.

The article states:

Advancements in voice search and artificial intelligence (AI) will drive rich answers that will help marketers understand the customer intent behind I-want-to-go, I-want-to-know, I-want-to-buy and I-want-to-do micro-moments. Google has developed algorithms to cater directly to the search intent of the customers behind these queries, enabling customers to find the right answers quickly.

My view is that the article is correct in its assessment.

Where the article and I differ boils down to search engine optimization. The idea that voice search and Big Data will make fooling the relevance algorithms of Bing, Google, and Yandex a windfall for search engine optimization experts is partially true. Marketing whiz kids will do and say many things to deliver results that do not answer my query or meet my expectation of a “correct” answer.

My view is that the proliferation of systems which purport to understand human utterances in text,and voice-to-text conversions will discover that the the error rates of 60 to 75 percent are not good enough. Errors can be buried in long lists of results. They can be sidestepped if a voice enabled system works from a set of rules confined to a narrow topic domain.

Open the door to natural language parsing, and the error rates which once were okay become a liability. In my opinion, this will set off a scramble among companies struggling to get their smart software to provide information that customers accept and use repeatedly. Fail and customer turnover can be a fatal knife wound to the heart of an organization. The cost of replacing a paying customer is high. Companies need to keep the customers they have with technology that helps keep paying customers smiling.

What companies are able to provide higher accuracy linguistic functions? There are dozens of companies which assert that their systems can extract entities, figure out semantic relationships, and manipulate content in a handful of languages.

The problem with most of these systems is that certain, very widely used methods collapse when high accuracy is required for large volumes of text. The short cut is to use numerical tricks, and some of those tricks create disconnects between the information the user requests or queries and the results the system displays. Examples range from the difficulties of tuning the Autonomy Digital Reasoning Engine to figuring out how in the heck Google Home arrived at a particular song when the user wanted something else entirely.

Our suggestion is that instead of emailing IBM to sign a deal for that companies language technology, you might have a more productive result if you contact Bitext. This is a company which has been on my mind. I interviewed the founder and CEO (an IBM alum as I learned) and met with some of the remarkable Bitext team.

I am unable to disclose Bitext’s clients. I can suggest that if you fancy a certain German sports car or use one of the world’s most popular online services, you will be bumping into Bitext’s Digital Linguistic Analysis platform. For more information, navigate to Bitext.com.

The data I reviewed suggested that Bitext’s linguistic platform delivers accuracy significantly better than some of the other systems’ outputs I have reviewed. How accurate? Good enough to get an A in my high school math class.

Stephen E Arnold, April 20, 2017

Image Search: Biased by Language. The Fix? Use Humans!

April 19, 2017

Houston, we (male, female, uncertain) have a problem. Bias is baked into some image analysis and just about every other type of smart software.

The culprit?

Numerical recipes.

The first step in solving a problem is to acknowledge that a problem exists. The second step is more difficult.

I read “The Reason Why Most of the Images That Show Up When You Search for Doctor Are White Men.” The headline identifies the problem. However, what does one do about biases rooted in human utterance.

My initial thought was to eliminate human utterances. No fancy dancing required. Just let algorithms do what algorithms do. I realized that although this approach has a certain logical completeness, implementation may meet with a bit of resistance.

What does the write up have to say about the problem? (Remember. The fix is going to be tricky.)

I learned:

Research from Princeton University suggests that these biases, like associating men with doctors and women with nurses, come from the language taught to the algorithm. As some data scientists say, “garbage in, garbage out”: Without good data, the algorithm isn’t going to make good decisions.

Okay, right coast thinking. I feel more comfortable.

What does the write up present as wizard Aylin Caliskan’s view of the problem? A post doctoral researcher seems to be a solid choice for a source. I assume the wizard is a human, so perhaps he, she, it is biased? Hmmm.

I highlighted in true blue several passages from the write up / interview with he, she, it. Let’s look at three statements, shall we?

Regarding genderless languages like Turkish:

when you directly translate, and “nurse” is “she,” that’s not accurate. It should be “he or she or it” is a nurse. We see that it’s making a biased decision—it’s a very simple example of machine translation, but given that these models are incorporated on the web or any application that makes use of textual data, it’s the foundation of most of these applications. If you search for “doctor” and look at the images, you’ll see that most of them are male. You won’t see an equal male and female distribution.

If accurate, this observation means that the “fix” is going to be difficult. Moving from a language without gender identification to a language with gender identification requires changing the target language. Easy for software. Tougher for a human. If the language and its associations are anchored in the brain of a target language speaker, change may be, how shall I say it, a trifle difficult. My fix looks pretty good at this point.

And what about images and videos? I learned:

Yes, anything that text touches. Images and videos are labeled to they can be used on the web. The labels are in text, and it has been shown that those labels have been biased.

And the fix is a human doing the content selection, indexing, and dictionary tweaking. Not so fast. The cost of indexing with humans is very expensive. Don’t believe me. Download 10,000 Wikipedia articles and hire some folks to index them from the controlled term list humans set up. Let me know if you can hit $17 per indexed article. My hunch is that you will exceed this target by several orders of magnitude. (Want to know where the number comes from? Contact me and we discuss a for fee deal for this high value information.)

How does the write up solve the problem? Here’s the capper:

…you cannot directly remove the bias from the dataset or model because it’s giving a very accurate representation of the world, and that’s why we need a specialist to deal with this at the application level.

Notice that my solution is to eliminate humans entirely. Why? The pipe dream of humans doing indexing won’t fly due to [a] time, [b] cost, [c] the massive flows of data to index. Forget the mother of all bombs.

Think about the mother of all indexing backlogs. The gap would make the Modern Language Association’s “gaps” look like weekend catch up party. Is this a job for the operating system for machine intelligence?

Stephen E Arnold, April 17, 2017

Smart Software, Dumb Biases

April 17, 2017

Math is objective, right? Not really. Developers of artificial intelligence systems, what I call smart software, rely on what they learned in math school. If you have flipped through math books ranging from the Googler’s tome on artificial intelligence Artificial Intelligence: A Modern Approach to the musings of the ACM’s journals, you see the same methods recycled. Sure, the algorithms are given a bath and their whiskers are cropped. But underneath that show dog’s sleek appearance, is a familiar pooch. K-means. We have k-means. Decision trees? Yep, decision trees.

What happens when developers feed content into Rube Goldberg machines constructed of mathematical procedures known and loved by math wonks the world over?

The answer appears in “Semantics Derived Automatically from Language Corpora Contain Human Like Biases.” The headline says it clearly, “Smart software becomes as wild and crazy as a group of Kentucky politicos arguing in a bar on Friday night at 2:15 am.”

Biases are expressed and made manifest.

The article in Science reports with considerable surprise it seems to me:

word embeddings encode not only stereotyped biases but also other knowledge, such as the visceral pleasantness of flowers or the gender distribution of occupations.

Ah, ha. Smart software learns biases. Perhaps “smart” correlates with bias?

The canny whiz kids who did the research crawfish a bit:

We stress that we replicated every association documented via the IAT that we tested. The number, variety, and substantive importance of our results raise the possibility that all implicit human biases are reflected in the statistical properties of language. Further research is needed to test this hypothesis and to compare language with other modalities, especially the visual, to see if they have similarly strong explanatory power.

Yep, nothing like further research to prove that when humans build smart software, “magic” happens. The algorithms manifest biases.

What the write up did not address is a method for developing less biases smart software. Is such a method beyond the ken of computer scientists?

To get more information about this question, I asked on the world leader in the field of computational linguistics, Dr. Antonio Valderrabanos, the founder and chief executive officer at Bitext. Dr. Valderrabanos told me:

We use syntactic relations among words instead of using n-grams and similar statistical artifacts, which don’t understand word relations. Bitext’s Deep Linguistics Analysis platform can provide phrases or meaningful relationships to uncover more textured relationships. Our analysis will provide better content to artificial intelligence systems using corpuses of text to learn.

Bitext’s approach is explained in the exclusive interview which appeared in Search Wizards Speak on April 11, 2017. You can read the full text of the interview at this link and review the public information about the breakthrough DLA platform at www.bitext.com.

It seems to me that Bitext has made linguistics the operating system for artificial intelligence.

Stephen E Arnold, April 17, 2017

Bitext: Exclusive Interview with Antonio Valderrabanos

April 11, 2017

On a recent trip to Madrid, Spain, I was able to arrange an interview with Dr. Antonio Valderrabanos, the founder and CEO of Bitext. The company has its primary research and development group in Las Rosas, the high-technology complex a short distance from central Madrid. The company has an office in San Francisco and a number of computational linguists and computer scientists in other locations. Dr. Valderrabanos worked at IBM in an adjacent field before moving to Novell and then making the jump to his own start up. The hard work required to invent a fundamentally new way to make sense of human utterance is now beginning to pay off.

Antonio Valderrabanos of Bitext

Dr. Antonio Valderrabanos, founder and CEO of Bitext. Bitext’s business is growing rapidly. The company’s breakthroughs in deep linguistic analysis solves many difficult problems in text analysis.

Founded in 2008, the firm specializes in deep linguistic analysis. The systems and methods invented and refined by Bitext improve the accuracy of a wide range of content processing and text analytics systems. What’s remarkable about the Bitext breakthroughs is that the company support more than 40 different languages, and its platform can support additional languages with sharp reductions in the time, cost, and effort required by old-school systems. With the proliferation of intelligent software, Bitext, in my opinion, puts the digital brains in overdrive. Bitext’s platform improves the accuracy of many smart software applications, ranging from customer support to business intelligence.

In our wide ranging discussion, Dr. Valderrabanos made a number of insightful comments. Let me highlight three and urge you to read the full text of the interview at this link. (Note: this interview is part of the Search Wizards Speak series.)

Linguistics as an Operating System

One of Dr. Valderrabanos’ most startling observations addresses the future of operating systems for increasingly intelligence software and applications. He said:

Linguistic applications will form a new type of operating system. If we are correct in our thought that language understanding creates a new type of platform, it follows that innovators will build more new things on this foundation. That means that there is no endpoint, just more opportunities to realize new products and services.

Better Understanding Has Arrived

Some of the smart software I have tested is unable to understand what seems to be very basic instructions. The problem, in my opinion, is context. Most smart software struggles to figure out the knowledge cloud which embraces certain data. Dr. Valderrabanos observed:

Search is one thing. Understanding what human utterances mean is another. Bitext’s proprietary technology delivers understanding. Bitext has created an easy to scale and multilingual Deep Linguistic Analysis or DLA platform. Our technology reduces costs and increases user satisfaction in voice applications or customer service applications. I see it as a major breakthrough in the state of the art.

If he is right, the Bitext DLA platform may be one of the next big things in technology. The reason? As smart software becomes more widely adopted, the need to make sense of data and text in different languages becomes increasingly important. Bitext may be the digital differential that makes the smart applications run the way users expect them to.

Snap In Bitext DLA

Advanced technology like Bitext’s often comes with a hidden cost. The advanced system works well in a demonstration or a controlled environment. When that system has to be integrated into “as is” systems from other vendors or from a custom development project, difficulties can pile up. Dr. Valderrabanos asserted:

Bitext DLA provides parsing data for text enrichment for a wide range of languages, for informal and formal text and for different verticals to improve the accuracy of deep learning engines and reduce training times and data needs. Bitext works in this way with many other organizations’ systems.

When I asked him about integration, he said:

No problems. We snap in.

I am interested in Bitext’s technical methods. In the last year, he has signed deals with companies like Audi, Renault, a large mobile handset manufacturer, and an online information retrieval company.

When I thanked him for his time, he was quite polite. But he did say, “I have to get back to my desk. We have received several requests for proposals.”

Las Rosas looked quite a bit like Silicon Valley when I left the Bitext headquarters. Despite the thousands of miles separating Madrid from the US, interest in Bitext’s deep linguistic analysis is surging. Silicon Valley has its charms, and now it has a Bitext US office for what may be the fastest growing computational linguistics and text analysis system in the world. Worth watching this company I think.

For more about Bitext, navigate to the firm’s Web site at www.bitext.com.

Stephen E Arnold, April 11, 2017

Upgraded Social Media Monitoring

February 20, 2017

Analytics are catching up to content. In a recent ZDNet article, Digimind partners with Ditto to add image recognition to social media monitoring, we are reminded images reign supreme on social media. Between Pinterest, Snapchat and Instagram, messages are often conveyed through images as opposed to text. Capitalizing on this, and intelligence software company Digimind has announced a partnership with Ditto Labs to introduce image-recognition technology into their social media monitoring software called Digimind Social. We learned,

The Ditto integration lets brands identify the use of their logos across Twitter no matter the item or context. The detected images are then collected and processed on Digimind Social in the same way textual references, articles, or social media postings are analysed. Logos that are small, obscured, upside down, or in cluttered image montages are recognised. Object and scene recognition means that brands can position their products exactly where there customers are using them. Sentiment is measured by the amount of people in the image and counts how many of them are smiling. It even identifies objects such as bags, cars, car logos, or shoes.

It was only a matter of time before these types of features emerged in social media monitoring. For years now, images have been shown to increase engagement even on platforms that began focused more on text. Will we see more watermarked logos on images? More creative ways to visually identify brands? Both are likely and we will be watching to see what transpires.

Megan Feil, February 20, 2017

 

Alleged Google Loophole Lets Fake News Flow

January 1, 2017

I read a write up which, like 99 percent of the information available for free via the Internet, is 100 percent accurate.

The write up’s title tells the tale: “Google Does a Better Job with Fake News Than Facebook, but There’s a Big Loophole It Hasn’t Fixed.” What’s the loophole? The write up reports:

…the “newsy” modules that sit at the top of many Google searches (the “In the news” section on desktop, and the “Top stories” section on mobile) don’t pull content straight from Google News. They pull from all sorts of content available across the web, and can include sites not approved by Google News. This is particularly confusing for users on the desktop version of Google’s site, where the “In the news” section lives.Not only does the “In the news” section literally have the word “news” in its name, but the link at the bottom of the module, which says “More news for…,” takes you to the separate Google News page, which is comprised only of articles that Google’s editorial system has approved.

So why isn’t the “In the news” section just the top three Google News results?

The short answer is because Google sees Google Search and Google News as separate products.

The word “news” obviously does not mean news. We reported last week about Google’s effort to define “monopoly” for the European Commission investigating allegations of Google’s being frisky with its search results. News simply needs to be understood in the Google contextual lexicon.

The write up helps me out with this statement:

So why isn’t the “In the news” section just the top three Google News results? The short answer is because Google sees Google Search and Google News as separate products.

Logical? From Google’s point of view absolutely crystal clear.

The write up amplifies the matter:

Google does, however, seem to want to wipe fake news from its platform. “From our perspective, there should just be no situation where fake news gets distributed, so we are all for doing better here,” Google CEO Sundar Pichai said recently. After the issue of fake news entered the spotlight after the election, Google announced it would ban fake-news sites from its ad network, choking off their revenue. But even if Google’s goal is to kick fake-news sites out of its search engine, most Google users probably understand that Google search results don’t have carry the editorial stamp of approval from Google.

Fake news, therefore, is mostly under control. The Google users just have to bone up on how Google works to make information available.

What about mobile?

Google AMP is not news; AMP content labeled as “news” is part of the AMP technical standard which speeds up mobile page display.

Google, like Facebook, may tweak its approach to news.

Beyond Search would like to point out that wild and crazy news releases from big time PR dissemination outfits can propagate a range of information (some mostly accurate and some pretty crazy). The handling of high value sources allows some questionable content to flow. Oh, there are other ways to inject questionable content into the Web indexing systems.

There is not one loophole. There are others. Who wants to nibble into revenue? Not Beyond Search.

Stephen E Arnold, January 1, 2017

Smarter Content for Contentier Intelligence

December 28, 2016

I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.

Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.

image

The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:

image

What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:

A Data Ninja API

A Data Ninja “knowledgebase,” which I think is an iPhrase-type or TeraText type of method. Not familiar with iPhrase and TeraText, feel free to browse the descriptions at the links.

A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.

The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.

Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:

  • Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
  • Smart data; that is, I am not sure what this means
  • Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
  • Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
  • The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).

The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.

If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.

Stephen E Arnold, November 28, 2016

Search Email: Not Yours. A Competitor’s.

December 2, 2016

I read “This Startup Helps You Deep Snoop Competitor Email Marketing.” I like that “deep snoop” thing. That works pretty well until one loses access to content to analyze. Just ask Geofeedia which is scrambling since it lost access to Twitter and other social media content.

The outfit Rival Explorer offers:

a tool designed to help users improve their email marketing strategy and product pricing and promotion through comprehensive monitoring of their competitor’s email newsletters. After creating a free account, users can browse through a database of marketing emails from over 50,000 brands. Rival Explorer offers access to a number of different email types, including newsletters, cart abandonment emails, welcome emails, and other transactional messages.

In terms of information access, the Rival Explorer customers:

can search by brand, subject, message body, date, day of week, industry, category, and custom tags and keywords. When users select a message, they’re able to view the sender email, subject line, and timestamp of the messages. In addition to those details, users can view the emails as they appear on tablets and smartphones, plus they also can toggle images to get a better idea of design and copy strategy.

You can get more information at this link. Public content and marketing information can be useful it seems.

Stephen E Arnold, December 2, 2016

Pitching All Source Analysis: Just Do Dark Data. Really?

November 25, 2016

I read “Shedding Light on Dark Data: How to Get Started.” Okay, Dark Data. Like Big Data, the phrase is the fruit of the nomads at Garner Group. The person embracing this sort of old concept is an outfit OdinText. Spoiler: I thought the write up was going to identify outfits like BAE Systems, Centrifuge Systems, IBM Analyst’s Notebook, Palantir Technologies, and Recorded Future (an In-Q-Tel and Google backed outfit). Was I wrong? Yes.

The write up explains that a company has to tackle a range of information in order to be aware, informed, or insightful. Pick one. Here’s the list of Dark Data types, which the aforementioned companies have been working to capture, analyze, and make sense of for almost 20 years in the case of NetReveal (Detica) and Analyst’s Notebook. The other companies are comparative spring chickens with an average of seven years’ experience in this effort.

  • Customer relationship management data
  • Data warehouse information
  • Enterprise resource planning information
  • Log files
  • Machine data
  • Mainframe data
  • Semi structured information
  • Social media content
  • Unstructured data
  • Web content.

I think the company or non profit which tries to suck in these data types and process them may run into some cost and legal issues. Analyzing tweets and Facebook posts can be useful, but there are costs and license fees required. Frankly not even law enforcement and intelligence entities are able to do a Cracker Jack job with these content streams due to their volume, cryptic nature, and pesky quirks related to metadata tagging. But let’s move on. To this statement:

Phone transcripts, chat logs and email are often dark data that text analytics can help illuminate. Would it be helpful to understand how personnel deal with incoming customer questions? Which of your products are discussed with which of your other products or competitors’ products more often? What problems or opportunities are mentioned in conjunction with them? Are there any patterns over time?

Yep, that will work really well in many legal environments. Phone transcripts are particularly exciting.

How does one think about Dark Data? Easy. Here’s a visualization from the OdinText folks:

image

Notice that there are data types in this diagram NOT included in the listing above. I can’t figure out if this is just carelessness or an insight which escapes me.

How does one deal with Dark Data? OdinText, of course. Yep, of course. Easy.

Stephen E Arnold, November 25, 2016

Next Page »

  • Archives

  • Recent Posts

  • Meta