July 8, 2016
Another day, another merger. PR Newswire released a story, VirtualWorks and Language Tools Announce Merger, which covers Virtual Works’ purchase of Language Tools. In Language Tools, they will inherit computational linguistics and natural language processing technologies. Virtual Works is an enterprise search firm. Erik Baklid, Chief Executive Officer of VirtualWorks is quoted in the article,
“We are incredibly excited about what this combined merger means to the future of our business. The potential to analyze and make sense of the vast unstructured data that exists for enterprises, both internally and externally, cannot be understated. Our underlying technology offers a sophisticated solution to extract meaning from text in a systematic way without the shortcomings of machine learning. We are well positioned to bring to market applications that provide insight, never before possible, into the vast majority of data that is out there.”
This is another case of a company positioning themselves as a leader in enterprise search. Are they anything special? Well, the news release mentions several core technologies will be bolstered due to the merger: text analytics, data management, and discovery techniques. We will have to wait and see what their future holds in regards to the enterprise search and business intelligence sector they seek to be a leader in.
June 27, 2016
Trainspotting is a collection of short stories or a novel presented as a series of short stories by Irvine Welsh. The fun lovers in the fiction embrace avocations which seem to be addictive. The thrill is the thing. Now I think I have identified Palantir spotting.
Navigate to “Palantir Seeks to Muzzle Former Employees.” I am not too interested in the allegations in the write up. What is interesting is that the article is one of what appears to be of series of stories about Palantir Technologies enriched with non public documents.
The Thingverse muzzle might be just the ticket for employees who want to chatter about proprietary information. I assume the muzzle is sanitary and durable, comes in various sizes, and adapts to the jaw movement of the lucky dog wearing the gizmo.
Why use the phrase “Palantir spotting.” It seems to me that making an outfit which provides services and software to government entities is an unusual hobby. I, for example, lecture about the Dark Web, how to recognize recycled analytics algorithms and their assorted “foibles,” and how to find information in the new, super helpful Google Web search system.
Poking the innards of an outfit with interesting software and some wizards who might be a bit testy is okay if done with some Onion type or Colbert like humor. Doing what one of my old employers did in the 1970s to help ensure that company policies remain inside the company is old hat to me.
In the write up, I noted:
The Silicon Valley data-analysis company, which recently said it would buy up to $225 million of its own common stock from current and former staff, has attached some serious strings to the offer. It is requiring former employees who want to sell their shares to renew their non-disclosure agreements, agree not to poach Palantir employees for 12 months, and promise not to sue the company or its executives, a confidential contract reviewed by BuzzFeed News shows. The terms also dictate how former staff can talk to the press. If they get any inquiries about Palantir from reporters, the contract says, they must immediately notify Palantir and then email the company a copy of the inquiry within three business days. These provisions, which haven’t previously been reported, show one way Palantir stands to benefit from the stock purchase offer, known as a “liquidity event.”
Okay, manage information flow. In my experience, money often comes with some caveats. At one time I had lots and lots of @Home goodies which disappeared in a Sillycon Valley minute. The fine print for the deal covered the disappearance. Sigh. That’s life with techno-financial wizards. It seems life has not changed too much since the @Home affair decades ago.
I expect that there will be more Palantir centric stories. I will try to note these when they hit my steam powered radar detector in Harrod’s Creek. My thought is that like the protagonists in Trainspotting, Palantir spotting might have some after effects.
I keep asking myself this question:
How do company confidential documents escape the gravitational field of a comparatively secretive company?
The Palantir spotters are great data gatherers or those with access to the documents are making the material available. No answers yet. Just that question about “how”.
Stephen E Arnold, June 27, 2016
June 1, 2016
A few days ago, I stumbled upon a copy of a letter from the GAO concerning Palantir Technologies dated May 18, 2016. The letter became available to me a few days after the 18th, and the US holiday probably limited circulation of the document. The letter is from the US Government Accountability Office and signed by Susan A. Poling, general counsel. There are eight recipients, some from Palantir, some from the US Army, and two in the GAO.
Has the US Army put Palantir in an untenable spot? Is there a deus ex machina about to resolve the apparent checkmate?
The letter tells Palantir Technologies that its protest of the DCGS Increment 2 award to another contractor is denied. I don’t want to revisit the history or the details as I understand them of the DCGS project. (DCGS, pronounced “dsigs”, is a US government information fusion project associated with the US Army but seemingly applicable to other Department of Defense entities like the Air Force and the Navy.)
The passage in the letter I found interesting was:
While the market research revealed that commercial items were available to meet some of the DCGS-A2 requirements, the agency concluded that there was no commercial solution that could meet all the requirements of DCGS-A2. As the agency explained in its report, the DCGS-A2 contractor will need to do a great deal of development and integration work, which will include importing capabilities from DCGS-A1 and designing mature interfaces for them. Because the agency concluded that significant portions of the anticipated DCSG-A2 scope of work were not available as a commercial product, the agency determined that the DCGS-A2 development effort could not be procured as a commercial product under FAR part 12 procedures. The protester has failed to show that the agency’s determination in this regard was unreasonable.
The “importing” point is a big deal. I find it difficult to imagine that IBM i2 engineers will be eager to permit the Palantir Gotham system to work like one happy family. The importation and manipulation of i2 data in a third party system is more difficult than opening an RTF file in Word in my experience. My recollection is that the unfortunate i2-Palantir legal matter was, in part, related to figuring out how to deal with ANB files. (ANB is i2 shorthand for Analysts Notebook’s file format, a somewhat complex and closely-held construct.)
Net net: Palantir Technologies will not be the dog wagging the tail of IBM i2 and a number of other major US government integrators. The good news is that there will be quite a bit of work available for firms able to support the prime contractors and the vendors eligible and selected to provide for-fee products and services.
Was this a shoot-from-the-hip decision to deny Palantir’s objection to the award? No. I believe the FAR procurement guidelines and the content of the statement of work provided the framework for the decision. However, context is important as are past experiences and perceptions of vendors in the running for substantive US government programs.
May 19, 2016
I read “The Real Lesson for Data Science That is Demonstrated by Palantir’s Struggles · Simply Statistics.” I love write ups that plunk the word statistics near simple.
Here’s the passage I highlighted in money green:
… What is the value of data analysis?, and secondarily, how do you communicate that value?
I want to step away from the Palantir Technologies’ example and consider a broader spectrum of outfits tossing around the jargon “big data,” “analytics,” and synonyms for smart software. One doesn’t communicate value. One finds a person who needs a solution and crafts the message to close the deal.
When a company and its perceived technology catches the attention of allegedly informed buyers, a bandwagon effort kicks in. Talks inside an organization leads to mentions in internal meetings. The vendor whose products and services are the subject of these comments begins to hint at bigger and better things at conferences. Then a real journalist may catch a scent of “something happening” and writes an article. Technical talks at niche conferences generate wonky articles usually without dates or footnotes which make sense to someone without access to commercial databases. If a social media breeze whips up the smoldering interest, then a fire breaks out.
A start up should be so clever, lucky, or tactically gifted to pull off this type of wildfire. But when it happens, big money chases the outfit. Once money flows, the company and its products and services become real.
The problem with companies processing a range of data is that there are some friction inducing processes that are tough to coat with Teflon. These include:
- Taking different types of data, normalizing it, indexing it in a meaningful manner, and creating metadata which is accurate and timely
- Converting numerical recipes, many with built in threshold settings and chains of calculations, into marching band order able to produce recognizable outputs.
- Figuring out how to provide an infrastructure that can sort of keep pace with the flows of new data and the updates/corrections to the already processed data.
- Generating outputs that people in a hurry or in a hot zone can use to positive effect; for example, in a war zone, not get killed when the visualization is not spot on.
The write up focuses on a single company and its alleged problems. That’s okay, but it understates the problem. Most content processing companies run out of revenue steam. The reason is that the licensees or customers want the systems to work better, faster, and more cheaply than predecessor or incumbent systems.
The vast majority of search and content processing systems are flawed, expensive to set up and maintain, and really difficult to use in a way that produces high reliability outputs over time. I would suggest that the problem bedevils a number of companies.
Some of those struggling with these issues are big names. Others are much smaller firms. What’s interesting to me is that the trajectory content processing companies follow is a well worn path. One can read about Autonomy, Convera, Endeca, Fast Search & Transfer, Verity, and dozens of other outfits and discern what’s going to happen. Here’s a summary for those who don’t want to work through the case studies on my Xenky intel site:
Stage 1: Early struggles and wild and crazy efforts to get big name clients
Stage 2: Making promises that are difficult to implement but which are essential to capture customers looking actively for a silver bullet
Stage 3: Frantic building and deployment accompanied with heroic exertions to keep the customers happy
Stage 4: Closing as many deals as possible either for additional financing or for licensing/consulting deals
Stage 5: The early customers start grousing and the momentum slows
Stage 6: Sell off the company or shut down like Delphes, Entopia, Siderean Software and dozens of others.
The problem is not technology, math, or Big Data. The force which undermines these types of outfits is the difficulty of making sense out of words and numbers. In my experience, the task is a very difficult one for humans and for software. Humans want to golf, cruise Facebook, emulate Amazon Echo, or like water find the path of least resistance.
Making sense out of information when someone is lobbing mortars at one is a problem which technology can only solve in a haphazard manner. Hope springs eternal and managers are known to buy or license a solution in the hopes that my view of the content processing world is dead wrong.
So far I am on the beam. Content processing requires time, humans, and a range of flawed tools which must be used by a person with old fashioned human thought processes and procedures.
Value is in the eye of the beholder, not in zeros and ones.
Stephen E Arnold, May 19, 2016
April 19, 2016
I read an article in Jeff Bezos’ newspaper. The title was “We Analyzed the Names of Almost Every Chinese Restaurant in America. This Is What We Learned.” The almost is a nifty way of slip sliding around the sampling method which used restaurants listed in Yelp. Close enough for “real” journalism.
Using the notion of a frequency count, the write up revealed:
- The word appearing most frequently in the names of the sample was “restaurant.”
- The words “China” and “Chinese” appear in about 15,000 of the sample’s restaurant names
- “Express” is a popular word, not far ahead of “panda”.
The word list and their frequencies were used to generate a word cloud:
To answer the question where Chinese food is most popular in the US, the intrepid data wranglers at Jeff Bezos’ newspaper output a map:
Amazing. I wonder if law enforcement and intelligence entities know that one can map data to discover things like the fact that the word “restaurant” is the most used word in a restaurant’s name.
Stephen E Arnold, April 19, 2016
April 8, 2016
The chatter about smart is loud. I cannot hear the mixes on my Creamfields 2014 CD. Mozart, you are a goner.
If you want to cook up some smart algorithms to pick music or drive your autonomous vehicle without crashing into a passenger carrying bus, navigate to “Top 10 Machine Learning Algorithms.”
The write up points out that just like pop music, there is a top 10 list. More important in my opinion is the concomitant observation that smart software may be based on a limited number of procedures. Hey, this stuff is taught in many universities. Go with what you know maybe?
What are the top 10? The write up asserts:
- Linear regression
- Logistic regression
- Linear discriminant analysis
- Classification and regression trees
- Naive Bayes
- K nearest neighbors
- Learning vector quantization
- Support vector machines
- Bagged decision trees and random forest
- Boosting and AdaBoost.
The article tosses in a bonus too: Gradient descent.
What is interesting is that there is considerable overlap with the list I developed for my lecture on manipulating content processing using shaped or weaponized text strings. How’s that, Ms. Null?
The point is that when systems use the same basic methods, are those systems sufficiently different? If so, in what ways? How are systems using standard procedures configured? What if those configurations or “settings” are incorrect?
Stephen E Arnold, April 8, 2016
March 30, 2016
Short honk: The adventure of Attensity continues. Attensity Europe has renamed itself Sematell Interactive Solutions. You can read about the change here. The news release reminds the reader that Sematell is “the leading provider of interaction solutions.” I am not able to define interaction solutions, but I assume the company named by combining semantic and intelligence will make the “interaction solutions” thing crystal clear. The url is www.sematell.de.
Stephen E Arnold, March 30, 2016
March 16, 2016
Short honk: Want a growth business in a niche function that supports enterprise platforms? Well, gentle reader, look no farther than text analytics. Get your checkbook out and invest in this remarkable sector. It will be huuuuge.
Navigate to “Text Analytics Market to Account for US$12.16 bn in Revenue by 2024.” What is text analytics? How big is text analytics today? How long has text analytics been a viable function supporting content processing?
Ah, good questions, but what’s really important is this passage:
According to this report, the global text analytics market revenue stood at US$2.82 bn in 2015 and is expected to reach US$12.16 bn by 2024, at a CAGR of 17.6% from 2016 to 2024.
I love these estimates. Imagine. Close out your life savings and invest in text analytics. You will receive a CAGR of 17.6 percent which you can cash in and buy stuff in 2024. That’s just eight years.
Worried about the economy? Want to seek the safe shelter of bonds? Forget the worries. If text analytics is so darned hot, why is the consulting firm pitching this estimate writing reports. Why not invest in text analytics?
Answer: Maybe the estimate is a consequence of spreadsheet fever?
Text analytics is a rocket just like the ones Jeff Bezos will use to carry you into space.
Stephen E Arnold, March 16, 2016
February 25, 2016
I am skeptical about lists of problems which hot buzzwords leave in their wake. I read “Why Data Insight Remains Elusive,” which I though was another content marketing pitch to buy, buy, buy. Not so. The write up contains some clearly expressed, common sense reminds for those who want to crunch big data and point and click their way through canned reports. Those who actually took the second semester of Statistics 101 know that ignoring the data quality and the nitty gritty of the textbook procedures can lead to bone head outputs.
The write up identifies some points to keep in mind, regardless of which analytics vendor system a person is using to make more informed or “augmented” decisions.
Here’s the pick of the litter:
- Manage the data. Yep, time consuming, annoying, and essential. Skip this step at your decision making peril.
- Manage the indexing. The buzzword is metadata, but assigning keywords and other indexing items makes the difference when trying to figure out who, what, why, when, and where. Time? Yep, metadata which not even the Alphabet Google thing does particularly well.
- Create data models. Do the textbook stuff. Get the model wrong, and what happens? Failure on a scale equivalent to fumbling the data management processes.
- Visualization is not analytics. Visualization makes outputs of numerical recipes appear in graphical form. Do not confuse Hollywood outputs with relevance, accuracy, or math on point to the problem one is trying to resolve.
- Knee jerking one’s way through analytics. Sorry, reflexes are okay but useless without context. Yep, have a problem, get the data, get the model, test, and examine the outputs.
Common sense. Most basic stuff was in the textbooks for one’s college courses. Too bad more folks did not internalize those floorboards and now seek contractors to do a retrofit. Quite an insight when the bill arrives.
Stephen E Arnold, February 25, 2016
February 10, 2016
I located a list of companies involved in content processing. You may want to add one or more of these to your retirement investment portfolio. Which one will be the next Facebook, Google, or Uber? I know I would love to have a hat or T shirt from each of these outfits:
TEMIS (Expert System)
Stephen E Arnold, February 8, 2016