Big Data Blending Solution
January 20, 2016
I would have used Palantir or maybe our own tools. But an outfit named National Instruments found a different way to perform data blending. “How This Instrument Firm Tackled Big Data Blending” provides a case study and a rah rah for Alteryx. Here’s the paragraph I highlighted:
The software it [National Instruments] selected, from Alteryx, takes a somewhat unique approach in that it provides a visual representation of the data transformation process. Users can acquire, transform, and blend multiple data sources essentially by dragging and dropping icons on a screen. This GUI approach is beneficial to NI employees who aren’t proficient at manipulating data using something like SQL.
The graphical approach has been part of a number of tools. There are also some systems which just figure out where to put what.
The issue for me is, “What happens to rich media like imagery and unstructured information like email?”
There are systems which handle these types of content.
Another challenge is the dependence on structured relational data tables. Certain types of operations are difficult in this environment.
The write up is interesting, but it reveals that a narrow view of available tools may produce a partial solution.
Stephen E Arnold, January 20, 2016
Boolean Search: Will George Boole Rotate in His Grave?
January 12, 2016
Boolean logic is, for most math wonks, the father of Boolean logic. This is a nifty way to talk about sets and what they contain. One can perform algebra and differential equations whilst pondering George and his method for thinking about fruits when he went shopping.
In the good old days of search, there was one way to search. One used AND, OR, NOT, and maybe a handful of other logic operators to retrieve information from structured indexes and content. Most folks with a library science degree or a friendly math major can explain Boolean reasonably well. Here’s an example which might even work on CSA ProQuest (nèe Lockheed Dialog) even today:
CC=77? AND scam?
The systems when fed the right query would reply with pretty good precision and recall. Precision provided info that was supposed to be useful. Recall meant that what should be included was in the result set.
I thought about Boole, fruit, and logic when I read “The Best Boolean and Semantic Search Tool.” Was I going to read about SDC’s ORBIT, ESA Quest, or (heaven help me) the original Lexis system?
Nope.
I learned about LinkedIn. Not one word about Palantir’s injecting Boolean logic squarely in the middle of its advanced data management processes. Nope.
LinkedIn. I thought that LinkedIn used open source Lucene, but maybe the company has invested in Exorbyte, Funnelback, or some other information access system.
The write up stated:
If you use any source of human capital data to find and recruit people (e.g., your ATS/CRM, resume databases, LinkedIn, Google, Facebook, Github, etc.) and you really want to understand how to best approach your talent sourcing efforts, I recommend watching this video when you have the time.
Okay, human resource functions. LinkedIn, right.
But there is zero content in the write up. I was pointed to a video called “Become a LinkedIn Search Ninja: Advanced Boolean Search” on YouTube.
Here’s what I learned before I killed the one hour video:
- The speaker is in charge of personnel and responsible for Big Data activities related to human resources
- Search is important to LinkedIn users
- Profiles of people are important
- Use OR. (I found this suggestion amazing.)
- Use iterative, probabilistic, and natural language search, among others. (Yep, that will make sense to personnel professionals.)
Okay. I hit the stop button. Not only will George be rotating, I may have nightmares.
Please, let librarians explicitly trained in online search and retrieval explain methods for obtaining on point results. Failing a friendly librarian, ask someone who has designed a next generation system which provides “helpers” to allow the user to search and get useful outputs.
Entity queries are important. LinkedIn can provide some useful information. The tools to obtain that high value information are a bit more sophisticated than the recommendations in this video.
Stephen E Arnold, January 12, 2016
The Secret Weapon of Predictive Analytics Revealed
January 8, 2016
I like it when secrets are revealed. I learned how to unlock the treasure chest containing predictive analytics secret weapon. You can too. Navigate to “Contextual Integration Is the Secret Weapon of Predictive Analytics.”
The write up reports:
Predictive analytics has been around for years, but only now have data teams begun to refine the process to develop more accurate predictions and actionable business insights. The availability of tremendous amounts of data, cheap computation, and advancements in artificial intelligence has presented a massive opportunity for businesses to go beyond their legacy methodologies when it comes to customer data.
And what is the secret?
Contextual transformation.
Here’s the explanation:
A major part of this transformation is the realization that data needs to be looked at from as many angles as possible in an effort to create a multi-dimensional profile of the customer. As a consequence, we view recommendations through the lens of ensembles in which each modeled dimension may be weighted differently based on real-time contextual information. This means that, rather than looking at just transactional information, layering in other types of information, such as behavioral data, gives context and allows organizations to make more accurate predictions.
Is this easy?
Nope. The article reminds the reader:
A sound approach follows the scientific method, starting with understanding the business domain and the underlying data that is available. Then data scientists can prepare to test a particular hypothesis, build a model, evaluate results, and refine the model to draw general conclusions.
I would point out that folks at Palantir, Recorded Future, and other outfits have been working for years to deal with integration, math, and sense making.
I wonder if the wonks at these firms have realized that contextual integration is the secret? I assume one could ask IBM Watson or just understand the difference between interpreting marketing inputs from a closed user base and dealing with slightly more slippery data has more than one secret.
Stephen E Arnold, January 8, 2016
Dark Web and Tor Investigative Tools Webinar
January 5, 2016
Telestrategies announced on January 4, 2016, a new webinar for active LEA and intel professionals. The one hour program is focused on tactics, new products, and ongoing developments for Dark Web and Tor investigations. The program is designed to provide an overview of public, open source, and commercial systems and products. These systems may be used as standalone tools or integrated with IBM i2 ANB or Palantir Gotham. More information about the program is available from Telestrategies. There is no charge for the program. In 2016, Stephen E Arnold’s new Dark Web Notebook will be published. More information about the new monograph upon which the webinar is based may be obtained by writing benkent2020 at yahoo dot com.
Stephen E Arnold, January 5, 2016
Are Search Unicorns Sub Prime Unicorns?
January 4, 2016
The question is a baffler. Navigate to “Sorting Truth from Myth at Technology Unicorns.” If the link is bad or you have to pay to read the article in the Financial Times, pony up, go to the library, or buy hard copy. Don’t complain to me, gentle reader. Publishers are in need of revenue. Now the write up:
The assumption is that a unicorn exists. What exists are firms with massive amounts of venture funding and billion dollar valuations. I know the money is or was real, but the “sub prime unicorn” is a confection from a money thought leader Michael Moritz. A subprime unicorn is a co9mpany “built on the flimsiest of edifices.” Does this mean fairy dust or something more substantial?
According to the write up:
High quality global journalism requires investment. Please share this article with others using the link below, do not cut & paste the article. But the way in which private market valuations have become skewed and inflated as start-ups have delayed IPOs raises questions about the financing of innovation. Despite the excitement, venture capital has produced weak returns in recent decades — only a minority of funds have produced rewards high enough to compensate investors for illiquidity and opacity.
Why would funding start ups perform better than a start up financed by mom, dad, and one’s slightly addled, but friendly, great aunt?
The article then makes a reasonably sane point:
With the rise in US interest rates, the era of ultra-cheap financing is ending. As it does, Silicon Valley’s unicorns are losing their mystique and having to work to raise equity, sometimes at valuations below those they achieved before. The promise of private financing is being tested, and there will be disappointments. It does not pay to be dazzled by mythical beasts.
Let’s think a moment about search and content processing. The mid tier consulting firms—the outfits I call azure chip outfits—have generated some pretty crazy estimates about the market size for search and content processing solutions.
The reality is at odds with these speculative, marketing fueled prognostications. Yep, I would include the wizards at IDC who wanted $3,500 to sell an eight page document with my name on it without my permission. Refresh yourself on the IDC Schubmehl maneuver at this link.
Based on my research, two enterprise search outfits broke $150 million in revenues prior to 2011: Endeca tallied an estimated $150 million in revenues and Autonomy reported $700 million in revenues. Both outfits were sold.
Since 2012 exactly zero enterprise search firms have generated more than $700 million in revenues. Now the wild and crazy funding of search vendors has continued apace since 2012. There are a number of search and retrieval companies and some next generation content processing outfits which have ingested tens of millions of dollars.
How many of these outfits have gone public in the zero cost money environment? Based on my records, zero. Why haven’t Attivio, BA Insight, Coveo, Palantir and others cashed in on their technology, surging revenues, and market demand?
There are three reasons:
- The revenues are simply acceptable, not stunning. In the post Fast Search & Transfer era, twiddling the finances carries considerable risks. Think about a guilty decision for a search wizard. Yep, bad.
- The technology is a rehash gilded with new jargon. Take a look at the search and content processing systems, and you find the same methods and functions that have been known and in use for more than 30 years. The flashy interfaces are new, but the plumbing still delivers precision and recall which has hit a glass ceiling at 80 to 90 percent accuracy for the top performing systems. Looking for a recipe with good enough relevance is acceptable. Looking for a bad actor with a significant margin for error is not so good.
- The smart software performs certain functions at a level comparable to the performance of a subject matter index when certain criteria are met. The notion of human editors riding herd on entity and synonym dictionaries is not one that makes customers weep with joy. Smart software helps with some functions, but today’s systems remain anchored in human operators, and the work these folks have to perform to keep the systems in tip top share is expensive. Think about this human aspect in terms of how Palantir explains architects’ changes to type operators or the role of content intake specialists using the revisioning and similar field operations.
Why do I make this point in the context of unicorns? Search has one or two unicorns. I would suggest Palantir is a unicorn. When I think of Palantir, I consider this item:
To summarize, only a small number of companies reach the IPO stage.
Also, the HP Autonomy “deal” is a quasi unicorn. IBM’s investment in Watson is a potential unicorn if and when IBM releases financial data about his TV show champion.
Then there are a number of search and content processing creatures which could be hybrids of a horse and a donkey. The investors are breeders who hope that the offspring become champions. Long shots all.
The Financial Times’s article expresses a broad concept. The activities of the search and content processing vendors in the next 12 to 18 months will provide useful data about the genetic make up of some technology lab creations.
Stephen E Arnold, January 4, 2015
Weekly Watson: In the Real World
January 2, 2016
I want to start off the New Year with look at Watson in the real world. My real world is circumscribed by abandoned coal mines and hollows in rural Kentucky. I am pretty sure this real world is not the real world assumed in “IBM Watson: AI for the Real World.” IBM has tapped Bob Dylan, a TV game show, and odd duck quasi chemical symbols to communicate the importance of search and content processing.
The write up takes a different approach. In fact, the article begins with an interesting comment:
Computers are stupid.
There you go. A snazzy one liner.
The purpose of the reminder that a man made device is not quite the same as one’s faithful boxer dog or next door neighbor’s teen is startling.
The article summarizes an interview with a Watson wizard, Steven Abrams, director of technology for the Watson Ecosystem. This is one of those PR inspired outputs which I quite enjoy.
The write up quotes Abrams as saying:
“You debug Watson’s system by asking, ‘Did we give it the right data?'” Abrams said. “Is the data and experience complete enough?”
Okay, but isn’t this Dr. Mike Lynch’s approach. Lynch, as you may recall, was the Cambridge University wizard who was among the first to commercialize “learning” systems in the 1990s.
According to the write up:
Developers will have data sets they can “feed” Watson through one of over 30 APIs. Some of them are based on XML or JSON. Developers familiar with those formats will know how to interact with Watson, he [Abrams] explained.
As those who have used the 25 year old Autonomy IDOL system know, preparing the training data takes a bit of effort. Then as the content from current content is fed into the Autonomy IDOL system, the humans have to keep an eye on the indexing. Ignore the system too long, and the indexing “drifts”; that is, the learned content is not in tune with the current content processed by the system. Sure, algorithms attempt to keep the calibrations precise, but there is that annoying and inevitable “drift.”
IBM’s system, which strikes me as a modification of the Autonomy IDOL approach with a touch of Palantir analytics stirred in is likely to be one expensive puppy to groom for the dog show ring.
The article profiles the efforts of a couple of IBM “partners” to make Watson useful for the “real” world. But the snip I circled in IBM red-ink red was this one:
But Watson should not be mistaken for HAL. “Watson will not initiate conduct on its own,” IBM’s Abrams pointed out. “Watson does not have ambition. It has no objective to respond outside a query.” “With no individual initiative, it has no way of going out of control,” he continued. “Watson has a plug,” he quipped. It can be disconnected. “Watson is not going to be applied without individual judgment … The final decision in any Watson solution … will always be [made by] a human, being based on information they got from Watson.”
My hunch is that Watson will require considerable human attention. But it may perform best on a TV show or in a motion picture where post production can smooth out the rough edges.
Maybe entertainment is “real”, not the world of a Harrod’s Creek hollow.
Stephen E Arnold, January 2, 2016
IBM: There Are Doubters
December 31, 2015
Watson has its works cut out for itself in 2016. I read “IBM Set to Drop 13% in 2015.” When one is tossing around a $100 billion outfit, the thought of a share drop is disconcerting. Will Alibaba or Jeff Bezos step in. Fixing up the Washington Post may be trivial compared with an IBM scale challenge.
According to the write up:
Much of the disappointment in the tech company is because it has been unable to replace its hardware and software legacy products with new cloud-based and AI products — at least not at a rate that would pull IBM’s revenue up. Its major branded product in new age technology is Watson. While Watson has been the source of press releases and small customer alliances, outsiders have trouble seeing what it does to sharply increase IBM’s sales. Granted, Watson may be one of the most impressive product advances among large companies in the sector recently, but what it does for IBM may be very modest.
Somewhat of a downer I perceive.
The smart software thing is not new. In the last 18 months, awareness of the use of various numerical recipes has increased. Faster chips, memories, and interconnections have worked their magic.
The challenge for IBM is to make money, not just marketing hyperbole. The crunch is that expectations for certain technologies are often more robust than possible in a market.
Watson is, when one keeps its eye on the ball, is a search and content processing system. The wrappers make it possible to call assorted functions. Unlike Palantir, which has its own revenue fish to catch, IBM is a publicly traded company. Palantir does its magic as a privately held company, ingesting money at rates which would make beluga whale’s diet look modest.
But IBM has exposed itself. The Watson marketing push is dragged into the reality of IBM’s overall company performance. In 2016, IBM Watson will have to deliver the bacon, or some of the millennialesque PR and marketing folks will have an opportunity to work elsewhere. Talk about smart software is not generating sustainable revenue from smart software.
Stephen E Arnold, December 31, 2015
Index and Search: The Threat Intel Positioning
December 24, 2015
The Dark Web is out there. Not surprisingly, there are a number of companies indexing Dark Web content. One of these firms is Digital Shadows. I learned in “Cyber Threat Intelligence and the Market of One” that search and retrieval has a new suit of clothes. The write up states:
Cyber situational awareness shifts from only delivering generic threat intelligence that informs, to also delivering specific information to defend against adversaries launching targeted attacks against an organization or individual(s) within an organization. Cyber situational awareness brings together all the information that an organization possesses about itself such as its people, risk posture, attack surface, entire digital footprint and digital shadow (a subset of a digital footprint that consists of exposed personal, technical or organizational information that is often highly confidential, sensitive or proprietary). Information is gathered by examining millions of social sites, cloud-based file sharing sites and other points of compromise across a multi-lingual, global environment spanning the visible, dark and deep web.
The approach seems to echo the Palantir “platform” approach. Palantir, one must not forget, is a 2015 version of the Autonomy platform. The notion is that content is acquired, federated, and made useful via outputs and user friendly controls.
What’s interesting is that Digital Shadows indexes content and provides a search system to authorized users. Commercial access is available via tie up in the UK.
My point is that search is alive and well. The positioning of search and retrieval is undergoing some fitting and tucking. There are new terms, new rationale for business cases (fear is workable today), and new players. Under the surface are crawlers, indexes, and search functions.
The death of search may be news to the new players like Digital Shadows, Palantir, and Recorded Future, among numerous other shape shifters.
Stephen E Arnold, December 24, 2015
Gibiru Compromised?
December 22, 2015
I assume, gentle reader, that you are aware of the anonymizing search system called Gibiru. Today (December 22, 2015) I received this notification when I attempted to run a query about Palantir on this search system:
The Kaspersky information link is a 404. I located no substantive information about this possible issue when I poked around online. I had in my files a link to https://anonymous-gibiru.com/ which did not trigger the malicious file warning.
Stephen E Arnold, December 22, 2015
Search Vendors Under Pressure: Welcome to 2016
December 21, 2015
I read ”Silicon Valley’s Cash Party Is Coming to an End.” What took so long? I suppose reality is less fun than fantasy. Why watch a science documentary when one can get lost in Netflix binging.
The write up reports:
Based on interviews with about two dozen venture capitalists and tech investors, 2016 is shaping up to be a year of reckoning for scores of technology start-ups that have yet to prove out their business models and equally challenging for those that raised money at unjustifiably high prices.
Forget the unicorns. There are some enterprise search outfits which have ingested millions of dollars, have convinced investors that big revenue or an HP-Autonomy scale buy out is just around the corner, and proprietary technology or consulting plus open source will produce gushers of organic revenue. Other vendors have tapped their moms, their nest eggs, and angels who believe in fairies.
I am not there is a General Leia Organa to fight Star Wars: The Revenue Battle for most vendors of search and content processing. Bummer. Despite the lack of media coverage for search and content processing vendors, the number of companies pitching information access is hefty. I track about 200 outfits, but many of these are unknown either because they don’t want to be visible or lack any substantive “newsy” magnetism.
My hunch is that this article suggests that 2016 may be different from the free money era the articles suggests is ending. In 2016, my view is that many vendors will find themselves in a modest tussle with their stakeholders. I worked through some of the search and content processing companies taking cash from folks with deep pockets often filled with other people’s money. (Note that investments totals come from Crunchbase). Here’s a list of search and content processing vendors who may face stakeholder and investor pressure. The more more ingested, the greater the interest investors may have in getting a return:
- Antidot, $3 million
- Attensity, $90 million
- Attivio, $71 million
- BA Insight, $14 million
- Connotate, $12 million
- Coveo, $69 million
- Digital Reasoning, $28 million
- Elastic (formerly Elasticsearch), $104 million
- Lucidworks, $53 million
- MarkLogic, $175 million
- Perfect Search, $4 million
- Palantir, $1.7 billion
- Recommind, $22 million
- Sinequa, $5 million
- Sophia Ambiance, $5 million
- X1, $12 million.
Then there are the acquired search systems which been acquired. One assumes these deals will have to produce sustainable revenues in some form:
- Hewlett Packard with Autonomy
- IBM with Vivisimo
- Dassault Systèmes with Exalead
- Lexmark with Brainware and ISYS Search
- Microsoft with Fast Search
- OpenText with BASIS, BRS, Fulcrum, and Nstein
- Oracle with Endeca, InQuira, and Rightnow
- Thomson Reuters with Solcara
Are there sufficient prospects to generate deals large enough to keep these outfits afloat?
There are search and content processing vendors competing for sales with free and open source options and the vendors with proprietary software:
- Ami Albert
- Content Analyst
- Concept Searching
- dtSearch
- EasyAsk
- Exorbyte
- Fabasoft Mindbreeze
- Funnelback
- IHS Goldfire
- SLI Systems
- Smartlogic
- Sprylogics
- SurfRay
- Thunderstone
- WCC Elise
- Zaizi
These search vendors plus many smaller outfits like Intrafind and Srch2 have to find a way to close deals to avoid the fate of Arikus, Convera, Delphes, Dieselpoint, Entopia, Hakia, Kartoo, NuTech Search, and Siderean Software, among others.
Despite the lack of coverage from mid tier consultants and the “real” journalists, the information access sector is moving along. In fact, when one looks at the software options, search and content processing vendors are easily found.
The problem for 2016 will be making sales, generating sustainable revenues, and paying back stakeholders. For many of these companies, the new year will be one which sees a number of outfits going dark. A few will thrive.
Darned exciting times in findability.
Stephen E Arnold, December 21, 2015