Why Search When You Can Discover

November 11, 2016

What’s next in search? My answer is, “No search at all. The system thinks for you.” Sounds like Utopia for the intellectual couch potato to me.

I read “The Latest in Search: New Services in the Content Discovery Marketplace.” The main point of the write up is to highlight three “discovery” services. A discovery service is one which offers “information users new avenues to the research literature.”

See, no search needed.

The three services highlighted are:

  • Yewno, which is powered by an inference engine. (Does anyone remember the Inference search engine from days gone by?). The Yewno system uses “computational analysis and a concept map.” The problem is that it “supplements institutional discovery.” I don’t know what “institutional discovery” means, and my hunch is that folks living outside of rural Kentucky know what “institutional discovery” means. Sorry to be so ignorant.
  • ScienceOpen, which delivers a service which “complements open Web discovery.” Okay. I assume that this means I run an old fashioned query and ScienceOpen helps me out.
  • TrendMD, which “serves as a classic “onward journey tool” that aims to generate relevant recommendations serendipitously.”

I am okay with the notion of having tools to make it easier to locate information germane to a specific query. I am definitely happy with tools which can illustrate connections via concept maps, link analysis, and similar outputs. I understand that lawyers want to type in a phrase like “Panama deal” and get a set of documents related to this term so the mass of data can be chopped down by sending, recipient, time, etc.

But setting up discovery as a separate operation from keyword or entity based search seems a bit forced to me. The write up spins its lawn mower blades over the TrendMD service. That’s fine, but there are a number of ways to explore scientific, technical, and medical literature. Some are or were delightful like Grateful Med; others are less well known; for example, Mednar and Quertle.

Discovery means one thing to lawyers. It means another thing to me: A search add on.

Stephen E Arnold, November 11, 2016

Palantir Technologies: Less War with Gotham?

November 9, 2016

I read “Peter Thiel Explains Why His Company’s Defense Contracts Could Lead to Less War.” I noted that the write up appeared in the Washington Post, a favorite of Jeff Bezos I believe. The write up referenced a refrain which I have heard before:

Washington “insiders” currently leading the government have “squandered” money, time and human lives on international conflicts.

What I highlighted as an interesting passage was this one:

a spokesman for Thiel explained that the technology allows the military to have a more targeted response to threats, which could render unnecessary the wide-scale conflicts that Thiel sharply criticized.

I also put a star by this statement from the write up:

“If we can pinpoint real security threats, we can defend ourselves without resorting to the crude tactic of invading other countries,” Thiel said in a statement sent to The Post.

The write up pointed out that Palantir booked about $350 million in business between 2007 and 2016 and added:

The total value of the contracts awarded to Palantir is actually higher. Many contracts are paid in a series of installments as work is completed or funds are allocated, meaning the total value of the contract may be reflected over several years. In May, for example, Palantir was awarded a contract worth $222.1 million from the Defense Department to provide software and technical support to the U.S. Special Operations Command. The initial amount paid was $5 million with the remainder to come in installments over four years.

I was surprised at the Washington Post’s write up. No ads for Alexa and no Beltway snarkiness. That too was interesting to me. And I don’t have a dog in the fight. For those with dogs in the fight, there may be some billability worries ahead. I wonder if the traffic jam at 355 and Quince Orchard will now abate when IBM folks do their daily commute.

Stephen E Arnold, November 9, 2016

Ontotext: The Fabric of Relationships

November 9, 2016

Relationships among metadata, words, and other “information” are important. Google’s Dr. Alon Halevy, founder of Transformic which Google acquired in 2006, has been beavering away in this field for a number of years. His work on “dataspaces” is important for Google and germane to the “intelligence-oriented” systems which knit together disparate factoids about a person, event, or organization. I recall one of his presentations—specifically the PODs 2006 keynote–in which he reproduced a “colleague’s” diagram of a flow chart which made it easy to see who received the document, who edited the document and what changes were made, and to whom recipients of the document forward the document.

Here’s the diagram from Dr. Halevy’s lecture:

image

Principles of Dataspace Systems, Slide 4 by Dr. Alon Halevy at delivered on June 26, 2006 at PODs. Note that “PODs” is an annual ACM database-centric conference.

I found the Halevy discussion interesting.

Read more

Entity Extraction: No Slam Dunk

November 7, 2016

There are differences among these three use cases for entity extraction:

  1. Operatives reviewing content for information about watched entities prior to an operation
  2. Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
  3. Indexing Web content to add concepts to keyword indexing.

Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.

The write up identifies the systems which perform the best and the worst.

Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.

The five best performing systems on the test corpus were:

The five worst performing systems on the test corpus were:

There are some caveats to consider:

  1. Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
  2. Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
  3. Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).

Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.

Stephen E Arnold, November 7, 2016

Falcon Searches Through Browser History

October 21, 2016

Have you ever visited a Web site and then lost the address or could not find a particular section on it?  You know that the page exists, but no matter how often you use an advanced search feature or scour through your browser history it cannot be found.  If you use Google Chrome as your main browser than there is a solution, says GHacks in the article, “Falcon: Full-Text history Search For Chrome.”

Falcon is a Google Chrome extension that adds full-text history search to a browser.  Chrome usually remembers Web sites and their extensions when you type them into the address bar.  The Falcon extension augments the default behavior to match text found on previously visited Web Sites.

Falcon is a search option within a search feature:

The main advantage of Falcon over Chrome’s default way of returning results is that it may provide you with better results.  If the title or URL of a page don’t contain the keyword you entered in the address bar, it won’t be displayed by Chrome as a suggestion even if the page is full of that keyword. With Falcon, that page may be returned as well in the suggestions.

The new Chrome extension acts as a delimiter to recorded Web history and improves a user’s search experience so they do not have to sift through results individually.

Whitney Grace, October 21, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Structured Search: New York Style

October 10, 2016

An interesting and brief search related content marketing white paper “InnovationQ Plus Search Engine Technology” attracted my attention. What’s interesting is that the IEEE is apparently in the search engine content marketing game. The example I have in front of me is from a company doing business as IP.com.

What does InnovationQ Plus do to deliver on point results? The write up says:

This engine is powered by IP.com’s patented neural network machine learning technology that improves searcher productivity and alleviates the difficult task of identifying and selecting countless keywords/synonyms to combine into Boolean syntax. Simply cut and paste abstracts, summaries, claims, etc. and this state-of-the art system matches queries to documents based on meaning rather than keywords. The result is a search that delivers a complete result set with less noise and fewer false positives. Ensure you don’t miss critical documents in your search and analysis by using a semantic engine that finds documents that other tools do not.

The use of snippets of text as the raw material for a behind-the-scenes query generator reminds me of the original DR-LINK method, among others. Perhaps there is some Syracuse University “old school” search DNA in the InnovationQ Plus approach? Perhaps the TextWise system has manifested itself as a “new” approach to patent and STEM (scientific, technology, engineering, and medical)  online searching? Perhaps Manning & Napier’s interest in information access has inspired a new generation of search capabilities?

My hunch is, “Yep.”

If you don’t have a handy snippet encapsulating your search topic, just fill in the query form. Google offers a similar “fill in the blanks” approach even thought a tiny percentage of those looking for information on Google use advanced search. You can locate the Google advanced search form at this link.

Part of the “innovation” is the use of fielded search. Fielded search is useful. It was the go to method for locating information in the late 1960s. The method fell out of favor with the Sillycon Valley crowd when the idea of talking to one’s mobile phone became the synonym for good enough search.

To access the white paper, navigate the IEEE registration page and fill out the form at this link.

From my vantage point, structured search with “more like this” functions is a good way to search for information. There is a caveat. The person doing the looking has to know what he or she needs to know.

Good enough search takes a different approach. The systems try to figure out what the searcher needs to know and then deliver it. The person looking for information is not required to do much thinking.

The InnovationQ Plus approach shifts the burden from smart software to smart searchers.

Good enough search is winning the battle. In fact, some Sillycon Valley folks, far from upstate New York, have embraced good enough search with both hands. Why use words at all? There are emojis, smart software systems predicting what the use wants to know, and Snapchat infused image based methods.

The challenge will be to find a way to bridge the gap between the Sillycon Valley good enough methods and the more traditional structured search methods.

IEEE seems to agree as long as the vendor “participates” in a suitable IEEE publishing program.

Stephen E Arnold, October 10, 2016

Crimping: Is the Method Used for Text Processing?

October 4, 2016

I read an article I found quite thought provoking. “Why Companies Make Their Products Worse” explains that reducing costs allows a manufacturer to expand the market for a product. The idea is that more people will buy a product if it is less expensive than a more sophisticated version of the product. The example which I highlighted in eyeshade green explained that IBM introduced an expensive printer in the 1980s. Then IBM manufactured the different version of the printer using cheaper labor. The folks from Big Blue added electronic components to make the cheaper printer slower. The result was a lower cost printer that was “worse” than the original.

image

Perhaps enterprise search and content processing is a hybrid of two or more creatures?

The write up explained that this approach to degrading a product to make more money has a name—crimping. The concept creates “product sabotage”; that is, intentionally degrading a product for business reasons.

The comments to the article offer additional examples and one helpful person with the handle Dadpolice stated:

The examples you give are accurate, but these aren’t relics of the past. They are incredibly common strategies that chip makers still use today.

I understand the hardware or tangible product application of this idea. I began to think about the use of the tactic by text processing vendors.

The Google Search Appliance may have been a product subject to crimping. As I recall, the most economical GSA was less than $2000, a price which was relatively easy to justify in many organizations. Over the years, the low cost option disappeared and the prices for the Google Search Appliances soared to Autonomy- and Fast Search-levels.

Other vendors introduced search and content processing systems, but the prices remained lofty. Search and content processing in an organization never seemed to get less expensive when one considered the resources required, the license fees, the “customer” support, the upgrades, and the engineering for customization and optimization.

My hypothesis is that enterprise content processing does not yield compelling examples like the IBM printer example.

Perhaps the adoption rate for open source content processing reflects a pent up demand for “crimping”? Perhaps some clever graduate student would take the initiative to examine the content processing product prices? Licensees spend for sophisticated solution systems like those available from outfits like IBM and Palantir Technologies. The money comes from the engineering and what I call “soft” charges; that is, training, customer support, and engineering and consulting services.

At the other end of the content processing spectrum are open source solutions. The middle between free or low cost systems and high end solutions does not have too many examples. I am confident there are some, but I could identify Funnelback, dtSearch, and a handful of other outfits.

Perhaps “crimping” is not a universal principle? On the other hand, perhaps content processing is an example of a technical software which has its own idiosyncrasies.

Content processing products, I believe, become worse over time. The reason is not “crimping.” The trajectory of lousiness comes from:

  • Layering features on keyword retrieval in hopes of finding a way to generate keen buyer interest
  • Adding features helps justify price increases
  • The greater the complexity of the system, the less likely the licensee will be able to fiddle with the system
  • A refusal to admit that content processing is a core component of many other types of software so “finding information” has become a standard component for other applications.

If content processing is idiosyncratic, that might explain why investors pour money into content processing companies which have little chance to generate sufficient revenue to pay off investors, generate a profit, and build a sustainable business. Enterprise search and content processing vendors seem to be in a state of reinventing or reimagining themselves. Guitar makers just pursue cost cutting and expand their market. It is not so easy for content processing companies.

Stephen E Arnold, October 4, 2016

Pharmaceutical Research Made Simple

October 3, 2016

Pharmaceutical companies are a major power in the United States.  Their power comes from the medicine they produce and the wealth they generate.  In order to maintain both wealth and power, pharmaceutical companies conduct a lot of market research.  Market research is a field based on people’s opinions and their reactions, in other words, it contains information that is hard to process into black and white data.  Lexalytics is a big data platform built with a sentiment analysis to turn market research into useable data.

Inside Big Data explains how “Lexalytics Radically Simplifies Market Research And Voice Of Customer Programs For The Pharmaceutical Industry” with a new package called the Pharmaceutical Industry Pack.  Lexalytics uses a combination of machine learning and natural language processing to understand the meaning and sentiment in text documents.  The new pack can help pharmaceutical companies interpret how their customers react medications, what their symptoms are, and possible side effects of medication.

Our customers in the pharmaceutical industry have told us that they’re inundated with unstructured data from social conversations, news media, surveys and other text, and are looking for a way to make sense of it all and act on it,’ said Jeff Catlin, CEO of Lexalytics. ‘With the Pharmaceutical Industry Pack — the latest in our series of industry-specific text analytics packages — we’re excited to dramatically simplify the jobs of CEM and VOC pros, market researchers and social marketers in this field.

Along with basic natural language processing features, the Lexalytics Pharmaceutical Industry Pack contains 7000 sentiment terms from healthcare content as well as other medical references to understand market research data.  Lexalytics makes market research easy and offers invaluable insights that would otherwise go unnoticed.

Whitney Grace, October 3, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Attensity: A Big 404 in Text Analytics

October 1, 2016

Search vendors can save their business by embracing text analytics. Sounds like a wise statement, right? I would point out that our routine check of search and content processing companies turned up this inspiring Web page for Attensity, the Xerox Parc love child and once hot big dog in text analysis:

image

Attensity joins a long list of search-related companies which have had to reinvent themselves.

The company pulled in $90 million from a “mystery investor” in 2014. A pundit tweeted in 2015:

image

In February 2016, Attensity morphed into Sematell GmbH, a company with interaction solutions.

I mention this arabesque because it underscores:

  1. No single add on to enterprise search will “save” an information access company
  2. Enterprise search has become a utility function. Witness the shift to cloud based services like SearchBlox, appliances like Maxxcat, and open source options. Who will go out on a limb for a proprietary utility when open source variants are available and improving?
  3. Pundits who champion a company often have skin in the game. Self appointed experts for cognitive computing, predictive analytics, or semantic link analysis are tooting a horn without other instruments.

Attensity is a candidate to join the enterprise search Hall of Fame. In the shrine are Delphes, Entopia, et al. I anticipate more members, and I have a short list of “who is next” taped on my watch wall.

Stephen E Arnold, October 1, 2016

Bam! Pow! Zap! Palantir Steps Up Fight with US Army

September 25, 2016

Many moons ago I worked at that fun loving outfit Booz, Allen & Hamilton. I recall one Master of the Universe telling me, “Keep the client happy.” Today an alternative approach has emerged. I term it “Fight with the client.” I assume the tactic works really well.

image

I read “Palantir Claims Army Misled to Keep It Out of DCGS-A Program.” As I understand the Mixed Martial Arts cage match, the US Army wants to build its own software system. Like many ideas emerging from Washington, DC, the system strikes me as complex and expensive. The program’s funding stretches back a decade. My hunch is that the software system will eventually knit together the digital information required by the US Army to complete its missions. Like many other US government programs, there are numerous vendors involved. Many of these are essentially focused on meeting the needs of the US government.

Palantir Technologies is a Sillycon Valley construct. The company poked its beak though a silicon shell in 2003 and opened for “real” business in 2004. That makes the company 12 years old. Like many disruptive unicorns, Palantir appears to be convinced that its Gotham system can do what the US Army wants done. The Shire and its Hobbits are girding for battle. What are the odds that a high technology company can mount its unicorns and charge into battle and win?

Image result for comic book pow zap

The Palantirians’ reasoning is, by Sillycon Valley standards, logical. Google, by way of comparison, believes that it can solve death and compete with AT&T in high speed fiber. Google may demonstrate that the Sillycon Valley way is more than selling ads, but for now, Google is not gaining traction in some of its endeavors. Palantir wants to activate its four wheel drive and power the US Army to digital nirvana.

The Defense News’s write up is a 1,200 word explanation of Palantir’s locker room planning. I noted this passage:

The Palo Alto-based company has argued the way the Army wrote its requirements in a request for proposals to industry would shut out Silicon Valley companies that provide commercially available products. The company contended that the Army’s plan to award just one contract to a lead systems integrator means commercially available solutions would have to be excluded.
Palantir is seeking to show the court that its data-management product — Palantir Gotham Platform — does exactly what DCGS-A is trying to do and comes at a much lower cost.

I like the idea of demonstrating the capabilities of Gotham to legal eagles. I know that lawyers are among the most technologically sophisticated professionals in the world. In addition, most lawyers are really skilled at technical problem solving and can work math puzzles while waiting for a Teavana Shaken Iced Tea.

image

The article also references “a chain of emails.” Yep, emails can be an interesting component of a cage match. With some Palantir proprietary information apparently surfacing in Buzzfeed, perhaps more emails will be forthcoming.

I have formulated three hypotheses about this tussle with the US Army:

  1. Palantir Technologies is not making progress with Gotham because of the downstream consequences of the i2 Analyst’s Notebook legal matter. The i2 product is owned by IBM, and IBM is a potentially important vendor to the US Army. IBM also has some chums in other big outfits working on the DCGS project. Palantir wants to be live in the big dogs’ kennel, but no go.
  2. Palantir’s revenue may need the DCGS contracts to make up for sales challenges in other market sectors. Warfighting and related security jobs can more predictable than selling a one off to a hospital chain in Tennessee.
  3. Palantir’s perception of Washington may be somewhat negative. Sillycon Valley companies “know” that their “solutions” are the “logical” ones. When Sillycon Valley logic confronts the reality of government contracting, sparks may become visible.

For me, I think the Booz, Allen & Hamilton truism may be on target. Does one keep a customer happy by fighting a public battle designed to prove the “logic” of the Sillycon Valley way?

I don’t think most of the DCGS contractors are lining up to mud wrestle the US Army. I would enjoy watching how legal eagles react to the Gotham wheel menu and learning how long it takes for a savvy lawyer to move discovery content into the Gotham system.

My seeing stone shows an messy five round battle and a lot of clean up and medical treatment after the fight.

Stephen E Arnold, September 25, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta