The House Cleaning of Halevy Dataspace: A Web Curiosity

November 14, 2016

I am preparing three seven minute videos. That effort will be one video each week starting on 20 December 2016. The subject is my Google Trilogy, published by an antique outfit which has drowned in River Avon. The first video is about the 2004 monograph, The Google Legacy. I coined the term “Googzilla” in that 230 page discussion of how Google became baby Google. The second video summarizes several of the take aways from Google: The Calculating Predator, published in 2007. The key to the monograph is the bound phrase “calculating predator.” Yep, not the happy little search out most know and love. The third video hits the main points of Google: The Digital Gutenberg, published in 2009. The idea is that Google spits out more digital content than almost anyone. Few think of the GOOG as the content generator the company has become. Yep, a map is a digital artifact.

Now to the curiosity. I wanted to reference the work of Dr. Alon Halevy, a former University of Washington professor and founder of Nimble and Transformic. I had a stack of links I used when I was doing the research for my predator book. Just out of curiosity I started following the links. I do have PDF versions of most of the open source Halevy-centric content I located.

But guess what?

Dr. Alon Halevy has disappeared. I could not locate the open source version of his talk about dataspaces. I could not locate the Wayback Machine’s archived version of the Web site. The links returned these weird 404 errors. My assumption was that Wayback’s Web pages resided happily on the outfit’s servers. I was incorrect. Here’s what I saw:


I explored the bound phrase “Alon Halvey” with various other terms only to learn that the bulk of the information has disappeared. No PowerPoints, no much substantive information. There were a few “information objects” which have not yet disappeared; for example:

  • An ACM blog post which references “the structured data team” and Nimble and Transformic
  • A Google research paper which will not make those who buy into David Gelerter’s The Tides of the Mind thesis
  • A YouTube video of a lecture given at Technion.

I found the gap between my research gathered in 2005 to 2007 interesting. I asked myself, “How did I end up with so many dead links about a technology I have described as one of the most important in database, data management, data analysis, and information retrieval?

Here are the answers I formulated:

  1. The Web is a lousy source of information. Stuff just disappears like the Darpa listing of open source Dark Web software, blogs, and Web sites
  2. I did really terrible research and even worse librarian type behavior. Yep, mea culpa.
  3. Some filtering procedures became a bit too aggressive and the information has been swept from assorted indexes
  4. The Wayback Machine ran off the rails and pointed to an actual 2005 Web site which its system failed to copy when the original spidering was completed.
  5. Gremlins. Hey, they really do exist. Just ask Grace Hopper. Yikes, she’s not available.

I wanted to mention this apparent or erroneous scrubbing. The story in this week HonkinNews video points out that 89 percent of journalists do their research via Google. Now if information is not in Google, what does that imply for a “real” journalist trying to do an objective, comprehensive story? I leave it up to you, gentle reader, to penetrate this curiosity.

Watch for the Google Trilogy seven minute videos on December 20, 2016, December 27, 2016, and

Stephen E Arnold, November 14, 2016, and January 3, 2017. Free. No pay wall. No pleading. No registration form. Just honkin’ news seven days a week and some video shot on an old Bell+Howell camera in a log cabin in rural Kentucky.

Why Search When You Can Discover

November 11, 2016

What’s next in search? My answer is, “No search at all. The system thinks for you.” Sounds like Utopia for the intellectual couch potato to me.

I read “The Latest in Search: New Services in the Content Discovery Marketplace.” The main point of the write up is to highlight three “discovery” services. A discovery service is one which offers “information users new avenues to the research literature.”

See, no search needed.

The three services highlighted are:

  • Yewno, which is powered by an inference engine. (Does anyone remember the Inference search engine from days gone by?). The Yewno system uses “computational analysis and a concept map.” The problem is that it “supplements institutional discovery.” I don’t know what “institutional discovery” means, and my hunch is that folks living outside of rural Kentucky know what “institutional discovery” means. Sorry to be so ignorant.
  • ScienceOpen, which delivers a service which “complements open Web discovery.” Okay. I assume that this means I run an old fashioned query and ScienceOpen helps me out.
  • TrendMD, which “serves as a classic “onward journey tool” that aims to generate relevant recommendations serendipitously.”

I am okay with the notion of having tools to make it easier to locate information germane to a specific query. I am definitely happy with tools which can illustrate connections via concept maps, link analysis, and similar outputs. I understand that lawyers want to type in a phrase like “Panama deal” and get a set of documents related to this term so the mass of data can be chopped down by sending, recipient, time, etc.

But setting up discovery as a separate operation from keyword or entity based search seems a bit forced to me. The write up spins its lawn mower blades over the TrendMD service. That’s fine, but there are a number of ways to explore scientific, technical, and medical literature. Some are or were delightful like Grateful Med; others are less well known; for example, Mednar and Quertle.

Discovery means one thing to lawyers. It means another thing to me: A search add on.

Stephen E Arnold, November 11, 2016

Palantir Technologies: Less War with Gotham?

November 9, 2016

I read “Peter Thiel Explains Why His Company’s Defense Contracts Could Lead to Less War.” I noted that the write up appeared in the Washington Post, a favorite of Jeff Bezos I believe. The write up referenced a refrain which I have heard before:

Washington “insiders” currently leading the government have “squandered” money, time and human lives on international conflicts.

What I highlighted as an interesting passage was this one:

a spokesman for Thiel explained that the technology allows the military to have a more targeted response to threats, which could render unnecessary the wide-scale conflicts that Thiel sharply criticized.

I also put a star by this statement from the write up:

“If we can pinpoint real security threats, we can defend ourselves without resorting to the crude tactic of invading other countries,” Thiel said in a statement sent to The Post.

The write up pointed out that Palantir booked about $350 million in business between 2007 and 2016 and added:

The total value of the contracts awarded to Palantir is actually higher. Many contracts are paid in a series of installments as work is completed or funds are allocated, meaning the total value of the contract may be reflected over several years. In May, for example, Palantir was awarded a contract worth $222.1 million from the Defense Department to provide software and technical support to the U.S. Special Operations Command. The initial amount paid was $5 million with the remainder to come in installments over four years.

I was surprised at the Washington Post’s write up. No ads for Alexa and no Beltway snarkiness. That too was interesting to me. And I don’t have a dog in the fight. For those with dogs in the fight, there may be some billability worries ahead. I wonder if the traffic jam at 355 and Quince Orchard will now abate when IBM folks do their daily commute.

Stephen E Arnold, November 9, 2016

Ontotext: The Fabric of Relationships

November 9, 2016

Relationships among metadata, words, and other “information” are important. Google’s Dr. Alon Halevy, founder of Transformic which Google acquired in 2006, has been beavering away in this field for a number of years. His work on “dataspaces” is important for Google and germane to the “intelligence-oriented” systems which knit together disparate factoids about a person, event, or organization. I recall one of his presentations—specifically the PODs 2006 keynote–in which he reproduced a “colleague’s” diagram of a flow chart which made it easy to see who received the document, who edited the document and what changes were made, and to whom recipients of the document forward the document.

Here’s the diagram from Dr. Halevy’s lecture:


Principles of Dataspace Systems, Slide 4 by Dr. Alon Halevy at delivered on June 26, 2006 at PODs. Note that “PODs” is an annual ACM database-centric conference.

I found the Halevy discussion interesting.

Read more

Entity Extraction: No Slam Dunk

November 7, 2016

There are differences among these three use cases for entity extraction:

  1. Operatives reviewing content for information about watched entities prior to an operation
  2. Identifying people, places, and things for a marketing analysis by a PowerPoint ranger
  3. Indexing Web content to add concepts to keyword indexing.

Regardless of your experience with software which identifies “proper nouns,” events, meaningful digits like license plate numbers, organizations, people, and locations (accepted and colloquial)—you will find the information in “Performance Comparison of 10 Linguistic APIs for Entity Recognition” thought provoking.

The write up identifies the systems which perform the best and the worst.

Here are the five systems and the number of errors each generated in a test corpus. The “scores” are based on a test which contained 150 targets. The “best” system got more correct than incorrect. I find the results interesting but not definitive.

The five best performing systems on the test corpus were:

The five worst performing systems on the test corpus were:

There are some caveats to consider:

  1. Entity identification works quite well when the training set includes the entities and their synonyms as part of the training set
  2. Multi-language entity extraction requires additional training set preparation. “Learn as you go” is often problematic when dealing with social messages, certain intercepted content, and colloquialisms
  3. Identification of content used as a code—for example, Harrod’s teddy bear for contraband—is difficult even for smart software operating with subject matter experts’ input. (Bad guys are often not stupid and understand the concept of using one word to refer to another thing based on context or previous interactions).

Net net: Automated systems are essential. The error rates may be fine for some use cases and potentially dangerous for others.

Stephen E Arnold, November 7, 2016

Falcon Searches Through Browser History

October 21, 2016

Have you ever visited a Web site and then lost the address or could not find a particular section on it?  You know that the page exists, but no matter how often you use an advanced search feature or scour through your browser history it cannot be found.  If you use Google Chrome as your main browser than there is a solution, says GHacks in the article, “Falcon: Full-Text history Search For Chrome.”

Falcon is a Google Chrome extension that adds full-text history search to a browser.  Chrome usually remembers Web sites and their extensions when you type them into the address bar.  The Falcon extension augments the default behavior to match text found on previously visited Web Sites.

Falcon is a search option within a search feature:

The main advantage of Falcon over Chrome’s default way of returning results is that it may provide you with better results.  If the title or URL of a page don’t contain the keyword you entered in the address bar, it won’t be displayed by Chrome as a suggestion even if the page is full of that keyword. With Falcon, that page may be returned as well in the suggestions.

The new Chrome extension acts as a delimiter to recorded Web history and improves a user’s search experience so they do not have to sift through results individually.

Whitney Grace, October 21, 2016
Sponsored by, publisher of the CyberOSINT monograph


Structured Search: New York Style

October 10, 2016

An interesting and brief search related content marketing white paper “InnovationQ Plus Search Engine Technology” attracted my attention. What’s interesting is that the IEEE is apparently in the search engine content marketing game. The example I have in front of me is from a company doing business as

What does InnovationQ Plus do to deliver on point results? The write up says:

This engine is powered by’s patented neural network machine learning technology that improves searcher productivity and alleviates the difficult task of identifying and selecting countless keywords/synonyms to combine into Boolean syntax. Simply cut and paste abstracts, summaries, claims, etc. and this state-of-the art system matches queries to documents based on meaning rather than keywords. The result is a search that delivers a complete result set with less noise and fewer false positives. Ensure you don’t miss critical documents in your search and analysis by using a semantic engine that finds documents that other tools do not.

The use of snippets of text as the raw material for a behind-the-scenes query generator reminds me of the original DR-LINK method, among others. Perhaps there is some Syracuse University “old school” search DNA in the InnovationQ Plus approach? Perhaps the TextWise system has manifested itself as a “new” approach to patent and STEM (scientific, technology, engineering, and medical)  online searching? Perhaps Manning & Napier’s interest in information access has inspired a new generation of search capabilities?

My hunch is, “Yep.”

If you don’t have a handy snippet encapsulating your search topic, just fill in the query form. Google offers a similar “fill in the blanks” approach even thought a tiny percentage of those looking for information on Google use advanced search. You can locate the Google advanced search form at this link.

Part of the “innovation” is the use of fielded search. Fielded search is useful. It was the go to method for locating information in the late 1960s. The method fell out of favor with the Sillycon Valley crowd when the idea of talking to one’s mobile phone became the synonym for good enough search.

To access the white paper, navigate the IEEE registration page and fill out the form at this link.

From my vantage point, structured search with “more like this” functions is a good way to search for information. There is a caveat. The person doing the looking has to know what he or she needs to know.

Good enough search takes a different approach. The systems try to figure out what the searcher needs to know and then deliver it. The person looking for information is not required to do much thinking.

The InnovationQ Plus approach shifts the burden from smart software to smart searchers.

Good enough search is winning the battle. In fact, some Sillycon Valley folks, far from upstate New York, have embraced good enough search with both hands. Why use words at all? There are emojis, smart software systems predicting what the use wants to know, and Snapchat infused image based methods.

The challenge will be to find a way to bridge the gap between the Sillycon Valley good enough methods and the more traditional structured search methods.

IEEE seems to agree as long as the vendor “participates” in a suitable IEEE publishing program.

Stephen E Arnold, October 10, 2016

Crimping: Is the Method Used for Text Processing?

October 4, 2016

I read an article I found quite thought provoking. “Why Companies Make Their Products Worse” explains that reducing costs allows a manufacturer to expand the market for a product. The idea is that more people will buy a product if it is less expensive than a more sophisticated version of the product. The example which I highlighted in eyeshade green explained that IBM introduced an expensive printer in the 1980s. Then IBM manufactured the different version of the printer using cheaper labor. The folks from Big Blue added electronic components to make the cheaper printer slower. The result was a lower cost printer that was “worse” than the original.


Perhaps enterprise search and content processing is a hybrid of two or more creatures?

The write up explained that this approach to degrading a product to make more money has a name—crimping. The concept creates “product sabotage”; that is, intentionally degrading a product for business reasons.

The comments to the article offer additional examples and one helpful person with the handle Dadpolice stated:

The examples you give are accurate, but these aren’t relics of the past. They are incredibly common strategies that chip makers still use today.

I understand the hardware or tangible product application of this idea. I began to think about the use of the tactic by text processing vendors.

The Google Search Appliance may have been a product subject to crimping. As I recall, the most economical GSA was less than $2000, a price which was relatively easy to justify in many organizations. Over the years, the low cost option disappeared and the prices for the Google Search Appliances soared to Autonomy- and Fast Search-levels.

Other vendors introduced search and content processing systems, but the prices remained lofty. Search and content processing in an organization never seemed to get less expensive when one considered the resources required, the license fees, the “customer” support, the upgrades, and the engineering for customization and optimization.

My hypothesis is that enterprise content processing does not yield compelling examples like the IBM printer example.

Perhaps the adoption rate for open source content processing reflects a pent up demand for “crimping”? Perhaps some clever graduate student would take the initiative to examine the content processing product prices? Licensees spend for sophisticated solution systems like those available from outfits like IBM and Palantir Technologies. The money comes from the engineering and what I call “soft” charges; that is, training, customer support, and engineering and consulting services.

At the other end of the content processing spectrum are open source solutions. The middle between free or low cost systems and high end solutions does not have too many examples. I am confident there are some, but I could identify Funnelback, dtSearch, and a handful of other outfits.

Perhaps “crimping” is not a universal principle? On the other hand, perhaps content processing is an example of a technical software which has its own idiosyncrasies.

Content processing products, I believe, become worse over time. The reason is not “crimping.” The trajectory of lousiness comes from:

  • Layering features on keyword retrieval in hopes of finding a way to generate keen buyer interest
  • Adding features helps justify price increases
  • The greater the complexity of the system, the less likely the licensee will be able to fiddle with the system
  • A refusal to admit that content processing is a core component of many other types of software so “finding information” has become a standard component for other applications.

If content processing is idiosyncratic, that might explain why investors pour money into content processing companies which have little chance to generate sufficient revenue to pay off investors, generate a profit, and build a sustainable business. Enterprise search and content processing vendors seem to be in a state of reinventing or reimagining themselves. Guitar makers just pursue cost cutting and expand their market. It is not so easy for content processing companies.

Stephen E Arnold, October 4, 2016

Pharmaceutical Research Made Simple

October 3, 2016

Pharmaceutical companies are a major power in the United States.  Their power comes from the medicine they produce and the wealth they generate.  In order to maintain both wealth and power, pharmaceutical companies conduct a lot of market research.  Market research is a field based on people’s opinions and their reactions, in other words, it contains information that is hard to process into black and white data.  Lexalytics is a big data platform built with a sentiment analysis to turn market research into useable data.

Inside Big Data explains how “Lexalytics Radically Simplifies Market Research And Voice Of Customer Programs For The Pharmaceutical Industry” with a new package called the Pharmaceutical Industry Pack.  Lexalytics uses a combination of machine learning and natural language processing to understand the meaning and sentiment in text documents.  The new pack can help pharmaceutical companies interpret how their customers react medications, what their symptoms are, and possible side effects of medication.

Our customers in the pharmaceutical industry have told us that they’re inundated with unstructured data from social conversations, news media, surveys and other text, and are looking for a way to make sense of it all and act on it,’ said Jeff Catlin, CEO of Lexalytics. ‘With the Pharmaceutical Industry Pack — the latest in our series of industry-specific text analytics packages — we’re excited to dramatically simplify the jobs of CEM and VOC pros, market researchers and social marketers in this field.

Along with basic natural language processing features, the Lexalytics Pharmaceutical Industry Pack contains 7000 sentiment terms from healthcare content as well as other medical references to understand market research data.  Lexalytics makes market research easy and offers invaluable insights that would otherwise go unnoticed.

Whitney Grace, October 3, 2016
Sponsored by, publisher of the CyberOSINT monograph

Attensity: A Big 404 in Text Analytics

October 1, 2016

Search vendors can save their business by embracing text analytics. Sounds like a wise statement, right? I would point out that our routine check of search and content processing companies turned up this inspiring Web page for Attensity, the Xerox Parc love child and once hot big dog in text analysis:


Attensity joins a long list of search-related companies which have had to reinvent themselves.

The company pulled in $90 million from a “mystery investor” in 2014. A pundit tweeted in 2015:


In February 2016, Attensity morphed into Sematell GmbH, a company with interaction solutions.

I mention this arabesque because it underscores:

  1. No single add on to enterprise search will “save” an information access company
  2. Enterprise search has become a utility function. Witness the shift to cloud based services like SearchBlox, appliances like Maxxcat, and open source options. Who will go out on a limb for a proprietary utility when open source variants are available and improving?
  3. Pundits who champion a company often have skin in the game. Self appointed experts for cognitive computing, predictive analytics, or semantic link analysis are tooting a horn without other instruments.

Attensity is a candidate to join the enterprise search Hall of Fame. In the shrine are Delphes, Entopia, et al. I anticipate more members, and I have a short list of “who is next” taped on my watch wall.

Stephen E Arnold, October 1, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta