CyberOSINT banner

Does America Want to Forget Some Items in the Google Index?

July 8, 2015

The idea that the Google sucks in data without much editorial control is just now grabbing brain cells in some folks. The Web indexing approach has traditionally allowed the crawlers to index what was available without too much latency. If there were servers which dropped a connection or returned an error, some Web crawlers would try again. Our Point crawler just kept on truckin’. I like the mantra, “Never go back.”

Google developed a more nuanced approach to Web indexing. The link thing, the popularity thing, and the hundred plus “factors” allowed the Google to figure out what to index, how often, and how deeply (no, grasshopper, not every page on a Web site is indexed with every crawl).

The notion of “right to be forgotten” amounts to a third party asking the GOOG to delete an index pointer in an index. This is sort of a hassle and can create some exciting moments for the programmers who have to manage the “forget me” function across distributed indexes and keep the eager beaver crawler from reindexing a content object.

The Google has to provide this type of third party editing for most of the requests from individuals who want one or more documents to be “forgotten”; that is, no longer in the Google index which the public users’ queries “hit” for results.

According to “Google Is Facing a Fight over Americans’ Right to Be Forgotten.” The write up states:

Consumer Watchdog’s privacy project director John Simpson wrote to the FTC yesterday, complaining that though Google claims to be dedicated to user privacy, its reluctance to allow Americans to remove ‘irrelevant’ search results is “unfair and deceptive.”

I am not sure how quickly the various political bodies will move to make being forgotten a real thing. My hunch is that it will become an issue with legs. Down the road, the third party editing is likely to be required. The First Amendment is a hurdle, but when it comes times to fund a campaign or deal with winning an election, there may be some flexibility in third party editing’s appeal.

From my point of view, an index is an index. I have seen some frisky analyses of my blog articles and my for fee essays. I am not sure I want criticism of my work to be forgotten. Without an editorial policy, third party, ad hoc deletion of index pointers distorts the results as much, if not more, than results skewed by advertisers’ personal charm.

How about an editorial policy and then the application of that policy so that results are within applicable guidelines and representative of the information available on the public Internet?

Wow, that sounds old fashioned. The notion of an editorial policy is often confused with information governance. Nope. Editorial policies inform the database user of the rules of the game and what is included and excluded from an online service.

I like dinosaurs too. Like a cloned brontosaurus, is it time to clone the notion of editorial policies for corpus indices?

Stephen E Arnold, July 8, 2015

A Reminder about What Is Available to Search

July 5, 2015

Navigate to “Big Data, Big Problems: 4 Major Link Indexes Compared.” The write up explains why indexes have different content in their indexes. The services referenced in the write up are:

  • Ahrefs. A backlink index updated every 15 minutes.
  • Majestic. A big data solution for marketers and others. The company says, “Majestic-12 has crawled the web again, and again, and again. We have seen 2.7 trillion URLs come and go, and in the last 90 days we have seen, checked, scored and categorized 715 billion URLs.”
  • Moz. Products for in bound marketers.
  • SEMrush. Search engine marketing for digital marketers.

Despite the marketing focus, there were some interesting comments based on the analysis of backlink services (who links to what). Here’s one point I highlighted:

Each organization has to create a crawl prioritization strategy.

The article points out:\

The bigger the crawl, the more the crawl prioritization will cause disparities. This is not a deficiency; this is just the nature of the beast.

Yep, editorial choice. Inclusions and exclusions. Take away. When you run a query, chances are you are getting biased, incomplete information for the query.

The most important statement in the write up, in my opinion, is this one:

If anything rings true, it is that once again it makes sense to get data from as many sources as possible.

Good advice for search experts and sixth graders. Oh, MBAs may want to heed the statement as well.

But who cares? Probably not too many Internet users. Exciting when these “incomplete” information searchers make decisions.

Stephen E Arnold, July 5, 2015

France: Annoying the GOOG. Do the French Change a Cheese Process?

June 15, 2015

I have do chien in this fight. I read “France Orders Google to Scrub Search Globally in Right to Be Forgotten Requests.” Since I had been in a far off land then beavering away in a place where open carry enhances one’s machismo, the story may be old news to you. To me, it was like IBM innovation: Looked fresh, probably recycled.

Nevertheless, the article reports that the folks who bedeviled Julius Caesar are now irritating the digital Roman Empire. I learned:

France’s Commission nationale de l’informatique et des libertés (CNIL), the country’s data protection authority, has ordered Google to apply delisting on all domain names of its search engine. CNIL said in its news release that it’s received hundreds of complaints following Google’s refusals to carry out delisting. According to its latest transparency report, last updated on Friday 12 June, Google had received a total of 269,314 removal requests, had evaluated 977,948 URLs, and had removed 41.3% of those URLs.

I had an over the transom email from a person who identified himself with two initials only. He wrote:


For some reason the person was unhappy with Google’s responsiveness. I pointed the person to the appropriate Google Web page. But the two initial person continues to ask me to help. Yo, dude, I am retired. Google does not perceive me as much more than a person who should be buying Adwords.

Apparently, folks like my two letter person feels similarly frustrated.

As I understand the issue, France, like some other countries, wants the Google to remove links to content a person or entity filling in the form to move quickly and with extreme prejudice.

We will see. The Google does not do sprints, even when the instructions come from a country with more than 200 varieties of cheese, a plethora of search and retrieval systems, and some unsullied landscapes.

My hunch is that it may be quicker to create a Le Châtelain Camembert than to modify Google’s internal work flows. Well, maybe Roquefort or a Tomme de Savoie. Should France stick with cheese and leave the Googling to Google?

Stephen E Arnold, June 15, 2015

Medical Tagging: No Slam Dunk

May 28, 2015

The taxonomy/ontology/indexing professionals have a challenge. I am not sure many of the companies pitching better, faster, cheaper—no, strike that—better automated indexing of medical information will become too vocal about a flubbed layup.

Navigate to “Coalition for ICD 10 Responds to AMA.” It seems as if indexing what is a more closed corpus is a sticky ball of goo. The issue is the coding scheme required by everyone who wants to get reimbursed and retain certification.

The write up quotes a person who is supposed to be in the know:

“We’d see 13,000 diagnosis codes balloon into 68,000 – a five-fold increase.” [Dr. Robert Wah of the AMA]

The idea is that the controlled terms are becoming obese, weighty, and frankly sufficiently numerous to require legions of subject matter experts and software a heck of a lot more functional than Watson to apply “correctly.” I will let you select the definition of “correctly” which matches your viewpoint from this list of Beyond Search possibilities:

  • Health care administrators: Get paid
  • Physicians: Avoid scrutiny from any entity or boss
  • Insurance companies: Pay the least possible amount yet have an opportunity for machine assisted claim identification for subrogation
  • Patients: Oh, I forgot. The patients are of lesser importance.

You, gentle reader, are free to insert your own definition.

I circled this statement as mildly interesting:

As to whether ICD-10 will improve care, it would seem obvious that more precise data should lead to better identification of potential quality problems and assessment of provider performance. There are multiple provisions in current law that alter Medicare payments for providers with excess patient complications. Unfortunately, the ICD-9 codes available to identify complications are woefully inadequate. If a patient experiences a complication from a graft or device, there is no way to specify the type of graft or device nor the kind of problem that occurred. How can we as a nation assess hospital outcomes, pay fairly, ensure accurate performance reports, and embrace value-based care if our coded data doesn’t provide such basic information? Doesn’t the public have a right to know this kind of information?

Maybe. In my opinion, the public may rank below patients in the priorities of some health care delivery outfits, professionals, and advisers.

Indexing is necessary. Are the codes the ones needed? In an automatic indexing system, what’s more important: [a] Generating revenue for the vendor; [b] Reducing costs to the customer of the automated tagging system; [c] Making the indexing look okay and good enough?

Stephen E Arnold, May 28, 2015

Cerebrant Discovery Platform from Content Analyst

May 6, 2015

A new content analysis platform boasts the ability to find “non-obvious” relationships within unstructured data, we learn from a write-up hosted at PRWeb, “Content Analyst Announces Cerebrant, a Revolutionary SaaS Discovery Platform to Provide Rapid Insight into Big Content.” The press release explains what makes Cerebrant special:

“Users can identify and select disparate collections of public and premium unstructured content such as scientific research papers, industry reports, syndicated research, news, Wikipedia and other internal and external repositories.

“Unlike alternative solutions, Cerebrant is not dependent upon Boolean search strings, exhaustive taxonomies, or word libraries since it leverages the power of the company’s proprietary Latent Semantic Indexing (LSI)-based learning engine. Users simply take a selection of text ranging from a short phrase, sentence, paragraph, or entire document and Cerebrant identifies and ranks the most conceptually related documents, articles and terms across the selected content sets ranging from tens of thousands to millions of text items.”

We’re told that Cerebrant is based on the company’s prominent CAAT machine learning engine. The write-up also notes that the platform is cloud-based, making it easy to implement and use. Content Analyst launched in 2004, and is based in Reston, Virginia, near Washington, DC. They also happen to be hiring, in case anyone here is interested.

Cynthia Murrell, May 6, 2015

Sponsored by, publisher of the CyberOSINT monograph

Microsoft Nudges English to Ideographs

May 5, 2015

Short honk: In my college days, I studied with a fellow who was the world’s expert in the morpheme burger. You are familiar with hamburger. Lev Soudek (I believe this was his name) set out to catalog every use of –burger he could find. Dr. Soudek was convinced that words had a future.

He is probably pondering the rise of ideographs like emoji. For insiders, a pictograph can be worth a thousand words. I suppose the morpheme burger is important to the emergence of the hamburger icon like this:


Microsoft is pushing into new territory according to “Microsoft Is First to Let You Flip the Middle Finger Emoji.” Attensity, Smartlogic, and other content processing systems will be quick to adapt. The new Microsoft is a pioneering outfit.

Is it possible to combine the hamburger icon with the middle finger emoji to convey a message without words.

Dr. Soudek, what do you think?

image image

What about this alternative?

image image

How would one express this thought? Modern language? Classy!

Stephen E Arnold, May 5, 2015

Indexing Rah Rah Rah!

May 4, 2015

Enterprise search is one of the most important features for enterprise content management systems and there is huge industry for designing and selling taxonomies.  The key selling features for taxonomies are their diversity, accuracy, and quality.  The categories within taxonomies make it easier for people to find their content, but Tech Target’s Search Content Management blog says there is room improvement in the post: “Search-Based Applications Need The Engine Of Taxonomy.”

Taxonomies are used for faceted search, allowing users to expand and limit their search results.  Faceted search gives users a selection to change their results, including file type, key words, and more of the ever popular content categories. Users usually don’t access the categories, primarily they are used behind the scenes and aggregated the results appear on the dashboard.

Taxonomies, however, take their information from more than what the user provides:

“We are now able to assemble a holistic view of the customer based on information stored across a number of disparate solutions. Search-based applications can also include information about the customer that was inferred from public content sources that the enterprise does not own, such as news feeds, social media and stock prices.”

Whether you know it or not, taxonomies are vital to enterprise search.  Companies that have difficulty finding their content need to consider creating a taxonomy plan or invest in purchasing category lists from a proven company.

Whitney Grace, May 4, 2015
Sponsored by, publisher of the CyberOSINT monograph

BA Insight: More Auto Classification for SharePoint

April 30, 2015

I thought automatic indexing and classifying of content was a slam dunk. One could download Elastic and Carrot2 or just use Microsoft’s tools to whip up a way to put accounting tags on accounting documents, and planning on strategic management documents.

There are a number of SharePoint centric “automated solutions” available, and now there is one more.

I noticed on the BA Insight Web site this page:


There was some rah rah in US and Australian publications. But the big point is that either SharePoint administrators have a problem that existing solutions cannot solve or the competitors’ solutions don’t work particularly well.

My hunch is that automatic indexing and classifying in a wonky SharePoint set up is a challenge. The indexing can be done by humans and be terrible. Alternatively, the tagging can be done by an automated system and be terrible.

The issues range from entity resolution (remember the different spellings of Al Qaeda) to “drift.” In my lingo, “drift” means that the starting point for automated indexing just wanders as more content flows through the system and the administrator does not provide the time consuming and often expensive tweaking to get the indexing back on track.

There are smarter systems than some of those marketed to the struggling SharePoint licensees. I profile a number of NGIA systems in my new monograph CyberOSINT: Next Generation Information Access.

The SharePoint folks are not featured in my study because the demands of real time, multi lingual, real time content processing do not work with solutions from more traditional vendors.

On any given day, I am asked to sit through Webinars about concepts, semantics, and classification. If these solutions worked, the market for SharePoint add in would begin to coalesce.

So far, dealing with the exciting world of SharePoint content processing remains a work very much in progress.

Stephen E Arnold, April 30, 2015

Ontotext Pursues Visibility

April 23, 2015

Do you know Ontotext? The company is making an effort to become more visible. Navigate to “Vassil Momtchev talks Insights with the Bloor Group.” The interview provides a snapshot of the company’s history which dates from 2001. After 14 years, the interview reports that Ontotext “keeps its original company spirit.”

Other points from the write up:

  • The company’s technology makes use of semantic and ontology modeling
  • A knowledge base represents complex information and makes asking questions better
  • Semantic applications can deliver complete applications.

For more information about Ontotext and its “ontological” approach, visit the company’s Web site at

Stephen E Arnold, April 23, 2015

Enterprise Search Is Important: But Vendor Survey Fails to Make Its Case

March 20, 2015

I read “Concept Searching Survey Shows Enterprise Search Rises in the Ranks of Strategic Applications.” Over the years, I have watched enterprise search vendors impale themselves on their swords. In a few instances, licensees of search technology loosed legal eagles to beat the vendors to the ground. Let me highlight a few of the milestones in enterprise search before commenting on this “survey says, it must be true” news release.

A Simple Question?

What do these companies have in common?

  • Autonomy
  • Convera
  • Fast Search & Transfer?

I know from my decades of work in the information retrieval sector that financial doubts plagued these firms. Autonomy, as you know, is the focal point of on-going litigation over accounting methods, revenue, and its purchase price. Like many high-tech companies, Autonomy achieved significant revenues and caused some financial firms to wonder how Autonomy achieved its hundreds of millions in revenue. There was a report from Cazenove Capital I saw years ago, and it contained analyses that suggested search was not the money machine for the company.

And Convera? After morphing from Excalibur with its acquisition of the manual-indexing ConQuest Technologies, a document scanning with some brute force searching technology morphed into Convera. Convera suggested that it could perform indexing magic on text and video. Intel dived in and so did the NBA. These two deals did not work out and the company fell on hard times. With an investment from Allen & Company, Conquest tried its hand at Web indexing. Finally, stakeholders lost faith and Convera sold off its government sales and folded its tent. (Some of the principals cooked up another search company. This time the former Convera wizards got into the consulting engineering business.) Convera lives on in a sense as part of the Ntent system. Convera lost some money along the way. Lots of money as I recall.

And Fast Search? Microsoft paid $1.2 billion for Fast Search. Now the 1998 technology lives on within Microsoft SharePoint. But Fast Search has the unique distinction of facing both a financial investigation for fancy dancing with its profit and loss statement and the distinction of having its founder facing a jail term. Fast Search ran into trouble when its marketers promised magic from the ESP system. When the pixie dust caused licensees to develop an allergic reaction, Fast ran into trouble. The scrambling caused some managers to flee the floundering Norwegian search ship and found another search company. For those who struggle with Fast Search in its present guise, you understand the issues created by Fast Search’s “sell it today and program it tomorrow” approach.

Is There a Lesson in These Vendors’ Trajectories?

What do these three examples tell us? High flying enterprise search vendors seem to have run into some difficulties. Not surprisingly, the customers of these companies are often wary of enterprise search. Perhaps that is the reason so many enterprise search vendors do not use the words “enterprise search”, preferring euphemisms like customer support, business intelligence, and knowledge management?

The Rush to Sell Out before Drowning in Red Ink

Now a sidelight. Before open source search effectively became the go to keyword search system, there were vendors who had products that for the most part worked when installed to do basic information retrieval. These companies’ executives worked overtime to find buyers. The founders cashed out and left the new owners to figure out how to make sales, pay for research, and generate sufficient revenue to get the purchase price back. Which companies are these? Here’s a short list and incomplete list to help jog your memory:

  • Artificial Linguistics (sold to Oracle)
  • BRS Search (sold to OpenText)
  • EasyAsk (first to Progress Software and then to an individual investor)
  • Endeca to Oracle
  • Enginium (sold to Kroll and now out of business)
  • Exalead to Dassault
  • Fulcrum Technology to IBM (quite a story. See the Fulcrum profile at
  • InQuira to Oracle
  • Information Dimensions (sold to OpenText)
  • Innerprise (Microsoft centric, sold to GoDaddy)
  • iPhrase to IBM (iPhrase was a variant of Teratext’s approach)
  • ISYS Search Software to Lexmark (yes, a printer company)
  • RightNow to Oracle (RightNow acquired Dutch technology for its search function)
  • Schemalogic to Smartlogic
  • Stratify/Purple Yogi (sold to Iron Mountain and then to Autonomy)
  • Teratext to SAIC, now Leidos
  • TripleHop to Oracle
  • Verity to Autonomy and then HP bought Autonomy
  • Vivisimo to IBM (how clustering and metasearch magically became a Big Data system from the company that “invented” Watson) .

The brand impact of these acquired search vendors is dwindling. The only “name” on the list which seems to have some market traction is Endeca.

Some outfits just did not make it or who are in a very quiet, almost dormant, mode. Consider  these search vendors:

  • Delphes (academic thinkers with linguistic leanings)
  • Edgee
  • Dieselpoint (structured data search)
  • DR LINK (Syracuse University and an investment bank)
  • Executive Search (not a headhunting outfit, an enterprise search outfit)
  • Grokker
  • Intrafind
  • Kartoo
  • Lextek International
  • Maxxcat
  • Mondosoft
  • Pertimm (reincarnated with Axel Springer (Macmillan) money as Qwant, which according to Eric Schmidt, is a threat to Google. Yeah, right.)
  • Siderean Software (semantic search)
  • Speed of Mind
  • Suggest (Weitkämper Technology)?
  • Thunderstone

These are not a comprehensive list. I just wanted to layout some facts about vendors who tilted at the enterprise search windmill. I think that a reasonable person might conclude that enterprise search has been a tough sell. Of the companies that developed a brand, none was able to achieve sustainable revenues. The information highway is littered with the remains of vendors who pitched enterprise search as the killer app for anything to do with information.

Now the survey purports to reveal insights to which I have been insensitive in my decades of work in digital information access.

Here’s what the company sponsoring the survey offers:

Concept Searching [the survey promulgator], the global leader in semantic metadata generation, auto-classification, and taxonomy management software, and developer of the Smart Content Framework™, is compiling the statistics from its 2015 SharePoint and Office 365 Metadata survey, currently unpublished. One of the findings, gathered from over 360 responses, indicates a renewed focus on improving enterprise search.

The focus seems to be on SharePoint. I thought SharePoint was a mishmash of content management, collaboration, and contacts along with documents created by the fortunate SharePoint users. Question: Is enterprise search conflated with SharePoint?

I would not make this connection.

If I understand this, the survey makes clear that some of the companies in the “sample” (method of selection not revealed) want better search. I want better information access, not search per se.

Each day I have dozens of software applications which require information access activity.  I also have a number of “enterprise” search systems available to me. Nevertheless, the finding suggests to me that enterprise search is and has not been particularly good. If I put on my SharePoint sunglasses, I see a glint of the notion that SharePoint search is not very good. The dying sparks of Fast Search technology smoldering in fire at Camp DontWorkGud.

Images, videos, and audio content present me with a challenge. Enterprise search and metatagging systems struggle to deal with these content types. I also get odd ball file formats; for example, Framemaker, Quark, and AS/400 DB2 UDB files.

The survey points out that the problem with enterprise search is that indexing is not very good. That may be an understatement. But the remedy is not just indexing, is it?

After reading the news release, I formed the opinion that the fix is to use the type of system available from the survey sponsor Concept Searching. Is that a coincidence?

Frankly, I think the problems with search are more severe than bad indexing, whether performed by humans or traditional “smart” software.

According the news release, my view is not congruent with the survey or the implications of the survey data:

A new focus on enterprise search can be viewed as a step forward in the management and use of unstructured content. Organizations are realizing that the issue isn’t going to go away and is now impacting applications such as records management, security, and litigation support. This translates into real business currency and increases the risk of non-compliance and security breaches. You can’t find, protect, or use what you don’t know exists. For those organizations that are using, or intend to deploy, a hybrid environment, the challenges of leveraging metadata across the entire enterprise can be daunting, without the appropriate technology to automate tagging.

Real business currency. Is that money?

Are system administrators still indexing human resource personnel records, in process legal documents related to litigation, data from research tests and trials in an enterprise search system? I thought a more fine-grained approach to indexing was appropriate. If an organization has a certain type of government work, knowledge of that work can only be made available to those with a need to know. Is indiscriminate and uncontrolled indexing in line with a “need to know” approach?

Information access has a bright future. Open source technology such as Lucene/Solar/Searchdaimon/SphinxSearch, et al is a reasonable approach to keyword functionality.

Value-added content processing is also important but not as an add on. I think that the type of functionality available from BAE, Haystax, Leidos, and Raytheon is more along the lines of the type of indexing, metatagging, and coding I need. The metatagging is integrated into a more modern system and architecture.

For instance, I want to map geo-coordinates in the manner of Geofeedia to each item of data. I also want context. I need an entity (Barrerra) mapped to an image integrated with social media. And, for me, predictive analytics are essential. If I have the name of an individual, I want that name and its variants. I want the content to be multi-language.

I want what next generation information access systems deliver. I don’t want indexing and basic metatagging. There is a reason for Google’s investing in Recorded Future, isn’t there?

The future of buggy whip enterprise search is probably less of a “strategic application” and more of a utility. Microsoft may make money from SharePoint. But for certain types of work, SharePoint is a bit like Windows 3.11. I want a system that solves problems, not one that spawns new challenges on a daily basis.

Enterprise search vendors have been delivering so-so, flawed, and problematic functionality for 40 years. After decades of vendor effort to make information findable in an organization, has significant progress been made. DARPA doesn’t think search is very good. The agency is seeking better methods of information access.

What I see when I review the landscape of enterprise search is that today’s “leaders”  (Attivio, BA Insight, Coveo, dtSearch, Exorbyte, among others) remind me of the buggy whip makers driving a Model T to lecture farmers that their future depends on the horse as the motive power for their tractor.

Enterprise search is a digital horse, an one that is approaching break down.

Enterprise search is a utility within more feature rich, mission critical systems. For a list of 20 companies delivering NGIA with integrated content processing, check out

Stephen E Arnold, March 20, 2015

Next Page »