The Google: A Real Newspaper Discovers Modern Research

December 4, 2016

I read “Google, Democracy and the Truth about Internet Search.” One more example of a person who thinks he or she is an excellent information hunter and gatherer. Let’s be candid. A hunter gatherer flailing away for 15 or so years using online research tools, libraries, and conversations with actual humans should be able to differentiate a bunny rabbit from a female wolf with baby wolves at her feet.

Natural selection works differently in the hunting and gathering world of online. The intrepid knowledge warrior can make basic mistakes, use assumptions without consequence, and accept whatever a FREE online service delivers. No natural selection operates.


A “real” journalist discovers the basics of online search’s power. Great insight, just 50 years from the time online search became available to this moment of insight in December 2017. Slow on the trigger or just clueless?

That’s scary. When the 21st century hunter gatherer seems to have an moment of inspiration and realizes that online services—particularly ad supported free services—crank out baloney, it’s frightening. The write up makes clear that a “real” journalist seems to have figured out that online outputs are not exactly the same as sitting at a table with several experts and discussing an issue. Online is not the same as going to a library and reading books and journal articles, thinking about what each source presents as actual factoids.

Here’s an example of the “understanding” one “real” journalist has about online information:

Google is knowledge. It’s where you go to find things out.

There you go. Reliance on one service to provide “knowledge.” From an ad supported. Free. Convenient. Ubiquitous. Online service.

Yep, that’s the way to keep track of “knowledge.”

Read more

Google and Its Search Results: Objective or Subjective

December 1, 2016

I love the Alphabet Google thing. The information I obtain via a Google query is spot on, accurate, perfect, and highly credible. Run the query “dancing with the stars” and what do you get? Substance. Rock solid factoids.

I read “Google Search Results Tend to Have Liberal Bias That Could Influence Public Opinion.” The write up informed me:

After analyzing nearly 2,000 pages, a panel rated 31% pages as liberal as opposed to only 22% that were conservative; the remaining 47% pages were neutral that included government or mainstream news websites.

And the source of this information? An outfit called That sounds like a company that would make Ian Sharp sit up and take notice. Don’t remember Ian Sharp? Well, too bad. He founded IP Sharp Associates and had some useful insights about the subjective/objective issues in algorithms.

The methodology is interesting too:

The study conducted by online search marketer found that 50 most recent searches for political terms on the search engine showed more liberal-leaning Web pages rather than conservative ones.

But the Google insists that is results are objective. But Google keeps its ranking method secret. The write up quotes a computer science professor as saying:

“No one really knows what Google’s search engine is doing,” said Christo Wilson, a Northeastern University computer science professor. “This is a big, complex system that’s been evolving for 15 years.”

Hmm. Evolving. I thought that the Google wraps its 1998 methods and just keeps on trucking. My hunch is that the wrappers which have been added by those trying to deal with the new content and new uses to which the mobile and desktop Web search systems are put are add ons. Think of the customization of a celebrity’s SUV. That’s how Google relevance has evolved. Cool, right?

The write up points out:

Google denies results are politically slanted and says its algorithms use several factors.

My hunch is that is well meaning, but it may have some biases baked into its study., like the Google, is based on human choices. When humans fiddle, subjectivity enters the arena. For real objectivity, check out Google’s translation system which may have created its own inter-lingua. That’s objective as long as one does not try to translate colloquial code phrase from a group of individuals seeking to secure their communications.

Subjective humans are needed for that task. Humans are subjective. So how does the logic flow? Oh, right. Google must be subjective. This is news? Ask Foundem.

Stephen E Arnold, December 1, 2016

Partnership Aims to Establish AI Conventions

October 24, 2016

Artificial intelligence research has been booming, and it is easy to see why— recent advances in the field have opened some exciting possibilities, both for business and  society as a whole. Still, it is important to proceed carefully, given the potential dangers of relying too much on the judgement of algorithms. The Philadelphia Inquirer reports on a joint effort to develop some AI principles and best practices in its article, “Why This AI Partnership Could Bring Profits to These Tech Titans.” Writer Chiradeep BasuMallick explains:

Given this backdrop, the grandly named Partnership on AI to Benefit People and Society is a bold move by Alphabet, Facebook, IBM and Microsoft. These globally revered companies are literally creating a technology Justice League on a quest to shape public/government opinion on AI and to push for friendly policies regarding its application and general audience acceptability. And it should reward investors along the way.

The job at hand is very simple: Create a wave of goodwill for AI, talk about the best practices and then indirectly push for change. Remember, global laws are still obscure when it comes to AI and its impact.

Curiously enough, this elite team is missing two major heavyweights. Apple and Tesla Motors are notably absent. Apple Chief Executive Tim Cook, always secretive about AI work, though we know about the estimated $200 million  Turi project, is probably waiting for a more opportune moment. And Elon Musk, co-founder, chief executive and product architect of Tesla Motors, has his own platform to promote technology, called OpenAI.

Along with representatives of each participating company, the partnership also includes some independent experts in the AI field. To say that technology is advancing faster than the law can keep up with is a vast understatement. This ongoing imbalance underscores the urgency of this group’s mission to develop best practices for companies and recommendations for legislators. Their work could do a lot to shape the future of AI and, by extension, society itself. Stay tuned.

Cynthia Murrell, October 24, 2016
Sponsored by, publisher of the CyberOSINT monograph

Semantiro and Ontocuro Basic

October 20, 2016

Quick update from the Australian content processing vendor SSAP or Semantic Software Asia Pacific Limited. The company’s Semantiro platform now supports the new Ontocuro tool.

Semantiro is a platform which “promises the ability to enrich the semantics of data collected from disparate data sources, and enables a computer to understand its context and meaning,” according to “Semantic Software Announces Artificial Intelligence Offering.”

I learned:

Ontocuro is the first suite of core components to be released under the Semantiro platform. These bespoke components will allow users to safely prune unwanted concepts and axioms; validate existing, new or refined ontologies; and import, store and share these ontologies via the Library.

The company’s approach is to leapfrog the complex interfaces other indexing and data tagging tools impose on the user. The company’s Web site for Ontocuro is at this link.

Stephen E Arnold, October 20, 2016

Online and without Ooomph: Social Content

October 15, 2016

I am surprised when Scientific American Magazine runs a story somewhat related to online information access. Navigate to read “The Bright Side of Internet Shaming.” The main point is that shaming has “become so common that it might soon begin to lose its impact.” Careful wording, of course. It is Scientific American, and the write up has few facts of the scientific ilk.

I highlighted this passage:

…these days public shaming are increasingly frequent. They’ve become a new kind of grisly entertainment, like a national reality show.

Yep, another opinion from Scientific American.

I then circled in Hawthorne Scarlet A red:

there’s a certain kind of hope in the increasing regularity of shamings. As they become commonplace, maybe they’ll lose their ability to shock. The same kinds of ugly tweets have been repeated so many times, they’re starting to become boilerplate.

I don’t pay much attention to social media unless the data are part of a project. I have a tough time distinguishing misinformation, disinformation, and run of the mill information.

What’s the relationship to search? Locating “shaming” type messages is difficult. Social media search engines don’t work particularly well. The half hearted attempts at indexing are not consistent. No surprise in that because user generated input is often uninformed input, particularly when it comes to indexing.

My thought is that Scientific American reflects shaming. The write up is not scientific. I would have found the article more interesting if:

  • Data based on tweet or Facebook post analyses based on negative or “shaming” words
  • Facts about the increase or decrease in “shaming” language for some “boilerplate” words
  • A Palantir-type link analysis illustrating the centroids for one solid shaming example.

Scientific American has redefined science it seems. Thus, a search for science might return a false drop for the magazine. I will skip the logic of the write up because the argument strikes me as subjective American thought.

Stephen E Arnold, October 15, 2016

Deindexing: A Thing?

October 12, 2016

There was the right to be forgotten. There were reputation management companies promising to scrub unwanted information for indexes using humans, lawyers (a different species, of course), and software agents.

Now I have learned that “dozens of suspicious court cases, with missing defendants, aim at getting web pages taken down or deindexed.” The write up asserts:

Google and various other Internet platforms have a policy: They won’t take down material (or, in Google’s case, remove it from Google indexes) just because someone says it’s defamatory. Understandable — why would these companies want to adjudicate such factual disputes? But if they see a court order that declares that some material is defamatory, they tend to take down or deindex the material, relying on the court’s decision.

Two thoughts:

  1. Have reputation management experts cooked up some new broth?
  2. How quickly will the lovely word “deindex” survive in the maelstrom of the information flow.

I love the idea of indexing content. Perhaps there is a new opportunity for innovation with the deindexing thing? Semantic deindexing? Structured deindexing? And my fave unstructured deindexing in federated cloud based data lakes. I wish I were 21 years old again. A new career beckons with declassification codes, delanguage processing, and even desmart software.

Stephen E Arnold, October 22, 2016

Hacking Federal Agencies Now a Childs Play

October 12, 2016

A potentially dangerous malware called GovRat that is effective in cyber-espionage is available on Dark Web for as low as $1,000.

IBTimes recently published an article Malware used to target US Government and military being sold on Dark Web in which the author states –

The evolved version of GovRat, which builds on a piece of malware first exposed in November last year, can be used by hackers to infiltrate a victim’s computer, remotely steal files, upload malware or compromised usernames and passwords.

The second version of this malware has already caused significant damage. Along with it, the seller is also willing to give away credentials to access US government servers and military groups.

Though the exact identity of the creator of GovRat 2.0 is unknown, the article states:

Several of these individuals are known as professional hackers for hire,” Komarovexplained. He cited one name as ROR [RG] – a notorious hacker who previously targeted Ashley Madison, AdultFriendFinder and the Turkish General Directorate of Security (EGM).

Data of large numbers of federal employees are already compromised and details like email, home address, login IDs and hashed passwords are available for anyone who can pay the price.

InfoArmor a cybersecurity and identity protection firm while scanning the Dark Web forums unearthed this information and has already passed on the details to relevant affected parties. The extent of the damage is unknown, the stolen information can be used to cause further damage.

Vishal Ingole, October 12, 2016
Sponsored by, publisher of the CyberOSINT monograph

Google: Algorithms Are Objective

July 17, 2016

I know that Google’s algorithms are tireless, objective numerical recipes. However, “Google: Downranking Online Piracy Sites in Search Results Has Led to a 89% Decrease in Traffic” sparked in my mind the notion that human intervention may be influencing some search result rankings. I highlighted these statements in the write up:

“Google does not proactively remove hyperlinks to any content unless first notified by copyright holders, but the tech giant says that it is now processing copyright removal notices in less than six hours on average…” I assume this work is performed by objective algorithms.

“…it is happy to demote links to pages that explicitly contain or link to content that infringes copyright.” Again, a machine process and, therefore, objective?

Human intervention in high volume flows of information is often difficult. If Google is not using machine processes, perhaps the company is forced to group sites and then have humans make decisions.

Artificial intelligence, are you not up to the task?

Stephen E Arnold, July 21, 2016

Semantics Made Easier

May 9, 2016

For fans of semantic technology, Ontotext has a late spring delight for you. The semantic platform vendor Ontotext has released GraphDB 7. I read “Ontotext Releases New Version of Semantic Graph Database.” According to the announcement, set up and data access are easier. I learned:

The new release offers new tools to access and explore data, eliminating the need to know everything about the dataset before start working with it. GraphDB 7 enables users to navigate their way through third-party and any other dataset regardless of data volumes, which makes it a powerful Big Data analytics tool. Ver.7 offers visual exploration of the loaded data schema – ontology, interactive query builder for better entity retrieval, and full support for RDF 1.1 allowing smooth import of a huge number of public Open Data as well as proprietary Linked Datasets.

If you want to have a Palantir-type system, check out Ontotext. The company is confident that semantic technology will yield benefits, a claim made by other semantic technology vendors. But the complexity challenges associated with conversion and normalization of content is likely to be a pebble in the semantic sneaker.

Stephen E Arnold, May 9, 2016

Patents and Semantic Search: No Good, No Good

March 31, 2016

I have been working on a profile of Palantir (open source information only, however) for my forthcoming Dark Web Notebook. I bumbled into a video from an outfit called ClearstoneIP. I noted that ClearstoneIP’s video showed how one could select from a classification system. With every click,the result set changed. For some types of searching, a user may find the point-and-click approach helpful. However, there are other ways to root through what appears to be patent applications. There are the very expensive methods happily provided by Reed Elsevier and Thomson Reuters, two find outfits. And then there are less expensive methods like Alphabet Google’s odd ball patent search system or the quite functional FreePatentsOnline service. In between, you and I have many options.

None of them is a slam dunk. When I was working through the publicly accessible Palantir Technologies’ patents, I had to fall back on my very old-fashioned method. I tracked down a PDF, printed it out, and read it. Believe me, gentle reader, this is not the most fun I have ever had. In contrast to the early Google patents, Palantir’s documents lack the detailed “background of the invention” information which the salad days’ Googlers cheerfully presented. Palantir’s write ups are slogs. Perhaps the firm’s attorneys were born with dour brain circuitry.

I did a side jaunt and came across a white paper from ClearstoneIP called “Why Semantic Searching Fails for Freedom-to-Operate (FTO).”i The 12 page write up is from a company called ClearstoneIP, which is a patent analysis company. The firm’s 12 pager is about patent searching. The company, according to its Web site is a “paradigm shifter.” The company describes itself this way:

ClearstoneIP is a California-based company built to provide industry leaders and innovators with a truly revolutionary platform for conducting product clearance, freedom to operate, and patent infringement-based analyses. ClearstoneIP was founded by a team of forward-thinking patent attorneys and software developers who believe that barriers to innovation can be overcome with innovation itself.

The “freedom to operate” phrase is a bit of legal jargon which I don’t understand. I am, thank goodness, not an attorney.

The firm’s search method makes much of the ontology, taxonomy, classification approach to information access. Hence, the reason my exploration of Palantir’s dynamic ontology with objects tossed ClearstoneIP into one of my search result sets.

The white paper is interesting if one works around the legal mumbo jumbo. The company’s approach is remarkable and invokes some of my caution light words; for example:

  • “Not all patent searches are the same.”, page two
  • “This all leads to the question…”, page seven
  • “…there is never a single “right” way to do so.”, page eight
  • “And if an analyst were to try to capture all of the ways…”, page eight
  • “to capture all potentially relevant patents…”, page nine.

The absolutist approach to argument is fascinating.

Okay, what’s the ClearstoneIP search system doing? Well, it seems to me that it is taking a path to consider some of the subtlties in patent claims’ statements. The approach is very different from that taken by Brainware and its tri-gram technology. Now that Lexmark owns Brainware, the application of the Brainware system to patent searching has fallen off my radar. Brainware relied on patterns; ClearstoneIP uses the ontology-classification approach.

Both are useful in identifying patents related to a particular subject.

What is interesting in the write up is its approach to “semantics.” I highlighted in billable hour green:

Anticipating all the ways in which a product can be described is serious guesswork.

Yep, but isn’t that the role of a human with relevant training and expertise becomes important? The white paper takes the approach that semantic search fails for the ClearstoneIP method dubbed FTO or freedom to operate information access.

The white paper asserted:


Semantic searching is the primary focus of this discussion, as it is the most evolved.

ClearstoneIP defines semantic search in this way:

Semantic patent searching generally refers to automatically enhancing a text -based query to better represent its underlying meaning, thereby better identifying conceptually related references.

I think the definition of semantic is designed to strike directly at the heart of the methods offered to lawyers with paying customers by Lexis-type and Westlaw-type systems. Lawyers to be usually have access to the commercial-type services when in law school. In the legal market, there are quite a few outfits trying to provide better, faster, and sometimes less expensive ways to make sense of the Miltonesque prose popular among the patent crowd.

The white paper, in a lawyerly way, the approach of semantic search systems. Note that the “narrowing” to the concerns of attorneys engaged in patent work is in the background even though the description seems to be painted in broad strokes:

This process generally includes: (1) supplementing terms of a text-based query with their synonyms; and (2) assessing the proximity of resulting patents to the determined underlying meaning of the text – based query. Semantic platforms are often touted as critical add-ons to natural language searching. They are said to account for discrepancies in word form and lexicography between the text of queries and patent disclosure.

The white paper offers this conclusion about semantic search:

it [semantic search] is surprisingly ineffective for FTO.

Seems reasonable, right? Semantic search assumes a “paradigm.” In my experience, taxonomies, classification schema, and ontologies perform the same intellectual trick. The idea is to put something into a cubby. Organizing information makes manifest what something is and where it fits in a mental construct.

But these semantic systems do a lousy job figuring out what’s in the Claims section of a patent. That’s a flaw which is a direct consequence of the lingo lawyers use to frame the claims themselves.

Search systems use many different methods to pigeonhole a statement. The “aboutness” of a statement or a claim is a sticky wicket. As I have written in many articles, books, and blog posts, finding on point information is very difficult. Progress has been made when one wants a pizza. Less progress has been made in finding the colleagues of the bad actors in Brussels.

Palantir requires that those adding content to the Gotham data management system add tags from a “dynamic ontology.” In addition to what the human has to do, the Gotham system generates additional metadata automatically. Other systems use mostly automatic systems which are dependent on a traditional controlled term list. Others just use algorithms to do the trick. The systems which are making friends with users strike a balance; that is, using human input directly or indirectly and some administrator only knowledgebases, dictionaries, synonym lists, etc.

ClearstoneIP keeps its eye on its FTO ball, which is understandable. The white paper asserts:

The point here is that semantic platforms can deliver effective results for patentability searches at a reasonable cost but, when it comes to FTO searching, the effectiveness of the platforms is limited even at great cost.

Okay, I understand. ClearstoneIP includes a diagram which drives home how its FTO approach soars over the competitors’ systems:


ClearstoneIP, © 2016

My reaction to the white paper is that for decades I have evaluated and used information access systems. None of the systems is without serious flaws. That includes the clever n gram-based systems, the smart systems from dozens of outfits, the constantly reinvented keyword centric systems from the Lexis-type and Westlaw-type vendor, even the simplistic methods offered by free online patent search systems like

What seems to be reality of the legal landscape is:

  1. Patent experts use a range of systems. With lots of budget, many fee and for fee systems will be used. The name of the game is meeting the client needs and obviously billing the client for time.
  2. No patent search system to which I have been exposed does an effective job of thinking like an very good patent attorney. I know that the notion of artificial intelligence is the hot trend, but the reality is that seemingly smart software usually cheats by formulating queries based on analysis of user behavior, facts like geographic location, and who pays to get their pizza joint “found.”
  3. A patent search system, in order to be useful for the type of work I do, has to index germane content generated in the course of the patent process. Comprehensiveness is simply not part of the patent search systems’ modus operandi. If there’s a B, where’s the A? If there is a germane letter about a patent, where the heck is it?

I am not on the “side” of the taxonomy-centric approach. I am not on the side of the crazy semantic methods. I am not on the side of the keyword approach when inventors use different names on different patents, Babak Parviz aliases included. I am not in favor of any one system.

How do I think patent search is evolving? ClearstoneIP has it sort of right. Attorneys have to tag what is needed. The hitch in the git along has been partially resolved by Palantir’’-type systems; that is, the ontology has to be dynamic and available to anyone authorized to use a collection in real time.

But for lawyers there is one added necessity which will not leave us any time soon. Lawyers bill; hence, whatever is output from an information access system has to be read, annotated, and considered by a semi-capable human.

What’s the future of patent search? My view is that there will be new systems. The one constant is that, by definition, a lawyer cannot trust the outputs. The way to deal with this is to pay a patent attorney to read patent documents.

In short, like the person looking for information in the scriptoria at the Alexandria Library, the task ends up as a manual one. Perhaps there will be a friendly Boston Dynamics librarian available to do the work some day. For now, search systems won’t do the job because attorneys cannot trust an algorithm when the likelihood of missing something exists.

Oh, I almost forget. Attorneys have to get paid via that billable time thing.

Stephen E Arnold, March 30, 2016

Next Page »