CyberOSINT banner

Google: Algorithms Are Objective

July 17, 2016

I know that Google’s algorithms are tireless, objective numerical recipes. However, “Google: Downranking Online Piracy Sites in Search Results Has Led to a 89% Decrease in Traffic” sparked in my mind the notion that human intervention may be influencing some search result rankings. I highlighted these statements in the write up:

“Google does not proactively remove hyperlinks to any content unless first notified by copyright holders, but the tech giant says that it is now processing copyright removal notices in less than six hours on average…” I assume this work is performed by objective algorithms.

“…it is happy to demote links to pages that explicitly contain or link to content that infringes copyright.” Again, a machine process and, therefore, objective?

Human intervention in high volume flows of information is often difficult. If Google is not using machine processes, perhaps the company is forced to group sites and then have humans make decisions.

Artificial intelligence, are you not up to the task?

Stephen E Arnold, July 21, 2016

Semantics Made Easier

May 9, 2016

For fans of semantic technology, Ontotext has a late spring delight for you. The semantic platform vendor Ontotext has released GraphDB 7. I read “Ontotext Releases New Version of Semantic Graph Database.” According to the announcement, set up and data access are easier. I learned:

The new release offers new tools to access and explore data, eliminating the need to know everything about the dataset before start working with it. GraphDB 7 enables users to navigate their way through third-party and any other dataset regardless of data volumes, which makes it a powerful Big Data analytics tool. Ver.7 offers visual exploration of the loaded data schema – ontology, interactive query builder for better entity retrieval, and full support for RDF 1.1 allowing smooth import of a huge number of public Open Data as well as proprietary Linked Datasets.

If you want to have a Palantir-type system, check out Ontotext. The company is confident that semantic technology will yield benefits, a claim made by other semantic technology vendors. But the complexity challenges associated with conversion and normalization of content is likely to be a pebble in the semantic sneaker.

Stephen E Arnold, May 9, 2016

Patents and Semantic Search: No Good, No Good

March 31, 2016

I have been working on a profile of Palantir (open source information only, however) for my forthcoming Dark Web Notebook. I bumbled into a video from an outfit called ClearstoneIP. I noted that ClearstoneIP’s video showed how one could select from a classification system. With every click,the result set changed. For some types of searching, a user may find the point-and-click approach helpful. However, there are other ways to root through what appears to be patent applications. There are the very expensive methods happily provided by Reed Elsevier and Thomson Reuters, two find outfits. And then there are less expensive methods like Alphabet Google’s odd ball patent search system or the quite functional FreePatentsOnline service. In between, you and I have many options.

None of them is a slam dunk. When I was working through the publicly accessible Palantir Technologies’ patents, I had to fall back on my very old-fashioned method. I tracked down a PDF, printed it out, and read it. Believe me, gentle reader, this is not the most fun I have ever had. In contrast to the early Google patents, Palantir’s documents lack the detailed “background of the invention” information which the salad days’ Googlers cheerfully presented. Palantir’s write ups are slogs. Perhaps the firm’s attorneys were born with dour brain circuitry.

I did a side jaunt and came across a white paper from ClearstoneIP called “Why Semantic Searching Fails for Freedom-to-Operate (FTO).”i The 12 page write up is from a company called ClearstoneIP, which is a patent analysis company. The firm’s 12 pager is about patent searching. The company, according to its Web site is a “paradigm shifter.” The company describes itself this way:

ClearstoneIP is a California-based company built to provide industry leaders and innovators with a truly revolutionary platform for conducting product clearance, freedom to operate, and patent infringement-based analyses. ClearstoneIP was founded by a team of forward-thinking patent attorneys and software developers who believe that barriers to innovation can be overcome with innovation itself.

The “freedom to operate” phrase is a bit of legal jargon which I don’t understand. I am, thank goodness, not an attorney.

The firm’s search method makes much of the ontology, taxonomy, classification approach to information access. Hence, the reason my exploration of Palantir’s dynamic ontology with objects tossed ClearstoneIP into one of my search result sets.

The white paper is interesting if one works around the legal mumbo jumbo. The company’s approach is remarkable and invokes some of my caution light words; for example:

  • “Not all patent searches are the same.”, page two
  • “This all leads to the question…”, page seven
  • “…there is never a single “right” way to do so.”, page eight
  • “And if an analyst were to try to capture all of the ways…”, page eight
  • “to capture all potentially relevant patents…”, page nine.

The absolutist approach to argument is fascinating.

Okay, what’s the ClearstoneIP search system doing? Well, it seems to me that it is taking a path to consider some of the subtlties in patent claims’ statements. The approach is very different from that taken by Brainware and its tri-gram technology. Now that Lexmark owns Brainware, the application of the Brainware system to patent searching has fallen off my radar. Brainware relied on patterns; ClearstoneIP uses the ontology-classification approach.

Both are useful in identifying patents related to a particular subject.

What is interesting in the write up is its approach to “semantics.” I highlighted in billable hour green:

Anticipating all the ways in which a product can be described is serious guesswork.

Yep, but isn’t that the role of a human with relevant training and expertise becomes important? The white paper takes the approach that semantic search fails for the ClearstoneIP method dubbed FTO or freedom to operate information access.

The white paper asserted:


Semantic searching is the primary focus of this discussion, as it is the most evolved.

ClearstoneIP defines semantic search in this way:

Semantic patent searching generally refers to automatically enhancing a text -based query to better represent its underlying meaning, thereby better identifying conceptually related references.

I think the definition of semantic is designed to strike directly at the heart of the methods offered to lawyers with paying customers by Lexis-type and Westlaw-type systems. Lawyers to be usually have access to the commercial-type services when in law school. In the legal market, there are quite a few outfits trying to provide better, faster, and sometimes less expensive ways to make sense of the Miltonesque prose popular among the patent crowd.

The white paper, in a lawyerly way, the approach of semantic search systems. Note that the “narrowing” to the concerns of attorneys engaged in patent work is in the background even though the description seems to be painted in broad strokes:

This process generally includes: (1) supplementing terms of a text-based query with their synonyms; and (2) assessing the proximity of resulting patents to the determined underlying meaning of the text – based query. Semantic platforms are often touted as critical add-ons to natural language searching. They are said to account for discrepancies in word form and lexicography between the text of queries and patent disclosure.

The white paper offers this conclusion about semantic search:

it [semantic search] is surprisingly ineffective for FTO.

Seems reasonable, right? Semantic search assumes a “paradigm.” In my experience, taxonomies, classification schema, and ontologies perform the same intellectual trick. The idea is to put something into a cubby. Organizing information makes manifest what something is and where it fits in a mental construct.

But these semantic systems do a lousy job figuring out what’s in the Claims section of a patent. That’s a flaw which is a direct consequence of the lingo lawyers use to frame the claims themselves.

Search systems use many different methods to pigeonhole a statement. The “aboutness” of a statement or a claim is a sticky wicket. As I have written in many articles, books, and blog posts, finding on point information is very difficult. Progress has been made when one wants a pizza. Less progress has been made in finding the colleagues of the bad actors in Brussels.

Palantir requires that those adding content to the Gotham data management system add tags from a “dynamic ontology.” In addition to what the human has to do, the Gotham system generates additional metadata automatically. Other systems use mostly automatic systems which are dependent on a traditional controlled term list. Others just use algorithms to do the trick. The systems which are making friends with users strike a balance; that is, using human input directly or indirectly and some administrator only knowledgebases, dictionaries, synonym lists, etc.

ClearstoneIP keeps its eye on its FTO ball, which is understandable. The white paper asserts:

The point here is that semantic platforms can deliver effective results for patentability searches at a reasonable cost but, when it comes to FTO searching, the effectiveness of the platforms is limited even at great cost.

Okay, I understand. ClearstoneIP includes a diagram which drives home how its FTO approach soars over the competitors’ systems:


ClearstoneIP, © 2016

My reaction to the white paper is that for decades I have evaluated and used information access systems. None of the systems is without serious flaws. That includes the clever n gram-based systems, the smart systems from dozens of outfits, the constantly reinvented keyword centric systems from the Lexis-type and Westlaw-type vendor, even the simplistic methods offered by free online patent search systems like

What seems to be reality of the legal landscape is:

  1. Patent experts use a range of systems. With lots of budget, many fee and for fee systems will be used. The name of the game is meeting the client needs and obviously billing the client for time.
  2. No patent search system to which I have been exposed does an effective job of thinking like an very good patent attorney. I know that the notion of artificial intelligence is the hot trend, but the reality is that seemingly smart software usually cheats by formulating queries based on analysis of user behavior, facts like geographic location, and who pays to get their pizza joint “found.”
  3. A patent search system, in order to be useful for the type of work I do, has to index germane content generated in the course of the patent process. Comprehensiveness is simply not part of the patent search systems’ modus operandi. If there’s a B, where’s the A? If there is a germane letter about a patent, where the heck is it?

I am not on the “side” of the taxonomy-centric approach. I am not on the side of the crazy semantic methods. I am not on the side of the keyword approach when inventors use different names on different patents, Babak Parviz aliases included. I am not in favor of any one system.

How do I think patent search is evolving? ClearstoneIP has it sort of right. Attorneys have to tag what is needed. The hitch in the git along has been partially resolved by Palantir’’-type systems; that is, the ontology has to be dynamic and available to anyone authorized to use a collection in real time.

But for lawyers there is one added necessity which will not leave us any time soon. Lawyers bill; hence, whatever is output from an information access system has to be read, annotated, and considered by a semi-capable human.

What’s the future of patent search? My view is that there will be new systems. The one constant is that, by definition, a lawyer cannot trust the outputs. The way to deal with this is to pay a patent attorney to read patent documents.

In short, like the person looking for information in the scriptoria at the Alexandria Library, the task ends up as a manual one. Perhaps there will be a friendly Boston Dynamics librarian available to do the work some day. For now, search systems won’t do the job because attorneys cannot trust an algorithm when the likelihood of missing something exists.

Oh, I almost forget. Attorneys have to get paid via that billable time thing.

Stephen E Arnold, March 30, 2016

DeepGram: Audio Search in Lectures and Podcasts

March 23, 2016

I read “DeepGram Lets You Search through Lectures and Podcasts for Your Favorite Quotes.” I don’t think the system is available at this time. The article states:

Search engines make it easy to look through text files for specific words, but finding phrases and keywords in audio and video recordings could be a hassle. Fortunately, California-based startup DeepGram is working on a tool that will make this process simpler.

The hint is the “is working.” Not surprisingly, the system is infused with artificial intelligence. The process is to covert speech to text and then index the result.

Exalead had an interesting system seven or eight years ago. I am not sure what happened to that demonstration. My recollection is that the challenge is to have sufficient processing power to handle the volume of audio and video content available for indexing.

When an outfit like Google is not able to pull off a comprehensive search system for its audio and video content, my hunch is that the task for a robust volume of content might be a challenge.

But if there is sufficient money, engineering talent, and processing power, perhaps I will no longer have to watch serial videos and listen to lousy audio to figure out what some folks are trying to communicate in their presentations.

Stephen E Arnold, March 23, 2016


Interview with Stephen E Arnold, Reveals Insights about Content Processing

March 22, 2016

Nikola Danaylov of the Singularity Weblog interviewed technology and financial analyst Stephen E. Arnold on the latest episode of his podcast, Singularity 1 on 1. The interview, Stephen E. Arnold on Search Engines and Intelligence Gathering, offers thought-provoking ideas on important topics related to sectors — such as intelligence, enterprise search, and financial — which use indexing and content processing methods Arnold has worked with for over 50 years.

Arnold attributes the origins of his interest in technology to a programming challenge he sought and accepted from a computer science professor, outside of the realm of his college major of English. His focus on creating actionable software and his affinity for problem-solving of any nature led him to leave PhD work for a job with Halliburton Nuclear. His career includes employment at Booz, Allen & Hamilton, the Courier Journal & Louisville Times, and Ziff Communications, before starting strategic information services in 1991. He co-founded and sold a search system to Lycos, Inc., worked with numerous organizations including several intelligence and enforcement organizations such as US Senate Police and General Services Administration, and authored seven books and monographs on search related topics.

With a continued emphasis on search technologies, Arnold began his blog, Beyond Search, in 2008 aiming to provide an independent source of “information about what I think are problems or misstatements related to online search and content processing.” Speaking to the relevance of the blog to his current interest in the intelligence sector of search, he asserts:

“Finding information is the core of the intelligence process. It’s absolutely essential to understand answering questions on point and so someone can do the job and that’s been the theme of Beyond Search.”

As Danaylov notes, the concept of search encompasses several areas where information discovery is key for one audience or another, whether counter-terrorism, commercial, or other purposes. Arnold agrees,

“It’s exactly the same as what the professor wanted to do in 1962. He had a collection  of Latin sermons. The only way to find anything was to look at sermons on microfilm. Whether it is cell phone intercepts, geospatial data, processing YouTube videos uploaded from a specific IP address– exactly the same problem and process. The difficulty that exists is that today we need to process data in a range of file types and at much higher speeds than ever anticipated, but the processes remain the same.”

Arnold explains the iterative nature of his work:

“The proof of the value of the legacy is I don’t really do anything new, I just keep following these themes. The Dark Web Notebook is very logical. This is a new content domain. And if you’re an intelligence or information professional, you want to know, how do you make headway in that space.”

Describing his most recent book, Dark Web Notebook, Arnold calls it “a cookbook for an investigator to access information on the Dark Web.” This monograph includes profiles of little-known firms which perform high-value Dark Web indexing and follows a book he authored in 2015 called CYBEROSINT: Next Generation Information Access.

Read more

Around Paywalls? Probably Not Spot On

February 27, 2016

I read “How Google’s Web Crawler Bypasses Paywalls.” I am not confident the write up is spot in. You may find the information useful in your own efforts to do the Connotate-type or Kimono-type thing.

The outfit with the paywall tunnel, according to the write up, is Alphabet’s Google unit. Talk about the tail wagging the dog.

The write up points out that the method uses Referer and User –Agent headers.

The approach is detailed in the article via code snippets. It’s in the cards, so have at it.

Oh, there may be other methods in play, but I will leave you to your experimentation.

Stephen E Arnold, February 23, 2016

Need an Open Source Semantic Web Crawler?

December 17, 2015

If you do, the beleaguered Yahoo has some open source goodies for you. Navigate to “Yahoo Open Sources Anthelion Web Crawler for Parsing Structured Data on HTML Pages.” The software, states the write up, is “designed for parsing structured data from HTML pages under an open source license.”

There is a statement I found darned interesting:

“To the best of our knowledge, we are first to introduce the idea of a crawler focusing on semantic data, embedded in HTML pages using markup languages as microdata, microformats or RDFa,” wrote authors Peter Mika and Roi Blanco of Yahoo Labs and Robert Meusel of Germany’s University of Mannheim.

My immediate thought was, “Why don’t these folks take a look at the 2007 patent documents penned by Ramanathan Guha. Those documents present a rather thorough description of a semantic component which hooks into the Google crawlers. Now the Google has not open sourced these systems and methods.

My reaction is, “Yahoo may want to ask the former Yahooligans who are now working at Yahoo how novel the Yahoo approach really is.”

Failing that, Yahoo may want to poke around in the literature, including patent documents, to see which outfits have trundled down the semantic crawling Web thing before. Would it have been more economical and efficient to license the Siderean Software crawler and build on that?

Stephen E Arnold, December 17, 2015

Google Indexes Some Dynamic Content

December 10, 2015

If you generate Web pages dynamically (who doesn’t?), you may want to know if the Alphabet Google thing can index the content on dynamic pages.

For some apparently objective information about the GOOG’s ability to index dynamic content, navigate to “Does Google Crawl Dynamic Content?” The article considers 11 types of dynamic content methods.

Here’s the passage I highlighted:

  • Google crawls and indexes all content that was injected by javascript.
  • Google even shows results in the SERP that are based on asynchronously injected content.
  • Google can handle content from httpRequest().
  • However, JSON-LD as such does not necessarily lead to SERP results (as opposed to the officially supported SERP entities that are not only indexed, but also used to decorate the SERP).
  • Injected JSON-LD gets recognized by the structured data testing tool – including Tag Manager injection. This means that once Google decides to support the entities, indexing will not be a problem.
  • Dynamically updated meta elements get crawled and indexed, too.

The question one may wish to consider is, “What does Alphabet Google do with that information and data?” There are some clues in the Ramanathan Guha patent documents filed in 2007.

Stephen E Arnold, December 10, 2015

Medical Search Solved Again

December 10, 2015

I have looked at a wide range of medical information search systems over the years. These range from Medline to the Grateful Med.

I read “A Cure for Medical Researchers’ Big Data Headache.” The Big Data in question is the Medline database. The new search tool is ORiGAMI (I love that wonky upper and lower case thing).

The basic approach involves:

Apollo, a Cray Urika graph computer, possesses massive multithreaded processors and 2 terabytes of shared memory, attributes that allow it to host the entire MEDLINE database and compute multiple pathways on multiple graphs simultaneously. Combined with Helios, CADES’ Cray Urika extreme analytics platform, Sukumar’s team had the cutting-edge hardware needed to process large datasets quickly—about 1,000 times faster than a workstation—and at scale.

And the payoff?

Once the MEDLINE database was brought into the CADES environment, [Sreenivas Rangan Sukumar’s [a data scientist at the Department of Energy’s Oak Ridge National Laboratory] team applied advanced graph theory models that implement semantic, statistical, and logical reasoning algorithms to create ORiGAMI. The result is a free online application capable of delivering health insights in less than a second based on the combined knowledge of a worldwide medical community.

My view is that Medline is not particularly big. The analysis of the content pool can generate lots of outputs.

From my vantage point in rural Kentucky, this is another government effort to create a search system. Perhaps this is the breakthrough that will surpass IBM Watson’s medical content capabilities?

Does your local health care provider have access to a Cray computer and the other bits and pieces, including a local version of Dr. Sukumar?

Stephen E Arnold, December 10, 2015

Visual Content: An Indexing Challenge

December 4, 2015

The average bounce rate on blogs for new visitors is 60.2%, and the average reader stays only 1 to 2 minutes on your website. One way to get people to really engage with your content is to use a tool like Roojoom, which is a content curation and creation platform.

Here’s one example from the write up:

Roojoom lets you collect content from your online and offline sources (such as your web pages, videos, PDFs and marketing materials) to create a “content journey“ for readers. You then guide readers step by step through the journey,all from within one centralized place. I read “5 Visual Content Tools to Boost Engagement.” The write up points to a handful of services which generate surveys, infographics, and collages of user supplier photos. If I knew a millennial, I can imagine hearing the susurration of excitement emitted by the lad or lass.

Now I don’t want to rain on the innovation parade. Years ago, an outfit called i2 Group Ltd. developed a similar solution. After dogging and ponying the service, it became clear that in the early 2000s, there was not much appetite for this type of data exploration. i2 eventually sold out to IBM and the company returned to its roots in intelligence and law enforcement.

The thought I had after reading about Roojoom and the other services was this:

How will the information be indexed and made findable?

As content become emoji-ized, the indexing task does not become easier. Making sense of images is not yet a slam dunk. Heck, automated indexing only shoots accurately 80 to 90 percent of the time. In a time of heightened concern about risks, is a one in five bet a good one? I try to narrow the gap, but many are okay without worrying too much.

As visual content becomes more desirable, the indexing systems will have to find a way to make this content findable. Words baffle many of these content processing outfits. Pictures are another hill to climb. If it is not indexed, the content may not be findable. Is this a problem for researchers and analysts? And for you, gentle reader?

Stephen E Arnold, December 4, 2015

Next Page »