Patents and Semantic Search: No Good, No Good

March 31, 2016

I have been working on a profile of Palantir (open source information only, however) for my forthcoming Dark Web Notebook. I bumbled into a video from an outfit called ClearstoneIP. I noted that ClearstoneIP’s video showed how one could select from a classification system. With every click,the result set changed. For some types of searching, a user may find the point-and-click approach helpful. However, there are other ways to root through what appears to be patent applications. There are the very expensive methods happily provided by Reed Elsevier and Thomson Reuters, two find outfits. And then there are less expensive methods like Alphabet Google’s odd ball patent search system or the quite functional FreePatentsOnline service. In between, you and I have many options.

None of them is a slam dunk. When I was working through the publicly accessible Palantir Technologies’ patents, I had to fall back on my very old-fashioned method. I tracked down a PDF, printed it out, and read it. Believe me, gentle reader, this is not the most fun I have ever had. In contrast to the early Google patents, Palantir’s documents lack the detailed “background of the invention” information which the salad days’ Googlers cheerfully presented. Palantir’s write ups are slogs. Perhaps the firm’s attorneys were born with dour brain circuitry.

I did a side jaunt and came across a white paper from ClearstoneIP called “Why Semantic Searching Fails for Freedom-to-Operate (FTO).”i The 12 page write up is from a company called ClearstoneIP, which is a patent analysis company. The firm’s 12 pager is about patent searching. The company, according to its Web site is a “paradigm shifter.” The company describes itself this way:

ClearstoneIP is a California-based company built to provide industry leaders and innovators with a truly revolutionary platform for conducting product clearance, freedom to operate, and patent infringement-based analyses. ClearstoneIP was founded by a team of forward-thinking patent attorneys and software developers who believe that barriers to innovation can be overcome with innovation itself.

The “freedom to operate” phrase is a bit of legal jargon which I don’t understand. I am, thank goodness, not an attorney.

The firm’s search method makes much of the ontology, taxonomy, classification approach to information access. Hence, the reason my exploration of Palantir’s dynamic ontology with objects tossed ClearstoneIP into one of my search result sets.

The white paper is interesting if one works around the legal mumbo jumbo. The company’s approach is remarkable and invokes some of my caution light words; for example:

  • “Not all patent searches are the same.”, page two
  • “This all leads to the question…”, page seven
  • “…there is never a single “right” way to do so.”, page eight
  • “And if an analyst were to try to capture all of the ways…”, page eight
  • “to capture all potentially relevant patents…”, page nine.

The absolutist approach to argument is fascinating.

Okay, what’s the ClearstoneIP search system doing? Well, it seems to me that it is taking a path to consider some of the subtlties in patent claims’ statements. The approach is very different from that taken by Brainware and its tri-gram technology. Now that Lexmark owns Brainware, the application of the Brainware system to patent searching has fallen off my radar. Brainware relied on patterns; ClearstoneIP uses the ontology-classification approach.

Both are useful in identifying patents related to a particular subject.

What is interesting in the write up is its approach to “semantics.” I highlighted in billable hour green:

Anticipating all the ways in which a product can be described is serious guesswork.

Yep, but isn’t that the role of a human with relevant training and expertise becomes important? The white paper takes the approach that semantic search fails for the ClearstoneIP method dubbed FTO or freedom to operate information access.

The white paper asserted:

Semantic

Semantic searching is the primary focus of this discussion, as it is the most evolved.

ClearstoneIP defines semantic search in this way:

Semantic patent searching generally refers to automatically enhancing a text -based query to better represent its underlying meaning, thereby better identifying conceptually related references.

I think the definition of semantic is designed to strike directly at the heart of the methods offered to lawyers with paying customers by Lexis-type and Westlaw-type systems. Lawyers to be usually have access to the commercial-type services when in law school. In the legal market, there are quite a few outfits trying to provide better, faster, and sometimes less expensive ways to make sense of the Miltonesque prose popular among the patent crowd.

The white paper, in a lawyerly way, the approach of semantic search systems. Note that the “narrowing” to the concerns of attorneys engaged in patent work is in the background even though the description seems to be painted in broad strokes:

This process generally includes: (1) supplementing terms of a text-based query with their synonyms; and (2) assessing the proximity of resulting patents to the determined underlying meaning of the text – based query. Semantic platforms are often touted as critical add-ons to natural language searching. They are said to account for discrepancies in word form and lexicography between the text of queries and patent disclosure.

The white paper offers this conclusion about semantic search:

it [semantic search] is surprisingly ineffective for FTO.

Seems reasonable, right? Semantic search assumes a “paradigm.” In my experience, taxonomies, classification schema, and ontologies perform the same intellectual trick. The idea is to put something into a cubby. Organizing information makes manifest what something is and where it fits in a mental construct.

But these semantic systems do a lousy job figuring out what’s in the Claims section of a patent. That’s a flaw which is a direct consequence of the lingo lawyers use to frame the claims themselves.

Search systems use many different methods to pigeonhole a statement. The “aboutness” of a statement or a claim is a sticky wicket. As I have written in many articles, books, and blog posts, finding on point information is very difficult. Progress has been made when one wants a pizza. Less progress has been made in finding the colleagues of the bad actors in Brussels.

Palantir requires that those adding content to the Gotham data management system add tags from a “dynamic ontology.” In addition to what the human has to do, the Gotham system generates additional metadata automatically. Other systems use mostly automatic systems which are dependent on a traditional controlled term list. Others just use algorithms to do the trick. The systems which are making friends with users strike a balance; that is, using human input directly or indirectly and some administrator only knowledgebases, dictionaries, synonym lists, etc.

ClearstoneIP keeps its eye on its FTO ball, which is understandable. The white paper asserts:

The point here is that semantic platforms can deliver effective results for patentability searches at a reasonable cost but, when it comes to FTO searching, the effectiveness of the platforms is limited even at great cost.

Okay, I understand. ClearstoneIP includes a diagram which drives home how its FTO approach soars over the competitors’ systems:

image

ClearstoneIP, © 2016

My reaction to the white paper is that for decades I have evaluated and used information access systems. None of the systems is without serious flaws. That includes the clever n gram-based systems, the smart systems from dozens of outfits, the constantly reinvented keyword centric systems from the Lexis-type and Westlaw-type vendor, even the simplistic methods offered by free online patent search systems like Pat2PDF.org.

What seems to be reality of the legal landscape is:

  1. Patent experts use a range of systems. With lots of budget, many fee and for fee systems will be used. The name of the game is meeting the client needs and obviously billing the client for time.
  2. No patent search system to which I have been exposed does an effective job of thinking like an very good patent attorney. I know that the notion of artificial intelligence is the hot trend, but the reality is that seemingly smart software usually cheats by formulating queries based on analysis of user behavior, facts like geographic location, and who pays to get their pizza joint “found.”
  3. A patent search system, in order to be useful for the type of work I do, has to index germane content generated in the course of the patent process. Comprehensiveness is simply not part of the patent search systems’ modus operandi. If there’s a B, where’s the A? If there is a germane letter about a patent, where the heck is it?

I am not on the “side” of the taxonomy-centric approach. I am not on the side of the crazy semantic methods. I am not on the side of the keyword approach when inventors use different names on different patents, Babak Parviz aliases included. I am not in favor of any one system.

How do I think patent search is evolving? ClearstoneIP has it sort of right. Attorneys have to tag what is needed. The hitch in the git along has been partially resolved by Palantir’’-type systems; that is, the ontology has to be dynamic and available to anyone authorized to use a collection in real time.

But for lawyers there is one added necessity which will not leave us any time soon. Lawyers bill; hence, whatever is output from an information access system has to be read, annotated, and considered by a semi-capable human.

What’s the future of patent search? My view is that there will be new systems. The one constant is that, by definition, a lawyer cannot trust the outputs. The way to deal with this is to pay a patent attorney to read patent documents.

In short, like the person looking for information in the scriptoria at the Alexandria Library, the task ends up as a manual one. Perhaps there will be a friendly Boston Dynamics librarian available to do the work some day. For now, search systems won’t do the job because attorneys cannot trust an algorithm when the likelihood of missing something exists.

Oh, I almost forget. Attorneys have to get paid via that billable time thing.

Stephen E Arnold, March 30, 2016

Microsoft and the Open Source Trojan Horse

March 30, 2016

Quite a few outfits embrace open source. There are a number of reasons:

  1. It is cheaper than writing original code
  2. It is less expensive than writing original code
  3. It is more economical than writing original code.

The article “Microsoft is Pretending to be a FOSS Company in Order to Secure Government Contracts With Proprietary Software in ‘Open’ Clothing” reminded me that there is another reason.

No kidding.

I know that IBM has snagged Lucene and waved its once magical wand over the information access system and pronounced, “Watson.” I know that deep inside the kind, gentle heart of Palantir Technologies, there are open source bits. And there are others.

The write up asserted:

For those who missed it, Microsoft is trying to EEE GNU/Linux servers amid Microsoft layoffs; selfish interests of profit, as noted by some writers [1,2] this morning, nothing whatsoever to do with FOSS (there’s no FOSS aspect to it at all!) are driving these moves. It’s about proprietary software lock-in that won’t be available for another year anyway. It’s a good way to distract the public and suppress criticism with some corny images of red hearts.

The other interesting point I highlighted was:

reject the idea that Microsoft is somehow “open” now. The European Union, the Indian government and even the White House now warm up to FOSS, so Microsoft is pretending to be FOSS. This is protectionism by deception from Microsoft and those who play along with the PR campaign (or lobbying) are hurting genuine/legitimate FOSS.

With some government statements of work requiring “open” technologies, Microsoft may be doing what other firms have been doing for a while. See points one to three above. Microsoft is just late to the accountants’ party.

Why not replace the SharePoint search thing with an open source solution? What’s the $1.2 billion MSFT paid for the fascinating Fast Search & Transfer technology in 2008? It works just really well, right?

Stephen E Arnold, March 30, 2016

Google Reveals Personal Data in Search Results

March 30, 2016

Our lives are already all over the Internet, but Google recently unleashed a new feature that takes it to a new level.  Search Engine Watch tells us about, “Google Shows Personal Data Within Search Results, Tests ‘Recent Purchases’ Feature” and the new way to see your Internet purchases.

Google pulls the purchase information most likely from Gmail or Chrome.   The official explanation is that Google search is now more personalized, because it does pull information from Google apps:

“You can search for information from other Google products you use, like Gmail, Google Calendar, and Google+. For example, you can search for information about your upcoming flights, restaurant reservations, or appointments.”

Personalized Google search can display results now only from purchases but also bills, flights, reservations, packages, events, and Google Photos.  It is part of Google’s mission to not only organize the world, but also be a personal assistant, part of the new Google Now.

While it is a useful tool to understand your personal habits, organize information, and interact with data like in a science-fiction show, at the same time it is creepy being able to search your life with Google.  Some will relish in the idea of having their lives organized at their fingertips, but others will feel like the NSA or even Dark Web predators will hack into their lives.

 

Whitney Grace, March 30, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Google Dorking: It Is Search, Folks

March 29, 2016

I received a call from a former client this morning (March 28, 2016). The question? Google dorking. Relax. Google dorking is another way to say advanced search. In those How to Search with Google seminars I used to do for an outfit where the metros are unreliable and trust is a weird concept, I covered a number of Google dorking methods.

I don’t make those lectures’ content available for free in this blog, but you can round up some basic info at these links:

The dear, dear Alphabet Google thing kills or breaks useful search functions. This weekend, the FILETYPE: instruction performed like the University of Virginia men’s basketball team. You will have to do some thinking.

By the way, as Google shifts to its magical artificial intelligence methods, finding information via Google is getting more and more difficult.

We do webinars on how to deal with the Alphabet Google thing. Write seaky2000 at yahoo dot com and inquire about a 75 minute webinar. Yep, the same one I do for government types.

Stephen E Arnold, March 29, 2016

Retraining the Librarian for the Future

March 28, 2016

The Internet is often described as the world’s biggest library containing all the world’s knowledge that someone dumped on the floor.  The Internet is the world’s biggest information database as well as the world’s biggest data mess.  In the olden days, librarians used to be the gateway to knowledge management but they need to vamp up their skills beyond the Dewey Decimal System and database searching.  Librarians need to do more and Christian Lauersen’s personal blog explains how in, “Data Scientist Training For Librarians-Re-Skilling Libraries For The Future.”

DST4L is a boot camp for librarians and other information professionals to learn new skills to maintain relevancy.  Last year DST4L was held as:

“DST4L has been held three times in The States and was to be set for the first time in Europe at Library of Technical University of Denmark just outside of Copenhagen. 40 participants from all across Europe were ready to get there hands dirty over three days marathon of relevant tools within data archiving, handling, sharing and analyzing. See the full program here and check the #DST4L hashtag at Twitter.”

Over the course of three days, the participants learned about OpenRefine, a spreadsheet-like application that cane be used for data cleanup and transformation.  They also learned about the benefits of GitHub and how to program using Python.  These skills are well beyond the classed they teach in library graduate programs, but it is a good sign that the profession is evolving even if the academia aspects lag behind.

Whitney Grace, March 28, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Search as a Framework

March 26, 2016

A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.

If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.

The article states:

Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.

The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.

Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”

Stephen E Arnold, March 26, 2016

Ixquick and StartPage Become One

March 25, 2016

Ixquick was created by a person in Manhattan. Then the system shifted from the USA to Europe. I lost track. I read “Ixquick Merges with StartPage Search Engine.” Web search is a hideously expensive activity to fund. Costs can be suppressed if one just passes the user’s query to Bing, Google, or some other Web indexing search system. The approach delivers what is called a value-added opportunity. Vivisimo used the approach before it morphed into a unit of IBM and emerged not as a search federation system but a Big Data system. Most search traffic flows to the Alphabet Google advertising system. Those who use federated search systems often don’t know the difference and, based on my observations, don’t care.

According to the write up:

The main difference between StartPage and the current version of Ixquick is that the former is powered exclusively by Google search results while the latter aggregates data from multiple search engines to rank them based on factors such as prominence and quantity. Both search engines are privacy orientated, and the merging won’t change the fact. IP addresses are not recorded for instance, and data is not shared with third-parties.

Like DuckDuckGo.com, Ixquick.com and StartPage.com “protect the user’s privacy. My thought is that I am not confident Tor sessions are able to protect a user’s privacy. A general interest search engine which delivers on this assertion is interesting indeed.

If you want to use the Ixquick function that presents only Google results, navigate to www.ixquick.eu. There are other privacy oriented systems; for example, Gibiru and Unbubble.

Sorry, I won’t/can’t go into the privacy angle. You may want to poke around how secure a VPN session, Tails, and Tor are. The exploration may yield some useful information. Make sure your computing device does not have malware installed, please. Otherwise, the “privacy” issue is off the table.

Stephen E Arnold, March 25, 2016

Play Search the Game

March 25, 2016

Within the past few years, gamers have had the privilege to easily play brand new games as well as the old classics.  Nearly all of the games ever programmed are available through various channels from Steam, simulator, to system emulator.  While it is easy to locate a game if you know the name, main character, or even the gaming system, but with the thousands of games available maybe you want to save time and not have use a search engine.  Good news, everyone!

Sofotex, a free software download Web site, has a unique piece of freeware that you will probably want to download if you are a gamer. Igrulka is a search engine app programmed to search only games.  Here is the official description:

Igrulka is a unique software that helps you to search, find and play millions of games in the network.

“Once you download the installer, all you have to do is go to the download location on your computer and install the app.

Igrulka allows you to search for the games that you love either according to the categories they are in or by name. For example, you get games in the shooter, arcade, action, puzzle or racing games categories among many others.

If you would like to see more details about the available games, their names as well as their descriptions, all you have to do is hover over them using your mouse as shown below. Choose the game you want to play and click on it.”

According to the description, it looks like Igrulka searches through free games and perhaps the classics from systems.  In order to find out what Irgulka can do, download and play search results roulette.

 

Whitney Grace, March 25, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Wikipedia Grants Users Better Search

March 24, 2016

Wikipedia is the defacto encyclopedia to confirm fact from fiction, although academic circles shun its use (however, scholars do use it but never cite it).  Wikipedia does not usually make the news, unless it is tied to its fundraising campaign or Wikileaks releases sensitive information meant to remain confidential.  The Register tells us that Wikipedia makes the news for another reason, “Reluctant Wikipedia Lifts Lid On $2.5m Internet Search Engine Project.”  Wikipedia is better associated with the cataloging and dissemination of knowledge, but in order to use that knowledge it needs to be searched.

Perhaps that is why the Wikimedia Foundation is “doing a Google” and will be investing a Knight Foundation Grant into a search-related project.  The Wikimedia Foundation finally released information about the Knight Foundation Grant, dedicated to provide funds for companies invested in innovative solutions related to information, community, media, and engagement.

“The grant provides seed money for stage one of the Knowledge Engine, described as “a system for discovering reliable and trustworthy information on the Internet”. It’s all about search and federation. The discovery stage includes an exploration of prototypes of future versions of Wikipedia.org which are “open channels” rather than an encyclopedia, analysing the query-to-content path, and embedding the Wikipedia Knowledge Engine ‘via carriers and Original Equipment Manufacturers’.”

The discovery stage will last twelve months, ending in August 2016.  The biggest risk for the search project would be if Google or Yahoo decided to invest in something similar.

What is interesting is that former Wiki worker Jimmy Wales denied the Wikimedia Foundation was working on a search engine via the Knowledge Engine.  Wales has since left and Andreas Kolbe reported in a Wikipedia Signpost article that they are building a search engine and led to believe it would be to find information spread cross the Wikipedia portals, rather it is something much more powerful.

Here is what the actual grant is funding:

“To advance new models for finding information by supporting stage one development of the Knowledge Engine by Wikipedia, a system for discovering reliable and trustworthy public information on the Internet.”

It sounds like a search engine that provides true and verifiable search results, which is what academic scholars have been after for years!  Wow!  Wikipedia might actually be worth a citation now.

 

Whitney Grace, March 24, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

VPN Disables Right to Be Forgotten for Users in European Union

March 24, 2016

Individuals in the European Union have been granted legal protection to request unwanted information about themselves be removed from search engines. An article from Wired, In Europe,You’ll Need a VPN to See Real Google Search Results, explains the latest on the European Union’s “right to be forgotten” laws. Formerly, privacy requests would only scrub sites with European country extensions like .fr, but now Google.com will filter results for privacy for those with a European IP address. However, European users can rely on a VPN to enable their location to appear as if it were from elsewhere. The article offers context and insight,

“China has long had its “Great Firewall,” and countries like Russia and Brazil have tried to build their own barriers to the outside ‘net in recent years. These walls have always been quite porous thanks to VPNs. The only way to stop it would be for Google to simply stop allowing people to access its search engine via a VPN. That seems unlikely. But with Netflix leading the way in blocking access via VPNs, the Internet may yet fracture and localize.”

The demand for browsing the web using surreptitious methods, VPN or otherwise, only seems to be increasing. Whether motivations are to uncover personal information about certain individuals, watch Netflix content available in other countries or use forums on the Dark Web, the landscape of search appears to be changing in a major way.

 

Megan Feil, March 24, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta