Semantic Search and Challenging Patent Document Content Domains

July 7, 2015

Over the years, I have bumped into some challenging content domains. One of the most difficult was the collection of mathematical papers organized with the Dienst architecture. Another was a collection of blog posts from African bulletin board systems in a number of different languages, peppered with insider jargon. I also recall my jousts with patent documents for some pretty savvy outfits.

The processing of each of these corpuses and making them searchable by a regular human being remains an unsolved problem. Progress has been slow, and the focus of many innovators has been on workarounds. The challenge of each corpus remains a high hurdle, and in my opinion, no search sprinter is able to make it over the race course without catching a toe and plunging head first into the Multi-layer SB Resin covered surface.

I read “Why Is Semantic Search So Important for Patent Searching?” My answer was and remains, “Because vendors will grab at any buzzy concept in the hopes of capturing a share of the patent research market?”

The write up take a different approach, an approach which I find interesting and somewhat misleading.

The write up states that there are two ways to search for information: Navigational search sort of like Endeca I assume and research search, which is the old fashioned Boolean logic which I really like.

The article points out that keyword search sucks if the person looking for information does not know the exact term. That’s why I used the reference to Dienst. I wanted to provide an example which requires precise knowledge of terminology. That’s a challenge and it requires specialized knowledge from a person who recognizes that he or she may not know the exact terminology required to locate the needed information. Try the Dienst query. Navigate to a whizzy new search engine like and plug away. How is that working out for you, but don’t cheat. You can’t use the term Dienst.

If you run the query on a point and click Web search system like, you cannot locate the term without running a keyword search.

The problems in patents, whether indexed with value added metadata, humans laboring in a warehouse, or with semantic methods are:

  1. Patent documents exist in versions and each document drags along assorted forms which may or may not be findable. Trips to the USPTO with hat in hand and a note from a senator often do not work. Fancy Dan patent attorneys fall back on the good old method of hunting using intermediaries. Not pretty, not easy, not cheap, and not foolproof. The versions and assorted attachments are often unfindable. (There are sometimes interesting reasons for this kettle of fish and the fish within it.) I don’t have a solution to the chains of documents and the versions of patent documents. Sigh.
  2. Patents include art. Usually the novice reacts negatively to lousy screenshots, clunky drawings, and equations which make it tough to figure out what a superscript character is. Keywords and pointing and clicking, metaphors, razzle dazzle search systems, and buzzword charged solutions from outfits like Thomson Reuters and Lexis are just tools, stone tools chiseled by some folks who want to get paid. I don’t have a good solution to the arts and crafts aspect of patent documents. Sigh sigh.
  3. Patent documents are written at a level of generalization, with jargon, Latinate constructs, and assertions that usually give me a headache. Who signed up to read lots of really bad poetry. Working through the Old Norse version of Heimskringla is a walk in the park compared to figuring out what some patents “mean.” I spent a number of years indexing 15th century Latin sermons. At least in that corpus, the common knowledge base was social and political events and assorted religious material. Patents can be all over the known knowledge universe. I don’t know of a patent processing system which can make this weird prose-poetry understandable if there is litigation or findable if there is a need to figure out if someone cooked up the same system and method before the document in question was crafted. Sigh sigh sigh.
  4. None of the systems I have used over the past 40 years does a bang up job of identifying prior art in scientific, technical or medical journal articles, blog posts, trade publications, or Facebook posts by a socially aware astrophysicist working for a social media company. Finding antecedents is a great deal of work. Has been and will be in my opinion. Sigh sigh sigh sigh. But the patent attorneys cry, “Hooray. We get to bill time.”

The write up presents some of those top brass magnets: Snappy visualizations. The idea is that a nifty diagram will address the three problems I identified in the preceding paragraphs. Visualizations may be able to provide some useful way to conceptualize where a particular patent document falls in a cluster of correctly processed patent documents. But an image does not deliver the mental equivalent of a NOW Foods Why Protein Isolate.

Net net: Pitching semantic search as a solution to the challenges of patent information access is a ball. Strikes in patent searching are not easily obtained unless you pay expert patent attorneys and their human assets to do the job. Just bring your checkbook.

Stephen E Arnold, July 7, 2015


Comments are closed.

  • Archives

  • Recent Posts

  • Meta