CyberOSINT banner

Semantics Made Easier

May 9, 2016

For fans of semantic technology, Ontotext has a late spring delight for you. The semantic platform vendor Ontotext has released GraphDB 7. I read “Ontotext Releases New Version of Semantic Graph Database.” According to the announcement, set up and data access are easier. I learned:

The new release offers new tools to access and explore data, eliminating the need to know everything about the dataset before start working with it. GraphDB 7 enables users to navigate their way through third-party and any other dataset regardless of data volumes, which makes it a powerful Big Data analytics tool. Ver.7 offers visual exploration of the loaded data schema – ontology, interactive query builder for better entity retrieval, and full support for RDF 1.1 allowing smooth import of a huge number of public Open Data as well as proprietary Linked Datasets.

If you want to have a Palantir-type system, check out Ontotext. The company is confident that semantic technology will yield benefits, a claim made by other semantic technology vendors. But the complexity challenges associated with conversion and normalization of content is likely to be a pebble in the semantic sneaker.

Stephen E Arnold, May 9, 2016

Semantic Search: Clustering and Heat Maps Explain Creativity

May 8, 2016

I know zero about semantics as practiced at big time universities. I know about the same when it comes to semantic search. With my background as a tabula rasa, I read “A Semantic Map for Evaluating Creativity.” According to the write up:

We present a semantic map of words related with creativity. The aim is to empirically derive terms which can be used to rate processes or products of computational creativity. The words in the map are based on association studies per for med by human subjects and augmented with words derived from the literature (based on human raters).

After considerable text processing and a dose of analytics, the paper states:

… There is an overlap in the set of words formed by the two methods, but there are also some differences. Further investigations could reveal how these methods are related and if they are both needed (as complements) to arrive at more objective procedures for the evaluation of computational (and human) creativity.

I await a mid tier consulting firm’s for fee study about the applications of this technology in determining which companies are creative. And what about government use cases; for example, which entry lever professional is most creative. Then there are academic applications; for instance, which professors are their most creative. Creative folks can create creative ways to understand creativity. Stay tuned.

Stephen E Arnold, May 8, 2016

Mouse Movements Are the New Fingerprints

May 6, 2016

A martial artist once told me that an individual’s fighting style, if defined enough, was like a set of fingerprints.  The same can be said for painting style, book preferences, and even Netflix selections, but what about something as anonymous as a computer mouse’s movement?  Here is a new scary thought from PC & Tech Authority: “Researcher Can Indentify Tor Users By Their Mouse Movements.”

Juan Carlos Norte is a researcher in Barcelona, Spain and he claims to have developed a series of fingerprinting methods using JavaScript that measures times, mouse wheel movements, speed movement, CPU benchmarks, and getClientRects.   Combining all of this data allowed Norte to identify Tor users based on how they used a computer mouse.

It seems far-fetched, especially when one considers how random this data is, but

“’Every user moves the mouse in a unique way,’ Norte told Vice’s Motherboard in an online chat. ‘If you can observe those movements in enough pages the user visits outside of Tor, you can create a unique fingerprint for that user,’ he said. Norte recommended users disable JavaScript to avoid being fingerprinted.  Security researcher Lukasz Olejnik told Motherboard he doubted Norte’s findings and said a threat actor would need much more information, such as acceleration, angle of curvature, curvature distance, and other data, to uniquely fingerprint a user.”

This is the age of big data, but looking Norte’s claim from a logical standpoint one needs to consider that not all computer mice are made the same, some use lasers, others prefer trackballs, and what about a laptop’s track pad?  As diverse as computer users are, there are similarities within the population and random mouse movement is not individualistic enough to ID a person.  Fear not Tor users, move and click away in peace.


Whitney Grace, May 6, 2016
Sponsored by, publisher of the CyberOSINT monograph

An Open Source Search Engine to Experiment With

May 1, 2016

Apache Lucene receives the most headlines when it comes to discussion about open source search software.  My RSS feed pulled up another open source search engine that shows promise in being a decent piece of software.  Open Semantic Search is free software that cane be uses for text mining, analytics, a search engine, data explorer, and other research tools.  It is based on Elasticsearch/Apache Solrs’ open source enterprise search.  It was designed with open standards and with a robust semantic search.

As with any open source search, it can be programmed with numerous features based on the user’s preference.  These include, tagging, annotation, varying file format support, multiple data sources support, data visualization, newsfeeds, automatic text recognition, faceted search, interactive filters, and more.  It has the benefit that it can be programmed for mobile platforms, metadata management, and file system monitoring.

Open Semantic Search is described as

“Research tools for easier searching, analytics, data enrichment & text mining of heterogeneous and large document sets with free software on your own computer or server.”

While its base code is derived from Apache Lucene, it takes the original product and builds something better.  Proprietary software is an expense dubbed a necessary evil if you work in a large company.  If, however, you are a programmer and have the time to develop your own search engine and analytics software, do it.  It could be even turn out better than the proprietary stuff.


Whitney Grace, May 1, 2016
Sponsored by, publisher of the CyberOSINT monograph

Expert System: Inspired by Endeca

April 23, 2016

Years ago I listened to Endeca (now owned by Oracle) extol the virtues of its various tools. The idea was that the tools made it somewhat easier to get Endeca up and running. The original patents for Endeca reveal the computational blender which the Endeca method required. Endeca shifted from licensing software to bundling consulting with a software license. Setting up Endeca required MBAs, patience, and money. Endeca rose to generate more than $120 million in revenues before its sale to Oracle. Today Endeca is still available, and the Endeca patents—particularly 7035864—reveal how Endeca pulled off its facets. Today Endeca has lost a bit of its spit and polish, a process that began when Autonomy blasted past the firm in the early 2000s.

Endeca rolled out its “studio” a decade ago. I recall that Business Objects had a “studio.” The idea behind a studio was to make the complex task of creating something an end user could use without much training. But the studio was not aimed at an end user. The studio was a product for a developer, who found the tortuous, proprietary methods complex and difficult to learn. A studio would unleash the developers and, of course, propel the vendors with studios to new revenue heights.

Studio is back. This time, if the information in “Expert System Releases Cogito Studio for Combining the Advantages of Semantic Technology with Deep Learning,” is accurate. The spin is that semantic technology and deep learning—two buzzwords near and dear to the heart of those in search of the next big thing—will be a boon. Who is the intended user? Well, developers. These folks are learning that the marketing talk is a heck of a lot easier than designing, coding, debugging, stabilizing, and then generating useful outputs is quite difficult work.

According to the Expert System announcement:

The new release of Cogito Studio is the result of the hard work and dedication of our labs, which are focused on developing products that are both powerful and easy to use,” said Marco Varone, President and CTO, Expert System. “We believe that we can make significant contributions to the field of artificial intelligence. In our vision of AI, typical deep learning algorithms for automatic learning and knowledge extraction can be made more effective when combined with algorithms based on a comprehension of text and on knowledge structured in a manner similar to that of humans.”

Does this strike you as vague?

Expert System is an Italian, high tech outfit, which was founded in 1989. That’s almost a decade before the Endeca system poked its moist nose into the world of search. Fellow travelers from this era include Fulcrum Technologies and ISYS Search Software. Both of these companies’ technology are still available today.

Thus, it makes sense that the idea of a “studio” becomes a way to chop away at the complexity of Expert System-type systems.

According to Google Finance, Expert System’s stock is trending upwards.

expert system share 4 17

That’s a good sign. My hunch is that announcements about “studios” wrapped in lingo like semantics and Big Data are a good thing.

Stephen E Arnold, April 23, 2016

Free Book? Semantic Mining of Social Networks

April 14, 2016

I saw a reference to a 2015 book, Semantic Mining of Social Networks by Jie Tang and Juanzi Li. This volume consists of essays about things semantic. Published by Morgan & Claypool publishers, the link I clicked did not return a bibliographic citation nor a review. The link displayed the book which appeared to be downloadable. If your engines are revved with the notion of semantic analysis, you may want to explore the volume yourself. I advocate purchasing monographs. Here’s the link I followed. Keep in mind that if the link 404s you, the fault is not mine.

Stephen E Arnold, April 14, 2016

Semantic Search Craziness Makes Search Increasingly Difficult

April 3, 2016

How is that for a statement? Search is getting hard. No, search is becoming impossible.

For evidence, I point to the Search Today and Beyond: Optimizing for the Semantic Web Wired Magazine article “Search Today and Beyond: Optimizing for the Semantic Web.”

Here’s a passage I noted:

Despite the billions and billions of searches, Google reports that 20 percent of all searches in 2012 were new. It seems quite staggering, but it’s a product of the semantic search rather than the simple keyword search.

Wow, unique queries. How annoying? Isn’t it better for people to just run queries which Google has seen and cached the results?

I have been poking around for information about a US government program called “DCGS.” Enter the query and what do you get? A number of results unrelated to the terms in my query; for example, US Army. Toss in quotes to “tell” Google to focus only on the string DCGS. Nah, does not do the job. Add the filetype:ppt operator and what do you get, documents in other formats too.

Semantic search is now a buzzword which is designed to obfuscate one important point: Methods for providing on point information are less important than assertions about what jargon can deliver.

For me, when I enter a query, I want the search system to deliver documents in which the key words appear. I want an option to see related documents. I do not want the search system doing the ad thing, the cheapest and most economical query, and I don’t want unexpected behaviors from a search and retrieval system.

Unfortunately lots of folks, including Wired Magazine, this that semantic search optimizes. Wonderful. With baloney like this I am not sure about the future of search; to wit:

…the future possibilities are endless for those who are studious enough to keep pace and agile enough to adjust.

Yeah, agile. What happened to the craziness that search is the new interface to Big Data? Right, agile.

Stephen E Arnold, April 3, 2016

GoPubMed Sorts Searching

March 31, 2016

Do you search the US government’s content? If you use the system, you may want to come at the information in different or more useful ways.

The German company Transinsight makes available its search system,

@@ pubmed

You can use the semantic search system to explore the knowledgebase.

Features of the system include:

  • A sidebar which allows one click access to concepts, authors, journals, etc.
  • A search box which accepts keyword queries and offers suggestions for the query
  • A results list.

The layout is clear. A bit of hunting around is necessary, but that is a common experience when trying to figure out if there is a way to narrow a broad search based on a lousy query.

There have been many search systems built to make the PubMed information findable. My favorite, though long gone, was Grateful Med. Like patent searching, queries of medical information are tricky. Some day I will write about the Information Health Reference Center, circa 1989. That was exciting.

Stephen E Arnold, March 31, 2016

Patents and Semantic Search: No Good, No Good

March 31, 2016

I have been working on a profile of Palantir (open source information only, however) for my forthcoming Dark Web Notebook. I bumbled into a video from an outfit called ClearstoneIP. I noted that ClearstoneIP’s video showed how one could select from a classification system. With every click,the result set changed. For some types of searching, a user may find the point-and-click approach helpful. However, there are other ways to root through what appears to be patent applications. There are the very expensive methods happily provided by Reed Elsevier and Thomson Reuters, two find outfits. And then there are less expensive methods like Alphabet Google’s odd ball patent search system or the quite functional FreePatentsOnline service. In between, you and I have many options.

None of them is a slam dunk. When I was working through the publicly accessible Palantir Technologies’ patents, I had to fall back on my very old-fashioned method. I tracked down a PDF, printed it out, and read it. Believe me, gentle reader, this is not the most fun I have ever had. In contrast to the early Google patents, Palantir’s documents lack the detailed “background of the invention” information which the salad days’ Googlers cheerfully presented. Palantir’s write ups are slogs. Perhaps the firm’s attorneys were born with dour brain circuitry.

I did a side jaunt and came across a white paper from ClearstoneIP called “Why Semantic Searching Fails for Freedom-to-Operate (FTO).”i The 12 page write up is from a company called ClearstoneIP, which is a patent analysis company. The firm’s 12 pager is about patent searching. The company, according to its Web site is a “paradigm shifter.” The company describes itself this way:

ClearstoneIP is a California-based company built to provide industry leaders and innovators with a truly revolutionary platform for conducting product clearance, freedom to operate, and patent infringement-based analyses. ClearstoneIP was founded by a team of forward-thinking patent attorneys and software developers who believe that barriers to innovation can be overcome with innovation itself.

The “freedom to operate” phrase is a bit of legal jargon which I don’t understand. I am, thank goodness, not an attorney.

The firm’s search method makes much of the ontology, taxonomy, classification approach to information access. Hence, the reason my exploration of Palantir’s dynamic ontology with objects tossed ClearstoneIP into one of my search result sets.

The white paper is interesting if one works around the legal mumbo jumbo. The company’s approach is remarkable and invokes some of my caution light words; for example:

  • “Not all patent searches are the same.”, page two
  • “This all leads to the question…”, page seven
  • “…there is never a single “right” way to do so.”, page eight
  • “And if an analyst were to try to capture all of the ways…”, page eight
  • “to capture all potentially relevant patents…”, page nine.

The absolutist approach to argument is fascinating.

Okay, what’s the ClearstoneIP search system doing? Well, it seems to me that it is taking a path to consider some of the subtlties in patent claims’ statements. The approach is very different from that taken by Brainware and its tri-gram technology. Now that Lexmark owns Brainware, the application of the Brainware system to patent searching has fallen off my radar. Brainware relied on patterns; ClearstoneIP uses the ontology-classification approach.

Both are useful in identifying patents related to a particular subject.

What is interesting in the write up is its approach to “semantics.” I highlighted in billable hour green:

Anticipating all the ways in which a product can be described is serious guesswork.

Yep, but isn’t that the role of a human with relevant training and expertise becomes important? The white paper takes the approach that semantic search fails for the ClearstoneIP method dubbed FTO or freedom to operate information access.

The white paper asserted:


Semantic searching is the primary focus of this discussion, as it is the most evolved.

ClearstoneIP defines semantic search in this way:

Semantic patent searching generally refers to automatically enhancing a text -based query to better represent its underlying meaning, thereby better identifying conceptually related references.

I think the definition of semantic is designed to strike directly at the heart of the methods offered to lawyers with paying customers by Lexis-type and Westlaw-type systems. Lawyers to be usually have access to the commercial-type services when in law school. In the legal market, there are quite a few outfits trying to provide better, faster, and sometimes less expensive ways to make sense of the Miltonesque prose popular among the patent crowd.

The white paper, in a lawyerly way, the approach of semantic search systems. Note that the “narrowing” to the concerns of attorneys engaged in patent work is in the background even though the description seems to be painted in broad strokes:

This process generally includes: (1) supplementing terms of a text-based query with their synonyms; and (2) assessing the proximity of resulting patents to the determined underlying meaning of the text – based query. Semantic platforms are often touted as critical add-ons to natural language searching. They are said to account for discrepancies in word form and lexicography between the text of queries and patent disclosure.

The white paper offers this conclusion about semantic search:

it [semantic search] is surprisingly ineffective for FTO.

Seems reasonable, right? Semantic search assumes a “paradigm.” In my experience, taxonomies, classification schema, and ontologies perform the same intellectual trick. The idea is to put something into a cubby. Organizing information makes manifest what something is and where it fits in a mental construct.

But these semantic systems do a lousy job figuring out what’s in the Claims section of a patent. That’s a flaw which is a direct consequence of the lingo lawyers use to frame the claims themselves.

Search systems use many different methods to pigeonhole a statement. The “aboutness” of a statement or a claim is a sticky wicket. As I have written in many articles, books, and blog posts, finding on point information is very difficult. Progress has been made when one wants a pizza. Less progress has been made in finding the colleagues of the bad actors in Brussels.

Palantir requires that those adding content to the Gotham data management system add tags from a “dynamic ontology.” In addition to what the human has to do, the Gotham system generates additional metadata automatically. Other systems use mostly automatic systems which are dependent on a traditional controlled term list. Others just use algorithms to do the trick. The systems which are making friends with users strike a balance; that is, using human input directly or indirectly and some administrator only knowledgebases, dictionaries, synonym lists, etc.

ClearstoneIP keeps its eye on its FTO ball, which is understandable. The white paper asserts:

The point here is that semantic platforms can deliver effective results for patentability searches at a reasonable cost but, when it comes to FTO searching, the effectiveness of the platforms is limited even at great cost.

Okay, I understand. ClearstoneIP includes a diagram which drives home how its FTO approach soars over the competitors’ systems:


ClearstoneIP, © 2016

My reaction to the white paper is that for decades I have evaluated and used information access systems. None of the systems is without serious flaws. That includes the clever n gram-based systems, the smart systems from dozens of outfits, the constantly reinvented keyword centric systems from the Lexis-type and Westlaw-type vendor, even the simplistic methods offered by free online patent search systems like

What seems to be reality of the legal landscape is:

  1. Patent experts use a range of systems. With lots of budget, many fee and for fee systems will be used. The name of the game is meeting the client needs and obviously billing the client for time.
  2. No patent search system to which I have been exposed does an effective job of thinking like an very good patent attorney. I know that the notion of artificial intelligence is the hot trend, but the reality is that seemingly smart software usually cheats by formulating queries based on analysis of user behavior, facts like geographic location, and who pays to get their pizza joint “found.”
  3. A patent search system, in order to be useful for the type of work I do, has to index germane content generated in the course of the patent process. Comprehensiveness is simply not part of the patent search systems’ modus operandi. If there’s a B, where’s the A? If there is a germane letter about a patent, where the heck is it?

I am not on the “side” of the taxonomy-centric approach. I am not on the side of the crazy semantic methods. I am not on the side of the keyword approach when inventors use different names on different patents, Babak Parviz aliases included. I am not in favor of any one system.

How do I think patent search is evolving? ClearstoneIP has it sort of right. Attorneys have to tag what is needed. The hitch in the git along has been partially resolved by Palantir’’-type systems; that is, the ontology has to be dynamic and available to anyone authorized to use a collection in real time.

But for lawyers there is one added necessity which will not leave us any time soon. Lawyers bill; hence, whatever is output from an information access system has to be read, annotated, and considered by a semi-capable human.

What’s the future of patent search? My view is that there will be new systems. The one constant is that, by definition, a lawyer cannot trust the outputs. The way to deal with this is to pay a patent attorney to read patent documents.

In short, like the person looking for information in the scriptoria at the Alexandria Library, the task ends up as a manual one. Perhaps there will be a friendly Boston Dynamics librarian available to do the work some day. For now, search systems won’t do the job because attorneys cannot trust an algorithm when the likelihood of missing something exists.

Oh, I almost forget. Attorneys have to get paid via that billable time thing.

Stephen E Arnold, March 30, 2016

Celebros Launches Natural Language Processing Ecommerce Extension with Seven Conversions

March 9, 2016

An e-commerce site search company, Celebros, shared a news release touting their new product. Celebros, First to Launch Natural Language Site Search Extension for Magento 2.0 announces their Semantic Site Search extension for Magento 2.0. Magento 2.0 boasts the largest marketplace of e-commerce extensions in the world. This product, along with other Magento extensions, are designed to help online merchants expand their marketing and e-commerce capabilities. Celebros CMO and President of Global Sales Jeffrey Tower states,

“Celebros is proud to add the new Magento 2 extension to our existing and very successful Magento 1 extension. Celebros will offer the new extension free of charge to our entire Magento client base to ensure an easy, fast and pain-free upgrade while providing free integrations to new Celebros clients world-wide. The new extension encompasses our Natural Language Site Search in seven languages along with eight additional features that include our advanced auto-complete, guided navigation, dynamic landing pages and merchandising engine, product recommendations and more.”

For online retailers, extension products like Celebros may make or break the platforms like Magento 2.0, as these products are what add value and drive e-commerce technologies forward. It is intriguing that the Celebros natural language processing technology offers conversions available in seven languages. We live in an increasingly globalized world.


Megan Feil, March 9, 2016

Sponsored by, publisher of the CyberOSINT monograph

Next Page »