CyberOSINT banner

More Semantic Search and Search Engine Optimization Chatter

June 10, 2015

I read “Understanding Semantic Search.” I had high hopes. The notion of Semantic Search as set forth by Tim Bray, Ramanathan Guha, and some other wizards years ago continues to intrigue me. The challenge has been to deliver high value outputs that generate sufficient revenue to pay for the plumbing, storage, and development good ideas can require.

I spent considerable time exploring one of the better known semantic search systems before the company turned off the lights and locked its doors. Siderean Software offered its Seamark system which could munch on triples and output some quite remarkable results. I am not sure why the company was not able to generate more revenue.

The company emphasized “discovery searching.” Vivisimo later imitated Siderean’s user input feature. The idea is that if a document required an additional key word, the system accepted the user input and added the term to the index. Siderean was one of the first search vendors to suggest that “graph search” or relationships would allow users to pinpoint content processed by the system. In the 2006-2007 period, Siderean indexed Oracle text content as a demonstration. (At the time, Oracle had the original Artificial Linguistics’ technology, the Oracle Text function, Triple Hop, and PL/SQL queries. Not surprisingly, Oracle did not show the search acquisition appetite the company demonstrated a few years later when Oracle bought Endeca’s ageing technology, the RightNow Netherlands-originated technology, or the shotgun marriage search vendor InQuira.)

I also invested some time on behalf of the client in the semantic inventions of Dr. Ramanathan Guha. This work was summarized in Google Version 2.0, now out of print. Love those print publishers, folks.

Dr. Guha applied the features of the Semantic Web to plumbing which, if fully implemented, would have allowed Google to build a universal database of knowledge, serve up snippets from a special semantic server, and perform a number of useful functions. This work was done by Dr. Guha when he was at IBM Almaden and at Google. My analysis of Dr. Guha’s work suggests that Google has more semantic plumbing than most observers of the search giant notice. The reason, I concluded, was that semantic technology works behind the scenes. Dragging the user into OWL, RDF, and other semantic nuances does not pay off as well as embedding certain semantic functions behind the scenes.

In the “Understanding Semantic Search” write up, I learned that my understanding of semantic search is pretty much a wild and crazy collection of half truths. Let me illustrate what the article presents as the “understanding” function for addled geese like me.

  • Searches have a context
  • Results can be local or national
  • Entities are important; for example, the White House is different from a white house

So far, none of this helps me understand semantic search as embodied in the 3WC standard nor in the implementation of companies like Siderean or the Google-Guha patent documents from 2007 forward.

The write up makes a leap from context to the question, “Are key words still important?”

From that question, the article informs me that I need to utilize schema mark up. These are additional code behinds which provide information to crawlers and other software about the content which the user sees on a rendering device.

And that’s it.

So let’s recap. I learned that context is important via illustrations which show Google using different methods to localize or personalize content. The write up does not enumerate the different methods which use browser histories, geolocation, and other signals. The write up then urges me to use additional mark up.

I think I will stick with my understanding of semantics. My work with Siderean and my research for an investment bank provided a richer base of knowledge about the real world applications of semantic technology. Technology, I wish to point out, which can be computationally demanding unless one has sufficient resources to perform the work.

What is happening in this “Understanding Semantic Search” article is an attempt to generate business for search engine optimization experts. Key word stuffing and doorway pages no longer work very well. In fact, SEO itself is a problem because it undermines precision and recall. Spoofing relevance is not my idea of a useful activity.

For those looking to semantics to deliver Google traffic, you might want to invest the time and effort in creating content which pulls users to you.

Stephen E Arnold, June 9, 2015

Semantic Search Hoohah: Hakia

June 8, 2015

My Overflight system snagged an updated post to an article written in 2006 about Hakia. Hakia, as you may know, was a semantic search system. I ran an interview with Riza C. Berkan in 2008. You can find that Search Wizards Speak interview here.

Hakia went quiet months ago. The author of “Hakia Is a True Semantic Search Engine” posted a sentence that said: “Hakia, unfortunately, failed and went out of business.”

I reviewed that nine year old article this morning and highlighted several passages. These are important because these snippets illustrate how easy it is to create a word picture which does not match reality. Search engine developers find describing their vision a heck of a lot easier than converting talk into sustainable revenues.

Let’s run down three of the passages proudly displaying my blue highlighter’s circles and arrows. The red text is from the article describing Hakia and the blue text is what the founder of Hakia said in the Search Wizards Speak interview.

Passage One

So a semantic search engine doesn’t have to address every word in the English language, and in fact it may be able to get by with a very small core set of words. Let’s say that we want to create our own semantic search engine. We can probably index most mainstream (non-professional) documents with 20-30,000 words. There will be a few gaps here and there, but we can tolerate those gaps. But still, the task of computing relevance for millions, perhaps billions of documents that use 30,000 words is horrendously monumental. If we’re going to base our relevance scoring on semantic analysis, we need to reduce the word-set as much as possible.

This passage is less about Hakia and more about the author’s perception of semantic search. Does this explanation resonate with you? For me, many semantic methods are computationally burdensome. As a result the systems are often sluggish and unable to keep pace with updated content and new content.

Here’s how Dr. Riza C. Berkan, a nuclear engineer and math whiz explained semantics:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

Passage Two

What I believe the founders of Hakia have done is borrow the concept of Lambda Calculus from compiler theory to speed the process of reducing elements on pages to their conceptual foundations. That is, if we assume everyone writes like me, then most documents can be reduced to a much smaller subset of place-holders that accurately convey the meaning of all the words we use.

Okay, but in my Search Wizards Speak interview, the founder of Hakia said:

We can analyze 70 average pages per second per server. Scaling: The beauty of QDexing is that QDexing grows with new knowledge and sequences, but not with new documents. If I have one page, two pages or 1,000 pages of the OJ Simpson trial, they are all talking about the same thing, and thus I need to store very little of it. The more pages that come, the more the quality of the results increase, but only with new information is the amount of QDex stored information increased. At the beginning, we have a huge steep curve, but then, processing and storage are fairly low cost. The biggest cost is the storage, as we have many many QDex files, but these are tiny two to three Kb files. Right now, we are going through news, and we are showing a seven to 10 minute lag for fully QDexing news.

No reference to a type of calculus that thrills Googlers. In fact, a review of the patent shows that well know methods are combined in what appears to be an interesting way.

Passage Three

Documents can still pass value by reference in a semantic index, but the mechanics of reference work differently. You have more options, so less sophisticated writers who don’t embed links in their text can have just as much impact on another document’s importance as professional search optimization copywriters. Paid links may not become a thing of the past very quickly, but you can bet your blog posts that buying references is going to be more sophisticated if this technology takes off. That is what is so exciting about Hakia. They haven’t just figured out a way to produce a truly semantic search engine. They have just cut through a lot of the garbage (at a theoretical level) that permeates the Web. Google AdSense arbitragers who rely on scraping other documents to create content will eventually find their cash cows drying up. The semantic index will tell Hakia where the original content came from more often than not.

Here’s what the founder says in the Search Wizards Speak interview:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

The key fact is that Hakia failed. The company tried to get traction with health and medical information. The vocabulary for scientific, technical, and medical content is less poetic than the writing in business articles and general blog posts. Nevertheless, the customers and users did not bite..

Notice that both the author of the article did not come to grips with the specific systems and methods used by Hakia. The write up “sounds good” but lacks substance. The founder’s explanation reveals his confidence in what “should be,” not what was and is.

My point: Writing about search is difficult. Founders see the world one way; those writing about search interpret the descriptions in terms of their knowledge.

Where can one get accurate, objective information about search? The options are limited and have been for decades. Little wonder that search remains a baffler to many people.

Stephen E Arnold, June 8, 2015

The Semantic Blenders: Not Consumable by Most

June 7, 2015

i read “Schema Markup and Microformatting Is Only the First Step in your Semantic Search Strategy.”

Okay, schema markup and microformatting. These are, according to the headline, one thing.

I am probably off base here in Harrod’s Creek, but I thought:

  1. Schema markup. Google’s explanation is designed to help out the GOOG, not the user. The methods of Guha and Halevy have proven difficult to implement. The result is a Googley move: Have the developers insert data into Web pages. Easy. Big benefit for Google too.
  2. Microformatting. A decade old effort to add additional information to a Web page. You can find examples galore at

I am not very good at math, but it sure seems to me that these are two different processes.

But the burr under my blanket is that one cannot apply anything unless there is something written or displayed on a Web page. Therefore, these two additions to a Web page’s code cannot be the first thing. Tagging can occur after something has been written or at the same time when the writing is done with a smart input system.

The notion that these rather squishy logical mistakes occur in the headline did not rev my engine when I worked through the 1,800 words in the article. The assumption in the write up is that a reader wants to create an ecommerce site which garners a top Google result. The idea is that one types in a key word like “cyberosint” and the first hit in the result list points to the ecommerce page.

The hitch in the git along is that more queries are arriving from mobile devices. The consequence of this is that the mobile system will be filtering content and displaying information which the system calculates as important to the user.

I don’t want to rain on the semanticists’ parade, nor do I want to point out that search engine optimization is pretty much an undrinkable concoction of buzz words, jargon, and desperation.

Here’s one of the passage in the write up that I marked and inked a blue exclamation point in the margin of my print out:

Within Search Engine Optimization, many businesses focus on keywords, phrases, and search density as a way of sending clues to search engines that they should be known for those things. But let’s look at it from the human side: how can we make sure that our End User makes those associations? How can we build that Brand Association and Topical Relevance to a human being? By focusing our content strategy and providing quality content curation.

Well, SEO folks, I am not too keen on brand associations and I am not sure I want to be relevant to another human being. Whether I agree or not, the fix is not to perform these functions:

  • Bridges of association
  • Social listening
  • Quality reputation (a method used I might add on the Dark Web)

This smoothie is a mess.

There are steps a person with a Web page can take to communicate with spiders and human readers. I am not sure the effort, cost, and additional page fluff are going to work.

Perhaps the semanticists should produce something other than froth? To help Google, write and present information which is clear, concise, and consistent just like in junior high school English class.

Stephen E Arnold, June 7, 2015

Semantic Search Failure Rate: 50% and There Is Another Watson Search System

June 1, 2015

The challenge of creating a semantic search system is a mini Mt. Everest during an avalanche. One of the highest profile semantic search systems was Siderean Software. The company quietly went quiet several years ago. I thought about Siderean when I followed up on a suggestion made by one of the stalwarts who read Beyond Search.

That reader sent me a link to a list of search systems. The list appeared on AI3. I could not determine when the list was compiled. To check the sticking power of the companies/organizations on the list, we looked up each vendor.

The results were interesting. Half of the listed companies were no longer in the search business.

Here’s the full list and the Beyond Search researcher’s annotations:

Search System Type
Antidot Finder Suite Commercial vendor
BAAGZ Not available
Beagle++ Not available
BuddyFinder (CORDER) Search buddyspace and Jabber
CognitionSearch Emphasis on monitoring
ConWeaver Customer support
DOAPspace Search not a focus of the site
EntityCube Displays a page with a handful of ideographs
Falcons Search system from Nanjing University
Ferret Open source search library
Flamenco A Marti Hearst search interface framework
HyperTwitter Does not search current Twitter stream
LARQ Redirects to Apache Jena, an open source Java framework for building Semantic Web and Linked Data applications
Lucene Apache Lucene Core
Lucene-skos Deprecated; points visitor to Lucene
LuMriX Medical search
Lupedia 404 error
OntoFrame Redirect due to 404 error
Ontogator Link to generic view based RDF search engine
OntoSearch 404 error
Opossum Page content not related to search
Picky Search engine in Ruby script
Searchy A metasearch engine performing a semantic translation into RDF; page updated in 2006
Semantic Search 404
Semplore 404
SemSearch Keyword based semantic search. Link points to defunct Google Code service
Sindice 404
SIREn 404
SnakeT Page renders; service 404s
Swangler Displays; last update 2005
Swoogle Search over 10,000 ontologies
SWSE 404
TrueKnowledge 404
Watson Not IBM; searches semantic documents
Zebra General purpose open source structured text indexing and retrieval engine
ZoomInfo Commercial people search system

The most interesting entry in the list is the Watson system which seems to be operating as part of an educational institution.

Here’s what the Watson looks like:


IBM’s attorneys may want to see who owns what rights to the name “Watson.” But for IBM’s working on a Watson cookbook, this errant Watson may have been investigated, eh, Sherlock.

Stephen E Arnold, June 1, 2015

The Future of Enterprise and Web Search: Worrying about a Very Frail Goose

May 28, 2015

For a moment, I thought search was undergoing a renascence. But I was wrong. I noted a chart which purports to illustrate that the future is not keyword search. You can find the illustration (for now) at this Twitter location. The idea is that keyword search is less and less effective as the volume of data goes up. I don’t want to be a spoil sport, but for certain queries key words and good old Boolean may be the only way to retrieve certain types of information. Don’t believe me. Log on to your organization’s network or to Google. Now look for the telephone number of a specific person whose name you know or a tire company located in a specific city with a specific name which you know. Would you prefer to browse a directory, a word cloud, a list of suggestions? I want to zing directly to the specific fact. Yep, key word search. The old reliable.

But the chart points out that the future is composed of three “webs”: The Social Web, the Semantic Web, and the Intelligent Web. The dates for the Intelligent Web appears to be 2018 (the diagram at which I am looking is fuzzy). We are now perched half way through 2015. In 30 months, the Intelligent Web will arrive with these characteristics:

Embedded image permalink

  • Web scale reasoning (Don’t we have Watson? Oh, right. I forgot.)
  • Intelligent agents (Why not tap Connotate? Agents ready to roll.)
  • Natural language search (Yep, talk to your phone How is that working out on a noisy subway train?)
  • Semantics. (Embrace the OWL. Now.)

Now these benchmarks will arrive in the next 30 months, which implies a gradual emergence of Web 4.0.

The hitch in the git along, like most futuristic predictions about information access, is that reality behaves in some unpredictable ways. The assumption behind this graph is “Semantic technology help to regain productivity in the face of overwhelming information growth.”

Read more

Hijacking Semantics for Search Engine Optimization

May 26, 2015

I am just too old and cranky to get with the search engine optimization program. If a person cannot find your content, too bad. SEO has caused some of the erosion of relevance across public Web search engines.

The reason is that pages with lousy content are marketed as having other, more valuable content. The result is queries like this:


I want information about methods of digital reasoning. What I get is a company profile.

How do I get information for my specific requirement? I have to know how to work around the problems SEO puts in my face every day, over and over again.

This query works on Bing, Google, and Yandex: artificial intelligence decision procedures.


The results do not point to a small company in Tennessee, but to substantive documents from which other, pointed queries can be launched for individuals, industry associations, and methods.

When I read “Semantic Search Strategies That Work,” I became agitated. The notion of “forgetting about content” and “focusing on quality” miss the mark. Telling me to “spend time on engagement” are a collection of unrelated assertions.

The goal of semantics for SEO is to generate traffic. The search systems suck in shaped content and persist in directing people to topics that may have little or nothing to do with the information a person needs to solve his or her problem.

In short, the bastardization of semantics in the name of SEO is ensuring that some users will define the world from the point of view of marketing, not objective information.

What’s the fix?

Here’s the shocker: There is no fix. As individuals abrogate their responsibility to demand high value, on point results, schlock becomes the order of the day.

So much for clear thinking. Semantic strategies that erode relevance do not “work” from my point of view. This type of semantics thickens the cloud of unknowning.

Stephen E Arnold, May 26, 2015

Semantic Search: A Return to Hieroglyphics

May 20, 2015

I am so out of date, lost in time, and dumb that I experienced a touch of nausea when I read “Feeligo Expands Semantic Search for Branded Online Stickers.” Feeligo I learned is “a leading provided of branded stickers for online conversations.”



The leap from a sticker to semantic search is dazzling. According to the write up, Feeligo has 500 million users. These folks are doing semantic search. How does this sticker-semantic marriage work? The article says:

Feeligo has developed a platform that capitalizes on the growing awareness of marketing to online and mobile users through social conversations, including comment forums and user forums. Feeligo offers clients a plug-and-play solution for all messaging services, complete with generic and branded stickers, which are installed on client sites. Through Feeligo’s semantic recommendation algorithms, direct matches between words and phrases in users’ text conversations and stickers are made, enabling users to quickly find the appropriate sticker for a user’s message.

I have watched enterprise search vendors distort language in their remarkable attempts to generate sales. I have watched the search engine optimization crowd trash relevance and then embrace the jargon of RDF and Owl. I have now seen how purveyors of digital stickers have tapped semantic technology to make hieroglyphics a brand message technique.

Does anyone notice that a digital sticker is a cartoon empowered to generate three views every second? Do these sticker consumers consider dipping into William James or Charles Dickens? Nah, no stickers for that irrelevant material.

Stephen E Arnold, May 20, 2015

Developing an NLP Semantic Search

May 15, 2015

Can you imagine a natural language processing semantic search engine?  It would be a lovely tool to use in your daily routines and make research a bit easier.  If you are working on such a project and are making a progress, keep at that startup because this is lucrative field at the moment.  Over at Stack Overflow, an entrepreneuring spirit is trying to develop a “Semantic Search With NLP And Elasticsearch”:

“I am experimenting with Elasticsearch as a search server and my task is to build a “semantic” search functionality. From a short text phrase like “I have a burst pipe” the system should infer that the user is searching for a plumber and return all plumbers indexed in Elasticsearch.

Can that be done directly in a search server like Elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.”

Given that this question was asked about three years ago, a lot has been done not only with Elasticsearch, but also NLP.  Search is moving towards a more organic experience, but accuracy is often muddled by different factors.  These include the quality of the technology, classification, taxonomies, ads in results, and even keywords (still!).

NLP semantic search is closer now than it was three years ago, but technology companies would invest a lot of money in a startup that can bridge the gap between natural language and machine learning.

Whitney Grace, May 15, 2015

Sponsored by, publisher of the CyberOSINT monograph

The Philosophy of Semantic Search

May 13, 2015

The article Taking Advantage of Semantic Search NOW: Understanding Semiotics, Signs, & Schema on Lunametrics delves into semantics on a philosophical and linguistic level as well as in regards to business. He goes through the emergence of semantic search beginning with Ray Kurzweil’s interest in machine learning meaning as opposed to simpler keyword search. In order to fully grasp this concept, the author of the article provides a brief refresher on Saussure’s semantics.

“a Sign is comprised of a signifier, or the name of a thing, and the signified, what that thing represents… Say you sell iPad accessories. “iPad case” is your signifier, or keyword in search marketing speak. We’ve abused the signifier to the utmost over the years, stuffing it onto pages, calculating its density with text tools, jamming it into title tags, in part because we were speaking to robot who read at a 3-year-old level.”

In order to create meaning, we must go beyond even just the addition of price tag and picture to create a sign. The article suggests the need for schema, in the addition of some indication of whom and what the thing is for. The author, Michael Bartholow, has a background in linguistics and marketing and search engine optimization. His article ends with the question of when linguists, philosophers and humanists will be invited into the conversation with businesses, perhaps making him a true visionary in a field populated by data engineers with tunnel-vision.

Chelsea Kerwin, May 13, 2014

Sponsored by, publisher of the CyberOSINT monograph

RichRelevance Promises Complete Omnichannel Personalization

May 7, 2015

The article on MarketWatch titled RichRelevance Extends Its Partner Ecosystem to Support True Omnichannel Personalization predicts the consequences of San Francisco-based company RichRelevance’s recent announcement that they will be amping up partner support in order to improve the continuity of the customer experience across “web, mobile, call center and store.” The article explains what is meant by omnichannel personalization and why it is so important,

“Personalization has emerged as the most important strategic imperative for global businesses,” said Eduardo Sanchez, CEO of RichRelevance. “Our partner ecosystem provides our customers with a unique resource to support the implementation of different components of the Relevance Cloud in their business, as well as customize personalization according to the highly specific demands of their own businesses and consumer base.” Gartner predicts that 89% of companies plan to compete primarily on the basis of the customer experience by 2016…”

The Relevance Cloud is available for Richrelevance partners and includes such core capabilities as Pre-built personalization apps for recommendations and search, the Open Innovation Platform for Build, and Relevance in Store for the reported 90% of sales that occur in-store. The announcement ensures that the collaboration Richrelevance emphasizes with its partners will really range all areas of customer engagement.

Chelsea Kerwin, May 7, 2014

Sponsored by, publisher of the CyberOSINT monograph

« Previous PageNext Page »