CyberOSINT banner

Old Wine: Semantic Search from the Enlightenment

June 24, 2015

I read a weird disclaimer. Here it is:

This is an archived version of Pandia’s original article “Top 5 Semantic Search Engines”, we made it available to the users mainly because it is still among the most sought articles from old site. You can also check kids, radio search, news, people finder and q-cards sections.

An article from the defunct search newsletter Pandia surfaced in a news aggregation list. Pandia published one of my books, but at the moment I cannot remember which of my studies.

The write up identifies “semantic search engines.” Here’s the list with my status update in bold face:

  • Hakia. Out of business
  • SenseBot. Out of business.
  • Powerset. Bought by Microsoft. Fate unknown in the new Delve/Bing world.
  • DeepDyve. Talk about semantics but the system is a variation of the Dialog/BRS for fee search model from the late 1970s.
  • Cognition (Cognition Technologies). May be a unit of Nuance?

What’s the score?

Two failures. Two sales to another company. One survivor which has an old school business model. My take? Zero significant impact on information retrieval.

Feel free to disagree, but the promise of semantic search seems to pivot on finding a buyer and surviving by selling online research. Why so much semantic cheerleading? Beats me. Semantic methods are useful in the plumbing as a component of a richer, more robust system. Most cyberOSINT systems follow this path. Users don’t care too much about plumbing in my experience.

Stephen E Arnold, June 24, 2015

Expert Systems Acquires TEMIS

June 22, 2015

In a move to improve its product offerings, Expert System acquired TEMIS.  The two companies will combine their assets to create a leading semantic provider for cognitive computing.  Reuters described the acquisition in very sparse details: “Expert System Signs Agreement To Acquire French TEMIS SA.”

Reuters describes the merger as:

“Reported on Wednesday that it [Expert System] signed binding agreement to buy 100 percent of TEMIS SA, a French company offering solutions in text analytics

  • Deal value is 12 million euros ($13.13 million)”

TEMIS creates technology that helps organizations leverage, manage, and structure their unstructured information assets.  It is best known for Luxid, which identifies and extracts information to semantically enrich content with domain-specific metadata.

Expert System, on the other hand, is another semantically inclined company and its flagship product is Cogito.  The Cogito software is designed to understand content within unstructured text, systems, and analytics.  The goal is give organizations a complete picture of your information, because Cogitio actually understand what is processing.

TEMIS and Expert System have similar goals to make unstructured data useful to organizations.  Other than the actual acquisition deal, details on how Expert System plans to use TEMIS have not been revealed.  Expert System, of course, plans to use TEMIS to improve its own semantic technology and increase revenue.  Both companies are pleased at the acquisition, but if you consider other buy outs in recent times the cost to Expert System is very modest.  Thirteen million dollars underscores the valuation of other text analysis companies.  Other text analysis companies would definitely cost more than TEMIS.

Whitney Grace, June 22, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Sprylogics Repositioned to Mobile Search

June 20, 2015

I learned about Cluuz.com in a briefing in a gray building in a gray room with gray carpeting. The person yapping explained how i2 Ltd.-type relationship analysis was influencing certain intelligence-centric software. I jotted down some urls the speaker mentioned.

When I returned to my office, I check out the urls. I found the Cluuz.com service interesting. The system allowed me to run a query, review results with inline extracts, and relationship visualizations among entities. In that 2007 version of Cluuz.com’s system, I found the presentation, the inclusion of emails, phone numbers, and parent child relationships quite useful. The demonstration used queries passed against Web indexes. Technically, Cluuz.com belonged to the category of search systems which I call “metasearch” engines. The Googles and Yahoos index the Web; Cluuz.com added value. Nifty.

I chased down Alex Zivkovic, the individual then identified as the chief technical professional at Sprylogics. You can read my 2008 interview with Zivkovic in my Search Wizards Speak collection. The Cluuz.com system originated with a former military professional’s vision for information analysis. According to Zivkovic, the prime mover for Cluuz.com was Avi Shachar. At the time of the interview, the company focused on enterprise customers.

Zivkovic told me in 2008:

We have clustering. We have entity extraction. We have a relational ship analysis in a graph format. I want to point out that for enterprise applications, the Cluuz.com functions are significantly more rich. For example, a query can be run across internal content and external content. The user sees that the internal information is useful but not exactly on point. Our graph technology makes it easy for the user to spot useful information from an external source such as the Web in conjunction with the internal information. With a single click, the user can be looking into those information objects. We think we have come up with a very useful way to allow an organization to give its professionals an efficient way to search for content that is behind the firewall and on the Web. The main point, however, is that user does not have to be trained. Our graphical interface makes it obvious what information is available from which source. Instead of formulating complex queries, the person doing the search can scan, click, and browse. Trips back to the search box are options, not mandatory.

I visited the Sprylogics.com Web site the other day and learned that the Cluuz.com-type technology has been repackaged as a mobile search solution and real time sports application.

There is a very good explanation of the company’s use of its technology in a more consumer friendly presentation. You can find that presentation at this link, but the material can be removed at any time, so don’t blame me if the link is dead when you try to review the explanation of the 2015 version of Sprylogics.

From my point of view, the Sprylogics’ repositioning is an excellent example of how a company with technology designed for intelligence professionals can be packaged into a consumer application. The firm has more than a dozen patents, which some search and content processing companies cannot match. The semantic functions and the system’s ability to process Web content in near real time make the firm’s Poynt product interesting to me.

Sprylogics’ approach, in my opinion, is a far more innovative approach to leveraging advanced content processing capabilities than approaches taken by most search vendors. It is easier to slap a customer relationship management, customer support, or business intelligence label on what is essential search and retrieval software than create a consumer facing app.

Kudos to Sprylogics. The ArnoldIT team hopes their stock, which is listed on the Toronto Stock Exchange, takes wing.

Stephen E Arnold, June 20, 2015

Solcara Is The Best!  Ra Ra Ra!

June 15, 2015

Thomson-Reuters is a world renowned news syndication, but the company also has its own line of search software called Solcara Federated Search also known as Solcara SolSearch.”  In a cheerleading press release, Q-resolve highlights Solcara’s features and benefits: “Solcara Legal Search, Federated Search And Know How.”  Solcara allows users to search multiple information resources, including intranets, databases, Knowledge Management, and library and document management systems.  It returns accurate results according to the inputted search terms or keywords.  In other words, it acts like an RSS feed combined with Google.

Solcara also has a search product specially designed for those in the legal profession and the press release uses a smooth reading product description to sell it:

“Solcara legal Search is as easy to use as your favorite search engine. With just one search you can reference internal documents and approved legal information resources simultaneously without the need for large scale content indexing, downloading or restructuring. What’s more, you can rely on up-to-date content because all searches are carried out in real time.”

The press release also mentions some other tools, case studies, and references the semantic Web.  While Solcara does sound like a good product and comes from a reliable new aggregator like Thomson-Reuters, the description and organization of the press release makes it hard to understand all the features and who the target consumer group is.  Do they want to sell to the legal profession and only that group or do they want to demonstrate how Solcara can be adapted to all industries that digest huge information amounts?  The importance of advertising is focusing the potential buyer’s attention.  This one jumps all over the place.

Whitney Grace, June 15, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

More Semantic Search and Search Engine Optimization Chatter

June 10, 2015

I read “Understanding Semantic Search.” I had high hopes. The notion of Semantic Search as set forth by Tim Bray, Ramanathan Guha, and some other wizards years ago continues to intrigue me. The challenge has been to deliver high value outputs that generate sufficient revenue to pay for the plumbing, storage, and development good ideas can require.

I spent considerable time exploring one of the better known semantic search systems before the company turned off the lights and locked its doors. Siderean Software offered its Seamark system which could munch on triples and output some quite remarkable results. I am not sure why the company was not able to generate more revenue.

The company emphasized “discovery searching.” Vivisimo later imitated Siderean’s user input feature. The idea is that if a document required an additional key word, the system accepted the user input and added the term to the index. Siderean was one of the first search vendors to suggest that “graph search” or relationships would allow users to pinpoint content processed by the system. In the 2006-2007 period, Siderean indexed Oracle text content as a demonstration. (At the time, Oracle had the original Artificial Linguistics’ technology, the Oracle Text function, Triple Hop, and PL/SQL queries. Not surprisingly, Oracle did not show the search acquisition appetite the company demonstrated a few years later when Oracle bought Endeca’s ageing technology, the RightNow Netherlands-originated technology, or the shotgun marriage search vendor InQuira.)

I also invested some time on behalf of the client in the semantic inventions of Dr. Ramanathan Guha. This work was summarized in Google Version 2.0, now out of print. Love those print publishers, folks.

Dr. Guha applied the features of the Semantic Web to plumbing which, if fully implemented, would have allowed Google to build a universal database of knowledge, serve up snippets from a special semantic server, and perform a number of useful functions. This work was done by Dr. Guha when he was at IBM Almaden and at Google. My analysis of Dr. Guha’s work suggests that Google has more semantic plumbing than most observers of the search giant notice. The reason, I concluded, was that semantic technology works behind the scenes. Dragging the user into OWL, RDF, and other semantic nuances does not pay off as well as embedding certain semantic functions behind the scenes.

In the “Understanding Semantic Search” write up, I learned that my understanding of semantic search is pretty much a wild and crazy collection of half truths. Let me illustrate what the article presents as the “understanding” function for addled geese like me.

  • Searches have a context
  • Results can be local or national
  • Entities are important; for example, the White House is different from a white house

So far, none of this helps me understand semantic search as embodied in the 3WC standard nor in the implementation of companies like Siderean or the Google-Guha patent documents from 2007 forward.

The write up makes a leap from context to the question, “Are key words still important?”

From that question, the article informs me that I need to utilize schema mark up. These are additional code behinds which provide information to crawlers and other software about the content which the user sees on a rendering device.

And that’s it.

So let’s recap. I learned that context is important via illustrations which show Google using different methods to localize or personalize content. The write up does not enumerate the different methods which use browser histories, geolocation, and other signals. The write up then urges me to use additional mark up.

I think I will stick with my understanding of semantics. My work with Siderean and my research for an investment bank provided a richer base of knowledge about the real world applications of semantic technology. Technology, I wish to point out, which can be computationally demanding unless one has sufficient resources to perform the work.

What is happening in this “Understanding Semantic Search” article is an attempt to generate business for search engine optimization experts. Key word stuffing and doorway pages no longer work very well. In fact, SEO itself is a problem because it undermines precision and recall. Spoofing relevance is not my idea of a useful activity.

For those looking to semantics to deliver Google traffic, you might want to invest the time and effort in creating content which pulls users to you.

Stephen E Arnold, June 9, 2015

Semantic Search Hoohah: Hakia

June 8, 2015

My Overflight system snagged an updated post to an article written in 2006 about Hakia. Hakia, as you may know, was a semantic search system. I ran an interview with Riza C. Berkan in 2008. You can find that Search Wizards Speak interview here.

Hakia went quiet months ago. The author of “Hakia Is a True Semantic Search Engine” posted a sentence that said: “Hakia, unfortunately, failed and went out of business.”

I reviewed that nine year old article this morning and highlighted several passages. These are important because these snippets illustrate how easy it is to create a word picture which does not match reality. Search engine developers find describing their vision a heck of a lot easier than converting talk into sustainable revenues.

Let’s run down three of the passages proudly displaying my blue highlighter’s circles and arrows. The red text is from the article describing Hakia and the blue text is what the founder of Hakia said in the Search Wizards Speak interview.

Passage One

So a semantic search engine doesn’t have to address every word in the English language, and in fact it may be able to get by with a very small core set of words. Let’s say that we want to create our own semantic search engine. We can probably index most mainstream (non-professional) documents with 20-30,000 words. There will be a few gaps here and there, but we can tolerate those gaps. But still, the task of computing relevance for millions, perhaps billions of documents that use 30,000 words is horrendously monumental. If we’re going to base our relevance scoring on semantic analysis, we need to reduce the word-set as much as possible.

This passage is less about Hakia and more about the author’s perception of semantic search. Does this explanation resonate with you? For me, many semantic methods are computationally burdensome. As a result the systems are often sluggish and unable to keep pace with updated content and new content.

Here’s how Dr. Riza C. Berkan, a nuclear engineer and math whiz explained semantics:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

Passage Two

What I believe the founders of Hakia have done is borrow the concept of Lambda Calculus from compiler theory to speed the process of reducing elements on pages to their conceptual foundations. That is, if we assume everyone writes like me, then most documents can be reduced to a much smaller subset of place-holders that accurately convey the meaning of all the words we use.

Okay, but in my Search Wizards Speak interview, the founder of Hakia said:

We can analyze 70 average pages per second per server. Scaling: The beauty of QDexing is that QDexing grows with new knowledge and sequences, but not with new documents. If I have one page, two pages or 1,000 pages of the OJ Simpson trial, they are all talking about the same thing, and thus I need to store very little of it. The more pages that come, the more the quality of the results increase, but only with new information is the amount of QDex stored information increased. At the beginning, we have a huge steep curve, but then, processing and storage are fairly low cost. The biggest cost is the storage, as we have many many QDex files, but these are tiny two to three Kb files. Right now, we are going through news, and we are showing a seven to 10 minute lag for fully QDexing news.

No reference to a type of calculus that thrills Googlers. In fact, a review of the patent shows that well know methods are combined in what appears to be an interesting way.

Passage Three

Documents can still pass value by reference in a semantic index, but the mechanics of reference work differently. You have more options, so less sophisticated writers who don’t embed links in their text can have just as much impact on another document’s importance as professional search optimization copywriters. Paid links may not become a thing of the past very quickly, but you can bet your blog posts that buying references is going to be more sophisticated if this technology takes off. That is what is so exciting about Hakia. They haven’t just figured out a way to produce a truly semantic search engine. They have just cut through a lot of the garbage (at a theoretical level) that permeates the Web. Google AdSense arbitragers who rely on scraping other documents to create content will eventually find their cash cows drying up. The semantic index will tell Hakia where the original content came from more often than not.

Here’s what the founder says in the Search Wizards Speak interview:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

The key fact is that Hakia failed. The company tried to get traction with health and medical information. The vocabulary for scientific, technical, and medical content is less poetic than the writing in business articles and general blog posts. Nevertheless, the customers and users did not bite..

Notice that both the author of the article did not come to grips with the specific systems and methods used by Hakia. The write up “sounds good” but lacks substance. The founder’s explanation reveals his confidence in what “should be,” not what was and is.

My point: Writing about search is difficult. Founders see the world one way; those writing about search interpret the descriptions in terms of their knowledge.

Where can one get accurate, objective information about search? The options are limited and have been for decades. Little wonder that search remains a baffler to many people.

Stephen E Arnold, June 8, 2015

The Semantic Blenders: Not Consumable by Most

June 7, 2015

i read “Schema Markup and Microformatting Is Only the First Step in your Semantic Search Strategy.”

Okay, schema markup and microformatting. These are, according to the headline, one thing.

I am probably off base here in Harrod’s Creek, but I thought:

  1. Schema markup. Google’s explanation is designed to help out the GOOG, not the user. The methods of Guha and Halevy have proven difficult to implement. The result is a Googley move: Have the developers insert data into Web pages. Easy. Big benefit for Google too.
  2. Microformatting. A decade old effort to add additional information to a Web page. You can find examples galore at http://microformats.org/.

I am not very good at math, but it sure seems to me that these are two different processes.

But the burr under my blanket is that one cannot apply anything unless there is something written or displayed on a Web page. Therefore, these two additions to a Web page’s code cannot be the first thing. Tagging can occur after something has been written or at the same time when the writing is done with a smart input system.

The notion that these rather squishy logical mistakes occur in the headline did not rev my engine when I worked through the 1,800 words in the article. The assumption in the write up is that a reader wants to create an ecommerce site which garners a top Google result. The idea is that one types in a key word like “cyberosint” and the first hit in the result list points to the ecommerce page.

The hitch in the git along is that more queries are arriving from mobile devices. The consequence of this is that the mobile system will be filtering content and displaying information which the system calculates as important to the user.

I don’t want to rain on the semanticists’ parade, nor do I want to point out that search engine optimization is pretty much an undrinkable concoction of buzz words, jargon, and desperation.

Here’s one of the passage in the write up that I marked and inked a blue exclamation point in the margin of my print out:

Within Search Engine Optimization, many businesses focus on keywords, phrases, and search density as a way of sending clues to search engines that they should be known for those things. But let’s look at it from the human side: how can we make sure that our End User makes those associations? How can we build that Brand Association and Topical Relevance to a human being? By focusing our content strategy and providing quality content curation.

Well, SEO folks, I am not too keen on brand associations and I am not sure I want to be relevant to another human being. Whether I agree or not, the fix is not to perform these functions:

  • Bridges of association
  • Social listening
  • Quality reputation (a method used I might add on the Dark Web)

This smoothie is a mess.

There are steps a person with a Web page can take to communicate with spiders and human readers. I am not sure the effort, cost, and additional page fluff are going to work.

Perhaps the semanticists should produce something other than froth? To help Google, write and present information which is clear, concise, and consistent just like in junior high school English class.

Stephen E Arnold, June 7, 2015

Semantic Search Failure Rate: 50% and There Is Another Watson Search System

June 1, 2015

The challenge of creating a semantic search system is a mini Mt. Everest during an avalanche. One of the highest profile semantic search systems was Siderean Software. The company quietly went quiet several years ago. I thought about Siderean when I followed up on a suggestion made by one of the stalwarts who read Beyond Search.

That reader sent me a link to a list of search systems. The list appeared on AI3. I could not determine when the list was compiled. To check the sticking power of the companies/organizations on the list, we looked up each vendor.

The results were interesting. Half of the listed companies were no longer in the search business.

Here’s the full list and the Beyond Search researcher’s annotations:

Search System Type
Antidot Finder Suite Commercial vendor
BAAGZ Not available
Beagle++ Not available
BuddyFinder (CORDER) Search buddyspace and Jabber
CognitionSearch Emphasis on monitoring
ConWeaver Customer support
DOAPspace Search not a focus of the site
EntityCube Displays a page with a handful of ideographs
Falcons Search system from Nanjing University
Ferret Open source search library
Flamenco A Marti Hearst search interface framework
HyperTwitter Does not search current Twitter stream
LARQ Redirects to Apache Jena, an open source Java framework for building Semantic Web and Linked Data applications
Lucene Apache Lucene Core
Lucene-skos Deprecated; points visitor to Lucene
LuMriX Medical search
Lupedia 404 error
OntoFrame Redirect due to 404 error
Ontogator Link to generic view based RDF search engine
OntoSearch 404 error
Opossum Page content not related to search
Picky Search engine in Ruby script
Searchy A metasearch engine performing a semantic translation into RDF; page updated in 2006
Semantic Search 404
Semplore 404
SemSearch Keyword based semantic search. Link points to defunct Google Code service
Sindice 404
SIREn 404
SnakeT Page renders; service 404s
Swangler Displays SemWebCentral.org; last update 2005
Swoogle Search over 10,000 ontologies
SWSE 404
TrueKnowledge 404
Watson Not IBM; searches semantic documents
Zebra General purpose open source structured text indexing and retrieval engine
ZoomInfo Commercial people search system

The most interesting entry in the list is the Watson system which seems to be operating as part of an educational institution.

Here’s what the Open.ac.uk Watson looks like:

image

IBM’s attorneys may want to see who owns what rights to the name “Watson.” But for IBM’s working on a Watson cookbook, this errant Watson may have been investigated, eh, Sherlock.

Stephen E Arnold, June 1, 2015

The Future of Enterprise and Web Search: Worrying about a Very Frail Goose

May 28, 2015

For a moment, I thought search was undergoing a renascence. But I was wrong. I noted a chart which purports to illustrate that the future is not keyword search. You can find the illustration (for now) at this Twitter location. The idea is that keyword search is less and less effective as the volume of data goes up. I don’t want to be a spoil sport, but for certain queries key words and good old Boolean may be the only way to retrieve certain types of information. Don’t believe me. Log on to your organization’s network or to Google. Now look for the telephone number of a specific person whose name you know or a tire company located in a specific city with a specific name which you know. Would you prefer to browse a directory, a word cloud, a list of suggestions? I want to zing directly to the specific fact. Yep, key word search. The old reliable.

But the chart points out that the future is composed of three “webs”: The Social Web, the Semantic Web, and the Intelligent Web. The dates for the Intelligent Web appears to be 2018 (the diagram at which I am looking is fuzzy). We are now perched half way through 2015. In 30 months, the Intelligent Web will arrive with these characteristics:

Embedded image permalink

  • Web scale reasoning (Don’t we have Watson? Oh, right. I forgot.)
  • Intelligent agents (Why not tap Connotate? Agents ready to roll.)
  • Natural language search (Yep, talk to your phone How is that working out on a noisy subway train?)
  • Semantics. (Embrace the OWL. Now.)

Now these benchmarks will arrive in the next 30 months, which implies a gradual emergence of Web 4.0.

The hitch in the git along, like most futuristic predictions about information access, is that reality behaves in some unpredictable ways. The assumption behind this graph is “Semantic technology help to regain productivity in the face of overwhelming information growth.”

Read more

Hijacking Semantics for Search Engine Optimization

May 26, 2015

I am just too old and cranky to get with the search engine optimization program. If a person cannot find your content, too bad. SEO has caused some of the erosion of relevance across public Web search engines.

The reason is that pages with lousy content are marketed as having other, more valuable content. The result is queries like this:

image

I want information about methods of digital reasoning. What I get is a company profile.

How do I get information for my specific requirement? I have to know how to work around the problems SEO puts in my face every day, over and over again.

This query works on Bing, Google, and Yandex: artificial intelligence decision procedures.

image

The results do not point to a small company in Tennessee, but to substantive documents from which other, pointed queries can be launched for individuals, industry associations, and methods.

When I read “Semantic Search Strategies That Work,” I became agitated. The notion of “forgetting about content” and “focusing on quality” miss the mark. Telling me to “spend time on engagement” are a collection of unrelated assertions.

The goal of semantics for SEO is to generate traffic. The search systems suck in shaped content and persist in directing people to topics that may have little or nothing to do with the information a person needs to solve his or her problem.

In short, the bastardization of semantics in the name of SEO is ensuring that some users will define the world from the point of view of marketing, not objective information.

What’s the fix?

Here’s the shocker: There is no fix. As individuals abrogate their responsibility to demand high value, on point results, schlock becomes the order of the day.

So much for clear thinking. Semantic strategies that erode relevance do not “work” from my point of view. This type of semantics thickens the cloud of unknowning.

Stephen E Arnold, May 26, 2015

Next Page »