Semantic Search and Challenging Patent Document Content Domains

July 7, 2015

Over the years, I have bumped into some challenging content domains. One of the most difficult was the collection of mathematical papers organized with the Dienst architecture. Another was a collection of blog posts from African bulletin board systems in a number of different languages, peppered with insider jargon. I also recall my jousts with patent documents for some pretty savvy outfits.

The processing of each of these corpuses and making them searchable by a regular human being remains an unsolved problem. Progress has been slow, and the focus of many innovators has been on workarounds. The challenge of each corpus remains a high hurdle, and in my opinion, no search sprinter is able to make it over the race course without catching a toe and plunging head first into the Multi-layer SB Resin covered surface.

I read “Why Is Semantic Search So Important for Patent Searching?” My answer was and remains, “Because vendors will grab at any buzzy concept in the hopes of capturing a share of the patent research market?”

The write up take a different approach, an approach which I find interesting and somewhat misleading.

The write up states that there are two ways to search for information: Navigational search sort of like Endeca I assume and research search, which is the old fashioned Boolean logic which I really like.

The article points out that keyword search sucks if the person looking for information does not know the exact term. That’s why I used the reference to Dienst. I wanted to provide an example which requires precise knowledge of terminology. That’s a challenge and it requires specialized knowledge from a person who recognizes that he or she may not know the exact terminology required to locate the needed information. Try the Dienst query. Navigate to a whizzy new search engine like www.unbubble.eu and plug away. How is that working out for you, but don’t cheat. You can’t use the term Dienst.

If you run the query on a point and click Web search system like Qwant.com, you cannot locate the term without running a keyword search.

The problems in patents, whether indexed with value added metadata, humans laboring in a warehouse, or with semantic methods are:

  1. Patent documents exist in versions and each document drags along assorted forms which may or may not be findable. Trips to the USPTO with hat in hand and a note from a senator often do not work. Fancy Dan patent attorneys fall back on the good old method of hunting using intermediaries. Not pretty, not easy, not cheap, and not foolproof. The versions and assorted attachments are often unfindable. (There are sometimes interesting reasons for this kettle of fish and the fish within it.) I don’t have a solution to the chains of documents and the versions of patent documents. Sigh.
  2. Patents include art. Usually the novice reacts negatively to lousy screenshots, clunky drawings, and equations which make it tough to figure out what a superscript character is. Keywords and pointing and clicking, metaphors, razzle dazzle search systems, and buzzword charged solutions from outfits like Thomson Reuters and Lexis are just tools, stone tools chiseled by some folks who want to get paid. I don’t have a good solution to the arts and crafts aspect of patent documents. Sigh sigh.
  3. Patent documents are written at a level of generalization, with jargon, Latinate constructs, and assertions that usually give me a headache. Who signed up to read lots of really bad poetry. Working through the Old Norse version of Heimskringla is a walk in the park compared to figuring out what some patents “mean.” I spent a number of years indexing 15th century Latin sermons. At least in that corpus, the common knowledge base was social and political events and assorted religious material. Patents can be all over the known knowledge universe. I don’t know of a patent processing system which can make this weird prose-poetry understandable if there is litigation or findable if there is a need to figure out if someone cooked up the same system and method before the document in question was crafted. Sigh sigh sigh.
  4. None of the systems I have used over the past 40 years does a bang up job of identifying prior art in scientific, technical or medical journal articles, blog posts, trade publications, or Facebook posts by a socially aware astrophysicist working for a social media company. Finding antecedents is a great deal of work. Has been and will be in my opinion. Sigh sigh sigh sigh. But the patent attorneys cry, “Hooray. We get to bill time.”

The write up presents some of those top brass magnets: Snappy visualizations. The idea is that a nifty diagram will address the three problems I identified in the preceding paragraphs. Visualizations may be able to provide some useful way to conceptualize where a particular patent document falls in a cluster of correctly processed patent documents. But an image does not deliver the mental equivalent of a NOW Foods Why Protein Isolate.

Net net: Pitching semantic search as a solution to the challenges of patent information access is a ball. Strikes in patent searching are not easily obtained unless you pay expert patent attorneys and their human assets to do the job. Just bring your checkbook.

Stephen E Arnold, July 7, 2015

Need Semantic Search: Lucidworks Asserts It Is the Answer by Golly

July 3, 2015

If you read this blog, you know that I comment on semantic technology every month or so. In June I pointed to an article which had been tweeted as “new stuff.” Wrong. Navigate to “Semantic Search Hoohah: Hakia”; you will learn that Hakia is a quiet outfit. Quiet as in no longer on the Web. Maybe gone?

There are other write ups in my free and for fee columns about semantic search. The theme has been consistent. My view is that semantic technology is one component in a modern cybernized system. (To learn about my use of the term cyber, navigate to www.xenky.com/cyberosint.)

I find the promotion of search engine optimization as “semantic” amusing. I find the search service firms’ promotion of their semantic expertise amusing. I find the notion of open source outfits deep in hock to venture capitalists asserting their semantic wizardry amusing.

I don’t know if you are quite as amused as I am. Here’s an easy way to determine your semantic humor score. Navigate to this slideshare link and cruise through the 34 deck presentation made by one of Lucidworks’ search mavens. Lucidworks is a company I have followed since it fired up its jets with Marc Krellenstein on board. Dr. Krellenstein ejected in short order, and the company has consumed many venture dollars with management shifts, repositionings, and the Big Data thing.

We now have Lucidworks in the semantic search sector.

Here’s what I learned from the deck:

  1. The company has a new logo. I think this is the third or fourth.
  2. Search is about technology and language. Without Google’s predictive and personalized routines, words are indeed necessary.
  3. Buzzwords and jargon do not make semantic methods simple. Consider this statement from the deck, “Tokenization plus vector mathematics (TF/IDF) or one of its cousins—“bag of words” – Algorithmic tweaks – enhanced bag of words.” Got that, gentle reader. If not, check out “sausagization.”
  4. Lucidworks offers a “field cache.” Okay, I am not unfamiliar with caching in order to goose performance, which can be an issue with some open source search systems. But Searchdaimon, an open source search system developed in Norway, runs circles around Lucidworks. My team did the benchmark test of major open source systems. Searchdaimon was the speed champ and had other sector leading characteristics as well.)
  5. Lucidworks does the ontology thing as well. The tie up of “category nodes” and “evidence nodes” may be one reason the performance goblin noses into the story.

The problem I encountered is that the write up for the slide deck emphasized Fusion as a key component. I have been poking around the “fusion” notion as we put our new study of the Dark Web together. Fusion is a tricky problem and the US government has made fusion a priority. Keep in mind that content is more than text. There are images, videos, geocodes, cryptic tweets in Farsi, and quite a few challenging issues with making content available to a researcher or analyst.

It seems that Lucidworks has cracked a problem which continues to trouble some reasonably sophisticated folks in the content analysis business. Here’s the “evidence” that Lucidworks can do what others cannot:

image

This diagram shows that after a connector is available, then “pipelines proliferate.” Well, okay.

I thought the goal was to process content objects with low latency, easily, and with semantic value adds. “Lots of stages” and “index pipelines: one way query pipelines: round trip” does not compute for this addled goose.

If the Lucidworks approach makes sense to you go for it. My team and I will stick to here and now tools and open source technology which works without the semantic jargon which is pretty much incidental to the matter. We need to process more than text. CyberOSINT vendors deliver and most use open source search as a utility function. Yep, utility. Not the main event. The failure of semantic search vendors suggests that the buzzword is not the solution to marketing woes. Pop. (That’s a pre fourth of July celebratory ladyfinger.)

Stephen E Arnold, July 3, 2015

Old Wine: Semantic Search from the Enlightenment

June 24, 2015

I read a weird disclaimer. Here it is:

This is an archived version of Pandia’s original article “Top 5 Semantic Search Engines”, we made it available to the users mainly because it is still among the most sought articles from old site. You can also check kids, radio search, news, people finder and q-cards sections.

An article from the defunct search newsletter Pandia surfaced in a news aggregation list. Pandia published one of my books, but at the moment I cannot remember which of my studies.

The write up identifies “semantic search engines.” Here’s the list with my status update in bold face:

  • Hakia. Out of business
  • SenseBot. Out of business.
  • Powerset. Bought by Microsoft. Fate unknown in the new Delve/Bing world.
  • DeepDyve. Talk about semantics but the system is a variation of the Dialog/BRS for fee search model from the late 1970s.
  • Cognition (Cognition Technologies). May be a unit of Nuance?

What’s the score?

Two failures. Two sales to another company. One survivor which has an old school business model. My take? Zero significant impact on information retrieval.

Feel free to disagree, but the promise of semantic search seems to pivot on finding a buyer and surviving by selling online research. Why so much semantic cheerleading? Beats me. Semantic methods are useful in the plumbing as a component of a richer, more robust system. Most cyberOSINT systems follow this path. Users don’t care too much about plumbing in my experience.

Stephen E Arnold, June 24, 2015

Expert Systems Acquires TEMIS

June 22, 2015

In a move to improve its product offerings, Expert System acquired TEMIS.  The two companies will combine their assets to create a leading semantic provider for cognitive computing.  Reuters described the acquisition in very sparse details: “Expert System Signs Agreement To Acquire French TEMIS SA.”

Reuters describes the merger as:

“Reported on Wednesday that it [Expert System] signed binding agreement to buy 100 percent of TEMIS SA, a French company offering solutions in text analytics

  • Deal value is 12 million euros ($13.13 million)”

TEMIS creates technology that helps organizations leverage, manage, and structure their unstructured information assets.  It is best known for Luxid, which identifies and extracts information to semantically enrich content with domain-specific metadata.

Expert System, on the other hand, is another semantically inclined company and its flagship product is Cogito.  The Cogito software is designed to understand content within unstructured text, systems, and analytics.  The goal is give organizations a complete picture of your information, because Cogitio actually understand what is processing.

TEMIS and Expert System have similar goals to make unstructured data useful to organizations.  Other than the actual acquisition deal, details on how Expert System plans to use TEMIS have not been revealed.  Expert System, of course, plans to use TEMIS to improve its own semantic technology and increase revenue.  Both companies are pleased at the acquisition, but if you consider other buy outs in recent times the cost to Expert System is very modest.  Thirteen million dollars underscores the valuation of other text analysis companies.  Other text analysis companies would definitely cost more than TEMIS.

Whitney Grace, June 22, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Sprylogics Repositioned to Mobile Search

June 20, 2015

I learned about Cluuz.com in a briefing in a gray building in a gray room with gray carpeting. The person yapping explained how i2 Ltd.-type relationship analysis was influencing certain intelligence-centric software. I jotted down some urls the speaker mentioned.

When I returned to my office, I check out the urls. I found the Cluuz.com service interesting. The system allowed me to run a query, review results with inline extracts, and relationship visualizations among entities. In that 2007 version of Cluuz.com’s system, I found the presentation, the inclusion of emails, phone numbers, and parent child relationships quite useful. The demonstration used queries passed against Web indexes. Technically, Cluuz.com belonged to the category of search systems which I call “metasearch” engines. The Googles and Yahoos index the Web; Cluuz.com added value. Nifty.

I chased down Alex Zivkovic, the individual then identified as the chief technical professional at Sprylogics. You can read my 2008 interview with Zivkovic in my Search Wizards Speak collection. The Cluuz.com system originated with a former military professional’s vision for information analysis. According to Zivkovic, the prime mover for Cluuz.com was Avi Shachar. At the time of the interview, the company focused on enterprise customers.

Zivkovic told me in 2008:

We have clustering. We have entity extraction. We have a relational ship analysis in a graph format. I want to point out that for enterprise applications, the Cluuz.com functions are significantly more rich. For example, a query can be run across internal content and external content. The user sees that the internal information is useful but not exactly on point. Our graph technology makes it easy for the user to spot useful information from an external source such as the Web in conjunction with the internal information. With a single click, the user can be looking into those information objects. We think we have come up with a very useful way to allow an organization to give its professionals an efficient way to search for content that is behind the firewall and on the Web. The main point, however, is that user does not have to be trained. Our graphical interface makes it obvious what information is available from which source. Instead of formulating complex queries, the person doing the search can scan, click, and browse. Trips back to the search box are options, not mandatory.

I visited the Sprylogics.com Web site the other day and learned that the Cluuz.com-type technology has been repackaged as a mobile search solution and real time sports application.

There is a very good explanation of the company’s use of its technology in a more consumer friendly presentation. You can find that presentation at this link, but the material can be removed at any time, so don’t blame me if the link is dead when you try to review the explanation of the 2015 version of Sprylogics.

From my point of view, the Sprylogics’ repositioning is an excellent example of how a company with technology designed for intelligence professionals can be packaged into a consumer application. The firm has more than a dozen patents, which some search and content processing companies cannot match. The semantic functions and the system’s ability to process Web content in near real time make the firm’s Poynt product interesting to me.

Sprylogics’ approach, in my opinion, is a far more innovative approach to leveraging advanced content processing capabilities than approaches taken by most search vendors. It is easier to slap a customer relationship management, customer support, or business intelligence label on what is essential search and retrieval software than create a consumer facing app.

Kudos to Sprylogics. The ArnoldIT team hopes their stock, which is listed on the Toronto Stock Exchange, takes wing.

Stephen E Arnold, June 20, 2015

Solcara Is The Best!  Ra Ra Ra!

June 15, 2015

Thomson-Reuters is a world renowned news syndication, but the company also has its own line of search software called Solcara Federated Search also known as Solcara SolSearch.”  In a cheerleading press release, Q-resolve highlights Solcara’s features and benefits: “Solcara Legal Search, Federated Search And Know How.”  Solcara allows users to search multiple information resources, including intranets, databases, Knowledge Management, and library and document management systems.  It returns accurate results according to the inputted search terms or keywords.  In other words, it acts like an RSS feed combined with Google.

Solcara also has a search product specially designed for those in the legal profession and the press release uses a smooth reading product description to sell it:

“Solcara legal Search is as easy to use as your favorite search engine. With just one search you can reference internal documents and approved legal information resources simultaneously without the need for large scale content indexing, downloading or restructuring. What’s more, you can rely on up-to-date content because all searches are carried out in real time.”

The press release also mentions some other tools, case studies, and references the semantic Web.  While Solcara does sound like a good product and comes from a reliable new aggregator like Thomson-Reuters, the description and organization of the press release makes it hard to understand all the features and who the target consumer group is.  Do they want to sell to the legal profession and only that group or do they want to demonstrate how Solcara can be adapted to all industries that digest huge information amounts?  The importance of advertising is focusing the potential buyer’s attention.  This one jumps all over the place.

Whitney Grace, June 15, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

More Semantic Search and Search Engine Optimization Chatter

June 10, 2015

I read “Understanding Semantic Search.” I had high hopes. The notion of Semantic Search as set forth by Tim Bray, Ramanathan Guha, and some other wizards years ago continues to intrigue me. The challenge has been to deliver high value outputs that generate sufficient revenue to pay for the plumbing, storage, and development good ideas can require.

I spent considerable time exploring one of the better known semantic search systems before the company turned off the lights and locked its doors. Siderean Software offered its Seamark system which could munch on triples and output some quite remarkable results. I am not sure why the company was not able to generate more revenue.

The company emphasized “discovery searching.” Vivisimo later imitated Siderean’s user input feature. The idea is that if a document required an additional key word, the system accepted the user input and added the term to the index. Siderean was one of the first search vendors to suggest that “graph search” or relationships would allow users to pinpoint content processed by the system. In the 2006-2007 period, Siderean indexed Oracle text content as a demonstration. (At the time, Oracle had the original Artificial Linguistics’ technology, the Oracle Text function, Triple Hop, and PL/SQL queries. Not surprisingly, Oracle did not show the search acquisition appetite the company demonstrated a few years later when Oracle bought Endeca’s ageing technology, the RightNow Netherlands-originated technology, or the shotgun marriage search vendor InQuira.)

I also invested some time on behalf of the client in the semantic inventions of Dr. Ramanathan Guha. This work was summarized in Google Version 2.0, now out of print. Love those print publishers, folks.

Dr. Guha applied the features of the Semantic Web to plumbing which, if fully implemented, would have allowed Google to build a universal database of knowledge, serve up snippets from a special semantic server, and perform a number of useful functions. This work was done by Dr. Guha when he was at IBM Almaden and at Google. My analysis of Dr. Guha’s work suggests that Google has more semantic plumbing than most observers of the search giant notice. The reason, I concluded, was that semantic technology works behind the scenes. Dragging the user into OWL, RDF, and other semantic nuances does not pay off as well as embedding certain semantic functions behind the scenes.

In the “Understanding Semantic Search” write up, I learned that my understanding of semantic search is pretty much a wild and crazy collection of half truths. Let me illustrate what the article presents as the “understanding” function for addled geese like me.

  • Searches have a context
  • Results can be local or national
  • Entities are important; for example, the White House is different from a white house

So far, none of this helps me understand semantic search as embodied in the 3WC standard nor in the implementation of companies like Siderean or the Google-Guha patent documents from 2007 forward.

The write up makes a leap from context to the question, “Are key words still important?”

From that question, the article informs me that I need to utilize schema mark up. These are additional code behinds which provide information to crawlers and other software about the content which the user sees on a rendering device.

And that’s it.

So let’s recap. I learned that context is important via illustrations which show Google using different methods to localize or personalize content. The write up does not enumerate the different methods which use browser histories, geolocation, and other signals. The write up then urges me to use additional mark up.

I think I will stick with my understanding of semantics. My work with Siderean and my research for an investment bank provided a richer base of knowledge about the real world applications of semantic technology. Technology, I wish to point out, which can be computationally demanding unless one has sufficient resources to perform the work.

What is happening in this “Understanding Semantic Search” article is an attempt to generate business for search engine optimization experts. Key word stuffing and doorway pages no longer work very well. In fact, SEO itself is a problem because it undermines precision and recall. Spoofing relevance is not my idea of a useful activity.

For those looking to semantics to deliver Google traffic, you might want to invest the time and effort in creating content which pulls users to you.

Stephen E Arnold, June 9, 2015

Semantic Search Hoohah: Hakia

June 8, 2015

My Overflight system snagged an updated post to an article written in 2006 about Hakia. Hakia, as you may know, was a semantic search system. I ran an interview with Riza C. Berkan in 2008. You can find that Search Wizards Speak interview here.

Hakia went quiet months ago. The author of “Hakia Is a True Semantic Search Engine” posted a sentence that said: “Hakia, unfortunately, failed and went out of business.”

I reviewed that nine year old article this morning and highlighted several passages. These are important because these snippets illustrate how easy it is to create a word picture which does not match reality. Search engine developers find describing their vision a heck of a lot easier than converting talk into sustainable revenues.

Let’s run down three of the passages proudly displaying my blue highlighter’s circles and arrows. The red text is from the article describing Hakia and the blue text is what the founder of Hakia said in the Search Wizards Speak interview.

Passage One

So a semantic search engine doesn’t have to address every word in the English language, and in fact it may be able to get by with a very small core set of words. Let’s say that we want to create our own semantic search engine. We can probably index most mainstream (non-professional) documents with 20-30,000 words. There will be a few gaps here and there, but we can tolerate those gaps. But still, the task of computing relevance for millions, perhaps billions of documents that use 30,000 words is horrendously monumental. If we’re going to base our relevance scoring on semantic analysis, we need to reduce the word-set as much as possible.

This passage is less about Hakia and more about the author’s perception of semantic search. Does this explanation resonate with you? For me, many semantic methods are computationally burdensome. As a result the systems are often sluggish and unable to keep pace with updated content and new content.

Here’s how Dr. Riza C. Berkan, a nuclear engineer and math whiz explained semantics:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

Passage Two

What I believe the founders of Hakia have done is borrow the concept of Lambda Calculus from compiler theory to speed the process of reducing elements on pages to their conceptual foundations. That is, if we assume everyone writes like me, then most documents can be reduced to a much smaller subset of place-holders that accurately convey the meaning of all the words we use.

Okay, but in my Search Wizards Speak interview, the founder of Hakia said:

We can analyze 70 average pages per second per server. Scaling: The beauty of QDexing is that QDexing grows with new knowledge and sequences, but not with new documents. If I have one page, two pages or 1,000 pages of the OJ Simpson trial, they are all talking about the same thing, and thus I need to store very little of it. The more pages that come, the more the quality of the results increase, but only with new information is the amount of QDex stored information increased. At the beginning, we have a huge steep curve, but then, processing and storage are fairly low cost. The biggest cost is the storage, as we have many many QDex files, but these are tiny two to three Kb files. Right now, we are going through news, and we are showing a seven to 10 minute lag for fully QDexing news.

No reference to a type of calculus that thrills Googlers. In fact, a review of the patent shows that well know methods are combined in what appears to be an interesting way.

Passage Three

Documents can still pass value by reference in a semantic index, but the mechanics of reference work differently. You have more options, so less sophisticated writers who don’t embed links in their text can have just as much impact on another document’s importance as professional search optimization copywriters. Paid links may not become a thing of the past very quickly, but you can bet your blog posts that buying references is going to be more sophisticated if this technology takes off. That is what is so exciting about Hakia. They haven’t just figured out a way to produce a truly semantic search engine. They have just cut through a lot of the garbage (at a theoretical level) that permeates the Web. Google AdSense arbitragers who rely on scraping other documents to create content will eventually find their cash cows drying up. The semantic index will tell Hakia where the original content came from more often than not.

Here’s what the founder says in the Search Wizards Speak interview:

With semantic search, this is not a problem. We will extract everything of the 500 words that is relevant content. That is why Google has a credibility problem. Google cannot guarantee credibility because its system relies on link statistics. Semantic methods do not rely on links. Semantic methods use the content itself. For example, hakia QDexed approximately 10 million PubMed pages. If there are 100 million questions, hakia will bring you to the correct PubMed page 99 percent of the time, whereas other engines will bring you perhaps 25 percent of the time, depending on the level of available statistics. For certain things, the big players do not like awareness. Google has never made, and probably never will make, credibility important. You can do advanced search and do “site: sitename” but that is too hard for the user; less than 0.5% of users ever use advanced search features.

The key fact is that Hakia failed. The company tried to get traction with health and medical information. The vocabulary for scientific, technical, and medical content is less poetic than the writing in business articles and general blog posts. Nevertheless, the customers and users did not bite..

Notice that both the author of the article did not come to grips with the specific systems and methods used by Hakia. The write up “sounds good” but lacks substance. The founder’s explanation reveals his confidence in what “should be,” not what was and is.

My point: Writing about search is difficult. Founders see the world one way; those writing about search interpret the descriptions in terms of their knowledge.

Where can one get accurate, objective information about search? The options are limited and have been for decades. Little wonder that search remains a baffler to many people.

Stephen E Arnold, June 8, 2015

The Semantic Blenders: Not Consumable by Most

June 7, 2015

i read “Schema Markup and Microformatting Is Only the First Step in your Semantic Search Strategy.”

Okay, schema markup and microformatting. These are, according to the headline, one thing.

I am probably off base here in Harrod’s Creek, but I thought:

  1. Schema markup. Google’s explanation is designed to help out the GOOG, not the user. The methods of Guha and Halevy have proven difficult to implement. The result is a Googley move: Have the developers insert data into Web pages. Easy. Big benefit for Google too.
  2. Microformatting. A decade old effort to add additional information to a Web page. You can find examples galore at http://microformats.org/.

I am not very good at math, but it sure seems to me that these are two different processes.

But the burr under my blanket is that one cannot apply anything unless there is something written or displayed on a Web page. Therefore, these two additions to a Web page’s code cannot be the first thing. Tagging can occur after something has been written or at the same time when the writing is done with a smart input system.

The notion that these rather squishy logical mistakes occur in the headline did not rev my engine when I worked through the 1,800 words in the article. The assumption in the write up is that a reader wants to create an ecommerce site which garners a top Google result. The idea is that one types in a key word like “cyberosint” and the first hit in the result list points to the ecommerce page.

The hitch in the git along is that more queries are arriving from mobile devices. The consequence of this is that the mobile system will be filtering content and displaying information which the system calculates as important to the user.

I don’t want to rain on the semanticists’ parade, nor do I want to point out that search engine optimization is pretty much an undrinkable concoction of buzz words, jargon, and desperation.

Here’s one of the passage in the write up that I marked and inked a blue exclamation point in the margin of my print out:

Within Search Engine Optimization, many businesses focus on keywords, phrases, and search density as a way of sending clues to search engines that they should be known for those things. But let’s look at it from the human side: how can we make sure that our End User makes those associations? How can we build that Brand Association and Topical Relevance to a human being? By focusing our content strategy and providing quality content curation.

Well, SEO folks, I am not too keen on brand associations and I am not sure I want to be relevant to another human being. Whether I agree or not, the fix is not to perform these functions:

  • Bridges of association
  • Social listening
  • Quality reputation (a method used I might add on the Dark Web)

This smoothie is a mess.

There are steps a person with a Web page can take to communicate with spiders and human readers. I am not sure the effort, cost, and additional page fluff are going to work.

Perhaps the semanticists should produce something other than froth? To help Google, write and present information which is clear, concise, and consistent just like in junior high school English class.

Stephen E Arnold, June 7, 2015

Semantic Search Failure Rate: 50% and There Is Another Watson Search System

June 1, 2015

The challenge of creating a semantic search system is a mini Mt. Everest during an avalanche. One of the highest profile semantic search systems was Siderean Software. The company quietly went quiet several years ago. I thought about Siderean when I followed up on a suggestion made by one of the stalwarts who read Beyond Search.

That reader sent me a link to a list of search systems. The list appeared on AI3. I could not determine when the list was compiled. To check the sticking power of the companies/organizations on the list, we looked up each vendor.

The results were interesting. Half of the listed companies were no longer in the search business.

Here’s the full list and the Beyond Search researcher’s annotations:

Search System Type
Antidot Finder Suite Commercial vendor
BAAGZ Not available
Beagle++ Not available
BuddyFinder (CORDER) Search buddyspace and Jabber
CognitionSearch Emphasis on monitoring
ConWeaver Customer support
DOAPspace Search not a focus of the site
EntityCube Displays a page with a handful of ideographs
Falcons Search system from Nanjing University
Ferret Open source search library
Flamenco A Marti Hearst search interface framework
HyperTwitter Does not search current Twitter stream
LARQ Redirects to Apache Jena, an open source Java framework for building Semantic Web and Linked Data applications
Lucene Apache Lucene Core
Lucene-skos Deprecated; points visitor to Lucene
LuMriX Medical search
Lupedia 404 error
OntoFrame Redirect due to 404 error
Ontogator Link to generic view based RDF search engine
OntoSearch 404 error
Opossum Page content not related to search
Picky Search engine in Ruby script
Searchy A metasearch engine performing a semantic translation into RDF; page updated in 2006
Semantic Search 404
Semplore 404
SemSearch Keyword based semantic search. Link points to defunct Google Code service
Sindice 404
SIREn 404
SnakeT Page renders; service 404s
Swangler Displays SemWebCentral.org; last update 2005
Swoogle Search over 10,000 ontologies
SWSE 404
TrueKnowledge 404
Watson Not IBM; searches semantic documents
Zebra General purpose open source structured text indexing and retrieval engine
ZoomInfo Commercial people search system

The most interesting entry in the list is the Watson system which seems to be operating as part of an educational institution.

Here’s what the Open.ac.uk Watson looks like:

image

IBM’s attorneys may want to see who owns what rights to the name “Watson.” But for IBM’s working on a Watson cookbook, this errant Watson may have been investigated, eh, Sherlock.

Stephen E Arnold, June 1, 2015

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta