CyberOSINT banner

The Semantic Web and JSON LD: Some Irritation Perhaps?

July 30, 2015

I read the Wikipedia article about JSON LD or JavaScript Object notation for Linked Data when I was pondering the fate of the XML centric start ups like MarkLogic. I highlighted one sentence in the Wikipedia write up which is subject to the usual caveats about bias, incorrect information, etc. And that sentence was:

JSON-LD is designed around the concept of a “context” to provide additional mappings from JSON to an RDF model.

Yes, the much loved RDF model.

When I read “JSON-LD and Why I Hate the Semantic Web,” I noticed a bit of friskiness in the word choice; for example, misguided souls, cryptic, complicated, market share, “kick RDF in the nuts,” and similar rhetorical arabesques. I do like the active verb “kick” however.

The passage I highlighted with my bright orange marker was this one:

The problem with getting a room full of smart people together is that the group’s world view gets skewed. There are many reasons that a working group filled with experts don’t consistently produce great results. For example, many of the participants can be humble about their knowledge so they tend to think that a good chunk of the people that will be using their technology will be just as enlightened. Bad feature ideas can be argued for months and rationalized because smart people, lacking any sort of compelling real world data, are great at debating and rationalizing bad decisions.

Seems normal to me.

In my opinion, this write up explains why some XML centric, Semantic Web cheerleaders have labored to generate organic growth. Just a thought. Talking to fellow travelers is reassuring and comfortable. Those not on the cruise ship may have a different point of view.

Stephen E Arnold, July 30, 2015

Italian Firm Delivers Semantic API to Wall Street

July 22, 2015

Short honk: There are quite a few high technology firms chasing the deep pockets on Wall Street and in the City. Some, like Digital Reasoning, have teamed with larger players to capture customers. Others, like Connotate, have relied on their stakeholders to open doors. Many companies attended financial technology showcases to demonstrate the power of their intelligent systems; for example, Digital Shadows. Some companies like Terbium Labs show up and demonstrate how their advanced technology reduces risk and improves financial performance.

Expert System is approaching the market with what it calls the “first semantic API”. The idea is that money folks can create cognitive computing systems. You can read about the system at this link.

Expert Systems is betting that this is true. The news release quotes Luca Scagliarini, CEO as saying:

Intelligent solutions for strategic information management are absolutely critical in today’s big data world, and no where is this more critical than in the financial services industry where inaccurate or incomplete data can lead to fatal decisions. With Cogito API Finance, we are filling a big gap and tremendous need for customized knowledge management solutions in the financial industry.

Expert System is a publicly traded company (EXSY:MI) so the payoff from this cognitive push should be evident in the firm’s next financial report.

image

Today shares are trading at 2.12, up 0.02 or 0.76 percent. BAE Systems, a company with its NetReveal / Detica technologies which are in use in a number of financial applications, is trading at 29.35. There is market headroom available.

Stephen E Arnold, July 22, 2015

On Embedding Valuable Outside Links

July 21, 2015

If media websites take this suggestion from an article at Monday Note, titled “How Linking to Knowledge Could Boost News Media,” there will be no need to search; we’ll just follow the yellow brick links. Writer Frederic Filloux laments the current state of affairs, wherein websites mostly link to internal content, and describes how embedded links could be much, much more valuable. He describes:

“Now picture this: A hypothetical big-issue story about GE’s strategic climate change thinking, published in the Wall Street Journal, the FT, or in The Atlantic, suddenly opens to a vast web of knowledge. The text (along with graphics, videos, etc.) provided by the news media staff, is amplified by access to three books on global warming, two Ted Talks, several databases containing references to places and people mentioned in the story, an academic paper from Knowledge@Wharton, a MOOC from Coursera, a survey from a Scandinavian research institute, a National Geographic documentary, etc. Since (supposedly), all of the above is semanticized and speaks the same lingua franca as the original journalistic content, the process is largely automatized.”

Filloux posits that such a trend would be valuable not only for today’s Web surfers, but also for future historians and researchers. He cites recent work by a couple of French scholars, Fabian Suchanek and Nicoleta Preda, who have been looking into what they call “Semantic Culturonomics,” defined as “a paradigm that uses semantic knowledge bases in order to give meaning to textual corpora such as news and social media.” Web media that keeps this paradigm in mind will wildly surpass newspapers in the role of contemporary historical documentation, because good outside links will greatly enrich the content.

Before this vision becomes reality, though, media websites must be convinced that linking to valuable content outside their site is worth the risk that users will wander away. The write-up insists that a reputation for providing valuable outside links will more than make up for any amount of such drifting visitors. We’ll see whether media sites agree.

Cynthia Murrell, July 21, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

On Embedding Valuable Outside Links

July 17, 2015

If media websites take this suggestion from an article at Monday Note, titled “How Linking to Knowledge Could Boost News Media,” there will be no need to search; we’ll just follow the yellow brick links. Writer Frederic Filloux laments the current state of affairs, wherein websites mostly link to internal content, and describes how embedded links could be much, much more valuable. He describes:

“Now picture this: A hypothetical big-issue story about GE’s strategic climate change thinking, published in the Wall Street Journal, the FT, or in The Atlantic, suddenly opens to a vast web of knowledge. The text (along with graphics, videos, etc.) provided by the news media staff, is amplified by access to three books on global warming, two Ted Talks, several databases containing references to places and people mentioned in the story, an academic paper from Knowledge@Wharton, a MOOC from Coursera, a survey from a Scandinavian research institute, a National Geographic documentary, etc. Since (supposedly), all of the above is semanticized and speaks the same lingua franca as the original journalistic content, the process is largely automatized.”

Filloux posits that such a trend would be valuable not only for today’s Web surfers, but also for future historians and researchers. He cites recent work by a couple of French scholars, Fabian Suchanek and Nicoleta Preda, who have been looking into what they call “Semantic Culturonomics,” defined as “a paradigm that uses semantic knowledge bases in order to give meaning to textual corpora such as news and social media.” Web media that keeps this paradigm in mind will wildly surpass newspapers in the role of contemporary historical documentation, because good outside links will greatly enrich the content.

Before this vision becomes reality, though, media websites must be convinced that linking to valuable content outside their site is worth the risk that users will wander away. The write-up insists that a reputation for providing valuable outside links will more than make up for any amount of such drifting visitors. We’ll see whether media sites agree.

Cynthia Murrell, July 17, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Want To Know What A Semantic Ecosystem Is

July 8, 2015

Do you want to know what a semantic ecosystem is? The answer is available from TopQuadrant in its article, “Semantic Ecosystem-What’s That About?”  According to the article, a semantic ecosystem enables patterns to be discovered, show the relationships between and within data sources, add meaning to raw data artifacts, and dynamically bring information together.

In short, it shows how data and its sources connect with each other and extracts relationships from it.

What follows the brief explanation about what a semantic ecosystem can do is a paragraph about the importance of data, how it takes many forms, etc., etc.  Trust me, you have heard it before. It then makes a comparison with a natural ecosystem, i.e. the ones find in nature.

The article continues with this piece:

“As in natural ecosystems, we believe that success in business is based on capability – and the ability to adapt and evolve new capabilities. Semantic ecosystems transform existing diverse information into valuable semantic assets. Key characteristics of a semantic ecosystem are that it is adaptable and evolvable. You can start small – with one or more key business solutions and a few data sources – and the semantic foundation can grow and evolve with you.”

It turns out a semantic ecosystem is just another name for information management.  TopQuadrant coined the term to associate with their products and services.  Talk about fancy business jargon, but TopQuadrant makes a point about having an information system work so well that it seems natural.  When a system works naturally, it is able to intuit needs, interpret patterns, and make educated correlations between data.

Whitney Grace, July 8, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Semantic Search and Challenging Patent Document Content Domains

July 7, 2015

Over the years, I have bumped into some challenging content domains. One of the most difficult was the collection of mathematical papers organized with the Dienst architecture. Another was a collection of blog posts from African bulletin board systems in a number of different languages, peppered with insider jargon. I also recall my jousts with patent documents for some pretty savvy outfits.

The processing of each of these corpuses and making them searchable by a regular human being remains an unsolved problem. Progress has been slow, and the focus of many innovators has been on workarounds. The challenge of each corpus remains a high hurdle, and in my opinion, no search sprinter is able to make it over the race course without catching a toe and plunging head first into the Multi-layer SB Resin covered surface.

I read “Why Is Semantic Search So Important for Patent Searching?” My answer was and remains, “Because vendors will grab at any buzzy concept in the hopes of capturing a share of the patent research market?”

The write up take a different approach, an approach which I find interesting and somewhat misleading.

The write up states that there are two ways to search for information: Navigational search sort of like Endeca I assume and research search, which is the old fashioned Boolean logic which I really like.

The article points out that keyword search sucks if the person looking for information does not know the exact term. That’s why I used the reference to Dienst. I wanted to provide an example which requires precise knowledge of terminology. That’s a challenge and it requires specialized knowledge from a person who recognizes that he or she may not know the exact terminology required to locate the needed information. Try the Dienst query. Navigate to a whizzy new search engine like www.unbubble.eu and plug away. How is that working out for you, but don’t cheat. You can’t use the term Dienst.

If you run the query on a point and click Web search system like Qwant.com, you cannot locate the term without running a keyword search.

The problems in patents, whether indexed with value added metadata, humans laboring in a warehouse, or with semantic methods are:

  1. Patent documents exist in versions and each document drags along assorted forms which may or may not be findable. Trips to the USPTO with hat in hand and a note from a senator often do not work. Fancy Dan patent attorneys fall back on the good old method of hunting using intermediaries. Not pretty, not easy, not cheap, and not foolproof. The versions and assorted attachments are often unfindable. (There are sometimes interesting reasons for this kettle of fish and the fish within it.) I don’t have a solution to the chains of documents and the versions of patent documents. Sigh.
  2. Patents include art. Usually the novice reacts negatively to lousy screenshots, clunky drawings, and equations which make it tough to figure out what a superscript character is. Keywords and pointing and clicking, metaphors, razzle dazzle search systems, and buzzword charged solutions from outfits like Thomson Reuters and Lexis are just tools, stone tools chiseled by some folks who want to get paid. I don’t have a good solution to the arts and crafts aspect of patent documents. Sigh sigh.
  3. Patent documents are written at a level of generalization, with jargon, Latinate constructs, and assertions that usually give me a headache. Who signed up to read lots of really bad poetry. Working through the Old Norse version of Heimskringla is a walk in the park compared to figuring out what some patents “mean.” I spent a number of years indexing 15th century Latin sermons. At least in that corpus, the common knowledge base was social and political events and assorted religious material. Patents can be all over the known knowledge universe. I don’t know of a patent processing system which can make this weird prose-poetry understandable if there is litigation or findable if there is a need to figure out if someone cooked up the same system and method before the document in question was crafted. Sigh sigh sigh.
  4. None of the systems I have used over the past 40 years does a bang up job of identifying prior art in scientific, technical or medical journal articles, blog posts, trade publications, or Facebook posts by a socially aware astrophysicist working for a social media company. Finding antecedents is a great deal of work. Has been and will be in my opinion. Sigh sigh sigh sigh. But the patent attorneys cry, “Hooray. We get to bill time.”

The write up presents some of those top brass magnets: Snappy visualizations. The idea is that a nifty diagram will address the three problems I identified in the preceding paragraphs. Visualizations may be able to provide some useful way to conceptualize where a particular patent document falls in a cluster of correctly processed patent documents. But an image does not deliver the mental equivalent of a NOW Foods Why Protein Isolate.

Net net: Pitching semantic search as a solution to the challenges of patent information access is a ball. Strikes in patent searching are not easily obtained unless you pay expert patent attorneys and their human assets to do the job. Just bring your checkbook.

Stephen E Arnold, July 7, 2015

Need Semantic Search: Lucidworks Asserts It Is the Answer by Golly

July 3, 2015

If you read this blog, you know that I comment on semantic technology every month or so. In June I pointed to an article which had been tweeted as “new stuff.” Wrong. Navigate to “Semantic Search Hoohah: Hakia”; you will learn that Hakia is a quiet outfit. Quiet as in no longer on the Web. Maybe gone?

There are other write ups in my free and for fee columns about semantic search. The theme has been consistent. My view is that semantic technology is one component in a modern cybernized system. (To learn about my use of the term cyber, navigate to www.xenky.com/cyberosint.)

I find the promotion of search engine optimization as “semantic” amusing. I find the search service firms’ promotion of their semantic expertise amusing. I find the notion of open source outfits deep in hock to venture capitalists asserting their semantic wizardry amusing.

I don’t know if you are quite as amused as I am. Here’s an easy way to determine your semantic humor score. Navigate to this slideshare link and cruise through the 34 deck presentation made by one of Lucidworks’ search mavens. Lucidworks is a company I have followed since it fired up its jets with Marc Krellenstein on board. Dr. Krellenstein ejected in short order, and the company has consumed many venture dollars with management shifts, repositionings, and the Big Data thing.

We now have Lucidworks in the semantic search sector.

Here’s what I learned from the deck:

  1. The company has a new logo. I think this is the third or fourth.
  2. Search is about technology and language. Without Google’s predictive and personalized routines, words are indeed necessary.
  3. Buzzwords and jargon do not make semantic methods simple. Consider this statement from the deck, “Tokenization plus vector mathematics (TF/IDF) or one of its cousins—“bag of words” – Algorithmic tweaks – enhanced bag of words.” Got that, gentle reader. If not, check out “sausagization.”
  4. Lucidworks offers a “field cache.” Okay, I am not unfamiliar with caching in order to goose performance, which can be an issue with some open source search systems. But Searchdaimon, an open source search system developed in Norway, runs circles around Lucidworks. My team did the benchmark test of major open source systems. Searchdaimon was the speed champ and had other sector leading characteristics as well.)
  5. Lucidworks does the ontology thing as well. The tie up of “category nodes” and “evidence nodes” may be one reason the performance goblin noses into the story.

The problem I encountered is that the write up for the slide deck emphasized Fusion as a key component. I have been poking around the “fusion” notion as we put our new study of the Dark Web together. Fusion is a tricky problem and the US government has made fusion a priority. Keep in mind that content is more than text. There are images, videos, geocodes, cryptic tweets in Farsi, and quite a few challenging issues with making content available to a researcher or analyst.

It seems that Lucidworks has cracked a problem which continues to trouble some reasonably sophisticated folks in the content analysis business. Here’s the “evidence” that Lucidworks can do what others cannot:

image

This diagram shows that after a connector is available, then “pipelines proliferate.” Well, okay.

I thought the goal was to process content objects with low latency, easily, and with semantic value adds. “Lots of stages” and “index pipelines: one way query pipelines: round trip” does not compute for this addled goose.

If the Lucidworks approach makes sense to you go for it. My team and I will stick to here and now tools and open source technology which works without the semantic jargon which is pretty much incidental to the matter. We need to process more than text. CyberOSINT vendors deliver and most use open source search as a utility function. Yep, utility. Not the main event. The failure of semantic search vendors suggests that the buzzword is not the solution to marketing woes. Pop. (That’s a pre fourth of July celebratory ladyfinger.)

Stephen E Arnold, July 3, 2015

Old Wine: Semantic Search from the Enlightenment

June 24, 2015

I read a weird disclaimer. Here it is:

This is an archived version of Pandia’s original article “Top 5 Semantic Search Engines”, we made it available to the users mainly because it is still among the most sought articles from old site. You can also check kids, radio search, news, people finder and q-cards sections.

An article from the defunct search newsletter Pandia surfaced in a news aggregation list. Pandia published one of my books, but at the moment I cannot remember which of my studies.

The write up identifies “semantic search engines.” Here’s the list with my status update in bold face:

  • Hakia. Out of business
  • SenseBot. Out of business.
  • Powerset. Bought by Microsoft. Fate unknown in the new Delve/Bing world.
  • DeepDyve. Talk about semantics but the system is a variation of the Dialog/BRS for fee search model from the late 1970s.
  • Cognition (Cognition Technologies). May be a unit of Nuance?

What’s the score?

Two failures. Two sales to another company. One survivor which has an old school business model. My take? Zero significant impact on information retrieval.

Feel free to disagree, but the promise of semantic search seems to pivot on finding a buyer and surviving by selling online research. Why so much semantic cheerleading? Beats me. Semantic methods are useful in the plumbing as a component of a richer, more robust system. Most cyberOSINT systems follow this path. Users don’t care too much about plumbing in my experience.

Stephen E Arnold, June 24, 2015

Expert Systems Acquires TEMIS

June 22, 2015

In a move to improve its product offerings, Expert System acquired TEMIS.  The two companies will combine their assets to create a leading semantic provider for cognitive computing.  Reuters described the acquisition in very sparse details: “Expert System Signs Agreement To Acquire French TEMIS SA.”

Reuters describes the merger as:

“Reported on Wednesday that it [Expert System] signed binding agreement to buy 100 percent of TEMIS SA, a French company offering solutions in text analytics

  • Deal value is 12 million euros ($13.13 million)”

TEMIS creates technology that helps organizations leverage, manage, and structure their unstructured information assets.  It is best known for Luxid, which identifies and extracts information to semantically enrich content with domain-specific metadata.

Expert System, on the other hand, is another semantically inclined company and its flagship product is Cogito.  The Cogito software is designed to understand content within unstructured text, systems, and analytics.  The goal is give organizations a complete picture of your information, because Cogitio actually understand what is processing.

TEMIS and Expert System have similar goals to make unstructured data useful to organizations.  Other than the actual acquisition deal, details on how Expert System plans to use TEMIS have not been revealed.  Expert System, of course, plans to use TEMIS to improve its own semantic technology and increase revenue.  Both companies are pleased at the acquisition, but if you consider other buy outs in recent times the cost to Expert System is very modest.  Thirteen million dollars underscores the valuation of other text analysis companies.  Other text analysis companies would definitely cost more than TEMIS.

Whitney Grace, June 22, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Sprylogics Repositioned to Mobile Search

June 20, 2015

I learned about Cluuz.com in a briefing in a gray building in a gray room with gray carpeting. The person yapping explained how i2 Ltd.-type relationship analysis was influencing certain intelligence-centric software. I jotted down some urls the speaker mentioned.

When I returned to my office, I check out the urls. I found the Cluuz.com service interesting. The system allowed me to run a query, review results with inline extracts, and relationship visualizations among entities. In that 2007 version of Cluuz.com’s system, I found the presentation, the inclusion of emails, phone numbers, and parent child relationships quite useful. The demonstration used queries passed against Web indexes. Technically, Cluuz.com belonged to the category of search systems which I call “metasearch” engines. The Googles and Yahoos index the Web; Cluuz.com added value. Nifty.

I chased down Alex Zivkovic, the individual then identified as the chief technical professional at Sprylogics. You can read my 2008 interview with Zivkovic in my Search Wizards Speak collection. The Cluuz.com system originated with a former military professional’s vision for information analysis. According to Zivkovic, the prime mover for Cluuz.com was Avi Shachar. At the time of the interview, the company focused on enterprise customers.

Zivkovic told me in 2008:

We have clustering. We have entity extraction. We have a relational ship analysis in a graph format. I want to point out that for enterprise applications, the Cluuz.com functions are significantly more rich. For example, a query can be run across internal content and external content. The user sees that the internal information is useful but not exactly on point. Our graph technology makes it easy for the user to spot useful information from an external source such as the Web in conjunction with the internal information. With a single click, the user can be looking into those information objects. We think we have come up with a very useful way to allow an organization to give its professionals an efficient way to search for content that is behind the firewall and on the Web. The main point, however, is that user does not have to be trained. Our graphical interface makes it obvious what information is available from which source. Instead of formulating complex queries, the person doing the search can scan, click, and browse. Trips back to the search box are options, not mandatory.

I visited the Sprylogics.com Web site the other day and learned that the Cluuz.com-type technology has been repackaged as a mobile search solution and real time sports application.

There is a very good explanation of the company’s use of its technology in a more consumer friendly presentation. You can find that presentation at this link, but the material can be removed at any time, so don’t blame me if the link is dead when you try to review the explanation of the 2015 version of Sprylogics.

From my point of view, the Sprylogics’ repositioning is an excellent example of how a company with technology designed for intelligence professionals can be packaged into a consumer application. The firm has more than a dozen patents, which some search and content processing companies cannot match. The semantic functions and the system’s ability to process Web content in near real time make the firm’s Poynt product interesting to me.

Sprylogics’ approach, in my opinion, is a far more innovative approach to leveraging advanced content processing capabilities than approaches taken by most search vendors. It is easier to slap a customer relationship management, customer support, or business intelligence label on what is essential search and retrieval software than create a consumer facing app.

Kudos to Sprylogics. The ArnoldIT team hopes their stock, which is listed on the Toronto Stock Exchange, takes wing.

Stephen E Arnold, June 20, 2015

« Previous PageNext Page »