Apache Tika Could Be the Google of Dark Web?

January 16, 2017

Conventional search engines can effectively index text based content. However, Apache Tika, a system developed by Defense Advanced Research Projects Agency (DARPA) can identify and analyze all kinds of content. This might enable law enforcement agencies to track all kind of illicit activities over Dark Web and possibly end them.

An article by Christian Mattmann titled Could This Tool for the Dark Web Fight Human Trafficking and Worse? that appears on Startup Smart says:

At present the most easily indexed material from the web is text. But as much as 89 to 96 percent of the content on the internet is actually something else – images, video, audio, in all thousands of different kinds of non-textual data types. Further, the vast majority of online content isn’t available in a form that’s easily indexed by electronic archiving systems like Google’s.

Apache Tika, which Mattmann helped develop bridges the gap by analyzing Metadata of the content type and then identifying content of the file using techniques like Named Entity Recognition (NER). Apache Tika was instrumental in tracking down players in Panama Scandal.

If Apache Tika is capable of what it says, many illicit activities over Dark Web like human trafficking, drug and arms peddling can be stopped in its tracks. As the author points out in the article:

Employing Tika to monitor the deep and dark web continuously could help identify human- and weapons-trafficking situations shortly after the photos are posted online. That could stop a crime from occurring and save lives.

However, the system is not sophisticated enough to handle the amount of content that is out there. Being an open source code, in near future someone may be able to make it capable of doing so. Till then, the actors of Dark Web can heave a sigh of relief.

Vishal Ingole, January 16, 2017


RAVN Flaps Amidst a Flurry of Feathers

January 12, 2017

I read “Abraaj Drives Innovation in Private Equity Market with Implementation of RAVN’s Cognitive Search Solution.” The main idea is that RAVN, a vendor of enterprise search, has snagged a customer. That’s good. What’s interesting about the write up is the language of the “news.” Here’s a rundown of the words I highlighted as flaps of the RAVN’s marketing department wings:

  • Access
  • Artificial intelligence and AI
  • Classify
  • Cognitive search
  • Collaborate
  • Component
  • Connect enterprise
  • Data mining
  • Deal flow
  • Differentiation
  • Drive innovation
  • Dynamic decisions
  • Engagement
  • Engine as in “cognitive engine”
  • Experts and expertise
  • Extract
  • Functional knowledge
  • Ground breaking
  • Growth markets organization
  • Highly distributed network
  • Internal and external content
  • Intelligently transforms
  • Interrelationships
  • Knowledge graph
  • Knowledge management
  • Knowledge sources
  • Leverage
  • Lifecycle
  • Monitoring
  • Multi geography
  • Navigate
  • Phases
  • Platform
  • Proprietary
  • Sector knowledge.
  • Sectoral
  • Secure
  • Solutions
  • Teams
  • Transformation
  • Unstructured
  • Visualize

What’s left out? No analytics, which is one of the must have functions for a modern search and content processing system. My hunch is that RAVN has numbers in its nest. In the millennial editing frenzy, counting visitors and other useful items was overlooked. Amazing stuff. No wonder some folks roll their eyes when enterprise search vendors trot out keyword search dressed in rhetoric honed by Sophists.

For more lingo which makes search seem more than it is, review the list of cacaphones at this link. Ah, the cacophony of search and retrieval vendors.

Stephen E Arnold, January 12, 2017

Epi-Search and dtSearch

January 11, 2017

I read about Epi-Search in “Epi-Search, A Free Academic Resource, Lets Researchers Enter Up to 10,000 Words of Text and, Using a dtSearch® Engine API, Find “More Like This” Across the ISCE Library.” The idea is a good one. Plug in a research summary or journal article abstract. The system then outputs related documents. The search plumbing is provided by dtSearch, a vendor based in Bethesda, Maryland.

I ran a test using the description of my Cyberosint: Next Generation Information Access monograph on the system at this link. The system returned 10 hits to related documents. Here’s a partial list:


Only the “Penal Populism” was in the same city as the ball park in which I kick around.

The link to Google search results was in the ball park’s parking lot. But the majority of the Google hits point to sites for cyber security, not for the use of open source intelligence to obtain operational intelligence. The Google search grabbed the notion of bad actors compromising network integrity. Important, but not the game I follow. The Google search results returned by the Epi system were PDF files and advertisements.

On a side note, there is a product called Episerer which is offered by a company called Epi. You can get information at this link. Epi’s content management system includes a search engine marketed as Find. Perhaps the name clash can be resolved?

Stephen E Arnold, January 11, 2017

Coveo Search: Reveal Engine

January 11, 2017

Enterprise search is alive and well when it comes to pivots and jargon. One example is Coveo’s adoption of the moniker “Reveal Engine.” The idea is that one can search to find needed information. If the notion is unclear, you can watch the “Coveo Reveal Engine Explainer Video.” The idea is that Coveo’s software puts intelligence everywhere. I love those categorical affirmatives too.

Coveo explains:

Meet Coveo Reveal Engine, the self-learning technology in the Coveo platform that makes intelligent search even smarter. It continuously analyzes your visitors’ click stream data and behavior patterns captured by search usage analytics, then accurately serves up the content that is most likely to drive conversions and ensure self-service success. In other words, Coveo learns which content delivers the best outcomes and then automatically tunes search results to ensure the right content always rises to the top – without any manual effort on your part. Think of it as having a built-in concierge that intuitively knows what your visitors are looking for. It makes intelligent recommendations based on a deep understanding of what has worked best for others. Coveo understands your visitors’ intent. It auto-completes search terms, provides relevant related search suggestions, and even recommends complementary content they hadn’t originally sought out.


The pivot is that Coveo is positioning its search system to handle self service questions. More information about Coveo is available at www.coveo.com.

Stephen E Arnold, January 11, 2017

Improve Your B2B Search with Klevu

January 6, 2017

Ecommerce sites rely on a strong search tool to bring potential customers to their online stores and to find specific products without a hassle.  B2B based companies have the same goal, but they need an entire different approach although they still rely on search.  If you run a B2B company, you might want to take a gander at Klevu and their solutions: “Search Requirements For A B2B Retailer.”

In the blog post, Klevu explains that B2B companies have multiple customer groups that allow different pricing, products, discounts, etc.  The customers see prices based on allocation from the store, but they cannot use a single price for every item.  Search is also affected by this outcome.  Klevu came out with the Klevu Magneto plugin to:

With the help of our partners and our in-house expertise, we came up with a solution that allows such group prices to automatically work with Klevu. The Klevu Magneto plugin fetches the group prices and, at the time of showing the search results, Klevu’s JavaScript determines the relevant price for rendering. We’ve also ensure that this works in Ajax / quick search as well, as this was an important requirement.

The Klevu Magneto plugin also has an SKU search option, maintaining the same landing page within search results, and instant faceted search.  Klevu researched the issues that its B2B customers had the most problems with and created solutions.  They are actively pursuing ways to resolve bothersome issues that pop up and this is just the start for them.

Whitney Grace, January 6, 2017

Textkernel: Narrowing Search to an HR Utility

January 5, 2017

Remember the good old days of search? Autonomy, Convera, Endeca, Fast Search, and others from the go go 2000s identified search as a solution to enterprise information access. Well, those assertions proved to be difficult to substantiate. Marketing is one thing; finding information is another.

How does a vendor of Google style searching with some pre-sell Clearwell Systems-type business process tweaking avoid the problems which other enterprise search vendors have encountered?

The answer is, “Market search as a solution for hiring.” Just as Clearwell Systems and its imitators did in the legal sector, Textkernel, founded in 2001 and sold to CareerBuilder in 2015, ,  is doing résumé indexing and search focused on finding people to hire. Search becomes “recruitment technology,” which is reasonably clever buzzworking.

The company explains its indexing of CVs (curricula vitae) this way:

CV parsing, also called resume parsing or CV extraction, is the process of converting an unstructured (so-called free-form) CV/resume or social media profile into a structured format that can be integrated into any software system and made searchable. CV parsing eliminates manual data entry, allows candidates to apply via any (mobile) device and enables better search results.

The Textkernel Web site provides more details about the company’s use of tried and true enterprise search functions like metadata generation and report generation (called a “candidate profile”).

In 2015 the company had about 70 employees. Using the Overflight revenue estimation tool, Beyond Search pegs the 2015 revenue in the $5 million range.

The good news is that the company avoided the catastrophic thrashing which other European enterprise search vendors experienced. The link to the video on the Textkernel page is broken, which does not bode well for Web coding expertise. However, you can bite into some text kernels at this link.

Stephen E Arnold, January 5, 2016

Is Your Data up for Sale on Dark Web?

January 4, 2017

A new service has been launched in UK that enables users to find out if their confidential information is up for sale over the Dark Web.

As reported by Hacked in an article This Tool Lets You Scan the Dark Web for Your (Stolen) Personal Data, it says:

The service is called OwlDetect and is available for £3,5 a month. It allows users to scan the dark web in search for their own leaked information. This includes email addresses, credit card information and bank details.

The service uses a supposedly sophisticated algorithm that has alleged capabilities to penetrate up to 95% of content on the Dark Web. The inability of Open Web search engines to index and penetrate Dark Web has led to mushrooming of Dark Web search engines.

OwlDetect works very similar to early stage Google, as it becomes apparent here in the article:

This new service has a database of stolen data. This database was created over the past 10 years, presumably with the help of their software and team. A real deep web search engine does exist, however.

This means the search is not real time and is as good as searching your local hard drive. Most of the data might be outdated and companies that owned this data might have migrated to secure platforms. Moreover, the user might also have deleted the old data. Thus, the service just tells you that were you ever hacked or was your data was even stolen?

Vishal Ingole,  January 4, 2017

US Patent Search Has a Ways to Go

January 3, 2017

A recent report was released by the U.S. Government Accountability Office entitled Patent Office Should Strengthen Search Capabilities and Better Monitor Examiners’ Work. Published on June 30, 2016, the report totals 91 pages in the form of a PDF. Included in the report is an examination by the U.S. Patent and Trademark Office (USPTO) of the challenges in identifying relevant information to an existing claimed invention that effect patent search. The website says the following in regards to the reason for this study,

GAO was asked to identify ways to improve patent quality through use of the best available prior art. This report (1) describes the challenges examiners face in identifying relevant prior art, (2) describes how selected foreign patent offices have addressed challenges in identifying relevant prior art, and (3) assesses the extent to which USPTO has taken steps to address challenges in identifying relevant prior art. GAO surveyed a generalizable stratified random sample of USPTO examiners with an 80 percent response rate; interviewed experts active in the field, including patent holders, attorneys, and academics; interviewed officials from USPTO and similarly sized foreign patent offices, and other knowledgeable stakeholders; and reviewed USPTO documents and relevant laws.

In short, the state of patent search is currently not very good. Timeliness and accuracy continue to be concerned when it comes to providing effective search in any capacity. Based on the study’s findings, it appears bolstering the effectiveness of these areas can be especially troublesome due to clarity of patent applications and USPTO’s policies and search tools.

Megan Feil, January 3, 2017

Tor Anonymity Not 100 Percent Guaranteed

January 1, 2017

An article at Naked Security reveals some information turned up by innovative Tor-exploring hidden services in its article, “‘Honey Onions’ Probe the Dark Web: At Least 3% of Tor Nodes are Rogues.” By “rogues,” writer Paul Ducklin is referring to sites, run by criminals and law-enforcement alike, that are able to track users through Tor entry and/or exit nodes. The article nicely lays out how this small fraction of sites can capture IP addresses, so see the article for that explanation. As Ducklin notes, three percent is a small enough window that someone just wishing to avoid having their shopping research tracked may remain unconcerned, but is a bigger matter for, say, a journalist investigating events in a war-torn nation. He writes:

Two researchers from Northeastern University in Boston, Massachussets, recently tried to measure just how many rogue HSDir nodes there might be, out of the 3000 or more scattered around the world. Detecting that there are rogue nodes is fairly easy: publish a hidden service, tell no one about it except a minimum set of HSDir nodes, and wait for web requests to come in.[…]

With 1500 specially-created hidden services, amusingly called ‘Honey Onions,’ or just Honions, deployed over about two months, the researchers measured 40,000 requests that they assume came from one or more rogue nodes. (Only HSDir nodes ever knew the name of each Honion, so the researchers could assume that all connections must have been initiated by a rogue node.) Thanks to some clever mathematics about who knew what about which Honions at what time, they calculated that these rogue requests came from at least 110 different HSDir nodes in the Tor network.

It is worth noting that many of those requests were simple pings, but others were actively seeking vulnerabilities. So, if you are doing anything more sensitive than comparing furniture prices, you’ll have to decide whether you want to take that three percent risk. Ducklin concludes by recommending added security measures for anyone concerned.

Cynthia Murrell, January 1, 2017

Connexica (Formerly Ardentia NetSearch) Embraces Business Analytics

December 31, 2016

You may remember Ardentia NetSearch. The company’s original product was NetSearch, which was designed to be quick to deploy and designed for the end use, not the information technology department. The company changed its name to Connexica in 2001. I checked the company’s Web site and noted that the company positions itself this way:

Our mission is to turn smart data discovery into actionable information for everyone.

What’s interesting is that Connexica asserts that

“search engine technology is the simplest and fastest way for users to service their own information needs.”

The idea is that if one can use Google, one can use Connexica’s systems. A brief description of the company states:

Connexica is the world’s pioneer of search based analytics.

The company offers Cxair. This is a Java based Web application. The application provides search engine based data discovery. The idea is that Cxair permits “fast, effective and agile business analytics.” What struck me was the assertion that Cxair is usable with “poor quality data.” The idea is to create reports without having to know the formal query syntax of SQL.

The company’s MetaVision produce is a Java based Web application that “interrogates database metadata.” The idea, as I understand it, is to use MetaVision to help migrate data into Hadoop, Cxair, or ElasticSearch.

Connexica, partly funded by Midven, is a privately held company based in the UK. The firm has more than 200 customers and more than 30 employees. When updating my files, I noted that Zoominfo reports that the firm was founded in 2006, but that conflicts with my file data which pegs the company operating as early as 2001.

A quick review of the company’s information on its Web site and open sources suggests that the firm is focusing its sales and marketing efforts on health care, finance, and government customers.

Connexica is another search vendor which has performed a successful pivot. Search technology is secondary to the company’s other applications.

Stephen E Arnold, December 31, 2016

Next Page »