CyberOSINT banner

Need an Open Source Semantic Web Crawler?

December 17, 2015

If you do, the beleaguered Yahoo has some open source goodies for you. Navigate to “Yahoo Open Sources Anthelion Web Crawler for Parsing Structured Data on HTML Pages.” The software, states the write up, is “designed for parsing structured data from HTML pages under an open source license.”

There is a statement I found darned interesting:

“To the best of our knowledge, we are first to introduce the idea of a crawler focusing on semantic data, embedded in HTML pages using markup languages as microdata, microformats or RDFa,” wrote authors Peter Mika and Roi Blanco of Yahoo Labs and Robert Meusel of Germany’s University of Mannheim.

My immediate thought was, “Why don’t these folks take a look at the 2007 patent documents penned by Ramanathan Guha. Those documents present a rather thorough description of a semantic component which hooks into the Google crawlers. Now the Google has not open sourced these systems and methods.

My reaction is, “Yahoo may want to ask the former Yahooligans who are now working at Yahoo how novel the Yahoo approach really is.”

Failing that, Yahoo may want to poke around in the literature, including patent documents, to see which outfits have trundled down the semantic crawling Web thing before. Would it have been more economical and efficient to license the Siderean Software crawler and build on that?

Stephen E Arnold, December 17, 2015

Google Indexes Some Dynamic Content

December 10, 2015

If you generate Web pages dynamically (who doesn’t?), you may want to know if the Alphabet Google thing can index the content on dynamic pages.

For some apparently objective information about the GOOG’s ability to index dynamic content, navigate to “Does Google Crawl Dynamic Content?” The article considers 11 types of dynamic content methods.

Here’s the passage I highlighted:

  • Google crawls and indexes all content that was injected by javascript.
  • Google even shows results in the SERP that are based on asynchronously injected content.
  • Google can handle content from httpRequest().
  • However, JSON-LD as such does not necessarily lead to SERP results (as opposed to the officially supported SERP entities that are not only indexed, but also used to decorate the SERP).
  • Injected JSON-LD gets recognized by the structured data testing tool – including Tag Manager injection. This means that once Google decides to support the entities, indexing will not be a problem.
  • Dynamically updated meta elements get crawled and indexed, too.

The question one may wish to consider is, “What does Alphabet Google do with that information and data?” There are some clues in the Ramanathan Guha patent documents filed in 2007.

Stephen E Arnold, December 10, 2015

Medical Search Solved Again

December 10, 2015

I have looked at a wide range of medical information search systems over the years. These range from Medline to the Grateful Med.

I read “A Cure for Medical Researchers’ Big Data Headache.” The Big Data in question is the Medline database. The new search tool is ORiGAMI (I love that wonky upper and lower case thing).

The basic approach involves:

Apollo, a Cray Urika graph computer, possesses massive multithreaded processors and 2 terabytes of shared memory, attributes that allow it to host the entire MEDLINE database and compute multiple pathways on multiple graphs simultaneously. Combined with Helios, CADES’ Cray Urika extreme analytics platform, Sukumar’s team had the cutting-edge hardware needed to process large datasets quickly—about 1,000 times faster than a workstation—and at scale.

And the payoff?

Once the MEDLINE database was brought into the CADES environment, [Sreenivas Rangan Sukumar’s [a data scientist at the Department of Energy’s Oak Ridge National Laboratory] team applied advanced graph theory models that implement semantic, statistical, and logical reasoning algorithms to create ORiGAMI. The result is a free online application capable of delivering health insights in less than a second based on the combined knowledge of a worldwide medical community.

My view is that Medline is not particularly big. The analysis of the content pool can generate lots of outputs.

From my vantage point in rural Kentucky, this is another government effort to create a search system. Perhaps this is the breakthrough that will surpass IBM Watson’s medical content capabilities?

Does your local health care provider have access to a Cray computer and the other bits and pieces, including a local version of Dr. Sukumar?

Stephen E Arnold, December 10, 2015

Visual Content: An Indexing Challenge

December 4, 2015

The average bounce rate on blogs for new visitors is 60.2%, and the average reader stays only 1 to 2 minutes on your website. One way to get people to really engage with your content is to use a tool like Roojoom, which is a content curation and creation platform.

Here’s one example from the write up:

Roojoom lets you collect content from your online and offline sources (such as your web pages, videos, PDFs and marketing materials) to create a “content journey“ for readers. You then guide readers step by step through the journey,all from within one centralized place. I read “5 Visual Content Tools to Boost Engagement.” The write up points to a handful of services which generate surveys, infographics, and collages of user supplier photos. If I knew a millennial, I can imagine hearing the susurration of excitement emitted by the lad or lass.

Now I don’t want to rain on the innovation parade. Years ago, an outfit called i2 Group Ltd. developed a similar solution. After dogging and ponying the service, it became clear that in the early 2000s, there was not much appetite for this type of data exploration. i2 eventually sold out to IBM and the company returned to its roots in intelligence and law enforcement.

The thought I had after reading about Roojoom and the other services was this:

How will the information be indexed and made findable?

As content become emoji-ized, the indexing task does not become easier. Making sense of images is not yet a slam dunk. Heck, automated indexing only shoots accurately 80 to 90 percent of the time. In a time of heightened concern about risks, is a one in five bet a good one? I try to narrow the gap, but many are okay without worrying too much.

As visual content becomes more desirable, the indexing systems will have to find a way to make this content findable. Words baffle many of these content processing outfits. Pictures are another hill to climb. If it is not indexed, the content may not be findable. Is this a problem for researchers and analysts? And for you, gentle reader?

Stephen E Arnold, December 4, 2015

XML Marches On

December 2, 2015

For fans of XML and automated indexing, there’s a new duo in town. The shoot out at the JSON corral is not scheduled, but you can get the pre show down information in “Smartlogic and MarkLogic Corporation Enhance Platform Integration between Semaphore and MarkLogic Database.” Rumors of closer ties between the outfits surfaced earlier this year. I pinged one of the automated indexing company’s wizards and learned, “Nope, nothing going on.” Gee, I almost believed this until a virtual strategy story turned up. Virtual no more.

According to the write up:

Smartlogic, the Content Intelligence Company, today announced tighter software integration with MarkLogic, the Enterprise NoSQL database platform provider, creating a seamless approach to semantic information management where organizations maximize information to drive change. Smartlogic’s Content Intelligence capabilities provide a robust set of semantic tools which create intelligent metadata, enhancing the ability of the enterprise-grade MarkLogic database to power smarter applications.

For fans of user  friendliness, the tie up may mean more XQuery scripting and some Semaphore tweaks. And JSON? Not germane.

What is germane is that Smartlogic may covet some of MarkLogic’s publishing licensees. After slicing and dicing, some of these outfits are having trouble finding out what their machine assisted editors have crafted with refined quantities of editorial humans.

Stephen E Arnold, December 2, 2015

Alphabet Google Misspells Relevance, Yikes, Yelp?

November 25, 2015

I read “Google Says Local Search Result That Buried Rivals Yelp, Trip Advisor Is Just a Bug.” I thought the relevance, precision, and objectivity issues had been put into a mummy style sleeping bag and put in the deep freeze.

According to the write up:

executives from public Internet companies Yelp and TripAdvisor noted a disturbing trend: Google searches on smartphones for their businesses had suddenly buried their results beneath Google’s own. It looked like a flagrant reversal of Google’s stated position on search, and a move to edge out rivals.

The article contains this statement attributed to the big dog at Yelp:

Far from a glitch, this is a pattern of behavior by Google.

I don’t have a dog in this fight nor am I looking for a dog friendly hotel or a really great restaurant in Rooster Run, Kentucky.

My own experience running queries on Google is okay. Of course, I have the goslings, several of whom are real live expert searchers with library degrees and one has a couple of well received books to her credit. Oh, I forgot. We also have a pipeline to a couple of high profile library schools, and I have a Rolodex with the names and numbers of research professionals who have pretty good search skills.

So maybe my experience with Google is different from the folks who are not able to work around what the Yelp top dog calls, according to the article, “Google’s monopoly.”

My thought is that those looking for free search results need to understand how oddities like relevance, precision, and objectivity are defined at the Alphabet Google thing.

Google even published a chunky manual to help Web masters, who may have been failed middle school teachers in a previous job, do things the Alphabet Google way. You can find that rules of the Google information highway here.

The Google relevance, precision, and objectivity thing has many moving parts. Glitches are possible. Do Googlers make errors? In my experience, not too many. Well, maybe solving death, Glass, and finding like minded folks in the European Union regulators’ office.

My suggestion? Think about other ways to obtain information. When a former Gannet sci tech reporter could not find Cuba Libre restaurant in DC on his Apple phone, there was an option. I took him there even though the eatery was not in the Google mobile search results. Cuba Libre is not too far from the Alphabet Google DC office. No problem.

Stephen E Arnold, November 25, 2015

Indexing: A Cautionary Example

November 17, 2015

i read “Half of World’s Museum Specimens Are Wrongly Labeled, Oxford University Finds.” Anyone involved in indexing knows the perils of assigning labels, tags, or what the whiz kids call metadata to an object.

Humans make mistakes. According to the write up:

As many as half of all natural history specimens held in the some of the world’s greatest institutions are probably wrongly labeled, according to experts at Oxford University and the Royal Botanic Garden in Edinburgh. The confusion has arisen because even accomplished naturalists struggle to tell the difference between similar plants and insects. And with hundreds or thousands of specimens arriving at once, it can be too time-consuming to meticulously research each and guesses have to be made.

Yikes. Only half. I know that human indexers get tired. Now there is just too much work to do. The reaction is typical of busy subject matter experts. Just guess. Close enough for horse shoes.

What about machine indexing? Anyone who has retrained an HP Autonomy system knows that humans get involved as well. If humans make mistakes with bugs and weeds, imagine what happens when a human has to figure out a blog post in a dialect of Korean.

The brutal reality is that indexing is a problem. When dealing with humans, the problems do not go away. When humans interact with automated systems, the automated systems make mistakes, often more rapidly than the sorry human indexing professionals do.

What’s the point?

I would sum up the implication as:

Do not believe a human (indexing species or marketer of automated indexing species).

Acceptable indexing with accuracy above 85 percent is very difficult to achieve. Unfortunately the graduates of a taxonomy boot camp or the entrepreneur flogging an automatic indexing system which is powered by artificial intelligence may not be reliable sources of information.

I know that this notion of high error rates is disappointing to those who believe their whizzy new system works like a champ.

Reality is often painful, particularly when indexing is involved.

What are the consequences? Here are three:

  1. Results of queries are incomplete or just wrong
  2. Users are unaware of missing information
  3. Failure to maintain either human, human assisted, or automated systems results in indexing drift. Eventually the indexing is just misleading if not incorrect.

How accurate is your firm’s indexing? How accurate is your own indexing?

Stephen E Arnold, November 17, 2015

An Early Computer-Assisted Concordance

November 17, 2015

An interesting post at Mashable, “1955: The Univac Bible,” takes us back in time to examine an innovative indexing project. Writer Chris Wild tells us about the preacher who realized that these newfangled “computers” might be able to help with a classically tedious and time-consuming task: compiling a book’s concordance, or alphabetical list of key words, their locations in the text, and the context in which each is used. Specifically, Rev. John Ellison and his team wanted to create the concordance for the recently completed Revised Standard Version of the Bible (also newfangled.) Wild tells us how it was done:

“Five women spent five months transcribing the Bible’s approximately 800,000 words into binary code on magnetic tape. A second set of tapes was produced separately to weed out typing mistakes. It took Univac five hours to compare the two sets and ensure the accuracy of the transcription. The computer then spat out a list of all words, then a narrower list of key words. The biggest challenge was how to teach Univac to gather the right amount of context with each word. Bosgang spent 13 weeks composing the 1,800 instructions necessary to make it work. Once that was done, the concordance was alphabetized, and converted from binary code to readable type, producing a final 2,000-page book. All told, the computer shaved an estimated 23 years off the whole process.”

The article is worth checking out, both for more details on the project and for the historic photos. How much time would that job take now? It is good to remind ourselves that tagging and indexing data has only recently become a task that can be taken for granted.

Cynthia Murrell, November 17, 2015

Sponsored by, publisher of the CyberOSINT monograph


RAVN Pipeline Coupled with ElasticSearch to Improve Indexing Capabilities

October 28, 2015

The article on PR Newswire titled RAVN Systems Releases its Enterprise Search Indexing Platform, RAVN Pipeline, to Ingest Enterprise Content Into ElasticSearch unpacks the decision to improve the ElasticSearch platform by supplying the indexing platform of the RAVN Pipeline. RAVN Systems is a UK company with expertise in processing unstructured data founded by consultants and developers. Their stated goal is to discover new lands in the world of information technology. The article states,

“RAVN Pipeline delivers a platform approach to all your Extraction, Transformation and Load (ETL) needs. A wide variety of source repositories including, but not limited to, File systems, e-mail systems, DMS platforms, CRM systems and hosted platforms can be connected while maintaining document level security when indexing the content into Elasticsearch. Also, compressed archives and other complex data types are supported out of the box, with the ability to retain nested hierarchical structures.”

The added indexing ability is very important, especially for users trying to index from from or into cloud-based repositories. Even a single instance of any type of data can be indexed with the Pipeline, which also enriches data during indexing with auto-tagging and classifications. The article also promises that non-specialists (by which I assume they mean people) will be able to use the new systems due to their being GUI driven and intuitive.

Chelsea Kerwin, October 28, 2015

Sponsored by, publisher of the CyberOSINT monograph


Short Honk: Crawl the Web at Scale

September 30, 2015

Short honk: I read “Aduana: Link Analysis to Crawl the Web at Scale.” The write up explains an open source project which can copy content “dispersed all over the Web.” Keep in mind that the approach focuses primarily on text. Aduana is a special back end for the developer’s tool for speeding up crawls which is built on top of a data management system.

According to the write up:

we wanted to locate relevant pages first rather than on an ad hoc basis. We also wanted to revisit the more interesting ones more often than the others. We ultimately ran a pilot to see what happens. We figured our sheer capacity might be enough. After all, our cloud-based platform’s users scrape over two billion web pages per month….We think Aduana is a very promising tool to expedite broad crawls at scale. Using it, you can prioritize crawling pages with the specific type of information you’re after. It’s still experimental. And not production-ready yet.

In its present form, Aduana is able to:

  • Analyze news.
  • Search locations and people.
  • Perform sentiment analysis.
  • Find companies to classify them.
  • Extract job listings.
  • Find all sellers of certain products.

The write up contains links to the relevant github information, some code snippets, and descriptive information.

Stephen E Arnold, September 30, 2015

Next Page »