CyberOSINT banner

Medical Publisher Does Rah Rah for MarkLogic

November 20, 2015

Now MarkLogic is a unicorn. The company wants to generate revenues. Okay. No problem.

I found “200-Year-Old Publisher Finds Happiness with NoSQL Database” quite interesting. The write up explains that the New England Journal of Medicine uses MarkLogic’s XML data management system to — well — manage its text and other content.

The write up states:

With features like XQuery, a SQL-like query engine for XML data, MarkLogic promised to retrieve unstructured data at speeds no SQL database could approach.

What did I note? The big thing is that this deal went down when MarkLogic was a “fledgling company.” Hmm. Was this a Dave Kellogg-era deal? I also noted that the write up did not beat the drum for MarkLogic as a business and government intelligence. email management, and analytics Swiss Army knife able to cut into the revenues of Oracle and other Codd database outfits.

MarkLogic’s marketing may be making progress by emphasizing what MarkLogic’s technology was built to deliver: A data management system for publishers. The publication still uses SQL for financial records and dabbles with the open source quasi-doppelgänger MondoDB.

MarkLogic hit a wall at about $60 million. Today the fledgling is a unicorn. Will MarkLogic put wings on its unicorn? Stakeholders sure think is going to happen. For me, I will observe. Will the proprietary MarkLogic prevail or will open source alternatives nibble into this box of Kellogg’s revenue?

Stephen E Arnold, November 20, 2015

Icann Is an I Won’t

November 16, 2015

Have you ever heard of Icann?  You are probably like many people within the United States and have not heard of the non-profit private company.  What does Icann do?  Icann is responsible for Internet protocol addresses (IP) and coordinating domain names, so basically the company is responsible for a huge portion of the Internet.  According to The Guardian in “The Internet Is Run By An Unaccountable Private Company. This Is A Problem,” the US supposedly runs the Icann but its role is mostly clerical and by September 30, 2015 it was supposed to hand the reins over to someone else.

The “else” is the biggest question.  The Icann community spent hours trying to figure out who would manage the company, but they ran into a huge brick wall.  The biggest issue is that the volunteers want Icann to have more accountability, which does not seem feasible. Icann’s directors cannot be fired, except by each other.  Finances are another problem with possible governance risks and corruption.

A supposed solution is to create a membership organization, a common business model for non-profits and will give power to the community.  Icann’s directors are not too happy and have been allowed to add their own opinions.  Decisions are not being made at Icann and with the new presidential election the entire power shift could be off.  It is not the worst that could happen:

“But there’s much more at stake. Icann’s board – as ultimate authority in this little company running global internet resources, and answerable (in fact, and in law) to no one – does have the power to reject the community’s proposals. But not everything that can be done, should be done. If the board blunders on, it will alienate those volunteers who are the beating heart of multi-stakeholder governance. It will also perfectly illustrate why change is required.”

The board has all the power and the do not have anyone to hold them accountable.  Icann directors just have to stall long enough to keep things the same and they will be able to give themselves more raises.

Whitney Grace, November 16, 2015
Sponsored by, publisher of the CyberOSINT monograph

Crazy, Wild Hadoop Prioritization Advice

November 12, 2015

I read “Top 10 Priorities for a Successful Hadoop Implementation.” A listicle. I understand. Clicks. Visibility. Fame. Fortune. Well, hopefully.

I wanted to highlight two pieces of advice delivered in a somber, parental manner. Here are two highlights from the write up intended to help a Hadoop administrator get ‘er done and keep the paychecks rolling in.

Item 2 of 10: “Innovate with Big Data on enterprise Hadoop.” I find it amusing when advisors, poobahs, and former middle school teachers tell another person to innovate. Yep, that works really well. Even those who innovate are faced with failure many times. I think the well ran dry for some of the Italian Renaissance artists when the examples of frescos in Nero’s modest home were recycled. Been there. Done that. The notion of a person innovating with an enterprise deployment of Hadoop strikes me as interesting, but probably not a top 10 priority. How about getting the data into the system, formulating a meaningful query, and figuring out how to deal with the batchiness of the system?

Item 9 of 10: “Look for capabilities that make Hadoop data look relational.” There is a reason to use Codd type data management systems. Those reasons include that they work when properly set up, and they require data which can be sliced and diced. Maybe not easily, but no one fools himself or herself thinking, “Gee, why don’t I dump everything into one big data lake and pull out the big, glossy fish automagically.”

I am okay with advice. Perhaps it should reflect the reality with which open source data management tools present to an enterprise user seeking guidance. Enterprise search vendors got themselves into a world of hurt with this type of casual advice. Where are those vendors now?

Stephen E Arnold, November 12, 2015

Amazon Punches Business Intelligence

November 11, 2015

Amazon already gave technology a punch when it launched AWS, but now it is releasing a business intelligence application that will change the face of business operations or so Amazon hopes.  ZDNet describes Amazon’s newest endeavor in “AWS QuickSight Will Disrupt Business Intelligence, Analytics Markets.”  The market is already saturated with business intelligence technology vendors, but Amazon’s new AWS QuickSight will cause another market upheaval.

“This month is no exception: Amazon crashed the party by announcing QuickSight, a new BI and analytics data management platform. BI pros will need to pay close attention, because this new platform is inexpensive, highly scalable, and has the potential to disrupt the BI vendor landscape. QuickSight is based on AWS’ cloud infrastructure, so it shares AWS characteristics like elasticity, abstracted complexity, and a pay-per-use consumption model.”

Another monkey wrench for business intelligence vendors is that AWS QuickSight’s prices are not only reasonable, but are borderline scandalous: standard for $9/month per user or enterprise edition for $18/month per user.

Keep in mind, however, that AWS QuickSight is the newest shiny object on the business intelligence market, so it will have out-of-the-box problems, long-term ramifications are unknown, and reliance on database models and schemas.  Do not forget that most business intelligence solutions do not resolve all issues, including ease of use and comprehensiveness.  It might be better to wait until all the bugs are worked out of the system, unless you do not mind being a guinea pig.

Whitney Grace, November 11, 2015
Sponsored by, publisher of the CyberOSINT monograph


Photo Farming in the Early Days

November 9, 2015

Have you ever wondered what your town looked like while it was still urban and used as farmland?  Instead of having to visit your local historical society or library (although we do encourage you to do so), the United States Farm Security Administration and Office Of War Information (known as  FSA-OWI for short) developed Photogrammer.  Photogrammer is a Web-based image platform for organizing, viewing, and searching farm photos from 1935-1945.

Photogrammer uses an interactive map of the United States, where users can click on a state and then a city or county within it to see the photos from the timeline.  The archive contains over 170,000 photos, but only 90,000 have a geographic classification.  They have also been grouped by the photographer who took the photos, although it is limited to fifteen people.  Other than city, photographer, year, and month, the collection c,an be sorted by collection tags and lot numbers (although these are not discussed in much detail).

While farm photographs from 1935-1945 do not appear to need their own photographic database, the collection’s history is interesting:

“In order to build support for and justify government programs, the Historical Section set out to document America, often at her most vulnerable, and the successful administration of relief service. The Farm Security Administration—Office of War Information (FSA-OWI) produced some of the most iconic images of the Great Depression and World War II and included photographers such as Dorothea Lange, Walker Evans, and Arthur Rothstein who shaped the visual culture of the era both in its moment and in American memory. Unit photographers were sent across the country. The negatives were sent to Washington, DC. The growing collection came to be known as “The File.” With the United State’s entry into WWII, the unit moved into the Office of War Information and the collection became known as the FSA-OWI File.”

While the photos do have historical importance, rather than creating a separate database with its small flaws, it would be more useful if it was incorporated into a larger historical archive, like the Library of Congress, instead of making it a pet project.

Whitney Grace, November 9, 2015

Sponsored by, publisher of the CyberOSINT monograph

TemaTres Open Source Vocabulary Server

November 3, 2015

The latest version of the TemaTres vocabulary server is now available, we learn from the company’s blog post, “TemaTres 2.0 Released.” Released under the GNU General Public License version 2.0, the web application helps manage taxonomies, thesauri, and multilingual vocabularies. The web application can be downloaded at SourceForge. Here’s what has changed since the last release:

*Export to Moodle your vocabulary: now you can export to Moodle Glossary XML format

*Metadata summary about each term and about your vocabulary (data about terms, relations, notes and total descendants terms, deep levels, etc)

*New report: reports about terms with mapping relations, terms by status, preferred terms, etc.

*New report: reports about terms without notes or specific type of notes

*Import the notes type defined by user (custom notes) using tagged file format

*Select massively free terms to assign to other term

*Improve utilities to take terminological recommendations from other vocabularies (more than 300:

*Update Zthes schema to Zthes 1.0 (Thanks to Wilbert Kraan)

*Export the whole vocabulary to Metadata Authority Description Schema (MADS)

*Fixed bugs and improved several functional aspects.

*Uses Bootstrap v3.3.4

See the server’s SourceForge page, above, for the full list of features. Though as of this writing only 21 users had rated the product, all seemed very pleased with the results. The TemaTres website notes that running the server requires some other open source tools: PHP, MySql, and HTTP Web server. It also specifies that, to update from version 1.82, keep the db.tematres.php, but replace the code. To update from TemaTres 1.6 or earlier, first go in as an administrator and update to version 1.7 through Menu-> Administration -> Database Maintenance.

Cynthia Murrell, November 3, 2015

Sponsored by, publisher of the CyberOSINT monograph

Libraries Failure to Make Room for Developer Librarians

October 23, 2015

The article titled Libraries’ Tech Pipeline Problem on Geek Feminism explores the lack of diverse developers. The author, a librarian, is extremely frustrated with the approach many libraries have taken. Rather than refocusing their hiring and training practices to emphasize technical skills, many are simply hiring more and more vendors, hardly a solution. The article states,

“The biggest issue I see is that we offer a fair number of very basic learn-to-code workshops, but we don’t offer a realistic path from there to writing code as a job. To put a finer point on it, we do not offer “junior developer” positions in libraries; we write job ads asking for unicorns, with expert- or near-expert-level skills in at least two areas (I’ve seen ones that wanted strong skills in development, user experience, and devops, for instance).”

The options available are that librarians either learn to code in their spare time (not viable), or enter the tech workforce temporarily and bring your skills back after a few years. This option is also full of drawbacks, especially that even white women are marginalized in the tech industry. Instead, the article stipulates the libraries need to make more room for hiring and promoting people with coding skills and interests while also joining the coding communities like Code4Lib.


Chelsea Kerwin, October 23, 2015

Sponsored by, publisher of the CyberOSINT monograph


Spark Burns Down Hadoop

October 20, 2015

I read “Apache Spark vs Hadoop.” I conceptualized Ronda Rousey climbing in the octagon with Ramazan Emeev. A big gate. As a certain presidential candidate might say, “Huge.”

Alas, the dust up between Spark (MapReduce on steroids) and Hadoop (a batch operation clustering system) was not much of a contest, according to the article.

I highlighted this passage:

With Apache Spark, you can act on your data in whatever way you want. Want to look for interesting tidbits in your data? You can perform some quick queries. Want to run something you know will take a long time? You can use a batch job. Want to process your data streams in real time? You can do that too.

The key to the Spark wonderfulness is RDDs or resilient distributed datasets. I underlined with definition:

They’re fine-grained, keeping track of all changes that have been made from other transformations such as map or join. This means that it’s possible to recover from failures by rebuilding from these transformations (which is why they’re called Resilient Distributed Datasets).

My goodness with these features, poor, old Hadoop may not stand a chance. Now who would win a fight between Rousey and Emeev? One could, I assume, input data about the two fighters and perform on quick queries and get an “answer.”

Like most NoSQL confections, will the answer match what happens in the ring?

Stephen E Arnold, October 20, 2015

Quote to Note: Halevy after 10 Years Before the Ads

September 23, 2015

If you track innovations at the Alphabet Google thing, you will know that a number of wizards make the outfit hum. One of the big wizards is Dr. Alon Halevy. He is a database guru, has patents, and now an essayist.

Navigate to “A Decade at Google.” The write up does not reference the ad model which makes research possible. Legal dust ups are sidestepped. The management approach and the reorganization are not part of the write up.

I did note an interesting passage, which I flagged as a quote to note:

It is common wisdom that you should not choose a project that a product team is likely to be embarking on in the short term (e.g., up to a year). By the time you’ll get any results, they will have done it already. They might not do it as well as or as elegantly as you can, but that won’t matter at that point.

I interpreted this to underscore Alphabet Google thing’s “good enough” approach to its technology. If you have time, think about the confluence of Dr. Halevy’s research and Dr. Guha’s. The semantic search engine optimization crowd may have a field day.

Stephen E Arnold, September 23, 2015

13 Big Data Trends: Fodder for Mid Tier Consultants

September 20, 2015

Let’s assume that a colleague has lost his or her job (xe, in Tennessee, I heard). The question becomes, “What can I do with my current skills to make big money is hot new sector?”

The answer appears in “13 New Trends in Big Data and Data Science.” The write up is intended to be a round up of jazzy hot topics in a couple of even hotter quasi-new facets of the database world. Like enterprise search, databases are in need of juice. Nothing helps established technology than new spins in old orbits.

My6 suggestion is to read through the list of 13 “new trends.” Pick one, and suggest to your prospect hunting pal to get hired. Nothing to it.

Allow me to illustrate the method in action.

I have selected trend 8 “The rise of mobile data exploitation.” There are some companies active in this field; for example, S2T. The S2T name means simulation software and technology. The outfit processes a range of digital information and analyzes it with the company’s own tools. Anyone can work in this sector. The demand for talent is high. The work is not too difficult. The desire to hire “experts” various aspects of data is keen. No problem. Sure, there may be some trivial requirements like checking with a person’s mom and his or her best friends to make sure the applicant can be trusted. Hot trend. No problemo.

Let’s look at another field.

Trend 11. High performance computing (HPC). What could be faster than Apple’s new mobile chip? What could be higher performance than the Facebook or Google infrastructure. If the job seeker is familiar with these technologies, the world of Big Data excitement awaits. The experience is the important thing, not knowledge of optimized parallelization pipelines.


Each of the 13 trends makes it clear that there are numerous opportunities. These range from digital health (IBM Watson is a PR player) to the trivial world of analytic apps and APIs.

After reading the article, I was delighted to see how many important trends are getting buzz.

Big Data is definitely the go to discipline. I anticipate that anyone interested in search and cotnent processing will be able to pursue a career in Big Data.

Now some skeptics believe that Big Data is a nebulous concept. Do not be dissuaded. The 13 trends are evidence that databases and the analysis of their contents is the future. Just as these activities have been since the days of Edgar Codd.

The mid tier consultants can ride with the hounds.

Stephen E Arnold, September 20, 2015

Next Page »