To Make Data Analytics Sort of Work: Attention to Detail

March 10, 2017

I read “The Much-Needed Business Facet for Modern Data Integration.” The write up presents some useful information. Not many of the “go fast and break things” crowd will relate to some of the ideas and suggestions, but I found the article refreshing.

What does one do to make modern data centric activities sort of work? The answers are ones that I have found many more youthful wizards often elect to ignore.

Here they are:

  1. Do data preparation. Yikes. Normalization of data. I have fielded this question in the past, “Who has time for that?” Answer: Too few, gentle reader. Too few.
  2. Profile the data. Another gasp. In my experience it is helpful to determine what data are actually germane to the goal. Think about the polls for the recent
  3. Create data libraries. Good idea. But it is much more fun to just recreate data sets. Very Zen like.
  4. Have rules which are now explained as “data governance.” The jargon does not change the need for editorial and data guidelines.
  5. Take a stab at data quality. This is another way of saying, “Clean up the data.” Even whiz bang modern systems are confused with differences like I.B.M and International Business Machines or numbers with decimal points in the incorrect place.
  6. Get colleagues in the game. This is a good idea, but in many organizations in which I have worked “team” is spelled “my bonus.”

Useful checklist. I fear that those who color unicorns will not like the dog work which accompanies implementing the ideas. That’s what makes search and content processing so darned interesting.

Stephen E Arnold, March 10, 2017

ScyllaDB Version 3.1 Available

March 8, 2017

According to Scylla, their latest release is currently the fastest NoSQL database. We learn about the update from SiliconAngle’s article, “ScyllaDB Revamps NoSQL Database in 1.3 Release.” To support their claim, the company points to a performance benchmark test executed by the Yahoo Cloud Serving Benchmark project. That group compared ScyllaDB to the open source Cassandra database, and found Scylla to be 4.6 times faster than a standard Cassandra cluster.

Writer Mike Wheatley elaborates on the product:

ScyllaDB’s biggest differentiator is that it’s compatible with the Apache Cassandra database APIs. As such, the creators claims that ScyllaDB can be used as a drop-in replacement for Cassandra itself, offering users the benefit of improved performance and scale that comes from the integration with a light key/value store.

The company says the new release is geared towards development teams that have struggled with Big Data projects, and claims a number of performance advantages over more traditional development approach, including:

*10X throughput of baseline Cassandra – more than 1,000,000 CQL operations per second per node

*Sub 1msec 99% latency

*10X per-node storage capacity over Cassandra

*Self-tuning database: zero configuration needed to max out hardware

*Unparalleled high availability, native multi-datacenter awareness

*Drop-in replacement for Cassandra – no additional scripts or code required”

Wheatley cites Scylla’s CTO when he points to better integration with graph databases and improved support for Thrift, Date Tiered Compaction Strategy, Large Partitions, Docker, and CQL tracing. I notice the company is hiring as of this writing. Don’t let the Tel Aviv location of Scylla’s headquarters stop from applying you if you don’t happen to live nearby—they note that their developers can work from anywhere in the world.

Cynthia Murrell, March 8, 2016

IBM and Root Access Misstep?

March 2, 2017

Maybe this is fake news? Maybe. Navigate to “Big Blue’s Big Blunder: IBM Accidentally Hands Over Root Access to Its Data Science Servers.” When I read the title, my first reaction was, “Hey, Yahoot is back in the security news.” Wrong.

According to the write up, which I assume to be exposing the “truth”:

IBM left private keys to the Docker host environment in its Data Science Experience service inside freely available containers. This potentially granted the cloud service’s users root access to the underlying container-hosting machines – and potentially to other machines in Big Blue’s Spark computing cluster. Effectively, Big Blue handed its cloud users the secrets needed to potentially commandeer and control its service’s computers.

IBM hopped to it. Two weeks after the stumble was discovered, IBM fixed the problem.

The write up includes this upbeat statement, attributed to the person using a demo account which exposed the glitch:

I think that IBM already has some amazing infosec people and a genuine commitment to protecting their services, and it’s a matter of instilling security culture and processes across their entire organization. That said, any company that has products allowing users to run untrusted code should think long and hard about their system architecture. This is not to imply that containers were poorly designed (because I don’t think they were), but more that they’re so new that best practices in their use are still being actively developed. Compare a newer-model table saw to one decades old: The new one comes stock with an abundance of safety features including emergency stopping, a riving knife, push sticks, etc, as a result of evolving culture and standards through time and understanding.

Bad news. Good news.

Let’s ask Watson about IBM security. Hold that thought, please. Watson is working on health care information. And don’t forget the March 2017 security conference sponsored by those security pros at IBM.

Stephen E Arnold, March 2, 2017

Bad Big Data? Get More Data Then

March 2, 2017

I like the idea that more is better. The idea is particularly magnetic when a company cannot figure out what it’s own, in house, proprietary data mean. Think of the legions of consultants from McKinsey and BCG telling executives what their own data “means.” Toss in the notion of a Big Data in a giant “data lake,” and you have decision makers who cannot use the information they already have.

Well, how does one fix that problem? Easy. Get more data. That sounds like a plan, particularly when the professionals struggling are in charge of figuring out if sales and marketing investments sort of pay for themselves.

I learned that I need more data by reading “Deepening The Data Lake: How Second-Party Data Increases AI For Enterprises.” The headline introduces the amazing data lake concept along with two giant lake front developments: More data and artificial intelligence.

Buzzwords? Heck no. Just solid post millennial reasoning; for example:

there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.

I like that sales needle phrase. Marketers have to justify themselves and many have only “sparse” data. I would suggest that marketers have often useless data like the number of unique clicks, but that’s only polluting the data lake.

The fix is interesting. I learned:

we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live)… this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.

Yep, more data.

And what about making sense of the additional data? I learned:

The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments. Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.

Okay, I think I understand. No wonder companies hire outfits like blue chip consulting firms to figure out what is going on in their companies. Stated another way, insiders live in the swamp. Outsiders can put the swamp into a context and maybe implement some pollution control systems.

Stephen E Arnold, March 2, 2017

Gradescope Cuts Grading Time in Half, Makes Teachers Lives 50% More Bearable

February 8, 2017

The article titled Professors of the World, Rejoice: Gradescope Brings AI to Grading on Nvidia might more correctly be titled: TAs of the World, Rejoice! In my experience, those hapless, hardworking, underpaid individuals are the ones doing most of the grunt work on college campuses. Any grad student who has faced a stack of essays or tests when their “real work” is calling knows the pain and redundancy of grading. Gradescope is an exciting innovation that cuts the time spent grading in half. The article explains,

The AI isn’t used to directly grade the papers; rather, it turns grading into an automated, highly repeatable exercise by learning to identify and group answers, and thus treat them as batches. Using an interface similar to a photo manager, instructors ensure that the automatically suggested answer groups are correct, and then score each answer with a rubric. In this way, input from users lets the AI continually improve its future predictions.

The trickiest part of this technology was handwriting recognition, and the Berkeley team used a “recurrent neural network trained using the Tesla K40 and GEForce GTX 980 Ti GPUs.” Interestingly, the app was initially created at least partly to prevent cheating. Students have been known to alter their answers after the fact and argue a failure of grading, so a digital record of the paper is extremely useful. This might sound like the end of teachers, but in reality it is the beginning of a giant, global teacher party!

Chelsea Kerwin, February 8, 2017

JustOne: When a Pivot Is Not Possible

February 4, 2017

CopperEye hit my radar when I did a project for the now-forgotten Speed of Mind search system. CopperEye delivered high speed search in a patented hierarchical data management system. The company snagged some In-Q-Tel interest in 2007, but by 2009, I lost track of the company. Several of the CopperEye senior managers teamed to create the JustOne database, search and analytic system. One of the new company’s inventions is documented in “Apparatus, Systems, and Methods for Data Storage and/or Retrieval Based on a Database Model-agnostic, Schema-Agnostic, and Workload-Agnostic Data Storage and Access Models.” If you are into patent documents about making sense of Big Data, you will find US20140317115 interesting. I will leave it to you to determine if there is any overlap between this system and method and those of the now low profile CopperEye.

Why would In-Q-Tel get interested in another database? From my point of view, CopperEye was interesting because:

  1. The system and method was idea for finding information from large collections of intercept information
  2. The tech whiz behind the JustOne system wanted to avoid “band-aid” architectures; that is, software shims, wrappers, and workarounds that other data management and information access systems generated like rabbits
  3. The method of finding information achieved or exceeded the performance of the very, very snappy Speed of Mind system
  4. The system sidestepped a number of the problems which plague Oracle-style databases trying to deal with floods of real time information from telecommunication traffic, surveillance, and Internet of Things transmissions or “emissions.”

How import6ant is JustOne? I think the company is one of those outfits which has a better mousetrap. Unlike the champions of XML, JustOne uses JSON and other “open” technologies. In fact, a useful version of the JustOne system is available for download from the JustOne Web site. Be aware that the name “JustOne” is in use by other vendors.


The fragmented world of database and information access. Source: Duncan Pauly

A good, but older, write up explains some of the strengths of the JustOne approach to search and retrieval couched in the lingo of the database world. The key points from “The Evolution of Data Management” strikes me as helpful in understanding why Jerry Yang and Scott McNealy invested in the CopperEye veterans’ start up. I highlighted these points:

  • Databases have to be operational and analytical; that is, storing information is not enough
  • Transaction rates are high; that is, real time flows from telecommunications activity
  • Transaction size varies from the very small to hefty; that is, the opposite of the old school records associated with old school IBM IMS system
  • High concurrency; that is, more than one “thing” at a time
  • Dynamic schema and query definition

I highlighted this statement as suggestive:

In scaled-out environments, transactions need to be able to choose what guarantees they require – rather than enforcing or relaxing ACID constraints across a whole database. Each transaction should be able to decide how synchronous, atomic or durable it needs to be and how it must interact with other transactions. For example, must a transaction be applied in chronological order or can it be allowed out of time order with other transactions providing the cumulative result remains the same? Not all transactions need be rigorously ACID and likewise not all transactions can afford to be non-atomic or potentially inconsistent.

My take on this CopperEye wind down and JustOne wind up is that CopperEye, for whatever management reason, was not able to pivot from where CopperEye was to where CopperEye had to be to grow. More information is available from the JustOne Database Web site at

Is Duncan Pauly one of the most innovative engineers laboring in the database search sector? Could be.

Stephen E Arnold, February 4, 2017

Google and the Cloud Take on Corporate Database Management

February 1, 2017

The article titled Google Cloud Platform Releases New Database Services, Fighting AWS and Azure for Corporate Customers on GeekWire suggests that Google’s corporate offerings have been weak in the area of database management. Compared to Amazon Web Services and Microsoft Azure, Google is only wading into the somewhat monotonous arena of corporate database needs. The article goes into detail on the offerings,

Cloud SQL, Second Generation, is a service offering instances of the popular MySQL database. It’s most comparable to AWS’s Aurora and SQL Azure, though there are some differences from SQL Azure, so Microsoft allows running a MySQL database on Azure. Google’s Cloud SQL supports MySQL 5.7, point-in-time recovery, automatic storage resizing and one-click failover replicas, the company said. Cloud Bigtable is a NoSQL database, the same one that powers Google’s own search, analytics, maps and Gmail.

The Cloud Bigtable database is made to handle major workloads of 100+ petabytes, and it comes equipped with resources such as Hadoop and Spark. It will be fun to see what happens as Google’s new service offering hits the ground running. How will Amazon and Microsoft react? Will price wars arise? If so, only good can come of it, at least for the corporate consumers.

Chelsea Kerwin, February 1, 2017

Et Tu, Brutus? Oracle Database on the Way Out

January 10, 2017

i read “NoSQL to Undo Oracle’s Database Reign.” The author is a person who once worked at Oracle. Like Brutus, the author knows Julius Caesar. Sorry, I meant the jet loving, top dog at Oracle.

The tussle between Oracle and MarkLogic seems likely to continue in 2017. The write up explains that Oracle has become a lot like IBM. I learned:

Like IBM did in the past, Oracle and the other incumbents are adding features to old technologies in an attempt to meet today’s challenges — features such as in-memory, graph, JSON and XML support. None of them have changed their underlying architectures so their efforts will fall short, just as IBM’s did in the last generational shift of the database industry 35 years ago. What’s more, their widely publicized moves of shifting old technology to the cloud changes the deployment model but doesn’t help solve the modern data challenges their customers are facing. An outdated database technology on the cloud is still an outdated database.

The new champion of the data management world is MarkLogic, the outfit where Gary Bloom labors. MarkLogic, I concluded, is one of the “emergent winners.”

That’s good.

MarkLogic is an XML centric data management system. XML is ideal for slicing and dicing once the data have been converted to validated XML. For some folks, changing a legacy AS/400 Ironside output into XML might be interesting. But, it seems, that MarkLogic has cracked the data conversion, transformation, extraction, and loading processes. Anyone can do it. Perhaps not everyone because there are some proprietary tweaks to the open source methods required by the MarkLogic system. No problem, but volume, time, and cost constraints might be an issue for some use cases.

I noted this passage in the undated write up:

There is definitely shake out of the NoSQL vendors and MarkLogic is one of the emergent victors. As an enterprise-ready NoSQL database that handles multiple models natively and doesn’t care if you have two or hundreds of data silos, MarkLogic is becoming the database platform for those with complex data integration problems. In fact, some companies are skipping the relational generation altogether and going straight from the mainframe to NoSQL. Virginia’s Fairfax County recently migrated years of historical data from its 30-year-old mainframe system to MarkLogic’s NoSQL. Residents and employees can now more easily and quickly search all the data—including property records going back to the 1950s and both old and new data coming from multiple data silos.

MarkLogic, however, is no spring chicken. The company was founded in 2001, which works out to 16 years old. Oh, you might recall that the total equity funding is $173.23 million with the most recent round contributing $102 million in May 2015 if the Crunchbase data are on the money. Some of that $102 million came from Gary Bloom, the author of the write up. (No wonder he is optimistic about MarkLogic. Hope is better than fear that one might have to go look for another job.)

My view is that MarkLogic wants a big fight with Oracle. That adds some zip to what is one of the less magnetic types of software in a business world excited by Amazon,  Google, Facebook, Tesla, and Uber. Personally I find data management exciting, but I gravitate to the systems and methods articulated by Googler Ramanathan Guha. Your mileage may vary.

The challenge for MarkLogic is to generate sufficient sustainable revenue to achieve one of these outcomes:

  1. A sale of the company to a firm which believes in the XML tinted world of the XML rock stars. (Yes, there’s is an XML rock star video at this link.) Obviously a lucrative sale would make the folks watching their $173 million grow into a huge payday would find this exit worthy of a happy face emoji.
  2. A surge in the number of companies convinced that MarkLogic and not an open source, now license fee alternative writing checks for multi year licenses and six figure service deals. Rapid revenue growth and high margin services may not get the $172 million back, but life would be less stressful if those numbers soar.
  3. MarkLogic goes public fueled in part by a PR battle with Oracle.

Will systems like MarkLogic’s become the future of next generation operational and transaction systems? MarkLogic believes NoSQL is the future. Will Oracle wake up and buy MarkLogic? Will Google realize its error when it passed on a MarkLogic buy out? Will Amazon figure out that life will be better without the home brew approach to data management that Amazon has taken since it shifted from an Oracle type fixation? Will Facebook see MarkLogic as a solution to some of its open source data management hassles?

Here in Harrod’s Creek, we still remember the days when MarkLogic was explaining that it was an enterprise search system, an analytics system, and a content production system. A database can be many things. The one important characteristic, however, is that the data management system generate substantial revenue and juicy profits.

Stephen E Arnold, January 10, 2017

Smarter Content for Contentier Intelligence

December 28, 2016

I spotted a tweet about making smart content smarter. It seems that if content is smarter, then intelligence becomes contentier. I loved my logic class in 1962.

Here’s the diagram from this tweet. Hey, if the link is wonky, just attend the conference and imbibe the intelligence directly, gentle reader.


The diagram carries the identifier Data Ninja, which echoes Palantir’s use of the word ninja for some of its Hobbits. Data Ninja’s diagram has three parts. I want to focus on the middle part:


What I found interesting is that instead of a single block labeled “content processing,” the content processing function is broken into several parts. These are:

A Data Ninja API

A Data Ninja “knowledgebase,” which I think is an iPhrase-type or TeraText type of method. Not familiar with iPhrase and TeraText, feel free to browse the descriptions at the links.

A third component in the top box is the statement “analyze unstructured text.” This may refer to indexing and such goodies as entity extraction.

The second box performs “text analysis.” Obviously this process is different from “the analyze unstructured text” step; otherwise, why run the same analyses again? The second box performs what may be clustering of content into specific domains. This is important because a “terminal” in transportation may be different from a “terminal” in a cloud hosting facility. Disambiguation is important because the terminal may be part of a diversified transportation company’s computing infrastructure. I assume Data Ninja’s methods handles this parsing of “concepts” without many errors.

Once the selection of a domain area has been performed, the system appears to perform four specific types of operations as the Data Ninja practice their katas. These are the smart components:

  • Smart sentiment; that is, is the content object weighted “positive” or “negative”, “happy” or “sad”, or green light or red light, etc.
  • Smart data; that is, I am not sure what this means
  • Smart content; that is, maybe a misclassification because the end result should be smart content, but the diagram shows smart content as a subcomponent within the collection of procedures/assertions in the middle part of the diagram
  • Smart learning; that is, the Data Ninja system is infused with artificial intelligence, smart software, or machine learning (perhaps the three buzzwords are combined in practice, not just in diagram labeling?)
  • The end result is an iPhrase-type representation of data. (Note: that this approach infuses TeraText, MarkLogic, and other systems which transform unstructured data to metadata tagged structured information).

The diagram then shows a range of services “plugging” into the box performing the functions referenced in my description of the middle box.

If the system works as depicted, Data Ninjas may have the solution to the federation challenge which many organizations face. Smarter content should deliver contentier intelligence or something along that line.

Stephen E Arnold, November 28, 2016

On the Hunt for Thesauri

December 15, 2016

How do you create a taxonomy? These curated lists do not just write themselves, although they seem to do that these days.  Companies that specialize in file management and organization develop taxonomies.  Usually they offer customers an out-of-the-box option that can be individualized with additional words, categories, etc.  Taxonomies can be generalized lists, think of a one size fits all deal.  Certain industries, however, need specialized taxonomies that include words, phrases, and other jargon particular to that field.  Similar to the generalized taxonomies, there are canned industry specific taxonomies, except the more specialized the industry the less likely there is a canned list.

This is where the taxonomy lists needed to be created from scratch.  Where do the taxonomy writers get the content for their lists?  They turn to the tried, true resources that have aided researchers for generations: dictionaries, encyclopedias, technical manuals, and thesauri are perhaps one of the most important tools for taxonomy writers, because they include not only words and their meanings, but also synonyms and antonyms words within a field.

If you need to write a taxonomy and are at a lost, check out MultiTes.  It is a Web site that includes tools and other resources to get your taxonomy job done.  Multisystems built MultiTes and they:

…developed our first computer program for Thesaurus Management on PC’s in 1983, using dBase II under CPM, predecessor of the DOS operating system.  Today, more than three decades later, our products are as easy to install and use. In addition, with MultiTes Online all that is needed is a web connected device with a modern web browser.

In other words, they have experience and know their taxonomies.

Whitney Grace, December 15, 2016

Next Page »

  • Archives

  • Recent Posts

  • Meta