July 18, 2014
The article titled Call Me Mayble: Elasticsearch on Aphyr explores potential issues with Elasticsearch. Jepsen is a section of Aphyr that tests the behaviors of different technology and software under types of network failure. Elasticsearch comes with the solid Java indexing library of Apache-Lucene. The article begins with an overview of how Elasticsearch scales through sharding and replication.
“The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over…Because index construction is a somewhat expensive process, Elasticsearch provides a faster database backed by a write-ahead log.”
Over a series of tests, (with results summarized by delightful Barbie and Ken doll memes) the article decides that while version control may be considered a “lost cause” Elasticsearch handles inserts superbly. For more information on how Elasticsearch behaved through speed bumbs, building a nemesis, nontransitive partitions, needless data loss, random and fixed transitive partitions, and more, read the full article. It ends with recommendations for Elasticsearch and for users, and concedes that the post provides far more information on Elasticsearch than anyone would ever desire.
Chelsea Kerwin, July 18, 2014
July 17, 2014
The article titled Scoop: A Glimpse Into the NYTimes CMS on the New York Times Blog discusses the importance of Content Management Systems (CMS) for the future of journalism. Recently, journalist Ezra Klein reportedly left The Washington Post for Vox Media largely for Vox’s preferable CMS. The NYT has its own CMS called Scoop, described in the article,
“…It is a system for managing content and publishing data so that other applications can render the content across our platforms. This separation of functions gives development teams at The Times the freedom to build solutions on top of that data independently, allowing us to move faster than if Scoop were one monolithic system. For example, our commenting platform and recommendations engine integrate with Scoop but remain separate applications.”
So it does seem that there is some wheel reinventing going on at the NYT. The article outlines the major changes that Scoop has undergone in the past few years, with live article editing that sounds like Google Docs, tagging, notifications, and simplified processes for the addition of photographs multimedia. While there is some debate about where Scoop stands on the list of Content Management Systems, the Times certainly has invested in it for the long haul.
Chelsea Kerwin, July 17, 2014
July 8, 2014
I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.
The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.
Several key buzzwords appear in the interview:
- Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
- Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
- A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
- Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
- Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”
My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.
No problem as long as the money lake is sufficiently deep, broad, and full.
The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.
Stephen E Arnold, July 7, 2014
July 8, 2014
The article on FlowingData titled How to Make Government Data Sites Better uses the Center for Disease Control website to illustrate measures the government should take to make their data more accessible and manageable. The first suggestion is to provide files in a useable format. By avoiding PDFs and providing CSV files (or even raw data), the user will be in a much better position to work with the data. Another suggestion is simply losing or simplifying the multipart form that makes search nearly impossible. The author also proposes clearer and more consistent annotation, using the following scenario to illustrate the point,
“The CDC data subdomain makes use of the Socrata Open Data API,… It’s weekly data that has been updated regularly for the past few months. There’s an RSS feed. There’s an API. There’s a lot to like… There’s also a lot of variables without much annotation or metadata … When you share data, tell people where the data is from, the methodology behind it, and how we should interpret it. At the very least, include a link to a report in the vicinity of the dataset.”
Overall, the author makes many salient points about transparency, consistency and clutter. But there is an assumption in the article that the government actually desires to make data sites better, which may be the larger question. If no one implements these ideas, perhaps that will be answer enough.
Chelsea Kerwin, July 08, 2014
July 7, 2014
The data-analysis work of recently prominent economist Thomas Pikkety receives another whack, this time from computer scientist and blogger Daniel Lemire in, “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It).” Pikkety is not alone in Lemire’s reproach; last year, he took Harvard-based economists Carmen Reinhart and Kenneth Rogoff to task for building their influential 2010 paper on an Excel spreadsheet.
The article begins by observing that Pikkety’s point, that in today’s world the rich get richer and the poor poorer, is widely made but difficult to prove. Though he seems to applaud Pikkety’s attempt to do so, Lemire really wishes the economist had chosen specialized software, like STATA, SAS, or “even” R or Fortran. He writes:
“What is remarkable regarding Piketty’s work, is that he backed his work with comprehensive data and thorough analysis. Unfortunately, like too many people, Piketty used speadsheets instead of writing sane software. On the plus side, he published his code… on the negative side, it appears that Piketty’s code contains mistakes, fudging and other problems….
“I will happily use a spreadsheet to estimate the grades of my students, my retirement savings, or how much tax I paid last year… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle. Spreadsheets are convenient but error prone. They are at their best when errors are of little consequence or when problems are simple. It looks to me like Piketty was doing complicated work and he bet his career on the accuracy of his results.”
The write-up notes that Piketty admits there are mistakes in his work, but asserts they are “probably inconsequential.” That’s missing the point, says Lemire, who insists that a responsible data analyst would have taken more time to ensure accuracy. My parents always advised me to use the right tool for a job: that initial choice can make a big difference in the outcome. It seems economists may want to heed that common (and common sense) advice.
Cynthia Murrell, July 07, 2014
June 25, 2014
Data is no longer just facts, figures, and black and white graphs. Data visualizations are becoming an increasingly important way that data (and even Big Data) is demonstrated and communicated. A few data visualization solutions are making big waves, and Visage is one on the rise. It is highlighted in the FastCompany article, “A Tool For Building Beautiful Data Visualizations.”
The article begins:
“Visage, a newly launched platform, provides custom templates for graphics. There are myriad tools on the market that do this (for a gander at 30 of them, check out this list), but Visage is the latest, and it’s gaining traction with designers at Mashable, MSNBC, and A&E. That’s due in part to Visage’s offerings, which are designed to be more flexible, and more personalized, than other services.”
More and more companies are working on ways to help organizations decipher and make sense of Big Data. But what good is the information if it cannot be effectively communicated? This is where data visualizations come in – helping to communicate complex data through clean visuals.
Emily Rae Aldridge, June 25, 2014
June 20, 2014
In February 2014, NJTC TechWire wrote an article on “Connotate Announces 25% YOY Growth In Total Contract Value For 2013.” Connotate has made a name for itself by being a leading provider of Webdata extraction and monitoring solutions. The company’s revenue grew 25% in 2013 and among other positives for Connotate were the release of Connotate 4.0, a new Web site, and new multi-year deal renewals. On top of the record growth, BIIA reports that “Connotate Launches Connotate4,” a Web browser that simplified and streamlines Webdata extraction. Connotate4 will do more than provide users with a custom browser:
? “Inline data transformations within the Agent development process is a powerful new capability that will ease data integration and customization.
? Enhanced change detection with highlighting can be requested during the Agent development process via a simple point-and-click checkbox, enabling highlighted change detection that is easily illustrated at the character, word or phrase level.
? Parallel extraction tasks makes it faster to complete tasks, allowing even more scalability for even larger extractions.
? Build and expand capabilities turn the act of re-using a single Agent for related extraction tasks a one-click event, allowing for faster Agent creation.
? A simplified user interface enabling simplified and faster Agent development.”
Connotate brags that the new browser will give user access to around 95% of Webdata and is adaptable as new technologies are made. Connotate aims to place itself in the next wave of indispensable enterprise tools.
June 13, 2014
An article Gigaom is titled Michael Stonebraker’s New Startup, Tamr, Wants to Help Get Messy Data in Shape. With the help ($16 million) from Google Ventures and New Enterprise Associates, Stonebraker and partner Andy Palmer are working to crack the ongoing problem of data transformation and normalization. The article explains,
“Essentially, the Tamr tool is a data cleanup automation tool. The machine-learning algorithms and software can do the dirty work of organizing messy data sets that would otherwise take a person thousands of hours to do the same, Palmer said. It’s an especially big problem for older companies whose data is often jumbled up in numerous data sources and in need of better organization in order for any data analytic tool to actually work with it.”
Attempting to allow for machines to learn some human-like insight into repetitive cleanup work just might be the trick. Tamr does still require a human in the management seat known as the data steward, someone who will read the results of a projected comparison between two sets of separate data and decide whether it is a good relationship. Tamr has been compared to Trifacta, but Palmer insists that Tamr is preferable for its ability to compare thousands of data sources with a data stewards oversight. He also noted that Trifacta co-founder Joe Hellerstein was a student of Stonebraker’s in a PhD program.
Chelsea Kerwin, June 13, 2014
June 10, 2014
At this year’s Gigaom Structure Data conference, Palantir’s Ari Gesher offered an apt parallel for the data field’s current growing pains: using computers before the dawn of operating systems. Gigaom summarizes his explanation in, “Palantir: Big Data Needs to Get Even More Abstract(ions).” Writer Tom Krazit tells us:
“Gesher took attendees on a bit of a computer history lesson, recalling how computers once required their users to manually reconfigure the machine each time they wanted to run a new program. This took a fair amount of time and effort: ‘if you wanted to use a computer to solve a problem, most of the effort went into organizing the pieces of hardware instead of doing what you wanted to do.’
“Operating systems brought abstraction, or a way to separate the busy work from the higher-level duties assigned to the computer. This is the foundation of modern computing, but it’s not widely used in the practice of data science.
“In other words, the current state of data science is like ‘yak shaving,’ a techie meme for a situation in which a bunch of tedious tasks that appear pointless actually solve a greater problem. ‘We need operating system abstractions for data problems,’ Gesher said.”
An operating system for data analysis? That’s one way to look at it, I suppose. The article invites us to click through to a video of the session, but as of this writing it is not functioning. Perhaps they will heed the request of one commenter and fix it soon.
Based in Palo Alto, California, Palantir focuses on improving the methods their customers use to analyze data. The company was founded in 2004 by some folks from PayPal and from Stanford University. The write-up makes a point of noting that Palantir is “notoriously secretive” and that part(s) of the U.S. government can be found among its clients. I’m not exactly sure, though, how that ties into Gesher’s observations. Does Krazit suspect it is the federal government calling for better organization and a simplified user experience? Now, that would be interesting.
Cynthia Murrell, June 10, 2014
June 2, 2014
In the fast moving world of technology, updated resources are especially important. The Data Journalism Handbook is a new one that is worth a second look. Available in a variety of languages, the handbook aims to be a primer for the emerging world of data journalism.
The overview states:
“The Data Journalism Handbook is a free, open source reference book for anyone interested in the emerging field of data journalism. It was born at a 48 hour workshop at MozFest 2011 in London. It subsequently spilled over into an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners.”
Freely available online via a Creative Commons license, the handbook is an initiative of the European Journalism Centre. Download your free copy today to see if data journalism is a field in which you can participate.
Emily Rae Aldridge, June 02, 2014