August 14, 2014
Data integration from an old system to a new system is coded for trouble. The system swallowing the data is guaranteed to have indigestion and the only way to relieve the problem is by burping. Chez Brochez has dealt with his own share of data integration issues and in his article, “Tips and Tricks For Optimizing Oracle Endeca Data Ingestion” he details some of the best way burp.
Here he explains why he wrote the blog post:
“More than once I’ve been on a client site to try to deal with a data build that was either taking too long, or was no longer completing successfully. The handraulic analysis to figure out what was causing the issues can take a long time. The rewards however are tremendous. Not simply fixing a build that was failing, but in some cases cutting the time demand in half meant a job could be run overnight rather than scheduled for weekends. In some cases verifying with the business users what attributes are loaded and how they are interacted with can make their lives easier.”
While the post focuses on Oracle Endeca, skimming through the tips will benefit anyone working with data. Many of them are common sense, such as having data integrations do the heavy lifting in off-hours and shutting down competing programs. Others require more in-depth knowledge. It beats down to getting content into a old school system requires a couple of simple steps and lots of quite complex ones.
August 12, 2014
An article on the Library Journal Infodocket is titled Co-Founder of Vivisimo Launches “OnlyBoth” and It’s Super Cool! The article continues in this entirely unbiased vein. OnlyBoth, it explains, was created by Raul Valdes- Perez and Andre Lessa. It offers an automated process of finding data and delivering it to the user in perfect English. The article states,
“What does OnlyBoth do? Actions speak louder than words so go take a look but in a nutshell, OnlyBoth can mine a dataset, discover insights, and then write what it finds in grammatically correct sentences. The entire process is automated. At launch, OnlyBoth offers an application providing insights o 3,122 U.S. colleges and universities described by 190 attributes. Entries also include a list of similar and neighboring institutions. More applications are forthcoming.”
The article suggests that this technology will easily lend itself to more applications, for now it is limited to presenting the facts about colleges and baseball in perfect English. The idea is called “niche finding” which Valedes-Perez developed in the early 2000s and never finished. The technology focuses on factual data that requires some reasoning. For example, the Onlyboth website suggests that the insight “If California were a country, it would be the tenth biggest in the world” is a more complicated piece of information than just a simple fact like the population of California. OnlyBoth promises that more applications are forthcoming.
Chelsea Kerwin, August 12, 2014
July 18, 2014
The article titled Call Me Mayble: Elasticsearch on Aphyr explores potential issues with Elasticsearch. Jepsen is a section of Aphyr that tests the behaviors of different technology and software under types of network failure. Elasticsearch comes with the solid Java indexing library of Apache-Lucene. The article begins with an overview of how Elasticsearch scales through sharding and replication.
“The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over…Because index construction is a somewhat expensive process, Elasticsearch provides a faster database backed by a write-ahead log.”
Over a series of tests, (with results summarized by delightful Barbie and Ken doll memes) the article decides that while version control may be considered a “lost cause” Elasticsearch handles inserts superbly. For more information on how Elasticsearch behaved through speed bumbs, building a nemesis, nontransitive partitions, needless data loss, random and fixed transitive partitions, and more, read the full article. It ends with recommendations for Elasticsearch and for users, and concedes that the post provides far more information on Elasticsearch than anyone would ever desire.
Chelsea Kerwin, July 18, 2014
July 17, 2014
The article titled Scoop: A Glimpse Into the NYTimes CMS on the New York Times Blog discusses the importance of Content Management Systems (CMS) for the future of journalism. Recently, journalist Ezra Klein reportedly left The Washington Post for Vox Media largely for Vox’s preferable CMS. The NYT has its own CMS called Scoop, described in the article,
“…It is a system for managing content and publishing data so that other applications can render the content across our platforms. This separation of functions gives development teams at The Times the freedom to build solutions on top of that data independently, allowing us to move faster than if Scoop were one monolithic system. For example, our commenting platform and recommendations engine integrate with Scoop but remain separate applications.”
So it does seem that there is some wheel reinventing going on at the NYT. The article outlines the major changes that Scoop has undergone in the past few years, with live article editing that sounds like Google Docs, tagging, notifications, and simplified processes for the addition of photographs multimedia. While there is some debate about where Scoop stands on the list of Content Management Systems, the Times certainly has invested in it for the long haul.
Chelsea Kerwin, July 17, 2014
July 8, 2014
I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.
The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.
Several key buzzwords appear in the interview:
- Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
- Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
- A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
- Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
- Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”
My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.
No problem as long as the money lake is sufficiently deep, broad, and full.
The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.
Stephen E Arnold, July 7, 2014
July 8, 2014
The article on FlowingData titled How to Make Government Data Sites Better uses the Center for Disease Control website to illustrate measures the government should take to make their data more accessible and manageable. The first suggestion is to provide files in a useable format. By avoiding PDFs and providing CSV files (or even raw data), the user will be in a much better position to work with the data. Another suggestion is simply losing or simplifying the multipart form that makes search nearly impossible. The author also proposes clearer and more consistent annotation, using the following scenario to illustrate the point,
“The CDC data subdomain makes use of the Socrata Open Data API,… It’s weekly data that has been updated regularly for the past few months. There’s an RSS feed. There’s an API. There’s a lot to like… There’s also a lot of variables without much annotation or metadata … When you share data, tell people where the data is from, the methodology behind it, and how we should interpret it. At the very least, include a link to a report in the vicinity of the dataset.”
Overall, the author makes many salient points about transparency, consistency and clutter. But there is an assumption in the article that the government actually desires to make data sites better, which may be the larger question. If no one implements these ideas, perhaps that will be answer enough.
Chelsea Kerwin, July 08, 2014
July 7, 2014
The data-analysis work of recently prominent economist Thomas Pikkety receives another whack, this time from computer scientist and blogger Daniel Lemire in, “You Shouldn’t Use a Spreadsheet for Important Work (I Mean It).” Pikkety is not alone in Lemire’s reproach; last year, he took Harvard-based economists Carmen Reinhart and Kenneth Rogoff to task for building their influential 2010 paper on an Excel spreadsheet.
The article begins by observing that Pikkety’s point, that in today’s world the rich get richer and the poor poorer, is widely made but difficult to prove. Though he seems to applaud Pikkety’s attempt to do so, Lemire really wishes the economist had chosen specialized software, like STATA, SAS, or “even” R or Fortran. He writes:
“What is remarkable regarding Piketty’s work, is that he backed his work with comprehensive data and thorough analysis. Unfortunately, like too many people, Piketty used speadsheets instead of writing sane software. On the plus side, he published his code… on the negative side, it appears that Piketty’s code contains mistakes, fudging and other problems….
“I will happily use a spreadsheet to estimate the grades of my students, my retirement savings, or how much tax I paid last year… but I will not use Microsoft Excel to run a bank or to compute the trajectory of the space shuttle. Spreadsheets are convenient but error prone. They are at their best when errors are of little consequence or when problems are simple. It looks to me like Piketty was doing complicated work and he bet his career on the accuracy of his results.”
The write-up notes that Piketty admits there are mistakes in his work, but asserts they are “probably inconsequential.” That’s missing the point, says Lemire, who insists that a responsible data analyst would have taken more time to ensure accuracy. My parents always advised me to use the right tool for a job: that initial choice can make a big difference in the outcome. It seems economists may want to heed that common (and common sense) advice.
Cynthia Murrell, July 07, 2014
June 25, 2014
Data is no longer just facts, figures, and black and white graphs. Data visualizations are becoming an increasingly important way that data (and even Big Data) is demonstrated and communicated. A few data visualization solutions are making big waves, and Visage is one on the rise. It is highlighted in the FastCompany article, “A Tool For Building Beautiful Data Visualizations.”
The article begins:
“Visage, a newly launched platform, provides custom templates for graphics. There are myriad tools on the market that do this (for a gander at 30 of them, check out this list), but Visage is the latest, and it’s gaining traction with designers at Mashable, MSNBC, and A&E. That’s due in part to Visage’s offerings, which are designed to be more flexible, and more personalized, than other services.”
More and more companies are working on ways to help organizations decipher and make sense of Big Data. But what good is the information if it cannot be effectively communicated? This is where data visualizations come in – helping to communicate complex data through clean visuals.
Emily Rae Aldridge, June 25, 2014
June 20, 2014
In February 2014, NJTC TechWire wrote an article on “Connotate Announces 25% YOY Growth In Total Contract Value For 2013.” Connotate has made a name for itself by being a leading provider of Webdata extraction and monitoring solutions. The company’s revenue grew 25% in 2013 and among other positives for Connotate were the release of Connotate 4.0, a new Web site, and new multi-year deal renewals. On top of the record growth, BIIA reports that “Connotate Launches Connotate4,” a Web browser that simplified and streamlines Webdata extraction. Connotate4 will do more than provide users with a custom browser:
? “Inline data transformations within the Agent development process is a powerful new capability that will ease data integration and customization.
? Enhanced change detection with highlighting can be requested during the Agent development process via a simple point-and-click checkbox, enabling highlighted change detection that is easily illustrated at the character, word or phrase level.
? Parallel extraction tasks makes it faster to complete tasks, allowing even more scalability for even larger extractions.
? Build and expand capabilities turn the act of re-using a single Agent for related extraction tasks a one-click event, allowing for faster Agent creation.
? A simplified user interface enabling simplified and faster Agent development.”
Connotate brags that the new browser will give user access to around 95% of Webdata and is adaptable as new technologies are made. Connotate aims to place itself in the next wave of indispensable enterprise tools.
June 13, 2014
An article Gigaom is titled Michael Stonebraker’s New Startup, Tamr, Wants to Help Get Messy Data in Shape. With the help ($16 million) from Google Ventures and New Enterprise Associates, Stonebraker and partner Andy Palmer are working to crack the ongoing problem of data transformation and normalization. The article explains,
“Essentially, the Tamr tool is a data cleanup automation tool. The machine-learning algorithms and software can do the dirty work of organizing messy data sets that would otherwise take a person thousands of hours to do the same, Palmer said. It’s an especially big problem for older companies whose data is often jumbled up in numerous data sources and in need of better organization in order for any data analytic tool to actually work with it.”
Attempting to allow for machines to learn some human-like insight into repetitive cleanup work just might be the trick. Tamr does still require a human in the management seat known as the data steward, someone who will read the results of a projected comparison between two sets of separate data and decide whether it is a good relationship. Tamr has been compared to Trifacta, but Palmer insists that Tamr is preferable for its ability to compare thousands of data sources with a data stewards oversight. He also noted that Trifacta co-founder Joe Hellerstein was a student of Stonebraker’s in a PhD program.
Chelsea Kerwin, June 13, 2014