The Problem with Data Silos for Companies and Consumers

October 13, 2014

The article titled What If Your Data Worked Together on The Woopra Blog makes a plea for normalized and federated information. It also stomps data silos in the dirt for causing frustration in both customers and employees. The call for efficiency in this article does laud certain companies for organizing relevant data, with the example that follows,

“I had ordered a bed from ( and called them a few days later to ask a question about the delivery. The woman who answered my call didn’t ask me for a single piece of information, just “How can I help you?”. She already knew exactly who I was and what I had ordered…she told me that their system automatically gave her my profile based on my phone number.”

This particular example resonated with me, especially after dealing with certain cable companies who seem to keep all of their data in lockboxes and throw away the keys. The article went on to suggest that data silos hurt companies as much as customers by segmenting data and making it more difficult to understand the entire story in a certain usage. This article ends with a promise that it will follow up with more information on data harmony, and we can only hope that someone out there is listening.

Chelsea Kerwin, October 13, 2014

Sponsored by, developer of Augmentext

Google and Images: What Does Remove Mean?

October 4, 2014

I read “After Legal Threat, Google Says It Removed ‘Tens of Thousands’ of iCloud Hack Pics.” On the surface, the story is straightforward. A giant company gets a ringy dingy from attorneys. The giant company takes action. Legal eagles return to their nests.

However, a question zipped through my mind:

What does remove mean?

If one navigates to a metasearch engine like, the user can run queries. A query often generates results with a hot link to the Google cache. Have other services constructed versions of the Google index to satisfy certain types of queries? Are their third parties that have content in Web mirrors? Is content removed from those versions of content? Does “remove” mean from the Fancy Dan pointers to content or from the actual Google or other data structure? (See my write ups in Google Version 2.0 and The Digital Gutenberg to get a glimpse of how certain content can be deconstructed and stored in various Google data structures.)

Does remove mean a sweep of Google Images? Again are the objects themselves purged or are the pointers deleted.

Then I wondered what happens if Google suffers a catastrophic failure. Will the data and content objects be restored by a back up. Are those back ups purged?

I learned in the write up:

The Hollywood Reporter on Thursday published a letter to Google from Hollywood lawyers representing “over a dozen” of the celebrity victims of last month’s leak of nude photos. The lawyers accused Google of failing to expeditiously remove the photos as it is required to do under the Digital Millennium Copyright Act. They also demanded that Google remove the images from Blogger and YouTube as well as suspend or terminate any offending accounts. The lawyers claimed that four weeks after sending the first DMCA takedown notice relating to the images, and filing  over a dozen more since, the photos are still available on the Google sites.

What does “remove” mean?

Stephen E Arnold, October 4, 2014

Among More Changes Connotate Adds New Leader

September 30, 2014

Connotate has been going through many changes through 2014. According to Virtual Strategy they can count adding a new leader to the list: “Connotate Appoints Rich Kennelly As Chief Executive.” Connotate sells big data technology, specializing in enterprise grade Web data harvesting services. The newest leader for the company is Richard J. Kennelly. Kennelly has worked in the IT sector for over twenty years. Most of his experience has been helping developing businesses harness Internet and data. He has worked at Ipswitch and Akami Technologies, holding leadership roles at both companies.

Kennelly is excited about his new position:

“ ‘This is the perfect time to join Connotate,’ said Kennelly. ‘The Web is the largest data source ever created.  The biggest brands are moving quickly to leverage that data to drive competitive advantage and create new revenue streams. Connotate’s patented technology, scalability, and deep technical expertise make us the natural choice for these forward thinking companies.’”

The rest of the quote includes a small, but impressive client list, more praise for Kennelly, and how Connotate is a leading big data company.

If Connotate did not have good products and services, then they would not keep their clients. Despite the big names, they are still going through financial woes. Is choosing Kennelly a sign that they are trying to raise harvest more funding?

Whitney Grace, September 30, 2014
Sponsored by, developer of Augmentext

Automating Data with SharePoint to Boost Efficiency

September 25, 2014

Automating data with SharePoint in order to save cost and time is the subject of an upcoming webinar, “SharePoint Automates EHS Programs: Easy, Flexible, Powerful.” Occurring October 1st, the free webinar focuses on how environmental, health, and safety managers can streamline data collection, processing, and reporting. Read the details in the article, “Automate EHS Data Collection & Reporting with Microsoft SharePoint to Save Time & Cost is Subject of October 1st Webinar.”

The press release says:

“Environmental, health and safety programs require the ongoing routine tasks of data collection, data processing, data analysis, corrective action tracking, and report generation. The essentially manual and time-consuming process places a significant strain on already stretched EHS resources. However, with the use of Microsoft SharePoint — already available in many companies and institutions — EHS managers can automate these tasks to cut both processing time and costs.”

Stephen E. Arnold has a vested interest in SharePoint news and events. His career is focused on following the latest in search, and he makes his findings available via His SharePoint feed is particularly helpful for users who need to keep up with the latest SharePoint news, tips, and tricks.

Emily Rae Aldridge, September 25, 2014

Data Burping

August 14, 2014

Data integration from an old system to a new system is coded for trouble. The system swallowing the data is guaranteed to have indigestion and the only way to relieve the problem is by burping. Chez Brochez has dealt with his own share of data integration issues and in his article, “Tips and Tricks For Optimizing Oracle Endeca Data Ingestion” he details some of the best way burp.

Here he explains why he wrote the blog post:

“More than once I’ve been on a client site to try to deal with a data build that was either taking too long, or was no longer completing successfully. The handraulic analysis to figure out what was causing the issues can take a long time. The rewards however are tremendous. Not simply fixing a build that was failing, but in some cases cutting the time demand in half meant a job could be run overnight rather than scheduled for weekends. In some cases verifying with the business users what attributes are loaded and how they are interacted with can make their lives easier.”

While the post focuses on Oracle Endeca, skimming through the tips will benefit anyone working with data. Many of them are common sense, such as having data integrations do the heavy lifting in off-hours and shutting down competing programs. Others require more in-depth knowledge. It beats down to getting content into a old school system requires a couple of simple steps and lots of quite complex ones.

Whitney Grace, August 14, 2014
Sponsored by, developer of Augmentext

OnlyBoth Launches “Niche Finding” Data Search

August 12, 2014

An article on the Library Journal Infodocket is titled Co-Founder of Vivisimo Launches “OnlyBoth” and It’s Super Cool! The article continues in this entirely unbiased vein. OnlyBoth, it explains, was created by Raul Valdes- Perez and Andre Lessa. It offers an automated process of finding data and delivering it to the user in perfect English. The article states,

“What does OnlyBoth do? Actions speak louder than words so go take a look but in a nutshell, OnlyBoth can mine a dataset, discover insights, and then write what it finds in grammatically correct sentences. The entire process is automated. At launch, OnlyBoth offers an application providing insights o 3,122 U.S. colleges and universities described by 190 attributes. Entries also include a list of similar and neighboring institutions. More applications are forthcoming.”

The article suggests that this technology will easily lend itself to more applications, for now it is limited to presenting the facts about colleges and baseball in perfect English. The idea is called “niche finding” which Valedes-Perez developed in the early 2000s and never finished. The technology focuses on factual data that requires some reasoning. For example, the Onlyboth website suggests that the insight “If California were a country, it would be the tenth biggest in the world” is a more complicated piece of information than just a simple fact like the population of California. OnlyBoth promises that more applications are forthcoming.

Chelsea Kerwin, August 12, 2014

Sponsored by, developer of Augmentext

Jepsen-Testing Elasticsearch for Safety and Data Loss

July 18, 2014

The article titled Call Me Mayble: Elasticsearch on Aphyr explores potential issues with Elasticsearch. Jepsen is a section of Aphyr that tests the behaviors of different technology and software under types of network failure. Elasticsearch comes with the solid Java indexing library of Apache-Lucene. The article begins with an overview of how Elasticsearch scales through sharding and replication.

“The document space is sharded–sliced up–into many disjoint chunks, and each chunk allocated to different nodes. Adding more nodes allows Elasticsearch to store a document space larger than any single node could handle, and offers quasilinear increases in throughput and capacity with additional nodes. For fault-tolerance, each shard is replicated to multiple nodes. If one node fails or becomes unavailable, another can take over…Because index construction is a somewhat expensive process, Elasticsearch provides a faster database backed by a write-ahead log.”

Over a series of tests, (with results summarized by delightful Barbie and Ken doll memes) the article decides that while version control may be considered a “lost cause” Elasticsearch handles inserts superbly. For more information on how Elasticsearch behaved through speed bumbs, building a nemesis, nontransitive partitions, needless data loss, random and fixed transitive partitions, and more, read the full article. It ends with recommendations for Elasticsearch and for users, and concedes that the post provides far more information on Elasticsearch than anyone would ever desire.

Chelsea Kerwin, July 18, 2014

Sponsored by, developer of Augmentext

The Future of Journalism Linked to Content Management Systems

July 17, 2014

The article titled Scoop: A Glimpse Into the NYTimes CMS on the New York Times Blog discusses the importance of Content Management Systems (CMS) for the future of journalism. Recently, journalist Ezra Klein reportedly left The Washington Post for Vox Media largely for Vox’s preferable CMS. The NYT has its own CMS called Scoop, described in the article,

“…It is a system for managing content and publishing data so that other applications can render the content across our platforms. This separation of functions gives development teams at The Times the freedom to build solutions on top of that data independently, allowing us to move faster than if Scoop were one monolithic system. For example, our commenting platform and recommendations engine integrate with Scoop but remain separate applications.”

So it does seem that there is some wheel reinventing going on at the NYT. The article outlines the major changes that Scoop has undergone in the past few years, with live article editing that sounds like Google Docs, tagging, notifications, and simplified processes for the addition of photographs multimedia. While there is some debate about where Scoop stands on the list of Content Management Systems, the Times certainly has invested in it for the long haul.

Chelsea Kerwin, July 17, 2014

Sponsored by, developer of Augmentext

Swimming in a Hadoop Data Lake

July 8, 2014

I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.

The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.

Several key buzzwords appear in the interview:

  • Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
  • Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
  • A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
  • Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
  • Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”

My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.

No problem as long as the money lake is sufficiently deep, broad, and full.

The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.

Stephen E Arnold, July 7, 2014

Steps Offered to Improve Government Data Sites

July 8, 2014

The article on FlowingData titled How to Make Government Data Sites Better uses the Center for Disease Control website to illustrate measures the government should take to make their data more accessible and manageable. The first suggestion is to provide files in a useable format. By avoiding PDFs and providing CSV files (or even raw data), the user will be in a much better position to work with the data. Another suggestion is simply losing or simplifying the multipart form that makes search nearly impossible. The author also proposes clearer and more consistent annotation, using the following scenario to illustrate the point,

“The CDC data subdomain makes use of the Socrata Open Data API,… It’s weekly data that has been updated regularly for the past few months. There’s an RSS feed. There’s an API. There’s a lot to like… There’s also a lot of variables without much annotation or metadata … When you share data, tell people where the data is from, the methodology behind it, and how we should interpret it. At the very least, include a link to a report in the vicinity of the dataset.”

Overall, the author makes many salient points about transparency, consistency and clutter. But there is an assumption in the article that the government actually desires to make data sites better, which may be the larger question. If no one implements these ideas, perhaps that will be answer enough.

Chelsea Kerwin, July 08, 2014

Sponsored by, developer of Augmentext

Next Page »