CyberOSINT banner

Big Data Lake: Are the Data Safe to Consume?

August 2, 2015

I read “The Analytics Journey Leading to the Business Data Lake.” Data lake is one of the terms floating around (pun definitely intended!) to stimulate sales. If one has a great deal of water, one needs a place to put it. Even though water is dammed, piped, used, recycled, and dumped—storage is the key.

Enter EMC, a company which is in the business of helping those with water store it and make use of that substance.

The write up reflects effort. I assume there was a PowerPoint slide deck in the mix. There are some snazzy graphics. Here’s one that caught my eye:


Instead of enterprise search being the go-to enterprise software solution, EMC has slugged in the following umbrella terms:

  • Information ecosystem
  • Business intelligence (perhaps an oxymoron in light of this article)
  • Advanced analytics (obviously because regular analytics just are zippy enough)
  • Knowledge layer (I remain puzzled about knowledge because I have a tough time defining. In fact, I resigned from my for fee knowledge management column because I just don’t know what the heck “knowledge” means.)
  • The unfathomable data lake (yep, pun intended). What’s wrong with the word “storage” or “database” by the way?
  • Master data which is also baffling. Is there servant data too?
  • Machine data. Again I have no clue what this means.

The chart scatters undefined and fuzzy buzzwords like a crazed Jethro Tull, a water soluble blend of Jethro Tull (inventor of the seed drill) and Jethro Tull (the commercially successful and eccentric rock bands).

The write up is important because EMC has sucked in the jargon and assertions once associated with enterprise search and applied them to the dark and mysterious data lake.

I highlighted:

Our data lake is one logical data platform with multiple tiers of performance and storage levels to optimally serve various data needs based on Service Level Agreements (SLA). It will provide a vast amount of structured and unstructured data at the Hadoop and Greenplum layers to data scientists for advanced analytics innovation. The higher performance levels powered by Greenplum and in-memory caching databases will serve mission-critical and real-time analytics and application solutions. With more robust data governance and data quality management, we can ensure authoritative, high-quality data driving all of EMC business insights and analytics driven applications using data services from the lake.

Ah, the Mariana Trench of enterprise information: Governance. Like “knowledge” and “advanced analytics”,  governance has euphony. I think of the water lapping against the shore of Lake Paseco.

So what? Several observations:

  1. This type of “suggest lots” marketing ended poorly for a number of companies who used this type of rhetoric when marketing search
  2. The folks who swallow this bait are likely to find themselves in a most uncomfortable spot
  3. The problems associated with making use of information to improve decision making by reducing risk are not going to be solved by crazy diagrams and unsupported assertions.

EMC has been able to return revenue growth. But the company’s profit margin has flat lined.


I am not sure that increasing the buzzword density in marketing write ups will help angle the red lines to low earth orbit. With better margins, it is much easier to check out the topographic view and see where lakes meet land.

Stephen E Arnold, August 2, 2015

The Hadoop Spark Thing: Simple, Simple

July 30, 2015

I am fascinated with the cheerleading about open source software which makes Big Data as easy as driving a Fiat 500 through a car wash. (Make sure the wheels fit inside the automated pulley system, of course.)

Navigate to “The Big Big Data Question: Hadoop or Spark?” Be prepared to read about two—count ‘em—two systems working as smoothly as the engine in a technical high school’s auto repair class’ project car.

I want to highlight two statements in the write up.

The first is:

As I [a Big Data practitioner] mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

In short, Spark is what I call a wrapper. One uses it like a taco shell to keep the good in position for real time munching.

The second is this comment:

The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data.

What the write omits is that there are some other bits and pieces needed; for example, how does one locate a particular string amidst the Big Data?

The point, for me, is that these nested and layered systems are truly exciting to troubleshoot. Not only are their issues with the integrity of the data, there is the thrill of getting each subsystem to work and then figuring out how to get useful outputs from the digital equivalent of a Roy’s Place Lassie’s Double Revenge sandwich before it closed its doors in 2013.

A Lassie’s Double Revenge consisted of a knockwurst, cheese, grilled onions, baked beans, and assorted seasonings served to the discerning diner.

A little like an open source Big Data mash up.

As a bonus, one gets to hire consultants who can make separate products, systems, and solutions work in a way which benefits the licensee and the system’s users.

Stephen E Arnold, July 30, 2015

PowerPoint Enabled Big Data Presenters Rejoice

July 27, 2015

Navigate to “A Plethora of Big Data Infographics.” Note that the original write up misspells “plethora” at “pletora” but, as many in Big Data say, “it is close enough for horseshoes.”

big data chart snip

I quit browsing after a baker’s dozen of these puppies. If you want to be an expert in Big Data, these charts will do the trick. I would steer clear of a person with a PhD in statistics, however.

Stephen E Arnold, July 27, 2015

Forbes and Some Big Data Forecasts

July 26, 2015

Short honk: For fee, mid tier consultants have had their thunder stolen. Forbes, the capitalist tool, wants to make certain its readers know how juicy Big Data is as a market. Navigate to “Roundup Of Analytics, Big Data & Business Intelligence Forecasts And Market Estimates, 2015.”

The write up summarizes the eye watering examples of spreadsheet fever’s impact on otherwise semi-rationale MBAs, senior managers, and used car sales professionals. IDC, without the inputs of Dave Schubmehl comes up with a spectacular number: $125 billion in 2015.

Sounds good, right?

The data will find their way into innumerable PowerPoint presentations. Snag ‘em while you can.

Stephen E Arnold, July 26, 2015

Big Data: Slow Down, Think

July 25, 2015

i read “Contradictions of Big Data.” Few articles which I see take a common sense approach to Big Data baloney. (Azure chip consultants bristle at my use of baloney. Too bad.) I liked this article.

The article appeared in my Overflight a day ago even though the write up was posted in March 2015. Big Data does not mean rapid data.

I highlighted this passage:

have been waging an uphill battle against the nonsensical and unsubstantiated idea that more data is better data, but now this view is getting some additional support, and from some surprising corners.

I do not agree. The yap about Big Data has almost overpowered the craziness of search engine optimization’s shouting about semantic search.

The write up points out:

Take it from me [Martyn Jones] , most businesses will not be basing their business strategies on the analysis of a glut of selfies, home videos of cute kittens, or the complete works of William Shakespeare or Dan Brown. Almost all business analysis will continue to be carried out on structured data obtained primarily from internal operational systems and external structured data providers.

The write up points out the silliness of velocity and several other slices of marketing baloney. (Make a sandwich, please.)

I found this paragraph insightful:

I have seen data scientists at work, and the word science doesn’t actually jump out and grab you. It’s difficult to make the connection, just as it is to accurately connect some popular science magazines with fundamental scientific research. If a professional and qualified statistician wants to label themselves a data scientist then I have no issue with that, it’s their problem, but I am not willing to lend credibility to the term ‘data scientist’ when it is merely an interesting job title, with at most a tenuous connection to the actual role, and one that is liberally applied, with the almost customary largesse of IT, to creative code hackers and business-averse dabblers in data.

Harsh words for those who combine an undergraduate degree minor in math with Twitter and come up with data scientist.

Hopefully other will pick up this practical approach to the sliced and processed meat wrapped in plastic and branded Big Data.

Stephen E Arnold, July 25, 2015

Lucidworks (Really?) Does Fusion Too

July 23, 2015

I read “Lucidworks Delivers Fusion 2.0 with Spark Integration.” The idea is that search is not exactly flying off the shelves. Why not download Elasticsearch and move on? The way to make search relevant is to make it a Big Data thing. This is the hard to believe path IBM took with Vivisimo’s technology. Where is Vivisimo in the IBM revenue picture? Well, that picture seems gloomy. Maybe the Big Data thing doesn’t work particularly well.

In terms of venture backed Lucidworks, the write up explains:

Fusion 2.0 provides an organization with access to a streamlined, consumer-like search experience with enterprise-grade speed and scalability. The new release integrates Lucidworks’ Fusion with Apache Spark to enable real-time data analytics. Fusion 2.0 also features a new version of the company’s SiLK user interface (UI) that simplifies dashboard visualizations and enhances the user experience.  The SiLK UI runs on top of Fusion and the Apache Solr search platform, upon which Fusion is based. SiLK gives users the power to perform ad-hoc search and analysis of massive amounts of multi-structured and time series data. Users can swiftly transform their findings into visualizations and dashboards.

I think I understand. Wrappers of software provide more developer-friendly tools. The may be one slight  hitch in the git along. Those familiar with the technology of open source and fluent in the mumbo jumbo jargon that Lucid and other repositioning enterprise search vendors employ may not comprise a giant pool of prospects.

In short, writing wrappers is hard work. Dealing with fusion in an effective manner is harder work. Eliminating the latency that accompanies layers and handoffs is the hardest work of all.

The challenge will be generating substantial organic revenue and having enough profit to satisfy the investors which have been very patient with the Lucidworks outfit. No, really.

Stephen E Arnold, July 23, 2015

IBM SAP Versus SAS: A Faux Dust Up

July 22, 2015

Ah, the freebie statistics are like gnats. One or two make no difference when one is eating a chicken leg. Toss in 20,000 or more and the leg eating becomes a chore.

I read an oblique write up called “SAS UK Chief: Envious Rivals, Skills Gap and Analytics in the Cloud.” The topics are interesting because they are mixed together, a fruit salad to go with that picnic chicken.

The write up begins a statement attributed to an IBM SAP executive along the lines: “SAS could be entirely replaced.” That seems a bit of fortune telling which might not be entirely in line with some SAS users’ plans. IBM, as you may know, is fresh from 13 straight quarters of revenue decline. I interpreted the feisty comment as a signal to IBM management that the much loved SAP division is replete with machismo and doing its bit to increase revenues. There’s nothing like a statistics squabble to pump up the sales spice.

As I understand the write up, that allegedly “put ‘em up, chump” statement caused an SAS executive to flounder. SAS’s problem is that it is still a little chunk of graduate school. SAS faces competition from upstarts like Talend. SAP, on the other hand, is chasing consulting and giant IBM cloud-type things. But the two outfits are old school operations. For proof just ask a graduate student in statistics.

The reality is that both SAP and SAS may be victims of the same market shifts. In order to get either company’s products to deliver a perfect grilled chicken, one has to know about statistics and have resources (money, gentle reader).

Big companies are okay with these requirements. But the buzz in the analytics world is for open source, point and click, ready to run solutions. The outputs of these next generation systems may not meet the standards of the SAPs and the SASs of the world, but the customers don’t care.

These two firms are facing many gnats. Neither is going to have a pleasant meal. The good old days of sunshine, blue skies, and a bug free experience are gone.

Stephen E Arnold, July 22, 2015

Big Data Vendor List

July 19, 2015

I scanned the Big Data list. I won’t linger too long. You can too. (Apologies to Robert Frost and “The Pasture.” The clarity part I will leave to you.)

The list appears in this article: “42 Big Data Startups.” One reader added 16 other companies. I am unclear. I tried to “wait to watch the water clear” but it did not.

Main thoughts:

  1. What’s a start up? A number in the companies in the list have been around for a while; for example, Talend was founded in 2005. Let’s see, despite the muddy water, that works out to a decade.
  2. Why is there just one company with “search” solutions on the list. The search-aware outfit is Datastax. But the company’s information access capability was not mentioned. The list totters as a result like the “little calf that’s standing by the mother.”
  3. What’s the rationale for clumping in an earthworm type laundry list services, software, applications that sit on top of data management systems, and outfits which focus on a niche like geolocation or search engine optimization? There are no horses, sheep, or pigs in the Frost poem. At least, I did not discern any nor did the person who came along.

Listicles can be interesting, humorous, and informative. Lists without logic are not particularly useful unless one is eager to demonstrate the importance of specified criteria and sort of useful classification of items in the list.

Stephen E Arnold, July 19, 2015

Kashman to Host Session at SharePoint Fest Seattle

July 14, 2015

Mark Kashman, Senior Product Manager at Microsoft, will deliver a presentation at the upcoming SharePoint Fest Seattle in August. All eyes remain peeled for any news about the new SharePoint Server 2016 release, so his talk entitled, “SharePoint at the Core of Reinventing Productivity,” should be well watched. Benzinga gives a sneak peek with their article, “Microsoft’s Mark Kashman to Deliver Session at SharePoint Fest Seattle.”

The article begins:

“Mark Kashman will deliver a session at SharePoint Fest Seattle on August 19, 2015. His session will be held at the Washington State Convention Center in downtown Seattle. SharePoint Fest is a two-day training conference (plus an optional day of workshops) that will have over 70 sessions spread across multiple tracks that brings together SharePoint enthusiasts and practitioners with many of the leading SharePoint experts and solution providers in the country.”

Stephen E. Arnold is also keeping an eye out for the latest news surrounding SharePoint and its upcoming release. His Web service efficiently synthesizes and summarizes essential tips, tricks, and news surrounding all things search, including SharePoint. The dedicated SharePoint feed can save users time by serving as a one-stop-shop for the most pertinent pieces for users and managers alike.
Emily Rae Aldridge, July 14, 2015

Sponsored by, publisher of the CyberOSINT monograph

SAS Explains Big Data. Includes Cartoon, Excludes Information about Cost

July 13, 2015

I know that it is easy to say Big Data. It is easy to say Hadoop. It is easy to make statements in marketing collateral, in speeches, and in blogs written by addled geese. Honk!


I wish to point out that any use of these terms in the same sentence require an important catalyst: Money. Money that has been in the words of the government procurement officer, “Allocated, not just budgeted.”

Here are the words:

  1. Big Data
  2. Hadoop
  3. Unstructured data.

Point your monitored browser at “Marketers Ask: What Can Hadoop Do That My Data Warehouse Can’t?” The write up originates with SAS. When a company anchored in statistics, I expect some familiarity with numbers. (yep, just like the class you have blocked from your mind. The mid term? What mid term?)

The write up points out that unstructured data comes in many flavors. This chart, complete with cartoon, identifies 15 content types. I was amazed. Just 15. What about the data in that home brew content management system or tucked in the index of the no longer supported DEC 20 TIPS system. Yes, that data.


How does Hadoop deal with the orange and blue? Pretty well but you and the curious marketer must attend to three steps. Count ‘em off, please:

  1. Identify the business issue. I think this means know what problem one is trying to solve. This is a good idea, but I think most marketing problems boil down to generating revenue and proving it to senior management. Marketing looks for silver bullets when the sales are not dropping from the sky like packages for the believers in the Cargo Cult.
  2. Get top management support. Yep, this is a good idea because the catalyst—money—has to be available to clean, acquire, and load the goodies in the blue boxes and the wonky stuff from the home brew CMS.
  3. Develop a multi play plan. I think this means that the marketer has zero clue how complicated the Hadoop magic is. The excitement of extract, transform, and load. The thrill of batch processing awaits. Then the joy of looking at outputs which baffle the marketer more comfortable selecting colors and looking at Adwords’ reports than Hadoop data.

My thought is that SAS understands data, statistical methods, and the reality of a revolution which is taking place without the strictures of SAS approaches.

I do like the cartoon. I do not like the omission of the money part of the task. Doing the orange and blue thing for marketers is expensive. Do the marketers know this?


Stephen E Arnold, July 13, 2015

Next Page »