Big Data, Analytics, and Time to Burn

June 11, 2015

Short honk: I read “Riding Dirty: The Science of Cars and Rap Lyrics.” The concept is to match rap music references to models of automobiles. To what end I am not certain. I was delighted to find that Mercedes is the most popular auto in rap music. I used to own a Subaru, which ranks dead last. I am obviously at the bottom of the rap auto popularity heap. Great use of time I assert. How will the auto manufacturers use these data? Ideas Cadillac (#2) and Chevrolet (#3)?

Stephen E Arnold, June 12, 2015

Amazon and Elasticsearch

May 29, 2015

If you are curious about the utility of Elastic’s technology, you will find “Indexing Common Crawl Metadata on Amazon EMR Using Cascading and Elasticsearch” a useful article to review. The main idea is that Amazon made Elasticsearch do some circus tricks. The write up explains the approach, provides code snippets, and includes a couple of nifty graphics which help those zany Zonies figure out the implications of the data crunched. the main idea is that Elasticsearch did something use with content in everyone’s favorite magic wand Hadoop. Why didn’t Amazon use LucidWorks (Really?)? Hmm. Good question.

Stephen E Arnold, May 29, 2015

Unstructured Data Challenge: An Infographic Does the Job

May 29, 2015

Search and content processing vendors talk about their systems’ handling structured and unstructured data. One outfit thinks that the challenge is unstructured information. After decades of floudering, the search sector lacks a solution that makes accountants and users beam with happiness.

What’s the fix?

According to “Meeting the Challenge of Unsructured Data,” the unstructured data speed bumps can be resolved by realizing there are challenges. The fix begins with realizing that one may not be prepared for them, which is like showing up for Marine boot camp in formal wear.

The inforgraphic does a good job of presenting the issues which most organiations are not willing to shift from the business of making money to the business of dealing with lots of email, PowerPoints, and Web pages.

For example, the infographic asserts:

  • Big data comes from many sources, and Big Data come in many shapes, sizes, and colors
  • Network loads with go up in 24 months
  • 60 percent of organizations are ready for “the surge in network traffic.”

Okay, let’s step back.

At this time, most sentient managers know that there is a great deal of unstructured information in their organization. Most managers are not able to find information in a way that makes them emulate a happy face. Accountants wrinkle their baby smooth foreheads when tallying up the costs for digital information storage, findability, maintenance, and unbudgeted expenses to get the existing systems to do their often weak kneed thing.

These challenges are truisms.

My question: “When will innovations have an impact on these challenges?” Based on progress in the last few decades, solutions will arrive with marketing parades. The results, in my view, will be the same old issues: User cannot locate the information requjired to make their work a doddle.

It is easier to identify deal breakers than unbreak the deals. It is easier to use jargon to help close deals than provide solutions that deal with information challenges.

One does not meet the challenges of unstructured data with lists of facts that make clear that today’s solutions are not, shall we say, efficacious.

Stephen E Arnold, May 29, 2015

IBM and Hadoop: Closer Than Ever

May 28, 2015

I read “Hadoop and IBM i: Not as Far Apart as One Might Think.” The letter “i” is important if you are in the IBM lingo parade. The letter “i” refers to the EBCDIC based operating system which runs on IBM Power and Pure Systems. If you don’t know EBCDIC, you should go back to your iOS device and wait for an Apple IBM app that runs on this puppy.

If your inner AS/400 itch needs scratching, you can fire up your system and use wrapper software from Mrc Productivity. You can then do the Hadoop thing.

The write up mentions other vendors working in this sector, but if you are an IBM i shop, you have companies like mrc on your iPhone’s speed dialer.

The article does state:

Not every IBM i shop is asking for Hadoop capabilities, but there have been some inquiries, says mrc’s marketing director Steve Hansen. [mrc is leading the Hadoop thing on the i/AS/400 platforms]… We’re not telling people it’s time to replace IBM i. We’re saying the data is getting bigger. There’s unstructured data and social data, and businesses just aren’t doing much with it yet. I think it’s overwhelming. Right now we’re [mrc folks] trying to build awareness to what Hadoop is and how people who are using IBM i can take this data that they’re not taking advantage of and put it into Hadoop. I don’t see it as a replacement for their IBM i. It’s more something that can enhance what they’re currently doing and tracking all this data they’re not tracking.

Yep, overwhelming.

For more information about mrc, navigate to http://www.mrc-productivity.com/.

Stephen E Arnold, May 28, 2015

Generalizations about Big Data: Hail, the Mighty Hadoop

May 26, 2015

I read “A Big Data Cheat Sheet: What Executives Want to Know.” The hidden agenda in the write up is revealed with the juxtaposition of the source Social Media Today and the technology Hadoop.

Big Data is one of those buzzwords which now grates on me. When I hear it, I wonder what the outfit is pitching and how something as nebulous as Big Data is going to save someone’s bacon or, if one is a vegetarian, tofu.

This write up beats the Hadoop drum. Isn’t Hadoop one method for performing certain types of data management tasks and extracting results from those tasks? Hadoop is a tool, and like a router in the home workshop, a pretty feisty gizmo in the hands of a novice.

The article suggests that Hadoop is a federation system. Hadoop can be a federation system, but it can handle data from a single source; for example, log files. Federation is not magic; it requires work. In fact, federation may render the benefits of Hadoop secondary to the cost of the resources required to utilize Hadoop in an effective way.

There are other assertions as well; for example:

  • Hadoop can archive “all data.” Hmmm. “All.” Does this sound a bit over blown.
  • Hadoop is enterprise ready? Sure, if the enterprise has the resources to make appropriate use of Hadoop.
  • Are data lakes and data warehouses the same? According to the write up, the data warehouse uses structured data and the data lake is just a big pool of disparate data. Queries across this type of “pool” can be exciting and expensive.
  • The upsides and downsides of the data lake pivot on data management. Okay, that is definitely true. What is not explored is the cost of managing large volumes of data, their updates, and their manipulation. Queries can be expensive.

My point is that sweeping generalizations about a technology which is useful are not helpful. Firing buzzwords into the mushy brain of a person involved in social media can have some interesting consequences.

Hadoop is not magic. Hadoop requires specialized knowledge. Hadoop does not deliver like the tooth fairy a quarter under one’s pillow. If Hadoop were the answer to Big Data problems, why are so many Hadoop projects vulnerable to very common  problems in configuration, memory handling, lousy performance, and problematic hives?

Social media experts are not likely to appreciate these challenges as they work to deal with large volumes of data, updates, and queries. Oh, are the outputs valid? Frankly some Hadoop projects never face that problem.

Stephen E Arnold, May 26, 2015

Welcome YottaSearch

May 26, 2015

There is another game player in the world of enterprise search: Yotta Data Technologies announced their newest product: “Yotta Data Technologies Announces Enterprise Search And Big Data Analytics Platform.”  Yotta Data Technologies is known for its affordable and easy to use information management solutions. Yotta has increased its solutions by creating YottaSearch, a data analytics and search platform designed to be a data hub for organizations.

“YottaSearch brings together the most powerful and agile open source technologies available to enable today’s demanding users to easily collect data, search it, analyze it and create rich visualizations in real time.  From social media and email for Information Governance and eDiscovery to web and network server logs for Information Technology Operations Analytics (ITOA), YottaSearch™ provides the Big Data Analytics for users to derive information intelligence that may be critical to a project, case, business unit or market.”

YottaSearch uses the popular SaaS model and offers users not only data analytics and search, but also knowledge management, information governance, eDiscovery, and IT operations analytics.  Yotta decided to create YottaSearch to earn revenue from the burgeoning big data market, especially the enterprise search end.

The market is worth $1.7 billion, so Yotta has a lot of competition, but if they offer something different and better than their rivals they stand a chance to rise to the top.

Whitney Grace, May 26, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Maana from Heaven: Sustaining Big Data Search

May 23, 2015

Need to search Big Data in Hadoop? Other data management systems? Maana is now ready to assist you. Fresh from stealth mode, the company received an infusion of venture capital which now totals $14.2 million. (You may have to pay to access the details of this cash injection.) Maana garnered only a fraction of the money pumped into search vendors Attivio ($71 million), Coveo ($34 million) or Palantir (hundreds of millions). But Maana has some big name backers; for example, GE Ventures and Intel Capital, among others.

Maana’s manna looks a lot like legal tender.

According to the company:

Maana is pioneering new search technology for big data. It helps corporations drive significant improvements in productivity, efficiency, safety, and security in the operations of their core assets.

This value proposition strikes me as familiar.

Maana is ready to enable customers to perform knowledge modeling, evaluation, data understanding, data shaping, and orchestration. Differentiation is likely to be a challenge. The company offers this diagram to assist prospects in understanding why Maana is different from other Big Data search solutions:

image

Image from www.maana.com

A key differentiator is that the company says:

Maana is not based on open source Solr/Lucene.

That should chop out the LuceneWorks (Really?) and other open source Big Data options in a competitive fray.

Will Manna’s positioning tactic thwart other proprietary Big Data information access solutions? Hewlett Packard, are you ready to rumble? Oracle. Wait. Oracle is always ready to rumble. Google and In-Q-Tel backed Recorded Future? Oops. Recorded Future is jammed with work and inquiries as I understand it. Whatever. Let the proprietary Big Data search Copa de Data off begin.

Stephen E Arnold, May 23, 2015

Is Collaboration the Key to Big Data Progress?

May 22, 2015

The article titled Big Data Must Haves: Capacity, Compute, Collaboration on GCN offers insights into the best areas of focus for big data researchers. The Internet2 Global Summit is in D.C. this year with many exciting panelists who support the emphasis on collaboration in particular. The article mentions the work being presented by several people including Clemson professor Alex Feltus,

“…his research team is leveraging the Internet2 infrastructure, including its Advanced Layer 2 Service high-speed connections and perfSONAR network monitoring, to substantially accelerate genomic big data transfers and transform researcher collaboration…Arizona State University, which recently got 100 gigabit/sec connections to Internet2, has developed the Next Generation Cyber Capability, or NGCC, to respond to big data challenges.  The NGCC integrates big data platforms and traditional supercomputing technologies with software-defined networking, high-speed interconnects and visualization for medical research.”

Arizona’s NGCC provides the essence of the article’s claims, stressing capacity with Internet2, several types of computing, and of course collaboration between everyone at work on the system. Feltus commented on the importance of cooperation in Arizona State’s work, suggesting that personal relationships outweigh individual successes. He claims his own teamwork with network and storage researchers helped him find new potential avenues of innovation that might not have occurred to him without thoughtful collaboration.

Chelsea Kerwin, May 22, 2014

Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

Big Data: The Shrinky Dink Approach

May 21, 2015

I read “To Handle Big Data, Shrink It.” Years ago I did a job for a unit of a blue chip consulting firm. My task was to find a technology which allowed a financial institution to query massive data sets without bringing the computing system to its knees and causing the on-staff programmers to howl with pain.

I located an outfit in what is now someplace near a Prague-like location. The company was CrossZ, and it used a wonky method of compression and a modified version of SQL with a point and click interface. The idea was that a huge chunk of the bank data—for instance, the transactions in the week before mother’s day—to be queried for purchasing-related trends. Think fraud. Think flowers. Think special promotions that increased sales. I have not kept track of the low profile, secretive company. I did think of it when I read the “shrink Big Data story.”

This passage resonated and sparked my memory:

MIT researchers will present a new algorithm that finds the smallest possible approximation of the original matrix that guarantees reliable computations. For a class of problems important in engineering and machine learning, this is a significant improvement over previous techniques. And for all classes of problems, the algorithm finds the approximation as quickly as possible.

The point is that it is now 2015 and a mid 1990s notion seems to be fresh. My hunch is that the approach will be described as novel, innovative, and a solution to the problems Big Data poses.

Perhaps the MIT approach is those things. For me, the basic idea is that Big Data has to be approached in a rational way. Otherwise, how will queries of “Big Data” which has been processed and a stream of new or changed “Big Data” be processed in a way that is affordable, is computable, and is meaningful to a person who has no clue what is “in” the Big Data.

Fractal compression, recursive methods, mereological techniques, and other methods are a good idea. I am delighted with the notion that Big Data has to be made small in order to be more useful to companies with limited budgets and a desire to answer some basic business questions with small data.

Stephen E Arnold, May 21, 2015

Hadoop: Its Inventor Speaks

May 18, 2015

I must have my wires crossed about Hadoop. I thought other folks were the creators of what became Hadoop. I read “Where Next for Hadoop? An Interview with Co-Creator Doug Cutting” to get my memory refreshed. (Note: you may have to register or pay to view the full text of this interview.)

According to the article Doug Cutting and mike Cafarella cooked up Hadoop in 2005. Cutting now works at Cloudera, which, according to Crunchbase, is

an enterprise software company that provides Apache Hadoop-based software and training to data-driven enterprises. –

You can find some objective analyses of the company and its technology at http://bit.ly/1desDEN. I use the term “objective” to mean written by mid tier consultants.

I highlighted this statement:

Hadoop is already much more versatile and user-friendly than it was in the early days and innovations such as Yarn, Impala and Spark as well as a hardening of the platform’s security have all made it more “enterprise ready” too…

To underscore the user friendliness of Hadoop I circled in high intensity pink:

Asked whether some IT people are so bowled over by the number and choice of big data tools that they neglect to think how they will use them, Cutting agrees that this can be the case, but says that as use cases grow this issue will diminish. “It’s in an early stage of maturity so that’s not unexpected, but I think over time people are going to think about the functionality you’ve got in the distribution. You could have a SQL engine for analytics queries. You’ve got a NoSQL engine for reporting queries,” he says. So are companies like Cloudera, which, thanks to support from the likes of Intel (see below) and its vast marketing budget, distracting the market from the bigger picture? “There is confusion but I think it’s mostly because people are new to it and do not have much experience,” Cutting says.

And a final snippet:

Mostly I think this mantle of open and standard is deceptive. It is neither open in that everybody’s really invited on equal terms to play, nor is it a standard. It’s a minority of people out there.”

There are other comments about Hadoop. I will leave them to you. Easy to use, not confusing, and no problems with open and standard. There are many consulting firms thrilled with Hadoop. Snap it in and dig into data. Versatile too.

Stephen E Arnold, May 18, 2015

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta