CyberOSINT banner

Generalizations about Big Data: Hail, the Mighty Hadoop

May 26, 2015

I read “A Big Data Cheat Sheet: What Executives Want to Know.” The hidden agenda in the write up is revealed with the juxtaposition of the source Social Media Today and the technology Hadoop.

Big Data is one of those buzzwords which now grates on me. When I hear it, I wonder what the outfit is pitching and how something as nebulous as Big Data is going to save someone’s bacon or, if one is a vegetarian, tofu.

This write up beats the Hadoop drum. Isn’t Hadoop one method for performing certain types of data management tasks and extracting results from those tasks? Hadoop is a tool, and like a router in the home workshop, a pretty feisty gizmo in the hands of a novice.

The article suggests that Hadoop is a federation system. Hadoop can be a federation system, but it can handle data from a single source; for example, log files. Federation is not magic; it requires work. In fact, federation may render the benefits of Hadoop secondary to the cost of the resources required to utilize Hadoop in an effective way.

There are other assertions as well; for example:

  • Hadoop can archive “all data.” Hmmm. “All.” Does this sound a bit over blown.
  • Hadoop is enterprise ready? Sure, if the enterprise has the resources to make appropriate use of Hadoop.
  • Are data lakes and data warehouses the same? According to the write up, the data warehouse uses structured data and the data lake is just a big pool of disparate data. Queries across this type of “pool” can be exciting and expensive.
  • The upsides and downsides of the data lake pivot on data management. Okay, that is definitely true. What is not explored is the cost of managing large volumes of data, their updates, and their manipulation. Queries can be expensive.

My point is that sweeping generalizations about a technology which is useful are not helpful. Firing buzzwords into the mushy brain of a person involved in social media can have some interesting consequences.

Hadoop is not magic. Hadoop requires specialized knowledge. Hadoop does not deliver like the tooth fairy a quarter under one’s pillow. If Hadoop were the answer to Big Data problems, why are so many Hadoop projects vulnerable to very common  problems in configuration, memory handling, lousy performance, and problematic hives?

Social media experts are not likely to appreciate these challenges as they work to deal with large volumes of data, updates, and queries. Oh, are the outputs valid? Frankly some Hadoop projects never face that problem.

Stephen E Arnold, May 26, 2015

Welcome YottaSearch

May 26, 2015

There is another game player in the world of enterprise search: Yotta Data Technologies announced their newest product: “Yotta Data Technologies Announces Enterprise Search And Big Data Analytics Platform.”  Yotta Data Technologies is known for its affordable and easy to use information management solutions. Yotta has increased its solutions by creating YottaSearch, a data analytics and search platform designed to be a data hub for organizations.

“YottaSearch brings together the most powerful and agile open source technologies available to enable today’s demanding users to easily collect data, search it, analyze it and create rich visualizations in real time.  From social media and email for Information Governance and eDiscovery to web and network server logs for Information Technology Operations Analytics (ITOA), YottaSearch™ provides the Big Data Analytics for users to derive information intelligence that may be critical to a project, case, business unit or market.”

YottaSearch uses the popular SaaS model and offers users not only data analytics and search, but also knowledge management, information governance, eDiscovery, and IT operations analytics.  Yotta decided to create YottaSearch to earn revenue from the burgeoning big data market, especially the enterprise search end.

The market is worth $1.7 billion, so Yotta has a lot of competition, but if they offer something different and better than their rivals they stand a chance to rise to the top.

Whitney Grace, May 26, 2015
Sponsored by, publisher of the CyberOSINT monograph

Maana from Heaven: Sustaining Big Data Search

May 23, 2015

Need to search Big Data in Hadoop? Other data management systems? Maana is now ready to assist you. Fresh from stealth mode, the company received an infusion of venture capital which now totals $14.2 million. (You may have to pay to access the details of this cash injection.) Maana garnered only a fraction of the money pumped into search vendors Attivio ($71 million), Coveo ($34 million) or Palantir (hundreds of millions). But Maana has some big name backers; for example, GE Ventures and Intel Capital, among others.

Maana’s manna looks a lot like legal tender.

According to the company:

Maana is pioneering new search technology for big data. It helps corporations drive significant improvements in productivity, efficiency, safety, and security in the operations of their core assets.

This value proposition strikes me as familiar.

Maana is ready to enable customers to perform knowledge modeling, evaluation, data understanding, data shaping, and orchestration. Differentiation is likely to be a challenge. The company offers this diagram to assist prospects in understanding why Maana is different from other Big Data search solutions:


Image from

A key differentiator is that the company says:

Maana is not based on open source Solr/Lucene.

That should chop out the LuceneWorks (Really?) and other open source Big Data options in a competitive fray.

Will Manna’s positioning tactic thwart other proprietary Big Data information access solutions? Hewlett Packard, are you ready to rumble? Oracle. Wait. Oracle is always ready to rumble. Google and In-Q-Tel backed Recorded Future? Oops. Recorded Future is jammed with work and inquiries as I understand it. Whatever. Let the proprietary Big Data search Copa de Data off begin.

Stephen E Arnold, May 23, 2015

Is Collaboration the Key to Big Data Progress?

May 22, 2015

The article titled Big Data Must Haves: Capacity, Compute, Collaboration on GCN offers insights into the best areas of focus for big data researchers. The Internet2 Global Summit is in D.C. this year with many exciting panelists who support the emphasis on collaboration in particular. The article mentions the work being presented by several people including Clemson professor Alex Feltus,

“…his research team is leveraging the Internet2 infrastructure, including its Advanced Layer 2 Service high-speed connections and perfSONAR network monitoring, to substantially accelerate genomic big data transfers and transform researcher collaboration…Arizona State University, which recently got 100 gigabit/sec connections to Internet2, has developed the Next Generation Cyber Capability, or NGCC, to respond to big data challenges.  The NGCC integrates big data platforms and traditional supercomputing technologies with software-defined networking, high-speed interconnects and visualization for medical research.”

Arizona’s NGCC provides the essence of the article’s claims, stressing capacity with Internet2, several types of computing, and of course collaboration between everyone at work on the system. Feltus commented on the importance of cooperation in Arizona State’s work, suggesting that personal relationships outweigh individual successes. He claims his own teamwork with network and storage researchers helped him find new potential avenues of innovation that might not have occurred to him without thoughtful collaboration.

Chelsea Kerwin, May 22, 2014

Stephen E Arnold, Publisher of CyberOSINT at

Big Data: The Shrinky Dink Approach

May 21, 2015

I read “To Handle Big Data, Shrink It.” Years ago I did a job for a unit of a blue chip consulting firm. My task was to find a technology which allowed a financial institution to query massive data sets without bringing the computing system to its knees and causing the on-staff programmers to howl with pain.

I located an outfit in what is now someplace near a Prague-like location. The company was CrossZ, and it used a wonky method of compression and a modified version of SQL with a point and click interface. The idea was that a huge chunk of the bank data—for instance, the transactions in the week before mother’s day—to be queried for purchasing-related trends. Think fraud. Think flowers. Think special promotions that increased sales. I have not kept track of the low profile, secretive company. I did think of it when I read the “shrink Big Data story.”

This passage resonated and sparked my memory:

MIT researchers will present a new algorithm that finds the smallest possible approximation of the original matrix that guarantees reliable computations. For a class of problems important in engineering and machine learning, this is a significant improvement over previous techniques. And for all classes of problems, the algorithm finds the approximation as quickly as possible.

The point is that it is now 2015 and a mid 1990s notion seems to be fresh. My hunch is that the approach will be described as novel, innovative, and a solution to the problems Big Data poses.

Perhaps the MIT approach is those things. For me, the basic idea is that Big Data has to be approached in a rational way. Otherwise, how will queries of “Big Data” which has been processed and a stream of new or changed “Big Data” be processed in a way that is affordable, is computable, and is meaningful to a person who has no clue what is “in” the Big Data.

Fractal compression, recursive methods, mereological techniques, and other methods are a good idea. I am delighted with the notion that Big Data has to be made small in order to be more useful to companies with limited budgets and a desire to answer some basic business questions with small data.

Stephen E Arnold, May 21, 2015

Hadoop: Its Inventor Speaks

May 18, 2015

I must have my wires crossed about Hadoop. I thought other folks were the creators of what became Hadoop. I read “Where Next for Hadoop? An Interview with Co-Creator Doug Cutting” to get my memory refreshed. (Note: you may have to register or pay to view the full text of this interview.)

According to the article Doug Cutting and mike Cafarella cooked up Hadoop in 2005. Cutting now works at Cloudera, which, according to Crunchbase, is

an enterprise software company that provides Apache Hadoop-based software and training to data-driven enterprises. –

You can find some objective analyses of the company and its technology at I use the term “objective” to mean written by mid tier consultants.

I highlighted this statement:

Hadoop is already much more versatile and user-friendly than it was in the early days and innovations such as Yarn, Impala and Spark as well as a hardening of the platform’s security have all made it more “enterprise ready” too…

To underscore the user friendliness of Hadoop I circled in high intensity pink:

Asked whether some IT people are so bowled over by the number and choice of big data tools that they neglect to think how they will use them, Cutting agrees that this can be the case, but says that as use cases grow this issue will diminish. “It’s in an early stage of maturity so that’s not unexpected, but I think over time people are going to think about the functionality you’ve got in the distribution. You could have a SQL engine for analytics queries. You’ve got a NoSQL engine for reporting queries,” he says. So are companies like Cloudera, which, thanks to support from the likes of Intel (see below) and its vast marketing budget, distracting the market from the bigger picture? “There is confusion but I think it’s mostly because people are new to it and do not have much experience,” Cutting says.

And a final snippet:

Mostly I think this mantle of open and standard is deceptive. It is neither open in that everybody’s really invited on equal terms to play, nor is it a standard. It’s a minority of people out there.”

There are other comments about Hadoop. I will leave them to you. Easy to use, not confusing, and no problems with open and standard. There are many consulting firms thrilled with Hadoop. Snap it in and dig into data. Versatile too.

Stephen E Arnold, May 18, 2015 Preserves Online Information

May 18, 2015

Today’s information seekers use the Internet the way some of used reference books growing up. Unlike the paper tomes on our dusty bookshelves, however, websites can change their content without so much as a by-your-leave. Suggestions for preserving online information can be found in “Create Publicly Available Web Page Archives with” at

Writer Martin Brinkmann begins by listing several local options familiar to many of us. There’s Ctrl-s, of course, and assorted screenshot-saving methods. Website archivers like Httrack perform their own crawls and save the results to the user’s local machine. Remotely, automatically creates snapshots of prominent sites, but users cannot control the results. Enter Brinkmann writes: is a free service that helps you out. To use it, paste a web address into the form on the services main page and hit submit url afterwards. The service takes two snapshots of that page at that point in time and makes it available publicly. The first takes a static snapshot of the site. You find images, text and other static contents included while dynamic contents and scripts are not. The second snapshot takes a screenshot of the page instead. An option to download the data is provided. Note that this downloads the textual copy of the site only and not the screenshot. A Firefox add-on has been created for the service which may be useful to some of its users. It creates automatic snapshots of every web page that you bookmark in the web browser after installation of the add-on.”

Wow, don’t set and forget that Firefox option! In fact, the article cautions, be mindful of the public availability of every snapshot; Brinkmann reasonably suggests the tool could benefit from a password feature. Still, this could be an option to preserve important (but, for the prudent, impersonal) information found online.

Cynthia Murrell, May 18, 2015

Stephen E Arnold, Publisher of CyberOSINT at

Exit Governance. Enter DMP.

May 17, 2015

A DMP is a data management platform. I think in terms of databases. I find that software does not do a particularly reliable job “managing data.” Software can run processes, write log file, and perform other functions. But management, based on my experience at Booz, Allen & Hamilton, requires humans. Talking about analytics from Big Data and implementing a platform to perform management are apples and house paint in my mind.

Intrigued by the reference, I downloaded a document available upon registration from Infinitive. You can find the company’s Web site at The white paper maps you 10 ways a data management platform can help me.

I was not familiar with Infinitive. According to the firm’s Web site: Infinitive is

A Different Kind of Consultancy. Results-driven and client-centric. Fun, focused and flexible. Highly engaged and easy to work with. Those are the qualities that make Infinitive a different kind of consultancy. And they’re the pillars of our unique culture. Headquartered in the Washington, D.C. area, Infinitive specializes in digital ad solutions, business transformation, customer & audience intelligence and enterprise risk management. Leveraging best practices in process engineering, change management and program management, we design and deliver custom solutions for leading organizations in communications, media and entertainment, financial services and educational services. For our clients, the results include quantifiable performance improvement and tangible bottom-line value in addressing their most pressing challenges and fulfilling their top-priority objectives.

What is a data management platform?

White paper or two page document identifies these benefits of a DMP. I was hoping for an explanation of the “platform,” but let’s look at the payoffs from the platform.

The company points out that a DMP makes ad money go farther. Big Data become actionable. A DMP provides a foundation for analytics. The DMP “ensures the quality and accessibility of customer and audience intelligence data.” The DMP can harmonize data. A DMP allows me to “adapt traditional CRM strategies and technology to incorporate new customer behavior.” I can create new customer and audience “segments.” The DMP becomes the central nervous system for my company. And the DMP protects privacy.

That is a bundle of benefits. But what is the platform provided by a consulting company, especially one that is “fun”? I was not able to locate details about the platform. The company appears to be a firm focused on advertising.

The Web site includes a page about the DMP at this link. The information is buzzword heavy and fact free. My view is that the DMP is a marketing hook. The implied technology is consulting services. That’s okay, but I find the approach representative of marketing billable time, not delivering a platform with the remarkable and perhaps unattainable benefits suggested in the white paper.

The approach must work. The company’s Web site points out this message:


Not a platform, however.

Stephen E Arnold, May 17, 2015

HP Idol and Hadoop: Search, Analytics, and Big Data for You

May 16, 2015

I was clicking through links related to Autonomy IDOL. One of the links which I noted was to a YouTube video labeled “HP IDOL for for Hadoop: Create a Smarter Data Lake.” Hadoop has become a simile for making sense of Big Data. I am not sure what Big Data are, but I assume I will know when my eight gigabyte USB key cannot accept another file. Big Data? Doesn’t it depend on one’s point of view?

What is fascinating about the HP Idol video is that it carries a posting date of October 2014, which is in the period when HP was ramping up its anti-Autonomy legal activities. The video, I assumed before watching, would break from the Autonomy marketing assertions and move in a bold, new direction.

The video contained some remarkable assertions. Please, watch the video yourself because I may have missed some howlers as I was chuckling and writing on my old school notepad with a decidedly old fashioned pencil. Hey, these tools work, which is more than I can say for some of the software we examined last week.

Here’s what I noted with the accompanying screenshot so you can locate the frame in the YouTube video to double check my observation with the reality of the video.

First, there is the statement that in an organization 88 percent of its information is “unanalyzed.” The source is a 2012 study from Forrsights Strategy Spotlight: Business Intelligence and Big Data. Forrester, another mid tier consulting firm, produces these reports for its customers. Okay, a couple of years old research. Maybe it is valid? Maybe not? My thought was that HP may be a company which did not examine the data to which it had access about Autonomy before it wrote a check for billions of dollars. I assume HP has rectified any glitch along this line. HP’s litigation with Autonomy and the billions in write down for the deal underscore the problem with unanalyzed data. Alas, no reference was made to this case example in the HP video.

Second, Hadoop, a variant of Google’s MapReduce technology, is presented as a way to reap the benefits of cost efficiency and scalability. These are generally desirable attributes of Hadoop and other data management systems. The hitch, in my opinion, is that it is a collection of projects. These have been developed via the open source / commercial model. Hadoop works well for certain types of problems. Extract, transform, and load works reasonably well once the Hadoop installation is set up, properly resourced, and the Java code debugged so it works. Hadoop requires some degree of technical sophistication; otherwise, the system can be slow, stuffed with duplicates, and a bit like a Rube Goldberg machine. But the Hadoop references in the video are not a demonstration. I noted this “explanation.”


Third, HP jumps from the Hadoop segment to “what if” questions. I liked the “democratize Big Data” because “Big Data Changes everything.” Okay, but the solution is Idol for Hadoop. The HP approach is to create a “smarter data lake.” Hmmm. Hadoop to Idol to data lake for the purpose of advanced analytics, machine learning functions, and enterprise level security. That sounds quite a bit like Autonomy’s value proposition before it was purchased from Dr. Lynch and company. In fact, Autonomy’s connectors permitted the system to ingest disparate types of data as I recall.

Fourth, the next logical discontinuity is the shift from Hadoop to something called “contextual search.” A Gartner report is presented which states with Douglas McArthur-like confidence:

HP Idol. A leader in the 2014 Garnter Magic Quadrant for Contextual Search.

What the heck is contextual search in a Hadoop system accessed by Autonomy Idol? The answer is SEARCH. Yep, a concept that has been difficult to implement for 20, maybe 30 years. Search is so difficult to sell that Dr. Lynch generated revenues by acquiring companies and applying his neuro-linguistic methods to these firms’ software. I learned:

The sophistication and extensibility of HP Autonomy’s Intelligent Data Operating Layer (Idol) offering enable it to tackle the most demanding use cases, such as fraud detection and search within large video libraries and feeds.

Yo, video. I thought Autonomy acquired video centric companies and the video content resided within specialized storage systems using quite specific indexing and information access features. Has HP cracked the problem of storing video in Hadoop so that a licensee can perform fraud detection and search within video libraries. My experience with large video libraries is that certain video like surveillance footage is pretty tough to process with accuracy. Humans, even academic trainees, can be placed in front of a video monitor and told, “Watch this stream. Note anomalies.” Not exciting but necessary because processing large volumes of video remains what I would describe as “a bit of a challenge, grasshopper.” Why is Google adding wild and crazy banners, overlays, and required metadata inputs? Maybe because automated processing and magical deep linking are out of reach? HP appears to have improved or overhauled Autonomy’s video analysis functions, and the Gartner analyst is reporting a major technical leap forward. Identifying a muzzle flash is different from recognizing a face in a flow of subway patrons captured on a surveillance camera, is it not?


I have heard some pre HP Autonomy sales pitches, but I can’t recall hearing that Idol can crunch flows of video content unless one uses the quite specialized system Autonomy acquired. Well, I have been wrong before, and I am certainly not qualified to be an analyst like the ones Gartner relies upon. I learned that HP Idol has a comprehensive list of data connectors. I think I would use the word “library,” but why niggle?

Fifth, the video jumps to a presentation of a “content hub.” The idea is that HP idol provides visual programming tools. I assume an HP Idol customer will point and click to create queries. The  queries will deliver outputs from the Hadoop data management system and the content which embodies the data lake. The user can also run a query and see a list of documents. but the video jumps from what strikes me as exactly what many users no longer want to do to locate information. One can search effectively when one knows what one is looking for and that the needed information is actually in the index. The use case appears to be health care and the video concludes with a reminder that one can perform advanced analytics. There is a different point of view available in this ParAccel  white paper.

I understand the strengths and weaknesses of videos. I have been doing some home brew videos since I retired. But HP is presenting assertions about Autonomy’s technology which seem to be out of step with my understanding of what Idol, the digital reasoning engine, Autonomy’s acquired video technology.

The point is that HP seems to be out marketing Autonomy’s marketing. The assert6ions and logical leaps in the HP Idol Hadoop video stretch the boundaries of my credulity. I find this interesting because HP is alleging that Autonomy used similar verbal polishing to convince HP to write a billion dollar check for a search vendor which had grown via acquisitions over a period of 15 years.

Stephen E Arnold, May 16, 2015

Explaining Big Data Mythology

May 14, 2015

Mythologies usually develop over a course of centuries, but big data has only been around for (arguably) a couple decades—at least in the modern incarnate.  Recently big data has received a lot of media attention and product development, which was enough to give the Internet time to create a big data mythology.  The Globe and Mail wanted to dispel some of the bigger myths in the article, “Unearthing Big Myths About Big Data.”

The article focuses on Prof. Joerg Niessing’s big data expertise and how he explains the truth behind many of the biggest big data myths.  One of the biggest items that Niessing wants people to understand is that gathering data does not equal dollar signs, you have to be active with data:

“You must take control, starting with developing a strategic outlook in which you will determine how to use the data at your disposal effectively. “That’s where a lot of companies struggle. They do not have a strategic approach. They don’t understand what they want to learn and get lost in the data,” he said in an interview. So before rushing into data mining, step back and figure out which customer segments and what aspects of their behavior you most want to learn about.”

Niessing says that big data is not really big, but made up of many diverse, data points.  Big data also does not have all the answers, instead it provides ambiguous results that need to be interpreted.  Have questions you want to be answered before gathering data.  Also all of the data returned is not the greatest.  Some of it is actually garbage, so it cannot be usable for a project.  Several other myths are uncovered, but the truth remains that having a strategic big data plan in place is the best way to make the most of big data.

Whitney Grace, May 14, 2015

Sponsored by, publisher of the CyberOSINT monograph

Next Page »