Hadoop: Its Inventor Speaks

May 18, 2015

I must have my wires crossed about Hadoop. I thought other folks were the creators of what became Hadoop. I read “Where Next for Hadoop? An Interview with Co-Creator Doug Cutting” to get my memory refreshed. (Note: you may have to register or pay to view the full text of this interview.)

According to the article Doug Cutting and mike Cafarella cooked up Hadoop in 2005. Cutting now works at Cloudera, which, according to Crunchbase, is

an enterprise software company that provides Apache Hadoop-based software and training to data-driven enterprises. –

You can find some objective analyses of the company and its technology at http://bit.ly/1desDEN. I use the term “objective” to mean written by mid tier consultants.

I highlighted this statement:

Hadoop is already much more versatile and user-friendly than it was in the early days and innovations such as Yarn, Impala and Spark as well as a hardening of the platform’s security have all made it more “enterprise ready” too…

To underscore the user friendliness of Hadoop I circled in high intensity pink:

Asked whether some IT people are so bowled over by the number and choice of big data tools that they neglect to think how they will use them, Cutting agrees that this can be the case, but says that as use cases grow this issue will diminish. “It’s in an early stage of maturity so that’s not unexpected, but I think over time people are going to think about the functionality you’ve got in the distribution. You could have a SQL engine for analytics queries. You’ve got a NoSQL engine for reporting queries,” he says. So are companies like Cloudera, which, thanks to support from the likes of Intel (see below) and its vast marketing budget, distracting the market from the bigger picture? “There is confusion but I think it’s mostly because people are new to it and do not have much experience,” Cutting says.

And a final snippet:

Mostly I think this mantle of open and standard is deceptive. It is neither open in that everybody’s really invited on equal terms to play, nor is it a standard. It’s a minority of people out there.”

There are other comments about Hadoop. I will leave them to you. Easy to use, not confusing, and no problems with open and standard. There are many consulting firms thrilled with Hadoop. Snap it in and dig into data. Versatile too.

Stephen E Arnold, May 18, 2015

Written by Stephen E. Arnold · Filed Under Big data, Database, News | Comments Off on Hadoop: Its Inventor Speaks

Archive.is Preserves Online Information

May 18, 2015

Today’s information seekers use the Internet the way some of used reference books growing up. Unlike the paper tomes on our dusty bookshelves, however, websites can change their content without so much as a by-your-leave. Suggestions for preserving online information can be found in “Create Publicly Available Web Page Archives with Archive.is” at gHacks.net.

Writer Martin Brinkmann begins by listing several local options familiar to many of us. There’s Ctrl-s, of course, and assorted screenshot-saving methods. Website archivers like Httrack perform their own crawls and save the results to the user’s local machine. Remotely, Archive.org automatically creates snapshots of prominent sites, but users cannot control the results. Enter Archive.is. Brinkmann writes:

“Archive.is is a free service that helps you out. To use it, paste a web address into the form on the services main page and hit submit url afterwards. The service takes two snapshots of that page at that point in time and makes it available publicly. The first takes a static snapshot of the site. You find images, text and other static contents included while dynamic contents and scripts are not. The second snapshot takes a screenshot of the page instead. An option to download the data is provided. Note that this downloads the textual copy of the site only and not the screenshot. A Firefox add-on has been created for the service which may be useful to some of its users. It creates automatic snapshots of every web page that you bookmark in the web browser after installation of the add-on.”

Wow, don’t set and forget that Firefox option! In fact, the article cautions, be mindful of the public availability of every Archive.is snapshot; Brinkmann reasonably suggests the tool could benefit from a password feature. Still, this could be an option to preserve important (but, for the prudent, impersonal) information found online.

Cynthia Murrell, May 18, 2015

Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

Written by Stephen E. Arnold · Filed Under Big data, Data, News, Search, Search quality | 1 Comment

Exit Governance. Enter DMP.

May 17, 2015

A DMP is a data management platform. I think in terms of databases. I find that software does not do a particularly reliable job “managing data.” Software can run processes, write log file, and perform other functions. But management, based on my experience at Booz, Allen & Hamilton, requires humans. Talking about analytics from Big Data and implementing a platform to perform management are apples and house paint in my mind.

Intrigued by the reference, I downloaded a document available upon registration from Infinitive. You can find the company’s Web site at www.infinitive.com. The white paper maps you 10 ways a data management platform can help me.

I was not familiar with Infinitive. According to the firm’s Web site: Infinitive is

A Different Kind of Consultancy. Results-driven and client-centric. Fun, focused and flexible. Highly engaged and easy to work with. Those are the qualities that make Infinitive a different kind of consultancy. And they’re the pillars of our unique culture. Headquartered in the Washington, D.C. area, Infinitive specializes in digital ad solutions, business transformation, customer & audience intelligence and enterprise risk management. Leveraging best practices in process engineering, change management and program management, we design and deliver custom solutions for leading organizations in communications, media and entertainment, financial services and educational services. For our clients, the results include quantifiable performance improvement and tangible bottom-line value in addressing their most pressing challenges and fulfilling their top-priority objectives.

What is a data management platform?

White paper or two page document identifies these benefits of a DMP. I was hoping for an explanation of the “platform,” but let’s look at the payoffs from the platform.

The company points out that a DMP makes ad money go farther. Big Data become actionable. A DMP provides a foundation for analytics. The DMP “ensures the quality and accessibility of customer and audience intelligence data.” The DMP can harmonize data. A DMP allows me to “adapt traditional CRM strategies and technology to incorporate new customer behavior.” I can create new customer and audience “segments.” The DMP becomes the central nervous system for my company. And the DMP protects privacy.

That is a bundle of benefits. But what is the platform provided by a consulting company, especially one that is “fun”? I was not able to locate details about the platform. The company appears to be a firm focused on advertising.

The Web site includes a page about the DMP at this link. The information is buzzword heavy and fact free. My view is that the DMP is a marketing hook. The implied technology is consulting services. That’s okay, but I find the approach representative of marketing billable time, not delivering a platform with the remarkable and perhaps unattainable benefits suggested in the white paper.

The approach must work. The company’s Web site points out this message:

Not a platform, however.

Stephen E Arnold, May 17, 2015

Written by Stephen E. Arnold · Filed Under Analytics, Big data, Management, News | Comments Off on Exit Governance. Enter DMP.

HP Idol and Hadoop: Search, Analytics, and Big Data for You

May 16, 2015

I was clicking through links related to Autonomy IDOL. One of the links which I noted was to a YouTube video labeled “HP IDOL for for Hadoop: Create a Smarter Data Lake.” Hadoop has become a simile for making sense of Big Data. I am not sure what Big Data are, but I assume I will know when my eight gigabyte USB key cannot accept another file. Big Data? Doesn’t it depend on one’s point of view?

What is fascinating about the HP Idol video is that it carries a posting date of October 2014, which is in the period when HP was ramping up its anti-Autonomy legal activities. The video, I assumed before watching, would break from the Autonomy marketing assertions and move in a bold, new direction.

The video contained some remarkable assertions. Please, watch the video yourself because I may have missed some howlers as I was chuckling and writing on my old school notepad with a decidedly old fashioned pencil. Hey, these tools work, which is more than I can say for some of the software we examined last week.

Here’s what I noted with the accompanying screenshot so you can locate the frame in the YouTube video to double check my observation with the reality of the video.

First, there is the statement that in an organization 88 percent of its information is “unanalyzed.” The source is a 2012 study from Forrsights Strategy Spotlight: Business Intelligence and Big Data. Forrester, another mid tier consulting firm, produces these reports for its customers. Okay, a couple of years old research. Maybe it is valid? Maybe not? My thought was that HP may be a company which did not examine the data to which it had access about Autonomy before it wrote a check for billions of dollars. I assume HP has rectified any glitch along this line. HP’s litigation with Autonomy and the billions in write down for the deal underscore the problem with unanalyzed data. Alas, no reference was made to this case example in the HP video.

Second, Hadoop, a variant of Google’s MapReduce technology, is presented as a way to reap the benefits of cost efficiency and scalability. These are generally desirable attributes of Hadoop and other data management systems. The hitch, in my opinion, is that it is a collection of projects. These have been developed via the open source / commercial model. Hadoop works well for certain types of problems. Extract, transform, and load works reasonably well once the Hadoop installation is set up, properly resourced, and the Java code debugged so it works. Hadoop requires some degree of technical sophistication; otherwise, the system can be slow, stuffed with duplicates, and a bit like a Rube Goldberg machine. But the Hadoop references in the video are not a demonstration. I noted this “explanation.”

Third, HP jumps from the Hadoop segment to “what if” questions. I liked the “democratize Big Data” because “Big Data Changes everything.” Okay, but the solution is Idol for Hadoop. The HP approach is to create a “smarter data lake.” Hmmm. Hadoop to Idol to data lake for the purpose of advanced analytics, machine learning functions, and enterprise level security. That sounds quite a bit like Autonomy’s value proposition before it was purchased from Dr. Lynch and company. In fact, Autonomy’s connectors permitted the system to ingest disparate types of data as I recall.

Fourth, the next logical discontinuity is the shift from Hadoop to something called “contextual search.” A Gartner report is presented which states with Douglas McArthur-like confidence:

HP Idol. A leader in the 2014 Garnter Magic Quadrant for Contextual Search.

What the heck is contextual search in a Hadoop system accessed by Autonomy Idol? The answer is SEARCH. Yep, a concept that has been difficult to implement for 20, maybe 30 years. Search is so difficult to sell that Dr. Lynch generated revenues by acquiring companies and applying his neuro-linguistic methods to these firms’ software. I learned:

The sophistication and extensibility of HP Autonomy’s Intelligent Data Operating Layer (Idol) offering enable it to tackle the most demanding use cases, such as fraud detection and search within large video libraries and feeds.

Yo, video. I thought Autonomy acquired video centric companies and the video content resided within specialized storage systems using quite specific indexing and information access features. Has HP cracked the problem of storing video in Hadoop so that a licensee can perform fraud detection and search within video libraries. My experience with large video libraries is that certain video like surveillance footage is pretty tough to process with accuracy. Humans, even academic trainees, can be placed in front of a video monitor and told, “Watch this stream. Note anomalies.” Not exciting but necessary because processing large volumes of video remains what I would describe as “a bit of a challenge, grasshopper.” Why is Google adding wild and crazy banners, overlays, and required metadata inputs? Maybe because automated processing and magical deep linking are out of reach? HP appears to have improved or overhauled Autonomy’s video analysis functions, and the Gartner analyst is reporting a major technical leap forward. Identifying a muzzle flash is different from recognizing a face in a flow of subway patrons captured on a surveillance camera, is it not?

I have heard some pre HP Autonomy sales pitches, but I can’t recall hearing that Idol can crunch flows of video content unless one uses the quite specialized system Autonomy acquired. Well, I have been wrong before, and I am certainly not qualified to be an analyst like the ones Gartner relies upon. I learned that HP Idol has a comprehensive list of data connectors. I think I would use the word “library,” but why niggle?

Fifth, the video jumps to a presentation of a “content hub.” The idea is that HP idol provides visual programming tools. I assume an HP Idol customer will point and click to create queries. The queries will deliver outputs from the Hadoop data management system and the content which embodies the data lake. The user can also run a query and see a list of documents. but the video jumps from what strikes me as exactly what many users no longer want to do to locate information. One can search effectively when one knows what one is looking for and that the needed information is actually in the index. The use case appears to be health care and the video concludes with a reminder that one can perform advanced analytics. There is a different point of view available in this ParAccel white paper.

I understand the strengths and weaknesses of videos. I have been doing some home brew videos since I retired. But HP is presenting assertions about Autonomy’s technology which seem to be out of step with my understanding of what Idol, the digital reasoning engine, Autonomy’s acquired video technology.

The point is that HP seems to be out marketing Autonomy’s marketing. The assert6ions and logical leaps in the HP Idol Hadoop video stretch the boundaries of my credulity. I find this interesting because HP is alleging that Autonomy used similar verbal polishing to convince HP to write a billion dollar check for a search vendor which had grown via acquisitions over a period of 15 years.

Stephen E Arnold, May 16, 2015

Written by Stephen E. Arnold · Filed Under Big data, Marketing, News, Search | 1 Comment

Explaining Big Data Mythology

May 14, 2015

Mythologies usually develop over a course of centuries, but big data has only been around for (arguably) a couple decades—at least in the modern incarnate. Recently big data has received a lot of media attention and product development, which was enough to give the Internet time to create a big data mythology. The Globe and Mail wanted to dispel some of the bigger myths in the article, “Unearthing Big Myths About Big Data.”

The article focuses on Prof. Joerg Niessing’s big data expertise and how he explains the truth behind many of the biggest big data myths. One of the biggest items that Niessing wants people to understand is that gathering data does not equal dollar signs, you have to be active with data:

“You must take control, starting with developing a strategic outlook in which you will determine how to use the data at your disposal effectively. “That’s where a lot of companies struggle. They do not have a strategic approach. They don’t understand what they want to learn and get lost in the data,” he said in an interview. So before rushing into data mining, step back and figure out which customer segments and what aspects of their behavior you most want to learn about.”

Niessing says that big data is not really big, but made up of many diverse, data points. Big data also does not have all the answers, instead it provides ambiguous results that need to be interpreted. Have questions you want to be answered before gathering data. Also all of the data returned is not the greatest. Some of it is actually garbage, so it cannot be usable for a project. Several other myths are uncovered, but the truth remains that having a strategic big data plan in place is the best way to make the most of big data.

Whitney Grace, May 14, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Big data, Data mining, News, Search, Technology, Web Services | Comments Off on Explaining Big Data Mythology

SharePoint Server 2016 Details Released

May 12, 2015

Some details about the rollout of SharePoint Server 2016 were revealed at the much-anticipated Ignite event in Chicago last week. Microsoft now commits to being on track with the project, making a public beta available in fourth quarter of this year, and “release candidate” and “general availability” versions to follow. Read more in the Redmond Magazine article, “SharePoint Server 2016 Roadmap Highlighted at Ignite Event.”

The article addresses the tension between cloud and on-premises versions:

“While Microsoft has been developing the product based on its cloud learnings, namely SharePoint Online as part of its Office 365 services, those cloud-inspired features eventually will make their way back into the server product. The capabilities that don’t make it into the server will be offered as Office 365 services that can be leveraged by premises-based systems.”

It appears that the delayed timeline may be a “worst case scenario” measure, and that the release could happen earlier. After all, it is better for customers to be prepared for the worst and be pleasantly surprised. To stay in touch with the latest news regarding features and timeline, keep an eye on ArnoldIT.com, specifically the SharePoint feed. Stephen E. Arnold is a longtime leader in search and serves as a great resource for individuals who need access to the latest SharePoint news at a glance.

Emily Rae Aldridge, May 12, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Big data, Customer support, Database, Enterprise, Microsoft, News, SharePoint, Web Services | Comments Off on SharePoint Server 2016 Details Released

Neural Networks Finally Have Their Day

May 11, 2015

The Toronto Star offers a thoughtful piece about deep learning titled, “How a Toronto Professor’s Research Revolutionized Artificial Intelligence.” Professor Geoffrey Hinton was instrumental in pursuing the development of neural network-based AI since long before the concept was popular. Lately, though, this “deep learning” approach has taken off, launching many a product, corporate division, and startup. Reporter Kate Allen reveals who we can credit for leading neural networks through the shadows of doubt:

“Ask anyone in machine learning what kept neural network research alive and they will probably mention one or all of these three names: Geoffrey Hinton, fellow Canadian Yoshua Bengio and Yann LeCun, of Facebook and New York University.

“But if you ask these three people what kept neural network research alive, they are likely to cite CIFAR, the Canadian Institute for Advanced Research. The organization creates research programs shaped around ambitious topics. Its funding, drawn from both public and private sources, frees scientists to spend more time tackling those questions, and draws experts from different disciplines together to collaborate.”

Hooray for CIFAR! The detailed article describes what gives deep learning the edge, explains why “machine learning” is a better term than “AI”, and gives several examples of ways deep learning is being used today, including Hinton’s current work at Google and the University of Toronto. Allen also traces the history of the neural network from its conceptualization in 1958 by Frank Rosenblatt, through an era of skepticism, to its recent warm embrace by the AI field. I recommend interested parties check out the full article. We’re reminded:

“In 2006, Hinton and a PhD student, Ruslan Salakhutdinov, published two papers that demonstrated how very large neural networks, once too slow to be effective, could work much more quickly than before. The new nets had more layers of computation: they were ‘deep,’ hence the method’s rebranding as deep learning. And when researchers began throwing huge data sets at them, and combining them with new and powerful graphics processing units originally built for video games, the systems began beating traditional machine learning systems that had been tweaked for decades. Neural nets were back.”

What detailed discussion of machine learning would be complete without a nod to concerns that we develop AI at our peril? Allen takes some time to sketch out both sides of that debate, and summarizes:

“Some in the field believe that artificial intelligence will augment, not replace: algorithms will free us from rote tasks like memorizing reams of legal precedents and allow us to pursue the higher-order thinking our massive brains are capable of. Others think the only tasks machines can’t do better are creative ones.”

I suppose the answers to those debates will present themselves eventually. Personally, I’m more excited than scared by the possibilities. How about you, dear reader?

Cynthia Murrell, May 11, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under AI, Big data, Data, Facebook, Google, News | 1 Comment

Blur Private Search Promises to Hide User Identities from Google

May 8, 2015

We advise you to not take this advice: ReadWrite purports to tell us “How to Blur your Search Tracks on Google.” The article profiles Blur Private Search from privacy company Albine, a shield service that works to hide your identity from Google’s prying databases. The tool does this by setting each user up with a fake, cookie-free identity for each search. Writer Yael Grauer tells us:

“Private Search provides a new made-up identity for each individual search. It then funnels the request through an SSL tunnel, so that the search is encrypted—even Abine can’t see what you’re searching for. And every phrase or topic you search appears as if it is unconnected to previous searches, since each query is sent through Abine’s server with an entirely different IP address (which is yet another avenue by which websites can track people).

“Your search requests are modified before leaving your browser in a way that breaks the identity connection between your searches and the rest of your tabs. That means you can keep your YouTube tab open with all of your videos, and stay logged into Gmail, all without allowing Google to link your search queries with your account (and identity).”

At this time, the tool runs only in Firefox, and they have not yet implemented the in-results visuals that let you know it is working. Those problems will be fixed, but the bigger issue lies in trying to hide the tracks of anything typed into Google. Even the folks at Albine admit that people with something to hide that could put them in actual danger (Chinese dissidents, for example) would be better off going through Tor. There are other engines that don’t track in the first place, too. At the same time, it is true that Google’s functionality is unmatched, so users must weigh their priorities; one might use a non-tracking tool for anything financial, health, or uprising-related, for example, and Google for everything else. Just a suggestion.

Boston-based Albine bills itself as “the online privacy company,” and their goal is to bring user-friendly security to anyone who goes online. Their other products include DoNotTrackMe, MaskMe, and DeleteMe. The company was founded in 2008.

Cynthia Murrell, May 8, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Analytics, Big data, Google, News, Search | 3 Comments

IBM and SAP: More Power Delivered for Big Data

May 5, 2015

I read “IBM Creates Power Systems Servers for Big Data Crunching in SAP HANA.” The story line is easy to grasp: Struggling IBM has purpose built fast servers for the IBM like SAP. According to the write up:

BM has expanded its partnership with SAP by creating Power Systems server configurations specifically designed to enhance the way SAP HANA is deployed for big data projects.

IBM said its Power Systems Solution Editions for SAP HANA will allow users of IBM’s Power8 systems to deploy the in-memory database management platform faster and in a more cost-effective manner.

What’s interesting is that both companies have compute intensive content processing systems. The challenge of making sense of structured and unstructured information is a need IBM and SAP customers have.

The fix is big iron. Crunching large volumes of data in real time appears to be an issue both IBM and SAP wish to resolve.

The implication is that cloud services like those available from Amazon and HP are not up to the task. The tie up sounds good. The article references content processing as well:

Powering big data analytics and database management appears to be a major part of IBM’s strategy. The company recently entered the healthcare big data market by creating Watson Health after snapping up big data and cloud startups. Big Blue is also teaming up with Twitter to analyze big data harvested from the social network.

One minor point: Will customers be able to realize cost savings? Are IBM and a company with IBM’s DNA cost effective? “Cost savings” are easy to say and sometimes difficult to deliver. I assume one can ask Watson.

Stephen E Arnold, May 5, 2015

Written by Stephen E. Arnold · Filed Under Big data, Financial, News, Technology | Comments Off on IBM and SAP: More Power Delivered for Big Data

A Binging Double Take

May 1, 2015

After you read this headline from Venture Beat, you will definitely be doing a double take: “ComScore: Bing Passes 20% Share In The US For The First Time.” Bing has been the punch line for search experts and IT professionals ever since it was deployed a few years ago. Anyone can contest that Bing is not the most accurate search engine, mostly due to it being a Microsoft product. Bing developers have been working to improve the search engine’s accuracy and for the first time ever ComScore showed that both Google and Yahoo fell a 0.1 percentage and Bing gained 0.3 percent, most likely stealing it from DuckDuckGo and other smaller search engines. Microsoft can proudly state that one in five searches are conducted on Bing.

The change comes after months of stagnation:

“For many months, ComScore’s reports showed next to no movement for each search service (a difference of 0.1 points or 0.2 points one way or the other, if that). A 0.3 point change is not much larger, but it does come just a few months after big gains from Yahoo. So far, 2015 is already a lot more exciting, and it looks like the search market is going to be worth paying close attention to.”

The article says that most of search engine usage is generated by what Internet browsers people use. Yahoo keep telling people to move to Firefox and Google wants people to download Chrome. The browser and search engine rivalries continue, but Google still remains on top. How long will Bing be able to keep this bragging point?

Whitney Grace, May 1, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Analytics, Big data, Bing, Google, Microsoft, News, Search, Web Services, Yahoo | Comments Off on A Binging Double Take

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.