Hadoop: Its Inventor Speaks

May 18, 2015

I must have my wires crossed about Hadoop. I thought other folks were the creators of what became Hadoop. I read “Where Next for Hadoop? An Interview with Co-Creator Doug Cutting” to get my memory refreshed. (Note: you may have to register or pay to view the full text of this interview.)

According to the article Doug Cutting and mike Cafarella cooked up Hadoop in 2005. Cutting now works at Cloudera, which, according to Crunchbase, is

an enterprise software company that provides Apache Hadoop-based software and training to data-driven enterprises. –

You can find some objective analyses of the company and its technology at http://bit.ly/1desDEN. I use the term “objective” to mean written by mid tier consultants.

I highlighted this statement:

Hadoop is already much more versatile and user-friendly than it was in the early days and innovations such as Yarn, Impala and Spark as well as a hardening of the platform’s security have all made it more “enterprise ready” too…

To underscore the user friendliness of Hadoop I circled in high intensity pink:

Asked whether some IT people are so bowled over by the number and choice of big data tools that they neglect to think how they will use them, Cutting agrees that this can be the case, but says that as use cases grow this issue will diminish. “It’s in an early stage of maturity so that’s not unexpected, but I think over time people are going to think about the functionality you’ve got in the distribution. You could have a SQL engine for analytics queries. You’ve got a NoSQL engine for reporting queries,” he says. So are companies like Cloudera, which, thanks to support from the likes of Intel (see below) and its vast marketing budget, distracting the market from the bigger picture? “There is confusion but I think it’s mostly because people are new to it and do not have much experience,” Cutting says.

And a final snippet:

Mostly I think this mantle of open and standard is deceptive. It is neither open in that everybody’s really invited on equal terms to play, nor is it a standard. It’s a minority of people out there.”

There are other comments about Hadoop. I will leave them to you. Easy to use, not confusing, and no problems with open and standard. There are many consulting firms thrilled with Hadoop. Snap it in and dig into data. Versatile too.

Stephen E Arnold, May 18, 2015

Written by Stephen E. Arnold · Filed Under Big data, Database, News | Comments Off on Hadoop: Its Inventor Speaks

Glass: A Family

May 18, 2015

Forget YouTube search, which is allegedly going to get better real soon.

Navigate to “Google Glass Tipped to Become Product Family.” I learned:

We also get some clues from the new Glass team description, which reads: “The Google Glass division is a world-class team focused on the cutting edge of hardware, software, and industrial design.” It continues: “It is charged with pioneering, developing, building, and launching smart eyewear and other related products in line with Google’s ambitious and visionary objectives.

If I have an Apple Watch and ride around in an autonomous auto or snag an Uber ride, tell me again why I need to wear another gadget over my trifocals. How will that work? I suppose I can wear the new design instead of my trifocals. Wait! Then I would not be able to see my Apple Watch or the Uber car. I am excited.

Stephen Arnold, May 18, 2015

Written by Stephen E. Arnold · Filed Under Google, Marketing, News | Comments Off on Glass: A Family

Archive.is Preserves Online Information

May 18, 2015

Today’s information seekers use the Internet the way some of used reference books growing up. Unlike the paper tomes on our dusty bookshelves, however, websites can change their content without so much as a by-your-leave. Suggestions for preserving online information can be found in “Create Publicly Available Web Page Archives with Archive.is” at gHacks.net.

Writer Martin Brinkmann begins by listing several local options familiar to many of us. There’s Ctrl-s, of course, and assorted screenshot-saving methods. Website archivers like Httrack perform their own crawls and save the results to the user’s local machine. Remotely, Archive.org automatically creates snapshots of prominent sites, but users cannot control the results. Enter Archive.is. Brinkmann writes:

“Archive.is is a free service that helps you out. To use it, paste a web address into the form on the services main page and hit submit url afterwards. The service takes two snapshots of that page at that point in time and makes it available publicly. The first takes a static snapshot of the site. You find images, text and other static contents included while dynamic contents and scripts are not. The second snapshot takes a screenshot of the page instead. An option to download the data is provided. Note that this downloads the textual copy of the site only and not the screenshot. A Firefox add-on has been created for the service which may be useful to some of its users. It creates automatic snapshots of every web page that you bookmark in the web browser after installation of the add-on.”

Wow, don’t set and forget that Firefox option! In fact, the article cautions, be mindful of the public availability of every Archive.is snapshot; Brinkmann reasonably suggests the tool could benefit from a password feature. Still, this could be an option to preserve important (but, for the prudent, impersonal) information found online.

Cynthia Murrell, May 18, 2015

Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com

Written by Stephen E. Arnold · Filed Under Big data, Data, News, Search, Search quality | 1 Comment

Behind The Google X Doors

May 18, 2015

Google X is Google’s top-secret laboratory, where the company develops new, innovative technology projects. The main purpose behind Google X is to make technology more adaptable, useful, as well as improve people’s lives. The Google Glass was one of their projects, so is Project Loon, where giant, high altitude balloons are released into the sky to bring Internet services to rural areas. Also do not forget the driverless car. EWeek has listed “10 Bold Google X Projects Aiming For Tech Breakthroughs,” exploring the new wonders that could eventually be available to your or me.

Are you interested in cleaner, renewable energy? So are the folks at Makani Power, a Google X project that builds wind turbines and then makes them airborne using kites. The wind turbines make energy for human consumption. While energy is important for modern human life, health is a big issue too.

Google X has four projects dedicated to learning more about the human body and disease. One is a contact lens measure glucose levels in tears, so diabetics will not have to prick themselves with needles to measure their sugar levels. The Baseline Study project analyzes medical information and uses genomics to define what the human body actually is. This project’s goal is to predict major diseases before their onset. Life Labs, acquired in 2014, invented a spoon device that counteracts Parkinson’s disease. The most astounding is something out of a science-fiction novel:

“Google X is in the nanoparticles business. The company in October unveiled a platform that uses nanoparticles to detect disease. In January, it followed that up with the announcement of the creation of synthetic skin as a proof-of-concept to show what nanoparticle technology might achieve in human biology and health.”

Nanoparticles? Self-driving cars? Wind turbines on kites? What will Google X work on next?

Whitney Grace, May 18, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Written by Stephen E. Arnold · Filed Under Business intelligence, Data, Google, News, Open source | Comments Off on Behind The Google X Doors

Yotta Search: A Full Service Solution

May 17, 2015

I spoke to a colleague who asked me about Yotta Search. I dug through my Overflight files and located a write up about the new enterprise search system from Yotta Data Technologies and a company called Yotta Customer Analytics. One Yotta is in Cleveland. The other is in Silicon Valley. Both are in the analytics game.

A “yotta” is a whole lotta data, the biggest unit of data. I wonder if the company has a comment on a set of yottas?

I checked my files for the company offering Yotta search, based in Cleveland, home of EPI Thunderstone, another enterprise search vendor. The company behind Yotta Search is Yotta Data Technologies.

According the firm’s Web site at www.yottadatatechnologies.com:

Yotta Data Technologies (YDT) is a technology company built on a foundation of deep industry experience and driven by a passion for innovative excellence. We provide data management and information governance solutions to corporations, firms and agencies, whether they be a small local firm or a multinational corporation with offices around the globe. Each of our platforms maintains the high levels of quality, performance and security that are critical within information governance initiatives and any data management project.

The search system appears to be based on open source technology if I understand this Web site information:

Yotta Search is a versatile enterprise search solution being developed by Yotta Data Technologies (YDT) for teams, small to medium sized businesses and large corporations. Yotta Search provides powerful, fast and flexible technology that is not only well beyond full text search, but also powers the search and analysis features of many of the world’s largest internet sites and data platforms.

The operative phrase is “being developed.” The company asserts capabilities in these functions:

Business intelligence
Discovery
Information governance
Virtual data rooms.

I noticed a news item called “Yotta Data Technologies Announces Enterprise Search and Big Data Analytics Platform.” If the information is correct, Yotta is no longer “being developed,” one can license the system. The url provided is www.yottasearch.info. The story describes the Yotta search system in this way.

YottaSearch is easy – and budget friendly – to implement with a cloud-based, Software-as-a-Service (SaaS) delivery model and a disruptive, subscription-based pricing model.

Key Functionality of the YottaSearch

Data Point Connectors – Local, Network, Email, Enterprise Systems, Databases, Social Media

File Crawlers – Detects & Parses over 1,000 file types

File Indexer – Language Detection, Deduplication, Near Real Time, Distributed, Scalable

Advanced Search Engines – Based on the high performance Apache Lucene library

Data Analytics – Intelligent analysis of structured and unstructured data

Dynamic Dashboards – Explore, analyze, navigate and define large volumes of complex data.

The system can be used for a number of applications, according to the write up:

Enterprise Search and Analytics
Information Governance
IT Operations Analytics (ITOA)
Investigations & eDiscovery
Knowledge Management (KM)
Internet of Things (IoT), Event & Log Data Analysis

Also, Yotta offers global data services and global electronic discovery services. The company’s tag line is “Information intelligence for corporations, firms, and agencies.”

Like I said, a lotta yottas and a robust line up of functionality which some more established search and content processing systems do not possess. Is Yotta competing with Elastic or is Yotta competing with the ABC vendors: Attivio, BA Insight, or Coveo? Worth watching.

Stephen E Arnold, May 17, 2015

Written by Stephen E. Arnold · Filed Under Business strategy, Enterprise search, News | 1 Comment

Exit Governance. Enter DMP.

May 17, 2015

A DMP is a data management platform. I think in terms of databases. I find that software does not do a particularly reliable job “managing data.” Software can run processes, write log file, and perform other functions. But management, based on my experience at Booz, Allen & Hamilton, requires humans. Talking about analytics from Big Data and implementing a platform to perform management are apples and house paint in my mind.

Intrigued by the reference, I downloaded a document available upon registration from Infinitive. You can find the company’s Web site at www.infinitive.com. The white paper maps you 10 ways a data management platform can help me.

I was not familiar with Infinitive. According to the firm’s Web site: Infinitive is

A Different Kind of Consultancy. Results-driven and client-centric. Fun, focused and flexible. Highly engaged and easy to work with. Those are the qualities that make Infinitive a different kind of consultancy. And they’re the pillars of our unique culture. Headquartered in the Washington, D.C. area, Infinitive specializes in digital ad solutions, business transformation, customer & audience intelligence and enterprise risk management. Leveraging best practices in process engineering, change management and program management, we design and deliver custom solutions for leading organizations in communications, media and entertainment, financial services and educational services. For our clients, the results include quantifiable performance improvement and tangible bottom-line value in addressing their most pressing challenges and fulfilling their top-priority objectives.

What is a data management platform?

White paper or two page document identifies these benefits of a DMP. I was hoping for an explanation of the “platform,” but let’s look at the payoffs from the platform.

The company points out that a DMP makes ad money go farther. Big Data become actionable. A DMP provides a foundation for analytics. The DMP “ensures the quality and accessibility of customer and audience intelligence data.” The DMP can harmonize data. A DMP allows me to “adapt traditional CRM strategies and technology to incorporate new customer behavior.” I can create new customer and audience “segments.” The DMP becomes the central nervous system for my company. And the DMP protects privacy.

That is a bundle of benefits. But what is the platform provided by a consulting company, especially one that is “fun”? I was not able to locate details about the platform. The company appears to be a firm focused on advertising.

The Web site includes a page about the DMP at this link. The information is buzzword heavy and fact free. My view is that the DMP is a marketing hook. The implied technology is consulting services. That’s okay, but I find the approach representative of marketing billable time, not delivering a platform with the remarkable and perhaps unattainable benefits suggested in the white paper.

The approach must work. The company’s Web site points out this message:

Not a platform, however.

Stephen E Arnold, May 17, 2015

Written by Stephen E. Arnold · Filed Under Analytics, Big data, Management, News | Comments Off on Exit Governance. Enter DMP.

Quote to Note: How to Make Search Relevant

May 16, 2015

Short honk: I read “Intranet Search? Sssh! Don’t Speak of It.” It seems that enterprise search is struggling and sweeping generalizations about information governance and knowledge management are not helping the situation. But that’s just my opinion.

But set that “issue” aside. Here’s the quote I noted:

The only way this situation [search is a problem’] will change is with intranet managers stepping up to the challenge and telling stories internally. The problem with search analytics (even if you do everything that Lou Rosenfeld [search wizard] recommends) is that there is no direct evidence of the day-to-day impact of search.

Will accountants respond to search stories? Why is there no direct evident of the day to day impact of search? Perhaps search, along with some other hoo hah endeavors, is simply not relevant in today’s business environment? Won’t more hyperbole filled marketing solve the problem? Another conference?

The wet blanket on enterprise search remains “there is no direct evidence of the day to day impact of search.” After 30 or 40 years of implementations and hundreds of millions in search development, why not? Er, what about this thought:

Search is a low value utility which has been over hyped.

Stephen E Arnold, May 17, 2015

Written by Stephen E. Arnold · Filed Under Enterprise search, Management, News, Quotation | Comments Off on Quote to Note: How to Make Search Relevant

Connotate Reveals There Are One Billion Web Sites

May 16, 2015

I did not know there were one billion Web sites. Here’s the Web page on Connotate’s Web site which puts me in the know:

Source: www.connotate.com

The figure has been bandied about by Internet Live States, Business Insider, and the Daily Mail. This number was hit in late 2014 and confirmed by “the inventor of the Internet.” I noted that no one asked Google, an outfit which has a reasonable log file of its crawling activities. Doesn’t Google “know” a number? If the GOOG does, it is not talking or maybe the company is not returning phone calls from people asking, “How many Web sites make up the Internet?”

I navigated to Internet Live Stats on May 16, 2015, and noted this item of information:

I don’t want to rain on the parade, but the number is 900 million and apparently growing. Apparently the “number” can vary. Internet Live Stats says:

We do expect, however, to exceed 1 billion websites again sometime in 2015 and to stabilize the count above this historic milestone in 2016.

So what? Frankly the one billion number is irrelevant to me. What is relevant is that a company is using what I suppose is a sketchy number as a way to capture business is a good example of the marketing used by search and content processing vendors.

I know that generating organic, sustainable revenue from search and content processing, information access, and indexing software is very difficult.

The number of Web sites does not mean much, if anything. In an interview with BrightPlanet, I learned that savvy customers narrow the focus of their content acquisition and analysis. Less, it seems to me, may be more. Also, Darpa’s MEMEX project is designed to figure out the width, depth, and breadth of the Dark Net. Is it larger or smaller than the Clear Net?

I prefer value propositions and “marketing hooks” that do not equate size with importance or trigger the fear of not knowing what’s out there? But if it works, it is definitely okay in today’s pressurized sales environment.

There are a billion crazy search and content marketing assertions. Wait. Make that two billion.

Stephen E Arnold, May 16, 2015

Written by Stephen E. Arnold · Filed Under Marketing, News | Comments Off on Connotate Reveals There Are One Billion Web Sites

HP Idol and Hadoop: Search, Analytics, and Big Data for You

May 16, 2015

I was clicking through links related to Autonomy IDOL. One of the links which I noted was to a YouTube video labeled “HP IDOL for for Hadoop: Create a Smarter Data Lake.” Hadoop has become a simile for making sense of Big Data. I am not sure what Big Data are, but I assume I will know when my eight gigabyte USB key cannot accept another file. Big Data? Doesn’t it depend on one’s point of view?

What is fascinating about the HP Idol video is that it carries a posting date of October 2014, which is in the period when HP was ramping up its anti-Autonomy legal activities. The video, I assumed before watching, would break from the Autonomy marketing assertions and move in a bold, new direction.

The video contained some remarkable assertions. Please, watch the video yourself because I may have missed some howlers as I was chuckling and writing on my old school notepad with a decidedly old fashioned pencil. Hey, these tools work, which is more than I can say for some of the software we examined last week.

Here’s what I noted with the accompanying screenshot so you can locate the frame in the YouTube video to double check my observation with the reality of the video.

First, there is the statement that in an organization 88 percent of its information is “unanalyzed.” The source is a 2012 study from Forrsights Strategy Spotlight: Business Intelligence and Big Data. Forrester, another mid tier consulting firm, produces these reports for its customers. Okay, a couple of years old research. Maybe it is valid? Maybe not? My thought was that HP may be a company which did not examine the data to which it had access about Autonomy before it wrote a check for billions of dollars. I assume HP has rectified any glitch along this line. HP’s litigation with Autonomy and the billions in write down for the deal underscore the problem with unanalyzed data. Alas, no reference was made to this case example in the HP video.

Second, Hadoop, a variant of Google’s MapReduce technology, is presented as a way to reap the benefits of cost efficiency and scalability. These are generally desirable attributes of Hadoop and other data management systems. The hitch, in my opinion, is that it is a collection of projects. These have been developed via the open source / commercial model. Hadoop works well for certain types of problems. Extract, transform, and load works reasonably well once the Hadoop installation is set up, properly resourced, and the Java code debugged so it works. Hadoop requires some degree of technical sophistication; otherwise, the system can be slow, stuffed with duplicates, and a bit like a Rube Goldberg machine. But the Hadoop references in the video are not a demonstration. I noted this “explanation.”

Third, HP jumps from the Hadoop segment to “what if” questions. I liked the “democratize Big Data” because “Big Data Changes everything.” Okay, but the solution is Idol for Hadoop. The HP approach is to create a “smarter data lake.” Hmmm. Hadoop to Idol to data lake for the purpose of advanced analytics, machine learning functions, and enterprise level security. That sounds quite a bit like Autonomy’s value proposition before it was purchased from Dr. Lynch and company. In fact, Autonomy’s connectors permitted the system to ingest disparate types of data as I recall.

Fourth, the next logical discontinuity is the shift from Hadoop to something called “contextual search.” A Gartner report is presented which states with Douglas McArthur-like confidence:

HP Idol. A leader in the 2014 Garnter Magic Quadrant for Contextual Search.

What the heck is contextual search in a Hadoop system accessed by Autonomy Idol? The answer is SEARCH. Yep, a concept that has been difficult to implement for 20, maybe 30 years. Search is so difficult to sell that Dr. Lynch generated revenues by acquiring companies and applying his neuro-linguistic methods to these firms’ software. I learned:

The sophistication and extensibility of HP Autonomy’s Intelligent Data Operating Layer (Idol) offering enable it to tackle the most demanding use cases, such as fraud detection and search within large video libraries and feeds.

Yo, video. I thought Autonomy acquired video centric companies and the video content resided within specialized storage systems using quite specific indexing and information access features. Has HP cracked the problem of storing video in Hadoop so that a licensee can perform fraud detection and search within video libraries. My experience with large video libraries is that certain video like surveillance footage is pretty tough to process with accuracy. Humans, even academic trainees, can be placed in front of a video monitor and told, “Watch this stream. Note anomalies.” Not exciting but necessary because processing large volumes of video remains what I would describe as “a bit of a challenge, grasshopper.” Why is Google adding wild and crazy banners, overlays, and required metadata inputs? Maybe because automated processing and magical deep linking are out of reach? HP appears to have improved or overhauled Autonomy’s video analysis functions, and the Gartner analyst is reporting a major technical leap forward. Identifying a muzzle flash is different from recognizing a face in a flow of subway patrons captured on a surveillance camera, is it not?

I have heard some pre HP Autonomy sales pitches, but I can’t recall hearing that Idol can crunch flows of video content unless one uses the quite specialized system Autonomy acquired. Well, I have been wrong before, and I am certainly not qualified to be an analyst like the ones Gartner relies upon. I learned that HP Idol has a comprehensive list of data connectors. I think I would use the word “library,” but why niggle?

Fifth, the video jumps to a presentation of a “content hub.” The idea is that HP idol provides visual programming tools. I assume an HP Idol customer will point and click to create queries. The queries will deliver outputs from the Hadoop data management system and the content which embodies the data lake. The user can also run a query and see a list of documents. but the video jumps from what strikes me as exactly what many users no longer want to do to locate information. One can search effectively when one knows what one is looking for and that the needed information is actually in the index. The use case appears to be health care and the video concludes with a reminder that one can perform advanced analytics. There is a different point of view available in this ParAccel white paper.

I understand the strengths and weaknesses of videos. I have been doing some home brew videos since I retired. But HP is presenting assertions about Autonomy’s technology which seem to be out of step with my understanding of what Idol, the digital reasoning engine, Autonomy’s acquired video technology.

The point is that HP seems to be out marketing Autonomy’s marketing. The assert6ions and logical leaps in the HP Idol Hadoop video stretch the boundaries of my credulity. I find this interesting because HP is alleging that Autonomy used similar verbal polishing to convince HP to write a billion dollar check for a search vendor which had grown via acquisitions over a period of 15 years.

Stephen E Arnold, May 16, 2015

Written by Stephen E. Arnold · Filed Under Big data, Marketing, News, Search | 1 Comment

Image Search: Getting Better and Better

May 15, 2015

Image search means having software which can figure out from a digital photo that a cow is a cow. In more complex photos, the software identifies what it can. I recall one demonstration which recognized me as a 20 year old criminal. Close but no cigar.

I received an email from a former clandestine professional. The link provided informed me that Baidu was better at image recognition than the Google. The alleged error rate is 4.58 percent. I love the two decimal accuracy.

Not to be outdone, WolframAlpha is in the image recognition game as well. Navigate to “Wolfram Alpha Image Identification Identifies Steven Wolfram as Podium.” The write up points out:

Speaking of which, a picture of Steven Wolfram returned the answer ‘podium’. So no recognition for the creator. Unfortunately, it couldn’t identify a map of France at all and just came back with a big question mark. Sorry, France.

You can try the system at this page.

I uploaded the image of the cover of my new CyberOSINT study. The system returned this result:

My book cover is a a piece of electronic equipment that mixes two or more input signals to give a single output signal.

I did not know that. I thought it was a book cover with a blue hand.

Stephen E Arnold, May 15, 2015

Written by Stephen E. Arnold · Filed Under Image search, News, Rich media | Comments Off on Image Search: Getting Better and Better

« Previous Page — Next Page »

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.