Pomposity and Stakeholders: The Big Data Play

July 9, 2014

I read with  considerable amusement “With Big Data Comes Big Responsibility.” Let’s think about the premise of this write up. Here’s the passage which I think expresses one of the the main ideas about the uses of Big Data and the public’s cluelessness:

I am actually amazed that cities are willing to trade data such as photos from traffic cameras that impacts its citizenry to a privately-owned company (in this case, Google) without as much as a debate. I am sure, a new parking lot gets more attention from the legislators.

From my vantage point in Harrod’s Creek, there are some realities that some Ivory Tower-type thinkers do not accepts. Let me invite you to read the “Big Responsibility” article and then consider these observations. Make you own decision about the likelihood of rejigging the definition of responsibility.

Money First

The notion that whiz kids and their digital creations are about helping people is baloney. The objective is to win and win as much as possible. The German football team should not have slacked off in the second half. The proof of winning is crushing competitors, getting money, having lots of power, and obtaining ever increasing adulation of peers. Responsibility is defined by a hierarchy of needs that does not include some of the touchstone values of JP Morgan, Cornelius Vanderbilt, or John D. Rockefeller. These guys were not digitally hip and, therefore, could not leverage data effectively.

Mr Rockefeller said, “God gave me my money.” Now there’s confidence for a business model.

Mr. Morgan said, “A man generally has two reasons for doing a thing. One that sounds good and a real one.”

Mr. Vanderbilt said, “I don’t care half so much about making money as I do about making my point, and coming out ahead.”

Other Directed Behavior

I have been lucky enough to work inside some outfits which saw themselves as the new elite. There were the Halliburton NUS nuclear engineers and wizards like the now deceased Jim Terwilliger whose life vision was, “Anyone not able to deal with my nuclear-focused mathematics is a loser.” I also did a stint at Booz, Allen, and Hamilton before it degraded to azure chip consultant status. The officers’ meetings were tributes to the specialness of the top performers among the many smart people at the firm. An outside speaker could not be anyone. We enjoyed the wit and wisdom of Henry Kissinger, a pal of partner William Simon. Even the rental cars used to get to the hideaway were special. I recall a replica 1940s For convertible and assorted luxury vehicles. Special, special, special. I have done consulting work for some outfits whose names even a Beyond Search reader will recognize. Take it from me, everything was special, special, special. Outfits with folks who are smart and set themselves apart from those not good enough to be admitted to the “club” are into other directed behavior among their peers AND only if there is an upside. Forget lip service like saving stray dogs. Special is special. To be judged as super special by your in crowd is one major pivot point.

Silly Concerns

A typical silly concern is privacy. The folks who amass, resell, exploit, manipulate, and leverage data are operating under the Law of Eminent Domain. The whole point is to take advantage for one’s self, peers, and stakeholders. Other folks can work harder or try to get a better roll of the dice. Most folks don’t have a glimmer of insight about information manipulation. They never will. The notion that someone Ivory Tower values are going to grab and hold on is as silly as trying to explain that the Facebook experiment is one that was found out. There are other experiments and because these are not known, the experiments and their learnings are not available to the users of TV or digital gambling device.

The notion of a moral imperative will make for excellent conversation at a coffee shop. It won’t have any impact on the juggernauts now racing through certain developed societies. Barn burnt, Horses gone. Amazon distribution center erected on the site. Google. Well, to bad for those looking for Cuba Libra via Google Maps. And Facebook. My dog has mounting friend requests and is now getting junk mail via her “real” Facebook page. The past is gone. The reality is what’s cooking near Ukraine, the freshly minted “states” in the East, and the shift from phishing email to kidnapping in certain African countries. Walled communities are back. It may be the dawning of the new Dark Age. [Update: This link may provide a useful example of how a moral imperative is put into action by a high flying Silicon Valley professional. I wonder how one would explain the discontinuity between intelligent, five children, and heroin to the surviving spouse. Well, I will leave the gilding of the lilly to a pundit. Added, July 10, 2014.]

Those old Roman emperors like JP, JD, and Corny may not look so bad today. These folks had the right idea in the view of some modern captains of Big Data.

Stephen E Arnold, July 9, 2014

ZyLabs Mary Mack Urges Caution with Predictive Coding

July 9, 2014

An article titled ZyLAB’s Mary Mack on Predictive Coding Myths and Traps for the Unwary on The eDisclosure Information Project offers some insight into the trend of viewing predictive coding as some form of “magic.” This idea is quickly brushed aside and predictive coding is allocated back to the realm of statistics and technology. The article quotes Mary Mack of ZyLab,

“Machine learning and artificial intelligence for legal applications is our future. It’s a wonderful advance that the judiciary is embracing machine-assisted review in the form of predictive coding. While we steadily move into the second and much less risky generation of predictive coding, there are still traps and pitfalls that are better considered early for mitigation. This session and the session on eDiscovery taboos will expose a few concerns to consider when evaluating predictive coding for specific or portfolio litigation.”

In this article ZyLab offers a counterpoint to Recommind, which asserted in a recent article that predictive coding was to eDiscovery like a GPS is to driving cross-country. ZyLab prefers a much more cautious approach to the innovative technology. The article stresses an objective, fact-based discussion on the merits and pitfalls of predictive coding is a necessary step in its growth.

Chelsea Kerwin, July 09, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Predictive Coding for eDiscovery Users in a Hurry

July 9, 2014

The article on Recommind titled Why eDiscovery Needs GPS (And a Soundtrack) whimsically applies the basic tenets of GPS to the eDiscovery process with the aid of song titles. If you can get through the song titles bit, there is some meat to the article, though not much. He suggests several areas where predictive coding might make eDiscovery easier and more efficient. The author explains his thinking,

“A good eDiscovery navigator will help you take a reliable Estimation Sample… early on to determine the statistically likely number of responsive documents for any issue in your matter.  It will then plot that destination clearly, along with the appropriate margin of error, and show your status toward it at every point along The Long and Winding Road. It should also clearly display the responsiveness levels you’re experiencing with each iteration as you review the machine-suggested document batches.”

The type of guidance and efficiency that predictive coding offers is already being utilized by companies conducting internal investigations and “reviewing data already seized by a regulatory agency.” The author conditions the usefulness of predictive coding on its being flexible and able to recalculate based on any change in direction.When speed and effectiveness are of paramount importance, a GPS for eDiscovery might be the best possible tool.

Chelsea Kerwin, July 09, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Swimming in a Hadoop Data Lake

July 8, 2014

I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.

The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.

Several key buzzwords appear in the interview:

  • Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
  • Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
  • A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
  • Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
  • Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”

My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.

No problem as long as the money lake is sufficiently deep, broad, and full.

The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.

Stephen E Arnold, July 7, 2014

Hadoop Annual Growth Numbers Sky-High

July 8, 2014

The article titled Hadoop Sector will Have Annual Growth of 58% for 2013-2020 in CloudTimes offers a wild and crazy market size estimate for the company. Hadoop is open source so this is a lot of services revenue. Hadoop’s achievement is based on work in big data analysis, access to big data at high speeds, and the management of unstructured data. Keeping costs low while maintain effectiveness spelled success for Hadoop. The article states,

“The report categorized the Hadoop software market into application software, management software, packaged software and performance monitoring software and found that application software category is leading the global Hadoop software market due to high return in its increasing implementation by developers to build real time applications. Also, Hadoop packaged software provides easier deployment of Hadoop clusters. Thus, Hadoop projects such as MapReduce, Sqoop, Hive and others can be smoothly integrated.”

The article does offer some caution to balance the wildly positive report for Hadoop. Due to holes in qualified staff to fill the company, there has been some slowing of growth especially in small and medium enterprises, who might hesitate to adopt the software. Hadoop is booming with government sectors, manufacturing, BFSI, retail and healthcare, among other areas.

Chelsea Kerwin, July 08, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Steps Offered to Improve Government Data Sites

July 8, 2014

The article on FlowingData titled How to Make Government Data Sites Better uses the Center for Disease Control website to illustrate measures the government should take to make their data more accessible and manageable. The first suggestion is to provide files in a useable format. By avoiding PDFs and providing CSV files (or even raw data), the user will be in a much better position to work with the data. Another suggestion is simply losing or simplifying the multipart form that makes search nearly impossible. The author also proposes clearer and more consistent annotation, using the following scenario to illustrate the point,

“The CDC data subdomain makes use of the Socrata Open Data API,… It’s weekly data that has been updated regularly for the past few months. There’s an RSS feed. There’s an API. There’s a lot to like… There’s also a lot of variables without much annotation or metadata … When you share data, tell people where the data is from, the methodology behind it, and how we should interpret it. At the very least, include a link to a report in the vicinity of the dataset.”

Overall, the author makes many salient points about transparency, consistency and clutter. But there is an assumption in the article that the government actually desires to make data sites better, which may be the larger question. If no one implements these ideas, perhaps that will be answer enough.

Chelsea Kerwin, July 08, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

SharePoint Potential for Surface Pro 3

July 8, 2014

Microsoft’s Surface Pro 3 made waves as one of the first high profile enterprise ready tablets. Mobility is no longer a trend, but a necessity, with mobile search set to surpass desktop search this year. SharePoint needs to keep itself in the mobility game, and Surface Pro 3 may be one way to do that. Redmond covers the story in their article, “Why SharePoint Admins Should Check Out Surface Pro 3.”

Tamir Orbach, Metalogix’s director of product management for SharePoint migration product, gave his opinion on the new device:

“’Pretty much all of us professionals want or need both a laptop or desktop and a slate,” Orbach said. ‘It’s so light that you can carry it anywhere you want and you would barely even feel it. And the screen is big enough, the resolution is good, the functionality is powerful enough to be used as our day-to-day computer.’”

Stephen E. Arnold has made a career out of following all things search. Enterprise is particularly affected by search, good and bad, and SharePoint is unequivocally the biggest player in the enterprise game. However, it has struggled with mobile functionality. And while critics will not be completely satisfied if Microsoft claims SharePoint’s mobile struggles must be settled with another Microsoft product, it does show some movement in the right direction. Keep an eye on Arnold’s SharePoint feed on his Web site, ArnoldIT.com, for the latest news, tips, and tricks.

Emily Rae Aldridge, July 8, 2014

Quote to Note: The Curse of Smart People

July 7, 2014

Here’s a gem. The source is “Google Employee Blows the Whistle on Search Giant’s Problem with Over-Confident ‘Geek Types Living in a Bubble‘.

The quote is from Avery Pennarun, a Googler (maybe soon to be a Xoogler) who allegedly said:

Smart people have a problem especially when you put them in large groups. That problem is an ability to convincingly rationalize nearly anything.

I must admit that here in Harrod’s Creek, Kentucky, rationalizations are not as popular as brawls and shouting.

I don’t understand the “Impostor Syndrome”, but I accept that I don’t get quite a few aspects of the modern world.

I would be happy if search results were focused on precision, recall, and similar old-fashioned ideas. I wonder if the Googlers in Mountain View have ever walked around East Palo Alto and checked out the housing, the stores, and the quality of life? Does a Google Maps search show details in East Palo Alto? Cuba Libre in Washington, DC, was unfindable and it was only a couple of blocks from Google’s DC office.

Stephen E Arnold, July 7, 2014

Google and Amazon: The Cost Challenged Prepare to Squabble

July 7, 2014

I read “Inside Google’s Big Plan to Race Amazon to Your Door.” The US of A is a big place. Making money with to-my-door deliveries is an interesting business proposition. Amazon floated the idea of drones dropping boxes in my yard and has some United States Postal Service trucks putting Amazon boxes on my brick mailbox on Sunday.

Well, Google wants to “race” Amazon. Like an F 1 team, racing can be expensive, very expensive.

The write up does not dwell on costs, preferring to point out:

Google is the undisputed king of search in all but one lucrative and vital category: Product searches.

I also noticed this passage:

Unlike Amazon, Google does not operate its own giant warehouses or store inventory for more than a few hours. Instead, it fulfills customer orders by picking up items from nearby retail stores. So rather than compete directly against retailers like Amazon does, Google is attempting to position itself as an ally. Shoppers in cities where the service is available — mainly areas around San Francisco, Los Angeles and New York City for now — visit a dedicated Google Shopping Express website where they can choose to buy goods like groceries, cameras and clothing from a selection of retail partners.

How many of those partners are willing and able to provide same day delivery. In Harrod’s Creek, some of the “we deliver” pizza joints don’t deliver to some locations not on a paved road or in a specific zip code. Will merchants change their tune when the GOOG is involved?

I also underlined the “me too” approach of Google in its battle with Amazon:

Eventually, Google plans to launch a flat-fee membership model similar to Amazon Prime…

In addition to consolidation of shopping, Google wants to be just like Amazon. The innovation is difficult for me to spot. I recall the Google Catalog project. The idea was to scan pages of printed catalogs. Now that did not seem like a particularly useful service. Google killed it. Then there was Froogle, and it disappeared. Now I think there is Google shopping, but I don’t use the service because I browsed pages and pages of non store listings. I recall that the service was not helpful to me.

Google’s approach is to be a partner. Okay, sounds good. Here’s the passage I dog earred:

Google has assembled a respectable group of partners to the program. Several of them say participating in the Google Shopping Express program gives them a way to evaluate whether it’s more cost effective to offer same-day and next-day delivery themselves, through a partner or whether they should at all.

Google “assembled” a group of college and university partners. How has that worked out?

Several observations:

First, both Amazon and Google have a cost control problem. The massive spending is hoped to turn into piles of money. My hunch is that when the costs become greater than the income, both Amazon and Google will have to find a way to produce the returns investors want. The bubble economy in the US may put increased pressure on Amazon and Google to generate better returns. Someone has to pay for the rising investments both companies are making.

Second, Google is less diversified in terms of its revenue than Amazon. As Steve Ballmer said years ago, Google is a one trick revenue pony. Generating meaningful revenue streams from the Overture/GoTo/Yahoo pay to play model is not as interesting to me as search shifts from the desktop to mobile devices. Google has to amp up its revenue. Amazon does too; hence, Amazon is doing some interesting things to publishers, for example. The objective is money, not stroking the publishers.

Third, I am weary of spending more and more time working around ads, search engine optimization content, and plain old flawed information. The Google engine does me no favors when my alert for the phrase “enterprise search” returns a pointer to an SEO outfit that does business as TopSEO. Pure garbage in my opinion.

Net net: Innovation is now enshrined as imitation. That’s okay. We know how the Italian inventor Tesla ended up. Fascinating 21st century business creativity. Oh, by the way, I don’t need same day delivery. I like to go to the farmers’ market.

Converting Amazon and Google to a digital WalMart leaves me cold.

Stephen E Arnold, July 7, 2014

Google and What You Cannot Find

July 7, 2014

I don’t have much information about the “right to be forgotten” process at the GOOG. I have been watching the streams my Overflight system tracks. I did find one Web page that I found interesting. Navigate to Forgotten Results.

You can explore the links and the source for each entry. I clicked on a few and found the information suggestive, not definitive. I did a couple of quick checks and the content for which I looked was available via other indexes or from other Google domains when I used a Web proxy.

For most users, information not in the Google index does not exist. The approach is, I think, “Hey, Google indexes all the world’s information, right?”

Sort of.

You can ponder the value of being able to delete certain information from online indexes used to satisfy a Web query. My hunch is that some outfits who continue to grouse about Google (maybe, Foundem), certain types of content (information not deemed to be high priority), and other digital information can be deleted. Most folks won’t know the difference.

Keep in mind that among the people who are online searchers, almost everyone is an expert in their own mind. There are professionals like Marydee Ojala, Barbara Quint, Anne Mintz, and Ruth Pagell who are significantly more “expert” than the over confident MBAs, mobile phone search wizards, search engine optimization gurus, and the majority of short cut focused college students chasing a library or information science degree.

What’s important to me is that it is now possible to be confident that locating information on mind becomes much harder. Multiple queries and different search systems must be used. Will Bing maps show you the location of a certain facility in Scotland? Why are some government servers not in the USA.gov service? Why is Yahoo’s presentation of the “news” focused squarely on the inconsequential and stale?

The question about Google is a pretty good one. In our tests, identical queries across different search systems generate anywhere from 60 to 75 percent overlap. Flip this around and you will have to work really hard to find the other 25 to 40 percent.

Research is hard work. The right to be forgotten just ups the ante for specialists in open source online research. I suppose that’s one reason my intel conference briefings on alternatives to Google.com search continues to pack ’em in.

Stephen E Arnold, July 7, 2014

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta