Big Data Boom Pushes Schools to Create Big Data Programs

July 29, 2014

Can education catch up to progress? Perhaps, especially when corporations take an interest. Fortune discusses “Educating the ‘Big Data’ Generation.” As companies try to move from simply collecting vast amounts of data to putting that information to use, they find a serious dearth of qualified workers in the field. In fact, Gartner predicted in 2012 that 4.4 million big-data IT jobs would be created globally by 2015 (1.9 million in the U.S.). Schools are now working to catch up with this demand, largely as the result of prodding from the big tech companies.

The field of big data collection and analysis presents a previously rare requirement—workers that understand both technology and business. Reporter Katherine Noyes cites MIT’s Erik Brynjolfsson, who will be teaching a course on big data this summer:

“We have more data than ever,’ Brynjolfsson said, ‘but understanding how to apply it to solve business problems needs creativity and also a special kind of person.’ Neither the ‘pure geeks’ nor the ‘pure suits’ have what it takes, he said. ‘We need people with a little bit of each.’”

Over at Arizona State, which boasts year-old master’s and bachelor’s programs in data analytics, Information Systems chair Michael Goul agrees:

“’We came to the conclusion that students needed to understand the business angle,’ Goul said. ‘Describing the value of what you’ve discovered is just as key as discovering it.’”

In order to begin meeting this new need for business-minded geeks (or tech-minded business people), companies are helping schools develop programs to churn out that heretofore suspect hybrid. For example, Noyes writes:

“MIT’s big-data education programs have involved numerous partners in the technology industry, including IBM […], which began its involvement in big data education about four years ago. IBM revealed to Fortune that it plans to expand its academic partnership program by launching new academic programs and new curricula with more than twenty business schools and universities, to begin in the fall….

“Business analytics is now a nearly $16 billion business for the company, IBM says—which might be why it is interested in cultivating partnerships with more than 1,000 institutions of higher education to drive curricula focused on data-intensive careers.”

Whatever forms these programs, and these jobs, ultimately take, one thing is clear: for those willing and able to gain the skills, the field of big data is wide open. Anyone with a strong love of (and aptitude for) working with data should consider entering the field now, while competition for qualified workers is so very high.

Cynthia Murrell, July 29, 2014

Sponsored by, developer of Augmentext

From Search to Sentiment

July 28, 2014

Attivio has placed itself in the news again, this time for scoring a new patent. Virtual-Strategy Magazine declares, “Attivio Awarded Breakthrough Patent for Big Data Sentiment Analysis.” I’m not sure “breakthrough” is completely accurate, but that’s the language of press releases for you. Still, any advance can provide an advantage. The write-up explains that the company:

“… announced it was awarded U.S. Patent No. 8725494 for entity-level sentiment analysis. The patent addresses the market’s need to more accurately analyze, assign and understand customer sentiment within unstructured content where multiple brands and people are referenced and discussed. Most sentiment analysis today is conducted on a broad level to determine, for example, if a review is positive, negative or neutral. The entire entry or document is assigned sentiment uniformly, regardless of whether the feedback contains multiple comments that express a combination of brand and product sentiment.”

I can see how picking up on nuances can lead to a more accurate measurement of market sentiment, though it does seem more like an incremental step than a leap forward. Still, the patent is evidence of Attivio’s continued ascent. Founded in 2007 and headquartered in Massachusetts, Attivio maintains offices around the world. The company’s award-winning Active Intelligence Engine integrates structured and unstructured data, facilitating the translation of that data into useful business insights.

Cynthia Murrell, July 28, 2014

Sponsored by, developer of Augmentext

Is New Math Really New Yet?

July 21, 2014

I read “Scientific Data Has Become So Complex, We Have to Invent New Math to Deal With It.” My hunch is that this article will become Google spider food with a protein punch.

In my lectures for the police and intelligence community, I review research findings from journals and my work that reveal a little appreciated factoid; to wit: The majority of today’s content processing systems use a fairly narrow suite of numerical recipes that have been embraced for decades by vendors, scientists, mathematicians, and entrepreneurs. Due to computational constraints and limitations of even the slickest of today’s modern computers, processing certain data sets is a very difficult and expensive in humans, programming, and machine time job.

Thus, the similarity among systems comes from several factors.

  1. The familiar is preferred to the onerous task of finding a slick new way to compute k-means or perform one of the other go-to functions in information processing
  2. Systems have to deliver certain types of functions in order to make it easy for a procurement team or venture oriented investor to ask, “Does your system cluster?” Answer: Yes. Venture oriented investor responds, “Check.” The procedure accounts for the sameness of the feature lists between Palantir, Recorded Future, and simile systems. When the similarities make companies nervous, litigation results. Example: Palantir versus i2 Ltd. (now a unit of IBM).
  3. Alternative methods of addressing tasks in content processing exist, but they are tough to implement in today’s computing systems. The technical reason for the reluctance to use some fancy math from my uncle Vladimir Ivanovich Arnold’s mentor Andrey Kolmogorov is that in many applications the computing system cannot complete the computation. The buzzword for this is P=NP? Here’s MIT’s 2009 explanation
  4. Savvy researchers have to find a way to get from A to B that works within the constraints of time, confidence level required, and funding.

The Wired article identifies other hurdles; for example, the need for constant updating. A system might be able to compute a solution using fancy math on a right sized data set. But toss in constantly updating information and the computing resources often just keep getting hungrier for more storage, bandwidth, and computational power. Then the bigger the data, the computing system has to shove that data around. As fast as an iPad or modern Dell notebook seems, the friction adds latency to a system. For some analyses, delays can have significant repercussions. Most Big Data systems are not the fleetest of foot.

The Wired article explains how fancy math folks cope with these challenges:

Vespignani uses a wide range of mathematical tools and techniques to make sense of his data, including text recognition. He sifts through millions of tweets looking for the most relevant words to whatever system he is trying to model. DeDeo adopted a similar approach for the Old Bailey archives project. His solution was to reduce his initial data set of 100,000 words by grouping them into 1,000 categories, using key words and their synonyms. “Now you’ve turned the trial into a point in a 1,000-dimensional space that tells you how much the trial is about friendship, or trust, or clothing,” he explained.

Wired labels this approach as “piecemeal.”

The fix? Wired reports:

the big data equivalent of a Newtonian revolution, on par with the 17th century invention of calculus, which he [Yalie mathematician Ronald Coifman] believes is already underway.

Topological analyses and sparsity,  may offer a path forward.

The kicker in the Wired story is the use of the phrase “tractable computational techniques.” The notion of “new math” is an appealing one.

For the near future, the focus will be on optimization of methods that can be computed on today’s gizmos. One widely used method in Autonomy, Recommind, and many other systems originates with Sir Thomas Bayes who died in 1761. My relative died 2010. I understand there were some promising methods developed after Kolmogorov died in 1987.

Inventing new math is underway. The question is, “When will computing systems become available to use these methods without severe sampling limitations?” In the meantime, Big Data keep on rolling in, possibly mis-analyzed and contributing to decisions with unacceptable levels of risk.

Stephen E Arnold, July 21, 2014

Big Data Stress at Both Ends of the Lens

July 11, 2014

An interesting article at the New Inquiry looks at the psychological effects of both surveying and being surveyed in the modern, data-driven world. It’s an interesting read, and I urge the curious to check out the whole piece. Writer Kate Crawford begins with ways intelligence agencies use big data and some of the stumbling blocks that come with it: No matter how much information they collect, the picture is incomplete. On the other hand, the more information they stockpile, the easier it is to miss important clues and fail to prevent a crisis. Agencies continue to search for frameworks that will help them put the pieces together faster.

On the other side are private citizens, who feel (because they are) increasingly under observation. With each advance in surveillance tech, people feel more anxious, even if they have nothing to hide. Just the feeling of being watched gives us the heebie jeebies (I believe that’s the technical term). Crawford points to the “normcore” trend as evidence that more and more people just want to blend in to escape notice. Isn’t that kind of sad? She also notes that the trend toward privacy violation is unlikely to slow as long as the big data philosophy, that more data is always better, holds sway.

Crawford concludes:

“If we take these twinned anxieties — those of the surveillers and the surveilled — and push them to their natural extension, we reach an epistemological end point: on one hand, the fear that there can never be enough data, and on the other, the fear that one is standing out in the data. These fears reinforce each other in a feedback loop, becoming stronger with each turn of the ratchet. As people seek more ways to blend in — be it through normcore dressing or hardcore encryption — more intrusive data collection techniques are developed. And yet, this is in many ways the expected conclusion of big data’s neopositivist worldview. As historians of science Lorraine Daston and Peter Galison once wrote, all epistemology begins in fear — fear that the world cannot be threaded by reason, fear that memory fades, fear that authority will not be enough.”

Ah yes, fear. It’s effects on society have been widespread from the beginning, of course, but now it has scarier technology to work with. It will be interesting to see how this plays out, and which sci-fi plots the path will most resemble.

Cynthia Murrell, July 11, 2014

Sponsored by, developer of Augmentext

Pomposity and Stakeholders: The Big Data Play

July 9, 2014

I read with  considerable amusement “With Big Data Comes Big Responsibility.” Let’s think about the premise of this write up. Here’s the passage which I think expresses one of the the main ideas about the uses of Big Data and the public’s cluelessness:

I am actually amazed that cities are willing to trade data such as photos from traffic cameras that impacts its citizenry to a privately-owned company (in this case, Google) without as much as a debate. I am sure, a new parking lot gets more attention from the legislators.

From my vantage point in Harrod’s Creek, there are some realities that some Ivory Tower-type thinkers do not accepts. Let me invite you to read the “Big Responsibility” article and then consider these observations. Make you own decision about the likelihood of rejigging the definition of responsibility.

Money First

The notion that whiz kids and their digital creations are about helping people is baloney. The objective is to win and win as much as possible. The German football team should not have slacked off in the second half. The proof of winning is crushing competitors, getting money, having lots of power, and obtaining ever increasing adulation of peers. Responsibility is defined by a hierarchy of needs that does not include some of the touchstone values of JP Morgan, Cornelius Vanderbilt, or John D. Rockefeller. These guys were not digitally hip and, therefore, could not leverage data effectively.

Mr Rockefeller said, “God gave me my money.” Now there’s confidence for a business model.

Mr. Morgan said, “A man generally has two reasons for doing a thing. One that sounds good and a real one.”

Mr. Vanderbilt said, “I don’t care half so much about making money as I do about making my point, and coming out ahead.”

Other Directed Behavior

I have been lucky enough to work inside some outfits which saw themselves as the new elite. There were the Halliburton NUS nuclear engineers and wizards like the now deceased Jim Terwilliger whose life vision was, “Anyone not able to deal with my nuclear-focused mathematics is a loser.” I also did a stint at Booz, Allen, and Hamilton before it degraded to azure chip consultant status. The officers’ meetings were tributes to the specialness of the top performers among the many smart people at the firm. An outside speaker could not be anyone. We enjoyed the wit and wisdom of Henry Kissinger, a pal of partner William Simon. Even the rental cars used to get to the hideaway were special. I recall a replica 1940s For convertible and assorted luxury vehicles. Special, special, special. I have done consulting work for some outfits whose names even a Beyond Search reader will recognize. Take it from me, everything was special, special, special. Outfits with folks who are smart and set themselves apart from those not good enough to be admitted to the “club” are into other directed behavior among their peers AND only if there is an upside. Forget lip service like saving stray dogs. Special is special. To be judged as super special by your in crowd is one major pivot point.

Silly Concerns

A typical silly concern is privacy. The folks who amass, resell, exploit, manipulate, and leverage data are operating under the Law of Eminent Domain. The whole point is to take advantage for one’s self, peers, and stakeholders. Other folks can work harder or try to get a better roll of the dice. Most folks don’t have a glimmer of insight about information manipulation. They never will. The notion that someone Ivory Tower values are going to grab and hold on is as silly as trying to explain that the Facebook experiment is one that was found out. There are other experiments and because these are not known, the experiments and their learnings are not available to the users of TV or digital gambling device.

The notion of a moral imperative will make for excellent conversation at a coffee shop. It won’t have any impact on the juggernauts now racing through certain developed societies. Barn burnt, Horses gone. Amazon distribution center erected on the site. Google. Well, to bad for those looking for Cuba Libra via Google Maps. And Facebook. My dog has mounting friend requests and is now getting junk mail via her “real” Facebook page. The past is gone. The reality is what’s cooking near Ukraine, the freshly minted “states” in the East, and the shift from phishing email to kidnapping in certain African countries. Walled communities are back. It may be the dawning of the new Dark Age. [Update: This link may provide a useful example of how a moral imperative is put into action by a high flying Silicon Valley professional. I wonder how one would explain the discontinuity between intelligent, five children, and heroin to the surviving spouse. Well, I will leave the gilding of the lilly to a pundit. Added, July 10, 2014.]

Those old Roman emperors like JP, JD, and Corny may not look so bad today. These folks had the right idea in the view of some modern captains of Big Data.

Stephen E Arnold, July 9, 2014

Swimming in a Hadoop Data Lake

July 8, 2014

I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.

The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.

Several key buzzwords appear in the interview:

  • Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
  • Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
  • A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
  • Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
  • Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”

My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.

No problem as long as the money lake is sufficiently deep, broad, and full.

The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.

Stephen E Arnold, July 7, 2014

Hadoop Annual Growth Numbers Sky-High

July 8, 2014

The article titled Hadoop Sector will Have Annual Growth of 58% for 2013-2020 in CloudTimes offers a wild and crazy market size estimate for the company. Hadoop is open source so this is a lot of services revenue. Hadoop’s achievement is based on work in big data analysis, access to big data at high speeds, and the management of unstructured data. Keeping costs low while maintain effectiveness spelled success for Hadoop. The article states,

“The report categorized the Hadoop software market into application software, management software, packaged software and performance monitoring software and found that application software category is leading the global Hadoop software market due to high return in its increasing implementation by developers to build real time applications. Also, Hadoop packaged software provides easier deployment of Hadoop clusters. Thus, Hadoop projects such as MapReduce, Sqoop, Hive and others can be smoothly integrated.”

The article does offer some caution to balance the wildly positive report for Hadoop. Due to holes in qualified staff to fill the company, there has been some slowing of growth especially in small and medium enterprises, who might hesitate to adopt the software. Hadoop is booming with government sectors, manufacturing, BFSI, retail and healthcare, among other areas.

Chelsea Kerwin, July 08, 2014

Sponsored by, developer of Augmentext

Presentation by a NoSQL Leader

July 4, 2014

The purported father of NoSQL, Norman T. Kutemperor, made an appearance at this year’s Enterprise Search & Discovery conference, we learn from “Scientel Presented Advanced Big Data Content Management & Search With NoSQL DB at Enterprise Search Summit in NY on May 13” at IT Business Net. The press release states:

“Norman T. Kutemperor, President/CEO of Scientel, presented on Scientels Enterprise Content Management & Search System (ECMS) capabilities using Scientels Gensonix NoSQL DB on May 13 at the Enterprise Search & Discovery 2014 conference in NY. Mr. Kutemperor, who has been termed the Father of NoSQL, was quoted as saying, When it comes to Big Data, advanced content management and extremely efficient searchability and discovery are key to gaining a competitive edge. The presentation focused on: The Power of Content – More power in a NoSQL environment.”

According to the write-up, Kutemperor spoke about the growing need to manage multiple types of unstructured data within a scalable system, noting that users now expect drag-and-drop functionality. He also asserted that any NoSQL system should automatically extract text and build an index that can be searched by both keywords and sentences. Of course, no discussion of databases would be complete without a note about the importance of security, and Kutemperor emphasized that point as well.

The veteran info-tech company Scientel has been in business since 1977. These days, they focus on NoSQL database design; however, it should be noted that they also design and produce optimized, high-end servers to go with their enterprise Genosix platform. The company makes its home in Bingham Farms, Michigan.

Cynthia Murrell, July 04, 2014

Sponsored by, developer of Augmentext

NoSQL Has a Weakness. Just Tell No One.

July 2, 2014

I read “The Rise (and Fall?) of NoSQL.” The write up seems to take a stance somewhat different from that adopted by enterprise search vendors. With search getting more difficult to sell for big bucks, findability folks are reinventing themselves as Big Data mavens. Examples range from the Fast Search clones to tagging outfits. (Sorry, no names this morning. Search and content processing vendors with chunks of venture firm cash do not need any more fireworks today.)

Is Big Data the white knight that will allow those venture funded companies to deliver a huge payday? I don’t know, but I keep my nest egg is less risky

Here’s the segment I noted:

It’s quite simple: analytics tooling for NoSQL databases is almost non-existent. Apps stuff a lot of data into these databases, but legacy analytics tooling based on relational technology can’t make any sense of it (because it’s not uniform, tabular data). So what usually happens is that companies extract, transform, normalize, and flatten their NoSQL data into an RDBMS, where they can slice and dice data and build reports. The cost and pain of this process, together with the fact that NoSQL databases aren’t fully self-contained (using them requires using their “competition”” for analytics!) is the biggest threat to the possible dominance of NoSQL databases.

My take on this searchification of Big Data boils down to one word: Scrambling for revenues. Perhaps some of the money pumped into crazy marketing schemes might be directed at creating something that works. Systems that dip into a barrel of trail mix return a snack that cannot replace a square meal.

Stephen E Arnold, July 2, 2014

Keeping Up with IBM: A New Daily Is Available

July 1, 2014

I try to keep up with Watson, the billion dollar bet that is much loved by gourmets at Bon Appétit. If you want daily IBM news and information for free, navigate to the “The THINQ Magazine Daily.” I think the THINQ is a modern version of the IBM sign I saw in the Federal Systems’ offices in 1973. That sign said, “Think.” Spelling aside, the algorithm harvests information from Web sites and presents it in a zippy format. Vivisimo, now and IBM company involved in Big Data, offered a service about enterprise search.

The issue I am viewing today (July 1, 2014) covers stuff I never heard of; for example, Bluemix and “world class analytics.” There are some stories with which I am familiar; for example, Watson crafts recipes. I wrote about tamarind as an ingredient not long ago too. The THINQ content includes links to IBM videos. I was not familiar with what is labeled “theCUBE.” I am not into videos because it takes too much time to watch talking heads and PowerPoint slides. Reading is quicker and easier for me, but I am old fashioned.

There is a selection of photos. Some of these come from sources other than IBM. I assume these are snapshots from IBM partners. A number of the pictures show really happy people looking at computing devices and somewhat baffling images with text asking me, “Why do you love social media?” I don’t love social media, but for certain type of law enforcement work, social media is darned useful. Facebook users often post snaps of themselves at crime scenes or capture their thoughts moments before taking some action I find disturbing.

There is an article telling small businesses how these small outfits can use Big Data. The link points to Inc. Magazine and an article with the title “What 3 Small Businesses Learned From Big Data.” The THINQ title does not quite capture what the Inc. article actually says, but I assume that most THINQ visitors will not pay much attention to the meaning adjustment.

If you are curious about IBM, take a look at THINQ. I will stick to my own system for monitoring the exciting world of IBM.

Stephen E Arnold, July 1, 2014

Next Page »