May 22, 2015
The article titled Big Data Must Haves: Capacity, Compute, Collaboration on GCN offers insights into the best areas of focus for big data researchers. The Internet2 Global Summit is in D.C. this year with many exciting panelists who support the emphasis on collaboration in particular. The article mentions the work being presented by several people including Clemson professor Alex Feltus,
“…his research team is leveraging the Internet2 infrastructure, including its Advanced Layer 2 Service high-speed connections and perfSONAR network monitoring, to substantially accelerate genomic big data transfers and transform researcher collaboration…Arizona State University, which recently got 100 gigabit/sec connections to Internet2, has developed the Next Generation Cyber Capability, or NGCC, to respond to big data challenges. The NGCC integrates big data platforms and traditional supercomputing technologies with software-defined networking, high-speed interconnects and visualization for medical research.”
Arizona’s NGCC provides the essence of the article’s claims, stressing capacity with Internet2, several types of computing, and of course collaboration between everyone at work on the system. Feltus commented on the importance of cooperation in Arizona State’s work, suggesting that personal relationships outweigh individual successes. He claims his own teamwork with network and storage researchers helped him find new potential avenues of innovation that might not have occurred to him without thoughtful collaboration.
Chelsea Kerwin, May 22, 2014
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
May 18, 2015
In plain English too. Navigate to “Top 10 Data Mining Algorithms in Plain English.” When you fire up an enterprise content processing system, the algorithms beneath the user experience layer are chestnuts. Universities do a good job of teaching students about some reliable methods to perform data operations. In fact, the universities do such a good job that most content processing systems include almost the same old chestnuts in their solutions. The decision to use some or all of the top 10 data mining algorithms has some interesting consequences, but you will have to attend one of my lectures about the weaknesses of these numerical recipes to get some details.
The write up is worth a read. The article includes a link to information which underscores the ubiquitous nature of these methods. This is the Xindong Wu et all write up “Top 10 Algorithms in Data Mining.” Our research reveals that dependence on these methods is more wide spread now than they were seven years ago when the paper first appeared.
The implication then and now is that content processing systems are more alike than different. The use of similar methods means that the differences among some systems is essentially cosmetic. There is a flub in the paper. I am confident that you, gentle reader, will spot it easily.
Now to the “made simple” write up. The article explains quite clearly the what and why of 10 widely used methods. The article also identifies some of the weaknesses of each method. If there is a weakness, do you think it can be exploited? This is a question worth considering I suggest.
Example: What is a weakness of k means:
Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the initial choice of centroids. One final thing to keep in mind is k-means is designed to operate on continuous data — you’ll need to do some tricks to get it to work on discrete data.
Note the key word “tricks.” When one deals with math, the way to solve problems is to be clever. It follows that some of the differences among content processing systems boils down to the cleverness of the folks working on a particular implementation. Think back to your high school math class. Was there a student who just spit out an answer and then said, “It’s obvious.” Well, that’s the type of cleverness I am referencing.
The author does not dig too deeply into PageRank, but it too has some flaws. An easy way to identify one is to attend a search engine optimization conference. One flaw turbocharges these events.
My relative Vladimir Arnold, whom some of the Arnolds called Vlad the Annoyer, would have liked the paper. So do I. The write up is a keeper. Plus there is a video, perfect for the folks whose attention span is better than a goldfish’s.
Stephen E Arnold, May 18, 2015
May 14, 2015
Mythologies usually develop over a course of centuries, but big data has only been around for (arguably) a couple decades—at least in the modern incarnate. Recently big data has received a lot of media attention and product development, which was enough to give the Internet time to create a big data mythology. The Globe and Mail wanted to dispel some of the bigger myths in the article, “Unearthing Big Myths About Big Data.”
The article focuses on Prof. Joerg Niessing’s big data expertise and how he explains the truth behind many of the biggest big data myths. One of the biggest items that Niessing wants people to understand is that gathering data does not equal dollar signs, you have to be active with data:
“You must take control, starting with developing a strategic outlook in which you will determine how to use the data at your disposal effectively. “That’s where a lot of companies struggle. They do not have a strategic approach. They don’t understand what they want to learn and get lost in the data,” he said in an interview. So before rushing into data mining, step back and figure out which customer segments and what aspects of their behavior you most want to learn about.”
Niessing says that big data is not really big, but made up of many diverse, data points. Big data also does not have all the answers, instead it provides ambiguous results that need to be interpreted. Have questions you want to be answered before gathering data. Also all of the data returned is not the greatest. Some of it is actually garbage, so it cannot be usable for a project. Several other myths are uncovered, but the truth remains that having a strategic big data plan in place is the best way to make the most of big data.
Whitney Grace, May 14, 2015
April 13, 2015
Bing is considered a search engine joke, but it might be working its way as a viable search solution…maybe. MakeUseOf notes, “How Bing Predicts Has Become So Good” due to Microsoft actually listening to its users and improving the search results with the idea that “Bing is for doing.” One way Microsoft is putting its search engine to work is with Bing Predicts, a tool that predicts who win competitions, weather, and other information analyzed from popular searches, social media, regional trends, and more.
It takes a bit more for Predicts to divine sporting event outcomes, for those Bing relies on historic team data, key player data, opinions from top news sources, and pre-game report predictions.
“Microsoft researcher, and serial predictor David Rothschild believes the prediction engine is ‘an interesting way to show users that Bing has a lot of horsepower beyond just providing good search results.’ Data is everything. Even regular Internet users understand the translation of data to power, so Microsoft’s bold step forward with their predictions underscores the confidence in their own algorithms, and their ability to handle the data coming into Redmond.”
Other than predicting games and the next American Idol winner, Bing Predicts has application for social fields and industry. Companies are already implementing some forms of future analysis and for social causes it can be used to predict the best ways to conserve resources, medicinal supplies, food, and even conservatism.
Whitney Grace, April 13, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
February 2, 2015
I find the complaints about Google’s inability to handle time amusing. On the surface, Google seems to demote, ignore, or just not understand the concept of time. For the vast majority of Google service users, Google is no substitute for the users’ investment of time and effort into dating items. But for the wide, wide Google audience, ads, not time, are more important.
Does Google really get an F in time? The answer is, “Nope.”
In CyberOSINT: Next Generation Information Access I explain that Google’s time sense is well developed and of considerable importance to next generation solutions the company hopes to offer. Why the craw fishing? Well, Apple could just buy Google and make the bitter taste of the Apple Board of Directors’ experience a thing of the past.
Now to temporal matters in the here and now.
CyberOSINT relies on automated collection, analysis, and report generation. In order to make sense of data and information crunched by an NGIA system, time is a really key metatag item. To figure out time, a system has to understand:
- The date and time stamp
- Versioning (previous, current, and future document, data items, and fact iterations)
- Times and dates contained in a structured data table
- Times and dates embedded in content objects themselves; for example, a reference to “last week” or in some cases, optical character recognition of the data on a surveillance tape image.
For the average query, this type of time detail is overkill. The “time and date” of an event, therefore, requires disambiguation, determination and tagging of specific time types, and then capturing the date and time data with markers for document or data versions.
A simplification of Recorded Future’s handling of unstructured data. The system can also handle structured data and a range of other data management content types. Image copyright Recorded Future 2014.
Sounds like a lot of computational and technical work.
In CyberOSINT, I describe Google’s and In-Q-Tel’s investments in Recorded Future, one of the data forward NGIA companies. Recorded Future has wizards who developed the Spotfire system which is now part of the Tibco service. There are Xooglers like Jason Hines. There are assorted wizards from Sweden, countries the most US high school software cannot locate on a map, and assorted veterans of high technology start ups.
An NGIA system delivers actionable information to a human or to another system. Conversely a licensee can build and integrate new solutions on top of the Recorded Future technology. One of the company’s key inventions is numerical recipes that deal effectively with the notion of “time.” Recorded Future uses the name “Tempora” as shorthand for the advanced technology that makes time along with predictive algorithms part of the Recorded Future solution.
January 12, 2015
If you are a fan of “knowledge,” you probably follow the information provided by www.KDNuggets.com. I read “Research Leaders on Data Science and big Data Key Trends, Top Papers.” The information is quite interesting. I did note that the paper was kicked off with this statement:
As for the papers, we found that many researchers were so busy that they did not really have the time to read many papers by others. Of course, top researchers learn about works of others from personal interactions, including conferences and meetings, but we hope that professors have enough students who do read the papers and summarize the important ones for them!
Okay, everyone is really busy.
In the 13 experts cited, I noted that there were two papers that seemed to call attention to the issue of accuracy. These were:
“Preventing False Discovery in Interactive Data Analysis is Hard,” Moritz Hardt and Jonathan Ullman
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” Anh Nguyen, Jason Yosinski, Jeff Clune.
A related paper noted in the article is “Intriguing Properties of Neural Networks,” by Christian Szegdy, et al. The KDNuggets’ comment states:
It found that for every correctly classified image, one can generate an “adversarial”, visually indistinguishable image that will be misclassified. This suggests potential deep flaws in all neural networks, including possibly a human brain.
My take away is that automation is coming down the pike. Accuracy could get hit by a speeding output.
Stephen E Arnold, January 12, 2015
November 13, 2014
The article on Inside BigData titled RapidMiner Moves Predictive Analytics, Data Mining and Machine Learning into the Cloud promotes RapidMiner Cloud, the recently announced tool for business analysts. The technology allows for users to leverage over 300 cloud platforms such as Amazon, Twitter and Dropbox at an affordable price ($39/month.) The article quotes RapidMiner CEO Ingo Mierswa, who emphasized the “single click” necessary for users to gain important predictive analytics. The article says,
“RapidMiner understands the unique needs of today’s mobile workforce. RapidMiner Cloud includes connectors to cloud-based data sources that can be used on-premises and in the cloud with seamless transitioning between the two. This allows users to literally process Big Data at anytime and in any place, either working in the cloud or picking up where they left off when back in the office. This feature is especially important for mobile staff and consultants in the field.”
RapidMiner Cloud also contains the recently launched Wisdom of the Crowds Operator Recommendations, which culls insights into the analytics process from the millions of models created by members of the RapidMiner community. The article also suggests that RapidMiner is uniquely capable of integration with open-source solutions, rather than competing, the platform is more invested in source-code availability.
Chelsea Kerwin, November 13, 2014
October 3, 2014
I read “MarkLogic Positioned as a Leader in NoSQL Document Databases Report by Independent Research Firm.” The research firm is the mid tier outfit Forrester Research Inc. Forrester creates “wave” reports. These are Forrester’s response to various grid, quadrants, and tables cranked out by Gartner, Ovum, Butler, Kelsey, and a life boat stuffed with consulting firm shakeout survivors. Dated October 2, 2014, the MarkLogic news release will be the first of a half dozen or more issued by companies in this “independent research firm’s” report. The mid tier analyses are crafted so that negatives are swathed in high density, low impact foam like the spray on insulation.
Like Heaven’s Gate’s media event, any publicity is good publicity. At least, that’s the public relations mantra. Look at IBM Watson and its BBQ sauce recipe with tamarind. I mention that innovation as frequently as possible.
Well, let me do my part for this report:
The write up asserts:
“MarkLogic offers the most mature and scalable NoSQL document database. Unlike other NoSQL document databases, MarkLogic has been offering a NoSQL solution for more than a decade,” stated Forrester in the report that evaluated select companies against 57 criteria. “MarkLogic has the most comprehensive data management features and functionality to store, process, and access any kind of structured and multi structured data.” Forrester’s evaluation of NoSQL document database vendors scored factors like performance, scalability, integration, security, high availability, workload management and form factor. MarkLogic was cited as a Leader in the evaluation, receiving its highest score in the go-to-market category.
Okay. The news release provides a link so the reader can get a copy of the “independent research firm’s” report. If you want to skip the original document and go to the registration form so you can download the “independent research firm’s” report, navigate to http://bit.ly/1oGQCvf. In my experience, some follow up by the “leader” MarkLogic may take place.
In my view, content marketing covers these “independent” reports. The idea makes clear that attention is required in order to kindle interest in a product or a service. Now MarkLogic is an Extensible Markup Language data management system. The company has been in business since 2003. The firm has ingested more than $70 million in venture funding. The firm has experienced the same type of revolving door for senior management that other ageing starts up experience; for example, Lucid Imagination (now Lucid Works, which I write as Lucid Works. Really?). MarkLogic, in order to meet stakeholders’ expectations, has to find a growth bull, get it in a corral, and covert the animal to high value revenue.
- Proprietary XML systems positioned as NoSQL alternatives have to find a way to convince a prospect that proprietary is a better value than open source. The impact of Hadoop, a variant of Google’s Big Table, is long in the tooth and faces some of its own value challenges.
- Companies like Oracle are providing some of its clients with the comfort of a proprietary system with compatibility with open source technology. Thus, some large companies may be reluctant to dismount one old nag and climb on another. IBM also does some anti open source marketing but that’s another story. For some insights, run a query for Watson on the Beyond Search index.
- The noise surrounding NoSQL is creating some confusion. This means that firms that are neither big or small have to find a way to make their size into a positive. Enter content marketing and reports that present a group of companies in a simplified table.
- Do the “independent” experts use the products included in a variant of the Boston Consulting Group’s matrix? You know: Install, optimize, customize, and utilize with their own brain, fingers, and eyeballs? My hunch is that none of this “real” experience stuff is germane to cranking out an “independent” report. Just my uninformed opinion, you understand.
If a company requires a NoSQL solution, how do those firms select vendors? Based on the research that IDC used to skip Dave Schubmehl to expert status, large companies are more likely to try open source for a new project. Smaller firms often look for brand name software in order to show investors that base technology has a brand name.
Forrester-type firms (Gartner, IDC, Ovum, etc.) generate “independent” reports to inflate the balloon. The French have a delightful verb for this: “se gonfler”. So, nous [MarkLogic] gonflons notre ballon. (If the translation is poor, blame Google, the inventor of Big Table more than a decade ago.)
Stephen E Arnold, October 3, 2014
April 2, 2014
Have you ever heard of transfinancial economics? It is a concept originated by Robert Searle and he writes about the topic and other related concepts on his blog The Economic Realms. Searle explains that:
“[Transfinancial economics] believes that apart from earned money, new unearned money could be electronically created without serious inflation notably for key climate change/ environmentally sustainable projects, and for high ethical/ social “enterprises.” “
It is a possible theory that could be explored, but while investigating Searle’s blog posts and his user profile it comes to light that Searle is either an extremely longwinded person or he is a dummy SEO profile. While trying to study his reasoning for transfinancial economics, he wrote a blog post that explains how data mining will be important to it.
He then copied the entire Wikipedia entry on data mining. Browsing through his other posts, he has copied other Wikipedia entries among a few original entries. If Searle is a real person, his blog follows a Pat Gunkel-esque writing style. He spins his ideas to connect to each other from his transfinancial economics to improvisation whistling. If you have time, you work through the entire blog for an analysis of the discipline and how transfinancial economics works. We doubt that Searle will be writing a book on the topic soon.
April 1, 2014
Tech Radar has an article that suggests an idea we have never heard before: “How Text Mining Can Help Your Business Dig Gold.” Be mindful that was a sarcastic comment. It is already common knowledge that text mining is advantageous tool to learn about customers, products, new innovations, market trends, and other patterns. One of big data’s main scopes is capturing that information from an organization’s data. The article explains how much data is created in a single minute from text with some interesting facts (2.46 million Facebook posts, wow!).
It suggests understanding the type of knowledge you wish to capture and finding software with a user-friendly dashboard. It ends on this note:
“In summary, you need to listen to what the world is trying to tell you, and the premier technology for doing so is “text mining.” But, you can lean on others to help you use this daunting technology to extract the right conversations and meanings for you.”
The entire article is an overview of what text mining can do and how it is beneficial. It does not go further than basic explanations or how to mine the gold in the data mine. That will require further reading. We suggest a follow up article that explains how text mining can also lead to fool’s gold.