CyberOSINT banner

Governance for Big Data. A Sure Fire Winner for Consultants

July 28, 2016

I read “What’s Next for Big Data Analytics?” I didn’t know the answer to this question, and I still don’t. The angle of attack is common sense. Companies with experience is dealing with digital information often have viewpoints different from the marketing collateral produced by their colleagues. This write up seems to fall in the category of Mr. Bush’s request, “Please, clap.”

The idea is that an organization has to have information policies. That sounds like consultant speak. Most organizations struggle to figure out what their company party policies are. Digital data policies are one of those tasks that senior managers allow others to wrestle to the ground and get a tap out.

The write up includes a number of diagrams. I highlighted this one:


The red area is the governance and management thing. Good luck with that. Companies need revenue. Big Data is supposed to deliver. If not, those policies and governance meeting minutes along with the consultants who billed big bucks for them are going to the shredder in my opinion.

Stephen E Arnold, July 28, 2016

Scholarship Evolving with the Web

July 21, 2016

Is big data good only for the hard sciences, or does it have something to offer the humanities? Writer Marcus A Banks thinks it does, as he states in, “Challenging the Print Paradigm: Web-Powered Scholarship is Set to Advance the Creation and Distribution of Research” at the Impact Blog (a project of the London School of Economics and Political Science). Banks suggests that data analysis can lead to a better understanding of, for example, how the perception of certain historical events have evolved over time. He goes on to explain what the literary community has to gain by moving forward:

“Despite my confidence in data mining I worry that our containers for scholarly works — ‘papers,’ ‘monographs’ — are anachronistic. When scholarship could only be expressed in print, on paper, these vessels made perfect sense. Today we have PDFs, which are surely a more efficient distribution mechanism than mailing print volumes to be placed onto library shelves. Nonetheless, PDFs reinforce the idea that scholarship must be portioned into discrete units, when the truth is that the best scholarship is sprawling, unbounded and mutable. The Web is flexible enough to facilitate this, in a way that print could never do. A print piece is necessarily reductive, while Web-oriented scholarship can be as capacious as required.

“To date, though, we still think in terms of print antecedents. This is not surprising, given that the Web is the merest of infants in historical terms. So we find that most advocacy surrounding open access publishing has been about increasing access to the PDFs of research articles. I am in complete support of this cause, especially when these articles report upon publicly or philanthropically funded research. Nonetheless, this feels narrow, quite modest. Text mining across a large swath of PDFs would yield useful insights, for sure. But this is not ‘data mining’ in the maximal sense of analyzing every aspect of a scholarly endeavor, even those that cannot easily be captured in print.”

Banks does note that a cautious approach to such fundamental change is warranted, citing the development of the data paper in 2011 as an example.  He also mentions Scholarly HTML, a project that hopes to evolve into a formal W3C standard, and the Content Mine, a project aiming to glean 100 million facts from published research papers. The sky is the limit, Banks indicates, when it comes to Web-powered scholarship.


Cynthia Murrell, July 21, 2016

Sponsored by, publisher of the CyberOSINT monograph

There is a Louisville, Kentucky Hidden Web/Dark
Web meet up on July 26, 2016.
Information is at this link:


Attivio Targets Profitability by the End of 2016 Through $31M Financing Round

July 18, 2016

The article on VentureBeat titled Attivio Raises $31 Million to Help Companies Make Sense of Big Data discusses the promises of profitability that Attivio has made since its inception in 2007. According to Crunchbase, the search vendor has raised over $100 million from four investors. In March 2016, the company closed a financing round at $31M with the expectation of becoming profitable within 2016. The article explains,

“Our increased investment underscores our belief that Attivio has game-changing capabilities for enterprises that have yet to unlock the full value of Big Data,” said Oak Investment Partners’ managing partner, Edward F. Glassmeyer. Attivio also highlighted such recent business victories as landing lab equipment maker Thermo Fisher Scientific as a client and partnering with medical informatics shop PerkinElmer. Oak Investment Partners, General Electric Pension Trust, and Tenth Avenue Holdings participated in the investment, which pushed Attivio’s funding to at least $102 million.”

In the VentureBeat Profile about the deal, Stephen Baker, CEO of Attivio makes it clear that 2015 was a turning point for the company, or in his words, “a watershed year.” Attivio prides itself on both speeding up the data preparation process and empowering their customers to “achieve true Data Dexterity.”  And hopefully they will also be profitable, soon.


Chelsea Kerwin, July 18, 2016

Sponsored by, publisher of the CyberOSINT monograph

There is a Louisville, Kentucky Hidden Web/Dark
Web meet up on July 26, 2016.
Information is at this link:


The Web, the Deep Web, and the Dark Web

July 18, 2016

If it was not a challenge enough trying to understand how the Internet works and avoiding identity theft, try carving through the various layers of the Internet such as the Deep Web and the Dark Web.  It gets confusing, but “Big Data And The Deep, Dark Web” from Data Informed clears up some of the clouds that darken Internet browsing.

The differences between the three are not that difficult to understand once they are spelled out.  The Web is the part of the Internet that we use daily to check our email, read the news, check social media sites, etc.  The Deep Web is an Internet sector not readily picked up by search engines.  These include password protected sites, very specific information like booking a flight with particular airline on a certain date, and the TOR servers that allow users to browse anonymously.  The Dark Web are Web pages that are not indexed by search engines and sell illegal goods and services.

“We do not know everything about the Dark Web, much less the extent of its reach.

“What we do know is that the deep web has between 400 and 550 times more public information than the surface web. More than 200,000 deep web sites currently exist. Together, the 60 largest deep web sites contain around 750 terabytes of data, surpassing the size of the entire surface web by 40 times. Compared with the few billion individual documents on the surface web, 550 billion individual documents can be found on the deep web. A total of 95 percent of the deep web is publically accessible, meaning no fees or subscriptions.”

The biggest seller on the Dark Web is child pornography.  Most of the transactions take place using BitCoin with an estimated $56,000 in daily sales.  Criminals are not the only ones who use the Dark Web, whistle-blowers, journalists, and security organizations use it as well.  Big data has not even scratched the surface related to mining, but those interested can find information and do their own mining with a little digging


Whitney Grace,  July 18 , 2016
Sponsored by, publisher of the CyberOSINT monograph

There is a Louisville, Kentucky Hidden Web/Dark
Web meet up on July 26, 2016.
Information is at this link:

Short Honk: Elassandra

July 16, 2016

Just a factoid. There is now a version of Elasticsearch which is integrated with Cassandra. You can get the code for version 2.1.1-14 via Github. Just another example of the diffusion of the Elastic search system.

Stephen E Arnold, July 16, 2016

Books about Data Mining: Some Free, Some for Fee

July 14, 2016

If you want to load up on fun beach reading, I have a suggestion for you, gentle reader. KDNuggets posted “60+ Free Books on Big Data, Data Science, Data Mining, Machine Learning, Python, R, and More.” The does contain books about data mining and a number of other subjects. You will have to read the list and figure out which titles are germane to your interests. A number of the books include a helpful Amazon link. If you click on the hyperlink you may get a registration form, a PDF of the book, or this message:


Stephen E Arnold, July 14, 2016

Big Data Diagram Reveals Database Crazy Quilt

July 7, 2016

I was cruising through the outputs of my Overflight system and spotted a write up with the fetching title “Big Data Services | @CloudExpo #BigData #IoT #M2M #ML #InternetOfThings.” Unreadable? Nah. Just a somewhat interesting attempt to get a marketing write up indexed by a Web search engine. Unfortunately humans have to get involved at some point. Thus, in my quest to learn what the heck Big Data is, I explored the content of the write up. What the article presents is mini summaries of slide decks developed by assorted mavens, wizards, and experts. I dutifully viewed most of the information but tired quickly as I moved through a truly unusual article about a conference held in early June. I assume that the “news” is that the post conference publicity is going to provide me with high value information in exchange for the time I invested in trying to figure out what the heck the title means.

I viewed a slide deck from an outfit called Cazena. You can view “Tech Primer: Big Data in the Cloud.” I want to highlight this deck because it contains one of the most amazing diagrams I have seen in months. Here’s the image:


Not only is the diagram enhanced by the colors and lines, the world it depicts is a listing of data management products. The image was produced in June 2015 by a consulting firm and recycled in “Tech Primer” a year later.

I assume the folks in the audience benefited from the presentation of information from mid tier consulting firms. I concluded that the title of the article is actually pretty clear.

I wonder, Is a T shirt is available with the database graphic? If so, I want one. Perhaps I can search for the strings “#M2M #ML.”

Stephen E Arnold, July 7, 2016

What Makes Artificial Intelligence Relevant to Me

July 7, 2016

Artificial intelligence makes headlines every once in awhile when a new super computer beats a pro player at chess, go, or even Jeopardy.  It is amazing how these machines replicate human thought processes, but it is more of a novelty than a practical application.  The IT Proportal discusses the actual real world benefits of artificial intelligence in, “How Semantic Technology Is Making Sense Of Our Big Data.”

The answer, of course, revolves around big data and how industries are not capable of keeping up with the amount of unstructured data generated by the data surges with more advanced technology.  Artificial intelligence processes the data and interprets it into recognizable patterns.

Then the article inserts information about the benefits of natural language processing, how it scours the information, and can extrapolate context based on natural speech patterns.  It also goes into how semantic technology picks up the slack when natural language processing does not work.  The entire goal is to make unstructured data more structured:

“It is also reasonable to note that the challenge also relates to the structure and output of your data management. The application of semantic technologies within an unstructured data environment can only draw real business value if the output is delivered in a meaningful way for the human tasked with looking at the relationships. It is here that graphical representations add user interface value and presents a cohesive approach to improving the search and understanding of enterprise data.”

The article is an informative fluff piece that sells big data technology and explains the importance of taking charge of data.  It has been discussed before.


Whitney Grace,  July 7, 2016
Sponsored by, publisher of the CyberOSINT monograph

More Data Truths: Painful Stuff

July 4, 2016

I read “Don’t Let Your Data Lake Turn into a Data Swamp.” Nice idea, but there may be a problem which resists some folks’ best efforts to convert that dicey digital real estate into a tidy subdivision. Swamps are wetlands. As water levels change, the swamps come and go, ebb and flow as it were. More annoying is the fact that swamps are not homogeneous. Fens, muskegs, and bogs add variety to the happy hiker who strays into the Vasyugan Swamp as the spring thaw progresses.

The notion of a data swamp is an interesting one. I am not certain how zeros and ones in a storage medium relate to the Okavango delta, but let’s give this metaphor a go. The write up reveals:

Data does not move easily. This truth has plagued the world of Big Data for some time and will continue to do so. In the end, the laws of physics dictate a speed limit, no matter what else is done. However, somewhere between data at rest and the speed of light, there are many processes that must be performed to make data mobile and useful. Integrating data and managing a data pipeline are two of these necessary tasks.

Okay, no swamp thing here.

The write up shifts gears and introduces the “data pipeline” and the concept of “keeping the data lake clean.”

Let’s step back. What seems to be the motive force for this item about information in digital form has several gears:

  1. Large volumes of data are a mess. Okay, but not all swamps are messes. The real problem is that whoever stored data did it without figuring out what to do with the information. Collection is not application.
  2. The notion of a data pipeline implies movement of information from Point A to Point B or through a series of processes which convert Input A into Output B. Data pipelines are easy to talk about, but in my experience these require knowing what one wants to achieve and then constructing a system to deliver. Talking about a data pipeline is not a data pipeline in my wetland.
  3. The concept of pollution seems to suggest that dirty data are bad. Making certain data are accurate and normalized requires effort.

My view is that this write up is trying to communicate the fact that Big Data is not too helpful if one does not take care of the planning before clogging a storage subsystem with digital information.

Seems obvious but I suppose that’s why we have Love Canals and an ever efficient Environmental Protection Agency to clean up shortcuts.

Stephen E Arnold, July 4, 2016

Bad News for Instant Analytics Sharpies

June 28, 2016

I read “Leading Statisticians Establish Steps to Convey Statistics a Science Not Toolbox.” I think “steps” are helpful. The challenge will be to corral the escaped ponies who are making fancy analytics a point and click, drop down punch list. Who needs to understand anything. Hit the button and generate visualizations until somethings looks really super. Does anyone know a general who engages in analytic one-upmanship? Content and clarity sit in the backseat of the JLTV.

The write up is similar to teens who convince their less well liked “pals” to go on a snipe hunt. I noted this passage:

To this point, Meng [real statistics person] notes “sound statistical practices require a bit of science, engineering, and arts, and hence some general guidelines for helping practitioners to develop statistical insights and acumen are in order. No rules, simple or not, can be 100% applicable or foolproof, but that’s the very essence that I find this is a useful exercise. It reminds practitioners that good statistical practices require far more than running software or an algorithm.”

Many vendors emphasize how easy smart analytics systems are to use. The outputs are presentation ready. Checks and balances are mostly pushed to the margins of the interface.

Here are the 10 rules.

  1. Statistical Methods Should Enable Data to Answer Scientific Questions
  2. Signals Always Come with Noise
  3. Plan Ahead, Really Ahead
  4. Worry about Data Quality
  5. Statistical Analysis Is More Than a Set of Computations
  6. Keep it Simple
  7. Provide Assessments of Variability
  8. Check Your Assumptions
  9. When Possible, Replicate!
  10. Make Your Analysis Reproducible

I think I can hear the guffaws from the analytics vendors now. I have tears in my eyes when I think about “statistical methods should enable data to answer scientific questions.” I could have sold that line to Jack Benny if he were still alive and doing comedy. Scientific questions from data which no human has checked for validity. Oh, my goodness. Then reproducibility. That’s a good one too.

Stephen E Arnold, June 28, 2016

Next Page »