CyberOSINT banner

Managing Unstructured Data in Just Nine Steps

February 25, 2015

I’m sure Datamation will help some companies with its post, “Big Data: 9 Steps to Extract Insight from Unstructured Data.” However, we think these steps may make other companies look for easier options. Writer Salil Godika explains why he feels these steps are worth the effort:

“Organizations have to study both structured and unstructured data to arrive at meaningful business decisions…. Not only do they have to analyze information provided by consumers and other organizations, information collected from devices must be scrutinized. This must be done not only to ensure that the organization is on top of any network security threats, but to also ensure the proper functioning of embedded devices.

“While sifting through vast amounts of information can look like a lot of work, there are rewards. By reading large, disparate sets of unstructured data, one can identify connections from unrelated data sources and find patterns. What makes this method of analysis extremely effective is that it enables the discovery of trends; traditional methods only work with what is already quantifiable, while looking through unstructured data can cause revelations.”

The nine steps presented in the article begin at the beginning (“make sense of the disparate data sources”) and ends at the logical destination (“obtain insight from the analysis and visualize it”.) See the article for the steps in between and their descriptions. A few highlights include designating the technology stack for data processing and storage, creating a “term frequency matrix” to understand word patterns and flow, and performing an ontology evaluation.

Writer Salil Godika concludes with a reminder that new types of information call for new approaches, including revised skillsets for data scientists. The ability to easily blend and analyze information from disparate sources in a variety of formats remains the ultimate data-analysis goal.

Cynthia Murrell, February 25, 2015

Sponsored by, developer of Augmentext

DataStax Buys Graph-Database Startup Aurelius

February 20, 2015

DataStax has purchased open-source graph-database company, Aurelius, we learn in “DataStax Grabs Aurelius in Graph Database Acqui-Hire” at TechCrunch. Aurelius’ eight engineers will reportedly be working at DataStax, delving right into a scalable graph component for the company’s Cassandra-based Enterprise database. This acquisition, DataStax declares, makes theirs the only database platform with graph, analytics, search, and in-memory in one package. Writer Ron Miller tells us:

“DataStax is the commercial face of the open source Apache Cassandra database. Aurelius was the commercial face of the Titan graph database.

“Matt Pfeil, co-founder and chief customer officer at DataStax, says customers have been asking about graph database functionality for some time. Up until now customers have been forced to build their own on top of the DataStax offering.

“‘This was something that was on our radar. As we started to ramp up, it made sense from corporate [standpoint] to buy it instead of build it.’ He added that getting the graph-database engineering expertise was a bonus. ‘There’s not a ton of graph database experts [out there],’ he said.

“This expertise is especially important as two of the five major DataStax key use cases — fraud detection and recommendation engines — involve a graph database.”

Though details of the deal have not been released, see the write-up for some words on the fit between these two companies. Founded on an open-source model, Aurelius was doing just fine in its own. Co-founder Matthias Bröcheler is excited, though, about what his team can do at DataStax. Bröcheler did note that the graph database’s open-source version, Titan, will live on. Aurelius is located in Oakland, California, and was just launched in 2014.

Headquartered in San Mateo, California, DataStax was founded in 2010. Their Cassandra-based software implementations are flexible and scalable. Clients range from young startups to Fortune 100 companies, including such notables as eBay, Netflix and HealthCare Anytime.

Cynthia Murrell, February 20, 2015

Sponsored by, developer of Augmentext

Chilling Effects Censors Its Own Database

February 13, 2015

In the struggle between privacy and transparency, score one for the privacy advocates. Or, at least, for those looking to protect intellectual property. TorrentFreak tells us that “Chilling Effects DMCA Archive Censors Itself.” Chilling Effects is a site/ database set up in response to takedown requests; their homepage describes their goal:

“The Chilling Effects database collects and analyzes legal complaints and requests for removal of online materials, helping Internet users to know their rights and understand the law. These data enable us to study the prevalence of legal threats and let Internet users see the source of content removals.”

Now, though, the site has decided to conceal the non-linked URLs that could be used to find material that has been removed due to copyright infringement complaints. The TorrentFreak (TF) article explains:

“The Chilling Effects DMCA clearing house is one of the few tools that helps to keep copyright holders accountable. Founded by Harvard’s Berkman Center, it offers an invaluable database for researchers and the public in general. At TF we use the website on a weekly basis to spot inaccurate takedown notices and other wrongdoings. Since the native search engine doesn’t always return the best results, we mostly use Google to spot newsworthy notices on the site. This week, however, we were no longer able to do so. The Chilling Effects team decided to remove its entire domain from all search engines, including its homepage and other informational and educational resources.”

Yes, information is tough to find if it is not indexed. For their part, the folks at Chilling Effects feel this step is necessary, at least for the time being; they “continue to think things through” as they walk the line between legally protected privacy and freedom of information.

Cynthia Murrell, February 13, 2015

Sponsored by, developer of Augmentext

Four Visualization Tools to Choose From

February 12, 2015

MakeUseOf offers us a list of graphic-making options in its “4 Data Visualization Tools for Captivating Data Journalism.” Writer Brad Jones describes four options, ranging from the quick and easy to more complex solutions. The first entry, Tableau Public, may be the best place for new users to start. The write-up tells us:

“Data visualization can be a very complex process, and as such the programs and tools used to achieve good results can be similarly complex. Tableau Public, at first glance, is not — it’s a very accommodating, intuitive piece of software to start using. Simply import your data as a text file, an Excel spreadsheet or an Access database, and you’re up and running.

“You can create a chart simply by dragging and dropping various dimensions and measures into your workspace. Figuring out exactly how to produce the sort of visualizations you’re looking for might take some experimentation, but there’s no great challenge in creating simple charts and graphs.

“That said, if you’re looking to go further, Tableau Public can cater to you. It’ll take some time on your part to really understand the breadth of what’s on offer, but it’s a matter of learning a skill rather than the program itself being difficult to use.”

The next entry is Google Fusion Tables, which helpfully links to other Google services, and much of its process is automated. The strengths of Infoactive are its ability to combine datasets and a wealth of options to create cohesive longer content. Rounding out the list is R, which Jones warns is “obtuse and far from user friendly”; it even requires a working knowledge of JavaScript and its own proprietary language to make the most of its capabilities. However, he says there is simply nothing better for producing exactly what one needs.

Cynthia Murrell, February 12, 2015

Sponsored by, developer of Augmentext

Linguistic Analysis and Data Extraction with IBM Watson Content Analytics

January 30, 2015

The article on IBM titled Discover and Use Real-World Terminology with IBM Watson Content Analytics provides an overview to domain-specific terminology through the linguistic facets of Watson Content Analytics. The article begins with a brief reminder that most data, whether in the form of images or texts, is unstructured. IBM’s linguistic analysis focuses on extracting relevant unstructured data from texts in order to make it more useful and usable in analysis. The article details the processes of IBM Watson Content Analytics,

“WCA processes raw text from the content sources through a pipeline of operations that is conformant with the UIMA standard. UIMA (Unstructured Information Management Architecture) is a software architecture that is aimed at the development and deployment of resources for the analysis of unstructured information. WCA pipelines include stages such as detection of source language, lexical analysis, entity extraction… Custom concept extraction is performed by annotators, which identify pieces of information that are expressed as segments of text.”

The main uses of WCA are exploring insights through facets as well as extracting concepts in order to apply WCA analytics. The latter might include excavating lab analysis reports to populate patient records, for example. If any of these functionalities sound familiar, it might not surprise you that IBM bough iPhrase, and much of this article is reminiscent of iPhrase functionality from about 15 years ago.

Chelsea Kerwin, January 30, 2014

Sponsored by, developer of Augmentext

Guide to Getting the Most Out of Your Unstructured Data

January 23, 2015

The article on Datamation titled Big Data: 9 Steps to Extract Insight Unstructured Data explores the process of analyzing all of the data organizations collect from phone calls, emails and social media. The article stipulates that this data does contain insights into patterns and connections important to the company. The suggested starting point is deciding what data needs to be analyzed, based on relevance. At this point, the reason for the analysis and what will be done with the information should be clear. After planning on the technology stack the information should be kept in a data lake. The article explains,

“Traditionally, an organization obtained or generated information, sanitized it and stored it away… Anything useful that was discarded in the initial data load was lost as a result… However, with the advent of Big Data, it has come into common practice to do the opposite. With a data lake, information is stored in its native format until it is actually deemed useful and needed for a specific purpose, preserving metadata or anything else that might assist in the analysis.”

The article continues with steps 5-9, which include preparing the data for storage, saving useful information, ontology evaluation, statistical modeling and finally, gaining insights from the analysis. While an interesting breakdown of the process, the number of steps in the article might seem overwhelming for companies in a hurry and not technically robust.

Chelsea Kerwin, January 23, 2014

Sponsored by, developer of Augmentext

It May Be Too Soon to Dismiss Tape Storage

January 20, 2015

Is the cloud giving new life to tape? The Register reports on “The Year When Google Made TAPE Cool Again….” With both Google and Amazon archiving data onto tape, the old medium suddenly seems relevant again. Our question—does today’s tape guarantee 100% restores? We want to see test results. One cannot find information that is no longer there, after all.

The article hastens to point out that the tape-manufacturing sector is not out of the irrelevance woods yet. Reporter Chris Mellor writes about industry attempts to survive:

“Overall the tape backup market is still in decline, with active vendors pursuing defensive strategies. Struggling tape system vendors Overland Storage and Tandberg Data, both pushing forwards in disk-based product sales, are merging in an attempt to gain critical and stable business mass for profitable revenues.

“Quantum still has a large tape business and is managing its decline and hoping a profitable business will eventually emerge. SpectraLogic has emerged as an archive tape champion and one of the tape technology area’s leaders, certainly the most visionary with the Black Pearl technology.

“Oracle launched its higher-capacity T10000D format and IBM is pushing tape drive and media capacity forwards, heading past a 100TB capacity tape.”

The write-up concludes with ambivalence about the future of tape. Mellor does not see the medium disappearing any time soon, but is less confident about its long-term relevance. After all, who knows what storage-medium breakthrough is around the corner?

Cynthia Murrell, January 20, 2015

Sponsored by, developer of Augmentext

Divining Unemployment Patterns from Social Media Data

January 14, 2015

It is now possible to map regional unemployment estimates based solely on social-media data. That’s the assertion of a little write-up posted by Cornell University Library titled, “Social Media Fingerprints of Unemployment.” Researchers Alejandro Llorente, Manuel Garcia-Harranze, Manuel Cebrian, and Esteban Moro reveal:

“Recent wide-spread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and inter-personal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 145 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates.”

The team used these patterns to create a model they say paints an accurate picture of regional unemployment incidence. They assure us that these results can be created at low-cost using publicly available data from social media sources. Click here (PDF) to view the team’s paper on the subject.

Cynthia Murrell, January 14, 2015

Sponsored by, developer of Augmentext

The Continuing Issue of Data Integration for Financial Services Organizations

January 12, 2015

The article on Kapow Software titled Easy Integration of External Data? Don’t Bank On It shows that data integration and fusion still create issues. The article claims that any manual process for integrating external data cannot really be called timely. Financial services organizations need information from external sources like social media, and this often means the manual integration of structured and unstructured data. A survey through brought to light some of the issues with data handling. The article explains,

“Integrating internal systems with external data sources can be challenging to say the least, especially when organizations are constantly adding new external sources of information to their operations, and these external websites and web portals either don’t provide APIs or the development efforts are too time consuming and costly… manual processes no longer fit into any financial organization business process. It’s clear these time consuming development projects used to integrate external data sources into an enterprise infrastructure are not a long-term viable strategy.”

Perhaps the top complaint companies have about data is that costliness of the time spent manually importing it and then validating it. 43% of companies surveyed said that they “struggle” with the integration between internal systems and external data sources. The article finishes with the suggestion that a platform for data integration that is both user-friendly and customizable is a necessity.

Chelsea Kerwin, January 12, 2014

Sponsored by, developer of Augmentext

Security, Data Analytics Make List of Predicted Trends in 2015

January 9, 2015

The article on ZyLab titled Looking Ahead to 2015 sums up the latest areas of focus at the end of one year and the beginning of the next. Obviously security is at the top of the list. According to the article, incidents of breaches in security grew 43% in 2014. We assume Sony would be the first to agree that security is of the utmost importance to most companies. The article goes on to predict audio data being increasingly important as evidence,

“Audio evidence brings many challenges. For example, the review of audio evidence can be more labor intensive than other types of electronically stored information because of the need to listen not only to the words but also take into consideration tone, expression and other subtle nuances of speech and intonation…As a result, the cost of reviewing audio evidence can quickly become prohibitive and with only a proportional of the data relevant in most cases.”

The article also briefly discusses various data sources, data analytics and information governance in their prediction of the trends for 2015. The article makes a point of focusing on the growth of data and types of data sources, which will hopefully coincide with an improved ability to discover the sort of insights that companies desire.

Chelsea Kerwin, January 09, 2014

Sponsored by, developer of Augmentext

Next Page »