CyberOSINT banner

Linguistic Analysis and Data Extraction with IBM Watson Content Analytics

January 30, 2015

The article on IBM titled Discover and Use Real-World Terminology with IBM Watson Content Analytics provides an overview to domain-specific terminology through the linguistic facets of Watson Content Analytics. The article begins with a brief reminder that most data, whether in the form of images or texts, is unstructured. IBM’s linguistic analysis focuses on extracting relevant unstructured data from texts in order to make it more useful and usable in analysis. The article details the processes of IBM Watson Content Analytics,

“WCA processes raw text from the content sources through a pipeline of operations that is conformant with the UIMA standard. UIMA (Unstructured Information Management Architecture) is a software architecture that is aimed at the development and deployment of resources for the analysis of unstructured information. WCA pipelines include stages such as detection of source language, lexical analysis, entity extraction… Custom concept extraction is performed by annotators, which identify pieces of information that are expressed as segments of text.”

The main uses of WCA are exploring insights through facets as well as extracting concepts in order to apply WCA analytics. The latter might include excavating lab analysis reports to populate patient records, for example. If any of these functionalities sound familiar, it might not surprise you that IBM bough iPhrase, and much of this article is reminiscent of iPhrase functionality from about 15 years ago.

Chelsea Kerwin, January 30, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Guide to Getting the Most Out of Your Unstructured Data

January 23, 2015

The article on Datamation titled Big Data: 9 Steps to Extract Insight Unstructured Data explores the process of analyzing all of the data organizations collect from phone calls, emails and social media. The article stipulates that this data does contain insights into patterns and connections important to the company. The suggested starting point is deciding what data needs to be analyzed, based on relevance. At this point, the reason for the analysis and what will be done with the information should be clear. After planning on the technology stack the information should be kept in a data lake. The article explains,

“Traditionally, an organization obtained or generated information, sanitized it and stored it away… Anything useful that was discarded in the initial data load was lost as a result… However, with the advent of Big Data, it has come into common practice to do the opposite. With a data lake, information is stored in its native format until it is actually deemed useful and needed for a specific purpose, preserving metadata or anything else that might assist in the analysis.”

The article continues with steps 5-9, which include preparing the data for storage, saving useful information, ontology evaluation, statistical modeling and finally, gaining insights from the analysis. While an interesting breakdown of the process, the number of steps in the article might seem overwhelming for companies in a hurry and not technically robust.

Chelsea Kerwin, January 23, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

It May Be Too Soon to Dismiss Tape Storage

January 20, 2015

Is the cloud giving new life to tape? The Register reports on “The Year When Google Made TAPE Cool Again….” With both Google and Amazon archiving data onto tape, the old medium suddenly seems relevant again. Our question—does today’s tape guarantee 100% restores? We want to see test results. One cannot find information that is no longer there, after all.

The article hastens to point out that the tape-manufacturing sector is not out of the irrelevance woods yet. Reporter Chris Mellor writes about industry attempts to survive:

“Overall the tape backup market is still in decline, with active vendors pursuing defensive strategies. Struggling tape system vendors Overland Storage and Tandberg Data, both pushing forwards in disk-based product sales, are merging in an attempt to gain critical and stable business mass for profitable revenues.

“Quantum still has a large tape business and is managing its decline and hoping a profitable business will eventually emerge. SpectraLogic has emerged as an archive tape champion and one of the tape technology area’s leaders, certainly the most visionary with the Black Pearl technology.

“Oracle launched its higher-capacity T10000D format and IBM is pushing tape drive and media capacity forwards, heading past a 100TB capacity tape.”

The write-up concludes with ambivalence about the future of tape. Mellor does not see the medium disappearing any time soon, but is less confident about its long-term relevance. After all, who knows what storage-medium breakthrough is around the corner?

Cynthia Murrell, January 20, 2015

Sponsored by ArnoldIT.com, developer of Augmentext

Divining Unemployment Patterns from Social Media Data

January 14, 2015

It is now possible to map regional unemployment estimates based solely on social-media data. That’s the assertion of a little write-up posted by Cornell University Library titled, “Social Media Fingerprints of Unemployment.” Researchers Alejandro Llorente, Manuel Garcia-Harranze, Manuel Cebrian, and Esteban Moro reveal:

“Recent wide-spread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and inter-personal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 145 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates.”

The team used these patterns to create a model they say paints an accurate picture of regional unemployment incidence. They assure us that these results can be created at low-cost using publicly available data from social media sources. Click here (PDF) to view the team’s paper on the subject.

Cynthia Murrell, January 14, 2015

Sponsored by ArnoldIT.com, developer of Augmentext

The Continuing Issue of Data Integration for Financial Services Organizations

January 12, 2015

The article on Kapow Software titled Easy Integration of External Data? Don’t Bank On It shows that data integration and fusion still create issues. The article claims that any manual process for integrating external data cannot really be called timely. Financial services organizations need information from external sources like social media, and this often means the manual integration of structured and unstructured data. A survey through Computerworld.com brought to light some of the issues with data handling. The article explains,

“Integrating internal systems with external data sources can be challenging to say the least, especially when organizations are constantly adding new external sources of information to their operations, and these external websites and web portals either don’t provide APIs or the development efforts are too time consuming and costly… manual processes no longer fit into any financial organization business process. It’s clear these time consuming development projects used to integrate external data sources into an enterprise infrastructure are not a long-term viable strategy.”

Perhaps the top complaint companies have about data is that costliness of the time spent manually importing it and then validating it. 43% of companies surveyed said that they “struggle” with the integration between internal systems and external data sources. The article finishes with the suggestion that a platform for data integration that is both user-friendly and customizable is a necessity.

Chelsea Kerwin, January 12, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Security, Data Analytics Make List of Predicted Trends in 2015

January 9, 2015

The article on ZyLab titled Looking Ahead to 2015 sums up the latest areas of focus at the end of one year and the beginning of the next. Obviously security is at the top of the list. According to the article, incidents of breaches in security grew 43% in 2014. We assume Sony would be the first to agree that security is of the utmost importance to most companies. The article goes on to predict audio data being increasingly important as evidence,

“Audio evidence brings many challenges. For example, the review of audio evidence can be more labor intensive than other types of electronically stored information because of the need to listen not only to the words but also take into consideration tone, expression and other subtle nuances of speech and intonation…As a result, the cost of reviewing audio evidence can quickly become prohibitive and with only a proportional of the data relevant in most cases.”

The article also briefly discusses various data sources, data analytics and information governance in their prediction of the trends for 2015. The article makes a point of focusing on the growth of data and types of data sources, which will hopefully coincide with an improved ability to discover the sort of insights that companies desire.

Chelsea Kerwin, January 09, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Inside the Creative Commons Dataset from Yahoo and Flickr

January 5, 2015

These are not our grandparents’ photo albums. With today’s technology, photos and videos are created and shared at a truly astounding pace. Much of that circulation occurs on Flickr, who teamed up with Yahoo to create a cache of nearly 100 million photos and almost 800,000 videos with creative commons licenses for us all to share. Code.flickr.com gives us the details in “The Ins and Outs of the Yahoo Flickr Creative Commons 100 Million Dataset.” Researchers Bart Thomée and David A. Shamma report:

“To understand more about the visual content of the photos in the dataset, the Flickr Vision team used a deep-learning approach to find the presence of visual concepts, such as people, animals, objects, events, architecture, and scenery across a large sample of the corpus. There’s a diverse collection of visual concepts present in the photos and videos, ranging from indoor to outdoor images, faces to food, nature to automobiles.”

The article goes on to explore the frequency of certain tags, both user-annotated and machine-generated. The machine tags include factors like time, location, and camera used, suggesting rich material for data analysts to play with. The researchers conclude with praise for their team’s project:

“The collection is one of the largest released for academic use, and it’s incredibly varied—not just in terms of the content shown in the photos and videos, but also the locations where they were taken, the photographers who took them, the tags that were applied, the cameras that were used, etc. The best thing about the dataset is that it is completely free to download by anyone, given that all photos and videos have a Creative Commons license. Whether you are a researcher, a developer, a hobbyist or just plain curious about online photography, the dataset is the best way to study and explore a wide sample of Flickr photos and videos.”

See the article for more details on those tags found within the massive dataset. To download the whole assemblage from Yahoo Labs, click here.

Cynthia Murrell, January 05, 2015

Sponsored by ArnoldIT.com, developer of Augmentext

A SASsy Hadoop Data Connection

January 2, 2015

It has been a while since we posted an article that highlights Hadoop’s capabilities and benefits. The SAS Data Management blog talks about how data sources are increasing and Hadoop can help companies organize and use their data: “The Snap, Crackle, And Pop Of Data Management On Hadoop.”

SAS is a leading provider of data management solutions, including an entire line based on the open source Hadoop software. They offer several ways to control data, including the FROM, WITH, and IN options. While the names are simple, they sun up the processes in one world.

The SAS FROM allows users to connect to the Hadoop cluster. It connects to Hadoop using an SAS/ACCESS engine, which collects metadata built in Hadoop and making them available in the data flows. This allows the software to make performance decisions without user intervention.

SAS WITH is more complicated based off its give and take function:

“The SAS WITH story provides transformation capabilities not yet available in Hadoop. UPDATE and DELETE are standard SQL transformations used in a variety of data processing programs. Hive does not yet support these functions, but you can utilize PROC IMSTAT (part of the WITH story) to lift a table or partition into memory and perform these functions in parallel. The table or partition could then be reincorporated into the Hive table, alleviating the need to truncate and reload from an RDBMS data source.”

SAS IN has the most advanced coding capabilities for data management. It allows users to run a program, where they can run eight functions in parallel against Hadoop data tables. They can also use DS2 language to perform difficult transformation of a table in parallel.

SAS’s three new Hadoop interactions allow for better streamlining of data from multiple sources and provides more insight into industry applications.

Whitney Grace, January 02, 2015
Sponsored by ArnoldIT.com, developer of Augmentext

Mastering Data Quality Requires Change

December 31, 2014

Big data means big changes for data management and ensuring its quality. Computer users, especially those ingrained in their ways, have never been keen on changing their habits. Insert trainings and meetings, then you have a general idea of what it takes to install data acceptance. Dylan Jones at SAS’s Data Roundtable wrote an editorial, “Data Quality Mastery Depends On Change Management Essentials.”

Jones writes that data management is still viewed as a strict IT domain and data quality suffers from it. It required change management to make other departments understand about the necessity for the changes.

Change management involves:

• “Ownership and leadership from the top

• Alignment with the overall strategy of the organization

• A clear vision for data quality

• Constant dialogue and consultation”

Jones notes that leaders are difficult to work with when it comes to change implementation, because they do not see what the barriers are. It translates to a company’s failure to adapt and learn. He recommends having an outside consultant, with an objective perspective, help when trying to make big changes.

Jones makes good suggestions, but he lacks any advice on how to feasibly accomplish a task. What he also needs to consider is data quality is constantly changing as new advances are made. Is he aware that some users cannot keep up with the daily changes?

Whitney Grace, December 31, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Data Analysis by Algorithm

December 22, 2014

The folks at Google may have the answer for the dearth of skilled data analysts out there. Unfortunately for our continuing job crisis, that answer does not lie in (human) training programs. Google Research Blog discusses “Automatically Making Sense of Data.” Writers Keven Murphy and David Harper ask:

“What if one could automatically discover human-interpretable trends in data in an unsupervised way, and then summarize these trends in textual and/or visual form? To help make progress in this area, Professor Zoubin Ghahramani and his group at the University of Cambridge received a Google Focused Research Award in support of The Automatic Statistician project, which aims to build an ‘artificial intelligence for data science’.”

Trends in time-series data have thus far provided much fodder for the team’s research. The article details an example involving solar-irradiance levels over time, and discusses modeling the data using Gaussian-based statistical models. Murphy and Harper report on the Cambridge team’s progress:

“Prof Ghahramani’s group has developed an algorithm that can automatically discover a good kernel, by searching through an open-ended space of sums and products of kernels as well as other compositional operations. After model selection and fitting, the Automatic Statistician translates each kernel into a text description describing the main trends in the data in an easy-to-understand form.”

Naturally, the team is going on to work with other kinds of data. We wonder—have they tried it on Google Glass market projections?

There’s a simplified version available for demo at the project’s website, and an expanded version should be available early next year. See the write-up for the technical details.

Cynthia Murrell, December 22, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Next Page »