December 22, 2014
The folks at Google may have the answer for the dearth of skilled data analysts out there. Unfortunately for our continuing job crisis, that answer does not lie in (human) training programs. Google Research Blog discusses “Automatically Making Sense of Data.” Writers Keven Murphy and David Harper ask:
“What if one could automatically discover human-interpretable trends in data in an unsupervised way, and then summarize these trends in textual and/or visual form? To help make progress in this area, Professor Zoubin Ghahramani and his group at the University of Cambridge received a Google Focused Research Award in support of The Automatic Statistician project, which aims to build an ‘artificial intelligence for data science’.”
Trends in time-series data have thus far provided much fodder for the team’s research. The article details an example involving solar-irradiance levels over time, and discusses modeling the data using Gaussian-based statistical models. Murphy and Harper report on the Cambridge team’s progress:
“Prof Ghahramani’s group has developed an algorithm that can automatically discover a good kernel, by searching through an open-ended space of sums and products of kernels as well as other compositional operations. After model selection and fitting, the Automatic Statistician translates each kernel into a text description describing the main trends in the data in an easy-to-understand form.”
Naturally, the team is going on to work with other kinds of data. We wonder—have they tried it on Google Glass market projections?
There’s a simplified version available for demo at the project’s website, and an expanded version should be available early next year. See the write-up for the technical details.
Cynthia Murrell, December 22, 2014
December 15, 2014
Did you know that there was hidden data in big data? Okay, that makes a little sense given that big data software is designed to find the hidden trends and patterns, but RC Wireless’ “Discovering Big Data Unknowns” article points out that there is even more data left unexplored. Why? Because people are only searching in the known areas. What about the unknown areas?
The article focuses on Katherine Matsumoto of Attensity and how she uses natural language processing to “social listen” in these grey areas. Attensity is a company that specializes in natural language processing analytics to understand the content around unstructured data—big data white noise. Attensity views the Internet as the world’s largest consumer focus group and they help their clients’ consumerism habits. The new Attensity Q platform enables users to identify these patterns in real time with and detect big data unknowns.
“The company’s platform combines sentiment and trend analysis with geospatial information and information on trend influencers, and said its approach of analyzing the conversations around emerging trends enables it to act as an “early warning” system for market shifts.”
The biggest problem Attensity faces is filtering out spam and understanding the data’s context. Finding the context is the main way social data can be harnessed for companies.
Scooping out the white noise for the useful information is a hard job. Can the same technology be applied to online ads to filter out the scams from legitimate ones?
December 2, 2014
The article titled Distell Supports Business Growth Through Improved Information Management on OpenText tells the story of the booming business Distell, a South African beverage producer. Since opening in 2000, the company has grown quickly, and the speed of the growth resulted in unstructured data being stored in unconnected silos. Needless to say this was detrimental to the company’s efficiency. The article explains,
“Today, there are over 13 million information assets in the Distell Enterprise Content Management (ECM) platform or repository; with tens of thousands of items being added weekly. Helping make sense of this wealth of corporate intellectual property are OpenText ECM solutions, from archiving to document management and secure file sharing in the cloud. This collaborative, searchable, secure repository enables marketing, sales, operations, production and service functions in one continent to access information from peers across the globe.”
The article seems to convey an OpenText success story, with improved collaboration and efficiency throughout Distell. The company boasts around 30 new employees a month, and ECM’s largest benefit is considered productivity and continuity of services. No word on how this implementation cost, but you can almost hear an OpenText representative asking, “can you put a price on empowering your employees?”
Chelsea Kerwin, December 02, 2014
December 1, 2014
“Kapow Enterprise 9.3 introduces new capabilities that give organizations greater flexibility, speed and reach in turning Big Data into business insights. These enhancements extend Kapow Enterprise as the leading data integration platform to access, integrate, deliver and explore data from the widest variety of internal and external sources.”
The new version boasts added flexibility and coverage when acquiring data across disparate sources. It also offers enhanced data distribution and exploration; of particular value to many will be the platform’s visual presentation of data through auto-generated graphs and tables, both of which update themselves as users add and remove filters. Kapow has also improved its Kapplets, the feature that lets users easily publish web apps that combine information into easily-digested interactive presentations. See the post for more information, or contact the company to request a demo.
Priding themselves on their products’ flexibility, integration-and-automation firm Kapow serves businesses of all sizes around the world. Headquartered in Palo Alto, California, Kapow was founded in 2005. The promising company was snapped up by process-applications outfit Kofax in 2013. Kofax is also based in Palo Alto, and was founded back in 1991.
Cynthia Murrell, December 01, 2014
November 26, 2014
Enterprise Apps Today has an article called “Attensity Boosts Ability To Discover ‘Unknown’ Trends In Data,” discussing how Attensity was updated with new features to detect themes in real-time social data, catch spam, and make it easier to compose/filter queries. Before Attensity’s new software updates, social analytics tools use mentions to measure interest in products. The “mentions” are not the most quantifiable way to see if a product is successful.
The new Attensity Q tracks themes, trends, anomalies, and events around a product in the context of online conversations. This makes it easier to create new vocabularies and brand-unique terms into queries.
” ‘Social analytics has largely been limited up to this point by forming hypotheses and testing them – the hunting and pecking for insights that traditional search requires you to do,” [Senior Project Manager and NLP Strategist Katherine] Matsumoto said. “But there is a growing need for our customers to be presented with findings that they didn’t know to look for. These findings may be within their search topic, adjacent to it or many degrees removed through nested relationships.’ “
Attensity Q has more applications than retail. It can be used for legal departments to detect fraudulent activities and by HR departments to target area for improvement. It could even be used with healthcare patient data to track unusual patterns and offer a better diagnosis.
Rather than bragging about big data’s possibilities, Attensity is describing some practical applications and their uses.
November 24, 2014
Data is messy and needs to be kept clean. Data on a large, enterprise scale is a nightmare to neat freaks, because without an organizational hierarchy it would take years to sift through. Wand Inc.’s corporate blog posted some exciting news, “Expert System And WAND Partner For A More Effective Management Of Enterprise Information.” WAND is known throughout big data as the leader in enterprise taxonomies, while Expert Systems is renowned for its semantic technology.
The goal of the partnership is to help enterprise systems make their data more findable, manage better client relationships, and decrease operational risks. While the partnership will affect enterprise systems overall, there are three main factors that will overhaul the enterprise content management process:
1 “Taxonomy selection: WAND offers the biggest library of out-of-the-box taxonomies available on the market today. By selecting one of the available sector specific taxonomies, customers can speed up significantly their implementation time without compromising their specific classification requirements.
2 Automatic Classification based on the selected taxonomy: once the customer chooses the taxonomy, Expert System makes a full set of tools available to define the semantic based categorization rules and the engine that enables the automatic categorization of all the enterprise content.
3 Native integration with the most common document and collaboration systems, including Microsoft SharePoint.”
WAND and Expert Systems’ combined forces will allow enterprise systems to make their data more findable. While the partnership is beneficial, it reads like most big data relationships. What makes it different, however, are the names attached.
November 21, 2014
SemanticWeb.com posted an article called “Retrieving And Using Taxonomy Data From DBpedia” with an interesting introduction. It explains that DBpedia is a crowd-sourced Internet community whose entire goal is to extract structured information from Wikipedia and share it. The introduction continues that DBpedia already has over three billion facts W3C standard RDF data model ready for application use.
The W3C standards are already written using the SKOS vocabulary, primarily used by the New York Times, the Library of Congress, and other organizations for their own taxonomies and subject headers. Users can extrapolate the data and implement it in their own RDF applications with the goal of giving your data more value.
DBpedia is doing a wonderful service for users so they do not have to rely on proprietary software to deliver them rich taxonomies. The taxonomies can be retrieved under the open source community bylaws and gain instant improvement for content. There is one caveat:
“Remember that, for better or worse, the data is based on Wikipedia data. If you extend the structure of the query above to retrieve lower, more specific levels of horror film categories, you’d probably find the work of film scholars who’ve done serious research as well as the work of nutty people who are a little too into their favorite subgenres.”
Remember Wikipedia is a good reference tool to gain an understanding of a topic, but you still need to check more verifiable resources for hard facts.
November 18, 2014
The article on CNN Money titled Varonis Announces Metadata Framework Version 6, Including New Functionality For Four Varonis Solutions explores the new features of Version 6. Varonis, the leading software provider, focuses on human-generated data that is unstructured and might include anything from spreadsheets to emails to text messages. They can boast over 3,000 customers in fields as varied as healthcare, media and financial services. The Varonis MetaData Framework has been perfected over the last decade. The article describes it this way,
“ [It is ] a single platform on a unifying code base, purpose-built to tackle the many challenges and use cases that arise from the massive volumes of unstructured data files created and stored by organizations of all sizes. Currently powering five distinct Varonis products, the Varonis Metadata Framework intelligently extracts and analyzes metadata from customers’ vast, distributed unstructured data stores, and enables a variety of uses cases, including data governance, data security, archiving, file synchronization, enhanced mobile data accessibility, search, and business collaboration.”
Exciting new features in Version 6 include a search API for DatAnswers, “bi-directional permissions visibility” for DatAdvantage to reduce operational overhead, and reduced risk through DatAlert with the information of malware location and timing.
Chelsea Kerwin, November 18, 2014
November 12, 2014
The article titled The Five Rules for Data Discovery on Computerworld discusses Enterprise Data Discovery. In the pursuit of fast-paced, accurate data analytics, Enterprise Data Discovery is touted in this article as a ramped up tool for accessing relevant information quickly. The first capability is “governed self-service discovery” which enables users to reformulate their data search on their own. This also allows for the blending of data types including social media and unstructured data. The article also emphasizes the importance of having a dialogue with the data,
“You also discovered that the spike in sales occurred in the middle of the media campaign and during the time of the spike, there was a major sporting event. This new clue prompts a new question – what could a sporting event have to do with the spike? Again, the data reveals its value by providing a new answer – one of the advertisements from the campaign got additional play at the event. Now, you have something solid to work on.”
According to the article, Enterprise Data Discovery offers a view of the road less travelled, enabling users to approach their discovery with new questions. Of course, the question that arises while reading this article is, who has time for this? The emphasis on self-service is interesting, but it also suggests that users will be spending a good chunk of time manipulating the data on their own.
Chelsea Kerwin, November 12, 2014
November 10, 2014
Depending on one’s field, it may seem like every bit of information in existence is now just an Internet search away. However, as researchers well know, there is a wealth of potentially crucial information that is still difficult to access. In fact, GCN tells us that marketing firm IDC estimates up to 90 percent of “big data” falls into this category. GCN also turns our attention to a potential solution in, “Brown Dog Digs Into the Deep, Dark Web.”
Brown Dog is a project out of the National Center for Supercomputing Application [NCSA] at the University of Illinois at Urbana-Champaign. In 2013, the team received a $10 million, five-year award from the National Science Foundation for the project. Already, they have developed two services that facilitate access to uncurated data collections. The write-up reports:
“The first, called Data Access Proxy (DAP), transforms unreadable files into readable ones by linking together a series of computing and translational operations behind the scenes. Similar to an Internet gateway, the configuration of the DAP would be entered into a user’s machine settings. Thereafter, data requests over HTTP would first be examined by the proxy to determine if the native file format is readable on the client device.
“The second tool, the Data Tilling Service (DTS), lets individuals search collections of data, using an existing file to discover similar files in the data. For example, while browsing an online image collection, a user could drop an image of three people into the search field, and the DTS would return images in the collection that also contain three people. If the DTS encounters a file format it is unable to parse, it would use the Data Access Proxy to make the file accessible. It also indexes the data and extracts and appends metadata to files to give users a sense of the type of data they are encountering.”
The article notes that Brown Dog’s makers are building on previous software development, and that they hope to “bring together every possible source of automated help already in existence.” That’s some goal! Not surprisingly, the prospective tools have been likened to a time machine of sorts. Kenton McHenry, one of the project’s leaders, reminds us that the world’s first web browser, Mosaic, was also developed at NCSA; his team hopes to leave a similarly significant legacy.
Cynthia Murrell, November 10, 2014