Going Beyond ETL
May 25, 2013
Traditional data warehousing, or as it is often called, Extract, Transform and Load (ETL) have constituted an important enterprise software category. Now, these capabilities are built into products solving other data needs. The article, “Talend Ships Version 5.3 of Data Integration Platform” talks Talend’s new platform.
Talend specializes in Data Integration and offers an open source distribution of its platform. Beyond ETL it features Master Data Management, Data Quality, Business Process Integration and Enterprise Service Bus.
The invariable question in this area is what about Hadoop?
“Talend version 5.3 now features a graphical mapper for building Apache Pig data transformation scripts visually (rather than having to code the data flows in the component’s language, “Pig Latin”), thus making an important Hadoop stack component a bit more analyst-friendly. Talend 5.3 can also generate native Java MapReduce code, which allows data transformations to run right on the Hadoop cluster, avoiding burdensome data movement, and making use of general purpose SQL and import/export tools like Hive and Sqoop unnecessary.”
The rest of the article mentions that Talend beefs up its connectors. They have added support for Couchbase, CouchDB, and Neo4j. Now they offer connectivity to databases across all four major NoSQL categories.
Megan Feil, May 25, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
LucidWorks Raises 10 Million in Capital
May 23, 2013
LucidWorks continues to raise revenue, helping the company build and support open source software that empowers organizations to manage their multi-structured data. Venture Beat covers this latest round of venture capital in their story, “LucidWorks Pulls in $10M to Turn Open Source Data Into ‘Business Gold.’”
The articles states:
“‘Big data’ startup LucidWorks has raised $10 million to help enterprise companies ‘turn multistructured data into business gold’ . . . According to a form filed with the SEC, existing investors Shasta Ventures, Granite Ventures, and Walden International contributed to this third round of funding. It brings LucidWorks’ total capital raised to $26 million.”
The company employs one-fourth of the committers on the Apache Lucene/Solr project, upon which their LucidWorks Search and LucidWorks Big Data offerings are built. Big customers include AT&T, Elsevier, Cisco, Nike, Sears, and Ford, among others. The company is truly doing well, and this additional capital will help improve their scope and reach. Their support offerings set them apart from the pack, and their investment in open source is sincere, sponsoring multiple training and development events across the country. If they stay on this path, good things will continue to happen to LucidWorks.
Emily Rae Aldridge, May 23, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
Phone Data Value And What Companies Are Doing With It
May 23, 2013
Smartphones are an extension of a person’s life and they record it every time a person uses it. Smithsonian Magazine takes a look at how phone companies are tracking and using the data from phones in, “What Phone Companies Are Doing With All That Data From Your Phone.” Verizon Wireless is aware of the phone data goldmine and has added a new division called Precision Market Insights and Telefonica is adding a new business unit Telefonica Dynamic Insights to do the same thing. Phone data is being used for market, medical, and social science research. The biggest usage is tracking how people move in real time. The data collected is supposed to remain anonymous, but that is not happening.
People can be tracked:
“But a study published in Scientific Reports in March found that even data made anonymous may not be so anonymous after all. A team of researchers from Louvain University in Belgium, Harvard and M.I.T. found that by using data from 15 months of phone use by 1.5 million people, together with a similar dataset from Foursquare, they could identify about 95 percent of the cell phones users with just four data points and 50 percent of them with just two data points. A data point is an individual’s approximate whereabouts at the approximate time they’re using their cell phone.”
People’s travel and cell phone patterns are repetitive and unique, making it easy to narrow down results to an individual user. Anonymity is a hard thing to achieve with a smartphone. To confuse the data, a person could get two mobile phones, but then does that increase the fun or increase the risk?
Whitney Grace, May 23, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
Potentially New Web Page Data Mining Tool
May 22, 2013
Extracting content from a Web page can be a maddening process, requiring specialized scripts and time spent coding them. Taking a look at available tools, Softpedia touts “FMiner Pro 7.05.” FMiner Pro is advertised as a reliable application that allows users to easily handle Web content without scripts. The software can pull data from any page type, including https, plugins, JavaScript, and even complete data structures.
After the data is extracted much can be done with it:
“Extracted results can be saved to csv, Excel(xls), SQLite, Access, SQL Server, MySQL, PostgreSQL, and can specify the database fields’ types and attributes(eg, UNIQUE can avoid duplication of the extracted data). According to the setting, program can build, rebuild or load the database structure, and save the data to an existing database. Professional edition support incremental extraction, clear extraction and schedule extraction.”
FMiner Pro is available for a free fifteen-day trial to see how well it can perform. After viewing the specs, FMiner Pro is worth a shot. It can probably save coders hours by not having to write scripts and organizing Web content is a tedious job no one likes to do. Having a program to do it is much more preferable.
Whitney Grace, May 22, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
Generational Generation of Digital Information–a Breakdown
May 21, 2013
In the article Digital Footprints Broken Down by Generation on Bit Rebels, there are some very interesting facts laid out about the amount of data generated everyday by humans. According to the article, citing “science”, the internet in all its encompassing hugeness weighs no more than the average strawberry, but the data printed out could cross the US and China fifteen times. To discover your data footprint and get a label (such as super-user) you can visit Cisco’s website What is Your Digital Footprint. The article also states,
“What I found to be the most staggering stat is that from the beginning of time until 2003, humans generated 5 billion gigabytes worth of data. Today, we generate that much data every two days. In a year from now, we will generate that much data every ten minutes. What will it be like in 10 years from now? Doesn’t it seem like at some point it would get full? Science is full of mystery and wonder.”
For a breakdown of generational usage of everything from tv to smart phones to desktop computers, examine this handy infographic by Wikibon. It predicts that soon we will generate 5 billion gigabytes of data every ten minutes. Some of this data is private, and at ArnoldIT you can learn from a team of professionals about how to avoid leaving a digital footprint that will open you up to risk and embarrassment.
Chelsea Kerwin, May 21, 2013
If you are interested in gourmet food and spirits, read Gourmet De Ville.
A Fresh Look at Big Data
May 8, 2013
Next week I am doing an invited talk in London. My subject is search and Big Data. I will be digging into this notion in this month’s Honk newsletter and adding some business intelligence related comments at an Information Today conference in New York later this month. (I have chopped the number of talks I am giving this year because at my age air travel and the number of 20 somethings at certain programs makes me jumpy.)
I want to highlight one point in my upcoming London talk; namely, the financial challenge which companies face when they embrace Big Data and then want to search the information in the system and search the Big Data system’s outputs.
Here are the simplified curves:
Notice that precision and recall has not improved significantly over the last 30 years. I anticipate that many search vendors will tell me that their systems deliver excellent precision and recall. I am not convinced. The data which I have reviewed show that over a period of 10 years most systems hit the 80 to 85 percent precision and recall level for content which is about a topic. Content collections composed of scientific, technical, and medical information where the terminology is reasonably constrained can do better. I have seen scores above 90 percent. However, for general collections, precision and recall has not been improving relative to the advances in other disciplines; for example, converting structured data outputs to fancy graphics.
Hadoop Heats Up with New Startups
May 8, 2013
While there is some controversy over whether Hadoop is the only necessary tool to mine opportunities from big data, Hadoop and insights from big data seem to be synonymous according to Datamation’s recent article. They give us the rundown on “Seven Hot Hadoop Startups that Will Tame Big Data.”
According to this article, the current Hadoop ecosytem market is worth around $77 million. With growth, the value is projected to be at $813 million by 2016. The article notes that Hadoop has not been proven as completely effective in the enterprise world. Queries are still a weak point.
The article discusses seven startups that intend on seeing Hadoop through into maturity like Alpine Data Labs. The following excerpt explains why they are on this list:
“According to Alpine Data, part of the problem is that it’s much too difficult to get real insights out of Hadoop and other parallel platforms. Most companies don’t know what to do with massive datasets, and few have gotten any further with Hadoop than batch processing and basic querying. Alpine Data set out to simplify machine-learning methods and make them available on petabyte-scale datasets. Their tools make these methods available in a lightweight web application with a code-free, drag-and-drop interface.”
With the amount of attention on Hadoop over the years, Hadoop start ups are not a commodity. A list featuring a selection of the new ones to watch is much appreciated. Check out the full and useful list of hot Hadoop start ups.
Megan Feil, May 08, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search
Inventive Graduate Student Builds Breakthrough Database
April 30, 2013
For some folks, deadlines can lead to innovation. One graduate student’s efforts to speed up his research has resulted in the inspired, high-speed parallel database MapD, we learn from DataInformed‘s encouraging piece, “Fast Database Emerges from MIT Class, GPUs and Student’s Invention.” Todd Mostak’s in-a-pinch breakthrough could soon help others in business as well as academia.
The informative article contains too many specifics to cover here, but I suggest checking it out. It should be fascinating reading for anyone interested in data management. I personally think the use of graphics processors designed for gaming is a stroke of genius. Or maybe desperation (the two can be closely related). Reporter Ian B. Murphy tells us:
“While taking a class on databases at MIT, Mostak built a new parallel database, called MapD, that allows him to crunch complex spatial and GIS data in milliseconds, using off-the-shelf gaming graphical processing units (GPU) like a rack of mini supercomputers. Mostak reports performance gains upwards of 70 times faster than CPU-based systems. . . .
“‘I had the realization that this had the potential to be majorly disruptive,’ Mostak said. ‘There have been all these little research pieces about this algorithm or that algorithm on the GPU, but I thought, “Somebody needs to make an end-to-end system.” I was shocked that it really hadn’t been done.’”
Well, sometimes it takes someone from outside a field to see what seems obvious in retrospect. Mostak’s undergraduate experience was in economics, anthropology, and math, and he was in Harvard’s Middle Eastern Studies program when he was compelled to develop MapD. A database class at MITgave him the knowledge he needed to build this tool, which he created to help with the tweet-heavy, Arab Spring-related thesis he was working on.
MIT’s Computer Science and Artificial Intelligence Lab has now snapped up the innovator. Though some questioned hiring someone with such a lean computer-science education, Lab director Sam Madden knows that Mostak’s unconventional background only means he has a unique point of view. The nascent computer scientist has already shown he has the talent to make it in this field.
Though Mostak says he still has work ahead to perfect his system, he does plan to share MapD as an open source project in the near future. Is he concerned about opening his work to the public? Nope; he states:
“If worse comes to worst, and somebody steals the idea, or nobody likes it, then I have a million other things I want to do too, in my head. I don’t think you can be scared. Life is too short.”
That it is. I suspect we will be hearing more from this creative thinker in the years to come.
Cynthia Murrell, April 30, 2013
Sponsored by ArnoldIT.com, developer of Augmentext
The Heat in Text Radar: April 12 to April 18
April 23, 2013
This week the Text Radar advanced intelligence blog covered a myriad of articles related to the big data deluge and its impact on a variety of different sectors.
One example of the unique ways that big data is being used is seen in “Using Big Data to Geotag the History of Human Events.” The article discusses a database that aims to contain a list of every event in human history.
Why is database journalism important? The author explains:
“It matters because historians have long feared that we live in a digital dark ages - where our history will have vanished when future generations try to look back on these electronic decades.
That is the purpose of GDELT: Global Data on Events, Location and Tone. Primarily set up by Kalev Leetaru at the University of Illinois it is literally a giant list: over 250m events in over 300 categories from riots and protests to diplomatic exchanges and peace appeals.
Crucially, it contains latitude and longitude for every event – all of them are now geotagged to city level.”
There are other ways that big data is having a big impact. “Kenneth Cukier on Big Data and How it is Changing Our World” explains the impact that big data is having on journalism and patient care and treatment in healthcare.
The article characterizes big data as:
“There is no concrete definition and that is probably a good thing since to define is also to limit. But it’s not woolly either. We can understand big data by its features, and the central one is this: we can do things with a huge corpus of data that we are unable to do with smaller amounts, to extract new insights and create new sources of value. This encompasses things like machine learning, in which we have self-driving cars and decent language translation.”
While big data is certainly taking off in the United States and around the world, there remain more than a few skeptics. “Daniel Rasmus on Skepticism with Big Data Implementation” explains that healthy skepticism is important when discussing such a large topic.
The article states:
“Rasmus explains that asking data for an answer involves serious programming needs, such as selecting relevant data, normalizing it, and producing results that a human or machine can act upon. It is tricky business. The article provides an in-depth review of the topic and what seem to be valid issues worth considering.”
Lucky for those that find big data research daunting, there are plenty of experts out there to help. We highly recommend Smartlogic’s Semaphore Content Intelligence Platform to add meaning to your data and deliver actionable insights.
Jasmine Ashton, April 23, 2013
Sponsored by ArnoldIT.com, developer of Augmentext.com
Silo Syndrome Claims the Sky Is Falling
April 18, 2013
Organizations in the financial services, healthcare, technology, e-business and government industries are at an increased risk for the newly diagnosed “Silo Syndrome”, according to the article “Thousands of Companies Diagnosed with Dreaded ‘Silo Syndrome’” published by PR Newswire.
Apparently, the symptoms of corporate “Silo Syndrome” are as follows:
“*An inability to immediately access business information
- Searching for answers but never really finding them
- Problems processing terms like “unstructured content”
- A penchant to unnecessarily flatten relational data
- Inability to join concepts together in real-time
- Needlessly accessing multiple systems for ‘what’ and ‘why’ answers”
Big data giant Attivio is championing awareness initiatives for what they claim is an increasingly ubiquitous syndrome, as CTO Sid Probstein stars in his very own PSA-style video. Attivio has also created a “Six Signs of Silo Syndrome” warning sign, which can be printed and displayed anywhere.
While Attivio no doubt holds the cure to “Silo Syndrome”, maybe humans build silos because silos are useful. After all, silos are required by various regulations, and silos simply make sense for certain types of business processes. Sure there is room for improvement, but sometimes silos just make sense.
Samantha Plappert, April 18, 2013
Sponsored by ArnoldIT.com, developer of Beyond Search





