Dataiku: Former Exalead Wizard Strikes Big Data Fire

January 24, 2015

I read “Big Data : Le Français Dataiku Lève 3 millions d’Euros.” The recipient of the cash infusion is Dataiku. Founded by former Exalead wizard Florian Douetteau, Dataiku offers:

a software platform that aggregates all the steps and big data tools necessary to get from raw data to production ready applications. It shortens the load-prepare-test-deploy cycles required to create data driven applications.

The company’s approach is to reduce the complexity of Big Data app construction. The company’s algorithms support predictive analytics. A community edition download is available at

Dataiku plans to open an office in the US in 2015.

Information about Dataiku is at

Stephen E Arnold, January 24, 2015

Enterprise Search: Is Search Big Data Ready?

January 17, 2015

At lunch on Thursday, January 15, 2015, one of my colleagues called my attention to “10 Hot Big Data Startups to Watch in 2015 from A to Z.” The story is by a professional at a company named Zementis. The story appears in or on a LinkedIn page, and I believe this may be from a person which LinkedIn considers a thought leader.

The reason I perked up when my colleague read the list of 10 companies was two fold. First, the author put his company Zementis on the list. Second, the consulting services firm LucidWorks—which I write in this way LucidWorks (Really?)—turned up.

Straight away, here’s the list of the “hot start ups” I am enjoined to “watch” in 2015. I assume that start up means “a newly established business,” according to Google’s nifty, attribution free definition service. “New” means “not existing before; made, introduced, or discovered recently or now for the first time.” Okay, with the housekeeping out of the way, on to the list:

  • Alpine Data Labs, founded in 2010
  • Confluent, founded in 2014 by LinkedIn engineers
  • Databricks, founded in 2013
  • Datameer, founded in 2009
  • Hadoop, now 10 years old and originally an open source project and not a company but figure 2004
  • Interana, founded in 2014 by former Facebook engineers
  • LucidWorks (Really?), né Lucid Imagination, founded in 2007
  • Paxata, founded in 2012
  • Trifacta, founded in 2012
  • Zementis, founded in 2004

Of these 10 companies, the firms that is not a commercial enterprise is Hadoop. Wikipedia suggests that Hadoop is a set of algorithms based on Google’s MapReduce open source version of code the search giant developed prior to 2004.

Okay, now we have nine hot data startups.

I am okay with Confluent and Interana being considered as new. Now we have seven companies that do not strike me as either “hot” or “new”. These non-hot and non-new outfits are Databricks (two years old), Datameer (four years old), LucidWorks Really? (eight years old), Paxata (three years old), and Zementis (11 years old).

I guess I can see that one could describe five of these companies as startups, but I cannot accept the “new” or “hot” moniker without some client names, revenue data, or some sort of factual substantiation.,

Now we have two companies to consider: LucidWorks Really? and Zementis.

LucidWorks Really? is a value added services firm based on Lucene/Solr. The company charges for its home-brew software and consulting and engineering services. According to Wikipedia, Lucene is:

Apache Lucene is a free open source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Apache offers this about Solr:

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene. [Lucene is a trademark of Apache it seems]

As Elasticsearch’s success in combining several open source products as a mechanism for accessing large datasets shows, it is possible to use Lucene as a query tool for information. But, and this is a large but, both the thriving Elasticsearch and LucidWorks Really? are search and retrieval systems. Yep, good old keyword search with some frosting tossed in by various community members and companies repackaging and marketing special builds of what is free software. LucidWorks has been around for eight years. I have trouble perceiving this company and its repositionings as “new”. The Big Data label seems little more than a marketing move as the company struggles to generate revenues.

Now Zementis. Like Recorded Future (funded by the GOOG and In-Q-Tel), Zementis is in the predictive analytics game. The company focuses on “holistic and actionable customer insight across all channels.” I did not include this company in my CyberOSINT study because the company seems to focus on commercial clients like retail stores and financial services. CyberOSINT is an analysis of next generation information access companies primarily serving law enforcement and intelligence entities.

But the deal breaker for me is not the company’s technology. I find it difficult to accept that a company founded 11 years ago is new. Like LucidWorks Really?, the label start up has more to do with the need to find a positioning that allows the company to generate sales and sustainable revenue.

These are essential imperatives. I do not accept the assertions about new, startup, and, to some degree, Big Data.

Furthermore, the inclusion of a project as a startup just adds evidence to support this hypothesis:

The write up is a listicle with little knowledge value. See

Why am I summarizing this information? The volume of disinformation about companies engaged in next generation information access are making the same marketing mistakes that pushed Delphes, Fast Search & Transfer, Entopia, Fulcrum Technology, iPhrase, and other hype oriented vendors into a corner.

Why not explain what a product does to solve a problem, offer specific case examples, and deal in concrete facts?

I assume that is just too much for the enterprise search and content processing “experts” to achieve in today’s business climate. Wow, what a confused listicle.

Stephen E Arnold, January 17, 2015

Startup SpaceCurve Promises Speedy Geospatial Data Analysis

January 15, 2015

The big-data field has recently seen a boom in technology that collects location-related information. The ability to quickly make good use of that data, though, has lagged behind our capacity to collect it. That gap is now being addressed, according to IT World’s piece, “Startup Rethinks Databases for the Real-Time Geospatial Era.” SpaceCurve, launched in 2009 and based in Seattle, recently released their new database system (also named “SpaceCurve”) intended to analyze geospatial data as it comes in. Writer Joab Jackson summarizes some explanatory tidbits from SpaceCurve CEO Dane Coyer:

“Traditional databases and even newer big data processing systems aren’t really optimized to quickly analyze such data, even though most all systems have some geospatial support. And although there are no shortage of geographic information systems, they aren’t equipped to handle the immense volumes of sensor data that could be produced by Internet-of-things-style sensor networks, Coyer said.

“The SpaceCurve development team developed a set of geometric computational algorithms that simplifies the parsing of geographic data. They also built the core database engine from scratch, and designed it to run across multiple servers in parallel.

“As a result, SpaceCurve, unlike big data systems such as Hadoop, can perform queries on real-time streams of data, and do so at a fraction of the cost of in-memory analysis systems such as Oracle’s TimesTen, Coyer said.”

Jackson gives a brief rundown of ways this data can be used. Whether these examples illustrate mostly positive or negative impacts on society I leave for you, dear readers, to judge for yourselves. The piece notes that SpaceCurve can work with data that has been packaged with REST, JSON, or ArcGIS formats. The platform does require Linux, and can be run on cloud services like Amazon Web Services.

Naturally, SpaceCurve is not the only company who has noticed the niche spring up around geospatial data. IBM, for example, markets their InfoSphere Streams as able to handily analyze large chunks of such information.

Cynthia Murrell, January 15, 2015

Sponsored by, developer of Augmentext

Top Papers in Data Mining: Some Concern about Possibly Flawed Outputs

January 12, 2015

If you are a fan of “knowledge,” you probably follow the information provided by I read “Research Leaders on Data Science and big Data Key Trends, Top Papers.” The information is quite interesting. I did note that the paper was kicked off with this statement:

As for the papers, we found that many researchers were so busy that they did not really have the time to read many papers by others. Of course, top researchers learn about works of others from personal interactions, including conferences and meetings, but we hope that professors have enough students who do read the papers and summarize the important ones for them!

Okay, everyone is really busy.

In the 13 experts cited, I noted that there were two papers that seemed to call attention to the issue of accuracy. These were:

“Preventing False Discovery in Interactive Data Analysis is Hard,” Moritz Hardt and Jonathan Ullman

Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” Anh Nguyen, Jason Yosinski, Jeff Clune.

A related paper noted in the article is “Intriguing Properties of Neural Networks,” by Christian Szegdy, et al. The KDNuggets’ comment states:

It found that for every correctly classified image, one can generate an “adversarial”, visually indistinguishable image that will be misclassified. This suggests potential deep flaws in all neural networks, including possibly a human brain.

My take away is that automation is coming down the pike. Accuracy could get hit by a speeding output.

Stephen E Arnold, January 12, 2015

Small Data Sprawl

January 9, 2015

I read “Forget Big Data, Small Data Is Going to Be 100 Times Harder to Tackle—EMC.” The point of the write up is that when devices generate data, life will be tougher than with old fashioned Big Data.

My thought is that wordsmithing works wonders for companies seeking purchase on the slippery face of Mt. Sales.

My reaction is that each datum of small data “belongs” to something. Maybe Google? Maybe a company in Shanghai? Won’t these folks aggregate the data and sell it? There are nascent data hubs. Perhaps these companies will get into the aggregation business. It worked for Ian Sharp at IP Sharp decades ago. I think it will probably work again.

Stephen E Arnold, January 9, 2015

Centrifuge Analytics v3 Promises Better Understanding of Big Data Through Visualization

December 23, 2014

The article on WMC Action News 5 titled Centrifuge Analytics v3 is Now Available- Large Scale Data Discovery Never Looked Better promotes the availability of Centrifuge Analytics v3, a product that enables users to see the results of their data analysis like never before. This intuitive, efficient tool helps users dig deeper into the meaning of their data. Centrifuge Systems has gained a reputation in data discovery software, particularly in the fields of cyber security, counter-terrorism, homeland defense, and financial crimes analysis among others. Chief Executive Officer Simita Bose is quoted in the article,

“Centrifuge exists to help customers with critical missions, from detecting cyber threats to uncovering healthcare fraud…Centrifuge Analytics v3 is an incredibly innovative product that represents a breakthrough for big data discovery.” “Big data is here to stay and is quickly becoming the raw material of business,” says Stan Dushko, Chief Product Officer at Centrifuge Systems. “Centrifuge Analytics v3 allows users to answer the root cause and effect questions to help them take the right actions.”

The article also lists several of the perks of Centrifuge Analytics v3, including that it is easy to deploy in multiple settings from a laptop to the cloud. It also offers powerful visuals in a fully integrated background that is easy for users to explore, and even add to if source data is complete. This may be an answer for companies who have all the big data they need, but don’t know what it means.

Chelsea Kerwin, December 23, 2014

Sponsored by, developer of Augmentext

Narrative Science Gets Money to Crunch Numbers

December 18, 2014

A smaller big data sector that specializes in text analysis to generate content and reports is burgeoning with startups. Venture Beat takes a look out how one of these startups, Narrative Science, is gaining more attention in the enterprise software market: “Narrative Science Pulls In $10M To Analyze Corporate Data And Turn It Into Text-Based Reports.”

Narrative Science started out with software that created sport and basic earnings articles for newspaper filler. It has since grown into help businesses in different industries to take their data by the digital horns and leverage it.

Narrative Science recently received $10 million in funding to further develop its software. Stuart Frankel, chief executive, is driven to help all industries save time and resources by better understanding their data

“ ‘We really want to be a technology provider to those media organizations as opposed to a company that provides media content,’ Frankel said… ‘When humans do that work…it can take weeks. We can really get that down to a matter of seconds.’”

From making content to providing technology? It is quite a leap for Narrative Science. While they appear to have a good product, what is it they exactly do?

Whitney Grace, December 18, 2014
Sponsored by, developer of Augmentext

Short Honk: Google and Fish

December 17, 2014

You may want to read “Google Helps to Use Big Data for Global Surveillance—And That’s Good.” I have no big thoughts about this write up. Googlers like sushi, so protecting fish from overzealous fisher people seems logical to me. I would raise one question you ponder after you have read the article:

What happens when humans are tracked and analyzed in this manner?


Is this function in place as you read this?

I have no answers, but I enjoy learning what other people think. We do not need to discuss the meaning of “good.”

Stephen E Arnold, December 17, 2014

Hidden Data In Big Data

December 15, 2014

Did you know that there was hidden data in big data? Okay, that makes a little sense given that big data software is designed to find the hidden trends and patterns, but RC Wireless’ “Discovering Big Data Unknowns” article points out that there is even more data left unexplored. Why? Because people are only searching in the known areas. What about the unknown areas?

The article focuses on Katherine Matsumoto of Attensity and how she uses natural language processing to “social listen” in these grey areas. Attensity is a company that specializes in natural language processing analytics to understand the content around unstructured data—big data white noise. Attensity views the Internet as the world’s largest consumer focus group and they help their clients’ consumerism habits. The new Attensity Q platform enables users to identify these patterns in real time with and detect big data unknowns.

“The company’s platform combines sentiment and trend analysis with geospatial information and information on trend influencers, and said its approach of analyzing the conversations around emerging trends enables it to act as an “early warning” system for market shifts.”

The biggest problem Attensity faces is filtering out spam and understanding the data’s context. Finding the context is the main way social data can be harnessed for companies.

Scooping out the white noise for the useful information is a hard job. Can the same technology be applied to online ads to filter out the scams from legitimate ones?

Whitney Grace, December 15, 2014
Sponsored by, developer of Augmentext

A Possibility of Profit from Autonomy Deal

December 15, 2014

While this is the season of miracles and magic, usually those are reserved for Hallmark movies and people in need, but one could argue that HP was in desperate need after the Autonomy fiasco. Maybe their Christmas wish will come true if the Information Week article “HP Cloud Adds Big Data Options” makes correct prediction.

HP will release its Haven big data analytics platform through the HP Helion cloud as Haven OnDemand. The writer believes this is HP’s next logical step given Autonomy Idol was released in January as SaaS. The popular Vertica DBMS will also be launches as a cloud service.

“Cloud-based database services have proven to be popular, with Amazon’s fast-growing Redshift service being an obvious point of comparison. Both HP Vertica and Redshift are distributed, columnar databases that are ideally suited to high-scale data-mart and data-warehouse use cases.”

HP wants to make a mark in the big data market and help their clients harness the valuable insights hiding in structured and unstructured data. While HP is on its way to becoming a key component in big data software, but it still needs improvement to compete. It doesn’t offer Hadoop OnDemand and it also lacks ETL, analytics software, and BI solutions that run alongside HP Haven OnDemand.

The company is finally moving forward and developing products that will start making up for the money lost in the Autonomy deal. How long will it take, however, to get every penny back?

Whitney Grace, December 15, 2014
Sponsored by, developer of Augmentext

Next Page »