CyberOSINT banner

More Open Source Search Excitement: Solr Flare Erupts

February 20, 2015

I read “Yonik Seeley, Creator of Apache Solr Search Engine Joins Cloudera.” Most personnel moves in the search and retrieval sector are ho hum events. Seely’s jump from Heliosearch to Cloudera may disrupt activities a world away from the former Lucid Imagination now chasing Big  Data under the moniker “LucidWorks.” I write the company’s name as LucidWorks (Really?) because the company has undergone some Cirque du Soleil moves since the management revolving door was installed.

Seeley was one of the founders and top engineers at Lucid. Following his own drum beat, he formed his own company to support Solr,. In my opinion, Seeley played a key role in shaping Solr  into a reasonable alternative to proprietary findability solutions like Endeca. With Seeley at Cloudera, Lucid’s vision of becoming the search solution for Hadoop-like data management systems may suffer a transmission outage. I think of this as a big Solr flare.

Cloudera will move forward and leverage Seeley’s expertise. It is possible that Lucid will move out of the Big Data orbit and find a way to generate sustainable revenues. However, Cloudera now has an opportunity to add some fuel to its solutions.

For me, the Seeley move is good news for Cloudera. For Lucid, Seeley’s joining Cloudera is yet another challenge to Lucid. I think the Lucid operation is still dazed after four or five years sharp blows to the corporate body.

The patience of Lucid’s investors may be tested again. The management issues, the loss of a key executive to Amazon, the rise of Elasticsearch, and now the Seeley shift in orbit—these are the times that may try the souls of those who expect a payoff from their investments in Lucid’s open source dream. Cloudera or Elasticsearch are now companies with a fighting chance to become the next RedHat. Really.

Stephen E Arnold, February 20, 2015

Statistics, Statistics. Disappointing Indeed

February 16, 2015

At dinner on Saturday evening, a medical researcher professional mentioned that reproducing results from tests conducted in the researcher’s lab was tough. I think the buzzword for this is “non reproducibility.” The question was asked, “Perhaps the research is essentially random?” There were some furrowed brows. My reaction was, “How does one know what’s what with experiments, data, or reproducibility tests?” The table talk shifted to a discussion of Saturday Night Live’s 40th anniversary. Safer ground.

Navigate to “Science’s Significant Stat Problem.” The article makes clear that 2013 thinking may have some relevance today. Here’s a passage I highlighted in pale blue:

Scientists use elaborate statistical significance tests to distinguish a fluke from real evidence. But the sad truth is that the standard methods for significance testing are often inadequate to the task.

There you go. And the supporting information for this statement?

One recent paper found an appallingly low chance that certain neuroscience studies could correctly identify an effect from statistical data. Reviews of genetics research show that the statistics linking diseases to genes are wrong far more often than they’re right. Pharmaceutical companies find that test results favoring new drugs typically disappear when the tests are repeated.

For the math inclined the write up offers:

It’s like flipping coins. Sometimes you’ll flip a penny and get several heads in a row, but that doesn’t mean the penny is rigged. Suppose, for instance, that you toss a penny 10 times. A perfectly fair coin (heads or tails equally likely) will often produce more or fewer than five heads. In fact, you’ll get exactly five heads only about a fourth of the time. Sometimes you’ll get six heads, or four. Or seven, or eight. In fact, even with a fair coin, you might get 10 heads out of 10 flips (but only about once for every thousand 10-flip trials). So how many heads should make you suspicious? Suppose you get eight heads out of 10 tosses. For a fair coin, the chances of eight or more heads are only about 5.5 percent. That’s a P value of 0.055, close to the standard statistical significance threshold. Perhaps suspicion is warranted.

Now the kicker:

And there’s one other type of paper that attracts journalists while illustrating the wider point: research about smart animals. One such study involved a fish—an Atlantic salmon—placed in a brain scanner and shown various pictures of human activity. One particular spot in the fish’s brain showed a statistically significant increase in activity when the pictures depicted emotional scenes, like the exasperation on the face of a waiter who had just dropped his dishes. The scientists didn’t rush to publish their finding about how empathetic salmon are, though. They were just doing the test to reveal the quirks of statistical significance. The fish in the scanner was dead.

How are those Big Data analyses working out, folks?

Stephen E Arnold, February 16, 2015

VVVVV and Big Data

February 7, 2015

Somewhere along the line a marketer cooked up volume, variety, and velocity to describe Big Data. Well, VVV is good but now we have VVVVV. Want to know more about “value” and “veracity”? Navigate to “2 More Big Data V’s—Value and Veracity.” The new Vs are slippery. How does one demonstrate value. The write up does not nail down the concept. There are MBA type references to ROI, use cases, and brand. Not much numerical evidence or a credible analytic foundation is presented. Isn’t “value” a matter of perception. Numbers may not be needed.

Veracity is also a bit mushy. What about Brian Williams’ and his oft repeated “conflation”? What about marketing collateral for software vendors in search of a sale?

I typed 25 and moved on. Neither a big number nor much in the way of big data.

Stephen E Arnold, February 7, 2015

Dataiku: Former Exalead Wizard Strikes Big Data Fire

January 24, 2015

I read “Big Data : Le Français Dataiku Lève 3 millions d’Euros.” The recipient of the cash infusion is Dataiku. Founded by former Exalead wizard Florian Douetteau, Dataiku offers:

a software platform that aggregates all the steps and big data tools necessary to get from raw data to production ready applications. It shortens the load-prepare-test-deploy cycles required to create data driven applications.

The company’s approach is to reduce the complexity of Big Data app construction. The company’s algorithms support predictive analytics. A community edition download is available at http://www.dataiku.com/dss/editions/.

Dataiku plans to open an office in the US in 2015.

Information about Dataiku is at http://www.dataiku.com.

Stephen E Arnold, January 24, 2015

Enterprise Search: Is Search Big Data Ready?

January 17, 2015

At lunch on Thursday, January 15, 2015, one of my colleagues called my attention to “10 Hot Big Data Startups to Watch in 2015 from A to Z.” The story is by a professional at a company named Zementis. The story appears in or on a LinkedIn page, and I believe this may be from a person which LinkedIn considers a thought leader.

The reason I perked up when my colleague read the list of 10 companies was two fold. First, the author put his company Zementis on the list. Second, the consulting services firm LucidWorks—which I write in this way LucidWorks (Really?)—turned up.

Straight away, here’s the list of the “hot start ups” I am enjoined to “watch” in 2015. I assume that start up means “a newly established business,” according to Google’s nifty, attribution free definition service. “New” means “not existing before; made, introduced, or discovered recently or now for the first time.” Okay, with the housekeeping out of the way, on to the list:

  • Alpine Data Labs, founded in 2010
  • Confluent, founded in 2014 by LinkedIn engineers
  • Databricks, founded in 2013
  • Datameer, founded in 2009
  • Hadoop, now 10 years old and originally an open source project and not a company but figure 2004
  • Interana, founded in 2014 by former Facebook engineers
  • LucidWorks (Really?), né Lucid Imagination, founded in 2007
  • Paxata, founded in 2012
  • Trifacta, founded in 2012
  • Zementis, founded in 2004

Of these 10 companies, the firms that is not a commercial enterprise is Hadoop. Wikipedia suggests that Hadoop is a set of algorithms based on Google’s MapReduce open source version of code the search giant developed prior to 2004.

Okay, now we have nine hot data startups.

I am okay with Confluent and Interana being considered as new. Now we have seven companies that do not strike me as either “hot” or “new”. These non-hot and non-new outfits are Databricks (two years old), Datameer (four years old), LucidWorks Really? (eight years old), Paxata (three years old), and Zementis (11 years old).

I guess I can see that one could describe five of these companies as startups, but I cannot accept the “new” or “hot” moniker without some client names, revenue data, or some sort of factual substantiation.,

Now we have two companies to consider: LucidWorks Really? and Zementis.

LucidWorks Really? is a value added services firm based on Lucene/Solr. The company charges for its home-brew software and consulting and engineering services. According to Wikipedia, Lucene is:

Apache Lucene is a free open source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Apache offers this about Solr:

Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene. [Lucene is a trademark of Apache it seems]

As Elasticsearch’s success in combining several open source products as a mechanism for accessing large datasets shows, it is possible to use Lucene as a query tool for information. But, and this is a large but, both the thriving Elasticsearch and LucidWorks Really? are search and retrieval systems. Yep, good old keyword search with some frosting tossed in by various community members and companies repackaging and marketing special builds of what is free software. LucidWorks has been around for eight years. I have trouble perceiving this company and its repositionings as “new”. The Big Data label seems little more than a marketing move as the company struggles to generate revenues.

Now Zementis. Like Recorded Future (funded by the GOOG and In-Q-Tel), Zementis is in the predictive analytics game. The company focuses on “holistic and actionable customer insight across all channels.” I did not include this company in my CyberOSINT study because the company seems to focus on commercial clients like retail stores and financial services. CyberOSINT is an analysis of next generation information access companies primarily serving law enforcement and intelligence entities.

But the deal breaker for me is not the company’s technology. I find it difficult to accept that a company founded 11 years ago is new. Like LucidWorks Really?, the label start up has more to do with the need to find a positioning that allows the company to generate sales and sustainable revenue.

These are essential imperatives. I do not accept the assertions about new, startup, and, to some degree, Big Data.

Furthermore, the inclusion of a project as a startup just adds evidence to support this hypothesis:

The write up is a listicle with little knowledge value. See http://amzn.to/1rUoQyn.

Why am I summarizing this information? The volume of disinformation about companies engaged in next generation information access are making the same marketing mistakes that pushed Delphes, Fast Search & Transfer, Entopia, Fulcrum Technology, iPhrase, and other hype oriented vendors into a corner.

Why not explain what a product does to solve a problem, offer specific case examples, and deal in concrete facts?

I assume that is just too much for the enterprise search and content processing “experts” to achieve in today’s business climate. Wow, what a confused listicle.

Stephen E Arnold, January 17, 2015

Startup SpaceCurve Promises Speedy Geospatial Data Analysis

January 15, 2015

The big-data field has recently seen a boom in technology that collects location-related information. The ability to quickly make good use of that data, though, has lagged behind our capacity to collect it. That gap is now being addressed, according to IT World’s piece, “Startup Rethinks Databases for the Real-Time Geospatial Era.” SpaceCurve, launched in 2009 and based in Seattle, recently released their new database system (also named “SpaceCurve”) intended to analyze geospatial data as it comes in. Writer Joab Jackson summarizes some explanatory tidbits from SpaceCurve CEO Dane Coyer:

“Traditional databases and even newer big data processing systems aren’t really optimized to quickly analyze such data, even though most all systems have some geospatial support. And although there are no shortage of geographic information systems, they aren’t equipped to handle the immense volumes of sensor data that could be produced by Internet-of-things-style sensor networks, Coyer said.

“The SpaceCurve development team developed a set of geometric computational algorithms that simplifies the parsing of geographic data. They also built the core database engine from scratch, and designed it to run across multiple servers in parallel.

“As a result, SpaceCurve, unlike big data systems such as Hadoop, can perform queries on real-time streams of data, and do so at a fraction of the cost of in-memory analysis systems such as Oracle’s TimesTen, Coyer said.”

Jackson gives a brief rundown of ways this data can be used. Whether these examples illustrate mostly positive or negative impacts on society I leave for you, dear readers, to judge for yourselves. The piece notes that SpaceCurve can work with data that has been packaged with REST, JSON, or ArcGIS formats. The platform does require Linux, and can be run on cloud services like Amazon Web Services.

Naturally, SpaceCurve is not the only company who has noticed the niche spring up around geospatial data. IBM, for example, markets their InfoSphere Streams as able to handily analyze large chunks of such information.

Cynthia Murrell, January 15, 2015

Sponsored by ArnoldIT.com, developer of Augmentext

Top Papers in Data Mining: Some Concern about Possibly Flawed Outputs

January 12, 2015

If you are a fan of “knowledge,” you probably follow the information provided by www.KDNuggets.com. I read “Research Leaders on Data Science and big Data Key Trends, Top Papers.” The information is quite interesting. I did note that the paper was kicked off with this statement:

As for the papers, we found that many researchers were so busy that they did not really have the time to read many papers by others. Of course, top researchers learn about works of others from personal interactions, including conferences and meetings, but we hope that professors have enough students who do read the papers and summarize the important ones for them!

Okay, everyone is really busy.

In the 13 experts cited, I noted that there were two papers that seemed to call attention to the issue of accuracy. These were:

“Preventing False Discovery in Interactive Data Analysis is Hard,” Moritz Hardt and Jonathan Ullman

Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” Anh Nguyen, Jason Yosinski, Jeff Clune.

A related paper noted in the article is “Intriguing Properties of Neural Networks,” by Christian Szegdy, et al. The KDNuggets’ comment states:

It found that for every correctly classified image, one can generate an “adversarial”, visually indistinguishable image that will be misclassified. This suggests potential deep flaws in all neural networks, including possibly a human brain.

My take away is that automation is coming down the pike. Accuracy could get hit by a speeding output.

Stephen E Arnold, January 12, 2015

Small Data Sprawl

January 9, 2015

I read “Forget Big Data, Small Data Is Going to Be 100 Times Harder to Tackle—EMC.” The point of the write up is that when devices generate data, life will be tougher than with old fashioned Big Data.

My thought is that wordsmithing works wonders for companies seeking purchase on the slippery face of Mt. Sales.

My reaction is that each datum of small data “belongs” to something. Maybe Google? Maybe a company in Shanghai? Won’t these folks aggregate the data and sell it? There are nascent data hubs. Perhaps these companies will get into the aggregation business. It worked for Ian Sharp at IP Sharp decades ago. I think it will probably work again.

Stephen E Arnold, January 9, 2015

Centrifuge Analytics v3 Promises Better Understanding of Big Data Through Visualization

December 23, 2014

The article on WMC Action News 5 titled Centrifuge Analytics v3 is Now Available- Large Scale Data Discovery Never Looked Better promotes the availability of Centrifuge Analytics v3, a product that enables users to see the results of their data analysis like never before. This intuitive, efficient tool helps users dig deeper into the meaning of their data. Centrifuge Systems has gained a reputation in data discovery software, particularly in the fields of cyber security, counter-terrorism, homeland defense, and financial crimes analysis among others. Chief Executive Officer Simita Bose is quoted in the article,

“Centrifuge exists to help customers with critical missions, from detecting cyber threats to uncovering healthcare fraud…Centrifuge Analytics v3 is an incredibly innovative product that represents a breakthrough for big data discovery.” “Big data is here to stay and is quickly becoming the raw material of business,” says Stan Dushko, Chief Product Officer at Centrifuge Systems. “Centrifuge Analytics v3 allows users to answer the root cause and effect questions to help them take the right actions.”

The article also lists several of the perks of Centrifuge Analytics v3, including that it is easy to deploy in multiple settings from a laptop to the cloud. It also offers powerful visuals in a fully integrated background that is easy for users to explore, and even add to if source data is complete. This may be an answer for companies who have all the big data they need, but don’t know what it means.

Chelsea Kerwin, December 23, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Narrative Science Gets Money to Crunch Numbers

December 18, 2014

A smaller big data sector that specializes in text analysis to generate content and reports is burgeoning with startups. Venture Beat takes a look out how one of these startups, Narrative Science, is gaining more attention in the enterprise software market: “Narrative Science Pulls In $10M To Analyze Corporate Data And Turn It Into Text-Based Reports.”

Narrative Science started out with software that created sport and basic earnings articles for newspaper filler. It has since grown into help businesses in different industries to take their data by the digital horns and leverage it.

Narrative Science recently received $10 million in funding to further develop its software. Stuart Frankel, chief executive, is driven to help all industries save time and resources by better understanding their data

“ ‘We really want to be a technology provider to those media organizations as opposed to a company that provides media content,’ Frankel said… ‘When humans do that work…it can take weeks. We can really get that down to a matter of seconds.’”

From making content to providing technology? It is quite a leap for Narrative Science. While they appear to have a good product, what is it they exactly do?

Whitney Grace, December 18, 2014
Sponsored by ArnoldIT.com, developer of Augmentext

Next Page »