Honkin' News banner

LucidWorks Bet on Spark. Now What?

August 8, 2016

Many clear night ago, Lucid Imagination offered an open source enterprise search solution. Presidents came. Presidents went. Lucid Imagination morphed into LucidWorks. I promptly referred to the company in this way: Lucid works, really?

The firm embraced Spark and did a not-unexpected pirouette into a Big Data outfit. I know. I know. Lucid Imagination is a company anchored in key word search, but this is the 21st century. Pirouettes are better than mere pivots, so Big Data it is.

I read “Big Data Brawlers: 4 Challengers to Spark” and the write up triggered some thoughts about LucidWorks. Really.

The point of the story is to identify four open source solutions which do what Spark allegedly does so darned well. Each of these challengers:

  • Scales
  • Handles Big Data (whatever that means)
  • Exploits cheap memory so there are no slug like disc writes
  • Does the old school batch processing thing.

What are the “challengers” to Spark? Here are the contenders:

  • Apache Apex. Once proprietary, now open source, the software does micro batching for almost, sort of real time functions
  • Heron. Another real time solution with spouts and bolts. Excited?
  • Apache Flink. This is an open source library with a one two punch: It does the Flink stuff and the Spark stuff.
  • Onyx. This is a distributed computation system which will appeal to the Java folks.

What do these Spark alternatives have to do with LucidWorks, really? I think there is going to be one major impact. LucidWorks will have to spend or invest in supporting whatever becomes the next big thing. Recommind hit a glass ceiling with its business model. LucidWorks may be bumping into the open source sky light. Instead of being stopped, LucidWorks has to keep investing to keep pace with what the community driven folks generate with little thought to the impact on companies trying to earn a living with open source.

Stephen E Arnold, August 8, 2016

Amazon AWS Jungle Snares Some Elasticsearch Functions

July 1, 2016

Elastic’s Elasticsearch has become one of the go to open source search and retrieval solutions. Based on Lucene, the system has put the heat on some of the other open source centric search vendors. However, search is a tricky beastie.

Navigate to “AWS Elasticsearch Service Woes” to get a glimpse of some of the snags which can poke holes in one’s rip stop hiking garb. The problems are not surprising. One does not know what issues will arise until a search system is deployed and the lucky users are banging away with their queries or a happy administrator discovers that Button A no longer works.

The write up states:

We kept coming across OOM issues due the JVMMemoryPresure spiking and inturn the ES service kept crapping out. Aside from some optimization work, we’d more than likely have to add more boxes/resources to the cluster which then means more things to manage. This is when we thought, “Hey, AWS have a service for this right? Let’s give that a crack?!”. As great as having it as a service is, it certainly comes with some fairly irritating pitfalls which then causes you to approach the situation from a different angle.

One approach is to use templates to deal with the implementation of shard management in AWS Elasticsearch. Sample templates are provided in the write up. The fix does not address some issues. The article provides a link to a reindexing tool called es-tool.

The most interesting comment in the article in my opinion is:

In hindsight I think it may have been worth potentially sticking with and fleshing out the old implementation of Elasticsearch, instead of having to fudge various things with the AWS ES service. On the other hand it has relieved some of the operational overhead, and in terms of scaling I am literally a couple of clicks away. If you have large amounts of data you pump into Elasticsearch and you require granular control, AWS ES is not the solution for you. However if you need a quick and simple Elasticsearch and Kibana solution, then look no further.

My takeaway is to do some thinking about the strengths and weaknesses of the Amazon AWS before chopping through the Bezos cloud jungle.

Stephen E Arnold, July 1, 2016

Microsoft and the Open Source Trojan Horse

March 30, 2016

Quite a few outfits embrace open source. There are a number of reasons:

  1. It is cheaper than writing original code
  2. It is less expensive than writing original code
  3. It is more economical than writing original code.

The article “Microsoft is Pretending to be a FOSS Company in Order to Secure Government Contracts With Proprietary Software in ‘Open’ Clothing” reminded me that there is another reason.

No kidding.

I know that IBM has snagged Lucene and waved its once magical wand over the information access system and pronounced, “Watson.” I know that deep inside the kind, gentle heart of Palantir Technologies, there are open source bits. And there are others.

The write up asserted:

For those who missed it, Microsoft is trying to EEE GNU/Linux servers amid Microsoft layoffs; selfish interests of profit, as noted by some writers [1,2] this morning, nothing whatsoever to do with FOSS (there’s no FOSS aspect to it at all!) are driving these moves. It’s about proprietary software lock-in that won’t be available for another year anyway. It’s a good way to distract the public and suppress criticism with some corny images of red hearts.

The other interesting point I highlighted was:

reject the idea that Microsoft is somehow “open” now. The European Union, the Indian government and even the White House now warm up to FOSS, so Microsoft is pretending to be FOSS. This is protectionism by deception from Microsoft and those who play along with the PR campaign (or lobbying) are hurting genuine/legitimate FOSS.

With some government statements of work requiring “open” technologies, Microsoft may be doing what other firms have been doing for a while. See points one to three above. Microsoft is just late to the accountants’ party.

Why not replace the SharePoint search thing with an open source solution? What’s the $1.2 billion MSFT paid for the fascinating Fast Search & Transfer technology in 2008? It works just really well, right?

Stephen E Arnold, March 30, 2016

Elasticsearch for Text Analysis

March 29, 2016

Short honk: Put your code hat on. “Mining Mailboxes with Elasticsearch and Kibana” walks a reader through using open source technology to do text analysis. The example under the microscope is email, but the method will work for any text corpus ingested by Elasticsearch. The write up includes code samples and enough explanation to get the Elastic system moving forward. Visualizations are included. These make it easy to spot certain trends; for example, the top recipients of the email analyzed for the tutorial. Worth a look.

Stephen E Arnold, March 29, 2016

Search as a Framework

March 26, 2016

A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.

If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.

The article states:

Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.

The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.

Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”

Stephen E Arnold, March 26, 2016

Reviews on Dark Web Email Providers Shared by Freedom Hacker

February 10, 2016

The Dark Web has many layers of sites and services, as the metaphor provided in the .onion extension suggests. List of secure Dark Web email providers in 2016 was recently published on Freedom Hacker to detail and review the Dark Web email providers currently available. These services, typically offering both free and pro account versions, facilitate emailing without any type of third-party services. That even means you can forget any hidden Google scripts, fonts or trackers. According to this piece,

“All of these email providers are only accessible via the Tor Browser, an anonymity tool designed to conceal the end users identity and heavily encrypt their communication, making those who use the network anonymous. Tor is used by an array of people including journalists, activists, political-dissidents, government-targets, whistleblowers, the government and just about anyone since it’s an open-source free tool. Tor provides a sense of security in high-risk situations and is often a choice among high-profile targets. However, many use it day-to-day as it provides identity concealment seamlessly.”

We are intrigued by the proliferation of these services and their users. While usage numbers in this article are not reported, the write-up of the author’s top five email applications indicate enough available services to necessitate reviews. Equally interesting will be the response by companies on the clearweb, or the .com and other regular sites. Not to mention how the government and intelligence agencies will interact with this burgeoning ecosystem.


Megan Feil, February 10, 2016

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph


More Open Source Smart Software

January 15, 2016

The gift giver this time is Baidu. Navigate to “Baidu Open-Sources Its WARP-CTC Artificial Intelligence Software.” Baidu’s method is call the connectionist temporal classification or CTC method. Is the innovation from the Middle Kingdom? Nah. Switzerland. You know, the country where Einstein whacked away with his so so computational skills.

According to the write up:

The CTC approach involves recurrent neural networks (RNNs), an increasingly common component used for a type of AI called deep learning. Recurrent nets have been shown to work well even in noisy environments.

Have at the code, gentle read. The link is https://github.com/baidu-research/warp-ctc

Stephen E Arnold, January 14, 2016

Open Source Data Management: It Is Now Easy to Understand

January 10, 2016

I read “16 for 16: What You Must Know about Hadoop and Spark Right Now.” I like the “right now.” Urgency. I am not sure I feel too much urgency at the moment. I will leave that wonderful feeling to the executives who have sucked in venture money and have to find a way to generate revenue in the next 11 months.

The article runs down the basic generalizations associated with each of these open source data management components:

  • Spark
  • Hive
  • Kerberos
  • Ranger/Sentry
  • HBase/Phoenix
  • Impala
  • Hadoop Distributed File System (HDFS)
  • Kafka
  • Storm/Apex
  • Ambari/Cloudera Manager
  • Pig
  • Yarn/Mesos
  • Nifi/Kettle
  • Knox
  • Scala/Python
  • Zeppelin/Databricks

What the list tells me is two things. First, the proliferation of open source data tools is thriving. Second, there will have to be quite a few committed developers to keep these projects afloat.

The write up is not content with this shopping list. The intrepid reader will have an opportunity to learn a bit about:

  • Kylin
  • Atlas/Navigator

As the write up swoops to its end point, I learned about some open source projects which are a bit of a disappointment; for example, Oozie and Tez.

The key point of the article is that Google’s MapReduce which is now pretty long in the tooth is now effectively marginalized.

The Balkanization of data management is evident. The challenge will be to use one or more of these technologies to make some substantial revenue flow.

What happens if a company jumps on the wrong bandwagon as it leaves the parade ground? I would suggest that it may be more like a Pig than an Atlas. The investors will change from Rangers looking for profits to Pythons ready to strike. A Spark can set fire to some hopes and dreams in the Hive. Poorly constructed walls of Databricks can come falling down. That will be an Oozie.

Dear old Oracle, DB2, and SQLServer will just watch.

Stephen E Arnold, January 10, 2016

Short Honk: Hadoop Ecosystem Made Clear

January 3, 2016

Love Hadoop. Love all things Hadoopy? You will want to navigate to “The Hadoop Ecosystem Table.” You have categories of Hadoopiness with examples of the Hadoop amoebae. You are able to see where Spark “fits” or Kudu. Need some document data model options? The table will deliver: ArangoDB and more. Useful stuff.

Stephen E Arnold, December 30, 2015

The Importance of Google AI

December 23, 2015

According to Business Insider, we’ve all been overlooking something crucial about Google. Writer Lucinda Shen reports, “Top Internet Analyst: There Is One Thing About Google that Everyone Is Missing.” Shen cites an observation by prominent equity analyst Carlos Kirjner. She writes:

“Kirjner, that thing [that everyone else is missing] is AI at Google. ’Nobody is paying attention to that because it is not an issue that will play out in the next few quarters, but longer term it is a big, big opportunity for them,’ he said. ‘Google’s investments in artificial intelligence, above and beyond the use of machine learning to improve character, photo, video and sound classification, could be so revolutionary and transformational to the point of raising ethical questions.’

“Even if investors and analysts haven’t been closely monitoring Google’s developments in AI, the internet giant is devoted to the project. During the company’s third-quarter earnings call, CEO Sundar Pichai told investors the company planned to integrate AI more deeply within its core business.”

Google must be confident in its AI if it is deploying it across all its products, as reported. Shen recalls that the company made waves back in November, when it released the open-source AI platform TensorFlow. Is Google’s AI research about to take the world by storm?


Cynthia Murrell, December 23, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Next Page »