September 20, 2016
Google has been under scrutiny for suspected tax evasion. Yahoo published a brief piece updating us on the investigation: Data analysis from Paris raid on Google will take months, possibly years: prosecutor. French police raided Google’s office in Paris, taking the tax avoidance inquiry to a new level. This comes after much pressure from across Europe to prevent multinational corporations from using their worldwide presence to pay less taxes. Financial prosecutor Eliane Houlette is quoted stating,
We have collected a lot of computer data, Houlette said in an interview with Europe 1 radio, TV channel iTele and newspaper Le Monde, adding that 96 people took part in the raid. “We need to analyze (the data) … (it will take) months, I hope that it won’t be several years, but we are very limited in resources’. Google, which said it is complying fully with French law, is under pressure across Europe from public opinion and governments angry at the way multinationals exploit their global presence to minimize tax liabilities.
While big data search technology exists, government and law enforcement agencies may not have the funds to utilize such technologies. Or, perhaps the knowledge of open source solutions is not apparent. If nothing else, these comments made by Houlette go to show the need for increased focus on upgrading systems for real-time and rapid data analysis.
Megan Feil, September 20, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
There is a Louisville, Kentucky Hidden Web/Dark Web meet up on September 27, 2016.
Information is at this link: https://www.meetup.com/Louisville-Hidden-Dark-Web-Meetup/events/233599645/
August 29, 2016
I read “Faster, Better Text Classification.” Facebook’s artificial intelligence team has made available some of its whizzy code. The software may be a bit of a challenge to the vendors of proprietary text classification software, but Facebook wants to help everyone. Think of the billion plus Facebook users who need to train an artificially intelligent system with one billion words in 10 minutes. You may want to try this on your Chromebook, gentle reader.
Automatic text processing forms a key part of the day-to-day interaction with your computer; it’s a critical component of everything from web search and content ranking to spam filtering, and when it works well, it’s completely invisible to you. With the growing amount of online data, there is a need for more flexible tools to better understand the content of very large datasets, in order to provide more accurate classification results. To address this need, the Facebook AI Research (FAIR) lab is open-sourcing fastText, a library designed to help build scalable solutions for text representation and classification.
What does the Facebook text classification code deliver as open sourciness? I learned:
FastText combines some of the most successful concepts introduced by the natural language processing and machine learning communities in the last few decades. These include representing sentences with bag of words and bag of n-grams, as well as using subword information, and sharing information across classes through a hidden representation. We also employ a hierarchical softmax that takes advantage of the unbalanced distribution of the classes to speed up computation. These different concepts are being used for two different tasks: efficient text classification and learning word vector representations.
The write up details some of the benefits of the code; for example, its multilingual capabilities and its accuracy.
What will other do gooders like Amazon, Google, and Microsoft do to respond to Facebook’s generosity? My thought is that more text processing software will find its way to open source green pastures.
What will the for fee vendors peddling proprietary classification systems do? Here’s a short list of ideas I had:
- Pivot to become predictive analytics companies and seek new rounds of financing
- Pretend that open source options are available but not good enough for real world tasks
- Generate white papers and commission mid tier consulting firms to extol the virtues of their innovative, unique, high speed, smart software
- Look for another line of work in search engine optimization, direct sales for a tool and die company, or check out Facebook.
Stephen E Arnold, August 29, 2016
August 8, 2016
Many clear night ago, Lucid Imagination offered an open source enterprise search solution. Presidents came. Presidents went. Lucid Imagination morphed into LucidWorks. I promptly referred to the company in this way: Lucid works, really?
The firm embraced Spark and did a not-unexpected pirouette into a Big Data outfit. I know. I know. Lucid Imagination is a company anchored in key word search, but this is the 21st century. Pirouettes are better than mere pivots, so Big Data it is.
I read “Big Data Brawlers: 4 Challengers to Spark” and the write up triggered some thoughts about LucidWorks. Really.
The point of the story is to identify four open source solutions which do what Spark allegedly does so darned well. Each of these challengers:
- Handles Big Data (whatever that means)
- Exploits cheap memory so there are no slug like disc writes
- Does the old school batch processing thing.
What are the “challengers” to Spark? Here are the contenders:
- Apache Apex. Once proprietary, now open source, the software does micro batching for almost, sort of real time functions
- Heron. Another real time solution with spouts and bolts. Excited?
- Apache Flink. This is an open source library with a one two punch: It does the Flink stuff and the Spark stuff.
- Onyx. This is a distributed computation system which will appeal to the Java folks.
What do these Spark alternatives have to do with LucidWorks, really? I think there is going to be one major impact. LucidWorks will have to spend or invest in supporting whatever becomes the next big thing. Recommind hit a glass ceiling with its business model. LucidWorks may be bumping into the open source sky light. Instead of being stopped, LucidWorks has to keep investing to keep pace with what the community driven folks generate with little thought to the impact on companies trying to earn a living with open source.
Stephen E Arnold, August 8, 2016
July 1, 2016
Elastic’s Elasticsearch has become one of the go to open source search and retrieval solutions. Based on Lucene, the system has put the heat on some of the other open source centric search vendors. However, search is a tricky beastie.
Navigate to “AWS Elasticsearch Service Woes” to get a glimpse of some of the snags which can poke holes in one’s rip stop hiking garb. The problems are not surprising. One does not know what issues will arise until a search system is deployed and the lucky users are banging away with their queries or a happy administrator discovers that Button A no longer works.
The write up states:
We kept coming across OOM issues due the JVMMemoryPresure spiking and inturn the ES service kept crapping out. Aside from some optimization work, we’d more than likely have to add more boxes/resources to the cluster which then means more things to manage. This is when we thought, “Hey, AWS have a service for this right? Let’s give that a crack?!”. As great as having it as a service is, it certainly comes with some fairly irritating pitfalls which then causes you to approach the situation from a different angle.
One approach is to use templates to deal with the implementation of shard management in AWS Elasticsearch. Sample templates are provided in the write up. The fix does not address some issues. The article provides a link to a reindexing tool called es-tool.
The most interesting comment in the article in my opinion is:
In hindsight I think it may have been worth potentially sticking with and fleshing out the old implementation of Elasticsearch, instead of having to fudge various things with the AWS ES service. On the other hand it has relieved some of the operational overhead, and in terms of scaling I am literally a couple of clicks away. If you have large amounts of data you pump into Elasticsearch and you require granular control, AWS ES is not the solution for you. However if you need a quick and simple Elasticsearch and Kibana solution, then look no further.
My takeaway is to do some thinking about the strengths and weaknesses of the Amazon AWS before chopping through the Bezos cloud jungle.
Stephen E Arnold, July 1, 2016
March 30, 2016
Quite a few outfits embrace open source. There are a number of reasons:
- It is cheaper than writing original code
- It is less expensive than writing original code
- It is more economical than writing original code.
The article “Microsoft is Pretending to be a FOSS Company in Order to Secure Government Contracts With Proprietary Software in ‘Open’ Clothing” reminded me that there is another reason.
I know that IBM has snagged Lucene and waved its once magical wand over the information access system and pronounced, “Watson.” I know that deep inside the kind, gentle heart of Palantir Technologies, there are open source bits. And there are others.
The write up asserted:
For those who missed it, Microsoft is trying to EEE GNU/Linux servers amid Microsoft layoffs; selfish interests of profit, as noted by some writers [1,2] this morning, nothing whatsoever to do with FOSS (there’s no FOSS aspect to it at all!) are driving these moves. It’s about proprietary software lock-in that won’t be available for another year anyway. It’s a good way to distract the public and suppress criticism with some corny images of red hearts.
The other interesting point I highlighted was:
reject the idea that Microsoft is somehow “open” now. The European Union, the Indian government and even the White House now warm up to FOSS, so Microsoft is pretending to be FOSS. This is protectionism by deception from Microsoft and those who play along with the PR campaign (or lobbying) are hurting genuine/legitimate FOSS.
With some government statements of work requiring “open” technologies, Microsoft may be doing what other firms have been doing for a while. See points one to three above. Microsoft is just late to the accountants’ party.
Why not replace the SharePoint search thing with an open source solution? What’s the $1.2 billion MSFT paid for the fascinating Fast Search & Transfer technology in 2008? It works just really well, right?
Stephen E Arnold, March 30, 2016
March 29, 2016
Short honk: Put your code hat on. “Mining Mailboxes with Elasticsearch and Kibana” walks a reader through using open source technology to do text analysis. The example under the microscope is email, but the method will work for any text corpus ingested by Elasticsearch. The write up includes code samples and enough explanation to get the Elastic system moving forward. Visualizations are included. These make it easy to spot certain trends; for example, the top recipients of the email analyzed for the tutorial. Worth a look.
Stephen E Arnold, March 29, 2016
March 26, 2016
A number of search and content processing vendors suggest their information access system can function as a framework. The idea is that search is more than a utility function.
If the information in the article “Abusing Elasticsearch as a Framework” is spot on, a non search vendor may have taken an important step to making an assertion into a reality.
The article states:
Crate is a distributed SQL database that leverages Elasticsearch and Lucene. In it’s infant days it parsed SQL statements and translated them into Elasticsearch queries. It was basically a layer on top of Elasticsearch.
The idea is that the framework uses discovery, master election, replication, etc along with the Lucene search and indexing operations.
Crate, the framework, is a distributed SQL database “that leverages Elasticsearch and Lucene.”
Stephen E Arnold, March 26, 2016
February 10, 2016
The Dark Web has many layers of sites and services, as the metaphor provided in the .onion extension suggests. List of secure Dark Web email providers in 2016 was recently published on Freedom Hacker to detail and review the Dark Web email providers currently available. These services, typically offering both free and pro account versions, facilitate emailing without any type of third-party services. That even means you can forget any hidden Google scripts, fonts or trackers. According to this piece,
“All of these email providers are only accessible via the Tor Browser, an anonymity tool designed to conceal the end users identity and heavily encrypt their communication, making those who use the network anonymous. Tor is used by an array of people including journalists, activists, political-dissidents, government-targets, whistleblowers, the government and just about anyone since it’s an open-source free tool. Tor provides a sense of security in high-risk situations and is often a choice among high-profile targets. However, many use it day-to-day as it provides identity concealment seamlessly.”
We are intrigued by the proliferation of these services and their users. While usage numbers in this article are not reported, the write-up of the author’s top five email applications indicate enough available services to necessitate reviews. Equally interesting will be the response by companies on the clearweb, or the .com and other regular sites. Not to mention how the government and intelligence agencies will interact with this burgeoning ecosystem.
Megan Feil, February 10, 2016
January 15, 2016
The gift giver this time is Baidu. Navigate to “Baidu Open-Sources Its WARP-CTC Artificial Intelligence Software.” Baidu’s method is call the connectionist temporal classification or CTC method. Is the innovation from the Middle Kingdom? Nah. Switzerland. You know, the country where Einstein whacked away with his so so computational skills.
According to the write up:
The CTC approach involves recurrent neural networks (RNNs), an increasingly common component used for a type of AI called deep learning. Recurrent nets have been shown to work well even in noisy environments.
Have at the code, gentle read. The link is https://github.com/baidu-research/warp-ctc
Stephen E Arnold, January 14, 2016
January 10, 2016
I read “16 for 16: What You Must Know about Hadoop and Spark Right Now.” I like the “right now.” Urgency. I am not sure I feel too much urgency at the moment. I will leave that wonderful feeling to the executives who have sucked in venture money and have to find a way to generate revenue in the next 11 months.
The article runs down the basic generalizations associated with each of these open source data management components:
- Hadoop Distributed File System (HDFS)
- Ambari/Cloudera Manager
What the list tells me is two things. First, the proliferation of open source data tools is thriving. Second, there will have to be quite a few committed developers to keep these projects afloat.
The write up is not content with this shopping list. The intrepid reader will have an opportunity to learn a bit about:
As the write up swoops to its end point, I learned about some open source projects which are a bit of a disappointment; for example, Oozie and Tez.
The key point of the article is that Google’s MapReduce which is now pretty long in the tooth is now effectively marginalized.
The Balkanization of data management is evident. The challenge will be to use one or more of these technologies to make some substantial revenue flow.
What happens if a company jumps on the wrong bandwagon as it leaves the parade ground? I would suggest that it may be more like a Pig than an Atlas. The investors will change from Rangers looking for profits to Pythons ready to strike. A Spark can set fire to some hopes and dreams in the Hive. Poorly constructed walls of Databricks can come falling down. That will be an Oozie.
Dear old Oracle, DB2, and SQLServer will just watch.
Stephen E Arnold, January 10, 2016