A Little Lucene History
March 26, 2015
Instead of venturing to Wikipedia to learn about Lucene’s history, visit the Parse.ly blog and read the post, “Lucene: The Good Parts.” After detailing how Doug Cutting created Lucene in 1999, the post describes how searching through SQL in the early 2000s was a huge task. SQL databases are not the best when it comes to unstructured search, so developers installed Lucene to make SQL document search more reliable. What is interesting is how much it has been adopted:
“At the time, Solr and Elasticsearch didn’t yet exist. Solr would be released in one year by the team at CNET. With that release would come a very important application of Lucene: faceted search. Elasticsearch would take another 5 years to be released. With its recent releases, it has brought another important application of Lucene to the world: aggregations. Over the last decade, the Solr and Elasticsearch packages have brought Lucene to a much wider community. Solr and Elasticsearch are now being considered alongside data stores like MongoDB and Cassandra, and people are genuinely confused by the differences.”
If you need a refresher or a brief overview of how Lucene works, related jargon, tips for using in big data projects, and a few more tricks. Lucene might just be a java library, but it makes using databases much easier. We have said for a while, information is only useful if you can find it easily. Lucene made information search and retrieval much simpler and accurate. It set the grounds for the current big data boom.
Whitney Grace, March 26, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Big Data and Their Interesting Processes
March 25, 2015
I love it when mid tier consultants wax enthusiastically about Big Data. Search your data lake, enjoins one clueless marketer. Big Data is the future, sings a self appointed expert. Yikes.
To get a glimpse of exactly what has to be done to process certain types of Big Data in an economical yet timely manner, I suggest you read “Analytics on the Cheap.” The author is 0X74696D. Get it?
The write up explains the procedures required to crunch data and manage the budget. The work flow process I found interesting is:
- Incoming message passes through our CDN to pick up geolocation headers
- Message has its session authenticated (this happens at our routing layer in Nginx/OpenResty)
- Message is routed to an ingest server
- Ingest server transforms message and headers into a single character-delimited querystring value
- Ingest server makes a HTTP GET to a 0-byte file on S3 with that querystring
- The bucket on S3 has S3 logging turned on.
- We ingest the S3 logs directly into Redshift on a daily basis.
The write up then provides code snippets and some business commentary. The author also identifies the upside of the approach used.
Why is this important? It is easy to talk about Big Data. Looking at what is required to make use of Big Data reveals the complexity of the task.
Keep this hype versus real world split in mind the next time you listen to a search vendor yak about Big Data.
Stephen E Arnold, March 25, 2015
Modus Operandi Gets a Big Data Storage Contract
March 24, 2015
The US Missile Defense Agency awarded Modus Operandi a huge government contract to develop an advanced data storage and retrieval system for the Ballistic Missile Defense System. Modus Operandi specializes in big data analytic solutions for national security and commercial organizations. Modus Operandi posted a press release on their Web site to share the news, “Modus Operandi Awarded Contract To Develop Advanced Data Storage And Retrieval System For The US Missile Defense Agency.”
The contract is a Phase I Small Business Innovation Research (SBIR), under which Modus Operandi will work on the DMDS Analytic Semantic System (BASS). The BASS will replace the old legacy system and update it to be compliant with social media communities, the Internet, and intelligence.
“ ‘There has been a lot of work in the areas of big data and analytics across many domains, and we can now apply some of those newer technologies and techniques to traditional legacy systems such as what the MDA is using,’ said Dr. Eric Little, vice president and chief scientist, Modus Operandi. ‘This approach will provide an unprecedented set of capabilities for the MDA’s data analysts to explore enormous simulation datasets and gain a dramatically better understanding of what the data actually means.’ ”
It is worrisome that the missile defense system is relying on an old legacy system, but at least it is being upgraded now. Modus Operandi also sales Cyber OSINT and they are applying this technology in an interesting way for the government.
Whitney Grace, March 24, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Digital Shadows Searches the Shadow Internet
March 23, 2015
The deep Web is not hidden from Internet users, but regular search engines like Google and Bing do not index it in their results. Security Affairs reported on a new endeavor to search the deep Web in the article, “Digital Shadows Firm Develops A Search Engine For The Deep Web.” Memex and Flashpoint are two search engine projects that are already able to scan the deep Web. Digital Shadows, a British cyber security firm, is working on another search engine specially designed to search the Tor network.
The CEO of Digital Shadows Alistair Paterson describes the project as Google for Tor. It was made for:
“Digital Shadows developed the deep Web search engine to offer its services to private firms to help them identifying cyber threats or any other illegal activity that could represent a threat.”
While private firms will need and want this software to detect illegal activities, law enforcement officials currently need deep Web search tools more than other fields. They use it to track fraud, drug and sex trafficking, robberies, and tacking contraband. Digital Shadows is creating a product that is part of a growing industry. The company will not only make profit, but also help people at the same time.
Whitney Grace, March 23, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Apache Samza Revamps Databases
March 19, 2015
Databases have advanced far beyond the basic relational databases. They need to be consistently managed and have real-time updates to keep them useful. The Apache Software Foundation developed the Apache Samza software to help maintain asynchronous stream processing network. Samza was made in conjunction with Apache Kafka.
If you are interested in learning how to use Apache Samza, the Confluent blog posted “Turning The Database Inside-Out With Apache Samza” by Martin Keppmann. Kleppmann recorded a seminar he gave at Strange Loop 2014 that explains his process for how it can improve many features on a database:
“This talk introduces Apache Samza, a distributed stream processing framework developed at LinkedIn. At first it looks like yet another tool for computing real-time analytics, but it’s more than that. Really it’s a surreptitious attempt to take the database architecture we know, and turn it inside out. At its core is a distributed, durable commit log, implemented by Apache Kafka. Layered on top are simple but powerful tools for joining streams and managing large amounts of data reliably.”
Learning new ways to improve database features and functionality always improve your skill set. Apache Software also forms the basis for many open source projects and startups. Martin Kleppman’s talk might give you a brand new idea or at least improve your database.
Whitney Grace, March 20, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Give Employees the Data they Need
March 19, 2015
A classic quandary: will it take longer to reinvent a certain proverbial wheel, or to find the documentation from the last time one of your colleagues reinvented it? That all depends on your organization’s search system. An article titled “Help Employees to ‘Upskill’ with Access to Information” at DataInformed makes the case for implementing a user-friendly, efficient data-management platform. Writer Diane Berry, not coincidentally a marketing executive at enterprise-search company Coveo, emphasizes that re-covering old ground can really sap workers’ time and patience, ultimately impacting customers. Employees simply must be able to quickly and easily access all company data relevant to the task at hand if they are to do their best work. Berry explains why this is still a problem:
“Why do organizations typically struggle with implementing these strategies? It revolves around two primary reasons. The first reason is that today’s heterogeneous IT infrastructures form an ‘ecosystem of record’ – a collection of newer, cloud-based software; older, legacy systems; and data sources that silo valuable data, knowledge, and expertise. Many organizations have tried, and failed, to centralize information in a ‘system of record,’ but IT simply cannot keep up with the need to integrate systems while also constantly moving and updating data. As a result, information remains disconnected, making it difficult and time consuming to find. Access to this knowledge often requires end-users to conduct separate searches within disconnected systems, often disrupting co-workers by asking where information may be found, and – even worse – moving forward without the knowledge necessary to make sound decisions or correctly solve the problem at hand.
“The second reason is more cultural than technological. Overcoming the second roadblock requires an organization to recognize the value of information and knowledge as a key organizational asset, which requires a cultural shift in the company.”
Fair enough; she makes a good case for a robust, centralized data-management solution. But what about that “upskill” business? Best I can tell, it seems the term is not about improving skills, but about supplying employees with resources they need to maximize their existing skills. The term was a little confusing to me, but I can see how it might be catchy. After all, marketing is the author’s forte.
Cynthia Murrell, March 19, 2015
Stephen E Arnold, Publisher of CyberOSINT at www.xenky.com
Attivio Does the Hadoop the Loop
March 9, 2015
What happens when a company founded by former Fast Search & Transfer executives do? Attivio took a reasonable path:
- Present the company’s mash up of open source and proprietary code as a report generator that answered questions
- Put search in a subordinate role to the report output
- Bang the drum about the upside of the approach in order to attract millions in venture funding
- Replace the Fast founders with hardier stock
- Unveil the new Attivio as a Big Data and Discovery platform.
The transformation took from 2007 until I read the official announcement in this write up “Attivio Previews Big Data Profiling & Discovery Platform at Strata + Hadoop World 2015.”
The question is, “Will the Fast DNA go gently into the good night?” My hunch is that Attivio’s founders realized that search was not the killer app. Fast Search during its spectacular implosion learned that talking about a “platform” was different from delivering a functioning platform.
Attivio tried to avoid that error. According to the write up:
Attivio, Inc., the software company reinventing enterprise search and Big Data discovery, today announced that it will showcase its new Big Data Profiling and Discovery Platform at Strata + Hadoop World 2015. Demonstrations of the Big Data Profiling and Discovery Platform will take place at booth #1136 in the main exhibit hall.
After eight years in business, some stakeholders may be looking for a solid payback. With the discovery and Big Data market choked with companies offering knock out services, Attivio may face some challenges.
One of these is the fact that Hortonworks, one of the cheerleaders for Big Data systems based loosely on Google’s approach from 2002 and 2003, missed its revenue target. If “Hortonworks Q4 Misses on Revenue” is accurate, the Big Data market could be one of those fanciful confections that enthrall pundits, mid tier consulting firms, and former enterprise search wizards.
Hadoop is morphing into other types of software. For me, this looks like a reprise of the Fast Search strategy: Start with something familiar and then add software widgets until people start to buy. Once a deal is closed, assemble the solution. Rinse and repeat.
What could go wrong?
Stephen E Arnold, March 9, 2014
More Open Source Search Excitement: Solr Flare Erupts
February 20, 2015
I read “Yonik Seeley, Creator of Apache Solr Search Engine Joins Cloudera.” Most personnel moves in the search and retrieval sector are ho hum events. Seely’s jump from Heliosearch to Cloudera may disrupt activities a world away from the former Lucid Imagination now chasing Big Data under the moniker “LucidWorks.” I write the company’s name as LucidWorks (Really?) because the company has undergone some Cirque du Soleil moves since the management revolving door was installed.
Seeley was one of the founders and top engineers at Lucid. Following his own drum beat, he formed his own company to support Solr,. In my opinion, Seeley played a key role in shaping Solr into a reasonable alternative to proprietary findability solutions like Endeca. With Seeley at Cloudera, Lucid’s vision of becoming the search solution for Hadoop-like data management systems may suffer a transmission outage. I think of this as a big Solr flare.
Cloudera will move forward and leverage Seeley’s expertise. It is possible that Lucid will move out of the Big Data orbit and find a way to generate sustainable revenues. However, Cloudera now has an opportunity to add some fuel to its solutions.
For me, the Seeley move is good news for Cloudera. For Lucid, Seeley’s joining Cloudera is yet another challenge to Lucid. I think the Lucid operation is still dazed after four or five years sharp blows to the corporate body.
The patience of Lucid’s investors may be tested again. The management issues, the loss of a key executive to Amazon, the rise of Elasticsearch, and now the Seeley shift in orbit—these are the times that may try the souls of those who expect a payoff from their investments in Lucid’s open source dream. Cloudera or Elasticsearch are now companies with a fighting chance to become the next RedHat. Really.
Stephen E Arnold, February 20, 2015
Statistics, Statistics. Disappointing Indeed
February 16, 2015
At dinner on Saturday evening, a medical researcher professional mentioned that reproducing results from tests conducted in the researcher’s lab was tough. I think the buzzword for this is “non reproducibility.” The question was asked, “Perhaps the research is essentially random?” There were some furrowed brows. My reaction was, “How does one know what’s what with experiments, data, or reproducibility tests?” The table talk shifted to a discussion of Saturday Night Live’s 40th anniversary. Safer ground.
Navigate to “Science’s Significant Stat Problem.” The article makes clear that 2013 thinking may have some relevance today. Here’s a passage I highlighted in pale blue:
Scientists use elaborate statistical significance tests to distinguish a fluke from real evidence. But the sad truth is that the standard methods for significance testing are often inadequate to the task.
There you go. And the supporting information for this statement?
One recent paper found an appallingly low chance that certain neuroscience studies could correctly identify an effect from statistical data. Reviews of genetics research show that the statistics linking diseases to genes are wrong far more often than they’re right. Pharmaceutical companies find that test results favoring new drugs typically disappear when the tests are repeated.
For the math inclined the write up offers:
It’s like flipping coins. Sometimes you’ll flip a penny and get several heads in a row, but that doesn’t mean the penny is rigged. Suppose, for instance, that you toss a penny 10 times. A perfectly fair coin (heads or tails equally likely) will often produce more or fewer than five heads. In fact, you’ll get exactly five heads only about a fourth of the time. Sometimes you’ll get six heads, or four. Or seven, or eight. In fact, even with a fair coin, you might get 10 heads out of 10 flips (but only about once for every thousand 10-flip trials). So how many heads should make you suspicious? Suppose you get eight heads out of 10 tosses. For a fair coin, the chances of eight or more heads are only about 5.5 percent. That’s a P value of 0.055, close to the standard statistical significance threshold. Perhaps suspicion is warranted.
Now the kicker:
And there’s one other type of paper that attracts journalists while illustrating the wider point: research about smart animals. One such study involved a fish—an Atlantic salmon—placed in a brain scanner and shown various pictures of human activity. One particular spot in the fish’s brain showed a statistically significant increase in activity when the pictures depicted emotional scenes, like the exasperation on the face of a waiter who had just dropped his dishes. The scientists didn’t rush to publish their finding about how empathetic salmon are, though. They were just doing the test to reveal the quirks of statistical significance. The fish in the scanner was dead.
How are those Big Data analyses working out, folks?
Stephen E Arnold, February 16, 2015
VVVVV and Big Data
February 7, 2015
Somewhere along the line a marketer cooked up volume, variety, and velocity to describe Big Data. Well, VVV is good but now we have VVVVV. Want to know more about “value” and “veracity”? Navigate to “2 More Big Data V’s—Value and Veracity.” The new Vs are slippery. How does one demonstrate value. The write up does not nail down the concept. There are MBA type references to ROI, use cases, and brand. Not much numerical evidence or a credible analytic foundation is presented. Isn’t “value” a matter of perception. Numbers may not be needed.
Veracity is also a bit mushy. What about Brian Williams’ and his oft repeated “conflation”? What about marketing collateral for software vendors in search of a sale?
I typed 25 and moved on. Neither a big number nor much in the way of big data.
Stephen E Arnold, February 7, 2015