CyberOSINT banner

Metaphor of the Week: Hadoop Is Not Like a City

August 30, 2015

Navigate to “Five Open Source Big Data Projects to Watch.” You will a little learn about Flink, Samza, Ibis, an incubating Twill, and Mahout-Samsara. That’s just the starter. The write up had a heck of a finisher.

Here’s the passage I highlighted in bright orange:

If this small sampling of some of the many Big Data open source projects out there shows anything, it’s that Hadoop isn’t merely like a city, but rather a major metropolitan area. It has its suburbs, where its mayor has no jurisdiction, and where political beliefs may differ from those in the center of town. But it has its core character and it must be treated as a market in its own right. Practitioners have to approach “greater” Hadoop, not just the core project itself, or they risk missing trends in its adoption and evolution.

Isn’t open source great? No wonder mid tier consultants tippy toe around open source technology. Volunteers. Who needs them in Consultingland?

Stephen E Arnold, August 30, 2015

Quote to Note: Confluent

August 20, 2015

I read “Meet Confluent, The Big-Data Startup That Has Silicon Valley Buzzing.” Confluent can keep “he data flowing at some of the biggest and most information-rich firms in Silicon Valley.” The company’s Web site is The company uses Apache Kafka to deliver its value to customers.

Here’s the passage i noted:

Experts suggest Confluent’s revenue could approach $10 million next year and pass $50 million in 2017. The company could echo the recent success of another open-source darling, Docker, which has turned record adoption of its computing tools called “containers” into a growing enterprise suite and a $1 billion valuation. Confluent is likely worth about one-sixth that today but not for long. “Every person we hire uncovers millions of dollars in sales,” says early investor Eric Vishria of Benchmark. “There’s real potential [for Confluent] to be an enterprise phenomenon.”

I noted the congruence of Docker and Confluence. I enjoyed the word “every”. Categorical affirmatives are thrilling. I liked also “phenomenon.” The article’s omission of a reference to Palantir surprised me.

Nevertheless, I have a question: “Has another baby unicorn been birthed?” According to Crunchbase, the company has raised more than $50 million. With 17 full time employees, Confluent may be hiring. Perhaps some lucid engineers will see the light?

Stephen E Arnold, August 20, 2015

Watson: Following in the Footsteps of America Online with PR, not CD ROMs

July 31, 2015

I am now getting interested in the marketing efforts of IBM Watson’s professionals. I have written about some of the items which my Overflight system snags.

I have gathered a handful of gems from the past week or so. As you peruse these items, remember several facts:

  • Watson is Lucene, home brew scripts, and acquired search utilities like Vivisimo’s clustering and de-duplicating technology
  • IBM said that Watson would be a multi billion dollar business and then dropped that target from 10 or 12 Autonomy scale operations to something more modest. How modest the company won’t say.
  • IBM has tallied a baker’s dozen of quarterly reports with declining revenues
  • IBM’s reallocation of employee resources continues as IBM is starting to run out of easy ways to trim expenses
  • The good old mainframe is still a technology wonder, and it produces something Watson only dreams about: Profits.

Here we go. Remember high school English class and the “willing suspension of disbelief.” Keep that in mind, please.

ITEM 1: “IBM Watson to Help Cities Run Smarter.” The main assertion, which comes from unicorn land, is: “Purple Forge’s “Powered by IBM Watson” solution uses Watson’s question answering and natural language processing capabilities to let users  ask questions and get evidence-based answers using a website, smartphone or wearable devices such as the Apple Watch, without having to wait for a call agent or a reply to an email.” There you go. Better customer service. Aren’t government’s supposed to serve its citizens? Does the project suggest that city governments are not performing this basic duty? Smarter? Hmm.

ITEM 2: “Why I’m So Excited about Watson, IBM’s Answer Man.” In this remarkable essay, an “expert” explains that the president of IBM explained to a TV interviewer that IBM was being “reinvented.” Here’s the quote that I found amusing: “IBM invented almost everything about data,” Rometty insisted. “Our research lab was the first one ever in Silicon Valley. Creating Watson made perfect sense for us. Now he’s ready to help everyone.” Now the author is probably unaware that I was, lo, these many years ago, involved with an IBM Herb Noble who was struggling to make IBM’s own and much loved STAIRS III work. I wish to point out that Silicon Valley research did not have its hands on the steering wheel when it came to the STAIRS system. In fact, the job of making this puppy work fell to IBM folks in Germany as I recall.

ITEM 3: “IBM Watson, CVS Deal: How the Smartest Computer on Earth Could Shake Up Health Care for 70m Pharmacy Customers.” Now this is an astounding chunk of public relations output. I am confident that the author is confident that “real journalism” was involved. You know: Interviewing, researching, analyzing, using Watson, talking to customers, etc. Here’s the passage I highlighted: “One of the most frustrating things for patients can be a lack of access to their health or prescription history and the ability to share it. This is one of the things both IBM and CVS officials have said they hope to solve.” Yes, hope. It springs eternal as my mother used to say.

If you find these fact filled romps through the market activating technology of Watson, you may be qualified to become a Watson believer. For me, I am reminded of Charles Bukowski’s alleged quip:

The problem with the world is that the intelligent people are full of doubts while the stupid ones are full of confidence.

Stephen E Arnold, July 31, 2015

The Hadoop Spark Thing: Simple, Simple

July 30, 2015

I am fascinated with the cheerleading about open source software which makes Big Data as easy as driving a Fiat 500 through a car wash. (Make sure the wheels fit inside the automated pulley system, of course.)

Navigate to “The Big Big Data Question: Hadoop or Spark?” Be prepared to read about two—count ‘em—two systems working as smoothly as the engine in a technical high school’s auto repair class’ project car.

I want to highlight two statements in the write up.

The first is:

As I [a Big Data practitioner] mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

In short, Spark is what I call a wrapper. One uses it like a taco shell to keep the good in position for real time munching.

The second is this comment:

The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data.

What the write omits is that there are some other bits and pieces needed; for example, how does one locate a particular string amidst the Big Data?

The point, for me, is that these nested and layered systems are truly exciting to troubleshoot. Not only are their issues with the integrity of the data, there is the thrill of getting each subsystem to work and then figuring out how to get useful outputs from the digital equivalent of a Roy’s Place Lassie’s Double Revenge sandwich before it closed its doors in 2013.

A Lassie’s Double Revenge consisted of a knockwurst, cheese, grilled onions, baked beans, and assorted seasonings served to the discerning diner.

A little like an open source Big Data mash up.

As a bonus, one gets to hire consultants who can make separate products, systems, and solutions work in a way which benefits the licensee and the system’s users.

Stephen E Arnold, July 30, 2015

Neural Networks and Thought Commands

July 22, 2015

If you’ve been waiting for the day you can operate a computer by thinking at it, check out “When Machine Learning Meets the Mind: BBC and Google Get Brainy” at the Inquirer. Reporter Chris Merriman brings our attention to two projects, one about hardware and one about AI, that stand at the intersection of human thought and machine. Neither venture is anywhere near fruition, but a peek at their progress gives us clues about the future.

The internet-streaming platform iPlayer is a service the BBC provides to U.K. residents who wish to catch up on their favorite programmes. In pursuit of improved accessibility, the organization’s researchers are working on a device that allows users to operate the service with their thoughts. The article tells us:

“The electroencephalography wearable that powers the technology requires lucidity of thought, but is surprisingly light. It has a sensor on the forehead, and another in the ear. You can set the headset to respond to intense concentration or meditation as the ‘fire’ button when the cursor is over the option you want.”

Apparently this operation is easier for some subjects than for others, but all users were able to work the device to some degree. Creepy or cool? Perhaps it’s both, but there’s no escaping this technology now.

As for Google’s undertaking, we’ve examined this approach before: the development of artificial neural networks. This is some exciting work for those interested in AI. Merriman writes:

“Meanwhile, a team of Google researchers has been looking more closely at artificial neural networks. In other words, false brains. The team has been training systems to classify images and better recognise speech by bombarding them with input and then adjusting the parameters to get the result they want.

But once equipped with the information, the networks can be flipped the other way and create an impressive interpretation of objects based on learned parameters, such as ‘a screw has twisty bits’ or ‘a fly has six legs’.”

This brain-in-progress still draws some chuckle-worthy and/or disturbing conclusions from images, but it is learning. No one knows what the end result of Google’s neural network research will be, but it’s sure to be significant. In a related note, the article points out that IBM is donating its machine learning platform to Apache Spark. Who knows where the open-source community will take it from here?

Cynthia Murrell, July 22, 2015

Sponsored by, publisher of the CyberOSINT monograph


Short Honk: Open Semantic Search Appliance

July 17, 2015

Several people have asked me about Open Semantic Search. I sent a couple of emails to the professional identified on the DNS record as the contact point. No response yet from our inquiry emails, but this is not unusual. People are so darned busy today.

The Open Semantic Search organization is offering an open semantic search appliance. The appliance is not a box like the much loved Google Search Appliance or the Maxxcat solutions. The appliance is virtual.

The explanation of the  data enriching system is located at this link. The resources required are modest and based on the information I scanned, the open semantic search appliance is a solution to many information access woes.

I will be able to search, explore, and analyze. Give the system a whirl. We will add it to our list of tasks. We assume it will present the same exciting challenges as other Lucene/Solr solutions. The addition of semantics will add a new wrinkle or two.

If you are into semantics and open source, the system may be for you.

Stephen E Arnold, July 17, 2015

Hadoop Rounds Up Open Source Goodies

July 17, 2015

Summer time is here and what better way to celebrate the warm weather and fun in the sun than with some fantastic open source tools.  Okay, so you probably will not take your computer to the beach, but if you have a vacation planned one of these tools might help you complete your work faster so you can get closer to that umbrella and cocktail.  Datamation has a great listicle focused on “Hadoop And Big Data: 60 Top Open Source Tools.”

Hadoop is one of the most adopted open source tool to provide big data solutions.  The Hadoop market is expected to be worth $1 billion by 2020 and IBM has dedicated 3,500 employees to develop Apache Spark, part of the Hadoop ecosystem.

As open source is a huge part of the Hadoop landscape, Datamation’s list provides invaluable information on tools that could mean the difference between a successful project and failed one.  Also they could save some extra cash on the IT budget.

“This area has a seen a lot of activity recently, with the launch of many new projects. Many of the most noteworthy projects are managed by the Apache Foundation and are closely related to Hadoop.”

Datamation has maintained this list for a while and they update it from time to time as the industry changes.  The list isn’t sorted on a comparison scale, one being the best, rather they tools are grouped into categories and a short description is given to explain what the tool does. The categories include: Hadoop-related tools, big data analysis platforms and tools, databases and data warehouses, business intelligence, data mining, big data search, programming languages, query engines, and in-memory technology.  There is a tool for nearly every sort of problem that could come up in a Hadoop environment, so the listicle is definitely worth a glance.

Whitney Grace, July 17, 2015
Sponsored by, publisher of the CyberOSINT monograph


Microsoft Takes SharePoint Criticism Seriously

July 16, 2015

Organizations are reaching the point where a shift toward mobile productivity and adoption must take place; therefore, their enterprise solution must follow suit. While Office 365 adoption has soared in light of the realization, Microsoft still has work to do in order to give users the experience that they demand from a mobile and social heavy platform. ComputerWorld goes into more details with their article, “Onus on Microsoft as SharePoint and OneDrive Roadmaps Reach Crossroads.”

The article states Microsoft’s current progress and future goals:

“With the advent of SharePoint Server 2016 (public beta expected 4Q 2015, with general availability 2Q 2016), Edwards believes Microsoft is placing renewed focus on file management, content management, sites, and portals. Going forward, Redmond claims it will also continue to develop the hybrid capabilities of SharePoint, recognizing that hybrid deployments are a steady state for many large organizations, and not just a temporary position to enable migration to the cloud.”

Few users chose to adopt the opportunities offered by Office 365 and SharePoint 2013, so Microsoft has to make SharePoint Server 2016 look like a new, enticing offering worthy of being taken seriously. So far, they have done a good job of building up some hype and attention. Stephen E. Arnold is a longtime leader in search and he has been covering the news surrounding the release on Additionally, his dedicated SharePoint feed makes it easy to catch the latest news, tips, and tricks at a glance.
Emily Rae Aldridge, July 16, 2015

Sponsored by, publisher of the CyberOSINT monograph

Need Semantic Search: Lucidworks Asserts It Is the Answer by Golly

July 3, 2015

If you read this blog, you know that I comment on semantic technology every month or so. In June I pointed to an article which had been tweeted as “new stuff.” Wrong. Navigate to “Semantic Search Hoohah: Hakia”; you will learn that Hakia is a quiet outfit. Quiet as in no longer on the Web. Maybe gone?

There are other write ups in my free and for fee columns about semantic search. The theme has been consistent. My view is that semantic technology is one component in a modern cybernized system. (To learn about my use of the term cyber, navigate to

I find the promotion of search engine optimization as “semantic” amusing. I find the search service firms’ promotion of their semantic expertise amusing. I find the notion of open source outfits deep in hock to venture capitalists asserting their semantic wizardry amusing.

I don’t know if you are quite as amused as I am. Here’s an easy way to determine your semantic humor score. Navigate to this slideshare link and cruise through the 34 deck presentation made by one of Lucidworks’ search mavens. Lucidworks is a company I have followed since it fired up its jets with Marc Krellenstein on board. Dr. Krellenstein ejected in short order, and the company has consumed many venture dollars with management shifts, repositionings, and the Big Data thing.

We now have Lucidworks in the semantic search sector.

Here’s what I learned from the deck:

  1. The company has a new logo. I think this is the third or fourth.
  2. Search is about technology and language. Without Google’s predictive and personalized routines, words are indeed necessary.
  3. Buzzwords and jargon do not make semantic methods simple. Consider this statement from the deck, “Tokenization plus vector mathematics (TF/IDF) or one of its cousins—“bag of words” – Algorithmic tweaks – enhanced bag of words.” Got that, gentle reader. If not, check out “sausagization.”
  4. Lucidworks offers a “field cache.” Okay, I am not unfamiliar with caching in order to goose performance, which can be an issue with some open source search systems. But Searchdaimon, an open source search system developed in Norway, runs circles around Lucidworks. My team did the benchmark test of major open source systems. Searchdaimon was the speed champ and had other sector leading characteristics as well.)
  5. Lucidworks does the ontology thing as well. The tie up of “category nodes” and “evidence nodes” may be one reason the performance goblin noses into the story.

The problem I encountered is that the write up for the slide deck emphasized Fusion as a key component. I have been poking around the “fusion” notion as we put our new study of the Dark Web together. Fusion is a tricky problem and the US government has made fusion a priority. Keep in mind that content is more than text. There are images, videos, geocodes, cryptic tweets in Farsi, and quite a few challenging issues with making content available to a researcher or analyst.

It seems that Lucidworks has cracked a problem which continues to trouble some reasonably sophisticated folks in the content analysis business. Here’s the “evidence” that Lucidworks can do what others cannot:


This diagram shows that after a connector is available, then “pipelines proliferate.” Well, okay.

I thought the goal was to process content objects with low latency, easily, and with semantic value adds. “Lots of stages” and “index pipelines: one way query pipelines: round trip” does not compute for this addled goose.

If the Lucidworks approach makes sense to you go for it. My team and I will stick to here and now tools and open source technology which works without the semantic jargon which is pretty much incidental to the matter. We need to process more than text. CyberOSINT vendors deliver and most use open source search as a utility function. Yep, utility. Not the main event. The failure of semantic search vendors suggests that the buzzword is not the solution to marketing woes. Pop. (That’s a pre fourth of July celebratory ladyfinger.)

Stephen E Arnold, July 3, 2015

Forget Oracle. Think about Vendors of Proprietary Enterprise Search Systems.

June 14, 2015

Database revenue doom looms for Oracle. Who did not know that, Mr. BigTable and Ms. Spark? Navigate to “Oracle Sales Erode as Startups Embrace Souped-Up Free Software.” The write up makes this point:

The impact [use of proprietary software] shows up in Oracle’s sales of new software licenses, which have declined for seven straight quarters compared with the period a year earlier. New licenses made up 25 percent of total revenue in fiscal 2014, down from 28 percent a year earlier — a sign the company is becoming increasingly dependent on revenue from supporting and maintaining products at existing customers and having a harder time finding new business. Oracle reports fiscal fourth-quarter earnings next week. To blunt this, the Redwood City, California-based company is expanding efforts in cloud computing, which will let it sell packaged high-margin services to customers. That may help balance the slowdown in the basic business. It also operates an open-source database called MySQL.

The unarticulated issue is the word “startup.” Research we conducted and which was verified by various third party sources revealed in 2012 that open source software was getting more attention from Fortune 1000 companies. The reason was that these outfits had the resources to deal with the excitement open source software provides in a Blue Apron type package.

If this Bloomberg write up is correct, the startup crowd is stepping away from Microsoft software and other well known brands toward open source. One can raise prices in the Fortune 1000 arena for a short time. Then, as Thomson Reuters- and Reed Elsevier-type companies have learned, the big boys just go a different direction. Thus, the start up and mid sized market become more and more important to proprietary software vendors.

When the small folks head for the hills, where’s the growth? Price increases? Me too plays? Marketing two steps?

I don’t think so.

Ergo. Trouble ahead for Oracle, but the challenges facing the down market and up market proprietary enterprise search vendors are going to become more severe if Bloomie is on the beam.

Stephen E Arnold, June 14, 2015

Next Page »