CyberOSINT banner

Short Honk: Crawl the Web at Scale

September 30, 2015

Short honk: I read “Aduana: Link Analysis to Crawl the Web at Scale.” The write up explains an open source project which can copy content “dispersed all over the Web.” Keep in mind that the approach focuses primarily on text. Aduana is a special back end for the developer’s tool for speeding up crawls which is built on top of a data management system.

According to the write up:

we wanted to locate relevant pages first rather than on an ad hoc basis. We also wanted to revisit the more interesting ones more often than the others. We ultimately ran a pilot to see what happens. We figured our sheer capacity might be enough. After all, our cloud-based platform’s users scrape over two billion web pages per month….We think Aduana is a very promising tool to expedite broad crawls at scale. Using it, you can prioritize crawling pages with the specific type of information you’re after. It’s still experimental. And not production-ready yet.

In its present form, Aduana is able to:

  • Analyze news.
  • Search locations and people.
  • Perform sentiment analysis.
  • Find companies to classify them.
  • Extract job listings.
  • Find all sellers of certain products.

The write up contains links to the relevant github information, some code snippets, and descriptive information.

Stephen E Arnold, September 30, 2015

Spark: Another Open Source Game Changer

September 24, 2015

Gentle reader, I know that knowledge about Spark is as widespread as information about the woes of the Philadelphia Eagles. My understanding of Spark is that is is an open source engine for large scale data processing. It is faster than Hadoop. It is easy to use. It is flexible enough to allow the intrepid Spark aficionado the combine structured query language, streaming, and analytics in one software system. Spark runs “everywhere.” For more about Spark, see this Apache project page.

Spark is one of the next big things, poised to ignite innovation, consulting revenues, innovations, and vendor repositionings.

I approached “Game-Changing Real-time Uses for Apache Spark” in order to learn how Spark can change the game for real time data and information work. Game changing means that old school outfits are going to lose because the new game has new rules, new players, and new everything.

The write up identified these ways Spark will change some quite significant markets:

  • Credit card fraud detection
  • Network security
  • Genomic sequencing
  • Real time ad processing
  • Medical

My goodness, Spark will become the number one enabling technology for some very problematic market spaces.

Let’s look at what Spark will do to real time ad processing. The write up reports:

One advertising firm uses Spark, on MapR-DB, to build a real-time ad targeting platform. The system looks at user data and decides which ads to show users on the Internet based on demographic data. Since advertising is so time-sensitive, advertisers have to move fast if they want to capture mindshare. Spark Streaming is one way to help them do that.

What strikes me is that Spark requires programmers, software engineering, and then integration of different components. If an error manifests itself, the Spark solution may require those who embrace it to perform some old fashioned work.

In a sense, the game hasn’t changed at all. Open source software reduces license fees and provides a developer with some freedom from license restrictions. On the other hand, the difficult task of getting a complex system to work as intended remains.

My hunch is that Spark is an interesting open source project. The consultants and start ups see Spark as an opportunity. The game changing nature of Spark is potential energy, not a sure thing.

Stephen E Arnold, September 23, 2015

Metaphor of the Week: Hadoop Is Not Like a City

August 30, 2015

Navigate to “Five Open Source Big Data Projects to Watch.” You will a little learn about Flink, Samza, Ibis, an incubating Twill, and Mahout-Samsara. That’s just the starter. The write up had a heck of a finisher.

Here’s the passage I highlighted in bright orange:

If this small sampling of some of the many Big Data open source projects out there shows anything, it’s that Hadoop isn’t merely like a city, but rather a major metropolitan area. It has its suburbs, where its mayor has no jurisdiction, and where political beliefs may differ from those in the center of town. But it has its core character and it must be treated as a market in its own right. Practitioners have to approach “greater” Hadoop, not just the core project itself, or they risk missing trends in its adoption and evolution.

Isn’t open source great? No wonder mid tier consultants tippy toe around open source technology. Volunteers. Who needs them in Consultingland?

Stephen E Arnold, August 30, 2015

Quote to Note: Confluent

August 20, 2015

I read “Meet Confluent, The Big-Data Startup That Has Silicon Valley Buzzing.” Confluent can keep “he data flowing at some of the biggest and most information-rich firms in Silicon Valley.” The company’s Web site is The company uses Apache Kafka to deliver its value to customers.

Here’s the passage i noted:

Experts suggest Confluent’s revenue could approach $10 million next year and pass $50 million in 2017. The company could echo the recent success of another open-source darling, Docker, which has turned record adoption of its computing tools called “containers” into a growing enterprise suite and a $1 billion valuation. Confluent is likely worth about one-sixth that today but not for long. “Every person we hire uncovers millions of dollars in sales,” says early investor Eric Vishria of Benchmark. “There’s real potential [for Confluent] to be an enterprise phenomenon.”

I noted the congruence of Docker and Confluence. I enjoyed the word “every”. Categorical affirmatives are thrilling. I liked also “phenomenon.” The article’s omission of a reference to Palantir surprised me.

Nevertheless, I have a question: “Has another baby unicorn been birthed?” According to Crunchbase, the company has raised more than $50 million. With 17 full time employees, Confluent may be hiring. Perhaps some lucid engineers will see the light?

Stephen E Arnold, August 20, 2015

Watson: Following in the Footsteps of America Online with PR, not CD ROMs

July 31, 2015

I am now getting interested in the marketing efforts of IBM Watson’s professionals. I have written about some of the items which my Overflight system snags.

I have gathered a handful of gems from the past week or so. As you peruse these items, remember several facts:

  • Watson is Lucene, home brew scripts, and acquired search utilities like Vivisimo’s clustering and de-duplicating technology
  • IBM said that Watson would be a multi billion dollar business and then dropped that target from 10 or 12 Autonomy scale operations to something more modest. How modest the company won’t say.
  • IBM has tallied a baker’s dozen of quarterly reports with declining revenues
  • IBM’s reallocation of employee resources continues as IBM is starting to run out of easy ways to trim expenses
  • The good old mainframe is still a technology wonder, and it produces something Watson only dreams about: Profits.

Here we go. Remember high school English class and the “willing suspension of disbelief.” Keep that in mind, please.

ITEM 1: “IBM Watson to Help Cities Run Smarter.” The main assertion, which comes from unicorn land, is: “Purple Forge’s “Powered by IBM Watson” solution uses Watson’s question answering and natural language processing capabilities to let users  ask questions and get evidence-based answers using a website, smartphone or wearable devices such as the Apple Watch, without having to wait for a call agent or a reply to an email.” There you go. Better customer service. Aren’t government’s supposed to serve its citizens? Does the project suggest that city governments are not performing this basic duty? Smarter? Hmm.

ITEM 2: “Why I’m So Excited about Watson, IBM’s Answer Man.” In this remarkable essay, an “expert” explains that the president of IBM explained to a TV interviewer that IBM was being “reinvented.” Here’s the quote that I found amusing: “IBM invented almost everything about data,” Rometty insisted. “Our research lab was the first one ever in Silicon Valley. Creating Watson made perfect sense for us. Now he’s ready to help everyone.” Now the author is probably unaware that I was, lo, these many years ago, involved with an IBM Herb Noble who was struggling to make IBM’s own and much loved STAIRS III work. I wish to point out that Silicon Valley research did not have its hands on the steering wheel when it came to the STAIRS system. In fact, the job of making this puppy work fell to IBM folks in Germany as I recall.

ITEM 3: “IBM Watson, CVS Deal: How the Smartest Computer on Earth Could Shake Up Health Care for 70m Pharmacy Customers.” Now this is an astounding chunk of public relations output. I am confident that the author is confident that “real journalism” was involved. You know: Interviewing, researching, analyzing, using Watson, talking to customers, etc. Here’s the passage I highlighted: “One of the most frustrating things for patients can be a lack of access to their health or prescription history and the ability to share it. This is one of the things both IBM and CVS officials have said they hope to solve.” Yes, hope. It springs eternal as my mother used to say.

If you find these fact filled romps through the market activating technology of Watson, you may be qualified to become a Watson believer. For me, I am reminded of Charles Bukowski’s alleged quip:

The problem with the world is that the intelligent people are full of doubts while the stupid ones are full of confidence.

Stephen E Arnold, July 31, 2015

The Hadoop Spark Thing: Simple, Simple

July 30, 2015

I am fascinated with the cheerleading about open source software which makes Big Data as easy as driving a Fiat 500 through a car wash. (Make sure the wheels fit inside the automated pulley system, of course.)

Navigate to “The Big Big Data Question: Hadoop or Spark?” Be prepared to read about two—count ‘em—two systems working as smoothly as the engine in a technical high school’s auto repair class’ project car.

I want to highlight two statements in the write up.

The first is:

As I [a Big Data practitioner] mentioned, Spark does not include its own system for organizing files in a distributed way (the file system) so it requires one provided by a third-party. For this reason many Big Data projects involve installing Spark on top of Hadoop, where Spark’s advanced analytics applications can make use of data stored using the Hadoop Distributed File System (HDFS).

In short, Spark is what I call a wrapper. One uses it like a taco shell to keep the good in position for real time munching.

The second is this comment:

The open source principle is a great thing, in many ways, and one of them is how it enables seemingly similar products to exist alongside each other – vendors can sell both (or rather, provide installation and support services for both, based on what their customers actually need in order to extract maximum value from their data.

What the write omits is that there are some other bits and pieces needed; for example, how does one locate a particular string amidst the Big Data?

The point, for me, is that these nested and layered systems are truly exciting to troubleshoot. Not only are their issues with the integrity of the data, there is the thrill of getting each subsystem to work and then figuring out how to get useful outputs from the digital equivalent of a Roy’s Place Lassie’s Double Revenge sandwich before it closed its doors in 2013.

A Lassie’s Double Revenge consisted of a knockwurst, cheese, grilled onions, baked beans, and assorted seasonings served to the discerning diner.

A little like an open source Big Data mash up.

As a bonus, one gets to hire consultants who can make separate products, systems, and solutions work in a way which benefits the licensee and the system’s users.

Stephen E Arnold, July 30, 2015

Neural Networks and Thought Commands

July 22, 2015

If you’ve been waiting for the day you can operate a computer by thinking at it, check out “When Machine Learning Meets the Mind: BBC and Google Get Brainy” at the Inquirer. Reporter Chris Merriman brings our attention to two projects, one about hardware and one about AI, that stand at the intersection of human thought and machine. Neither venture is anywhere near fruition, but a peek at their progress gives us clues about the future.

The internet-streaming platform iPlayer is a service the BBC provides to U.K. residents who wish to catch up on their favorite programmes. In pursuit of improved accessibility, the organization’s researchers are working on a device that allows users to operate the service with their thoughts. The article tells us:

“The electroencephalography wearable that powers the technology requires lucidity of thought, but is surprisingly light. It has a sensor on the forehead, and another in the ear. You can set the headset to respond to intense concentration or meditation as the ‘fire’ button when the cursor is over the option you want.”

Apparently this operation is easier for some subjects than for others, but all users were able to work the device to some degree. Creepy or cool? Perhaps it’s both, but there’s no escaping this technology now.

As for Google’s undertaking, we’ve examined this approach before: the development of artificial neural networks. This is some exciting work for those interested in AI. Merriman writes:

“Meanwhile, a team of Google researchers has been looking more closely at artificial neural networks. In other words, false brains. The team has been training systems to classify images and better recognise speech by bombarding them with input and then adjusting the parameters to get the result they want.

But once equipped with the information, the networks can be flipped the other way and create an impressive interpretation of objects based on learned parameters, such as ‘a screw has twisty bits’ or ‘a fly has six legs’.”

This brain-in-progress still draws some chuckle-worthy and/or disturbing conclusions from images, but it is learning. No one knows what the end result of Google’s neural network research will be, but it’s sure to be significant. In a related note, the article points out that IBM is donating its machine learning platform to Apache Spark. Who knows where the open-source community will take it from here?

Cynthia Murrell, July 22, 2015

Sponsored by, publisher of the CyberOSINT monograph


Short Honk: Open Semantic Search Appliance

July 17, 2015

Several people have asked me about Open Semantic Search. I sent a couple of emails to the professional identified on the DNS record as the contact point. No response yet from our inquiry emails, but this is not unusual. People are so darned busy today.

The Open Semantic Search organization is offering an open semantic search appliance. The appliance is not a box like the much loved Google Search Appliance or the Maxxcat solutions. The appliance is virtual.

The explanation of the  data enriching system is located at this link. The resources required are modest and based on the information I scanned, the open semantic search appliance is a solution to many information access woes.

I will be able to search, explore, and analyze. Give the system a whirl. We will add it to our list of tasks. We assume it will present the same exciting challenges as other Lucene/Solr solutions. The addition of semantics will add a new wrinkle or two.

If you are into semantics and open source, the system may be for you.

Stephen E Arnold, July 17, 2015

Hadoop Rounds Up Open Source Goodies

July 17, 2015

Summer time is here and what better way to celebrate the warm weather and fun in the sun than with some fantastic open source tools.  Okay, so you probably will not take your computer to the beach, but if you have a vacation planned one of these tools might help you complete your work faster so you can get closer to that umbrella and cocktail.  Datamation has a great listicle focused on “Hadoop And Big Data: 60 Top Open Source Tools.”

Hadoop is one of the most adopted open source tool to provide big data solutions.  The Hadoop market is expected to be worth $1 billion by 2020 and IBM has dedicated 3,500 employees to develop Apache Spark, part of the Hadoop ecosystem.

As open source is a huge part of the Hadoop landscape, Datamation’s list provides invaluable information on tools that could mean the difference between a successful project and failed one.  Also they could save some extra cash on the IT budget.

“This area has a seen a lot of activity recently, with the launch of many new projects. Many of the most noteworthy projects are managed by the Apache Foundation and are closely related to Hadoop.”

Datamation has maintained this list for a while and they update it from time to time as the industry changes.  The list isn’t sorted on a comparison scale, one being the best, rather they tools are grouped into categories and a short description is given to explain what the tool does. The categories include: Hadoop-related tools, big data analysis platforms and tools, databases and data warehouses, business intelligence, data mining, big data search, programming languages, query engines, and in-memory technology.  There is a tool for nearly every sort of problem that could come up in a Hadoop environment, so the listicle is definitely worth a glance.

Whitney Grace, July 17, 2015
Sponsored by, publisher of the CyberOSINT monograph


Microsoft Takes SharePoint Criticism Seriously

July 16, 2015

Organizations are reaching the point where a shift toward mobile productivity and adoption must take place; therefore, their enterprise solution must follow suit. While Office 365 adoption has soared in light of the realization, Microsoft still has work to do in order to give users the experience that they demand from a mobile and social heavy platform. ComputerWorld goes into more details with their article, “Onus on Microsoft as SharePoint and OneDrive Roadmaps Reach Crossroads.”

The article states Microsoft’s current progress and future goals:

“With the advent of SharePoint Server 2016 (public beta expected 4Q 2015, with general availability 2Q 2016), Edwards believes Microsoft is placing renewed focus on file management, content management, sites, and portals. Going forward, Redmond claims it will also continue to develop the hybrid capabilities of SharePoint, recognizing that hybrid deployments are a steady state for many large organizations, and not just a temporary position to enable migration to the cloud.”

Few users chose to adopt the opportunities offered by Office 365 and SharePoint 2013, so Microsoft has to make SharePoint Server 2016 look like a new, enticing offering worthy of being taken seriously. So far, they have done a good job of building up some hype and attention. Stephen E. Arnold is a longtime leader in search and he has been covering the news surrounding the release on Additionally, his dedicated SharePoint feed makes it easy to catch the latest news, tips, and tricks at a glance.
Emily Rae Aldridge, July 16, 2015

Sponsored by, publisher of the CyberOSINT monograph

Next Page »