DataSift Architecture

March 1, 2012

So you want to do “big data”? This is for the SEO, PR, and marketing consultants who assert that “big data” is part of their firms’ standard fanny pack. You can view the large version of this DataSift architecture image at DataSift, as you may know, processes the Twitter tweet stream. Yep, big data. The IT folks at the new age Madison Avenue firms have this type of technology with their Starbuck’s latte:


The DataSift Architecture: A Bird’s Eye View.

Trivial.for the SEO experts and former middle school English teachers.

Stephen E Arnold, March 1, 2012

Sponsored by


Exogenous Complexity 3: Being Clever

February 24, 2012

I just submitted my March 2012 column to Enterprise Technology Management, published in London by IMI Publishing. In that column I explored the impact of Google’s privacy stance on the firm’s enterprise software business. I am not letting any tiny cat out of a big bag when I suggested that the blow back might be a thorn in Googzilla’s extra large foot.

In this essay, I want to consider exogenous complexity in the context of the consumerization of information technology and, by extension, on information access in an organization. The spark for my thinking was the write up “Google, Safari and Our Final Privacy Wake-Up Call.”

Here’s a clever action. MIT students put a red truck on top of the dome. For more see

If you do not have an iPad or an iPhone or an Android device, you will want to stop reading. Consumerization of information technology boils down to employees and contract workers who show up with mobile devices (yes, including laptops) at work. In the brave new world, the nanny instincts of traditional information technology managers are little more than annoying nags from a corporate mom.

The reality is that when consumer devices enter the workplace, three externalality happen in my experience.

First, security is mostly ineffective. Clever folks then exploit vulnerable systems. I think this is why clever people say that the customer is to blame. So clever exploits cluelessness. Clever is exogenous for the non clever. There are some actions an employer can take; for example, confiscating personal devices before the employee enters the work area. This works in certain law enforcement, intelligence, and a handful of other environments; for example, fabrication facilities in electronics or pharmaceuticals. Mobile devices have cameras and can “do” video. “Secret” processes can become un-secret in a nonce. In the free flowing, disorganized craziness of most organizations, personal devices are ignored or overlooked. In short, in a monitored financial trading environment, a professional can send messages outside the firm and the bank’s security and monitoring systems are happily ignorant. The cost of dropping a truly secure box around a work place is expensive and beyond the core competency of most information technology professionals.

Second, employees blur information which is “for work” with information which is “for friends, lovers, or acquaintances.” The exogenous factor is political. To fix the problem, rules are framed. The more rule applied to a flawed system, the greater the likelihood is that clever people will exploit systems which ignore the rules. Clever actions, therefore, increase. In short, this is a variation of the Facebook phenomena when a posting can reach many people quickly or lie dormant until the data load explodes like long forgotten Fourth of July fire cracker. As people chase the fire, clever folks exploit the fire. Information time bombs are not thought about by most senior managers, but they are on the radar of those involved in a legal matter and in the minds of some disgruntled programmers. The half life of information is less well understood by most professionals than the difference between a uranium based reactor and a thorium based reactor. Work and life information are blended, and in my opinion, the compound is a dangerous one.

Third, vendors focusing on consumerizing information technology spur adoption of devices and practices which cannot be easily controlled. The data-Hoovering processes, therefore, can suck up information which is proprietary, of high value, and potentially damaging to the information owner. Information is not “like sand grains.” Some information is valueless; other information commands a high price. In fact, modern content processing and data analytic systems can take fragments of information and “fuse” them. To most people these amalgams are of little interest. But to someone with specialized knowledge, the fused data are not god nuggets, the fused data are a chunky rosy diamond, maybe a Pink Panther. As a result, an exogenous factor increases the flow of high value data through uncontrolled channels.


A happy quack to Gunaxin. You can see how clever, computer situations, and real life blend in this “pranking” poster. I would have described the wrapping of equipment in plastic “clever.” But I am the fume hood guy, Woodruff High School, 1958 to 1962. Image source:

Now, let’s think about being clever. When I was in high school, I was one of a group of 25 students who were placed in an “advanced” program. Part of the program included attending universities for additional course work. I ended up at the University of Illinois at age 15. I went back to regular high school, did some other Fancy Dan learning programs, and eventually graduated. My specialty was tricking students in “regular” chemistry into modifying their experiments to produce interesting results. One of these suggestions resulted in a fume hood catching fire. Another dispersed carbon strands through the school’s ventilation system. I thought I was clever, but eventually Mr. Shepherd, the chemistry teach, found out that I was the “clever” one. I sat in the hall for the balance of the semester. I adapted quickly, got an A, and became semi-famous. I was already sitting in the hall for writing essays filled with double entendres. Sigh. Clever has its burdens. Some clever folks just retreat into a private world. The Internet is ideal for providing an environment in which isolated clever people can find a “friend.” Once a couple of clever folks hook up, the result is lots of clever activity. Most of the clever activity is not appreciated by the non clever. There is the social angle and the understanding angle. In order to explain a clever action, one has to be somewhat clever. The non clever have no clue what has been done, why, when, or how. There is a general annoyance factor associated with any clever action. So, clever usually gets masked or shrouded in something along the lines, “Gee, I am sorry” or “Goodness gracious, I did not think you would be annoyed.” Apologies usually work because the non clever believe the person saying “I’m sorry” really means it. Nah. I never meant it. I did not pay for the fume hood or the air filter replacement. Clever, right?

What happens when folks from the type of academic experience I had go to work in big companies. Well, it is sink or swim. I have been fortunate because my “real” work experiences began at Halliburton Nuclear Services and continued at Booz, Allen & Hamilton when it was a solid blue chip firm, not the azure chip outfit it is today. The fact that I was surrounded by nuclear engineers whose idea of socializing was arguing about Monte Carlo code and nuclear fuel degradation at the local exercise club. At Booz, Allen the environment was not as erudite as the nuclear outfit, but there were lots of bright people who were actually able to conduct a normal conversation. Nevertheless, the Type As made life interesting for one another, senior managers, clients, and family. Ooops. At the Booz, Allen I knew, one’s family was one’s colleagues. Most spouses had no idea about the odd ball world of big time consulting. There were exceptions. Some folks married a secretary or colleague. That way the spouse knew what work was like. Others just married the firm, converting “quality time” into two days with the dependents at a posh resort.

So clever usually causes one to seek out other clever people or find a circle of friends who appreciate the heat generated by aluminum powder in an oxygen rich environment. When a company employs clever people, it is possible to generalize:

Clever people do clever things.

What’s this mean in search and information access? You probably already know that clever people often have a healthy sense of self worth. There is also arrogance, a most charming quality among other clever people. The non-clever find the arrogance “thing” less appealing.

Let’s talk about information access.

Let’s assume that a clever person wants to know where a particular group of users navigate via a mobile device or a traditional browser. Clever folks know about persistent cookies, workarounds for default privacy settings, spoofing built in browser functions, or installation of rogue code which resets certain user selected settings on a heartbeat or restart. Now those in my advanced class would get a kick out these types of actions. Clever people appreciate the work of clever people. When the work leaves the “non advanced” in a clueless state, the fun curve does the hockey stick schtick. So clever enthuses those who are clever. The unclever are, by definition, clueless and not impressed. For really nifty clever actions, the unclever get annoyed, maybe mad. I was threatened by one student when the Friday afternoon fume hood event took place. Fortunately my debate coach intervened. Hey, I was winning and a broken nose would have imperiled my chances at the tournament on Saturday.

Now more exogenous complexity. Those who are clever often ignore unintended consequences. I could have been expelled, but I figured my getting into big trouble would have created problems with far reaching implications. I won a State Championship in the year of the fume hood. I won some silly scholarship. I published a story in the St Louis Post Dispatch called “Burger Boat Drive In.” I had a poem in a national anthology. So, I concluded that a little sport in regular chemistry class would not have any significant impact. I was correct.

However, when clever people do clever things in a larger arena, then the assumptions have to be recalibrated. Clever people may not look beyond their cube or outside their computer’s display. That’s when the exogenous complexity thing kicks in.

So Google’s clever folks allegedly did some work arounds. But the work around allowed Microsoft to launch an attack on Google. Then the media picked up on the work around and the Microsoft push back. The event allowed me to raise the question, “So workers bring their own consumerized device to work. What’s being tracked? Do you know? Answer: Nope.” What’s Google do? Apologize. Hey, this worked for me with the fume hood event, but on a global stage when organizations are pretty much lost in space when it comes to control of information, effective security, and managing crazed 20 somethings—wow.

In short, the datasphere encourages and rewards exogenous behavior by clever people. Those who are unclever take actions which sets off a flood of actions which benefit the clever.

Clever. Good sometimes. Other times. Not so good. But it is better to be clever than unclever. Exogenous factors reward the clever and brutalize the unclever.

Stephen E Arnold, February 24, 2012

Sponsored by

Al Jazeera and Its US Reach

January 24, 2012

We were surprised, then resigned. Has the US slipped lower on yet another yardstick of achievement?

Al Jazeera English, an international 24 hour English-Language news and current affairs TV channel headquartered in Doha, Qatar, has now reached 250 million homes — 5 million of those being in the U.S.

The Los Angeles Times reported on this startling milestone in the article “Al Jazeera English Now Reaches 250 Million Households.”

We learned:

Five years after its launch, there are 130 countries that carry Al Jazeera English, but in the U.S., the channel has limited availability; it can be found on cable systems in Washington, D.C.; New York; Burlington, Vt.; Toledo, Ohio; and, recently, Chicago and in Los Angeles on KCET. And while the U.S. makes up a fraction of the quarter-billion households, it is a major source of AJE’s Web traffic, totaling 40 percent, according to the network.

The fact that Al Jazeera English has such a large web following in the United States despite its limited availability, leads me to think that a significant shift has taken place.

Jasmine Ashton, January 24, 2012

Sponsored by

Google Does Real Time Again

October 28, 2011

Google+ Rolls Out Real-Time Search and Hashtag Support

On October 12, Google Plus rolled out two new features; both allow users to create custom news streams based around topics being shared and build upon the search functionality of the network. The first feature, a real-time search, finds results from Google+ posts that are related to the search term a user enters. As new posts are created centering around the search topic, the user is notified and a real-time stream of posts is begun. ZDNet’s article, “Google+ Real-Time Search: The Social News “Ticker” tells us more about the changes:

… Google engineer Vic Gundotra – who posted the news from his Google Plus feed – notes that it’s a great way to keep up with real-time news events, such as a speech, a court trial or a sporting event. Basically, it’s a real-time news ticker for niche topics. The second feature – hashtag support – essentially turns any hashtag in a post into a searchable term that can be used as another way to create feeds and real-time streams.

This is a catchy notion. I’m interested to see if Google+ will begin integrating all social networking posts into their search results. What they’re doing right now isn’t groundbreaking; Twitter already offers the exact same feature. However, it would be groundbreaking to be able to follow trending topics on all the major social networking sites as they correlate to breaking news.

But Google did real time before. What’s “real time”? Whatever Google wants it to be I suppose from a marketing viewpoint.

Stephen E Arnold, October 28, 2011

Sponsored by

Lucid Imagination: Open Source Search Reaches for Big Data

September 30, 2011

We are wrapping up a report about the challenges “big data” pose to organizations. Perhaps the most interesting outcome of our research is that there are very few search and content processing systems which can cope with the digital information required by some organizations. Three examples merit listing before I comment on open source search and “big data”.

The first example is the challenge of filtering information required by orgnaizatio0ns produced within the organization and by the organizations staff, contractors, and advisors. We learned in the course of our investigation that the promises of processing updates to Web pages, price lists, contracts, sales and marketing collateral, and other routine information are largely unmet. One of the problems is that the disparate content types have different update and change cycles. The most widely used content management system based on our research results is SharePoint, and SharePoint is not able to deliver a comprehensive listing of content without significant latency. Fixes are available but these are engineering tasks which consume resources. Cloud solutions do not fare much better, once again due to latency. The bottom line is that for information produced within an organization employees are mostly unable to locate information without a manual double check. Latency is the problem. We did identify one system which delivered documented latency across disparate content types of 10 to 15 minutes. The solution is available from Exalead, but the other vendors’ systems were not able to match this problem of putting fresh, timely information produced within an organization in front of system users. Shocked? We were.

lucid decision copy

Reducing latency in search and content processing systems is a major challenge. Vendors often lack the resources required to solve a “hard problem” so “easy problems” are positioned as the key to improving information access. Is latency a popular topic? A few vendors do address the issue; for example, Digital Reasoning and Exalead.

Second, when organizations tap into content produced by third parties, the latency problem becomes more severe. There is the issue of the inefficiency and scaling of frequent index updates. But the larger problem is that once an organization “goes outside” for information, additional variables are introduced. In order to process the broad range of content available from publicly accessible Web sites or the specialized file types used by certain third party content producers, connectors become a factor. Most search vendors obtain connectors from third parties. These work pretty much as advertised for common file types such as Lotus Notes. However, when one of the targeted Web sites such as a commercial news services or a third-party research firm makes a change, the content acquisition system cannot acquire content until the connectors are “fixed”. No problem as long as the company needing the information is prepared to wait. In my experience, broken connectors mean another variable. Again, no problem unless critical information needed to close a deal is overlooked.

Read more

Endeca Clicks into Real Time Search with DataSift

September 26, 2011

Endeca, known for its e-commerce software, is pairing with DataSift, a provider of aggregated social data feeds at Web scale. Their partnership will produce visualizations and advanced analytics on semi-structured content in real time. Benzinga covers the latest in, “Endeca and DataSift Team to Analyze the Real Time Web.” The write up asserts:

Pairing Endeca Latitude®, an Agile BI platform, with the breadth of social data like Facebook, Twitter, and WordPress as well as other popular social solutions, enables organizations to react to the “big data fire hose” alongside internal data, for marketing analytics, customer intelligence, CRM and competitive intelligence. Endeca and DataSift will demonstrate their joint offering at O’Reilly’s Strata Conference on September 22-23 in New York.

DataSift’s granular and modular sifting abilities combine with Endeca Latitude’s intuitive interface to produce a product that is both powerful and cost-effective. The yet unnamed offering will help companies mine the business value out of the gushing well of new social data.

Our view is that “latency” exists across the six major types of “real time” solutions. What does “real time” mean? Well, it means different things depending upon the application. Some solutions are mind bogglingly expensive. Think Thomson Reuters’ feeds of financial data on certain investments. Others are pretty leisurely; for example, what is trending in the world of Lady Gaga. Interesting tie up. No solid definition of latency yet. We are watching and waiting. You know. Latency.

Emily Rae Aldridge, September 23, 2011

Alerts When Search Is Hit and Miss

August 21, 2011

Search seems like the answer to Every Man’s information needs. It is not. Not by a long shot.

If organizations cannot search by individual as to who needs information, they will invariably push content onto a whole group of people. AFV-News reported “U.S. Army Deploys AtHoc IWSAlerts Emergency Mass Notification System.”

Businesses, schools, universities, and military groups all employ the usage of emergency alerts, providing mass notifications to everyone in their system. Fort Jackson brags that their AtHoc alerts span 25,000 personnel and dependents.

AtHoc IWS Alerts offer control from a unified Web-based console, which allows Fort Jackson to send alerts to cell phones, landlines, smart phones, SMS text and email. It’s not just Fort Jackson—AtHoc services more than 1.5 million Department of Defense, more than any other provider.

We learned about AtHoc’s capabilities and infrastructure from the AFV-News article:

[The] system integrates with the post’s existing Internet Protocol network services, which means reduced infrastructure and maintenance costs. Personnel accountability is accomplished through the bi-directional capability, allowing responses to notifications in real-time. Network alert delivery and response can be tracked, ensuring that targeted recipients have received and responded to alerts.

While alerts for dangerous situations and testing can save lives and are obviously a necessity, mass alert systems also unfortunately end up in too many unnecessary inboxes.

Megan Feil, August 21, 2011

Sponsored by

IBM May Need a More Robust Classification Solution

August 18, 2011

According to talk around the water cooler, some IBM content and search units are poking around for a classification “solution”. We think the rumor is mostly big company confusion since IBM already has software available to assess and address an organization’s content classification needs through the use of several components. According to the IBM website:

Most unstructured content is either trapped in silos across the organization or entirely unmanaged “content in the wild.” A majority of that unstructured content can be deemed unnecessary – over-retained, irrelevant, or duplicate – and should be either decommissioned or deleted.

As we understand it, one licenses the  Classification Module and/or Content Analytics software to prevent the previously stated problem and to provide content classification.

Sounds great like the ads for IBM mainframes and the promotional information about

But a disturbing question to the ArnoldIT goslings who wear blue IBM logos: What if this stuff costs too much and does not deliver on the fly classification for real time processing of tweets and Google Plus public content?

Maybe an IBM box of parts with an expensive IBM engineering team is not exactly what some outfits require? Perhaps IBM should look around and maybe snap up one of the hot players in the space. IBM has been announcing partnerships with a number of interesting companies. We track  Digital Reasoning and and think its technology looks very promising? IBM is in a good position to have an impact in the data analysis space, but it needs tools that go beyond its in house code and Cognos and SPSS methods in our opinion.

Jasmine Ashton, August 19, 2011

Sponsored by, publishers of The New Landscape of Enterprise Search

Social Content Feed Tool from Know about It

August 2, 2011

When all your Facebook, Twitter, and other social streams become so convoluted, you might miss out on that link, photo, or music video you would’ve loved. You’ll never know – until now…maybe. Marshall Kirkpatrick looks at the new start-up, Know About It, in “New Service Sniffs out Secret Gems from across Your News Feeds.”

The service brings in all your subscribed content from major social networks, then offers a number of different ways to sort what it finds. My favorite is the filter called “Potentially Missed – links from people who don’t share a lot of links.

Know About It explains on its Web site they collect all the links passing through your social streams and perform a “bunch of analysis on each one to determine which are most likely to be of interest to you.”

Sounds helpful. The idea of sorting all your inbound information in a variety of ways is appealing. You can also look at the service’s recommendations based on your expressed interest or get a personalized email digest.

Mr. Kirkpatrick has not yet tested the service but likes the idea. What isn’t mentioned? Privacy. So what is the ‘bunch of analysis’ and where do all those links end up? Advertisers? If the start-up is successful, time will tell. But with the social web moving at a never-ending pace and growing, social media users wanting to sort their feeds likely won’t mind too much. We think these types of tools are likely to grow in importance as free real time search becomes a difficult service to monetize.

Philip West, August 2, 2011

Sponsored by, publishers of The New Landscape of Enterprise Search

Synthesio Releases New Social Media Monitoring Tool

May 22, 2011

More social media monitoring. “This New Dashboard Lets You Monitor Social Media Conversations About Your Brand Everywhere describes a dashboard called Unity. The solution is from Synthesio, and it could quickly become an essential marketing tool.

Unlike TweetDeck, Unity is not free. However, the cost may be worth it. The article points to two components that put this app far ahead:

  • “It monitors much more than Facebook or Twitter, in particular it crawls user forums, which is trickier and in practice is often much more important for many brands;
  • It works in over 30 languages. Synthesio has teams of translators around the world and around the clock that monitor conversations in many languages and make it all accessible to marketers in one dashboard.”

For your money, you get information about how to customize your dashboard. Regular analytic reports are available for an added cost. Such monitoring of the real time environment may soon be essential for companies to stay competitive, “or well”, bring the future home today..

Cynthia Murrell, May 22, 2011

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta