No Mole, Just Data

November 23, 2015

It all comes down to putting together the pieces, we learn from Salon’s article, “How to Explain the KGB’s Aazing Success Identifying CIA Agents in the Field?” For years, the CIA was convinced there was a Soviet mole in their midst; how else to explain the uncanny knack of the 20th Century’s KGB to identify CIA agents? Now we know it was due to the brilliance of one data-savvy KGB agent, Yuri Totrov, who analyzed U.S. government’s personnel data to separate the spies from the rest of our workers overseas. The technique was very effective, and all without the benefit of today’s analytics engines.

Totrov began by searching the KGB’s own data, and that of allies like Cuba, for patterns in known CIA agent postings. He also gleaned a lot if info from  publicly available U.S. literature and from local police. Totrov was able to derive 26 “unchanging indicators” that would pinpoint a CIA agent, as well as many other markers less universal but useful. Things like CIA agents driving the same car and renting the same apartment as their immediate predecessors. Apparently, logistics agents back at Langley did not foresee that such consistency, though cost-effective, could be used against us.

Reporter Jonathan Haslam elaborates:

“Thus one productive line of inquiry quickly yielded evidence: the differences in the way agency officers undercover as diplomats were treated from genuine foreign service officers (FSOs). The pay scale at entry was much higher for a CIA officer; after three to four years abroad a genuine FSO could return home, whereas an agency employee could not; real FSOs had to be recruited between the ages of 21 and 31, whereas this did not apply to an agency officer; only real FSOs had to attend the Institute of Foreign Service for three months before entering the service; naturalized Americans could not become FSOs for at least nine years but they could become agency employees; when agency officers returned home, they did not normally appear in State Department listings; should they appear they were classified as research and planning, research and intelligence, consular or chancery for security affairs; unlike FSOs, agency officers could change their place of work for no apparent reason; their published biographies contained obvious gaps; agency officers could be relocated within the country to which they were posted, FSOs were not; agency officers usually had more than one working foreign language; their cover was usually as a ‘political’ or ‘consular’ official (often vice-consul); internal embassy reorganizations usually left agency personnel untouched, whether their rank, their office space or their telephones; their offices were located in restricted zones within the embassy; they would appear on the streets during the working day using public telephone boxes; they would arrange meetings for the evening, out of town, usually around 7.30 p.m. or 8.00 p.m.; and whereas FSOs had to observe strict rules about attending dinner, agency officers could come and go as they pleased.”

In the era of Big Data, it seems like common sense to expect such deviations to be noticed and correlated, but it was not always so obvious. Nevertheless, Totrov’s methods did cause embarrassment for the agency when they were revealed. Surely, the CIA has changed their logistic ways dramatically since then to avoid such discernable patterns. Right?

Cynthia Murrell, November 23, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Data and Information: Three Cs for an Average Grade

November 21, 2015

I read “Why Companies Are Not Engaging with Their Data.” The write up boils down the “challenge” to three Cs; that is, a mnemonic which makes it easy to pinpoint Big Data clumsiness.

The three Cs are:

  • Callowness
  • Cost
  • Complexity.

How does one get past the notion of inexperience? I suppose one muddles through grade school, high school, college, and maybe graduate school. Then one uses “experience” to get a job and one can repeat this process with Big Data. How many organizations will have an appetite for the organic approach to inexperience? Not many I assert. We live in a quick fix, do it now environment which darned well better deliver an immediate pay off or “value.” Big Data may require experience but the real world wants instant gratification.

Cost remains a bit of a challenge, particularly when revenues are under pressure. Data analytics can be expensive when done correctly and really costly if done incorrectly.

Complexity. Math remains math. Engineering data management systems tickles the fancy of problem solvers. Combine the two, and the senior management of many firms are essentially clueless about what is required to deliver outputs which are on the money and with budgets.

The write up states:

As a recent report from Ernst & Young points out ‘Most organizations have complex and fragmented architecture landscapes that make the cohesive collation and dissemination of data difficult.

In short, big hat, no cattle. Just like the promises of enterprise search vendor to make information accessible to those making business decisions, the verbal picture painted by marketers is more enticing than the shadow cast by Big Data’s Cs. I see that.

Stephen E Arnold, November 21, 2015

Predictions for a Big Data Future

November 19, 2015

Want to know what the future will look like? Navigate to “7 Reasons Why the Algorithmic Business Will Change Society.” The changes come via Datafloq via a mid tier consulting firm. I find the predictions oddly out of step with the milieu in which I live. That’s okay but this list of seven changes raises a number of questions and seems to sidestep some of the social consequences of the world foreshadowed in the predictions. Finding information is, let me say at the outset, not part of the Big Data future.

Here are the seven predictions:

    1. By 2018, 20% of all business content will be authorized by machines, which means a hiring freeze on copywriters in favor of robowriting algorithms;
    2. By 2020, autonomous software agents, or algorithms, outside human control, will participate in 5% off all economic transactions, thanks to, among others, blockchain. On the other hand, we will need pattern-matching algorithms to detect robot thieves. 
    3. By 2018, more than 3 million workers globally will be supervised by a “roboboss”. These algorithms will determine what work you would need to do.
    4. By 2018, 50% of the fastest growing companies will have fewer employees than smart machines. Companies will become smaller due to expanding presence of algorithms.
    5. By 2018, customer digital assistants will recognize individuals by face and voice across channels and partners. Although this will benefit the customer, organizations should prevent the creepiness-factor.
    6. By 2018, 2 millions employees will be required to wear health and fitness tracking devices. The data generated from these devices, will be monitored by algorithms, which will inform management on any actions to be taken.
    7. By 2020, smart agents will facilitate 40% of mobile transactions, and the post-app era will begin to dominate, where algorithms in the cloud guide us through our daily tasks without the need for individual apps.

Fascinating. Who will work? What will people do in a Big Data world? What about social issues? How will one find information? What happens if one or more algorithms drift and deliver flawed outputs?

No answers of course, but that’s the great advantage of talking about a digital future three or more years down the road. I assume folks will have time to plan their Big Data strategy for this predicted world. I suppose one could ask Google, Watson, or one’s roboboss.

Stephen E Arnold, November 19, 2015

A Modest Dust Up between Big Data and Text Analytics

November 18, 2015

I wonder if you will become involved in this modest dust up between the Big Data folks and the text analytics adherents. I know that I will sit on the sidelines and watch the battle unfold. I may mostly alone on that fence for three reasons:

  • Some text analytics outfits are Big Data oriented. I would point modestly to Terbium Labs and Recorded Future. Both do the analytics thing and both use “text” in their processing. (I know that learning about these companies is not as much fun as reading about Facebook friends, but it is useful to keep up with cutting edge outfits in my opinion.)
  • Text analytics can produce Big Data. I know that sounds like a fish turned inside out. Trust me. It happens. Think about some wan government worker in the UK grinding through Twitter and Facebook posts. The text analytics output lots of data.
  • A faux dust up is mostly a marketing play. I enjoyed search and content processing vendor presentations which pitted features of one system versus another. This approach is not too popular because every system says it can do what every other system can do. The reality of the systems is, in most cases, not discernible to the casual failed webmaster now working as a “real” wizard.

Navigate to “Text Analytics Gurus Debunk 4 Big Data Myths.” You will learn that there are four myths which are debunked. Here are the myths:

  1. Big Data survey scores reign supreme. Hey, surveys are okay because outfits like Survey  Monkey and the crazy pop up technology from that outfit in Michigan are easy to implement. Correct? Not important. Usable data for marketing? Important.
  2. Bigger social media data analysis is better. The outfits able to process the real time streams from Facebook and Twitter have lots of resources. Most companies do not have these resources. Ergo: Statistics 101 reigns no matter what the marketers say.
  3. New data sources are the most valuable. The idea is that data which are valid, normalized, and available for processing trump bigness. No argument from me.
  4. Keep your eye on the ball by focusing on how customers view you. Right. The customer is king in marketing land. In reality, the customer is a code word for generating revenue. Neither Big Data nor text analytics produce enough revenue in my world view. Sounds great though.

Will Big Data respond to this slap down? Will text analytic gurus mount their steeds and take another run down Marketing Lane to the windmill set up as a tourist attraction in an Amsterdam suburb?

Nope. The real battle involves organic, sustainable revenues. Talk is easy. Closing deals is hard. This dust up is not a mixed martial arts pay per view show.

Stephen E Arnold, November 18, 2015

Deleting Data: Are They Really Gone?

November 17, 2015

I read “Gawker Media’s Data Guru Presents the Case for Deleting Data.” The main idea is that hoarding permits a reality TV program. Hoarding data may not be good TV.

The write up points out that data cleaning is not cheap. Storage also costs money.

A Gawker wizard is quoted as saying:

We effectively are setting traps in our data sets for our future selves and our colleagues… Increasingly, I find that eliminating this data from our databases is the best solution. Gawker’s traffic data is maintained for just a few months. In our own logs and databases, we only have traffic data since February. and even that’s of limited use: We’ll toss some of it before the end of the year.

Seems reasonable. However, there may be instances when dumping or just carelessly overwriting log files might not be expedient or legal. For example, in one government agency, the secretary’s “bonus” depends on showing how Internet site usage relates to paperwork reduction. The idea is that when a “customer” of the government uses a Web site and does not show up in person at an office to fill out a request, the “customer” allegedly gets better service and costs, in theory, should drop. Also, some deals require that data be retained. You can use your imagination if you are an ISP in a country recently attacked by terrorists and your usage logs are “disappeared.” SEC and IRS retention guidelines? Worth noting in some cases.

The question is, “Are data really gone once deleted?” The fact of automatic backups, services in the middle routinely copying data, and other ways of creating unobserved backups may mean that deleted data can come back to life.

Pragmatism and legal constraints as well as the “men in the middle” issue can create zombie data, which, unlike the fictional zombies, can bite.

Stephen E Arnold, November 17, 2015

Quote to Note: Big Data Must Be Small

November 16, 2015

The consulting firm KPMG Chine tweeted a quote I found worthy of my Quote to Note folder. You may be able to read this gem in this tweet, at least for now.

Here’s the quote attributed to Dr. Mark Kennedy, whom I presume is either a KPMG expert or an advisor to the blue chip firm:

To get value from Big Data, make it small.

That quote seems to complement the definition in “Big Data Explained in Less Than 2 Minutes to Absolutely Anyone”; to wit:

The behind the phrase ‘Big Data’ is that everything we do is increasingly leaving a digital trace (or data), which we (and others) can use and analyze. Big Data therefore refers to that data being collected and our ability to make use of it.

Does this mean that Big Data are just data with spray on marketing silicone? Definitions of big and small might be helpful. The fish I caught last summer was this big.

Stephen E Arnold, November 16, 2015

More Big Data Value Floundering

November 15, 2015

Here in Harrod’s Creek, Kentucky, the mist is rising from the mine drainage ditch. Value is calculated in a couple of easy ways. Here are two concrete examples:

One of my neighbors buys my collection of used auto parts. Before he puts the parts in his truck, a 1950 Chevrolet, he pays me cash money. Once I count the money, I help him load the parts and watch him drive away in a haze of Volkswagen type emissions.

Here’s another:

A person calls me and wants to talk with me about enterprise search and content processing. I explain that I don’t “talk” for free. If the caller transfers cash money to my PayPal account, then I call the person and answer questions. The time buys minutes. When the minutes are consumed, I hang up.

The notion of value, therefore, is focused on cash, not feeling good, having a nice day, or winning an election as the friendliest retired consultant in Harrod’s Creek.

Now navigate to “What Is the Value of Big Data to Your Business?” There is a gap between my definition of value and the definition of value set forth in this write up.

Here’s an example of Big Data value:

Big data and how it shapes your company

Big data is at the center of many decisions in any company. It will allow your company to:

Reduce and manage risk

Without data, organizations are vulnerable to many risks. Big data allows financial institutions to profile their customers when giving them credit facilities. Insurance companies can also create risk profiles which will allow them to set appropriate premiums for different customers. Agricultural enterprises as well, can use data on weather and food pricing to control production.

Better decision making

Collecting data on employees’ interests, behavior, interactions, work time, resource use and resource allocation can be very instrumental in creating better structures, improving the flow of information, increasing inter-departmental cooperation, increasing efficiency, saving time and saving resources.

Get a competitive edge

Monitoring competitor products, marketing activities, sales and pricing will help you to respond urgently with your own counter measures. If you are selling your products on a platform like Amazon, you can keep an eye on your biggest competitors and respond accordingly when they seem to be outselling you.

News flash. None of these listicle items deliver value from my point of view. Like other buzzwords and whizzy concepts, backfilling with generalizations is not going to convince me that Big Data has “value” unless the situation is linked to cash money.

Call me old fashioned, but this approach to value is one reason many companies are struggling to generate revenue from their search and content processing efforts.

Stephen E Arnold, November 15, 2015

Crazy, Wild Hadoop Prioritization Advice

November 12, 2015

I read “Top 10 Priorities for a Successful Hadoop Implementation.” A listicle. I understand. Clicks. Visibility. Fame. Fortune. Well, hopefully.

I wanted to highlight two pieces of advice delivered in a somber, parental manner. Here are two highlights from the write up intended to help a Hadoop administrator get ‘er done and keep the paychecks rolling in.

Item 2 of 10: “Innovate with Big Data on enterprise Hadoop.” I find it amusing when advisors, poobahs, and former middle school teachers tell another person to innovate. Yep, that works really well. Even those who innovate are faced with failure many times. I think the well ran dry for some of the Italian Renaissance artists when the examples of frescos in Nero’s modest home were recycled. Been there. Done that. The notion of a person innovating with an enterprise deployment of Hadoop strikes me as interesting, but probably not a top 10 priority. How about getting the data into the system, formulating a meaningful query, and figuring out how to deal with the batchiness of the system?

Item 9 of 10: “Look for capabilities that make Hadoop data look relational.” There is a reason to use Codd type data management systems. Those reasons include that they work when properly set up, and they require data which can be sliced and diced. Maybe not easily, but no one fools himself or herself thinking, “Gee, why don’t I dump everything into one big data lake and pull out the big, glossy fish automagically.”

I am okay with advice. Perhaps it should reflect the reality with which open source data management tools present to an enterprise user seeking guidance. Enterprise search vendors got themselves into a world of hurt with this type of casual advice. Where are those vendors now?

Stephen E Arnold, November 12, 2015

Data Lake and Semantics: Swimming in Waste Water?

November 6, 2015

I read a darned fascinating write up called “Use Semantics to Keep Your Data Lake Clear.” There is a touch of fantasy in the idea of importing heterogeneous “data” into a giant data lake. The result is, in my experience, more like waste water in a pre-treatment plant in Saranda, Albania. Trust me. Distasteful.

Looks really nice, right?

The write up invokes a mid tier consultant and then tosses in the fuzzy word term governance. We are now on semi solid ground, right? I do like the image of a data swap which contrasts nicely with the images from On Golden Pond.

I noted this passage:

Using a semantic data model, you represent the meaning of a data string as binary objects – typically in triplicates made up of two objects and an action. For example, to describe a dog that is playing with a ball, your objects are DOG and BALL, and their relationship is PLAY. In order for the data tool to understand what is happening between these three bits of information, the data model is organized in a linear fashion, with the active object first – in this case, DOG. If the data were structured as BALL, DOG, and PLAY, the assumption would be that the ball was playing with the dog. This simple structure can express very complex ideas and makes it easy to organize information in a data lake and then integrate additional large data stores.

Okay.

Next I circled:

A semantic data lake is incredibly agile. The architecture quickly adapts to changing business needs, as well as to the frequent addition of new and continually changing data sets. No schemas, lengthy data preparation, or curating is required before analytics work can begin. Data is ingested once and is then usable by any and all analytic applications. Best of all, analysis isn’t impeded by the limitations of pre-selected data sets or pre-formulated questions, which frees users to follow the data trail wherever it may lead them.

Yep, makes perfect sense. But there is one tiny problem. Garbage in, garbage out. Not even modern jargon can solve this decades old computer challenge.

Fantasy is much better than reality.

Stephen E Arnold, November 6, 2015

Braiding Big Data

October 26, 2015

An apt metaphor to explain big data is the act of braiding.  Braiding requires  person to take three or more locks of hair and alternating weaving them together.  The end result is clean, pretty hairstyle that keeps a person’s hair in place and off the face.  Big data is like braiding, because specially tailored software takes an unruly mess of data, including the combed and uncombed strands, and organizes them into a legible format.   Perhaps this is why TopQuadrant named its popular big data software TopBraid, read more about its software upgrade in “TopQuadrant Launches TopBraid 5.0.”

TopBraid Suite is an enterprise Web-based solution set that simplifies the development and management of standards-based, model driven solutions focused on taxonomy, ontology, metadata management, reference data governance, and data virtualization.  The newest upgrade for TopBraid builds on the current enterprise information management solutions and adds new options:

“ ‘It continues to be our goal to improve ways for users to harness the full potential of their data,’ said Irene Polikoff, CEO and co-founder of TopQuadrant. ‘This latest release of 5.0 includes an exciting new feature, AutoClassifier. While our TopBraid Enterprise Vocabulary Net (EVN) Tagger has let users manually tag content with concepts from their vocabularies for several years, AutoClassifier completely automates that process.’ “

 

The AutoClassifer makes it easier to add and edit tags before making them a part of the production tag set. Other new features are for TopBraid Enterprise Vocabulary Net (TopBraid EVN), TopBraid Reference Data Manager (RDM), TopBraid Insight, and the TopBraid platform, including improvements in internationalization and a new component for increasing system availability in enterprise environments, TopBraid DataCache.

TopBraid might be the solution an enterprise system needs to braid its data into style.

Whitney Grace, October 26, 2015

Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta