August 17, 2015
IBM has made available two open source repositories for the IBM i2 intelligence platform: the Data-Acquisition-Accelerators and Intelligence-Analysis-Platform can both be found on the IBM-i2 page at GitHub. The IBM i2 suite of products includes many parts that work together to give law enforcement, intelligence organizations, and the military powerful data analysis capabilities. For an glimpse of what these products can do, we recommend checking out the videos at the IBM i2 Analyst’s Notebook page. (You may have to refresh the page before the videos will play.)
The Analyst’s Notebook is but one piece, of course. For the suite’s full description, I turned to the product page, IBM i2 Intelligence Analysis Platform V3.0.11. The Highlights summary describes:
“The IBM i2 Intelligence Analysis product portfolio comprises a suite of products specifically designed to bring clarity through the analysis of the mass of information available to complex investigations and scenarios to help enable analysts, investigators, and the wider operational team to identify, investigate, and uncover connections, patterns, and relationships hidden within high-volume, multi-source data to create and disseminate intelligence products in real time. The offerings target law enforcement, defense, government agencies, and private sector businesses to help them maximize the value of the mass of information that they collect to discover and disseminate actionable intelligence to help them in their pursuit of predicting, disrupting, and preventing criminal, terrorist, and fraudulent activities.”
The description goes on to summarize each piece, from the Intelligence Analysis Platform to the Information Exchange Visualizer. I recommend readers check out this page, and, especially, the videos mentioned above for better understanding of this software’s capabilities. It is an eye-opening experience.
Cynthia Murrell, August 18, 2015
August 13, 2015
Call me skeptical. Okay, call me a person who is fed up with silly jargon. You know what a database is, right? You know what a data warehouse is, well, sort of, maybe? Do you know what a data lake is? I don’t.
A lake, according to the search engine du jour Giburu:
An area prototypically filled with water, also of variable size.
A data lake, therefore, is an area filled with zeros and ones, also of variable size. How does a data lake differ from a database or a data warehouse?
According to the write up “Sink or Swim – Why your Organization Needs a Data Lake”:
A Data Lake is a storage repository that holds a vast amount of raw data in its native format for processing later by the business.
The magic in this unnecessary jargon is, in my opinion, a quest, perhaps Quixotic?) for sales leads. The write up points out that a data lake is available. A data lake is accessible. A data lake is—wait for it—Hadoop.
What happens if the water is neither clear nor pristine? One cannot unleash the hounds of the EPA to resolve the problem of data which may not very good until validated, normalized, and subjected to the ho hum tests which some folks want to have me believe may be irrelevant steps in the land of a marketer’s data lakes.
My admonition, “Don’t drink the water until you know it won’t make life uncomfortable—or worse. Think fatal.”
Stephen E Arnold, August 13, 2015
August 11, 2015
Editor’s note: The full text of the exclusive interview with Dr. Daniel J. Rogers, co-founder of Terbium Labs, is available on the Xenky Cyberwizards Speak Web service at www.xenky.com/terbium-labs. The interview was conducted on August 4, 2015.
Significant innovations in information access, despite the hyperbole of marketing and sales professionals, are relatively infrequent. In an exclusive interview, Danny Rogers, one of the founders of Terbium Labs, has developed a way to flip on the lights to make it easy to locate information hidden in the Dark Web.
Web search has been a one-trick pony since the days of Excite, HotBot, and Lycos. For most people, a mobile device takes cues from the user’s location and click streams and displays answers. Access to digital information requires more than parlor tricks and pay-to-play advertising. A handful of companies are moving beyond commoditized search, and they are opening important new markets such as secret and high value data theft. Terbium Labs can “illuminate the Dark Web.”
In an exclusive interview, Dr. Danny Rogers, one of the founders of Terbium Labs with Michael Moore, explained the company’s ability to change how data breaches are located. He said:
Typically, breaches are discovered by third parties such as journalists or law enforcement. In fact, according to Verizon’s 2014 Data Breach Investigations Report, that was the case in 85% of data breaches. Furthermore, discovery, because it is by accident, often takes months, or may not happen at all when limited personnel resources are already heavily taxed. Estimates put the average breach discovery time between 200 and 230 days, an exceedingly long time for an organization’s data to be out of their control. We hope to change that. By using Matchlight, we bring the breach discovery time down to between 30 seconds and 15 minutes from the time stolen data is posted to the web, alerting our clients immediately and automatically. By dramatically reducing the breach discovery time and bringing that discovery into the organization, we’re able to reduce damages and open up more effective remediation options.
Terbium’s approach, it turns out, can be applied to traditional research into content domains to which most systems are effectively blind. At this time, a very small number of companies are able to index content that is not available to traditional content processing systems. Terbium acquires content from Web sites which require specialized software to access. Terbium’s system then processes the content, converting it into the equivalent of an old-fashioned fingerprint. Real-time pattern matching makes it possible for the company’s system to locate a client’s content, either in textual form, software binaries, or other digital representations.
One of the most significant information access innovations uses systems and methods developed by physicists to deal with the flood of data resulting from research into the behaviors of difficult-to-differentiate sub atomic particles.
One part of the process is for Terbium to acquire (crawl) content and convert it into encrypted 14 byte strings of zeros and ones. A client such as a bank then uses the Terbium content encryption and conversion process to produce representations of the confidential data, computer code, or other data. Terbium’s system, in effect, looks for matching digital fingerprints. The task of locating confidential or proprietary data via traditional means is expensive and often a hit and miss affair.
Terbium Labs changes the rules of the game and in the process has created a way to provide its licensees with anti-fraud and anti-theft measures which are unique. In addition, Terbium’s digital fingerprints make it possible to find, analyze, and make sense of digital information not previously available. The system has applications for the Clear Web, which millions of people access every minute, to the hidden content residing on the so called Dark Web.
Terbium Labs, a start up located in Baltimore, Maryland, has developed technology that makes use of advanced mathematics—what I call numerical recipes—to perform analyses for the purpose of finding connections. The firm’s approach is one that deals with strings of zeros and ones, not the actual words and numbers in a stream of information. By matching these numerical tokens with content such as a data file of classified documents or a record of bank account numbers, Terbium does what strikes many, including myself, as a remarkable achievement.
Terbium’s technology can identify highly probable instances of improper use of classified or confidential information. Terbium can pinpoint where the compromised data reside on either the Clear Web, another network, or on the Dark Web. Terbium then alerts the organization about the compromised data and work with the victim of Internet fraud to resolve the matter in a satisfactory manner.
Terbium’s breakthrough has attracted considerable attention in the cyber security sector, and applications of the firm’s approach are beginning to surface for disciplines from competitive intelligence to health care.
We spent a significant amount of time working on both the private data fingerprinting protocol and the infrastructure required to privately index the dark web. We pull in billions of hashes daily, and the systems and technology required to do that in a stable and efficient way are extremely difficult to build. Right now we have over a quarter trillion data fingerprints in our index, and that number is growing by the billions every day.
The idea for the company emerged from a conversation with a colleague who wanted to find out immediately if a high profile client list was ever leaded to the Internet. But, said Rogers, “This individual could not reveal to Terbium the list itself.”
How can an organization locate secret information if that information cannot be provided to a system able to search for the confidential information?
The solution Terbium’s founders developed relies on novel use of encryption techniques, tokenization, Clear and Dark Web content acquisition and processing, and real time pattern matching methods. The interlocking innovations have been patented (US8,997,256), and Terbium is one of the few, perhaps the only company in the world, able to crack open Dark Web content within regulatory and national security constraints.
I think I have to say that the adversaries are winning right now. Despite billions being spent on information security, breaches are happening every single day. Currently, the best the industry can do is be reactive. The adversaries have the perpetual advantage of surprise and are constantly coming up with new ways to gain access to sensitive data. Additionally, the legal system has a long way to go to catch up with technology. It really is a free-for-all out there, which limits the ability of governments to respond. So right now, the attackers seem to be winning, though we see Terbium and Matchlight as part of the response that turns that tide.
Terbium’s product is Matchlight. According to Rogers:
Matchlight is the world’s first truly private, truly automated data intelligence system. It uses our data fingerprinting technology to build and maintain a private index of the dark web and other sites where stolen information is most often leaked or traded. While the space on the internet that traffics in that sort of activity isn’t intractably large, it’s certainly larger than any human analyst can keep up with. We use large-scale automation and big data technologies to provide early indicators of breach in order to make those analysts’ jobs more efficient. We also employ a unique data fingerprinting technology that allows us to monitor our clients’ information without ever having to see or store their originating data, meaning we don’t increase their attack surface and they don’t have to trust us with their information.
Stephen E Arnold, August 11, 2015
August 8, 2015
I read “What Does a Data Scientist Do That a Traditional Data Analytics Team Can’t?” Good marketing question. Math, until the whole hearted embrace of fuzziness, was reasonably objective. Survivors of introductory statistics learned about the subjectivity involved with Bayesian antics and the wonder of fiddling with thresholds. You remember. Above this value, do this; below this value, do that. Eventually one can string together numerical recipes which make threshold decisions based on inputs. In the hands of responsible, capable, and informed professionals, the systems work reasonably. Sure, smart software can drift and then run off the rails. There are procedures to keep layered systems on track. They work reasonably well for horseshoes. You know. Close enough for horseshoes. Monte Carlo’s bright lights beckon.
The write up takes a different approach. The idea is that someone who does descriptive procedures is an apple. The folks who do predictive procedures are oranges. One lets the data do the talking. Think of a spreadsheet jockey analyzing historical pre tax profits at a public company. Now contrast that with a person who looks at data and makes judgments about what the data “mean.”
Close enough for horse shoes.
Which is more fun? Go with the fortune tellers, of course.
The write up also raises the apparent black-white issue of structured versus unstructured data. The writer says:
Unstructured or “dirty” data is in many ways the opposite of its more organized counterpart, and is what data scientists rely on for their analysis. Data of this type is made up of qualitative rather than quantitative information — descriptive words instead of measurable numbers — and comes from more obscure sources such as emails, sentiment expressed in blogs or engagement across social media. Processing this information also involves the use of probability and statistical algorithms to translate what is learned into advanced applications for machine learning or even artificial intelligence, and these skills are often well beyond those of the average data analyst.
There you go. One does not want to be average. I am tempted to ask mode, median, or mean?
Net net: If the mathematical foundation is wrong, if the selected procedure is inappropriate, if the data are not validated—errors are likely and they will propagate.
One does not need to be too skilled in mathematics to understand that mistakes are not covered or ameliorated with buzz words.
Stephen E Arnold, August 8, 2016
August 7, 2015
I noted the article “IBM Adds Medical Images to Watson, Buying Merge Healthcare for $1 Billion.” The company is in the content management business. Medical images are pretty much of a hassle whether in the good old fashioned film form or in digital versions. The few opportunities I have had to looked at murky gray or odd duck enhanced color images, I marveled at how a professional would make sense of the data displayed. Did this explanation trigger thoughts of IBM FileNet?
The image processing technology available from specialist firms permitting satellite or surveillance image analysis are a piece of cake compared to the medical imaging examples I reviewed. From my point of view the nifty stuff available to an analyst looking at the movement of men and equipment were easier to figure out.
Merge delivers a range of image and content management services to health care outfits. The systems can work with on premises systems and park data in the cloud in a way that keeps the compliance folks happy.
According to the write up:
When IBM set up its Watson health business in April, it began with a couple of smaller medical data acquisitions and industry partnerships with Apple, Johnson & Johnson and Medtronic. Last week, IBM announced a partnership with CVS Health, the large pharmacy chain, to develop data-driven services to help people with chronic ailments like diabetes and heart disease better manage their health.
Now Watson is plopping down a $1 billion to get a more substantive, image centric, and—dare I say it—more traditional business.
The idea I learned:
“We’re bringing Watson and analytics to the largest data set in health care — images,” John Kelly, IBM’s senior vice president of research who oversees the Watson business, said in an interview.
The idea, as I understand the management speak, is that Watson will be able to perform image analysis, thus allowing IBM to convert Watson into a significant revenue generator. IBM does need all the help it can get. The company has just achieved a milestone of sorts; IBM’s revenue has declined for 13 consecutive quarters.
My view is that the integration of the Merge systems with the evolving Watson “solution” will be expensive, slow, and frustrating to those given the job of making image analysis better, faster, and cheaper.
My hunch is that the time and cost required to integrate Watson and Merge will be an issue in six or nine months. Once the “integration” is complete, the costs of adding new features and functions to keep pace with regulations and advances in diagnosis and treatment will create a 21st century version of FileNet. (FileNet, as you, gentle reader, know as the 2006 acquisition. At the time, nine years ago, IBM said that the FileNet technology would
“advance its Information on Demand initiative, IBM’s strategy for pursuing the growing market opportunity around helping clients capture insights from their information so it can be used as a strategic asset. FileNet is a leading provider of business process and content management solutions that help companies simplify critical and everyday decision making processes and give organizations a competitive advantage.”
FileNet was an imaging technology for financial institutions and a search system which allowed a person with access to the system to locate a check or other scanned document.)
And FileNet today? Well, like many IBM acquisitions it is still chugging along, just part of the services oriented architecture at Big Blue. Why, one might ask, was the FileNet technology not applicable to health care? I will leave you to ponder the answer.
I want to be optimistic about the upside of this Merge acquisition for the companies involved and for the health care professionals who will work with the Watsonized system. I assume that IBM will put on a happy face about Watson’s image analysis capabilities. I, however, want to see the system in action and have some hard data, not M&A fluff, about the functionality and accuracy of the merged systems.
At this moment, I think Watson and other senior IBM managers are looking for a way to make a lemon grove from Watson. Nothing makes bankers and deal makers happy than a big, out of the blue acquisition.
Now the job is to find a way to sell enough lemons to pay for the maintenance and improvement of the lemon grove. I assume Watson has an answer to on going costs for maintenance and enhancements, bug finding and stomping, and the PR such activities trigger. Yep, costs and revenue. Boring but important to IBM’s stakeholders.
Stephen E Arnold, August 7, 2015
August 7, 2015
IT architecture might appear to be the same across the board, but depending on the industry the standards change. Rupert Brown wrote “From BCBS To TOGAF: The Need For A Semantically Rigorous Business Architecture” for Bob’s Guide and he discusses how TOGAF is the defacto standard for global enterprise architecture. He explains that while TOGAF does have its strengths, it supports many weaknesses are its reliance on diagrams and using PowerPoint to make them.
Brown spends a large portion of the article stressing that information content and model are more important and a diagramed should only be rendered later. He goes on that as industries have advanced the tools have become more complex and it is very important for there to be a more universal approach IT architecture.
What is Brown’s supposed solution? Semantics!
“The mechanism used to join the dots is Semantics: all the documents that are the key artifacts that capture how a business operates and evolves are nowadays stored by default in Microsoft or Open Office equivalents as XML and can have semantic linkages embedded within them. The result is that no business document can be considered an island any more – everything must have a reason to exist.”
The reason that TOGAF has not been standardized using semantics is the lack of something to connect various architecture models together. A standardized XBRL language for financial and regulatory reporting would help get the process started, but the biggest problem will be people who make a decent living using PowerPoint (so he claims).
Brown calls for a global reporting standard for all industries, but that is a pie in the sky hope unless the government imposes regulations or all industries have a meeting of the minds. Why? The different industries do not always mesh, think engineering firms vs. a publishing house, and each has their own list of needs and concerns. Why not focus on getting industry standards for one industry rather than across the board?
August 6, 2015
I read “Why Quality Management Needs Text Analytics.” I learned:
To analyze customer quality complaints to find the most common complaints and steer the production or service process accordingly can be a very tedious job. It takes time and resources.
This idea is similar to the one expressed by Ronen Feldman in a presentation he gave in the early 2000s. My notes of the event record that he reviewed the application of ClearForest technology to reports from automobile service professionals which presented customer comments and data about repairs. ClearForest’s system was able to pinpoint that a particular mechanical issue was emerging. The client responded to the signals from the ClearForest system and took remediating action. The point was that sometime in the early 2000s, ClearForest had built and deployed a text analytics system with a quality-centric capability.
I mention this point because many companies are recycling ideas and concepts which are in some cases long beards. ClearForest was acquired by the estimable Thomson Reuters. Some of the technology is available as open source at Calais.
In search and content processing, the case examples, the lingo, and even the technology has entered what I call its “recycling” phase.
I learned about several new search systems this week. I looked at each. One was a portal, another a metasearch system, and a third a privacy centric system with a somewhat modest index. Each was presented as new, revolutionary, and innovative. The reality is that today’s information highways are manufactured from recycled plastic bottles.
Stephen E Arnold, August 6, 2015
August 4, 2015
IBM’s supercomputer Watson is being “trained” in various fields, such as healthcare, app creation, customer service relations, and creating brand new recipes. The applications for Watson are possibly endless. The supercomputer is combining its “skills” from healthcare and recipes by trying its hand at nutrition. Welltok invented the CaféWell Health Optimization Platform, a PaaS that creates individualized healthcare plans, and it implemented Watson’s big data capabilities to its Healthy Dining CaféWell personal concierge app. eWeek explains that “Welltok Takes IBM Watson Out To Dinner,” so it can offer clients personalized restaurant menu choices.
” ‘Optimal nutrition is one of the most significant factors in preventing and reversing the majority of our nation’s health conditions, like diabetes, overweight and obesity, heart disease and stroke and Alzheimer’s,’ said Anita Jones-Mueller, president of Healthy Dining, in a statement. ‘Since most Americans eat away from home an average of five times each week and it can be almost impossible to know what to order at restaurants to meet specific health needs, it is very important that wellness and condition management programs empower smart dining out choices. We applaud Welltok’s leadership in providing a new dimension to healthy restaurant dining through its groundbreaking CaféWell Concierge app.’”
Restaurant menus are very vague when it comes to nutritional information. When it comes to knowing if something is gluten-free, spicy, or a vegetarian option, the menu will state it, but all other information is missing. In order to find a restaurant’s nutritional information, you have to hit the Internet and conduct research. A new law passed will force restaurants to post calorie counts, but that will not include the amount of sugar, sodium, and other information. People have been making poor eating choices, partially due to the lack of information, if they know what they are eating they can improve their health. If Watson’s abilities can decrease the US’s waistline, it is for the better. The bigger challenge would be to get people to use the information.
August 3, 2015
Before IBM purchased i2 Ltd from an investment outfit, I did some work for Mike Hunter, one of the founders of i2 Ltd. i2 is not a household name. The fault lies not with i2’s technology; the fault lies at the feet of IBM.
A bit of history. Back in the 1990s, Hunter was working on an advanced degree in physics at Cambridge University. HIs undergraduate degree was from Manchester University. At about the same time, Michael Lynch, founder of Autonomy and DarkTrace, was a graduate of Cambridge and an early proponent of guided machine learning implemented in the Digital Reasoning Engine or DRE, an influential invention from Lynch’s pre Autonomy student research. Interesting product name: Digital Reasoning Engine. Lynch’s work was influential and triggered some me too approaches in the world of information access and content processing. Examples can be found in the original Fast Search & Transfer enterprise systems and in Recommind’s probabilistic approach, among others.
By 2001, i2 had placed its content processing and analytics systems in most of the NATO alliance countries. There were enough i2 Analyst Workbenches in Washington, DC to cause the Cambridge-based i2 to open an office in Arlington, Virginia.
i2 delivered in the mid 1990s, tools which allowed an analyst to identify people of interest, display relationships among these individuals, and drill down into underlying data to examine surveillance footage or look at text from documents (public and privileged).
IBM has i2 technology, and it also owns the Cybertap technology. The combination allows IBM to deploy for financial institutions a remarkable range of field proven, powerful tools. These tools are mature.
Due to the marketing expertise of IBM, a number of firms looked at what Hunter “invented” and concluded that there were whizzier ways to deliver certain functions. Palantir, for example, focused on Hollywood style visualization, Digital Reasoning emphasized entity extraction, and Haystax stressed insider threat functions. Today there are more than two dozen companies involved in what I call the Hunter-i2 market space.
Some of these have pushed in important new directions. Three examples of important innovators are: Diffeo, Recorded Future, and Terbium Labs. There are others which I can name, but I will not. You will have to wait until my new Dark Web study becomes available. (If you want to reserve a copy, send an email to benkent2020 at yahoo dot com. The book will run about 250 pages and cost about $100 when available as a PDF.)
The reason I mention i2 is because a recent Wall Street Journal article called “”Spy Tools Come to Wall Street” Print edition for August 3, 2015) and “Spy Software Gets a Second Life on Wall Street” did not. That’s not a surprise because the Murdoch property defines “news” in an interesting way.
The write up profiles a company called Digital Reasoning, which was founded in 2000 by a clever lad from the University of Virginia. I am confident of the academic excellence of the university because my son graduated from this fine institution too.
Digital Reasoning is one of the firms engaged in cognitive computing. I am not sure what this means, but I know IBM is pushing the concept for its fascinating Watson technology, which can create recipes and cure cancer. I am not sure about generating a profit, but that’s another issue associated with the cognitive computing “revolution.”
In pitching prospective clients, Digital Reasoning often shows a demonstration of how its system respo9nded when it was fed 500,000 emails related to the Enron scandal made available by the Federal Energy Regulatory Commission. After being “taught” some key concepts about compliance, the Synthesys program identified dozens of suspicious emails in which participants were using language that suggested attempts to conceal or destroy information.
Interesting. I would suggest that the Digital Reasoning approach is 15 years old; that is, only marginally newer than the i2 system. Digital Reasoning lacks the functionality of Cybertap. Furthermore, companies like Diffeo, Recorded Future, and Terbium incorporate sophisticated predictive methods which operate in an environment of real time information flows. The idea is that looking at an archive is interesting and useful to an attorney or investigator looking backwards. However, the focus for many financial firms is on what is happening “now.”
The Wall Street Journal story reminds me of the third party descriptions of Autonomy’s mid 1990s technology. Those who fail to understand the quantity of content preparation and manual, subject matter expert effort required to obtain high value outputs are watching smoke, not investigating the fire.
For organizations looking for next generation technology which is and has been working for several years, one must push beyond the Palantir valuation and look to the value of innovative systems and methods.
For a starter, check out Diffeo, Recorded Future, and Terbium Labs. Please, push IBM to exert some effort to explain the i2-Cybertap capabilities. I tip my hat to the PR firm which may have synthesized some information for a story that is likely to make the investors’ hearts race this fine day.
Stephen E Arnold, August 3, 2015
August 3, 2015
I read “Data Scientists to CEOs: You Can’t Handle the Truth.” I enjoy write ups about data science which start off with the notion of truth. I know that the “truth” referenced is the outputs of analytics systems.
Call me skeptical. If the underlying data are not normalized, validated, and timely, the likelihood of truth becomes even murkier than it was in my college philosophy class. Roger Ailes allegedly said:
Truth is whatever people will believe.
Toss in the criticism of a senior manager who in the US is probably a lawyer or an accountant, and you have a foul brew. Why would a manager charged with hitting quarterly targets or generating enough money to meet payroll quiver with excitement when a data scientist presents “truth.”
There is that pesky perception thing. There are frames of reference. There are subjective factors in play. Think of the dentist who killed Cecil. I am not sure data science will solve his business and personal challenges. Do you?
The write up is a silly fan rant for the fuzzy discipline of data science. Data science does not pivot on good old statisticians with their love of SAS and SPSS, fancy math, and 17th century notions of what constitutes a valid data set. Nope.
The data scientist has to communicate the known unknowns to his or her CEO. Shades of Rumsfeld. Does today’s CEO want to know more about the uncertainty in the business? The answer is, “Maybe.” But senior managers often get information that is filtered, shaped, and presented to create an illusion. Shattering those illusions can have some negative career consequences even for data scientists, assuming there is such a discipline as data science.
Evoking the truth from statistical processes which are output from system configured by others can be interesting. Those threshold settings are not theoretical. Those settings determine what the outputs are and what they are “about.”
Connecting an automated output to something that the data scientist asserts should be changed strikes me as somewhat parental. How does that work on a manager like Dick Cheney? How does that work on the manager of a volunteer committee working on a parent teacher luncheon?
I thought the Jack Benny program from the 1930s to 1960s was amusing. Some of the output about data science suggests that comedy may be a more welcoming profession than management based on truth from data science. Truth and statistics. Amazing comedy.
Stephen E Arnold, August 3, 2015