Machine Learning Does Not Have All the Answers

November 25, 2016

Despite our broader knowledge, we still believe that if we press a few buttons and press enter computers can do all work for us.  The advent of machine learning and artificial intelligence does not repress this belief, but instead big data vendors rely on this image to sell their wares.  Big data, though, has its weaknesses and before you deploy a solution you should read Network World’s, “6 Machine Learning Misunderstandings.”

Pulling from Juniper Networks’s security intelligence software engineer Roman Sinayev explains some of the pitfalls to avoid before implementing big data technology.  It is important not to take into consideration all the variables and unexpected variables, otherwise that one forgotten factor could wreck havoc on your system.  Also, do not forget to actually understand the data you are analyzing and its origin.  Pushing forward on a project without understanding the data background is a guaranteed fail.

Other practical advice, is to build a test model, add more data when the model does not deliver, but some advice that is new even to us is:

One type of algorithm that has recently been successful in practical applications is ensemble learning – a process by which multiple models combine to solve a computational intelligence problem. One example of ensemble learning is stacking simple classifiers like logistic regressions. These ensemble learning methods can improve predictive performance more than any of these classifiers individually.

Employing more than one algorithm?  It makes sense and is practical advice why did that not cross our minds? The rest of the advice offered is general stuff that can be applied to any project in any field, just change the lingo and expert providing it.

Whitney Grace, November 25, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

 

Do Not Forget to Show Your Work

November 24, 2016

Showing work is messy, necessary step to prove how one arrived at a solution.  Most of the time it is never reviewed, but with big data people wonder how computer algorithms arrive at their conclusions.  Engadget explains that computers are being forced to prove their results in, “MIT Makes Neural Networks Show Their Work.”

Understanding neural networks is extremely difficult, but MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a way to map the complex systems.  CSAIL figured the task out by splitting networks in two smaller modules.  One for extracting text segments and scoring according to their length and accordance and the second module predicts the segment’s subject and attempts to classify them.  The mapping modules sounds almost as complex as the actual neural networks.  To alleviate the stress and add a giggle to their research, CSAIL had the modules analyze beer reviews:

For their test, the team used online reviews from a beer rating website and had their network attempt to rank beers on a 5-star scale based on the brew’s aroma, palate, and appearance, using the site’s written reviews. After training the system, the CSAIL team found that their neural network rated beers based on aroma and appearance the same way that humans did 95 and 96 percent of the time, respectively. On the more subjective field of “palate,” the network agreed with people 80 percent of the time.

One set of data is as good as another to test CSAIL’s network mapping tool.  CSAIL hopes to fine tune the machine learning project and use it in breast cancer research to analyze pathologist data.

Whitney Grace, November 24, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Big Data Teaches Us We Are Big Paranoid

November 18, 2016

I love election years!  Actually, that is sarcasm.  Election years bring out the worst in Americans.  The media runs rampant with predictions that each nominee is the equivalent of the anti-Christ and will “doom America,” “ruin the nation,” or “destroy humanity.”  The sane voter knows that whoever the next president is will probably not destroy the nation or everyday life…much.  Fear, hysteria, and paranoia sells more than puff pieces and big data supports that theory.  Popular news site Newsweek shares that, “Our Trust In Big Data Shows We Don’t Trust Ourselves.”

The article starts with a new acronym: DATA.  It is not that new, but Newsweek takes a new spin on it.  D means dimensions or different datasets, the ability to combine multiple data streams for new insights.  A is for automatic, which is self-explanatory.  T stands for time and how data is processed in real time.  The second A is for artificial intelligence that discovers all the patterns in the data.

Artificial intelligence is where the problems start to emerge.  Big data algorithms can be unintentionally programmed with bias.  In order to interpret data, artificial intelligence must learn from prior datasets.  These older datasets can show human bias, such as racism, sexism, and socioeconomic prejudices.

Our machines are not as objectives as we believe:

But our readiness to hand over difficult choices to machines tells us more about how we see ourselves.

Instead of seeing a job applicant as a person facing their own choices, capable of overcoming their disadvantages, they become a data point in a mathematical model. Instead of seeing an employer as a person of judgment, bringing wisdom and experience to hard decisions, they become a vector for unconscious bias and inconsistent behavior.  Why do we trust the machines, biased and unaccountable as they are? Because we no longer trust ourselves.”

Newsweek really knows how to be dramatic.  We no longer trust ourselves?  No, we trust ourselves more than ever, because we rely on machines to make our simple decisions so we can concentrate on more important topics.  However, what we deem important is biased.  Taking the Newsweek example, what a job applicant considers an important submission, a HR representative will see as the 500th submission that week.  Big data should provide us with better, more diverse perspectives.

Whitney Grace, November 18, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Hard and Soft Clustering Explained

November 17, 2016

I read “An Introduction to Clustering and Different Methods of Clustering.” Clustering, it seems, remains a popular topic among the quasi-search and content processing crowd. What’s interesting about this write up is that it introduces hard clustering and soft clustering. I had assumed that clustering was neither hard nor soft. Here’s the distinction:

  • In hard clustering, each data point either belongs to a cluster completely or not. For example, in the above example each customer is put into one group out of the 10 groups.
  • In soft clustering, instead of putting each data point into a separate cluster, a probability or likelihood of that data point to be in those clusters is assigned.

The write up then highlights these go-to methods of clustering:

  • K means clustering
  • Hierarchical clustering.

The write up introduces the idea of supervised learning. I noted that the article did not point out that training is a time consuming and often expensive exercise. The omission complements the “quick look” approach in the write up.

I am not sure that a person interested in clustering will be able to make a giant leap forward. Perhaps the effort will result in a hard soft landing?

Stephen E Arnold, November 17, 2016

AI to Profile Gang Members on Twitter

November 16, 2016

Researchers from Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) are claiming that an algorithm developed by them is capable of identifying gang members on Twitter.

Vice.com recently published an article titled Researchers Claim AI Can Identify Gang Members on Twitter, which claims that:

A deep learning AI algorithm that can identify street gang members based solely on their Twitter posts, and with 77 percent accuracy.

The article then points out the shortcomings of the algorithm or AI by saying this:

According to one expert contacted by Motherboard, this technology has serious shortcomings that might end up doing more harm than good, especially if a computer pegs someone as a gang member just because they use certain words, enjoy rap, or frequently use certain emojis—all criteria employed by this experimental AI.

The shortcomings do not end here. The data on Twitter is being analyzed in a silo. For example, let us assume that few gang members are identified using the algorithm (remember, no location information is taken into consideration by the AI), what next?

Is it not necessary then to also identify other social media profiles of the supposed gang members, look at Big Data generated by them, analyze their communication patterns and then form some conclusion? Unfortunately, none of this is done by the AI. It, in fact, would be a mammoth task to extrapolate data from multiple sources just to identify people with certain traits.

And most importantly, what if the AI is put in place, and someone just for the sake of fun projects an innocent person as a gang member? As rightly pointed out in the article – machines trained on prejudiced data tend to reproduce those same, very human, prejudices.

Vishal Ingole, November  16, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Lawyers Might Be Automated Too

November 14, 2016

The worry with artificial intelligence is that it will automate jobs and leave people without a way to earn income.  The general belief is that AI will automate manufacturing, retail, food service, and other industries, but what about law?  One would think that lawyers would never lose their jobs, because a human is required to navigate litigation and represent a person in court, right?  According to The Inquirer article, “UCL Creates AI ‘Lawbot’ That Rules on Cases With Surprising Accuracy” lawyers might be automated too.

On a level akin to Watson, researchers at University College London, led by Dr. Nikoalos Aletras, created an algorithm that peruses case information and can predict accurate verdicts.  The UCL team fed the algorithm litigation information from cases about torture, degrading treatment, privacy, and fair trials.  They hope the algorithm will be used to identify patterns in human rights abuses.

Dr. Aletras does not think AI will replace judges and lawyers, but it could be used as a tool to identify patterns in cases with specific outcomes.  The algorithm has a 79% accuracy rate, which is not bad considering the amount of documentation involved.  Also the downside is:

At a wider level, although 79 percent is a bit more ED-209 than we’d like for now, it does suggest that we’re a long way towards being able to install an ethical and moral code that would allow AI to … you know, not kill us and that.  With so many doomsayers warning us that the closer that we get to the so-called ‘singularity’ between humans and machines, the more likely we are to be toast as a race, it’s something of a good news story to see what’s being done to ensure AI stays on the straight and narrow.

Automation in the legal arena is a strong possibility for when “…implementation and interpretation of the law that is required, less so than the fact themselves.”  The human element is still needed to decide cases, but perhaps it would cut down on the amount of light verdicts for pedophiles, sex traffickers, rapists, and other bad guys.  It does make one wonder what alternative fields lawyers would consider?

Whitney Grace, November 14, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

IBM Watson: Cruel, Cruel Caveats

November 12, 2016

There’s nothing like a cruel caveat applied to IBM Watson. Navigate to “Cognitive Computing Applications Present New Business Challenges.” These challenges are not “new”; what’s new is that naive smart software licensees are discovering that training software is difficult, time consuming, and expensive. Best of all, the training is not forever. Smart systems need to be retrained because language and data change.

The write up reports that an executive involved in smart software at Rabobank, a Dutch outfit, offered this observation at the World of Watson conference held at the end of October 2016 :

AI is everywhere, and people think it’s so fantastic. And these companies, including IBM, come in and then you go to do a project and see that it’s not really that great yet,” Serrurier Schepper said. “You have to train a model, and it takes time.”

The story continues:

After building a centralized AI unit, teams should look for quick wins and then publicize their success, Serrurier Schepper said. Models may take a long time to train, but once they’re delivering strong results, sharing this with the rest of the company can help build support for future initiatives.

Yep, time. Time is money, which is a statement any bank professional with Excel can understand.

How does one avoid failing? That’s easy. The write up reports:

Choosing the right use cases for cognitive computing applications is also important. There is a general notion that AI software can perform just about any task. And while that may be the ultimate goal of the technology, today’s tools are a ways off from that. Enterprises need to identify business problems where the technology is competent, and that’s not always a simple proposition.

The point is that no matter how generalized the perception that smart software like Watson can be, the licensee has to figure out exactly what problem to attack. The reason is that the time and cost of creating a model and then training the smart software will put the project deep into a swamp of red, mercury tinged muck.

But be prepared to spend money. The write up quotes another Watson aware executive as saying:

“If you get too hung up on ROI, you’ll never do anything.”

I disagree. Those involved in the project may have an opportunity to look for a new job. It’s the time and cost thing that creates these new horizons for some smart software champions.

Stephen E Arnold, November 12, 2016

Lucidworks Hires Watson

November 7, 2016

One of our favorite companies to track is Lucidworks, due to their commitment to open source technology and development in business enterprise systems.  The San Diego Times shares that “Lucidworks Integrates IBM Watson To Fusion Enterprise Discovery Platform.”  This means that Lucidworks has integrated IBM’s supercomputer into their Fusion platform to help developers create discovery applications to capture data and discover insights.  In short, they have added a powerful big data algorithm.

While Lucidworks is built on open source software, adding a proprietary supercomputer will only benefit their clients.  Watson has proven itself an invaluable big data tool and paired with the Fusion platform will do wonders for enterprise systems.  Data is a key component to every industry, but understanding and implementing it is difficult:

Lucidworks’ Fusion is an application framework for creating powerful enterprise discovery apps that help organizations access all their information to make better, data-driven decisions. Fusion can process massive amounts of structured and multi-structured data in context, including voice, text, numerical, and spatial data. By integrating Watson’s ability to read 800 million pages per second, Fusion can deliver insights within seconds. Developers benefit from this platform by cutting down the work and time it takes to create enterprise discovery apps from months to weeks.

With the Watson upgrade to Lucidworks’ Fusion platform, users gain natural language processing and machine learning.  It makes the Fusion platform act more like a Star Trek computer that can provide data analysis and even interpret results.

Whitney Grace, November 7, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Is Your Company a Data Management Leader or Laggard?

November 4, 2016

The article titled Companies are Falling Short in Data Management on IT ProPortal describes the obstacles facing many businesses when it comes to data management optimization. Why does this matter? The article states that big data analytics and the internet of things will combine to form an over $300 billion industry by 2020. Companies that fail to build up their capabilities will lose out—big. The article explains,

More than two thirds of data management leaders believe they have an effective data management strategy. They also believe they are approaching data cleansing and analytics the right way…The [SAS] report also says that approximately 10 per cent of companies it calls ‘laggards’, believe the same thing. The problem is – there are as many ‘laggards’, as there are leaders in the majority of industries, which leads SAS to a conclusion that ‘many companies are falling short in data management’.

In order to avoid this trend, company leaders must identify the obstacles impeding their path. A better focus on staff training and development is only possible after recognizing that a lack of internal skills is one of the most common issues. Additionally, companies must clearly define their data strategy and disseminate the vision among all levels of personnel.

Chelsea Kerwin,  November 4, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph

Self Service Business Intelligence: Some Downers

November 2, 2016

Perhaps I am looking at a skewed sample of write ups. I noted another downer about easy to use, do it yourself business intelligence systems. These systems allow anyone to derive high value insights from data with the click of a mouse.

That’s been a dream of some for many years. I recall that one of my colleagues at Halliburton NUS repeating to anyone who would listen to a civil engineer with a focus on wastewater say, “I want to walk into my office and have the computer tell me what I need to know today.”

Yep, how’s that coming along?

The write up “9 Ways Self Service BI Solutions Fall Short” suggests that that the comment made by the sewage expert in 1972 is not yet a reality. The write up identifies nine “reasons,” but I circled three as of particular interest to me and my research goslings. You will need to read the original “Fall Short” article for the full complement of downers or “challenges” in today’s parlance.

  1. Hidden complexity. Yep, folks who don’t know what they don’t know but just want a good enough answer struggle with the realities of data integrity, mathematics, and assumptions. A pretty chart may be eye catching and “simple”. But is it on point? Well, that’s part of the complexity which the pretty chart is doing its best to keep hidden. Out of sight, out of mind, right?
  2. Customization. Yep, the chart is pretty but it does not answer the question of a particular user. Now the plumbing must be disassembled in order to get what the self service BI user wants. Okay, but what if that self service user who is in a hurry cannot put the plumbing together again. Messy, right?
  3. Cost and scalability. The problem with self service is that low cost comes from standardization. You can have any color so long as it is black. The notion of mass customization persists even through every Apple iPhone is the same. The user has to figure out how to set up the phone to do what the user wants. The result is that most of the iPhone users make minimal changes to the software on the phone. Default settings are the setting for the vast majority of a system’s users. When a change has to be made, that change comes at a cost and neither users nor the accountants are too keen on the unique snowflake approach to hardware or software. The outputs from a BI system, therefore, get used with zero or minimal modifications.

What are the risks of self service business intelligence? These range from governmental flops like 18F to Google’s failure with its fiber play. Think of the inefficiency resulting from the use of business intelligence systems marketed as the answer to the employee’s need for on  point information.

When I walk into my office, no system tells me what I need to know. Nice idea, though.

Stephen E Arnold, November 2, 2016

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta