Machine Learning Frameworks: Why Not Just Use Amazon?

September 16, 2018

A colleague sent me a link to “The 10 Most Popular Machine Learning Frameworks Used by Data Scientists.” I found the write up interesting despite the author’s failure to define the word popular and the bound phrase data scientists. But few folks in an era of “real” journalism fool around with my quaint notions.

According to the write up, the data come from an outfit called Figure Eight. I don’t know the company, but I assume their professionals adhere to the basics of Statistics 101. You know the boring stuff like sample size, objectivity of the sample, sample selection, data validity, etc. Like information in our time of “real” news and “real” journalists, some of these annoying aspects of churning out data in which an old geezer like me can have some confidence. You know like the 70 percent accuracy of some US facial recognition systems. Close enough for horseshoes, I suppose.

miss sort of accurate

Here’s the list. My comments about each “learning framework” appear in italics after each “learning framework’s” name:

  1. Pandas — an open source, BSD-licensed library
  2. Numpy — a package for scientific computing with Python
  3. Scikit-learn — another BSD licensed collection of tools for data mining and data analysis
  4. Matplotlib — a Python 2D plotting library for graphics
  5. TensorFlow — an open source machine learning framework
  6. Keras — a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano
  7. Seaborn — a Python data visualization library based on matplotlib
  8. Pytorch & Torch
  9. AWS Deep Learning AMI — infrastructure and tools to accelerate deep learning in the cloud. Not to be annoying but defining AMI as Amazon Machine Learning Interface might be useful to some
  10. Google Cloud ML Engine — neural-net-based ML service with a typically Googley line up of Googley services.

Stepping back, I noticed a handful of what I am sure are irrelevant points which are of little interest to a “real” journalists creating “real” news.

First, notice that the list is self referential with python love. Frameworks depend on other python loving frameworks. There’s nothing inherently bad about this self referential approach to shipping up a list, and it makes it a heck of a lot easier to create the list in the first place.

Second, the information about Amazon is slightly misleading. In my lecture in Washington, DC on September 7, I mentioned that Amazon’s approach to machine learning supports Apache MXNet and Gluon, TensorFlow, Microsoft Cognitive Toolkit, Caffe, Caffe2, Theano, Torch, PyTorch, Chainer, and Keras. I found this approach interesting, but of little interest to those creating a survey or developing an informed list about machine learning frameworks; for example, Amazon is executing a quite clever play. In bridge, I think the phrase “trump card” suggests what the Bezos momentum machine has cooked up. Notice the past tense because this Amazon stuff has been chugging along in at least one US government agency for about four, four and one half years.

Third, Google brings up dead last. What about IBM? What about Microsoft and its CNTK. Ah, another acronym, but I as a non real journalist will reveal that this acronym means Microsoft Cognitive Toolkit. More information is available in Microsoft’s wonderful prose at this link. By the way, the Amazon machine learning spinning momentum thing supports the CNTK. Imagine that? Right, I didn’t think so.

Net net: The machine learning framework list may benefit from a bit of refinement. On the other hand, just use Amazon and move down the road to a new type of smart software lock in. Want to know more? Write benkent2020 @ yahoo dot com and inquire about our for fee Amazon briefing about machine learning, real time data marketplaces, and a couple of other most off the radar activities. Have you seen Amazon’s facial recognition camera? It’s part of the Amazon machine learning imitative, and it has some interesting capabilities.

Stephen E Arnold, September 16, 2018

IBM Watson Workspace

August 6, 2018

I read “What Is Watson Workspace?” I have been assuming that WW is a roll up of:

  • IBM Lotus Connections
  • IBM Lotus Domino
  • IBM Lotus Mashups
  • IBM Lotus Notes
  • IBM Lotus Quickr
  • IBM Lotus Sametime

image

The write up explains how wrong I am (yet again. Such a surprise for a person who resides in rural Kentucky). The write up states:

IBM Watson Workspace offers a “smart” destination for employees to collaborate on projects, share ideas, and post questions, all built from the ground up to take advantage of Watson’s cognitive computing abilities.

Yeah, but I thought the Lotus products provided these services.

How silly of me?

The different is that WW includes cognitive APIs. Sounds outstanding. I can:

  • Draw insights from conversations
  • Turn conversations into actions
  • Access video conferencing
  • Customize Watson Workspace.

When I was doing a little low level work for one of the US government agencies (maybe it was the White House?) I recall sitting in a briefing and these functions were explained. A short time thereafter I had the thankless job of reviewing a minor contract to answer an almost irrelevant question. Guess what? The “workspace” did not contain the email nor the attachments I sought. The system, it was explained to me by someone from IBM in Gaithersburg, was that it was not the fault of the IBM system.

Read more

Business Intelligence: What Is Hot? What Is Not?

July 16, 2018

I read “Where Business Intelligence is Delivering Value in 2018.” The write up summarizes principal findings from a study conducted by Dresner Advisory Services, an outfit with which I am not familiar. I suggest you scan the summary in Cloud Tweaks and then, if you find the data interesting, chase after the Dresner outfit. My hunch is that the sales professionals will respond to your query.

Several items warranted my uncapping my trusty pink marker and circling an item of information.

First, I noticed a chart called Technologies and Initiatives Strategic to Business Intelligence. The chart presents data about 36 “technologies.” I noticed that “enterprise search” did not make the list. I did note that cognitive business intelligence, artificial intelligence, t4ext analytics, and natural language analytics did. If I were generous to a fault, I would say, “These Dresner analysts are covering enterprise search, just taking the Tinker Toy approach by naming areas of technologies.” However, I am not feeling generous, and I find it difficult to believe that Dresner or any other knowledge worker can do “work” without being able to find a file, data, look up a factoid, or perform even the most rudimentary type of research without using search. The omission of this category is foundational, and I am not sure I have much confidence in the other data arrayed in the report.

Second, I don’t know what “data storytelling” is. I suppose (and I am making a wild and crazy guess here) that a person who has some understanding of the source data, the algorithmic methods used to produce output, and the time to think about the likely accuracy of the output creates a narrative. For example, I have been in a recent meeting with the president of a high technology company who said, “We have talked to our customers, and we know we have to create our own system.” Obviously the fellow knows his customers, essentially government agencies. The customers (apparently most of them) want an alternative, and realizes change is necessary. The actual story based on my knowledge of the company, the product and service he delivers, and the government agencies’ budget constraints. The “real story” boils down to: “Deliver a cheaper product or you will lose the contract.” Stories, like those from teenagers who lose their homework, often do not reflect reality. What’s astounding is that data story telling is number eight on the hit parade of initiatives strategic to business intelligence. I was indeed surprised. But governance made the list as did governance. What the heck is governance?

Read more

What Has Happened to Enterprise Search?

June 28, 2018

I read “Enterprise Search in 2018: What a Long Strange Trip It’s Been.” I found the information presented interesting. The thesis is that enterprise search has been on a journey almost like a “Wizard of Oz” experience.

The idea of consolidation, from my point of view, boils down to executives who want to cash in, get out, and move on. The reasons are not far to seek: Over promising and under delivering, financial manipulations, and positioning a nuts and bolts utility as something that solves information problems.

lava flow fixed

Some, maybe many, licensees of proprietary enterprise search systems may have viewed their investment as an opportunity that delivered an unexpected but inevitable outcome. Where is that lush scenery? Where’s the beach?

The reality is that enterprise search vendors were aced by Shay Banon. His Act II of Compass: A Finding Story was Elasticsearch and the company Elastic. Why not use free and open source software. At least the code gets some bugs fixed unlike old school proprietary enterprise search systems. Bug fixes? Yep, good luck with your Fast Search & Retrieval ESP platform idiosyncrasies.

The landscape today is a bit like the volcanic transformation of Hawaii’s Vacationland. Real estate agents will be explaining that the lava flows have created new beach views, promising that eruptions are a low probability event.

The write up does not highlight one simple fact: Enterprise search has given way to “roll up” services or what I call “meta-plays.” The idea is that search is tucked inside systems like Diffeo, Palantir Gotham, and other “intelligence” platforms.

Why aren’t these enterprise grade solutions sold as “enterprise search” or “enterprise business intelligence and discovery solutions”?

The answer is that the information retrieval nest has been marginalized by the actions of vendors stretching back to the Smart system and to the present with “proprietary” solutions which actually include open source technology. These systems are anchored in the past.

Consider Diffeo?

Why offer enterprise search when one can provide a solution that delivers information in context, provides collaboration tools, and displays information in different ways with a single mouse click?

Read more

Picking and Poking Palantir Technologies: A New Blood Sport?

April 25, 2018

My reaction to “Palantir Has Figured Out How to Make Money by Using Algorithms to Ascribe Guilt to People, Now They’re Looking for New Customers” is a a sign and a groan.

I don’t work for Palantir Technologies, although I have been a consultant to one of its major competitors. I do lecture about next generation information systems at law enforcement and intelligence centric conferences in the US and elsewhere. I also wrote a book called “CyberOSINT: Next Generation Information Access.” That study has spawned a number of “experts” who are recycling some of my views and research. A couple of government agencies have shortened by word “cyberosint” into the “cyint.” In a manner of speaking, I have an information base which can be used to put the actions of companies which offer services similar to those available from Palantir in perspective.

The article in Boing Boing falls into the category of “yikes” analysis. Suddenly, it seems, the idea that cook book mathematical procedures can be used to make sense of a wide range of data. Let me assure you that this is not a new development, and Palantir is definitely not the first of the companies developing applications for law enforcement and intelligence professionals to land customers in financial and law firms.

baseball card part 5

A Palantir bubble gum card shows details about a person of interest and links to underlying data from which the key facts have been selected. Note that this is from an older version of Palantir Gotham. Source: Google Images, 2015

Decades ago, a friend of mine (Ev Brenner, now deceased) was one of the pioneers using technology and cook book math to make sense of oil and gas exploration data. How long ago? Think 50 years.

The focus of “Palantir Has Figured Out…” is that:

Palantir seems to be the kind of company that is always willing to sell magic beans to anyone who puts out an RFP for them. They have promised that with enough surveillance and enough secret, unaccountable parsing of surveillance data, they can find “bad guys” and stop them before they even commit a bad action.

Okay, that sounds good in the context of the article, but Palantir is just one vendor responding to the need for next generation information access tools from many commercial sectors.

Read more

Taking Time for Search Vendor Limerance

April 18, 2018

Life is a bit hectic. The Beyond Search and the DarkCyber teams are working on the US government hidden Web presentation scheduled this week. We also have final research underway for the two Telestrategies ISS CyberOSINT lectures. The first is a review of the DarkCyber approach to deanonymizing Surface Web and hidden Web chat. The second focuses on deanonymizing digital currency transactions. Both sessions provide attendees with best practices, commercial solutions, open source tools, and the standard checklists which are a feature of  my LE and intel lectures.

However, one of my associates asked me if I knew what the word “limerance” meant. This individual is reasonably intelligent, but the bar for brains is pretty low here in rural Kentucky. I told the person, “I think it is psychobabble, but I am not sure.”

The fix was a quick Bing.com search. The wonky relevance of the Google was the reason for the shift to the once indomitable Microsoft.

Limerance, according to Bing’s summary of Wikipedia means “a state of mind which results from a romantic attraction to another person typically including compulsive thoughts and fantasies and a desire to form or maintain a relationship and have one’s feelings reciprocated.”

limerance

Upon reflection, I decided that limerance can be liberated from the woozy world of psychologists, shrinks, and wielders of water witches.

Consider this usage in the marginalized world of enterprise search:

Limerance: The state of mind which causes a vendor of key word search to embrace any application or use case which can be stretched to trigger a license to the vendor’s “finding” system.

 

Read more

Speeding Up Search: The Challenge of Multiple Bottlenecks

March 29, 2018

I read “Search at Scale Shows ~30,000X Speed Up.” I have been down this asphalt road before, many times in fact. The problem with search and retrieval is that numerous bottlenecks exist; for example, dealing with exceptions (content which the content processing system cannot manipulate).

Those who want relevant information or those who prefer superficial descriptions of search speed focus on a nice, easy-to-grasp metric; for example, how quickly do results display.

May I suggest you read the source document, work through the rat’s nest of acronyms, and swing your mental machete against the “metrics” in the write up?

Once you have taken these necessary steps, consider this statement from the write up:

These results suggest that we could use the high-quality matches of the RWMD to query — in sub-second time — at least 100 million documents using only a modest computational infrastructure.

Image result for speed bump

The path to responsive search and retrieval is littered with multiple speed bumps. Hit any one when going to fast can break the search low rider.

I wish to list some of the speed bumps which the write does not adequately address or, in some cases, acknowledge:

  • Content flows are often in the terabit or petabit range for certain filtering and query operations., One hundred million won’t ring the bell.
  • This is the transform in ETL operations. Normalizing content takes some time, particularly when the historical on disc content from multiple outputs and real-time flows from systems ranging from Cisco Systems intercept devices are large. Please, think in terms of gigabytes per second and petabytes of archived data parked on servers in some countries’ government storage systems.
  • Populating an index structure with new items also consumes time. If an object is not in an index of some sort, it is tough to find.
  • Shaping the data set over time. Content has a weird property. It evolves. Lowly chat messages can contain a wide range of objects. Jump to today’s big light bulb which illuminates some blockchains’ ability house executables, videos, off color images, etc.
  • Because IBM inevitably drags Watson to the party, keep in mind that Watson still requires humans to perform gorilla style grooming before it’s show time at the circus. Questions have to be considered. Content sources selected. The training wheels bolted to the bus. Then trials have to be launched. What good is a system which returns off point answers?

I think you get the idea.

Read more

Crime Prediction: Not a New Intelligence Analysis Function

March 16, 2018

We noted “New Orleans Ends Its Palantir Predictive Policing Program.” The interest in this Palantir Technologies’ project surprised us from our log cabin with a view of the mine drainage run off pond. The predictive angle is neither new nor particularly stealthy. Many years ago when I worked for one of the outfits developing intelligence analysis systems, the “predictive” function was a routine function.

Here’s how it works:

  • Identify an entity of interest (person, event, organization, etc.)
  • Search for other items including the entity
  • Generate near matches. (We called this “fuzzification” because we wanted hits which were “near” the entity in which we had an interest. Plus, the process worked reasonably well in reverse too.)
  • Punch the analyze function.

Once one repeats the process several times, the system dutifully generates reports which make it easy to spot:

  • Exact matches; for example, a “name” has a telephone number and a dossier
  • Close matches; for example, a partial name or organization is associated with the telephone number of the identity
  • Predicted matches; for example, based on available “knowns”, the system can generate a list of highly likely matches.

The particular systems with which I am familiar allow the analyst, investigator, or intelligence professional to explore the relationships among these pieces of information. Timeline functions make it trivial to plot when events took place and retrieve from the analytics module highly likely locations for future actions. If an “organization” held a meeting with several “entities” at a particular location, the geographic component can plot the actual meetings and highlight suggestions for future meetings. In short, prediction functions work in a manner similar to Excel’s filling in items in a number series.

heat map with histogram

What would you predict as a “hot spot” based on this map? The red areas, the yellow areas, the orange areas, or the areas without an overlay? Prediction is facilitated with some outputs from intelligence analysis software. (Source: Palantir via Google Image search)

Read more

The New York Times Wants to Change Your Google Habit

March 1, 2018

Sunday is a slightly less crazy day. I took time to scan “The Case Against Google.” I had the dead tree edition of the New York Times Magazine for February 25, 2018. You may be able to access this remarkable hybridization of Harvard MBA think, DNA engineered to stick pins in Google, and good old establishment journalism toasted at Yale University.

image

The author is a wildly successful author. Charles Duhigg loves his family, makes time for his children, writes advice books, and immerses himself in a single project at a time. When he comes up for air, he breathes deeply of Google outputs in order to obtain information. If the Google fails, he picks up the phone. I assume those whom he calls answer the ring tone. I find that most people do not answer their phones, but that’s another habit which may require analysis.

I worked through the write up. I noted three things straight away.

First, the timeline structure of the story is logical. However, leaving it up to me to figure out which date matched which egregious Google action was annoying. Fortunately, after writing The Google Legacy, Google Version 2.0, and Google: The Digital Gutenberg, I had the general timeline in mind. Other readers may not.

Second, the statement early in the write up reveals the drift of the essay’s argument. The best selling author of The Power of Habit writes:

Within computer science, this kind of algorithmic alchemy is sometimes known as vertical search, and it’s notoriously hard to master. Even Google, with its thousands of Ph.D.s, gets spooked by vertical-search problems.

I am not into arguments about horizontal and vertical search. I ran around that mulberry tree with a number of companies, including a couple of New York investment banks. Been there. Done that. There are differences in how the components of a findability solution operate, but the basic plumbing is similar. One must not confuse search with the specific technology employed to deliver a particular type of output. Want to argue? First, read The New Landscape of Search, published by Pandia before the outfit shut down. Then, send me an email with your argument.

Third, cherry picking from Google’s statements makes it possible to paint a somewhat negative picture of the great and much loved Google. With more than 60,000 employees, many blogs, many public presentations, oodles of YouTube videos, and a library full of technical papers and patents, the Google folks say a lot. The problem is that finding a quote to support almost any statement is not hard; it just takes persistence. Here’s an example:

We absolutely  do not make changes 5to our search algorithm to disadvantage competitors.

Read more

Governance: Now That Is a Management Touchstone for MBA Experts

February 27, 2018

I read “Unlocking the Power of Today’s Big Data through Governance.” Quite a lab grown meat wiener that “unlocking,” “power,” “Big Data,” and “governance” statement is that headline. Yep, IDG, the outfit which cannot govern its own agreements with the people the firm pays to make the IDG experts so darned smart. (For the back-story, check out this snapshot of governance in action.)

Image result for wishful thinking

What’s the write up with the magical word governance about?

Instead of defining “governance,” I learn what governance is not; to wit:

Data governance isn’t about creating a veil of secrecy around data

I have zero idea what this means. Back to the word “governance.” Google and Wikipedia define the word in this way:

Governance is all of the processes of governing, whether undertaken by a government, market or network, whether over a family, tribe, formal or informal organization or territory and whether through the laws, norms, power or language of an organized society.

Okay, governing. What’s governing mean? Back to the GOOG. Here’s one definition which seems germane to MBA speakers:

control, influence, or regulate (a person, action, or course of events).

The essay drags out the chestnuts about lots of information. Okay, I think I understand because Big Data has been touted for many years. Now, mercifully I assert, the drums are beating out the rhythm of “artificial intelligence” and its handmaiden “algos,” the terrific abbreviation some of the marketing jazzed engineers have coined. Right, algos, bro.

What’s the control angle for Big Data? The answer is that “data governance” will deal with:

  • Shoddy data
  • Incomplete data
  • Off point data
  • Made up data
  • Incorrect data

Presumably these thorny issues will yield to a manager who knows the ins and outs of governance. I suppose there are many experts in governance; for example, the fine folks who have tamed content chaos with their “governance” of content management systems or the archiving mavens who have figured out what to do with tweets at the Library of Congress. (The answer is to not archive tweets. There you go. Governance in action.)

The article suggests a “definitive data governance program.” Right. If one cannot deal with backfiles, changes to the data in the archives, and the new flows of data—how does one do the “definitive governance program” thing? The answer is, “Generate MBA baloney and toss around buzzwords.” Check out the list of tasks which, in my experience, are difficult to accomplish when resources are available and the organization has a can-do attitude:

  • Document data and show its lineage.
  • Set appropriate policies, and enforce them.
  • Address roles and responsibilities of everyone who touches that data, encouraging collaboration across the organization.

These types of tasks are the life blood of consultants who purport to have the ability to deliver the near impossible.

What happens if we apply the guidelines in the Governance article to the data sets listed in “Big Data And AI: 30 Amazing (And Free) Public Data Sources For 2018.” In my experience, the cost of normalizing the data is likely to be out of reach for most organizations. Once these data have been put in a form that permits machine-based quality checks, the organization has to figure out what questions the data can answer with a reasonable level of confidence. Getting over these hurdles then raises the question, “Are these data up to date?” And, if the data are stale, “How do we update the information?” There are, of course, other questions, but the flag waving about governance operates at an Ivory Tower level. Dealing with data takes place with one’s knees on the ground and one’s hands in the dirt. If the public data sources are not pulling the hay wagon, what’s the time, cost, and complexity of obtaining original data sets, validating them, and whipping them into shape for use by an MBA?

You know the answer: “This is not going to happen.”

Here’s a paragraph which I circled in Oscar Mayer wiener pink:

One of the more significant, and exciting, changes in data governance has been the shift in focus to business users. Historically, data has been a technical issue owned by IT and locked within the organization by specific functions and silos. But if data is truly going to be an asset, everyday users—those who need to apply the data in different contexts—must have access and control over it and trust the data. As such, data governance is transforming from a technical tool to a business application. And chief data officers (CDOs) are starting to see the technologies behind data governance as their critical operating environment, in much the same way SAP serves CFOs, and Salesforce supports CROs. It is rare to find an opportunity to build a new system of record for a market.

Let’s look at this low calorie morsel and consider some of its constituent elements. (Have you ever seen wieners being manufactured? Fill in that gap in your education if you have not had the first hand learning experience.)

First, business users want to see a pretty dashboard, click on something that looks interesting in a visualization, and have an answer delivered. Most of the business people I know struggle to understand if the data in their system is accurate and limited expertise to understand the mathematical processes which churn away to display an “answer.”

The reference to SAP is fascinating, but I think of IBM-type systems as somewhat out of step with the more sophisticated tools available to deal with certain data problems. In short, SAP is an artifact of an earlier era, and its lessons, even when understood, have been inadequate in the era of real time data analysis.

Let me be clear: Data governance is a management malarkey. Look closely at organizations which are successful. Peer inside their data environments. When I have looked, I have seen clever solutions to specific problems. The cleverness can create its own set of challenges.

The difference between a Google and a Qwant, a LookingGlass Cyber and IBM i2, or Amazon and Wal-Mart is not Big Data. It is not the textbook definition of “governance.” Success has more to do with effective problem solving on a set of data required by a task. Google sells ads and deals with Big Data to achieve its revenue goals. LookingGlass addresses chat information for a specific case. Amazon recommends products in order to sell more products.

Experts who invoke governance on a broad scale as a management solution are disconnected from the discipline required to identify a problem and deal with data required to solve that problem.

Few organizations can do this with their “content management systems”, their “business intelligence systems,” or their “product information systems.” Why? Talking about a problem is not solving a problem.

Governance is wishful thinking and not something that is delivered by a consultant. Governance is an emergent characteristic of successful problem solving. Governance is not paint; it is not delivered by an MBA and a PowerPoint; it is not a core competency of jargon.

In Harrod’s Creek, governance is getting chicken to the stores in the UK. Whoops. That management governance is not working. So much in modern business does not work very well.

Stephen E Arnold, February 27, 2018

Next Page »

  • Archives

  • Recent Posts

  • Meta