Big Data Blending Solution
January 20, 2016
I would have used Palantir or maybe our own tools. But an outfit named National Instruments found a different way to perform data blending. “How This Instrument Firm Tackled Big Data Blending” provides a case study and a rah rah for Alteryx. Here’s the paragraph I highlighted:
The software it [National Instruments] selected, from Alteryx, takes a somewhat unique approach in that it provides a visual representation of the data transformation process. Users can acquire, transform, and blend multiple data sources essentially by dragging and dropping icons on a screen. This GUI approach is beneficial to NI employees who aren’t proficient at manipulating data using something like SQL.
The graphical approach has been part of a number of tools. There are also some systems which just figure out where to put what.
The issue for me is, “What happens to rich media like imagery and unstructured information like email?”
There are systems which handle these types of content.
Another challenge is the dependence on structured relational data tables. Certain types of operations are difficult in this environment.
The write up is interesting, but it reveals that a narrow view of available tools may produce a partial solution.
Stephen E Arnold, January 20, 2016
Machine Learning Hindsight
January 18, 2016
Have you ever found yourself saying, “If I only knew then, what I know now”? It is a moment we all experience, but instead of stewing over our past mistakes it is better to share the lessons we’ve learned with others. Data scientist Peadar Coyle learned some valuable lessons when he first started working with machine learning. He discusses three main things he learned in the article, “Three Things I Wish I Knew Earlier About Machine Learning.”
Here are the three items he wishes he knew then about machine learning, but know now:
- “Getting models into production is a lot more than just micro services
- Feature selection and feature extraction are really hard to learn from a book
- The evaluation phase is really important”
Developing models is an easy step, but putting them in production is difficult. There are many major steps that need attending to and doing all of the little jobs isn’t feasible on huge projects. Peadar recommends outsourcing when you can. Books and online information are good reference tools, but when they cannot be applied to actual situations the knowledge is useless. Paedar learned that real world experience has no comparison. When it comes to testing, it is a very important thing. Very much as real world experience is invaluable, so is the evaluation. Life does not hand perfect datasets for experimentation and testing different situations will better evaluate the model.
Paedar’s advice applies to machine learning, but it applies more to life in general.
Whitney Grace, January 18, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Open Source Data Management: It Is Now Easy to Understand
January 10, 2016
I read “16 for 16: What You Must Know about Hadoop and Spark Right Now.” I like the “right now.” Urgency. I am not sure I feel too much urgency at the moment. I will leave that wonderful feeling to the executives who have sucked in venture money and have to find a way to generate revenue in the next 11 months.
The article runs down the basic generalizations associated with each of these open source data management components:
- Spark
- Hive
- Kerberos
- Ranger/Sentry
- HBase/Phoenix
- Impala
- Hadoop Distributed File System (HDFS)
- Kafka
- Storm/Apex
- Ambari/Cloudera Manager
- Pig
- Yarn/Mesos
- Nifi/Kettle
- Knox
- Scala/Python
- Zeppelin/Databricks
What the list tells me is two things. First, the proliferation of open source data tools is thriving. Second, there will have to be quite a few committed developers to keep these projects afloat.
The write up is not content with this shopping list. The intrepid reader will have an opportunity to learn a bit about:
- Kylin
- Atlas/Navigator
As the write up swoops to its end point, I learned about some open source projects which are a bit of a disappointment; for example, Oozie and Tez.
The key point of the article is that Google’s MapReduce which is now pretty long in the tooth is now effectively marginalized.
The Balkanization of data management is evident. The challenge will be to use one or more of these technologies to make some substantial revenue flow.
What happens if a company jumps on the wrong bandwagon as it leaves the parade ground? I would suggest that it may be more like a Pig than an Atlas. The investors will change from Rangers looking for profits to Pythons ready to strike. A Spark can set fire to some hopes and dreams in the Hive. Poorly constructed walls of Databricks can come falling down. That will be an Oozie.
Dear old Oracle, DB2, and SQLServer will just watch.
Stephen E Arnold, January 10, 2016
Short Honk: Hadoop Ecosystem Made Clear
January 3, 2016
Love Hadoop. Love all things Hadoopy? You will want to navigate to “The Hadoop Ecosystem Table.” You have categories of Hadoopiness with examples of the Hadoop amoebae. You are able to see where Spark “fits” or Kudu. Need some document data model options? The table will deliver: ArangoDB and more. Useful stuff.
Stephen E Arnold, December 30, 2015
Data Managers as Data Librarians
December 31, 2015
The tools of a librarian may be the key to better data governance, according to an article at InFocus titled, “What Librarians Can Teach Us About Managing Big Data.” Writer Joseph Dossantos begins by outlining the plight data managers often find themselves in: executives can talk a big game about big data, but want to foist all the responsibility onto their overworked and outdated IT departments. The article asserts, though, that today’s emphasis on data analysis will force a shift in perspective and approach—data organization will come to resemble the Dewey Decimal System. Dossantos writes:
“Traditional Data Warehouses do not work unless there a common vocabulary and understanding of a problem, but consider how things work in academia. Every day, tenured professors and students pore over raw material looking for new insights into the past and new ways to explain culture, politics, and philosophy. Their sources of choice: archived photographs, primary documents found in a city hall, monastery or excavation site, scrolls from a long-abandoned cave, or voice recordings from the Oval office – in short, anything in any kind of format. And who can help them find what they are looking for? A skilled librarian who knows how to effectively search for not only books, but primary source material across the world, who can understand, create, and navigate a catalog to accelerate a researcher’s efforts.”
The article goes on to discuss the influence of the “Wikipedia mindset;” data accuracy and whether it matters; and devising structures to address different researchers’ needs. See the article for details on each of these (especially on meeting different needs.) The write-up concludes with a call for data-governance professionals to think of themselves as “data librarians.” Is this approach the key to more effective data search and analysis?
Cynthia Murrell, December 31, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
Caution about NoSQL Databases
December 22, 2015
I read “Exasol and Birst Join In-Memory Database to ‘Networked’ BI to Aid Mutual Expansion.” Another day, another marketing tie up. But the article contained a very interesting statement, attributed to a Birst big dog:
NoSQL databases are great for atomic storage and retrieval, and for elastic scaling over a distributed [server] environment, but when it comes to doing aggregations with joins – and that’s what analytics is about – it is just not what they are built for.”
I wonder if that shot is aimed at outfits like MarkLogic. Worth watching this partnership.
Stephen E Arnold, December 22, 2015
Two AI Paths Pondered by Teradata
December 20, 2015
I read the content marketing write up by Karthik Guruswamy. I like the “guru” part of the expert’s name. I am stuck with the “old” part of my name.
The write is called “Data Science: Machine Learning Vs. Rules Based Systems.” I know a little bit about both of these methods, and I know a teeny tiny bit about Teradata, an outstanding data warehouse solution chugging along with its stock in the high $20s per share. The Google finance chart suggests that the company has some challenges with net income and profit margin to my unlearned eye:
Looks like some content marketing oomph is needed to move that top line number.
I learned in the write up:
Rules based systems will work effectively if all the situations, under which decisions can be made, are known ahead of time.
Okay. Insight. Know everything ahead of time and one can write rules to cover the situation. Is this expensive? Is this a never ending job? Consultants sure hope so.
There is an alternative:
Enter Machine Learning or ML! If we classify the data into good vs. bad data sets or categorize them into different labels like A, B, C, D etc., the Machine Learning algorithms can help build rules on the fly. This step is called training which results in a model. During operationalization, this model is used by the prediction algorithm to classify the incoming data in the right way which in turn leads to sound decision making.
I recall that Autonomy used this approach for its system. Those familiar with Autonomy have some experience with retraining, Bayesian drift, and other exciting facets of machine learning based systems. Consultants love to build new training sets.
The write up asserts:
With Machine Learning, one can iteratively achieve good results by cleansing & prepping the data, changing or combining algorithms or merely tweaking the algorithm parameters. This is becoming much easier thanks to the increased awareness and the availability of different types of data science tools in the market today.
High five.
My view is that the write up left out some information. But there is one omission which warrants a special comment.
Neither of these systems works without human intervention.
Bummer. Reality is sort of a drag, but maybe that’s why Teradata is wrestling with revenue and net profit alligators. Consultants, on the other hand, can bill to enhance either approach.
What about the customer? Well, some customers of brand name data warehouse systems struggle to get data into and out of this whiz bang systems in my experience. Regardless of the craziness involved with Hadoop and Spark, these open source approaches may make more sense than pumping six or seven figures into a proprietary system.
Consultants can still bill, of course. That’s one upside of any approach one wishes to embrace.
Stephen E Arnold, December 20, 2015
XML Marches On
December 2, 2015
For fans of XML and automated indexing, there’s a new duo in town. The shoot out at the JSON corral is not scheduled, but you can get the pre show down information in “Smartlogic and MarkLogic Corporation Enhance Platform Integration between Semaphore and MarkLogic Database.” Rumors of closer ties between the outfits surfaced earlier this year. I pinged one of the automated indexing company’s wizards and learned, “Nope, nothing going on.” Gee, I almost believed this until a virtual strategy story turned up. Virtual no more.
According to the write up:
Smartlogic, the Content Intelligence Company, today announced tighter software integration with MarkLogic, the Enterprise NoSQL database platform provider, creating a seamless approach to semantic information management where organizations maximize information to drive change. Smartlogic’s Content Intelligence capabilities provide a robust set of semantic tools which create intelligent metadata, enhancing the ability of the enterprise-grade MarkLogic database to power smarter applications.
For fans of user friendliness, the tie up may mean more XQuery scripting and some Semaphore tweaks. And JSON? Not germane.
What is germane is that Smartlogic may covet some of MarkLogic’s publishing licensees. After slicing and dicing, some of these outfits are having trouble finding out what their machine assisted editors have crafted with refined quantities of editorial humans.
Stephen E Arnold, December 2, 2015
Medical Publisher Does Rah Rah for MarkLogic
November 20, 2015
Now MarkLogic is a unicorn. The company wants to generate revenues. Okay. No problem.
I found “200-Year-Old Publisher Finds Happiness with NoSQL Database” quite interesting. The write up explains that the New England Journal of Medicine uses MarkLogic’s XML data management system to — well — manage its text and other content.
The write up states:
With features like XQuery, a SQL-like query engine for XML data, MarkLogic promised to retrieve unstructured data at speeds no SQL database could approach.
What did I note? The big thing is that this deal went down when MarkLogic was a “fledgling company.” Hmm. Was this a Dave Kellogg-era deal? I also noted that the write up did not beat the drum for MarkLogic as a business and government intelligence. email management, and analytics Swiss Army knife able to cut into the revenues of Oracle and other Codd database outfits.
MarkLogic’s marketing may be making progress by emphasizing what MarkLogic’s technology was built to deliver: A data management system for publishers. The publication still uses SQL for financial records and dabbles with the open source quasi-doppelgänger MondoDB.
MarkLogic hit a wall at about $60 million. Today the fledgling is a unicorn. Will MarkLogic put wings on its unicorn? Stakeholders sure think is going to happen. For me, I will observe. Will the proprietary MarkLogic prevail or will open source alternatives nibble into this box of Kellogg’s revenue?
Stephen E Arnold, November 20, 2015
Icann Is an I Won’t
November 16, 2015
Have you ever heard of Icann? You are probably like many people within the United States and have not heard of the non-profit private company. What does Icann do? Icann is responsible for Internet protocol addresses (IP) and coordinating domain names, so basically the company is responsible for a huge portion of the Internet. According to The Guardian in “The Internet Is Run By An Unaccountable Private Company. This Is A Problem,” the US supposedly runs the Icann but its role is mostly clerical and by September 30, 2015 it was supposed to hand the reins over to someone else.
The “else” is the biggest question. The Icann community spent hours trying to figure out who would manage the company, but they ran into a huge brick wall. The biggest issue is that the volunteers want Icann to have more accountability, which does not seem feasible. Icann’s directors cannot be fired, except by each other. Finances are another problem with possible governance risks and corruption.
A supposed solution is to create a membership organization, a common business model for non-profits and will give power to the community. Icann’s directors are not too happy and have been allowed to add their own opinions. Decisions are not being made at Icann and with the new presidential election the entire power shift could be off. It is not the worst that could happen:
“But there’s much more at stake. Icann’s board – as ultimate authority in this little company running global internet resources, and answerable (in fact, and in law) to no one – does have the power to reject the community’s proposals. But not everything that can be done, should be done. If the board blunders on, it will alienate those volunteers who are the beating heart of multi-stakeholder governance. It will also perfectly illustrate why change is required.”
The board has all the power and the do not have anyone to hold them accountable. Icann directors just have to stall long enough to keep things the same and they will be able to give themselves more raises.
Whitney Grace, November 16, 2015
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph