MarkLogic Aims to Take on Oracle in Enterprise Class Data Hub Frameworks
October 10, 2017
MarkLogic is trying to give Oracle a run for its money in the world of enterprise-class data hubs. According to a recent press release on ITWire, “MarkLogic Releases New Enterprise Class Data Hub Framework to Enhance Agility and Speed Digital Transformations.”
How does this Australian legend plan on doing this? According to the release:
Traditionally, integrating data from silos has been very costly and time consuming for large organizations looking to make faster and better decisions based on their data assets. The Data Hub Framework simplifies and speeds the process of building a MarkLogic solution by providing a framework around how to data model, load data, harmonize data, and iterate with new data and compliance requirements.
But is that enough to unseat Oracle, who has long had a seat at the head of the table? Especially, since they have their own new framework hitting the market. That is still up for debate, but MarkLogic is confident in their ability to compete. According to the piece:
Unlike other databases, NoSQL was specifically designed to ingest and integrate all types of disparate data to find relationships among data, and drive searches and analytics—within seconds.
This battle is just beginning and we have no indication of who has the edge, but you can bet it will be an interesting fight in the marketplace between these two titans.
Patrick Roland, October 10, 2017
Yet Another Digital Divide
September 8, 2017
Recommind sums up what happened at a recent technology convention in the article, “Why Discovery & ECM Haven’t, Must Come Together (CIGO Summit 2017 Recap).” Author Hal Marcus first discusses that he was a staunch challenge to anyone who said they could provide a complete information governance solution. He recently spoke at CIGO Summit 2017 about how to make information governance a feasible goal for organizations.
The problem with information governance is that there is no one simple solution and projects tend to be self-contained with only one goal: data collection, data reduction, etc. When he spoke he explained that there are five main reasons for there is not one comprehensive solution. They are that it takes a while to complete the project to define its parameters, data can come from multiple streams, mass-scale indexing is challenging, analytics will only help if there are humans to interpret the data, risk, and cost all put a damper on projects.
Yet we are closer to a solution:
Corporations seem to be dedicating more resources for data reduction and remediation projects, triggered largely by high profile data security breaches.
Multinationals are increasingly scrutinizing their data sharing and retention practices, spurred by the impending May 2018 GDPR deadline.
ECA for data culling is becoming more flexible and mature, supported by the growing availability and scalability of computing resources.
Discovery analytics are being offered at lower, all-you-can-eat rates, facilitating a range of corporate use cases like investigations, due diligence, and contract analysis
Tighter, more seamless and secure integration of ECM and discovery technology is advancing and seeing adoption in corporations, to great effect.
And it always seems farther away.
Whitney Grace, September 8, 2017
Big Data Too Is Prone to Human Bug
August 2, 2017
Conventional wisdom says Big Data being a realm of machines is immune from human behavioral traits like discrimination. Insights from data scientists, however, are different.
According to an article published by PHYS.ORG titled Discrimination, Lack of Diversity, and Societal Risks of Data Mining Highlighted in Big Data, the author says:
Despite the dramatic growth in big data affecting many areas of research, industry, and society, there are risks associated with the design and use of data-driven systems. Among these are issues of discrimination, diversity, and bias.
The crux of the problem is the way data is mined, processed and decisions made. At every step, humans need to be involved in order to tell machines how each of these processes are executed. If the person guiding the system is biased, these biases are bound to seep into the subsequent processes in some way.
Apart from decisions like granting credit, human resources which also is being automated may have diversity issues. The fundamental remains the same in this case too.
Big Data was touted as the next big thing and may turn out to be so, but most companies are yet to figure out how to utilize it. Streamlining the processes and making them efficient would be the next step.
Vishal Ingole, August 2, 2017
Big Data in Biomedical
July 19, 2017
The biomedical field which is replete with unstructured data is all set to take a giant leap towards standardization with Biological Text Mining Unit.
According to PHYS.ORG, in a peer review article titled Researchers Review the State-Of-The-Art Text Mining Technologies for Chemistry, the author states:
Being able to transform unstructured biomedical research data into structured databases that can be more efficiently processed by machines or queried by humans is critical for a range of heterogeneous applications.
Scientific data has fixed set of vocabulary which makes standardization and indexation easy. However, most big names in Big Data and enterprise search are concentrating their efforts on e-commerce.
Hundreds of new compounds are discovered every year. If the data pertaining to these compounds is made available to other researchers, advancements in this field will be very rapid. The major hurdle is the data is in an unstructured format, which Biological Text Mining Unit standards intend to overcome.
Vishal Ingole, July 19, 2017
Does This Count As Irony?
May 16, 2017
Does this count as irony?
Palantir, who has built its data-analysis business largely on its relationships with government organizations, has a Department of Labor analysis to thank for recent charges of discrimination. No word on whether that Department used Palantir software to “sift through” the reports. Now, Business Insider tells us, “Palantir Will Shell Out $1.7 Million to Settle Claims that It Discriminated Against Asian Engineers.” Writer Julie Bort tells us that, in addition to that payout, Palantir will make job offers to eight unspecified Asians. She also explains:
The issue arose because, as a government contractor, Palantir must report its diversity statistics to the government. The Labor Department sifted through these reports and concluded that even though Palantir received a huge number of qualified Asian applicants for certain roles, it was hiring only small numbers of them. Palantir, being the big data company that it is, did its own sifting and produced a data-filled response that it said refuted the allegations and showed that in some tech titles 25%-38% of its employees were Asians. Apparently, Palantirs protestations weren’t enough on to satisfy government regulators, so the company agreed to settle.
For its part, Palantir insists on their innocence but say they settled in order to put the matter behind them. Bort notes the unusual nature of this case—according to the Equal Employment Opportunity Commission, African-Americans, Latin-Americans, and women are more underrepresented in tech fields than Asians. Is the Department of Labor making it a rule to analyze the hiring patterns of companies required to report diversity statistics? If they are consistent, there should soon be a number of such lawsuits regarding discrimination against other groups. We shall see.
Cynthia Murrell, May 16, 2017
Bad Big Data? Get More Data Then
March 2, 2017
I like the idea that more is better. The idea is particularly magnetic when a company cannot figure out what it’s own, in house, proprietary data mean. Think of the legions of consultants from McKinsey and BCG telling executives what their own data “means.” Toss in the notion of a Big Data in a giant “data lake,” and you have decision makers who cannot use the information they already have.
Well, how does one fix that problem? Easy. Get more data. That sounds like a plan, particularly when the professionals struggling are in charge of figuring out if sales and marketing investments sort of pay for themselves.
I learned that I need more data by reading “Deepening The Data Lake: How Second-Party Data Increases AI For Enterprises.” The headline introduces the amazing data lake concept along with two giant lake front developments: More data and artificial intelligence.
Buzzwords? Heck no. Just solid post millennial reasoning; for example:
there are many marketers with surprisingly sparse data, like the food marketer who does not get many website visitors or authenticated customers downloading coupons. Today, those marketers face a situation where they want to use data science to do user scoring and modeling but, because they only have enough of their own data to fill a shallow lake, they have trouble justifying the costs of scaling the approach in a way that moves the sales needle.
I like that sales needle phrase. Marketers have to justify themselves and many have only “sparse” data. I would suggest that marketers have often useless data like the number of unique clicks, but that’s only polluting the data lake.
The fix is interesting. I learned:
we can think of the marketer’s first-party data – media exposure data, email marketing data, website analytics data, etc. – being the water that fills a data lake. That data is pumped into a data management platform (pictured here as a hydroelectric dam), pumped like electricity through ad tech pipes (demand-side platforms, supply-side platforms and ad servers) and finally delivered to places where it is activated (in the town, where people live)… this infrastructure can exist with even a tiny bit of water but, at the end of the cycle, not enough electricity will be generated to create decent outcomes and sustain a data-driven approach to marketing. This is a long way of saying that the data itself, both in quality and quantity, is needed in ever-larger amounts to create the potential for better targeting and analytics.
Yep, more data.
And what about making sense of the additional data? I learned:
The data is also of extremely high provenance, and I would also be able to use that data in my own environment, where I could model it against my first-party data, such as site visitors or mobile IDs I gathered when I sponsored free Wi-Fi at the last Country Music Awards. The ability to gather and license those specific data sets and use them for modeling in a data lake is going to create massive outcomes in my addressable campaigns and give me an edge I cannot get using traditional ad network approaches with third-party segments. Moreover, the flexibility around data capture enables marketers to use highly disparate data sets, combine and normalize them with metadata – and not have to worry about mapping them to a predefined schema. The associative work happens after the query takes place. That means I don’t need a predefined schema in place for that data to become valuable – a way of saying that the inherent observational bias in traditional approaches (“country music fans love mainstream beer, so I’d better capture that”) never hinders the ability to activate against unforeseen insights.
Okay, I think I understand. No wonder companies hire outfits like blue chip consulting firms to figure out what is going on in their companies. Stated another way, insiders live in the swamp. Outsiders can put the swamp into a context and maybe implement some pollution control systems.
Stephen E Arnold, March 2, 2017
Big Data Needs to Go Public
December 16, 2016
Big Data touches every part of our lives and we are unaware. Have you ever noticed when you listen to the news, read an article, or watch a YouTube video that people say items such as: “experts claim, “science says,” etc.” In the past, these statements relied on less than trustworthy sources, but now they can use Big Data to back up their claims. However, popular opinion and puff pieces still need to back up their big data with hard fact. Nature.com says that transparency is a big deal for Big Data and algorithm designers need to work on it in the article, “More Accountability For Big-Data Algorithms.”
One of the hopes is that big data will be used to bridge the divide between one bias and another, except that he opposite can happen. In other words, Big Data algorithms can be designed with a bias:
There are many sources of bias in algorithms. One is the hard-coding of rules and use of data sets that already reflect common societal spin. Put bias in and get bias out. Spurious or dubious correlations are another pitfall. A widely cited example is the way in which hiring algorithms can give a person with a longer commute time a negative score, because data suggest that long commutes correlate with high staff turnover.
Even worse is that people and organizations can design an algorithm to support science or facts they want to pass off as the truth. There is a growing demand for “algorithm accountability,” mostly in academia. The demands are that data sets fed into the algorithms are made public. There also plans to make algorithms that monitor algorithms for bias.
Big Data is here to say, but relying too much on algorithms can distort the facts. This is why the human element is still needed to distinguish between fact and fiction. Minority Report is closer to being our present than ever before.
Whitney Grace, December 16, 2016
Algorithm Bias in Beauty Contests
September 16, 2016
I don’t read about beauty contests. In my college dorm, I recall that the televised broadcast of the Miss America pageant was popular among some of the residents. I used the attention grabber as my cue to head to the library so I could hide reserved books from my classmates. Every little bit helps in the dog eat dog world of academic achievement.
“When Artificial Intelligence Judges a Beauty Contest, White People Win” surprised me. I thought that algorithms were objective little numerical recipes. Who could fiddle 1=1=2?
I learned:
The foundation of machine learning is data gathered by humans, and without careful consideration, the machines learn the same biases of their creators. Sometimes bias is difficult to track, but other times it’s clear as the nose on someone’s face—like when it’s a face the algorithm is trying to process and judge.
Its seems that an algorithm likes white people. The write up informed me:
An online beauty contest called Beauty.ai, run byYouth Laboratories (that lists big names in tech like Nvidia and Microsoft as “partners and supporters” on the contest website), solicited 600,000 entries by saying they would be graded by artificial intelligence. The algorithm would look at wrinkles, face symmetry, amount of pimples and blemishes, race, and perceived age. However, race seemed to play a larger role than intended; of the 44 winners, 36 were white.
Oh, oh. Microsoft and its smart software seem to play a role in this drama.
What’s the fix? Better data. The write up includes this statement from a Microsoft expert:
“If a system is trained on photos of people who are overwhelmingly white, it will have a harder time recognizing non-white faces,” writes Kate Crawford, principal researcher at Microsoft Research New York City, in a New York Times op-ed. “So inclusivity matters—from who designs it to who sits on the company boards and which ethical perspectives are included. Otherwise, we risk constructing machine intelligence that mirrors a narrow and privileged vision of society, with its old, familiar biases and stereotypes.”
In the last few months, Microsoft’s folks were involved in Tay, a chatbot which allegedly learned to be racist. Then there was the translation of “Daesh” as Saudi Arabia. Now algorithms appear to favor folks of a particular stripe.
Exciting math. But Microsoft has also managed to gum up webcams and Kindle access in Windows 10. Yep, the new Microsoft is a sparkling example of smart.
Stephen E Arnold, September 16, 2016
In-Q-Tel Wants Less Latency, Fewer Humans, and Smarter Dashboards
September 15, 2016
I read “The CIA Just Invested in a Hot Startup That Makes Sense of Big Data.” I love the “just.” In-Q-Tel investments are not like bumping into a friend in Penn Station. Zoomdata, founded in 2012, has been making calls, raising venture funding (more than $45 million in four rounds from 21 investors), and staffing up to about 100 full time equivalents. With its headquarters in Reston, Virginia, the company is not exactly operating from a log cabin west of Paducah, Kentucky.
The write up explains:
Zoom Data uses something called Data Sharpening technology to deliver visual analytics from real-time or historical data. Instead of a user searching through an Excel file or creating a pivot table, Zoom Data puts what’s important into a custom dashboard so users can see what they need to know immediately.
What Zoomdata does is offer hope to its customers for less human fiddling with data and faster outputs of actionable intelligence. If you recall how IBM i2 and Palantir Gotham work, humans are needed. IBM even snagged Palantir’s jargon of AI for “augmented intelligence.”
In-Q-Tel wants more smart software with less dependence on expensive, hard to train, and often careless humans. When incoming rounds hit near a mobile operations center, it is possible to lose one’s train of thought.
Zoomdata has some Booz, Allen DNA, some MIT RNA, and protein from other essential chemicals.
The write up mentions Palantir, but does not make explicit the need to reduce t6o some degree the human-centric approaches which are part of the major systems’ core architecture. You have nifty cloud stuff, but you have less nifty humans in most mission critical work processes.
To speed up the outputs, software should be the answer. An investment in Zoomdata delivers three messages to me here in rural Kentucky:
- In-Q-Tel continues to look for ways to move along the “less wait and less weight” requirement of those involved in operations. “Weight” refers to heavy, old-fashioned system. “Wait” refers to the latency imposed by manual processes.
- Zoomdata and other investments whips to the flanks of the BAE Systems, IBMs, and Palantirs chasing government contracts. The investment focuses attention not on scope changes but on figuring out how to deal with the unacceptable complexity and latency of many existing systems.
- In-Q-Tel has upped the value of Zoomdata. With consolidation in the commercial intelligence business rolling along at NASCAR speeds, it won’t take long before Zoomdata finds itself going to big company meetings to learn what the true costs of being acquired are.
For more information about Zoomdata, check out the paid-for reports at this link.
Stephen E Arnold, September 15, 2016
How Collaboration and Experimentation Are Key to Advancing Machine Learning Technology
September 12, 2016
The article on CIO titled Machine Learning “Still a Cottage Industry” conveys the sentiments of a man at the heart of the industry in Australia, Professor Bob Williamson. Williamson is the Commonwealth Scientific and Industrial Research Organisation’s (CSIRO’s) Data 61 group chief scientist. His work in machine learning and data analytics led him to the conclusion that for machine learning to truly move forward, scientists must find a way to collaborate. He is quoted in the article,
There’s these walled gardens: ‘I’ve gone and coded my models in a particular way, you’ve got your models coded in a different way, we can’t share’. This is a real challenge for the community. No one’s cracked this yet.” A number of start-ups have entered the “machine-learning-as-a-service” market, such as BigML, Wise.io and Precog, and the big names including IBM, Microsoft and Amazon haven’t been far behind. Though these MLaaSs herald some impressive results, Williamson warned businesses to be cautious.
Williamson speaks to the possibility of stagnation in machine learning due to the emphasis on data mining as opposed to experimenting. He hopes businesses will do more with their data than simply look for patterns. It is a refreshing take on the industry from an outsider/insider, a scientist more interested in the science of it all than the massive stacks of cash at stake.
Chelsea Kerwin, September 12, 2016
Sponsored by ArnoldIT.com, publisher of the CyberOSINT monograph
There is a Louisville, Kentucky Hidden Web/Dark Web meet up on September 27, 2016.
Information is at this link: https://www.meetup.com/Louisville-Hidden-Dark-Web-Meetup/events/233599645/