Oracle and Blockchain
July 28, 2020
Amidst the angst about US big technology companies, Rona, and Intel’s management floundering, Oracle blockchain is easy to overlook. “Oracle Updates Blockchain Platform Cloud Service.” The title alone invokes the image of Amazon’s blockchain platform and its associated moving parts.
The write up focuses on Oracle as if the Amazon and other options do not exist. But the parallels with Amazon’s blockchain services are clearly articulated. The article reports:
Blockchain Platform Cloud Service features stronger access controls for sharing confidential information, greater decentralization capabilities for blockchain consortiums, and stronger audibility when rich history database feature is used in conjunction with Oracle Database Blockchain Tables.
Even more Amazon envy seems to have influenced this “new” feature:
Oracle Cloud Infrastructure Availability Domains (and in the regions with a single Availability Domain, three Fault Domains) to provide stronger resilience and recoverability, with the SLA for the Enterprise SKUs of at least 99.95%.
The line up of services strikes me as having been developed after reading Amazon’s blockchain documentation; for example:
- On demand storage
- Spiffed up access controls
- Workflow functions.
There is one difference, however. It appears that Oracle wants to tackle Amazon blockchain at a weak point: Price. Oracle is not likely to be significantly cheaper than AWS blockchain. Oracle wants to make its pricing more or less understandable to a prospect.
Will clarity allow Oracle to compete with Amazon blockchain?
After losing Amazon as a customer and watching the online book store pump out blockchain inventions for several years, Oracle hopes its approach will prevail or at least catch up with the Bezos bulldozer.
Stephen E Arnold, July 28, 2020
TileDB Developing a Solution to Database Headaches
July 27, 2020
Developers at TileDB are working on a solution to the many problems traditional and NoSQL databases create, and now they have secured more funding to help them complete their platform. The company’s blog reports, “TileDB Closes $15M Series A for Industry’s First Universal Data Engine.” The funding round is led by Two Bear Capital, whose managing partner will be joining TileDB’s board of directors. The company’s CEO, Stavros Papadopoulos, writes:
“The Series A financing comes after TileDB was chosen by customers who experienced two key pains: scalability for complex data and deployment. Whole-genome population data, single-cell gene data, spatio-temporal satellite imagery, and asset-trading data all share multi-dimensional structures that are poorly handled by monolithic databases, tables, and legacy file formats. Newer computational frameworks evolved to offer ‘pluggable storage’ but that forces another part of the stack to deal with data management. As a result, organizations waste resources on managing a sea of files and optimizing storage performance, tasks traditionally done by the database. Moreover, developers and data scientists are spending excessive time in data engineering and deployment, instead of actual analysis and collaboration. …
“We invented a database that focuses on universal storage and data management rather than the compute layer, which we’ve instead made ‘pluggable.’ We cleared the path for analytics professionals and data scientists by taking over the messiest parts of data management, such as optimized storage for all data types on numerous backends, data versioning, metadata, access control within or outside organizational boundaries, and logging.”
So with this tool, developers will be freed from tedious manual steps, leaving more time to innovate and draw conclusions from their complex data. TileDB has also developed APIs to facilitate integration with tools like Spark, Dask, MariaDB and PrestoDB, while TileDB Cloud enables easy, secure sharing and scalability. See the write-up for praise from excited customers-to-be, or check out the company’s website. Readers can also access the open-source TileDB Embedded storage engine on Github. Founded in 2017, TileDB is based in Cambridge, Massachusetts.
Cynthia Murrell, July 27, 2020
IHS Markit Data Lake “Catalog”
July 14, 2020
One of the DarkCyber research team spotted this product announcement from IHS, a diversified information company: “IHS Markit’s New Data Lake Delivers Over 1,000 Datsets in an Integrated Catalogued Platform.” The article states:
The cloud-based platform stores, catalogues, and governs access to structured and unstructured data. Data Lake solutions include access to over 1,000 proprietary data assets, which will be expanded over time, as well as a technology platform allowing clients to manage their own data. The IHS Markit Data Lake Catalogue offers robust search and exploration capabilities, accessed via a standardized taxonomy, across datasets from the financial services, transportation and energy sectors.
The idea is consistently organized information. Queries can run across the content to which the customer has access.
Similar services are available from other companies; for example, Oracle BlueKai.
One question which comes up is, “What exactly are the data on offer?” Another is, “How much does it cost to use the service?”
Let’s tackle the first question: Scope.
None of the aggregators make it easy to scan a list of datasets, click on an item, and get a useful synopsis of the content, content elements, number of items in the dataset, update frequency (annual, monthly, weekly, near real time), and the cost method applicable to a particular “standard” query.
A search of Bing and Google reveals the name of particular sets of data; for example, Carfax. However, getting answers to the scope question can require direct interaction with the company. Some aggregators operate in a similar manner.
The second question: Cost?
The answer to the cost question is a tricky one. The data aggregators have adopted a set or a cluster of pricing scenarios. It is up to the customer to look at the disclosed data and do some figuring. In DarkCyber’s experience, the data aggregators know much more about what content process, functions or operations generate the maximum profit for the vendor. The customer does not have this insight. Only through use of the system, analyzing the invoices, and paying them is it possible to get a grip on costs.
DarkCyber’s view is that data marketplaces are vulnerable to disruption. With a growing demand for a wide range of information some potential customers want answers before signing a contract and outputting big bucks.
Aggregators are a participant in what DarkCyber calls “professional publishing.” The key to this sector is mystery and a reluctance to spell out exact answers to important questions.
What company is poised to disrupt the data aggregation business? Is it the small scale specialist like the firms pursued relentlessly by “real” journalists seeking a story about violations of privacy? Is it a giant company casting about for a new source of revenue and, therefore, is easily overlooked. Aggregation is not exactly exciting for many people.
DarkCyber does not know. One thing seems highly likely: Professional publishing data aggregation sector is likely to face competitive pressure in the months ahead.
Some customers may be fed up with the secrecy and lack of clarity and entrepreneurs will spot the opportunity and move forward. Rich innovators will just buy the vendors and move in new directions.
Stephen E Arnold, July 14, 2020
The Myth of Data Federation: Not a New Problem, Not One Easily Solved
July 8, 2020
I read “A Plan to Make Police Data Open Source Started on Reddit.” The main point of this particular article is:
The Police Data Accessibility Project aims to request, download, clean, and standardize public records that right now are overly difficult to find.
Interesting, but I interpreted the Silicon Valley centric write up differently. If you are a marketer of systems which purport to normalize disparate types of data, aggregate them, federate indexes, and make the data accessible, analyzable, retrievable, and bang on dead simple — stop reading now. I don’t want to deal with squeals from vendors about their superior systems.
For the individual reading this sentence, a word of advice. Fasten your seat belt.
Some points to consider when reading the article cited above, listening to a Vimeo “insider” sales pitch, or just doing techno babble with your Spin class pals:
- Dealing with disparate data requires time and money as well as NOT ONE but multiple software tools.
- Even with a well resourced and technologically adept staff, exceptions require attention. A failure to deal with the stuff in the Exceptions folder can skew the outputs of some Fancy Dan analytic systems. Example: How about that Detroit facial recognition system? Nifty, eh?
- The flows of real time data are a big problem — are you ready for this — a challenge to the Facebooks, Googles, and Microsofts of the world. The reason is that the volume of data and CHANGES TO THOSE ALREADY PROCESSED ITEMS OF INFORMATION is a very, very tough problem. No, faster processors, bigger pipes, and zippy SSDs won’t do the job. The trouble lies within, the intradevice and intra software module flow. The fix is to sample, and sampling increases the risk of inaccuracies. Example: Remember Detroit’s facial recognition accuracy. The arrested individual may share some impressions with you.
- The baloney about “all” data or “any” type is crazy talk. When one deals with more than 18,000 police forces in the US, outputs from surveillance devices from different vendors, and the geodumps of individuals and their ad tracking beacons — this is going to be mashed up and made usable. Noble idea. There are many noble ideas.
Why am I taking the time to repeat what anyone with experience in large scale data normalization and analysis knows?
Baloney can be thinly sliced, smeared with gochujang, and served on Delft plates. Know what? Still baloney.
Gobble this:
Still, data is an important piece of understanding what law enforcement looks like in the US now, and what it could look like in the future. And making that information more accessible, and the stories people tell about policing more transparent, is a first step.
But the killer assumption is that the humans involved don’t make errors, systems remain online, and file formats are forever.
That baloney. It really is incredible. Just not what you think.
Stephen E Arnold, July 8, 2020
Content for Deep Learning: The Lionbridge View
March 17, 2020
Here is a handy resource. Lionbridge AI shares “The Best 25 Datasets for Natural Language Processing.” The list is designed as a starting point for those just delving into NLP. Writer Meiryum Ali begins:
“Natural language processing is a massive field of research. With so many areas to explore, it can sometimes be difficult to know where to begin – let alone start searching for data. With this in mind, we’ve combed the web to create the ultimate collection of free online datasets for NLP. Although it’s impossible to cover every field of interest, we’ve done our best to compile datasets for a broad range of NLP research areas, from sentiment analysis to audio and voice recognition projects. Use it as a starting point for your experiments, or check out our specialized collections of datasets if you already have a project in mind.”
The suggestions are divided by purpose. For use in sentiment analysis, Ali notes one needs to train machine learning models on large, specialized datasets like the Multidomain Sentiment Analysis Dataset or the Stanford Sentiment Treebank. Some text datasets she suggests for natural language processing tasks like voice recognition or chatbots include 20 Newsgroups, the Reuters News Dataset, and Princeton University’s WordNet. Audio speech datasets that made the list include the audiobooks of LibriSpeech, the Spoken Wikipedia Corpora, and the Free Spoken Digit Dataset. The collection concludes with some more general-purpose datasets, like Amazon Reviews, the Blogger Corpus, the Gutenberg eBooks List, and a set of questions and answers from Jeopardy. See the write-up for more on each of these entries as well as the rest of Ali’s suggestions in each category.
This being a post from Lionbridge, an AI training data firm, it naturally concludes with an invitation to contact them when ready to move beyond these pre-made datasets to one customized for you. Based in Waltham, Massachusetts, the company was founded in 1996 and acquired by H.I.G. Capital in 2017.
Cynthia Murrell, March 17, 2020
LiveRamp: Data Aggregation Under the Marketing Umbrella
March 10, 2020
Editor’s Note: We posted a short item about Venntel. This sparked some email and phone calls from journalists wanting to know more about data aggregation. There are a number of large data aggregation companies. Many of these work with diverse partners. If the data aggregation companies do not sell directly to the US government, some of the partners of these firms might. One of the larger data aggregation companies positions itself as a specialist, a niche player. We have pulled some information from our files to illustrate what data aggregation, cross correlation, and identify resolution contributes to advertisers, political candidates, and other entities.
Introduction
LiveRamp is Acxiom, and it occupies a leadership position in resolving identity across data sets. The system can be used by a company to generate revenue from its information. The company says:
We’re innovators, engineers, marketers, and data ethics experts on a mission to make data safe and easy to use.
LiveRamp also makes it easy to a company to obtain certain types of data and services which can be made more accurate via LiveRamp methods. The information is first, second, and third party data. First means the company captures the data directly. Second means the data come from a partner. Third means that, like distant cousins, there’s mostly a tenuous relationship among the source of the data, the creator of the data, the collector of the data, and the intermediary who provides the data to LiveRamp. There’s a 2016 how to at this link.
According to a former LiveRamp employee:
LiveRamp doesn’t actually provide intelligence on the data, it just moves the data around effectively, quickly, seamlessly, and accurately.
The basic mechanism was explained in “The Hidden Value of of Acxiom’s LiveRamp”:
An alternative approach is to designate a single company to be the hub of all ID syncs. The hub can collect IDs from each participating ad tech partner and then form mutual ID syncs as needed. Think of this as a match maker who knows the full universe of eligible singles and can then introduce couples. LiveRamp has established itself as this match maker…
This is ID syncing; that is, figuring out who is who or what is what via anonymized or incomplete data sets.
There’s nothing unusual in what LiveRamp does. Oracle and other firms perform onboarding Why? Data are hot mess. Hot means that government agencies, companies, digital currency providers, and non governmental organizations will license access to these data. The mess means that information is messy, incomplete, and inaccurate. Cross correlation can address some, but not all, of these characteristics.
The Business: License Access to Data
Think of LiveRamp as an old-school mailing list company. There’s a difference. LiveRamp drinks protein shakes, follows a keto diet, and makes full use of digital technology.
We have a unique philosophy and approach to onboarding [that’s the LiveRamp lingo for importing data]. It’s not just about bringing offline data online. It’s about bringing siloed first-, second-, and third-party data together in a privacy-conscious manner and then resolving it to a single persistent identifier called an IdentityLink.
DarkCyber is no expert in the business processes of LiveRamp. We can express some of these ideas in our own words.
Onboarding means importing. In order to import data, LiveRamp, a Fiverr worker, or smart software has to convert the source data to a format LiveRamp can import. There are other steps to make sure the data is consistent, fields exist, and are what the bringer of the data says they are; for example, the number of records matches what the data provider asserts.
Siloed data are data kept apart from other data. The reason for creating separate, often locked down sets of data separate from other data is for secrecy, licensing compliance, or business policies; for example, a pharma outfit developing a Covid 19 treatment does not want those data floating around anywhere except in a very narrow slice of the research facility. Once siloed data appear anywhere, DarkCyber becomes quite curious about the who, what, when, where, why, and the all important how. How answers the question, “How did the data escape the silo?”
Privacy conscious is a phrase that seems a bit like Facebook lingo. No comment or further explanation is needed from DarkCyber’s point of view.
IdentityLink is essentially an accession number to a profile. Law enforcement gives prisoners numbers and gathers data in a profile. LiveRamp does it for the entities its cross correlative methods facilitate. Once an individual profile exists, other numerical procedures can be applied to assign “values” or “classifications” to the entities; for example, sports fans or maybe millennial big spender. One may be able to “resolve identity” if a customer does not know “who” an entity is.
Cookie data are available. These are useful for a range of specialized functions; for example, trying to determine where an individual has “gone” on the Internet and related operations.
In a nutshell, this is the business of LiveRamp.
Open Source Contributions
LiveRamp has more than three dozen repositories in GitHub. Examples include:
- Cascading_ext which allows LiveRamp customers to build, debug, and run simple data workflows.
- HyperMinHash-java. Cross correlation by any other name still generates useful outputs.
- Munkres. Optimization made semi-easy.
People
The LiveRamp CEO is Scott Howe, who used to work at Microsoft. LiveRamp purchased Data Plus Math, a firm specializing in analyzing targeted ads on traditional and streaming TV. Data Plus Math co-founders, CEO John Hoctor and Chief Technology Officer Matthew Emans, allegedly have work experience with Mr. Howe and Microsoft’s advertising unit.
Interesting Customers
- Advertising agencies
- Political campaigns
- Ad inventory brokers.
Stephen E Arnold, March 10, 2020
Enterprise Document Management: A Remarkable Point of View
March 3, 2020
DarkCyber spotted “What Is an Enterprise Document Management (EDM) System? How to Implement Full Document Control.” The write up is lengthy, running about 4,000 words. There are pictures like this one:
ECM is enterprise content management and in the middle is Enterprise Document Management which is abbreviated DMS, not EDM.
The idea is that documents have to be managed, and DarkCyber assumes that most organizations do not manage their content — regardless of its format — particularly well until the company is involved in a legal matter. Then document management becomes the responsibility of the lawyers.
In order to do any type of document or content management, employees have to follow the rules. The rules are the underlying foundation of the article. A company manufacturing interior panels for an automaker will have to have a product management system, an system to deal with drawings (paper and digital), supplier data, and other bits and pieces to make sure the “door cards” are produced.
The problem is that guidelines often do not translate into consistent employee behavior. One big reason is that the guidelines don’t fit into the work flows and the incentive schemes do not reward the time and effort required to make sure the information ends up in the “system.” Many professionals write something, text it, and move on. Enterprise systems typically do not track fine grained information very well.
Like enterprise search, the “document management” folks try to make workers who may be concerned about becoming redundant, a sick child, an angry boss, or any other perturbation in the consultant’s checklist ignore many information rules.
There is an association focused on records management. There are companies concerned with content management. There are vendors who focus on images, videos, audio, and tweets.
The myth that an EDM, ECM, or enterprise search system can create an affordable, non invasive, legally compliant, and effective way to deal with the digital fruit cake in organizations is worth lots of money.
The problem is that these systems, methods, guidelines, data lakes, federation technologies, smart software, etc. etc. don’t work.
The article does a good job of explaining what a consultant recommends. The information it presents provides fodder for the marketing animals who are going to help sell systems, training, and consulting.
The reality is that humans generate information and use a range of systems to produce content. Tweets about a missed shipment from a person mobile phone may be prohibited. Yeah, explain that to the person who got the order in the door and kept the commitment to the customer.
There are conferences, blogs, consulting firms, reports, and BrightPlanet videos about managing information.
The write up states:
There is no use documenting and managing poor workflows, processes, and documentation. To survive in business, you have to adapt, change and improve. That means continuously evaluating your business operations to identify shortfalls, areas for improvements, and strengths for continuous investment. Regular internal audits of your management systems will enable you to evaluate the effectiveness of your Enterprise Document Management solution.
Right. When these silver bullet, pie-in-the-sky solutions cost more than budgeted, employees quit using them, and triage costs threaten the survival of the company — call in the consultants.
Today’s systems do not work with the people actually doing information creation. As a result, most fail to deliver. Sound familiar? It should. You, gentle reader, will never follow the information rules unless you are specifically paid to follow them or given an ultimatum like “do this or get fired.”
Tweet that and let me know if you managed that information.
Stephen E Arnold, March 3, 2020
After Decades of Marketing Chaff, Data Silos Thrive
March 2, 2020
Here’s another round of data silo baloney—“Top 4 Ways to Eliminate Data Fragmentation Within Your Organization” from IT Brief. Surveys have found that many businesses are not making the most of all that data they’ve been collecting, and it has become common to blame data silos. It is true that some organizations could store and access their data more efficiently. There’s just one problem, and it is one we have mentioned before—there are some very good reasons to keep some data fragmented. Silos exist because of things like government requirements, legal processes, sensitive medical data, experts protecting their turf, and basic common sense.
The article asserts:
“Many organizations are finding it difficult to extract meaningful value from their data due to one endemic problem: mass data fragmentation. With mass data fragmentation, data volumes continue to rise exponentially, but companies struggle to manage that data because it’s scattered across locations and infrastructure silos, both in on-premises data centers and in the cloud. Organizations often don’t know what data exists, where it is and whether it’s being stored securely and in compliance with regulations.”
Of course, entities must ensure data is stored securely and that they comply with regulations. Also, the write-up’s advice to keep redundancies to a minimum and to understand how one’s data is being stored and accessed in the cloud are good ones. However, the exhortation to eliminate silos entirely is off the mark; trying to do so can be a fruitless exercise in expense and frustration.
Why?
- A person wants to hoard his or her information
- Rules or regulations prevent sharing to those “not in the fox hole”
- Lawyers and HR professionals don’t want legal documents available and “people” managers definitely do not want employee health and salary data flying around like particles motivated by Brownian motion.
Net net: Reality has silos. Accept it. Omit the marketing silliness.
Stephen E Arnold, March 2, 2020
Graph QL: The Future Five Years Later
February 28, 2020
Graph QL is “is a query language for APIs and a runtime for fulfilling those queries with your existing data.” The technology allegedly was a result of Facebook’s technical wizardry in 2012. The digital information weapon vendor released Graph QL to open source in 2015. You can get insights, links, and techno babble on the Graph QL Foundation Web site.
DarkCyber noted that Hasura snagged about $10 million to make Graph QL easier to use. The story appeared in TechCrunch on February 26, 2020. Is Hasura a frillback pigeon?
Or is the company one of those lovable creatures found in Washington Square Park in the spring?
As it turns out, Graph QL is becoming a mini boomlet in the database universe. There are the companies supporting the Graph QL Facebook innovation; for example:
Plus others like IBM and the PR world’s fave Twitter.
However, there are other companies in the “graph” business; for example:
Also, another dozen or so innovators.
Altexsoft asserts that GraphQL is that the technology is good for complex systems. Other upsides include:
- Retrieves data with a single call
- Delivers just what’s needed
- Permits validation and type checks
- Auto generates API documentation
- Supports rapid application prototyping (the move fast and break things approach perhaps?)
There are some downsides; for example:
- Complexity
- Performance
- The ever helpful a HTTP status code of 200 (helpful indeed)
- Complexity (Oh, sorry, I mentioned that).
Now back to the TechCrunch story about Hasura. The reason the company was funded may relate to the firm’s unique selling proposition: Our approach makes GraphQL easy.
Will easy sell? Worth watching in order to determine what breed of pigeon is flying through disparate sets of big data.
Stephen E Arnold, February 28, 2020
NoSQL DBMS: A Surprising Inclusion
February 12, 2020
“Top Databases Used in Machine Learning Project” is a listicle. The information in the write up is similar to the lists of “best” products whipped up by Silicon Valley type publications, mid tier consulting firms (a shade off the blue chip outfits like McKinsey, Booz, and BCG), and 20 somethings fresh from university.
The interesting inclusion in the list of DBMS is?
If you said, Elasticsearch you would be correct. Elasticsearch is an open source play doing business as Elastic. The open source version is at its core a search and retrieval system. (Does this mean the index is the data and the database?)
DarkCyber is not going to get into a discussion of whether an enterprise search system can be a database management system. Both sides in the battle are less interested in resolving the fuzzy language than making sales.
Maybe Elasticsearch is just doing what other enterprise search systems have done since the 1980s? Vendors describe search and retrieval as the solution to the world’s data management Wu Flu.
Net net: Without boundaries, why make distinctions? Just close the deal. Distinctions are irrelevant for some business tasks.
Stephen E Arnold, February 12, 2020