Audio Data Set: Start Your AI Engines
August 16, 2019
Machine learning projects have a new source of training data. BoingBoing announces the new “Open Archive of 240,000 Hours’ Worth of Talk Radio, Including 2.8 Billion Words of Machine-Transcription.” A project of MIT Media Lab, Radiotalk holds a wealth of machine-generated transcriptions of talk radio broadcasts between October 2018 and March 2019. Naturally, the text is all tagged with machine-readable metadata. The team hopes their work will enrich research in natural language processing, conversational analysis, and social sciences. Writer Cory Doctorow comments:
“I’m mostly interested in the social science implications here: talk radio is incredibly important to the US political discourse, but because it is ephemeral and because recorded speech is hard to data-mine, we have very little quantitative analysis of this body of work. As Gretchen McCulloch points out in her new book on internet-era language, Because Internet, research on human speech has historically relied on expensive human transcription, leading to very small and corpuses covering a very small fraction of human communication. This corpus is part of a shift that allows social scientists, linguists and political scientists to study a massive core-sample of spoken language in our public discourse.”
The metadata attached to these transcripts includes information about geographical location, speaker turn boundaries, gender, and radio program information. Curious readers can access the researchers’ paper here (PDF).
Cynthia Murrell, August 16, 2019
Hadoop Fail: A Warning Signal in Big Data Fantasy Land?
August 11, 2019
DarkCyber notices when high profile companies talk about data federation, data lakes, and intelligent federation of real time data with historical data. Examples include Amazon and Anduril to name two companies offering this type of data capability.
“What Happened to Hadoop and Where Do We Go from Here?” does not directly discuss the data management systems in Amazon and Anduril, but the points the author highlights may be germane to thinking about what is possible and what remains just out of reach when it comes to processing the rarely defined world of “Big Data.”
The write up focuses on Hadoop, the elephant logo thing. Three issues are identified:
- Data provenance was tough to maintain and therefore determine. This is a variation on the GIGO theme (garbage in, garbage out)
- Creating a data lake is complicated. With talent shortages, the problem of complexity may hardwire failure.
- The big pool of data becomes the focus. That’s okay, but the application to solve the problem is often lost.
Why is a discussion of Hadoop relevant to Amazon and Anduril? The reason is that despite the weaknesses of these systems, both companies are addressing the “Hadoop problem” but in different ways.
These two firms, therefore, may be significant because of their approach and their different angles of attacks.
Amazon is providing a platform which, in the hands of a skilled Amazon technologist, can deliver a cohesive data environment. Furthermore, the digital craftsman can build a solution that works. It may be expensive and possibly flakey, but it mostly works.
Anduril, on the other hand, delivers the federation in a box. Anduril is a hardware product, smart software, and applications. License, deploy, and use.
Despite the different angles of attack, both companies are making headway in the data federation, data lake, and real time analytics sector.
The issue is not what will happen to Hadoop, the issue is how quickly will competitors respond to these different ways of dealing with Big Data.
Stephen E Arnold, August 11, 2019
MarkLogic: A NoSQL Vertical Jump for More Revenue?
July 24, 2019
Is NoSQL-database-platform-firm MarkLogic is emulating Dialog Information Services and Lexis Nexis or vertical plays for quirky controversial, niche markets like drugs and medical device specific services? MarkLogic’s push into other verticals like professional publishing and finance have not generated the type of buzz and revenue that other Silicon Valley firms have sparked. Maybe pharma is the key which will unlock massive returns for the stakeholders? MarkLogic has resisted the type of acquisition and repositioning play that kCura executed in eDiscovery? Perhaps pharma, a sector whose revenue grows as the number of global players shrinks?
The company announced the MarkLogic Pharma Research Hub, created to bring the power of federated search to the field of pharmaceutical R&D. The product description tells us:
“For pharmaceutical companies, the discovery of new molecules and the cost of developing a successful medicine can take up to 15 years and $2.6 billion — slowing potentially life-saving drugs from getting to the patients who need them and resulting in abandonment of drug trials when faced with potential failure. In this industry, even small improvements to streamline R&D processes can lead to substantially higher revenue and lower costs. To achieve those goals, pharmaceutical companies need to leverage their massive data assets that include decades of research and clinical trial data. The challenge is that researchers are often unable to access the information they need. And, even when data does get consolidated, researchers find it difficult to sift through it all and make sense of it in order to confidently draw the right conclusions and share the right results.”
The product announcement elaborated:
“The main challenge facing IT departments that serve pharma R&D is patchwork infrastructure that creates the data silos that isolate and restrict access to data. Pharmas need to leverage massive data sets, including decades of research and clinical trials information.”
In addition, we’re reminded, disparate data silos hamper collaboration, upon which researchers rely heavily. The announcement goes on to outline the platform’s advanced features: the ability to load any pharmaceutical data set, relationship visualizations and discovery, and customizable search results. Naturally, these functions are made possible by machine-learning AI.
Founded in 2001 as Cerisent, MarkLogic is based in San Carlos, California, with several offices in the U.S. and in Europe. After changing its name, it released Version 1 of its platform in 2003. The company has ingested more than $170 million in venture funding. The firm has probed the intelligence sector and marketed itself as an enterprise search solution. But revenues? MarkLogic is a privately held firm just 18 years young.
Cynthia Murrell, July 23, 2019
A Partial Look: Data Discovery Service for Anyone
July 18, 2019
F-Secure has made available a Data Discovery Portal. The idea is that a curious person (not anyone on the DarkCyber team but one of our contractors will be beavering away today) can “find out what information you have given to the tech giants over the years.” Pick a social media service — for example, Apple — and this is what you see:
A curious person plugs in the Apple ID information and F-Secure obtains and displays the “data.” If one works through the services for which F-Secure offers this data discovery service, the curious user will have provided some interesting data to F-Secure.
Sound like a good idea? You can try it yourself at this F-Secure link.
F-Secure operates from Finland and was founded in 1988.
Do you trust the Finnish anti virus wizards with your user names and passwords to your social media accounts?
Are the data displayed by F-Secure comprehensive? Filtered? Accurate?
Stephen E Arnold, July 18, 2019
A Reminder about Deleting Data
July 15, 2019
If you believe data are deleted, you may want to take a deep breath and read “Good Luck Deleting Someone’s Private Info from a Trained Neural Network – It’s Likely to Bork the Whole Thing. Researchers Show Limited Success in Getting Rid of Data.”
With a title like this, there’s not much left to say. We did note this one, cautious quote:
Zou [a whiz researcher] said it isn’t entirely impossible, however. “We don’t have tools just yet but we are hoping to develop these deletion tools in the next few months.”
Will there be such tools? I have been stumbling along with databases since the 1960s, and deletes which delete are still not available.
Just a reminder that what one believes is not what happens within data management systems.
Stephen E Arnold, July 15, 2019
When Is a Deletion a Real Deletion
June 29, 2019
Years ago we created the Point (Top 5% of the Internet). You oldsters may remember our badge which was for a short period of Internet time a thing.
When we started work on the service in either 1992 or 1993, one of the people working with the team put the demo in the Paradox database. Hey, who knew that traffic would explode, and advertisers would contact us to put their messages on the site.
The Paradox database was not designed to deal with the demands we put upon it. One of its charming characteristics was that when we deleted something, the space was not reclaimed. Paradox — like many, many other databases — just removed the index pointer. The “space” and hence some charming idiosyncrasies remained.
Flash forward decades. A deletion may not be a deletion. Different “databases” handle deletions in different ways. Plus anyone with experience working with long forgotten systems like the Information Dimensions’ system to the total weirdness of a CICS system knows that paranoid people back up and back up as often as possible. Why? Fool with an AS400 database at the wrong time doing something trivial and poof. Everything is gone. More modern databases? Consider this passage from the Last Pickle:
The process of deletion becomes more interesting when we consider that Cassandra stores its data in immutable files on disk. In such a system, to record the fact that a delete happened, a special value called a “tombstone” needs to be written as an indicator that previous values are to be considered deleted.
When one digs around in database files, it is possible to come across these deleted data. People are amazed when a Windows file can be recovered. Yep, deletions don’t explain exactly what has been “deleted” and the conditions under which the data can be undeleted. Deletion allows one to assume one thing when the data have been safely archived, converted to tokens, or munged into a dossier.
Put these two things together and what do you get? A minimum of two places to look for deleted data. Look in the database files themselves, and look in backups.
In short, deleted data may not be deleted.
How does one know if data are “there”? Easy. Grunt work.
Why is this journey to the world of Paradox relevant?
Navigate to “Google Now Lets Users Auto-Delete Their Location and Web History.” Note this passage:
Specifically, Google account holders will be able to choose a time limit of either 3 or 18 months, after which, their location, web, and app history will automatically be deleted.
Some questions?
- Who verifies that the content has been removed from indexes and data files?
- Who verifies that the data have been expunged from metadata linked to the user?
- What does deletion mean as the word is used by Google?
- From what has something been deleted?
Like hotel temperature controls, fiddling with the knobs may change nothing.
Stephen E Arnold, June 29, 2019
Data Science Book: Free for Now
May 24, 2019
We spotted a post by Capri Granville which points to a free data science book. The post also provides a link to other free books. The Microsoft Research India book is “Foundations of Data Science” by Ravi Kannan. You can as of May 24, 2019, download the book without charge at this link: https://www.cs.cornell.edu/jeh/book.pdf. Cornell charges students about $55,188 for an academic year. DarkCyber believes that “free” may not be an operative word where the Theory Center used to love those big IBM computers. No, they were not painted Azure.
Stephen E Arnold, May 24, 2019
IBM Hyperledger: More Than a Blockchain or Less?
May 17, 2019
Though the IBM-backed open-source project Hyperledger has been prominent on the blockchain scene since 2016, The Next Web declares, “IBM’s Hyperledger Isn’t a Real Blockchain—Here’s Why.” Kadena president, and writer, Stuart Popejoy tells us:
“A blockchain is a decentralized and distributed database, an immutable ledger of events or transactions where truth is determined by a consensus mechanism — such as participants voting to agree on what gets written — so that no central authority arbitrates what is true. IBM’s definition of blockchain captures the distributed and immutable elements of blockchain but conveniently leaves out decentralized consensus — that’s because IBM Hyperledger Fabric doesn’t require a true consensus mechanism at all.
We noted this statement as well:
“Instead, it suggests using an ‘ordering service’ called Kafka, but without enforced, democratized, cryptographically-secure voting between participants, you can’t really prove whether an agent tampers with the ledger. In effect, IBM’s ‘blockchain’ is nothing more than a glorified time-stamped list of entries. IBM’s architecture exposes numerous potential vulnerabilities that require a very small amount of malicious coordination. For instance, IBM introduces public-key cryptography ‘inside the network’ with validator signatures, which fundamentally invalidates the proven security model of Bitcoin and other real blockchains, where the network can never intermediate a user’s externally-provided public key signature.”
Then there are IBM’s approaches to architecture, security flaws, and smart contracts to consider, as well as misleading performance numbers. See the article for details on each of those criticisms. Popejoy concludes with the prediction that better blockchains are bound to be developed, alongside a more positive approach to technology in general, across society.
Cynthia Murrell, May 17, 2019
Machine Learning and Data Quality
April 23, 2019
We’re updating our data quality files as part of the run up to my lecture at the TechnoSecurity & Digital Forensics Conference. A paper by Sanau.co is worth reading if you are thinking about how to solve some issues with the accuracy of the outputs of some machine learning systems. “Dear AI Startups: Your ML Models Are Dying Quietly.” The slow deterioration of certain Bayesian methods has been a subject I have addressed for years. The Sanau write up called to my attention another source of data deterioration or data rot; that is, seemingly logical changes made to field names and the insidious downstream consequences of these changes. The article provides useful explanations and a concrete example drawn from ecommerce. The article has a much broader application. Worth reading.
Stephen E Arnold, April 23, 2019
The Surf Is Up for the Word Dark
April 4, 2019
Just a short note. I read this puffy wuffy write up about a new market research report. Its title?
What caught my attention is not the authors’ attempt to generate some dough via open source data collection and a touch of Excel fever.
Here’s what caught my attention:
Dark analytics is the analysis of dark data present in the enterprises. Dark data is generally is referred as raw data or information buried in text, tables, figures that organizations acquire in various business operations and store it but, is unused to derive insights and for decision making in business. Organizations nowadays are realizing that there is a huge risk associated with losing competitive edge in business and regulatory issues that comes with not analyzing and processing this data. Hence, dark analytics is a practice followed in enterprises that advances in analyzing computer network operations and pattern recognition.
Yes, buried data treasure. Now the cost of locating, accessing, validating, and normalizing these time encrusted nuggets?
Answer: A lot. A whole lot. That’s part of the reason old data are not particularly popular in some organizations. The idea of using a consulting firm or software from SAP is not particularly thrilling to my DarkCyber team. (Our use of “dark” is different too.)
Stephen E Arnold, April 4, 2019