Computational Constraints: Big Data Are Big
July 8, 2015
Navigate to “Genome Researchers Raise Alarm over Big Data.” The point of the write up is that “genome data will exceed the computing challenges of YouTube and Twitter.” This may be a surprise to the faux Big Data experts. The write up points out:
… they [computer wizards] agree that the computing needs of genomics will be enormous as sequencing costs drop and ever more genomes are analyzed. By 2025, between 100 million and billion human genomes could have been sequenced, according to the report, which is published in the journal PLoS Biology. The data-storage demands for this alone could run to as much as 2^40 exabytes (1 exabyte is 1018 bytes), because the number of data that must be stored for a single genome are 30 times larger than the size of the genome itself, to make up for errors incurred during sequencing and preliminary analysis.
Until computing resources are sufficiently robust and affordable, the write up states:
Nevertheless, Desai [an expert] says, genomics will have to address the fundamental question of how much data it should generate. “The world has a limited capacity for data collection and analysis, and it should be used well. Because of the accessibility of sequencing, the explosive growth of the community has occurred in a largely decentralized fashion, which can’t easily address questions like this,” he says. Other resource-intensive disciplines, such as high-energy physics, are more centralized; they “require coordination and consensus for instrument design, data collection and sampling strategies”, he adds. But genomics data sets are more balkanized, despite the recent interest of cloud-computing companies in centrally storing large amounts of genomics data.
Will the reality of Big Data increase awareness of the need for Little Data; that is, trimmed sets? Nah, probably not.
Stephen E Arnold, July 8, 2015