Archiving Genomic Data: A Proposal
One of the big problems facing whole-genome research efforts these days is archiving. A single experiment can generate a terabyte or more of data, and while it’s all conveniently stored on hard drives in the short term, that’s a poor medium for handing down the scientific heritage of mankind.
The problem is twofold: digital data storage changes constantly, and many formats that were sold as “archival” have since turned out to be alarmingly perishable. If you put a home movie on DVD just five years ago, it might be unplayable already. Files I saved to floppy disk in the 1990s are likely lost forever, and computer programs written in the 1980s on a VAX mainframe are essentially unrecoverable.
Considering the importance of research data, we can’t just keep tossing it onto the medium du jour and expect future generations of scientists to be able to use it. Unfortunately, nobody’s come up with a better plan. Right now, taxpayers and corporate investors are spending huge sums of money and betting the future of medicine on genomic science, but there’s absolutely no guarantee the field’s results will be accessible more than a few years into the future. That’s unconscionable.
This has bothered me for awhile, but I didn’t have any better ideas. Now I think I might have stumbled onto one.
A couple of weeks ago, I opened an archive of research data that my grandfather had saved more than 70 years ago. It was about 500 gigabytes of information, saved on a cheap, widely available medium, and it hadn’t been stored very carefully; the shoebox-size collection had kicked around in various family attics and basements for its entire life.
Nonetheless, using a fully mature technology, I recovered enough of the data to determine that it needed to be handed over to a professional curator, and that’s exactly where it’s heading. There’d been some degradation, but with existing techniques it would be entirely possible to recover nearly all of the information as it was originally stored.
What was this amazing high-density archival data storage medium? Film.
The movies from my grandfather were shot on 16mm Kodak Safety Film, which has since been supplanted by film types with much longer archival lifespans. Movie film also comes in a 35mm size, which is the main format Hollywood has used throughout the history of cinema. There’s a vast infrastructure built around recording, handling, saving, displaying, and restoring film, and we know from experience - not “simulated aging” experiments or manufacturers’ hype, but actual experience - that stored film lasts at least a century.
Indeed, even the film industry has recently realized just how cheap, reliable, and dense the data are on film. In a recent report, the Academy of Motion Picture Arts and Sciences pointed out that the storage costs for digital films are drastically higher than those for celluloid, and warned that relying on digital archiving jeopardizes the movie industry’s future.
How much data can we store on film? A back-of-the-envelope calculation shows that a frame of 35mm film contains about 60MB. There are 16 frames per foot of cinema film, and a projector operating at 24 frames per second shows 1,000 feet in about 11 minutes. In other words, if you leave the theater to get a bag of popcorn, you’ll probably miss a terabyte or so. In order to produce a two-hour Hollywood epic, a director will shoot each scene at least a few times, plus many scenes that never make it into the final cut. Add it all up, and a single copy of a single movie’s archive is about two petabytes.
Storing all that information requires nothing more than a cool, dry place. In Hollywood, archivists favor large vaults built inside abandoned mines. Park your film there, and twenty or fifty or eighty years later you can pull it out, dust it off, and show it again. Can your hard drive do that?
Genomics researchers could tap into this amazing archiving infrastructure pretty easily. First, agree on a standard for displaying uncompressed genomic data as color images. Next, transfer the data to film with a high-speed scanner, and send it for processing and storage. Use a high-resolution telecine to recover the data back to a hard drive (or whatever data storage device people are using in the year 2100) anytime. There’s no need to reinvent the wheel, as off-the-shelf filmmaking equipment could handle these transfers with little or no modification.
Once the system is set up, document it fully and print the description on paper. Store several copies of that documentation with the film, and we could be nearly certain that today’s data would remain recoverable for a very, very long time.