Tag Archives: genomics

Decoding ENCODE

Today, a scientific collaboration called the Encyclopedia of DNA Elements (ENCODE) published some of its data. When I say “collaboration,” I mean more than 400 scientists working in 32 different labs, and when I say “some of its data,” I mean over 1,600 experiments involving 24 types of analyses on 147 cultured cell lines.

A typical ENCODE experimental result.

A typical ENCODE experimental result. Links to the original figure.

ENCODE didn’t publish this massive data set in a paper. They published it in 30 papers that came out simultaneously in three journals, plus additional commentary elsewhere. To help people make sense of this information glut, Nature, the main publisher, set up a special web page where all of the papers are freely available, released an iPad app that lets users explore the results through different “threads” of inquiry, and held a press conference that featured several of the consortium’s principal researchers as well as an interpretive dance performance inspired by the results. Yes, really.

As regular readers know, I’m always ready to call out publishers who engage in excessive hype. In this case, though, I think Nature‘s hoopla is entirely appropriate. This is a $185 million project that’s trying to figure out how humans work at a molecular level, and the current batch of publications presents both a rough sketch of an answer and a whole new list of big questions.

Most science news stories on ENCODE will probably begin and end with an observation about “junk DNA,” and how the new data apparently overturn the notion that most of the human genome is just taking up space. Perhaps acknowledging that this is the most easily-digested result, the press materials and many of the commentary articles highlight it. Molecular biologist Joseph Ecker puts it this way in his synopsis:

One of the more remarkable findings … is that 80% of the genome contains elements linked to biochemical functions, dispatching the widely held view that the human genome is mostly ‘junk DNA.’ The authors report that the space between genes is filled with enhancers (regulatory DNA elements), promoters (the sites at which DNA’s transcription into RNA is initiated) and numerous previously overlooked regions that encode RNA transcripts that are not translated into proteins but might have regulatory roles.

But neither that result nor any other individual piece of the data is really the main point. What matters about ENCODE is the totality of it, and what the scale of the data says about the future of biology.

When the Human Genome Project released its draft sequence 11 years ago, it was a bit like Deep Thought reporting that the answer was in fact 42. By itself, the genome sequence told us that we only had about 20,000 genes, and that most of our DNA didn’t look like it had any function at all. There was obviously a lot more going on than we’d be able to glean just by looking at the sequence.

ENCODE is a follow-up project, in which researchers used a huge variety of techniques to probe the functions of all of the parts of our DNA, not just the segments that contain obvious genes. They looked for enhancers that can control the expression of genes in other parts of the genome. They screened all of the RNA in cells to find new pieces of micro-RNA, a type of gene-controlling molecule we didn’t even know about when I went to graduate school. They tested which parts of the genome were wrapped up in chromatin, a sort of deep storage system, and which were open for business in different types of cells. And on and on. In short, they examined what every piece of the genome was doing under as many different conditions as they could.

Besides finding that most of the genome is probably doing something to earn its keep, ENCODE has illuminated the scope of the problem biologists now face. It’s huge.

A graphic accompanying Brendan Maher’s excellent news feature on the project shows what ENCODE has accomplished so far, and how much work remains just to finish its initial phase. For example, the investigators have looked at only 120 of an estimated 1,800 transcription factors, proteins that control gene expression directly, and they’ve only looked at those factors in a subset of the cell lines they set out to study. That one snippet of the work produced a massive amount of information by itself.

Even after ENCODE finishes, what we’ll have will be more of a pamphlet than an encyclopedia. Cultured human cell lines are a great tool for laboratory studies, but they only partly mimic the behavior of the cells that make up a real human, which in turn vary from person to person and within a single person over time. ENCODE is giving us a two-dimensional view of a system that’s at least five-dimensional. That’s not to minimize the project; the team has made astonishing progress, but it’s just a start.

After doing the rest of the cultured cell experiments, biologists will have to figure out the results, which raises a whole new problem. I can’t tell you what all of the ENCODE data mean. Neither can the people who generated them. Besides the 30 new papers (and their supplementary online sections), the project has also produced databases, software, and other analytical tools so scientists can dive into the results directly. The conclusions of the new papers are just the bits that the experimenters thought were most interesting. As happened with the human genome sequence, people will be digging new publications out of these data for years.

Right now, we’re like astronomers looking at millions of smudges of light we can see with a new telescope, and it’s just dawning on us that those aren’t stars. They’re galaxies.

ENCODE is also part of a trend that’s raising tough ancillary questions for scientists and science publishers. Though its principal investigators undoubtedly see the project as worthwhile, $185 million is a lot of money, and the reality of government-sponsored science is that it’s a zero-sum game. Despite what some big science proponents claim, funding for consortium-based “factory research” studies such as this necessarily comes at the expense of individual investigator-led projects. In an environment where thousands of promising young researchers are scrambling for grants, can we be sure this was the best way to spend those funds?

From the publishers’ perspective, big science is fraught with disputes over credit, concerns about oversight and data integrity, and fundamental questions regarding the proper length and format for a paper. It’s not even clear that a project like this should be published in a conventional journal; perhaps the data should simply go online, accompanied (or not) by a few comments from the lead scientists. As the ENCODE juggernaut keeps rolling along, and as subsequent, even bigger projects follow it, it might not even be possible to crank out papers for each new batch of work.

But this, too, is an expected result. This is what science does: uses what’s possible to redefine what’s possible. The ability to sequence a gene becomes the ability to sequence a genome becomes the ability to sequence a thousand genomes. When our minds can’t accomodate the new information, we’ll just have to expand them.


1. Nature 489, 57–74 (06 September 2012) doi:10.1038/nature11247

2. Nature 489, 75–82 (06 September 2012) doi:10.1038/nature11232

3. Nature 489, 83–90 (06 September 2012) doi:10.1038/nature11212

4. Nature 489, 91–100 (06 September 2012) doi:10.1038/nature11245

5. Nature 489, 101–108 (06 September 2012) doi:10.1038/nature11233

6. Nature 489, 109–113 (06 September 2012) doi:10.1038/nature11279

Genetic Privacy: An Inalienable Right?

Congresswoman Louise Slaughter (D-NY) has a thought-provoking editorial in the 11 May issue of Science:

The Genetic Information Nondiscrimination Act (GINA) languished in past Congresses for 12 years. But finally, new leadership in the House of Representatives has given the bill its best chance to become law since its introduction in 1995. On 25 April, GINA passed the House by a vote of 420 to 3. The act will prohibit health insurers from denying coverage or charging higher premiums to a healthy individual solely because they possess a genetic predisposition to develop a disease in the future. It will also bar employers from using genetic information in hiring, firing, job placement, or promotion decisions.

She goes on to argue that genetic discrimination is a real and insidious danger, and that the new legislation is critical to stopping it. While I do believe that people are already facing uninformed discrimination on the basis of primitive, misinterpreted genetic tests, I’m not convinced that we should have an inalienable right to keep our genetic information secret from insurers and employers. Currently, insurers can ask if I have a family history of, say, cancer or diabetes. That’s genetic information. So if a test comes along that makes solid predictions from my actual DNA sequence, rather than a potentially flawed inference from vague family history data, why am I suddenly allowed to keep that a secret?

More Money for Cheap Genomes

The National Human Genome Research Institute, one of the National Institutes of Health, today announced two efforts to accelerate the development of cheaper whole-genome sequencing. One announcement, which had been anticipated, was that the NHGRI has awarded several new grants, totaling about $13 million, to researchers developing the next generation of genome sequencing technologies. The other development is a new $10 million reward offering from the X Prize Foundation, the folks who are famous for trying to bring space flight to the masses – or at least to the merely rich.

The focus of the grants and the prize is the much-touted “thousand dollar genome,” which would enable anyone with a grand to get his or her entire genome sequenced. Naive readers might now ask “why would anyone want to do that?” I’m afraid I don’t have a good answer.

The NHGRI has been hyping the thousand-dollar genome idea for a few years now, ever since their primary reason for existence ended. NHGRI was created to compile a single “average” human genome sequence from several samples, a job they accomplished in a highly publicized race with privately-funded sequencers. When new companies and government organizations get created to accomplish a single, finite job, there’s always a certain amount of awkwardness when the job is done (see also: NORAD). So we sequenced the genome, now … let’s sequence everyone’s genomes for less money.

A thousand-dollar genome sequencing platform is another achievable goal – really a tricky engineering problem rather than a fundamental scientific one – and it’s pretty clear that cheap sequencing will eventually happen, even if, as Rob Carlson argues, it might take longer than expected to get there. But what’s the killer application for this new toy?

Genome sequencing fans will argue that nobody had a clear application for the first human sequence until it was done, and that this is “hypothesis-generating” research rather than hypothesis-testing. That’s fine as far as it goes, but it’s important not to overpromise such a fishing expedition.

“The ability to sequence an individual genome cost-effectively could enable health care professionals to tailor diagnosis, treatment and prevention to each person’s unique genetic profile,” says the press release. Yes, it could. Or, it might not.

The first complete human genome sequence, and the complete sequences of numerous other organisms, have indeed led to interesting new basic research strategies. I have no doubt that bringing down the price of genome sequencing will yield similar benefits for basic science.

But the clinical applications are much murkier. With our current understanding, a complete genome sequence would be a bit like a whole-body MRI scan on a healthy person: interesting, but not especially informative. That’s because we know so little about what our genes are doing, or how the information in them changes in different cellular contexts. Scientists can’t even agree on which parts of the genome are doing anything at all. A few genetic diseases are linked to changes in single genes, but those tend to be rare and relatively easy to identify. Routine whole-genome sequencing in the doctor’s office, in contrast, will give patients a deluge of information long before anyone has a clue how to interpret it.

This could go in a lot of strange and disturbing directions. Patients might even start lobbying their physicians to practice one of the worst forms of medicine: treating a test result rather than a disease. “Ask your doctor about Inhibistatex, the newest inhibitor of gene X.”

Predictably, it took me several tries to generate a drug name in that last sentence that wasn’t already in use. Where’s Philip K. Dick now that we need him?