Assembling Genomes from Scratch, at a Fraction of the Cost

June 5, 2017
Hertz Staff

From: Laboratory Equipment

When the 2015/2016 Zika virus epidemic swept through the Americas and several islands in the Pacific and Southeast Asia, researchers were urged to focus their efforts on developing treatments and vaccines to combat the virus’ effects.

A team of researchers from the Center for Genome Architecture at Baylor College of Medicine coincidentally had crucial but fragmented pieces of relevant data from previous research that could greatly impact this effort. Using a 3-D genome assembly approach referred to as HI-C, the team was able to quickly assemble the 1.2 billion letter genome of Aedes aegypti, the Zika-carrying mosquito, producing the first end-to-end assembly of each of its three chromosomes.

“When the Aedes mosquito came into the spotlight in relation to the Zika epidemic, we found ourselves sitting on a bunch of relevant data and proof-of-principle work,” said Olga Dudchenko, a postdoctoral fellow at The Center for Genome Architecture. “The situation prompted us to polish our methods and share the data we had.”

The development provides a much-needed boost to research and treatment options for the Zika virus by identifying vulnerabilities in the mosquito that the virus uses to spread.

With the success of assembling the Aedes genome, the Baylor team has also shown that the 3-D assembly technique can be an important tool for similar outbreaks in the future, and could also aid in personalized care for human patients suffering from a variety of diseases.


Erez Lieberman Aiden, Director of the Center for Genome Architecture, originally proposed the general idea of 3-D assembly in 2009.

Lieberman Aiden and colleagues first tested their technique in 2013 by sequencing a human genome, and comparing the data to that made available by the Human Genome Project.

The team found their assembly correlated with the reference data from the Human Genome Project with 99 percent accuracy, validating the method. However, the 3-D assembly method produced similar results in a fraction of the time, and at significantly less cost.

They then switched their focus toward the Aedes aegypti mosquito, which is responsible for the spread of not only Zika virus, but dengue, chikungunya and yellow fever. When the Zika outbreak began to become a global health threat, the team knew they could piece together information they acquired from previous research to create a clear, cohesive picture of the mosquito’s genome.

3-D assembly allowed the team to create the 1.2 billion-letter genome of the mosquito for about $10,000—a price comparable to that of an MRI scan.

The third phase of their research included assembling the genome of the Culex quinquefasciatus mosquito, a carrier of West Nile virus.

“Culex is another important genome to have since it is responsible for transmitting so many diseases,” said Lieberman Aiden. “Still, trying to guess what genome is going to be critical ahead of time is not a good plan. Instead, we need to be able to respond quickly to unexpected events. Whether it is a patient with a medical emergency or the outbreak of an epidemic, these methods will allow us to assemble de novo genomes in days, instead of years.”

For the Culex sequence, the researchers carried out their work with IBM’s VOLTRON—a high performance computing (HPC) system. VOLTRON is based on the company’s Power Systems platform, which provides scalable HPC capabilities necessary to accommodate a broad spectrum of data-enabled research activities. The Power Systems platform has also been selected for use by the Department of Energy’s Oak Ridge and Lawrence Livermore National Laboratories, and the UK government’s Science and Technology Facilities Council’s Hartree Centre.

“3-D assembly and IBM technology are a terrific combination: one requires extraordinary computational firepower, which the other provides,” said Lieberman Aiden.

Incorporated into the design of VOLTRON is a POWER and Tesla technology combination that allowed Baylor researchers to handle extreme amounts of data with incredible speed. VOLTRON comprises a cluster of four systems, each featuring a set of eight NVIDIA Tesla GPUs tuned by NVIDIA engineers to help Baylor’s researchers achieve optimum performance on their data-intensive genomic research computations.

The team also made a new discovery about these mosquito families during sequencing. They found that chromosome content did not mix much across the species, which could prove helpful in any future outbreak with a mosquito as a carrier, regardless of whether its genome is sequenced or not.

“If you’re looking at various mammals, chromosome content will mix a lot from one species to another. But what turned out to happen in mosquitoes was a very different story. What we saw is that although things mix a lot locally, (within chromosome arms) very little content jumped from one arm to another, or from one chromosome to another chromosome,” explained Dudchenko.

This fundamental fact of mosquito evolution offers immediate benefit for researchers. If another epidemic hit through a completely different mosquito carrier that researchers and health officials know nothing about, they can at least infer where to look for particular genes or targets within the new carrier because of this unique property. In the event of a new outbreak, this technology could answer questions much faster and lead to the rapid development of a treatment or vaccine, saving lives and money.

Advancing capabilities
As Dudchenko explained to Laboratory Equipment, the field of genome assembly is a very actively developing field, providing researchers with genetic information that was previously unobtainable.

But for certain applications, assembly methods can be “suboptimal.”

Prior to the 3-D genome assembly method, it was challenging and extremely expensive to sequence a genome by starting at the beginning of a chromosome and reading all the way through to the end. Instead, researchers would use methods that read small snippets of chromosomes many times over, find overlaps in those snippets and piece them together to create longer, continuous sequences.

The problem arises, however, when repetitive fragments appear, or there’s a large amount of variation across a species.

“It’s like reading a book in which the pages aren’t bound or numbered in any meaningful way,” Dudchenko said. “In some sense, the information is all there and if you’re lucky enough that the information you want to read off is in the paragraph on the same page, you can make sense of it. But if you have longer stretches of text, or are unlucky and hit the end of the page, you don’t know where to go from there, and there’s not a lot you can do with this information.”

Another problem with short-reads or “jumping methods” is when DNA is being extracted, the chromosomes rarely remain unbroken.

But HI-C provides the needed information at the scale of whole chromosomes.

The technique traces the genome as it folds inside the nucleus, and shows how frequently different stretches of the genome come into contact with each other. This enabled the team to stitch together hundreds of millions of short DNA reads into the sequences of entire chromosomes.

“People can view HI-C as a type of ‘jumping library’ that spans all scales,” said Dudchenko.

Clinical relevance
The short-read format also greatly reduces the cost of assembly, making it an option for doctors to conduct a personalized genome project on individual patients as needed.

“This is the technology that will get you there,” said Dudchenko.

Prior to HI-C, a clinician may sequence a person’s DNA, but instead of fully assembling it, they align it with a reference and look for “typos” to see where the patient differs from the genome reference. The disadvantage with this approach is that the clinician is looking at someone’s genome through the lens of an existing reference, creating a biased view. It relies heavily on prior work, and if there’s something unexpectedly different, the clinician may not notice, according to Dudchenko.

The next step for the Baylor team is to build more of a stable infrastructure for other research groups to utilize the HI-C technology in their respective fields, and to make it more straight-forward for people who may not necessarily come from a 3-D space with relevant expertise.

“We’re very excited to see how this technology will play out with different genes and assemblies, and how far we will be able to go with this technology,” added Dudchenko. “Right now, we’re pretty optimistic.”