Native to the east coast of the United States, the giant sequoia (Sequoiadendron giganteum) and coast redwood (Sequoia sempervirens) are two of the largest living organisms on the planet. Over the past century, 95% of ancient redwoods, which can live to over 2,000 years, have been lost, lowering the pool of genetic diversity and leaving these species endangered.
In order to support conservation and breeding efforts, researchers at the University of California, Davis and Johns Hopkins University initiated an ambitious project to sequence the massive genomes of these massive organisms1. At 8.2 Gb for sequoia and 26.5 Gb for redwood, the genomes of these organisms are, respectively, 2.6 and 8.3 times larger than that of humans. To tackle this challenge, the team deployed a ‘hybrid’ genome assembly strategy, utilising both short-read sequencing technology and long nanopore sequencing reads. As stated by Professor Steven Salzberg, one of the project leaders, with lengths in excess of 10 kb the nanopore sequencing reads are able to span nearly all common repeats, simplifying the assembly process1.
The team deployed the MaSuRCA hybrid assembler, an open source tool developed by Aleksey Zimin, a senior scientist in Professor Salzberg’s lab. Briefly, this uses a k-mer lookup to extend short sequencing reads base by base, at both the 5’ and 3’ ends (as long as the extension is unique), to form much longer ‘super-reads’. The combination of super-reads and long nanopore sequencing reads then enable the generation of even larger ‘mega-reads’.
The sequoia sample was sequenced to a depth of 135x using short-read technology and 22x using nanopore sequencing on the MinION. Assembly using the short-read data alone generated 2,507,175 contigs; however, addition of the long-read nanopore data delivered a 30-fold reduction in contig number (Table 1).
Table 1: Addition of long nanopore sequencing reads provided a 30-fold reduction in the number of contigs, and a 20-fold increase in contig sizes, for the sequoia genome assembly. Data courtesy of Professor Steven Salzberg, Johns Hopkins University, US.
To further enhance assembly contiguity, the team collaborated with Dovetail Genomics to use Hi-C chromosome conformation in conjunction with Dovetail's HiRise assembly algorithm, a technique that they had previously successfully used to generate chromosome-level assembly of the walnut (Juglans regia L) genome2. Comparing the walnut and sequoia nanopore sequencing reads, Professor Salzberg commented that the more recent sequoia reads were significantly longer, reflecting the rapid development of nanopore technology and optimisation of their sequencing workflow.
Assembly using the HiRise algorithm generated 11 ‘enormous’ chromosome-size scaffolds ranging from 443 Mb to 985 Gb in size. Describing such large scaffolds as ‘spectacular’ and ‘transformative’, Professor Salzberg noted that these are the largest scaffolds ever assembled for any genome.
The 26.5 Gb hexaploid (six copies of each chromosome) coast redwood genome provided the team with an even sterner assembly challenge. In total, 3.2 trillion bases of short-read data and 582 billion bases of nanopore sequencing data were generated, representing 122x and 21x genome coverage respectively. Confirming the scale of the task, subsequent genome assembly took 6 months (or approximately 700,000 CPU hours post error correction). Hi-C scaffolding is currently ongoing; however, the initial hybrid assembly strategy delivered a N50 contig size of 110 kb and a longest contig of 2.4 Mb. Professor Salzberg suggests that, using the final assembly, it may be possible to segment the redwood genome into its three sub-genomes, shedding new light on the evolutionary history of this iconic species1.
The Redwood Genome Project is led by Professor David Neale at UC Davis and Steven Salzberg at Johns Hopkins University, and is funded by the non-profit conservation group, Save The Redwoods League.