Combining Illumina and ONT reads makes your de novo assembly successful

Intro

Although state-of-the-art Next-Generation sequencing (NGS) technologies generate an adequate amount of sufficient quality data for de novo genome assembly of organisms for which the reference sequence is not known yet, the genome assembly process itself is still a major challenge for bioinformaticians. In this article, we will try to outline the issue of assembly and show possible solutions.

One of the most widespread NGS platforms is Illumina. The main advantage of this platform is the high quality of reads, the ability to sequence paired end reads, the relatively affordable cost of sequencing and robustness. The disadvantage is the reading length, which is limited to 2x300 b for this platform. Even with high coverage there are specific regions of the genome which are insufficiently, or not at all, sequenced. We are talking about GC rich regions, chromosome beginnings and ends, repetitive regions, etc. These issues may have negative effect on the process of assembling the complete genome, which results in incomplete and fragmented sequences and leads to the formation of a larger number of shorter contigs, which is not easy to properly assemble into one coherent unit. One way how to bridge these inconsistencies is usage of mate-paired libraries. This kind of sequencing does not fill the gaps in assembly but helps to arrange contigs to the correct order.

Another possibility is to incorporate long fragments of DNA obtained from different sequencing platform into assembly. Oxford Nanopore Technologies (ONT) seems to be a suitable adept. According to the manufacturer's specifications, it is possible to obtain reads of up to several megabases. Ordinary sequencing run contains shorter fragments (few kilobases) which are, however, sufficient for purpose of analysis. Even if the ONT lags behind Illumina in quality, it is being very quickly developed and improved.

One could certainly make De novo assembly based only on long reads but in case of higher error rate higher read depth is necessary. Here the question arises – is it possible to combine these sequencing technologies and obtain the highest quality genome? The answer is yes and we have more than one solution. For example, hybrid assembly is able to combine shorter high quality reads from Illumina with long reads which are ideal for filling the gaps between contigs. Second possibility is assembly based on long reads (e.g. ONT) and improved with the high quality Illumina short reads - so-called polishing.

We have recently addressed a similar situation - we performed de novo assembly of the bacterial genome using data from both Illumina and ONT. The reason for assembling the genome using multiple approaches was to try to find the most ideal assembly method for the data.

SPAdes Genome Assembler was chosen for hybrid assembly. This program uses de Bruijn graphs for read assembly and was developed for de novo assembly of small genomes. SPAdes enables to combine short reads and long reads or assemble them separately. Second approach was assembly of long reads using Canu assembler. This assembler focuses on analyzing reads from PacBio and ONT platforms. Canu works in three steps: corrections, trimming and assembly. Inconsistent parts of assembly of long reads are repaired with Illumina data in following step calls polishing.

Note: All programs used in this comparison are free and can be distributed and/or modified according to the GNU General Public Licence version 2.

Data

In our case, the new bacterial genome was assembled with estimated size 3 Mb and higher level of GC content. For this purpose were prepared two libraries.

Whole genome shotgun DNA library, sequenced 2x150 b (Illumina)
Non-sheared whole genome DNA, mean read length 10 kb approx. (ONT)

For comparison, four different approaches of assembly were run:

De novo assembly based only on Illumina data (SPAdes)
Hybrid de novo assembly combining Illumina and ONT data (SPAdes)
De novo assembly based on ONT data (Canu)
De novo assembly based on ONT data, polishing by Illumina data (Canu, Pilon)

Assembly results were compared by program Quast. Only contigs longer or equal to 500bp were included in the statistical calculations. See summary in the table below.

Results

Contigs created only from Illumina data were analysed first. Program SPAdes assembled reads into 50 contigs, 31 of them were shorter than 500bp, the remaining of 19 contigs were included to the statistics. The longest contig had 659 929 bases and total length of assembly was 2 727 401 bases. N50 statistic was equal to 359 855 bp and L50 statistic contained three contigs. GC content approached to 70.63%.
The second column describes hybrid assembly of short Illumina reads and long Nanopore data made by SPAdes assembler. Total amount of contigs decreased to 12 and 2 of them were longer than 500bp. The longest contig had 1 768 225 bases and total length of assembly was 2 735 673 bases. N50 was 1 767 225 bp and L50 was represented by one contig. GC content was 70.60%.
Third column represents results of Nanopore data assembly created by Canu. One single contig was assembled with total length 2 742 214 bases and GC content 70.83%.
Last column shows statistics of Nanopore data assembly by Canu and draft polishing made by program Pilon. Final assembly is composed by only one contig with total length 2 774 588 bases and GC content 70,63%.

Statistical comparison of all assemblies - Quast

Comment

Total length of assembly was around 2.7Mb in all four examples which corresponded to the expected size of studied genome. In the first case (Illumina_spades), the genome was assembled into several contigs. More than half of them were shorter than 500 bp. The longest contig covered about one quarter of the total length of assembly. In comparisons with the second analysis (Illumina_nanopore_spades) interesting difference can be seen between assembly composed only from Illumina data and hybrid assembly. In case of hybrid assembly, count of contigs dropped down (nevertheless, majority of contigs was shorter than 500 bp). Total length of assembly is slightly longer, but more interesting is the size of longest contig which grew up more than 2,5 times in comparison to the assembly from Illumina data! It is obvious that long reads helped to bridge parts of genome which could not be properly assembled with short reads.

The assembly of long reads produced one final contig (nanopore_canu). Compared to the previous results, we can observe a slightly higher GC content, which may be due to the poorer quality of ONT reads. The last column represents the results of the assembly of long reads (Canu) after the subsequent correction by Pilon using Illumina reads (Nanopore_illumina_canu_pilon). This correction helped to slightly lengthen the final contig and adjust the GC composition.

These results clearly indicate that the de novo assembly created by combining short reads (Illumina data) with long reads (ONT data) contributed greatly convincing results and helped to technically improve the final assembly of the genome. Although it is not possible to say with 100% certainty that this genome was assembled completely and correctly (which, unfortunately, can never be done in principle), we are close to the expected assumptions and commonly used metrics in this case clearly indicate the suitability of this approach. In general, it cannot be declared that the only method of assembly with subsequent polishing is the best approach. Subsequent genome annotation and in vitro control of results are necessary. It is always necessary to take into account the specifics of individual species, the overall size and complexity of the genome. Therefore, we recommend using multiple procedures during data analysis. The most suitable approach for specific data should be chosen based on critical evaluation of results.

It is certainly clear that this combination of ONT and Illumina technologies, inevitably more costly than the use of only one of them, offers fundamental advantages for successful de novo assembly. If you are planning a similar project, we recommend that you contact our application specialists soon enough. The above example is proof that the basis of success is to plan the project well and we will be happy to help you!

NGS Lab, ngs@seqme.eu

Literature

Pilon - Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. PLoS ONE. (2014)
Canu - Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research. (2017)
SPADes – Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology. (2012)
Quast - Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. (2013)

Magazine

Combining Illumina and ONT reads makes your de novo assembly successful

Related articles

Express links

Contact Us