HapCol: haplotype assembly from long reads
HapCol is an algorithm designed for the haplotype assembly of the long reads produced by future-generation sequencing technologies. The related work has been recently published on Bioinformatics journal, now in advance online access, and it can be found at this link http://bioinformatics.oxfordjournals.org/content/early/2015/09/17/bioinformatics.btv495.
This report has been written by one of the first authors, Simone Zaccaria.
The genome of diploid organisms, such as humans, is composed of two sets of chromosomes, one inherited from each parent. It follows that for each chromosome there are two copies and each one is called haplotype. The two haplotypes related to the same chromosome are very similar to each other, but they are not identical; in fact, they exhibit differences in terms of Single Nucleotide Polymorphisms (SNPs).
The haplotypes provides information may be of fundamental importance for many applications (especially in genome medicine), such as analyzing the relationships between genetic variation and gene function, or between genetic variation and disease susceptibility. However, since a whole experimental reconstruction is not yet cost-effective, we need to reconstruct the haplotypes from another kind of data. The haplotype assembly proposes to reconstruct the haplotypes starting from a collection of sequencing reads. These reads are aligned to a reference genome, in order to obtain a global alignment (Figure 1).
At this point, the problem is to bipartition the reads into two sets, such that all the reads belonging to the same part will be used to reconstruct one haplotype (Figure 2). However, the presence of sequencing or mapping errors lead the problem to be modelled as an optimization problem. Indeed, several formulations have been proposed in literature and one of the most prominent formulation is the Minimum Error Correction (MEC) problem (Figure 3) that aims to find the minimum number of corrections to the SNP values, such that the resulting reads can be unambiguously partitioned into two sets, each one identifying a haplotype.
Haplotype assembly highly benefits from the advent of ‘future-generation’ sequencing technologies and their capability to produce long reads at increasing coverage. Existing methods are not able to deal with such data in a fully satisfactory way, either because accuracy or performances degrade as read length and sequencing coverage increase or because they are based on restrictive assumptions.
We introduce a novel constrained variant of the MEC problem that is strongly motivated by a novel characteristic of the future generation sequencing technologies, namely the uniform distribution of sequencing errors. In particular, the key idea is to limit the maximum number of corrections that are allowed on each SNP position. We have designed an algorithm, called HapCol, for solving this novel formulation and it is based on a dynamic programming approach. Figure 4 shows a solution of the MEC problem on a small collection of fragments, represented as a matrix where rows represent fragments, whereas columns represents positions. However, in this example it is clear that fixing to 1 the maximum number of corrections on each, such as in out constrained variant, allows to find the same solution, but strongly limiting the searching space.
HapCol has been implemented in C++ and is available under the terms of the GPL at http://hapcol.algolab.eu/. We experimentally compared HapCol with current state-of-the-art approaches both on a real standard benchmark of long reads (from the HapMap sample NA12878) and on a produced collection of realistically simulated datasets. On the real dataset, HapCol revealed to be competitive with the current state-of-the-art methods, improving the accuracy and the number of phased positions. Furthermore, experiments on realistically simulated datasets revealed that HapCol requires significantly less computing resources, especially memory. Thanks to its computational efficiency, HapCol can overcome the limits of previous approaches, allowing to phase datasets with higher coverage and without the traditional all-heterozygous assumption.