**Yuri Pirola**

**Covering pairs in directed acyclic graphs**,*LATA 2014*(slides).- The Minimum Path Cover problem on directed acyclic graphs (DAGs) is a classical problem that provides a clear and simple mathematical formulation for several applications in different areas and that has an efficient algorithmic solution. In this paper, we study the computational complexity of two constrained variants of Minimum Path Cover motivated by the recent introduction of next-generation sequencing technologies in bioinformatics. The first problem (MinPCRP), given a DAG and a set of pairs of vertices, asks for a minimum cardinality set of paths covering all the vertices such that both vertices of each pair belong to the same path. For this problem, we show that, while it is NP-hard to compute if there exists a solution consisting of at most three paths, it is possible to decide in polynomial time whether a solution consisting of at most two paths exists. The second problem (MaxRPSP), given a DAG and a set of pairs of vertices, asks for a single path containing the maximum number of the given pairs of vertices. We show its NP-hardness and also its W[1]-hardness when parameterized by the number of covered pairs. On the positive side, we give a fixed-parameter algorithm when the parameter is the maximum overlapping degree, a natural parameter in the bioinformatics applications of the problem.

**Combinatorial structures and NGS data in transcriptomics: some results**,*CS4SA 2013*(slides).- In this talk we discuss a combinatorial structure that gives a compact representation of gene structures—called splicing graph—and investigate the computational problem of building a splicing graph that is compatible with Next-Generation Sequencing data. More precisely, a fundamental question we are going to investigate is: under which conditions can the reconstruction of a gene structure be efficiently accomplished using only information provided by RNA-Seq data?

In order to partially answer this question, we introduce the formal definition of the computational problem of reconstructing the gene structure from RNA-Seq data when the solution is represented by a splicing graph and we give some necessary conditions under which the reconstructed splicing graph represents the real (unknown) gene structure. Finally we describe an efficient algorithm that, under some conditions, is able to exactly solve our problem but that is also able to achieve good accuracy on real genes violating such conditions.

- In this talk we discuss a combinatorial structure that gives a compact representation of gene structures—called splicing graph—and investigate the computational problem of building a splicing graph that is compatible with Next-Generation Sequencing data. More precisely, a fundamental question we are going to investigate is: under which conditions can the reconstruction of a gene structure be efficiently accomplished using only information provided by RNA-Seq data?
**Haplotype-based prediction of gene alleles using pedigrees and SNP genotypes**,*ACM-BCB 2013*(slides).- In this work, we propose a new gene allele prediction method based on a simple, but robust, combinatorial formulation for the problem of discovering haplotype-allele associations. The inherent uncertainty of the haplotype inference process is reduced by taking into account the inheritance of gene alleles across the population pedigree while genotypes are phased. The accuracy of the method has been extensively evaluated on a representative real-world livestock dataset under several scenarios and choices of parameters. The median error rate ranged from 0.0537 to 0.0896, with an average of 0.0678; this is 21% better than another state-of-the-art prediction algorithm that does not use the pedigree information.
- The software, released under the GNU General Public License, can be freely downloaded from this page.

**Covering pairs in directed acyclic graphs**,*ICTCS 2013*(slides).- In this work we study a constrained version of the Minimum Path Cover problem motivated by applications in software testing and in the analysis of Next-Generation Sequencing data in bioinformatics. Given a directed acyclic graph and a set of pairs of vertices, the problem asks for a path that contains the maximum number of pairs. We investigate the complexity of the problem and we show that it is not only NP-hard, but also W[1]-hard when the parameter is the size of the solution, i.e. the number of covered pairs. On the positive side, we give a fixed-parameter algorithm when the parameter is the number of overlapping pairs, a natural parameter in some applications of the problem.

**A fast and practical approach to genotype phasing and imputation on a pedigree with erroneous and incomplete information**,*ICCABS 2012*(slides).- This work proposes the Min-Recombinant Haplotype Configuration with Bounded Errors problem (MRHCE), which extends the original Min-Recombinant Haplotype Configuration formulation by incorporating two common characteristics of real data: errors and missing genotypes (including untyped individuals). We describe a practical algorithm for MRHCE that is based on a reduction to the Satisfiability problem (SAT) and exploits recent advances in the constraint programming literature. An experimental analysis demonstrates the soundness of our model and the effectiveness of the algorithm under several scenarios.
- The analysis on real data and the comparison with state-of-the-art programs reveals that our approach couples better scalability to large and complex pedigrees with the explicit inclusion of genotyping errors into the model.
- The software, released under the GNU General Public License, can be freely downloaded from this page.

**PIntron: a fast method for gene structure prediction via maximal pairings of a pattern and a text**,*ICCABS 2011*(slides).- In this work, we propose a novel pipeline for computational gene-structure prediction based on spliced alignment of expressed sequences (ESTs and mRNAs). This pipeline, called
**PIntron**, is composed by four steps: Firstly, alternative alignments of expressed sequences to a reference genomic sequence are implicitly computed and represented in a graph (called embedding graph) by a novel fast spliced alignment procedure. Secondly, biologically meaningful alignments are extracted. Then, a consensus gene structure induced by the previously computed alignments is determined based on a parsimony principle. Finally, the resulting introns are reconciliated and classified according to general biological criteria. - The software, released under the GNU Affero General Public License, can be freely downloaded from this page.

- In this work, we propose a novel pipeline for computational gene-structure prediction based on spliced alignment of expressed sequences (ESTs and mRNAs). This pipeline, called
**Haplotype Inference on Pedigrees with Recombinations and Mutations**,*WABI 2010*(slides).- The work proposes a new combinatorial formulation for haplotype inference on general pedigrees that generalizes the existing combinatorial ones to a more realistic settings. We prove an approximation-preserving reduction from this problem, called
**Minimum Change Haplotype Configuration**(MCHC), to a well-known coding theory problem, namely the Nearest Codeword Problem. This reduction automatically implies the approximability of MCHC within a factor O(n/log n) and, more importantly, it leads to a new very efficient heuristic algorithm that has been experimentally compared with other state-of-the-art HI methods. - The software implementing the heuristic is freely available at this page under the GNU General Public License.

- The work proposes a new combinatorial formulation for haplotype inference on general pedigrees that generalizes the existing combinatorial ones to a more realistic settings. We prove an approximation-preserving reduction from this problem, called
**Minimum Factorization Agreement of Spliced ESTs**,*WABI 2009*(slides).- The work presents a new parsimony-based formulation for the problem of choosing the best alignment of an expressed sequence against a genomic sequence exploiting the redundancy of current expressed sequence databases. A preliminary experimental evaluation of the formulation and the related algorithm shows their applicability on real instances.
- The implementation of this approach is integrated in our software PIntron.

**Pure Parsimony Xor Haplotyping**,*ISBRA 2009*(slides).- In this work we addressed the problem of haplotype inference from xor genotypes under the pure parsimony assumption. Exact algorithms for restricted instances, a fixed parameter algorithm, an approximation algorithm, and an effective heuristic have been proposed.
- A prototypical implementation of the heuristic is freely available at this page under the GNU General Public License.

## License

This page and all the material here available are licensed under the

Creative Commons Attribuzione-Non commerciale-Non opere derivate 2.5 Italia License.