Workshop on Graph Assembly Algorithms for omics data

November 18th, 2016

Univ. Milano-Bicocca. Building U24 – Room C02

Theme and scope

New and different sequencing technologies have appeared in the last few years with the promise of overcoming Illumina as the reference platform: PacBio, 10x Genomics, and MinION being the most relevant today. Those technological improvements have shown the limitations of managing linear strings instead of richer models that are usually graph-based. Transcriptomics was the first field where the need of imposing a structure over strings has become clear, leading to the notion of a splicing graph to represent all possible transcripts of a gene, as well as the exon-intron structure. Genome assemblers are increasingly aware that a diploid organism has two different, but very similar, copies of each genome: hence the correct outcome of an assembly cannot be a single string. Finally, metagenomics studies different species that can share portions of the genome: another example where the correct representation of an assembly cannot be linear. Moreover, the 1000 Genomes project has generated new interest in the analysis of a population genome.

From a bioinformatics point of view, this results in the need for refining the classical approach where a genome is represented as a single string. Instead, some approaches based on graph representations have become more popular. For instance, some recent developments have been obtained on defining a common format (GFA – Graph Fragment Assembly) to represent an assembly graph.

The aim of this informal workshop is to bring together researchers from different, but correlated, areas under the widest possible notion of graph assembly algorithms, such as genome assembly, alternative splicing prediction, metagenomics, graph manipulation and representation, that are involved in the thorough investigation of these problems with both theoretical and experimental approaches. The goal is to brainstorm which challenges are the most important and can drive the research agenda in the next few years, and to jumpstart a proposal for a European Training Network (specific call) under the Horizon 2020 program and/or a COST action (call).

Nonexclusive list of relevant topics

  • graph assembly
  • variation graphs
  • de Bruijn graphs
  • string graphs
  • alternative splicing
  • metagenomics
  • the population genome
  • pan-genome graph
  • graph fragment assembly

Tentative schedule

9-9.30: Welcome and registration

9.30-11.35: Session 1 (5 talks, 25 minutes each)

  1. Alexander Schoenhuth (Centrum Wiskunde & Informatica). De novo assembly of viral quasispecies using overlap graphs (slides)

  2. Leena Salmela (University of Helsinki). Path searching problems in de Bruijn graphs (slides)

  3. Tomas Vinar (Comenius University). Probabilistic Models for Genome Assembly

  4. Jouni Siren (Wellcome Trust Sanger Institute). Indexing Graphs for Path Queries (slides)

  5. Guillaume Holley (Universität Bielefeld). Bloom Filter Trie: an efficient data structure for colored de Bruijn graphs (slides1 slides2)

11.35-12.00: Coffee break

12.00-13.40: Session 2 (4 talks, 25 minutes each)

  1. Pierre Peterlongo (Université de Rennes 1). Graph assembly algorithms at the Genscale team (slides)

  2. Jean-François Flot (Université libre de Bruxelles). Read compaction from a De Bruijn graph: a new paradigm for polymorphic genome assembly

  3. Rayan Chikhi (CNRS). Efficient construction of compacted de Bruijn graphs (slides)

  4. Rumen Andonov (Université de Rennes 1). Global Optimization Methods for Genome Scaffolding and Completing Genome Assemblies (slides)

13.40-14.30: Lunch

14.30-15.45: Session 3 (3 talks, 25 minutes each)

  1. Tobias Marschall (Universität des Saarlandes). Computational pan-genomics: status, promises and challenges (slides)

  2. Eric Rivals (CNRS). Superstring Graph: an alternative to assembly with multiple de Bruijn graphs

  3. Gianluca Della Vedova (Università di Milano-Bicocca). BWT and Graph Algorithms

15.45-17.30: Discussion

Venue and accommodation

The workshop will take place in Milano, at the Department of Computer Science (DISco), at the U24 building (Edificio U24).

Milan is easily reachable by all major European and main international airports. How to reach us

The suggested hotels are Starhotel Tourist and Hotel Arcimboldi that are both within a 5-minute walk from the workshop site.

List of speakers

Scientific Committee:

Paola Bonizzoni
Gianluca Della Vedova
Raffaella Rizzi

List of partecipants

  • Rumen Andonov
  • Stefano Beretta
  • Paola Bonizzoni
  • Broňa Brejová
  • Rayan Chikhi
  • Gianluca Della Vedova
  • Guillaume Holley
  • Jean-François Flot
  • Murray Patterson
  • Pierre Peterlongo
  • Nadia Pisanti
  • Alberto Policriti
  • Marco Previtali
  • Eric Rivals
  • Raffaella Rizzi
  • Romeo Rizzi
  • Giovanna Rosone
  • Leena Salmela
  • Alexander Schönhuth
  • Marinella Sciortino
  • Jouni Sirén
  • Jens Stoye
  • Tomáš Vinař

Tesi: Algoritmi di indicizzazione del grafo del pangenoma

Descrizione

La tesi si occupa di modellare, disegnare e implementare nuovi metodi di pattern matching dove il testo non è più una struttura lineare, ma è rappresentabile da un grafo orientato aciclico dove i vertici corrispondono a porzioni di genoma condivisi da uno o più individui e gli archi identificano le varianti genomiche presenti in ogni persone.

L’ambito fornisce la possibilità di avere più tesi di laurea magistrale, anche coordinate fra loro. Ad esempio:

  1. definire e studiare il problema computazionale quando il pattern a sua volta è un grafo.
  2. implementare un algoritmo di pattern matching con testo che è un grafo
  3. disegnare e implementare un algoritmo di allineamento di un read con un grafo
  4. modellare e disegnare un algoritmo per il caso di genomi con variazioni strutturali

Luogo

Laboratorio di Algoritmica Sperimentale — Algolab — http://www.algolab.eu

Requisiti

Interesse ed entusiasmo per inventare nuovi algoritmi per indicizzare grafi basati su BWT (Burrows-Wheeler transform)

Risultati attesi

Nuovi modelli o algoritmi per l’indicizzazione e interrogazione di dati modellabili come grafi.

Per informazioni contattare

Paola Bonizzoni <bonizzoni@disco.unimib.it> oppure Gianluca Della Vedova <gianluca.dellavedova@unimib.it>

Tempo di realizzazione

6 mesi

Advanced Techniques for Combinatorial Algorithms

In February-March 2016, we will give a PhD course on Advanced Techniques for Combinatorial Algorithms.

This course focuses on the design and the analysis of efficient algorithms, with a strong
emphasis on theoretical aspects.

The goal of the course is to give a broad coverage of the main techniques in algorithm
design and analysis, so that the course can be useful also for a researcher in a different
field. To this purpose, the computational problems tackled are among the most basic
problems on strings and graphs, such as pattern matching, vertex cover and max cut.

The official page is here.

Schedule

  • Feb. 2, 10:30-12:30 – meeting room III floor, U14, (lecturer Gianluca Della Vedova). Parallel Algorithms. (PRAM model, prefix sum algorithm, tree depth, connectivity). slides
  • Feb. 3, 10:30-12:30 – meeting room III floor, U14, (lecturer Gianluca Della Vedova). Randomized Algorithms. (Karp-Rabin algorithm for pattern matching). Succinct representations of data (Entropy, Huffman codes, Jensen’s Inequality, Elias’ gamma and delta codes) slides
  • Feb. 8, 14:00-16:00 – Seminar Room, U14 (lecturer Pawel Gawrychowski). Suffix arrays (construction and pattern matching) slides
  • Feb. 9, 14:00-16:00 – T014 (lecturer Pawel Gawrychowski). Suffix arrays 2 (Burrows-Wheeler transform and Lempel-Ziv factorization) slides
  • Feb. 11, 11:00-13:00 – Seminar Room, U14 (lecturer Pawel Gawrychowski). Indexing with errors (indexing with k mismatches, heavy path decomposition) slides
  • Feb. 16, 11:00-13:00 – meeting room III floor, U14 (lecturer Travis Gagie). Basic Lossless Data compression (The Kraft Inequality and Jensen’s Inequality; Shannon’s Noiseless Coding Theorem; Shannon and Huffman Codes; Canonical Codes; Elias Codes; empirical entropy; LZ77 and LZ78; BWT)
  • Feb. 17, 11:00-13:00 – meeting room III floor, U14, (lecturer Travis Gagie). Compressed data structures (Bitvectors; wavelet trees; compressed permutations; rank and select on sequences; succinct trees). voce enciclopedia articolo introduttivo slides
  • Feb. 18, 11:00-13:00 – Seminar Room, U14, (lecturer Travis Gagie). Self-Indexes for Strings and Graphs (LZ-based indexes; FM-indexes; LZ/FM-hybrid indexes; the XBWT for labelled trees; BWT-based representations of de Bruijn graphs; the GCSA for genomic graph references) articolo introduttivo slides
  • Mar. 1, 14:00 – 16:00 – meeting room III floor, U14, (lecturer Gianluca Della Vedova) Approximation Algorithms. (Algorithms for Vertex Cover, Satisfiability, Max Cut. Approximation Complexity)
  • Mar. 8, 14:00 – 16:00 – meeting room III floor, U14, (lecturer Gianluca Della Vedova) Approximation Algorithms. slides
NGS12_poster

An in-silico framework for comparing and validating transcripts predicted from single and paired-end reads

Anna Paola Carrieri, Stefano Beretta, Gianluca Della Vedova, Ernesto Picardi, Yuri Pirola, Raffaella Rizzi, Graziano Pesole, and Paola Bonizzoni, NGS 2012 (poster and abstract).

With the advent of high-throughput sequencing of transcriptome (RNA-Seq), different computational methods that use RNA-Seq data to assemble full-length mRNA isoforms have been proposed, albeit not solving completely the problem. We have analyzed some of the most used available tools, evaluating their performance and accuracy.

Our experimental analysis reveals that using GSNAP instead of TopHat gives more specific predictions with also a minor (but statistically inconclusive) improvement in sensitivity.

We plan to extend our study (i) by introducing some alternatives to EVAL for comparing predictions, (ii) by considering different kinds of simulated data (more coverage levels and/or errors), as well as real data, and (iii) by analyzing in more detail the structure of predicted transcripts since a preliminary study in this direction reveals that the actual methods have various shortcomings in assembling transcripts.

Detection of long noncoding (lncRNA) involved in RNA-Seq isoforms

 

  • Detection of long noncoding (lncRNA) involved in RNA-Seq isoformsHassan MahmoudDISCO, May 28, 2015 (slides).
    The recent discovery showed that the human and other mammalian genomes produce thousands of mRNA-like molecules namely, long noncoding RNAs (lncRNAs). Almost on a weekly basis, many biological studies detected that lncRNA is to be up or down-regulated in a particular disease. However, these lncRNAs which lack significant protein-coding capacity have been implicated in a wide range of biological functions through diverse and as yet poorly understood molecular mechanisms. The majority of human genes are alternatively spliced in a highly tissue and cell type–specific manner essential for generating protein diversity. The production of alternative splicing mRNAs is regulated by combinatorial use of multiple cis-acting RNA elements along the precursor mRNA (pre-mRNA). In this talk, I will show the concepts and subtypes of lncRNA, their role in gene regulation and their relation to alternative splicing, and the bioinformatics tools used for detecting lncRNA mainly from sequencing isoforms.