Workshop on Graph Assembly Algorithms for omics data

November 18th, 2016

Univ. Milano-Bicocca. Building U24 – Room C02

Theme and scope

New and different sequencing technologies have appeared in the last few years with the promise of overcoming Illumina as the reference platform: PacBio, 10x Genomics, and MinION being the most relevant today. Those technological improvements have shown the limitations of managing linear strings instead of richer models that are usually graph-based. Transcriptomics was the first field where the need of imposing a structure over strings has become clear, leading to the notion of a splicing graph to represent all possible transcripts of a gene, as well as the exon-intron structure. Genome assemblers are increasingly aware that a diploid organism has two different, but very similar, copies of each genome: hence the correct outcome of an assembly cannot be a single string. Finally, metagenomics studies different species that can share portions of the genome: another example where the correct representation of an assembly cannot be linear. Moreover, the 1000 Genomes project has generated new interest in the analysis of a population genome.

From a bioinformatics point of view, this results in the need for refining the classical approach where a genome is represented as a single string. Instead, some approaches based on graph representations have become more popular. For instance, some recent developments have been obtained on defining a common format (GFA – Graph Fragment Assembly) to represent an assembly graph.

The aim of this informal workshop is to bring together researchers from different, but correlated, areas under the widest possible notion of graph assembly algorithms, such as genome assembly, alternative splicing prediction, metagenomics, graph manipulation and representation, that are involved in the thorough investigation of these problems with both theoretical and experimental approaches. The goal is to brainstorm which challenges are the most important and can drive the research agenda in the next few years, and to jumpstart a proposal for a European Training Network (specific call) under the Horizon 2020 program and/or a COST action (call).

Nonexclusive list of relevant topics

  • graph assembly
  • variation graphs
  • de Bruijn graphs
  • string graphs
  • alternative splicing
  • metagenomics
  • the population genome
  • pan-genome graph
  • graph fragment assembly

Tentative schedule

9-9.30: Welcome and registration

9.30-11.35: Session 1 (5 talks, 25 minutes each)

  1. Alexander Schoenhuth (Centrum Wiskunde & Informatica). De novo assembly of viral quasispecies using overlap graphs (slides)

  2. Leena Salmela (University of Helsinki). Path searching problems in de Bruijn graphs (slides)

  3. Tomas Vinar (Comenius University). Probabilistic Models for Genome Assembly

  4. Jouni Siren (Wellcome Trust Sanger Institute). Indexing Graphs for Path Queries (slides)

  5. Guillaume Holley (Universität Bielefeld). Bloom Filter Trie: an efficient data structure for colored de Bruijn graphs (slides1 slides2)

11.35-12.00: Coffee break

12.00-13.40: Session 2 (4 talks, 25 minutes each)

  1. Pierre Peterlongo (Université de Rennes 1). Graph assembly algorithms at the Genscale team (slides)

  2. Jean-François Flot (Université libre de Bruxelles). Read compaction from a De Bruijn graph: a new paradigm for polymorphic genome assembly

  3. Rayan Chikhi (CNRS). Efficient construction of compacted de Bruijn graphs (slides)

  4. Rumen Andonov (Université de Rennes 1). Global Optimization Methods for Genome Scaffolding and Completing Genome Assemblies (slides)

13.40-14.30: Lunch

14.30-15.45: Session 3 (3 talks, 25 minutes each)

  1. Tobias Marschall (Universität des Saarlandes). Computational pan-genomics: status, promises and challenges (slides)

  2. Eric Rivals (CNRS). Superstring Graph: an alternative to assembly with multiple de Bruijn graphs

  3. Gianluca Della Vedova (Università di Milano-Bicocca). BWT and Graph Algorithms

15.45-17.30: Discussion

Venue and accommodation

The workshop will take place in Milano, at the Department of Computer Science (DISco), at the U24 building (Edificio U24).

Milan is easily reachable by all major European and main international airports. How to reach us

The suggested hotels are Starhotel Tourist and Hotel Arcimboldi that are both within a 5-minute walk from the workshop site.

List of speakers

Scientific Committee:

Paola Bonizzoni
Gianluca Della Vedova
Raffaella Rizzi

List of partecipants

  • Rumen Andonov
  • Stefano Beretta
  • Paola Bonizzoni
  • Broňa Brejová
  • Rayan Chikhi
  • Gianluca Della Vedova
  • Guillaume Holley
  • Jean-François Flot
  • Murray Patterson
  • Pierre Peterlongo
  • Nadia Pisanti
  • Alberto Policriti
  • Marco Previtali
  • Eric Rivals
  • Raffaella Rizzi
  • Romeo Rizzi
  • Giovanna Rosone
  • Leena Salmela
  • Alexander Schönhuth
  • Marinella Sciortino
  • Jouni Sirén
  • Jens Stoye
  • Tomáš Vinař

Advanced Techniques for Combinatorial Algorithms

In February-March 2016, we will give a PhD course on Advanced Techniques for Combinatorial Algorithms.

This course focuses on the design and the analysis of efficient algorithms, with a strong
emphasis on theoretical aspects.

The goal of the course is to give a broad coverage of the main techniques in algorithm
design and analysis, so that the course can be useful also for a researcher in a different
field. To this purpose, the computational problems tackled are among the most basic
problems on strings and graphs, such as pattern matching, vertex cover and max cut.

The official page is here.


  • Feb. 2, 10:30-12:30 – meeting room III floor, U14, (lecturer Gianluca Della Vedova). Parallel Algorithms. (PRAM model, prefix sum algorithm, tree depth, connectivity). slides
  • Feb. 3, 10:30-12:30 – meeting room III floor, U14, (lecturer Gianluca Della Vedova). Randomized Algorithms. (Karp-Rabin algorithm for pattern matching). Succinct representations of data (Entropy, Huffman codes, Jensen’s Inequality, Elias’ gamma and delta codes) slides
  • Feb. 8, 14:00-16:00 – Seminar Room, U14 (lecturer Pawel Gawrychowski). Suffix arrays (construction and pattern matching) slides
  • Feb. 9, 14:00-16:00 – T014 (lecturer Pawel Gawrychowski). Suffix arrays 2 (Burrows-Wheeler transform and Lempel-Ziv factorization) slides
  • Feb. 11, 11:00-13:00 – Seminar Room, U14 (lecturer Pawel Gawrychowski). Indexing with errors (indexing with k mismatches, heavy path decomposition) slides
  • Feb. 16, 11:00-13:00 – meeting room III floor, U14 (lecturer Travis Gagie). Basic Lossless Data compression (The Kraft Inequality and Jensen’s Inequality; Shannon’s Noiseless Coding Theorem; Shannon and Huffman Codes; Canonical Codes; Elias Codes; empirical entropy; LZ77 and LZ78; BWT)
  • Feb. 17, 11:00-13:00 – meeting room III floor, U14, (lecturer Travis Gagie). Compressed data structures (Bitvectors; wavelet trees; compressed permutations; rank and select on sequences; succinct trees). voce enciclopedia articolo introduttivo slides
  • Feb. 18, 11:00-13:00 – Seminar Room, U14, (lecturer Travis Gagie). Self-Indexes for Strings and Graphs (LZ-based indexes; FM-indexes; LZ/FM-hybrid indexes; the XBWT for labelled trees; BWT-based representations of de Bruijn graphs; the GCSA for genomic graph references) articolo introduttivo slides
  • Mar. 1, 14:00 – 16:00 – meeting room III floor, U14, (lecturer Gianluca Della Vedova) Approximation Algorithms. (Algorithms for Vertex Cover, Satisfiability, Max Cut. Approximation Complexity)
  • Mar. 8, 14:00 – 16:00 – meeting room III floor, U14, (lecturer Gianluca Della Vedova) Approximation Algorithms. slides

Combinatorial Structures for Sequence Analysis in Bioinformatics

27 Novembre 2013
Aula Seminari
Dipartimento di Informatica, Sistemistica e Comunicazione (Edificio U14)
Univ. degli Studi di Milano-Bicocca


Nell’ambito del Progetto PRIN 2010/2011 “Automi e Linguaggi Formali: aspetti matematici e applicativi” si svolgerà il 27 Novembre 2013 presso il Dipartimento di Informatica, Sistemistica e Comunicazione dell’Università di Milano-Bicocca un workshop dedicato a strutture combinatorie per l’analisi di sequenze in bioinformatica.

L’incontro si propone l’obiettivo di dare la possibilità, sia ai ricercatori del progetto PRIN che ad altri che si occupano di algoritmi su stringhe e sequenze di riunirsi per presentare le proprie linee di ricerca, i risultati ottenuti e i problemi aperti al fine di iniziare nuove collaborazioni e rinforzare quelle già esistenti.


9.15 – 9.30  Apertura lavori

9.30 – 10.30 Gregory Kucherov: “Algorithms and data structures for Next-Generation Sequencing data”.

10.30 – 11.00  Discussione

11.00 – 11.30  Giovanna Rosone: “Extended Burrows-Wheeler Transform and analysis of biological sequences”

11.30 – 11.45 Discussione

11.45 – 12.15   Yuri Pirola: “Combinatorial structures and NGS data in Transcriptomics: some results”

12.15 – 12.45  Paola Bonizzoni: “Combinatorial structures and NGS data in Transcriptomics: some open problems”

12.45 – 13.00  Discussione

13.00 – 14.30  Pausa pranzo

14.30 – 15.00 Francesco Masulli: “Predicting microRNA Target Genes”

15.00 – 15.15  Discussione

15.15 – 15.45 Zsuzsanna Lipták: “Jumbled String Matching: Motivations, Variants, Algorithms”

15.45 – 16.00 Discussione

16.00 – 18.00 Apertura della tavola rotonda: presentazione di problemi aperti e discussione

Ulteriori informazioni

Il workshop si terrà nell’aula Seminari del Dipartimento di Informatica, Sistemistica e Comunicazione. Indicazioni su come raggiungere la sede del workshop sono disponibili a questa pagina.

Per ogni informazione, contattare

Abstract  degli interventi

Gregory Kucherov: “Algorithms and data structures for Next-Generation Sequencing data“.

In the middle of the 00’s, the whole area of bioinformatics, already under rapide development, underwent a new breakthrough caused by the advent of Next Generation Sequencing (NGS) technologies. I will give a very quick introduction to NGS for computer scientists, highlighting main involved computational challenges. Then I’ll survey some my contributions to the area, namely on read mapping on the one hand, and on storing de Bruijn graphs for genome assembly on the other hand.

Giovanna Rosone: “Extended Burrows-Wheeler Transform and analysis of biological sequences

The advent of “next-generation” DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in Bioinformatics.
For instance, a human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine.
Many algorithms and data structures for compression and analysis of a text have the BWT at their heart, and it would be of great interest to explore their applications to sequence collections such as these.
We describe and highlight an extension of the Burrows-Wheeler Transform (EBWT), of the suffix array and of the longest common prefix array to very large string collections.
We introduce a new methodology for computing the EBWT for 100 billion characters or more of data in external memory. We use the EBWT for applications in the context of string processing, such as large-scale compression of genomic sequence databases, compression of sequence quality scores and comparison of DNA Sequence Collections.

Yuri Pirola: “Combinatorial structures and NGS data in Transcriptomics: some results”

Next-Generation Sequencing (NGS) technologies need new methodologies for Alternative Splicing (AS) analysis.
In this talk we discuss a combinatorial structure that gives a compact representation of gene structures—called  splicing graph—and investigate the computational problem of building a splicing graph that is compatible with NGS data. More precisely, a fundamental question we are going to investigate is: under which conditions can the reconstruction of a gene structure be efficiently accomplished using only information provided by RNA-Seq data?
In order to partially answer this question, we introduce the formal definition of the computational problem of reconstructing the gene structure from RNA-Seq data when the solution is represented by a splicing graph and we give some necessary conditions under which the reconstructed splicing graph represents the real (unknown) gene structure. Finally we describe an efficient algorithm that, under some conditions, is able to exactly solve our problem but that is also able to achieve good accuracy on real genes violating such conditions.

Bonizzoni Paola: “Combinatorial structures and NGS data in Transcriptomics: some open problems”

In this talk we discuss new approaches to the comparison of short sequences (RNA-Seq sequences) produced by NGS technologies when applied  to transcriptomic data. We discuss the computational problem of assembling a string-graph from RNA-Seq data using the Burrows Wheeler transform and a space efficient FM-index to detect overlaps in a collection of sequences.

Francesco Masulli: “Predicting microRNA Target Genes”

miRNA (microRNAs) are a class of short, non coding RNA capable of base-pairing with imperfect complementarity to the transcripts of animal protein-coding genes (targets). In such a way, they negatively regulate gene expression; however, the exact mechanism remains still unclear. Since each miRNA targets several hundreds of transcripts and each transcript can be the target of multiple miRNAs, the complexity of this class of RNA is remarkable. miRNA are in involved in the pathway of several biological processes, including tumors such as prostate cancer, lung cancer, breast cancer, and colorectal cancer, as well neurogenesis and synaptic function. Computational prediction of miRNA targets is not an easy task. In the last few years, more than 10 target prediction programs have been developed. They typically analyze sequences by a sliding window and, in order to reduce false positive predictions, they test each site of the gene with some algorithms in cascade. miRanda, proposed in 2003, is the most popular miRNA target prediction method. It splits the target site prediction task into three distinct steps carried out in sequence: (a) Homology evaluation; (b) Free energy computation; (c) Evolutionary conservation computation. The parameters used in miRanda have been manually optimized according to biological knowledge and software tools available in 2003 (with some updates).
We present an approach to improve the computational estimation of target sites, maximizing the adherence of results to biological evidence. The proposed method provides some improvements on the original miRanda program. In particular, we employ the new RNAcofold routine for free energy computation and we improve the match between the set of estimated target genes and the set of biologically validated target genes nowadays known by using a genetic algorithm. Our method is more selective than miRanda. The validated targets with the proposed method have consistently better free energy than the others in the list; this is not true for miRanda. The increased selectivity of the method may also be used to guide experimental validation in a more focused direction.

Zsuzsanna Lipták: “Jumbled String Matching: Motivations, Variants, Algorithms”

In the classical exact string matching problem, we are given two strings s (the text) and t (the pattern), and we want to find all occurrences of t as substrings of s. In jumbled string matching, the goal is to find all occurrences of substrings t’ of s with the same multiplicity of characters as the pattern, i.e. all permuted (or jumbled) versions of the pattern. Alternatively, the pattern can be given as a vector specifying the character multiplicities (a so-called Parikh vector), and we want to find all occurrences of substrings with this Parikh vector.
For example, the text s=baacabccab has two occurrences of the pattern (2,1,1), i.e. of substrings consisting of two a’s, one b, and one c: in position 1 and in position 3. The online version of the problem can be solved optimally in O(n) time with a simple sliding window algorithm. However, the indexed version, where we want to answer multiple queries fast, using an index of the text, has proved challenging.
Variants of this problem are motivated by applications in mass spectrometry data interpretation, by gene clusters, and by motif search in networks. I will talk about different algorithms for several variants of the problem, including the case of binary alphabets, which has very specific properties, and which has aroused much interest in recent years.

Ph.D. course: Massive data analysis in the Web and post-genomic era

Lectures (April-May 2012):

1. Introduction Part I
April 16th (h. 10.30)
Paola Bonizzoni (Università di Milano-Bicocca)

The first lectures aim to provide the theoretical bases required to face the research topics introduced in  the course, as well as the main technological motivations of big data. The course is oriented to computer scientists, physicists, statisticians, genetic epidemiologists, bioinformaticians,  genome biologists and aims to open a discussion on the challenges and opportunities in next-generation sequencing data analysis and massive data analysis.
Massive data, deep sequencing and indexing techniques. Software tools.

2. Inferring Genetic Diversity from NGS
April 16th (h. 15.30) and 17th (h. 9.30)
Niko Beerenwinkel (ETH Zurich)

With high-coverage next-generation sequencing (NGS), the genetic diversity of mixed samples can be probed at an unprecedented level of detail in a cost-effective manner. However, NGS reads tend to be erroneous and they are relatively short, complicating the detection of low-frequency variants and the reconstruction of long haplotype sequences. In this lecture, I will introduce computational and statistical challenges associated witgenetic diversity estimation from NGS data. I will discuss several approaches to their solution based on probabilistic graphical models and on combinatorial optimization techniques. Two major applications will be presented: the genetic diversity of HIV within patients and the genetic diversity of cancer cells within tumors.
Part 1: Detecting low-frequency single-nucleotide variants (SNVs).
Part 2: Local haplotype inference and global quasispecies assembly.


3. Introduction Part II
April 23rd (h. 13:00) and 24th (h. 14:00)
Gianluca Della Vedova (Università di Milano-Bicocca)

Moore’s Law: current trends and the big data revolution. Approaches to work splitting: parallel algorithms, map reduce, data streaming.


4. The Paradigm of Data Stream for Next Generation Internet
May 2nd (14.00-16.00) and 3rd (9.00-11.00)
Irene Finocchi (Università La Sapienza – Roma)

Data stream algorithmics has gained increasing popularity in the last few years as an effective paradigm for processing massive data sets. A wide range of applications in computational sciences generate huge and rapidly changing streams of data that need to be continuously monitored and processed in one or few sequential passes, using a limited amount of working memory. Despite the heavy restrictions on time and space resources imposed by this data access model, major progress has been achieved in the last ten years in the design of streaming algorithms for several fundamental data sketching and statistics problems. The lectures will overview this rapidly evolving area and present basic algorithmic ideas, techniques, and challenges in data stream processing.Slides

5. Next Generation Sequencing analysis
May 8th (9.30-13.30)
Nadia Pisanti (Università di Pisa)

New Sequencing Technologies have dramatically decreased costs and thus opened the way to new challenges in applications such as metagenomics and transcriptome analysis by means of sequences; in particular, low costs of re-sequencing applied to the human genome opens the way to new issued in personalized medicine. As a consequence, a new phase has been opened for genome research. From the point of view of the computer scientist, the management of huge amount of data, the small size of sequenced fragments (with respect to previous technologies), and the new applications that bring down on sequences lots of data that used to be managed with arrays, has led to several new problems in string algorithms. We will try to give an overview on them and on possible approaches to address these problems.

Slides NGS Slides SV e CNV

6. Combinatoria delle parole ed applicazioni alla biologia teorica
May 29th (afternoon)
Giuseppe Pirillo (IASI, CNR)