**27 Novembre 2013**

**Aula Seminari**

Dipartimento di Informatica, Sistemistica e Comunicazione (Edificio U14)

Univ. degli Studi di Milano-Bicocca

#### Presentazione

Nell’ambito del Progetto PRIN 2010/2011 “Automi e Linguaggi Formali: aspetti matematici e applicativi” si svolgerà il 27 Novembre 2013 presso il Dipartimento di Informatica, Sistemistica e Comunicazione dell’Università di Milano-Bicocca un workshop dedicato a strutture combinatorie per l’analisi di sequenze in bioinformatica.

L’incontro si propone l’obiettivo di dare la possibilità, sia ai ricercatori del progetto PRIN che ad altri che si occupano di algoritmi su stringhe e sequenze di riunirsi per presentare le proprie linee di ricerca, i risultati ottenuti e i problemi aperti al fine di iniziare nuove collaborazioni e rinforzare quelle già esistenti.

#### Programma

9.15 – 9.30 Apertura lavori

9.30 – 10.30 Gregory Kucherov: *“Algorithms and data structures for Next-Generation Sequencing data”*.

10.30 – 11.00 Discussione

11.00 – 11.30 Giovanna Rosone: *“Extended Burrows-Wheeler Transform and analysis of biological sequences”*

11.30 – 11.45 Discussione

11.45 – 12.15 Yuri Pirola: *“Combinatorial structures and NGS data in Transcriptomics: some results”*

12.15 – 12.45 Paola Bonizzoni: *“Combinatorial structures and NGS data in Transcriptomics: some open problems”*

12.45 – 13.00 Discussione

13.00 – 14.30 Pausa pranzo

14.30 – 15.00 Francesco Masulli: *“Predicting microRNA Target Genes”*

15.00 – 15.15 Discussione

15.15 – 15.45 Zsuzsanna Lipták: *“Jumbled String Matching: Motivations, Variants, Algorithms”*

15.45 – 16.00 Discussione

16.00 – 18.00 Apertura della tavola rotonda: presentazione di problemi aperti e discussione

#### Ulteriori informazioni

Il workshop si terrà nell’aula Seminari del Dipartimento di Informatica, Sistemistica e Comunicazione. Indicazioni su come raggiungere la sede del workshop sono disponibili a questa pagina.

Per ogni informazione, contattare *gianluca.dellavedova@unimib.it*

#### Abstract degli interventi

Gregory Kucherov: “*Algorithms and data structures for Next-Generation Sequencing data*“.

In the middle of the 00’s, the whole area of bioinformatics, already under rapide development, underwent a new breakthrough caused by the advent of Next Generation Sequencing (NGS) technologies. I will give a very quick introduction to NGS for computer scientists, highlighting main involved computational challenges. Then I’ll survey some my contributions to the area, namely on read mapping on the one hand, and on storing de Bruijn graphs for genome assembly on the other hand.

Giovanna Rosone: “*Extended Burrows-Wheeler Transform and analysis of biological sequences*”

The advent of “next-generation” DNA sequencing (NGS) technologies has meant that collections of hundreds of millions of DNA sequences are now commonplace in Bioinformatics.

For instance, a human genome sequencing experiment might yield a billion or more sequences, each 100 characters in length. Such a dataset can now be generated in just a few days on a single sequencing machine.

Many algorithms and data structures for compression and analysis of a text have the BWT at their heart, and it would be of great interest to explore their applications to sequence collections such as these.

We describe and highlight an extension of the Burrows-Wheeler Transform (EBWT), of the suffix array and of the longest common prefix array to very large string collections.

We introduce a new methodology for computing the EBWT for 100 billion characters or more of data in external memory. We use the EBWT for applications in the context of string processing, such as large-scale compression of genomic sequence databases, compression of sequence quality scores and comparison of DNA Sequence Collections.

Yuri Pirola: *“Combinatorial structures and NGS data in Transcriptomics: some results”*

Next-Generation Sequencing (NGS) technologies need new methodologies for Alternative Splicing (AS) analysis.

In this talk we discuss a combinatorial structure that gives a compact representation of gene structures—called splicing graph—and investigate the computational problem of building a splicing graph that is compatible with NGS data. More precisely, a fundamental question we are going to investigate is: *under which conditions can the reconstruction of a gene structure be efficiently accomplished using only information provided by RNA-Seq data*?

In order to partially answer this question, we introduce the formal definition of the computational problem of reconstructing the gene structure from RNA-Seq data when the solution is represented by a splicing graph and we give some necessary conditions under which the reconstructed splicing graph represents the real (unknown) gene structure. Finally we describe an efficient algorithm that, under some conditions, is able to exactly solve our problem but that is also able to achieve good accuracy on real genes violating such conditions.

Bonizzoni Paola: *“Combinatorial structures and NGS data in Transcriptomics: some open problems”*

In this talk we discuss new approaches to the comparison of short sequences (RNA-Seq sequences) produced by NGS technologies when applied to transcriptomic data. We discuss the computational problem of assembling a string-graph from RNA-Seq data using the Burrows Wheeler transform and a space efficient FM-index to detect overlaps in a collection of sequences.

Francesco Masulli: *“Predicting microRNA Target Genes”*

miRNA (microRNAs) are a class of short, non coding RNA capable of base-pairing with imperfect complementarity to the transcripts of animal protein-coding genes (targets). In such a way, they negatively regulate gene expression; however, the exact mechanism remains still unclear. Since each miRNA targets several hundreds of transcripts and each transcript can be the target of multiple miRNAs, the complexity of this class of RNA is remarkable. miRNA are in involved in the pathway of several biological processes, including tumors such as prostate cancer, lung cancer, breast cancer, and colorectal cancer, as well neurogenesis and synaptic function. Computational prediction of miRNA targets is not an easy task. In the last few years, more than 10 target prediction programs have been developed. They typically analyze sequences by a sliding window and, in order to reduce false positive predictions, they test each site of the gene with some algorithms in cascade. miRanda, proposed in 2003, is the most popular miRNA target prediction method. It splits the target site prediction task into three distinct steps carried out in sequence: (a) Homology evaluation; (b) Free energy computation; (c) Evolutionary conservation computation. The parameters used in miRanda have been manually optimized according to biological knowledge and software tools available in 2003 (with some updates).

We present an approach to improve the computational estimation of target sites, maximizing the adherence of results to biological evidence. The proposed method provides some improvements on the original miRanda program. In particular, we employ the new RNAcofold routine for free energy computation and we improve the match between the set of estimated target genes and the set of biologically validated target genes nowadays known by using a genetic algorithm. Our method is more selective than miRanda. The validated targets with the proposed method have consistently better free energy than the others in the list; this is not true for miRanda. The increased selectivity of the method may also be used to guide experimental validation in a more focused direction.

Zsuzsanna Lipták: *“Jumbled String Matching: Motivations, Variants, Algorithms”*

In the classical exact string matching problem, we are given two strings s (the text) and t (the pattern), and we want to find all occurrences of t as substrings of s. In jumbled string matching, the goal is to find all occurrences of substrings t’ of s with the same multiplicity of characters as the pattern, i.e. all permuted (or jumbled) versions of the pattern. Alternatively, the pattern can be given as a vector specifying the character multiplicities (a so-called Parikh vector), and we want to find all occurrences of substrings with this Parikh vector.

For example, the text s=baacabccab has two occurrences of the pattern (2,1,1), i.e. of substrings consisting of two a’s, one b, and one c: in position 1 and in position 3. The online version of the problem can be solved optimally in O(n) time with a simple sliding window algorithm. However, the indexed version, where we want to answer multiple queries fast, using an index of the text, has proved challenging.

Variants of this problem are motivated by applications in mass spectrometry data interpretation, by gene clusters, and by motif search in networks. I will talk about different algorithms for several variants of the problem, including the case of binary alphabets, which has very specific properties, and which has aroused much interest in recent years.