Ph.D. course: Massive data analysis in the Web and post-genomic era

Lectures (April-May 2012):

1.	Introduction Part I April 16th (h. 10.30) Paola Bonizzoni (Università di Milano-Bicocca) The first lectures aim to provide the theoretical bases required to face the research topics introduced in the course, as well as the main technological motivations of big data. The course is oriented to computer scientists, physicists, statisticians, genetic epidemiologists, bioinformaticians, genome biologists and aims to open a discussion on the challenges and opportunities in next-generation sequencing data analysis and massive data analysis. Massive data, deep sequencing and indexing techniques. Software tools.
2.	Inferring Genetic Diversity from NGS April 16th (h. 15.30) and 17th (h. 9.30) Niko Beerenwinkel (ETH Zurich) With high-coverage next-generation sequencing (NGS), the genetic diversity of mixed samples can be probed at an unprecedented level of detail in a cost-effective manner. However, NGS reads tend to be erroneous and they are relatively short, complicating the detection of low-frequency variants and the reconstruction of long haplotype sequences. In this lecture, I will introduce computational and statistical challenges associated witgenetic diversity estimation from NGS data. I will discuss several approaches to their solution based on probabilistic graphical models and on combinatorial optimization techniques. Two major applications will be presented: the genetic diversity of HIV within patients and the genetic diversity of cancer cells within tumors. Part 1: Detecting low-frequency single-nucleotide variants (SNVs). Part 2: Local haplotype inference and global quasispecies assembly. Slides
3.	Introduction Part II April 23rd (h. 13:00) and 24th (h. 14:00) Gianluca Della Vedova (Università di Milano-Bicocca) Moore’s Law: current trends and the big data revolution. Approaches to work splitting: parallel algorithms, map reduce, data streaming. Slides
4.	The Paradigm of Data Stream for Next Generation Internet May 2nd (14.00-16.00) and 3rd (9.00-11.00) Irene Finocchi (Università La Sapienza – Roma) Data stream algorithmics has gained increasing popularity in the last few years as an effective paradigm for processing massive data sets. A wide range of applications in computational sciences generate huge and rapidly changing streams of data that need to be continuously monitored and processed in one or few sequential passes, using a limited amount of working memory. Despite the heavy restrictions on time and space resources imposed by this data access model, major progress has been achieved in the last ten years in the design of streaming algorithms for several fundamental data sketching and statistics problems. The lectures will overview this rapidly evolving area and present basic algorithmic ideas, techniques, and challenges in data stream processing.Slides
5.	Next Generation Sequencing analysis May 8th (9.30-13.30) Nadia Pisanti (Università di Pisa) New Sequencing Technologies have dramatically decreased costs and thus opened the way to new challenges in applications such as metagenomics and transcriptome analysis by means of sequences; in particular, low costs of re-sequencing applied to the human genome opens the way to new issued in personalized medicine. As a consequence, a new phase has been opened for genome research. From the point of view of the computer scientist, the management of huge amount of data, the small size of sequenced fragments (with respect to previous technologies), and the new applications that bring down on sequences lots of data that used to be managed with arrays, has led to several new problems in string algorithms. We will try to give an overview on them and on possible approaches to address these problems. Slides NGS Slides SV e CNV
6.	Combinatoria delle parole ed applicazioni alla biologia teorica May 29th (afternoon) Giuseppe Pirillo (IASI, CNR)