Genome Assembly and Indexing Data Structures for Bioinformatics Applications

Genome assembly is one of the fundamental problems in bioinformatics. Although studied for more than twenty years, new methods and tools for solving this problem are published monthly. Research in this direction is mostly motivated by the sheer amount of data produced by second and third generation sequencing machines that makes managing and analyzing genomic data challenging.

Genome assembly, and in particular de novo genome assembly, is closely related to efficient indexing strategies of the input data and to stringology. In particular the Burrows-Wheeler Transform and the de Bruijn graphs have a central role in this field.

The group research in this field mostly focuses on developing softwares to assemble genomes from raw data, in developing new algorithms for building the multi-BWT of a sequencing experiment, and in developing new data structures to index de Bruijn graphs.

SOFTWARE

  • LightStringGraph : C++ software that builds string graphs using negligible amount of RAM.
  • FastStringGraph : C++ software that builds string graphs in parallel.
  • bwt-lcp-em : C++ software that builds the multi-BWT and the LCP of a sequencing experiment (in FASTA format) with an extremely low main memory footprint.

REFERENCES

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi: An External-Memory Algorithm for String Graph Construction. Algorithmica 78(2): 394-424 (2017)

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi: FSG: Fast String Graph Construction for De Novo Assembly. Journal of Computational Biology 24(10): 953-968 (2017)

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi: Computing the BWT and LCP array of a Set of Strings in External Memory. CoRR abs/1705.07756 (2017)

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi: LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly. Journal of Computational Biology 23(3): 137-149 (2016)

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi: FSG: Fast String Graph Construction for De Novo Assembly of Reads Data. ISBRA 2016: 27-39

Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, Simon J. Puglisi: Bidirectional Variable-Order de Bruijn Graphs. LATIN 2016: 164-178

Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali: Fully Dynamic de Bruijn Graphs. SPIRE 2016: 145-152

Paola Bonizzoni, Gianluca Della Vedova, Serena Nicosia, Marco Previtali, Raffaella Rizzi: A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings. CoRR abs/1607.08342 (2016)

Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, Raffaella Rizzi: Constructing String Graphs in External Memory. WABI 2014: 311-325