Text indexing

Text indexing in genomics

We present an efficient external memory algorithm to compute the string graph from a collection of reads, which is a fundamental data representation used for sequence assembly.

Our algorithm builds upon some recent results on lightweight Burrows-Wheeler Transform (BWT) and Longest Common Prefix (LCP) construction providing, as a by-product, an efficient procedure to extend intervals of the BWT that could be of independent interest.

We have implemented our algorithm and compared its efficiency against SGA—the most advanced assembly string graph construction program.

Constructing String Graphs in External Memory P Bonizzoni, G Della Vedova, Y Pirola, M Previtali, R Rizzi WABI Algorithms in Bioinformatics, 311-325