A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model

Abstract:

The explosive growth of biological sequences over the last decade has consequently led

to a lot of research on effective techniques for storing and analysing this big data.

Generating Deoxyribonucleic Acid (DNA) sequences has become more reliable, faster,

easier and cheaper due to latest sequencing technology leading to big data sequences.

This poses a challenge to efficient analysis to such massive sequences. Advanced

indexing algorithmic techniques are required to facilitate efficient analysis. A particular

interesting indexing data structure for such analysis is the Suffix Tree. Search times for

patterns have linear times, however longer construction times and high memory

requirements make it prohibitive to use this data structure for very long sequences such

as DNA sequences. To index these long sequences, several algorithms have been

explored in recent years. These algorithms all target to improve running time and

memory consumption for the Suffix Tree. That is, being able to index a long sequence

by making it fit in the main memory and being able to look for patterns in reasonable

times. DNA sequences carry basic information for each human being. Searching for

patterns in DNA sequences has been shown to have biological relevance since patterns

signify some underlying biological function such as disease detection, genetic variation,

and regulation, etc. This thesis explores the use of truncating the generalised Suffix Tree

basing on the details of the input, that is, the size of the repeat to be searched. This is

aimed at improving the memory requirements. In this thesis, parallelisation of the Suffix

Tree data structure is further explored to reduce the construction times. Parallelisation

has become of more interest since parallel architectures are now ubiquitous in standard

desktop computers. Based on the theoretical and experimental findings, we can conclude

that specification of pattern size, prior to Suffix Tree construction by truncating, reduces

the memory requirements of the generalised Suffix Tree. In conclusion, the novel

alphabet dependant parallelisation algorithm achieves a speedup of 15x over the existing

state of art sequential algorithm. Despite the success of our Parallel algorithm for Suffix

Tree (PaST) algorithm, its main limitation is that of specification of pattern size prior to

construction or searching. Other than that, comparatively our algorithm shows improved

performance in terms of memory and time requirements.