Abstract:
The explosive growth of biological sequences over the last decade has consequently led
to a lot of research on effective techniques for storing and analysing this big data.
Generating Deoxyribonucleic Acid (DNA) sequences has become more reliable, faster,
easier and cheaper due to latest sequencing technology leading to big data sequences.
This poses a challenge to efficient analysis to such massive sequences. Advanced
indexing algorithmic techniques are required to facilitate efficient analysis. A particular
interesting indexing data structure for such analysis is the Suffix Tree. Search times for
patterns have linear times, however longer construction times and high memory
requirements make it prohibitive to use this data structure for very long sequences such
as DNA sequences. To index these long sequences, several algorithms have been
explored in recent years. These algorithms all target to improve running time and
memory consumption for the Suffix Tree. That is, being able to index a long sequence
by making it fit in the main memory and being able to look for patterns in reasonable
times. DNA sequences carry basic information for each human being. Searching for
patterns in DNA sequences has been shown to have biological relevance since patterns
signify some underlying biological function such as disease detection, genetic variation,
and regulation, etc. This thesis explores the use of truncating the generalised Suffix Tree
basing on the details of the input, that is, the size of the repeat to be searched. This is
aimed at improving the memory requirements. In this thesis, parallelisation of the Suffix
Tree data structure is further explored to reduce the construction times. Parallelisation
has become of more interest since parallel architectures are now ubiquitous in standard
desktop computers. Based on the theoretical and experimental findings, we can conclude
that specification of pattern size, prior to Suffix Tree construction by truncating, reduces
the memory requirements of the generalised Suffix Tree. In conclusion, the novel
alphabet dependant parallelisation algorithm achieves a speedup of 15x over the existing
state of art sequential algorithm. Despite the success of our Parallel algorithm for Suffix
Tree (PaST) algorithm, its main limitation is that of specification of pattern size prior to
construction or searching. Other than that, comparatively our algorithm shows improved
performance in terms of memory and time requirements.
Freeson, K (2024). A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model. Afribary. Retrieved from https://tracking.afribary.com/works/a-kmer-based-parallel-algorithm-for-pattern-searching-in-dna-sequences-on-shared-memory-model
Freeson, Kaniwa "A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model" Afribary. Afribary, 30 Mar. 2024, https://tracking.afribary.com/works/a-kmer-based-parallel-algorithm-for-pattern-searching-in-dna-sequences-on-shared-memory-model. Accessed 21 Nov. 2024.
Freeson, Kaniwa . "A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model". Afribary, Afribary, 30 Mar. 2024. Web. 21 Nov. 2024. < https://tracking.afribary.com/works/a-kmer-based-parallel-algorithm-for-pattern-searching-in-dna-sequences-on-shared-memory-model >.
Freeson, Kaniwa . "A kmer-based parallel algorithm for pattern searching in DNA sequences on shared-memory model" Afribary (2024). Accessed November 21, 2024. https://tracking.afribary.com/works/a-kmer-based-parallel-algorithm-for-pattern-searching-in-dna-sequences-on-shared-memory-model