Avoided Words, Overabundant Words, Maximal Palindromes and Applicatons

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

The study of string algorithms is essential for computer science and computational molecular biology. Recently, a number of algorithms on strings appear in many fields, such as pattern matching, combinatorics on words, string processing and automata the-ory, because of their magnitude advances in applications and theories. These advances have made to develop faster algorithms and to deal with certain natural problems.
In this thesis, we focus on computing certain structures in biological sequences using different algorithmic methods. Firstly, we study the computation of avoided words and overabundant words in biological sequences. The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word in a given sequence can be used for classifying that word as avoided or overabundant. This concept is particularly useful in DNA linguistic analysis. The definitions used for the expectation and deviation of the word in this statistical model were described and biologically justified by Brendel et al. [19]. We present some algorithms that can be used effectively for computing such words. Furthermore, experimental results, using both real and synthetic data, which further highlight the effectiveness of this model, show the efficiency and applicability of our implementation in biological sequence analysis.
Secondly, we consider a special type of uncertain sequence called weighted string. In a weighted string every position contains a subset of the alphabet and every letter of the alphabet is associated with a probability of occurrence such that the sum of probabilities at each position equals 1. We generalize Alatabbi et al.’s [2] solution for standard strings to compute maximal palindromes of a weighted string. We provide some efficient algorithms to compute maximal palindromes in weighted strings. Last but not least, we make available an implementation of our algorithms, using synthetic data, show the efficiency of our implementation.
Date of Award2018
Original languageEnglish
Awarding Institution
  • King's College London
SupervisorCostas Iliopoulos (Supervisor) & Max Crochemore (Supervisor)

Cite this

'