Abstract
This thesis is devoted to algorithms and data structures for classical problems in stringology in the internal and dynamic settings.
In the internal setting, the task is to preprocess a string S so that queries of interest for substrings of S can be answered efficiently. Note that a substring of S can be specified in constant time by the indices of an occurrence of it as a fragment of S. Efficient data structures for problems in the internal setting have proved useful as building blocks in the design of algorithms and data structures for problems on strings.
Here, we consider the INTERNAL DICTIONARY MATCHING problem, where one is given a text T of length n and a dictionary D consisting of substrings of T, each given as a fragment of T. This way, D takes space proportional to the number d=|D| of patterns rather than their total length, which could be Θ(n∙d). We show how to construct in (n+d)∙log^{O(1)}n time a data structure that answers the following types of queries in log^{O(1)}n+O(|output|) time: reporting/counting all occurrences of patterns from D in a fragment T[i..j] and reporting distinct patterns from D that occur in T[i..j]. We also present data structures for the problem of counting distinct patterns from D that occur in T[i..j]. Finally, we also address these problems for a dynamic dictionary, in which case queries are interleaved with updates to D.
In the dynamic setting, the task is to maintain a collection of strings under updates, such as insertions and deletions of letters, so that information of interest can be efficiently retrieved. We consider the following problems in such a setting:
- INTERNAL PATTERN MATCHING: We show that the data structure of Gawrychowski et al. [SODA 2018] for maintaining a persistent collection X of strings, of total length at most N, under operations “split”, “concatenate” and “insert a string of length 1 to X”, in O(logN) time each, can be augmented to return the occurrences of any string P ϵ X in any string T ϵ X in O(|T|/|P|∙log^{2}N) time.
- REPETITION DETECTION: We show how to maintain the longest square substring of a dynamic string, of length at most n, in O(log^{3}n) time per update, using O(n) space. One of the main ingredients of our algorithm is our aforementioned implementation of internal pattern matching queries in the dynamic setting.
- LONGEST COMMON FACTOR: We show that a longest common factor (equiv. substring) of two dynamic strings, of total length at most n, can be maintained in amortised O(log^{8}n) time per update, using Õ(n) space. An exponentially faster algorithm with polynomial space requirements is ruled out due to a lower bound in the cell-probe model of computation [Charalampopoulos et al., ICALP 2020]. We also show more efficient solutions for the case where one of the two strings is static.
- STRING ALIGNMENT: We consider the problem of dynamically maintaining an optimal alignment of two strings, each of length at most n, as they undergo edit operations. The string alignment problem generalises the longest common subsequence (LCS) problem and the edit distance problem (as long as insertions and deletions cost the same). The conditional lower bound of Backurs and Indyk [SICOMP 2018] for computing the LCS in the static case implies that strongly sublinear update time for the problem in scope is unlikely. We essentially match this lower bound when the alignment weights are constants, by showing how to process each update in Õ(n) time. When the weights are integers, bounded in absolute value by some w=n^{O(1)}, we can maintain the alignment in n∙min{√n,w}∙log^{O(1)}n time per update.
Interestingly, we use a wide range of tools in order to obtain our results, such as: string periodicity, locally consistent parsing of strings, orthogonal range queries, efficient (min,+)-multiplication of simple unit-Monge matrices, and data structures for computing distances in planar graphs.
Date of Award | 1 May 2021 |
---|---|
Original language | English |
Awarding Institution |
|
Supervisor | Tomasz Radzik (Supervisor) & Solon Pissis (Supervisor) |