Abstract
Given a string on alphabet (Formula presented.) the partitioning problem is to compute classes of equivalences on the set of positions of the input string. These classes implicitly memorise identical factors of the string and, hence, their efficient computation is essential for a wide range of string processing applications. We study this problem for a weighted string: for every position of the weighted string and every letter of the alphabet a probability of occurrence of this letter at this position is given. Thus a weighted string may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. In this article, we present a non-trivial generalisation of Crochemore’s partitioning algorithm (IPL, 1981) that works on weighted strings requiring time (Formula presented.), where n is the length of the string, (Formula presented.), (Formula presented.) is the size of (Formula presented.), and 1 / z is a cumulative weight threshold, defined as the minimal probability of occurrence of factors in the string. Our contributions can be summarised as follows: (a) we design the first algorithm to solve the partitioning problem on weighted strings for arbitrary z and (Formula presented.) in time (Formula presented.) and space (Formula presented.) improving the state of the art for (Formula presented.); (b) we improve the state of the art for numerous other string processing problems; and (c) we show further combinatorial insight into the relation between weighted and indeterminate strings, that is, sequences of alphabet subsets without associated occurrence probabilities.
Original language | English |
---|---|
Pages (from-to) | 496–514 |
Number of pages | 19 |
Journal | ALGORITHMICA |
Volume | 80 |
Early online date | 27 Jan 2017 |
DOIs | |
Publication status | Published - Feb 2018 |
Keywords
- Covers
- Crochemore partitioning
- Position weight matrix
- Repetitions
- Seeds
- Uncertain sequences
- Weighted strings