Linear-time computation of minimal absent words using suffix array

Carl Barton; Alice Heliou; Laurent Mouchard; Solon P. Pissis

doi:10.1186/1471-2105-15-388

Linear-time computation of minimal absent words using suffix array

Carl Barton, Alice Heliou, Laurent Mouchard, Solon P. Pissis

Informatics

Research output: Contribution to journal › Article › peer-review

36 Citations (Scopus)

Abstract

Background:
An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n) -time and O(n) -space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n2) -time and O(n) -space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available.
Results:
Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O(n) -time and O(n) -space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw.
Conclusions:
Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.

Original language	English
Article number	388
Number of pages	10
Journal	BMC Bioinformatics
Volume	15
Issue number	1
DOIs	https://doi.org/10.1186/1471-2105-15-388
Publication status	Published - Dec 2014

Access to Document

10.1186/1471-2105-15-388

http://dx.doi.org/10.1186/1471-2105-15-388

Cite this

@article{9ebf56c676e24093bbbb47110c81a5e8,

title = "Linear-time computation of minimal absent words using suffix array",

abstract = "Background:An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n) -time and O(n) -space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n2) -time and O(n) -space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available.Results:Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O(n) -time and O(n) -space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw.Conclusions:Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.",

author = "Carl Barton and Alice Heliou and Laurent Mouchard and Pissis, {Solon P.}",

year = "2014",

month = dec,

doi = "10.1186/1471-2105-15-388",

language = "English",

volume = "15",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Linear-time computation of minimal absent words using suffix array

AU - Barton, Carl

AU - Heliou, Alice

AU - Mouchard, Laurent

AU - Pissis, Solon P.

PY - 2014/12

Y1 - 2014/12

N2 - Background:An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n) -time and O(n) -space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n2) -time and O(n) -space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available.Results:Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O(n) -time and O(n) -space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw.Conclusions:Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.

AB - Background:An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n) -time and O(n) -space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n2) -time and O(n) -space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available.Results:Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an O(n) -time and O(n) -space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw.Conclusions:Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.

U2 - 10.1186/1471-2105-15-388

DO - 10.1186/1471-2105-15-388

M3 - Article

SN - 1471-2105

VL - 15

JO - BMC Bioinformatics

JF - BMC Bioinformatics

IS - 1

M1 - 388

ER -

Linear-time computation of minimal absent words using suffix array

Abstract

Access to Document

Fingerprint

Cite this