On overabundant words and their application to biological sequence analysis

Yannis Almirantis; Panagiotis Charalampopoulos; Jia Gao; Costas S. Iliopoulos; Manal Mohamed; Solon P. Pissis; Dimitris Polychronopoulos

doi:10.1016/j.tcs.2018.09.011

On overabundant words and their application to biological sequence analysis

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S. Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

192 Downloads (Pure)

Abstract

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986, [1]). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017, [2]). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

Original language	English
Journal	Theoretical Computer Science
Early online date	12 Sept 2018
DOIs	https://doi.org/10.1016/j.tcs.2018.09.011
Publication status	E-pub ahead of print - 12 Sept 2018

Keywords

Overabundant words
Avoided words
Pattern matching
Suffix tree
DNA sequence analysis

Access to Document

10.1016/j.tcs.2018.09.011

On overabundant words and_ALMIRANTIS_Firstonline12September2018_GREEN AAM (CC BY-NC-ND)Accepted author manuscript, 436 KBLicence: CC BY-NC-ND

Cite this

@article{ec2b4dfd49cb4be3bc3d7cc48325339e,

title = "On overabundant words and their application to biological sequence analysis",

abstract = "The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986, [1]). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017, [2]). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.",

keywords = "Overabundant words, Avoided words, Pattern matching, Suffix tree, DNA sequence analysis",

author = "Yannis Almirantis and Panagiotis Charalampopoulos and Jia Gao and Iliopoulos, {Costas S.} and Manal Mohamed and Pissis, {Solon P.} and Dimitris Polychronopoulos",

year = "2018",

month = sep,

day = "12",

doi = "10.1016/j.tcs.2018.09.011",

language = "English",

journal = "Theoretical Computer Science",

issn = "0304-3975",

publisher = "Elsevier",

}

TY - JOUR

T1 - On overabundant words and their application to biological sequence analysis

AU - Almirantis, Yannis

AU - Charalampopoulos, Panagiotis

AU - Gao, Jia

AU - Iliopoulos, Costas S.

AU - Mohamed, Manal

AU - Pissis, Solon P.

AU - Polychronopoulos, Dimitris

PY - 2018/9/12

Y1 - 2018/9/12

N2 - The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986, [1]). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017, [2]). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

AB - The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word w in a given sequence x can be used for classifying w as avoided or overabundant. The definitions used for the expectation and deviation of w in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986, [1]). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017, [2]). In this article, we extend this study by presenting an O(n)-time and O(n)-space algorithm for computing all overabundant words in a sequence x of length n over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree T of x: the number of distinct factors of x whose longest infix is the label of an explicit node of T is no more than 3n−4. We further show that the presented algorithm is time-optimal by proving that O(n) is a tight upper bound for the number of overabundant words. Finally, we present experimental results, using both synthetic and real data, which justify the effectiveness and efficiency of our approach in practical terms.

KW - Overabundant words

KW - Avoided words

KW - Pattern matching

KW - Suffix tree

KW - DNA sequence analysis

U2 - 10.1016/j.tcs.2018.09.011

DO - 10.1016/j.tcs.2018.09.011

M3 - Article

SN - 0304-3975

JO - Theoretical Computer Science

JF - Theoretical Computer Science

ER -

On overabundant words and their application to biological sequence analysis

Abstract

Keywords

Access to Document

Fingerprint

Cite this