Constructing Antidictionaries in Output-Sensitive Space

Lorraine A.K. Ayad; Golnaz Badkobeh; Gabriele Fici; Alice Heliou; Solon P. Pissis

doi:10.1109/DCC.2019.00062

Constructing Antidictionaries in Output-Sensitive Space

Lorraine A.K. Ayad, Golnaz Badkobeh, Gabriele Fici, Alice Heliou, Solon P. Pissis

Informatics

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

6 Citations (Scopus)

100 Downloads (Pure)

Abstract

A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

Original language	English
Title of host publication	Proceedings - DCC 2019
Subtitle of host publication	2019 Data Compression Conference
Editors	James A. Storer, Ali Bilgin, Joan Serra-Sagrista, Michael W. Marcellin
Pages	538-547
Number of pages	10
Volume	2019-March
ISBN (Electronic)	9781728106571
DOIs	https://doi.org/10.1109/DCC.2019.00062
Publication status	Published - 10 May 2019

Keywords

Absent words
Antidictionaries
Data compression
Output sensitive algorithms
String algorithms

Access to Document

10.1109/DCC.2019.00062

Constructing Antidictionaries in_AYAD_Firstonline26March2019_GREEN AAMAccepted author manuscript, 357 KB

Cite this

@inbook{667f8b9a73b34002be5f653e1759bffe,

title = "Constructing Antidictionaries in Output-Sensitive Space",

abstract = "A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.",

keywords = "Absent words, Antidictionaries, Data compression, Output sensitive algorithms, String algorithms",

author = "Ayad, {Lorraine A.K.} and Golnaz Badkobeh and Gabriele Fici and Alice Heliou and Pissis, {Solon P.}",

year = "2019",

month = may,

day = "10",

doi = "10.1109/DCC.2019.00062",

language = "English",

volume = "2019-March",

pages = "538--547",

editor = "Storer, {James A.} and Ali Bilgin and Joan Serra-Sagrista and Marcellin, {Michael W.}",

booktitle = "Proceedings - DCC 2019",

}

Constructing Antidictionaries in Output-Sensitive Space. / Ayad, Lorraine A.K.; Badkobeh, Golnaz; Fici, Gabriele et al.
Proceedings - DCC 2019: 2019 Data Compression Conference. ed. / James A. Storer; Ali Bilgin; Joan Serra-Sagrista; Michael W. Marcellin. Vol. 2019-March 2019. p. 538-547 8712742.

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

TY - CHAP

T1 - Constructing Antidictionaries in Output-Sensitive Space

AU - Ayad, Lorraine A.K.

AU - Badkobeh, Golnaz

AU - Fici, Gabriele

AU - Heliou, Alice

AU - Pissis, Solon P.

PY - 2019/5/10

Y1 - 2019/5/10

N2 - A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

AB - A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y-1, y-2,...,y-k over an alphabet Σ, we are asked to compute the set M^ℓ-y-1#...#y-k of minimal absent words of length at most ℓ of word y=y-1#y-2#...#y-k, NotElementΣ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ-y-1#...#y-N || =o(n), for all N [1, k]. For instance, in the human genome, n ≈ 3 × 10^9 but ||M^12-y-1#...#y-k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ-y-1,...,M^ℓ-y-1#...#y-k can be computed in O(kn+k-N=1||M^ℓ-y-1#...#y-N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in y-1,...,y-k and MaxOut=max{||M^ℓ-y-1#...#y-N||:N [1, k]. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.

KW - Absent words

KW - Antidictionaries

KW - Data compression

KW - Output sensitive algorithms

KW - String algorithms

UR - http://www.scopus.com/inward/record.url?scp=85066302019&partnerID=8YFLogxK

U2 - 10.1109/DCC.2019.00062

DO - 10.1109/DCC.2019.00062

M3 - Conference paper

VL - 2019-March

SP - 538

EP - 547

BT - Proceedings - DCC 2019

A2 - Storer, James A.

A2 - Bilgin, Ali

A2 - Serra-Sagrista, Joan

A2 - Marcellin, Michael W.

ER -

Constructing Antidictionaries in Output-Sensitive Space

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this