Space-Efficient Indexes for Uncertain Strings

Esteban Gabory; Chang Liu; Grigorios Loukides; Solon P. Pissis; Wiktor Zuba

doi:10.1109/ICDE60146.2024.00367

Space-Efficient Indexes for Uncertain Strings

Esteban Gabory, Chang Liu, Grigorios Loukides, Solon P. Pissis, Wiktor Zuba

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

Abstract

Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length n on an alphabetΣ is a sequence of n probability distributions over Σ. Given an uncertain string X and a weight threshold 1/zϵ(0,1), we say that pattern P occurs in X at position i, if the product of probabilities of the letters of P at positions i,..., i+ |P|-1 is at least 1/z. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has O(nz) size, requires O(nz) time and O(nz) space to be constructed, and answers pattern matching queries in the optimal O(m+ [Occl) time, where m is the length of P and |Occ| is the total number of occurrences of P in X. For large n and (moderate) z values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of Q (nz/ℓ log z) expected size, which can be constructed using Q (nz/ℓ log z) expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

Original language	English
Title of host publication	Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
Publisher	IEEE Computer Society
Pages	4828-4842
Number of pages	15
ISBN (Electronic)	9798350317152
DOIs	https://doi.org/10.1109/ICDE60146.2024.00367
Publication status	Accepted/In press - 9 Mar 2024
Event	40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, Netherlands Duration: 13 May 2024 → 17 May 2024

Publication series

Name	Proceedings - International Conference on Data Engineering
ISSN (Print)	1084-4627
ISSN (Electronic)	2375-0286

Conference

Conference	40th IEEE International Conference on Data Engineering, ICDE 2024
Country/Territory	Netherlands
City	Utrecht
Period	13/05/2024 → 17/05/2024

Keywords

index
space-efficient
uncertain strings

Access to Document

10.1109/ICDE60146.2024.00367

Cite this

@inbook{e7daab9d5f0946bfbd6e45f6f5809a5e,

title = "Space-Efficient Indexes for Uncertain Strings",

abstract = "Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length n on an alphabetΣ is a sequence of n probability distributions over Σ. Given an uncertain string X and a weight threshold 1/zϵ(0,1), we say that pattern P occurs in X at position i, if the product of probabilities of the letters of P at positions i,..., i+ |P|-1 is at least 1/z. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has O(nz) size, requires O(nz) time and O(nz) space to be constructed, and answers pattern matching queries in the optimal O(m+ [Occl) time, where m is the length of P and |Occ| is the total number of occurrences of P in X. For large n and (moderate) z values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of Q (nz/ℓ log z) expected size, which can be constructed using Q (nz/ℓ log z) expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.",

keywords = "index, space-efficient, uncertain strings",

author = "Esteban Gabory and Chang Liu and Grigorios Loukides and Pissis, {Solon P.} and Wiktor Zuba",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 40th IEEE International Conference on Data Engineering, ICDE 2024 ; Conference date: 13-05-2024 Through 17-05-2024",

year = "2024",

month = mar,

day = "9",

doi = "10.1109/ICDE60146.2024.00367",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "4828--4842",

booktitle = "Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024",

address = "United States",

}

Gabory, E, Liu, C, Loukides, G , Pissis, SP & Zuba, W 2024, Space-Efficient Indexes for Uncertain Strings. in Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. Proceedings - International Conference on Data Engineering, IEEE Computer Society, pp. 4828-4842, 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, Netherlands, 13/05/2024. https://doi.org/10.1109/ICDE60146.2024.00367

Space-Efficient Indexes for Uncertain Strings. / Gabory, Esteban; Liu, Chang; Loukides, Grigorios et al.
Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. IEEE Computer Society, 2024. p. 4828-4842 (Proceedings - International Conference on Data Engineering).

Research output: Chapter in Book/Report/Conference proceeding › Conference paper › peer-review

TY - CHAP

T1 - Space-Efficient Indexes for Uncertain Strings

AU - Gabory, Esteban

AU - Liu, Chang

AU - Loukides, Grigorios

AU - Pissis, Solon P.

AU - Zuba, Wiktor

PY - 2024/3/9

Y1 - 2024/3/9

N2 - Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length n on an alphabetΣ is a sequence of n probability distributions over Σ. Given an uncertain string X and a weight threshold 1/zϵ(0,1), we say that pattern P occurs in X at position i, if the product of probabilities of the letters of P at positions i,..., i+ |P|-1 is at least 1/z. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has O(nz) size, requires O(nz) time and O(nz) space to be constructed, and answers pattern matching queries in the optimal O(m+ [Occl) time, where m is the length of P and |Occ| is the total number of occurrences of P in X. For large n and (moderate) z values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of Q (nz/ℓ log z) expected size, which can be constructed using Q (nz/ℓ log z) expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

AB - Strings in the real world are often encoded with some level of uncertainty, for example, due to: unreliable data measurements; flexible sequence modeling; or noise introduced for privacy protection. In the character-level uncertainty model, an uncertain string X of length n on an alphabetΣ is a sequence of n probability distributions over Σ. Given an uncertain string X and a weight threshold 1/zϵ(0,1), we say that pattern P occurs in X at position i, if the product of probabilities of the letters of P at positions i,..., i+ |P|-1 is at least 1/z. While indexing standard strings for online pattern searches can be performed in linear time and space, indexing uncertain strings is much more challenging. Specifically, the state-of-the-art index for uncertain strings has O(nz) size, requires O(nz) time and O(nz) space to be constructed, and answers pattern matching queries in the optimal O(m+ [Occl) time, where m is the length of P and |Occ| is the total number of occurrences of P in X. For large n and (moderate) z values, this index is completely impractical to construct, which outweighs the benefit of the supported optimal pattern matching queries. We were thus motivated to design a space-efficient index at the expense of slower yet competitive pattern matching queries. We show that when we have at hand a lower bound ℓ on the length of the supported pattern queries, as is often the case in real-world applications, we can slash the index size and the construction space roughly by ℓ. In particular, we propose an index of Q (nz/ℓ log z) expected size, which can be constructed using Q (nz/ℓ log z) expected space, and supports very fast pattern matching queries in expectation, for patterns of length m ≥ ℓ. We have implemented and evaluated several versions of our index. The best-performing version of our index is up to two orders of magnitude smaller than the state of the art in terms of both index size and construction space, while offering faster or very competitive query and construction times.

KW - index

KW - space-efficient

KW - uncertain strings

U2 - 10.1109/ICDE60146.2024.00367

DO - 10.1109/ICDE60146.2024.00367

M3 - Conference paper

AN - SCOPUS:85200512989

T3 - Proceedings - International Conference on Data Engineering

SP - 4828

EP - 4842

BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024

PB - IEEE Computer Society

T2 - 40th IEEE International Conference on Data Engineering, ICDE 2024

Y2 - 13 May 2024 through 17 May 2024

ER -

Space-Efficient Indexes for Uncertain Strings

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this