TY - CHAP
T1 - Making de Bruijn Graphs Eulerian
AU - Bernardini, Giulia
AU - Chen, Huiping
AU - Loukides, Grigorios
AU - Pissis, Solon P.
AU - Stougie, Leen
AU - Sweering, Michelle
N1 - Funding Information:
Funding The work in this paper is supported in part by: the Netherlands Organisation for Scientific Research (NWO) through project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)” and Gravitation-grant NETWORKS-024.002.003; a CSC scholarship; the Leverhulme Trust RPG-2019-399 project; and the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Funding Information:
The work in this paper is supported in part by: the Netherlands Organisation for Scientific Research (NWO) through project OCENW.GROOT.2019.015 “Optimization for and with Machine Learning (OPTIMAL)” and Gravitation-grant NETWORKS-024.002.003; a CSC scholarship; the Leverhulme Trust RPG-2019-399 project; and the PANGAIA and ALPACA projects that have received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.
Publisher Copyright:
© Giulia Bernardini, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, and Michelle Sweering; licensed under Creative Commons License CC-BY 4.0
PY - 2022/6/1
Y1 - 2022/6/1
N2 - A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler's theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph GS,k(V, E), where V is the set of length-(k − 1) substrings of the strings in S, and GS,k contains an edge (u, v) with multiplicity mu,v, if and only if the string u[0] · v is equal to the string u · v[k − 2] and this string occurs exactly mu,v times in total in strings in S. Let GΣ,k(VΣ,k, EΣ,k) be the complete dBG of Σk. The Eulerian Extension (EE) problem on GS,k asks to extend GS,k with a set B of nodes from VΣ,k and a smallest multiset A of edges from EΣ,k to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of GS,k but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on GS,k is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1. When GS,k is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying GΣ,k and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in O(|V |k log d + |E|) time, which is nearly optimal, since the size of GS,k is Θ(|V |k + |E|). 2. When GS,k is not balanced, we are asked to extend GS,k to HS,k(V ∪ B, E ∪ A) such that every node of HS,k is balanced and the total number |A| of added edges is minimized. We show that this problem can be solved in the optimal O(k|V | + |E| + |A|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.
AB - A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler's theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph GS,k(V, E), where V is the set of length-(k − 1) substrings of the strings in S, and GS,k contains an edge (u, v) with multiplicity mu,v, if and only if the string u[0] · v is equal to the string u · v[k − 2] and this string occurs exactly mu,v times in total in strings in S. Let GΣ,k(VΣ,k, EΣ,k) be the complete dBG of Σk. The Eulerian Extension (EE) problem on GS,k asks to extend GS,k with a set B of nodes from VΣ,k and a smallest multiset A of edges from EΣ,k to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of GS,k but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on GS,k is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1. When GS,k is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying GΣ,k and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in O(|V |k log d + |E|) time, which is nearly optimal, since the size of GS,k is Θ(|V |k + |E|). 2. When GS,k is not balanced, we are asked to extend GS,k to HS,k(V ∪ B, E ∪ A) such that every node of HS,k is balanced and the total number |A| of added edges is minimized. We show that this problem can be solved in the optimal O(k|V | + |E| + |A|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.
KW - de Bruijn graph
KW - Eulerian graph
KW - graph algorithms
KW - string algorithms
UR - http://www.scopus.com/inward/record.url?scp=85134359609&partnerID=8YFLogxK
U2 - 10.4230/LIPIcs.CPM.2022.12
DO - 10.4230/LIPIcs.CPM.2022.12
M3 - Conference paper
AN - SCOPUS:85134359609
T3 - Leibniz International Proceedings in Informatics, LIPIcs
BT - 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
A2 - Bannai, Hideo
A2 - Holub, Jan
PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
T2 - 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
Y2 - 27 June 2022 through 29 June 2022
ER -