Making de Bruijn Graphs Eulerian

Giulia Bernardini*, Huiping Chen, Grigorios Loukides, Solon P. Pissis, Leen Stougie, Michelle Sweering

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

1 Citation (Scopus)

Abstract

A directed multigraph is called Eulerian if it has a circuit which uses each edge exactly once. Euler's theorem tells us that a weakly connected directed multigraph is Eulerian if and only if every node is balanced. Given a collection S of strings over an alphabet Σ, the de Bruijn graph (dBG) of order k of S is a directed multigraph GS,k(V, E), where V is the set of length-(k − 1) substrings of the strings in S, and GS,k contains an edge (u, v) with multiplicity mu,v, if and only if the string u[0] · v is equal to the string u · v[k − 2] and this string occurs exactly mu,v times in total in strings in S. Let GΣ,k(VΣ,k, EΣ,k) be the complete dBG of Σk. The Eulerian Extension (EE) problem on GS,k asks to extend GS,k with a set B of nodes from VΣ,k and a smallest multiset A of edges from EΣ,k to make it Eulerian. Note that extending dBGs is algorithmically much more challenging than extending general directed multigraphs because some edges in dBGs are by definition forbidden. Extending dBGs lies at the heart of sequence assembly [Medvedev et al., WABI 2007], one of the most important tasks in bioinformatics. The novelty of our work with respect to existing works is that we allow not only to duplicate existing edges of GS,k but to also add novel edges and nodes, in an effort to (i) connect multiple components and (ii) reduce the total EE cost. It is easy to show that EE on GS,k is NP-hard via a reduction from shortest common superstring. We further show that EE remains NP-hard, even when we are not allowed to add new nodes, via a highly non-trivial reduction from 3-SAT. We thus investigate the following two problems underlying EE in dBGs: 1. When GS,k is not weakly connected, we are asked to connect its d > 1 components using a minimum-weight spanning tree, whose edges are paths on the underlying GΣ,k and weights are the corresponding path lengths. This way of connecting guarantees that no new unbalanced node is added. We show that this problem can be solved in O(|V |k log d + |E|) time, which is nearly optimal, since the size of GS,k is Θ(|V |k + |E|). 2. When GS,k is not balanced, we are asked to extend GS,k to HS,k(V ∪ B, E ∪ A) such that every node of HS,k is balanced and the total number |A| of added edges is minimized. We show that this problem can be solved in the optimal O(k|V | + |E| + |A|) time. Let us stress that, although our main contributions are theoretical, the algorithms we design for the above two problems are practical. We combine the two algorithms in one method that makes any dBG Eulerian; and show experimentally that the cost of the obtained feasible solutions on real-world dBGs is substantially smaller than the corresponding cost obtained by existing greedy approaches.

Original languageEnglish
Title of host publication33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
EditorsHideo Bannai, Jan Holub
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959772341
DOIs
Publication statusPublished - 1 Jun 2022
Event33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022 - Prague, Czech Republic
Duration: 27 Jun 202229 Jun 2022

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume223
ISSN (Print)1868-8969

Conference

Conference33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022
Country/TerritoryCzech Republic
CityPrague
Period27/06/202229/06/2022

Keywords

  • de Bruijn graph
  • Eulerian graph
  • graph algorithms
  • string algorithms

Fingerprint

Dive into the research topics of 'Making de Bruijn Graphs Eulerian'. Together they form a unique fingerprint.

Cite this