MoTeX-II: structured MoTif eXtraction from large-scale datasets

Solon P. Pissis

doi:10.1186/1471-2105-15-235

MoTeX-II: structured MoTif eXtraction from large-scale datasets

Solon P. Pissis

Informatics

Research output: Contribution to journal › Article › peer-review

22 Citations (Scopus)

Abstract

Background
Identifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets-for instance, large-scale ChIPSeq datasets?cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds.

Results
In this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets.

Conclusions
Use of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/.

Original language	English
Article number	235
Number of pages	17
Journal	BMC Bioinformatics
Volume	15
DOIs	https://doi.org/10.1186/1471-2105-15-235
Publication status	Published - 8 Jul 2014

Access to Document

10.1186/1471-2105-15-235

Cite this

@article{88f8971727134a9fba6899412d84a49f,

title = "MoTeX-II: structured MoTif eXtraction from large-scale datasets",

abstract = "BackgroundIdentifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets-for instance, large-scale ChIPSeq datasets?cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds.ResultsIn this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets.ConclusionsUse of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/.",

author = "Pissis, {Solon P.}",

year = "2014",

month = jul,

day = "8",

doi = "10.1186/1471-2105-15-235",

language = "English",

volume = "15",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central",

}

TY - JOUR

T1 - MoTeX-II

T2 - structured MoTif eXtraction from large-scale datasets

AU - Pissis, Solon P.

PY - 2014/7/8

Y1 - 2014/7/8

N2 - BackgroundIdentifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets-for instance, large-scale ChIPSeq datasets?cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds.ResultsIn this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets.ConclusionsUse of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/.

AB - BackgroundIdentifying repeated factors that occur in a string of letters or common factors that occur in a set of strings represents an important task in computer science and biology. Such patterns are called motifs, and the process of identifying them is called motif extraction. In biology, motif extraction constitutes a fundamental step in understanding regulation of gene expression. State-of-the-art tools for motif extraction have their own constraints. Most of these tools are only designed for single motif extraction; structured motifs additionally allow for distance intervals between their single motif components. Moreover, motif extraction from large-scale datasets-for instance, large-scale ChIPSeq datasets?cannot be performed by current tools. Other constraints include high time and/or space complexity for identifying long motifs with higher error thresholds.ResultsIn this article, we introduce MoTeX-II, a word-based high-performance computing tool for structured MoTif eXtraction from large-scale datasets. Similar to its predecessor for single motif extraction, it uses state-of-the-art algorithms for solving the fixed-length approximate string matching problem. It produces similar and partially identical results to state-of-the-art tools for structured motif extraction with respect to accuracy as quantified by statistical significance measures. Moreover, we show that it matches or outperforms these tools in terms of runtime efficiency by merging single motif occurrences efficiently. MoTeX-II comes in three flavors: a standard CPU version; an OpenMP-based version; and an MPI-based version. For instance, the MPI-based version of MoTeX-II requires only a couple of hours to process all human genes for structured motif extraction on 1056 processors, while current sequential tools require more than a week for this task. Finally, we show that MoTeX-II is successful in extracting known composite transcription factor binding sites from real datasets.ConclusionsUse of MoTeX-II in biological frameworks may enable deriving reliable and important information since real full-length datasets can now be processed with almost any set of input parameters for both single and structured motif extraction in a reasonable amount of time. The open-source code of MoTeX-II is freely available at http://www.inf.kcl.ac.uk/research/projects/motex/.

U2 - 10.1186/1471-2105-15-235

DO - 10.1186/1471-2105-15-235

M3 - Article

SN - 1471-2105

VL - 15

JO - BMC Bioinformatics

JF - BMC Bioinformatics

M1 - 235

ER -

MoTeX-II: structured MoTif eXtraction from large-scale datasets

Abstract

Access to Document

Fingerprint

Cite this