MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Alessandro Pandini; L Bonati; F Fraternali; J Kleinjung

doi:10.1093/bioinformatics/btl637

MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Alessandro Pandini, L Bonati, F Fraternali, J Kleinjung

Research output: Contribution to journal › Article › peer-review

12 Citations (Scopus)

Abstract

Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of > 90% for fragments of length 1-4

Original language	English
Pages (from-to)	515 - 516
Number of pages	2
Journal	BIOINFORMATICS
Volume	23
Issue number	4
DOIs	https://doi.org/10.1093/bioinformatics/btl637
Publication status	Published - 15 Feb 2007

Access to Document

10.1093/bioinformatics/btl637

Cite this

@article{f01020cd09844cb088eb8d11f12a85b5,

title = "MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database",

abstract = "Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of > 90% for fragments of length 1-4",

author = "Alessandro Pandini and L Bonati and F Fraternali and J Kleinjung",

year = "2007",

month = feb,

day = "15",

doi = "10.1093/bioinformatics/btl637",

language = "English",

volume = "23",

pages = "515 -- 516",

journal = "BIOINFORMATICS",

publisher = "Oxford University Press (OUP)",

number = "4",

}

TY - JOUR

T1 - MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

AU - Pandini, Alessandro

AU - Bonati, L

AU - Fraternali, F

AU - Kleinjung, J

PY - 2007/2/15

Y1 - 2007/2/15

N2 - Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of > 90% for fragments of length 1-4

AB - Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of > 90% for fragments of length 1-4

U2 - 10.1093/bioinformatics/btl637

DO - 10.1093/bioinformatics/btl637

M3 - Article

VL - 23

SP - 515

EP - 516

JO - BIOINFORMATICS

JF - BIOINFORMATICS

IS - 4

ER -

MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Abstract

Access to Document

Fingerprint

Cite this