MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database

Alessandro Pandini, L Bonati, F Fraternali, J Kleinjung

Research output: Contribution to journalArticlepeer-review

12 Citations (Scopus)

Abstract

Motivation: The size of current protein databases is a challenge for many Bioinformatics applications, both in terms of processing speed and information redundancy. It may be therefore desirable to efficiently reduce the database of interest to a maximally representative subset. Results: The MinSet method employs a combination of a Suffix Tree and a Genetic Algorithm for the generation, selection and assessment of database subsets. The approach is generally applicable to any type of string-encoded data, allowing for a drastic reduction of the database size whilst retaining most of the information contained in the original set. We demonstrate the performance of the method on a database of protein domain structures encoded as strings. We used the SCOP40 domain database by translating protein structures into character strings by means of a structural alphabet and by extracting optimized subsets according to an entropy score that is based on a constant-length fragment dictionary. Therefore, optimized subsets are maximally representative for the distribution and range of local structures. Subsets containing only 10% of the SCOP structure classes show a coverage of > 90% for fragments of length 1-4
Original languageEnglish
Pages (from-to)515 - 516
Number of pages2
JournalBIOINFORMATICS
Volume23
Issue number4
DOIs
Publication statusPublished - 15 Feb 2007

Fingerprint

Dive into the research topics of 'MinSet: a general approach to derive maximally representative database subsets by using fragment dictionaries and its application to the SCOP database'. Together they form a unique fingerprint.

Cite this