Antje Krause (email@example.com)
Martin Vingron (firstname.lastname@example.org)
Deutsches Krebsforschungszentrum (DKFZ), Abt. Theoretische Bioinformatik
Im Neuenheimer Feld~280, D-69120 Heidelberg, Germany
An iterative database searching method is introduced and applied to the design of a database clustering procedure. The search method virtually never produces false positive hits while determining meaningfully large sets of sequences related to the query. A novel set-theoretic database clustering algorithm exploits this feature and avoids a traditional, distance-based clustering step. This makes it fast and applicable to data-sets of the size of, e.g., the Swiss-Prot database. In practice we achieve unambiguous assignment of 80% of Swiss-Prot sequences to non-overlapping sequence clusters in an entirely automatic fashion.