K-Boost: a Scalable Algorithm for High-Quality Clustering of Microarray Gene Expression Data

Filippo Geraci1, Mauro Leoncini2, Manuela Montangero2, Marco Pellegrini1 and M. Elena Renda1


1 Istituto di Informatica e Telematica (IIT)
Consiglio Nazionale delle Ricerche (CNR)
I-56100
Pisa (PI) ITALY

2 Dipartimento di Ingegneria dell'Informazione
University of Modena e Reggio Emilia
I-41100
Modena (MO) ITALY

Contacts:
Filippo.Geraci_AT_iit.cnr.it
Mauro.Leoncini_AT_unimo.it
Manuela.Montangero_AT_unimo.it
Marco.Pellegrini_AT_iit.cnr.it
Elena.Renda_AT_iit.cnr.it


Abstract. Microarray technology for profiling gene expression levels is a popular tool in modern biological research. Applications range from tissue classification to the detection of metabolic networks, from drug discovery to time-critical personalized medicine. Given the increase in size and complexity of the data sets produced, their analysis is becoming problematic in terms of time/quality trade-offs. Clustering genes with similar expression profiles is a key initial step for subsequent manipulations and the increasing volumes of data to be analyzed requires methods that are at the same time efficient (completing an analysis in minutes rather than hours) and effective (identifying significant clusters with high biological correlations). In this paper, we propose K-Boost, a clustering algorithm based on a combination of the furthest-point-first (FPF) heuristic for solving the metric k-center problem, a stability-based method for determining the number of clusters, and a k-means-like cluster refinement. K-Boost runs in O (|N|·k) time, where N is the input matrix and k is the number of proposed clusters. Experiments show that this low complexity is usually coupled with a very good quality of the computed clusterings, which we measure using both internal and external criteria. Supporting data can be found as online Supplementary Material at www.liebertonline.com.

 


©2009 Published by Mary Ann Liebert doi:10.1089/cmb.2008.0201


BibTex

@Article{K-BoostJCB09,
author = "Geraci, Filippo and Leoncini, Mauro and Montangero, Manuela and Pellegrini, Marco and Renda, M. Elena",
title = "K-Boost: a Scalable Algorithm for High-Quality Clustering of Microarray Gene Expression Data",
journal = "Journal of Computational Biology",
volume = "16",
number = "6",
year = "2009",
publisher = "Mary Ann Liebert",
pages = "859--873"
}