K-Boost: a Scalable Algorithm for High-Quality Clustering of Microarray Gene Expression Data

Filippo Geraci13, Mauro Leoncini2, Manuela Montangero2, Marco Pellegrini1 and M.Elena Renda1


Istituto di Informatica e Telematica - C.N.R.

Technical Report Number: 2007-TR-15
Via G. Moruzzi,1
I-56124 Pisa (PI) ITALY


1 Istituto di Informatica e Telematica (IIT)
Consiglio Nazionale delle Ricerche (CNR)
I-56100
Pisa (PI) ITALY

2 Dipartimento di Ingegneria dell'Informazione
University of Modena e Reggio Emilia
I-41100
Modena (MO) ITALY

3 Dipartimento di Ingegneria dell'Informazione
University of Siena
I-53100
Siena (SI) ITALY


Abstract. Motivation: Microarray technology for profiling gene expression levels is a popular tool in modern biological research. Applications range from tissue classification to the detection of metabolic networks, from drug discovery to time-critical personalized medicine. Given the increase in size and complexity of the data sets produced, their analysis is becoming problematic in terms of time/quality tradeoffs. Clustering genes with similar expression profiles is a key initial step for subsequent manipulations and the increasing volumes of data to be analyzed requires methods that are at the same time efficient (completing an analysis in minutes rather than hours) and effective (identifying significant clusters with high biological correlations).
Results: In this paper we propose K-Boost, a novel clustering algorithm based on a combination of the Furthest-Point-First (FPF) heuristic for solving the metric k-centers problem, a stability-based method for determining the number of clusters (i.e. the value of k), and a k-means-like cluster refinement. K-Boost is able to detect the optimal number of clusters to produce. It is scalable to large data-sets without sacrificing output quality as measured by several internal and external criteria.


For full paper: please contact me (Elena.Renda_AT_iit.cnr.it).