TRStalker: an Efficient Heuristic for Finding NP-Complete Tandem Repeats

Marco Pellegrini1, M. Elena Renda1 and Alessio Vecchio2


Istituto di Informatica e Telematica - C.N.R.

Technical Report Number: 2009-TR-08
Via G. Moruzzi,1
I-56124 Pisa (PI) ITALY


1 Istituto di Informatica e Telematica (IIT)
Consiglio Nazionale delle Ricerche (CNR)
I-56100
Pisa (PI) ITALY

2 Dipartimento di Ingegneria dell'Informazione
University of Pisa
I-56100
Pisa (PI) ITALY


Abstract. Genomic sequences in higher eucaryotic organisms contain a substantial amount of (almost) repeated sequences. Tandem Repeats (TRs) constitute a large class of repetitive sequences that are originated via phenomena such as replication slippage, are characterized by close spatial contiguity, and play an important role in several molecular regulatory mechanisms. Certain types of tandem repeats are highly polymorphic and constitute a fingerprint feature of individuals. Abnormal TRs are known to be linked to several diseases. Researchers in bio-informatics in the last 20 years have proposed many formal definitions for the rather loose notion of a Tandem Repeat and have proposed exact or heuristic algorithms to detect TRs in genomic sequences. The general trend has been to use formal (implicit or explicit) definitions of TR for which verification of the solution was easy (with complexity linear, or polynomial in the TR’s length and substitution+indel rates) while the effort was directed towards identifying efficiently the sub-strings of the input to submit to the verification phase (either implicitly or explicitly). In this paper we take a step forward: we use a definition of TR for which also the verification step is difficult (in effect, NP-complete) and we develop new filtering techniques for coping with high error levels. The resulting heuristic algorithm, christened TRStalker, is approximate since it cannot guarantee that all NP-Complete Tandem Repeats satisfying the target definition in the input string will be found. However, in synthetic experiments with 30% of errors allowed, TRStalker has demonstrated a very high recall (ranging from 100% to 60%, depending on motif length and repetition number) for the NP-complete TRs. TRStalker has consistently better performance than some state- of-the-art methods for a large range of parameters on the class of NP-complete Tandem Repeats. TRStalker aims at improving the capability of TR detection for classes of TRs for which existing methods do not perform well.


For full paper: please contact me (Elena.Renda_AT_iit.cnr.it).