A New Idea for Improving the Running Time of PMS Algorithm

: Motif finding problem is a major challenge in biology with significant applications in the detection of transcription factor binding sites and transcriptional regulatory elements that are crucial in understanding gene expression and function, human disease, drug design, etc . Two type of motif finding problems have been investigated. Planted Motif Search Problem (PMSP) which is defined as finding motifs that appear in all sequences and a restricted version of it “Planted Motif Search Problem-Sample Driven” (PMSP-SD) where the motifs themselves are found in the input. The first version is NP-Complete and the second version can be trivially solved in polynomial time. In this paper, a new idea is used to speed up the PMS-SD algorithm. Although PMS-SD is a polynomial time algorithm and the new idea does not improve its asymptotic runtime, but since most of the motif search algorithms combine a sample driven approach with a pattern driven approach, the speed up of PMS-SD running time would result in speed up of PMS algorithm. To verify the performance of the modified algorithms which are called PMS-two step and PMS-SD-two step, these algorithms are tested on simulated data. The experimental results approve the improvements.


Introduction
Motifs which approximate conserved sequences across DNA/protein sequences, lead biologists to new biological discoveries. Regulatory regions in a genome such as promoters, enhancers, locus control regions contain motifs that control many biological processes such as gene expression [1] and [2]. In fact, some proteins known as transcription factors that bind to motif locations in regulatory regions can regulate gene expression.
A large number of methods have been proposed to investigate motifs in biological sequences. A class of approaches are combinatorial approaches such as Planted Motif Search (PMS), Simple Motif Search (SMS), and Edited-distance-based Motif Search (EMS) [3].
Among different versions of combinatorial approaches, PMS problem is more popular due to its closeness to motif reality. Motif in PMS problem is referred as a (l,d)-motif, where l is the length of the motif and d is the number of mismatches allowed for its instances. This problem is trying to extract common substrings that appear in every input sequence with pre-specified mismatches allowed. In fact, these instances are (l,d)-motifs which has length l and allowed to have d mismatches in different places in each of them. An algorithm that solves PMS problem is called PMS Algorithm. Motif finding problem is based on two categories, Sample-Driven and Pattern-Driven approaches. Using pattern-Driven approaches, one tries all possible | | Σ l l-mers as motif candidates which is an exponential search space, but in Sample-Driven approaches, all possible motifs generated from the l-mers in input strings are of interest which could be found in polynomial time. It has been proven that PMS problem is NP-hard which means unlikely any algorithm solves it in polynomial time [4]. Du to NP-hardness, two kinds of PMS algorithms are exact and approximate algorithms. An exact algorithm can find all the motifs in input sequences, while an approximate algorithm may not be able to find all of them. All existing exact algorithms solve PMS problem in exponential time in some of its parameters. Some of the most important exact algorithms are PMS1 [5], PMS2 [5], PMS3 [5], PMSi [6], PMSP [6], PMSP4 [7], Stemming [8], PMS5 [9], PMS6 [10], PMS8 [11], qPMS9 [12], PMS Prune [13], Algorithm Voting [14] and RISSOTO [15]. On the other side, approximate algorithms take less time than exact algorithms. They usually employ heuristics such as local search, Gibbs sampling, exponential optimization, etc. Some examples of approximate algorithms are Algorithm MEME [16], Algorithm PROJECTION [17], Algorithm Gibbs DNA [18], Algorithm WINNOWER [19], and Algorithm Random Projection [20]. Some other approximate PMS algorithms are MULTIPROFILER [21], Algorithm Pattern Branching [22], Algorithm Profile Branching [22], Algorithm CONSENSUS [23] and genetic algorithm [24].
In this paper a new simple idea is proposed to speed up the PMS-SD algorithm and since PMS algorithms use PMS-SD as subroutines the faster PMS-SD algorithm results in speed up of PMS algorithm. The remainder of this paper is organized as follows. In section2, some definitions and theorems are introduced. In section 3, Algorithm PMS will be described briefly, and then a new algorithm is proposed based on PMS. Then, in section 4 PMS-SD algorithm in which the searched zone is restricted to input sequences is improved. Experimental results are shown in section 5 and section 6 ends the paper with conclusion.    (3) and (4) would be proven in exactly the same way as (1). Definition 3. Let s and x be two strings with lengths n and l, such that l<n, respectively. We define

Definitions and Theorems
Based on theorem 2, if and , ! 1 then has a chance to be in , . The next theorem is based on this fact. Theorem 3. Let and be two strings over an alphabet Σ .  (1) and (4) in theorem3.
Theorem1 and definition of # , help us to reduces local search in input sequences which is a part of all motif search algorithms. This result would be proposed in the next theorem.
Theorem 4. Let s i and s j be two strings over an alphabet Σ .
Definition 4 (planted motif search problem). Let " + , is a set of strings with length n over an alphabet Σ and nonnegative integers l, d, satisfying 0 ≤ < -< . . The (l,d)-motif search problem is to find a string x, called motif, of length l such that Now, we use corrolary1 and theorem4 to propose a technique for speed up the PMS algorithm which we call it PMS two-step.

PMS Two-Step Algorithm for Pattern-Driven Motif
First of all, we describe Algorithm PMS briefly, because our new idea is applied to it. For more details about Algorithm PMS, the reader is referred to [13].
The main trend of the algorithm is as follows. First for each l-mer x of the first string, find all of its neighbors with distance d. Each of these neighbors could be a motif candidate. Let x′ be one of these neighbors. To investigate whether x′ could be a motif, we do as follows. If there is an lmer with distance d from x′ in each string that is in distance 2d from x, then x′ is introduced as an instance of motif. Algorithm PMS could be presented as the following pseudocode.
Steps 8 to 12 are according to corollary 1.
Step 13 is based on the fact that the first l-mers of strings 2 to t cannot be obtained based on corollary 1. So we have to investigate these t-1 remained l-mers for 2d distances. space to does this process. If 6 # is changed to 6 # , < for k=1,2,…,t PMS algorithm is restricted to find motifs which are present in input sequences and is called PMS-SD algorithm.
In this section, we use corollary 1 and theorem 4 to improve PMS algorithm for sample-driven motif search problem. In other words, we search motifs which are present in the given strings.
For each 6

Finding the Simulated DNA Motif
The PMS-SD two-step and PMS two-step algorithms are evaluated by computational experiments. To do this, a set of random data is generated according to what described in [13]. First of all, 20 strings of length 600 are generated randomly such that each letter has the equal probability 1 | | Σ . Then, a motif of length l is generated randomly in the same manner. Next, 20 instances are generated from the motif by the mutating the letter at exactly d random positions. Finally, they are planted in each sequence where each position is selected randomly. The proposed algorithms have been written in C. The computational comparison has been performed in a laptop computer with an Intel Core i3-M330 CPU (2.13 GH) and 4 GB memory. PMS and PMS two-step's running time have been reported in table 1 and Algorithm PMS-SD and  Algorithm PMS-SD two-step in table 2. Based on Table 2, PMS-SD two-step has reduced PMS-SD's running time for about 40 percent.

Finding Real DNA Motif
The proposed algorithm tested with the real DNA data sets, preproinsulin, DHFR, metallothionein, c-fos and Yeast ECB data sets as in [25]. Algorithm PMS two-step found transcription regulatory elements on these data set.

Conclusion
In this paper, a new idea was used to speed up the PMS algorithm's running time. Using some characteristics of DNA strings, some theorems has been proven to reduce the number of calculations of distances between two strings with the same length. Algorithms PMS two-step and PMS-SD twostep based on Algorithm PMS were proposed. As Table 1 shows, although the voting algorithm finds the (15,5) motif in 14.8 m which is better than PMS two-step (22.7 m) and PMS (23.5 m), this algorithm fails to find (17,6) motif which is found by PMS two-step (8 h) and PMS (7.8 h). For SD motif search problem Table 2 shows the results obtained from Algorithm PMS-SD and Algorithm PMS-SD two-step. As it is seen for all tested (l,d) motifs, the running time of Algorithm PMS-SD two-step is better than Algorithm PMS-SD. For example for (25,10) motif Algorithm PMS-SD twostep serves 1.5 s while Algorithm PMS-SD needs 2.26 s to search the motif.