Vous êtes sur la page 1sur 16

String Matching Problem

Given a text string T of length n and a pattern string P of length m, the exact string matching problem is to find all occurrences of P in T. Example: T=AGCTTGA P=GCT Applications:
Searching keywords in a file Searching engines (like Google and Openfind) Database searching (GenBank)

What is pattern matching?


Problem/issue Finding occurrence of a pattern (string) P in String S and also finding the position in S where the pattern match occurs

Brute Force algorithm


The brute-force pattern matching algorithm compares the pattern P with the text T for each possible shift of P relative to T, *until either a match is found, or *all placements of the pattern have been tried

Brute-force
algorithm brute-force: input: an array of characters, T (the string to be analyzed) , length n an array of characters, P (the pattern to be searched for), length m for i := 0 to n-m do for j := 0 to m-1 do compare T[j] with P[i+j] if not equal, exit the inner loop

Worst O(m*n) Best O(n)

Example
Compare each character of P with S if match continue else shift one position ab c abaabc aba c String S
Pattern p

abaa

Step 1:compare p[1] with S[1] S a b c a b a a b c a b a c

abaa

Step 2: compare p[2] with S[2]

S a b c a b a a b c a b a c
p

abaa

Step 3: compare p[3] with S[3] S a b c a b a a b c a b a c


Mismatch occurs here..

p a b a a
Since mismatch is detected, shift P one position to the Right and perform steps analogous to those from step 1 to step 3. At position where mismatch is detected, shift P one position to the right and repeat matching procedure.

The Knuth-Morris-Pratt Algorithm


Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem. A matching time of O(n) is achieved by avoiding comparisons with elements of S that have previously been involved in comparison with some element of the pattern p to be matched. i.e., backtracking on the string S never occurs

Components of KMP algorithm


The prefix function, The prefix function, for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information can be used to avoid useless shifts of the pattern p. In other words, this enables avoiding backtracking on the string S. The KMP Matcher With string S, pattern p and prefix function as inputs, finds the occurrence of p in S and returns the number of shifts of p after which occurrence is found.

Knuth-Morris-Pratt algorithm
-Algorithm Compute-Prefix-Function(P) 1. m length[T] 2. [1] 0 3. k 0 4. for q 2 to m 5. do while k > 0 and P[k + 1] P[q] 6. do k [k] /*if k = 0 or P[k + 1] = P[q], 7. if P[k + 1] = P[q] going out of the while-loop.*/ 8. then k k + 1 9. [q] k 10. return

Knuth-Morris-Pratt algorithm
-Algorithm KMP-Matcher(T, P) 1. n length[T] 2. m length[P] 3. Compute-Prefix-Function(P) 4. q 0 5. for i 1 to n 6. do while q > 0 and P[q + 1] T[i] 7. do q [q] 8. if P[q + 1] = T[i] 9. then q q + 1 10. if q = m 11. then print pattern occurs with shift i m 12. q [q]

Compute prefix function


P = ababababca, T = ababaababababca [1] = 0 k=0 q = 2, P[k + 1] = P[1] = a, P[q] = P[2] = b, P[k + 1] P[q] [q] k ([2] 0) q = 3, P[k + 1] = P[1] = a, P[q] = P[3] = a, P[k + 1] = P[q] k k + 1, [q] k ([3] 1) k=1 q = 4, P[k + 1] = P[2] = b, P[q] = P[4] = b, P[k + 1] = P[q] k k + 1, [q] k ([4] 2)

k=2 q = 5, P[k + 1] = P[3] = a, P[q] = P[5] = a, P[k + 1] = P[q] k k + 1, [q] k ([5] 3) k=3 q = 6, P[k + 1] = P[4] = b, P[q] = P[6] = b, P[k + 1] = P[q] k k + 1, [q] k ([6] 4) k=4 q = 7, P[k + 1] = P[5] = a, P[q] = P[7] = a, P[k + 1] = P[q] k k + 1, [q] k ([7] 5) k=5 q = 8, P[k + 1] = P[6] = b, P[q] = P[8] = b, P[k + 1] = P[q] k k + 1, [q] k ([8] 6)

k=6 q = 9, P[k + 1] = P[6] = b, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [6] = 4) P[k + 1] = P[5] = a, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [4] = 2) P[k + 1] = P[3] = a, P[q] = P[9] = c, P[k + 1] P[q] k [k] (k [2] = 0) k=0 q = 9, P[k + 1] = P[1] = a, P[q] = P[9] = c, P[k + 1] P[q] [q] k ([9] 0) q = 10, P[k + 1] = P[1] = a, P[q] = P[10] = a, P[k + 1] = P[q] k k + 1, [q] k ([10] 1)

After prefix computation, the table is shown below


P = ababababca

1 P[i] a [i] 0
i
P8

2 b 0

3 a 1

4 b 2

5 a 3
c a

6 b 4

7 a 5

8 b 6

9 10 c a 0 1
[8] = 6 [6] = 4 [4] = 2 [2] = 0

a b a b a b a b a b a b a b

P6 P4 P2 P0

a b c a

a b a b
a b

a b a b c a
a b a b a b c a a b a b a b a b c a

Another Example for KMP Algorithm


Next, Search phase computation

Phase 2
First finish the prefix computation

f(41)+1= f(3)+1=0+1=1

Phase 1 matched
f(13-1)+1= 4+1=5

Vous aimerez peut-être aussi