Vous êtes sur la page 1sur 11

Strings

6. Strings

String. Sequence of characters. Ex. Natural languages, Java programs, genomic sequences,

The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. -M. V. Olson

Robert Sedgewick and Kevin Wayne Copyright 2006 http://www.Princeton.EDU/~cos226

Using Strings in Java


String concatenation. Append one string to end of another string. Substring. Extract a contiguous list of characters from a string.

Implementing Strings In Java


Memory. 40 + 2N bytes for a virgin string!
could use byte array instead of String to save space

s
0

t
1

r
2

i
3

n
4

g
5

s
6

public final class String private char[] value; private int offset; private int count; private int hash;

implements Comparable<String> { // characters // index of first char into array // length of string // cache of hashCode()

String char String String

s c t u

= = = =

"strings"; s.charAt(2); s.substring(2, 6); s + t;

// // // //

s c t u

= = = =

"strings" 'r' "ring" "stringsring"

private String(int offset, int count, char[] value) { this.offset = offset; this.count = count; this.value = value; } public String substring(int from, int to) { return new String(offset + from, to - from, value); } }
3

java.lang.String
4

String vs. StringBuilder


String. [immutable] Fast substring, slow concatenation. StringBuilder. [mutable] Slow substring, fast (amortized) append.

Longest Common Prefix


Longest common prefix. Given two strings, find the common prefix that is as long as possible.

public static String reverse(String s) { String rev = ""; for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i); return rev; }

p
quadratic time
0

r
1

e
2

f
3

i
4

x
5 6 7

public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); return rev.toString(); }

linear time

public static String lcp(String s, String t) { int n = Math.min(s.length(), t.length()); for (int i = 0; i < n; i++) { if (s.charAt(i) != t.charAt(i)) return s.substring(0, i); } return s.substring(0, n); }

Radix Sorting

Radix Sorting

Radix sorting. Specialized sorting solution for strings. Same ideas for bits, digits, etc.
! !

Applications. Bioinformatics. Sorting strings. Full text indexing. Plagiarism detection. Burrows-Wheeler transform. [see data compression]
! ! ! ! !

Reference: Chapter 13, Algorithms in Java, 3rd Edition, Robert Sedgewick.

Robert Sedgewick and Kevin Wayne Copyright 2006 http://www.Princeton.EDU/~cos226

An Application: Redundancy Detector


Longest repeated substring. Given a string of N characters, find the longest repeated substring. Ex: a a c a a g t t t a c a a g c Application: bioinformatics.
! ! !

An Application: Redundancy Detector


Longest repeated substring. Given a string of N characters, find the longest repeated substring. Ex: a a c a a g t t t a c a a g c Application: bioinformatics.
! ! !

Dumb brute force. Try all indices i and j, and all match lengths k, and check. O(W N3 ) time, where W is length of longest match.
! !

Brute force. Try all indices i and j for start of possible match, and check. O(W N2 ) time, where W is length of longest match.
! !

a
i

t
j

a
i

a
j

10

An Application: Redundancy Detector


Longest repeated substring. Given a string of N characters, find the longest repeated substring. Ex: a a c a a g t t t a c a a g c Application: bioinformatics.
! ! !

Suffix Sorting

suffixes
0 1 2

sorted suffixes
t a c a a g c a c a a g c c a a g c a a g c a g c g c c
0 11 3 9 1 12 4 14 10 2 13 5 8 7 6

a a c a a g t t t a c a a g c

a c a a g t t t a c a a g c

c a a g t t t a c a a g c

a a g t t t a c a a g c

a g t t t a c a a g c

g t t t a c a a g c

t t t a c a a g c

t t a c a a g c

a a a a a a a c c c g g t t t

a a a c c g g a a c t a t t

c g g a a c t a a t c a t

a c t a a t g g t a c a

a t g g t c t a a a c

g t c t a

t a t c

t c t a

t a a a

a a c g

c g a c

a c a

Suffix sort solution. Form N suffixes of original string. Sort to bring longest repeated substrings together. O(W N log N) time, where W is length of longest match.
! ! !

3 4 5 6 7 8 9 10 11 12 13 14

t c g a a

t a c g a

a a c g

c g

a c

11

12

Suffix Sorting: Java Implementation


public class LRS { public static void main(String[] args) { String s = StdIn.readAll(); int N = s.length(); String[] suffixes = new String[N]; for (int i = 0; i < N; i++) suffixes[i] = s.substring(i, N); Arrays.sort(suffixes);
read input

String Sorting Performance

String Sort Worst Case Brute Quicksort W N2 W N log N

Suffix (sec) Moby Dick 36,000 9.5

create suffixes (linear time) sort suffixes

String lrs = ""; for (int i = 0; i < N - 1; i++) { String x = lcp(suffixes[i], suffixes[i+1]); if (x.length() > lrs.length()) lrs = x; } find longest match System.out.println(lrs); } }
W = max length of string N = number of strings
1.2 million for Moby Dick

estimate probabilistic guarantee

% java LRS < mobydick.txt ,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th
13

14

String Sorting Notation

LSD Radix Sort

Notation. String = sequence of characters. W = max # characters per string. N = # input strings. R = radix.
! ! ! !

256 for extended ASCII, 65,536 for original Unicode

Java syntax. Array of strings: Number of strings: The ith string: The dth character of the ith string: Strings to be sorted:
! ! ! ! !

String[] a N = a.length a[i] a[i].charAt(d) a[0], , a[N-1]

Lysergic Acid Diethylamide, Circa 1960

Card Sorter, Circa 1960

Robert Sedgewick and Kevin Wayne Copyright 2006 http://www.Princeton.EDU/~cos226

16

Key Indexed Counting


Key indexed counting. Count frequencies of each letter. [0th character]
!

Key Indexed Counting


Key indexed counting. Count frequencies of each letter. [0th character] Compute cumulative frequencies.
! !

a
0 1 d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e

count
a b c d e f g 0 2 3 1 2 1 3 0 1

a
d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e a b c d e f g 0 2 3 1 2 1 3

count
a b c d e f 0 2 5 6 8 9

int N = a.length; int[] count = new int[256+1]; for (int i = 0; i < N; i++) { char c = a[i].charAt(d); count[c+1]++; d=0 }
frequencies

2 3 4 5 6 7 8 9 10 11

for (int i = 1; i < 256; i++) count[i] += count[i-1];


cumulative counts

2 3 4 5 6 7 8 9 10 11

g 12

18

19

Key Indexed Counting


Key indexed counting. Count frequencies of each letter. [0th character] Compute cumulative frequencies. Use cumulative frequencies to rearrange strings.
! ! !

Key Indexed Counting


Key indexed counting. Count frequencies of each letter. [0th character] Compute cumulative frequencies. Use cumulative frequencies to rearrange strings.
! ! !

a
0 1 d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e

count
a b c d e 1 2 5 6 8 9 0 1 2 3 4 5 6 7 8 9 10 11

temp
a a b b b c d d e f f f d c a e e a a a b a e e d e d e d b b d b d e d
32

a
0 1 a a b b b c d d e f f f d c a e e a a a b a e e d e d e d b b d b d e d

count
a b c d e 2 5 6 8 9 0 1 2 3 4 5 6 7 8 9 10 11

temp
a a b b b c d d e f f f d c a e e a a a b a e e d e d e d b b d b d e d
33

for (int i = 0; i < N; i++) { char c = a[i].charAt(d); temp[count[c]++] = a[i]; }


rearrange

2 3 4 5 6 7 8 9 10 11

for (int i = 0; i < N; i++) a[i] = temp[i];


copy back

2 3 4 5 6 7 8 9 10 11

f 12 g 12

f 12 g 12

Least Significant Digit Radix Sort


LSD. Consider digits d from right to left: stably sort using dth character as the key via key-indexed counting.

Least Significant Digit Radix Sort


LSD. Consider digits d from right to left: stably sort using dth character as the key via key-indexed counting.

sort key

sort key

sort key

0 1 2 3 4 5 6 7 8 9 10 11

d a c f f b d b f b e a

a d a a e a a e e e b c

b d b d e d d e d d b e

0 1 2 3 4 5 6 7 8 9 10 11

d c e a f b d f b f b a

a a b d a a a e e e e c

b b b d d d d d d e e e

0 1 2 3 4 5 6 7 8 9 10 11

d c f b d e a a f b f b

a a a a a b c d e e e e

b b d d d b e d d d e e

0 1 2 3 4 5 6 7 8 9 10 11

a a b b b c d d e f f f

c d a e e a a a b a e e

e d d d e b b d b d d e
34 35

public static void lsd(String[] a) { int W = a[0].length(); for (int d = W-1; d >= 0; d--) { // do key-indexed counting sort on digit d ... } } Assumes fixed length strings (length = W)

LSD Radix Sort: Correctness


Pf 1. [left-to-right] If two strings differ on first character, keyindexed sort puts them in proper relative order. If two strings agree on first character, stability keeps them in proper relative order.
! !

LSD Radix Sort Correctness


Running time. !(W(N + R)).
why doesn't it violate N log N lower bound?

Advantage. Fastest sorting method for random fixed length strings. Disadvantages. Accesses memory "randomly." Inner loop has a lot of instructions. Wastes time on low-order characters. Doesn't work for variable-length strings. Not much semblance of order until very last pass.
! ! ! ! !

Pf 2. [right-to-left] If the characters not yet examined differ, it doesn't matter what we do now. If the characters not yet examined agree, stability ensures later pass won't affect order.
! !

Goal. Find fast algorithm for variable length strings.

36

37

MSD Radix Sort

MSD Radix Sort

Most significant digit radix sort. Partition file into 256 pieces according to first character. Recursively sort all strings that start with the same character, etc.
! !

Q. How to sort on dth character? A. Key-indexed counting.

Robert Sedgewick and Kevin Wayne Copyright 2006 http://www.Princeton.EDU/~cos226

39

MSD Radix Sort Implementation

String Sorting Performance

public static void msd(String[] a) { msd(a, 0, a.length, 0); }

String Sort Worst Case


inclusive exclusive

Suffix (sec) Moby Dick 36,000

Brute Quicksort

W N2 W N log N W(N + R) W(N + R)

9.5 395

private static void msd(String[] a, int l, int r, int d) { if (r <= l + 1) return; // key-indexed counting sort on digit d of a[l] to a[r] int[] count = new int[256+1]; ... // recursively sort 255 subfiles assumes '\0' terminated for (int i = 0; i < 255; i++) msd(a, l + count[i], l + count[i+1], d+1); }

LSD * MSD

R = radix W = max length of string N = number of strings


1.2 million for Moby Dick

estimate * fixed length strings only probabilistic guarantee

40

41

MSD Radix Sort: Small Files


Disadvantages. Too slow for small files. ASCII: 100x slower than insertion sort for N = 2 Unicode: 30,000x slower for N = 2 Huge number of recursive calls on small files.
! !

String Sorting Performance

String Sort Worst Case Brute Quicksort LSD * W N2 W N log N W(N + R) W(N + R) W(N + R)

Suffix (sec) Moby Dick 36,000 9.5 395 6.8

Solution. Cutoff to insertion sort for small N. Consequence. Competitive with quicksort for string keys.

MSD MSD with cutoff

R = radix W = max length of string N = number of strings


1.2 million for Moby Dick
42

estimate * fixed length strings only probabilistic guarantee

43

Recursive Structure: MSD


Recursive structure. R-way branching leads to lots of empty calls.

3-Way Radix Quicksort

44

Robert Sedgewick and Kevin Wayne Copyright 2006 http://www.Princeton.EDU/~cos226

3-Way Radix Quicksort


Idea 1. Use dth character to "sort" into 3 pieces instead of 256, and sort each piece recursively. Idea 2. Keep all duplicates together in partitioning step.

Recursive Structure: MSD vs. 3-Way Quicksort


3-way radix quicksort collapses empty links in MSD tree.

MSD radix sort recursion tree (1035 null links, not shown)

3-way partition

3-way radix quicksort


46

3-way radix quicksort recursion tree (155 null links)


47

3-Way Radix Quicksort


private static void quicksortX(String a[], int lo, int hi, int d) { if (hi - lo <= 0) return; int i = lo-1, j = hi; int p = lo-1, q = hi; char v = a[hi].charAt(d);
repeat until pointers cross while (i < j) { while (a[++i].charAt(d) < v) if (i == hi) break; find i on left and while (v < a[--j].charAt(d)) if (j == lo) break; j on right to swap if (i > j) break; exch(a, i, j); if (a[i].charAt(d) == v) exch(a, ++p, i); swap equal chars if (a[j].charAt(d) == v) exch(a, j, --q); to left or right } if (p == q) { special case for if (v != '\0') quicksortX(a, lo, hi, d+1); all equal chars return; } if (a[i].charAt(d) < v) i++; for (int k = lo; k <= p; k++) exch(a, k, j--); swap equal ones back to middle for (int k = hi; k >= q; k--) exch(a, k, i++); quicksortX(a, lo, j, d); if ((i == hi) && (a[i].charAt(d) == v)) i++; sort 3 pieces recursively if (v != '\0') quicksortX(a, j+1, i-1, d+1); quicksortX(a, i, hi, d);

Quicksort vs. 3-Way Radix Quicksort


Quicksort. 2N ln N string comparisons on average. Long keys are costly to compare if they differ only at the end, and this is common case! absolutism, absolut, absolutely, absolute.
! ! !

3-way radix quicksort. Avoids re-comparing initial parts of the string. Uses just "enough" characters to resolve order. 2 N ln N character comparisons on average for random strings. Sub-linear sort for large W since input is of size NW.
! ! ! !

Theorem. Quicksort with 3-way partitioning is OPTIMAL. Pf. Ties cost to entropy. Beyond scope of 226.

}
48 49

String Sorting Performance

Suffix Sorting: Worst Case Input


Length of longest match small. Hard to beat 3-way radix quicksort.
!

String Sort Worst Case Brute Quicksort LSD * MSD MSD with cutoff 3-way radix quicksort W N2 W N log N W(N + R) W(N + R) W(N + R) W N log N

Suffix (sec) Moby Dick 36,000 9.5 395 6.8 2.8

Length of longest match very long. 3-way radix quicksort is quadratic. Ex: two copies of Moby Dick.
! !

Can we do better? !(N log N) ? !(N) ?

Observation. Must find longest repeated substring while suffix sorting to beat N2.
R = radix W = max length of string N = number of strings estimate * fixed length strings only probabilistic guarantee

abcdefghi abcdefghiabcdefghi bcdefghi bcdefghiabcdefghi cdefghi cdefghiabcdefgh defghi efghiabcdefghi efghi fghiabcdefghi fghi ghiabcdefghi fhi hiabcdefghi hi iabcdefghi i Input: "abcdeghiabcdefghi"

1.2 million for Moby Dick


50 51

Suffix Sorting in Linearithmic Time: Key Idea


sorted

Suffix Sorting in Sub-Quadratic Time


Manber's MSD algorithm. Phase 0: sort on first character using key-indexed sorting. Phase i: given list of suffixes sorted on first 2i-1 characters, create list of suffixes sorted on first 2i characters Finishes after lg N phases.
! ! !

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

babaaaabcbabaaaaa0 abaaaabcbabaaaaa0b baaaabcbabaaaaa0ba aaaabcbabaaaaa0bab aaabcbabaaaaa0baba aabcbabaaaaa0babaa abcbabaaaaa0babaaa bcbabaaaaa0babaaaa cbabaaaaa0babaaaab babaaaaa0babaaaabc abaaaaa0babaaaabcb baaaaa0babaaaabcba aaaaa0babaaaabcbab aaaa0babaaaabcbaba aaa0babaaaabcbabaa aa0babaaaabcbabaaa a0babaaaabcbabaaaa 0babaaaabcbabaaaaa

0 + 4 = 4 9 + 4 = 13

17 16 15 14 3 12 13 4 5 1 10 6 2 11 0 9 7 8

0bab a0ba aa0b aaa0 aaaa aaaa aaaa aaab aabc abaa abaa abcb baaa baaa baba baba bcba cbab

aaaa baaa abaa baba bcba a0ba 0bab cbab baba aabc aaa0 abaa abcb aa0b aaab aaaa baaa aaaa

bcbabaaa abcbabaa aabcbaba aaabcbab baaaaa0b baaaabcb aaaabcba aaaaa0ba aaaa0bab babaaaaa babaaaab aaa0baba abaaaaa0 abaaaabc cbabaaaa 0babaaaa aa0babaa a0babaaa

aa aa aa aa ab ab ba ba aa 0b cb aa ba ba a0 bc aa ab

Manber's LSD algorithm. Same idea but go from right to left. O(N log N) guaranteed running time. O(N) extra space (but need several auxiliary arrays).
! ! !

Best in theory. O(N) but more complicated to implement.

text = "babaaaabcbabaaaaa"
52 53

String Sorting Performance

String Sort Worst Case Brute Quicksort LSD * MSD MSD with cutoff 3-way radix quicksort Manber

Suffix Sort (seconds) Moby Dick 36,000

AesopAesop 3,990 167 memory 162 400 8.5

W N2 W N log N W(N + R) W(N + R) W(N + R) W N log N N log N


9.5 395 6.8 2.8 17

R = radix W = max length of string. N = number of strings


1.2 million for Moby Dick 191 thousand for Aesop's Fables

estimate fixed length strings only probabilistic guarantee suffix sorting only

54

Vous aimerez peut-être aussi