Académique Documents
Professionnel Documents
Culture Documents
6. Strings
String. Sequence of characters. Ex. Natural languages, Java programs, genomic sequences,
The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's and C's. This string is the root data structure of an organism's biology. -M. V. Olson
s
0
t
1
r
2
i
3
n
4
g
5
s
6
public final class String private char[] value; private int offset; private int count; private int hash;
implements Comparable<String> { // characters // index of first char into array // length of string // cache of hashCode()
s c t u
= = = =
// // // //
s c t u
= = = =
private String(int offset, int count, char[] value) { this.offset = offset; this.count = count; this.value = value; } public String substring(int from, int to) { return new String(offset + from, to - from, value); } }
3
java.lang.String
4
public static String reverse(String s) { String rev = ""; for (int i = s.length() - 1; i >= 0; i--) rev += s.charAt(i); return rev; }
p
quadratic time
0
r
1
e
2
f
3
i
4
x
5 6 7
public static String reverse(String s) { StringBuilder rev = new StringBuilder(); for (int i = s.length() - 1; i >= 0; i--) rev.append(s.charAt(i)); return rev.toString(); }
linear time
public static String lcp(String s, String t) { int n = Math.min(s.length(), t.length()); for (int i = 0; i < n; i++) { if (s.charAt(i) != t.charAt(i)) return s.substring(0, i); } return s.substring(0, n); }
Radix Sorting
Radix Sorting
Radix sorting. Specialized sorting solution for strings. Same ideas for bits, digits, etc.
! !
Applications. Bioinformatics. Sorting strings. Full text indexing. Plagiarism detection. Burrows-Wheeler transform. [see data compression]
! ! ! ! !
Dumb brute force. Try all indices i and j, and all match lengths k, and check. O(W N3 ) time, where W is length of longest match.
! !
Brute force. Try all indices i and j for start of possible match, and check. O(W N2 ) time, where W is length of longest match.
! !
a
i
t
j
a
i
a
j
10
Suffix Sorting
suffixes
0 1 2
sorted suffixes
t a c a a g c a c a a g c c a a g c a a g c a g c g c c
0 11 3 9 1 12 4 14 10 2 13 5 8 7 6
a a c a a g t t t a c a a g c
a c a a g t t t a c a a g c
c a a g t t t a c a a g c
a a g t t t a c a a g c
a g t t t a c a a g c
g t t t a c a a g c
t t t a c a a g c
t t a c a a g c
a a a a a a a c c c g g t t t
a a a c c g g a a c t a t t
c g g a a c t a a t c a t
a c t a a t g g t a c a
a t g g t c t a a a c
g t c t a
t a t c
t c t a
t a a a
a a c g
c g a c
a c a
Suffix sort solution. Form N suffixes of original string. Sort to bring longest repeated substrings together. O(W N log N) time, where W is length of longest match.
! ! !
3 4 5 6 7 8 9 10 11 12 13 14
t c g a a
t a c g a
a a c g
c g
a c
11
12
String lrs = ""; for (int i = 0; i < N - 1; i++) { String x = lcp(suffixes[i], suffixes[i+1]); if (x.length() > lrs.length()) lrs = x; } find longest match System.out.println(lrs); } }
W = max length of string N = number of strings
1.2 million for Moby Dick
% java LRS < mobydick.txt ,- Such a funny, sporty, gamy, jesty, joky, hoky-poky lad, is the Ocean, oh! Th
13
14
Notation. String = sequence of characters. W = max # characters per string. N = # input strings. R = radix.
! ! ! !
Java syntax. Array of strings: Number of strings: The ith string: The dth character of the ith string: Strings to be sorted:
! ! ! ! !
16
a
0 1 d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e
count
a b c d e f g 0 2 3 1 2 1 3 0 1
a
d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e a b c d e f g 0 2 3 1 2 1 3
count
a b c d e f 0 2 5 6 8 9
int N = a.length; int[] count = new int[256+1]; for (int i = 0; i < N; i++) { char c = a[i].charAt(d); count[c+1]++; d=0 }
frequencies
2 3 4 5 6 7 8 9 10 11
2 3 4 5 6 7 8 9 10 11
g 12
18
19
a
0 1 d a c f f b d b f b e a a d a a e a a e e e b c b d b d e d d e d d b e
count
a b c d e 1 2 5 6 8 9 0 1 2 3 4 5 6 7 8 9 10 11
temp
a a b b b c d d e f f f d c a e e a a a b a e e d e d e d b b d b d e d
32
a
0 1 a a b b b c d d e f f f d c a e e a a a b a e e d e d e d b b d b d e d
count
a b c d e 2 5 6 8 9 0 1 2 3 4 5 6 7 8 9 10 11
temp
a a b b b c d d e f f f d c a e e a a a b a e e d e d e d b b d b d e d
33
2 3 4 5 6 7 8 9 10 11
2 3 4 5 6 7 8 9 10 11
f 12 g 12
f 12 g 12
sort key
sort key
sort key
0 1 2 3 4 5 6 7 8 9 10 11
d a c f f b d b f b e a
a d a a e a a e e e b c
b d b d e d d e d d b e
0 1 2 3 4 5 6 7 8 9 10 11
d c e a f b d f b f b a
a a b d a a a e e e e c
b b b d d d d d d e e e
0 1 2 3 4 5 6 7 8 9 10 11
d c f b d e a a f b f b
a a a a a b c d e e e e
b b d d d b e d d d e e
0 1 2 3 4 5 6 7 8 9 10 11
a a b b b c d d e f f f
c d a e e a a a b a e e
e d d d e b b d b d d e
34 35
public static void lsd(String[] a) { int W = a[0].length(); for (int d = W-1; d >= 0; d--) { // do key-indexed counting sort on digit d ... } } Assumes fixed length strings (length = W)
Advantage. Fastest sorting method for random fixed length strings. Disadvantages. Accesses memory "randomly." Inner loop has a lot of instructions. Wastes time on low-order characters. Doesn't work for variable-length strings. Not much semblance of order until very last pass.
! ! ! ! !
Pf 2. [right-to-left] If the characters not yet examined differ, it doesn't matter what we do now. If the characters not yet examined agree, stability ensures later pass won't affect order.
! !
36
37
Most significant digit radix sort. Partition file into 256 pieces according to first character. Recursively sort all strings that start with the same character, etc.
! !
39
Brute Quicksort
9.5 395
private static void msd(String[] a, int l, int r, int d) { if (r <= l + 1) return; // key-indexed counting sort on digit d of a[l] to a[r] int[] count = new int[256+1]; ... // recursively sort 255 subfiles assumes '\0' terminated for (int i = 0; i < 255; i++) msd(a, l + count[i], l + count[i+1], d+1); }
LSD * MSD
40
41
String Sort Worst Case Brute Quicksort LSD * W N2 W N log N W(N + R) W(N + R) W(N + R)
Solution. Cutoff to insertion sort for small N. Consequence. Competitive with quicksort for string keys.
43
44
MSD radix sort recursion tree (1035 null links, not shown)
3-way partition
3-way radix quicksort. Avoids re-comparing initial parts of the string. Uses just "enough" characters to resolve order. 2 N ln N character comparisons on average for random strings. Sub-linear sort for large W since input is of size NW.
! ! ! !
Theorem. Quicksort with 3-way partitioning is OPTIMAL. Pf. Ties cost to entropy. Beyond scope of 226.
}
48 49
String Sort Worst Case Brute Quicksort LSD * MSD MSD with cutoff 3-way radix quicksort W N2 W N log N W(N + R) W(N + R) W(N + R) W N log N
Length of longest match very long. 3-way radix quicksort is quadratic. Ex: two copies of Moby Dick.
! !
Observation. Must find longest repeated substring while suffix sorting to beat N2.
R = radix W = max length of string N = number of strings estimate * fixed length strings only probabilistic guarantee
abcdefghi abcdefghiabcdefghi bcdefghi bcdefghiabcdefghi cdefghi cdefghiabcdefgh defghi efghiabcdefghi efghi fghiabcdefghi fghi ghiabcdefghi fhi hiabcdefghi hi iabcdefghi i Input: "abcdeghiabcdefghi"
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
babaaaabcbabaaaaa0 abaaaabcbabaaaaa0b baaaabcbabaaaaa0ba aaaabcbabaaaaa0bab aaabcbabaaaaa0baba aabcbabaaaaa0babaa abcbabaaaaa0babaaa bcbabaaaaa0babaaaa cbabaaaaa0babaaaab babaaaaa0babaaaabc abaaaaa0babaaaabcb baaaaa0babaaaabcba aaaaa0babaaaabcbab aaaa0babaaaabcbaba aaa0babaaaabcbabaa aa0babaaaabcbabaaa a0babaaaabcbabaaaa 0babaaaabcbabaaaaa
0 + 4 = 4 9 + 4 = 13
17 16 15 14 3 12 13 4 5 1 10 6 2 11 0 9 7 8
0bab a0ba aa0b aaa0 aaaa aaaa aaaa aaab aabc abaa abaa abcb baaa baaa baba baba bcba cbab
aaaa baaa abaa baba bcba a0ba 0bab cbab baba aabc aaa0 abaa abcb aa0b aaab aaaa baaa aaaa
bcbabaaa abcbabaa aabcbaba aaabcbab baaaaa0b baaaabcb aaaabcba aaaaa0ba aaaa0bab babaaaaa babaaaab aaa0baba abaaaaa0 abaaaabc cbabaaaa 0babaaaa aa0babaa a0babaaa
aa aa aa aa ab ab ba ba aa 0b cb aa ba ba a0 bc aa ab
Manber's LSD algorithm. Same idea but go from right to left. O(N log N) guaranteed running time. O(N) extra space (but need several auxiliary arrays).
! ! !
text = "babaaaabcbabaaaaa"
52 53
String Sort Worst Case Brute Quicksort LSD * MSD MSD with cutoff 3-way radix quicksort Manber
estimate fixed length strings only probabilistic guarantee suffix sorting only
54