Académique Documents
Professionnel Documents
Culture Documents
Overview of Talk
1. String Patterns
Regular expressions (also called regexes) are used to find string patterns. A variety of software packages has them implemented, e.g., Mathematica, Perl, SAS, Emacs, and so forth. Well use them to find interesting words (from wordlists available on the web) and interesting numbers (e.g., squares with unusual digit patterns).
Perl Regexes
For a wordlist, one word per line: /cat/ would match cat cats scatter but NOT Cat /[cC]at/ would match cat Cat or Catcher /cat/i would match cat CaT or sCaTtEr i stands for case insensitive /cat|dog/ would match either cat or dog Well see examples of more complex string patterns.
Word Graphs
before b-e-f | | r-o
concern e-c | |\ r-n-o
state s-t-a | e
decency d-e-c |/| n y
Each node is a distinct letter, and each edge connects letters that are adjacent in the word. Graphs are directed: the arrows are understood. From Section 24 of Eckler (1996).
What is longest such word? Answer: ambidextrously has 14 letters lycanthropies, metalworkings, multibranched, unpredictably are only examples with 13 letters
Longest cyclic words are 12 letters long: spaceflights, speculations, subharmonics, subordinates, switchblades, switchboards, sympathizers.
Are the Digits of Squares Random? The initial digits are not.
The limiting proportion of digits 1 k 9 is given by: (Sqrt[k+1] - Sqrt[k] + Sqrt[10(k+1)] Sqrt[10k])/9.
1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 Lower 100 142 174 200 224 245 265 283 300 317 448 548 633 708 775 837 895 949 Upper 141 173 200 223 244 264 282 300 316 447 547 632 707 774 836 894 948 999 Lower^2 10000 20164 30276 40000 50176 60025 70225 80089 90000 100489 200704 300304 400689 501264 600625 700569 801025 900601 Upper^2 19881 29929 40000 49729 59536 69696 79524 90000 99856 199809 299209 399424 499849 599076 698896 799236 898704 998001
Digit 1 2 3 4 5 6 7 8 9 Total =
Prob. 19.16% 14.70% 12.39% 10.92% 9.87% 9.08% 8.45% 7.93% 7.50% 100.00%
{an} satisfies Benfords law iff Log[10, an] (mod 1) is uniformly distributed. See http://en.wikipedia.org/wiki/Benford's_law. Benfords Law does not fit the distribution of initial digits of squares.
Digit 1 2 3 4 5 6 7 8 9 Total
Prob. 30.10% 17.61% 12.49% 9.69% 7.92% 6.69% 5.80% 5.12% 4.58% 100.00%
For a proof see Walter Penney (1960) On the Final Digits of Squares. Also see Walter Stangl (1996) Counting Squares in Zn
1 2 3 4 5 6 7
Equi-Pandigital Primes
An equi-pandigital number in base b contain each digit from 0 through (b-1) exactly the same number of times. Theorem. For b > 3, there are no equi-pandigital primes. Proof. Let n be an equi-pandigital number in base b. Then mod (b 1) n is congruent to the sum of its digits because bn 1n = 1. Let r be the # of repetitions of 0, 1, 2, , b 1, which sum to b(b 1)/2. So we have: If b is even, then n 0 (mod b 1) since b/2 is an integer, so (b 1) divides n. If b is odd, then (b 1)/2 is an integer, so either n 0 or (b 1)/2. In both cases, (b 1)/2 divides n. For b > 3, (b 1) and (b 1)/2 are nontrivial, so n is not prime. QED Remark 1: Finding a base 10 equi-pandigital prime will take some trickery. Remark 2: 102, 1001012, 1010012, 100010112, etc. are prime, as are 1023, 2013, 1000122123, 1000221123, etc.
Search Results
By computer search, there are 69774 equi-pandigital Gaussian primes of the form a + b i, a > b > 0. Here are some interesting ones:
Pandigital Gaussian prime 96530 + 87421i 20468 + 13597i 98765 + 10234i 60143 + 59872i 86420 + 79513i 20864 + 13579i 97531 + 82604i Distinguishing property Max norm Min norm Max real imaginary parts Min real imaginary parts Largest real part with all even digits Smallest imaginary part with all odd digits Largest real part with all odd digits
Well also consider anagrams of numbers. In what follows, initial zeros are forbidden.
E.g., 132 = 169, 142 = 196, and 312 = 961 are anagrams of each other (in base 10).
The above wordlists include all the inflected forms of words: nouns with both singular and plural forms, adjectives with comparative forms, verbs with all conjugated forms, etc.
If key already exists, then an anagram has been discovered. Example: evil, live, vile, veil all have the key eilv.
aa aah aahed aahing aahs aal aalii aaliis aals aardvark aardvarks aardwolf aardwolves aas aasvogel aasvogels aba abaca abacas abaci aback abacus abacuses
Step 3: Print out the hash with the keys sorted in alphabetical order.
The result (see right) is an anagram dictionary. Invaluable for word games such as Scrabble and Jumble: just sort the letters at hand and check if they form a word. Looking for entries with two or more commas reveals word anagrams.
Most words do not have anagrams.
aa, aa aaaaabbcdrr, abracadabra aaaabcceelrstu, baccalaureates aaaabcceelrtu, baccalaureate aaaabdilmorss, ambassadorial aaaabenn, anabaena aaaabenns, anabaenas aaaaccdiiklllsy, lackadaisically aaaaccdiiklls, lackadaisical aaaaccrr, caracara aaaaccrrs, caracaras aaaacgnr, caragana aaaacgnrs, caraganas aaaacmnrst, catamarans aaaacmnrt, catamaran
The program above can easily be modified to find anagrams of a set of numbers. In recreational mathematics, it is well known that 122 = 144, 212 = 441; and 132 = 169, 312 = 961, 142 = 196. Unlike words, it turns out that it is easy to find two or more squares that are anagrams. For example, the following 87 squares are anagrams of each other:
1026753849, 1042385796, 1098524736, 1237069584, 1248703569, 1278563049, 1285437609, 1382054976, 1436789025, 1503267984, 1532487609, 1547320896, 1643897025, 1827049536, 1927385604, 1937408256, 2076351489, 2081549376, 2170348569, 2386517904, 2431870596, 2435718609, 2571098436, 2913408576, 3015986724, 3074258916, 3082914576, 3089247561, 3094251876, 3195867024, 3285697041, 3412078569, 3416987025, 3428570916, 3528716409, 3719048256, 3791480625, 3827401956, 3928657041, 3964087521, 3975428601, 3985270641, 4307821956, 4308215769, 4369871025, 4392508176, 4580176329, 4728350169, 4730825961, 4832057169, 5102673489, 5273809641, 5739426081, 5783146209, 5803697124, 5982403716, 6095237184, 6154873209, 6457890321, 6471398025, 6597013284, 6714983025, 7042398561, 7165283904, 7285134609, 7351862049, 7362154809, 7408561329, 7680594321, 7854036129, 7935068241, 7946831025, 7984316025, 8014367529, 8125940736, 8127563409, 8135679204, 8326197504, 8391476025, 8503421796, 8967143025, 9054283716, 9351276804, 9560732841, 9614783025, 9761835204, 9814072356.
A Pattern Emerges
# Digits 1 2 3 4 5 6 7 8 9 10
# Squares # Anasquares Proportion 3 0 0.00% 6 0 0.00% 22 7 31.82% 68 13 19.12% 217 86 39.63% 683 293 42.90% 2163 1212 56.03% 6837 4699 68.73% 21623 17380 80.38% 68377 60623 88.66%
In fact, looking at n-digit squares, it seems that as n increases, the proportion of squares with square anagrams (lets call these anasquares) keeps increasing. What is the limit?
The above table is Table 1 from Bilisoly (2008a). Also see http://oeis.org/A177952.
End of Proof
The number of distinct hash keys is:
d b 1 d (d b 1)(d b 2)...(d (b 1)! 1)
bd 1
bd
bd / 2 b(d
1) / 2
b d / 2 (1 1 / b ).
Hence the number of d-digit squares is exponential (in d), but the number of patterns is a polynomial (in d), so the proportion of anasquares is bounded below by the following, which 1 as d (and b is fixed.)
b d / 2 (1 1 / b ) max
(d
E ( N shared )
365 0 i 1
(1 pi t ) exp( t )dt
Corollary (The Coupon Problem) We need all letters to appear at least once.
365 i 1
E ( N all )
(1 exp( pi t )) dt
Application to Birthdays
What is the expected number of people needed so that 2 people share a birthday? Mathematica gives E(Nshared) = 24.6166, which assumes each day is equally likely. Note E(Nall) = 2364.65. What is the expected number of people born in 1978 needed so that 2 people share a birthday? Mathematica gives E(Nshared) = 24.5262 and note E(Nall) = 2435.14. Plot of Julian Day vs. Proportion of births on that day for 1978. Which days does the lower band represent?
Data Source: Todd Swansons Home Page: http://www.math.hope.edu/swanson/da ta/birthdays.txt
Pangrammatic Windows
The Spirit dropped beneath it, so that the extinguisher covered its whole form; but though Scrooge pressed it down with all his force, he could not hide the light: which streamed from under it, in an unbroken flood upon the ground. He was conscious of being exhausted, and overcome by an irresistible drowsiness; and, further, of being in his own bedroom. He gave the cap a parting squeeze, in which his hand relaxed; and had barely time to reel to bed, before he sank into a heavy sleep. AWAKING in the middle of a prodigiously tough snore, and sitting up in bed to get his thoughts together, Scrooge had no occasion to be told that the bell was again upon the stroke of One. He felt that he was restored to consciousness in the right nick of time, for the especial purpose of holding a conference with the second messenger dispatched to him through Jacob Marley's intervention.
This text is from Charles Dickens A Christmas Carol. The blue portion is a pangrammatic window, i.e., it contains each letter of the alphabet at least once. There are 679 letters in color. The search started with The Spirit and the window could be shortened by dropping letters from the beginning.
Well search A Christmas Carol for pangrams by selecting random starting positions. Then we compare this to independently generated letters using the letter frequencies of this novel. The counts and the proportions are listed to the right. Of course, letters are not independent, but the question is this: How does the actual pangram lengths differ from the simulated independent pangram lengths?
f g h i j k l m n o
p
q r s t
2119
97 7031 7900 10869
0.017505
0.000801 0.058082 0.065261 0.089787
u
v w x y z
3335
1022 3096 131 2298 84
0.02755
0.008443 0.025576 0.001082 0.018983 0.000694
Pangram Lengths
The left histogram shows lengths of pangrams found in A Christmas Carol using random starting points. The right histogram shows lengths of pangrams found in a simulated string of independent letters using the proportions found in A Christmas Carol.
N = 1000
"Why, it's old ess his heart; it's o Old "Yo ho, there! ho, my boys!" said e, Dick. Christmas, ters up," cried old illi-ho!" cried old -ho, Dick! Chirrup, ared away, with old aches. In came Mrs. came the three Miss brought about, old overley." Then old to dance with Mrs. ah, four times--old , and so would Mrs. eared to issue from next. And when old d Fezziwig and Mrs. gain to your place; ke up. Mr. and Mrs. hearts in praise of
Fezziwig! Bless his heart; i Fezziwig alive again!" Fezziwig laid down his pen, Ebenezer! Dick!" Fezziwig. "No more work to-n Ebenezer! Let's have the shu Fezziwig, with a sharp clap Fezziwig, skipping down from Ebenezer!" Fezziwig looking on. It was Fezziwig, one vast substanti Fezziwigs, beaming and lovabl Fezziwig, clapping his hands Fezziwig stood out to dance Fezziwig. Top couple, too; w Fezziwig would have been a m Fezziwig. As to her, she was Fezziwig's calves. They shon Fezziwig and Mrs. Fezziwig h Fezziwig had gone all throug Fezziwig "cut"--cut so deftl Fezziwig took their stations Fezziwig: and when he had do
Middle section has 43 of the 84 zs, but represents only 3 of 83 pages of Dickens (1986).
luence over him, he e the cap a parting ore and centre of a ore alarming than a ; and such a mighty , half thawed, half ught fire, and were anding his gigantic chit, kissing her a erness and flavour, , so hard and firm, e flickering of the g grew but moss and of endeavouring to relents," she said, grave his own name, er they've sold the re?--Not the little it. It's twice the e passed the door a
seized the extinguisher-ca squeeze, in which his hand blaze of ruddy light, whi dozen ghosts, as he was p blaze went roaring up the frozen, whose heavier part blazing away to their dear size, he could accommoda dozen times, and taking o size and cheapness, were blazing in half of half-a-q blaze showed preparations furze, and coarse rank gr seize you, which would ha amazed, "there is! Nothing EBENEZER SCROOGE. prize Turkey that was han prize Turkey: the big one size of Tiny Tim. Joe Mi dozen times, before he ha
References
Roger Bilisoly (2008a). Anasquares: Square Anagrams of Squares. Mathematical Gazette, 92, 58-63. Roger Bilisoly (2008b). Practical Text Mining with Perl, Wiley. Roger Bilisoly (2009). Two Language-based Examples for Use in the Statistics Classroom. American Statistical Association Proceedings of the Joint Statistical Meetings, Section on Statistical Education. Gunnar Blom, Lars Holst, and Dennis Sandell (1993). Problems and Snapshots from the World of Probability, Springer. W. E. Deskins (1964). Abstract Algebra, MacMillan. Charles Dickens (1986). A Christmas Carol, Bantam. Philippe Flajolet, Daniele Gardy, and Loys Thimonier (1992). Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-Organinzing Search. Discrete Applied Mathematics, 39, 207-229. Walter Penney (1960). On the Final Digits of Squares. The American Mathematical Monthly, Vol. 67, No. 10, pp. 1000-1002. Walter Stangl (1996). Counting Squares in Zn. Mathematics Magazine, Vol. 69, No. 4, pp. 285189. Kenneth Williams (1995). "Some Refinements of an Algorithm of Brillhart," Canadian Mathematical Society Conference Proceedings, Volume 15, 409-416. Available at http://www.math.carleton.ca/~williams/papers/pdf/202.pdf.
Web References
Benfords Law
http://mathworld.wolfram.com/BenfordsLaw.html http://en.wikipedia.org/wiki/Benford's_law http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm
http://mathworld.wolfram.com/Baxter-HickersonFunction.html
http://cryptogram.org/ http://www.puzzlers.org/ http://icon.shef.ac.uk/Moby/ http://oeis.org/A177952. http://www.math.hope.edu/swanson/data/birthdays.txt http://wordways.com/
Wordplay References
Tony Augarde (1994). The Oxford A to Z of Word Games, Oxford. Tony Augarde (2003). The Oxford Guide to Word Games, Oxford. o Has historical information.
Dmitri Borgmann (1967). Beyond Language, Scribners.
Ross Eckler (1979). Word Recreations, Dover. o Most examples originally appeared in Word Ways. Ross Eckler (1996). Making the Alphabet Dance, St. Martin's. o Most examples originally appeared in Word Ways. Dave Morice (1997). Alphabet Avenue, Chicago Review Press. Dave Morice (2001). The Dictionary of Word Play, Teachers and Writers Collaborative. Warren F. Motte, Jr. (1998). Oulipo: A Primer of Potential Literature, Dalkey Archive. o Oulipo stands for Ouvroir de Litterature Potentielle, which is a group of writers, mathematicians, and other people interested in literary structures.
Bought by A. Ross Eckler, Jr. in 1968. He was editor and publisher from 1968-2006.
o o o
PhD in mathematics from Princeton, 1954 Worked at Bell Labs, 1954-84 Published Word Recreations (1979), Names and Games: Onomastics and Recreational Linguistics (1986), Making the Alphabet Dance (1996)
Online at http://wordways.com/
Open question: What are the upper and lower bounds of this plot? Points are squares in base 10 with 12 or less digits. This is Figure 2 of Bilisoly (2008a).
Define Njk = number of letters drawn from (with replacement) so that there are j distinct letters that each appear at least k times. Let ek(t) = kth order Taylor series expansion of exp(t). Theorem 1 of Flajolet, Gardy and Thimonier (1992) states: Product of 1st degree
polynomials in x
j 1
E ( N jk )
l 0
[x ]
na i 1
Corollary (The Birthday Problem) We need j = 1 day to appear at least k = 2 times. Note that the sum has only one term, and N12 = Nshared.
E ( N12 )
365 0 i 1
(1 pi t ) exp( t )dt
See Corollary 1 of Flajolet et al. (1992)
N all
na i 1
(1 exp( pi t )) dt
For uniformly likely birthdays, 2364.65 people are needed on average to get all 365 days to appear. For 1978, we expect to need 2435.14 people.
Pangrams have = {a, b, c, , z}, na = 26, and pi determined by frequencies found in a text sample.
Example of Mathematica 8 code to find 14 letter words with no multiple edges and diameter = 2.
97 997
9409 994009
9997
99997 999997
99940009
9999400009 999994000009
1233335
12333335
1521115222225
152111152222225
See http://www.asahi-net.or.jp/~KC2H-MSM/mathland/math02/math0203.htm
However, the analogous argument for squares with 4 distinct digits results in n (4/10)n, which diverges.