Vous êtes sur la page 1sur 101

1

1
Python: Regular Expressions
Bruce Beckles
University Computing Service
Bob Dowling
Scientiic Computing Support e!mail a""ress:
escience!support#ucs$cam$ac$uk
?elcome to the University Computing Service9s 'Python: Regular Expressions(
course$
%he oicial UCS e!mail a""ress or all scientiic computing support Gueries) inclu"ing
any Guestions about this course) is:
escience-supportTucs.cam.ac.uk
2
2
%his course:
basic regular expressions
getting Python to use them
&nother UCS course:
Pattern Matching Using Regular Expressions
more powerul regular expressions
eicient regular expressions
Beore we start) let9s speciy ,ust what is an" isn9t in this course$
%his course is a very simple) beginner9s course on regular expressions$ *t mostly
covers how to get Python to use them$ * you want to learn the ull power o regular
expressions go to the UCS two!aternoon course calle" 'Pattern 7atching Using
Regular Expressions( which will cover them in "etail) but in a way that "oesn+t ocus
on any particular language$ 1or urther "etails o this course see the course
"escription at:
$ttp:GGtrainin#.csx.cam.ac.ukGcourseGre#ex
%here is an on!line intro"uction calle" the Python 'Regular Expression .ow%o( at:
$ttp:GGwww.amk.caGpyt$onG$owtoGre#exG
an" the ormal Python "ocumentation at
$ttp:GGdocs.pyt$on.or#GliBraryGre.$tml
%here is a goo" book on regular expressions in the 89Reilly series calle" '7astering
Regular Expressions( by Qerey E$ 1$ 1rei"l$ Be sure to get the thir" e"ition 4or later5
as its author has a""e" a lot o useul inormation since the secon" e"ition$ %here are
"etails o this book at:
$ttp:GGre#ex.infoG
$ttp:GGwww.oreilly.comGcatalo#Gre#exEG
%here is also a ?ikipe"ia page on regular expressions which has useul inormation
itsel burie" within it an" a urther set o reerences at the en":
$ttp:GGen.wikipedia.or#GwikiG,e#ularU-xpression
3
3
& regular expression is a
'pattern( "escribing some text:
'a series o "igits(
'a lower case letter ollowe"
by some "igits(
'a mixture o characters except or
new line) ollowe" by a ull stop an"
one or more letters or numbers(
\d+
[a-z]\d+
.+\.\w+
& regular expression is simply some means to write "own a pattern "escribing some
text$ 4%here is a ormal mathematical "einition but we9re not bothering with that here$
?hat the computing worl" calls regular expressions an" what the strict mathematical
grammarians call regular expressions are slightly "ierent things$5
1or example we might like to say 'a series o "igits( or a 'a single lower case letter
ollowe" by some "igits($ %here are terms in regular expression language or all o
these concepts$
4
4
& regular expression is a
'pattern( "escribing some text:
\d+
[a-z]\d+
.+\.\w+
*sn+t this ,ust gibberish-
%he language o
regular expressions
?e will cover what this means in a ew sli"es time$ ?e will start with a 'trivial( regular
expression) however) which simply matches a ixe" bit o text$
5
5
Classic regular expression ilter
or each line in a ile
"oes the line match a pattern-
i it "oes) output something
how can we tell-
what-
'.ey/ Something matche"/(
%he line that matche"
%he bit o the line that matche"
%his is a course on using regular expressions rom Python) so beore we intro"uce
even our most trivial expression we shoul" look at how Python "rives the regular
expression system$
8ur basic script or this course will run through a ile) a line at a time) an" compare the
line against some regular expression$ * the line matches the regular expression the
script will output something$ %hat 'something( might be ,ust a notice that it happene"
4or a line number) or a count o lines matche") etc$5 or it might be the line itsel$
1inally) it might be ,ust the bit o the line that matche"$
Programs like this) that pro"uce a line o output i a line o input matches some
con"ition an" no line o output i it "oesn+t are calle" RiltersR$
6
6
%ask: 0ook or '1re"( in a list o names
&lice
Bob
Charlotte
Derek
Ermintru"e
1re"
1re"a
1re"erick
1elicity
2
names.txt
1re"
1re"a
1re"erick
freds.txt
So we will start with a script that looks or the ixe" text 'Fred( in the ile names.txt$
1or each line that matches) the line is printe"$ 1or each line that "oesn+t nothing is
printe"$
7
7
c$$ grep
$ grep 'Fred' < names.txt
Fred
Freda
Frederick
$
%his is eGuivalent to the tra"itional Unix comman") #rep$
8
8
Skeleton Python script
import
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
set up regular expression
to get st"in 3 st"out
rea" in the lines
one at a time
write out the
matching lines
import sys
regular expression module
compare line to regular expression
define pattern
So we will start with the outline o a Python script an" review the non!regular
expression lines irst$
Because we are using stan"ar" input an" stan"ar" output) we will import the sys
mo"ule to give us sys.stdin an" sys.stdout$
?e will process the ile a line at a time$ %he Python ob,ect sys.stdin correspon"s
to the stan"ar" input o the program an" i we use it like a list) as we "o here) then it
behaves like this list o lines$ So the Python Rfor line in sys.stdinR sets up a
or loop running through a line at a time) setting the variable line to be one line o the
ile ater another as the loop repeats$ %he loop en"s when there are no more lines in
the ile to rea"$
%he if statement simply looks at the results o the comparison to see i it was a
successul comparison or this particular value o line or not$
%he sys.stdout.write( line in the script simply prints the line$ ?e coul" ,ust use
print but we will use sys.stdout or symmetry with sys.stdin$
%he pseu"o!script on the sli"e contains all the non!regular!expression co"e reGuire"$
?hat we have to "o now is to ill in the rest$
9
9
Skeleton Python script
import
for line in sys.stdin:
if regular expression matches:
sys.stdout.write(line
set up regular expression
import sys
regular expression module
compare line to regular expression
're(
prepare the
reg$ exp$
use the
reg$ exp$
see what
we got
define pattern 'gibberish(
;ow let+s look at the regular expression lines we nee" to complete$
%he regular expression mo"ule in Python is calle" 're() so we will nee" the line
'import re( to loa" what we nee"$
*n Python we nee" to set up the regular expression in a"vance o using it$ 4&ctually
that+s not always true but this pattern is more lexible an" more eicient so we+ll ocus
on it in this course$5
1inally) or each line we rea" in we nee" some way to "etermine whether our regular
expression matches that line or not$
10
10
Skeleton Python script
import
import sys
re
Rea"y to use
regular expressions
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
set up regular expression
compare line to regular expression
define pattern
So we start by importing the regular expressions mo"ule) re$
11
11
Deining the pattern
pattern ! "Fred"
Simple string
*n this very simple case o looking or an exact string) the pattern is simply that string$
So) given that we are looking or R1re"R) we set the pattern to be R1re"R$
12
12
Skeleton Python script
import
import sys
re
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
set up regular expression
compare line to regular expression
pattern ! "Fred"
Deine the pattern
*n our skeleton script) though) we simply "eine a string$
%his is ,ust a Python string$ ?e still nee" to turn it into something that can "o the
searching or R1re"R$
13
13
Setting up a regular expression
re#exp ! re compile pattern
rom the re mo"ule
compile the pattern
'1re"(
regular expression ob,ect
. (
;ext we nee" to look at how to use a unction rom this mo"ule to set up a regular
expression ob,ect in Python rom that simple string$
%he re mo"ule has a unction 'compile(( which takes this string an" creates an
ob,ect Python can "o something with$ %his is "eliberately the same wor" as we use
or the processing o source co"e 4text5 into machine co"e 4program5$ .ere we are
taking a pattern 4text5 an" turning it into the mini!program that "oes the testing$
%he result o this compilation is a 'regular expression ob,ect() the mini program that
will "o work relevant to the particular pattern 'Fred($ ?e store this in a variable)
're#exp() so we can use it later in the script$
14
14
Skeleton Python script
import
import sys
re
Prepare the
regular
expression
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
compare line to regular expression
pattern ! "Fred"
re#exp ! re.compile(pattern
So we put that compilation line in our script instea" o our placehol"er$
;ext we have to apply that regular expression ob,ect) re#exp) to each line as it is
rea" in to see i the line matches$
15
15
Using a regular expression
result ! re#exp line ( searc$ . !
%he reg$ exp$ ob,ect
we ,ust prepare"$
%he reg$ exp$ ob,ect+s
search45 metho"$
%he text being teste"$
%he result o the test$
?e start by "oing the test an" then we will look at the test+s results$
%he regular expression ob,ect that we have ,ust create") 're#exp() has a metho" 4a
built in unction speciic to itsel5 calle" 'searc$(($ So to reerence it in our script we
nee" to reer to 're#exp.searc$(($ %his metho" takes the text being teste" 4our
input line in this case5 as its only argument$ %he input line in in variable line so we
nee" to run 're#exp.searc$(line( to get our result$
;ote that the string 'Fred( appears nowhere in this line$ *t is built in to the re#exp
ob,ect$
*nci"entally) there is a relate" conusingly similar metho" calle" 'matc$(($ Don+t use
it$ 4&n" that+s the only time it will be mentione" in this course$5
16
16
Skeleton Python script
import
import sys
re
Use the
reg$ exp$
for line in sys.stdin:
if
regular expression matches
:
sys.stdout.write(line
pattern ! "Fred"
re#exp ! re.compile(pattern
result ! re#exp.searc$(line
So we put that search line in our script instea" o our placehol"er$
;ext we have to test the result to see i the search was successul$
17
17
%esting a regular expression+s results
%he result o the
regular expression+s
search45 metho"$
if result:
Search successul:
tests as True
Search unsuccessul:
tests as False
%he searc$( metho" returns the Python 'null ob,ect() *one, i there is no match an"
something else 4which we will return to later5 i there is one$ So the result variable
now reers to whatever it was that searc$( returne"$
*one is Python9s way o representing 'nothing($ %he if test in Python treats *one as
False an" the 'something else( as 6rue so we can use result to provi"e us with a
simple test$
18
18
Skeleton Python script
import
import sys
re
See i the
line matche"
for line in sys.stdin:
if result:
sys.stdout.write(line
pattern ! "Fred"
re#exp ! re.compile(pattern
result ! re#exp.searc$(line
So i we "rop that line into our skeleton Python script we have complete" it$
%his Python script is the airly generic ilter$ * a input line matches the pattern write the
line out$ * it "oesn+t "on+t write anything$
?e will only see two variants o this script in the entire course: in one we only print out
certain parts o the line an" in the other we allow or there being multiple 1re"s in a
single line$
19
19
Exercise 6: complete the ile
import sys
import re
pattern ! "Fred"
re#exp ! %
for line in sys.stdin:
result ! %
if result:
sys.stdout.write(line
filter&'.py
* you look in the "irectory prepare" or you) you will in" a Python script calle"
'filter&'.py( which contains ,ust this script with a couple o critical lines missing$
Xour irst exercise is to e"it that ile to make it a search or the string +Fred+$
20
20
Exercise 6: test your ile
$ python filter01.py < names.txt
Fred
Freda
Frederick
8nce you have e"ite" the script give it a try$ ;ote that three names match the test
pattern: 1re") 1re"a an" 1re"erick$ * you "on+t get this result go back to the script an"
correct it$
21
21
Case sensitive matching
names.txt
1re"
1re"a
1re"erick
7anre"

Python matches are case sensitive by "eault


;ote that it "i" not pick out the name '7anre"( also in the ile$ Python regular
expressions are case sensitive by "eaultY they "o not eGuate 'F( with 'f($
22
22
Case insensitive matching
re#exp ! re.compile(pattern ,options
8ptions are given as mo"ule constants:
re.()*+,-./0-
case insensitive matching
an" other options 4some o which we9ll meet later5$
re.(
re#exp ! re.compile(pattern ,re.(
?e can buil" ourselves a case insensitive regular expression mini!program i we want
to$ %he re.compile( unction we saw earlier can take a secon") optional
argument$ %his argument is a set o lags which mo"iy how the regular expression
works$ 8ne o these lags makes it case insensitive$
%he options are set as a series o values that nee" to be a""e" together$ ?e9re
currently only intereste" in one o them) though) so we can give 're.()*+,-./0-(
4the ()*+,-./0- constant rom the re mo"ule5 as the secon" argument$
1or those o you who "islike long options) the ( constant is a synonym or the
()*+,-./0- constant) so we can use 're.(( instea" o 're.()*+,-./0-( i we
wish$ ?e use re.( in the sli"e ,ust to make the text it) but generally we woul"
encourage the long orms as better remin"ers o what the options mean or when you
come back to this script having not looke" at it or six months$
23
23
Exercise :: mo"iy the script
$ python filter01.py < names.txt
Fred
Freda
Frederick
1anfred
So) mo"iy your filter&'.py script rom the irst exercise to work case insensitively$
24
24
Serious example:
Post!processing program output
RU; <<<<<6 C87P0E%ED$ 8U%PU% *; 1*0E hy"rogen$"at$
RU; <<<<<: C87P0E%ED$ 8U%PU% *; 1*0E helium$"at$
2
RU; <<<<=> C87P0E%ED$ 8U%PU% *; 1*0E yttrium$"at$ 6 U;DER108?
?&R;*;@$
RU; <<<<A< C87P0E%ED$ 8U%PU% *; 1*0E Birconium$"at$ : U;DER108?
?&R;*;@S$
2
RU; <<<<CD C87P0E%ED$ 8U%PU% *; 1*0E lanthanum$"at$ &0@8R*%.7
D*D ;8% C8;EER@E &1%ER 6<<<<< *%ER&%*8;S$
2
RU; <<<<FA C87P0E%ED$ 8U%PU% *; 1*0E ga"olinium$"at$ 8EER108?
ERR8R$
2
atoms.lo#
;ow let9s look at a more serious example$
%he ile 'atoms.lo#( is the output o a set o programs which "o something involving
atoms o the elements$ 4*t9s a ictitious example) so "on9t obsess on the "etail$5
*t has a collection o lines correspon"ing to how various runs o a program complete"$
Some are simple success lines such as the irst line:
,2* &&&&&' .+145-6-7. +26426 (* F(5- $ydro#en.dat.
8thers have a""itional inormation in"icating that things "i" not go so well$
,2* &&&&E; .+145-6-7. +26426 (* F(5- yttrium.dat. ' 2*7-,F5+V
V/,*(*).
,2* &&&&>3 .+145-6-7. +26426 (* F(5- lant$anum.dat. /5)+,(6W1
7(7 *+6 .+*J-,)- /F6-, '&&&&& (6-,/6(+*0.
,2* &&&&8O .+145-6-7. +26426 (* F(5- #adolinium.dat. +J-,F5+V
-,,+,.
8ur ,ob will be to unpick the 'goo" lines( rom the rest$
25
25
?hat "o we want-
%he ile names or the runs with
no warning or error messages$
?hat pattern "oes this reGuire-
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
,2* .dat. &&&&'8.+145-6-7.+26426(*F(5-sulp$ur
,2* .dat. &&&&'9.+145-6-7.+26426(*F(5-ar#on
?e will buil" the pattern reGuire" or these goo" lines bit by bit$ *t helps to have some
lines 'in min"( while "eveloping the pattern) an" to consi"er which bits change
between lines an" which bits "on+t$
Because we are going to be using some lea"ing an" trailing spaces in our strings we
are marking them explicitly in the sli"es$
26
26
1ixe" text
'0iteral( text
,2* .dat. .+145-6-7.+26426(*F(5- &&&&'3 c$lorine
%he ixe" text is shown here$ ;ote that while the element part o the ile name varies)
its suix is constant$
27
27
Six "igits
Digits
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
%he irst part o the line that varies is the set o six "igits$
28
28
SeGuence o
lower case
letters
0etters
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
%he secon" part is the primary part o the ile name$
29
29
&n" no more/
%he line starts here
2an" ends here
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine

?hat we have "escribe" to "ate matches all the lines$ %hey all start with that same
sentence$ ?hat "istinguishes the goo" lines rom the ba" is that this is all there is$
%he lines start an" stop with exactly this) no more an" no less$
*t is goo" practice to match against as much o the line as possible as it lessens the
chance o acci"entally matching a line you "i"n9t plan to$ 0ater on it will become
essential as we will be extracting elements rom the line$
30
30
Buil"ing the pattern H 6
Start o the line marke" with I
:
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
?e will be buil"ing the pattern at the bottom o the sli"e$ ?e start by saying that the
line begins here$ ;othing may prece"e it$
%he start o line is represente" with the 'caret( or 'circumlex( character) ':($
: is known as an anchor) because it orces the pattern to match only at a ixe" point
4the start) in this case5 o the line$ Such patterns are calle" anchored patterns$
Patterns which "on9t have any anchors in them are known as 4surprise/5 unanchored
patterns$
31
31
Buil"ing the pattern H :
:,2*
0iteral text Don+t orget the space/
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
;ext comes some literal text$ ?e ,ust a"" this to the pattern as is$
%here9s one gotcha we will return to later$ *t9s easy to get the wrong number o spaces
or to mistake a tab stop or a space$ *n this example it9s a single space) but we will
learn how to cope with generic 'white space( later$
32
32
Buil"ing the pattern H =
:,2*\d\d\d\d\d\d
Six "igits
[&-;]
'any single character between < an" >(
\d
'any "igit(
inelegant
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
;ext comes a run o six "igits$ %here are two approaches we can take here$ & "igit
can be regar"e" as a character between '<( an" '>( in the character set use") but it is
more elegant to have a pattern that explicitly says 'a "igit($
%he seGuence '\d( means exactly that in regular expression patterns: a "igit$
.owever) a line o six instances o '\d( is not particularly elegant$ Can you imagine
counting them i there were sixty rather than six-
4%he seGuence '[&-;]( has the meaning 'one character between '<( an" '>( in the
character set$ ?e will meet this use o sGuare brackets in "etail in a ew sli"es9 time$5
33
33
Buil"ing the pattern H A
:,2*\d<8=
\d
'any "igit(
\d<8=
'six "igits(
\d<>?3=
'ive) six or seven "igits(
Six "igits
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
Regular expression pattern language has a solution to that inelegance$ 1ollowing any
pattern with a number in curly brackets 4'braces(5 means to iterate that pattern that
many times$
;ote that '\d<8=( means 'six "igits in a row($ *t "oes not mean 'the same "igit six
times($ ?e will see how to "escribe that later$
%he syntax can be exten"e":
\d<8= six "igits
\d<>?3= ive) six or seven "igits
\d<>?= ive or more "igits
\d<?3= no more than seven "igits 4inclu"ing no "igits5
34
34
Buil"ing the pattern H C
:,2*\d<8=.+145-6-7.+26426(*F(5-
0iteral text
4with spaces5$
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
;ext comes some more ixe" text$
35
35
Buil"ing the pattern H F
SeGuence o lower case letters
[a-z]
[a-z]+
'any single character between a an" B(
'one or more characters between a an" B(
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
:,2*\d<8=.+145-6-7.+26426(*F(5-[a-z]+
;ext comes the name o the element$ ?e will ignore or these purposes the act that
we know these are the names o elements$ 1or our purposes they are seGuences o
lower case letters$
%his time we will use the sGuare bracket notation$ %his is i"entical to the wil" car"s
use" by Unix shells) i you are alrea"y amiliar with that syntax$
%he regular expression pattern '[aeiou]( means 'exactly one character which can be
either an Na) an Ne) an Ni) an No) or a Nu($
%he slight variant '[a-m]( means 'exactly one character between Na an" Nm inclusive
in the character set($ *n the stan"ar" computing character sets 4with no
internationalisation turne" on5 the "igits) the lower case letters) an" the upper case
letters orm uninterrupte" runs$ So '[&-;]( will match a single "igit$ '[a-z]( will
match a single lower case letter$ '[/-A]( will match a single upper case letter$
But we "on9t want to match a single lower case letter$ ?e want to match an unknown
number o them$ &ny pattern can be ollowe" by a '+( to mean 'repeat the pattern one
or more times($ So '[a-z]+( matches a seGuence o one or more lower case letters$
4&gain) it "oes not mean 'the same lower case letter multiple times($5 *t is eGuivalent to
'[a-z]<'?=($
36
36
Buil"ing the pattern H D
0iteral text
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
:,2*\d<8=.+145-6-7.+26426(*F(5-[a-z]+.dat.
;ext we have the closing literal text$
4Strictly speaking the "ot is a special character in regular expressions but we will
a""ress that in a later sli"e$5
37
37
Buil"ing the pattern H J
:,2*\d<8=.+145-6-7.+26426(*F(5-[a-z]+.dat.$
En" o line marke" with K
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine

1inally) an" crucially) we i"entiy this as the en" o the line$ %he lines with warnings
an" errors go beyon" this point$
%he "ollar character) '$() marks the en" o the line$
%his is another anchor) since it orces the pattern to match only at another ixe" place
4the en"5 o the line$
;ote that although we are using both : an" $ in our pattern) you "on9t have to always
use both o them in a pattern$ Xou may use both) or only one) or neither) "epen"ing
on what you are trying to match$
38
38
Exercise =: running the ilter
6$ Copy filter&'.py filter&@.py
:$ E"it filter&@.py
Use the :,2*% regular expression$
=$ %est it against atoms.lo#
$ python filter02.py < atoms.log
7ake it case sensitive again$
Xou shoul" try this regular expression or yourselves an" get your ingers use" to
typing some o the strange seGuences$
Copy the filter&'.py ile that you "evelope" previously to a new ile calle"
filter&@.py$
%hen e"it the simple '1re"( string to the new expression we have "evelope"$ %his
search shoul" be case sensitive$
%hen try it out or real$
$ )p filter01.py filter02.py
$ gedit filter02.py
$ python filter02.py < atoms.log
* it "oesn+t work) go back an" ix filter&@.py until it "oes$
39
39
Exercise =: changing the ilter
A$ E"it filter&@.py
0ose the $ at the en" o the regular expression$
C$ ?hat output "o you think you will get this time-
F$ %est it against atoms.lo#
$ python filter02.py < atoms.log
again$
%hen change the regular expression very slightly simply by removing the inal "ollar
character that anchors the expression to the en" o the line$
?hat extra lines "o you think it will match now$
%ry the script again$ ?ere you right-
40
40
Special co"es in regular expressions
^ \A
&nchor start o line
$ \Z
&nchor en" o line
\d
&ny digit
\D
&ny non!"igit
?e have now starte" to meet some o the special co"es that the regular expression
language uses in its patterns$
%he caret character) ':() means 'start o line($ %he caret is tra"itional) but there is an
eGuivalent which is '\/($
%he "ollar character) '$() means 'en" o line( an" has a '\A( eGuivalent$
%he seGuence '\d( means 'a "igit($ ;ote that the capital version) '\7( means exactly
the opposite$
41
41
?hat can go in 'L2M( -
[aeiou]
[/-A]
any uppercase alphabetic
any lowercase vowel
[/-Aa-z]
any alphabetic
any alphabetic or a NM9
any alphabetic or a N!9
any alphabetic or a N!9
[/-Aa-z\]]
backslashe" character
[/-Aa-z\-]
[-/-Aa-z]
N!9 as irst character:
special behaviour or N!9 only
?e also nee" to consi"er ,ust what can go in between the sGuare brackets$
* we have ,ust a set o simple characters 4e$g$ '[aeiou](5 then it matches any one
character rom that set$ ;ote that the set o simple characters can inclu"e a space)
e$g$ '[ aeiou]( matches a space or an 'a( or an 'e( or an 'i( or an 'o( or a 'u($
* we put a "ash between two characters then it means any one character rom that
range$ So '[a-z]( is exactly eGuivalent to '[aBcdef#$iRklmnopXrstuFwxyz]($
?e can repeat this or multiple ranges) so '[/-Aa-z]( is eGuivalent to
'[/K.7-F)W(MS51*+4Y,062JVLZAaBcdef#$iRklmnopXrstuFwxyz]($
* we want one o the characters in the set to be a "ash) '-() there are two ways we
can "o this$ ?e can prece"e the "ash with a backspace '\-( to mean 'inclu"e the
character N- in the set o characters we want to match() e$g$ '[/-Aa-z\-]( means
'match any alphabetic character or a "ash($ &lternatively) we can make the irst
character in the set a "ash in which case it will be interprete" as a literal "ash 4'-(5
rather than in"icating a range o characters) e$g$ '[-/-za-z]( also means 'match any
alphabetic character or a "ash($
42
42
Counting in regular expressions
[aBc]C
[aBc]+
[aBc]
[aBc]<8=
[aBc]<>?3=
[aBc]<>?=
[aBc]<?3=
&ny one o Na9) Nb9 or Nc9$
8ne or more Na9) Nb9 or Nc9$
Oero or more Na9) Nb9 or Nc9$
Exactly F o Na9) Nb9 or Nc9$
C) F or D o Na9) Nb9 or Nc9$
C or more o Na9) Nb9 or Nc9$
D or ewer o Na9) Nb9 or Nc9$
[aBc]D
Oero or one Na9) Nb9 or Nc9$
?e also saw that we can count in regular expressions$ %hese counting mo"iiers
appear in the sli"e ater the example pattern 'LabcM($ %hey can ollow any regular
expression pattern$
?e saw the plus mo"iier) 'Z() meaning 'one or more($ %here are a couple o relate"
mo"iiers that are oten useul: a Guery) '-() means Bero or one o the pattern an"
asterisk) '[() means 'Bero or more($
;ote that in shell expansion o ile names 4'globbing(5 the asterisk means 'any string($
*n regular expressions it means nothing on its own an" is purely a mo"iier$
%he more precise counting is "one wit curly brackets$
43
43
?hat matches 'L( -
* 'Labc"M( matches any
one o 'a() 'b() 'c( or '"(
what matches 'Labc"M(-
[aBcd]
\[aBcd\] [aBcd]
&ny one o Na9) Nb9) Nc9) N"9$
;ow let+s pick up a ew stray Guestions that might have arisen as we built that pattern$
* sGuare brackets i"entiy sets o letters to match) what matches a sGuare bracket-
.ow woul" * match the literal string '[aBcde]() or example-
%he way to mean 'a real sGuare bracket( is to prece"e it with a backslash$ @enerally
speaking) i a character has a special meaning then prece"ing it with a backslash
turns o that specialness$ So '[( is special) but '\[( means ',ust an open sGuare
bracket($ 4Similarly) i we want to match a backslash we use '\\($5
?e will see more about backslash shortly$
44
44
Backslash
[
use" to hol" sets o characters
]
\[
the real sGuare brackets
\]
d
\d
the letter '"(
any "igit
d
\[ \]
\d
[ ]
literal characters specials
P
%he way to mean 'a real sGuare bracket( is to prece"e it with a backslash$ @enerally
speaking) i a character has a special meaning then prece"ing it with a backslash
turns o that specialness$ So '[( is special) but '\[( means ',ust an open sGuare
bracket($ 4Similarly) i we want to match a backslash we use '\\($5
Conversely) i a character is ,ust a plain character then prece"ing it with a backslash
can make it special$ 1or example) 'd( matches ,ust the lower case letter 'd( but '\d(
matches any one "igit$
45
45
?hat "oes "ot match-
.
'$( matches any character except 'Pn($
\.
'P$( matches ,ust the "ot$
?e9ve been using "ot as a literal character$
&ctually2
%here9s also an issue with using '.($ ?e9ve been using it as a literal character)
matching the ull stops at the en"s o sentences) or in ile name suixes but actually
it9s another special character that matches any single character except or the new line
character 4'\n( matches the new line character5$ ?e9ve ,ust been lucky so ar that the
only possible match has been to a real "ot$ * we want to orce the literal character we
place a backslash in ront o it) ,ust as we "i" with sGuare brackets$
46
46
Special co"es in regular expressions
^ \A
&nchor start o line
$ \Z
&nchor en" o line
\d
&ny digit
\D
&ny non!"igit
.
&ny character except newline
So we can a"" the ull stop to our set o special co"es$
47
47
Buil"ing the pattern H >
:,2* \d<8= .+145-6-7\. +26426 (* F(5- [a-z]+\.dat\.$
&ctual ull stops
in the literal text$
,2* .dat. &&&&'3.+145-6-7.+26426(*F(5-c$lorine
So our ilter expression or the atoms.lo# ile nee"s a small tweak to in"icate that
the "ots are real "ots an" not ,ust 'any character except the newline( markers$
48
48
Exercise A: changing the atom ilter
6$ E"it filter&@.py
Replace the $ at the en"$
1ix the "ots$
:$ Run filter&@.py again to check it$
$ python filter02.py < atoms.log
So apply this change to the "ots to your script) an" put back the en" o line anchor
while you+re there$
49
49
Exercise C an"
coee break
*nput:
messa#es
7atch lines with
'*nvali" user($
7atch the whole line$
Script:
filter&E.py
?e9ll take a break to grab some coee$ 8ver the break try this exercise$ Copy the ile
'filter&@.py( to 'filter&E.py( an" change he pattern to solve this problem:
%he ile 'messa#es( contains a week9s logs rom one o the authors9 workstations$ *n
it are a number o lines containing the phrase '(nFalid user($ ?rite a regular
expression to match these lines an" then print them out$
7atch the whole line) not ,ust the phrase$ ?e will want to use the rest o the line or a
later exercise$ *n a""ition) it orces you to think about how to match the terms that
appear in that line such as "ates an" time stamps$
%here are 6)=FF matching lines$ 8bviously) that9s too many lines or you to sensibly
count ,ust by looking at the screen) so you can use the Unix comman" wc to "o this
or you like this:
$ python filter0..py < messages / ) 0l
'E88
4%he '-l( option to wc tells it ,ust to count the number o lines in the output$ %he pipe
symbol) '[() tells the operating system to take the output o your Python script an"
give it to the wc comman" as input$5
50
50
&nswer to exercise
:[/-A][a-z]<@=['@E][&-;]\d\d:\d\d:\d\d
noet$erss$d\[\d+\]:(nFaliduser[/-Aa-z&-;G\-]+
from\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=$
:
[/-A][a-z]<@=
['@E][&-;]
\d\d:\d\d:\d\d
\[\d+\]
[/-Aa-z&-;G\-]+
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=
$
Start o line
'Qan() '1eb() '7ar() 2
':() '6:() 2
'<6::=:=A() '6::=A:C<() 2
'LCFDM() 'L6:=ACM(
'a"min() 'www!"ata() 'mp=() 2
En" o line '6=6$666$A$6:() 2
.ere9s the answer to the exercise$ * any o it "oesn+t make sense) please tell the
lecturer now$
%he patterns aren9t perect$ %he pattern use" to match the three character
abbreviation o the month also matches 'MeB( an" 'Fan() or example$
;ote how to match "ates which run rom 6 to =6 we "on9t say 'two "igits($ ?e say 'a
lea"ing 6) :) = or space( ollowe" by an arbitrary "igit$ &gain) this matches some
things which are not "ates) =: or example) but "oes exclu"e some other not!"ates$
%he pattern '[/-Aa-z&-;G\-]+( to match the *Ds being oere" is base" purely on
visual inspection o those in the ile$ *n the uture a hacker may attempt to access the
system using a login with a "ierent character in it$ ?e will see how to "eal with this
uncertainty very soon$
51
51
%ightening up the
regular expression
\
Backslash is special in Python strings 4'Pn(5
rH%H
r"%"
Putting an 'r( in ront o a string turns
o any special treatment o backslash$
4Routinely use" or regular expressions$5
\s+
general white space
\0+
general non!white space
&t this point we ought to "evelop some more syntax an" a useul Python trick to help
us roun" some problems which we have skate" over$
&n issue with breaking up lines like this is that some systems use single spaces while
others use "ouble spaces at the en"s o sentences or tab stops between iel"s etc$
%he seGuence '\s( means 'a white space character( 4space an" tab mostly5 so '\s+(
is commonly use" or 'some white space($
;ote that white space is marke" by a backslashe" lowercase 's($ &n uppercase '0(
means exactly the opposite$ '\s( matches a single white space character an" '\0(
matches a single character that is not white space$ %his wilol let us work roun" the
problem o not knowing what characters will appear in the invali" logins in a"vance$
?e seem to be using backslash a lot in regular expressions$ Unortunately backslash
is also special or or"inary Python strings$ '\n() or example means a new line) '\t(
means a tab) an" so on$ ?e want to make sure that our regular expression use o
backslash "oes not clash with the Python use o backslash$ %he way we "o this is to
prece"e the string with the letter 'r($ %his turns o any special backslash han"ling
Python woul" otherwise "o$ %his is usually only "one or regular expressions$
52
52
%he inal string
:[/-A][a-z]<@=['@E][&-;]\d\d:\d\d:\d\d
noet$erss$d\[\d+\]:(nFaliduser\0+from
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=$
:[/-A][a-z]<@=['@E][&-;]\d\d:\d\d:\d\d
noet$erss$d\[\d+\]:(nFaliduser\0+from
rH
H
Pattern:
String:
H
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=$
So our exercise9s pattern gets this inal orm) with a lea"ing 'r( on the string$
?e have also chosen to replace the '[/-Aa-z&-;G\-]+( pattern which happene" to
work or our particular log ile with '\0+( 4that9s an uppercase '0(5 to match more
generally$
53
53
Special co"es in regular expressions
^ \A
&nchor start o line
$ \Z
&nchor en" o line
\d
&ny digit
\D
&ny non!"igit
\
&ny wor" character 4letter) "igit) RSR5
\!
&ny non!wor" character
.
&ny character except newline
\s
&ny white!space
\"
&ny non!white!space
So we can a"" \s an" \0 to our set o special co"es an" we will take the opportunity
to inclu"e ,ust two more$ %he co"e \w matches any character that9s likely to be in a
4computerish5 wor"$ %hese are the letters 4both upper an" lower case5) the "igits an"
the un"erscore character$ %here is no punctuation inclu"e"$ %he upper case version
means the opposite again$
54
54
Exercise F: improving the ilter
E"it
filter&E.py
[/-Aa-z&-;G\-]+
""
\0+
r""
;ow you can improve your filter&E.py script$ Xou have now "one all that can
really be "one to the pattern to make it as goo" as it gets$
55
55
Complex regular expressions
* regular expressions were
a programming language2
comments
layout
meaningul variable names
:[/-A][a-z]<@=['@E][&-;]\d\d:\d\d:\d\d
noet$erss$d\[\d+\]:(nFaliduser\0+from
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=$
&s we9ve ,ust seen) regular expressions can get really complicate"$ * regular
expressions were a programming language in their own right) we woul" expect to be
able to lay them out sensibly to make them easier to rea" an" to inclu"e comments$
Python allows us to "o both o these with a special option to the re.compile(
unction which we will meet now$
4?e might also expect to have variables with names) an" we will come to that in this
course too$5
56
56
Eerbose mo"e
:[/-A][a-z]<@=['@E][&-;]\d\d:\d\d:\d\d
noet$erss$d\[\d+\]:(nFaliduser\0+from
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=$
.ar" to write
.ar"er to rea"
.ar"est to maintain
7ulti!line layout
Comments
'Eerbose( mo"e
8ur un"amental problem is that the enormous regular expression we have ,ust written
runs the risk o becoming gibberish$ *t was a struggle to write an" i you passe" it to
someone else it woul" be even more o a struggle to rea"$ *t gets even worse i you
are aske" to maintain it ater not looking at it or six months$
%he problem is that there is nothing that looks like a useul language or our eyes to
hook onY it looks too much like nonsense$
?e nee" to be able to sprea" it out over several lines so that how it breaks "own into
its component parts becomes clearer$ *t woul" be nice i we ha" comments so we
coul" annotate it too$
Python9s regular expression system has all this an" calls it) rather unairly) 'verbose(
mo"e$
57
57
0ayout
:
[/-A][a-z]<@=
['@E][&-;]
\d\d:\d\d:\d\d
noet$erss$d
\[\d+\]:
(nFaliduser
\0+
from
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=
$

?hat about spaces-


Signiicant space
*gnoreable space
;ee" to treat
spaces specially
%o ix our layout concerns let+s try splitting our pattern over several lines$ ?e9ll use the
one rom the exercise as it is by ar the most complex pattern we have seen to "ate$
%he irst issue we hit concerns spaces$ ?e want to match on spaces) an" our original
regular expression ha" spaces in it$ .owever) multi!line expressions like this typically
have trailing spaces at the en"s o lines ignore"$ *n particular any spaces between the
en" o the line an" the start o any putative comments mustn+t contribute towar"s the
matching component$
?e will nee" to treat spaces "ierently in the verbose version$
58
58
0ayout
:
[/-A][a-z]<@=
['@E][&-;]\
\d\d:\d\d:\d\d\
noet$er\ss$d
\[\d+\]:\
(nFalid\user\
\0+\
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=
$
\
Backslashes/
Signiicant space
*gnoreable space
?e use a backslash$
*n a verbose regular expression pattern spaces become special charactersY 'special(
because they are completely ignore"$ So we make a particular space 'or"inary(
4i$e$ ,ust a space5 by prece"ing it with a backslash) ,ust as we "i" or sGuare brackets$
Slightly para"oxically) where the space appears insi"e sGuare brackets to in"icate
membership o a set o characters it "oesn+t nee" backslashing as its meaning is
unambiguous$
59
59
Spaces in verbose mo"e
Space ignore" generally
Backslash space recognise"
Backslash not nee"e" in L2M

\
['@E ]
So this is a slight change nee"e" to our regular expression language to support multi!
line regular expressions$
60
60
Comments
:
[/-A][a-z]<@=
['@E][&-;]\
\d\d:\d\d:\d\d\
noet$er\ss$d
\[\d+\]:\
(nFalid\user\
\0+\
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=
$
\ I 1ont$
Signiicant space
*gnore" space
Comment
;ow let+s a"" comments$
?e will intro"uce them using exactly the same character as is use" in Python proper)
the 'hash( character) 'I($
&ny text rom the hash character to the en" o the line is ignore"$
%his means that we will have to have some special treatment or hashes i we want to
match them as or"inary characters) o course$ *t+s time or another backslash$
61
61
.ashes in verbose mo"e
.ash intro"uces a comment
Backslash hash matches 'T(
Backslash not nee"e" in L2M
I 1ont$
\I
['@EI]
*n multi!line mo"e) hashes intro"uce comments$ %he backslashe" hash) '\I()
matches the hash character itsel$ &gain) ,ust as with space) you "on+t nee" the
backslash insi"e sGuare brackets$
62
62
:
[/-A][a-z]<@=\ I 1ont$
['@E][&-;]\ I 7ay
\d\d:\d\d:\d\d\ I 6ime
noet$er\ss$d
\[\d+\]:\ I 4rocess (7
(nFalid\user\
\0+\ I 2ser (7
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E= I (4 address
$
Eerbose mo"e
So this gives us our more legible mo"e$ Each element o the regular expression gets a
line to itsel so it at least looks like smaller pieces o gibberish$ 1urthermore each can
have a comment so we can be remin"e" o what the ragment is trying to match$
63
63
:
[/-A][a-z]<@=\ I 1ont$
['@E][&-;]\ I 7ay
\d\d:\d\d:\d\d\ I 6ime
noet$er\ss$d
\[\d+\]:\ I 4rocess (7
(nFalid\user\
\0+\ I 2ser (7
from\
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E= I (4 address
$
Python long strings
r"""
"""
Start raw long string
En" long string
%his verbose regular expression pattern covers many lines$ Python has a mechanism
speciically "esigne" or multi!line strings: the triple!Guote" string$ ?e always use that)
in con,unction with the r 4'raw(5 Gualiier to carry these verbose regular expression
patterns$
64
64
%elling Python to 'go verbose(
Eerbose mo"e
&nother option) like ignoring case
7o"ule constant) like
re.J-,K+0- re.L
re.()*+,-./0-
So now all we have to "o is to tell Python to use this verbose mo"e instea" o its usual
one$ ?e "o this as an option on the re.compile( unction ,ust as we "i" when we
tol" it to work case insensitively$ %here is a Python mo"ule constant re.J-,K+0-
which we use in exactly the same way as we "i" re.()*+,-./0-$ *t has a chort
name 're.L( too$
*nci"entally) i you ever wante" case insensitivity an" verbosity) you a"" the two
together:
re#exp ! re.compile(pattern? re.()*+,-./0-+re.J-,K+0-
65
65
import sys
import re
%
pattern !
re#exp ! re.compile(pattern
%
import sys
import re
%
pattern !
[/-A][a-z]<@=
$
re#exp ! re.compile(pattern?
%
" r":[/-A][a-z]<@=%$
r""":
\
%
"""
re.J-,K+0-
6 : =
So how woul" we change a ilter script in practice to use verbose regular expressions-
*t+s actually a very easy three step process$
6$1irst you convert your pattern string into a multi!line string$
:$%hen you make the backslash tweaks necessary$
=$%hen you change the re.compile( call to have the re.J-,K+0- option$
4&n" then you shoul" test your script to see i it still works/5
66
66
Exercise D: use verbose mo"e
filter&@.py
single line
regular
expression
verbose
regular
expression
filter&E.py
So now that you9ve seen how to turn an 'or"inary( regular expression into a 'verbose(
one with comments) it+s time to try it or real$
E"it the iles filter&@.py an" filter&E.py so that the regular expression
patterns they use are 'verbose( ones lai" out across multiple lines with suitable
comments$ %est them against the same input iles as beore$
* you have any problems with this exercise) please ask the lecturer$
67
67
[a-z]+\.dat
\d<8=
: I 0tart of line
,2*\
I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
I File name
\.
$ I -nd of line
Extracting bits rom the line
Suppose we wante" to extract
,ust these two components$
?e+re almost inishe" with the regular expression syntax now$ ?e have most o what
we nee" or this course an" can now get on with "eveloping Python+s system or using
it$ ?e will continue to use the verbose version o the regular expressions as it is easier
to rea") which is helpul or courses as well as or real lie/ ;ote that nothing we teach
in the remain"er o this course is speciic to verbose mo"eY it will all work eGually well
in the concise mo"e too$
Suppose we are particularly intereste" in two parts o the line) the ,ob number an" the
ile name$ ;ote that the ile name inclu"es both the component that varies rom line to
line) '[a z]+ () an" the constant) ixe" suix) '.dat($
?hat we will "o is label the two components in the pattern an" then look at Python+s
mechanism to get at their values$
68
68
Changing the pattern
: I 0tart of line
,2*\
\d<8= I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
[a-z]+\.dat I File name
\.
$ I -nd of line
Parentheses aroun" the patterns
(
(

'@roups(
?e start by changing the pattern to place parentheses 4roun" brackets5 aroun" the
two components o interest$ %hese two selecte" areas are calle" 'groups($
69
69
Parentheses in regular expressions
Parentheses surroun" a group
Use backslash to match '4( or '5(
Backslash not nee"e" in L2M
(%
\(
['@E(]
\
Parentheses without backslashes must 'balance(
* you want to match a literal parenthesis use '\(( or '\($
;ote that because 4unbackslashe"5 parentheses have this special meaning o "eining
subsets o the matching line they must match$ * they "on+t then the re.compile(
unction will give an error similar to this:
QQQ pattern('*'
QQQ regexp(re.)ompile*pattern+
6raceBack (most recent call last:
File "PstdinQ"? line '? in PmoduleQ
File "GusrGliB8OGpyt$on@.8Gre.py"? line '99? in compile
return Ucompile(pattern? fla#s
File "GusrGliB8OGpyt$on@.8Gre.py"? line @OE? in Ucompile
raise error? F I inFalid expression
sreUconstants.error: unBalanced parent$esis
QQQ
70
70
%he 'match ob,ect(
%
re#exp ! re.compile(pattern?re.J-,K+0-
for line in sys.stdin:
! re#exp.searc$(line
if result:
%
result
;ow we are asking or certain parts o the pattern to be specially treate" 4as 'groups(5
we must turn our attention to the result o the search to get at those groups$
%o "ate all we have "one with the results is to test them or truth or alsehoo": 'is
there something there or not-( ;ow we will "ig in more "eeply$
71
71
Using the match ob,ect
0ine:
,2* &&&&&' .+145-6-7. +26426
(* F(5- $ydro#en.dat.
result.#roup('
result.#roup(@
H&&&&&'H
H$ydro#en.datH
result.#roup(&
whole pattern
?e get at the groups rom the match ob,ect$ %he metho" result.#roup(' will
return the contents o the irst pair o parentheses an" the metho"
result.#roup(@ will return the content o the secon"$
&vi" Pythonistas will recall that Python usually counts rom Bero an" may won"er what
result.#roup(& gives$ %his returns whatever the entire pattern matche"$ *n our
case where our regular expression "eines the whole line 4: to $5 this is eGuivalent to
the whole line$
72
72
Putting it all together
%
re#exp ! re.compile(pattern?re.J-,K+0-
for line in sys.stdin:
result ! re#exp.searc$(line
if result:
sys.stdout.write("Ns\tNs\n" N (r
esult.#roup('? result.#roup(@
So now we can write out ,ust those elements o the matching lines that we are
intereste" in$
;ote that we still have to test the result variable to make sure that it is not *one
4i$e$ that the regular expression matche" the line at all5$ %his is what the if% test "oes
because *one tests alse$ ?e cannot ask or the #roup( metho" on *one because
it "oesn+t have one$ * you make this mistake you will get an error message:
/ttriBute-rror: H*one6ypeH oBRect $as no attriBute H#roupH
an" your script will terminate abruptly$
73
73
Ns Ns name address
Exercise J: limite" output
7o"iy the log ile ilter to output ,ust
the account name an" the *P a""ress$
filter&E.py
sys.stdout.write(" ? \n" N ( \t
filter&O.py
;ow try it or yourselves$ Xou have a ile filter&E.py which you create" to answer
an earlier exercise$ %his in"s the lines rom the messages ile which in"icate an
*nvali" user$ Copy this script to filter&O.py an" e"it it so that you "eine groups or
the account name 4matche" by \0+5 an" the *P a""ress 4matche" by
\d<'?E=\.\d<'?E=\.\d<'?E=\.\d<'?E=5$
%he bottom o the sli"e is a Guick remin"er o the string substitution syntax in Python$
%his will get you nicely tab!aligne" text$
74
74
0imitations o numbere" groups
%he problem:
*nsert a group &ll ollowing numbers change
'?hat was group number three again-(
%he solution: use names instea" o numbers
*nsert a group *t gets its own name
Use sensible names$
@roups in regular expressions are goo" but they+re not perect$ %hey suer rom the
sort o problem that creeps up on you only ater you+ve been "oing Python regular
expressions or a bit$
Suppose you "eci"e you nee" to capture another group within a regular expression$ *
it is inserte" between the irst an" secon" existing group) say) then the ol" group
number : becomes the new number =) the ol" = the new A an" so on$
%here+s also a problem that 're#exp.#roup(@( "oesn+t shout out what the secon"
group actually was$
%here+s a solution to this$ ?e will associate names with groups rather than ,ust
numbers$
75
75
D4PfilenameQ
(D4PRoBnumQ\d<8=
: I 0tart of line
,2*\
I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
( [a-z]+\.dat I File name
\.
$ I -nd of line
;ame" groups
Speciying the name
& group name"
',obnum(
So how "o we "o this naming-
?e insert some a""itional controls imme"iately ater the open parenthesis$ *n general
in Python9s regular expression syntax '(D( intro"uces something special that may not
even be a group 4though in this case it is5$ ?e speciy the name with the rather biBarre
syntax 'D4P#roupnameQ($
76
76
;aming a group
(D4PfilenameQ[a-z]+\.dat
425
-P "eine a
name" group
U2V
group name
pattern
So the group is "eine" as usual by parentheses 4roun" brackets5$
;ext must come 'D4( to in"icate that we are han"ling a name" group$
%hen comes the name o the group in angle brackets$
1inally comes the pattern that actually "oes the matching$ ;one o the D4P%Q
business is use" or matchingY it is purely or naming$
77
77
Using the name" group
0ine:
,2* &&&&&' .+145-6-7. +26426
(* F(5- $ydro#en.dat.
result.#roup(HRoBnoH
result.#roup(HfilenameH
H&&&&&'H
H$ydro#en.datH
%o reer to a group by its name) you simply pass the name to the #roup( metho" as
a string$ Xou can still also reer to the group by its number$ So in the example here)
result.#roup(HRoBnoH is the same as result.#roup(') since the irst group
is name" 'RoBno($
78
78
Putting it all together W 6
%
pattern!r"""
:
,2*\
(D4PRoBnumQ\d<8= I MoB numBer
\ .+145-6-7\.\ +26426\ (*\ F(5-\
(D4PfilenameQ[a-z]+\.dat I File name
\.
$
"""
%
So i we e"it our filter&@.py script we can allocate group name 'RoBnum( to the
series o six "igits an" 'filename( to the ile name 4complete with suix '.dat(5$ %his
is all "one in the pattern string$
79
79
Putting it all together W :
%
re#exp ! re.compile(pattern?re.J-,K+0-
for line in sys.stdin:
result ! re#exp.searc$(line
if result:
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HRoBnumH? result.#roup(Hfi
lenameH
&t the bottom o the script we then mo"iy the output line to use the names o the
groups in the write statement$
80
80
Parentheses in regular expressions
;umbere" group
(%
;ame" group
(D4PnameQ%
So here9s a new use o parentheses 4without backslashes5$
81
81
Exercise >: use" name" groups
filter&O.py
numbere"
groups
name"
groups
;ow try it or yourselves$ &"apt the filter&O.py script to use name" groups 4with
meaningul group names5$ 7ake sure you test it to check it still works/
* you have any problems with this exercise) please ask the lecturer$
82
82
&mbiguous groups within
the regular expression
Dictionary:
GFarGliBGdictGwords
Reg$ exp$:
:([a-z]+([a-z]+$
Script:
filter&>.py
?hat part o the wor" goes in group 6)
an" what part goes in group :-
@roups are all well an" goo") but are they necessarily well!"eine"- ?hat happens i
a line can it into groups in two "ierent ways-
1or example) consi"er the list o wor"s in GFarGliBGdictGwords$ %he lower case
wor"s in this line all match the regular expression '[a-z]+[a-z]+( because it is a
series o lower case letters ollowe" by a series o lower case letters$ But i we assign
groups to these parts)
'([a-z]+([a-z]+() which part o the wor" goes into the irst group an" which in
the secon"-
Xou can in" out by running the script filter&>.py$
83
83
([a-z]+ ([a-z]+
'@ree"y( expressions
$
2
aa
aali
aardFar
aardFark
$
i
k
s
: $
%he irst group is
'gree"y( at the
expense o the
secon" group$
&im to avoi" ambiguity
python filter0#.py
Python9s implementation o regular expressions makes the irst group 'gree"y(Y the
irst group swallows as many letters as it can at the expense o the secon"$
%here is no guarantee that other languages9 implementations will "o the same)
though$ Xou shoul" always aim to avoi" this sort o ambiguity$
Xou can change the gree" o various groups with yet more use o the Guery character
but please note the ambiguity caution above$ * you in" yoursel wanting to play with
the gree"iness you+re almost certainly "oing something wrong at a "eeper level$
84
84
Reerring to numbere" groups
within a regular expression
:([a-z]+ \' $
7atches a
seGuence
o letters
@roup 6
7atches 'the same as group 6(
;ow that we have groups in our regular expression we can use them or more$ So ar
the bracketing to create groups has been purely labelling) to i"entiy sections we can
extract later$ ;ow we will use them in the expression itsel$
?e can use a backslash in ront o a number 4or integers rom 6 to >>5 to mean 'that
number group in the current expression($ So ':([a-z]+\'$( matches any string
which is the same seGuence o lower case letters repeate" twice$
85
85
Reerring to name" groups
within a regular expression
:
(D4P$alfQ[a-z]+
(D4!$alf
$
Creates a group) 'hal(
Refers to the group
Does not create a group
* we have given names to our groups) then we use the special Python syntax
'(D4!groupname( to mean 'the group groupname in the current expression($ So
':(D4PwordQ[a-z]+(D4!word$( matches any string which is the same
seGuence o lower case letters repeate" twice$
;ote that in this case the (D% expression "oes not create a groupY instea") it reers
to one that alrea"y exists$ 8bserve that there is no pattern language in that secon"
pair o parentheses$
86
86
Example
$ python filter0$.py
atlatl
BaBa
BeriBeri
BonBon
BooBoo
BulBul
%
%he ile filter&8.py "oes precisely this using a name" group$
* have no i"ea what hal o these wor"s mean$
87
87
Parentheses in regular expressions
Create a numbere" group
(%
Create a name" group
(D4PnameQ%
Reer to a name" group
(D!name
%his completes all the uses o parentheses in Python regular expressions that we are
going to meet in this course$ %here are more$
88
88
Exercise 6<
filter&8.py
1in" all the wor"s with the pattern ABABA
e$g$
entente
A
B
A
B
A
entente
e
nt
filter&3.py
Copy the script filter&8.py to filter&3.py an" e"it the latter to in" all the
wor"s with the orm &B&B&$ 4Call your groups 'a( an" 'b( i you are stuck or
meaningul names$
;ote that in or the example wor" on the sli"e) the & pattern ,ust happens to be one
letter long 4the lower case letter 'e(5) whilst the B pattern is two letters long 4the lower
case letter seGuence 'nt(5$
.int: 8n P?1 0inux the GFarGliBGdictGwords "ictionary contains C such wor"s$
;o) * have no i"ea what most o them mean) either$
89
89
7ultiple matches
Data:
Boil.txt
Dierent number o entries on each line:
Basic entry:
/r 93.E
?ant to unpick this mess
/r 93.E
,e >;&&.& ,a @&'&.&
S '&E@.& ,n @''.E ,$ E;89.&
;ow we will move on to a more powerul use o groups$ Consi"er the ile Boil.txt$
%his contains the boiling points 4in \elvin at stan"ar" pressure5 o the various
elements but it has lines with "ierent numbers o entries on them$ Some lines have a
single element]temperature pair) others have two) three) or our$ ?e will presume that
we "on+t know what the maximum per line is$
90
90
?hat pattern "o we nee"-
S '&E@.& ,n @''.E ,$ E;89.&
[/-A][a-z]D
'element(
\s+
white space
\d+\.\d+
'boil(
?e nee" it
multiple times
2but we "on+t
know how many
%he basic structure o each line is straightorwar") so long as we can have an arbitrary
number o instances o a group$
91
91
Elementary pattern
S '&E@.&
(D4PelementQ[/-A][a-z]D
\s+
(D4PBoilQ\d+\.\d+
7atches a single pair
?e start by buil"ing the basic pattern that will be repeate" multiple times$ %he basic
pattern contains two groups which isolate the components we want rom each repeat:
the name o the element an" the temperature$
;ote that because the pattern can occur anywhere in the line we "on+t use the ':]$(
pair$
92
92
Putting it all together W 6
%
pattern!r"""
(D4PelementQ[/-A][a-z]D
\s+
(D4PBoilQ\d+\.\d+
"""
re#exp ! re.compile(pattern?re.J-,K+0-
%
?e put all this together in a ile call filter&9.py$
8ur pattern matches one o the element name]boiling point pairs an" names the two
groups appropriately$ Because we know we "on9t have a line per pair we aren9t using
the :%$ anchors$
93
93
Putting it all together W :
%
for line in sys.stdin:
result ! re#exp.searc$(line
if result:
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HelementH? result.#roup(HB
oilH
&t the bottom o the script we print out whatever the two groups have matche"$
94
94
%ry that pattern
$ python filter0%.py < &oil.txt
/r 93.E
,e >;&&.&
S '&E@.&
%
/# @OE>.&
/u E'@;.&
1irst matching
case o each line
But only the first
?e will start by "ropping this pattern into our stan"ar" script) mostly to see what
happens$ %he script "oes generate some output) but the pattern only matches against
the start o the line$ *t inishes as soon as it has matche" once$
95
95
7ultiple matches
re#exp.searc$(line
returns a single match
returns a list o matches
finditer (line re#exp.
*t woul" be better calle"
searchiter45 but never min"
%he problem lies in our use o re#exp+s searc$( metho"$ *t returns a single
1atc$+BRect) correspon"ing to that irst instance o the pattern in the line$
%he regular expression ob,ect has another metho" calle" 'finditer(( which
returns a list o matches) one or each that it in"s in the line$ 4*t woul" be better calle"
'searc$iter(( but never min"$5
4&ctually) it "oesn+t return a list) but rather one o those Python ob,ects that can be
treate" like a list$ %hey+re calle" 'iterators( which is where the name o the metho"
comes rom$5
96
96
%
for line in sys.stdin:
res'lt ( regexp.sear)h*line+
if res'lt,
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HelementH? result.#roup(HB
oilH
%he original script
So) we return to our script an" observe that it currently uses searc$(to return a
single 1atc$+BRect an" tests on that ob,ect$
97
97
%
for line in sys.stdin:
res'lts ( regexp.finditer*line+
for res'lt in res'lts,
sys.stdout.write("Ns\tNs\n" N r
esult.#roup(HelementH? result.#roup(HB
oilH
%he change" script
%he pattern remains exactly the same$
?e change the line that calle" searc$( an" store" a single 1atc$+BRect or a line
that calls finditer( an" stores a list o 1atc$+BRects$
*nstea" o the if statement we have a for statement to loop through all o the
1atc$+BRects in the list$ 4* none are oun" it9s an empty list$5
98
98
Using in"iter45
$ python filter0-.py < &oil.txt
/r 93.E
,e >;&&.&
,a @&'&.&
%
/u E'@;.&
/t 8'&.&
(n @EO>.&
Every matching
case in each line
&n" it works/ %his time we get all the element]temperature pairs$
99
99
Exercise 66
filter'&.py
E"it the script so that text is split into one wor"
per line with no punctuation or spaces output$
$ python filter10.py < paragraph.txt
6$is
is
free
%
8ne last exercise in class$ %he ile filter'&.py that you have is a skeleton script
that nee"s lines complete"$ E"it the ile so that it can be use" to split incoming text
into in"ivi"ual wor"s) printing one on each line$ Punctuation shoul" not be printe"$
Xou may in" it useul to recall the "einition o '\w($
100
100
?e have covere" only
simple regular expressions/
Capable o much more/
?e have ocuse" on getting Python to use them$
UCS course:
Pattern Matching Using Regular Expressions
ocuses on the expressions themselves an" not on
the language using them$
&n" that+s it/
.owever) let me remin" you that this course has concentrate" on getting regular
expressions to work in Python an" has only intro"uce" regular expression syntax
where necessary to illustrate eatures in Python+s re mo"ule$ Regular expressions
are capable o much) much more an" the UCS oers a two aternoon course) 'Pattern
7atching Using Regular Expressions() that covers them in ull "etail$ 1or urther
"etails o this course see the course "escription at:
$ttp:GGtrainin#.csx.cam.ac.ukGcourseGre#ex
101
101
1inal exercise
Data:
atoms@.lo#
4exten"e" log ile carrying timing inormation5
1in" the total CPU time taken or all:
:$ unsuccessul runs
6$ successul runs
?e will en" with a inal exercise which you can either "o in the class room or take
away to try at home$ %he ile atoms@.lo# is similar to atoms.lo# except that the
lines have been exten"e" to inclu"e CPU timing inormation$ ?rite a script which
"etermines the total time taken or both the successul an" unsuccessul runs an"
which prints out both igures$
* you want to "o this exercise outsi"e o class) the atoms@.lo# ile is available on!
line at this UR0:
http:]]www!uxsup$csx$cam$ac$uk]courses]PythonRE]atoms:$log
So that you have some i"ea o whether or not your script is correct) here are the the
total times or successul an" unsuccessul runs:
6otal .42 seconds taken for successful runs: @3>.E>
6otal .42 seconds taken for unsuccessful runs: 9&8.';

Vous aimerez peut-être aussi