Vous êtes sur la page 1sur 17

Regular Expressions: Where They're Used Before we discuss what regular expressions look like, let me give you

a brief rundown of what Perl lets you do with them: Search for matches in strings and lists of strings. Search and replace substrings. Split strings into lists.

Matching strings is not new - we learned in Lesson 3 how index() can find a substring within a larger string. But regular expressions take this power much further: they can match strings of variable length, they support wild cards (matches against many possible characters), they let you customize the match. As we'll see in this lesson and beyond, regular expressions let you easily solve problems that would otherwise be much more difficult. Here are the functions and operators that use regular expressions and that we'll explore throughout this lesson:
Name Usage Description Look for a regular expression match in the text. Match $text =~ m/EXPR/

The string to search. EXPR: The regular expression to match.This operator returns true if there is a match, false otherwise. Look for a regular expression match in the text. If found,

$text:

Substitution

$text =~ s/EXPR/ REPL/

replace it with new text. replace.

$text: The string to search and EXPR: The regular expression to match.

REPL: Replacement text. This operator returns true if a


match and substitution took place, false otherwise. Split a text string into a list of substrings. The split occurs on substrings that match the regular expression. EXPR: The regular expression to match. As a special case, using

Split

split(/EX PR/, $text)

the string

white space. $text: the text string to split. Returns a list of substrings extracted from the original string. If there is a match at the beginning of $text, the first element of the result list is an empty string (unless you are using the special case

" " instead of a regular expression splits on

" " expression).

List Filter

grep(/EX PR/, @list)

Examine a list of strings; build a new list containing only those elements that match the regular expression.

EXPR: The regular expression to match. @list: The list


to examine. In list context, returns a list of elements from the original list that match the expression. In scalar context, returns the count of matching elements.

The Whirlwind Tour I'll start with a quick peek at what these various calls look like. A word of warning - it looks pretty strange at first: #!/usr/bin/perl # A quick tour of regular expression functions and operators $mystring = "How now brown cow"; # Does our string contain "BROWN"? if ($mystring =~ m/BROWN/)

{ print "String contains \"BROWN\"\n"; } # Does our string contain "BROWN" if we ignore case? if ($mystring =~ m/BROWN/i) { print "String contains \"BROWN\" (ignoring case)\n"; } # Split the string wherever we find an "ow" @pieces = split(/ow/, $mystring); # Result will be the list ("H", " n", " br", "n c") # Find all elements in @pieces containing the letter "n": @pieces_containing_n = grep(/n/, @pieces); # Result will be the list (" n", "n c") # Back to our original string. substitute "ex" for every "ow" $mystring =~ s/ow/ex/g; # $mystring now contains "Hex nex brexn cex" # Make a list of all two-letter substrings in $mystring ending in "e" @strings_ending_in_e = $mystring =~ m/.e/g; # Result will be the list ("He", "ne", "re", "ce") Not a very interesting script - this was just a contrived example just to show you what these features look like. Now let's apply this stuff to a real problem. A Serious Example: Counting Words To show regular expressions in actual use, we'll revisit an old friend - the word-counting project we've been playing with since Lesson 4 - and its use of the split() function. Here's what it last looked like (not counting your work in the Lesson 7 assignment): #!/usr/bin/perl # Make sure our hash is clean undef %wordcounts; # Read the file until we're done while (defined($line = <>)) {

# Add each word to our hash foreach $word (split(" ", $line)) { # Increment the count for this word. If we haven't yet seen # the word, this will replace an undefined value with a 1 $wordcounts{$word}++; } } # Print out the results foreach $word (sort(keys(%wordcounts))) { print "$word: $wordcounts{$word}\n"; } As you know from our earlier discussion of split(), this script splits words wherever it finds white space - the spaces between words. So what happens if you run this program on itself? That is, what happens if you save it in wordcount.pl and then ask it to count the words in wordcount.pl? Here's the beginning of its output: "$word:: 1 ",: 1 #: 6 #!/usr/bin/perl: 1 $line)): 1 $word: 2 $wordcounts{$word}++;: 1 $wordcounts{$word}\n";: 1 %wordcounts;: 1 (defined($line: 1 (sort(keys(%wordcounts))): 1 It's doing exactly as we asked - showing us everything it finds between the spaces. But it's not doing what we

want. We really need to find the words; things like ",", "$line))", and "$wordcounts{$word}\n";" don't look much like words. So I'll make a change to the loop that splits the lines and processes the words: foreach $word (split(/\W+/, $line)) { # Ignore potential empty string at the beginning if (length($word) == 0) { next; } # Increment the count for this word. If we haven't yet seen # the word, this will replace an undefined value with a 1 $wordcounts{$word}++; } Here's the result (the first few lines of output) from running this new version of the script - again, using the script itself as the input to analyze: 0: 1 1: 1 Add: 1 If: 1 Ignore: 1 Increment: 1 Make: 1 Print: 1 Read: 1 W: 1 a: 1 an: 1 at: 1 beginning: 1 bin: 1 clean: 1 count: 1 defined: 1

All the funny characters are gone - the script is now cleanly extracting the words from the text. I made two changes to the script: added a regular expression to split on groups of non-word characters, and added a test to ignore any empty strings returned by split(). When the split() call handles a line like this: print "$word: $wordcounts{$word}\n"; it extracts the five words shown in italics, and discards everything else. This magic comes from the tiny regular expression /\W+/ that I passed to the split() call. By the way, we've just passed an important milestone! The word-counting script finally does everything right we're done improving it. You won't be seeing it again. unless it shows up as a Perl-powered Web page in a few lessons. What Regular Expressions Look Like OK, so let's talk about what we've seen. Regular expressions are small, strange, and cryptic - that much is already clear - and it takes some time to master them. Please be patient with them. We won't cover everything that's possible with regular expressions, but I want to give you a good start. A regular expression is surrounded by slash delimiters (delimiters are characters that mark the start and end of items of data). Regular expressions contain three types of things between the delimiters:

Normal characters. For example, the expression /abc/ matches the text string "abc". Metacharacters: Special characters with magic meaning for more complex matches. For example, /^abc/
matches the text string "abc" only if it occurs at the beginning of a line; /abc$/ only matches an "abc" at the end of the line; /ab+c/ matches "abc","abbc","abbbc", "abbbbc", and so on.

Backslash Sequences: The backslash character can change the meaning of the character that follows it.
It turns a metacharacter into a normal character - /abc\$/ matches the text string "abc$" anywhere on the line. And it turns certain normal characters into metacharacters - I'll show you those in the table that immediately follows. Here are some of the metacharacters and metacharacter sequences often used in regular expressions. Don't worry about fully understanding this table just yet - I'll show you in Chapter 3 how these are used:
Character/Seque nce . ^ $ \b \B | () [] \ * + ? Meaning Match any character except newline. Match the beginning of the line. Match the end of the line. Match at a word boundary. Match anywhere except at a word boundary. Match one of several alternatives. Group characters together. Character class. Quote: interpret the next metacharacter or delimiter as a normal character. Match the last character, group, or class 0 or more times. Match the last character, group, or class 1 or more times. Match the last character, group, or class 0 or 1 time.

{n} {n,} {n,m} \t \n \r \w \W \s \S \d \D \1, \2, .

Match the last character, group, or class n times. Match the last character, group, or class n or more times. Match the last character, group, or class at least n times but no more than m times. Match a single tab character. Match a single newline character. Match a single carriage-return character. Match a single word character (letter, digit, or Match a single non-word character. Match a single whitespace character. Match a single non-whitespace character. Match a single digit character. Match a single non-digit character. Repeat the first, second, . ()-group match.

_).

Using these metacharacters, you can put together all sorts of interesting regular expressions. Here's a preview of the kind of expressions we'll assemble, starting in Chapter 3:

Look for lines that begin with white space: /^\s/. Look for all words 10 letters or longer: /\w{10,}/ Look for characters that repeat at least twice in a row: /(.)\1{2,}/ Look for words that repeat: /\b(\w+)\s+\1\b/
A regular expression can also contain a scalar variable - the contents of that variable will be interpolated into the expression. In other words, these two cases are the same: # Use the expression directly $text =~ m/abc/; # Use the expression from a variable $pattern = "abc"; $text =~ m/$pattern/; To reiterate: these expressions take some getting used to, and building regular expressions is sometimes like solving a puzzle. The next two chapters will help explain how to harness their power. And you'll see some real projects later in the course that would not even be possible without regular expressions.

A Regular Expression Learning Tool To help you understand how regular expressions work, I've written a learning tool that lets you play with them. Please copy and save it in the projects directory as regex.pl.:

#!/usr/bin/perl

# Demonstrate regular expression action. Given a regular expression as a # command-line argument, this script processes stdin and generates stdout # illustrating where matches occur. # # Usage: $0 regular-expression

if (@ARGV != 1) { print STDERR "Usage: $0 regular-expression\n"; exit(1); } $regex = $ARGV[0]; print "Regular expression: /$regex/\n"; while (defined($line = <STDIN>)) { chomp($line);

# Remove any embedded tabs while ($line =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e) {}

# Print the line we're analyzing print("$line\n"); # Start the analysis $pos = 0; # Loop while the expression matches while ($line =~ m/$regex/go) { # Compute location and size of the expression $offset = length($`); $newpos = pos($line);

$length = $newpos - $offset; # Print X's at the right locations if ($length > 0) { print(" " x ($offset - $pos) . "X" . "x" x ($length - 1)); $pos = $newpos; } } print ("\n"); }

To use the tool, you supply a regular expression - surrounded with double-quotes and without the slash delimiters - on the command line. It then reads text from STDIN. To test one of the examples from the end of Chapter 2, you can run it like this: perl regex.pl "(.)\1{2,}" and type in lines of text to test. It prints out each line it reads, followed by a map of where it finds matching regular expressions. If you type in this line: gooood spellling is verrry importttant it will generate this output:

Each "Xxxx" shows where the expression matches part of the line above it - the uppercase X marks the beginning of the match. You can type in more lines, then enter ctrl-Z when you're done (or ctrl-D if you're running on Unix or Linux). You can also have the script read its input from a file, using DOS or Unix input redirection:

perl regex.pl "(.)\1{2,}" <<i>filename</i>


As we study how to build regular expressions in the remainder of the lesson, I encourage you to use this tool to help you understand how the expressions work - and I'll be using it to generate some examples in the next few sections. By the way, don't look too closely at this script! It uses some advanced techniques I won't be covering. But you can peek at the FAQ for this lesson if you'd like to learn more about them. Assembling Regular Expressions I'm going to focus now on what goes into a regular expression: the characters and metacharacters we mentioned at the end of Chapter 2. This information applies to all the functions and operators - matching, substitution, splitting, and filtering. Later, I'll go into more detail on using the various functions and operators; for now we'll restrict the discussion to

matching. Everything we talk about here can be tried with the regex.pl tool. Simple Text Match The simplest regular expression is one containing no metacharacters. For example, searching for /ow/ in the expression "How now brown cow" matches on the instances of "ow" in each of the words. To see this with regex.pl, you can run: perl regex.pl "ow" and type in that text. You'll see this result:

If you need to do a simple text match on a metacharacter - that is, you want to match a ".", a "+", or whatever as a normal character, you can quote it with the backslash character. You can also use the backslash to quote the backslash character itself, and to quote the delimiter character. For example, this regular expression: /ab\.cd\+ef\\gh\/ij/ will match the string "ab.cd+ef\gh/ij". Why do we need to quote the delimiter character? Because if we did not, it would look like the end of the expression! Anchor Metacharacters Four anchor metacharacters, "^", "$", "\b" and "\B", let us restrict where a match can occur. These metacharacters don't match any characters in the text (Perl calls them zero-width assertions), they just control where matches can happen. The first two anchors are simple - limiting the match to the beginning or end of a line (/^abc/ and /abc$/, for example). The other two anchors control whether a match can take place at a word boundary. For example, /\babc/ will only match an "abc" that occurs at the beginning of a word, /abc\b/ will only match an "abc" at the end of a word, and /\Babc/ will match an "abc" anywhere except the beginning of a word. Character Classes The next step up in capability is character classes - a way to describe entire groups of characters. For example: "." matches any character, "\w" matches a word character (letters, numbers, and underscore), "\W" matches anything except a word character, and so on. You can also define your own character classes by enclosing them in square brackets. For example: " [aeiou] " matches any lowercase vowel. Here's an example of using this custom character class in regex.pl:

We have eleven matches on this particular line. You can easily define entire ranges of characters in a class by using a hyphen. In these three classes:

/[a-m]/ /[0-5]/ /[0-9a-fA-F]/


the first expression matches any lowercase letter from "a" to "m", the second matches any digit from "0" to "5", and the third matches any digit from "0" to "9" or any letter from "A" to "F". Finally, you can reverse the meaning of a custom character class by using the caret character ("^") at the beginning. This matches on all characters except uppercase letters:

/[^A-Z]/
Match Quantifiers So far, the characters and character classes we've talked about match exactly one character: /A/ matches a single occurrence of the letter "A", /\s/ matches a single white-space character, and so on. In the most recent example, / [aeiou] / matched 11 individual instances of a vowel. Quantifiers give you more flexibility - you can match any number of times, including zero. I'll illustrate a few of them (see the table in chapter 2 for the list of all of them): "*" matches zero or more occurrences. For example, /ab*c/ will match "ac", "abc", "abbc", "abbbc", "abbbbc", and so on.

"+" matches one or more occurrences. For example, /ab+c/ will match everything /ab*c/ matches except
"ac". "?" matches zero or one occurrence. For example, /ab?c/ will match "ac" or "abc". It's like saying, "this part of the match is optional".

"{n,m}" matches between n and m occurrences, where n and m are numbers. For example, /ab{3,5}c/
will match "abbbc", "abbbbc", or "abbbbbc",. So what would happen with our recent example if we changed / [aeiou] / to / [aeiou] +/? It would match on one or more occurrence of a vowel:

Notice that the "ui" in "quick" is now a single match instead of two separate matches. Depending on what you're trying to do - matching, substitution, or whatever - this can be an important difference. Match Grouping and Back-References OK, our expressions are going to start getting more difficult - so hang on and enjoy the ride! You know how to match repeated characters (or repeated character classes) - now how do you match repeated sequences of characters? For that, we use parentheses to group matches together. For example: /(abc)+/ will match on "abc", "abcabc", "abcabcabc", and so on. Now we can apply the same quantifiers to groups of characters that we apply to single characters. Grouping does something else for us - it allows us to refer back to an earlier match. Every group in a regular expression is numbered - the first group is #1, the second group is #2, and so on. You can refer back to group matches with a backslash followed by a number: "\1", "\2", "\3", and so on. This is called a back-reference. For example: the expression /\b(\w+)\s+\1\b/ places the \w+ match inside parentheses, then refers back to it later. The meaning of this regular expression is "match on a sequence of word characters that begins on a word boundary, followed by a sequence of white space, followed by a repeat of the earlier match that ends on a word boundary". What would we do with such a match? How about looking for words that repeat (which might, after all, be typos in the text)? Let's run this example: perl regex.pl "\b(\w+)\s+\1\b" on some input with some repeated words:

How's that for clever matching? When we discuss substitution in Chapter 4, you'll see a further use for the back-reference. Alternation

I'll mention one final feature: you can use the vertical-stroke metacharacter ("|") to choose between match alternatives. Alternatives should be grouped within parentheses. For example, /(fee|fie|foe|fum)/ will match on either "fee", "fie", "foe", or "fum". Moving On Whew! As you can see, regular expressions are a compact way to build some powerful logic for matching text. In this chapter, I've focused on the features I use all the time. There are more advanced features you can learn about in the perlre document, but please don't rush into that - I use them so rarely I have to look them up whenever I need them. In Chapter 4, we'll move on to discussing the functions that use regular expressions.

Functions and Operators for Regular Expressions Now that you've had your immersion into the strange but powerful language of regular expressions, let's see what we can do with them. This chapter will talk about each of the four regular expression functions and operators I mentioned in Lesson 4. Regular Expression Matching As I described and illustrated in chapters 2 and 3, you can check for the presence of a regular expression match like this:

$string =~ m/<i>regular expression</i>/;


This operator returns a logical value: true if there is a match with the $string, otherwise false. Variations on Matching Notation Here are a few variations on how you can write regular expressions: You can use delimiters other than a slash. For example, you might choose to use a hyphen delimiter:

$string =~ m-<i>regular expression</i>-;


This is convenient if you happen to have a lot of slashes in the expression itself - it lets you avoid endless quoting. So you can match five slashes with m-/////- instead of m/\/\/\/\/\//.

If you use the standard slash delimiter, the "m" at the beginning is optional. So our earlier expression
might also be written

$string =~ /<i>regular expression</i>/;


However, you must still use the "m" if you use any other delimiter.

If you want to match against the $_ convenience variable, you can leave off the $string =~ part of the
expression. Perl programmers often use the first and third variation in simple scripts. Here's a quick script that scans a file for any lines containing my name:

#!/usr/bin/perl while (<>) { if (/Nathan/) { print($_); } }

Modifiers for Matching Perl lets you modify the match in several ways - it's done by adding a one-letter command after the final delimiter. The one modification I use frequently is to ignore case: to let uppercase and lowercase letters match each other. You request this with the letter "i":

$text =~ m/abc/ $text =~ m/abc/i


In this example, the first version will match only on "abc", while the second version will also match on "ABC", "ABc", "aBc", and so on. There are other modifiers that allow you to perform multiple matches or to examine multiple lines when matching. You can learn more about them from the perlre documentation. I almost never use them, so I did not mention them here. Strangely enough, I used some of them for the first time ever when I wrote the regex.pl tool. You can read about those in the FAQ for this lesson. Predefined Variables Finally, let's talk about some predefined variables that are affected by regular-expression matching. These are set after a match is performed:
Variable $& Meaning Match: this variable contains the text that was matched in the most recent regular expression match. Prematch: this variable contains the text leading up to the most recent recent regular expression match. Postmatch: this variable contains the text after the most recent regular expression match.

$`

$'

$1, $2, $3, .

$1 contains the contents of the first ()-group, $2 contains the contents of the second ()-group, and so on.
Numbered buffer match:

For example, if you perform this match:

$text = "The quick brown fox jumps over the lazy dog"; $text =~ m/f(.)x/;
then, after the match, $& will contain "fox", $` will contain "The quick brown ", $' will contain " jumps over the lazy dog", and $1 will contain the "o" that was matched by the (.). Regular Expression Substitution Substitution is how we search and replace with regular expressions. The substitution operator looks similar to matching - with an additional delimiter:

$string =~ s/<i>regular expression</i>/<i>substitute</i>/;


For example, this will substitute replace "dog" with "cat":

$string =~ s/dog/cat/;
Variations on Substitution Notation Just as with expression matching, you can use a different delimiter with substitution. For example:

$string =~ s-<i>regular expression</i>-<i>substitute</i>-;


Also, as with expression matching, the substitution operator works on the $_ convenience variable if you leave

off the $string =~ part. Modifiers for Substitution As with matching, you can use special one-letter codes to modify how substitution works. The modifier to ignore case looks similar to what we saw before:

$string =~ s/dog/cat/i;
will perform the substitution for "dog", "DOG", "DoG", and so on. Another important modifier is the global modifier. Normally, a substitution is applied just once: for example, only the first "dog" found in $string is replaced with "cat". Adding the "g" modifier causes all of the "dog"s to be replaced:

$string =~ s/dog/cat/g;
And you can combine modifiers. This example replaces every "dog", "DOG", "DoG" (and so on) with "cat":

$string =~ s/dog/cat/gi;
Metacharacters for Substitution There are two important metacharacter sequences you can use in the substitute part of the command - the part between the second and third delimiter:
Metacharacters Meaning You can use the predefined expression-matching variables in the substitute part of the command. They have the same meaning here that we just discussed for matching. You can use back-references, which refer back to the ()-groupings that were used in the regular expression between the first and second delimiter.

$&, $`, and $' \1, \2, etc.

I'll illustrate with some examples. First, here's a substitution that matches every word on the line and puts quotes around it:

s/\w+/"$&"/g
If you'd like to see this in action, here's a script you can try:

#!/usr/bin/perl {

while (defined($line = <>))

$line =~ s/\w+/"$&"/g; print $line; }


And here's a substitution that swaps the first two words on every line. It defines four ()-groupings - leading white space (if any), the first word, the white space between words, and the second word - and then rearranges them:

s/^(\s*)(\S+)(\s+)(\S+)/\1\4\3\2/
Here's a script that lets you try this substitution:

#!/usr/bin/perl

while (defined($line = <>))

{ $line =~ s/^(\s*)(\S+)(\s+)(\S+)/\1\4\3\2/; print $line; }


Splitting Lines With split() You already know this function from Lesson 4. We originally used a special-case first argument of a single space character; now you know how to use split() with regular expressions. There are just a few things to be aware of:

This extra match at the beginning does not happen when you use the special-case " " argument instead
of a regular expression.

split(/\W+/, " Hello world")


returns a list with three elements: ("", "Hello", "world"). This does not happen when you use the specialcase " " argument instead of a regular expression. Using multiplier metacharacters can have important effects. Consider these two cases (notice we have two spaces in the string being matched):

split(/\W+/, "Hello world") split(/\W/, "Hello world")


The expression in the first split() matches on the double space between the words, and returns the list ("Hello", "world"). The second split() sees two single-character matches between the words, so it splits the string into three parts: ("Hello", "", "world"). Just as with matching and substitution, you can use modifiers in the regular expression. For example:

split(/dog/i, $text)
Filtering a List With grep() There's not much for me to add about grep(): it takes a list and returns a smaller list containing only those elements that match the expression. As with the other functions and operators, you can use modifiers with the regular expression. Practical Uses for Regular Expressions It's easy to get bogged down in details when we talk about regular expressions. So I'd like to finish up with one more project that will take some of what we've learned and do something useful with it. This script summarizes the contents of an email box: it looks at the file containing your emails and prints out information on the sender, receiver, date, subject, and length of each email. Unfortunately, not everyone can use this script. It will work on mailbox files in the standard email file format used by the Netscape mailer and by most Unix and Linux mail programs. But it cannot read the proprietary format used by Microsoft Outlook and Outlook Express. Whether or not you can use this script on your mailbox, you can still see how it uses regular expressions to scan the mailbox for information headers that are found at the beginning of every piece of email. It assembles what it knows from those headers, then prints out the results for each email. Here is the script:

#!/usr/bin/perl

undef(%headers); undef($last_header); undef($past_headers); while (defined($line = <>)) { chomp($line); # Is this the start of a new piece of mail? if ($line =~ /^From /) { # Yes. Report on the previous piece of mail &dump_mail_headers(); undef($past_headers); next; } # We're inside a piece of mail. Have we gone past the headers? if ($past_headers) { next; } # No. Are we at the end of the headers? if (length($line) == 0) { $past_headers = 1; next; } # No. Check for headers we're interested in. if (&checkheader("From")) { next; }; if (&checkheader("To")) { next; }; if (&checkheader("Date")) { next; }; if (&checkheader("Subject")) { next; }; if (&checkheader("Lines")) { next; }; # Is this a continuation of a header line we care about? if (defined($last_header) && $line =~ s/^\s+//)

{ # Yes, append it to the header we're currently building $headers{$last_header} .= " " . $line; } else { # No, this header's done undef($last_header); } } # We're done - report on the last piece of mail &dump_mail_headers(); exit(0); sub dump_mail_headers { if (exists($headers{"From"})) { &printheader("From"); &printheader("To"); &printheader("Date"); &printheader("Subject"); &printheader("Lines"); print("\n"); undef(%headers); undef($last_header); }

} sub checkheader { $header = $_[0]; if ($line =~ s/^$header:\s+//) { $headers{$header} = $line; $last_header = $header; # Return true return 1; } # Return false return undef; } sub printheader { $header = $_[0]; if (exists($headers{$header})) { print "$header: $headers{$header}\n"; } }
If you want to use this script, save it as mailsummary.pl in the projects directory and give it the name of the file containing your email:

perl mailsummary.pl mymail

Vous aimerez peut-être aussi