Vous êtes sur la page 1sur 50

Overview

Introduction
Regular expressions are tiny programs in their own special language, built inside Perl. These allow fast, flexible, and reliable string handling. A regular expression, often called a pattern in Perl, is a template that either matches or doesnt match a given string. That is, there are an infinite number of possible text strings; a given pattern divides that infinite set into two groups: the ones that match, and the ones that dont. Dont confuse regular expressions with shell filenamematching patterns, called globs, which is a different sort of pattern with its own rules.

Simple Pattern
To match a pattern (regular expression) against the contents of $_, simply put the pattern between a pair of forward slashes (/).
$_ = "yabba dabba doo"; if (/abba/) { print "It matched!\n"; }

The expression /abba/ looks for that four-letter string in $_; if it finds it, it returns a true value.

Unicode Properties
Unicode characters know something about themselves; they arent just sequences of bits. Instead of matching on a particular character, you can match a type of character. To match a particular property, you put the name in \p{PROPERTY}.
if (/\p{Space}/) { # 26 different possible characters print "The string has some whitespace.\n"; } if (/\p{Digit}/) { # 411 different possible characters print "The string has a digit.\n"; }

More properties at perluniprops .

Meta-characters
The dot (.) is a wildcard characterit matches any single character except a newline.
/bet.y/ - > matches betty, betsy, bet=y, bet.y, doesnt match bety or betsey.

The dot always matches exactly one character. If you wanted the dot to match just a period, you can simply backslash it.
/3\.141/ -> matches 3.141596456 doesnt match 3a141545

If you mean a real backslash, use a pair of them.


$_ = 'a real \\ backslash'; if (/\\/) { print "It matched!\n";

Simple Quantifiers
* -- zero or more occurrences
/fred\t*barney/ matches fredbarney, fred\tbarney, fred\t\tbarney /fred.*barney/ matches fredbarney, fredabcdbarney

+ -- one or more occurrences


/fred\t+barney/ matches fred\tbarney, fred\t\tbarney doesnt match fredbarney

? -- zero or one occurrence


/bam-?bam/ matches bambam, bam-bam doesnt match bam-----bam

Grouping in Patterns
Use parentheses (( )) to group parts of a pattern. So, parentheses are also meta-characters.
/fred+/ matches fredddd, fredd /(fred)+/ matches fred, fredfred, fredfredfred /(fred)*/ matches hello, barney, fred, fredfred

Using of parentheses makes perl to store matched text in the special variables $1, $2, and so on. The number denotes the capture group.
$_ = perl version is 5.14; if(/perl version is (.*)/) { print $1; #prints 5.14 }

Use back references to refer to text that you matched in the parentheses, called a capture group. You denote a back reference as a backslash followed by a number, like \1, \2, and so on.
$_ = "abba"; if (/(.)\1/) { # matches 'bb' print "It matched same character next to itself!\n"; } $_ = "yabba dabba doo"; if (/y(....) d\1/) { print "It matched the same after y and d!\n"; }

$_ = "yabba dabba doo"; if (/y(.)(.)\2\1/) { # matches 'abba' print "It matched after the y!\n"; }

How do I know which group gets which number? --just count the order of the opening parenthesis and ignore nesting.
$_ = "yabba dabba doo"; if (/y((.)(.)\3\2) d\1/) { print "It matched!\n"; }

Consider the problem where you want to use a back reference next to a part of the pattern that is a number. In this regular expression, you want to use \1 to repeat the character you matched in the parentheses and follow that with the literal string 11
$_ = "aa11bb"; if (/(.)\111/) { print "It matched!\n"; }

Is that \1, \11, or \111?

Starting from perl 5.10, by using \g{1}, you disambiguate the back reference and the literal parts of the pattern:
use 5.010; $_ = "aa11bb"; if (/(.)\g{1}11/) { print "It matched!\n"; }

With the \g{N} notation, you can also use negative numbers.
use 5.010; $_ = "xaa11bb"; if (/(.)(.)\g{1}11/) { print "It matched!\n"; }

Alternatives
The vertical bar (|), often called or in this usage, means, if the part of the pattern on the left of the bar fails, the part on the right gets a chance to match.
/fred|barney|betty/ matches fred, barney, betty. /fred( |\t)+barney/ matches if fred and barney are separated by spaces, tabs, or a mixture of the two.

/fred( +|\t+)barney/ matches if fred and barney are separated either only by space or only by tabs not mixture of space and tabs.
/fred (and|or) barney/ matches fred and barney, fred or barney. Same as pattern /fred and barney|fred or barney/.

Character Classes
A character class, a list of possible characters inside square brackets. It matches just one single character, but that one character may be any of the ones you list in the brackets.
[abcwxyz] matches a,b,c,w,x,y,z (any of those seven characters)

You may specify a range of characters with a hyphen (-)


[a-cw-z] implies all alphabets between a to c and w to z [a-zA-Z0-9] implies any alphanumeric character

$_ = "The HAL-9000 requires authorization to continue."; if (/HAL-[0-9]+/) { print "The string mentions some model of HAL computer.\n"; }

Character Class Shortcuts


Some character classes appear so frequently that they have shortcuts. The character class for any digit as \d.
$_ = 'The HAL-9000 requires authorization to continue.'; if (/HAL-[\d]+/) { say 'The string mentions some model of HAL computer.'; }

However, there are many more digits than the 0 to 9 that you may expect from ASCII, so that will also match HAL Recognizing this problematic shift from ASCII to Unicode, Perl 5.14 adds /a modifier on the end of the match perator tells Perl to use the old ASCII interpretation.

\s matches any whitespace, which is almost the same as the Unicode property \p{Space} \h only matches horizontal whitespace. \v shortcut only matches vertical whitespace. Taken together, the \h and \v are the same as \p{Space} The \R shortcut, introduced in Perl 5.10, matches any sort of line-break, independent of operating system. \w matches the set of characters [a-zA-Z0-9_]

Negating the Shortcuts


To specify the characters you want to leave out, rather than the ones within the character class use caret(^). A caret (^) at start of character class(i.e., inside square brackets) negates the class.
[^def] match any single character except one of those three. [^n\-z] matches any character except for n, hyphen, or z.

To negate a shortcut use it upper case


\S matches any non-space \D matches any non-digit [\d\D] matches any digit, or any non-digit. i.e., any character or anything [^\d\D] matches anything thats not either a digit or a non-digit. i.e., nothing!

Matches with m//


We put patterns in pairs of forward slashes, like /fred/. But this is actually a shortcut for the m// (pattern match operator). We may choose any pair of delimiters to quote the contents.
m(fred), m<fred>, m{fred}, m[fred], m,fred,, m!fred!, m^fred^

The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m. Wisely choose a delimiter that doesnt appear in your pattern.
m%http://% instead of /http:\/\// to match the initial "http://".

Match Modifiers
Case-Insensitive Matching with /i
$_=Is Freddy there?; if(/freddy/i) { print Yes Freddy is here; }

Without the /s modifier, that match would fail, since the two names arent on the same line. If you wanted to still match any character except a newline? --You could use the character class [^\n], or from Perl 5.12 added the shortcut \N to mean the complement of \n.

Matching Any Character with /s


Using /s modifier makes dot(.) to match any character including a newline character. It achieves this by replacing (.) with [dD] with matches anything. The effect can only be felt when the string has newline characters. $_ = "I saw Barney\ndown at the bowling alley\nwith Fred\nlast night.\n"; if (/Barney.*Fred/s) { print "That string mentions Fred after Barney!\n"; }

There are many other modifiers available at perlop documentation. A few are described below.

Adding Whitespace with /x


allows you to add arbitrary whitespace to a pattern, in order to make it easier to read. /-?[0-9]+\.?[0-9]*/ # what is this doing? / -? [0-9]+ \.? [0-9]* /x # a little better /x allows whitespace inside the pattern, Perl ignores literal space or tab characters within the pattern. You could use a backslashed space or \t or \s (more common)(or \s* or \s+) when you want to match whitespace.

Perl considers comments a type of whitespace, so you can put comments into that pattern to tell what you are trying to do: / -? # an optional minus sign [0-9]+ # one or more digits before the decimal point \.? # an optional decimal point [0-9]* # some optional digits after the decimal point /x # end of string Use the escaped character, \#, or the character class, [#], if you need to match a literal pound sign as it indicates start of comment / [0-9]+ # one or more digits before the decimal point [#] # literal pound sign /x # end of string

Be careful not to include the closing delimiter inside the comments, or it will prematurely terminate the pattern. This pattern ends before you think it does: / -? # with / without - <--- OOPS! [0-9]+ # one or more digits before the decimal point \.? # an optional decimal point [0-9]* # some optional digits after the decimal point /x # end of string

Combining Option Modifiers


If you want to use more than one modifier on the same match, just put them both at the end (their order isnt significant)
if (/barney.*fred/is) { # both /i and /s print "That string mentions Fred after Barney!\n"; } Or as a more expanded version with comments: if (m{ barney # the little guy .* # anything in between fred # the loud guy }isx) { # all three of /s and /i and /x print "That string mentions Fred after Barney!\n"; }

Misc
The trick with a good pattern is to not match more than you ever mean to match.