Académique Documents
Professionnel Documents
Culture Documents
Regular Expressions
Many of todays web applications require matching patterns in a text document to look
for specific information. A good example is parsing a html file to extract <img> tags of a
web document. If the image locations are available, then we can write a script to
automatically download these images to a location we specify. Looking for tags like
<img> is a form of searching for a pattern. Pattern searches are widely used in many
applications like search engines. A regular expression(regex) is defined as a pattern
that defines a class of strings. Given a string, we can then test if the string belongs to this
class of patterns. Regular expressions are used by many of the unix utilities like grep,
sed, awk, vi, emacs etc. We will learn the syntax of describing regex later.
Pattern search is a useful activity and can be used in many applications. We are already
doing some level of pattern search when we use wildcards such as *. For example,
> ls *.c
Lists all the files with c extension or
ls ab*
lists all file names that starts with ab in the current directory. These type of commands
(ls,dir etc) work with windows, unix and most operating systems. That is, the command ls
will look for files with a certain name patterns but are limited in ways we can describe
patterns. The wild card (*) is typically used with many commands in unix. For example,
cp *.c /afs/andrew.cmu.edu/course/15/123/handin/Lab6/guna
copies all .c files in the current directory to the given directory
Unix commands like ls, cp can use simple wild card (*) type syntax to describe specific
patterns and perform the corresponding tasks. However, there are many powerful unix
utilities that can look for patterns described in general purpose notations. One such utility
is the grep command.
grep
NAME
grep, egrep, fgrep - print lines matching a pattern
SYNOPSIS
grep [options] PATTERN [FILE...]
grep [options] [-e PATTERN | -f FILE] [FILE...]
DESCRIPTION
grep searches the named input FILEs (or standard input if no files are
named, or the file name - is given) for lines containing a match to
the given PATTERN. By default, grep prints the matching lines.
Source: unix manual
.
The grep (Global Regular Expression Print) is a unix command utility that can be used to
find specific patterns described in regular expressions, a notation which we will learn
shortly. For example, the grep command can be used to match all lines containing a
specific pattern. For example,
grep <a href guna.html > output.txt
writes all lines containing the matching substring <a href to the file output.txt
grep unix command can be an extremely handy tool for searching for patterns. If we do
grep foo filename
it returns all lines of the file given by filename that matches string foo.
Unix provide the | command (pipe command) to send an input from one process to
another process. Say for example, we would like to find all files that have the pattern
guna. We can do the following to accomplish that task.
> ls | grep guna
We note again that Pipe command | is used to send the output from one command as
input to another. For example, in the above command, we are sending the output from ls
as input to grep command.
If we need to find a pattern a.c in a file name (that is any file name that has the substring
a.c, where dot(.) indicates that any single character can be substituted) we can give the
command
ls | grep a.c
So we can find file name that has the substring aac, abc, a_c etc.
Copyright @ 2009 Ananda Gunawardena
Regular expressions
Regular expressions, that defines a pattern in a string, are used by many programs such
as grep, sed, awk, vi, emacs etc. The PERL language (which we will discuss soon) is a
scripting language where regular expressions can be used extensively for pattern
matching.
The origin of the regular expressions can be traced back to formal language theory or
automata theory, both of which are part of theoretical computer science.
A formal language consists of an alphabet, say {a,b,c} and a set of strings defined by the
language. For example, a language defined on the alphabet {a,b,c} could be all strings
that has at least one a. So ababb and abcbbc etc are valid strings while ccb is not.
An automaton (or automata in plural) is a machine that can recognize valid strings
generated by a formal language. A finite automata is a mathematical model of a finite
state machine (FSM), an abstract model under which all modern computers are built. A
FSM is a machine that consists of a set of finite states and a transition table. The FSM
can be in any one of the states and can transit from one state to another based on a series
of rules given by a transition function.
Example: think about a FSM that has an alphabet {a,b} and 3 states, Q0, Q1, and Q2
Define Q0 as the initial state, Q1 as intermediate and Q2 as the final or accepting state.
Complete the transitions so that the machine accepts any string that begins with zero or
more as immediately followed by one or more bs and then ending with an a. So the
strings accepted by this FSM would include aba, aaba, ba, aaaaaabbbbbbbbba
etc.
Q0
Q1
Q2
Finite automata can be built to recognize valid strings defined by a formal language. For
example, we can use a machine as defined above to find all substrings that begins with
zero or more as immediately followed by one or more bs and then ending with an a.
One important feature of a finite state machine is that it cannot be used to count. That is,
FSMs have no memory. For example, we can build a machine to accept all strings that
has even number of as, but cannot build a FSM to count the as in the string.
Our discussion on FSMs now leads to regular expressions. A regular expressions and
FSMs are equivalent concepts. Regular expression is a pattern that can be recognized by
a FSM.
Other Facts
. matches a single character
.* matches any string
[a-zA-Z]* matches any string of alphabetic characters
[ag].* matches any string that starts with a or g
[a-d].* matches any string that starts with a,b,c or d
^ab matches any string that begins with ab. In general, to match all lines that
begins with any string use ^string
grep command used with regular expression notation can make a powerful scripting
language. When using grep, be sure to escape special characters with \.
Eg: grep b\(a \| c \) b looks for patterns bab or bcb specifically
Example
1. Find all subdirectories within a directory
Answer: > ls l | grep ^d
Character Classes
Character classes such as digits (0-9) can be described using a short hand syntax such as
\d. A PERL(a language we shortly learn) programmer is free to use either [0-9] or \d to
describe a character class.
\d digit [0-9]
\D non-digit [^0-9]
\w word character [0-9a-z_A-Z]
\W non-word character [^0-9a-z_A-Z]
\s a whitespace character [ \t\n\r\f]
\S a non-whitespace character [^ \t\n\r\f]
Note: we will not use these shorthand notations with grep
Other more general regex grammar includes
1. {n,m} at least n but not more than m times
2. {n,} match at least n times
3. {n} match exactly n times
Examples:
Find all patterns that has at least one but no more than 3, as
(ans: grep a\{1,3\} filename)
Finding non-matches
To exclude patterns we can use [^class]. For example, to exclude patterns that may
contain a numeric digit, write [^0-9]. For example, we can exclude all lines that begins
with a-z by writing
> grep ^[^a-z] filename
Group Matching
If we group a match by using ( ) then the matching groups are given by 1, 2 etc..
For example a regex
"<h\([1-4]\)>.*h\([1-3]\)>"
Returns 1 as the number that matched with the first group \([1-4]\) and
2 as the number that matched with the second group \([1-3]\)
This can be useful in looking for patterns based on previous patterns found. For example
The regex
h[1-4] can match to h1, h2, h3, or h4.
Suppose later in the expression we need to match the same index. We can do this by
grouping [1-4] as \([1-4]\) --- note we need \( to make sure that ( is not used as a literal
match --Now the match is saved in a variable 1 (must refer to as \1) it can be used later in the
expression to match similar indices. An application of this would be to look for
<h1> . <\h1> but not <h1> . <\h2>
Here is how you do that.
> grep h\([1-4]\).*h\1 filename
In general, the back-reference \n, where n is a single digit, matches the substring
previously matched by the nth parenthesized sub expression of the regular expression.
This concept is sometimes called back reference. We will expand these ideas later in
Perl programming.
Two regular expressions may be joined by the infix operator | the resulting
regular expression matches any string matching either subexpression.
The backreference \n, where n is a single digit, matches the substring previously
matched by the nth parenthesized subexpression of the regular expression.
Copyright @ 2009 Ananda Gunawardena
Exercises
1.
Build a FSM that can accept any string that has even number of as. Assume the
alphabet is {a,b}.
2. Using grep command and regular expressions, list all files in the default directory
that others can read or write
3.
Write a regex that matches the emails of the form userid@domain.edu. Where
userid is one of more word characters or + and the domain is one or more word
characters.
4.
Construct a FSM and a regular expression that matches patterns
5.
6.
7.
8.
ANSWERS
1.
Build a FSM that can accept any string that has even number of as. Assume the
alphabet is {a,b}.
2. Using grep command and regular expressions, list all files in the default directory
that others can read or write
Ans: ls l | grep .\{7\}rw
3.
Write a regex that matches the emails of the form userid@domain.edu. Where
userid is one of more alpha characters or + and the domain is one or more alpha
characters.
Ans: grep [a-z+]\+@[a-z]\+.edu
4.
Construct a FSM and a regular expression that matches patterns
containing at least one ab followed by any number of bs.
Ans:
5.
6.
grep \(ab\)+b*