Vous êtes sur la page 1sur 10

Understanding Regular Expressions

by Peter Robson, JustSQL

really explore and manipulate in great


detail the contents of their database.
This article is about the new Regular
Expressions that were introduced by This paper attempts to introduce them to
Oracle in 10g. It is based closely on a the competent SQL user who has no
presentation that I have made to several experience with regular expressions,
conferences and meetings. One thing either in 10g, or in the Unix world. Rather
became clear from these events, and than simply list their parameters, this
that is the number of people who are paper will demonstrate, using simple
peripherally aware of these new regular worked examples, how the four types of
expressions, but had never tried to use regular expressions can be used in
them. So this article is for these people, everyday coding situations. It will show
strictly beginners in the field, but people how these expressions are built on the
who are very comfortable using SQL existing and familiar SQL character
itself. It is a first timers introduction, and functions, and will explain in detail each
will prepare you to explore the topic on of the various parameters associated
your own with the four regular expressions..

Overview Background

Regular expressions have long been The four regular expressions introduced
used in the Unix world. With Oracle 10g, into Oracle 10G are all based on the four
they have now entered the SQL toolbox. SQL character functions substr, like,
For those unfamiliar with them, regular instr, and replace. Their names too
expressions can be both complex and are recognizable, all being prefaced with
confusing, but do bring substantial the text ‘regexp’, as follows:
flexibility to the developer. They enable
the user to drill down into the detail REGEXP_LIKE
constituents of individual database text REGEXP_SUBSTR
fields. They can hunt for complex text REGEXP_INSTR
patterns embedded within fields, REGEXP_REPLACE
constrained by position in the fields (as
Each one of these terms will be
defined by any arbitrary character
explained and demonstrated, showing
position). Therefore, they add
how the original abilities of the SQL
considerably to the ability of the user to
character functions have been extended again irrespective of what may follow), or
into the regular expression forms. Their thirdly, anywhere in the string,
function falls into one of two categories, irrespective of what may precede or
either to identify patterns of data within follow. Note that in the second example,
discrete strings or attributes, or to I have used the underbar symbol (‘_’) to
manipulate target patterns within discrete indicate a single space!
strings or attributes. By ‘string’ or
The regular expression regexp_like
‘attributes’, I mean a string literal which
may be part of a SQL expression, or can of course achieve the same result,

alternatively (and more likely) an but uses a slightly different construct:

attribute value held within a database


Select * from tab where
table. It is when the two activities of REGEXP_LIKE(field,’START’);
identification and manipulation are
combined that the true strength of Note that no percentage wild card is
regular expressions can be seen. They required – the match will happen if the
are indeed very powerful tools with which string ‘START’ is found anywhere within
to both search and modify data, but with the target field. This is where additional
this power comes danger, which we will parameters allow you to control a lot
discuss later. more precisely exactly what will be
matched with a regular expression. A
REGEXP_LIKE variety of ‘metacharacters’, embedded
within the target search string, enables

As indicated, this is based on the SQL you to control the condition of the match.

character function ‘like’. Lets look at


three example uses of ‘like’:

Select * from tab where field like ‘START%’; Metacharacters


Select * from tab where field like ‘_START%’;

Select * from tab where field like ‘%START%’; It is worth listing examples of
metacharacters before discussing in
In each case the pattern to be found is detail how some of them work:
‘START’, but the constructs enable a
. (a full stop) Any
certain amount of choice over where that
character
string is to be found. The ‘%’ symbol acts
as the familiar wild card. The text pattern + One or More of the
may either be found at the beginning of previous expression
the string (irrespective of what may
? Zero or One of the
follow), or commencing at the 2nd
previous
character position in the string (and
the beginning of the field. This we
* Zero or More of the achieve by inserting the metacharacter ^
previous expression (carrot, or hat) immediately before the
{m} Exact Count of the required string. Notice that Oracle
previous expression automatically distinguishes between
metacharacters and normal text. If you
{m,} Count of 'm' or more
need to include a character in your
{m,n} Count of at least 'm', search template that is also a
maximum 'n' metacharacter, you can do this by using
the ‘escape’ device of a backward slash,
[…] List of characters to
which forces the next character to be
be matched
interpreted as normal text.
[ ^… ] List of characters NOT
to be matched If we use the metacharacter ‘dot’ (or
‘stop’, ‘ . ‘), we can create a slightly
| Or, alternatively
anomalous query. Look at this:
(…) Group or sub-
expression of Select * from tab where
REGEXP_LIKE(field,’^...’);
characters

^ Match from the first Here we are looking for any three
character in line or contiguous characters in our data source
field – they can be anything. We have also

$ Match the very last prefaced those dots with the hat - ^,

character in line / field instructing the search to begin at the


beginning of the data field. Syntactically
Most of the above are self-explanatory,
this construct is fine, it is just a bit
but you should experiment with them in
superfluous. It is easy to make errors like
simple examples to see exactly how they
this with regular expressions, and
vary the results you can obtain.
conversely, to create an expression
which is syntactically correct, but which
To return to our simple example, let’s
cannot possibly retrieve a result!
use a simple metacharacter (^) to force
our pattern search to begin at the first
Our simple query could also be written
character position in our data source:
as follows, which makes more explicit
the use of a pair of (complimentary)
Select * from tab where
metacharacters, the round brackets:
REGEXP_LIKE(field,’^START’);

Here we would only match those fields in Select * from tab where
REGEXP_LIKE(field,’(^START)’);
which the string START was located at
But in this case the round brackets are The reason for this is that you can never
not strictly required, and might confuse. actually see what is being matched by
regexp_like – all you see is the
There is a final aspect of regexp_like
retrieval that you requested if the match
to be noted, namely the optional ‘Match-
is made. The two are entirely
Parameter’, which is used to control the
disconnected. For example, you may
case of the character pattern being
inadvertently forget to switch the match
searched for. So again, our simple
parameter from case dependent to case
example can include this parameter as
independent, and may therefore get a
follows:
result, but based entirely on a false
premise. This is where regexp_substr
Select * from tab where
REGEXP_LIKE(field,’^START’,’i’); allows you to display the actual pattern
that is successfully matched with the
Here the letter ‘i’ is used to indicate that regexp_like function.
the string START can be any mixture of
upper or lowercase – it is to be case Lets start with the formal definition of this

independent. Use of the only other expression:

alternative parameter of ‘c’ will force the


character string to match exactly the REGEXP_SUBSTR
case of the string in the query. (source, pattern
[, position
The formal definition of regexp_like is
[, occurrence
as follows:
[, match_parameter ] ] ] )

REGEXP_LIKE
Unlike regexp_like, this expression
( source_string, pattern
enables you to retrieve exactly the
[, match_parameter] )
pattern within the text string or table field
that matches the pattern as defined. So
Because it is enclosed in square
if the regexp_like and
brackets, the match parameter is defined
as optional. regexp_substr constructs are
identical, you can be absolutely confident
that what you see retrieved by
REGEXP_SUBSTR
regexp_substr is exactly what is
being matched by regexp_like.
We now know enough to progress on to
the next regular expression, Notice that there are a couple more
regexp_substr. This is in fact highly optional parameters available for use
complimentary with regexp_like, the with regexp_substr. But this makes it
two work together most effectively. a superset of the regexp_like
construct, so enabling any construct in for debugging complex expressions. As
regexp_like to be used with with so much of SQL, it is the old
regexp_substr. Lets see how this can problem of the declarative versus the
work: procedural. Often one can build a
syntactically correct SQL regular
Select REGEXP_SUBSTR(field,’START’) from tab
where REGEXP_LIKE(field,’START’); expression, but the actual semantic
meaning of the expression can be very

Complexity starts to appear if you want different from what you think you have

to use the match_parameter. In written. By using these two expressions


in tandem, and by doing several tests,
regexp_substr it is the third optional
you can both learn a lot more about
parameter, which obliges you to insert
regular expressions, and furthermore, be
values for the previous two parameters,
confident that the answer you get is the
namely for ‘position’ and
answer to the question you really asked,
‘occurrence’. This is easy, however.
and not what you thought you asked.
‘Position’ refers to the position in the
text string from which to start looking for
a pattern. The numeric ‘1’ will force the
examination to commence from the
beginning, and is the default. The
‘occurrence’ parameter simply REGEXP_INSTR
indicates how many times you wish to
find the string – the default is 1. (Note The third of the four regular expressions
that this is not the same as the is distinctly different from the two
occurrence parameter in the previous examples that we have looked
regexp_instr, which we will discuss at. Now, the value returned from
later). regexp_instr is a numeric value
which identifies the position of the start
We can modify our example to force a
(or the end) of a particular pattern in the
case independent search as follows, but
data source character string. The original
we must insert at least the default values
SQL character function is defined as
for the two preceding parameters in the
follows:
regexp_substr expression:

INSTR
Select REGEXP_SUBSTR(field,’START’,1,1,’i’)
from tab (source, pattern
where REGEXP_LIKE (field,’START’,’i’);
[, starting at M
[, the Nth occurrence ]] )
The great benefit of using
regexp_like and regexp_substr The regular expression version takes this
together is that they are an ideal method basic position and adds a couple of
additional properties, the of ‘beer’:
return_option and the
match_parameter. Lets have a look at select
regexp_instr('Two beers or three beers,
sir?','beer',1,2,0,'i')
the formal definition: result from dual;

RESULT
REGEXP_INSTR ----------
20
(source, pattern
[, starting at M If the return option is changed from ‘0’ to
[, the Nth occurrence ‘1’, to return the integer indicating the
[, return_option end of the pattern string, the query and
[, match_parameter ] ] ] ] ) result appear as follows:

select
It may be easier to understand what is regexp_instr('Two beers or three beers,
sir?','beer',1,2,1,'i')
happening if the above definition is result from dual;
transposed into free text:- the expression RESULT
----------
is looking for the position of the Nth 24
occurrence of pattern in the source,
starting at character position M, with the
return_option of either the start or end But here is a subtly easily overlooked.
position of the pattern, with an optional The result returned is not the numeric
match_parameter governing the case position of the last character of the
sensitivity of the pattern search. search pattern ‘beer’, but rather the next
position after the end of the pattern,
As with the two previous expressions,
which is the letter ‘s’..
the source and the pattern operate in
exactly the same way, complete with
controlling metacharacters when REGEXP_REPLACE
required. Lets look at the following
example: The fourth of the regular expressions is
the only one to actually manipulate a
select
regexp_instr('Two beers or three beers, target pattern. As previously, it is based
sir?','beer',1,1,0,'i')
result from dual; on the original SQL character function of
RESULT ‘replace’, with the addition of three extra
----------
5 constraints. The definition of replace
looks like this:
The result gives the character position in
the string of the first occurrence of the
REPLACE
first letter of the pattern ‘beer’, which is
(source, pattern, replace_string )
of course the integer 5. We can modify
the query to look for the second instance
of which an example could be as multiple spaces, namely one single

follows: space.

The result would return the following


Select replace(‘this is a small
example’,’small’,’tiny’) from dual; string:

‘This written by a not-very good typist’


The result takes the target string,
substitutes ‘small’ with ‘tiny’ and returns
In the above example, we have not seen
the modified string ‘this is a tiny
example’. the use of either ‘position’,
‘occurrence’, or ‘match_parameter’.
This structure is developed further in the Just as with the previous regular
regexp_replace, which has the following expressions, their use is exactly the
definition: same. The following example shows how
one can apply the replacement
REGEXP_REPLACE
selectively within a string:
( source, pattern
[,replace_string
select regexp_replace
[,position ('Regexp could be used to encrypt
text','(e)','*XYZ*',10,1,'i') result from
[, occurrence dual;

[, match_parameter ] ] ] ] )
In this example, we are going to
Once again, let’s see what an example
use ‘regexp_replace’ as a crude
of this expression might look like. We
can use it to prune out multiple instances form of encryption, by looking
of spaces between characters, as here: for the first occurrence of the
letter ‘e’ after character position
select regexp_replace
('This written by a not-very good 10 in the string, and then
typist','( ){2,}', ' ') Result from dual;
replacing that letter with the
string ‘*XYZ*‘. The final result
The construct above contains three
items: a string with multiple spaces looks like this:
between words, a pattern to search for
(a single space identified in a closed ’Regexp could b*XYZ* used to encrypt

bracket construct) followed by a numeric text’

argument which says ‘find at least 2


contiguous spaces with no upper limit on Syntax
how many occur together’, and finally the
string with which to replace these So far we have only considered the most
simple of examples, and deliberately too.
We have touched on the variety of [:cntrl:] nonprinting or control characters
metacharacters available to use. These
are the characters which retain their own [:xdigit:] hexadecimal

meaning within a pattern string, and


[:alnum:] an alpha-numeric string
which are used to apply constraints to
the interpretation of that character string. You can see immediately how a
But there are traps here – look at the use compound pattern expression, looking
of the ‘hat’ or carrot’ (^). It can have two for both upper and lower case letters and
entirely different meanings, dependent numerics, constructed as follows:
on its context. In this form ( [^…] ) it is
defining characters which must not be [A-Za-z0-9]
matched. Here (^…) it requires the
parser to start hunting for the pattern
from the beginning of the text data can be simplified into a single expression
source (where ‘…’ stands for any three as follows:
alpha numeric characters). Note also the
use of the back slash \, the ‘escape’ [[:alnum:]]
character, which will force a
Note how the character classes are
metacharacter to be treated as an
defined with opening and closing square
ordinary text character. It is all too easy
brackets. Because we have replaced the
to overlook these small aspects of
pattern string ‘A-Za-z0-9’, which was
regular expression construct, which if
itself contained within square brackets,
misused can alter entirely the behaviour
those outer brackets have to be retained.
of the query.
The pairs of square brackets are a
There are also groups of characters familiar feature of regular expressions –
known as ‘character classes’ as distinct you will therefore immediately recognize
from metacharacters. These are less that a character class is being used.
confusing, and indeed if used carefully,
can do much to remove some of the
confusion which can arise in trying to Synthesis
manage complex strings. Let’s have a
look at some of the more useful The above gentle introduction to regular
character classes: expressions might seem straight forward,
but it masks a degree of complexity
[:alpha:] alphabetic chars only which can become seriously confusing.
Users familiar with the old SQL character
[:digit:] numerics
functions will recall that you often had to
[:lower:]lower case alphabetic use multiple functions to achieve the
desired result. For example, a replace
action could be used with one or more this article will result in a full table scan.
instr functions to restrict both the start With very large tables, this can result in
and the end point of a replace action, disastrously slow execution times.
possibly even with the use of the Where large tables are involved, try to
‘length’ function. This degree of ensure that the regular expression
complexity has been partially resolved operates on the smallest possible subset
with regular expressions, in that this sort of a query result, by ensuring that as
of control can now be obtained within much data as possible is retrieved using
one expression. But equally, there is no high performance indexes, before
reason why one cannot embed a applying time consuming regular
succession of regular expressions, each expression processing.
within the other. Any one of these
Regular expressions can also provide
expressions can be embedded in any
beneficial functionality as part of pre-load
other, which will of course automatically
validation, but again, only so long as load
increases the overall complexity.
volumes are kept under careful review.
If you are building a complex suite of
Finally – beware of using regular
expressions, be sure to build
expressions when there is an easier way
incrementally, and test at each stage as
of accomplishing the same task! It may
you build up the layers. If you are trying
well be the case that with some carefully
to understand what one expression is
constructed SQL code, you can achieve
actually doing, take it apart and break it
the same result, with a consequent
down into its individual components,
performance benefit too. If you still can’t
according to the formal definition of each
achieve your objectives with ‘pure’ SQL,
expression. Clarity of layout is the great
look again at the old character functions
issue here, just as it is in writing and
before finally resorting to the regular
reading a complex piece of SQL code.
expressions.

Conclusions
Further help

Regular expressions are a powerful


Any internet search in which the words
addition to the SQL toolbox. As well as
‘Oracle’ and ‘Regular Expressions’ are
becoming familiar with their complexity
submitted will return many useful pages
and slightly confusing syntax (and this
of information. In particular, I would
may not be a rapid task!), you should be
commend the following works for further
rather careful where you employ these
reading, and indeed for ongoing
techniques. Inevitably, their use in a
reference material. Be careful with any
simple query against a text field in a
text book addressing regular expressions
table such as the examples discussed in
in general. They will tend to be directed Based in Scotland, he is also a Director
at the Unix / shell script community. Just of the UK Oracle User Group.
to confuse things even further, every
different context in which regular
expressions appear tends to have its
own syntax variants.

“Writing Better SQL Using Regular


Expressions” – Alice Rischert
Google search; A previous article in
Oramag. Highly recommended.

Oracle on-line documentation:


http://download-
uk.oracle.com/docs/cd/B14117_01/serve
r.101/b10759/toc.htm
(This is ‘SQL Reference for 10g’ –
search on ‘regexp’)

An excellent blog / discussion on Regular


Expressions by Andrew Clarke (Radio
Free Tooting)
http://radiofreetooting.blogspot.com/2006
/09/ useful-article-on-regex-in-oracle-
10g.html

“Oracle Regular Expression Pocket


Reference” (O’Reilly) Jonathan Gennick
& Peter Linsley (cheap at $10!)

Peter Robson
(peter_robson@justsql.com) is a
database specialist with a particular
interest in SQL. His web site
www.justsql.com contains a number of
unusual examples of the use of SQL.

Vous aimerez peut-être aussi