Académique Documents
Professionnel Documents
Culture Documents
Overview Background
Regular expressions have long been The four regular expressions introduced
used in the Unix world. With Oracle 10g, into Oracle 10G are all based on the four
they have now entered the SQL toolbox. SQL character functions substr, like,
For those unfamiliar with them, regular instr, and replace. Their names too
expressions can be both complex and are recognizable, all being prefaced with
confusing, but do bring substantial the text ‘regexp’, as follows:
flexibility to the developer. They enable
the user to drill down into the detail REGEXP_LIKE
constituents of individual database text REGEXP_SUBSTR
fields. They can hunt for complex text REGEXP_INSTR
patterns embedded within fields, REGEXP_REPLACE
constrained by position in the fields (as
Each one of these terms will be
defined by any arbitrary character
explained and demonstrated, showing
position). Therefore, they add
how the original abilities of the SQL
considerably to the ability of the user to
character functions have been extended again irrespective of what may follow), or
into the regular expression forms. Their thirdly, anywhere in the string,
function falls into one of two categories, irrespective of what may precede or
either to identify patterns of data within follow. Note that in the second example,
discrete strings or attributes, or to I have used the underbar symbol (‘_’) to
manipulate target patterns within discrete indicate a single space!
strings or attributes. By ‘string’ or
The regular expression regexp_like
‘attributes’, I mean a string literal which
may be part of a SQL expression, or can of course achieve the same result,
As indicated, this is based on the SQL you to control the condition of the match.
Select * from tab where field like ‘%START%’; It is worth listing examples of
metacharacters before discussing in
In each case the pattern to be found is detail how some of them work:
‘START’, but the constructs enable a
. (a full stop) Any
certain amount of choice over where that
character
string is to be found. The ‘%’ symbol acts
as the familiar wild card. The text pattern + One or More of the
may either be found at the beginning of previous expression
the string (irrespective of what may
? Zero or One of the
follow), or commencing at the 2nd
previous
character position in the string (and
the beginning of the field. This we
* Zero or More of the achieve by inserting the metacharacter ^
previous expression (carrot, or hat) immediately before the
{m} Exact Count of the required string. Notice that Oracle
previous expression automatically distinguishes between
metacharacters and normal text. If you
{m,} Count of 'm' or more
need to include a character in your
{m,n} Count of at least 'm', search template that is also a
maximum 'n' metacharacter, you can do this by using
the ‘escape’ device of a backward slash,
[…] List of characters to
which forces the next character to be
be matched
interpreted as normal text.
[ ^… ] List of characters NOT
to be matched If we use the metacharacter ‘dot’ (or
‘stop’, ‘ . ‘), we can create a slightly
| Or, alternatively
anomalous query. Look at this:
(…) Group or sub-
expression of Select * from tab where
REGEXP_LIKE(field,’^...’);
characters
^ Match from the first Here we are looking for any three
character in line or contiguous characters in our data source
field – they can be anything. We have also
$ Match the very last prefaced those dots with the hat - ^,
Here we would only match those fields in Select * from tab where
REGEXP_LIKE(field,’(^START)’);
which the string START was located at
But in this case the round brackets are The reason for this is that you can never
not strictly required, and might confuse. actually see what is being matched by
regexp_like – all you see is the
There is a final aspect of regexp_like
retrieval that you requested if the match
to be noted, namely the optional ‘Match-
is made. The two are entirely
Parameter’, which is used to control the
disconnected. For example, you may
case of the character pattern being
inadvertently forget to switch the match
searched for. So again, our simple
parameter from case dependent to case
example can include this parameter as
independent, and may therefore get a
follows:
result, but based entirely on a false
premise. This is where regexp_substr
Select * from tab where
REGEXP_LIKE(field,’^START’,’i’); allows you to display the actual pattern
that is successfully matched with the
Here the letter ‘i’ is used to indicate that regexp_like function.
the string START can be any mixture of
upper or lowercase – it is to be case Lets start with the formal definition of this
REGEXP_LIKE
Unlike regexp_like, this expression
( source_string, pattern
enables you to retrieve exactly the
[, match_parameter] )
pattern within the text string or table field
that matches the pattern as defined. So
Because it is enclosed in square
if the regexp_like and
brackets, the match parameter is defined
as optional. regexp_substr constructs are
identical, you can be absolutely confident
that what you see retrieved by
REGEXP_SUBSTR
regexp_substr is exactly what is
being matched by regexp_like.
We now know enough to progress on to
the next regular expression, Notice that there are a couple more
regexp_substr. This is in fact highly optional parameters available for use
complimentary with regexp_like, the with regexp_substr. But this makes it
two work together most effectively. a superset of the regexp_like
construct, so enabling any construct in for debugging complex expressions. As
regexp_like to be used with with so much of SQL, it is the old
regexp_substr. Lets see how this can problem of the declarative versus the
work: procedural. Often one can build a
syntactically correct SQL regular
Select REGEXP_SUBSTR(field,’START’) from tab
where REGEXP_LIKE(field,’START’); expression, but the actual semantic
meaning of the expression can be very
Complexity starts to appear if you want different from what you think you have
INSTR
Select REGEXP_SUBSTR(field,’START’,1,1,’i’)
from tab (source, pattern
where REGEXP_LIKE (field,’START’,’i’);
[, starting at M
[, the Nth occurrence ]] )
The great benefit of using
regexp_like and regexp_substr The regular expression version takes this
together is that they are an ideal method basic position and adds a couple of
additional properties, the of ‘beer’:
return_option and the
match_parameter. Lets have a look at select
regexp_instr('Two beers or three beers,
sir?','beer',1,2,0,'i')
the formal definition: result from dual;
RESULT
REGEXP_INSTR ----------
20
(source, pattern
[, starting at M If the return option is changed from ‘0’ to
[, the Nth occurrence ‘1’, to return the integer indicating the
[, return_option end of the pattern string, the query and
[, match_parameter ] ] ] ] ) result appear as follows:
select
It may be easier to understand what is regexp_instr('Two beers or three beers,
sir?','beer',1,2,1,'i')
happening if the above definition is result from dual;
transposed into free text:- the expression RESULT
----------
is looking for the position of the Nth 24
occurrence of pattern in the source,
starting at character position M, with the
return_option of either the start or end But here is a subtly easily overlooked.
position of the pattern, with an optional The result returned is not the numeric
match_parameter governing the case position of the last character of the
sensitivity of the pattern search. search pattern ‘beer’, but rather the next
position after the end of the pattern,
As with the two previous expressions,
which is the letter ‘s’..
the source and the pattern operate in
exactly the same way, complete with
controlling metacharacters when REGEXP_REPLACE
required. Lets look at the following
example: The fourth of the regular expressions is
the only one to actually manipulate a
select
regexp_instr('Two beers or three beers, target pattern. As previously, it is based
sir?','beer',1,1,0,'i')
result from dual; on the original SQL character function of
RESULT ‘replace’, with the addition of three extra
----------
5 constraints. The definition of replace
looks like this:
The result gives the character position in
the string of the first occurrence of the
REPLACE
first letter of the pattern ‘beer’, which is
(source, pattern, replace_string )
of course the integer 5. We can modify
the query to look for the second instance
of which an example could be as multiple spaces, namely one single
follows: space.
[, match_parameter ] ] ] ] )
In this example, we are going to
Once again, let’s see what an example
use ‘regexp_replace’ as a crude
of this expression might look like. We
can use it to prune out multiple instances form of encryption, by looking
of spaces between characters, as here: for the first occurrence of the
letter ‘e’ after character position
select regexp_replace
('This written by a not-very good 10 in the string, and then
typist','( ){2,}', ' ') Result from dual;
replacing that letter with the
string ‘*XYZ*‘. The final result
The construct above contains three
items: a string with multiple spaces looks like this:
between words, a pattern to search for
(a single space identified in a closed ’Regexp could b*XYZ* used to encrypt
Conclusions
Further help
Peter Robson
(peter_robson@justsql.com) is a
database specialist with a particular
interest in SQL. His web site
www.justsql.com contains a number of
unusual examples of the use of SQL.