ISO/IEC TR 19075-1:2011
(Main)Information technology — Database languages — SQL Technical Reports — Part 1: XQuery Regular Expression Support in SQL
Information technology — Database languages — SQL Technical Reports — Part 1: XQuery Regular Expression Support in SQL
ISO/IEC TR 19075-1:2011 describes the regular expression support in SQL adopted from the regular expression syntax of XQuery 1.0 and XPath 2.0 Functions and Operators (Second Edition), which is derived from Perl. It discusses five operators using this regular expression syntax: LIKE_REGEX predicate, to determine the existence of a match to a regular expression. OCCURRENCES_REGEX numeric function, to determine the number of matches to a regular expression. POSITION_REGEX function, to determine the position of a match. SUBSTRING_REGEX function, to extract a substring matching a regular expression. TRANSLATE_REGEX function, to perform replacements using a regular expression.
Technologies de l'information — Langages de base de données — Rapport techniques SQL — Partie 1: Support d'expressions régulières de XQuery en SQL
General Information
Relations
Standards Content (Sample)
TECHNICAL ISO/IEC
REPORT TR
19075-1
First edition
2011-07-15
Information technology — Database
languages — SQL Technical Reports —
Part 1:
XQuery Regular Expression Support in
SQL
Technologies de l'information — Langages de base de données —
Rapport techniques SQL —
Partie 1: Support d'expressions régulières de XQuery en SQL
Reference number
ISO/IEC TR 19075-1:2011(E)
©
ISO/IEC 2011
---------------------- Page: 1 ----------------------
ISO/IEC TR 19075-1:2011(E)
COPYRIGHT PROTECTED DOCUMENT
© ISO/IEC 2011
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means,
electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or
ISO's member body in the country of the requester.
ISO copyright office
Case postale 56 CH-1211 Geneva 20
Tel. + 41 22 749 01 11
Fax + 41 22 749 09 47
E-mail copyright@iso.org
Web www.iso.org
Published in Switzerland
ii © ISO/IEC 2011 – All rights reserved
---------------------- Page: 2 ----------------------
ISO/IEC TR 19075-1:2011(E)
Contents Page
Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1 Scope. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 XQuery regular expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Matching a specific character. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Metacharacters and escape sequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Dot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Anchors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Line terminators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6
2.6 Bracket expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.6.1 Listing characters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.6.2 Matching a range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.3 Negation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.4 Character class subtraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Alternation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.8 Quantifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 Locating a match. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.10 Capture and back-reference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 Precedence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.12 Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Operators using regular expressions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 LIKE_REGEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 OCCURRENCES_REGEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 POSITION_REGEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 SUBSTRING_REGEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 TRANSLATE_REGEX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Bibliography. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
©ISO/IEC 2011 – All rights reserved Contents iii
---------------------- Page: 3 ----------------------
ISO/IEC TR 19075-1:2011(E)
(Blank page)
iv XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 4 ----------------------
ISO/IEC TR 19075-1:2011(E)
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission)
form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC
participate in the development of International Standards through technical committees established by the
respective organization to deal with particular fields of technical activity. ISO and IEC technical committees
collaborate in fields of mutual interest. Other international organizations, governmental and non-governmental,
in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC
have established a joint technical committee, ISO/IEC JTC 1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International Standards
adopted by the joint technical committee are circulated to national bodies for voting. Publication as an Interna-
tional Standard requires approval by at least 75 % of the national bodies casting a vote.
In exceptional circumstances, when the joint technical committee has collected data of a different kind from
that which is normally published as an International Standard (“state of the art”, for example), it may decide
to publish a Technical Report. A Technical Report is entirely informative in nature and shall be subject to
review every five years in the same manner as an International Standard.
Attention is drawn to the possibility that some of the elements of this document may be the subject of patent
rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
ISO/IEC TR 19075-1 was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology,
Subcommittee SC 32, Data management and interchange.
ISO/IEC TR 19075 consists of the following parts, under the general title Information technology — Database
languages — SQL Technical Reports:
— Part 1: XQuery Regular Expression Support in SQL
©ISO/IEC 2011 – All rights reserved Foreword v
---------------------- Page: 5 ----------------------
ISO/IEC TR 19075-1:2011(E)
Introduction
The organization of this part of ISO/IEC TR 19075 is as follows:
1) Clause 1, “Scope”, specifies the scope of this part of ISO/IEC TR 19075.
2) Clause 2, “XQuery regular expressions”, explains how XQuery regular expressions are formed.
3) Clause 3, “Operators using regular expressions”, explains how the SQL operators use regular expressions.
vi XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 6 ----------------------
TECHNICAL REPORT ISO/IEC TR 19075-1:2011(E)
Information technology — Database languages — SQL Technical Reports —
Part 1:
XQuery Regular Expression Support in SQL
1 Scope
This Technical Report describes the regular expression support in SQL adopted from the regular expression
syntax of [XQuery F&O], which is derived from Perl. This Technical Report discusses five operators using
this regular expression syntax:
— LIKE_REGEX predicate, to determine the existence of a match to a regular expression.
— OCCURRENCES_REGEX numeric function, to determine the number of matches to a regular expression.
— POSITION_REGEX function, to determine the position of a match.
— SUBSTRING_REGEX function, to extract a substring matching a regular expression.
— TRANSLATE_REGEX function, to perform replacements using a regular expression.
©ISO/IEC 2011 – All rights reserved Scope 1
---------------------- Page: 7 ----------------------
ISO/IEC TR 19075-1:2011(E)
(Blank page)
2 XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 8 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.1 Matching a specific character
2 XQuery regular expressions
XQuery regular expression syntax is specified in [XQuery F&O], section 7.6.1, “Regular expression syntax”.
This paper references the XQuery specification, with two small modifications (required since character strings
in an RDBMS are not necessarily normalized according to XML conventions). The following subsections
provide an overview of this syntax.
The XQuery regular expression syntax is itself a modification of another regular expression syntax found in
[XML Schema: Datatypes].
This section presents an overview of the capabilities of XQuery regular expression syntax. In the process, this
section will illustrate some of the SQL operators. The SQL operators themselves are presented in the next
section.
The following discussion does not cover every aspect of XQuery regular expressions; for this, [XQuery F&O]
is the reference (though hardly a tutorial; a variety of popular works contain detailed treatments of regular
expressions).
2.1 Matching a specific character
Perhaps the most elementary pattern matching requirement is the ability to match a single character or string.
For most characters, this is done by simply writing the character in the regular expression. For example, suppose
you want to know if a string S contains the letters “xyz”. This could be done with the following predicate:
S LIKE_REGEX 'xyz'
Note that the SQL LIKE predicate would require an exact match for “xyz”. However, the convention with
regular expressions is that S need only contain a substring that is “xyz”. For example, all of the following
values of S would yield True for the predicate above:
xyz
abcxyz123
1 xyz 2 xyz 3 xyz
Note that in the last example, there are actually three occurrences of the regular expression “xyz” within the
tested value. The user may wish to know the number of occurrences of a match. This can be done with
OCCURRENCES_REGEX. For example:
OCCURRENCES_REGEX ('xyz' IN '1 xyz 2 xyz 3 xyz') = 3
The user might also wish to know the position of a specific match. This can be done using POSITION_REGEX.
For example, to learn the starting character position of the second occurrence,
POSITION_REGEX ( 'xyz' IN '1 xyz 2 xyz 3 xyz' OCCURRENCE 2 ) = 9
It is also possible to ask for the character position of the first character after the match. For example:
©ISO/IEC 2011 – All rights reserved XQuery regular expressions 3
---------------------- Page: 9 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.1 Matching a specific character
POSITION_REGEX ( AFTER 'xyz' IN '1 xyz 2 xyz 3 xyz' OCCURRENCE 2 ) = 12
If AFTER is used and the last character of the subject string is consumed, then the result is the length of the
string plus 1 (one):
POSITION_REGEX ( AFTER 'xyz' IN 'xyz' ) = 4
2.2 Metacharacters and escape sequences
As mentioned, most characters can be matched by simply writing the character in the regular expression.
However, certain characters are reserved as metacharacters. The complete list of metacharacters is:
. \ ? * + { } ( ) | [ ] ^ $
The use of each of these metacharacters will be explained later. If you want to match a metacharacter, then you
need to use an escape sequence, consisting of a backslash (“/”) followed by the metacharacter. For example,
to test whether a string contains a dollar sign, you could write
S LIKE_REGEX '\$'
In particular, the escape sequence representing a backslash is two consecutive backslashes. There are various
other defined escape sequences, matching either a single character, or any of a group of characters. The single
character escape sequences are:
\n newline (U+000A)
\r return (U+000D)
\t tab (U+0009)
\- minus sign ('-')
The so-called category escapes are exemplified by “\p{L}” or “\p{Lu}”. A category escape begins with
“\p{” followed by one uppercase letter, optionally a lowercase letter, and then the closing brace. In these
example, “\p{L}” matches any letter (as defined by Unicode) and “\p{Lu}” matches any uppercase letter.
Some interesting category escapes are listed below:
\p{L} Any letter.
\p{Lu} Any uppercase letter.
\p{Ll} Any lowercase letter.
\p{Nd} Any decimal digit.
\p{P} Any punctuation mark.
\p{Z} Any separator (space, line, paragraph, etc.).
The complete list of category escapes is found in [XML Schema: Datatypes], section F.1.1, “Character class
escapes”.
There are also complementary category escapes, which are exemplified by “\P{L}” or “\P{Lu}”. A comple-
mentary category escape matches any character that would not be matched by the corresponding category
4 XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 10 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.2 Metacharacters and escape sequences
escape. The difference is that the (positive) character escape is written with a lowercase “p” whereas the com-
plementary character escape is written with an uppercase “P”.
The so-called block escapes match any character in a block of Unicode, that is, a predefined consecutive range
of code points. For example, “\p{IsBasicLatin}” matches the ASCII character set. There are also com-
plementary block escapes, such as “\P{IsBasicLatin}”, which matches any single character that is not
an ASCII character.
Finally, there are the following multi-character escape sequences:
\s As defined by [XML Schema: Datatypes], this escape matches space (U+0020), tab (U+0009),
newline (U+000A), or return (U+000D). Since character strings in an RDBMS have not undergone
XML line termination normalization, we broaden it to include any character or two-character
sequence that is recognized by [Unicode18] as a line terminator. Subclause 2.5, “Line terminators”,
discusses this issue further.
\S Any single character not matched by \s.
\i Underscore (“_”), colon (“:”) or letter (this is a lot more than just the Latin letters; see [XML 1.0]
appendix B, rule [84]).
\I Any single character not matched by \i.
\c Any single character matched by NameChar, as defined in [XML 1.0] section 2.3, rule [4].
\C Any single character not matched by \c.
\d Any single digit
\D Any single character not matched by \d.
\w Any single Unicode character except those classified as “punctuation”, “separator”, or “other”.
\W The complement of \w.
2.3 Dot
Dot (period, “.”) is a metacharacter that is used to match any single character (the same behavior as “_” in
LIKE predicates), or any single character that is not a line terminator. The default is to match anything except
a line terminator. The alternative, called dot-all mode, is specified using a flag that contains a lowercase “s”.
For example
S LIKE_REGEX 'a.b'
matches the following:
'xa0by'
but not the following:
'xa
by'
because the character between the “a” and the “b” is a line terminator. However, using dot-all mode like this:
©ISO/IEC 2011 – All rights reserved XQuery regular expressions 5
---------------------- Page: 11 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.3 Dot
S LIKE_REGEX 'a.b' FLAG 's'
would match both examples.
2.4 Anchors
We have seen that regular expressions look for a match anywhere within a string, without needing to match
the entire string. But what if you want to require a match of the entire string? For this, you can use anchors.
The anchors are the metacharacters “^” for the start of a string (or line), and “$” for the end of a string (or line).
For example:
S LIKE_REGEX '^xyz$'
can only match a string that is precisely 'xyz'.
Anchors may be used separately to require a “begins with” or “end with” match. For example
S LIKE_REGEX '^xyz'
matches any string that begins with “xyz”, and
S LIKE_REGEX 'xyz$'
matches any string that ends with “xyz”.
Instead of matching the begin or end of the string, the anchors may be used to anchor a match to the begin or
end of a line, by performing the match in multi-line mode. Multi-line mode is specified using a flag containing
a lowercase “m”. For example:
S LIKE_REGEX '^xyz' FLAG 'm'
performs an anchored search in multi-line mode, matching any string containing a line that begins with “xyz”.
The example above would match the following string:
'line one
xyz
line three'
2.5 Line terminators
The metacharacters “.”, “^”, and “$” and the multi-character escape sequences “\s” and “\S” are defined in
terms of a “line terminator”. What counts as a line terminator? [XQuery F&O] only recognizes a line feed
(U+000A) as a line terminator. This definition works well for XQuery, because XML normalizes the line ter-
minators on various platforms to a line feed.
A closer look shows that XML has two definitions of line handling, in section 2.11, “End-of-line handling”,
of [XML 1.0] and [XML 1.1]. So which should we use for SQL?
A first stop in answering this is to look at [SQL/XML WD] Subclause 6.17, “”, which requires
XML 1.0 as a basic level of support, and permits XML 1.1 support in the form of Feature X211, “XML 1.1
6 XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 12 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.5 Line terminators
support”. So, we might specify that the character string is normalized according to either XML 1.0 or XML
1.1 as an implementation-defined choice, or perhaps via a conformance feature.
However, some of the line terminators, even in XML 1.0, are two-character sequences. XML normalizes its
input, which means that such two-character sequences are converted to a single character. This changes the
relative position of every subsequent character, which would cause unexpected results for POSITION_REGEX.
Our solution is to look to [Unicode18], a Unicode standard containing guidelines for regular expression proces-
sors. This provides a referenceable definition of line terminator that does not require normalizing the subject
character string.
2.6 Bracket expressions
So far, we have seen how to match a specific character, or any character from certain predefined sets of characters.
Using bracket expressions, you can specify your own group of characters. (XML Schema and XQuery call
these character class expressions, but the term bracket expression is in common use.)
A bracket expression is begun by a left bracket “[” and terminated by a right bracket “]”. Bracket expressions
have a different list of special characters, namely
^ [ ] \
For clarity, we will call these special characters, in contrast to the metacharacters listed earlier.
2.6.1 Listing characters
If a bracket expression does not contain any of the special characters, then the bracket expression matches any
single character that is listed between the brackets. For example,
S LIKE_REGEX '[abc]'
matches any of the following:
'say'
'boy'
'lack'
All backslash escape sequences are available for use within a bracket expression. For example, to match either
a caret or a backslash, you can use
S LIKE_REGEX '[\^\\]'
To match all letters or digits, one might use
S LIKE_REGEX '[\p{L}\p{Nd}]'
where “\p{L}” is the escape matching any letter and “\p{Nd}” is the escape matching any digit.
©ISO/IEC 2011 – All rights reserved XQuery regular expressions 7
---------------------- Page: 13 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.6 Bracket expressions
2.6.2 Matching a range
A minus sign “-” is used to specify a character range. For example:
S LIKE_REGEX '[sa-my]'
matches the lowercase letters “s”, all the letters between “a” and “m” inclusive, and “y”. Ranges are defined
in terms of the UCS code point ordering. When there are multiple ranges, the bracket expression matches the
union of the ranges. For example:
S LIKE_REGEX '[a-me-z]'
matches all lowercase letters.
Using a special character in a range is sometimes permitted, but tricky. Rather than present the rules here, our
advice is to use a backslash escape if the start or end point of a range must be a special character.
2.6.3 Negation
A caret “^” is a special character when it is the first character of a bracket expression, where it indicates that
the set of characters is anything not listed by the following bracket expression. For example:
S LIKE_REGEX '[^aj-m]'
is True if S contains any character that is not “a”, “j”, “k”, “l”, or “m”.
2.6.4 Character class subtraction
A bracket expression may conclude with a minus sign “-” followed by a nested bracket expression. This is
called a character class subtraction, and indicates that any character matched by the nested bracket expression
is to be removed from the set of characters that might be a match. For example:
S LIKE_REGEX '[a-z-[m-p]]'
matches anything between “a” and “z”, except for the letters between “m” and “p”, inclusive. This example
is equivalent to:
S LIKE_REGEX '[a-lq-z]'
Seemingly you can nest character class subtractions indefinitely. This concludes the presentation of bracket
expressions.
2.7 Alternation
You can specify a choice of regular expressions using a vertical bar “|”. For example:
8 XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 14 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.7 Alternation
S LIKE_REGEX 'a|b'
is True if S contains either an “a” or a “b”.
Alternation has lower precedence than concatenation. Thus
S LIKE_REGEX 'ab|xyz'
is True if S contains either “ab” or “xyz”. To override this precedence, you can use parentheses, such as this
example:
S LIKE_REGEX 'a(b|xy)z'
The preceding example is True if S contains either “abz” or “axyz”.
2.8 Quantifiers
Quantifiers are metacharacters that specify a match for some number of repetitions of a regular expression.
There are two sets of quantifiers, the greedy and the reluctant. The greedy quantifiers are:
{n} Exactly n repetitions.
{n,} n or more repetitions.
{n,m} Between n and m repetitions, inclusive.
? 0 (zero) or 1 (one) repetition; equivalent to {0,1}.
* 0 (zero) or more repetitions; equivalent to {0,}.
+ 1 (one) or more repetitions; equivalent to {1,}.
The reluctant quantifiers are formed by suffixing a question mark to a greedy quantifier. Thus, “*?” is the
reluctant form of “*”, and “??” is the reluctant form of “?”. The greedy quantifiers try to match as much as
possible, whereas the reluctant quantifiers try to match as little as possible (while still allowing the overall
regular expression to match). There is no difference in behavior between the greedy and reluctant quantifiers
for LIKE_REGEX. We will look at this difference for the other operators shortly.
Examples:
S LIKE_REGEX 'a{3}'
is equivalent to
S LIKE_REGEX 'aaa'
and matches any string containing at least three consecutive instances of “a”. Note that if S contains more than
three consecutive instances of “a”, it still matches; to test whether S contains a substring of three consecutive
instances of “a” and no more is a lot harder, since you have to also require something other than an “a” at both
ends of the substring.
S LIKE_REGEX 'ab+c'
is equivalent to
©ISO/IEC 2011 – All rights reserved XQuery regular expressions 9
---------------------- Page: 15 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.8 Quantifiers
S LIKE_REGEX 'ab{1,}c'
and matches any string that contains a substring consisting of an “a”, one or more “b”s, and then a “c”.
2.9 Locating a match
LIKE_REGEX only cares whether a match exists; the other operators care about where a match is located in
the string. Consider the regular expression “a+” and the string “"a1aa2aaa3”. There are ten possible matches
for “a+”, indicated by the underlining on the following lines:
'a1aa2aaa3' -- position 1, length 1
'a1aa2aaa3' -- position 3, length 1
'a1aa2aaa3' -- position 3, length 2
'a1aa2aaa3' -- position 4, length 1
'a1aa2aaa3' -- position 6, length 1
'a1aa2aaa3' -- position 6, length 2
'a1aa2aaa3' -- position 6, length 3
'a1aa2aaa3' -- position 7, length 1
'a1aa2aaa3' -- position 8, length 2
'a1aa2aaa3' -- position 9, length 1
Notice that some of the matches are substrings of other matches. The rules of XQuery regular expressions are
designed to ignore certain matches, so that the recognized matches are mutually disjoint. Obviously there are
many ways to do this, so the rules provide priorities in determining the recognized matches. There are three
priorities:
1) The top priority is to find a match as early in the string as possible. This is commonly called the leftmost
rule.
2) The second priority is to find the first alternative of an alternation, if possible. We are unaware of a common
name for this rule.
3) The last priority is to find the longest possible match for greedy quantifiers, and the shortest match for
reluctant quantifiers. In the case of greedy quantifiers, this is commonly called the longest rule; we are
unaware of a common name for the rule regarding reluctant quantifiers.
[Historical note: POSIX only has a leftmost longest rule. There were no reluctant quantifiers, and the priority
for matching alternations was the longest match rather than the first alternative.]
These rules will be illustrated by examples:
Subject regular match(es) priority
string expression underlined
baaaaaa ba|a* baaaaaa leftmost (even though baaaaaa would be longer);
baaaaaa second match must start after the first match
ab a|ab ab first alternative (rather than matching ab)
10 XQuery Regular Expression Support in SQL/Foundation ©ISO/IEC 2011 – All rights reserved
---------------------- Page: 16 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.9 Locating a match
Subject regular match(es) priority
string expression underlined
abcabbabc ab* abcabbabc leftmost
abcabbabc longest (greedy quantifier consumes two "b"s)
abcabbabc longest
abcabbabc ab*? abcabbabc shortest (no need to match "b")
abcabbabc shortest
abcabbabc shortest
2.10 Capture and back-reference
A parenthesized sub-expression is a portion of a regular expression that is enclosed in parentheses. Parenthesized
sub-expressions are numbered in order of their left parenthesis. For example, in the regular expression
((a)|(b))
there are three sub-expressions:
1) ((a)|(b))
2) (a)
3) (b)
A sub-expression can be referenced later in a regular expression using a back-reference, taking the form of a
backslash followed by one or more digits. Thus the three sub-expressions in the example can be referenced as
“\1”, “\2”, and “\3”. For example, consider the regular expression:
\p{Z}(\p{L}*)\p{Z}*\1\p{Z}
The first and only parenthesized sub-expression (“\p{L}*”) matches any sequence of letters that is bounded
by some kind of space character (“\p{Z}”) before and after the sequence of letters. The back-reference (“\1”)
matches whatever sequence of letters was captured by the first sub-expression. This regular expression might
be used to search for occurrences of a repeated word (perhaps caused by a cut-and-paste error). Here is an
example of a subject string, with underlining to indicate the match for the entire regular expression:
Hello Dolly you're looking looking swell
When a back-reference references a parenthesized group with a quantifier, then the back-reference matches the
last iteration of the quantified sub-expression. For example, consider the regular expression:
'(ab*)*c*\1'
and the subject string:
'abbbabbabcabbbbb'
The matches to “(ab*)” are shown by underlining below:
'abbbabbabcabbbbb'
©ISO/IEC 2011 – All rights reserved XQuery regular expressions 11
---------------------- Page: 17 ----------------------
ISO/IEC TR 19075-1:2011(E)
2.10 Capture and back-reference
'abbbabbabcabbbbb'
'abbbabbabcabbbbb'
These three iterations of “(ab*)” are matched by “(ab*)*” and then the “c” is matched. Next, we need to
match “\1”. The last match for the first parenthesized sub-expression is “ab”, so the overall match is indicate
by underlining below:
'abbbabbabcabbbbb'
In the event that a sub-expression is unmatched, a back-reference to it matches the zero-length string. For
example, consider the regular
...
Questions, Comments and Discussion
Ask us and Technical Secretary will try to provide an answer. You can facilitate discussion about the standard in here.