590 lines
28 KiB
HTML
590 lines
28 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 -->
|
|
<!-- Copyright (c) 2001 The Open Group, All Rights Reserved -->
|
|
<title>regcomp</title>
|
|
</head>
|
|
<body bgcolor="white">
|
|
|
|
<basefont size="3"> <a name="regcomp"></a> <a name="tag_03_603"></a><!-- regcomp -->
|
|
<!--header start-->
|
|
<center><font size="2">The Open Group Base Specifications Issue 6<br>
|
|
IEEE Std 1003.1-2001<br>
|
|
Copyright © 2001 The IEEE and The Open Group, All Rights reserved.</font></center>
|
|
|
|
<!--header end-->
|
|
<hr size="2" noshade>
|
|
<h4><a name="tag_03_603_01"></a>NAME</h4>
|
|
|
|
<blockquote>regcomp, regerror, regexec, regfree - regular expression matching</blockquote>
|
|
|
|
<h4><a name="tag_03_603_02"></a>SYNOPSIS</h4>
|
|
|
|
<blockquote class="synopsis">
|
|
<p><code><tt>#include <<a href="../basedefs/regex.h.html">regex.h</a>><br>
|
|
<br>
|
|
int regcomp(regex_t *restrict</tt> <i>preg</i><tt>, const char *restrict</tt> <i>pattern</i><tt>,<br>
|
|
int</tt> <i>cflags</i><tt>);<br>
|
|
size_t regerror(int</tt> <i>errcode</i><tt>, const regex_t *restrict</tt> <i>preg</i><tt>,<br>
|
|
char *restrict</tt> <i>errbuf</i><tt>, size_t</tt> <i>errbuf_size</i><tt>);<br>
|
|
int regexec(const regex_t *restrict</tt> <i>preg</i><tt>, const char *restrict</tt> <i>string</i><tt>,<br>
|
|
size_t</tt> <i>nmatch</i><tt>, regmatch_t</tt> <i>pmatch</i><tt>[restrict], int</tt>
|
|
<i>eflags</i><tt>);<br>
|
|
void regfree(regex_t *</tt><i>preg</i><tt>);<br>
|
|
</tt></code></p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_03"></a>DESCRIPTION</h4>
|
|
|
|
<blockquote>
|
|
<p>These functions interpret <i>basic</i> and <i>extended</i> regular expressions as described in the Base Definitions volume of
|
|
IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html">Chapter 9, Regular Expressions</a>.</p>
|
|
|
|
<p>The <b>regex_t</b> structure is defined in <a href="../basedefs/regex.h.html"><i><regex.h></i></a> and contains at least
|
|
the following member:</p>
|
|
|
|
<center>
|
|
<table border="1" cellpadding="3" align="center">
|
|
<tr valign="top">
|
|
<th align="center">
|
|
<p class="tent"><b>Member Type</b></p>
|
|
</th>
|
|
<th align="center">
|
|
<p class="tent"><b>Member Name</b></p>
|
|
</th>
|
|
<th align="center">
|
|
<p class="tent"><b>Description</b></p>
|
|
</th>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td align="left">
|
|
<p class="tent">size_t</p>
|
|
</td>
|
|
<td align="left">
|
|
<p class="tent">re_nsub</p>
|
|
</td>
|
|
<td align="left">
|
|
<p class="tent">Number of parenthesized subexpressions.</p>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</center>
|
|
|
|
<p>The <b>regmatch_t</b> structure is defined in <a href="../basedefs/regex.h.html"><i><regex.h></i></a> and contains at
|
|
least the following members:</p>
|
|
|
|
<center>
|
|
<table border="1" cellpadding="3" align="center">
|
|
<tr valign="top">
|
|
<th align="center">
|
|
<p class="tent"><b>Member Type</b></p>
|
|
</th>
|
|
<th align="center">
|
|
<p class="tent"><b>Member Name</b></p>
|
|
</th>
|
|
<th align="center">
|
|
<p class="tent"><b>Description</b></p>
|
|
</th>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td align="left">
|
|
<p class="tent"><b>regoff_t</b></p>
|
|
</td>
|
|
<td align="left">
|
|
<p class="tent"><i>rm_so</i></p>
|
|
</td>
|
|
<td align="left">
|
|
<p class="tent">Byte offset from start of <i>string</i> to start of substring.</p>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr valign="top">
|
|
<td align="left">
|
|
<p class="tent"><b>regoff_t</b></p>
|
|
</td>
|
|
<td align="left">
|
|
<p class="tent"><i>rm_eo</i></p>
|
|
</td>
|
|
<td align="left">
|
|
<p class="tent">Byte offset from start of <i>string</i> of the first character after the end of substring.</p>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
</center>
|
|
|
|
<p>The <i>regcomp</i>() function shall compile the regular expression contained in the string pointed to by the <i>pattern</i>
|
|
argument and place the results in the structure pointed to by <i>preg</i>. The <i>cflags</i> argument is the bitwise-inclusive OR
|
|
of zero or more of the following flags, which are defined in the <a href="../basedefs/regex.h.html"><i><regex.h></i></a>
|
|
header:</p>
|
|
|
|
<dl compact>
|
|
<dt>REG_EXTENDED</dt>
|
|
|
|
<dd>Use Extended Regular Expressions.</dd>
|
|
|
|
<dt>REG_ICASE</dt>
|
|
|
|
<dd>Ignore case in match. (See the Base Definitions volume of IEEE Std 1003.1-2001, <a href=
|
|
"../basedefs/xbd_chap09.html">Chapter 9, Regular Expressions</a>.)</dd>
|
|
|
|
<dt>REG_NOSUB</dt>
|
|
|
|
<dd>Report only success/fail in <i>regexec</i>().</dd>
|
|
|
|
<dt>REG_NEWLINE</dt>
|
|
|
|
<dd>Change the handling of <newline>s, as described in the text.</dd>
|
|
</dl>
|
|
|
|
<p>The default regular expression type for <i>pattern</i> is a Basic Regular Expression. The application can specify Extended
|
|
Regular Expressions using the REG_EXTENDED <i>cflags</i> flag.</p>
|
|
|
|
<p>If the REG_NOSUB flag was not set in <i>cflags</i>, then <i>regcomp</i>() shall set <i>re_nsub</i> to the number of
|
|
parenthesized subexpressions (delimited by <tt>"\(\)"</tt> in basic regular expressions or <tt>"()"</tt> in extended regular
|
|
expressions) found in <i>pattern</i>.</p>
|
|
|
|
<p>The <i>regexec</i>() function compares the null-terminated string specified by <i>string</i> with the compiled regular
|
|
expression <i>preg</i> initialized by a previous call to <i>regcomp</i>(). If it finds a match, <i>regexec</i>() shall return 0;
|
|
otherwise, it shall return non-zero indicating either no match or an error. The <i>eflags</i> argument is the bitwise-inclusive OR
|
|
of zero or more of the following flags, which are defined in the <a href="../basedefs/regex.h.html"><i><regex.h></i></a>
|
|
header:</p>
|
|
|
|
<dl compact>
|
|
<dt>REG_NOTBOL</dt>
|
|
|
|
<dd>The first character of the string pointed to by <i>string</i> is not the beginning of the line. Therefore, the circumflex
|
|
character ( <tt>'^'</tt> ), when taken as a special character, shall not match the beginning of <i>string</i>.</dd>
|
|
|
|
<dt>REG_NOTEOL</dt>
|
|
|
|
<dd>The last character of the string pointed to by <i>string</i> is not the end of the line. Therefore, the dollar sign (
|
|
<tt>'$'</tt> ), when taken as a special character, shall not match the end of <i>string</i>.</dd>
|
|
</dl>
|
|
|
|
<p>If <i>nmatch</i> is 0 or REG_NOSUB was set in the <i>cflags</i> argument to <i>regcomp</i>(), then <i>regexec</i>() shall ignore
|
|
the <i>pmatch</i> argument. Otherwise, the application shall ensure that the <i>pmatch</i> argument points to an array with at
|
|
least <i>nmatch</i> elements, and <i>regexec</i>() shall fill in the elements of that array with offsets of the substrings of
|
|
<i>string</i> that correspond to the parenthesized subexpressions of <i>pattern</i>: <i>pmatch</i>[ <i>i</i>]. <i>rm_so</i> shall
|
|
be the byte offset of the beginning and <i>pmatch</i>[ <i>i</i>]. <i>rm_eo</i> shall be one greater than the byte offset of the end
|
|
of substring <i>i</i>. (Subexpression <i>i</i> begins at the <i>i</i>th matched open parenthesis, counting from 1.) Offsets in
|
|
<i>pmatch</i>[0] identify the substring that corresponds to the entire regular expression. Unused elements of <i>pmatch</i> up to
|
|
<i>pmatch</i>[ <i>nmatch</i>-1] shall be filled with -1. If there are more than <i>nmatch</i> subexpressions in <i>pattern</i> (
|
|
<i>pattern</i> itself counts as a subexpression), then <i>regexec</i>() shall still do the match, but shall record only the first
|
|
<i>nmatch</i> substrings.</p>
|
|
|
|
<p>When matching a basic or extended regular expression, any given parenthesized subexpression of <i>pattern</i> might participate
|
|
in the match of several different substrings of <i>string</i>, or it might not match any substring even though the pattern as a
|
|
whole did match. The following rules shall be used to determine which substrings to report in <i>pmatch</i> when matching regular
|
|
expressions:</p>
|
|
|
|
<ol>
|
|
<li>
|
|
<p>If subexpression <i>i</i> in a regular expression is not contained within another subexpression, and it participated in the
|
|
match several times, then the byte offsets in <i>pmatch</i>[ <i>i</i>] shall delimit the last such match.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>If subexpression <i>i</i> is not contained within another subexpression, and it did not participate in an otherwise successful
|
|
match, the byte offsets in <i>pmatch</i>[ <i>i</i>] shall be -1. A subexpression does not participate in the match when:</p>
|
|
|
|
<blockquote><tt>'*'</tt> or <tt>"\{\}"</tt> appears immediately after the subexpression in a basic regular expression, or
|
|
<tt>'*'</tt> , <tt>'?'</tt> , or <tt>"{}"</tt> appears immediately after the subexpression in an extended regular expression, and
|
|
the subexpression did not match (matched 0 times)</blockquote>
|
|
|
|
<p>or:</p>
|
|
|
|
<blockquote><tt>'|'</tt> is used in an extended regular expression to select this subexpression or another, and the other
|
|
subexpression matched.</blockquote>
|
|
</li>
|
|
|
|
<li>
|
|
<p>If subexpression <i>i</i> is contained within another subexpression <i>j</i>, and <i>i</i> is not contained within any other
|
|
subexpression that is contained within <i>j</i>, and a match of subexpression <i>j</i> is reported in <i>pmatch</i>[ <i>j</i>],
|
|
then the match or non-match of subexpression <i>i</i> reported in <i>pmatch</i>[ <i>i</i>] shall be as described in 1. and 2.
|
|
above, but within the substring reported in <i>pmatch</i>[ <i>j</i>] rather than the whole string. The offsets in <i>pmatch</i>[
|
|
<i>i</i>] are still relative to the start of <i>string</i>.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>If subexpression <i>i</i> is contained in subexpression <i>j</i>, and the byte offsets in <i>pmatch</i>[ <i>j</i>] are -1, then
|
|
the pointers in <i>pmatch</i>[ <i>i</i>] shall also be -1.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>If subexpression <i>i</i> matched a zero-length string, then both byte offsets in <i>pmatch</i>[ <i>i</i>] shall be the byte
|
|
offset of the character or null terminator immediately following the zero-length string.</p>
|
|
</li>
|
|
</ol>
|
|
|
|
<p>If, when <i>regexec</i>() is called, the locale is different from when the regular expression was compiled, the result is
|
|
undefined.</p>
|
|
|
|
<p>If REG_NEWLINE is not set in <i>cflags</i>, then a <newline> in <i>pattern</i> or <i>string</i> shall be treated as an
|
|
ordinary character. If REG_NEWLINE is set, then <newline> shall be treated as an ordinary character except as follows:</p>
|
|
|
|
<ol>
|
|
<li>
|
|
<p>A <newline> in <i>string</i> shall not be matched by a period outside a bracket expression or by any form of a
|
|
non-matching list (see the Base Definitions volume of IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html">Chapter
|
|
9, Regular Expressions</a>).</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>A circumflex ( <tt>'^'</tt> ) in <i>pattern</i>, when used to specify expression anchoring (see the Base Definitions volume of
|
|
IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_03_08">Section 9.3.8, BRE Expression Anchoring</a>),
|
|
shall match the zero-length string immediately after a <newline> in <i>string</i>, regardless of the setting of
|
|
REG_NOTBOL.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>A dollar sign ( <tt>'$'</tt> ) in <i>pattern</i>, when used to specify expression anchoring, shall match the zero-length string
|
|
immediately before a <newline> in <i>string</i>, regardless of the setting of REG_NOTEOL.</p>
|
|
</li>
|
|
</ol>
|
|
|
|
<p>The <i>regfree</i>() function frees any memory allocated by <i>regcomp</i>() associated with <i>preg</i>.</p>
|
|
|
|
<p>The following constants are defined as error return values:</p>
|
|
|
|
<dl compact>
|
|
<dt>REG_NOMATCH</dt>
|
|
|
|
<dd><i>regexec</i>() failed to match.</dd>
|
|
|
|
<dt>REG_BADPAT</dt>
|
|
|
|
<dd>Invalid regular expression.</dd>
|
|
|
|
<dt>REG_ECOLLATE</dt>
|
|
|
|
<dd>Invalid collating element referenced.</dd>
|
|
|
|
<dt>REG_ECTYPE</dt>
|
|
|
|
<dd>Invalid character class type referenced.</dd>
|
|
|
|
<dt>REG_EESCAPE</dt>
|
|
|
|
<dd>Trailing <tt>'\'</tt> in pattern.</dd>
|
|
|
|
<dt>REG_ESUBREG</dt>
|
|
|
|
<dd>Number in <tt>"\digit"</tt> invalid or in error.</dd>
|
|
|
|
<dt>REG_EBRACK</dt>
|
|
|
|
<dd><tt>"[]"</tt> imbalance.</dd>
|
|
|
|
<dt>REG_EPAREN</dt>
|
|
|
|
<dd><tt>"\(\)"</tt> or <tt>"()"</tt> imbalance.</dd>
|
|
|
|
<dt>REG_EBRACE</dt>
|
|
|
|
<dd><tt>"\{\}"</tt> imbalance.</dd>
|
|
|
|
<dt>REG_BADBR</dt>
|
|
|
|
<dd>Content of <tt>"\{\}"</tt> invalid: not a number, number too large, more than two numbers, first larger than second.</dd>
|
|
|
|
<dt>REG_ERANGE</dt>
|
|
|
|
<dd>Invalid endpoint in range expression.</dd>
|
|
|
|
<dt>REG_ESPACE</dt>
|
|
|
|
<dd>Out of memory.</dd>
|
|
|
|
<dt>REG_BADRPT</dt>
|
|
|
|
<dd><tt>'?'</tt> , <tt>'*'</tt> , or <tt>'+'</tt> not preceded by valid regular expression.</dd>
|
|
</dl>
|
|
|
|
<p>The <i>regerror</i>() function provides a mapping from error codes returned by <i>regcomp</i>() and <i>regexec</i>() to
|
|
unspecified printable strings. It generates a string corresponding to the value of the <i>errcode</i> argument, which the
|
|
application shall ensure is the last non-zero value returned by <i>regcomp</i>() or <i>regexec</i>() with the given value of
|
|
<i>preg</i>. If <i>errcode</i> is not such a value, the content of the generated string is unspecified.</p>
|
|
|
|
<p>If <i>preg</i> is a null pointer, but <i>errcode</i> is a value returned by a previous call to <i>regexec</i>() or
|
|
<i>regcomp</i>(), the <i>regerror</i>() still generates an error string corresponding to the value of <i>errcode</i>, but it might
|
|
not be as detailed under some implementations.</p>
|
|
|
|
<p>If the <i>errbuf_size</i> argument is not 0, <i>regerror</i>() shall place the generated string into the buffer of size
|
|
<i>errbuf_size</i> bytes pointed to by <i>errbuf</i>. If the string (including the terminating null) cannot fit in the buffer,
|
|
<i>regerror</i>() shall truncate the string and null-terminate the result.</p>
|
|
|
|
<p>If <i>errbuf_size</i> is 0, <i>regerror</i>() shall ignore the <i>errbuf</i> argument, and return the size of the buffer needed
|
|
to hold the generated string.</p>
|
|
|
|
<p>If the <i>preg</i> argument to <i>regexec</i>() or <i>regfree</i>() is not a compiled regular expression returned by
|
|
<i>regcomp</i>(), the result is undefined. A <i>preg</i> is no longer treated as a compiled regular expression after it is given to
|
|
<i>regfree</i>().</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_04"></a>RETURN VALUE</h4>
|
|
|
|
<blockquote>
|
|
<p>Upon successful completion, the <i>regcomp</i>() function shall return 0. Otherwise, it shall return an integer value indicating
|
|
an error as described in <a href="../basedefs/regex.h.html"><i><regex.h></i></a>, and the content of <i>preg</i> is
|
|
undefined. If a code is returned, the interpretation shall be as given in <a href=
|
|
"../basedefs/regex.h.html"><i><regex.h></i></a>.</p>
|
|
|
|
<p>If <i>regcomp</i>() detects an invalid RE, it may return REG_BADPAT, or it may return one of the error codes that more precisely
|
|
describes the error.</p>
|
|
|
|
<p>Upon successful completion, the <i>regexec</i>() function shall return 0. Otherwise, it shall return REG_NOMATCH to indicate no
|
|
match.</p>
|
|
|
|
<p>Upon successful completion, the <i>regerror</i>() function shall return the number of bytes needed to hold the entire generated
|
|
string, including the null termination. If the return value is greater than <i>errbuf_size</i>, the string returned in the buffer
|
|
pointed to by <i>errbuf</i> has been truncated.</p>
|
|
|
|
<p>The <i>regfree</i>() function shall not return a value.</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_05"></a>ERRORS</h4>
|
|
|
|
<blockquote>
|
|
<p>No errors are defined.</p>
|
|
</blockquote>
|
|
|
|
<hr>
|
|
<div class="box"><em>The following sections are informative.</em></div>
|
|
|
|
<h4><a name="tag_03_603_06"></a>EXAMPLES</h4>
|
|
|
|
<blockquote>
|
|
<pre>
|
|
<tt>#include <regex.h>
|
|
<br>
|
|
/*
|
|
* Match string against the extended regular expression in
|
|
* pattern, treating errors as no match.
|
|
*
|
|
* Return 1 for match, 0 for no match.
|
|
*/
|
|
<br>
|
|
int
|
|
match(const char *string, char *pattern)
|
|
{
|
|
int status;
|
|
regex_t re;
|
|
<br>
|
|
if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
|
|
return(0); /* Report error. */
|
|
}
|
|
status = regexec(&re, string, (size_t) 0, NULL, 0);
|
|
regfree(&re);
|
|
if (status != 0) {
|
|
return(0); /* Report error. */
|
|
}
|
|
return(1);
|
|
}
|
|
</tt>
|
|
</pre>
|
|
|
|
<p>The following demonstrates how the REG_NOTBOL flag could be used with <i>regexec</i>() to find all substrings in a line that
|
|
match a pattern supplied by a user. (For simplicity of the example, very little error checking is done.)</p>
|
|
|
|
<pre>
|
|
<tt>(void) regcomp (&re, pattern, 0);
|
|
/* This call to regexec() finds the first match on the line. */
|
|
error = regexec (&re, &buffer[0], 1, &pm, 0);
|
|
while (error == 0) { /* While matches found. */
|
|
/* Substring found between pm.rm_so and pm.rm_eo. */
|
|
/* This call to regexec() finds the next match. */
|
|
error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
|
|
}
|
|
</tt>
|
|
</pre>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_07"></a>APPLICATION USAGE</h4>
|
|
|
|
<blockquote>
|
|
<p>An application could use:</p>
|
|
|
|
<pre>
|
|
<tt>regerror(code,preg,(char *)NULL,(size_t)0)
|
|
</tt>
|
|
</pre>
|
|
|
|
<p>to find out how big a buffer is needed for the generated string, <a href="../functions/malloc.html"><i>malloc</i>()</a> a buffer
|
|
to hold the string, and then call <i>regerror</i>() again to get the string. Alternatively, it could allocate a fixed, static
|
|
buffer that is big enough to hold most strings, and then use <a href="../functions/malloc.html"><i>malloc</i>()</a> to allocate a
|
|
larger buffer if it finds that this is too small.</p>
|
|
|
|
<p>To match a pattern as described in the Shell and Utilities volume of IEEE Std 1003.1-2001, <a href=
|
|
"../utilities/xcu_chap02.html#tag_02_13">Section 2.13, Pattern Matching Notation</a>, use the <a href=
|
|
"../functions/fnmatch.html"><i>fnmatch</i>()</a> function.</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_08"></a>RATIONALE</h4>
|
|
|
|
<blockquote>
|
|
<p>The <i>regexec</i>() function must fill in all <i>nmatch</i> elements of <i>pmatch</i>, where <i>nmatch</i> and <i>pmatch</i>
|
|
are supplied by the application, even if some elements of <i>pmatch</i> do not correspond to subexpressions in <i>pattern</i>. The
|
|
application writer should note that there is probably no reason for using a value of <i>nmatch</i> that is larger than
|
|
<i>preg</i>-> <i>re_nsub</i>+1.</p>
|
|
|
|
<p>The REG_NEWLINE flag supports a use of RE matching that is needed in some applications like text editors. In such applications,
|
|
the user supplies an RE asking the application to find a line that matches the given expression. An anchor in such an RE anchors at
|
|
the beginning or end of any line. Such an application can pass a sequence of <newline>-separated lines to <i>regexec</i>() as
|
|
a single long string and specify REG_NEWLINE to <i>regcomp</i>() to get the desired behavior. The application must ensure that
|
|
there are no explicit <newline>s in <i>pattern</i> if it wants to ensure that any match occurs entirely within a single
|
|
line.</p>
|
|
|
|
<p>The REG_NEWLINE flag affects the behavior of <i>regexec</i>(), but it is in the <i>cflags</i> parameter to <i>regcomp</i>() to
|
|
allow flexibility of implementation. Some implementations will want to generate the same compiled RE in <i>regcomp</i>() regardless
|
|
of the setting of REG_NEWLINE and have <i>regexec</i>() handle anchors differently based on the setting of the flag. Other
|
|
implementations will generate different compiled REs based on the REG_NEWLINE.</p>
|
|
|
|
<p>The REG_ICASE flag supports the operations taken by the <a href="../utilities/grep.html"><i>grep</i></a> <b>-i</b> option and
|
|
the historical implementations of <a href="../utilities/ex.html"><i>ex</i></a> and <a href="../utilities/vi.html"><i>vi</i></a>.
|
|
Including this flag will make it easier for application code to be written that does the same thing as these utilities.</p>
|
|
|
|
<p>The substrings reported in <i>pmatch</i>[] are defined using offsets from the start of the string rather than pointers. Since
|
|
this is a new interface, there should be no impact on historical implementations or applications, and offsets should be just as
|
|
easy to use as pointers. The change to offsets was made to facilitate future extensions in which the string to be searched is
|
|
presented to <i>regexec</i>() in blocks, allowing a string to be searched that is not all in memory at once.</p>
|
|
|
|
<p>The type <b>regoff_t</b> is used for the elements of <i>pmatch</i>[] to ensure that the application can represent either the
|
|
largest possible array in memory (important for an application conforming to the Shell and Utilities volume of
|
|
IEEE Std 1003.1-2001) or the largest possible file (important for an application using the extension where a file is
|
|
searched in chunks).</p>
|
|
|
|
<p>The standard developers rejected the inclusion of a <i>regsub</i>() function that would be used to do substitutions for a
|
|
matched RE. While such a routine would be useful to some applications, its utility would be much more limited than the matching
|
|
function described here. Both RE parsing and substitution are possible to implement without support other than that required by the
|
|
ISO C standard, but matching is much more complex than substituting. The only difficult part of substitution, given the
|
|
information supplied by <i>regexec</i>(), is finding the next character in a string when there can be multi-byte characters. That
|
|
is a much larger issue, and one that needs a more general solution.</p>
|
|
|
|
<p>The <i>errno</i> variable has not been used for error returns to avoid filling the <i>errno</i> name space for this feature.</p>
|
|
|
|
<p>The interface is defined so that the matched substrings <i>rm_sp</i> and <i>rm_ep</i> are in a separate <b>regmatch_t</b>
|
|
structure instead of in <b>regex_t</b>. This allows a single compiled RE to be used simultaneously in several contexts; in
|
|
<i>main</i>() and a signal handler, perhaps, or in multiple threads of lightweight processes. (The <i>preg</i> argument to
|
|
<i>regexec</i>() is declared with type <b>const</b>, so the implementation is not permitted to use the structure to store
|
|
intermediate results.) It also allows an application to request an arbitrary number of substrings from an RE. The number of
|
|
subexpressions in the RE is reported in <i>re_nsub</i> in <i>preg</i>. With this change to <i>regexec</i>(), consideration was
|
|
given to dropping the REG_NOSUB flag since the user can now specify this with a zero <i>nmatch</i> argument to <i>regexec</i>().
|
|
However, keeping REG_NOSUB allows an implementation to use a different (perhaps more efficient) algorithm if it knows in
|
|
<i>regcomp</i>() that no subexpressions need be reported. The implementation is only required to fill in <i>pmatch</i> if
|
|
<i>nmatch</i> is not zero and if REG_NOSUB is not specified. Note that the <b>size_t</b> type, as defined in the ISO C
|
|
standard, is unsigned, so the description of <i>regexec</i>() does not need to address negative values of <i>nmatch</i>.</p>
|
|
|
|
<p>REG_NOTBOL was added to allow an application to do repeated searches for the same pattern in a line. If the pattern contains a
|
|
circumflex character that should match the beginning of a line, then the pattern should only match when matched against the
|
|
beginning of the line. Without the REG_NOTBOL flag, the application could rewrite the expression for subsequent matches, but in the
|
|
general case this would require parsing the expression. The need for REG_NOTEOL is not as clear; it was added for symmetry.</p>
|
|
|
|
<p>The addition of the <i>regerror</i>() function addresses the historical need for conforming application programs to have access
|
|
to error information more than "Function failed to compile/match your RE for unknown reasons".</p>
|
|
|
|
<p>This interface provides for two different methods of dealing with error conditions. The specific error codes (REG_EBRACE, for
|
|
example), defined in <a href="../basedefs/regex.h.html"><i><regex.h></i></a>, allow an application to recover from an error
|
|
if it is so able. Many applications, especially those that use patterns supplied by a user, will not try to deal with specific
|
|
error cases, but will just use <i>regerror</i>() to obtain a human-readable error message to present to the user.</p>
|
|
|
|
<p>The <i>regerror</i>() function uses a scheme similar to <a href="../functions/confstr.html"><i>confstr</i>()</a> to deal with
|
|
the problem of allocating memory to hold the generated string. The scheme used by <a href=
|
|
"../functions/strerror.html"><i>strerror</i>()</a> in the ISO C standard was considered unacceptable since it creates
|
|
difficulties for multi-threaded applications.</p>
|
|
|
|
<p>The <i>preg</i> argument is provided to <i>regerror</i>() to allow an implementation to generate a more descriptive message than
|
|
would be possible with <i>errcode</i> alone. An implementation might, for example, save the character offset of the offending
|
|
character of the pattern in a field of <i>preg</i>, and then include that in the generated message string. The implementation may
|
|
also ignore <i>preg</i>.</p>
|
|
|
|
<p>A REG_FILENAME flag was considered, but omitted. This flag caused <i>regexec</i>() to match patterns as described in the Shell
|
|
and Utilities volume of IEEE Std 1003.1-2001, <a href="../utilities/xcu_chap02.html#tag_02_13">Section 2.13, Pattern
|
|
Matching Notation</a> instead of REs. This service is now provided by the <a href="../functions/fnmatch.html"><i>fnmatch</i>()</a>
|
|
function.</p>
|
|
|
|
<p>Notice that there is a difference in philosophy between the ISO POSIX-2:1993 standard and IEEE Std 1003.1-2001 in
|
|
how to handle a "bad" regular expression. The ISO POSIX-2:1993 standard says that many bad constructs "produce undefined
|
|
results", or that "the interpretation is undefined". IEEE Std 1003.1-2001, however, says that the interpretation of
|
|
such REs is unspecified. The term "undefined" means that the action by the application is an error, of similar severity to
|
|
passing a bad pointer to a function.</p>
|
|
|
|
<p>The <i>regcomp</i>() and <i>regexec</i>() functions are required to accept any null-terminated string as the <i>pattern</i>
|
|
argument. If the meaning of the string is "undefined", the behavior of the function is "unspecified".
|
|
IEEE Std 1003.1-2001 does not specify how the functions will interpret the pattern; they might return error codes, or
|
|
they might do pattern matching in some completely unexpected way, but they should not do something like abort the process.</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_09"></a>FUTURE DIRECTIONS</h4>
|
|
|
|
<blockquote>
|
|
<p>None.</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_10"></a>SEE ALSO</h4>
|
|
|
|
<blockquote>
|
|
<p><a href="fnmatch.html"><i>fnmatch</i>()</a> , <a href="glob.html"><i>glob</i>()</a> , Shell and Utilities volume of
|
|
IEEE Std 1003.1-2001, <a href="../utilities/xcu_chap02.html#tag_02_13">Section 2.13, Pattern Matching Notation</a>, Base
|
|
Definitions volume of IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html">Chapter 9, Regular Expressions</a>, <a
|
|
href="../basedefs/regex.h.html"><i><regex.h></i></a>, <a href=
|
|
"../basedefs/sys/types.h.html"><i><sys/types.h></i></a></p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_11"></a>CHANGE HISTORY</h4>
|
|
|
|
<blockquote>
|
|
<p>First released in Issue 4. Derived from the ISO POSIX-2 standard.</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_12"></a>Issue 5</h4>
|
|
|
|
<blockquote>
|
|
<p>Moved from POSIX2 C-language Binding to BASE.</p>
|
|
</blockquote>
|
|
|
|
<h4><a name="tag_03_603_13"></a>Issue 6</h4>
|
|
|
|
<blockquote>
|
|
<p>In the SYNOPSIS, the optional include of the <a href="../basedefs/sys/types.h.html"><i><sys/types.h></i></a> header is
|
|
removed.</p>
|
|
|
|
<p>The following new requirements on POSIX implementations derive from alignment with the Single UNIX Specification:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>The requirement to include <a href="../basedefs/sys/types.h.html"><i><sys/types.h></i></a> has been removed. Although <a
|
|
href="../basedefs/sys/types.h.html"><i><sys/types.h></i></a> was required for conforming implementations of previous POSIX
|
|
specifications, it was not required for UNIX applications.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>The DESCRIPTION is updated to avoid use of the term "must" for application requirements.</p>
|
|
|
|
<p>The REG_ENOSYS constant is removed.</p>
|
|
|
|
<p>The <b>restrict</b> keyword is added to the <i>regcomp</i>(), <i>regerror</i>(), and <i>regexec</i>() prototypes for alignment
|
|
with the ISO/IEC 9899:1999 standard.</p>
|
|
</blockquote>
|
|
|
|
<div class="box"><em>End of informative text.</em></div>
|
|
|
|
<hr>
|
|
<hr size="2" noshade>
|
|
<center><font size="2"><!--footer start-->
|
|
UNIX ® is a registered Trademark of The Open Group.<br>
|
|
POSIX ® is a registered Trademark of The IEEE.<br>
|
|
[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href=
|
|
"../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>
|
|
]</font></center>
|
|
|
|
<!--footer end-->
|
|
<hr size="2" noshade>
|
|
</body>
|
|
</html>
|
|
|