402 lines
26 KiB
HTML
402 lines
26 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 -->
|
|
<!-- Copyright (c) 2001 The Open Group, All Rights Reserved -->
|
|
<title>Rationale</title>
|
|
</head>
|
|
<body>
|
|
|
|
<basefont size="3">
|
|
|
|
<center><font size="2">The Open Group Base Specifications Issue 6<br>
|
|
IEEE Std 1003.1-2001<br>
|
|
Copyright © 2001 The IEEE and The Open Group</font></center>
|
|
|
|
<hr size="2" noshade>
|
|
<h3><a name="tag_01_09"></a>Regular Expressions</h3>
|
|
|
|
<p>Rather than repeating the description of REs for each utility supporting REs, the standard developers preferred a common,
|
|
comprehensive description of regular expressions in one place. The most common behavior is described here, and exceptions or
|
|
extensions to this are documented for the respective utilities, as appropriate.</p>
|
|
|
|
<p>The BRE corresponds to the <a href="../utilities/ed.html"><i>ed</i></a> or historical <a href=
|
|
"../utilities/grep.html"><i>grep</i></a> type, and the ERE corresponds to the historical <i>egrep</i> type (now <a href=
|
|
"../utilities/grep.html"><i>grep</i></a> <b>-E</b>).</p>
|
|
|
|
<p>The text is based on the <a href="../utilities/ed.html"><i>ed</i></a> description and substantially modified, primarily to aid
|
|
developers and others in the understanding of the capabilities and limitations of REs. Much of this was influenced by
|
|
internationalization requirements.</p>
|
|
|
|
<p>It should be noted that the definitions in this section do not cover the <a href="../utilities/tr.html"><i>tr</i></a> utility;
|
|
the <a href="../utilities/tr.html"><i>tr</i></a> syntax does not employ REs.</p>
|
|
|
|
<p>The specification of REs is particularly important to internationalization because pattern matching operations are very basic
|
|
operations in business and other operations. The syntax and rules of REs are intended to be as intuitive as possible to make them
|
|
easy to understand and use. The historical rules and behavior do not provide that capability to non-English language users, and do
|
|
not provide the necessary support for commonly used characters and language constructs. It was necessary to provide extensions to
|
|
the historical RE syntax and rules to accommodate other languages.</p>
|
|
|
|
<p>As they are limited to bracket expressions, the rationale for these modifications is in the Base Definitions volume of
|
|
IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_03_05">Section 9.3.5, RE Bracket Expression</a>.</p>
|
|
|
|
<h4><a name="tag_01_09_01"></a>Regular Expression Definitions</h4>
|
|
|
|
<p>It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to each
|
|
subexpression, but only with the proviso that the overall match is leftmost longest. For example, matching
|
|
<tt>"\(ac*\)c*d[ac]*\1"</tt> against <i>acdacaaa</i> matches <i>acdacaaa</i> (with \1=<i>a</i>); simply matching the longest match
|
|
for <tt>"\(ac*\)"</tt> would yield \1=<i>ac</i>, but the overall match would be smaller (<i>acdac</i>). Conceptually, the
|
|
implementation must examine every possible match and among those that yield the leftmost longest total matches, pick the one that
|
|
does the longest match for the leftmost subexpression, and so on. Note that this means that matching by subexpressions is
|
|
context-dependent: a subexpression within a larger RE may match a different string from the one it would match as an independent
|
|
RE, and two instances of the same subexpression within the same larger RE may match different lengths even in similar sequences of
|
|
characters. For example, in the ERE <tt>"(a.*b)(a.*b)"</tt> , the two identical subexpressions would match four and six characters,
|
|
respectively, of <i>accbaccccb</i>.</p>
|
|
|
|
<p>The definition of single character has been expanded to include also collating elements consisting of two or more characters;
|
|
this expansion is applicable only when a bracket expression is included in the BRE or ERE. An example of such a collating element
|
|
may be the Dutch <i>ij</i>, which collates as a <tt>'y'</tt> . In some encodings, a ligature "i with j" exists as a character and
|
|
would represent a single-character collating element. In another encoding, no such ligature exists, and the two-character sequence
|
|
<i>ij</i> is defined as a multi-character collating element. Outside brackets, the <i>ij</i> is treated as a two-character RE and
|
|
matches the same characters in a string. Historically, a bracket expression only matched a single character. The
|
|
ISO POSIX-2:1993 standard required bracket expressions like <tt>"[^[:lower:]]"</tt> to match multi-character collating
|
|
elements such as <tt>"ij"</tt> . However, this requirement led to behavior that many users did not expect and that could not
|
|
feasibly be mimicked in user code, and it was rarely if ever implemented correctly. The current standard leaves it unspecified
|
|
whether a bracket expression matches a multi-character collating element, allowing both historical and ISO POSIX-2:1993
|
|
standard implementations to conform.</p>
|
|
|
|
<p>Also, in the current standard, it is unspecified whether character class expressions like <tt>"[:lower:]"</tt> can include
|
|
multi-character collating elements like <tt>"ij"</tt> ; hence <tt>"[[:lower:]]"</tt> can match <tt>"ij"</tt> , and
|
|
<tt>"[^[:lower:]]"</tt> can fail to match <tt>"ij"</tt> . Common practice is for a character class expression to match a collating
|
|
element if it matches the collating element's first character.</p>
|
|
|
|
<h4><a name="tag_01_09_02"></a>Regular Expression General Requirements</h4>
|
|
|
|
<p>The definition of which sequence is matched when several are possible is based on the leftmost-longest rule historically used by
|
|
deterministic recognizers. This rule is easier to define and describe, and arguably more useful, than the first-match rule
|
|
historically used by non-deterministic recognizers. It is thought that dependencies on the choice of rule are rare; carefully
|
|
contrived examples are needed to demonstrate the difference.</p>
|
|
|
|
<p>A formal expression of the leftmost-longest rule is:</p>
|
|
|
|
<blockquote>The search is performed as if all possible suffixes of the string were tested for a prefix matching the pattern; the
|
|
longest suffix containing a matching prefix is chosen, and the longest possible matching prefix of the chosen suffix is identified
|
|
as the matching sequence.</blockquote>
|
|
|
|
<p>Historically, most RE implementations only match lines, not strings. However, that is more an effect of the usage than of an
|
|
inherent feature of REs themselves. Consequently, IEEE Std 1003.1-2001 does not regard <newline>s as special; they
|
|
are ordinary characters, and both a period and a non-matching list can match them. Those utilities (like <a href=
|
|
"../utilities/grep.html"><i>grep</i></a>) that do not allow <newline>s to match are responsible for eliminating any
|
|
<newline> from strings before matching against the RE. The <a href="../functions/regcomp.html"><i>regcomp</i>()</a> function,
|
|
however, can provide support for such processing without violating the rules of this section.</p>
|
|
|
|
<p>Some implementations of <i>egrep</i> have had very limited flexibility in handling complex EREs. IEEE Std 1003.1-2001
|
|
does not attempt to define the complexity of a BRE or ERE, but does place a lower limit on it-any RE must be handled, as long as it
|
|
can be expressed in 256 bytes or less. (Of course, this does not place an upper limit on the implementation.) There are historical
|
|
programs using a non-deterministic-recognizer implementation that should have no difficulty with this limit. It is possible that a
|
|
good approach would be to attempt to use the faster, but more limited, deterministic recognizer for simple expressions and to fall
|
|
back on the non-deterministic recognizer for those expressions requiring it. Non-deterministic implementations must be careful to
|
|
observe the rules on which match is chosen; the longest match, not the first match, starting at a given character is used.</p>
|
|
|
|
<p>The term "invalid" highlights a difference between this section and some others: IEEE Std 1003.1-2001 frequently
|
|
avoids mandating of errors for syntax violations because they can be used by implementors to trigger extensions. However, the
|
|
authors of the internationalization features of REs wanted to mandate errors for certain conditions to identify usage problems or
|
|
non-portable constructs. These are identified within this rationale as appropriate. The remaining syntax violations have been left
|
|
implicitly or explicitly undefined. For example, the BRE construct <tt>"\{1,2,3\}"</tt> does not comply with the grammar. A
|
|
conforming application cannot rely on it producing an error nor matching the literal characters <tt>"\{1,2,3\}"</tt> . The term
|
|
"undefined" was used in favor of "unspecified" because many of the situations are considered errors on some implementations,
|
|
and the standard developers considered that consistency throughout the section was preferable to mixing undefined and
|
|
unspecified.</p>
|
|
|
|
<h4><a name="tag_01_09_03"></a>Basic Regular Expressions</h4>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_03_01"></a>BREs Matching a Single Character or Collating Element</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_03_02"></a>BRE Ordinary Characters</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_03_03"></a>BRE Special Characters</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_03_04"></a>Periods in BREs</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_03_05"></a>RE Bracket Expression</h5>
|
|
|
|
<p>Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and
|
|
portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such
|
|
characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges
|
|
have unspecified behavior.</p>
|
|
|
|
<p>Some historical implementations allow range expressions where the ending range point of one range is also the starting point of
|
|
the next (for instance, <tt>"[a-m-o]"</tt> ). This behavior should not be permitted, but to avoid breaking historical
|
|
implementations, it is now <i>undefined</i> whether it is a valid expression and how it should be interpreted.</p>
|
|
|
|
<p>Current practice in <a href="../utilities/awk.html"><i>awk</i></a> and <a href="../utilities/lex.html"><i>lex</i></a> is to
|
|
accept escape sequences in bracket expressions as per the Base Definitions volume of IEEE Std 1003.1-2001, Table 5-1,
|
|
Escape Sequences and Associated Actions, while the normal ERE behavior is to regard such a sequence as consisting of two
|
|
characters. Allowing the <a href="../utilities/awk.html"><i>awk</i></a>/ <a href="../utilities/lex.html"><i>lex</i></a> behavior in
|
|
EREs would change the normal behavior in an unacceptable way; it is expected that <a href="../utilities/awk.html"><i>awk</i></a>
|
|
and <a href="../utilities/lex.html"><i>lex</i></a> will decode escape sequences in EREs before passing them to <a href=
|
|
"../functions/regcomp.html"><i>regcomp</i>()</a> or comparable routines. Each utility describes the escape sequences it accepts as
|
|
an exception to the rules in this section; the list is not the same, for historical reasons.</p>
|
|
|
|
<p>As noted previously, the new syntax and rules have been added to accommodate other languages than English. The remainder of this
|
|
section describes the rationale for these modifications.</p>
|
|
|
|
<p>In the POSIX locale, a regular expression that starts with a range expression matches a set of strings that are contiguously
|
|
sorted, but this is not necessarily true in other locales. For example, a French locale might have the following behavior:</p>
|
|
|
|
<pre>
|
|
<b>$</b> <tt>ls
|
|
alpha Alpha estimé ESTIMé été eurêka
|
|
</tt><b>$</b> <tt>ls [a-e]*
|
|
alpha Alpha estimé eurêka
|
|
</tt>
|
|
</pre>
|
|
|
|
<p>Such disagreements between matching and contiguous sorting are unavoidable because POSIX sorting cannot be implemented in terms
|
|
of a deterministic finite-state automaton (DFA), but range expressions by design are implementable in terms of DFAs.</p>
|
|
|
|
<p>Historical implementations used native character order to interpret range expressions. The ISO POSIX-2:1993 standard
|
|
instead required collating element order (CEO): the order that collating elements were specified between the <b>order_start</b> and
|
|
<b>order_end</b> keywords in the <i>LC_COLLATE</i> category of the current locale. CEO had some advantages in portability over the
|
|
native character order, but it also had some disadvantages:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>CEO could not feasibly be mimicked in user code, leading to inconsistencies between POSIX matchers and matchers in popular user
|
|
programs like Emacs, <i>ksh</i>, and Perl.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>CEO caused range expressions to match accented and capitalized letters contrary to many users' expectations. For example,
|
|
<tt>"[a-e]"</tt> typically matched both <tt>'E'</tt> and <tt>'á'</tt> but neither <tt>'A'</tt> nor <tt>'é'</tt> .</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>CEO was not consistent across implementations. In practice, CEO was often less portable than native character order. For
|
|
example, it was common for the CEOs of two implementation-supplied locales to disagree, even if both locales were named
|
|
<tt>"da_DK"</tt> .</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Because of these problems, some implementations of regular expressions continued to use native character order. Others used the
|
|
collation sequence, which is more consistent with sorting than either CEO or native order, but which departs further from the
|
|
traditional POSIX semantics because it generally requires <tt>"[a-e]"</tt> to match either <tt>'A'</tt> or <tt>'E'</tt> but not
|
|
both. As a result of this kind of implementation variation, programmers who wanted to write portable regular expressions could not
|
|
rely on the ISO POSIX-2:1993 standard guarantees in practice.</p>
|
|
|
|
<p>While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for querying
|
|
the CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leaving
|
|
the standard alone was also considered, but rejected due to the problems described above.</p>
|
|
|
|
<p>The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. This makes it clearer that
|
|
conforming applications should avoid range expressions outside the POSIX locale, and it allows implementations and compatible
|
|
user-mode matchers to interpret range expressions using native order, CEO, collation sequence, or other, more advanced techniques.
|
|
The concerns which led to this change were raised in IEEE PASC interpretation 1003.2 #43 and others, and related to ambiguities in
|
|
the specification of how multi-character collating elements should be handled in range expressions. These ambiguities had led to
|
|
multiple interpretations of the specification, in conflicting ways, which led to varying implementations. As noted above, efforts
|
|
were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software
|
|
while not invalidating existing implementations.</p>
|
|
|
|
<p>The standard developers recognize that collating elements are important, such elements being common in several European
|
|
languages; for example, <tt>'ch'</tt> or <tt>'ll'</tt> in traditional Spanish; <tt>'aa'</tt> in several Scandinavian languages.
|
|
Existing internationalized implementations have processed, and continue to process, these elements in range expressions. Efforts
|
|
are expected to continue in the future to find a way to define the behavior of these elements precisely and portably.</p>
|
|
|
|
<p>The ISO POSIX-2:1993 standard required <tt>"[b-a]"</tt> to be an invalid expression in the POSIX locale, but this
|
|
requirement has been relaxed in this version of the standard so that <tt>"[b-a]"</tt> can instead be treated as a valid expression
|
|
that does not match any string.</p>
|
|
|
|
<h5><a name="tag_01_09_03_06"></a>BREs Matching Multiple Characters</h5>
|
|
|
|
<p>The limit of nine back-references to subexpressions in the RE is based on the use of a single-digit identifier; increasing this
|
|
to multiple digits would break historical applications. This does not imply that only nine subexpressions are allowed in REs. The
|
|
following is a valid BRE with ten subexpressions:</p>
|
|
|
|
<pre>
|
|
<tt>\(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*
|
|
</tt>
|
|
</pre>
|
|
|
|
<p>The standard developers regarded the common historical behavior, which supported <tt>"\n*"</tt> , but not
|
|
<tt>"\n\{min,max\}"</tt> , <tt>"\(...\)*"</tt> , or <tt>"\(...\)\{min,max\}"</tt> , as a non-intentional result of a specific
|
|
implementation, and they supported both duplication and interval expressions following subexpressions and back-references.</p>
|
|
|
|
<p>The changes to the processing of the back-reference expression remove an unspecified or ambiguous behavior in the Shell and
|
|
Utilities volume of IEEE Std 1003.1-2001, aligning it with the requirements specified for the <a href=
|
|
"../functions/regcomp.html"><i>regcomp</i>()</a> expression, and is the result of PASC Interpretation 1003.2-92 #43 submitted for
|
|
the ISO POSIX-2:1993 standard.</p>
|
|
|
|
<h5><a name="tag_01_09_03_07"></a>BRE Precedence</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_03_08"></a>BRE Expression Anchoring</h5>
|
|
|
|
<p>Often, the dollar sign is viewed as matching the ending <newline> in text files. This is not strictly true; the
|
|
<newline> is typically eliminated from the strings to be matched, and the dollar sign matches the terminating null
|
|
character.</p>
|
|
|
|
<p>The ability of <tt>'^'</tt> , <tt>'$'</tt> , and <tt>'*'</tt> to be non-special in certain circumstances may be confusing to
|
|
some programmers, but this situation was changed only in a minor way from historical practice to avoid breaking many historical
|
|
scripts. Some consideration was given to making the use of the anchoring characters undefined if not escaped and not at the
|
|
beginning or end of strings. This would cause a number of historical BREs, such as <tt>"2^10"</tt> , <tt>"$HOME"</tt> , and
|
|
<tt>"$1.35"</tt> , that relied on the characters being treated literally, to become invalid.</p>
|
|
|
|
<p>However, one relatively uncommon case was changed to allow an extension used on some implementations. Historically, the BREs
|
|
<tt>"^foo"</tt> and <tt>"\(^foo\)"</tt> did not match the same string, despite the general rule that subexpressions and entire BREs
|
|
match the same strings. To increase consensus, IEEE Std 1003.1-2001 has allowed an extension on some implementations to
|
|
treat these two cases in the same way by declaring that anchoring <i>may</i> occur at the beginning or end of a subexpression.
|
|
Therefore, portable BREs that require a literal circumflex at the beginning or a dollar sign at the end of a subexpression must
|
|
escape them. Note that a BRE such as <tt>"a\(^bc\)"</tt> will either match <tt>"a^bc"</tt> or nothing on different systems under
|
|
the rules.</p>
|
|
|
|
<p>ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped anchor character has never matched
|
|
its literal counterpart outside a bracket expression. Some implementations treated <tt>"foo$bar"</tt> as a valid expression that
|
|
never matched anything; others treated it as invalid. IEEE Std 1003.1-2001 mandates the former, valid unmatched
|
|
behavior.</p>
|
|
|
|
<p>Some implementations have extended the BRE syntax to add alternation. For example, the subexpression <tt>"\(foo$\|bar\)"</tt>
|
|
would match either <tt>"foo"</tt> at the end of the string or <tt>"bar"</tt> anywhere. The extension is triggered by the use of the
|
|
undefined <tt>"\|"</tt> sequence. Because the BRE is undefined for portable scripts, the extending system is free to make other
|
|
assumptions, such that the <tt>'$'</tt> represents the end-of-line anchor in the middle of a subexpression. If it were not for the
|
|
extension, the <tt>'$'</tt> would match a literal dollar sign under the rules.</p>
|
|
|
|
<h4><a name="tag_01_09_04"></a>Extended Regular Expressions</h4>
|
|
|
|
<p>As with BREs, the standard developers decided to make the interpretation of escaped ordinary characters undefined.</p>
|
|
|
|
<p>The right parenthesis is not listed as an ERE special character because it is only special in the context of a preceding left
|
|
parenthesis. If found without a preceding left parenthesis, the right parenthesis has no special meaning.</p>
|
|
|
|
<p>The interval expression, <tt>"{m,n}"</tt> , has been added to EREs. Historically, the interval expression has only been
|
|
supported in some ERE implementations. The standard developers estimated that the addition of interval expressions to EREs would
|
|
not decrease consensus and would also make BREs more of a subset of EREs than in many historical implementations.</p>
|
|
|
|
<p>It was suggested that, in addition to interval expressions, back-references ( <tt>'\n'</tt> ) should also be added to EREs. This
|
|
was rejected by the standard developers as likely to decrease consensus.</p>
|
|
|
|
<p>In historical implementations, multiple duplication symbols are usually interpreted from left to right and treated as additive.
|
|
As an example, <tt>"a+*b"</tt> matches zero or more instances of <tt>'a'</tt> followed by a <tt>'b'</tt> . In
|
|
IEEE Std 1003.1-2001, multiple duplication symbols are undefined; that is, they cannot be relied upon for conforming
|
|
applications. One reason for this is to provide some scope for future enhancements.</p>
|
|
|
|
<p>The precedence of operations differs between EREs and those in <a href="../utilities/lex.html"><i>lex</i></a>; in <a href=
|
|
"../utilities/lex.html"><i>lex</i></a>, for historical reasons, interval expressions have a lower precedence than
|
|
concatenation.</p>
|
|
|
|
<h5><a name="tag_01_09_04_01"></a>EREs Matching a Single Character or Collating Element</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_02"></a>ERE Ordinary Characters</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_03"></a>ERE Special Characters</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_04"></a>Periods in EREs</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_05"></a>ERE Bracket Expression</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_06"></a>EREs Matching Multiple Characters</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_07"></a>ERE Alternation</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_08"></a>ERE Precedence</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_04_09"></a>ERE Expression Anchoring</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h4><a name="tag_01_09_05"></a>Regular Expression Grammar</h4>
|
|
|
|
<p>The grammars are intended to represent the range of acceptable syntaxes available to conforming applications. There are
|
|
instances in the text where undefined constructs are described; as explained previously, these allow implementation extensions.
|
|
There is no intended requirement that an implementation extension must somehow fit into the grammars shown here.</p>
|
|
|
|
<p>The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside <tt>"\("</tt> and <tt>"\)"</tt> (which implies that <tt>'^'</tt> and
|
|
<tt>'$'</tt> are ordinary characters). This reflects the semantic limits on the application, as noted in the Base Definitions
|
|
volume of IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_03_08">Section 9.3.8, BRE Expression
|
|
Anchoring</a>. Implementations are permitted to extend the language to interpret <tt>'^'</tt> and <tt>'$'</tt> as anchors in these
|
|
locations, and as such, conforming applications cannot use unescaped <tt>'^'</tt> and <tt>'$'</tt> in positions inside
|
|
<tt>"\("</tt> and <tt>"\)"</tt> that might be interpreted as anchors.</p>
|
|
|
|
<p>The ERE grammar does not permit several constructs that the Base Definitions volume of IEEE Std 1003.1-2001, <a href=
|
|
"../basedefs/xbd_chap09.html#tag_09_04_02">Section 9.4.2, ERE Ordinary Characters</a> and the Base Definitions volume of
|
|
IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html#tag_09_04_03">Section 9.4.3, ERE Special Characters</a>
|
|
specify as having undefined results:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>ORD_CHAR preceded by <tt>'\'</tt></p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><i>ERE_dupl_symbol</i>(s) appearing first in an ERE, or immediately following <tt>'|'</tt> , <tt>'^'</tt> , or <tt>'('</tt></p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><tt>'{'</tt> not part of a valid <i>ERE_dupl_symbol</i></p>
|
|
</li>
|
|
|
|
<li>
|
|
<p><tt>'|'</tt> appearing first or last in an ERE, or immediately following <tt>'|'</tt> or <tt>'('</tt> , or immediately preceding
|
|
<tt>')'</tt></p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Implementations are permitted to extend the language to allow these. Conforming applications cannot use such constructs.</p>
|
|
|
|
<h5><a name="tag_01_09_05_01"></a>BRE/ERE Grammar Lexical Conventions</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h5><a name="tag_01_09_05_02"></a>RE and Bracket Expression Grammar</h5>
|
|
|
|
<p>The removal of the <i>Back_open_paren</i> <i>Back_close_paren</i> option from the <i>nondupl_RE</i> specification is the result
|
|
of PASC Interpretation 1003.2-92 #43 submitted for the ISO POSIX-2:1993 standard. Although the grammar required support for
|
|
null subexpressions, this section does not describe the meaning of, and historical practice did not support, this construct.</p>
|
|
|
|
<h5><a name="tag_01_09_05_03"></a>ERE Grammar</h5>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
|
|
<hr size="2" noshade>
|
|
<center><font size="2"><!--footer start-->
|
|
UNIX ® is a registered Trademark of The Open Group.<br>
|
|
POSIX ® is a registered Trademark of The IEEE.<br>
|
|
[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href=
|
|
"../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>
|
|
]</font></center>
|
|
|
|
<!--footer end-->
|
|
<hr size="2" noshade>
|
|
</body>
|
|
</html>
|
|
|