215 lines
13 KiB
HTML
215 lines
13 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
<link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 -->
|
|
<!-- Copyright (c) 2001 The Open Group, All Rights Reserved -->
|
|
<title>Rationale</title>
|
|
</head>
|
|
<body>
|
|
|
|
<basefont size="3">
|
|
|
|
<center><font size="2">The Open Group Base Specifications Issue 6<br>
|
|
IEEE Std 1003.1-2001<br>
|
|
Copyright © 2001 The IEEE and The Open Group</font></center>
|
|
|
|
<hr size="2" noshade>
|
|
<h3><a name="tag_01_06"></a>Character Set</h3>
|
|
|
|
<h4><a name="tag_01_06_01"></a>Portable Character Set</h4>
|
|
|
|
<p>The portable character set is listed in full so there is no dependency on the ISO/IEC 646:1991 standard (or historically
|
|
ASCII) encoded character set, although the set is identical to the characters defined in the International Reference version of the
|
|
ISO/IEC 646:1991 standard.</p>
|
|
|
|
<p>IEEE Std 1003.1-2001 poses no requirement that multiple character sets or codesets be supported, leaving this as a
|
|
marketing differentiation for implementors. Although multiple charmap files are supported, it is the responsibility of the
|
|
implementation to provide the file(s); if only one is provided, only that one will be accessible using the <a href=
|
|
"../utilities/localedef.html"><i>localedef</i></a> <b>-f</b> option.</p>
|
|
|
|
<p>The statement about invariance in codesets for the portable character set is worded to avoid precluding implementations where
|
|
multiple incompatible codesets are available (for instance, ASCII and EBCDIC). The standard utilities cannot be expected to produce
|
|
predictable results if they access portable characters that vary on the same implementation.</p>
|
|
|
|
<p>Not all character sets need include the portable character set, but each locale must include it. For example, a Japanese-based
|
|
locale might be supported by a mixture of character sets: JIS X 0201 Roman (a Japanese version of the
|
|
ISO/IEC 646:1991 standard), JIS X 0208, and JIS X 0201 Katakana. Not all of these character sets include
|
|
the portable characters, but at least one does (JIS X 0201 Roman).</p>
|
|
|
|
<h4><a name="tag_01_06_02"></a>Character Encoding</h4>
|
|
|
|
<p>Encoding mechanisms based on single shifts, such as the EUC encoding used in some Asian and other countries, can be supported
|
|
via the current charmap mechanism. With single-shift encoding, each character is preceded by a shift code (SS2 or SS3). A complete
|
|
EUC code, consisting of the portable character set (G0) and up to three additional character sets (G1, G2, G3), can be described
|
|
using the current charmap mechanism; the encoding for each character in additional character sets G2 and G3 must then include their
|
|
single-shift code. Other mechanisms to support locales based on encoding mechanisms such as locking shift are not addressed by this
|
|
volume of IEEE Std 1003.1-2001.</p>
|
|
|
|
<h4><a name="tag_01_06_03"></a>C Language Wide-Character Codes</h4>
|
|
|
|
<p>There is no additional rationale provided for this section.</p>
|
|
|
|
<h4><a name="tag_01_06_04"></a>Character Set Description File</h4>
|
|
|
|
<p>IEEE PASC Interpretation 1003.2 #196 is applied, removing three lines of text dealing with ranges of symbolic names using
|
|
position constant values which had been erroneously included in the final IEEE P1003.2b draft standard.</p>
|
|
|
|
<h5><a name="tag_01_06_04_01"></a>State-Dependent Character Encodings</h5>
|
|
|
|
<p>A requirement was considered that would force utilities to eliminate any redundant locking shifts, but this was left as a
|
|
quality of implementation issue.</p>
|
|
|
|
<p>This change satisfies the following requirement from the ISO POSIX-2:1993 standard, Annex H.1:</p>
|
|
|
|
<blockquote><i>The support of state-dependent (shift encoding) character sets should be addressed fully. See descriptions of these
|
|
in the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.2, Character Encoding. If such character encodings are
|
|
supported, it is expected that this will impact the Base Definitions volume of IEEE Std 1003.1-2001, Section 6.2,
|
|
Character Encoding, the Base Definitions volume of IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap07.html">Chapter 7,
|
|
Locale</a>, the Base Definitions volume of IEEE Std 1003.1-2001, <a href="../basedefs/xbd_chap09.html">Chapter 9, Regular
|
|
Expressions</a> , and the <a href="../utilities/comm.html"><i>comm</i></a>, <a href="../utilities/cut.html"><i>cut</i></a>, <a
|
|
href="../utilities/diff.html"><i>diff</i></a>, <a href="../utilities/grep.html"><i>grep</i></a>, <a href=
|
|
"../utilities/head.html"><i>head</i></a>, <a href="../utilities/join.html"><i>join</i></a>, <a href=
|
|
"../utilities/paste.html"><i>paste</i></a>, and <a href="../utilities/tail.html"><i>tail</i></a> utilities.</i></blockquote>
|
|
|
|
<p>The character set description file provides:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>The capability to describe character set attributes (such as collation order or character classes) independent of character set
|
|
encoding, and using only the characters in the portable character set. This makes it possible to create generic <a href=
|
|
"../utilities/localedef.html"><i>localedef</i></a> source files for all codesets that share the portable character set (such as the
|
|
ISO 8859 family or IBM Extended ASCII).</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>Standardized symbolic names for all characters in the portable character set, making it possible to refer to any such character
|
|
regardless of encoding.</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>Implementations are free to choose their own symbolic names, as long as the names identified by the Base Definitions volume of
|
|
IEEE Std 1003.1-2001 are also defined; this provides support for already existing "character names".</p>
|
|
|
|
<p>The names selected for the members of the portable character set follow the ISO/IEC 8859-1:1998 standard and the
|
|
ISO/IEC 10646-1:2000 standard. However, several commonly used UNIX system names occur as synonyms in the list:</p>
|
|
|
|
<ul>
|
|
<li>
|
|
<p>The historical UNIX system names are used for control characters.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>The word "slash" is given in addition to "solidus".</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>The word "backslash" is given in addition to "reverse-solidus".</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>The word "hyphen" is given in addition to "hyphen-minus".</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>The word "period" is given in addition to "full-stop".</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>For digits, the word "digit" is eliminated.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>For letters, the words "Latin Capital Letter" and "Latin Small Letter" are eliminated.</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>The words "left brace" and "right brace" are given in addition to "left-curly-bracket" and "right-curly-bracket".</p>
|
|
</li>
|
|
|
|
<li>
|
|
<p>The names of the digits are preferred over the numbers to avoid possible confusion between <tt>'0'</tt> and <tt>'O'</tt> , and
|
|
between <tt>'1'</tt> and <tt>'l'</tt> (one and the letter ell).</p>
|
|
</li>
|
|
</ul>
|
|
|
|
<p>The names for the control characters in the Base Definitions volume of IEEE Std 1003.1-2001, <a href=
|
|
"../basedefs/xbd_chap06.html">Chapter 6, Character Set</a> were taken from the ISO/IEC 4873:1991 standard.</p>
|
|
|
|
<p>The charmap file was introduced to resolve problems with the portability of, especially, <a href=
|
|
"../utilities/localedef.html"><i>localedef</i></a> sources. IEEE Std 1003.1-2001 assumes that the portable character set
|
|
is constant across all locales, but does not prohibit implementations from supporting two incompatible codings, such as both ASCII
|
|
and EBCDIC. Such dual-support implementations should have all charmaps and <a href=
|
|
"../utilities/localedef.html"><i>localedef</i></a> sources encoded using one portable character set, in effect cross-compiling for
|
|
the other environment. Naturally, charmaps (and <a href="../utilities/localedef.html"><i>localedef</i></a> sources) are only
|
|
portable without transformation between systems using the same encodings for the portable character set. They can, however, be
|
|
transformed between two sets using only a subset of the actual characters (the portable character set). However, the particular
|
|
coded character set used for an application or an implementation does not necessarily imply different characteristics or collation;
|
|
on the contrary, these attributes should in many cases be identical, regardless of codeset. The charmap provides the capability to
|
|
define a common locale definition for multiple codesets (the same <a href="../utilities/localedef.html"><i>localedef</i></a> source
|
|
can be used for codesets with different extended characters; the ability in the charmap to define empty names allows for characters
|
|
missing in certain codesets).</p>
|
|
|
|
<p>The <b><escape_char></b> declaration was added at the request of the international community to ease the creation of
|
|
portable charmap files on terminals not implementing the default backslash escape. The <b><comment_char></b> declaration was
|
|
added at the request of the international community to eliminate the potential confusion between the number sign and the pound
|
|
sign.</p>
|
|
|
|
<p>The octal number notation with no leading zero required was selected to match those of <a href=
|
|
"../utilities/awk.html"><i>awk</i></a> and <a href="../utilities/tr.html"><i>tr</i></a> and is consistent with that used by <a
|
|
href="../utilities/localedef.html"><i>localedef</i></a>. To avoid confusion between an octal constant and the back-references used
|
|
in <a href="../utilities/localedef.html"><i>localedef</i></a> source, the octal, hexadecimal, and decimal constants must contain at
|
|
least two digits. As single-digit constants are relatively rare, this should not impose any significant hardship. Provision is made
|
|
for more digits to account for systems in which the byte size is larger than 8 bits. For example, a Unicode
|
|
(ISO/IEC 10646-1:2000 standard) system that has defined 16-bit bytes may require six octal, four hexadecimal, and five decimal
|
|
digits.</p>
|
|
|
|
<p>The decimal notation is supported because some newer international standards define character values in decimal, rather than in
|
|
the old column/row notation.</p>
|
|
|
|
<p>The charmap identifies the coded character sets supported by an implementation. At least one charmap must be provided, but no
|
|
implementation is required to provide more than one. Likewise, implementations can allow users to generate new charmaps (for
|
|
instance, for a new version of the ISO 8859 family of coded character sets), but does not have to do so. If users are allowed
|
|
to create new charmaps, the system documentation describes the rules that apply (for instance, "only coded character sets that are
|
|
supersets of the ISO/IEC 646:1991 standard IRV, no multi-byte characters").</p>
|
|
|
|
<p>This addition of the <b>WIDTH</b> specification satisfies the following requirement from the ISO POSIX-2:1993 standard,
|
|
Annex H.1:</p>
|
|
|
|
<blockquote>
|
|
<dl compact>
|
|
<dt>(9)</dt>
|
|
|
|
<dd><i>The definition of column position relies on the implementation's knowledge of the integral width of the characters. The
|
|
charmap or</i> LC_CTYPE locale definitions should be enhanced to allow application specification of these widths.</dd>
|
|
</dl>
|
|
</blockquote>
|
|
|
|
<p>The character "width" information was first considered for inclusion under <i>LC_CTYPE</i> but was moved because it is more
|
|
closely associated with the information in the charmap than information in the locale source (cultural conventions information).
|
|
Concerns were raised that formalizing this type of information is moving the locale source definition from the codeset-independent
|
|
entity that it was designed to be to a repository of codeset-specific information. A similar issue occurred with the
|
|
<b><code_set_name></b>, <b><mb_cur_max></b>, and <b><mb_cur_min></b> information, which was resolved to reside in
|
|
the charmap definition.</p>
|
|
|
|
<p>The width definition was added to the IEEE P1003.2b draft standard with the intent that the <a href=
|
|
"../functions/wcswidth.html"><i>wcswidth</i>()</a> and/or <a href="../functions/wcwidth.html"><i>wcwidth</i>()</a> functions
|
|
(currently specified in the System Interfaces volume of IEEE Std 1003.1-2001) be the mechanism to retrieve the character
|
|
width information.</p>
|
|
|
|
|
|
<hr size="2" noshade>
|
|
<center><font size="2"><!--footer start-->
|
|
UNIX ® is a registered Trademark of The Open Group.<br>
|
|
POSIX ® is a registered Trademark of The IEEE.<br>
|
|
[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href=
|
|
"../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>
|
|
]</font></center>
|
|
|
|
<!--footer end-->
|
|
<hr size="2" noshade>
|
|
</body>
|
|
</html>
|
|
|