oldlinux-files/Ref-docs/POSIX/susv3/xrat/xbd_chap06.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 -->
<!-- Copyright (c) 2001 The Open Group, All Rights Reserved -->
<title>Rationale</title>
</head>
<body>

<basefont size="3">

<center><font size="2">The Open Group Base Specifications Issue 6<br>
IEEE Std 1003.1-2001<br>
Copyright &copy; 2001 The IEEE and The Open Group</font></center>

<hr size="2" noshade>
<h3><a name="tag_01_06"></a>Character Set</h3>

<h4><a name="tag_01_06_01"></a>Portable Character Set</h4>

<p>The portable character set is listed in full so there is no dependency on the ISO/IEC&nbsp;646:1991 standard (or historically
ASCII) encoded character set, although the set is identical to the characters defined in the International Reference version of the
ISO/IEC&nbsp;646:1991 standard.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 poses no requirement that multiple character sets or codesets be supported, leaving this as a
marketing differentiation for implementors. Although multiple charmap files are supported, it is the responsibility of the
implementation to provide the file(s); if only one is provided, only that one will be accessible using the <a href=
"../utilities/localedef.html"><i>localedef</i></a> <b>-f</b> option.</p>

<p>The statement about invariance in codesets for the portable character set is worded to avoid precluding implementations where
multiple incompatible codesets are available (for instance, ASCII and EBCDIC). The standard utilities cannot be expected to produce
predictable results if they access portable characters that vary on the same implementation.</p>

<p>Not all character sets need include the portable character set, but each locale must include it. For example, a Japanese-based
locale might be supported by a mixture of character sets: JIS&nbsp;X&nbsp;0201 Roman (a Japanese version of the
ISO/IEC&nbsp;646:1991 standard), JIS&nbsp;X&nbsp;0208, and JIS&nbsp;X&nbsp;0201 Katakana. Not all of these character sets include
the portable characters, but at least one does (JIS&nbsp;X&nbsp;0201 Roman).</p>

<h4><a name="tag_01_06_02"></a>Character Encoding</h4>

<p>Encoding mechanisms based on single shifts, such as the EUC encoding used in some Asian and other countries, can be supported
via the current charmap mechanism. With single-shift encoding, each character is preceded by a shift code (SS2 or SS3). A complete
EUC code, consisting of the portable character set (G0) and up to three additional character sets (G1, G2, G3), can be described
using the current charmap mechanism; the encoding for each character in additional character sets G2 and G3 must then include their
single-shift code. Other mechanisms to support locales based on encoding mechanisms such as locking shift are not addressed by this
volume of IEEE&nbsp;Std&nbsp;1003.1-2001.</p>

<h4><a name="tag_01_06_03"></a>C Language Wide-Character Codes</h4>

<p>There is no additional rationale provided for this section.</p>

<h4><a name="tag_01_06_04"></a>Character Set Description File</h4>

<p>IEEE PASC Interpretation 1003.2 #196 is applied, removing three lines of text dealing with ranges of symbolic names using
position constant values which had been erroneously included in the final IEEE&nbsp;P1003.2b draft standard.</p>

<h5><a name="tag_01_06_04_01"></a>State-Dependent Character Encodings</h5>

<p>A requirement was considered that would force utilities to eliminate any redundant locking shifts, but this was left as a
quality of implementation issue.</p>

<p>This change satisfies the following requirement from the ISO&nbsp;POSIX-2:1993 standard, Annex H.1:</p>

<blockquote><i>The support of state-dependent (shift encoding) character sets should be addressed fully. See descriptions of these
in the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, Section 6.2, Character Encoding. If such character encodings are
supported, it is expected that this will impact the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, Section 6.2,
Character Encoding, the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap07.html">Chapter 7,
Locale</a>, the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href="../basedefs/xbd_chap09.html">Chapter 9, Regular
Expressions</a> , and the <a href="../utilities/comm.html"><i>comm</i></a>, <a href="../utilities/cut.html"><i>cut</i></a>, <a
href="../utilities/diff.html"><i>diff</i></a>, <a href="../utilities/grep.html"><i>grep</i></a>, <a href=
"../utilities/head.html"><i>head</i></a>, <a href="../utilities/join.html"><i>join</i></a>, <a href=
"../utilities/paste.html"><i>paste</i></a>, and <a href="../utilities/tail.html"><i>tail</i></a> utilities.</i></blockquote>

<p>The character set description file provides:</p>

<ul>
<li>
<p>The capability to describe character set attributes (such as collation order or character classes) independent of character set
encoding, and using only the characters in the portable character set. This makes it possible to create generic <a href=
"../utilities/localedef.html"><i>localedef</i></a> source files for all codesets that share the portable character set (such as the
ISO&nbsp;8859 family or IBM Extended ASCII).</p>
</li>

<li>
<p>Standardized symbolic names for all characters in the portable character set, making it possible to refer to any such character
regardless of encoding.</p>
</li>
</ul>

<p>Implementations are free to choose their own symbolic names, as long as the names identified by the Base Definitions volume of
IEEE&nbsp;Std&nbsp;1003.1-2001 are also defined; this provides support for already existing &quot;character names&quot;.</p>

<p>The names selected for the members of the portable character set follow the ISO/IEC&nbsp;8859-1:1998 standard and the
ISO/IEC&nbsp;10646-1:2000 standard. However, several commonly used UNIX system names occur as synonyms in the list:</p>

<ul>
<li>
<p>The historical UNIX system names are used for control characters.</p>
</li>

<li>
<p>The word &quot;slash&quot; is given in addition to &quot;solidus&quot;.</p>
</li>

<li>
<p>The word &quot;backslash&quot; is given in addition to &quot;reverse-solidus&quot;.</p>
</li>

<li>
<p>The word &quot;hyphen&quot; is given in addition to &quot;hyphen-minus&quot;.</p>
</li>

<li>
<p>The word &quot;period&quot; is given in addition to &quot;full-stop&quot;.</p>
</li>

<li>
<p>For digits, the word &quot;digit&quot; is eliminated.</p>
</li>

<li>
<p>For letters, the words &quot;Latin Capital Letter&quot; and &quot;Latin Small Letter&quot; are eliminated.</p>
</li>

<li>
<p>The words &quot;left brace&quot; and &quot;right brace&quot; are given in addition to &quot;left-curly-bracket&quot; and &quot;right-curly-bracket&quot;.</p>
</li>

<li>
<p>The names of the digits are preferred over the numbers to avoid possible confusion between <tt>'0'</tt> and <tt>'O'</tt> , and
between <tt>'1'</tt> and <tt>'l'</tt> (one and the letter ell).</p>
</li>
</ul>

<p>The names for the control characters in the Base Definitions volume of IEEE&nbsp;Std&nbsp;1003.1-2001, <a href=
"../basedefs/xbd_chap06.html">Chapter 6, Character Set</a> were taken from the ISO/IEC&nbsp;4873:1991 standard.</p>

<p>The charmap file was introduced to resolve problems with the portability of, especially, <a href=
"../utilities/localedef.html"><i>localedef</i></a> sources. IEEE&nbsp;Std&nbsp;1003.1-2001 assumes that the portable character set
is constant across all locales, but does not prohibit implementations from supporting two incompatible codings, such as both ASCII
and EBCDIC. Such dual-support implementations should have all charmaps and <a href=
"../utilities/localedef.html"><i>localedef</i></a> sources encoded using one portable character set, in effect cross-compiling for
the other environment. Naturally, charmaps (and <a href="../utilities/localedef.html"><i>localedef</i></a> sources) are only
portable without transformation between systems using the same encodings for the portable character set. They can, however, be
transformed between two sets using only a subset of the actual characters (the portable character set). However, the particular
coded character set used for an application or an implementation does not necessarily imply different characteristics or collation;
on the contrary, these attributes should in many cases be identical, regardless of codeset. The charmap provides the capability to
define a common locale definition for multiple codesets (the same <a href="../utilities/localedef.html"><i>localedef</i></a> source
can be used for codesets with different extended characters; the ability in the charmap to define empty names allows for characters
missing in certain codesets).</p>

<p>The <b>&lt;escape_char&gt;</b> declaration was added at the request of the international community to ease the creation of
portable charmap files on terminals not implementing the default backslash escape. The <b>&lt;comment_char&gt;</b> declaration was
added at the request of the international community to eliminate the potential confusion between the number sign and the pound
sign.</p>

<p>The octal number notation with no leading zero required was selected to match those of <a href=
"../utilities/awk.html"><i>awk</i></a> and <a href="../utilities/tr.html"><i>tr</i></a> and is consistent with that used by <a
href="../utilities/localedef.html"><i>localedef</i></a>. To avoid confusion between an octal constant and the back-references used
in <a href="../utilities/localedef.html"><i>localedef</i></a> source, the octal, hexadecimal, and decimal constants must contain at
least two digits. As single-digit constants are relatively rare, this should not impose any significant hardship. Provision is made
for more digits to account for systems in which the byte size is larger than 8 bits. For example, a Unicode
(ISO/IEC&nbsp;10646-1:2000 standard) system that has defined 16-bit bytes may require six octal, four hexadecimal, and five decimal
digits.</p>

<p>The decimal notation is supported because some newer international standards define character values in decimal, rather than in
the old column/row notation.</p>

<p>The charmap identifies the coded character sets supported by an implementation. At least one charmap must be provided, but no
implementation is required to provide more than one. Likewise, implementations can allow users to generate new charmaps (for
instance, for a new version of the ISO&nbsp;8859 family of coded character sets), but does not have to do so. If users are allowed
to create new charmaps, the system documentation describes the rules that apply (for instance, &quot;only coded character sets that are
supersets of the ISO/IEC&nbsp;646:1991 standard IRV, no multi-byte characters&quot;).</p>

<p>This addition of the <b>WIDTH</b> specification satisfies the following requirement from the ISO&nbsp;POSIX-2:1993 standard,
Annex H.1:</p>

<blockquote>
<dl compact>
<dt>(9)</dt>

<dd><i>The definition of column position relies on the implementation's knowledge of the integral width of the characters. The
charmap or</i> LC_CTYPE locale definitions should be enhanced to allow application specification of these widths.</dd>
</dl>
</blockquote>

<p>The character &quot;width&quot; information was first considered for inclusion under <i>LC_CTYPE</i> but was moved because it is more
closely associated with the information in the charmap than information in the locale source (cultural conventions information).
Concerns were raised that formalizing this type of information is moving the locale source definition from the codeset-independent
entity that it was designed to be to a repository of codeset-specific information. A similar issue occurred with the
<b>&lt;code_set_name&gt;</b>, <b>&lt;mb_cur_max&gt;</b>, and <b>&lt;mb_cur_min&gt;</b> information, which was resolved to reside in
the charmap definition.</p>

<p>The width definition was added to the IEEE&nbsp;P1003.2b draft standard with the intent that the <a href=
"../functions/wcswidth.html"><i>wcswidth</i>()</a> and/or <a href="../functions/wcwidth.html"><i>wcwidth</i>()</a> functions
(currently specified in the System Interfaces volume of IEEE&nbsp;Std&nbsp;1003.1-2001) be the mechanism to retrieve the character
width information.</p>


<hr size="2" noshade>
<center><font size="2"><!--footer start-->
UNIX &reg; is a registered Trademark of The Open Group.<br>
POSIX &reg; is a registered Trademark of The IEEE.<br>
[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href=
"../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>
]</font></center>

<!--footer end-->
<hr size="2" noshade>
</body>
</html>