oldlinux-files/study/Ref-docs/C/charset.html

<HTML><HEAD><TITLE>Characters</TITLE></HEAD><BODY BGCOLOR="#FFFFFF">

<H1><A NAME="Characters">Characters</A></H1><HR>

<P><B>
<A HREF="#Character Sets">Character Sets</A>
&#183; <A HREF="#Character Sets and Locales">Character Sets and Locales</A>
&#183; <A HREF="#Escape Sequences">Escape Sequences</A>
&#183; <A HREF="#Numeric Escape Sequences">Numeric Escape Sequences</A>
&#183; <A HREF="#Trigraphs">Trigraphs</A>
&#183; <A HREF="#Multibyte Characters">Multibyte Characters</A>
&#183; <A HREF="#Wide-Character Encoding">Wide-Character Encoding</A>
</B></P>
<HR>

<P>Characters play a central role in Standard C. You represent
a C program as one or more
<B><A NAME="source file">source files</A></B>.
The translator reads a source file as a
<A NAME="text stream">text stream</A>
consisting of characters that you can read when you
display the stream on a terminal screen or produce hard copy with a
printer. You often manipulate text when a C program executes. The
program might produce a text stream that people can read, or it might
read a text stream entered by someone typing at a keyboard
or from a file modified using a text editor.
This document describes the characters that you
use to write C source files and that you manipulate as streams
when executing C programs.</P>

<H2><A NAME="Character Sets">Character Sets</A></H2>

<P>When you write a program, you express C source files as
<A HREF="lib_file.html#text lines" tppabs="http://ccs.ucsd.edu/c/lib_file.html#text lines">text lines</A>
containing characters from the
<B><A NAME="source character set">source character set</A></B>.
When a program executes in the
<B><A NAME="target environment">target environment</A></B>,
it uses characters from the
<B><A NAME="target character set">target character set</A></B>.
These character sets are related, but need not have
the same encoding or all the same members.</P>

<P>Every character set contains a distinct code value for each
character in the
<B><A NAME="basic C character set">basic C character set</A></B>.
A character set can also contain additional characters
with other code values. For example:</P>

<UL>
<LI>The
<B><A NAME="character constant">character constant</A></B>
<CODE>'x'</CODE> becomes the value of
the code for the character corresponding to <CODE>x</CODE> in the target
character set.

<LI>The
<B><A NAME="string literal">string literal</A></B>
<CODE>"xyz"</CODE> becomes a sequence of
character constants stored in successive bytes of memory, followed
by a byte containing the value zero:<BR>
<CODE>{'x', 'y', 'z', '\0'}</CODE>
</UL>

<P>A string literal is one way to specify a
<B><A NAME="null-terminated string">null-terminated string</A></B>,
an array of zero or more bytes followed by a byte containing the
value zero.</P>

<P><B><A NAME="visible graphic characters">Visible graphic characters</A></B>
in the basic C character set:</P>

<PRE>
<B>Form         Members</B>
<I>letter</I>       A B C D E F G H I J K L M
             N O P Q R S T U V W X Y Z
             a b c d e f g h i j k l m
             n o p q r s t u v w x y z

<I>digit</I>        0 1 2 3 4 5 6 7 8 9

<I>underscore</I>   _

<I>punctuation</I>  ! " # % &amp; ' ( ) * + , - . / :
             ; &lt; = &gt; ? [ \ ] ^ { | } ~</PRE>

<P><B><A NAME="additional graphic characters">Additional
graphic characters</A></B> in the basic C character set:</P>

<PRE>
<B>Character    Meaning</B>
<A NAME="space"><I>space</I></A>        <B>leave blank space</B>
<A NAME="BEL"><I>BEL</I></A>          <B>signal an alert (BELl)</B>
<A NAME="BS"><I>BS</I></A>           <B>go back one position (BackSpace)</B>
<A NAME="FF"><I>FF</I></A>           <B>go to top of page (Form Feed)</B>
<A NAME="NL"><I>NL</I></A>           <B>go to start of next line (NewLine)</B>
<A NAME="CR"><I>CR</I></A>           <B>go to start of this line (Carriage Return)</B>
<A NAME="HT"><I>HT</I></A>           <B>go to next Horizontal Tab stop</B>
<A NAME="VT"><I>VT</I></A>           <B>go to next Vertical Tab stop</B></PRE>

<P>The code value zero is reserved for the
<B><A NAME="null character">null character</A></B>
which is always in the target character set. Code values for the basic
C character set are positive when stored in an object of type <I>char.</I>
Code values for the digits are contiguous, with increasing value.
For example, <CODE>'0' + 5</CODE> equals <CODE>'5'</CODE>.
Code values for any
two letters are <I>not</I> necessarily contiguous.</P>

<H3><A NAME="Character Sets and Locales">
Character Sets and Locales</A></H3>

<P>An implementation can support multiple
<A HREF="locale.html" tppabs="http://ccs.ucsd.edu/c/locale.html">locales</A>, each
with a different character set. A locale summarizes conventions peculiar
to a given culture, such as how to format dates or how to sort names.
To change locales and, therefore, target character sets while the
program is running, use the function
<A HREF="locale.html#setlocale" tppabs="http://ccs.ucsd.edu/c/locale.html#setlocale"><CODE>setlocale</CODE></A>.
The translator encodes character constants and
string literals for the
<A HREF="locale.html#C locale" tppabs="http://ccs.ucsd.edu/c/locale.html#C locale"><CODE>"C"</CODE></A> locale,
which is the locale in effect at program startup.</P>

<H2><A NAME="Escape Sequences">Escape Sequences</A></H2>

<P>Within character constants and string literals, you can write
a variety of <B>escape sequences</B>. Each escape sequence determines
the code value for a single character. You use escape sequences
to represent character codes:</P>

<UL>
<LI>you cannot otherwise write (such as <CODE>\n</CODE>)

<LI>that can be difficult to read properly (such as <CODE>\t</CODE>)

<LI>that might change value in different target character sets (such
as <CODE>\a</CODE>)

<LI>that must not change in value among different target environments
(such as <CODE>\0</CODE>)
</UL>

<P>An escape sequence takes the form:</P>

<P><IMG SRC="escape.gif" tppabs="http://ccs.ucsd.edu/c/gif/escape.gif"></P>

<P><B><A NAME="mnemonic escape sequences">
Mnemonic escape sequences</A></B>
help you remember the characters they represent:</P>

<PRE>
<B>Character    Escape Sequence</B>
"            \"
'            \'
?            \?
\            \\
<I>BEL</I>          \a
<I>BS</I>           \b
<I>FF</I>           \f
<I>NL</I>           \n
<I>CR</I>           \r
<I>HT</I>           \t
<I>VT</I>           \v</PRE>

<H3><A NAME="Numeric Escape Sequences">
Numeric Escape Sequences</A></H3>

<P>You can also write <B>numeric escape sequences</B> using either
octal or hexadecimal digits. An
<B><A NAME="octal escape sequence">octal escape sequence</A></B>
takes one of the forms:</P>

<PRE>
    \<I>d</I> <B>or</B> \<I>dd</I> <B>or</B> \<I>ddd</I></PRE>

<P>The escape sequence yields a code value that is the numeric
value of the 1-, 2-, or 3-digit octal number following the backslash
(<CODE>\</CODE>). Each <CODE><I>d</I></CODE> can be
any digit in the range <CODE>0-7</CODE>.</P>

<P>A
<B><A NAME="hexadecimal escape sequence">
hexadecimal escape sequence</A></B> takes one of the forms:</P>

<PRE>
    \x<I>h</I> <B>or</B> \x<I>hh</I> <B>or ...</B></PRE>

<P>The escape sequence yields a code value that is the numeric
value of the arbitrary-length hexadecimal number following the backslash
(<CODE>\</CODE>). Each <CODE><I>h</I></CODE> can be any
decimal digit <CODE>0-9</CODE>, or
any of the letters <CODE>a-f</CODE> or <CODE>A-F</CODE>.
The letters represent
the digit values 10-15, where either <CODE>a</CODE> or <CODE>A</CODE> has
the value 10.</P>

<P>A numeric escape sequence terminates with the first character
that does not fit the digit pattern. Here are some examples:</P>

<UL>
<LI>You can write the
<A HREF="#null character">null character</A>
as <CODE>'\0'</CODE>.

<LI>You can write a newline character (<CODE><I>NL</I></CODE>)
within a string literal by writing:<BR>
<CODE>"hi\n"         <B>which becomes the array</B><BR>
               {'h', 'i', '\n', 0}</CODE>

<LI>You can write a string literal that begins with a specific numeric
value:<BR>
<CODE>"\3abc"        <B>which becomes the array</B><BR>
               {3, 'a', 'b', 'c', 0}</CODE>

<LI>You can write a string literal that contains the hexadecimal
escape sequence <CODE>\xF</CODE> followed by
the digit <CODE>3</CODE> by writing
two string literals:<BR>
<CODE>"\xF" "3"      <B>which becomes the array</B><BR>
               {0xF, '3', 0}</CODE>
</UL>

<H2><A NAME="Trigraphs">Trigraphs</A></H2>

<P>A <B>trigraph</B> is a sequence of three characters that begins
with two question marks (<CODE>??</CODE>). You use trigraphs to write C
source files with a character set that does not contain convenient
graphic representations for some punctuation characters. (The resultant
C source file is not necessarily more readable, but it is unambiguous.)</P>

<P>The list of all
<B><A NAME="defined trigraphs">defined trigraphs</A></B> is:</P>

<PRE>
<B>Character   Trigraph</B>
[           ??(
\           ??/
]           ??)
^           ??'
{           ??&lt;
|           ??!
}           ??&gt;
~           ??-
#           ??=</PRE>

<P>These are the only trigraphs. The translator does not alter any other
sequence that begins with two question marks.</P>

<P>For example, the expression statements:</P>

<PRE>
    printf("Case ??=3 is done??/n");
    printf("You said what????/n");</PRE>

<P>are equivalent to:</P>

<PRE>
    printf("Case #3 is done\n");
    printf("You said what??\n");</PRE>

<P>The translator replaces each trigraph with its equivalent single
character representation in an early
<A HREF="preproc.html#Phases of Translation" tppabs="http://ccs.ucsd.edu/c/preproc.html#Phases of Translation">phase of translation</A>.
You can always treat a trigraph as a single source character.</P>

<H2><A NAME="Multibyte Characters">Multibyte Characters</A></H2>

<P>A source character set or target character set can also contain
<B>multibyte characters</B> (sequences of one or more bytes). Each
sequence represents a single character in the
<B><A NAME="extended character set">extended character set</A></B>.
You use multibyte characters to represent large sets of characters,
such as Kanji. A multibyte character can be a one-byte sequence that
is a character from the
<A HREF="#basic C character set">basic C character set</A>,
an additional one-byte sequence that is
<A HREF="portable.html#implementation defined" tppabs="http://ccs.ucsd.edu/c/portable.html#implementation defined">implementation defined</A>,
or an additional sequence of two or more bytes that is
implementation defined.</P>

<P>Any multibyte encoding that contains sequences of two or more
bytes depends, for its interpretation between bytes, on a
<B><A NAME="conversion state">conversion state</A></B> determined
by bytes earlier in the sequence of characters. In the
<B><A NAME="initial conversion state">initial conversion state</A></B>
if the byte immediately following matches one of the characters
in the basic C character set, the byte must represent that character.</P>

<P>For example, the
<B><A NAME="EUC encoding">EUC encoding</A></B> is a superset of ASCII.
A byte value in the interval [0xA1, 0xFE] is the first of a two-byte
sequence (whose second byte value is in the interval [0x80, 0xFF]).
All other byte values are one-byte sequences. Since all members of the
<A HREF="#basic C character set">basic C character set</A>
have byte values in the range [0x00, 0x7F] in ASCII,
EUC meets the requirements
for a multibyte encoding in Standard C. Such a sequence is <I>not</I>
in the initial conversion state immediately after a byte value in
the interval [0xA1, 0xFe]. It is ill-formed if a second byte
value is not in the interval [0x80, 0xFF].

<P>Multibyte characters can also have a
<B><A NAME="state-dependent encoding">state-dependent encoding</A></B>.
How you interpret a byte in such an encoding depends on a
conversion state that involves both a
<B><A NAME="parse state">parse state</A></B>, as before, and a
<B><A NAME="shift state">shift state</A></B>, determined
by bytes earlier in the sequence of characters. The
<B><A NAME="initial shift state">initial shift state</A></B>,
at the beginning of a new multibyte character, is also the
initial conversion state. A subsequent
<B><A NAME="shift sequence">shift sequence</A></B> can determine an
<B><A NAME="alternate shift state">alternate shift state</A></B>,
after which all byte sequences (including one-byte sequences) can have
a different interpretation. A byte containing the value zero,
however, always represents the
<A HREF="#null character">null character</A>.
It cannot occur as any of the bytes of another multibyte character.</P>

<P>For example, the
<B><A NAME="JIS encoding">JIS encoding</A></B> is another superset of ASCII.
In the initial shift state, each byte represents a single character,
except for two three-byte shift sequences:</P>

<UL>
<LI>The three-byte sequence <CODE>"\x1B$B"</CODE> shifts to two-byte mode.
Subsequently, two successive bytes (both with values
in the range [0x21, 0x7E]) constitute a single multibyte character.

<LI>The three-byte sequence <CODE>"\x1B(B"</CODE> shifts back
to the initial shift state.
</UL>

<P>JIS also meets the requirements for a multibyte encoding in Standard C.
Such a sequence is <I>not</I> in the initial conversion state
when partway through a three-byte shift sequence
or when in two-byte mode.</P>

(<A HREF="lib_over.html#Amendment 1" tppabs="http://ccs.ucsd.edu/c/lib_over.html#Amendment 1">Amendment 1</A> adds the type
<A HREF="wchar.html#mbstate_t" tppabs="http://ccs.ucsd.edu/c/wchar.html#mbstate_t"><CODE>mbstate_t</CODE></A>,
which describes an object that can store a conversion state.
It also relaxes the above rules for
<A HREF="lib_file.html#generalized multibyte characters" tppabs="http://ccs.ucsd.edu/c/lib_file.html#generalized multibyte characters">
generalized multibyte characters</A>, which describe the encoding
rules for a broad range of
<A HREF="lib_file.html#wide stream" tppabs="http://ccs.ucsd.edu/c/lib_file.html#wide stream">wide streams</A>.)

<P>You can write multibyte characters in C source text as part
of a comment, a character constant, a string literal, or a filename in an
<A HREF="preproc.html#include directive" tppabs="http://ccs.ucsd.edu/c/preproc.html#include directive"><I>include</I> directive</A>.
How such characters print is
<A HREF="portable.html#implementation defined" tppabs="http://ccs.ucsd.edu/c/portable.html#implementation defined">implementation defined</A>.
Each sequence of multibyte characters that you write must
begin and end in the initial shift state.
The program can also include multibyte characters in
<A HREF="#null-terminated string">null-terminated</A>
<A HREF="lib_over.html#C string" tppabs="http://ccs.ucsd.edu/c/lib_over.html#C string">C strings</A>
used by several library functions, including the
<A HREF="lib_prin.html#format string" tppabs="http://ccs.ucsd.edu/c/lib_prin.html#format string">format strings</A> for
<A HREF="stdio.html#printf" tppabs="http://ccs.ucsd.edu/c/stdio.html#printf"><CODE>printf</CODE></A> and
<A HREF="stdio.html#scanf" tppabs="http://ccs.ucsd.edu/c/stdio.html#scanf"><CODE>scanf</CODE></A>.
Each such character string must begin and end
in the initial shift state.</P>

<H3><A NAME="Wide-Character Encoding">
Wide-Character Encoding</A></H3>

<P>Each character in the extended character set also has an integer
representation, called a <B>wide-character encoding</B>.
Each extended character has a unique wide-character value.
The value zero always corresponds to the
<B><A NAME="null wide character">null wide character</A></B>.
The type definition
<A HREF="stddef.html#wchar_t" tppabs="http://ccs.ucsd.edu/c/stddef.html#wchar_t"><CODE>wchar_t</CODE></A>
specifies the integer type that represents wide characters.</P>

<P>You write a
<B><A NAME="wide-character constant">wide-character constant</A></B>
as <CODE>L'mbc'</CODE>, where <CODE>mbc</CODE> represents
a single multibyte character.
You write a
<B><A NAME="wide-character string literal">
wide-character string literal</A></B> as <CODE>L"mbs"</CODE>,
where <CODE>mbs</CODE> represents
a sequence of zero or more multibyte characters.
The wide-character string literal
<CODE>L"xyz"</CODE> becomes a sequence of
wide-character constants stored in successive bytes of memory, followed
by a null wide character:<BR>
<CODE>{L'x', L'y', L'z', L'\0'}</CODE>

<P>The following library functions
help you convert between the multibyte
and wide-character representations of extended characters:
<A HREF="wchar.html#btowc" tppabs="http://ccs.ucsd.edu/c/wchar.html#btowc"><CODE>btowc</CODE></A>,
<A HREF="stdlib.html#mblen" tppabs="http://ccs.ucsd.edu/c/stdlib.html#mblen"><CODE>mblen</CODE></A>,
<A HREF="wchar.html#mbrlen" tppabs="http://ccs.ucsd.edu/c/wchar.html#mbrlen"><CODE>mbrlen</CODE></A>,
<A HREF="wchar.html#mbrtowc" tppabs="http://ccs.ucsd.edu/c/wchar.html#mbrtowc"><CODE>mbrtowc</CODE></A>,
<A HREF="wchar.html#mbsrtowcs" tppabs="http://ccs.ucsd.edu/c/wchar.html#mbsrtowcs"><CODE>mbsrtowcs</CODE></A>,
<A HREF="stdlib.html#mbstowcs" tppabs="http://ccs.ucsd.edu/c/stdlib.html#mbstowcs"><CODE>mbstowcs</CODE></A>,
<A HREF="stdlib.html#mbtowc" tppabs="http://ccs.ucsd.edu/c/stdlib.html#mbtowc"><CODE>mbtowc</CODE></A>,
<A HREF="wchar.html#wcrtomb" tppabs="http://ccs.ucsd.edu/c/wchar.html#wcrtomb"><CODE>wcrtomb</CODE></A>,
<A HREF="wchar.html#wcsrtombs" tppabs="http://ccs.ucsd.edu/c/wchar.html#wcsrtombs"><CODE>wcsrtombs</CODE></A>,
<A HREF="stdlib.html#wcstombs" tppabs="http://ccs.ucsd.edu/c/stdlib.html#wcstombs"><CODE>wcstombs</CODE></A>,
<A HREF="wchar.html#wctob" tppabs="http://ccs.ucsd.edu/c/wchar.html#wctob"><CODE>wctob</CODE></A>, and
<A HREF="stdlib.html#wctomb" tppabs="http://ccs.ucsd.edu/c/stdlib.html#wctomb"><CODE>wctomb</CODE></A>.
</P>

<P>The macro
<A HREF="limits.html#MB_LEN_MAX" tppabs="http://ccs.ucsd.edu/c/limits.html#MB_LEN_MAX"><CODE>MB_LEN_MAX</CODE></A>
specifies the length of the longest possible multibyte sequence required
to represent a single character
<A HREF="portable.html#implementation defined" tppabs="http://ccs.ucsd.edu/c/portable.html#implementation defined">defined</A>
by the implementation across supported locales. And the macro
<A HREF="stdlib.html#MB_CUR_MAX" tppabs="http://ccs.ucsd.edu/c/stdlib.html#MB_CUR_MAX"><CODE>MB_CUR_MAX</CODE></A>
specifies the length of the longest possible multibyte sequence required
to represent a single character defined for the current
<A HREF="locale.html#locale" tppabs="http://ccs.ucsd.edu/c/locale.html#locale">locale</A>.</P>

<P>For example, the
<A HREF="#string literal">string literal</A>
<CODE>"hello"</CODE> becomes an array of six <I>char</I>:</P>

<PRE>
    {'h', 'e', 'l', 'l', 'o', 0}</PRE>

<P>while the wide-character string literal
<CODE>L"hello"</CODE> becomes
an array of six integers of type
<A HREF="stddef.html#wchar_t" tppabs="http://ccs.ucsd.edu/c/stddef.html#wchar_t"><CODE>wchar_t</CODE></A>:</P>

<PRE>
    {L'h', L'e', L'l', L'l', L'o', 0}</PRE>

<HR>
<P>See also the
<B><A HREF="index.html#Table of Contents" tppabs="http://ccs.ucsd.edu/c/index.html#Table of Contents">Table of Contents</A></B> and the
<B><A HREF="_index.html" tppabs="http://ccs.ucsd.edu/c/_index.html">Index</A></B>.</P>

<P><I>
<A HREF="crit_pb.html" tppabs="http://ccs.ucsd.edu/c/crit_pb.html">Copyright</A> &#169; 1989-1996
by P.J. Plauger and Jim Brodie. All rights reserved.</I></P>

</BODY></HTML>