oldlinux-files/Ref-docs/POSIX/susv3/xrat/xsh_chap02.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link type="text/css" rel="stylesheet" href="style.css"><!-- Generated by The Open Group's rhtm tool v1.2.1 -->
<!-- Copyright (c) 2001 The Open Group, All Rights Reserved -->
<title>Rationale</title>
</head>
<body>

<basefont size="3">


<center><font size="2">The Open Group Base Specifications Issue 6<br>
IEEE Std 1003.1-2001<br>
Copyright &copy; 2001 The IEEE and The Open Group</font></center>

<hr size="2" noshade>
<h3><a name="tag_03_02"></a>General Information</h3>

<h4><a name="tag_03_02_01"></a>Use and Implementation of Functions</h4>

<p>The information concerning the use of functions was adapted from a description in the ISO&nbsp;C standard. Here is an example of
how an application program can protect itself from functions that may or may not be macros, rather than true functions:</p>

<p>The <a href="../functions/atoi.html"><i>atoi</i>()</a> function may be used in any of several ways:</p>

<ul>
<li>
<p>By use of its associated header (possibly generating a macro expansion):</p>

<blockquote>
<pre>
<tt>#include &lt;stdlib.h&gt;
/* ... */
i = atoi(str);
</tt>
</pre>
</blockquote>
</li>

<li>
<p>By use of its associated header (assuredly generating a true function call):</p>

<blockquote>
<pre>
<tt>#include &lt;stdlib.h&gt;
#undef atoi
/* ... */
i = atoi(str);
</tt>
</pre>
</blockquote>

<p>or:</p>

<blockquote>
<pre>
<tt>#include &lt;stdlib.h&gt;
/* ... */
i = (atoi) (str);
</tt>
</pre>
</blockquote>
</li>

<li>
<p>By explicit declaration:</p>

<blockquote>
<pre>
<tt>extern int atoi (const char *);
/* ... */
i = atoi(str);
</tt>
</pre>
</blockquote>
</li>

<li>
<p>By implicit declaration:</p>

<blockquote>
<pre>
<tt>/* ... */
i = atoi(str);
</tt>
</pre>
</blockquote>

<p>(Assuming no function prototype is in scope. This is not allowed by the ISO&nbsp;C standard for functions with variable
arguments; furthermore, parameter type conversion &quot;widening&quot; is subject to different rules in this case.)</p>
</li>
</ul>

<p>Note that the ISO&nbsp;C standard reserves names starting with <tt>'_'</tt> for the compiler. Therefore, the compiler could, for
example, implement an intrinsic, built-in function <i>_asm_builtin_atoi</i>(), which it recognized and expanded into inline
assembly code. Then, in <a href="../basedefs/stdlib.h.html"><i>&lt;stdlib.h&gt;</i></a>, there could be the following:</p>

<blockquote>
<pre>
<tt>#define atoi(X) _asm_builtin_atoi(X)
</tt>
</pre>
</blockquote>

<p>The user's &quot;normal&quot; call to <a href="../functions/atoi.html"><i>atoi</i>()</a> would then be expanded inline, but the
implementor would also be required to provide a callable function named <a href="../functions/atoi.html"><i>atoi</i>()</a> for use
when the application requires it; for example, if its address is to be stored in a function pointer variable.</p>

<h4><a name="tag_03_02_02"></a>The Compilation Environment</h4>

<h5><a name="tag_03_02_02_01"></a>POSIX.1 Symbols</h5>

<p>This and the following section address the issue of &quot;name space pollution&quot;. The ISO&nbsp;C standard requires that the name
space beyond what it reserves not be altered except by explicit action of the application writer. This section defines the actions
to add the POSIX.1 symbols for those headers where both the ISO&nbsp;C standard and POSIX.1 need to define symbols, and also where
the XSI extension extends the base standard.</p>

<p>When headers are used to provide symbols, there is a potential for introducing symbols that the application writer cannot
predict. Ideally, each header should only contain one set of symbols, but this is not practical for historical reasons. Thus, the
concept of feature test macros is included. Two feature test macros are explicitly defined by IEEE&nbsp;Std&nbsp;1003.1-2001; it is
expected that future revisions may add to this. <basefont size="2"></p>

<dl>
<dt><b>Note:</b></dt>

<dd>Feature test macros allow an application to announce to the implementation its desire to have certain symbols and prototypes
exposed. They should not be confused with the version test macros and constants for options in <a href=
"../basedefs/unistd.h.html"><i>&lt;unistd.h&gt;</i></a> which are the implementation's way of announcing functionality to the
application.</dd>
</dl>

<basefont size="3">

<p>It is further intended that these feature test macros apply only to the headers specified by IEEE&nbsp;Std&nbsp;1003.1-2001.
Implementations are expressly permitted to make visible symbols not specified by IEEE&nbsp;Std&nbsp;1003.1-2001, within both
POSIX.1 and other headers, under the control of feature test macros that are not defined by IEEE&nbsp;Std&nbsp;1003.1-2001.</p>

<h5><a name="tag_03_02_02_02"></a>The _POSIX_C_SOURCE Feature Test Macro</h5>

<p>Since _POSIX_SOURCE specified by the POSIX.1-1990 standard did not have a value associated with it, the _POSIX_C_SOURCE macro
replaces it, allowing an application to inform the system of the revision of the standard to which it conforms. This symbol will
allow implementations to support various revisions of IEEE&nbsp;Std&nbsp;1003.1-2001 simultaneously. For instance, when either
_POSIX_SOURCE is defined or _POSIX_C_SOURCE is defined as 1, the system should make visible the same name space as permitted and
required by the POSIX.1-1990 standard. When _POSIX_C_SOURCE is defined, the state of _POSIX_SOURCE is completely irrelevant.</p>

<p>It is expected that C bindings to future POSIX standards will define new values for _POSIX_C_SOURCE, with each new value
reserving the name space for that new standard, plus all earlier POSIX standards.</p>

<h5><a name="tag_03_02_02_03"></a>The _XOPEN_SOURCE Feature Test Macro</h5>

<p>The feature test macro _XOPEN_SOURCE is provided as the announcement mechanism for the application that it requires
functionality from the Single UNIX Specification. _XOPEN_SOURCE must be defined to the value 600 before the inclusion of any header
to enable the functionality in the Single UNIX Specification. Its definition subsumes the use of _POSIX_SOURCE and
_POSIX_C_SOURCE.</p>

<p>An extract of code from a conforming application, that appears before any <b>#include</b> statements, is given below:</p>

<pre>
<tt>#define _XOPEN_SOURCE 600 /* Single UNIX Specification, Version 3 */
<br>
#include ...
</tt>
</pre>

<p>Note that the definition of _XOPEN_SOURCE with the value 600 makes the definition of _POSIX_C_SOURCE redundant and it can safely
be omitted.</p>

<h5><a name="tag_03_02_02_04"></a>The Name Space</h5>

<p>The reservation of identifiers is paraphrased from the ISO&nbsp;C standard. The text is included because it needs to be part of
IEEE&nbsp;Std&nbsp;1003.1-2001, regardless of possible changes in future versions of the ISO&nbsp;C standard.</p>

<p>These identifiers may be used by implementations, particularly for feature test macros. Implementations should not use feature
test macro names that might be reasonably used by a standard.</p>

<p>Including headers more than once is a reasonably common practice, and it should be carried forward from the ISO&nbsp;C standard.
More significantly, having definitions in more than one header is explicitly permitted. Where the potential declaration is
&quot;benign&quot; (the same definition twice) the declaration can be repeated, if that is permitted by the compiler. (This is usually true
of macros, for example.) In those situations where a repetition is not benign (for example, <b>typedef</b>s), conditional
compilation must be used. The situation actually occurs both within the ISO&nbsp;C standard and within POSIX.1: <b>time_t</b>
should be in <a href="../basedefs/sys/types.h.html"><i>&lt;sys/types.h&gt;</i></a>, and the ISO&nbsp;C standard mandates that it be
in <a href="../basedefs/time.h.html"><i>&lt;time.h&gt;</i></a>.</p>

<p>The area of name space pollution <i>versus</i> additions to structures is difficult because of the macro structure of C. The
following discussion summarizes all the various problems with and objections to the issue.</p>

<p>Note the phrase &quot;user-defined macro&quot;. Users are not permitted to define macro names (or any other name) beginning with
<tt>"_[A-Z_]"</tt> . Thus, the conflict cannot occur for symbols reserved to the vendor's name space, and the permission to add
fields automatically applies, without qualification, to those symbols.</p>

<ol>
<li>
<p>Data structures (and unions) need to be defined in headers by implementations to meet certain requirements of POSIX.1 and the
ISO&nbsp;C standard.</p>
</li>

<li>
<p>The structures defined by POSIX.1 are typically minimal, and any practical implementation would wish to add fields to these
structures either to hold additional related information or for backwards-compatibility (or both). Future standards (and <i>de
facto</i> standards) would also wish to add to these structures. Issues of field alignment make it impractical (at least in the
general case) to simply omit fields when they are not defined by the particular standard involved.</p>

<p>The <b>dirent</b> structure is an example of such a minimal structure (although one could argue about whether the other fields
need visible names). The <i>st_rdev</i> field of most implementations' <b>stat</b> structure is a common example where extension is
needed and where a conflict could occur.</p>
</li>

<li>
<p>Fields in structures are in an independent name space, so the addition of such fields presents no problem to the C language
itself in that such names cannot interact with identically named user symbols because access is qualified by the specific structure
name.</p>
</li>

<li>
<p>There is an exception to this: macro processing is done at a lexical level. Thus, symbols added to a structure might be
recognized as user-provided macro names at the location where the structure is declared. This only can occur if the user-provided
name is declared as a macro before the header declaring the structure is included. The user's use of the name after the declaration
cannot interfere with the structure because the symbol is hidden and only accessible through access to the structure. Presumably,
the user would not declare such a macro if there was an intention to use that field name.</p>
</li>

<li>
<p>Macros from the same or a related header might use the additional fields in the structure, and those field names might also
collide with user macros. Although this is a less frequent occurrence, since macros are expanded at the point of use, no constraint
on the order of use of names can apply.</p>
</li>

<li>
<p>An &quot;obvious&quot; solution of using names in the reserved name space and then redefining them as macros when they should be visible
does not work because this has the effect of exporting the symbol into the general name space. For example, given a (hypothetical)
system-provided header <i>&lt;h.h&gt;</i>, and two parts of a C program in <b>a.c</b> and <b>b.c</b>, in header
<i>&lt;h.h&gt;</i>:</p>

<blockquote>
<pre>
<tt>struct foo {
    int __i;
}
<br>
#ifdef _FEATURE_TEST
#define i __i;
#endif
</tt>
</pre>
</blockquote>

<p>In file <b>a.c</b>:</p>

<blockquote>
<pre>
<tt>#include h.h
extern int i;
...
</tt>
</pre>
</blockquote>

<p>In file <b>b.c</b>:</p>

<blockquote>
<pre>
<tt>extern int i;
...
</tt>
</pre>
</blockquote>

<p>The symbol that the user thinks of as <i>i</i> in both files has an external name of <i>__i</i> in <b>a.c</b>; the same symbol
<i>i</i> in <b>b.c</b> has an external name <i>i</i> (ignoring any hidden manipulations the compiler might perform on the names).
This would cause a mysterious name resolution problem when <b>a.o</b> and <b>b.o</b> are linked.</p>

<p>Simply avoiding definition then causes alignment problems in the structure.</p>

<p>A structure of the form:</p>

<blockquote>
<pre>
<tt>struct foo {
    union {
        int __i;
#ifdef _FEATURE_TEST
        int i;
#endif
    } __ii;
}
</tt>
</pre>
</blockquote>

<p>does not work because the name of the logical field <i>i</i> is <i>__ii.i</i>, and introduction of a macro to restore the
logical name immediately reintroduces the problem discussed previously (although its manifestation might be more immediate because
a syntax error would result if a recursive macro did not cause it to fail first).</p>
</li>

<li>
<p>A more workable solution would be to declare the structure:</p>

<blockquote>
<pre>
<tt>struct foo {
#ifdef _FEATURE_TEST
    int i;
#else
    int __i;
#endif
}
</tt>
</pre>
</blockquote>

<p>However, if a macro (particularly one required by a standard) is to be defined that uses this field, two must be defined: one
that uses <i>i</i>, the other that uses <i>__i</i>. If more than one additional field is used in a macro and they are conditional
on distinct combinations of features, the complexity goes up as 2<small><sup><i>n</i></sup></small>.</p>
</li>
</ol>

<p>All this leaves a difficult situation: vendors must provide very complex headers to deal with what is conceptually simple and
safe-adding a field to a structure. It is the possibility of user-provided macros with the same name that makes this difficult.</p>

<p>Several alternatives were proposed that involved constraining the user's access to part of the name space available to the user
(as specified by the ISO&nbsp;C standard). In some cases, this was only until all the headers had been included. There were two
proposals discussed that failed to achieve consensus:</p>

<ol>
<li>
<p>Limiting it for the whole program.</p>
</li>

<li>
<p>Restricting the use of identifiers containing only uppercase letters until after all system headers had been included. It was
also pointed out that because macros might wish to access fields of a structure (and macro expansion occurs totally at point of
use) restricting names in this way would not protect the macro expansion, and thus the solution was inadequate.</p>
</li>
</ol>

<p>It was finally decided that reservation of symbols would occur, but as constrained.</p>

<p>The current wording also allows the addition of fields to a structure, but requires that user macros of the same name not
interfere. This allows vendors to do one of the following:</p>

<ul>
<li>
<p>Not create the situation (do not extend the structures with user-accessible names or use the solution in (7) above)</p>
</li>

<li>
<p>Extend their compilers to allow some way of adding names to structures and macros safely</p>
</li>
</ul>

<p>There are at least two ways that the compiler might be extended: add new preprocessor directives that turn off and on macro
expansion for certain symbols (without changing the value of the macro) and a function or lexical operation that suppresses
expansion of a word. The latter seems more flexible, particularly because it addresses the problem in macros as well as in
declarations.</p>

<p>The following seems to be a possible implementation extension to the C language that will do this: any token that during macro
expansion is found to be preceded by three <tt>'#'</tt> symbols shall not be further expanded in exactly the same way as described
for macros that expand to their own name as in Section 3.8.3.4 of the ISO&nbsp;C standard. A vendor may also wish to implement this
as an operation that is lexically a function, which might be implemented as:</p>

<blockquote>
<pre>
<tt>#define __safe_name(x) ###x
</tt>
</pre>
</blockquote>

<p>Using a function notation would insulate vendors from changes in standards until such a functionality is standardized (if ever).
Standardization of such a function would be valuable because it would then permit third parties to take advantage of it portably in
software they may supply.</p>

<p>The symbols that are &quot;explicitly permitted, but not required by IEEE&nbsp;Std&nbsp;1003.1-2001&quot; include those classified
below. (That is, the symbols classified below might, but are not required to, be present when _POSIX_C_SOURCE is defined to have
the value 200112L.)</p>

<ul>
<li>
<p>Symbols in <a href="../basedefs/limits.h.html"><i>&lt;limits.h&gt;</i></a> and <a href=
"../basedefs/unistd.h.html"><i>&lt;unistd.h&gt;</i></a> that are defined to indicate support for options or limits that are
constant at compile-time</p>
</li>

<li>
<p>Symbols in the name space reserved for the implementation by the ISO&nbsp;C standard</p>
</li>

<li>
<p>Symbols in a name space reserved for a particular type of extension (for example, type names ending with <b>_t</b> in <a href=
"../basedefs/sys/types.h.html"><i>&lt;sys/types.h&gt;</i></a>)</p>
</li>

<li>
<p>Additional members of structures or unions whose names do not reduce the name space reserved for applications</p>
</li>
</ul>

<p>Since both implementations and future revisions of IEEE&nbsp;Std&nbsp;1003.1 and other POSIX standards may use symbols in the
reserved spaces described in these tables, there is a potential for name space clashes. To avoid future name space clashes when
adding symbols, implementations should not use the posix_, POSIX_, or _POSIX_ prefixes.</p>

<h4><a name="tag_03_02_03"></a>Error Numbers</h4>

<p>It was the consensus of the standard developers that to allow the conformance document to state that an error occurs and under
what conditions, but to disallow a statement that it never occurs, does not make sense. It could be implied by the current wording
that this is allowed, but to reduce the possibility of future interpretation requests, it is better to make an explicit
statement.</p>

<p>The ISO&nbsp;C standard requires that <i>errno</i> be an assignable lvalue. Originally, the definition in POSIX.1 was stricter
than that in the ISO&nbsp;C standard, <b>extern int</b> <i>errno</i>, in order to support historical usage. In a multi-threaded
environment, implementing <i>errno</i> as a global variable results in non-deterministic results when accessed. It is required,
however, that <i>errno</i> work as a per-thread error reporting mechanism. In order to do this, a separate <i>errno</i> value has
to be maintained for each thread. The following section discusses the various alternative solutions that were considered.</p>

<p>In order to avoid this problem altogether for new functions, these functions avoid using <i>errno</i> and, instead, return the
error number directly as the function return value; a return value of zero indicates that no error was detected.</p>

<p>For any function that can return errors, the function return value is not used for any purpose other than for reporting errors.
Even when the output of the function is scalar, it is passed through a function argument. While it might have been possible to
allow some scalar outputs to be coded as negative function return values and mixed in with positive error status returns, this was
rejected-using the return value for a mixed purpose was judged to be of limited use and error prone.</p>

<p>Checking the value of <i>errno</i> alone is not sufficient to determine the existence or type of an error, since it is not
required that a successful function call clear <i>errno</i>. The variable <i>errno</i> should only be examined when the return
value of a function indicates that the value of <i>errno</i> is meaningful. In that case, the function is required to set the
variable to something other than zero.</p>

<p>The variable <i>errno</i> is never set to zero by any function call; to do so would contradict the ISO&nbsp;C standard.</p>

<p>POSIX.1 requires (in the ERRORS sections of function descriptions) certain error values to be set in certain conditions because
many existing applications depend on them. Some error numbers, such as [EFAULT], are entirely implementation-defined and are noted
as such in their description in the ERRORS section. This section otherwise allows wide latitude to the implementation in handling
error reporting.</p>

<p>Some of the ERRORS sections in IEEE&nbsp;Std&nbsp;1003.1-2001 have two subsections. The first:</p>

<blockquote>
<pre>
&quot;The function shall fail if:''
</pre>
</blockquote>

<p>could be called the &quot;mandatory&quot; section.</p>

<p>The second:</p>

<blockquote>
<pre>
&quot;The function may fail if:''
</pre>
</blockquote>

<p>could be informally known as the &quot;optional&quot; section.</p>

<p>Attempting to infer the quality of an implementation based on whether it detects optional error conditions is not useful.</p>

<p>Following each one-word symbolic name for an error, there is a description of the error. The rationale for some of the symbolic
names follows:</p>

<dl compact>
<dt>[ECANCELED]</dt>

<dd>This spelling was chosen as being more common.</dd>

<dt>[EFAULT]</dt>

<dd>Most historical implementations do not catch an error and set <i>errno</i> when an invalid address is given to the functions <a
href="../functions/wait.html"><i>wait</i>()</a>, <a href="../functions/time.html"><i>time</i>()</a>, or <a href=
"../functions/times.html"><i>times</i>()</a>. Some implementations cannot reliably detect an invalid address. And most systems that
detect invalid addresses will do so only for a system call, not for a library routine.</dd>

<dt>[EFTYPE]</dt>

<dd>This error code was proposed in earlier proposals as &quot;Inappropriate operation for file type&quot;, meaning that the operation
requested is not appropriate for the file specified in the function call. This code was proposed, although the same idea was
covered by [ENOTTY], because the connotations of the name would be misleading. It was pointed out that the <a href=
"../functions/fcntl.html"><i>fcntl</i>()</a> function uses the error code [EINVAL] for this notion, and hence all instances of
[EFTYPE] were changed to this code.</dd>

<dt>[EINTR]</dt>

<dd>POSIX.1 prohibits conforming implementations from restarting interrupted system calls of conforming applications unless the
SA_RESTART flag is in effect for the signal. However, it does not require that [EINTR] be returned when another legitimate value
may be substituted; for example, a partial transfer count when <a href="../functions/read.html"><i>read</i>()</a> or <a href=
"../functions/write.html"><i>write</i>()</a> are interrupted. This is only given when the signal-catching function returns normally
as opposed to returns by mechanisms like <a href="../functions/longjmp.html"><i>longjmp</i>()</a> or <a href=
"../functions/siglongjmp.html"><i>siglongjmp</i>()</a>.</dd>

<dt>[ELOOP]</dt>

<dd>In specifying conditions under which implementations would generate this error, the following goals were considered:

<ul>
<li>
<p>To ensure that actual loops are detected, including loops that result from symbolic links across distributed file systems.</p>
</li>

<li>
<p>To ensure that during pathname resolution an application can rely on the ability to follow at least {SYMLOOP_MAX} symbolic links
in the absence of a loop.</p>
</li>

<li>
<p>To allow implementations to provide the capability of traversing more than {SYMLOOP_MAX} symbolic links in the absence of a
loop.</p>
</li>

<li>
<p>To allow implementations to detect loops and generate the error prior to encountering {SYMLOOP_MAX} symbolic links.</p>
</li>
</ul>
</dd>

<dt>[ENAMETOOLONG]</dt>

<dd><br>
When a symbolic link is encountered during pathname resolution, the contents of that symbolic link are used to create a new
pathname. The standard developers intended to allow, but not require, that implementations enforce the restriction of {PATH_MAX} on
the result of this pathname substitution.</dd>

<dt>[ENOMEM]</dt>

<dd>The term &quot;main memory&quot; is not used in POSIX.1 because it is implementation-defined.</dd>

<dt>[ENOTSUP]</dt>

<dd>This error code is to be used when an implementation chooses to implement the required functionality of
IEEE&nbsp;Std&nbsp;1003.1-2001 but does not support optional facilities defined by IEEE&nbsp;Std&nbsp;1003.1-2001. The return of
[ENOSYS] is to be taken to indicate that the function of the interface is not supported at all; the function will always fail with
this error code.</dd>

<dt>[ENOTTY]</dt>

<dd>The symbolic name for this error is derived from a time when device control was done by <a href=
"../functions/ioctl.html"><i>ioctl</i>()</a> and that operation was only permitted on a terminal interface. The term &quot;TTY&quot; is
derived from &quot;teletypewriter&quot;, the devices to which this error originally applied.</dd>

<dt>[EOVERFLOW]</dt>

<dd>Most of the uses of this error code are related to large file support. Typically, these cases occur on systems which support
multiple programming environments with different sizes for <b>off_t</b>, but they may also occur in connection with remote file
systems.

<p>In addition, when different programming environments have different widths for types such as <b>int</b> and <b>uid_t</b>,
several functions may encounter a condition where a value in a particular environment is too wide to be represented. In that case,
this error should be raised. For example, suppose the currently running process has 64-bit <b>int</b>, and file descriptor
9223372036854775807 is open and does not have the close-on- <i>exec</i> flag set. If the process then uses <a href=
"../functions/execl.html"><i>execl</i>()</a> to <i>exec</i> a file compiled in a programming environment with 32-bit <b>int</b>,
the call to <a href="../functions/execl.html"><i>execl</i>()</a> can fail with <i>errno</i> set to [EOVERFLOW]. A similar failure
can occur with <a href="../functions/execl.html"><i>execl</i>()</a> if any of the user IDs or any of the group IDs to be assigned
to the new process image are out of range for the executed file's programming environment.</p>

<p>Note, however, that this condition cannot occur for functions that are explicitly described as always being successful, such as
<a href="../functions/getpid.html"><i>getpid</i>()</a>.</p>
</dd>

<dt>[EPIPE]</dt>

<dd>This condition normally generates the signal SIGPIPE; the error is returned if the signal does not terminate the process.</dd>

<dt>[EROFS]</dt>

<dd>In historical implementations, attempting to <a href="../functions/unlink.html"><i>unlink</i>()</a> or <a href=
"../functions/rmdir.html"><i>rmdir</i>()</a> a mount point would generate an [EBUSY] error. An implementation could be envisioned
where such an operation could be performed without error. In this case, if <i>either</i> the directory entry or the actual data
structures reside on a read-only file system, [EROFS] is the appropriate error to generate. (For example, changing the link count
of a file on a read-only file system could not be done, as is required by <a href="../functions/unlink.html"><i>unlink</i>()</a>,
and thus an error should be reported.)</dd>
</dl>

<p>Three error numbers, [EDOM], [EILSEQ], and [ERANGE], were added to this section primarily for consistency with the ISO&nbsp;C
standard.</p>

<h5><a name="tag_03_02_03_01"></a>Alternative Solutions for Per-Thread errno</h5>

<p>The usual implementation of <i>errno</i> as a single global variable does not work in a multi-threaded environment. In such an
environment, a thread may make a POSIX.1 call and get a -1 error return, but before that thread can check the value of
<i>errno</i>, another thread might have made a second POSIX.1 call that also set <i>errno</i>. This behavior is unacceptable in
robust programs. There were a number of alternatives that were considered for handling the <i>errno</i> problem:</p>

<ul>
<li>
<p>Implement <i>errno</i> as a per-thread integer variable.</p>
</li>

<li>
<p>Implement <i>errno</i> as a service that can access the per-thread error number.</p>
</li>

<li>
<p>Change all POSIX.1 calls to accept an extra status argument and avoid setting <i>errno</i>.</p>
</li>

<li>
<p>Change all POSIX.1 calls to raise a language exception.</p>
</li>
</ul>

<p>The first option offers the highest level of compatibility with existing practice but requires special support in the linker,
compiler, and/or virtual memory system to support the new concept of thread private variables. When compared with current practice,
the third and fourth options are much cleaner, more efficient, and encourage a more robust programming style, but they require new
versions of all of the POSIX.1 functions that might detect an error. The second option offers compatibility with existing code that
uses the <a href="../basedefs/errno.h.html"><i>&lt;errno.h&gt;</i></a> header to define the symbol <i>errno</i>. In this option,
<i>errno</i> may be a macro defined:</p>

<blockquote>
<pre>
<tt>#define errno  (*__errno())
extern int      *__errno();
</tt>
</pre>
</blockquote>

<p>This option may be implemented as a per-thread variable whereby an <i>errno</i> field is allocated in the user space object
representing a thread, and whereby the function <i>__errno</i>() makes a system call to determine the location of its user space
object and returns the address of the <i>errno</i> field of that object. Another implementation, one that avoids calling the
kernel, involves allocating stacks in chunks. The stack allocator keeps a side table indexed by chunk number containing a pointer
to the thread object that uses that chunk. The <i>__errno</i>() function then looks at the stack pointer, determines the chunk
number, and uses that as an index into the chunk table to find its thread object and thus its private value of <i>errno</i>. On
most architectures, this can be done in four to five instructions. Some compilers may wish to implement <i>__errno</i>() inline to
improve performance.</p>

<h5><a name="tag_03_02_03_02"></a>Disallowing Return of the [EINTR] Error Code</h5>

<p>Many blocking interfaces defined by IEEE&nbsp;Std&nbsp;1003.1-2001 may return [EINTR] if interrupted during their execution by a
signal handler. Blocking interfaces introduced under the Threads option do not have this property. Instead, they require that the
interface appear to be atomic with respect to interruption. In particular, clients of blocking interfaces need not handle any
possible [EINTR] return as a special case since it will never occur. If it is necessary to restart operations or complete
incomplete operations following the execution of a signal handler, this is handled by the implementation, rather than by the
application.</p>

<p>Requiring applications to handle [EINTR] errors on blocking interfaces has been shown to be a frequent source of often
unreproducible bugs, and it adds no compelling value to the available functionality. Thus, blocking interfaces introduced for use
by multi-threaded programs do not use this paradigm. In particular, in none of the functions <a href=
"../functions/flockfile.html"><i>flockfile</i>()</a>, <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a>, <a href=
"../functions/pthread_cond_wait.html"><i>pthread_cond_wait</i>()</a>, <a href=
"../functions/pthread_join.html"><i>pthread_join</i>()</a>, <a href=
"../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a>, and <a href=
"../functions/sigwait.html"><i>sigwait</i>()</a> did providing [EINTR] returns add value, or even particularly make sense. Thus,
these functions do not provide for an [EINTR] return, even when interrupted by a signal handler. The same arguments can be applied
to <a href="../functions/sem_wait.html"><i>sem_wait</i>()</a>, <a href="../functions/sem_trywait.html"><i>sem_trywait</i>()</a>, <a
href="../functions/sigwaitinfo.html"><i>sigwaitinfo</i>()</a>, and <a href=
"../functions/sigtimedwait.html"><i>sigtimedwait</i>()</a>, but implementations are permitted to return [EINTR] error codes for
these functions for compatibility with earlier versions of IEEE&nbsp;Std&nbsp;1003.1. Applications cannot rely on calls to these
functions returning [EINTR] error codes when signals are delivered to the calling thread, but they should allow for the
possibility.</p>

<h5><a name="tag_03_02_03_03"></a>Additional Error Numbers</h5>

<p>The ISO&nbsp;C standard defines the name space for implementations to add additional error numbers.</p>

<h4><a name="tag_03_02_04"></a>Signal Concepts</h4>

<p>Historical implementations of signals, using the <a href="../functions/signal.html"><i>signal</i>()</a> function, have
shortcomings that make them unreliable for many application uses. Because of this, a new signal mechanism, based very closely on
the one of 4.2 BSD and 4.3 BSD, was added to POSIX.1.</p>

<h5><a name="tag_03_02_04_01"></a>Signal Names</h5>

<p>The restriction on the actual type used for <b>sigset_t</b> is intended to guarantee that these objects can always be assigned,
have their address taken, and be passed as parameters by value. It is not intended that this type be a structure including pointers
to other data structures, as that could impact the portability of applications performing such operations. A reasonable
implementation could be a structure containing an array of some integer type.</p>

<p>The signals described in IEEE&nbsp;Std&nbsp;1003.1-2001 must have unique values so that they may be named as parameters of
<b>case</b> statements in the body of a C-language <b>switch</b> clause. However, implementation-defined signals may have values
that overlap with each other or with signals specified in IEEE&nbsp;Std&nbsp;1003.1-2001. An example of this is SIGABRT, which
traditionally overlaps some other signal, such as SIGIOT.</p>

<p>SIGKILL, SIGTERM, SIGUSR1, and SIGUSR2 are ordinarily generated only through the explicit use of the <a href=
"../functions/kill.html"><i>kill</i>()</a> function, although some implementations generate SIGKILL under extraordinary
circumstances. SIGTERM is traditionally the default signal sent by the <a href="../utilities/kill.html"><i>kill</i></a>
command.</p>

<p>The signals SIGBUS, SIGEMT, SIGIOT, SIGTRAP, and SIGSYS were omitted from POSIX.1 because their behavior is
implementation-defined and could not be adequately categorized. Conforming implementations may deliver these signals, but must
document the circumstances under which they are delivered and note any restrictions concerning their delivery. The signals SIGFPE,
SIGILL, and SIGSEGV are similar in that they also generally result only from programming errors. They were included in POSIX.1
because they do indicate three relatively well-categorized conditions. They are all defined by the ISO&nbsp;C standard and thus
would have to be defined by any system with an ISO&nbsp;C standard binding, even if not explicitly included in POSIX.1.</p>

<p>There is very little that a Conforming POSIX.1 Application can do by catching, ignoring, or masking any of the signals SIGILL,
SIGTRAP, SIGIOT, SIGEMT, SIGBUS, SIGSEGV, SIGSYS, or SIGFPE. They will generally be generated by the system only in cases of
programming errors. While it may be desirable for some robust code (for example, a library routine) to be able to detect and
recover from programming errors in other code, these signals are not nearly sufficient for that purpose. One portable use that does
exist for these signals is that a command interpreter can recognize them as the cause of a process' termination (with <a href=
"../functions/wait.html"><i>wait</i>()</a>) and print an appropriate message. The mnemonic tags for these signals are derived from
their PDP-11 origin.</p>

<p>The signals SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU, and SIGCONT are provided for job control and are unchanged from 4.2 BSD. The
signal SIGCHLD is also typically used by job control shells to detect children that have terminated or, as in 4.2 BSD, stopped.</p>

<p>Some implementations, including System&nbsp;V, have a signal named SIGCLD, which is similar to SIGCHLD in 4.2 BSD. POSIX.1
permits implementations to have a single signal with both names. POSIX.1 carefully specifies ways in which conforming applications
can avoid the semantic differences between the two different implementations. The name SIGCHLD was chosen for POSIX.1 because most
current application usages of it can remain unchanged in conforming applications. SIGCLD in System&nbsp;V has more cases of
semantics that POSIX.1 does not specify, and thus applications using it are more likely to require changes in addition to the name
change.</p>

<p>The signals SIGUSR1 and SIGUSR2 are commonly used by applications for notification of exceptional behavior and are described as
&quot;reserved as application-defined&quot; so that such use is not prohibited. Implementations should not generate SIGUSR1 or SIGUSR2,
except when explicitly requested by <a href="../functions/kill.html"><i>kill</i>()</a>. It is recommended that libraries not use
these two signals, as such use in libraries could interfere with their use by applications calling the libraries. If such use is
unavoidable, it should be documented. It is prudent for non-portable libraries to use non-standard signals to avoid conflicts with
use of standard signals by portable libraries.</p>

<p>There is no portable way for an application to catch or ignore non-standard signals. Some implementations define the range of
signal numbers, so applications can install signal-catching functions for all of them. Unfortunately, implementation-defined
signals often cause problems when caught or ignored by applications that do not understand the reason for the signal. While the
desire exists for an application to be more robust by handling all possible signals (even those only generated by <a href=
"../functions/kill.html"><i>kill</i>()</a>), no existing mechanism was found to be sufficiently portable to include in POSIX.1. The
value of such a mechanism, if included, would be diminished given that SIGKILL would still not be catchable.</p>

<p>A number of new signal numbers are reserved for applications because the two user signals defined by POSIX.1 are insufficient
for many realtime applications. A range of signal numbers is specified, rather than an enumeration of additional reserved signal
names, because different applications and application profiles will require a different number of application signals. It is not
desirable to burden all application domains and therefore all implementations with the maximum number of signals required by all
possible applications. Note that in this context, signal numbers are essentially different signal priorities.</p>

<p>The relatively small number of required additional signals, {_POSIX_RTSIG_MAX}, was chosen so as not to require an unreasonably
large signal mask/set. While this number of signals defined in POSIX.1 will fit in a single 32-bit word signal mask, it is
recognized that most existing implementations define many more signals than are specified in POSIX.1 and, in fact, many
implementations have already exceeded 32 signals (including the &quot;null signal&quot;). Support of {_POSIX_RTSIG_MAX} additional signals
may push some implementation over the single 32-bit word line, but is unlikely to push any implementations that are already over
that line beyond the 64-signal line.</p>

<h5><a name="tag_03_02_04_02"></a>Signal Generation and Delivery</h5>

<p>The terms defined in this section are not used consistently in documentation of historical systems. Each signal can be
considered to have a lifetime beginning with generation and ending with delivery or acceptance. The POSIX.1 definition of
&quot;delivery&quot; does not exclude ignored signals; this is considered a more consistent definition. This revised text in several parts
of IEEE&nbsp;Std&nbsp;1003.1-2001 clarifies the distinct semantics of asynchronous signal delivery and synchronous signal
acceptance. The previous wording attempted to categorize both under the term &quot;delivery&quot;, which led to conflicts over whether the
effects of asynchronous signal delivery applied to synchronous signal acceptance.</p>

<p>Signals generated for a process are delivered to only one thread. Thus, if more than one thread is eligible to receive a signal,
one has to be chosen. The choice of threads is left entirely up to the implementation both to allow the widest possible range of
conforming implementations and to give implementations the freedom to deliver the signal to the &quot;easiest possible&quot; thread should
there be differences in ease of delivery between different threads.</p>

<p>Note that should multiple delivery among cooperating threads be required by an application, this can be trivially constructed
out of the provided single-delivery semantics. The construction of a <i>sigwait_multiple</i>() function that accomplishes this goal
is presented with the rationale for <a href="../functions/sigwaitinfo.html"><i>sigwaitinfo</i>()</a>.</p>

<p>Implementations should deliver unblocked signals as soon after they are generated as possible. However, it is difficult for
POSIX.1 to make specific requirements about this, beyond those in <a href="../functions/kill.html"><i>kill</i>()</a> and <a href=
"../functions/sigprocmask.html"><i>sigprocmask</i>()</a>. Even on systems with prompt delivery, scheduling of higher priority
processes is always likely to cause delays.</p>

<p>In general, the interval between the generation and delivery of unblocked signals cannot be detected by an application. Thus,
references to pending signals generally apply to blocked, pending signals. An implementation registers a signal as pending on the
process when no thread has the signal unblocked and there are no threads blocked in a <a href=
"../functions/sigwait.html"><i>sigwait</i>()</a> function for that signal. Thereafter, the implementation delivers the signal to
the first thread that unblocks the signal or calls a <a href="../functions/sigwait.html"><i>sigwait</i>()</a> function on a signal
set containing this signal rather than choosing the recipient thread at the time the signal is sent.</p>

<p>In the 4.3 BSD system, signals that are blocked and set to SIG_IGN are discarded immediately upon generation. For a signal that
is ignored as its default action, if the action is SIG_DFL and the signal is blocked, a generated signal remains pending. In the
4.1 BSD system and in System&nbsp;V Release 3 (two other implementations that support a somewhat similar signal mechanism), all
ignored blocked signals remain pending if generated. Because it is not normally useful for an application to simultaneously ignore
and block the same signal, it was unnecessary for POSIX.1 to specify behavior that would invalidate any of the historical
implementations.</p>

<p>There is one case in some historical implementations where an unblocked, pending signal does not remain pending until it is
delivered. In the System&nbsp;V implementation of <a href="../functions/signal.html"><i>signal</i>()</a>, pending signals are
discarded when the action is set to SIG_DFL or a signal-catching routine (as well as to SIG_IGN). Except in the case of setting
SIGCHLD to SIG_DFL, implementations that do this do not conform completely to POSIX.1. Some earlier proposals for POSIX.1
explicitly stated this, but these statements were redundant due to the requirement that functions defined by POSIX.1 not change
attributes of processes defined by POSIX.1 except as explicitly stated.</p>

<p>POSIX.1 specifically states that the order in which multiple, simultaneously pending signals are delivered is unspecified. This
order has not been explicitly specified in historical implementations, but has remained quite consistent and been known to those
familiar with the implementations. Thus, there have been cases where applications (usually system utilities) have been written with
explicit or implicit dependencies on this order. Implementors and others porting existing applications may need to be aware of such
dependencies.</p>

<p>When there are multiple pending signals that are not blocked, implementations should arrange for the delivery of all signals at
once, if possible. Some implementations stack calls to all pending signal-catching routines, making it appear that each
signal-catcher was interrupted by the next signal. In this case, the implementation should ensure that this stacking of signals
does not violate the semantics of the signal masks established by <a href="../functions/sigaction.html"><i>sigaction</i>()</a>.
Other implementations process at most one signal when the operating system is entered, with remaining signals saved for later
delivery. Although this practice is widespread, this behavior is neither standardized nor endorsed. In either case, implementations
should attempt to deliver signals associated with the current state of the process (for example, SIGFPE) before other signals, if
possible.</p>

<p>In 4.2 BSD and 4.3 BSD, it is not permissible to ignore or explicitly block SIGCONT, because if blocking or ignoring this signal
prevented it from continuing a stopped process, such a process could never be continued (only killed by SIGKILL). However, 4.2 BSD
and 4.3 BSD do block SIGCONT during execution of its signal-catching function when it is caught, creating exactly this problem. A
proposal was considered to disallow catching SIGCONT in addition to ignoring and blocking it, but this limitation led to
objections. The consensus was to require that SIGCONT always continue a stopped process when generated. This removed the need to
disallow ignoring or explicit blocking of the signal; note that SIG_IGN and SIG_DFL are equivalent for SIGCONT.</p>

<h5><a name="tag_03_02_04_03"></a>Realtime Signal Generation and Delivery</h5>

<p>The Realtime Signals Extension option to POSIX.1 signal generation and delivery behavior is required for the following
reasons:</p>

<ul>
<li>
<p>The <b>sigevent</b> structure is used by other POSIX.1 functions that result in asynchronous event notifications to specify the
notification mechanism to use and other information needed by the notification mechanism. IEEE&nbsp;Std&nbsp;1003.1-2001 defines
only three symbolic values for the notification mechanism. SIGEV_NONE is used to indicate that no notification is required when the
event occurs. This is useful for applications that use asynchronous I/O with polling for completion. SIGEV_SIGNAL indicates that a
signal is generated when the event occurs. SIGEV_NOTIFY provides for &quot;callback functions&quot; for asynchronous notifications done by
a function call within the context of a new thread. This provides a multi-threaded process a more natural means of notification
than signals. The primary difficulty with previous notification approaches has been to specify the environment of the notification
routine.</p>

<ul>
<li>
<p>One approach is to limit the notification routine to call only functions permitted in a signal handler. While the list of
permissible functions is clearly stated, this is overly restrictive.</p>
</li>

<li>
<p>A second approach is to define a new list of functions or classes of functions that are explicitly permitted or not permitted.
This would give a programmer more lists to deal with, which would be awkward.</p>
</li>

<li>
<p>The third approach is to define completely the environment for execution of the notification function. A clear definition of an
execution environment for notification is provided by executing the notification function in the environment of a newly created
thread.</p>
</li>
</ul>

<p>Implementations may support additional notification mechanisms by defining new values for <i>sigev_notify</i>.</p>

<p>For a notification type of SIGEV_SIGNAL, the other members of the <b>sigevent</b> structure defined by
IEEE&nbsp;Std&nbsp;1003.1-2001 specify the realtime signal-that is, the signal number and application-defined value that
differentiates between occurrences of signals with the same number-that will be generated when the event occurs. The structure is
defined in <a href="../basedefs/signal.h.html"><i>&lt;signal.h&gt;</i></a>, even though the structure is not directly used by any
of the signal functions, because it is part of the signals interface used by the POSIX.1b &quot;client functions&quot;. When the client
functions include <a href="../basedefs/signal.h.html"><i>&lt;signal.h&gt;</i></a> to define the signal names, the <b>sigevent</b>
structure will also be defined.</p>

<p>An application-defined value passed to the signal handler is used to differentiate between different &quot;events&quot; instead of
requiring that the application use different signal numbers for several reasons:</p>

<ul>
<li>
<p>Realtime applications potentially handle a very large number of different events. Requiring that implementations support a
correspondingly large number of distinct signal numbers will adversely impact the performance of signal delivery because the signal
masks to be manipulated on entry and exit to the handlers will become large.</p>
</li>

<li>
<p>Event notifications are prioritized by signal number (the rationale for this is explained in the following paragraphs) and the
use of different signal numbers to differentiate between the different event notifications overloads the signal number more than
has already been done. It also requires that the application writer make arbitrary assignments of priority to events that are
logically of equal priority.</p>
</li>
</ul>

<p>A union is defined for the application-defined value so that either an integer constant or a pointer can be portably passed to
the signal-catching function. On some architectures a pointer cannot be cast to an <b>int</b> and <i>vice versa</i>.</p>

<p>Use of a structure here with an explicit notification type discriminant rather than explicit parameters to realtime functions,
or embedded in other realtime structures, provides for future extensions to IEEE&nbsp;Std&nbsp;1003.1-2001. Additional, perhaps
more efficient, notification mechanisms can be supported for existing realtime function interfaces, such as timers and asynchronous
I/O, by extending the <b>sigevent</b> structure appropriately. The existing realtime function interfaces will not have to be
modified to use any such new notification mechanism. The revised text concerning the SIGEV_SIGNAL value makes consistent the
semantics of the members of the <b>sigevent</b> structure, particularly in the definitions of <a href=
"../functions/lio_listio.html"><i>lio_listio</i>()</a> and <a href="../functions/aio_fsync.html"><i>aio_fsync</i>()</a>. For
uniformity, other revisions cause this specification to be referred to rather than inaccurately duplicated in the descriptions of
functions and structures using the <b>sigevent</b> structure. The revised wording does not relax the requirement that the signal
number be in the range SIGRTMIN to SIGRTMAX to guarantee queuing and passing of the application value, since that requirement is
still implied by the signal names.</p>
</li>

<li>
<p>IEEE&nbsp;Std&nbsp;1003.1-2001 is intentionally vague on whether &quot;non-realtime&quot; signal-generating mechanisms can result in a
<b>siginfo_t</b> being supplied to the handler on delivery. In one existing implementation, a <b>siginfo_t</b> is posted on signal
generation, even though the implementation does not support queuing of multiple occurrences of a signal. It is not the intent of
IEEE&nbsp;Std&nbsp;1003.1-2001 to preclude this, independent of the mandate to define signals that do support queuing. Any
interpretation that appears to preclude this is a mistake in the reading or writing of the standard.</p>
</li>

<li>
<p>Signals handled by realtime signal handlers might be generated by functions or conditions that do not allow the specification of
an application-defined value and do not queue. IEEE&nbsp;Std&nbsp;1003.1-2001 specifies the <i>si_code</i> member of the
<b>siginfo_t</b> structure used in existing practice and defines additional codes so that applications can detect whether an
application-defined value is present or not. The code SI_USER for <a href="../functions/kill.html"><i>kill</i>()</a>- generated
signals is adopted from existing practice.</p>
</li>

<li>
<p>The <a href="../functions/sigaction.html"><i>sigaction</i>()</a> <i>sa_flags</i> value SA_SIGINFO tells the implementation that
the signal-catching function expects two additional arguments. When the flag is not set, a single argument, the signal number, is
passed as specified by IEEE&nbsp;Std&nbsp;1003.1-2001. Although IEEE&nbsp;Std&nbsp;1003.1-2001 does not explicitly allow the
<i>info</i> argument to the handler function to be NULL, this is existing practice. This provides for compatibility with programs
whose signal-catching functions are not prepared to accept the additional arguments. IEEE&nbsp;Std&nbsp;1003.1-2001 is explicitly
unspecified as to whether signals actually queue when SA_SIGINFO is not set for a signal, as there appear to be no benefits to
applications in specifying one behavior or another. One existing implementation queues a <b>siginfo_t</b> on each signal
generation, unless the signal is already pending, in which case the implementation discards the new <b>siginfo_t</b>; that is, the
queue length is never greater than one. This implementation only examines SA_SIGINFO on signal delivery, discarding the queued
<b>siginfo_t</b> if its delivery was not requested.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 specifies several new values for the <i>si_code</i> member of the <b>siginfo_t</b> structure. In
existing practice, a <i>si_code</i> value of less than or equal to zero indicates that the signal was generated by a process via
the <a href="../functions/kill.html"><i>kill</i>()</a> function. In existing practice, values of <i>si_code</i> that provide
additional information for implementation-generated signals, such as SIGFPE or SIGSEGV, are all positive. Thus, if implementations
define the new constants specified in IEEE&nbsp;Std&nbsp;1003.1-2001 to be negative numbers, programs written to use existing
practice will not break. IEEE&nbsp;Std&nbsp;1003.1-2001 chose not to attempt to specify existing practice values of <i>si_code</i>
other than SI_USER both because it was deemed beyond the scope of IEEE&nbsp;Std&nbsp;1003.1-2001 and because many of the values in
existing practice appear to be platform and implementation-defined. But, IEEE&nbsp;Std&nbsp;1003.1-2001 does specify that if an
implementation-for example, one that does not have existing practice in this area-chooses to define additional values for
<i>si_code</i>, these values have to be different from the values of the symbols specified by IEEE&nbsp;Std&nbsp;1003.1-2001. This
will allow conforming applications to differentiate between signals generated by one of the POSIX.1b asynchronous events and those
generated by other implementation events in a manner compatible with existing practice.</p>

<p>The unique values of <i>si_code</i> for the POSIX.1b asynchronous events have implications for implementations of, for example,
asynchronous I/O or message passing in user space library code. Such an implementation will be required to provide a hidden
interface to the signal generation mechanism that allows the library to specify the standard values of <i>si_code</i>.</p>

<p>Existing practice also defines additional members of <b>siginfo_t</b>, such as the process ID and user ID of the sending process
for <a href="../functions/kill.html"><i>kill</i>()</a>- generated signals. These members were deemed not necessary to meet the
requirements of realtime applications and are not specified by IEEE&nbsp;Std&nbsp;1003.1-2001. Neither are they precluded.</p>

<p>The third argument to the signal-catching function, <i>context</i>, is left undefined by IEEE&nbsp;Std&nbsp;1003.1-2001, but is
specified in the interface because it matches existing practice for the SA_SIGINFO flag. It was considered undesirable to require a
separate implementation for SA_SIGINFO for POSIX conformance on implementations that already support the two additional
parameters.</p>
</li>

<li>
<p>The requirement to deliver lower numbered signals in the range SIGRTMIN to SIGRTMAX first, when multiple unblocked signals are
pending, results from several considerations:</p>

<ul>
<li>
<p>A method is required to prioritize event notifications. The signal number was chosen instead of, for instance, associating a
separate priority with each request, because an implementation has to check pending signals at various points and select one for
delivery when more than one is pending. Specifying a selection order is the minimal additional semantic that will achieve
prioritized delivery. If a separate priority were to be associated with queued signals, it would be necessary for an implementation
to search all non-empty, non-blocked signal queues and select from among them the pending signal with the highest priority. This
would significantly increase the cost of and decrease the determinism of signal delivery.</p>
</li>

<li>
<p>Given the specified selection of the lowest numeric unblocked pending signal, preemptive priority signal delivery can be
achieved using signal numbers and signal masks by ensuring that the <i>sa_mask</i> for each signal number blocks all signals with a
higher numeric value.</p>

<p>For realtime applications that want to use only the newly defined realtime signal numbers without interference from the standard
signals, this can be achieved by blocking all of the standard signals in the process signal mask and in the <i>sa_mask</i>
installed by the signal action for the realtime signal handlers.</p>
</li>
</ul>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 explicitly leaves unspecified the ordering of signals outside of the range of realtime signals
and the ordering of signals within this range with respect to those outside the range. It was believed that this would unduly
constrain implementations or standards in the future definition of new signals.</p>
</li>
</ul>

<h5><a name="tag_03_02_04_04"></a>Signal Actions</h5>

<p>Early proposals mentioned SIGCONT as a second exception to the rule that signals are not delivered to stopped processes until
continued. Because IEEE&nbsp;Std&nbsp;1003.1-2001 now specifies that SIGCONT causes the stopped process to continue when it is
generated, delivery of SIGCONT is not prevented because a process is stopped, even without an explicit exception to this rule.</p>

<p>Ignoring a signal by setting the action to SIG_IGN (or SIG_DFL for signals whose default action is to ignore) is not the same as
installing a signal-catching function that simply returns. Invoking such a function will interrupt certain system functions that
block processes (for example, <a href="../functions/wait.html"><i>wait</i>()</a>, <a href=
"../functions/sigsuspend.html"><i>sigsuspend</i>()</a>, <a href="../functions/pause.html"><i>pause</i>()</a>, <a href=
"../functions/read.html"><i>read</i>()</a>, <a href="../functions/write.html"><i>write</i>()</a>) while ignoring a signal has no
such effect on the process.</p>

<p>Historical implementations discard pending signals when the action is set to SIG_IGN. However, they do not always do the same
when the action is set to SIG_DFL and the default action is to ignore the signal. IEEE&nbsp;Std&nbsp;1003.1-2001 requires this for
the sake of consistency and also for completeness, since the only signal this applies to is SIGCHLD, and
IEEE&nbsp;Std&nbsp;1003.1-2001 disallows setting its action to SIG_IGN.</p>

<p>Some implementations (System&nbsp;V, for example) assign different semantics for SIGCLD depending on whether the action is set
to SIG_IGN or SIG_DFL. Since POSIX.1 requires that the default action for SIGCHLD be to ignore the signal, applications should
always set the action to SIG_DFL in order to avoid SIGCHLD.</p>

<p>Whether or not an implementation allows SIG_IGN as a SIGCHLD disposition to be inherited across a call to one of the <i>exec</i>
family of functions or <a href="../functions/posix_spawn.html"><i>posix_spawn</i>()</a> is explicitly left as unspecified. This
change was made as a result of IEEE PASC Interpretation 1003.1 #132, and permits the implementation to decide between the following
alternatives:</p>

<ul>
<li>
<p>Unconditionally leave SIGCHLD set to SIG_IGN, in which case the implementation would not allow applications that assume
inheritance of SIG_DFL to conform to IEEE&nbsp;Std&nbsp;1003.1-2001 without change. The implementation would, however, retain an
ability to control applications that create child processes but never call on the <i>wait</i> family of functions, potentially
filling up the process table.</p>
</li>

<li>
<p>Unconditionally reset SIGCHLD to SIG_DFL, in which case the implementation would allow applications that assume inheritance of
SIG_DFL to conform. The implementation would, however, lose an ability to control applications that spawn child processes but never
reap them.</p>
</li>

<li>
<p>Provide some mechanism, not specified in IEEE&nbsp;Std&nbsp;1003.1-2001, to control inherited SIGCHLD dispositions.</p>
</li>
</ul>

<p>Some implementations (System&nbsp;V, for example) will deliver a SIGCLD signal immediately when a process establishes a
signal-catching function for SIGCLD when that process has a child that has already terminated. Other implementations, such as 4.3
BSD, do not generate a new SIGCHLD signal in this way. In general, a process should not attempt to alter the signal action for the
SIGCHLD signal while it has any outstanding children. However, it is not always possible for a process to avoid this; for example,
shells sometimes start up processes in pipelines with other processes from the pipeline as children. Processes that cannot ensure
that they have no children when altering the signal action for SIGCHLD thus need to be prepared for, but not depend on, generation
of an immediate SIGCHLD signal.</p>

<p>The default action of the stop signals (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU) is to stop a process that is executing. If a stop
signal is delivered to a process that is already stopped, it has no effect. In fact, if a stop signal is generated for a stopped
process whose signal mask blocks the signal, the signal will never be delivered to the process since the process must receive a
SIGCONT, which discards all pending stop signals, in order to continue executing.</p>

<p>The SIGCONT signal continues a stopped process even if SIGCONT is blocked (or ignored). However, if a signal-catching routine
has been established for SIGCONT, it will not be entered until SIGCONT is unblocked.</p>

<p>If a process in an orphaned process group stops, it is no longer under the control of a job control shell and hence would not
normally ever be continued. Because of this, orphaned processes that receive terminal-related stop signals (SIGTSTP, SIGTTIN,
SIGTTOU, but not SIGSTOP) must not be allowed to stop. The goal is to prevent stopped processes from languishing forever. (As
SIGSTOP is sent only via <a href="../functions/kill.html"><i>kill</i>()</a>, it is assumed that the process or user sending a
SIGSTOP can send a SIGCONT when desired.) Instead, the system must discard the stop signal. As an extension, it may also deliver
another signal in its place. 4.3 BSD sends a SIGKILL, which is overly effective because SIGKILL is not catchable. Another possible
choice is SIGHUP. 4.3 BSD also does this for orphaned processes (processes whose parent has terminated) rather than for members of
orphaned process groups; this is less desirable because job control shells manage process groups. POSIX.1 also prevents SIGTTIN and
SIGTTOU signals from being generated for processes in orphaned process groups as a direct result of activity on a terminal,
preventing infinite loops when <a href="../functions/read.html"><i>read</i>()</a> and <a href=
"../functions/write.html"><i>write</i>()</a> calls generate signals that are discarded; see <a href=
"xbd_chap11.html#tag_01_11_01_04"><i>Terminal Access Control</i></a> . A similar restriction on the generation of SIGTSTP was
considered, but that would be unnecessary and more difficult to implement due to its asynchronous nature.</p>

<p>Although POSIX.1 requires that signal-catching functions be called with only one argument, there is nothing to prevent
conforming implementations from extending POSIX.1 to pass additional arguments, as long as Strictly Conforming POSIX.1 Applications
continue to compile and execute correctly. Most historical implementations do, in fact, pass additional, signal-specific arguments
to certain signal-catching routines.</p>

<p>There was a proposal to change the declared type of the signal handler to:</p>

<blockquote>
<pre>
<tt>void</tt> <i>func</i> <tt>(int</tt> <i>sig</i><tt>, ...);
</tt>
</pre>
</blockquote>

<p>The usage of ellipses ( <tt>"..."</tt> ) is ISO&nbsp;C standard syntax to indicate a variable number of arguments. Its use was
intended to allow the implementation to pass additional information to the signal handler in a standard manner.</p>

<p>Unfortunately, this construct would require all signal handlers to be defined with this syntax because the ISO&nbsp;C standard
allows implementations to use a different parameter passing mechanism for variable parameter lists than for non-variable parameter
lists. Thus, all existing signal handlers in all existing applications would have to be changed to use the variable syntax in order
to be standard and portable. This is in conflict with the goal of Minimal Changes to Existing Application Code.</p>

<p>When terminating a process from a signal-catching function, processes should be aware of any interpretation that their parent
may make of the status returned by <a href="../functions/wait.html"><i>wait</i>()</a> or <a href=
"../functions/waitpid.html"><i>waitpid</i>()</a>. In particular, a signal-catching function should not call <i>exit</i>(0) or
<i>_exit</i>(0) unless it wants to indicate successful termination. A non-zero argument to <a href=
"../functions/exit.html"><i>exit</i>()</a> or <a href="../functions/_exit.html"><i>_exit</i>()</a> can be used to indicate
unsuccessful termination. Alternatively, the process can use <a href="../functions/kill.html"><i>kill</i>()</a> to send itself a
fatal signal (first ensuring that the signal is set to the default action and not blocked). See also the RATIONALE section of the
<a href="../functions/_exit.html"><i>_exit</i>()</a> function.</p>

<p>The behavior of <i>unsafe</i> functions, as defined by this section, is undefined when they are invoked from signal-catching
functions in certain circumstances. The behavior of reentrant functions, as defined by this section, is as specified by POSIX.1,
regardless of invocation from a signal-catching function. This is the only intended meaning of the statement that reentrant
functions may be used in signal-catching functions without restriction. Applications must still consider all effects of such
functions on such things as data structures, files, and process state. In particular, application writers need to consider the
restrictions on interactions when interrupting <a href="../functions/sleep.html"><i>sleep</i>()</a> (see <a href=
"../functions/sleep.html"><i>sleep</i>()</a>) and interactions among multiple handles for a file description. The fact that any
specific function is listed as reentrant does not necessarily mean that invocation of that function from a signal-catching function
is recommended.</p>

<p>In order to prevent errors arising from interrupting non-reentrant function calls, applications should protect calls to these
functions either by blocking the appropriate signals or through the use of some programmatic semaphore. POSIX.1 does not address
the more general problem of synchronizing access to shared data structures. Note in particular that even the &quot;safe&quot; functions may
modify the global variable <i>errno</i>; the signal-catching function may want to save and restore its value. The same principles
apply to the reentrancy of application routines and asynchronous data access.</p>

<p>Note that <a href="../functions/longjmp.html"><i>longjmp</i>()</a> and <a href=
"../functions/siglongjmp.html"><i>siglongjmp</i>()</a> are not in the list of reentrant functions. This is because the code
executing after <a href="../functions/longjmp.html"><i>longjmp</i>()</a> or <a href=
"../functions/siglongjmp.html"><i>siglongjmp</i>()</a> can call any unsafe functions with the same danger as calling those unsafe
functions directly from the signal handler. Applications that use <a href="../functions/longjmp.html"><i>longjmp</i>()</a> or <a
href="../functions/siglongjmp.html"><i>siglongjmp</i>()</a> out of signal handlers require rigorous protection in order to be
portable. Many of the other functions that are excluded from the list are traditionally implemented using either the C language <a
href="../functions/malloc.html"><i>malloc</i>()</a> or <a href="../functions/free.html"><i>free</i>()</a> functions or the
ISO&nbsp;C standard I/O library, both of which traditionally use data structures in a non-reentrant manner. Because any combination
of different functions using a common data structure can cause reentrancy problems, POSIX.1 does not define the behavior when any
unsafe function is called in a signal handler that interrupts any unsafe function.</p>

<p>The only realtime extension to signal actions is the addition of the additional parameters to the signal-catching function. This
extension has been explained and motivated in the previous section. In making this extension, though, developers of POSIX.1b ran
into issues relating to function prototypes. In response to input from the POSIX.1 standard developers, members were added to the
<b>sigaction</b> structure to specify function prototypes for the newer signal-catching function specified by POSIX.1b. These
members follow changes that are being made to POSIX.1. Note that IEEE&nbsp;Std&nbsp;1003.1-2001 explicitly states that these fields
may overlap so that a union can be defined. This enabled existing implementations of POSIX.1 to maintain binary-compatibility when
these extensions were added.</p>

<p>The <b>siginfo_t</b> structure was adopted for passing the application-defined value to match existing practice, but the
existing practice has no provision for an application-defined value, so this was added. Note that POSIX normally reserves the
&quot;_t&quot; type designation for opaque types. The <b>siginfo_t</b> structure breaks with this convention to follow existing practice
and thus promote portability. Standardization of the existing practice for the other members of this structure may be addressed in
the future.</p>

<p>Although it is not explicitly visible to applications, there are additional semantics for signal actions implied by queued
signals and their interaction with other POSIX.1b realtime functions. Specifically:</p>

<ul>
<li>
<p>It is not necessary to queue signals whose action is SIG_IGN.</p>
</li>

<li>
<p>For implementations that support POSIX.1b timers, some interaction with the timer functions at signal delivery is implied to
manage the timer overrun count.</p>
</li>
</ul>

<h5><a name="tag_03_02_04_05"></a>Signal Effects on Other Functions</h5>

<p>The most common behavior of an interrupted function after a signal-catching function returns is for the interrupted function to
give an [EINTR] error unless the SA_RESTART flag is in effect for the signal. However, there are a number of specific exceptions,
including <a href="../functions/sleep.html"><i>sleep</i>()</a> and certain situations with <a href=
"../functions/read.html"><i>read</i>()</a> and <a href="../functions/write.html"><i>write</i>()</a>.</p>

<p>The historical implementations of many functions defined by IEEE&nbsp;Std&nbsp;1003.1-2001 are not interruptible, but delay
delivery of signals generated during their execution until after they complete. This is never a problem for functions that are
guaranteed to complete in a short (imperceptible to a human) period of time. It is normally those functions that can suspend a
process indefinitely or for long periods of time (for example, <a href="../functions/wait.html"><i>wait</i>()</a>, <a href=
"../functions/pause.html"><i>pause</i>()</a>, <a href="../functions/sigsuspend.html"><i>sigsuspend</i>()</a>, <a href=
"../functions/sleep.html"><i>sleep</i>()</a>, or <a href="../functions/read.html"><i>read</i>()</a>/ <a href=
"../functions/write.html"><i>write</i>()</a> on a slow device like a terminal) that are interruptible. This permits applications to
respond to interactive signals or to set timeouts on calls to most such functions with <a href=
"../functions/alarm.html"><i>alarm</i>()</a>. Therefore, implementations should generally make such functions (including ones
defined as extensions) interruptible.</p>

<p>Functions not mentioned explicitly as interruptible may be so on some implementations, possibly as an extension where the
function gives an [EINTR] error. There are several functions (for example, <a href="../functions/getpid.html"><i>getpid</i>()</a>,
<a href="../functions/getuid.html"><i>getuid</i>()</a>) that are specified as never returning an error, which can thus never be
extended in this way.</p>

<p>If a signal-catching function returns while the SA_RESTART flag is in effect, an interrupted function is restarted at the point
it was interrupted. Conforming applications cannot make assumptions about the internal behavior of interrupted functions, even if
the functions are async-signal-safe. For example, suppose the <a href="../functions/read.html"><i>read</i>()</a> function is
interrupted with SA_RESTART in effect, the signal-catching function closes the file descriptor being read from and returns, and the
<a href="../functions/read.html"><i>read</i>()</a> function is then restarted; in this case the application cannot assume that the
<a href="../functions/read.html"><i>read</i>()</a> function will give an [EBADF] error, since <a href=
"../functions/read.html"><i>read</i>()</a> might have checked the file descriptor for validity before being interrupted.</p>

<h4><a name="tag_03_02_05"></a>Standard I/O Streams</h4>

<h5><a name="tag_03_02_05_01"></a>Interaction of File Descriptors and Standard I/O Streams</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_05_02"></a>Stream Orientation and Encoding Rules</h5>

<p>There is no additional rationale provided for this section.</p>

<h4><a name="tag_03_02_06"></a>STREAMS</h4>

<p>STREAMS are introduced into IEEE&nbsp;Std&nbsp;1003.1-2001 as part of the alignment with the Single UNIX Specification, but
marked as an option in recognition that not all systems may wish to implement the facility. The option within
IEEE&nbsp;Std&nbsp;1003.1-2001 is denoted by the XSR margin marker. The standard developers made this option independent of the XSI
option.</p>

<p>STREAMS are a method of implementing network services and other character-based input/output mechanisms, with the STREAM being a
full-duplex connection between a process and a device. STREAMS provides direct access to protocol modules, and optional protocol
modules can be interposed between the process-end of the STREAM and the device-driver at the device-end of the STREAM. Pipes can be
implemented using the STREAMS mechanism, so they can provide process-to-process as well as process-to-device communications.</p>

<p>This section introduces STREAMS I/O, the message types used to control them, an overview of the priority mechanism, and the
interfaces used to access them.</p>

<h5><a name="tag_03_02_06_01"></a>Accessing STREAMS</h5>

<p>There is no additional rationale provided for this section.</p>

<h4><a name="tag_03_02_07"></a>XSI Interprocess Communication</h4>

<p>There are two forms of IPC supported as options in IEEE&nbsp;Std&nbsp;1003.1-2001. The traditional System&nbsp;V IPC routines
derived from the SVID-that is, the <i>msg*</i>(),
<i>sem*</i>(), and <i>shm*</i>() interfaces-are mandatory on
XSI-conformant systems. Thus, all XSI-conformant systems provide the same mechanisms for manipulating messages, shared memory, and
semaphores.</p>

<p>In addition, the POSIX Realtime Extension provides an alternate set of routines for those systems supporting the appropriate
options.</p>

<p>The application writer is presented with a choice: the System&nbsp;V interfaces or the POSIX interfaces (loosely derived from
the Berkeley interfaces). The XSI profile prefers the System&nbsp;V interfaces, but the POSIX interfaces may be more suitable for
realtime or other performance-sensitive applications.</p>

<h5><a name="tag_03_02_07_01"></a>IPC General Information</h5>

<p>General information that is shared by all three mechanisms is described in this section. The common permissions mechanism is
briefly introduced, describing the mode bits, and how they are used to determine whether or not a process has access to read or
write/alter the appropriate instance of one of the IPC mechanisms. All other relevant information is contained in the reference
pages themselves.</p>

<p>The semaphore type of IPC allows processes to communicate through the exchange of semaphore values. A semaphore is a positive
integer. Since many applications require the use of more than one semaphore, XSI-conformant systems have the ability to create sets
or arrays of semaphores.</p>

<p>Calls to support semaphores include:</p>

<blockquote><a href="../functions/semctl.html"><i>semctl</i>()</a>, <a href="../functions/semget.html"><i>semget</i>()</a>, <a
href="../functions/semop.html"><i>semop</i>()</a></blockquote>

<p>Semaphore sets are created by using the <a href="../functions/semget.html"><i>semget</i>()</a> function.</p>

<p>The message type of IPC allows processes to communicate through the exchange of data stored in buffers. This data is transmitted
between processes in discrete portions known as messages.</p>

<p>Calls to support message queues include:</p>

<blockquote><a href="../functions/msgctl.html"><i>msgctl</i>()</a>, <a href="../functions/msgget.html"><i>msgget</i>()</a>, <a
href="../functions/msgrcv.html"><i>msgrcv</i>()</a>, <a href="../functions/msgsnd.html"><i>msgsnd</i>()</a></blockquote>

<p>The shared memory type of IPC allows two or more processes to share memory and consequently the data contained therein. This is
done by allowing processes to set up access to a common memory address space. This sharing of memory provides a fast means of
exchange of data between processes.</p>

<p>Calls to support shared memory include:</p>

<blockquote><a href="../functions/shmctl.html"><i>shmctl</i>()</a>, <a href="../functions/shmdt.html"><i>shmdt</i>()</a>, <a href=
"../functions/shmget.html"><i>shmget</i>()</a></blockquote>

<p>The <a href="../functions/ftok.html"><i>ftok</i>()</a> interface is also provided.</p>

<h4><a name="tag_03_02_08"></a>Realtime</h4>

<h5><a name="tag_03_02_08_01"></a>Advisory Information</h5>

<p>POSIX.1b contains an Informative Annex with proposed interfaces for &quot;realtime files&quot;. These interfaces could determine groups
of the exact parameters required to do &quot;direct I/O&quot; or &quot;extents&quot;. These interfaces were objected to by a significant portion of
the balloting group as too complex. A conforming application had little chance of correctly navigating the large parameter space to
match its desires to the system. In addition, they only applied to a new type of file (realtime files) and they told the
implementation exactly what to do as opposed to advising the implementation on application behavior and letting it optimize for the
system the (portable) application was running on. For example, it was not clear how a system that had a disk array should set its
parameters.</p>

<p>There seemed to be several overall goals:</p>

<ul>
<li>
<p>Optimizing sequential access</p>
</li>

<li>
<p>Optimizing caching behavior</p>
</li>

<li>
<p>Optimizing I/O data transfer</p>
</li>

<li>
<p>Preallocation</p>
</li>
</ul>

<p>The advisory interfaces, <a href="../functions/posix_fadvise.html"><i>posix_fadvise</i>()</a> and <a href=
"../functions/posix_madvise.html"><i>posix_madvise</i>()</a>, satisfy the first two goals. The POSIX_FADV_SEQUENTIAL and
POSIX_MADV_SEQUENTIAL advice tells the implementation to expect serial access. Typically the system will prefetch the next several
serial accesses in order to overlap I/O. It may also free previously accessed serial data if memory is tight. If the application is
not doing serial access it can use POSIX_FADV_WILLNEED and POSIX_MADV_WILLNEED to accomplish I/O overlap, as required. When the
application advises POSIX_FADV_RANDOM or POSIX_MADV_RANDOM behavior, the implementation usually tries to fetch a minimum amount of
data with each request and it does not expect much locality. POSIX_FADV_DONTNEED and POSIX_MADV_DONTNEED allow the system to free
up caching resources as the data will not be required in the near future.</p>

<p>POSIX_FADV_NOREUSE tells the system that caching the specified data is not optimal. For file I/O, the transfer should go
directly to the user buffer instead of being cached internally by the implementation. To portably perform direct disk I/O on all
systems, the application must perform its I/O transfers according to the following rules:</p>

<ol>
<li>
<p>The user buffer should be aligned according to the {POSIX_REC_XFER_ALIGN} <a href=
"../functions/pathconf.html"><i>pathconf</i>()</a> variable.</p>
</li>

<li>
<p>The number of bytes transferred in an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} <a href=
"../functions/pathconf.html"><i>pathconf</i>()</a> variable.</p>
</li>

<li>
<p>The offset into the file at the start of an I/O operation should be a multiple of the {POSIX_ALLOC_SIZE_MIN} <a href=
"../functions/pathconf.html"><i>pathconf</i>()</a> variable.</p>
</li>

<li>
<p>The application should ensure that all threads which open a given file specify POSIX_FADV_NOREUSE to be sure that there is no
unexpected interaction between threads using buffered I/O and threads using direct I/O to the same file.</p>
</li>
</ol>

<p>In some cases, a user buffer must be properly aligned in order to be transferred directly to/from the device. The
{POSIX_REC_XFER_ALIGN} <a href="../functions/pathconf.html"><i>pathconf</i>()</a> variable tells the application the proper
alignment.</p>

<p>The preallocation goal is met by the space control function, <a href=
"../functions/posix_fallocate.html"><i>posix_fallocate</i>()</a>. The application can use <a href=
"../functions/posix_fallocate.html"><i>posix_fallocate</i>()</a> to guarantee no [ENOSPC] errors and to improve performance by
prepaying any overhead required for block allocation.</p>

<p>Implementations may use information conveyed by a previous <a href="../functions/posix_fadvise.html"><i>posix_fadvise</i>()</a>
call to influence the manner in which allocation is performed. For example, if an application did the following calls:</p>

<blockquote>
<pre>
<tt>fd = open("file");
posix_fadvise(fd, offset, len, POSIX_FADV_SEQUENTIAL);
posix_fallocate(fd, len, size);
</tt>
</pre>
</blockquote>

<p>an implementation might allocate the file contiguously on disk.</p>

<p>Finally, the <a href="../functions/pathconf.html"><i>pathconf</i>()</a> variables {POSIX_REC_MIN_XFER_SIZE},
{POSIX_REC_MAX_XFER_SIZE}, and {POSIX_REC_INCR_XFER_SIZE} tell the application a range of transfer sizes that are recommended for
best I/O performance.</p>

<p>Where bounded response time is required, the vendor can supply the appropriate settings of the advisories to achieve a
guaranteed performance level.</p>

<p>The interfaces meet the goals while allowing applications using regular files to take advantage of performance optimizations.
The interfaces tell the implementation expected application behavior which the implementation can use to optimize performance on a
particular system with a particular dynamic load.</p>

<p>The <a href="../functions/posix_memalign.html"><i>posix_memalign</i>()</a> function was added to allow for the allocation of
specifically aligned buffers; for example, for {POSIX_REC_XFER_ALIGN}.</p>

<p>The working group also considered the alternative of adding a function which would return an aligned pointer to memory within a
user-supplied buffer. This was not considered to be the best method, because it potentially wastes large amounts of memory when
buffers need to be aligned on large alignment boundaries.</p>

<h5><a name="tag_03_02_08_02"></a>Message Passing</h5>

<p>This section provides the rationale for the definition of the message passing interface in IEEE&nbsp;Std&nbsp;1003.1-2001. This
is presented in terms of the objectives, models, and requirements imposed upon this interface.</p>

<ul>
<li>
<p>Objectives</p>

<p>Many applications, including both realtime and database applications, require a means of passing arbitrary amounts of data
between cooperating processes comprising the overall application on one or more processors. Many conventional interfaces for
interprocess communication are insufficient for realtime applications in that efficient and deterministic data passing methods
cannot be implemented. This has prompted the definition of message passing interfaces providing these facilities:</p>

<ul>
<li>
<p>Open a message queue.</p>
</li>

<li>
<p>Send a message to a message queue.</p>
</li>

<li>
<p>Receive a message from a queue, either synchronously or asynchronously.</p>
</li>

<li>
<p>Alter message queue attributes for flow and resource control.</p>
</li>
</ul>

<p>It is assumed that an application may consist of multiple cooperating processes and that these processes may wish to communicate
and coordinate their activities. The message passing facility described in IEEE&nbsp;Std&nbsp;1003.1-2001 allows processes to
communicate through system-wide queues. These message queues are accessed through names that may be pathnames. A message queue can
be opened for use by multiple sending and/or multiple receiving processes.</p>
</li>

<li>
<p>Background on Embedded Applications</p>

<p>Interprocess communication utilizing message passing is a key facility for the construction of deterministic, high-performance
realtime applications. The facility is present in all realtime systems and is the framework upon which the application is
constructed. The performance of the facility is usually a direct indication of the performance of the resulting application.</p>

<p>Realtime applications, especially for embedded systems, are typically designed around the performance constraints imposed by the
message passing mechanisms. Applications for embedded systems are typically very tightly constrained. Application writers expect to
design and control the entire system. In order to minimize system costs, the writer will attempt to use all resources to their
utmost and minimize the requirement to add additional memory or processors.</p>

<p>The embedded applications usually share address spaces and only a simple message passing mechanism is required. The application
can readily access common data incurring only mutual-exclusion overheads. The models desired are the simplest possible with the
application building higher-level facilities only when needed.</p>
</li>

<li>
<p>Requirements</p>

<p>The following requirements determined the features of the message passing facilities defined in
IEEE&nbsp;Std&nbsp;1003.1-2001:</p>

<ul>
<li>
<p>Naming of Message Queues</p>

<p>The mechanism for gaining access to a message queue is a pathname evaluated in a context that is allowed to be a file system
name space, or it can be independent of any file system. This is a specific attempt to allow implementations based on either method
in order to address both embedded systems and to also allow implementation in larger systems.</p>

<p>The interface of <a href="../functions/mq_open.html"><i>mq_open</i>()</a> is defined to allow but not require the access control
and name conflicts resulting from utilizing a file system for name resolution. All required behavior is specified for the access
control case. Yet a conforming implementation, such as an embedded system kernel, may define that there are no distinctions between
users and may define that all processes have all access privileges.</p>
</li>

<li>
<p>Embedded System Naming</p>

<p>Embedded systems need to be able to utilize independent name spaces for accessing the various system objects. They typically do
not have a file system, precluding its utilization as a common name resolution mechanism. The modularity of an embedded system
limits the connections between separate mechanisms that can be allowed.</p>

<p>Embedded systems typically do not have any access protection. Since the system does not support the mixing of applications from
different areas, and usually does not even have the concept of an authorization entity, access control is not useful.</p>
</li>

<li>
<p>Large System Naming</p>

<p>On systems with more functionality, the name resolution must support the ability to use the file system as the name resolution
mechanism/object storage medium and to have control over access to the objects. Utilizing the pathname space can result in further
errors when the names conflict with other objects.<br>
</p>
</li>

<li>
<p>Fixed Size of Messages</p>

<p>The interfaces impose a fixed upper bound on the size of messages that can be sent to a specific message queue. The size is set
on an individual queue basis and cannot be changed dynamically.</p>

<p>The purpose of the fixed size is to increase the ability of the system to optimize the implementation of <a href=
"../functions/mq_send.html"><i>mq_send</i>()</a> and <a href="../functions/mq_receive.html"><i>mq_receive</i>()</a>. With fixed
sizes of messages and fixed numbers of messages, specific message blocks can be pre-allocated. This eliminates a significant amount
of checking for errors and boundary conditions. Additionally, an implementation can optimize data copying to maximize performance.
Finally, with a restricted range of message sizes, an implementation is better able to provide deterministic operations.</p>
</li>

<li>
<p>Prioritization of Messages</p>

<p>Message prioritization allows the application to determine the order in which messages are received. Prioritization of messages
is a key facility that is provided by most realtime kernels and is heavily utilized by the applications. The major purpose of
having priorities in message queues is to avoid priority inversions in the message system, where a high-priority message is delayed
behind one or more lower-priority messages. This allows the applications to be designed so that they do not need to be interrupted
in order to change the flow of control when exceptional conditions occur. The prioritization does add additional overhead to the
message operations in those cases it is actually used but a clever implementation can optimize for the FIFO case to make that more
efficient.</p>
</li>

<li>
<p>Asynchronous Notification</p>

<p>The interface supports the ability to have a task asynchronously notified of the availability of a message on the queue. The
purpose of this facility is to allow the task to perform other functions and yet still be notified that a message has become
available on the queue.</p>

<p>To understand the requirement for this function, it is useful to understand two models of application design: a single task
performing multiple functions and multiple tasks performing a single function. Each of these models has advantages.</p>

<p>Asynchronous notification is required to build the model of a single task performing multiple operations. This model typically
results from either the expectation that interruption is less expensive than utilizing a separate task or from the growth of the
application to include additional functions.</p>
</li>
</ul>
</li>
</ul>

<h5><a name="tag_03_02_08_03"></a>Semaphores</h5>

<p>Semaphores are a high-performance process synchronization mechanism. Semaphores are named by null-terminated strings of
characters.</p>

<p>A semaphore is created using the <a href="../functions/sem_init.html"><i>sem_init</i>()</a> function or the <a href=
"../functions/sem_open.html"><i>sem_open</i>()</a> function with the O_CREAT flag set in <i>oflag</i>.</p>

<p>To use a semaphore, a process has to first initialize the semaphore or inherit an open descriptor for the semaphore via <a href=
"../functions/fork.html"><i>fork</i>()</a>.</p>

<p>A semaphore preserves its state when the last reference is closed. For example, if a semaphore has a value of 13 when the last
reference is closed, it will have a value of 13 when it is next opened.</p>

<p>When a semaphore is created, an initial state for the semaphore has to be provided. This value is a non-negative integer.
Negative values are not possible since they indicate the presence of blocked processes. The persistence of any of these objects
across a system crash or a system reboot is undefined. Conforming applications must not depend on any sort of persistence across a
system reboot or a system crash.</p>

<ul>
<li>
<p>Models and Requirements</p>

<p>A realtime system requires synchronization and communication between the processes comprising the overall application. An
efficient and reliable synchronization mechanism has to be provided in a realtime system that will allow more than one schedulable
process mutually-exclusive access to the same resource. This synchronization mechanism has to allow for the optimal implementation
of synchronization or systems implementors will define other, more cost-effective methods.</p>

<p>At issue are the methods whereby multiple processes (tasks) can be designed and implemented to work together in order to perform
a single function. This requires interprocess communication and synchronization. A semaphore mechanism is the lowest level of
synchronization that can be provided by an operating system.</p>

<p>A semaphore is defined as an object that has an integral value and a set of blocked processes associated with it. If the value
is positive or zero, then the set of blocked processes is empty; otherwise, the size of the set is equal to the absolute value of
the semaphore value. The value of the semaphore can be incremented or decremented by any process with access to the semaphore and
must be done as an indivisible operation. When a semaphore value is less than or equal to zero, any process that attempts to lock
it again will block or be informed that it is not possible to perform the operation.</p>

<p>A semaphore may be used to guard access to any resource accessible by more than one schedulable task in the system. It is a
global entity and not associated with any particular process. As such, a method of obtaining access to the semaphore has to be
provided by the operating system. A process that wants access to a critical resource (section) has to wait on the semaphore that
guards that resource. When the semaphore is locked on behalf of a process, it knows that it can utilize the resource without
interference by any other cooperating process in the system. When the process finishes its operation on the resource, leaving it in
a well-defined state, it posts the semaphore, indicating that some other process may now obtain the resource associated with that
semaphore.</p>

<p>In this section, mutexes and condition variables are specified as the synchronization mechanisms between threads.</p>

<p>These primitives are typically used for synchronizing threads that share memory in a single process. However, this section
provides an option allowing the use of these synchronization interfaces and objects between processes that share memory, regardless
of the method for sharing memory.</p>

<p>Much experience with semaphores shows that there are two distinct uses of synchronization: locking, which is typically of short
duration; and waiting, which is typically of long or unbounded duration. These distinct usages map directly onto mutexes and
condition variables, respectively.</p>

<p>Semaphores are provided in IEEE&nbsp;Std&nbsp;1003.1-2001 primarily to provide a means of synchronization for processes; these
processes may or may not share memory. Mutexes and condition variables are specified as synchronization mechanisms between threads;
these threads always share (some) memory. Both are synchronization paradigms that have been in widespread use for a number of
years. Each set of primitives is particularly well matched to certain problems.</p>

<p>With respect to binary semaphores, experience has shown that condition variables and mutexes are easier to use for many
synchronization problems than binary semaphores. The primary reason for this is the explicit appearance of a Boolean predicate that
specifies when the condition wait is satisfied. This Boolean predicate terminates a loop, including the call to <a href=
"../functions/pthread_cond_wait.html"><i>pthread_cond_wait</i>()</a>. As a result, extra wakeups are benign since the predicate
governs whether the thread will actually proceed past the condition wait. With stateful primitives, such as binary semaphores, the
wakeup in itself typically means that the wait is satisfied. The burden of ensuring correctness for such waits is thus placed on
<i>all</i> signalers of the semaphore rather than on an <i>explicitly coded</i> Boolean predicate located at the condition wait.
Experience has shown that the latter creates a major improvement in safety and ease-of-use.</p>

<p>Counting semaphores are well matched to dealing with producer/consumer problems, including those that might exist between
threads of different processes, or between a signal handler and a thread. In the former case, there may be little or no memory
shared by the processes; in the latter case, one is not communicating between co-equal threads, but between a thread and an
interrupt-like entity. It is for these reasons that IEEE&nbsp;Std&nbsp;1003.1-2001 allows semaphores to be used by threads.</p>

<p>Mutexes and condition variables have been effectively used with and without priority inheritance, priority ceiling, and other
attributes to synchronize threads that share memory. The efficiency of their implementation is comparable to or better than that of
other synchronization primitives that are sometimes harder to use (for example, binary semaphores). Furthermore, there is at least
one known implementation of Ada tasking that uses these primitives. Mutexes and condition variables together constitute an
appropriate, sufficient, and complete set of inter-thread synchronization primitives.</p>

<p>Efficient multi-threaded applications require high-performance synchronization primitives. Considerations of efficiency and
generality require a small set of primitives upon which more sophisticated synchronization functions can be built.</p>
</li>

<li>
<p>Standardization Issues</p>

<p>It is possible to implement very high-performance semaphores using test-and-set instructions on shared memory locations. The
library routines that implement such a high-performance interface have to properly ensure that a <a href=
"../functions/sem_wait.html"><i>sem_wait</i>()</a> or <a href="../functions/sem_trywait.html"><i>sem_trywait</i>()</a> operation
that cannot be performed will issue a blocking semaphore system call or properly report the condition to the application. The same
interface to the application program would be provided by a high-performance implementation.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_04"></a>Realtime Signals</h5>

<h5><a name="tag_03_02_08_05"></a>Realtime Signals Extension</h5>

<p>This portion of the rationale presents models, requirements, and standardization issues relevant to the Realtime Signals
Extension. This extension provides the capability required to support reliable, deterministic, asynchronous notification of events.
While a new mechanism, unencumbered by the historical usage and semantics of POSIX.1 signals, might allow for a more efficient
implementation, the application requirements for event notification can be met with a small number of extensions to signals.
Therefore, a minimal set of extensions to signals to support the application requirements is specified.</p>

<p>The realtime signal extensions specified in this section are used by other realtime functions requiring asynchronous
notification:</p>

<ul>
<li>
<p>Models</p>

<p>The model supported is one of multiple cooperating processes, each of which handles multiple asynchronous external events.
Events represent occurrences that are generated as the result of some activity in the system. Examples of occurrences that can
constitute an event include:</p>

<ul>
<li>
<p>Completion of an asynchronous I/O request</p>
</li>

<li>
<p>Expiration of a POSIX.1b timer</p>
</li>

<li>
<p>Arrival of an interprocess message</p>
</li>

<li>
<p>Generation of a user-defined event</p>
</li>
</ul>

<p>Processing of these events may occur synchronously via polling for event notifications or asynchronously via a software
interrupt mechanism. Existing practice for this model is well established for traditional proprietary realtime operating systems,
realtime executives, and realtime extended POSIX-like systems.</p>

<p>A contrasting model is that of &quot;cooperating sequential processes&quot; where each process handles a single priority of events via
polling. Each process blocks while waiting for events, and each process depends on the preemptive, priority-based process
scheduling mechanism to arbitrate between events of different priority that need to be processed concurrently. Existing practice
for this model is also well established for small realtime executives that typically execute in an unprotected physical address
space, but it is just emerging in the context of a fuller function operating system with multiple virtual address spaces.</p>

<p>It could be argued that the cooperating sequential process model, and the facilities supported by the POSIX Threads Extension
obviate a software interrupt model. But, even with the cooperating sequential process model, the need has been recognized for a
software interrupt model to handle exceptional conditions and process aborting, so the mechanism must be supported in any case.
Furthermore, it is not the purview of IEEE&nbsp;Std&nbsp;1003.1-2001 to attempt to convince realtime practitioners that their
current application models based on software interrupts are &quot;broken&quot; and should be replaced by the cooperating sequential process
model. Rather, it is the charter of IEEE&nbsp;Std&nbsp;1003.1-2001 to provide standard extensions to mechanisms that support
existing realtime practice.</p>
</li>

<li>
<p>Requirements</p>

<p>This section discusses the following realtime application requirements for asynchronous event notification:</p>

<ul>
<li>
<p>Reliable delivery of asynchronous event notification</p>

<p>The events notification mechanism guarantees delivery of an event notification. Asynchronous operations (such as asynchronous
I/O and timers) that complete significantly after they are invoked have to guarantee that delivery of the event notification can
occur at the time of completion.</p>
</li>

<li>
<p>Prioritized handling of asynchronous event notifications</p>

<p>The events notification mechanism supports the assigning of a user function as an event notification handler. Furthermore, the
mechanism supports the preemption of an event handler function by a higher priority event notification and supports the selection
of the highest priority pending event notification when multiple notifications (of different priority) are pending
simultaneously.</p>

<p>The model here is based on hardware interrupts. Asynchronous event handling allows the application to ensure that time-critical
events are immediately processed when delivered, without the indeterminism of being at a random location within a polling loop. Use
of handler priority allows the specification of how handlers are interrupted by other higher priority handlers.</p>
</li>

<li>
<p>Differentiation between multiple occurrences of event notifications of the same type</p>

<p>The events notification mechanism passes an application-defined value to the event handler function. This value can be used for
a variety of purposes, such as enabling the application to identify which of several possible events of the same type (for example,
timer expirations) has occurred.</p>
</li>

<li>
<p>Polled reception of asynchronous event notifications</p>

<p>The events notification mechanism supports blocking and non-blocking polls for asynchronous event notification.</p>

<p>The polled mode of operation is often preferred over the interrupt mode by those practitioners accustomed to this model.
Providing support for this model facilitates the porting of applications based on this model to POSIX.1b conforming systems.</p>
</li>

<li>
<p>Deterministic response to asynchronous event notifications</p>

<p>The events notification mechanism does not preclude implementations that provide deterministic event dispatch latency and
minimizes the number of system calls needed to use the event facilities during realtime processing.</p>
</li>
</ul>
</li>

<li>
<p>Rationale for Extension</p>

<p>POSIX.1 signals have many of the characteristics necessary to support the asynchronous handling of event notifications, and the
Realtime Signals Extension addresses the following deficiencies in the POSIX.1 signal mechanism:</p>

<ul>
<li>
<p>Signals do not support reliable delivery of event notification. Subsequent occurrences of a pending signal are not guaranteed to
be delivered.</p>
</li>

<li>
<p>Signals do not support prioritized delivery of event notifications. The order of signal delivery when multiple unblocked signals
are pending is undefined.</p>
</li>

<li>
<p>Signals do not support the differentiation between multiple signals of the same type.</p>
</li>
</ul>
</li>
</ul>

<h5><a name="tag_03_02_08_06"></a>Asynchronous I/O</h5>

<p>Many applications need to interact with the I/O subsystem in an asynchronous manner. The asynchronous I/O mechanism provides the
ability to overlap application processing and I/O operations initiated by the application. The asynchronous I/O mechanism allows a
single process to perform I/O simultaneously to a single file multiple times or to multiple files multiple times.</p>

<h5><a name="tag_03_02_08_07"></a>Overview</h5>

<p>Asynchronous I/O operations proceed in logical parallel with the processing done by the application after the asynchronous I/O
has been initiated. Other than this difference, asynchronous I/O behaves similarly to normal I/O using <a href=
"../functions/read.html"><i>read</i>()</a>, <a href="../functions/write.html"><i>write</i>()</a>, <a href=
"../functions/lseek.html"><i>lseek</i>()</a>, and <a href="../functions/fsync.html"><i>fsync</i>()</a>. The effect of issuing an
asynchronous I/O request is as if a separate thread of execution were to perform atomically the implied <a href=
"../functions/lseek.html"><i>lseek</i>()</a> operation, if any, and then the requested I/O operation (either <a href=
"../functions/read.html"><i>read</i>()</a>, <a href="../functions/write.html"><i>write</i>()</a>, or <a href=
"../functions/fsync.html"><i>fsync</i>()</a>). There is no seek implied with a call to <a href=
"../functions/aio_fsync.html"><i>aio_fsync</i>()</a>. Concurrent asynchronous operations and synchronous operations applied to the
same file update the file as if the I/O operations had proceeded serially.</p>

<p>When asynchronous I/O completes, a signal can be delivered to the application to indicate the completion of the I/O. This signal
can be used to indicate that buffers and control blocks used for asynchronous I/O can be reused. Signal delivery is not required
for an asynchronous operation and may be turned off on a per-operation basis by the application. Signals may also be synchronously
polled using <a href="../functions/aio_suspend.html"><i>aio_suspend</i>()</a>, <a href=
"../functions/sigtimedwait.html"><i>sigtimedwait</i>()</a>, or <a href=
"../functions/sigwaitinfo.html"><i>sigwaitinfo</i>()</a>.</p>

<p>Normal I/O has a return value and an error status associated with it. Asynchronous I/O returns a value and an error status when
the operation is first submitted, but that only relates to whether the operation was successfully queued up for servicing. The I/O
operation itself also has a return status and an error value. To allow the application to retrieve the return status and the error
value, functions are provided that, given the address of an asynchronous I/O control block, yield the return and error status
associated with the operation. Until an asynchronous I/O operation is done, its error status is [EINPROGRESS]. Thus, an application
can poll for completion of an asynchronous I/O operation by waiting for the error status to become equal to a value other than
[EINPROGRESS]. The return status of an asynchronous I/O operation is undefined so long as the error status is equal to
[EINPROGRESS].</p>

<p>Storage for asynchronous operation return and error status may be limited. Submission of asynchronous I/O operations may fail if
this storage is exceeded. When an application retrieves the return status of a given asynchronous operation, therefore, any
system-maintained storage used for this status and the error status may be reclaimed for use by other asynchronous operations.</p>

<p>Asynchronous I/O can be performed on file descriptors that have been enabled for POSIX.1b synchronized I/O. In this case, the
I/O operation still occurs asynchronously, as defined herein; however, the asynchronous operation I/O in this case is not completed
until the I/O has reached either the state of synchronized I/O data integrity completion or synchronized I/O file integrity
completion, depending on the sort of synchronized I/O that is enabled on the file descriptor.</p>

<h5><a name="tag_03_02_08_08"></a>Models</h5>

<p>Three models illustrate the use of asynchronous I/O: a journalization model, a data acquisition model, and a model of the use of
asynchronous I/O in supercomputing applications.</p>

<ul>
<li>
<p>Journalization Model</p>

<p>Many realtime applications perform low-priority journalizing functions. Journalizing requires that logging records be queued for
output without blocking the initiating process.</p>
</li>

<li>
<p>Data Acquisition Model</p>

<p>A data acquisition process may also serve as a model. The process has two or more channels delivering intermittent data that
must be read within a certain time. The process issues one asynchronous read on each channel. When one of the channels needs data
collection, the process reads the data and posts it through an asynchronous write to secondary memory for future processing.</p>
</li>

<li>
<p>Supercomputing Model</p>

<p>The supercomputing community has used asynchronous I/O much like that specified in POSIX.1 for many years. This community
requires the ability to perform multiple I/O operations to multiple devices with a minimal number of entries to &quot;the system'';
each entry to &quot;the system&quot; provokes a major delay in operations when compared to the normal progress made by the application.
This existing practice motivated the use of combined <a href="../functions/lseek.html"><i>lseek</i>()</a> and <a href=
"../functions/read.html"><i>read</i>()</a> or <a href="../functions/write.html"><i>write</i>()</a> calls, as well as the <a href=
"../functions/lio_listio.html"><i>lio_listio</i>()</a> call. Another common practice is to disable signal notification for I/O
completion, and simply poll for I/O completion at some interval by which the I/O should be completed. Likewise, interfaces like <a
href="../functions/aio_cancel.html"><i>aio_cancel</i>()</a> have been in successful commercial use for many years. Note also that
an underlying implementation of asynchronous I/O will require the ability, at least internally, to cancel outstanding asynchronous
I/O, at least when the process exits. (Consider an asynchronous read from a terminal, when the process intends to exit
immediately.)</p>
</li>
</ul>

<h5><a name="tag_03_02_08_09"></a>Requirements</h5>

<p>Asynchronous input and output for realtime implementations have these requirements:</p>

<ul>
<li>
<p>The ability to queue multiple asynchronous read and write operations to a single open instance. Both sequential and random
access should be supported.</p>
</li>

<li>
<p>The ability to queue asynchronous read and write operations to multiple open instances.</p>
</li>

<li>
<p>The ability to obtain completion status information by polling and/or asynchronous event notification.</p>
</li>

<li>
<p>Asynchronous event notification on asynchronous I/O completion is optional.</p>
</li>

<li>
<p>It has to be possible for the application to associate the event with the <i>aiocbp</i> for the operation that generated the
event.</p>
</li>

<li>
<p>The ability to cancel queued requests.</p>
</li>

<li>
<p>The ability to wait upon asynchronous I/O completion in conjunction with other types of events.</p>
</li>

<li>
<p>The ability to accept an <a href="../functions/aio_read.html"><i>aio_read</i>()</a> and an <a href=
"../functions/aio_cancel.html"><i>aio_cancel</i>()</a> for a device that accepts a <a href=
"../functions/read.html"><i>read</i>()</a>, and the ability to accept an <a href=
"../functions/aio_write.html"><i>aio_write</i>()</a> and an <a href="../functions/aio_cancel.html"><i>aio_cancel</i>()</a> for a
device that accepts a <a href="../functions/write.html"><i>write</i>()</a>. This does not imply that the operation is
asynchronous.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_10"></a>Standardization Issues</h5>

<p>The following issues are addressed by the standardization of asynchronous I/O:</p>

<ul>
<li>
<p>Rationale for New Interface</p>

<p>Non-blocking I/O does not satisfy the needs of either realtime or high-performance computing models; these models require that a
process overlap program execution and I/O processing. Realtime applications will often make use of direct I/O to or from the
address space of the process, or require synchronized (unbuffered) I/O; they also require the ability to overlap this I/O with
other computation. In addition, asynchronous I/O allows an application to keep a device busy at all times, possibly achieving
greater throughput. Supercomputing and database architectures will often have specialized hardware that can provide true asynchrony
underlying the logical asynchrony provided by this interface. In addition, asynchronous I/O should be supported by all types of
files and devices in the same manner.</p>
</li>

<li>
<p>Effect of Buffering</p>

<p>If asynchronous I/O is performed on a file that is buffered prior to being actually written to the device, it is possible that
asynchronous I/O will offer no performance advantage over normal I/O; the cycles <i>stolen</i> to perform the asynchronous I/O will
be taken away from the running process and the I/O will occur at interrupt time. This potential lack of gain in performance in no
way obviates the need for asynchronous I/O by realtime applications, which very often will use specialized hardware support,
multiple processors, and/or unbuffered, synchronized I/O.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_11"></a>Memory Management</h5>

<p>All memory management and shared memory definitions are located in the <a href=
"../basedefs/sys/mman.h.html"><i>&lt;sys/mman.h&gt;</i></a> header. This is for alignment with historical practice.</p>

<h5><a name="tag_03_02_08_12"></a>Memory Locking Functions</h5>

<p>This portion of the rationale presents models, requirements, and standardization issues relevant to process memory locking.</p>

<ul>
<li>
<p>Models</p>

<p>Realtime systems that conform to IEEE&nbsp;Std&nbsp;1003.1-2001 are expected (and desired) to be supported on systems with
demand-paged virtual memory management, non-paged swapping memory management, and physical memory systems with no memory management
hardware. The general case, however, is the demand-paged, virtual memory system with each POSIX process running in a virtual
address space. Note that this includes architectures where each process resides in its own virtual address space and architectures
where the address space of each process is only a portion of a larger global virtual address space.</p>

<p>The concept of memory locking is introduced to eliminate the indeterminacy introduced by paging and swapping, and to support an
upper bound on the time required to access the memory mapped into the address space of a process. Ideally, this upper bound will be
the same as the time required for the processor to access &quot;main memory&quot;, including any address translation and cache miss
overheads. But some implementations-primarily on mainframes-will not actually force locked pages to be loaded and held resident in
main memory. Rather, they will handle locked pages so that accesses to these pages will meet the performance metrics for locked
process memory in the implementation. Also, although it is not, for example, the intention that this interface, as specified, be
used to lock process memory into &quot;cache&quot;, it is conceivable that an implementation could support a large static RAM memory and
define this as &quot;main memory&quot; and use a large[r] dynamic RAM as &quot;backing store&quot;. These interfaces could then be interpreted as
supporting the locking of process memory into the static RAM. Support for multiple levels of backing store would require extensions
to these interfaces.</p>

<p>Implementations may also use memory locking to guarantee a fixed translation between virtual and physical addresses where such
is beneficial to improving determinancy for direct-to/from-process input/output. IEEE&nbsp;Std&nbsp;1003.1-2001 does not guarantee
to the application that the virtual-to-physical address translations, if such exist, are fixed, because such behavior would not be
implementable on all architectures on which implementations of IEEE&nbsp;Std&nbsp;1003.1-2001 are expected. But
IEEE&nbsp;Std&nbsp;1003.1-2001 does mandate that an implementation define, for the benefit of potential users, whether or not
locking guarantees fixed translations.</p>

<p>Memory locking is defined with respect to the address space of a process. Only the pages mapped into the address space of a
process may be locked by the process, and when the pages are no longer mapped into the address space-for whatever reason-the locks
established with respect to that address space are removed. Shared memory areas warrant special mention, as they may be mapped into
more than one address space or mapped more than once into the address space of a process; locks may be established on pages within
these areas with respect to several of these mappings. In such a case, the lock state of the underlying physical pages is the
logical OR of the lock state with respect to each of the mappings. Only when all such locks have been removed are the shared pages
considered unlocked.</p>

<p>In recognition of the page granularity of Memory Management Units (MMU), and in order to support locking of ranges of address
space, memory locking is defined in terms of &quot;page&quot; granularity. That is, for the interfaces that support an address and size
specification for the region to be locked, the address must be on a page boundary, and all pages mapped by the specified range are
locked, if valid. This means that the length is implicitly rounded up to a multiple of the page size. The page size is
implementation-defined and is available to applications as a compile-time symbolic constant or at runtime via <a href=
"../functions/sysconf.html"><i>sysconf</i>()</a>.</p>

<p>A &quot;real memory&quot; POSIX.1b implementation that has no MMU could elect not to support these interfaces, returning [ENOSYS]. But
an application could easily interpret this as meaning that the implementation would unconditionally page or swap the application
when such is not the case. It is the intention of IEEE&nbsp;Std&nbsp;1003.1-2001 that such a system could define these interfaces
as &quot;NO-OPs&quot;, returning success without actually performing any function except for mandated argument checking.</p>
</li>

<li>
<p>Requirements</p>

<p>For realtime applications, memory locking is generally considered to be required as part of application initialization. This
locking is performed after an application has been loaded (that is, <i>exec</i>'d) and the program remains locked for its entire
lifetime. But to support applications that undergo major mode changes where, in one mode, locking is required, but in another it is
not, the specified interfaces allow repeated locking and unlocking of memory within the lifetime of a process.</p>

<p>When a realtime application locks its address space, it should not be necessary for the application to then &quot;touch&quot; all of the
pages in the address space to guarantee that they are resident or else suffer potential paging delays the first time the page is
referenced. Thus, IEEE&nbsp;Std&nbsp;1003.1-2001 requires that the pages locked by the specified interfaces be resident when the
locking functions return successfully.</p>

<p>Many architectures support system-managed stacks that grow automatically when the current extent of the stack is exceeded. A
realtime application has a requirement to be able to &quot;preallocate&quot; sufficient stack space and lock it down so that it will not
suffer page faults to grow the stack during critical realtime operation. There was no consensus on a portable way to specify how
much stack space is needed, so IEEE&nbsp;Std&nbsp;1003.1-2001 supports no specific interface for preallocating stack space. But an
application can portably lock down a specific amount of stack space by specifying MCL_FUTURE in a call to <a href=
"../functions/mlockall.html"><i>mlockall</i>()</a> and then calling a dummy function that declares an automatic array of the
desired size.</p>

<p>Memory locking for realtime applications is also generally considered to be an &quot;all or nothing&quot; proposition. That is, the
entire process, or none, is locked down. But, for applications that have well-defined sections that need to be locked and others
that do not, IEEE&nbsp;Std&nbsp;1003.1-2001 supports an optional set of interfaces to lock or unlock a range of process addresses.
Reasons for locking down a specific range include:</p>

<ul>
<li>
<p>An asynchronous event handler function that must respond to external events in a deterministic manner such that page faults
cannot be tolerated</p>
</li>

<li>
<p>An input/output &quot;buffer&quot; area that is the target for direct-to-process I/O, and the overhead of implicit locking and unlocking
for each I/O call cannot be tolerated</p>
</li>
</ul>

<p>Finally, locking is generally viewed as an &quot;application-wide&quot; function. That is, the application is globally aware of which
regions are locked and which are not over time. This is in contrast to a function that is used temporarily within a &quot;third party''
library routine whose function is unknown to the application, and therefore must have no &quot;side effects&quot;. The specified
interfaces, therefore, do not support &quot;lock stacking&quot; or &quot;lock nesting&quot; within a process. But, for pages that are shared
between processes or mapped more than once into a process address space, &quot;lock stacking&quot; is essentially mandated by the
requirement that unlocking of pages that are mapped by more that one process or more than once by the same process does not affect
locks established on the other mappings.</p>

<p>There was some support for &quot;lock stacking&quot; so that locking could be transparently used in functions or opaque modules. But the
consensus was not to burden all implementations with lock stacking (and reference counting), and an implementation option was
proposed. There were strong objections to the option because applications would have to support both options in order to remain
portable. The consensus was to eliminate lock stacking altogether, primarily through overwhelming support for the System&nbsp;V
&quot;m[un]lock[all]&quot; interface on which IEEE&nbsp;Std&nbsp;1003.1-2001 is now based.</p>

<p>Locks are not inherited across <a href="../functions/fork.html"><i>fork</i>()</a>s because some implementations implement <a
href="../functions/fork.html"><i>fork</i>()</a> by creating new address spaces for the child. In such an implementation, requiring
locks to be inherited would lead to new situations in which a fork would fail due to the inability of the system to lock sufficient
memory to lock both the parent and the child. The consensus was that there was no benefit to such inheritance. Note that this does
not mean that locks are removed when, for instance, a thread is created in the same address space.</p>

<p>Similarly, locks are not inherited across <i>exec</i> because some implementations implement <i>exec</i> by unmapping all of the
pages in the address space (which, by definition, removes the locks on these pages), and maps in pages of the <i>exec</i>'d image.
In such an implementation, requiring locks to be inherited would lead to new situations in which <i>exec</i> would fail. Reporting
this failure would be very cumbersome to detect in time to report to the calling process, and no appropriate mechanism exists for
informing the <i>exec</i>'d process of its status.</p>

<p>It was determined that, if the newly loaded application required locking, it was the responsibility of that application to
establish the locks. This is also in keeping with the general view that it is the responsibility of the application to be aware of
all locks that are established.</p>

<p>There was one request to allow (not mandate) locks to be inherited across <a href="../functions/fork.html"><i>fork</i>()</a>,
and a request for a flag, MCL_INHERIT, that would specify inheritance of memory locks across <i>exec</i>s. Given the difficulties
raised by this and the general lack of support for the feature in IEEE&nbsp;Std&nbsp;1003.1-2001, it was not added.
IEEE&nbsp;Std&nbsp;1003.1-2001 does not preclude an implementation from providing this feature for administrative purposes, such as
a &quot;run&quot; command that will lock down and execute a specified application. Additionally, the rationale for the objection equated <a
href="../functions/fork.html"><i>fork</i>()</a> with creating a thread in the address space. IEEE&nbsp;Std&nbsp;1003.1-2001 does
not mandate releasing locks when creating additional threads in an existing process.</p>
</li>

<li>
<p>Standardization Issues</p>

<p>One goal of IEEE&nbsp;Std&nbsp;1003.1-2001 is to define a set of primitives that provide the necessary functionality for
realtime applications, with consideration for the needs of other application domains where such were identified, which is based to
the extent possible on existing industry practice.</p>

<p>The Memory Locking option is required by many realtime applications to tune performance. Such a facility is accomplished by
placing constraints on the virtual memory system to limit paging of time of the process or of critical sections of the process.
This facility should not be used by most non-realtime applications.</p>

<p>Optional features provided in IEEE&nbsp;Std&nbsp;1003.1-2001 allow applications to lock selected address ranges with the caveat
that the process is responsible for being aware of the page granularity of locking and the unnested nature of the locks.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_13"></a>Mapped Files Functions</h5>

<p>The Memory Mapped Files option provides a mechanism that allows a process to access files by directly incorporating file data
into its address space. Once a file is &quot;mapped&quot; into a process address space, the data can be manipulated by instructions as
memory. The use of mapped files can significantly reduce I/O data movement since file data does not have to be copied into process
data buffers as in <a href="../functions/read.html"><i>read</i>()</a> and <a href="../functions/write.html"><i>write</i>()</a>. If
more than one process maps a file, its contents are shared among them. This provides a low overhead mechanism by which processes
can synchronize and communicate.</p>

<ul>
<li>
<p>Historical Perspective</p>

<p>Realtime applications have historically been implemented using a collection of cooperating processes or tasks. In early systems,
these processes ran on bare hardware (that is, without an operating system) with no memory relocation or protection. The
application paradigms that arose from this environment involve the sharing of data between the processes.</p>

<p>When realtime systems were implemented on top of vendor-supplied operating systems, the paradigm or performance benefits of
direct access to data by multiple processes was still deemed necessary. As a result, operating systems that claim to support
realtime applications must support the shared memory paradigm.</p>

<p>Additionally, a number of realtime systems provide the ability to map specific sections of the physical address space into the
address space of a process. This ability is required if an application is to obtain direct access to memory locations that have
specific properties (for example, refresh buffers or display devices, dual ported memory locations, DMA target locations). The use
of this ability is common enough to warrant some degree of standardization of its interface. This ability overlaps the general
paradigm of shared memory in that, in both instances, common global objects are made addressable by individual processes or
tasks.</p>

<p>Finally, a number of systems also provide the ability to map process addresses to files. This provides both a general means of
sharing persistent objects, and using files in a manner that optimizes memory and swapping space usage.</p>

<p>Simple shared memory is clearly a special case of the more general file mapping capability. In addition, there is relatively
widespread agreement and implementation of the file mapping interface. In these systems, many different types of objects can be
mapped (for example, files, memory, devices, and so on) using the same mapping interfaces. This approach both minimizes interface
proliferation and maximizes the generality of programs using the mapping interfaces.</p>
</li>

<li>
<p>Memory Mapped Files Usage</p>

<p>A memory object can be concurrently mapped into the address space of one or more processes. The <a href=
"../functions/mmap.html"><i>mmap</i>()</a> and <a href="../functions/munmap.html"><i>munmap</i>()</a> functions allow a process to
manipulate their address space by mapping portions of memory objects into it and removing them from it. When multiple processes map
the same memory object, they can share access to the underlying data. Implementations may restrict the size and alignment of
mappings to be on <i>page</i>-size boundaries. The page size, in bytes, is the value of the system-configurable variable
{PAGESIZE}, typically accessed by calling <a href="../functions/sysconf.html"><i>sysconf</i>()</a> with a <i>name</i> argument of
_SC_PAGESIZE. If an implementation has no restrictions on size or alignment, it may specify a 1-byte page size.</p>

<p>To map memory, a process first opens a memory object. The <a href="../functions/ftruncate.html"><i>ftruncate</i>()</a> function
can be used to contract or extend the size of the memory object even when the object is currently mapped. If the memory object is
extended, the contents of the extended areas are zeros.</p>

<p>After opening a memory object, the application maps the object into its address space using the <a href=
"../functions/mmap.html"><i>mmap</i>()</a> function call. Once a mapping has been established, it remains mapped until unmapped
with <a href="../functions/munmap.html"><i>munmap</i>()</a>, even if the memory object is closed. The <a href=
"../functions/mprotect.html"><i>mprotect</i>()</a> function can be used to change the memory protections initially established by
<a href="../functions/mmap.html"><i>mmap</i>()</a>.</p>

<p>A <a href="../functions/close.html"><i>close</i>()</a> of the file descriptor, while invalidating the file descriptor itself,
does not unmap any mappings established for the memory object. The address space, including all mapped regions, is inherited on <a
href="../functions/fork.html"><i>fork</i>()</a>. The entire address space is unmapped on process termination or by successful calls
to any of the <i>exec</i> family of functions.</p>

<p>The <a href="../functions/msync.html"><i>msync</i>()</a> function is used to force mapped file data to permanent storage.</p>
</li>

<li>
<p>Effects on Other Functions</p>

<p>When the Memory Mapped Files option is supported, the operation of the <a href="../functions/open.html"><i>open</i>()</a>, <a
href="../functions/creat.html"><i>creat</i>()</a>, and <a href="../functions/unlink.html"><i>unlink</i>()</a> functions are a
natural result of using the file system name space to map the global names for memory objects.</p>

<p>The <a href="../functions/ftruncate.html"><i>ftruncate</i>()</a> function can be used to set the length of a sharable memory
object.</p>

<p>The meaning of <a href="../functions/stat.html"><i>stat</i>()</a> fields other than the size and protection information is
undefined on implementations where memory objects are not implemented using regular files. When regular files are used, the times
reflect when the implementation updated the file image of the data, not when a process updated the data in memory.</p>

<p>The operations of <a href="../functions/fdopen.html"><i>fdopen</i>()</a>, <a href="../functions/write.html"><i>write</i>()</a>,
<a href="../functions/read.html"><i>read</i>()</a>, and <a href="../functions/lseek.html"><i>lseek</i>()</a> were made unspecified
for objects opened with <a href="../functions/shm_open.html"><i>shm_open</i>()</a>, so that implementations that did not implement
memory objects as regular files would not have to support the operation of these functions on shared memory objects.</p>

<p>The behavior of memory objects with respect to <a href="../functions/close.html"><i>close</i>()</a>, <a href=
"../functions/dup.html"><i>dup</i>()</a>, <a href="../functions/dup2.html"><i>dup2</i>()</a>, <a href=
"../functions/open.html"><i>open</i>()</a>, <a href="../functions/close.html"><i>close</i>()</a>, <a href=
"../functions/fork.html"><i>fork</i>()</a>, <a href="../functions/_exit.html"><i>_exit</i>()</a>, and the <i>exec</i> family of
functions is the same as the behavior of the existing practice of the <a href="../functions/mmap.html"><i>mmap</i>()</a>
function.</p>

<p>A memory object can still be referenced after a close. That is, any mappings made to the file are still in effect, and reads and
writes that are made to those mappings are still valid and are shared with other processes that have the same mapping. Likewise,
the memory object can still be used if any references remain after its name(s) have been deleted. Any references that remain after
a close must not appear to the application as file descriptors.</p>

<p>This is existing practice for <a href="../functions/mmap.html"><i>mmap</i>()</a> and <a href=
"../functions/close.html"><i>close</i>()</a>. In addition, there are already mappings present (text, data, stack) that do not have
open file descriptors. The text mapping in particular is considered a reference to the file containing the text. The desire was to
treat all mappings by the process uniformly. Also, many modern implementations use <a href=
"../functions/mmap.html"><i>mmap</i>()</a> to implement shared libraries, and it would not be desirable to keep file descriptors
for each of the many libraries an application can use. It was felt there were many other existing programs that used this behavior
to free a file descriptor, and thus IEEE&nbsp;Std&nbsp;1003.1-2001 could not forbid it and still claim to be using existing
practice.</p>

<p>For implementations that implement memory objects using memory only, memory objects will retain the memory allocated to the file
after the last close and will use that same memory on the next open. Note that closing the memory object is not the same as
deleting the name, since the memory object is still defined in the memory object name space.</p>

<p>The locks of <a href="../functions/fcntl.html"><i>fcntl</i>()</a> do not block any read or write operation, including read or
write access to shared memory or mapped files. In addition, implementations that only support shared memory objects should not be
required to implement record locks. The reference to <a href="../functions/fcntl.html"><i>fcntl</i>()</a> is added to make this
point explicitly. The other <a href="../functions/fcntl.html"><i>fcntl</i>()</a> commands are useful with shared memory
objects.</p>

<p>The size of pages that mapping hardware may be able to support may be a configurable value, or it may change based on hardware
implementations. The addition of the _SC_PAGESIZE parameter to the <a href="../functions/sysconf.html"><i>sysconf</i>()</a>
function is provided for determining the mapping page size at runtime.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_14"></a>Shared Memory Functions</h5>

<p>Implementations may support the Shared Memory Objects option without supporting a general Memory Mapped Files option. Shared
memory objects are named regions of storage that may be independent of the file system and can be mapped into the address space of
one or more processes to allow them to share the associated memory.</p>

<ul>
<li>
<p>Requirements</p>

<p>Shared memory is used to share data among several processes, each potentially running at different priority levels, responding
to different inputs, or performing separate tasks. Shared memory is not just simply providing common access to data, it is
providing the fastest possible communication between the processes. With one memory write operation, a process can pass information
to as many processes as have the memory region mapped.</p>

<p>As a result, shared memory provides a mechanism that can be used for all other interprocess communication facilities. It may
also be used by an application for implementing more sophisticated mechanisms than semaphores and message queues.</p>

<p>The need for a shared memory interface is obvious for virtual memory systems, where the operating system is directly preventing
processes from accessing each other's data. However, in unprotected systems, such as those found in some embedded controllers, a
shared memory interface is needed to provide a portable mechanism to allocate a region of memory to be shared and then to
communicate the address of that region to other processes.</p>

<p>This, then, provides the minimum functionality that a shared memory interface must have in order to support realtime
applications: to allocate and name an object to be mapped into memory for potential sharing ( <a href=
"../functions/open.html"><i>open</i>()</a> or <a href="../functions/shm_open.html"><i>shm_open</i>()</a>), and to make the memory
object available within the address space of a process ( <a href="../functions/mmap.html"><i>mmap</i>()</a>). To complete the
interface, a mechanism to release the claim of a process on a shared memory object ( <a href=
"../functions/munmap.html"><i>munmap</i>()</a>) is also needed, as well as a mechanism for deleting the name of a sharable object
that was previously created ( <a href="../functions/unlink.html"><i>unlink</i>()</a> or <a href=
"../functions/shm_unlink.html"><i>shm_unlink</i>()</a>).</p>

<p>After a mapping has been established, an implementation should not have to provide services to maintain that mapping. All memory
writes into that area will appear immediately in the memory mapping of that region by any other processes.</p>

<p>Thus, requirements include:</p>

<ul>
<li>
<p>Support creation of sharable memory objects and the mapping of these objects into the address space of a process.</p>
</li>

<li>
<p>Sharable memory objects should be accessed by global names accessible from all processes.</p>
</li>

<li>
<p>Support the mapping of specific sections of physical address space (such as a memory mapped device) into the address space of a
process. This should not be done by the process specifying the actual address, but again by an implementation-defined global name
(such as a special device name) dedicated to this purpose.</p>
</li>

<li>
<p>Support the mapping of discrete portions of these memory objects.</p>
</li>

<li>
<p>Support for minimum hardware configurations that contain no physical media on which to store shared memory contents
permanently.</p>
</li>

<li>
<p>The ability to preallocate the entire shared memory region so that minimum hardware configurations without virtual memory
support can guarantee contiguous space.</p>
</li>

<li>
<p>The maximizing of performance by not requiring functionality that would require implementation interaction above creating the
shared memory area and returning the mapping.</p>
</li>
</ul>

<p>Note that the above requirements do not preclude:</p>

<ul>
<li>
<p>The sharable memory object from being implemented using actual files on an actual file system.</p>
</li>

<li>
<p>The global name that is accessible from all processes being restricted to a file system area that is dedicated to handling
shared memory.</p>
</li>

<li>
<p>An implementation not providing implementation-defined global names for the purpose of physical address mapping.</p>
</li>
</ul>
</li>

<li>
<p>Shared Memory Objects Usage</p>

<p>If the Shared Memory Objects option is supported, a shared memory object may be created, or opened if it already exists, with
the <a href="../functions/shm_open.html"><i>shm_open</i>()</a> function. If the shared memory object is created, it has a length of
zero. The <a href="../functions/ftruncate.html"><i>ftruncate</i>()</a> function can be used to set the size of the shared memory
object after creation. The <a href="../functions/shm_unlink.html"><i>shm_unlink</i>()</a> function removes the name for a shared
memory object created by <a href="../functions/shm_open.html"><i>shm_open</i>()</a>.</p>
</li>

<li>
<p>Shared Memory Overview</p>

<p>The shared memory facility defined by IEEE&nbsp;Std&nbsp;1003.1-2001 usually results in memory locations being added to the
address space of the process. The implementation returns the address of the new space to the application by means of a pointer.
This works well in languages like C. However, in languages without pointer types it will not work. In the bindings for such a
language, either a special COMMON section will need to be defined (which is unlikely), or the binding will have to allow existing
structures to be mapped. The implementation will likely have to place restrictions on the size and alignment of such structures or
will have to map a suitable region of the address space of the process into the memory object, and thus into other processes. These
are issues for that particular language binding. For IEEE&nbsp;Std&nbsp;1003.1-2001, however, the practice will not be forbidden,
merely undefined.</p>

<p>Two potentially different name spaces are used for naming objects that may be mapped into process address spaces. When the
Memory Mapped Files option is supported, files may be accessed via <a href="../functions/open.html"><i>open</i>()</a>. When the
Shared Memory Objects option is supported, sharable memory objects that might not be files may be accessed via the <a href=
"../functions/shm_open.html"><i>shm_open</i>()</a> function. These options are not mutually-exclusive.</p>

<p>Some implementations supporting the Shared Memory Objects option may choose to implement the shared memory object name space as
part of the file system name space. There are several reasons for this:</p>

<ul>
<li>
<p>It allows applications to prevent name conflicts by use of the directory structure.</p>
</li>

<li>
<p>It uses an existing mechanism for accessing global objects and prevents the creation of a new mechanism for naming global
objects.</p>
</li>
</ul>

<p>In such implementations, memory objects can be implemented using regular files, if that is what the implementation chooses. The
<a href="../functions/shm_open.html"><i>shm_open</i>()</a> function can be implemented as an <a href=
"../functions/open.html"><i>open</i>()</a> call in a fixed directory followed by a call to <a href=
"../functions/fcntl.html"><i>fcntl</i>()</a> to set FD_CLOEXEC. The <a href="../functions/shm_unlink.html"><i>shm_unlink</i>()</a>
function can be implemented as an <a href="../functions/unlink.html"><i>unlink</i>()</a> call.</p>

<p>On the other hand, it is also expected that small embedded systems that support the Shared Memory Objects option may wish to
implement shared memory without having any file systems present. In this case, the implementations may choose to use a simple
string valued name space for shared memory regions. The <a href="../functions/shm_open.html"><i>shm_open</i>()</a> function permits
either type of implementation.</p>

<p>Some implementations have hardware that supports protection of mapped data from certain classes of access and some do not.
Systems that supply this functionality can support the Memory Protection option.</p>

<p>Some implementations restrict size, alignment, and protections to be on <i>page</i>-size boundaries. If an implementation has no
restrictions on size or alignment, it may specify a 1-byte page size. Applications on implementations that do support larger pages
must be cognizant of the page size since this is the alignment and protection boundary.</p>

<p>Simple embedded implementations may have a 1-byte page size and only support the Shared Memory Objects option. This provides
simple shared memory between processes without requiring mapping hardware.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 specifically allows a memory object to remain referenced after a close because that is existing
practice for the <a href="../functions/mmap.html"><i>mmap</i>()</a> function.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_15"></a>Typed Memory Functions</h5>

<p>Implementations may support the Typed Memory Objects option without supporting either the Shared Memory option or the Memory
Mapped Files option. Typed memory objects are pools of specialized storage, different from the main memory resource normally used
by a processor to hold code and data, that can be mapped into the address space of one or more processes.</p>

<ul>
<li>
<p>Model</p>

<p>Realtime systems conforming to one of the POSIX.13 realtime profiles are expected (and desired) to be supported on systems with
more than one type or pool of memory (for example, SRAM, DRAM, ROM, EPROM, EEPROM), where each type or pool of memory may be
accessible by one or more processors via one or more busses (ports). Memory mapped files, shared memory objects, and the
language-specific storage allocation operators ( <a href="../functions/malloc.html"><i>malloc</i>()</a> for the ISO&nbsp;C
standard, <i>new</i> for ISO Ada) fail to provide application program interfaces versatile enough to allow applications to control
their utilization of such diverse memory resources. The typed memory interfaces <a href=
"../functions/posix_typed_mem_open.html"><i>posix_typed_mem_open</i>()</a>, <a href=
"../functions/posix_mem_offset.html"><i>posix_mem_offset</i>()</a>, <a href=
"../functions/posix_typed_mem_get_info.html"><i>posix_typed_mem_get_info</i>()</a>, <a href=
"../functions/mmap.html"><i>mmap</i>()</a>, and <a href="../functions/munmap.html"><i>munmap</i>()</a> defined herein support the
model of typed memory described below.</p>

<p>For purposes of this model, a system comprises several processors (for example, P<sub><small>1</small></sub> and
P<sub><small>2</small></sub>), several physical memory pools (for example, M<sub><small>1</small></sub>,
M<sub><small>2</small></sub>, M<sub><small>2a</small></sub>, M<sub><small>2b</small></sub>, M<sub><small>3</small></sub>,
M<sub><small>4</small></sub>, and M<sub><small>5</small></sub>), and several busses or &quot;ports&quot; (for example,
B<sub><small>1</small></sub>, B<sub><small>2</small></sub>, B<sub><small>3</small></sub>, and B<sub><small>4</small></sub>)
interconnecting the various processors and memory pools in some system-specific way. Notice that some memory pools may be contained
in others (for example, M<sub><small>2a</small></sub> and M<sub><small>2b</small></sub> are contained in
M<sub><small>2</small></sub>).</p>

<p><a href="#tagfcjh_1">Example of a System with Typed Memory</a> shows an example of such a model. In a system like this, an
application should be able to perform the following operations:</p>

<dl compact>
<dt></dt>

<dd><img src=".././Figures/b-1.gif"></dd>
</dl>

<center><b><a name="tagfcjh_1"></a> Figure: Example of a System with Typed Memory</b></center>

<ul>
<li>
<p>Typed Memory Allocation</p>

<p>An application should be able to allocate memory dynamically from the desired pool using the desired bus, and map it into a
process' address space. For example, processor P<sub><small>1</small></sub> can allocate some portion of memory pool
M<sub><small>1</small></sub> through port B<sub><small>1</small></sub>, treating all unmapped subareas of
M<sub><small>1</small></sub> as a heap-storage resource from which memory may be allocated. This portion of memory is mapped into
the process' address space, and subsequently deallocated when unmapped from all processes.</p>
</li>

<li>
<p>Using the Same Storage Region from Different Busses</p>

<p>An application process with a mapped region of storage that is accessed from one bus should be able to map that same storage
area at another address (subject to page size restrictions detailed in <a href="../functions/mmap.html"><i>mmap</i>()</a>), to
allow it to be accessed from another bus. For example, processor P<sub><small>1</small></sub> may wish to access the same region of
memory pool M<sub><small>2b</small></sub> both through ports B<sub><small>1</small></sub> and B<sub><small>2</small></sub>.</p>
</li>

<li>
<p>Sharing Typed Memory Regions</p>

<p>Several application processes running on the same or different processors may wish to share a particular region of a typed
memory pool. Each process or processor may wish to access this region through different busses. For example, processor
P<sub><small>1</small></sub> may want to share a region of memory pool M<sub><small>4</small></sub> with processor
P<sub><small>2</small></sub>, and they may be required to use busses B<sub><small>2</small></sub> and B<sub><small>3</small></sub>,
respectively, to minimize bus contention. A problem arises here when a process allocates and maps a portion of fragmented memory
and then wants to share this region of memory with another process, either in the same processor or different processors. The
solution adopted is to allow the first process to find out the memory map (offsets and lengths) of all the different fragments of
memory that were mapped into its address space, by repeatedly calling <a href=
"../functions/posix_mem_offset.html"><i>posix_mem_offset</i>()</a>. Then, this process can pass the offsets and lengths obtained to
the second process, which can then map the same memory fragments into its address space.</p>
</li>

<li>
<p>Contiguous Allocation</p>

<p>The problem of finding the memory map of the different fragments of the memory pool that were mapped into logically contiguous
addresses of a given process can be solved by requesting contiguous allocation. For example, a process in
P<sub><small>1</small></sub> can allocate 10 Kbytes of physically contiguous memory from
M<sub><small>3</small></sub>-B<sub><small>1</small></sub>, and obtain the offset (within pool M<sub><small>3</small></sub>) of this
block of memory. Then, it can pass this offset (and the length) to a process in P<sub><small>2</small></sub> using some
interprocess communication mechanism. The second process can map the same block of memory by using the offset transferred and
specifying M<sub><small>3</small></sub>-B<sub><small>2</small></sub>.</p>
</li>

<li>
<p>Unallocated Mapping</p>

<p>Any subarea of a memory pool that is mapped to a process, either as the result of an allocation request or an explicit mapping,
is normally unavailable for allocation. Special processes such as debuggers, however, may need to map large areas of a typed memory
pool, yet leave those areas available for allocation.</p>
</li>
</ul>

<p>Typed memory allocation and mapping has to coexist with storage allocation operators like <a href=
"../functions/malloc.html"><i>malloc</i>()</a>, but systems are free to choose how to implement this coexistence. For example, it
may be system configuration-dependent if all available system memory is made part of one of the typed memory pools or if some part
will be restricted to conventional allocation operators. Equally system configuration-dependent may be the availability of
operators like <a href="../functions/malloc.html"><i>malloc</i>()</a> to allocate storage from certain typed memory pools. It is
not excluded to configure a system such that a given named pool, P<sub><small>1</small></sub>, is in turn split into
non-overlapping named subpools. For example, M<sub><small>1</small></sub>-B<sub><small>1</small></sub>,
M<sub><small>2</small></sub>-B<sub><small>1</small></sub>, and M<sub><small>3</small></sub>-B<sub><small>1</small></sub> could also
be accessed as one common pool M<sub><small>123</small></sub>-B<sub><small>1</small></sub>. A call to <a href=
"../functions/malloc.html"><i>malloc</i>()</a> on P<sub><small>1</small></sub> could work on such a larger pool while full
optimization of memory usage by P<sub><small>1</small></sub> would require typed memory allocation at the subpool level.</p>
</li>

<li>
<p>Existing Practice</p>

<p>OS-9 provides for the naming (numbering) and prioritization of memory types by a system administrator. It then provides APIs to
request memory allocation of typed (colored) memory by number, and to generate a bus address from a mapped memory address
(translate). When requesting colored memory, the user can specify type 0 to signify allocation from the first available type in
priority order.</p>

<p>HP-RT presents interfaces to map different kinds of storage regions that are visible through a VME bus, although it does not
provide allocation operations. It also provides functions to perform address translation between VME addresses and virtual
addresses. It represents a VME-bus unique solution to the general problem.</p>

<p>The PSOS approach is similar (that is, based on a pre-established mapping of bus address ranges to specific memories) with a
concept of segments and regions (regions dynamically allocated from a heap which is a special segment). Therefore, PSOS does not
fully address the general allocation problem either. PSOS does not have a &quot;process''-based model, but more of a
&quot;thread''-only-based model of multi-tasking. So mapping to a process address space is not an issue.</p>

<p>QNX uses the System&nbsp;V approach of opening specially named devices (shared memory segments) and using <a href=
"../functions/mmap.html"><i>mmap</i>()</a> to then gain access from the process. They do not address allocation directly, but once
typed shared memory can be mapped, an &quot;allocation manager&quot; process could be written to handle requests for allocation.</p>

<p>The System&nbsp;V approach also included allocation, implemented by opening yet other special &quot;devices&quot; which allocate, rather
than appearing as a whole memory object.</p>

<p>The Orkid realtime kernel interface definition has operations to manage memory &quot;regions&quot; and &quot;pools&quot;, which are areas of
memory that may reflect the differing physical nature of the memory. Operations to allocate memory from these regions and pools are
also provided.<br>
</p>
</li>

<li>
<p>Requirements</p>

<p>Existing practice in SVID-derived UNIX systems relies on functionality similar to <a href=
"../functions/mmap.html"><i>mmap</i>()</a> and its related interfaces to achieve mapping and allocation of typed memory. However,
the issue of sharing typed memory (allocated or mapped) and the complication of multiple ports are not addressed in any consistent
way by existing UNIX system practice. Part of this functionality is existing practice in specialized realtime operating systems. In
order to solidify the capabilities implied by the model above, the following requirements are imposed on the interface:</p>

<ul>
<li>
<p>Identification of Typed Memory Pools and Ports</p>

<p>All processes (running in all processors) in the system are able to identify a particular (system configured) typed memory pool
accessed through a particular (system configured) port by a name. That name is a member of a name space common to all these
processes, but need not be the same name space as that containing ordinary filenames. The association between memory pools/ports
and corresponding names is typically established when the system is configured. The &quot;open&quot; operation for typed memory objects
should be distinct from the <a href="../functions/open.html"><i>open</i>()</a> function, for consistency with other similar
services, but implementable on top of <a href="../functions/open.html"><i>open</i>()</a>. This implies that the handle for a typed
memory object will be a file descriptor.</p>
</li>

<li>
<p>Allocation and Mapping of Typed Memory</p>

<p>Once a typed memory object has been identified by a process, it is possible to both map user-selected subareas of that object
into process address space and to map system-selected (that is, dynamically allocated) subareas of that object, with user-specified
length, into process address space. It is also possible to determine the maximum length of memory allocation that may be requested
from a given typed memory object.</p>
</li>

<li>
<p>Sharing Typed Memory</p>

<p>Two or more processes are able to share portions of typed memory, either user-selected or dynamically allocated. This
requirement applies also to dynamically allocated regions of memory that are composed of several non-contiguous pieces.</p>
</li>

<li>
<p>Contiguous Allocation</p>

<p>For dynamic allocation, it is the user's option whether the system is required to allocate a contiguous subarea within the typed
memory object, or whether it is permitted to allocate discontiguous fragments which appear contiguous in the process mapping.
Contiguous allocation simplifies the process of sharing allocated typed memory, while discontiguous allocation allows for
potentially better recovery of deallocated typed memory.</p>
</li>

<li>
<p>Accessing Typed Memory Through Different Ports</p>

<p>Once a subarea of a typed memory object has been mapped, it is possible to determine the location and length corresponding to a
user-selected portion of that object within the memory pool. This location and length can then be used to remap that portion of
memory for access from another port. If the referenced portion of typed memory was allocated discontiguously, the length thus
determined may be shorter than anticipated, and the user code must adapt to the value returned.</p>
</li>

<li>
<p>Deallocation</p>

<p>When a previously mapped subarea of typed memory is no longer mapped by any process in the system-as a result of a call or calls
to <a href="../functions/munmap.html"><i>munmap</i>()</a>- that subarea becomes potentially reusable for dynamic allocation; actual
reuse of the subarea is a function of the dynamic typed memory allocation policy.</p>
</li>

<li>
<p>Unallocated Mapping</p>

<p>It must be possible to map user-selected subareas of a typed memory object without marking that subarea as unavailable for
allocation. This option is not the default behavior, and requires appropriate privilege.</p>
</li>
</ul>
</li>

<li>
<p>Scenario</p>

<p>The following scenario will serve to clarify the use of the typed memory interfaces.</p>

<p>Process A running on P<sub><small>1</small></sub> (see <a href="#tagfcjh_1">Example of a System with Typed Memory</a> ) wants to
allocate some memory from memory pool M<sub><small>2</small></sub>, and it wants to share this portion of memory with process B
running on P<sub><small>2</small></sub>. Since P<sub><small>2</small></sub> only has access to the lower part of
M<sub><small>2</small></sub>, both processes will use the memory pool named M<sub><small>2b</small></sub> which is the part of
M<sub><small>2</small></sub> that is accessible both from P<sub><small>1</small></sub> and P<sub><small>2</small></sub>. The
operations that both processes need to perform are shown below:</p>

<ul>
<li>
<p>Allocating Typed Memory</p>

<p>Process A calls <a href="../functions/posix_typed_mem_open.html"><i>posix_typed_mem_open</i>()</a> with the name
<b>/typed.m2b-b1</b> and a <i>tflag</i> of POSIX_TYPED_MEM_ALLOCATE to get a file descriptor usable for allocating from pool
M<sub><small>2b</small></sub> accessed through port B<sub><small>1</small></sub>. It then calls <a href=
"../functions/mmap.html"><i>mmap</i>()</a> with this file descriptor requesting a length of 4096 bytes. The system allocates two
discontiguous blocks of sizes 1024 and 3072 bytes within M<sub><small>2b</small></sub>. The <a href=
"../functions/mmap.html"><i>mmap</i>()</a> function returns a pointer to a 4096-byte array in process A's logical address space,
mapping the allocated blocks contiguously. Process A can then utilize the array, and store data in it.</p>
</li>

<li>
<p>Determining the Location of the Allocated Blocks</p>

<p>Process A can determine the lengths and offsets (relative to M<sub><small>2b</small></sub>) of the two blocks allocated, by
using the following procedure: First, process A calls <a href="../functions/posix_mem_offset.html"><i>posix_mem_offset</i>()</a>
with the address of the first element of the array and length 4096. Upon return, the offset and length (1024 bytes) of the first
block are returned. A second call to <a href="../functions/posix_mem_offset.html"><i>posix_mem_offset</i>()</a> is then made using
the address of the first element of the array plus 1024 (the length of the first block), and a new length of 4096-1024. If there
were more fragments allocated, this procedure could have been continued within a loop until the offsets and lengths of all the
blocks were obtained. Notice that this relatively complex procedure can be avoided if contiguous allocation is requested (by
opening the typed memory object with the <i>tflag</i> POSIX_TYPED_MEM_ALLOCATE_CONTIG).</p>
</li>

<li>
<p>Sharing Data Across Processes</p>

<p>Process A passes the two offset values and lengths obtained from the <a href=
"../functions/posix_mem_offset.html"><i>posix_mem_offset</i>()</a> calls to process B running on P<sub><small>2</small></sub>, via
some form of interprocess communication. Process B can gain access to process A's data by calling <a href=
"../functions/posix_typed_mem_open.html"><i>posix_typed_mem_open</i>()</a> with the name <b>/typed.m2b-b2</b> and a <i>tflag</i> of
zero, then using two <a href="../functions/mmap.html"><i>mmap</i>()</a> calls on the resulting file descriptor to map the two
subareas of that typed memory object to its own address space.</p>
</li>
</ul>
</li>

<li>
<p>Rationale for no <i>mem_alloc</i>() and <i>mem_free</i>()</p>

<p>The standard developers had originally proposed a pair of new flags to <a href="../functions/mmap.html"><i>mmap</i>()</a> which,
when applied to a typed memory object descriptor, would cause <a href="../functions/mmap.html"><i>mmap</i>()</a> to allocate
dynamically from an unallocated and unmapped area of the typed memory object. Deallocation was similarly accomplished through the
use of <a href="../functions/munmap.html"><i>munmap</i>()</a>. This was rejected by the ballot group because it excessively
complicated the (already rather complex) <a href="../functions/mmap.html"><i>mmap</i>()</a> interface and introduced semantics
useful only for typed memory, to a function which must also map shared memory and files. They felt that a memory allocator should
be built on top of <a href="../functions/mmap.html"><i>mmap</i>()</a> instead of being incorporated within the same interface, much
as the ISO&nbsp;C standard libraries build <a href="../functions/malloc.html"><i>malloc</i>()</a> on top of the virtual memory
mapping functions <i>brk</i>() and <i>sbrk</i>(). This would eliminate the complicated semantics involved with unmapping only part
of an allocated block of typed memory.</p>

<p>To attempt to achieve ballot group consensus, typed memory allocation and deallocation was first migrated from <a href=
"../functions/mmap.html"><i>mmap</i>()</a> and <a href="../functions/munmap.html"><i>munmap</i>()</a> to a pair of complementary
functions modeled on the ISO&nbsp;C standard <a href="../functions/malloc.html"><i>malloc</i>()</a> and <a href=
"../functions/free.html"><i>free</i>()</a>. The <i>mem_alloc</i>() function specified explicitly the typed memory object (typed
memory pool/access port) from which allocation takes place, unlike <a href="../functions/malloc.html"><i>malloc</i>()</a> where the
memory pool and port are unspecified. The <i>mem_free</i>() function handled deallocation. These new semantics still met all of the
requirements detailed above without modifying the behavior of <a href="../functions/mmap.html"><i>mmap</i>()</a> except to allow it
to map specified areas of typed memory objects. An implementation would have been free to implement <i>mem_alloc</i>() and
<i>mem_free</i>() over <a href="../functions/mmap.html"><i>mmap</i>()</a>, through <a href=
"../functions/mmap.html"><i>mmap</i>()</a>, or independently but cooperating with <a href=
"../functions/mmap.html"><i>mmap</i>()</a>.</p>

<p>The ballot group was queried to see if this was an acceptable alternative, and while there was some agreement that it achieved
the goal of removing the complicated semantics of allocation from the <a href="../functions/mmap.html"><i>mmap</i>()</a> interface,
several balloters realized that it just created two additional functions that behaved, in great part, like <a href=
"../functions/mmap.html"><i>mmap</i>()</a>. These balloters proposed an alternative which has been implemented here in place of a
separate <i>mem_alloc</i>() and <i>mem_free</i>(). This alternative is based on four specific suggestions:</p>

<ol>
<li>
<p>The <a href="../functions/posix_typed_mem_open.html"><i>posix_typed_mem_open</i>()</a> function should provide a flag which
specifies &quot;allocate on <a href="../functions/mmap.html"><i>mmap</i>()</a>&quot; (otherwise, <a href=
"../functions/mmap.html"><i>mmap</i>()</a> just maps the underlying object). This allows things roughly similar to <b>/dev/zero</b>
<i>versus</i> <b>/dev/swap</b>. Two such flags have been implemented, one of which forces contiguous allocation.</p>
</li>

<li>
<p>The <a href="../functions/posix_mem_offset.html"><i>posix_mem_offset</i>()</a> function is acceptable because it can be applied
usefully to mapped objects in general. It should return the file descriptor of the underlying object.</p>
</li>

<li>
<p>The <i>mem_get_info</i>() function in an earlier draft should be renamed <a href=
"../functions/posix_typed_mem_get_info.html"><i>posix_typed_mem_get_info</i>()</a> because it is not generally applicable to memory
objects. It should probably return the file descriptor's allocation attribute. The renaming of the function has been implemented,
but having it return a piece of information which is readily known by an application without this function has been rejected. Its
whole purpose is to query the typed memory object for attributes that are not user-specified, but determined by the
implementation.</p>
</li>

<li>
<p>There should be no separate <i>mem_alloc</i>() or <i>mem_free</i>() functions. Instead, using <a href=
"../functions/mmap.html"><i>mmap</i>()</a> on a typed memory object opened with an &quot;allocate on <a href=
"../functions/mmap.html"><i>mmap</i>()</a>&quot; flag should be used to force allocation. These are precisely the semantics defined in
the current draft.</p>
</li>
</ol>
</li>

<li>
<p>Rationale for no Typed Memory Access Management</p>

<p>The working group had originally defined an additional interface (and an additional kind of object: typed memory master) to
establish and dissolve mappings to typed memory on behalf of devices or processors which were independent of the operating system
and had no inherent capability to directly establish mappings on their own. This was to have provided functionality similar to
device driver interfaces such as <i>physio</i>() and their underlying bus-specific interfaces (for example, <i>mballoc</i>()) which
serve to set up and break down DMA pathways, and derive mapped addresses for use by hardware devices and processor cards.</p>

<p>The ballot group felt that this was beyond the scope of POSIX.1 and its amendments. Furthermore, the removal of interrupt
handling interfaces from a preceding amendment (the IEEE&nbsp;Std&nbsp;1003.1d-1999) during its balloting process renders these
typed memory access management interfaces an incomplete solution to portable device management from a user process; it would be
possible to initiate a device transfer to/from typed memory, but impossible to handle the transfer-complete interrupt in a portable
way.</p>

<p>To achieve ballot group consensus, all references to typed memory access management capabilities were removed. The concept of
portable interfaces from a device driver to both operating system and hardware is being addressed by the Uniform Driver Interface
(UDI) industry forum, with formal standardization deferred until proof of concept and industry-wide acceptance and
implementation.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_16"></a>Process Scheduling</h5>

<p>IEEE PASC Interpretation 1003.1 #96 has been applied, adding the <a href=
"../functions/pthread_setschedprio.html"><i>pthread_setschedprio</i>()</a> function. This was added since previously there was no
way for a thread to lower its own priority without going to the tail of the threads list for its new priority. This capability is
necessary to bound the duration of priority inversion encountered by a thread.</p>

<p>The following portion of the rationale presents models, requirements, and standardization issues relevant to process scheduling;
see also <a href="#tag_03_02_09_11">Thread Scheduling</a> .</p>

<p>In an operating system supporting multiple concurrent processes, the system determines the order in which processes execute to
meet implementation-defined goals. For time-sharing systems, the goal is to enhance system throughput and promote fairness; the
application is provided with little or no control over this sequencing function. While this is acceptable and desirable behavior in
a time-sharing system, it is inappropriate in a realtime system; realtime applications must specifically control the execution
sequence of their concurrent processes in order to meet externally defined response requirements.</p>

<p>In IEEE&nbsp;Std&nbsp;1003.1-2001, the control over process sequencing is provided using a concept of scheduling policies. These
policies, described in detail in this section, define the behavior of the system whenever processor resources are to be allocated
to competing processes. Only the behavior of the policy is defined; conforming implementations are free to use any mechanism
desired to achieve the described behavior.</p>

<ul>
<li>
<p>Models</p>

<p>In an operating system supporting multiple concurrent processes, the system determines the order in which processes execute and
might force long-running processes to yield to other processes at certain intervals. Typically, the scheduling code is executed
whenever an event occurs that might alter the process to be executed next.</p>

<p>The simplest scheduling strategy is a &quot;first-in, first-out&quot; (FIFO) dispatcher. Whenever a process becomes runnable, it is
placed on the end of a ready list. The process at the front of the ready list is executed until it exits or becomes blocked, at
which point it is removed from the list. This scheduling technique is also known as &quot;run-to-completion&quot; or &quot;run-to-block&quot;.</p>

<p>A natural extension to this scheduling technique is the assignment of a &quot;non-migrating priority&quot; to each process. This policy
differs from strict FIFO scheduling in only one respect: whenever a process becomes runnable, it is placed at the end of the list
of processes runnable at that priority level. When selecting a process to run, the system always selects the first process from the
highest priority queue with a runnable process. Thus, when a process becomes unblocked, it will preempt a running process of lower
priority without otherwise altering the ready list. Further, if a process elects to alter its priority, it is removed from the
ready list and reinserted, using its new priority, according to the policy above.</p>

<p>While the above policy might be considered unfriendly in a time-sharing environment in which multiple users require more
balanced resource allocation, it could be ideal in a realtime environment for several reasons. The most important of these is that
it is deterministic: the highest-priority process is always run and, among processes of equal priority, the process that has been
runnable for the longest time is executed first. Because of this determinism, cooperating processes can implement more complex
scheduling simply by altering their priority. For instance, if processes at a single priority were to reschedule themselves at
fixed time intervals, a time-slice policy would result.</p>

<p>In a dedicated operating system in which all processes are well-behaved realtime applications, non-migrating priority scheduling
is sufficient. However, many existing implementations provide for more complex scheduling policies.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 specifies a linear scheduling model. In this model, every process in the system has a priority.
The system scheduler always dispatches a process that has the highest (generally the most time-critical) priority among all
runnable processes in the system. As long as there is only one such process, the dispatching policy is trivial. When multiple
processes of equal priority are eligible to run, they are ordered according to a strict run-to-completion (FIFO) policy.</p>

<p>The priority is represented as a positive integer and is inherited from the parent process. For processes running under a fixed
priority scheduling policy, the priority is never altered except by an explicit function call.</p>

<p>It was determined arbitrarily that larger integers correspond to &quot;higher priorities&quot;.</p>

<p>Certain implementations might impose restrictions on the priority ranges to which processes can be assigned. There also can be
restrictions on the set of policies to which processes can be set.</p>
</li>

<li>
<p>Requirements</p>

<p>Realtime processes require that scheduling be fast and deterministic, and that it guarantees to preempt lower priority
processes.</p>

<p>Thus, given the linear scheduling model, realtime processes require that they be run at a priority that is higher than other
processes. Within this framework, realtime processes are free to yield execution resources to each other in a completely portable
and implementation-defined manner.</p>

<p>As there is a generally perceived requirement for processes at the same priority level to share processor resources more
equitably, provisions are made by providing a scheduling policy (that is, SCHED_RR) intended to provide a timeslice-like facility.
<basefont size="2"></p>

<dl>
<dt><b>Note:</b></dt>

<dd>The following topics assume that low numeric priority implies low scheduling criticality and <i>vice versa</i>.</dd>
</dl>

<basefont size="3"></li>

<li>
<p>Rationale for New Interface</p>

<p>Realtime applications need to be able to determine when processes will run in relation to each other. It must be possible to
guarantee that a critical process will run whenever it is runnable; that is, whenever it wants to for as long as it needs.
SCHED_FIFO satisfies this requirement. Additionally, SCHED_RR was defined to meet a realtime requirement for a well-defined
time-sharing policy for processes at the same priority.</p>

<p>It would be possible to use the BSD <a href="../functions/setpriority.html"><i>setpriority</i>()</a> and <a href=
"../functions/getpriority.html"><i>getpriority</i>()</a> functions by redefining the meaning of the &quot;nice&quot; parameter according to
the scheduling policy currently in use by the process. The System&nbsp;V <a href="../functions/nice.html"><i>nice</i>()</a>
interface was felt to be undesirable for realtime because it specifies an adjustment to the &quot;nice&quot; value, rather than setting it
to an explicit value. Realtime applications will usually want to set priority to an explicit value. Also, System&nbsp;V <a href=
"../functions/nice.html"><i>nice</i>()</a> does not allow for changing the priority of another process.</p>

<p>With the POSIX.1b interfaces, the traditional &quot;nice&quot; value does not affect the SCHED_FIFO or SCHED_RR scheduling policies. If
a &quot;nice&quot; value is supported, it is implementation-defined whether it affects the SCHED_OTHER policy.</p>

<p>An important aspect of IEEE&nbsp;Std&nbsp;1003.1-2001 is the explicit description of the queuing and preemption rules. It is
critical, to achieve deterministic scheduling, that such rules be stated clearly in IEEE&nbsp;Std&nbsp;1003.1-2001.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 does not address the interaction between priority and swapping. The issues involved with swapping
and virtual memory paging are extremely implementation-defined and would be nearly impossible to standardize at this point. The
proposed scheduling paradigm, however, fully describes the scheduling behavior of runnable processes, of which one criterion is
that the working set be resident in memory. Assuming the existence of a portable interface for locking portions of a process in
memory, paging behavior need not affect the scheduling of realtime processes.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 also does not address the priorities of &quot;system&quot; processes. In general, these processes should
always execute in low-priority ranges to avoid conflict with other realtime processes. Implementations should document the priority
ranges in which system processes run.</p>

<p>The default scheduling policy is not defined. The effect of I/O interrupts and other system processing activities is not
defined. The temporary lending of priority from one process to another (such as for the purposes of affecting freeing resources) by
the system is not addressed. Preemption of resources is not addressed. Restrictions on the ability of a process to affect other
processes beyond a certain level (influence levels) is not addressed.</p>

<p>The rationale used to justify the simple time-quantum scheduler is that it is common practice to depend upon this type of
scheduling to ensure &quot;fair&quot; distribution of processor resources among portions of the application that must interoperate in a
serial fashion. Note that IEEE&nbsp;Std&nbsp;1003.1-2001 is silent with respect to the setting of this time quantum, or whether it
is a system-wide value or a per-process value, although it appears that the prevailing realtime practice is for it to be a
system-wide value.</p>

<p>In a system with <i>N</i> processes at a given priority, all processor-bound, in which the time quantum is equal for all
processes at a specific priority level, the following assumptions are made of such a scheduling policy:</p>

<ol>
<li>
<p>A time quantum <i>Q</i> exists and the current process will own control of the processor for at least a duration of <i>Q</i> and
will have the processor for a duration of <i>Q</i>.</p>
</li>

<li>
<p>The <i>N</i>th process at that priority will control a processor within a duration of ( <i>N</i>-1) &times; <i>Q</i>.</p>
</li>
</ol>

<p>These assumptions are necessary to provide equal access to the processor and bounded response from the application.</p>

<p>The assumptions hold for the described scheduling policy only if no system overhead, such as interrupt servicing, is present. If
the interrupt servicing load is non-zero, then one of the two assumptions becomes fallacious, based upon how <i>Q</i> is measured
by the system.</p>

<p>If <i>Q</i> is measured by clock time, then the assumption that the process obtains a duration <i>Q</i> processor time is false
if interrupt overhead exists. Indeed, a scenario can be constructed with <i>N</i> processes in which a single process undergoes
complete processor starvation if a peripheral device, such as an analog-to-digital converter, generates significant interrupt
activity periodically with a period of <i>N</i> &times; <i>Q</i>.</p>

<p>If <i>Q</i> is measured as actual processor time, then the assumption that the <i>N</i>th process runs in within the duration (
<i>N</i>-1) &times; <i>Q</i> is false.</p>

<p>It should be noted that SCHED_FIFO suffers from interrupt-based delay as well. However, for SCHED_FIFO, the implied response of
the system is &quot;as soon as possible&quot;, so that the interrupt load for this case is a vendor selection and not a compliance
issue.</p>

<p>With this in mind, it is necessary either to complete the definition by including bounds on the interrupt load, or to modify the
assumptions that can be made about the scheduling policy.</p>

<p>Since the motivation of inclusion of the policy is common usage, and since current applications do not enjoy the luxury of
bounded interrupt load, item (2) above is sufficient to express existing application needs and is less restrictive in the standard
definition. No difference in interface is necessary.</p>

<p>In an implementation in which the time quantum is equal for all processes at a specific priority, our assumptions can then be
restated as:</p>

<ul>
<li>
<p>A time quantum <i>Q</i> exists, and a processor-bound process will be rescheduled after a duration of, at most, <i>Q</i>. Time
quantum <i>Q</i> may be defined in either wall clock time or execution time.</p>
</li>

<li>
<p>In general, the <i>N</i>th process of a priority level should wait no longer than ( <i>N</i>-1) &times; <i>Q</i> time to
execute, assuming no processes exist at higher priority levels.</p>
</li>

<li>
<p>No process should wait indefinitely.</p>
</li>
</ul>

<p>For implementations supporting per-process time quanta, these assumptions can be readily extended.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_17"></a>Sporadic Server Scheduling Policy</h5>

<p>The sporadic server is a mechanism defined for scheduling aperiodic activities in time-critical realtime systems. This mechanism
reserves a certain bounded amount of execution capacity for processing aperiodic events at a high priority level. Any aperiodic
events that cannot be processed within the bounded amount of execution capacity are executed in the background at a low priority
level. Thus, a certain amount of execution capacity can be guaranteed to be available for processing periodic tasks, even under
burst conditions in the arrival of aperiodic processing requests (that is, a large number of requests in a short time interval).
The sporadic server also simplifies the schedulability analysis of the realtime system, because it allows aperiodic processes or
threads to be treated as if they were periodic. The sporadic server was first described by Sprunt, et al.</p>

<p>The key concept of the sporadic server is to provide and limit a certain amount of computation capacity for processing aperiodic
events at their assigned normal priority, during a time interval called the &quot;replenishment period&quot;. Once the entity controlled by
the sporadic server mechanism is initialized with its period and execution-time budget attributes, it preserves its execution
capacity until an aperiodic request arrives. The request will be serviced (if there are no higher priority activities pending) as
long as there is execution capacity left. If the request is completed, the actual execution time used to service it is subtracted
from the capacity, and a replenishment of this amount of execution time is scheduled to happen one replenishment period after the
arrival of the aperiodic request. If the request is not completed, because there is no execution capacity left, then the aperiodic
process or thread is assigned a lower background priority. For each portion of consumed execution capacity the execution time used
is replenished after one replenishment period. At the time of replenishment, if the sporadic server was executing at a background
priority level, its priority is elevated to the normal level. Other similar replenishment policies have been defined, but the one
presented here represents a compromise between efficiency and implementation complexity.</p>

<p>The interface that appears in this section defines a new scheduling policy for threads and processes that behaves according to
the rules of the sporadic server mechanism. Scheduling attributes are defined and functions are provided to allow the user to set
and get the parameters that control the scheduling behavior of this mechanism, namely the normal and low priority, the
replenishment period, the maximum number of pending replenishment operations, and the initial execution-time budget.</p>

<ul>
<li>
<p>Scheduling Aperiodic Activities</p>

<p>Virtually all realtime applications are required to process aperiodic activities. In many cases, there are tight timing
constraints that the response to the aperiodic events must meet. Usual timing requirements imposed on the response to these events
are:</p>

<ul>
<li>
<p>The effects of an aperiodic activity on the response time of lower priority activities must be controllable and predictable.</p>
</li>

<li>
<p>The system must provide the fastest possible response time to aperiodic events.</p>
</li>

<li>
<p>It must be possible to take advantage of all the available processing bandwidth not needed by time-critical activities to
enhance average-case response times to aperiodic events.</p>
</li>
</ul>

<p>Traditional methods for scheduling aperiodic activities are background processing, polling tasks, and direct event
execution:</p>

<ul>
<li>
<p>Background processing consists of assigning a very low priority to the processing of aperiodic events. It utilizes all the
available bandwidth in the system that has not been consumed by higher priority threads. However, it is very difficult, or
impossible, to meet requirements on average-case response time, because the aperiodic entity has to wait for the execution of all
other entities which have higher priority.</p>
</li>

<li>
<p>Polling consists of creating a periodic process or thread for servicing aperiodic requests. At regular intervals, the polling
entity is started and its services accumulated pending aperiodic requests. If no aperiodic requests are pending, the polling entity
suspends itself until its next period. Polling allows the aperiodic requests to be processed at a higher priority level. However,
worst and average-case response times of polling entities are a direct function of the polling period, and there is execution
overhead for each polling period, even if no event has arrived. If the deadline of the aperiodic activity is short compared to the
inter-arrival time, the polling frequency must be increased to guarantee meeting the deadline. For this case, the increase in
frequency can dramatically reduce the efficiency of the system and, therefore, its capacity to meet all deadlines. Yet, polling
represents a good way to handle a large class of practical problems because it preserves system predictability, and because the
amortized overhead drops as load increases.</p>
</li>

<li>
<p>Direct event execution consists of executing the aperiodic events at a high fixed-priority level. Typically, the aperiodic event
is processed by an interrupt service routine as soon as it arrives. This technique provides predictable response times for
aperiodic events, but makes the response times of all lower priority activities completely unpredictable under burst arrival
conditions. Therefore, if the density of aperiodic event arrivals is unbounded, it may be a dangerous technique for time-critical
systems. Yet, for those cases in which the physics of the system imposes a bound on the event arrival rate, it is probably the most
efficient technique.</p>
</li>

<li>
<p>The sporadic server scheduling algorithm combines the predictability of the polling approach with the short response times of
the direct event execution. Thus, it allows systems to meet an important class of application requirements that cannot be met by
using the traditional approaches. Multiple sporadic servers with different attributes can be applied to the scheduling of multiple
classes of aperiodic events, each with different kinds of timing requirements, such as individual deadlines, average response
times, and so on. It also has many other interesting applications for realtime, such as scheduling producer/consumer tasks in
time-critical systems, limiting the effects of faults on the estimation of task execution-time requirements, and so on.</p>
</li>
</ul>
</li>

<li>
<p>Existing Practice</p>

<p>The sporadic server has been used in different kinds of applications, including military avionics, robot control systems,
industrial automation systems, and so on. There are examples of many systems that cannot be successfully scheduled using the
classic approaches, such as direct event execution, or polling, and are schedulable using a sporadic server scheduler. The sporadic
server algorithm itself can successfully schedule all systems scheduled with direct event execution or polling.</p>

<p>The sporadic server scheduling policy has been implemented as a commercial product in the run-time system of the Verdix Ada
compiler. There are also many applications that have used a much less efficient application-level sporadic server. These realtime
applications would benefit from a sporadic server scheduler implemented at the scheduler level.</p>
</li>

<li>
<p>Library-Level <i>versus</i> Kernel-Level Implementation</p>

<p>The sporadic server interface described in this section requires the sporadic server policy to be implemented at the same level
as the scheduler. This means that the process sporadic server must be implemented at the kernel level and the thread sporadic
server policy implemented at the same level as the thread scheduler; that is, kernel or library level.</p>

<p>In an earlier interface for the sporadic server, this mechanism was implementable at a different level than the scheduler. This
feature allowed the implementor to choose between an efficient scheduler-level implementation, or a simpler user or library-level
implementation. However, the working group considered that this interface made the use of sporadic servers more complex, and that
library-level implementations would lack some of the important functionality of the sporadic server, namely the limitation of the
actual execution time of aperiodic activities. The working group also felt that the interface described in this chapter does not
preclude library-level implementations of threads intended to provide efficient low-overhead scheduling for those threads that are
not scheduled under the sporadic server policy.</p>
</li>

<li>
<p>Range of Scheduling Priorities</p>

<p>Each of the scheduling policies supported in IEEE&nbsp;Std&nbsp;1003.1-2001 has an associated range of priorities. The priority
ranges for each policy might or might not overlap with the priority ranges of other policies. For time-critical realtime
applications it is usual for periodic and aperiodic activities to be scheduled together in the same processor. Periodic activities
will usually be scheduled using the SCHED_FIFO scheduling policy, while aperiodic activities may be scheduled using SCHED_SPORADIC.
Since the application developer will require complete control over the relative priorities of these activities in order to meet his
timing requirements, it would be desirable for the priority ranges of SCHED_FIFO and SCHED_SPORADIC to overlap completely.
Therefore, although IEEE&nbsp;Std&nbsp;1003.1-2001 does not require any particular relationship between the different priority
ranges, it is recommended that these two ranges should coincide.</p>
</li>

<li>
<p>Dynamically Setting the Sporadic Server Policy</p>

<p>Several members of the working group requested that implementations should not be required to support dynamically setting the
sporadic server scheduling policy for a thread. The reason is that this policy may have a high overhead for library-level
implementations of threads, and if threads are allowed to dynamically set this policy, this overhead can be experienced even if the
thread does not use that policy. By disallowing the dynamic setting of the sporadic server scheduling policy, these implementations
can accomplish efficient scheduling for threads using other policies. If a strictly conforming application needs to use the
sporadic server policy, and is therefore willing to pay the overhead, it must set this policy at the time of thread creation.</p>
</li>

<li>
<p>Limitation of the Number of Pending Replenishments</p>

<p>The number of simultaneously pending replenishment operations must be limited for each sporadic server for two reasons: an
unlimited number of replenishment operations would need an unlimited number of system resources to store all the pending
replenishment operations; on the other hand, in some implementations each replenishment operation will represent a source of
priority inversion (just for the duration of the replenishment operation) and thus, the maximum amount of replenishments must be
bounded to guarantee bounded response times. The way in which the number of replenishments is bounded is by lowering the priority
of the sporadic server to <i>sched_ss_low_priority</i> when the number of pending replenishments has reached its limit. In this
way, no new replenishments are scheduled until the number of pending replenishments decreases.</p>

<p>In the sporadic server scheduling policy defined in IEEE&nbsp;Std&nbsp;1003.1-2001, the application can specify the maximum
number of pending replenishment operations for a single sporadic server, by setting the value of the <i>sched_ss_max_repl</i>
scheduling parameter. This value must be between one and {SS_REPL_MAX}, which is a maximum limit imposed by the implementation. The
limit {SS_REPL_MAX} must be greater than or equal to {_POSIX_SS_REPL_MAX}, which is defined to be four in
IEEE&nbsp;Std&nbsp;1003.1-2001. The minimum limit of four was chosen so that an application can at least guarantee that four
different aperiodic events can be processed during each interval of length equal to the replenishment period.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_18"></a>Clocks and Timers</h5>

<ul>
<li>
<p>Clocks</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 and the ISO&nbsp;C standard both define functions for obtaining system time. Implicit behind
these functions is a mechanism for measuring passage of time. This specification makes this mechanism explicit and calls it a
clock. The CLOCK_REALTIME clock required by IEEE&nbsp;Std&nbsp;1003.1-2001 is a higher resolution version of the clock that
maintains POSIX.1 system time. This is a &quot;system-wide&quot; clock, in that it is visible to all processes and, were it possible for
multiple processes to all read the clock at the same time, they would see the same value.</p>

<p>An extensible interface was defined, with the ability for implementations to define additional clocks. This was done because of
the observation that many realtime platforms support multiple clocks, and it was desired to fit this model within the standard
interface. But implementation-defined clocks need not represent actual hardware devices, nor are they necessarily system-wide.</p>
</li>

<li>
<p>Timers</p>

<p>Two timer types are required for a system to support realtime applications:</p>

<ol>
<li>
<p>One-shot</p>

<p>A one-shot timer is a timer that is armed with an initial expiration time, either relative to the current time or at an absolute
time (based on some timing base, such as time in seconds and nanoseconds since the Epoch). The timer expires once and then is
disarmed. With the specified facilities, this is accomplished by setting the <i>it_value</i> member of the <i>value</i> argument to
the desired expiration time and the <i>it_interval</i> member to zero.</p>
</li>

<li>
<p>Periodic</p>

<p>A periodic timer is a timer that is armed with an initial expiration time, again either relative or absolute, and a repetition
interval. When the initial expiration occurs, the timer is reloaded with the repetition interval and continues counting. With the
specified facilities, this is accomplished by setting the <i>it_value</i> member of the <i>value</i> argument to the desired
initial expiration time and the <i>it_interval</i> member to the desired repetition interval.</p>
</li>
</ol>

<p>For both of these types of timers, the time of the initial timer expiration can be specified in two ways:</p>

<ol>
<li>
<p>Relative (to the current time)</p>
</li>

<li>
<p>Absolute</p>
</li>
</ol>
</li>

<li>
<p>Examples of Using Realtime Timers</p>

<p>In the diagrams below, <i>S</i> indicates a program schedule, <i>R</i> shows a schedule method request, and <i>E</i> suggests an
internal operating system event.</p>

<ul>
<li>
<p>Periodic Timer: Data Logging</p>

<p>During an experiment, it might be necessary to log realtime data periodically to an internal buffer or to a mass storage device.
With a periodic scheduling method, a logging module can be started automatically at fixed time intervals to log the data.</p>

<p>Program schedule is requested every 10 seconds.</p>

<blockquote>
<pre>
<tt>   R         S         S         S         S         S
----+----+----+----+----+----+----+----+----+----+----+---&gt;
    5   10   15   20   25   30   35   40   45   50   55
</tt>
</pre>
</blockquote>

<p>[Time (in Seconds)]</p>

<p>To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID
CLOCK_REALTIME. Then the timer would be armed via a call to <a href="../functions/timer_settime.html"><i>timer_settime</i>()</a>
with the TIMER_ABSTIME flag reset, and with an initial expiration value and a repetition interval of 10 seconds.</p>
</li>

<li>
<p>One-shot Timer (Relative Time): Device Initialization</p>

<p>In an emission test environment, large sample bags are used to capture the exhaust from a vehicle. The exhaust is purged from
these bags before each and every test. With a one-shot timer, a module could initiate the purge function and then suspend itself
for a predetermined period of time while the sample bags are prepared.</p>

<p>Program schedule requested 20 seconds after call is issued.</p>

<blockquote>
<pre>
<tt>   R                   S
----+----+----+----+----+----+----+----+----+----+----+---&gt;
    5   10   15   20   25   30   35   40   45   50   55
</tt>
</pre>
</blockquote>

<p>[Time (in Seconds)]</p>

<p>To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID
CLOCK_REALTIME. Then the timer would be armed via a call to <a href="../functions/timer_settime.html"><i>timer_settime</i>()</a>
with the TIMER_ABSTIME flag reset, and with an initial expiration value of 20 seconds and a repetition interval of zero.</p>

<p>Note that if the program wishes merely to suspend itself for the specified interval, it could more easily use <a href=
"../functions/nanosleep.html"><i>nanosleep</i>()</a>.<br>
</p>
</li>

<li>
<p>One-shot Timer (Absolute Time): Data Transmission</p>

<p>The results from an experiment are often moved to a different system within a network for postprocessing or archiving. With an
absolute one-shot timer, a module that moves data from a test-cell computer to a host computer can be automatically scheduled on a
daily basis.</p>

<p>Program schedule requested for 2:30 a.m.</p>

<blockquote>
<pre>
<tt>        R                                     S
-----+-----+-----+-----+-----+-----+-----+-----+-----+-----&gt;
   23:00 23:30 24:00 00:30 01:00 01:30 02:00 02:30 03:00
</tt>
</pre>
</blockquote>

<p>[Time of Day]</p>

<p>To achieve this type of scheduling using the specified facilities, a per-process timer would be allocated based on clock ID
CLOCK_REALTIME. Then the timer would be armed via a call to <a href="../functions/timer_settime.html"><i>timer_settime</i>()</a>
with the TIMER_ABSTIME flag set, and an initial expiration value equal to 2:30 a.m. of the next day.</p>
</li>

<li>
<p>Periodic Timer (Relative Time): Signal Stabilization</p>

<p>Some measurement devices, such as emission analyzers, do not respond instantaneously to an introduced sample. With a periodic
timer with a relative initial expiration time, a module that introduces a sample and records the average response could suspend
itself for a predetermined period of time while the signal is stabilized and then sample at a fixed rate.</p>

<p>Program schedule requested 15 seconds after call is issued and every 2 seconds thereafter.</p>

<blockquote>
<pre>
<tt>  R              S S S S S S S S S S S S S S S S S S S S
----+----+----+----+----+----+----+----+----+----+----+---&gt;
    5   10   15   20   25   30   35   40   45   50   55
</tt>
</pre>
</blockquote>

<p>[Time (in Seconds)]</p>

<p>To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID
CLOCK_REALTIME. Then the timer would be armed via a call to <a href="../functions/timer_settime.html"><i>timer_settime</i>()</a>
with TIMER_ABSTIME flag reset, and with an initial expiration value of 15 seconds and a repetition interval of 2 seconds.</p>
</li>

<li>
<p>Periodic Timer (Absolute Time): Work Shift-related Processing</p>

<p>Resource utilization data is useful when time to perform experiments is being scheduled at a facility. With a periodic timer
with an absolute initial expiration time, a module can be scheduled at the beginning of a work shift to gather resource utilization
data throughout the shift. This data can be used to allocate resources effectively to minimize bottlenecks and delays and maximize
facility throughput.</p>

<p>Program schedule requested for 2:00 a.m. and every 15 minutes thereafter.</p>

<blockquote>
<pre>
<tt>        R                               S  S  S  S  S  S
-----+-----+-----+-----+-----+-----+-----+-----+-----+-----&gt;
   23:00 23:30 24:00 00:30 01:00 01:30 02:00 02:30 03:00
</tt>
</pre>
</blockquote>

<p>[Time of Day]</p>

<p>To achieve this type of scheduling using the specified facilities, one would allocate a per-process timer based on clock ID
CLOCK_REALTIME. Then the timer would be armed via a call to <a href="../functions/timer_settime.html"><i>timer_settime</i>()</a>
with TIMER_ABSTIME flag set, and with an initial expiration value equal to 2:00 a.m. and a repetition interval equal to 15
minutes.</p>
</li>
</ul>
</li>

<li>
<p>Relationship of Timers to Clocks</p>

<p>The relationship between clocks and timers armed with an absolute time is straightforward: a timer expiration signal is
requested when the associated clock reaches or exceeds the specified time. The relationship between clocks and timers armed with a
relative time (an interval) is less obvious, but not unintuitive. In this case, a timer expiration signal is requested when the
specified interval, <i>as measured by the associated clock</i>, has passed. For the required CLOCK_REALTIME clock, this allows
timer expiration signals to be requested at specified &quot;wall clock&quot; times (absolute), or when a specified interval of &quot;realtime''
has passed (relative). For an implementation-defined clock-say, a process virtual time clock-timer expirations could be requested
when the process has used a specified total amount of virtual time (absolute), or when it has used a specified <i>additional</i>
amount of virtual time (relative).</p>

<p>The interfaces also allow flexibility in the implementation of the functions. For example, an implementation could convert all
absolute times to intervals by subtracting the clock value at the time of the call from the requested expiration time and
&quot;counting down&quot; at the supported resolution. Or it could convert all relative times to absolute expiration time by adding in the
clock value at the time of the call and comparing the clock value to the expiration time at the supported resolution. Or it might
even choose to maintain absolute times as absolute and compare them to the clock value at the supported resolution for absolute
timers, and maintain relative times as intervals and count them down at the resolution supported for relative timers. The choice
will be driven by efficiency considerations and the underlying hardware or software clock implementation.</p>
</li>

<li>
<p>Data Definitions for Clocks and Timers</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 uses a time representation capable of supporting nanosecond resolution timers for the following
reasons:</p>

<ul>
<li>
<p>To enable IEEE&nbsp;Std&nbsp;1003.1-2001 to represent those computer systems already using nanosecond or submicrosecond
resolution clocks.</p>
</li>

<li>
<p>To accommodate those per-process timers that might need nanoseconds to specify an absolute value of system-wide clocks, even
though the resolution of the per-process timer may only be milliseconds, or <i>vice versa</i>.</p>
</li>

<li>
<p>Because the number of nanoseconds in a second can be represented in 32 bits.</p>
</li>
</ul>

<p>Time values are represented in the <b>timespec</b> structure. The <i>tv_sec</i> member is of type <b>time_t</b> so that this
member is compatible with time values used by POSIX.1 functions and the ISO&nbsp;C standard. The <i>tv_nsec</i> member is a
<b>signed long</b> in order to simplify and clarify code that decrements or finds differences of time values. Note that because 1
billion (number of nanoseconds per second) is less than half of the value representable by a signed 32-bit value, it is always
possible to add two valid fractional seconds represented as integral nanoseconds without overflowing the signed 32-bit value.</p>

<p>A maximum allowable resolution for the CLOCK_REALTIME clock of 20 ms (1/50 seconds) was chosen to allow line frequency clocks in
European countries to be conforming. 60 Hz clocks in the U.S. will also be conforming, as will finer granularity clocks, although a
Strictly Conforming Application cannot assume a granularity of less than 20 ms (1/50 seconds).</p>

<p>The minimum allowable maximum time allowed for the CLOCK_REALTIME clock and the function <a href=
"../functions/nanosleep.html"><i>nanosleep</i>()</a>, and timers created with <i>clock_id</i>= CLOCK_REALTIME, is determined by the
fact that the <i>tv_sec</i> member is of type <b>time_t</b>.</p>

<p>IEEE&nbsp;Std&nbsp;1003.1-2001 specifies that timer expirations must not be delivered early, and <a href=
"../functions/nanosleep.html"><i>nanosleep</i>()</a> must not return early due to quantization error.
IEEE&nbsp;Std&nbsp;1003.1-2001 discusses the various implementations of <a href="../functions/alarm.html"><i>alarm</i>()</a> in the
rationale and states that implementations that do not allow alarm signals to occur early are the most appropriate, but refrained
from mandating this behavior. Because of the importance of predictability to realtime applications, IEEE&nbsp;Std&nbsp;1003.1-2001
takes a stronger stance.</p>

<p>The developers of IEEE&nbsp;Std&nbsp;1003.1-2001 considered using a time representation that differs from POSIX.1b in the second
32 bit of the 64-bit value. Whereas POSIX.1b defines this field as a fractional second in nanoseconds, the other methodology
defines this as a binary fraction of one second, with the radix point assumed before the most significant bit.</p>

<p>POSIX.1b is a software, source-level standard and most of the benefits of the alternate representation are enjoyed by hardware
implementations of clocks and algorithms. It was felt that mandating this format for POSIX.1b clocks and timers would unnecessarily
burden the application writer with writing, possibly non-portable, multiple precision arithmetic packages to perform conversion
between binary fractions and integral units such as nanoseconds, milliseconds, and so on.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_19"></a>Rationale for the Monotonic Clock</h5>

<p>For those applications that use time services to achieve realtime behavior, changing the value of the clock on which these
services rely may cause erroneous timing behavior. For these applications, it is necessary to have a monotonic clock which cannot
run backwards, and which has a maximum clock jump that is required to be documented by the implementation. Additionally, it is
desirable (but not required by IEEE&nbsp;Std&nbsp;1003.1-2001) that the monotonic clock increases its value uniformly. This clock
should not be affected by changes to the system time; for example, to synchronize the clock with an external source or to account
for leap seconds. Such changes would cause errors in the measurement of time intervals for those time services that use the
absolute value of the clock.</p>

<p>One could argue that by defining the behavior of time services when the value of a clock is changed, deterministic realtime
behavior can be achieved. For example, one could specify that relative time services should be unaffected by changes in the value
of a clock. However, there are time services that are based upon an absolute time, but that are essentially intended as relative
time services. For example, <a href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> uses an absolute
time to allow it to wake up after the required interval despite spurious wakeups. Although sometimes the <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> timeouts are absolute in nature, there are many
occasions in which they are relative, and their absolute value is determined from the current time plus a relative time interval.
In this latter case, if the clock changes while the thread is waiting, the wait interval will not be the expected length. If a <a
href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> function were created that would take a
relative time, it would not solve the problem because to retain the intended &quot;deadline&quot; a thread would need to compensate for
latency due to the spurious wakeup, and preemption between wakeup and the next wait.</p>

<p>The solution is to create a new monotonic clock, whose value does not change except for the regular ticking of the clock, and
use this clock for implementing the various relative timeouts that appear in the different POSIX interfaces, as well as allow <a
href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> to choose this new clock for its timeout. A new
<a href="../functions/clock_nanosleep.html"><i>clock_nanosleep</i>()</a> function is created to allow an application to take
advantage of this newly defined clock. Notice that the monotonic clock may be implemented using the same hardware clock as the
system clock.</p>

<p>Relative timeouts for <a href="../functions/sigtimedwait.html"><i>sigtimedwait</i>()</a> and <a href=
"../functions/aio_suspend.html"><i>aio_suspend</i>()</a> have been redefined to use the monotonic clock, if present. The <a href=
"../functions/alarm.html"><i>alarm</i>()</a> function has not been redefined, because the same effect but with better resolution
can be achieved by creating a timer (for which the appropriate clock may be chosen).</p>

<p>The <a href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> function has been treated in a
different way, compared to other functions with absolute timeouts, because it is used to wait for an event, and thus it may have a
deadline, while the other timeouts are generally used as an error recovery mechanism, and for them the use of the monotonic clock
is not so important. Since the desired timeout for the <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> function may either be a relative interval or an
absolute time of day deadline, a new initialization attribute has been created for condition variables to specify the clock that is
used for measuring the timeout in a call to <a href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a>.
In this way, if a relative timeout is desired, the monotonic clock will be used; if an absolute deadline is required instead, the
CLOCK_REALTIME or another appropriate clock may be used. This capability has not been added to other functions with absolute
timeouts because for those functions the expected use of the timeout is mostly to prevent errors, and not so often to meet precise
deadlines. As a consequence, the complexity of adding this capability is not justified by its perceived application usage.</p>

<p>The <a href="../functions/nanosleep.html"><i>nanosleep</i>()</a> function has not been modified with the introduction of the
monotonic clock. Instead, a new <a href="../functions/clock_nanosleep.html"><i>clock_nanosleep</i>()</a> function has been created,
in which the desired clock may be specified in the function call.</p>

<ul>
<li>
<p>History of Resolution Issues</p>

<p>Due to the shift from relative to absolute timeouts in IEEE&nbsp;Std&nbsp;1003.1d-1999, the amendments to the <a href=
"../functions/sem_timedwait.html"><i>sem_timedwait</i>()</a>, <a href=
"../functions/pthread_mutex_timedlock.html"><i>pthread_mutex_timedlock</i>()</a>, <a href=
"../functions/mq_timedreceive.html"><i>mq_timedreceive</i>()</a>, and <a href=
"../functions/mq_timedsend.html"><i>mq_timedsend</i>()</a> functions of that standard have been removed. Those amendments specified
that CLOCK_MONOTONIC would be used for the (relative) timeouts if the Monotonic Clock option was supported.</p>

<p>Having these functions continue to be tied solely to CLOCK_MONOTONIC would not work. Since the absolute value of a time value
obtained from CLOCK_MONOTONIC is unspecified, under the absolute timeouts interface, applications would behave differently
depending on whether the Monotonic Clock option was supported or not (because the absolute value of the clock would have different
meanings in either case).</p>

<p>Two options were considered:</p>

<ol>
<li>
<p>Leave the current behavior unchanged, which specifies the CLOCK_REALTIME clock for these (absolute) timeouts, to allow
portability of applications between implementations supporting or not the Monotonic Clock option.</p>
</li>

<li>
<p>Modify these functions in the way that <a href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a>
was modified to allow a choice of clock, so that an application could use CLOCK_REALTIME when it is trying to achieve an absolute
timeout and CLOCK_MONOTONIC when it is trying to achieve a relative timeout.</p>
</li>
</ol>

<p>It was decided that the features of CLOCK_MONOTONIC are not as critical to these functions as they are to <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a>. The <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> function is given a relative timeout; the timeout
may represent a deadline for an event. When these functions are given relative timeouts, the timeouts are typically for error
recovery purposes and need not be so precise.</p>

<p>Therefore, it was decided that these functions should be tied to CLOCK_REALTIME and not complicated by being given a choice of
clock.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_20"></a>Execution Time Monitoring</h5>

<ul>
<li>
<p>Introduction</p>

<p>The main goals of the execution time monitoring facilities defined in this chapter are to measure the execution time of
processes and threads and to allow an application to establish CPU time limits for these entities.</p>

<p>The analysis phase of time-critical realtime systems often relies on the measurement of execution times of individual threads or
processes to determine whether the timing requirements will be met. Also, performance analysis techniques for soft deadline
realtime systems rely heavily on the determination of these execution times. The execution time monitoring functions provide
application developers with the ability to measure these execution times online and open the possibility of dynamic execution-time
analysis and system reconfiguration, if required.</p>

<p>The second goal of allowing an application to establish execution time limits for individual processes or threads and detecting
when they overrun allows program robustness to be increased by enabling online checking of the execution times.</p>

<p>If errors are detected-possibly because of erroneous program constructs, the existence of errors in the analysis phase, or a
burst of event arrivals-online detection and recovery is possible in a portable way. This feature can be extremely important for
many time-critical applications. Other applications require trapping CPU-time errors as a normal way to exit an algorithm; for
instance, some realtime artificial intelligence applications trigger a number of independent inference processes of varying
accuracy and speed, limit how long they can run, and pick the best answer available when time runs out. In many periodic systems,
overrun processes are simply restarted in the next resource period, after necessary end-of-period actions have been taken. This
allows algorithms that are inherently data-dependent to be made predictable.</p>

<p>The interface that appears in this chapter defines a new type of clock, the CPU-time clock, which measures execution time. Each
process or thread can invoke the clock and timer functions defined in POSIX.1 to use them. Functions are also provided to access
the CPU-time clock of other processes or threads to enable remote monitoring of these clocks. Monitoring of threads of other
processes is not supported, since these threads are not visible from outside of their own process with the interfaces defined in
POSIX.1.</p>
</li>

<li>
<p>Execution Time Monitoring Interface</p>

<p>The clock and timer interface defined in POSIX.1 historically only defined one clock, which measures wall-clock time. The
requirements for measuring execution time of processes and threads, and setting limits to their execution time by detecting when
they overrun, can be accomplished with that interface if a new kind of clock is defined. These new clocks measure execution time,
and one is associated with each process and with each thread. The clock functions currently defined in POSIX.1 can be used to read
and set these CPU-time clocks, and timers can be created using these clocks as their timing base. These timers can then be used to
send a signal when some specified execution time has been exceeded. The CPU-time clocks of each process or thread can be accessed
by using the symbols CLOCK_PROCESS_CPUTIME_ID or CLOCK_THREAD_CPUTIME_ID.</p>

<p>The clock and timer interface defined in POSIX.1 and extended with the new kind of CPU-time clock would only allow processes or
threads to access their own CPU-time clocks. However, many realtime systems require the possibility of monitoring the execution
time of processes or threads from independent monitoring entities. In order to allow applications to construct independent
monitoring entities that do not require cooperation from or modification of the monitored entities, two functions have been added:
<a href="../functions/clock_getcpuclockid.html"><i>clock_getcpuclockid</i>()</a>, for accessing CPU-time clocks of other processes,
and <a href="../functions/pthread_getcpuclockid.html"><i>pthread_getcpuclockid</i>()</a>, for accessing CPU-time clocks of other
threads. These functions return the clock identifier associated with the process or thread specified in the call. These clock IDs
can then be used in the rest of the clock function calls.</p>

<p>The clocks accessed through these functions could also be used as a timing base for the creation of timers, thereby allowing
independent monitoring entities to limit the CPU time consumed by other entities. However, this possibility would imply additional
complexity and overhead because of the need to maintain a timer queue for each process or thread, to store the different expiration
times associated with timers created by different processes or threads. The working group decided this additional overhead was not
justified by application requirements. Therefore, creation of timers attached to the CPU-time clocks of other processes or threads
has been specified as implementation-defined.</p>
</li>

<li>
<p>Overhead Considerations</p>

<p>The measurement of execution time may introduce additional overhead in the thread scheduling, because of the need to keep track
of the time consumed by each of these entities. In library-level implementations of threads, the efficiency of scheduling could be
somehow compromised because of the need to make a kernel call, at each context switch, to read the process CPU-time clock.
Consequently, a thread creation attribute called <i>cpu-clock-requirement</i> was defined, to allow threads to disconnect their
respective CPU-time clocks. However, the Ballot Group considered that this attribute itself introduced some overhead, and that in
current implementations it was not worth the effort. Therefore, the attribute was deleted, and thus thread CPU-time clocks are
required for all threads if the Thread CPU-Time Clocks option is supported.</p>
</li>

<li>
<p>Accuracy of CPU-Time Clocks</p>

<p>The mechanism used to measure the execution time of processes and threads is specified in IEEE&nbsp;Std&nbsp;1003.1-2001 as
implementation-defined. The reason for this is that both the underlying hardware and the implementation architecture have a very
strong influence on the accuracy achievable for measuring CPU time. For some implementations, the specification of strict accuracy
requirements would represent very large overheads, or even the impossibility of being implemented.</p>

<p>Since the mechanism for measuring execution time is implementation-defined, realtime applications will be able to take advantage
of accurate implementations using a portable interface. Of course, strictly conforming applications cannot rely on any particular
degree of accuracy, in the same way as they cannot rely on a very accurate measurement of wall clock time. There will always exist
applications whose accuracy or efficiency requirements on the implementation are more rigid than the values defined in
IEEE&nbsp;Std&nbsp;1003.1-2001 or any other standard.</p>

<p>In any case, there is a minimum set of characteristics that realtime applications would expect from most implementations. One
such characteristic is that the sum of all the execution times of all the threads in a process equals the process execution time,
when no CPU-time clocks are disabled. This need not always be the case because implementations may differ in how they account for
time during context switches. Another characteristic is that the sum of the execution times of all processes in a system equals the
number of processors, multiplied by the elapsed time, assuming that no processor is idle during that elapsed time. However, in some
implementations it might not be possible to relate CPU time to elapsed time. For example, in a heterogeneous multi-processor system
in which each processor runs at a different speed, an implementation may choose to define each &quot;second&quot; of CPU time to be a
certain number of &quot;cycles&quot; that a CPU has executed.</p>
</li>

<li>
<p>Existing Practice</p>

<p>Measuring and limiting the execution time of each concurrent activity are common features of most industrial implementations of
realtime systems. Almost all critical realtime systems are currently built upon a cyclic executive. With this approach, a regular
timer interrupt kicks off the next sequence of computations. It also checks that the current sequence has completed. If it has not,
then some error recovery action can be undertaken (or at least an overrun is avoided). Current software engineering principles and
the increasing complexity of software are driving application developers to implement these systems on multi-threaded or
multi-process operating systems. Therefore, if a POSIX operating system is to be used for this type of application, then it must
offer the same level of protection.</p>

<p>Execution time clocks are also common in most UNIX implementations, although these clocks usually have requirements different
from those of realtime applications. The POSIX.1 <a href="../functions/times.html"><i>times</i>()</a> function supports the
measurement of the execution time of the calling process, and its terminated child processes. This execution time is measured in
clock ticks and is supplied as two different values with the user and system execution times, respectively. BSD supports the
function <a href="../functions/getrusage.html"><i>getrusage</i>()</a>, which allows the calling process to get information about
the resources used by itself and/or all of its terminated child processes. The resource usage includes user and system CPU time.
Some UNIX systems have options to specify high resolution (up to one microsecond) CPU-time clocks using the <a href=
"../functions/times.html"><i>times</i>()</a> or the <a href="../functions/getrusage.html"><i>getrusage</i>()</a> functions.</p>

<p>The <a href="../functions/times.html"><i>times</i>()</a> and <a href="../functions/getrusage.html"><i>getrusage</i>()</a>
interfaces do not meet important realtime requirements, such as the possibility of monitoring execution time from a different
process or thread, or the possibility of detecting an execution time overrun. The latter requirement is supported in some UNIX
implementations that are able to send a signal when the execution time of a process has exceeded some specified value. For example,
BSD defines the functions <a href="../functions/getitimer.html"><i>getitimer</i>()</a> and <a href=
"../functions/setitimer.html"><i>setitimer</i>()</a>, which can operate either on a realtime clock (wall-clock), or on virtual-time
or profile-time clocks which measure CPU time in two different ways. These functions do not support access to the execution time of
other processes.</p>

<p>IBM's MVS operating system supports per-process and per-thread execution time clocks. It also supports limiting the execution
time of a given process.</p>

<p>Given all this existing practice, the working group considered that the POSIX.1 clocks and timers interface was appropriate to
meet most of the requirements that realtime applications have for execution time clocks. Functions were added to get the CPU time
clock IDs, and to allow/disallow the thread CPU-time clocks (in order to preserve the efficiency of some implementations of
threads).</p>
</li>

<li>
<p>Clock Constants</p>

<p>The definition of the manifest constants CLOCK_PROCESS_CPUTIME_ID and CLOCK_THREAD_CPUTIME_ID allows processes or threads,
respectively, to access their own execution-time clocks. However, given a process or thread, access to its own execution-time clock
is also possible if the clock ID of this clock is obtained through a call to <a href=
"../functions/clock_getcpuclockid.html"><i>clock_getcpuclockid</i>()</a> or <a href=
"../functions/pthread_getcpuclockid.html"><i>pthread_getcpuclockid</i>()</a>. Therefore, these constants are not necessary and
could be deleted to make the interface simpler. Their existence saves one system call in the first access to the CPU-time clock of
each process or thread. The working group considered this issue and decided to leave the constants in
IEEE&nbsp;Std&nbsp;1003.1-2001 because they are closer to the POSIX.1b use of clock identifiers.</p>
</li>

<li>
<p>Library Implementations of Threads</p>

<p>In library implementations of threads, kernel entities and library threads can coexist. In this case, if the CPU-time clocks are
supported, most of the clock and timer functions will need to have two implementations: one in the thread library, and one in the
system calls library. The main difference between these two implementations is that the thread library implementation will have to
deal with clocks and timers that reside in the thread space, while the kernel implementation will operate on timers and clocks that
reside in kernel space. In the library implementation, if the clock ID refers to a clock that resides in the kernel, a kernel call
will have to be made. The correct version of the function can be chosen by specifying the appropriate order for the libraries
during the link process.</p>
</li>

<li>
<p>History of Resolution Issues: Deletion of the <i>enable</i> Attribute</p>

<p>In early proposals, consideration was given to inclusion of an attribute called <i>enable</i> for CPU-time clocks. This would
allow implementations to avoid the overhead of measuring execution time for those processes or threads for which this measurement
was not required. However, this is unnecessary since processes are already required to measure execution time by the POSIX.1 <a
href="../functions/times.html"><i>times</i>()</a> function. Consequently, the <i>enable</i> attribute is not present.</p>
</li>
</ul>

<h5><a name="tag_03_02_08_21"></a>Rationale Relating to Timeouts</h5>

<ul>
<li>
<p>Requirements for Timeouts</p>

<p>Realtime systems which must operate reliably over extended periods without human intervention are characteristic in embedded
applications such as avionics, machine control, and space exploration, as well as more mundane applications such as cable TV,
security systems, and plant automation. A multi-tasking paradigm, in which many independent and/or cooperating software functions
relinquish the processor(s) while waiting for a specific stimulus, resource, condition, or operation completion, is very useful in
producing well engineered programs for such systems. For such systems to be robust and fault-tolerant, expected occurrences that
are unduly delayed or that never occur must be detected so that appropriate recovery actions may be taken. This is difficult if
there is no way for a task to regain control of a processor once it has relinquished control (blocked) awaiting an occurrence
which, perhaps because of corrupted code, hardware malfunction, or latent software bugs, will not happen when expected. Therefore,
the common practice in realtime operating systems is to provide a capability to time out such blocking services. Although there are
several methods to achieve this already defined by POSIX, none are as reliable or efficient as initiating a timeout simultaneously
with initiating a blocking service. This is especially critical in hard-realtime embedded systems because the processors typically
have little time reserve, and allowed fault recovery times are measured in milliseconds rather than seconds.</p>

<p>The working group largely agreed that such timeouts were necessary and ought to become part of IEEE&nbsp;Std&nbsp;1003.1-2001,
particularly vendors of realtime operating systems whose customers had already expressed a strong need for timeouts. There was some
resistance to inclusion of timeouts in IEEE&nbsp;Std&nbsp;1003.1-2001 because the desired effect, fault tolerance, could, in
theory, be achieved using existing facilities and alternative software designs, but there was no compelling evidence that realtime
system designers would embrace such designs at the sacrifice of performance and/or simplicity.</p>
</li>

<li>
<p>Which Services should be Timed Out?</p>

<p>Originally, the working group considered the prospect of providing timeouts on all blocking services, including those currently
existing in POSIX.1, POSIX.1b, and POSIX.1c, and future interfaces to be defined by other working groups, as sort of a general
policy. This was rather quickly rejected because of the scope of such a change, and the fact that many of those services would not
normally be used in a realtime context. More traditional timesharing solutions to timeout would suffice for most of the POSIX.1
interfaces, while others had asynchronous alternatives which, while more complex to utilize, would be adequate for some realtime
and all non-realtime applications.</p>

<p>The list of potential candidates for timeouts was narrowed to the following for further consideration:</p>

<ul>
<li>
<p>POSIX.1b</p>

<ul>
<li>
<p><a href="../functions/sem_wait.html"><i>sem_wait</i>()</a></p>
</li>

<li>
<p><a href="../functions/mq_receive.html"><i>mq_receive</i>()</a></p>
</li>

<li>
<p><a href="../functions/mq_send.html"><i>mq_send</i>()</a></p>
</li>

<li>
<p><a href="../functions/lio_listio.html"><i>lio_listio</i>()</a></p>
</li>

<li>
<p><a href="../functions/aio_suspend.html"><i>aio_suspend</i>()</a></p>
</li>

<li>
<p><a href="../functions/sigwait.html"><i>sigwait</i>()</a> (timeout already implemented by <a href=
"../functions/sigtimedwait.html"><i>sigtimedwait</i>()</a>)</p>
</li>
</ul>
</li>

<li>
<p>POSIX.1c</p>

<ul>
<li>
<p><a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a></p>
</li>

<li>
<p><a href="../functions/pthread_join.html"><i>pthread_join</i>()</a></p>
</li>

<li>
<p><a href="../functions/pthread_cond_wait.html"><i>pthread_cond_wait</i>()</a> (timeout already implemented by <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a>)</p>
</li>
</ul>
</li>

<li>
<p>POSIX.1</p>

<ul>
<li>
<p><a href="../functions/read.html"><i>read</i>()</a></p>
</li>

<li>
<p><a href="../functions/write.html"><i>write</i>()</a></p>
</li>
</ul>
</li>
</ul>

<p>After further review by the working group, the <a href="../functions/lio_listio.html"><i>lio_listio</i>()</a>, <a href=
"../functions/read.html"><i>read</i>()</a>, and <a href="../functions/write.html"><i>write</i>()</a> functions (all forms of
blocking synchronous I/O) were eliminated from the list because of the following:</p>

<ul>
<li>
<p>Asynchronous alternatives exist</p>
</li>

<li>
<p>Timeouts can be implemented, albeit non-portably, in device drivers</p>
</li>

<li>
<p>A strong desire not to introduce modifications to POSIX.1 interfaces</p>
</li>
</ul>

<p>The working group ultimately rejected <a href="../functions/pthread_join.html"><i>pthread_join</i>()</a> since both that
interface and a timed variant of that interface are non-minimal and may be implemented as a function. See below for a library
implementation of <a href="../functions/pthread_join.html"><i>pthread_join</i>()</a>.</p>

<p>Thus, there was a consensus among the working group members to add timeouts to 4 of the remaining 5 functions (the timeout for
<a href="../functions/aio_suspend.html"><i>aio_suspend</i>()</a> was ultimately added directly to POSIX.1b, while the others were
added by POSIX.1d). However, <a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a> remained
contentious.</p>

<p>Many feel that <a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a> falls into the same class as the
other functions; that is, it is desirable to time out a mutex lock because a mutex may fail to be unlocked due to errant or
corrupted code in a critical section (looping or branching outside of the unlock code), and therefore is equally in need of a
reliable, simple, and efficient timeout. In fact, since mutexes are intended to guard small critical sections, most <a href=
"../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a> calls would be expected to obtain the lock without blocking
nor utilizing any kernel service, even in implementations of threads with global contention scope; the timeout alternative need
only be considered after it is determined that the thread must block.</p>

<p>Those opposed to timing out mutexes feel that the very simplicity of the mutex is compromised by adding a timeout semantic, and
that to do so is senseless. They claim that if a timed mutex is really deemed useful by a particular application, then it can be
constructed from the facilities already in POSIX.1b and POSIX.1c. The following two C-language library implementations of mutex
locking with timeout represent the solutions offered (in both implementations, the timeout parameter is specified as absolute time,
not relative time as in the proposed POSIX.1c interfaces).</p>
</li>

<li>
<p>Spinlock Implementation</p>

<blockquote>
<pre>
<tt>#include &lt;pthread.h&gt;
#include &lt;time.h&gt;
#include &lt;errno.h&gt;
<br>
int pthread_mutex_timedlock(pthread_mutex_t *mutex,
        const struct timespec *timeout)
    {
    struct timespec timenow;
<br>
    while (pthread_mutex_trylock(mutex) == EBUSY)
        {
        clock_gettime(CLOCK_REALTIME, &amp;timenow);
        if (timespec_cmp(&amp;timenow,timeout) &gt;= 0)
            {
            return ETIMEDOUT;
        }
        pthread_yield();
        }
    return 0;
    }
</tt>
</pre>
</blockquote>

<p>The Spinlock implementation is generally unsuitable for any application using priority-based thread scheduling policies such as
SCHED_FIFO or SCHED_RR, since the mutex could currently be held by a thread of lower priority within the same allocation domain,
but since the waiting thread never blocks, only threads of equal or higher priority will ever run, and the mutex cannot be
unlocked. Setting priority inheritance or priority ceiling protocol on the mutex does not solve this problem, since the priority of
a mutex owning thread is only boosted if higher priority threads are blocked waiting for the mutex; clearly not the case for this
spinlock.</p>
</li>

<li>
<p>Condition Wait Implementation</p>

<pre>
<tt>#include &lt;pthread.h&gt;
#include &lt;time.h&gt;
#include &lt;errno.h&gt;
<br>
struct timed_mutex
    {
    int locked;
    pthread_mutex_t mutex;
    pthread_cond_t cond;
    };
typedef struct timed_mutex timed_mutex_t;
<br>
int timed_mutex_lock(timed_mutex_t *tm,
        const struct timespec *timeout)
    {
    int timedout=FALSE;
    int error_status;
<br>
    pthread_mutex_lock(&amp;tm-&gt;mutex);
<br>
    while (tm-&gt;locked &amp;&amp; !timedout)
        {
        if ((error_status=pthread_cond_timedwait(&amp;tm-&gt;cond,
            &amp;tm-&gt;mutex,
            timeout))!=0)
        {
        if (error_status==ETIMEDOUT) timedout = TRUE;
        }
    }
<br>
    if(timedout)
        {
        pthread_mutex_unlock(&amp;tm-&gt;mutex);
        return ETIMEDOUT;
        }
    else
        {
        tm-&gt;locked = TRUE;
        pthread_mutex_unlock(&amp;tm-&gt;mutex);
        return 0;
        }
    }
<br>
void timed_mutex_unlock(timed_mutex_t *tm)
    {
    pthread_mutex_lock(&amp;tm-&gt;mutex); / for case assignment not atomic /
    tm-&gt;locked = FALSE;
    pthread_mutex_unlock(&amp;tm-&gt;mutex);
    pthread_cond_signal(&amp;tm-&gt;cond);
    }
</tt>
</pre>

<p>The Condition Wait implementation effectively substitutes the <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> function (which is currently timed out) for the
desired <a href="../functions/pthread_mutex_timedlock.html"><i>pthread_mutex_timedlock</i>()</a>. Since waits on condition
variables currently do not include protocols which avoid priority inversion, this method is generally unsuitable for realtime
applications because it does not provide the same priority inversion protection as the untimed <a href=
"../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a>. Also, for any given implementations of the current mutex
and condition variable primitives, this library implementation has a performance cost at least 2.5 times that of the untimed <a
href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a> even in the case where the timed mutex is readily
locked without blocking (the interfaces required for this case are shown in bold). Even in uniprocessors or where assignment is
atomic, at least an additional <a href="../functions/pthread_cond_signal.html"><i>pthread_cond_signal</i>()</a> is required. <a
href="../functions/pthread_mutex_timedlock.html"><i>pthread_mutex_timedlock</i>()</a> could be implemented at effectively no
performance penalty in this case because the timeout parameters need only be considered after it is determined that the mutex
cannot be locked immediately.</p>

<p>Thus it has not yet been shown that the full semantics of mutex locking with timeout can be efficiently and reliably achieved
using existing interfaces. Even if the existence of an acceptable library implementation were proven, it is difficult to justify
why the interface itself should not be made portable, especially considering approval for the other four timeouts.<br>
</p>
</li>

<li>
<p>Rationale for Library Implementation of <i>pthread_timedjoin</i>()</p>

<p>Library implementation of <i>pthread_timedjoin</i>():</p>

<pre>
<tt>/*
 * Construct a thread variety entirely from existing functions
 * with which a join can be done, allowing the join to time out.
 */
#include &lt;pthread.h&gt;
#include &lt;time.h&gt;
<br>
struct timed_thread {
    pthread_t t;
    pthread_mutex_t m;
    int exiting;
    pthread_cond_t exit_c;
    void *(*start_routine)(void *arg);
    void *arg;
    void *status;
};
<br>
typedef struct timed_thread *timed_thread_t;
static pthread_key_t timed_thread_key;
static pthread_once_t timed_thread_once = PTHREAD_ONCE_INIT;
<br>
static void timed_thread_init()
{
    pthread_key_create(&amp;timed_thread_key, NULL);
}
<br>
static void *timed_thread_start_routine(void *args)
<br>
/*
 * Routine to establish thread-specific data value and run the actual
 * thread start routine which was supplied to timed_thread_create().
 */
{
    timed_thread_t tt = (timed_thread_t) args;
<br>
    pthread_once(&amp;timed_thread_once, timed_thread_init);
    pthread_setspecific(timed_thread_key, (void *)tt);
    timed_thread_exit((tt-&gt;start_routine)(tt-&gt;arg));
}
<br>
int timed_thread_create(timed_thread_t ttp, const pthread_attr_t *attr,
    void *(*start_routine)(void *), void *arg)
<br>
/*
 * Allocate a thread which can be used with timed_thread_join().
 */
{
    timed_thread_t tt;
    int result;
<br>
    tt = (timed_thread_t) malloc(sizeof(struct timed_thread));
    pthread_mutex_init(&amp;tt-&gt;m,NULL);
    tt-&gt;exiting = FALSE;
    pthread_cond_init(&amp;tt-&gt;exit_c,NULL);
    tt-&gt;start_routine = start_routine;
    tt-&gt;arg = arg;
    tt-&gt;status = NULL;
<br>
    if ((result = pthread_create(&amp;tt-&gt;t, attr,
        timed_thread_start_routine, (void *)tt)) != 0) {
        free(tt);
        return result;
    }
<br>
    pthread_detach(tt-&gt;t);
    ttp = tt;
    return 0;
}
<br>
int timed_thread_join(timed_thread_t tt,
    struct timespec *timeout,
    void **status)
{
    int result;
<br>
    pthread_mutex_lock(&amp;tt-&gt;m);
    result = 0;
    /*
     * Wait until the thread announces that it is exiting,
     * or until timeout.
     */
    while (result == 0 &amp;&amp; ! tt-&gt;exiting) {
        result = pthread_cond_timedwait(&amp;tt-&gt;exit_c, &amp;tt-&gt;m, timeout);
    }
    pthread_mutex_unlock(&amp;tt-&gt;m);
    if (result == 0 &amp;&amp; tt-&gt;exiting) {
        *status = tt-&gt;status;
        free((void *)tt);
        return result;
    }
    return result;
}
<br>
void timed_thread_exit(void *status)
{
    timed_thread_t tt;
    void *specific;
<br>
    if ((specific=pthread_getspecific(timed_thread_key)) == NULL){
        /*
         * Handle cases which won't happen with correct usage.
         */
        pthread_exit( NULL);
    }
    tt = (timed_thread_t) specific;
    pthread_mutex_lock(&amp;tt-&gt;m);
    /*
     * Tell a joiner that we're exiting.
     */
    tt-&gt;status = status;
    tt-&gt;exiting = TRUE;
    pthread_cond_signal(&amp;tt-&gt;exit_c);
    pthread_mutex_unlock(&amp;tt-&gt;m);
    /*
     * Call pthread exit() to call destructors and really
     * exit the thread.
     */
    pthread_exit(NULL);
}
</tt>
</pre>

<p>The <a href="../functions/pthread_join.html"><i>pthread_join</i>()</a> C-language example shown above demonstrates that it is
possible, using existing pthread facilities, to construct a variety of thread which allows for joining such a thread, but which
allows the join operation to time out. It does this by using a <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> to wait for the thread to exit. A
<b>timed_thread_t</b> descriptor structure is used to pass parameters from the creating thread to the created thread, and from the
exiting thread to the joining thread. This implementation is roughly equivalent to what a normal <a href=
"../functions/pthread_join.html"><i>pthread_join</i>()</a> implementation would do, with the single change being that <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> is used in place of a simple <a href=
"../functions/pthread_cond_wait.html"><i>pthread_cond_wait</i>()</a>.</p>

<p>Since it is possible to implement such a facility entirely from existing pthread interfaces, and with roughly equal efficiency
and complexity to an implementation which would be provided directly by a pthreads implementation, it was the consensus of the
working group members that any <i>pthread_timedjoin</i>() facility would be unnecessary, and should not be provided.</p>
</li>

<li>
<p>Form of the Timeout Interfaces</p>

<p>The working group considered a number of alternative ways to add timeouts to blocking services. At first, a system interface
which would specify a one-shot or persistent timeout to be applied to subsequent blocking services invoked by the calling process
or thread was considered because it allowed all blocking services to be timed out in a uniform manner with a single additional
interface; this was rather quickly rejected because it could easily result in the wrong services being timed out.</p>

<p>It was suggested that a timeout value might be specified as an attribute of the object (semaphore, mutex, message queue, and so
on), but there was no consensus on this, either on a case-by-case basis or for all timeouts.</p>

<p>Looking at the two existing timeouts for blocking services indicates that the working group members favor a separate interface
for the timed version of a function. However, <a href=
"../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> utilizes an absolute timeout value while <a href=
"../functions/sigtimedwait.html"><i>sigtimedwait</i>()</a> uses a relative timeout value. The working group members agreed that
relative timeout values are appropriate where the timeout mechanism's primary use was to deal with an unexpected or error
situation, but they are inappropriate when the timeout must expire at a particular time, or before a specific deadline. For the
timeouts being introduced in IEEE&nbsp;Std&nbsp;1003.1-2001, the working group considered allowing both relative and absolute
timeouts as is done with POSIX.1b timers, but ultimately favored the simpler absolute timeout form.</p>

<p>An absolute time measure can be easily implemented on top of an interface that specifies relative time, by reading the clock,
calculating the difference between the current time and the desired wake-up time, and issuing a relative timeout call. But there is
a race condition with this approach because the thread could be preempted after reading the clock, but before making the timed-out
call; in this case, the thread would be awakened later than it should and, thus, if the wake-up time represented a deadline, it
would miss it.</p>

<p>There is also a race condition when trying to build a relative timeout on top of an interface that specifies absolute timeouts.
In this case, the clock would have to be read to calculate the absolute wake-up time as the sum of the current time plus the
relative timeout interval. In this case, if the thread is preempted after reading the clock but before making the timed-out call,
the thread would be awakened earlier than desired.</p>

<p>But the race condition with the absolute timeouts interface is not as bad as the one that happens with the relative timeout
interface, because there are simple workarounds. For the absolute timeouts interface, if the timing requirement is a deadline, the
deadline can still be met because the thread woke up earlier than the deadline. If the timeout is just used as an error recovery
mechanism, the precision of timing is not really important. If the timing requirement is that between actions A and B a minimum
interval of time must elapse, the absolute timeout interface can be safely used by reading the clock after action A has been
started. It could be argued that, since the call with the absolute timeout is atomic from the application point of view, it is not
possible to read the clock after action A, if this action is part of the timed-out call. But looking at the nature of the calls for
which timeouts are specified (locking a mutex, waiting for a semaphore, waiting for a message, or waiting until there is space in a
message queue), the timeouts that an application would build on these actions would not be triggered by these actions themselves,
but by some other external action. For example, if waiting for a message to arrive to a message queue, and waiting for at least 20
milliseconds, this time interval would start to be counted from some event that would trigger both the action that produces the
message, as well as the action that waits for the message to arrive, and not by the wait-for-message operation itself. In this
case, the workaround proposed above could be used.</p>

<p>For these reasons, the absolute timeout is preferred over the relative timeout interface.</p>
</li>
</ul>

<h4><a name="tag_03_02_09"></a>Threads</h4>

<p>Threads will normally be more expensive than subroutines (or functions, routines, and so on) if specialized hardware support is
not provided. Nevertheless, threads should be sufficiently efficient to encourage their use as a medium to fine-grained structuring
mechanism for parallelism in an application. Structuring an application using threads then allows it to take immediate advantage of
any underlying parallelism available in the host environment. This means implementors are encouraged to optimize for fast execution
at the possible expense of efficient utilization of storage. For example, a common thread creation technique is to cache
appropriate thread data structures. That is, rather than releasing system resources, the implementation retains these resources and
reuses them when the program next asks to create a new thread. If this reuse of thread resources is to be possible, there has to be
very little unique state associated with each thread, because any such state has to be reset when the thread is reused.</p>

<h5><a name="tag_03_02_09_01"></a>Thread Creation Attributes</h5>

<p>Attributes objects are provided for threads, mutexes, and condition variables as a mechanism to support probable future
standardization in these areas without requiring that the interface itself be changed.</p>

<p>Attributes objects provide clean isolation of the configurable aspects of threads. For example, &quot;stack size&quot; is an important
attribute of a thread, but it cannot be expressed portably. When porting a threaded program, stack sizes often need to be adjusted.
The use of attributes objects can help by allowing the changes to be isolated in a single place, rather than being spread across
every instance of thread creation.</p>

<p>Attributes objects can be used to set up <i>classes</i> of threads with similar attributes; for example, &quot;threads with large
stacks and high priority&quot; or &quot;threads with minimal stacks&quot;. These classes can be defined in a single place and then referenced
wherever threads need to be created. Changes to &quot;class&quot; decisions become straightforward, and detailed analysis of each <a href=
"../functions/pthread_create.html"><i>pthread_create</i>()</a> call is not required.</p>

<p>The attributes objects are defined as opaque types as an aid to extensibility. If these objects had been specified as
structures, adding new attributes would force recompilation of all multi-threaded programs when the attributes objects are
extended; this might not be possible if different program components were supplied by different vendors.</p>

<p>Additionally, opaque attributes objects present opportunities for improving performance. Argument validity can be checked once
when attributes are set, rather than each time a thread is created. Implementations will often need to cache kernel objects that
are expensive to create. Opaque attributes objects provide an efficient mechanism to detect when cached objects become invalid due
to attribute changes.</p>

<p>Because assignment is not necessarily defined on a given opaque type, implementation-defined default values cannot be defined in
a portable way. The solution to this problem is to allow attribute objects to be initialized dynamically by attributes object
initialization functions, so that default values can be supplied automatically by the implementation.</p>

<p>The following proposal was provided as a suggested alternative to the supplied attributes:</p>

<ol>
<li>
<p>Maintain the style of passing a parameter formed by the bitwise-inclusive OR of flags to the initialization routines ( <a href=
"../functions/pthread_create.html"><i>pthread_create</i>()</a>, <a href=
"../functions/pthread_mutex_init.html"><i>pthread_mutex_init</i>()</a>, <a href=
"../functions/pthread_cond_init.html"><i>pthread_cond_init</i>()</a>). The parameter containing the flags should be an opaque type
for extensibility. If no flags are set in the parameter, then the objects are created with default characteristics. An
implementation may specify implementation-defined flag values and associated behavior.</p>
</li>

<li>
<p>If further specialization of mutexes and condition variables is necessary, implementations may specify additional procedures
that operate on the <b>pthread_mutex_t</b> and <b>pthread_cond_t</b> objects (instead of on attributes objects).</p>
</li>
</ol>

<p>The difficulties with this solution are:</p>

<ol>
<li>
<p>A bitmask is not opaque if bits have to be set into bit-vector attributes objects using explicitly-coded bitwise-inclusive OR
operations. If the set of options exceeds an <b>int</b>, application programmers need to know the location of each bit. If bits are
set or read by encapsulation (that is, <i>get*</i>() or
<i>set*</i>() functions), then the bitmask is merely an implementation of attributes objects as
currently defined and should not be exposed to the programmer.</p>
</li>

<li>
<p>Many attributes are not Boolean or very small integral values. For example, scheduling policy may be placed in 3 bits or 4 bits,
but priority requires 5 bits or more, thereby taking up at least 8 bits out of a possible 16 bits on machines with 16-bit integers.
Because of this, the bitmask can only reasonably control whether particular attributes are set or not, and it cannot serve as the
repository of the value itself. The value needs to be specified as a function parameter (which is non-extensible), or by setting a
structure field (which is non-opaque), or by <i>get*</i>() and
<i>set*</i>() functions (making the bitmask a redundant addition to the attributes objects).</p>
</li>
</ol>

<p>Stack size is defined as an optional attribute because the very notion of a stack is inherently machine-dependent. Some
implementations may not be able to change the size of the stack, for example, and others may not need to because stack pages may be
discontiguous and can be allocated and released on demand.</p>

<p>The attribute mechanism has been designed in large measure for extensibility. Future extensions to the attribute mechanism or to
any attributes object defined in IEEE&nbsp;Std&nbsp;1003.1-2001 have to be done with care so as not to affect
binary-compatibility.</p>

<p>Attribute objects, even if allocated by means of dynamic allocation functions such as <a href=
"../functions/malloc.html"><i>malloc</i>()</a>, may have their size fixed at compile time. This means, for example, a <a href=
"../functions/pthread_create.html"><i>pthread_create</i>()</a> in an implementation with extensions to the <b>pthread_attr_t</b>
cannot look beyond the area that the binary application assumes is valid. This suggests that implementations should maintain a size
field in the attributes object, as well as possibly version information, if extensions in different directions (possibly by
different vendors) are to be accommodated.</p>

<h5><a name="tag_03_02_09_02"></a>Thread Implementation Models</h5>

<p>There are various thread implementation models. At one end of the spectrum is the &quot;library-thread model&quot;. In such a model, the
threads of a process are not visible to the operating system kernel, and the threads are not kernel-scheduled entities. The process
is the only kernel-scheduled entity. The process is scheduled onto the processor by the kernel according to the scheduling
attributes of the process. The threads are scheduled onto the single kernel-scheduled entity (the process) by the runtime library
according to the scheduling attributes of the threads. A problem with this model is that it constrains concurrency. Since there is
only one kernel-scheduled entity (namely, the process), only one thread per process can execute at a time. If the thread that is
executing blocks on I/O, then the whole process blocks.</p>

<p>At the other end of the spectrum is the &quot;kernel-thread model&quot;. In this model, all threads are visible to the operating system
kernel. Thus, all threads are kernel-scheduled entities, and all threads can concurrently execute. The threads are scheduled onto
processors by the kernel according to the scheduling attributes of the threads. The drawback to this model is that the creation and
management of the threads entails operating system calls, as opposed to subroutine calls, which makes kernel threads heavier weight
than library threads.</p>

<p>Hybrids of these two models are common. A hybrid model offers the speed of library threads and the concurrency of kernel
threads. In hybrid models, a process has some (relatively small) number of kernel scheduled entities associated with it. It also
has a potentially much larger number of library threads associated with it. Some library threads may be bound to kernel-scheduled
entities, while the other library threads are multiplexed onto the remaining kernel-scheduled entities. There are two levels of
thread scheduling:</p>

<ol>
<li>
<p>The runtime library manages the scheduling of (unbound) library threads onto kernel-scheduled entities.</p>
</li>

<li>
<p>The kernel manages the scheduling of kernel-scheduled entities onto processors.</p>
</li>
</ol>

<p>For this reason, a hybrid model is referred to as a two-level threads scheduling model. In this model, the process can have
multiple concurrently executing threads; specifically, it can have as many concurrently executing threads as it has
kernel-scheduled entities.</p>

<h5><a name="tag_03_02_09_03"></a>Thread-Specific Data</h5>

<p>Many applications require that a certain amount of context be maintained on a per-thread basis across procedure calls. A common
example is a multi-threaded library routine that allocates resources from a common pool and maintains an active resource list for
each thread. The thread-specific data interface provided to meet these needs may be viewed as a two-dimensional array of values
with keys serving as the row index and thread IDs as the column index (although the implementation need not work this way).</p>

<ul>
<li>
<p>Models</p>

<p>Three possible thread-specific data models were considered:</p>

<ol>
<li>
<p>No Explicit Support</p>

<p>A standard thread-specific data interface is not strictly necessary to support applications that require per-thread context. One
could, for example, provide a hash function that converted a <b>pthread_t</b> into an integer value that could then be used to
index into a global array of per-thread data pointers. This hash function, in conjunction with <a href=
"../functions/pthread_self.html"><i>pthread_self</i>()</a>, would be all the interface required to support a mechanism of this
sort. Unfortunately, this technique is cumbersome. It can lead to duplicated code as each set of cooperating modules implements
their own per-thread data management schemes.</p>
</li>

<li>
<p>Single (<b>void</b> *) Pointer</p>

<p>Another technique would be to provide a single word of per-thread storage and a pair of functions to fetch and store the value
of this word. The word could then hold a pointer to a block of per-thread memory. The allocation, partitioning, and general use of
this memory would be entirely up to the application. Although this method is not as problematic as technique 1, it suffers from
interoperability problems. For example, all modules using the per-thread pointer would have to agree on a common usage
protocol.</p>
</li>

<li>
<p>Key/Value Mechanism</p>

<p>This method associates an opaque key (for example, stored in a variable of type <b>pthread_key_t</b>) with each per-thread
datum. These keys play the role of identifiers for per-thread data. This technique is the most generic and avoids the problems
noted above, albeit at the cost of some complexity.</p>
</li>
</ol>

<p>The primary advantage of the third model is its information hiding properties. Modules using this model are free to create and
use their own key(s) independent of all other such usage, whereas the other models require that all modules that use
thread-specific context explicitly cooperate with all other such modules. The data-independence provided by the third model is
worth the additional interface.</p>
</li>

<li>
<p>Requirements</p>

<p>It is important that it be possible to implement the thread-specific data interface without the use of thread private memory. To
do otherwise would increase the weight of each thread, thereby limiting the range of applications for which the threads interfaces
provided by IEEE&nbsp;Std&nbsp;1003.1-2001 is appropriate.</p>

<p>The values that one binds to the key via <a href="../functions/pthread_setspecific.html"><i>pthread_setspecific</i>()</a> may,
in fact, be pointers to shared storage locations available to all threads. It is only the key/value bindings that are maintained on
a per-thread basis, and these can be kept in any portion of the address space that is reserved for use by the calling thread (for
example, on the stack). Thus, no per-thread MMU state is required to implement the interface. On the other hand, there is nothing
in the interface specification to preclude the use of a per-thread MMU state if it is available (for example, the key values
returned by <a href="../functions/pthread_key_create.html"><i>pthread_key_create</i>()</a> could be thread private memory
addresses).</p>
</li>

<li>
<p>Standardization Issues</p>

<p>Thread-specific data is a requirement for a usable thread interface. The binding described in this section provides a portable
thread-specific data mechanism for languages that do not directly support a thread-specific storage class. A binding to
IEEE&nbsp;Std&nbsp;1003.1-2001 for a language that does include such a storage class need not provide this specific interface.</p>

<p>If a language were to include the notion of thread-specific storage, it would be desirable (but <i>not</i> required) to provide
an implementation of the pthreads thread-specific data interface based on the language feature. For example, assume that a compiler
for a C-like language supports a <i>private</i> storage class that provides thread-specific storage. Something similar to the
following macros might be used to effect a compatible implementation:</p>

<blockquote>
<pre>
<tt>#define pthread_key_t                   private void *
#define pthread_key_create(key)         /* no-op */
#define pthread_setspecific(key,value)  (key)=(value)
#define pthread_getspecific(key)        (key)
</tt>
</pre>
</blockquote>

<basefont size="2">

<dl>
<dt><b>Note:</b></dt>

<dd>For the sake of clarity, this example ignores destructor functions. A correct implementation would have to support them.</dd>
</dl>

<basefont size="3"></li>
</ul>

<h5><a name="tag_03_02_09_04"></a>Barriers</h5>

<ul>
<li>
<p>Background</p>

<p>Barriers are typically used in parallel DO/FOR loops to ensure that all threads have reached a particular stage in a parallel
computation before allowing any to proceed to the next stage. Highly efficient implementation is possible on machines which support
a &quot;Fetch and Add&quot; operation as described in the referenced Almasi and Gottlieb (1989).</p>

<p>The use of return value PTHREAD_BARRIER_SERIAL_THREAD is shown in the following example:</p>

<blockquote>
<pre>
<tt>if ( (status=pthread_barrier_wait(&amp;barrier)) ==
    PTHREAD_BARRIER_SERIAL_THREAD) {
    ...serial section
    }
        else if (status != 0) {
        ...error processing
    }
status=pthread_barrier_wait(&amp;barrier);
...
</tt>
</pre>
</blockquote>

<p>This behavior allows a serial section of code to be executed by one thread as soon as all threads reach the first barrier. The
second barrier prevents the other threads from proceeding until the serial section being executed by the one thread has
completed.</p>

<p>Although barriers can be implemented with mutexes and condition variables, the referenced Almasi and Gottlieb (1989) provides
ample illustration that such implementations are significantly less efficient than is possible. While the relative efficiency of
barriers may well vary by implementation, it is important that they be recognized in the IEEE&nbsp;Std&nbsp;1003.1-2001 to
facilitate applications portability while providing the necessary freedom to implementors.</p>
</li>

<li>
<p>Lack of Timeout Feature</p>

<p>Alternate versions of most blocking routines have been provided to support watchdog timeouts. No alternate interface of this
sort has been provided for barrier waits for the following reasons:</p>

<ul>
<li>
<p>Multiple threads may use different timeout values, some of which may be indefinite. It is not clear which threads should break
through the barrier with a timeout error if and when these timeouts expire.</p>
</li>

<li>
<p>The barrier may become unusable once a thread breaks out of a <a href=
"../functions/pthread_barrier_wait.html"><i>pthread_barrier_wait</i>()</a> with a timeout error. There is, in general, no way to
guarantee the consistency of a barrier's internal data structures once a thread has timed out of a <a href=
"../functions/pthread_barrier_wait.html"><i>pthread_barrier_wait</i>()</a>. Even the inclusion of a special barrier
reinitialization function would not help much since it is not clear how this function would affect the behavior of threads that
reach the barrier between the original timeout and the call to the reinitialization function.</p>
</li>
</ul>
</li>
</ul>

<h5><a name="tag_03_02_09_05"></a>Spin Locks</h5>

<ul>
<li>
<p>Background</p>

<p>Spin locks represent an extremely low-level synchronization mechanism suitable primarily for use on shared memory
multi-processors. It is typically an atomically modified Boolean value that is set to one when the lock is held and to zero when
the lock is freed.</p>

<p>When a caller requests a spin lock that is already held, it typically spins in a loop testing whether the lock has become
available. Such spinning wastes processor cycles so the lock should only be held for short durations and not across sleep/block
operations. Callers should unlock spin locks before calling sleep operations.</p>

<p>Spin locks are available on a variety of systems. The functions included in IEEE&nbsp;Std&nbsp;1003.1-2001 are an attempt to
standardize that existing practice.</p>
</li>

<li>
<p>Lack of Timeout Feature</p>

<p>Alternate versions of most blocking routines have been provided to support watchdog timeouts. No alternate interface of this
sort has been provided for spin locks for the following reasons:</p>

<ul>
<li>
<p>It is impossible to determine appropriate timeout intervals for spin locks in a portable manner. The amount of time one can
expect to spend spin-waiting is inversely proportional to the degree of parallelism provided by the system.</p>

<p>It can vary from a few cycles when each competing thread is running on its own processor, to an indefinite amount of time when
all threads are multiplexed on a single processor (which is why spin locking is not advisable on uniprocessors).</p>
</li>

<li>
<p>When used properly, the amount of time the calling thread spends waiting on a spin lock should be considerably less than the
time required to set up a corresponding watchdog timer. Since the primary purpose of spin locks is to provide a low-overhead
synchronization mechanism for multi-processors, the overhead of a timeout mechanism was deemed unacceptable.</p>
</li>
</ul>

<p>It was also suggested that an additional <i>count</i> argument be provided (on the <a href=
"../functions/pthread_spin_lock.html"><i>pthread_spin_lock</i>()</a> call) in <i>lieu</i> of a true timeout so that a spin lock
call could fail gracefully if it was unable to apply the lock after <i>count</i> attempts. This idea was rejected because it is not
existing practice. Furthermore, the same effect can be obtained with <a href=
"../functions/pthread_spin_trylock.html"><i>pthread_spin_trylock</i>()</a>, as illustrated below:</p>

<blockquote>
<pre>
<tt>int n = MAX_SPIN;
<br>
while ( --n &gt;= 0 )
{
    if ( !pthread_spin_try_lock(...) )
        break;
}
if ( n &gt;= 0 )
{
    /* Successfully acquired the lock */
}
else
{
    /* Unable to acquire the lock */
}
</tt>
</pre>
</blockquote>
</li>

<li>
<p><i>process-shared</i> Attribute</p>

<p>The initialization functions associated with most POSIX synchronization objects (for example, mutexes, barriers, and read-write
locks) take an attributes object with a <i>process-shared</i> attribute that specifies whether or not the object is to be shared
across processes. In the draft corresponding to the first balloting round, two separate initialization functions are provided for
spin locks, however: one for spin locks that were to be shared across processes ( <i>spin_init</i>()), and one for locks that were
only used by multiple threads within a single process ( <a href=
"../functions/pthread_spin_init.html"><i>pthread_spin_init</i>()</a>). This was done so as to keep the overhead associated with
spin waiting to an absolute minimum. However, the balloting group requested that, since the overhead associated to a bit check was
small, spin locks should be consistent with the rest of the synchronization primitives, and thus the <i>process-shared</i>
attribute was introduced for spin locks.</p>
</li>

<li>
<p>Spin Locks <i>versus</i> Mutexes</p>

<p>It has been suggested that mutexes are an adequate synchronization mechanism and spin locks are not necessary. Locking
mechanisms typically must trade off the processor resources consumed while setting up to block the thread and the processor
resources consumed by the thread while it is blocked. Spin locks require very little resources to set up the blocking of a thread.
Existing practice is to simply loop, repeating the atomic locking operation until the lock is available. While the resources
consumed to set up blocking of the thread are low, the thread continues to consume processor resources while it is waiting.</p>

<p>On the other hand, mutexes may be implemented such that the processor resources consumed to block the thread are large relative
to a spin lock. After detecting that the mutex lock is not available, the thread must alter its scheduling state, add itself to a
set of waiting threads, and, when the lock becomes available again, undo all of this before taking over ownership of the mutex.
However, while a thread is blocked by a mutex, no processor resources are consumed.</p>

<p>Therefore, spin locks and mutexes may be implemented to have different characteristics. Spin locks may have lower overall
overhead for very short-term blocking, and mutexes may have lower overall overhead when a thread will be blocked for longer periods
of time. The presence of both interfaces allows implementations with these two different characteristics, both of which may be
useful to a particular application.</p>

<p>It has also been suggested that applications can build their own spin locks from the <a href=
"../functions/pthread_mutex_trylock.html"><i>pthread_mutex_trylock</i>()</a> function:</p>

<blockquote>
<pre>
<tt>while (pthread_mutex_trylock(&amp;mutex));
</tt>
</pre>
</blockquote>

<p>The apparent simplicity of this construct is somewhat deceiving, however. While the actual wait is quite efficient, various
guarantees on the integrity of mutex objects (for example, priority inheritance rules) may add overhead to the successful path of
the trylock operation that is not required of spin locks. One could, of course, add an attribute to the mutex to bypass such
overhead, but the very act of finding and testing this attribute represents more overhead than is found in the typical spin
lock.</p>

<p>The need to hold spin lock overhead to an absolute minimum also makes it impossible to provide guarantees against starvation
similar to those provided for mutexes or read-write locks. The overhead required to implement such guarantees (for example,
disabling preemption before spinning) may well exceed the overhead of the spin wait itself by many orders of magnitude. If a
&quot;safe&quot; spin wait seems desirable, it can always be provided (albeit at some performance cost) via appropriate mutex
attributes.</p>
</li>
</ul>

<h5><a name="tag_03_02_09_06"></a>XSI Supported Functions</h5>

<p>On XSI-conformant systems, the following symbolic constants are always defined:</p>

<blockquote>
<pre>
_POSIX_READER_WRITER_LOCKS
_POSIX_THREAD_ATTR_STACKADDR
_POSIX_THREAD_ATTR_STACKSIZE
_POSIX_THREAD_PROCESS_SHARED
_POSIX_THREADS
</pre>
</blockquote>

<p>Therefore, the following threads functions are always supported:</p>

<table cellpadding="3">
<tr valign="top">
<td align="left">
<p class="tent"><br>
<a href="../functions/pthread_atfork.html"><i>pthread_atfork</i>()</a><br>
<a href="../functions/pthread_attr_destroy.html"><i>pthread_attr_destroy</i>()</a><br>
<a href="../functions/pthread_attr_getdetachstate.html"><i>pthread_attr_getdetachstate</i>()</a><br>
<a href="../functions/pthread_attr_getguardsize.html"><i>pthread_attr_getguardsize</i>()</a><br>
<a href="../functions/pthread_attr_getschedparam.html"><i>pthread_attr_getschedparam</i>()</a><br>
<a href="../functions/pthread_attr_getstack.html"><i>pthread_attr_getstack</i>()</a><br>
<a href="../functions/pthread_attr_getstackaddr.html"><i>pthread_attr_getstackaddr</i>()</a><br>
<a href="../functions/pthread_attr_getstacksize.html"><i>pthread_attr_getstacksize</i>()</a><br>
<a href="../functions/pthread_attr_init.html"><i>pthread_attr_init</i>()</a><br>
<a href="../functions/pthread_attr_setdetachstate.html"><i>pthread_attr_setdetachstate</i>()</a><br>
<a href="../functions/pthread_attr_setguardsize.html"><i>pthread_attr_setguardsize</i>()</a><br>
<a href="../functions/pthread_attr_setschedparam.html"><i>pthread_attr_setschedparam</i>()</a><br>
<a href="../functions/pthread_attr_setstack.html"><i>pthread_attr_setstack</i>()</a><br>
<a href="../functions/pthread_attr_setstackaddr.html"><i>pthread_attr_setstackaddr</i>()</a><br>
<a href="../functions/pthread_attr_setstacksize.html"><i>pthread_attr_setstacksize</i>()</a><br>
<a href="../functions/pthread_cancel.html"><i>pthread_cancel</i>()</a><br>
<a href="../functions/pthread_cleanup_pop.html"><i>pthread_cleanup_pop</i>()</a><br>
<a href="../functions/pthread_cleanup_push.html"><i>pthread_cleanup_push</i>()</a><br>
<a href="../functions/pthread_cond_broadcast.html"><i>pthread_cond_broadcast</i>()</a><br>
<a href="../functions/pthread_cond_destroy.html"><i>pthread_cond_destroy</i>()</a><br>
<a href="../functions/pthread_cond_init.html"><i>pthread_cond_init</i>()</a><br>
<a href="../functions/pthread_cond_signal.html"><i>pthread_cond_signal</i>()</a><br>
<a href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a><br>
<a href="../functions/pthread_cond_wait.html"><i>pthread_cond_wait</i>()</a><br>
<a href="../functions/pthread_condattr_destroy.html"><i>pthread_condattr_destroy</i>()</a><br>
<a href="../functions/pthread_condattr_getpshared.html"><i>pthread_condattr_getpshared</i>()</a><br>
<a href="../functions/pthread_condattr_init.html"><i>pthread_condattr_init</i>()</a><br>
<a href="../functions/pthread_condattr_setpshared.html"><i>pthread_condattr_setpshared</i>()</a><br>
<a href="../functions/pthread_create.html"><i>pthread_create</i>()</a><br>
<a href="../functions/pthread_detach.html"><i>pthread_detach</i>()</a><br>
<a href="../functions/pthread_equal.html"><i>pthread_equal</i>()</a><br>
<a href="../functions/pthread_exit.html"><i>pthread_exit</i>()</a><br>
<a href="../functions/pthread_getconcurrency.html"><i>pthread_getconcurrency</i>()</a><br>
<a href="../functions/pthread_getspecific.html"><i>pthread_getspecific</i>()</a><br>
<a href="../functions/pthread_join.html"><i>pthread_join</i>()</a><br>
&nbsp;</p>
</td>
<td align="left">
<p class="tent"><br>
<a href="../functions/pthread_key_create.html"><i>pthread_key_create</i>()</a><br>
<a href="../functions/pthread_key_delete.html"><i>pthread_key_delete</i>()</a><br>
<a href="../functions/pthread_kill.html"><i>pthread_kill</i>()</a><br>
<a href="../functions/pthread_mutex_destroy.html"><i>pthread_mutex_destroy</i>()</a><br>
<a href="../functions/pthread_mutex_init.html"><i>pthread_mutex_init</i>()</a><br>
<a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a><br>
<a href="../functions/pthread_mutex_trylock.html"><i>pthread_mutex_trylock</i>()</a><br>
<a href="../functions/pthread_mutex_unlock.html"><i>pthread_mutex_unlock</i>()</a><br>
<a href="../functions/pthread_mutexattr_destroy.html"><i>pthread_mutexattr_destroy</i>()</a><br>
<a href="../functions/pthread_mutexattr_getpshared.html"><i>pthread_mutexattr_getpshared</i>()</a><br>
<a href="../functions/pthread_mutexattr_gettype.html"><i>pthread_mutexattr_gettype</i>()</a><br>
<a href="../functions/pthread_mutexattr_init.html"><i>pthread_mutexattr_init</i>()</a><br>
<a href="../functions/pthread_mutexattr_setpshared.html"><i>pthread_mutexattr_setpshared</i>()</a><br>
<a href="../functions/pthread_mutexattr_settype.html"><i>pthread_mutexattr_settype</i>()</a><br>
<a href="../functions/pthread_once.html"><i>pthread_once</i>()</a><br>
<a href="../functions/pthread_rwlock_destroy.html"><i>pthread_rwlock_destroy</i>()</a><br>
<a href="../functions/pthread_rwlock_init.html"><i>pthread_rwlock_init</i>()</a><br>
<a href="../functions/pthread_rwlock_rdlock.html"><i>pthread_rwlock_rdlock</i>()</a><br>
<a href="../functions/pthread_rwlock_tryrdlock.html"><i>pthread_rwlock_tryrdlock</i>()</a><br>
<a href="../functions/pthread_rwlock_trywrlock.html"><i>pthread_rwlock_trywrlock</i>()</a><br>
<a href="../functions/pthread_rwlock_unlock.html"><i>pthread_rwlock_unlock</i>()</a><br>
<a href="../functions/pthread_rwlock_wrlock.html"><i>pthread_rwlock_wrlock</i>()</a><br>
<a href="../functions/pthread_rwlockattr_destroy.html"><i>pthread_rwlockattr_destroy</i>()</a><br>
<a href="../functions/pthread_rwlockattr_getpshared.html"><i>pthread_rwlockattr_getpshared</i>()</a><br>
<a href="../functions/pthread_rwlockattr_init.html"><i>pthread_rwlockattr_init</i>()</a><br>
<a href="../functions/pthread_rwlockattr_setpshared.html"><i>pthread_rwlockattr_setpshared</i>()</a><br>
<a href="../functions/pthread_self.html"><i>pthread_self</i>()</a><br>
<a href="../functions/pthread_setcancelstate.html"><i>pthread_setcancelstate</i>()</a><br>
<a href="../functions/pthread_setcanceltype.html"><i>pthread_setcanceltype</i>()</a><br>
<a href="../functions/pthread_setconcurrency.html"><i>pthread_setconcurrency</i>()</a><br>
<a href="../functions/pthread_setspecific.html"><i>pthread_setspecific</i>()</a><br>
<a href="../functions/pthread_sigmask.html"><i>pthread_sigmask</i>()</a><br>
<a href="../functions/pthread_testcancel.html"><i>pthread_testcancel</i>()</a><br>
<a href="../functions/sigwait.html"><i>sigwait</i>()</a><br>
&nbsp;</p>
</td>
</tr>
</table>

<p>On XSI-conformant systems, the symbolic constant _POSIX_THREAD_SAFE_FUNCTIONS is always defined. Therefore, the following
functions are always supported:</p>

<table cellpadding="3">
<tr valign="top">
<td align="left">
<p class="tent"><br>
<a href="../functions/asctime_r.html"><i>asctime_r</i>()</a><br>
<a href="../functions/ctime_r.html"><i>ctime_r</i>()</a><br>
<a href="../functions/flockfile.html"><i>flockfile</i>()</a><br>
<a href="../functions/ftrylockfile.html"><i>ftrylockfile</i>()</a><br>
<a href="../functions/funlockfile.html"><i>funlockfile</i>()</a><br>
<a href="../functions/getc_unlocked.html"><i>getc_unlocked</i>()</a><br>
<a href="../functions/getchar_unlocked.html"><i>getchar_unlocked</i>()</a><br>
<a href="../functions/getgrgid_r.html"><i>getgrgid_r</i>()</a><br>
<a href="../functions/getgrnam_r.html"><i>getgrnam_r</i>()</a><br>
<a href="../functions/getpwnam_r.html"><i>getpwnam_r</i>()</a><br>
&nbsp;</p>
</td>
<td align="left">
<p class="tent"><br>
<a href="../functions/getpwuid_r.html"><i>getpwuid_r</i>()</a><br>
<a href="../functions/gmtime_r.html"><i>gmtime_r</i>()</a><br>
<a href="../functions/localtime_r.html"><i>localtime_r</i>()</a><br>
<a href="../functions/putc_unlocked.html"><i>putc_unlocked</i>()</a><br>
<a href="../functions/putchar_unlocked.html"><i>putchar_unlocked</i>()</a><br>
<a href="../functions/rand_r.html"><i>rand_r</i>()</a><br>
<a href="../functions/readdir_r.html"><i>readdir_r</i>()</a><br>
<a href="../functions/strerror_r.html"><i>strerror_r</i>()</a><br>
<a href="../functions/strtok_r.html"><i>strtok_r</i>()</a><br>
&nbsp;</p>
</td>
</tr>
</table>

<p>The following threads functions are only supported on XSI-conformant systems if the Realtime Threads Option Group is supported
:</p>

<table cellpadding="3">
<tr valign="top">
<td align="left">
<p class="tent"><br>
<a href="../functions/pthread_attr_getinheritsched.html"><i>pthread_attr_getinheritsched</i>()</a><br>
<a href="../functions/pthread_attr_getschedpolicy.html"><i>pthread_attr_getschedpolicy</i>()</a><br>
<a href="../functions/pthread_attr_getscope.html"><i>pthread_attr_getscope</i>()</a><br>
<a href="../functions/pthread_attr_setinheritsched.html"><i>pthread_attr_setinheritsched</i>()</a><br>
<a href="../functions/pthread_attr_setschedpolicy.html"><i>pthread_attr_setschedpolicy</i>()</a><br>
<a href="../functions/pthread_attr_setscope.html"><i>pthread_attr_setscope</i>()</a><br>
<a href="../functions/pthread_getschedparam.html"><i>pthread_getschedparam</i>()</a><br>
&nbsp;</p>
</td>
<td align="left">
<p class="tent"><br>
<a href="../functions/pthread_mutex_getprioceiling.html"><i>pthread_mutex_getprioceiling</i>()</a><br>
<a href="../functions/pthread_mutex_setprioceiling.html"><i>pthread_mutex_setprioceiling</i>()</a><br>
<a href="../functions/pthread_mutexattr_getprioceiling.html"><i>pthread_mutexattr_getprioceiling</i>()</a><br>
<a href="../functions/pthread_mutexattr_getprotocol.html"><i>pthread_mutexattr_getprotocol</i>()</a><br>
<a href="../functions/pthread_mutexattr_setprioceiling.html"><i>pthread_mutexattr_setprioceiling</i>()</a><br>
<a href="../functions/pthread_mutexattr_setprotocol.html"><i>pthread_mutexattr_setprotocol</i>()</a><br>
<a href="../functions/pthread_setschedparam.html"><i>pthread_setschedparam</i>()</a><br>
&nbsp;</p>
</td>
</tr>
</table>

<h5><a name="tag_03_02_09_07"></a>XSI Threads Extensions</h5>

<p>The following XSI extensions to POSIX.1c are now supported in IEEE&nbsp;Std&nbsp;1003.1-2001 as part of the alignment with the
Single UNIX Specification:</p>

<ul>
<li>
<p>Extended mutex attribute types</p>
</li>

<li>
<p>Read-write locks and attributes (also introduced by the IEEE&nbsp;Std&nbsp;1003.1j-2000 amendment)</p>
</li>

<li>
<p>Thread concurrency level</p>
</li>

<li>
<p>Thread stack guard size</p>
</li>

<li>
<p>Parallel I/O</p>
</li>
</ul>

<p>A total of 19 new functions were added.</p>

<p>These extensions carefully follow the threads programming model specified in POSIX.1c. As with POSIX.1c, all the new functions
return zero if successful; otherwise, an error number is returned to indicate the error.</p>

<p>The concept of attribute objects was introduced in POSIX.1c to allow implementations to extend IEEE&nbsp;Std&nbsp;1003.1-2001
without changing the existing interfaces. Attribute objects were defined for threads, mutexes, and condition variables. Attributes
objects are defined as implementation-defined opaque types to aid extensibility, and functions are defined to allow attributes to
be set or retrieved. This model has been followed when adding the new type attribute of <b>pthread_mutexattr_t</b> or the new
read-write lock attributes object <b>pthread_rwlockattr_t</b>.</p>

<ul>
<li>
<p>Extended Mutex Attributes</p>

<p>POSIX.1c defines a mutex attributes object as an implementation-defined opaque object of type <b>pthread_mutexattr_t</b>, and
specifies a number of attributes which this object must have and a number of functions which manipulate these attributes. These
attributes include <i>detachstate</i>, <i>inheritsched</i>, <i>schedparm</i>, <i>schedpolicy</i>, <i>contentionscope</i>,
<i>stackaddr</i>, and <i>stacksize</i>.</p>

<p>The System Interfaces volume of IEEE&nbsp;Std&nbsp;1003.1-2001 specifies another mutex attribute called <i>type</i>. The
<i>type</i> attribute allows applications to specify the behavior of mutex locking operations in situations where POSIX.1c behavior
is undefined. The OSF DCE threads implementation, based on Draft 4 of POSIX.1c, specified a similar attribute. Note that the names
of the attributes have changed somewhat from the OSF DCE threads implementation.</p>

<p>The System Interfaces volume of IEEE&nbsp;Std&nbsp;1003.1-2001 also extends the specification of the following POSIX.1c
functions which manipulate mutexes:</p>

<blockquote>
<pre>
<a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a>
<a href="../functions/pthread_mutex_trylock.html"><i>pthread_mutex_trylock</i>()</a>
<a href="../functions/pthread_mutex_unlock.html"><i>pthread_mutex_unlock</i>()</a>
</pre>
</blockquote>

<p>to take account of the new mutex attribute type and to specify behavior which was declared as undefined in POSIX.1c. How a
calling thread acquires or releases a mutex now depends upon the mutex <i>type</i> attribute.</p>

<p>The <i>type</i> attribute can have the following values:</p>

<dl compact>
<dt>PTHREAD_MUTEX_NORMAL</dt>

<dd><br>
Basic mutex with no specific error checking built in. Does not report a deadlock error.</dd>

<dt>PTHREAD_MUTEX_RECURSIVE</dt>

<dd><br>
Allows any thread to recursively lock a mutex. The mutex must be unlocked an equal number of times to release the mutex.</dd>

<dt>PTHREAD_MUTEX_ERRORCHECK</dt>

<dd><br>
Detects and reports simple usage errors; that is, an attempt to unlock a mutex that is not locked by the calling thread or that is
not locked at all, or an attempt to relock a mutex the thread already owns.</dd>

<dt>PTHREAD_MUTEX_DEFAULT</dt>

<dd><br>
The default mutex type. May be mapped to any of the above mutex types or may be an implementation-defined type.</dd>
</dl>

<p><i>Normal</i> mutexes do not detect deadlock conditions; for example, a thread will hang if it tries to relock a normal mutex
that it already owns. Attempting to unlock a mutex locked by another thread, or unlocking an unlocked mutex, results in undefined
behavior. Normal mutexes will usually be the fastest type of mutex available on a platform but provide the least error
checking.</p>

<p><i>Recursive</i> mutexes are useful for converting old code where it is difficult to establish clear boundaries of
synchronization. A thread can relock a recursive mutex without first unlocking it. The relocking deadlock which can occur with
normal mutexes cannot occur with this type of mutex. However, multiple locks of a recursive mutex require the same number of
unlocks to release the mutex before another thread can acquire the mutex. Furthermore, this type of mutex maintains the concept of
an owner. Thus, a thread attempting to unlock a recursive mutex which another thread has locked returns with an error. A thread
attempting to unlock a recursive mutex that is not locked returns with an error. Never use a recursive mutex with condition
variables because the implicit unlock performed by <a href="../functions/pthread_cond_wait.html"><i>pthread_cond_wait</i>()</a> or
<a href="../functions/pthread_cond_timedwait.html"><i>pthread_cond_timedwait</i>()</a> will not actually release the mutex if it
had been locked multiple times.</p>

<p><i>Errorcheck</i> mutexes provide error checking and are useful primarily as a debugging aid. A thread attempting to relock an
errorcheck mutex without first unlocking it returns with an error. Again, this type of mutex maintains the concept of an owner.
Thus, a thread attempting to unlock an errorcheck mutex which another thread has locked returns with an error. A thread attempting
to unlock an errorcheck mutex that is not locked also returns with an error. It should be noted that errorcheck mutexes will almost
always be much slower than normal mutexes due to the extra state checks performed.</p>

<p>The default mutex type provides implementation-defined error checking. The default mutex may be mapped to one of the other
defined types or may be something entirely different. This enables each vendor to provide the mutex semantics which the vendor
feels will be most useful to their target users. Most vendors will probably choose to make normal mutexes the default so as to give
applications the benefit of the fastest type of mutexes available on their platform. Check your implementation's documentation.</p>

<p>An application developer can use any of the mutex types almost interchangeably as long as the application does not depend upon
the implementation detecting (or failing to detect) any particular errors. Note that a recursive mutex can be used with condition
variable waits as long as the application never recursively locks the mutex.</p>

<p>Two functions are provided for manipulating the <i>type</i> attribute of a mutex attributes object. This attribute is set or
returned in the <i>type</i> parameter of these functions. The <a href=
"../functions/pthread_mutexattr_settype.html"><i>pthread_mutexattr_settype</i>()</a> function is used to set a specific type value
while <a href="../functions/pthread_mutexattr_gettype.html"><i>pthread_mutexattr_gettype</i>()</a> is used to return the type of
the mutex. Setting the <i>type</i> attribute of a mutex attributes object affects only mutexes initialized using that mutex
attributes object. Changing the <i>type</i> attribute does not affect mutexes previously initialized using that mutex attributes
object.</p>
</li>

<li>
<p>Read-Write Locks and Attributes</p>

<p>The read-write locks introduced have been harmonized with those in IEEE&nbsp;Std&nbsp;1003.1j-2000; see also <a href=
"#tag_03_02_09_26">Thread Read-Write Locks</a> .</p>

<p>Read-write locks (also known as reader-writer locks) allow a thread to exclusively lock some shared data while updating that
data, or allow any number of threads to have simultaneous read-only access to the data.</p>

<p>Unlike a mutex, a read-write lock distinguishes between reading data and writing data. A mutex excludes all other threads. A
read-write lock allows other threads access to the data, providing no thread is modifying the data. Thus, a read-write lock is less
primitive than either a mutex-condition variable pair or a semaphore.</p>

<p>Application developers should consider using a read-write lock rather than a mutex to protect data that is frequently referenced
but seldom modified. Most threads (readers) will be able to read the data without waiting and will only have to block when some
other thread (a writer) is in the process of modifying the data. Conversely a thread that wants to change the data is forced to
wait until there are no readers. This type of lock is often used to facilitate parallel access to data on multi-processor platforms
or to avoid context switches on single processor platforms where multiple threads access the same data.</p>

<p>If a read-write lock becomes unlocked and there are multiple threads waiting to acquire the write lock, the implementation's
scheduling policy determines which thread acquires the read-write lock for writing. If there are multiple threads blocked on a
read-write lock for both read locks and write locks, it is unspecified whether the readers or a writer acquire the lock first.
However, for performance reasons, implementations often favor writers over readers to avoid potential writer starvation.</p>

<p>A read-write lock object is an implementation-defined opaque object of type <b>pthread_rwlock_t</b> as defined in <a href=
"../basedefs/pthread.h.html"><i>&lt;pthread.h&gt;</i></a>. There are two different sorts of locks associated with a read-write
lock: a read lock and a write lock.</p>

<p>The <a href="../functions/pthread_rwlockattr_init.html"><i>pthread_rwlockattr_init</i>()</a> function initializes a read-write
lock attributes object with the default value for all the attributes defined in the implementation. After a read-write lock
attributes object has been used to initialize one or more read-write locks, changes to the read-write lock attributes object,
including destruction, do not affect previously initialized read-write locks.</p>

<p>Implementations must provide at least the read-write lock attribute <i>process-shared</i>. This attribute can have the following
values:</p>

<dl compact>
<dt>PTHREAD_PROCESS_SHARED</dt>

<dd><br>
Any thread of any process that has access to the memory where the read-write lock resides can manipulate the read-write lock.</dd>

<dt>PTHREAD_PROCESS_PRIVATE</dt>

<dd><br>
Only threads created within the same process as the thread that initialized the read-write lock can manipulate the read-write lock.
This is the default value.</dd>
</dl>

<p>The <a href="../functions/pthread_rwlockattr_setpshared.html"><i>pthread_rwlockattr_setpshared</i>()</a> function is used to set
the <i>process-shared</i> attribute of an initialized read-write lock attributes object while the function <a href=
"../functions/pthread_rwlockattr_getpshared.html"><i>pthread_rwlockattr_getpshared</i>()</a> obtains the current value of the
<i>process-shared</i> attribute.</p>

<p>A read-write lock attributes object is destroyed using the <a href=
"../functions/pthread_rwlockattr_destroy.html"><i>pthread_rwlockattr_destroy</i>()</a> function. The effect of subsequent use of
the read-write lock attributes object is undefined.</p>

<p>A thread creates a read-write lock using the <a href="../functions/pthread_rwlock_init.html"><i>pthread_rwlock_init</i>()</a>
function. The attributes of the read-write lock can be specified by the application developer; otherwise, the default
implementation-defined read-write lock attributes are used if the pointer to the read-write lock attributes object is NULL. In
cases where the default attributes are appropriate, the PTHREAD_RWLOCK_INITIALIZER macro can be used to initialize statically
allocated read-write locks.</p>

<p>A thread which wants to apply a read lock to the read-write lock can use either <a href=
"../functions/pthread_rwlock_rdlock.html"><i>pthread_rwlock_rdlock</i>()</a> or <a href=
"../functions/pthread_rwlock_tryrdlock.html"><i>pthread_rwlock_tryrdlock</i>()</a>. If <a href=
"../functions/pthread_rwlock_rdlock.html"><i>pthread_rwlock_rdlock</i>()</a> is used, the thread acquires a read lock if a writer
does not hold the write lock and there are no writers blocked on the write lock. If a read lock is not acquired, the calling thread
blocks until it can acquire a lock. However, if <a href=
"../functions/pthread_rwlock_tryrdlock.html"><i>pthread_rwlock_tryrdlock</i>()</a> is used, the function returns immediately with
the error [EBUSY] if any thread holds a write lock or there are blocked writers waiting for the write lock.</p>

<p>A thread which wants to apply a write lock to the read-write lock can use either of two functions: <a href=
"../functions/pthread_rwlock_wrlock.html"><i>pthread_rwlock_wrlock</i>()</a> or <a href=
"../functions/pthread_rwlock_trywrlock.html"><i>pthread_rwlock_trywrlock</i>()</a>. If <a href=
"../functions/pthread_rwlock_wrlock.html"><i>pthread_rwlock_wrlock</i>()</a> is used, the thread acquires the write lock if no
other reader or writer threads hold the read-write lock. If the write lock is not acquired, the thread blocks until it can acquire
the write lock. However, if <a href="../functions/pthread_rwlock_trywrlock.html"><i>pthread_rwlock_trywrlock</i>()</a> is used, the
function returns immediately with the error [EBUSY] if any thread is holding either a read or a write lock.</p>

<p>The <a href="../functions/pthread_rwlock_unlock.html"><i>pthread_rwlock_unlock</i>()</a> function is used to unlock a read-write
lock object held by the calling thread. Results are undefined if the read-write lock is not held by the calling thread. If there
are other read locks currently held on the read-write lock object, the read-write lock object remains in the read locked state but
without the current thread as one of its owners. If this function releases the last read lock for this read-write lock object, the
read-write lock object is put in the unlocked read state. If this function is called to release a write lock for this read-write
lock object, the read-write lock object is put in the unlocked state.</p>
</li>

<li>
<p>Thread Concurrency Level</p>

<p>On threads implementations that multiplex user threads onto a smaller set of kernel execution entities, the system attempts to
create a reasonable number of kernel execution entities for the application upon application startup.</p>

<p>On some implementations, these kernel entities are retained by user threads that block in the kernel. Other implementations do
not <i>timeslice</i> user threads so that multiple compute-bound user threads can share a kernel thread. On such implementations,
some applications may use up all the available kernel execution entities before their user-space threads are used up. The process
may be left with user threads capable of doing work for the application but with no way to schedule them.</p>

<p>The <a href="../functions/pthread_setconcurrency.html"><i>pthread_setconcurrency</i>()</a> function enables an application to
request more kernel entities; that is, specify a desired concurrency level. However, this function merely provides a hint to the
implementation. The implementation is free to ignore this request or to provide some other number of kernel entities. If an
implementation does not multiplex user threads onto a smaller number of kernel execution entities, the <a href=
"../functions/pthread_setconcurrency.html"><i>pthread_setconcurrency</i>()</a> function has no effect.</p>

<p>The <a href="../functions/pthread_setconcurrency.html"><i>pthread_setconcurrency</i>()</a> function may also have an effect on
implementations where the kernel mode and user mode schedulers cooperate to ensure that ready user threads are not prevented from
running by other threads blocked in the kernel.</p>

<p>The <a href="../functions/pthread_getconcurrency.html"><i>pthread_getconcurrency</i>()</a> function always returns the value set
by a previous call to <a href="../functions/pthread_setconcurrency.html"><i>pthread_setconcurrency</i>()</a>. However, if <a href=
"../functions/pthread_setconcurrency.html"><i>pthread_setconcurrency</i>()</a> was not previously called, this function returns
zero to indicate that the threads implementation is maintaining the concurrency level.</p>
</li>

<li>
<p>Thread Stack Guard Size</p>

<p>DCE threads introduced the concept of a &quot;thread stack guard size&quot;. Most thread implementations add a region of protected
memory to a thread's stack, commonly known as a &quot;guard region&quot;, as a safety measure to prevent stack pointer overflow in one
thread from corrupting the contents of another thread's stack. The default size of the guard regions attribute is {PAGESIZE} bytes
and is implementation-defined.</p>

<p>Some application developers may wish to change the stack guard size. When an application creates a large number of threads, the
extra page allocated for each stack may strain system resources. In addition to the extra page of memory, the kernel's memory
manager has to keep track of the different protections on adjoining pages. When this is a problem, the application developer may
request a guard size of 0 bytes to conserve system resources by eliminating stack overflow protection.</p>

<p>Conversely an application that allocates large data structures such as arrays on the stack may wish to increase the default
guard size in order to detect stack overflow. If a thread allocates two pages for a data array, a single guard page provides little
protection against thread stack overflows since the thread can corrupt adjoining memory beyond the guard page.</p>

<p>The System Interfaces volume of IEEE&nbsp;Std&nbsp;1003.1-2001 defines a new attribute of a thread attributes object; that is,
the <i>guardsize</i> attribute which allows applications to specify the size of the guard region of a thread's stack.</p>

<p>Two functions are provided for manipulating a thread's stack guard size. The <a href=
"../functions/pthread_attr_setguardsize.html"><i>pthread_attr_setguardsize</i>()</a> function sets the thread <i>guardsize</i>
attribute, and the <a href="../functions/pthread_attr_getguardsize.html"><i>pthread_attr_getguardsize</i>()</a> function retrieves
the current value.</p>

<p>An implementation may round up the requested guard size to a multiple of the configurable system variable {PAGESIZE}. In this
case, <a href="../functions/pthread_attr_getguardsize.html"><i>pthread_attr_getguardsize</i>()</a> returns the guard size specified
by the previous <a href="../functions/pthread_attr_setguardsize.html"><i>pthread_attr_setguardsize</i>()</a> function call and not
the rounded up value.</p>

<p>If an application is managing its own thread stacks using the <i>stackaddr</i> attribute, the <i>guardsize</i> attribute is
ignored and no stack overflow protection is provided. In this case, it is the responsibility of the application to manage stack
overflow along with stack allocation.</p>
</li>

<li>
<p>Parallel I/O</p>

<p>Suppose two or more threads independently issue read requests on the same file. To read specific data from a file, a thread must
first call <a href="../functions/lseek.html"><i>lseek</i>()</a> to seek to the proper offset in the file, and then call <a href=
"../functions/read.html"><i>read</i>()</a> to retrieve the required data. If more than one thread does this at the same time, the
first thread may complete its seek call, but before it gets a chance to issue its read call a second thread may complete its seek
call, resulting in the first thread accessing incorrect data when it issues its read call. One workaround is to lock the file
descriptor while seeking and reading or writing, but this reduces parallelism and adds overhead.</p>

<p>Instead, the System Interfaces volume of IEEE&nbsp;Std&nbsp;1003.1-2001 provides two functions to make seek/read and seek/write
operations atomic. The file descriptor's current offset is unchanged, thus allowing multiple read and write operations to proceed
in parallel. This improves the I/O performance of threaded applications. The <a href="../functions/pread.html"><i>pread</i>()</a>
function is used to do an atomic read of data from a file into a buffer. Conversely, the <a href=
"../functions/pwrite.html"><i>pwrite</i>()</a> function does an atomic write of data from a buffer to a file.</p>
</li>
</ul>

<h5><a name="tag_03_02_09_08"></a>Thread-Safety</h5>

<p>All functions required by IEEE&nbsp;Std&nbsp;1003.1-2001 need to be thread-safe. Implementations have to provide internal
synchronization when necessary in order to achieve this goal. In certain cases-for example, most floating-point
implementations-context switch code may have to manage the writable shared state.</p>

<p>While a read from a pipe of {PIPE_MAX}*2 bytes may not generate a single atomic and thread-safe stream of bytes, it should
generate &quot;several&quot; (individually atomic) thread-safe streams of bytes. Similarly, while reading from a terminal device may not
generate a single atomic and thread-safe stream of bytes, it should generate some finite number of (individually atomic) and
thread-safe streams of bytes. That is, concurrent calls to read for a pipe, FIFO, or terminal device are not allowed to result in
corrupting the stream of bytes or other internal data. However, <a href="../functions/read.html"><i>read</i>()</a>, in these cases,
is not required to return a single contiguous and atomic stream of bytes.</p>

<p>It is not required that all functions provided by IEEE&nbsp;Std&nbsp;1003.1-2001 be either async-cancel-safe or
async-signal-safe.</p>

<p>As it turns out, some functions are inherently not thread-safe; that is, their interface specifications preclude reentrancy. For
example, some functions (such as <a href="../functions/asctime.html"><i>asctime</i>()</a>) return a pointer to a result stored in
memory space allocated by the function on a per-process basis. Such a function is not thread-safe, because its result can be
overwritten by successive invocations. Other functions, while not inherently non-thread-safe, may be implemented in ways that lead
to them not being thread-safe. For example, some functions (such as <a href="../functions/rand.html"><i>rand</i>()</a>) store state
information (such as a seed value, which survives multiple function invocations) in memory space allocated by the function on a
per-process basis. The implementation of such a function is not thread-safe if the implementation fails to synchronize invocations
of the function and thus fails to protect the state information. The problem is that when the state information is not protected,
concurrent invocations can interfere with one another (for example, applications using <a href=
"../functions/rand.html"><i>rand</i>()</a> may see the same seed value).</p>

<p><i>Thread-Safety and Locking of Existing Functions</i></p>

<p>Originally, POSIX.1 was not designed to work in a multi-threaded environment, and some implementations of some existing
functions will not work properly when executed concurrently. To provide routines that will work correctly in an environment with
threads (``thread-safe&quot;), two problems need to be solved:</p>

<ol>
<li>
<p>Routines that maintain or return pointers to static areas internal to the routine (which may now be shared) need to be modified.
The routines <a href="../functions/ttyname.html"><i>ttyname</i>()</a> and <a href=
"../functions/localtime.html"><i>localtime</i>()</a> are examples.</p>
</li>

<li>
<p>Routines that access data space shared by more than one thread need to be modified. The <a href=
"../functions/malloc.html"><i>malloc</i>()</a> function and the <i>stdio</i> family routines are examples.</p>
</li>
</ol>

<p>There are a variety of constraints on these changes. The first is compatibility with the existing versions of these
functions-non-thread-safe functions will continue to be in use for some time, as the original interfaces are used by existing code.
Another is that the new thread-safe versions of these functions represent as small a change as possible over the familiar
interfaces provided by the existing non-thread-safe versions. The new interfaces should be independent of any particular threads
implementation. In particular, they should be thread-safe without depending on explicit thread-specific memory. Finally, there
should be minimal performance penalty due to the changes made to the functions.</p>

<p>It is intended that the list of functions from POSIX.1 that cannot be made thread-safe and for which corrected versions are
provided be complete.</p>

<p><i>Thread-Safety and Locking Solutions</i></p>

<p>Many of the POSIX.1 functions were thread-safe and did not change at all. However, some functions (for example, the math
functions typically found in <b>libm</b>) are not thread-safe because of writable shared global state. For instance, in
IEEE&nbsp;Std&nbsp;754-1985 floating-point implementations, the computation modes and flags are global and shared.</p>

<p>Some functions are not thread-safe because a particular implementation is not reentrant, typically because of a non-essential
use of static storage. These require only a new implementation.</p>

<p>Thread-safe libraries are useful in a wide range of parallel (and asynchronous) programming environments, not just within
pthreads. In order to be used outside the context of pthreads, however, such libraries still have to use some synchronization
method. These could either be independent of the pthread synchronization operations, or they could be a subset of the pthread
interfaces. Either method results in thread-safe library implementations that can be used without the rest of pthreads.</p>

<p>Some functions, such as the <i>stdio</i> family interface and dynamic memory allocation functions such as <a href=
"../functions/malloc.html"><i>malloc</i>()</a>, are inter-dependent routines that share resources (for example, buffers) across
related calls. These require synchronization to work correctly, but they do not require any change to their external (user-visible)
interfaces.</p>

<p>In some cases, such as <a href="../functions/getc.html"><i>getc</i>()</a> and <a href=
"../functions/putc.html"><i>putc</i>()</a>, adding synchronization is likely to create an unacceptable performance impact. In this
case, slower thread-safe synchronized functions are to be provided, but the original, faster (but unsafe) functions (which may be
implemented as macros) are retained under new names. Some additional special-purpose synchronization facilities are necessary for
these macros to be usable in multi-threaded programs. This also requires changes in <a href=
"../basedefs/stdio.h.html"><i>&lt;stdio.h&gt;</i></a>.</p>

<p>The other common reason that functions are unsafe is that they return a pointer to static storage, making the functions
non-thread-safe. This has to be changed, and there are three natural choices:</p>

<ol>
<li>
<p>Return a pointer to thread-specific storage</p>

<p>This could incur a severe performance penalty on those architectures with a costly implementation of the thread-specific data
interface.</p>

<p>A variation on this technique is to use <a href="../functions/malloc.html"><i>malloc</i>()</a> to allocate storage for the
function output and return a pointer to this storage. This technique may also have an undesirable performance impact, however, and
a simplistic implementation requires that the user program explicitly free the storage object when it is no longer needed. This
technique is used by some existing POSIX.1 functions. With careful implementation for infrequently used functions, there may be
little or no performance or storage penalty, and the maintenance of already-standardized interfaces is a significant benefit.</p>
</li>

<li>
<p>Return the actual value computed by the function</p>

<p>This technique can only be used with functions that return pointers to structures-routines that return character strings would
have to wrap their output in an enclosing structure in order to return the output on the stack. There is also a negative
performance impact inherent in this solution in that the output value has to be copied twice before it can be used by the calling
function: once from the called routine's local buffers to the top of the stack, then from the top of the stack to the assignment
target. Finally, many older compilers cannot support this technique due to a historical tendency to use internal static buffers to
deliver the results of structure-valued functions.</p>
</li>

<li>
<p>Have the caller pass the address of a buffer to contain the computed value</p>

<p>The only disadvantage of this approach is that extra arguments have to be provided by the calling program. It represents the
most efficient solution to the problem, however, and, unlike the <a href="../functions/malloc.html"><i>malloc</i>()</a> technique,
it is semantically clear.</p>
</li>
</ol>

<p>There are some routines (often groups of related routines) whose interfaces are inherently non-thread-safe because they
communicate across multiple function invocations by means of static memory locations. The solution is to redesign the calls so that
they are thread-safe, typically by passing the needed data as extra parameters. Unfortunately, this may require major changes to
the interface as well.</p>

<p>A floating-point implementation using IEEE&nbsp;Std&nbsp;754-1985 is a case in point. A less problematic example is the
<i>rand48</i> family of pseudo-random number generators. The functions <a href="../functions/getgrgid.html"><i>getgrgid</i>()</a>,
<a href="../functions/getgrnam.html"><i>getgrnam</i>()</a>, <a href="../functions/getpwnam.html"><i>getpwnam</i>()</a>, and <a
href="../functions/getpwuid.html"><i>getpwuid</i>()</a> are another such case.</p>

<p>The problems with <i>errno</i> are discussed in <a href="#tag_03_02_03_01">Alternative Solutions for Per-Thread errno</a> .</p>

<p>Some functions can be thread-safe or not, depending on their arguments. These include the <a href=
"../functions/tmpnam.html"><i>tmpnam</i>()</a> and <a href="../functions/ctermid.html"><i>ctermid</i>()</a> functions. These
functions have pointers to character strings as arguments. If the pointers are not NULL, the functions store their results in the
character string; however, if the pointers are NULL, the functions store their results in an area that may be static and thus
subject to overwriting by successive calls. These should only be called by multi-thread applications when their arguments are
non-NULL.</p>

<p><i>Asynchronous Safety and Thread-Safety</i></p>

<p>A floating-point implementation has many modes that effect rounding and other aspects of computation. Functions in some math
library implementations may change the computation modes for the duration of a function call. If such a function call is
interrupted by a signal or cancelation, the floating-point state is not required to be protected.</p>

<p>There is a significant cost to make floating-point operations async-cancel-safe or async-signal-safe; accordingly, neither form
of async safety is required.</p>

<p><i>Functions Returning Pointers to Static Storage</i></p>

<p>For those functions that are not thread-safe because they return values in fixed size statically allocated structures, alternate
&quot;_r&quot; forms are provided that pass a pointer to an explicit result structure. Those that return pointers into library-allocated
buffers have forms provided with explicit buffer and length parameters.</p>

<p>For functions that return pointers to library-allocated buffers, it makes sense to provide &quot;_r&quot; versions that allow the
application control over allocation of the storage in which results are returned. This allows the state used by these functions to
be managed on an application-specific basis, supporting per-thread, per-process, or other application-specific sharing
relationships.</p>

<p>Early proposals had provided &quot;_r&quot; versions for functions that returned pointers to variable-size buffers without providing a
means for determining the required buffer size. This would have made using such functions exceedingly clumsy, potentially requiring
iteratively calling them with increasingly larger guesses for the amount of storage required. Hence, <a href=
"../functions/sysconf.html"><i>sysconf</i>()</a> variables have been provided for such functions that return the maximum required
buffer size.</p>

<p>Thus, the rule that has been followed by IEEE&nbsp;Std&nbsp;1003.1-2001 when adapting single-threaded non-thread-safe functions
is as follows: all functions returning pointers to library-allocated storage should have &quot;_r&quot; versions provided, allowing the
application control over the storage allocation. Those with variable-sized return values accept both a buffer address and a length
parameter. The <a href="../functions/sysconf.html"><i>sysconf</i>()</a> variables are provided to supply the appropriate buffer
sizes when required. Implementors are encouraged to apply the same rule when adapting their own existing functions to a pthreads
environment.</p>

<h5><a name="tag_03_02_09_09"></a>Thread IDs</h5>

<p>Separate applications should communicate through well-defined interfaces and should not depend on each other's implementation.
For example, if a programmer decides to rewrite the <a href="../utilities/sort.html"><i>sort</i></a> utility using multiple
threads, it should be easy to do this so that the interface to the <a href="../utilities/sort.html"><i>sort</i></a> utility does
not change. Consider that if the user causes SIGINT to be generated while the <a href="../utilities/sort.html"><i>sort</i></a>
utility is running, keeping the same interface means that the entire <a href="../utilities/sort.html"><i>sort</i></a> utility is
killed, not just one of its threads. As another example, consider a realtime application that manages a reactor. Such an
application may wish to allow other applications to control the priority at which it watches the control rods. One technique to
accomplish this is to write the ID of the thread watching the control rods into a file and allow other programs to change the
priority of that thread as they see fit. A simpler technique is to have the reactor process accept IPCs (Interprocess Communication
messages) from other processes, telling it at a semantic level what priority the program should assign to watching the control
rods. This allows the programmer greater flexibility in the implementation. For example, the programmer can change the
implementation from having one thread per rod to having one thread watching all of the rods without changing the interface. Having
threads live inside the process means that the implementation of a process is invisible to outside processes (excepting debuggers
and system management tools).</p>

<p>Threads do not provide a protection boundary. Every thread model allows threads to share memory with other threads and
encourages this sharing to be widespread. This means that one thread can wipe out memory that is needed for the correct functioning
of other threads that are sharing its memory. Consequently, providing each thread with its own user and/or group IDs would not
provide a protection boundary between threads sharing memory.</p>

<h5><a name="tag_03_02_09_10"></a>Thread Mutexes</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_09_11"></a>Thread Scheduling</h5>

<ul>
<li>
<p>Scheduling Implementation Models</p>

<p>The following scheduling implementation models are presented in terms of threads and &quot;kernel entities&quot;. This is to simplify
exposition of the models, and it does not imply that an implementation actually has an identifiable &quot;kernel entity&quot;.</p>

<p>A kernel entity is not defined beyond the fact that it has scheduling attributes that are used to resolve contention with other
kernel entities for execution resources. A kernel entity may be thought of as an envelope that holds a thread or a separate kernel
thread. It is not a conventional process, although it shares with the process the attribute that it has a single thread of control;
it does not necessarily imply an address space, open files, and so on. It is better thought of as a primitive facility upon which
conventional processes and threads may be constructed.</p>

<ul>
<li>
<p>System Thread Scheduling Model</p>

<p>This model consists of one thread per kernel entity. The kernel entity is solely responsible for scheduling thread execution on
one or more processors. This model schedules all threads against all other threads in the system using the scheduling attributes of
the thread.</p>
</li>

<li>
<p>Process Scheduling Model</p>

<p>A generalized process scheduling model consists of two levels of scheduling. A threads library creates a pool of kernel
entities, as required, and schedules threads to run on them using the scheduling attributes of the threads. Typically, the size of
the pool is a function of the simultaneously runnable threads, not the total number of threads. The kernel then schedules the
kernel entities onto processors according to their scheduling attributes, which are managed by the threads library. This set model
potentially allows a wide range of mappings between threads and kernel entities.</p>
</li>
</ul>
</li>

<li>
<p>System and Process Scheduling Model Performance</p>

<p>There are a number of important implications on the performance of applications using these scheduling models. The process
scheduling model potentially provides lower overhead for making scheduling decisions, since there is no need to access kernel-level
information or functions and the set of schedulable entities is smaller (only the threads within the process).</p>

<p>On the other hand, since the kernel is also making scheduling decisions regarding the system resources under its control (for
example, CPU(s), I/O devices, memory), decisions that do not take thread scheduling parameters into account can result in
unspecified delays for realtime application threads, causing them to miss maximum response time limits.</p>
</li>

<li>
<p>Rate Monotonic Scheduling</p>

<p>Rate monotonic scheduling was considered, but rejected for standardization in the context of pthreads. A sporadic server policy
is included.</p>
</li>

<li>
<p>Scheduling Options</p>

<p>In IEEE&nbsp;Std&nbsp;1003.1-2001, the basic thread scheduling functions are defined under the Threads option, so that they are
required of all threads implementations. However, there are no specific scheduling policies required by this option to allow for
conforming thread implementations that are not targeted to realtime applications.</p>

<p>Specific standard scheduling policies are defined to be under the Thread Execution Scheduling option, and they are specifically
designed to support realtime applications by providing predictable resource-sharing sequences. The name of this option was chosen
to emphasize that this functionality is defined as appropriate for realtime applications that require simple priority-based
scheduling.</p>

<p>It is recognized that these policies are not necessarily satisfactory for some multi-processor implementations, and work is
ongoing to address a wider range of scheduling behaviors. The interfaces have been chosen to create abundant opportunity for future
scheduling policies to be implemented and standardized based on this interface. In order to standardize a new scheduling policy,
all that is required (from the standpoint of thread scheduling attributes) is to define a new policy name, new members of the
thread attributes object, and functions to set these members when the scheduling policy is equal to the new value.</p>
</li>
</ul>

<h5><a name="tag_03_02_09_12"></a>Scheduling Contention Scope</h5>

<p>In order to accommodate the requirement for realtime response, each thread has a scheduling contention scope attribute. Threads
with a system scheduling contention scope have to be scheduled with respect to all other threads in the system. These threads are
usually bound to a single kernel entity that reflects their scheduling attributes and are directly scheduled by the kernel.</p>

<p>Threads with a process scheduling contention scope need be scheduled only with respect to the other threads in the process.
These threads may be scheduled within the process onto a pool of kernel entities. The implementation is also free to bind these
threads directly to kernel entities and let them be scheduled by the kernel. Process scheduling contention scope allows the
implementation the most flexibility and is the default if both contention scopes are supported and none is specified.</p>

<p>Thus, the choice by implementors to provide one or the other (or both) of these scheduling models is driven by the need of their
supported application domains for worst-case (that is, realtime) response, or average-case (non-realtime) response.</p>

<h5><a name="tag_03_02_09_13"></a>Scheduling Allocation Domain</h5>

<p>The SCHED_FIFO and SCHED_RR scheduling policies take on different characteristics on a multi-processor. Other scheduling
policies are also subject to changed behavior when executed on a multi-processor. The concept of scheduling allocation domain
determines the set of processors on which the threads of an application may run. By considering the application's processor
scheduling allocation domain for its threads, scheduling policies can be defined in terms of their behavior for varying processor
scheduling allocation domain values. It is conceivable that not all scheduling allocation domain sizes make sense for all
scheduling policies on all implementations. The concept of scheduling allocation domain, however, is a useful tool for the
description of multi-processor scheduling policies.</p>

<p>The &quot;process control&quot; approach to scheduling obtains significant performance advantages from dynamic scheduling allocation
domain sizes when it is applicable.</p>

<p>Non-Uniform Memory Access (NUMA) multi-processors may use a system scheduling structure that involves reassignment of threads
among scheduling allocation domains. In NUMA machines, a natural model of scheduling is to match scheduling allocation domains to
clusters of processors. Load balancing in such an environment requires changing the scheduling allocation domain to which a thread
is assigned.</p>

<h5><a name="tag_03_02_09_14"></a>Scheduling Documentation</h5>

<p>Implementation-provided scheduling policies need to be completely documented in order to be useful. This documentation includes
a description of the attributes required for the policy, the scheduling interaction of threads running under this policy and all
other supported policies, and the effects of all possible values for processor scheduling allocation domain. Note that for the
implementor wishing to be minimally-compliant, it is (minimally) acceptable to define the behavior as undefined.</p>

<h5><a name="tag_03_02_09_15"></a>Scheduling Contention Scope Attribute</h5>

<p>The scheduling contention scope defines how threads compete for resources. Within IEEE&nbsp;Std&nbsp;1003.1-2001, scheduling
contention scope is used to describe only how threads are scheduled in relation to one another in the system. That is, either they
are scheduled against all other threads in the system (``system scope&quot;) or only against those threads in the process (``process
scope&quot;). In fact, scheduling contention scope may apply to additional resources, including virtual timers and profiling, which are
not currently considered by IEEE&nbsp;Std&nbsp;1003.1-2001.</p>

<h5><a name="tag_03_02_09_16"></a>Mixed Scopes</h5>

<p>If only one scheduling contention scope is supported, the scheduling decision is straightforward. To perform the processor
scheduling decision in a mixed scope environment, it is necessary to map the scheduling attributes of the thread with process-wide
contention scope to the same attribute space as the thread with system-wide contention scope.</p>

<p>Since a conforming implementation has to support one and may support both scopes, it is useful to discuss the effects of such
choices with respect to example applications. If an implementation supports both scopes, mixing scopes provides a means of better
managing system-level (that is, kernel-level) and library-level resources. In general, threads with system scope will require the
resources of a separate kernel entity in order to guarantee the scheduling semantics. On the other hand, threads with process scope
can share the resources of a kernel entity while maintaining the scheduling semantics.</p>

<p>The application is free to create threads with dedicated kernel resources, and other threads that multiplex kernel resources.
Consider the example of a window server. The server allocates two threads per widget: one thread manages the widget user interface
(including drawing), while the other thread takes any required application action. This allows the widget to be &quot;active&quot; while
the application is computing. A screen image may be built from thousands of widgets. If each of these threads had been created with
system scope, then most of the kernel-level resources might be wasted, since only a few widgets are active at any one time. In
addition, mixed scope is particularly useful in a window server where one thread with high priority and system scope handles the
mouse so that it tracks well. As another example, consider a database server. For each of the hundreds or thousands of clients
supported by a large server, an equivalent number of threads will have to be created. If each of these threads were system scope,
the consequences would be the same as for the window server example above. However, the server could be constructed so that actual
retrieval of data is done by several dedicated threads. Dedicated threads that do work for all clients frequently justify the added
expense of system scope. If it were not permissible to mix system and process threads in the same process, this type of solution
would not be possible.</p>

<h5><a name="tag_03_02_09_17"></a>Dynamic Thread Scheduling Parameters Access</h5>

<p>In many time-constrained applications, there is no need to change the scheduling attributes dynamically during thread or process
execution, since the general use of these attributes is to reflect directly the time constraints of the application. Since these
time constraints are generally imposed to meet higher-level system requirements, such as accuracy or availability, they frequently
should remain unchanged during application execution.</p>

<p>However, there are important situations in which the scheduling attributes should be changed. Generally, this will occur when
external environmental conditions exist in which the time constraints change. Consider, for example, a space vehicle major mode
change, such as the change from ascent to descent mode, or the change from the space environment to the atmospheric environment. In
such cases, the frequency with which many of the sensors or actuators need to be read or written will change, which will
necessitate a priority change. In other cases, even the existence of a time constraint might be temporary, necessitating not just a
priority change, but also a policy change for ongoing threads or processes. For this reason, it is critical that the interface
should provide functions to change the scheduling parameters dynamically, but, as with many of the other realtime functions, it is
important that applications use them properly to avoid the possibility of unnecessarily degrading performance.</p>

<p>In providing functions for dynamically changing the scheduling behavior of threads, there were two options: provide functions to
get and set the individual scheduling parameters of threads, or provide a single interface to get and set all the scheduling
parameters for a given thread simultaneously. Both approaches have merit. Access functions for individual parameters allow simpler
control of thread scheduling for simple thread scheduling parameters. However, a single function for setting all the parameters for
a given scheduling policy is required when first setting that scheduling policy. Since the single all-encompassing functions are
required, it was decided to leave the interface as minimal as possible. Note that simpler functions (such as
<i>pthread_setprio</i>() for threads running under the priority-based schedulers) can be easily defined in terms of the
all-encompassing functions.</p>

<p>If the <a href="../functions/pthread_setschedparam.html"><i>pthread_setschedparam</i>()</a> function executes successfully, it
will have set all of the scheduling parameter values indicated in <i>param</i>; otherwise, none of the scheduling parameters will
have been modified. This is necessary to ensure that the scheduling of this and all other threads continues to be consistent in the
presence of an erroneous scheduling parameter.</p>

<p>The [EPERM] error value is included in the list of possible <a href=
"../functions/pthread_setschedparam.html"><i>pthread_setschedparam</i>()</a> error returns as a reflection of the fact that the
ability to change scheduling parameters increases risks to the implementation and application performance if the scheduling
parameters are changed improperly. For this reason, and based on some existing practice, it was felt that some implementations
would probably choose to define specific permissions for changing either a thread's own or another thread's scheduling parameters.
IEEE&nbsp;Std&nbsp;1003.1-2001 does not include portable methods for setting or retrieving permissions, so any such use of
permissions is completely unspecified.</p>

<h5><a name="tag_03_02_09_18"></a>Mutex Initialization Scheduling Attributes</h5>

<p>In a priority-driven environment, a direct use of traditional primitives like mutexes and condition variables can lead to
unbounded priority inversion, where a higher priority thread can be blocked by a lower priority thread, or set of threads, for an
unbounded duration of time. As a result, it becomes impossible to guarantee thread deadlines. Priority inversion can be bounded and
minimized by the use of priority inheritance protocols. This allows thread deadlines to be guaranteed even in the presence of
synchronization requirements.</p>

<p>Two useful but simple members of the family of priority inheritance protocols are the basic priority inheritance protocol and
the priority ceiling protocol emulation. Under the Basic Priority Inheritance protocol (governed by the Thread Priority Inheritance
option), a thread that is blocking higher priority threads executes at the priority of the highest priority thread that it blocks.
This simple mechanism allows priority inversion to be bounded by the duration of critical sections and makes timing analysis
possible.</p>

<p>Under the Priority Ceiling Protocol Emulation protocol (governed by the Thread Priority Protection option), each mutex has a
priority ceiling, usually defined as the priority of the highest priority thread that can lock the mutex. When a thread is
executing inside critical sections, its priority is unconditionally increased to the highest of the priority ceilings of all the
mutexes owned by the thread. This protocol has two very desirable properties in uni-processor systems. First, a thread can be
blocked by a lower priority thread for at most the duration of one single critical section. Furthermore, when the protocol is
correctly used in a single processor, and if threads do not become blocked while owning mutexes, mutual deadlocks are
prevented.</p>

<p>The priority ceiling emulation can be extended to multiple processor environments, in which case the values of the priority
ceilings will be assigned depending on the kind of mutex that is being used: local to only one processor, or global, shared by
several processors. Local priority ceilings will be assigned the usual way, equal to the priority of the highest priority thread
that may lock that mutex. Global priority ceilings will usually be assigned a priority level higher than all the priorities
assigned to any of the threads that reside in the involved processors to avoid the effect called remote blocking.</p>

<h5><a name="tag_03_02_09_19"></a>Change the Priority Ceiling of a Mutex</h5>

<p>In order for the priority protect protocol to exhibit its desired properties of bounding priority inversion and avoidance of
deadlock, it is critical that the ceiling priority of a mutex be the same as the priority of the highest thread that can ever hold
it, or higher. Thus, if the priorities of the threads using such mutexes never change dynamically, there is no need ever to change
the priority ceiling of a mutex.</p>

<p>However, if a major system mode change results in an altered response time requirement for one or more application threads,
their priority has to change to reflect it. It will occasionally be the case that the priority ceilings of mutexes held also need
to change. While changing priority ceilings should generally be avoided, it is important that IEEE&nbsp;Std&nbsp;1003.1-2001
provide these interfaces for those cases in which it is necessary.</p>

<h5><a name="tag_03_02_09_20"></a>Thread Cancelation</h5>

<p>Many existing threads packages have facilities for canceling an operation or canceling a thread. These facilities are used for
implementing user requests (such as the CANCEL button in a window-based application), for implementing OR parallelism (for example,
telling the other threads to stop working once one thread has found a forced mate in a parallel chess program), or for implementing
the ABORT mechanism in Ada.</p>

<p>POSIX programs traditionally have used the signal mechanism combined with either <a href=
"../functions/longjmp.html"><i>longjmp</i>()</a> or polling to cancel operations. Many POSIX programmers have trouble using these
facilities to solve their problems efficiently in a single-threaded process. With the introduction of threads, these solutions
become even more difficult to use.</p>

<p>The main issues with implementing a cancelation facility are specifying the operation to be canceled, cleanly releasing any
resources allocated to that operation, controlling when the target notices that it has been canceled, and defining the interaction
between asynchronous signals and cancelation.</p>

<h5><a name="tag_03_02_09_21"></a>Specifying the Operation to Cancel</h5>

<p>Consider a thread that calls through five distinct levels of program abstraction and then, inside the lowest-level abstraction,
calls a function that suspends the thread. (An abstraction boundary is a layer at which the client of the abstraction sees only the
service being provided and can remain ignorant of the implementation. Abstractions are often layered, each level of abstraction
being a client of the lower-level abstraction and implementing a higher-level abstraction.) Depending on the semantics of each
abstraction, one could imagine wanting to cancel only the call that causes suspension, only the bottom two levels, or the operation
being done by the entire thread. Canceling operations at a finer grain than the entire thread is difficult because threads are
active and they may be run in parallel on a multi-processor. By the time one thread can make a request to cancel an operation, the
thread performing the operation may have completed that operation and gone on to start another operation whose cancelation is not
desired. Thread IDs are not reused until the thread has exited, and either it was created with the <i>Attr detachstate</i>
attribute set to PTHREAD_CREATE_DETACHED or the <a href="../functions/pthread_join.html"><i>pthread_join</i>()</a> or <a href=
"../functions/pthread_detach.html"><i>pthread_detach</i>()</a> function has been called for that thread. Consequently, a thread
cancelation will never be misdirected when the thread terminates. For these reasons, the canceling of operations is done at the
granularity of the thread. Threads are designed to be inexpensive enough so that a separate thread may be created to perform each
separately cancelable operation; for example, each possibly long running user request.</p>

<p>For cancelation to be used in existing code, cancelation scopes and handlers will have to be established for code that needs to
release resources upon cancelation, so that it follows the programming discipline described in the text.</p>

<h5><a name="tag_03_02_09_22"></a>A Special Signal Versus a Special Interface</h5>

<p>Two different mechanisms were considered for providing the cancelation interfaces. The first was to provide an interface to
direct signals at a thread and then to define a special signal that had the required semantics. The other alternative was to use a
special interface that delivered the correct semantics to the target thread.</p>

<p>The solution using signals produced a number of problems. It required the implementation to provide cancelation in terms of
signals whereas a perfectly valid (and possibly more efficient) implementation could have both layered on a low-level set of
primitives. There were so many exceptions to the special signal (it cannot be used with <a href=
"../functions/kill.html"><i>kill</i>()</a>, no POSIX.1 interfaces can be used with it) that it was clearly not a valid signal. Its
semantics on delivery were also completely different from any existing POSIX.1 signal. As such, a special interface that did not
mandate the implementation and did not confuse the semantics of signals and cancelation was felt to be the better solution.</p>

<h5><a name="tag_03_02_09_23"></a>Races Between Cancelation and Resuming Execution</h5>

<p>Due to the nature of cancelation, there is generally no synchronization between the thread requesting the cancelation of a
blocked thread and events that may cause that thread to resume execution. For this reason, and because excess serialization hurts
performance, when both an event that a thread is waiting for has occurred and a cancelation request has been made and cancelation
is enabled, IEEE&nbsp;Std&nbsp;1003.1-2001 explicitly allows the implementation to choose between returning from the blocking call
or acting on the cancelation request.</p>

<h5><a name="tag_03_02_09_24"></a>Interaction of Cancelation with Asynchronous Signals</h5>

<p>A typical use of cancelation is to acquire a lock on some resource and to establish a cancelation cleanup handler for releasing
the resource when and if the thread is canceled.</p>

<p>A correct and complete implementation of cancelation in the presence of asynchronous signals requires considerable care. An
implementation has to push a cancelation cleanup handler on the cancelation cleanup stack while maintaining the integrity of the
stack data structure. If an asynchronously-generated signal is posted to the thread during a stack operation, the signal handler
cannot manipulate the cancelation cleanup stack. As a consequence, asynchronous signal handlers may not cancel threads or otherwise
manipulate the cancelation state of a thread. Threads may, of course, be canceled by another thread that used a <a href=
"../functions/sigwait.html"><i>sigwait</i>()</a> function to wait synchronously for an asynchronous signal.</p>

<p>In order for cancelation to function correctly, it is required that asynchronous signal handlers not change the cancelation
state. This requires that some elements of existing practice, such as using <a href=
"../functions/longjmp.html"><i>longjmp</i>()</a> to exit from an asynchronous signal handler implicitly, be prohibited in cases
where the integrity of the cancelation state of the interrupt thread cannot be ensured.</p>

<h5><a name="tag_03_02_09_25"></a>Thread Cancelation Overview</h5>

<ul>
<li>
<p>Cancelability States</p>

<p>The three possible cancelability states (disabled, deferred, and asynchronous) are encoded into two separate bits ((disable,
enable) and (deferred, asynchronous)) to allow them to be changed and restored independently. For instance, short code sequences
that will not block sometimes disable cancelability on entry and restore the previous state upon exit. Likewise, long or unbounded
code sequences containing no convenient explicit cancelation points will sometimes set the cancelability type to asynchronous on
entry and restore the previous value upon exit.</p>
</li>

<li>
<p>Cancelation Points</p>

<p>Cancelation points are points inside of certain functions where a thread has to act on any pending cancelation request when
cancelability is enabled, if the function would block. As with checking for signals, operations need only check for pending
cancelation requests when the operation is about to block indefinitely.</p>

<p>The idea was considered of allowing implementations to define whether blocking calls such as <a href=
"../functions/read.html"><i>read</i>()</a> should be cancelation points. It was decided that it would adversely affect the design
of conforming applications if blocking calls were not cancelation points because threads could be left blocked in an uncancelable
state.</p>

<p>There are several important blocking routines that are specifically not made cancelation points:</p>

<ul>
<li>
<p><a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a></p>

<p>If <a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a> were a cancelation point, every routine that
called it would also become a cancelation point (that is, any routine that touched shared state would automatically become a
cancelation point). For example, <a href="../functions/malloc.html"><i>malloc</i>()</a>, <a href=
"../functions/free.html"><i>free</i>()</a>, and <a href="../functions/rand.html"><i>rand</i>()</a> would become cancelation points
under this scheme. Having too many cancelation points makes programming very difficult, leading to either much disabling and
restoring of cancelability or much difficulty in trying to arrange for reliable cleanup at every possible place.</p>

<p>Since <a href="../functions/pthread_mutex_lock.html"><i>pthread_mutex_lock</i>()</a> is not a cancelation point, threads could
result in being blocked uninterruptibly for long periods of time if mutexes were used as a general synchronization mechanism. As
this is normally not acceptable, mutexes should only be used to protect resources that are held for small fixed lengths of time
where not being able to be canceled will not be a problem. Resources that need to be held exclusively for long periods of time
should be protected with condition variables.</p>
</li>

<li>
<p><a href="../functions/pthread_barrier_wait.html"><i>pthread_barrier_wait</i>()</a></p>

<p>Canceling a barrier wait will render a barrier unusable. Similar to a barrier timeout (which the standard developers rejected),
there is no way to guarantee the consistency of a barrier's internal data structures if a barrier wait is canceled.</p>
</li>

<li>
<p><a href="../functions/pthread_spin_lock.html"><i>pthread_spin_lock</i>()</a></p>

<p>As with mutexes, spin locks should only be used to protect resources that are held for small fixed lengths of time where not
being cancelable will not be a problem.</p>
</li>
</ul>

<p>Every library routine should specify whether or not it includes any cancelation points. Typically, only those routines that may
block or compute indefinitely need to include cancelation points.</p>

<p>Correctly coded routines only reach cancelation points after having set up a cancelation cleanup handler to restore invariants
if the thread is canceled at that point. Being cancelable only at specified cancelation points allows programmers to keep track of
actions needed in a cancelation cleanup handler more easily. A thread should only be made asynchronously cancelable when it is not
in the process of acquiring or releasing resources or otherwise in a state from which it would be difficult or impossible to
recover.</p>
</li>

<li>
<p>Thread Cancelation Cleanup Handlers</p>

<p>The cancelation cleanup handlers provide a portable mechanism, easy to implement, for releasing resources and restoring
invariants. They are easier to use than signal handlers because they provide a stack of cancelation cleanup handlers rather than a
single handler, and because they have an argument that can be used to pass context information to the handler.</p>

<p>The alternative to providing these simple cancelation cleanup handlers (whose only use is for cleaning up when a thread is
canceled) is to define a general exception package that could be used for handling and cleaning up after hardware traps and
software-detected errors. This was too far removed from the charter of providing threads to handle asynchrony. However, it is an
explicit goal of IEEE&nbsp;Std&nbsp;1003.1-2001 to be compatible with existing exception facilities and languages having
exceptions.</p>

<p>The interaction of this facility and other procedure-based or language-level exception facilities is unspecified in this version
of IEEE&nbsp;Std&nbsp;1003.1-2001. However, it is intended that it be possible for an implementation to define the relationship
between these cancelation cleanup handlers and Ada, C++, or other language-level exception handling facilities.</p>

<p>It was suggested that the cancelation cleanup handlers should also be called when the process exits or calls the <i>exec</i>
function. This was rejected partly due to the performance problem caused by having to call the cancelation cleanup handlers of
every thread before the operation could continue. The other reason was that the only state expected to be cleaned up by the
cancelation cleanup handlers would be the intraprocess state. Any handlers that are to clean up the interprocess state would be
registered with <a href="../functions/atexit.html"><i>atexit</i>()</a>. There is the orthogonal problem that the <i>exec</i>
functions do not honor the <a href="../functions/atexit.html"><i>atexit</i>()</a> handlers, but resolving this is beyond the scope
of IEEE&nbsp;Std&nbsp;1003.1-2001.<br>
</p>
</li>

<li>
<p>Async-Cancel Safety</p>

<p>A function is said to be async-cancel-safe if it is written in such a way that entering the function with asynchronous
cancelability enabled will not cause any invariants to be violated, even if a cancelation request is delivered at any arbitrary
instruction. Functions that are async-cancel-safe are often written in such a way that they need to acquire no resources for their
operation and the visible variables that they may write are strictly limited.</p>

<p>Any routine that gets a resource as a side effect cannot be made async-cancel-safe (for example, <a href=
"../functions/malloc.html"><i>malloc</i>()</a>). If such a routine were called with asynchronous cancelability enabled, it might
acquire the resource successfully, but as it was returning to the client, it could act on a cancelation request. In such a case,
the application would have no way of knowing whether the resource was acquired or not.</p>

<p>Indeed, because many interesting routines cannot be made async-cancel-safe, most library routines in general are not
async-cancel-safe. Every library routine should specify whether or not it is async-cancel safe so that programmers know which
routines can be called from code that is asynchronously cancelable.</p>
</li>
</ul>

<h5><a name="tag_03_02_09_26"></a>Thread Read-Write Locks</h5>

<h5><a name="tag_03_02_09_27"></a>Background</h5>

<p>Read-write locks are often used to allow parallel access to data on multi-processors, to avoid context switches on
uni-processors when multiple threads access the same data, and to protect data structures that are frequently accessed (that is,
read) but rarely updated (that is, written). The in-core representation of a file system directory is a good example of such a data
structure. One would like to achieve as much concurrency as possible when searching directories, but limit concurrent access when
adding or deleting files.</p>

<p>Although read-write locks can be implemented with mutexes and condition variables, such implementations are significantly less
efficient than is possible. Therefore, this synchronization primitive is included in IEEE&nbsp;Std&nbsp;1003.1-2001 for the purpose
of allowing more efficient implementations in multi-processor systems.</p>

<h5><a name="tag_03_02_09_28"></a>Queuing of Waiting Threads</h5>

<p>The <a href="../functions/pthread_rwlock_unlock.html"><i>pthread_rwlock_unlock</i>()</a> function description states that one
writer or one or more readers must acquire the lock if it is no longer held by any thread as a result of the call. However, the
function does not specify which thread(s) acquire the lock, unless the Thread Execution Scheduling option is supported.</p>

<p>The standard developers considered the issue of scheduling with respect to the queuing of threads blocked on a read-write lock.
The question turned out to be whether IEEE&nbsp;Std&nbsp;1003.1-2001 should require priority scheduling of read-write locks for
threads whose execution scheduling policy is priority-based (for example, SCHED_FIFO or SCHED_RR). There are tradeoffs between
priority scheduling, the amount of concurrency achievable among readers, and the prevention of writer and/or reader starvation.</p>

<p>For example, suppose one or more readers hold a read-write lock and the following threads request the lock in the listed
order:</p>

<blockquote>
<pre>
<tt>pthread_rwlock_wrlock() - Low priority thread writer_a
pthread_rwlock_rdlock() - High priority thread reader_a
pthread_rwlock_rdlock() - High priority thread reader_b
pthread_rwlock_rdlock() - High priority thread reader_c
</tt>
</pre>
</blockquote>

<p>When the lock becomes available, should <i>writer_a</i> block the high priority readers? Or, suppose a read-write lock becomes
available and the following are queued:</p>

<blockquote>
<pre>
<tt>pthread_rwlock_rdlock() - Low priority thread reader_a
pthread_rwlock_rdlock() - Low priority thread reader_b
pthread_rwlock_rdlock() - Low priority thread reader_c
pthread_rwlock_wrlock() - Medium priority thread writer_a
pthread_rwlock_rdlock() - High priority thread reader_d
</tt>
</pre>
</blockquote>

<p>If priority scheduling is applied then <i>reader_d</i> would acquire the lock and <i>writer_a</i> would block the remaining
readers. But should the remaining readers also acquire the lock to increase concurrency? The solution adopted takes into account
that when the Thread Execution Scheduling option is supported, high priority threads may in fact starve low priority threads (the
application developer is responsible in this case for designing the system in such a way that this starvation is avoided).
Therefore, IEEE&nbsp;Std&nbsp;1003.1-2001 specifies that high priority readers take precedence over lower priority writers.
However, to prevent writer starvation from threads of the same or lower priority, writers take precedence over readers of the same
or lower priority.</p>

<p>Priority inheritance mechanisms are non-trivial in the context of read-write locks. When a high priority writer is forced to
wait for multiple readers, for example, it is not clear which subset of the readers should inherit the writer's priority.
Furthermore, the internal data structures that record the inheritance must be accessible to all readers, and this implies some sort
of serialization that could negate any gain in parallelism achieved through the use of multiple readers in the first place.
Finally, existing practice does not support the use of priority inheritance for read-write locks. Therefore, no specification of
priority inheritance or priority ceiling is attempted. If reliable priority-scheduled synchronization is absolutely required, it
can always be obtained through the use of mutexes.</p>

<h5><a name="tag_03_02_09_29"></a>Comparison to fcntl() Locks</h5>

<p>The read-write locks and the <a href="../functions/fcntl.html"><i>fcntl</i>()</a> locks in IEEE&nbsp;Std&nbsp;1003.1-2001 share
a common goal: increasing concurrency among readers, thus increasing throughput and decreasing delay.</p>

<p>However, the read-write locks have two features not present in the <a href="../functions/fcntl.html"><i>fcntl</i>()</a> locks.
First, under priority scheduling, read-write locks are granted in priority order. Second, also under priority scheduling, writer
starvation is prevented by giving writers preference over readers of equal or lower priority.</p>

<p>Also, read-write locks can be used in systems lacking a file system, such as those conforming to the minimal realtime system
profile of IEEE&nbsp;Std&nbsp;1003.13-1998.</p>

<h5><a name="tag_03_02_09_30"></a>History of Resolution Issues</h5>

<p>Based upon some balloting objections, early drafts specified the behavior of threads waiting on a read-write lock during the
execution of a signal handler, as if the thread had not called the lock operation. However, this specified behavior would require
implementations to establish internal signal handlers even though this situation would be rare, or never happen for many programs.
This would introduce an unacceptable performance hit in comparison to the little additional functionality gained. Therefore, the
behavior of read-write locks and signals was reverted back to its previous mutex-like specification.</p>

<h5><a name="tag_03_02_09_31"></a>Thread Interactions with Regular File Operations</h5>

<p>There is no additional rationale provided for this section.</p>

<h4><a name="tag_03_02_10"></a>Sockets</h4>

<p>The base document for the sockets interfaces in IEEE&nbsp;Std&nbsp;1003.1-2001 is the XNS, Issue 5.2 specification. This was
primarily chosen as it aligns with IPv6. Additional material has been added from IEEE&nbsp;Std&nbsp;1003.1g-2000, notably socket
concepts, raw sockets, the <a href="../functions/pselect.html"><i>pselect</i>()</a> function, the <a href=
"../functions/sockatmark.html"><i>sockatmark</i>()</a> function, and the <a href=
"../basedefs/sys/select.h.html"><i>&lt;sys/select.h&gt;</i></a> header.</p>

<h5><a name="tag_03_02_10_01"></a>Address Families</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_02"></a>Addressing</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_03"></a>Protocols</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_04"></a>Routing</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_05"></a>Interfaces</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_06"></a>Socket Types</h5>

<p>The type <b>socklen_t</b> was invented to cover the range of implementations seen in the field. The intent of <b>socklen_t</b>
is to be the type for all lengths that are naturally bounded in size; that is, that they are the length of a buffer which cannot
sensibly become of massive size: network addresses, host names, string representations of these, ancillary data, control messages,
and socket options are examples. Truly boundless sizes are represented by <b>size_t</b> as in <a href=
"../functions/read.html"><i>read</i>()</a>, <a href="../functions/write.html"><i>write</i>()</a>, and so on.</p>

<p>All <b>socklen_t</b> types were originally (in BSD UNIX) of type <b>int</b>. During the development of
IEEE&nbsp;Std&nbsp;1003.1-2001, it was decided to change all buffer lengths to <b>size_t</b>, which appears at face value to make
sense. When dual mode 32/64-bit systems came along, this choice unnecessarily complicated system interfaces because <b>size_t</b>
(with <b>long</b>) was a different size under ILP32 and LP64 models. Reverting to <b>int</b> would have happened except that some
implementations had already shipped 64-bit-only interfaces. The compromise was a type which could be defined to be any size by the
implementation: <b>socklen_t</b>.</p>

<h5><a name="tag_03_02_10_07"></a>Socket I/O Mode</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_08"></a>Socket Owner</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_09"></a>Socket Queue Limits</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_10"></a>Pending Error</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_11"></a>Socket Receive Queue</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_12"></a>Socket Out-of-Band Data State</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_13"></a>Connection Indication Queue</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_14"></a>Signals</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_15"></a>Asynchronous Errors</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_16"></a>Use of Options</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_17"></a>Use of Sockets for Local UNIX Connections</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_18"></a>Use of Sockets over Internet Protocols</h5>

<p>A raw socket allows privileged users direct access to a protocol; for example, raw access to the IP and ICMP protocols is
possible through raw sockets. Raw sockets are intended for knowledgeable applications that wish to take advantage of some protocol
feature not directly accessible through the other sockets interfaces.</p>

<h5><a name="tag_03_02_10_19"></a>Use of Sockets over Internet Protocols Based on IPv4</h5>

<p>There is no additional rationale provided for this section.</p>

<h5><a name="tag_03_02_10_20"></a>Use of Sockets over Internet Protocols Based on IPv6</h5>

<p>The Open Group Base Resolution bwg2001-012 is applied, clarifying that IPv6 implementations are required to support use of
AF_INET6 sockets over IPv4.</p>

<h4><a name="tag_03_02_11"></a>Tracing</h4>

<p>The organization of the tracing rationale differs from the traditional rationale in that this tracing rationale text is written
against the trace interface as a whole, rather than against the individual components of the trace interface or the normative
section in which those components are defined. Therefore the sections below do not parallel the sections of normative text in
IEEE&nbsp;Std&nbsp;1003.1-2001.</p>

<h5><a name="tag_03_02_11_01"></a>Objectives</h5>

<p>The intended uses of tracing are application-system debugging during system development, as a &quot;flight recorder&quot; for
maintenance of fielded systems, and as a performance measurement tool. In all of these intended uses, the vendor-supplied computer
system and its software are, for this discussion, assumed error-free; the intent being to debug the user-written and/or third-party
application code, and their interactions. Clearly, problems with the vendor-supplied system and its software will be uncovered from
time to time, but this is a byproduct of the primary activity, debugging user code.</p>

<p>Another need for defining a trace interface in POSIX stems from the objective to provide an efficient portable way to perform
benchmarks. Existing practice shows that such interfaces are commonly used in a variety of systems but with little commonality. As
part of the benchmarking needs, two aspects within the trace interface must be considered.</p>

<p>The first, and perhaps more important one, is the qualitative aspect.</p>

<p>The second is the quantitative aspect.</p>

<ul>
<li>
<p>Qualitative Aspect</p>

<p>To better understand this aspect, let us consider an example. Suppose that you want to organize a number of actions to be
performed during the day. Some of these actions are known at the beginning of the day. Some others, which may be more or less
important, will be triggered by reading your mail. During the day you will make some phone calls and synchronously receive some
more information. Finally you will receive asynchronous phone calls that also will trigger actions. If you, or somebody else,
examines your day at work, you, or he, can discover that you have not efficiently organized your work. For instance, relative to
the phone calls you made, would it be preferable to make some of these early in the morning? Or to delay some others until the end
of the day? Relative to the phone calls you have received, you might find that somebody you called in the morning has called you 10
times while you were performing some important work. To examine, afterwards, your day at work, you record in sequence all the trace
events relative to your work. This should give you a chance of organizing your next day at work.</p>

<p>This is the qualitative aspect of the trace interface. The user of a system needs to keep a trace of particular points the
application passes through, so that he can eventually make some changes in the application and/or system configuration, to give the
application a chance of running more efficiently.</p>
</li>

<li>
<p>Quantitative Aspect</p>

<p>This aspect concerns primarily realtime applications, where missed deadlines can be undesirable. Although there are, in
IEEE&nbsp;Std&nbsp;1003.1-2001, some interfaces useful for such applications (timeouts, execution time monitoring, and so on),
there are no APIs to aid in the tuning of a realtime application's behavior ( <b>timespec</b> in timeouts, length of message
queues, duration of driver interrupt service routine, and so on). The tuning of an application needs a means of recording
timestamped important trace events during execution in order to analyze offline, and eventually, to tune some realtime features
(redesign the system with less functionalities, readjust timeouts, redesign driver interrupts, and so on).</p>
</li>
</ul>

<h5><a name="tag_03_02_11_02"></a>Detailed Objectives</h5>

<p>Objectives were defined to build the trace interface and are kept for historical interest. Although some objectives are not
fully respected in this trace interface, the concept of the POSIX trace interface assumes the following points:</p>

<ol>
<li>
<p>It must be possible to trace both system and user trace events concurrently.</p>
</li>

<li>
<p>It must be possible to trace per-process trace events and also to trace system trace events which are unrelated to any
particular process. A per-process trace event is either user-initiated or system-initiated.</p>
</li>

<li>
<p>It must be possible to control tracing on a per-process basis from either inside or outside the process.</p>
</li>

<li>
<p>It must be possible to control tracing on a per-thread basis from inside the enclosing process.</p>
</li>

<li>
<p>Trace points must be controllable by trace event type ID from inside and outside of the process. Multiple trace points can have
the same trace event type ID, and will be controlled jointly.</p>
</li>

<li>
<p>Recording of trace events is dependent on both trace event type ID and the process/thread. Both must be enabled in order to
record trace events. System trace events may or may not be handled differently.</p>
</li>

<li>
<p>The API must not mandate the ability to control tracing for more than one process at the same time.</p>
</li>

<li>
<p>There is no objective for trace control on anything bigger than a process; for example, group or session.</p>
</li>

<li>
<p>Trace propagation and control:</p>

<ol type="a">
<li>
<p>Trace propagation across <a href="../functions/fork.html"><i>fork</i>()</a> is optional; the default is to not trace a child
process.</p>
</li>

<li>
<p>Trace control must span <a href="../functions/pthread_create.html"><i>pthread_create</i>()</a> operations; that is, if a process
is being traced, any thread will be traced as well if this thread allows tracing. The default is to allow tracing.</p>
</li>
</ol>
</li>

<li>
<p>Trace control must not span <i>exec</i> or <a href="../functions/posix_spawn.html"><i>posix_spawn</i>()</a> operations.</p>
</li>

<li>
<p>A triggering API is not required. The triggering API is the ability to command or stop tracing based on the occurrence of a
specific trace event other than a POSIX_TRACE_START trace event or a POSIX_TRACE_STOP trace event.</p>
</li>

<li>
<p>Trace log entries must have timestamps of implementation-defined resolution. Implementations are exhorted to support at least
microsecond resolution. When a trace log entry is retrieved, it must have timestamp, PC address, PID, and TID of the entity that
generated the trace event.</p>
</li>

<li>
<p>Independently developed code should be able to use trace facilities without coordination and without conflict.</p>
</li>

<li>
<p>Even if the trace points in the trace calls are not unique, the trace log entries (after any processing) must be uniquely
identified as to trace point.</p>
</li>

<li>
<p>There must be a standard API to read the trace stream.</p>
</li>

<li>
<p>The format of the trace stream and the trace log is opaque and unspecified.</p>
</li>

<li>
<p>It must be possible to read a completed trace, if recorded on some suitable non-volatile storage, even subsequent to a power
cycle or subsequent cold boot of the system.</p>
</li>

<li>
<p>Support of analysis of a trace log while it is being formed is implementation-defined.</p>
</li>

<li>
<p>The API must allow the application to write trace stream identification information into the trace stream and to be able to
retrieve it, without it being overwritten by trace entries, even if the trace stream is full.</p>
</li>

<li>
<p>It must be possible to specify the destination of trace data produced by trace events.</p>
</li>

<li>
<p>It must be possible to have different trace streams, and for the tracing enabled by one trace stream to be completely
independent of the tracing of another trace stream.</p>
</li>

<li>
<p>It must be possible to trace events from threads in different CPUs.</p>
</li>

<li>
<p>The API must support one or more trace streams per-system, and one or more trace streams per-process, up to an
implementation-defined set of per-system and per-process maximums.</p>
</li>

<li>
<p>It must be possible to determine the order in which the trace events happened, without necessarily depending on the clock, up to
an implementation-defined time resolution.</p>
</li>

<li>
<p>For performance reasons, the trace event point call(s) must be implementable as a macro (see the ISO&nbsp;POSIX-1:1996 standard,
1.3.4, Statement 2).</p>
</li>

<li>
<p>IEEE&nbsp;Std&nbsp;1003.1-2001 must not define the trace points which a conforming system must implement, except for trace
points used in the control of tracing.</p>
</li>

<li>
<p>The APIs must be thread-safe, and trace points should be lock-free (that is, not require a lock to gain exclusive access to some
resource).</p>
</li>

<li>
<p>The user-provided information associated with a trace event is variable-sized, up to some maximum size.</p>
</li>

<li>
<p>Bounds on record and trace stream sizes:</p>

<ol type="a">
<li>
<p>The API must permit the application to declare the upper bounds on the length of an application data record. The system must
return the limit it used. The limit used may be smaller than requested.</p>
</li>

<li>
<p>The API must permit the application to declare the upper bounds on the size of trace streams. The system must return the limit
it used. The limit used may be different, either larger or smaller, than requested.</p>
</li>
</ol>
</li>

<li>
<p>The API must be able to pass any fundamental data type, and a structured data type composed only of fundamental types. The API
must be able to pass data by reference, given only as an address and a length. Fundamental types are the POSIX.1 types (see the <a
href="../basedefs/sys/types.h.html"><i>&lt;sys/types.h&gt;</i></a> header) plus those defined in the ISO&nbsp;C standard.</p>
</li>

<li>
<p>The API must apply the POSIX notions of ownership and permission to recorded trace data, corresponding to the sources of that
data.</p>
</li>
</ol>

<h5><a name="tag_03_02_11_03"></a>Comments on Objectives</h5>

<basefont size="2">

<dl>
<dt><b>Note:</b></dt>

<dd>In the following comments, numbers in square brackets refer to the above objectives.</dd>
</dl>

<basefont size="3">

<p>It is necessary to be able to obtain a trace stream for a complete activity. Thus there is a requirement to be able to trace
both application and system trace events. A per-process trace event is either user-initiated, like the <a href=
"../functions/write.html"><i>write</i>()</a> function, or system-initiated, like a timer expiration. There is also a need to be
able to trace an entire process' activity even when it has threads in multiple CPUs. To avoid excess trace activity, it is
necessary to be able to control tracing on a trace event type basis.<br>
[Objectives 1,2,5,22]</p>

<p>There is a need to be able to control tracing on a per-process basis, both from inside and outside the process; that is, a
process can start a trace activity on itself or any other process. There is also the perceived need to allow the definition of a
maximum number of trace streams per system.<br>
[Objectives 3,23]</p>

<p>From within a process, it is necessary to be able to control tracing on a per-thread basis. This provides an additional
filtering capability to keep the amount of traced data to a minimum. It also allows for less ambiguity as to the origin of trace
events. It is recognized that thread-level control is only valid from within the process itself. It is also desirable to know the
maximum number of trace streams per process that can be started. The API should not require thread synchronization or mandate
priority inversions that would cause the thread to block. However, the API must be thread-safe.<br>
[Objectives 4,23,24,27]</p>

<p>There was no perceived objective to control tracing on anything larger than a process; for example, a group or session. Also,
the ability to start or stop a trace activity on multiple processes atomically may be very difficult or cumbersome in some
implementations.<br>
[Objectives 6,8]</p>

<p>It is also necessary to be able to control tracing by trace event type identifier, sometimes called a trace hook ID. However,
there is no mandated set of system trace events, since such trace points are implementation-defined. The API must not require from
the operating system facilities that are not standard.<br>
[Objectives 6,26]</p>

<p>Trace control must span <a href="../functions/fork.html"><i>fork</i>()</a> and <a href=
"../functions/pthread_create.html"><i>pthread_create</i>()</a>. If not, there will be no way to ensure that an application's
activity is entirely traced. The newly forked child would not be able to turn on its tracing until after it obtained control after
the fork, and trace control externally would be even more problematic.<br>
[Objective 9]</p>

<p>Since <i>exec</i> and <a href="../functions/posix_spawn.html"><i>posix_spawn</i>()</a> represent a complete change in the
execution of a task (a new program), trace control need not persist over an <i>exec</i> or <a href=
"../functions/posix_spawn.html"><i>posix_spawn</i>()</a>.<br>
[Objective 10]</p>

<p>Where trace activities are started on multiple processes, these trace activities should not interfere with each other.<br>
[Objective 21]</p>

<p>There is no need for a triggering objective, primarily for performance reasons; see also <a href="#tag_03_02_11_32">Rationale on
Triggering</a> , rationale on triggering.<br>
[Objective 11]</p>

<p>It must be possible to determine the origin of each traced event. The process and thread identifiers for each trace event are
needed. Also there was a perceived need for a user-specifiable origin, but it was felt that this would create too much
overhead.<br>
[Objectives 12,14]</p>

<p>An allowance must be made for trace points to come embedded in software components from several different sources and vendors
without requiring coordination.<br>
[Objective 13]</p>

<p>There is a requirement to be able to uniquely identify trace points that may have the same trace stream identifier. This is only
necessary when a trace report is produced.<br>
[Objectives 12,14]</p>

<p>Tracing is a very performance-sensitive activity, and will therefore likely be implemented at a low level within the system.
Hence the interface must not mandate any particular buffering or storage method. Therefore, a standard API is needed to read a
trace stream. Also the interface must not mandate the format of the trace data, and the interface must not assume a trace storage
method. Due to the possibility of a monolithic kernel and the possible presence of multiple processes capable of running trace
activities, the two kinds of trace events may be stored in two separate streams for performance reasons. A mandatory dump
mechanism, common in some existing practice, has been avoided to allow the implementation of this set of functions on small
realtime profiles for which the concept of a file system is not defined. The trace API calls should be implemented as macros.<br>
[Objectives 15,16,25,30]</p>

<p>Since a trace facility is a valuable service tool, the output (or log) of a completed trace stream that is written to permanent
storage must be readable on other systems of the type that produced the trace log. Note that there is no objective to be able to
interpret a trace log that was not successfully completed.<br>
[Objectives 17,18,19]</p>

<p>For trace streams written to permanent storage, a way to specify the destination of the trace stream is needed.<br>
[Objective 20]</p>

<p>There is a requirement to be able to depend on the ordering of trace events up to some implementation-defined time interval. For
example, there is a need to know the time period during which, if trace events are closer together, their ordering is unspecified.
Events that occur within an interval smaller than this resolution may or may not be read back in the correct order.<br>
[Objective 24]</p>

<p>The application should be able to know how much data can be traced. When trace event types can be filtered, the application
should be able to specify the approximate maximum amount of data that will be traced in a trace event so resources can be more
efficiently allocated.<br>
[Objectives 28,29]</p>

<p>Users should not be able to trace data to which they would not normally have access. System trace events corresponding to a
process/thread should be associated with the ownership of that process/thread.<br>
[Objective 31]<br>
</p>

<h5><a name="tag_03_02_11_04"></a>Trace Model</h5>

<h5><a name="tag_03_02_11_05"></a>Introduction</h5>

<p>The model is based on two base entities: the &quot;Trace Stream&quot; and the &quot;Trace Log&quot;, and a recorded unit called the &quot;Trace
Event&quot;. The possibility of using Trace Streams and Trace Logs separately gives two use dimensions and solves both the performance
issue and the full-information system issue. In the case of a trace stream without log, specific information, although reduced in
quantity, is required to be registered, in a possibly small realtime system, with as little overhead as possible. The Trace Log
option has been added for small realtime systems. In the case of a trace stream with log, considerable complex application-specific
information needs to be collected.</p>

<h5><a name="tag_03_02_11_06"></a>Trace Model Description</h5>

<p>The trace model can be examined for three different subfunctions: Application Instrumentation, Trace Operation Control, and
Trace Analysis.</p>

<dl compact>
<dt></dt>

<dd><img src=".././Figures/b-2.gif"></dd>
</dl>

<center><b><a name="tagfcjh_2"></a> Figure: Trace System Overview: for Offline Analysis</b></center>

<p>Each of these subfunctions requires specific characteristics of the trace mechanism API.</p>

<ul>
<li>
<p>Application Instrumentation</p>

<p>When instrumenting an application, the programmer is not concerned about the future use of the trace events in the trace stream
or the trace log, the full policy of the trace stream, or the eventual pre-filtering of trace events. But he is concerned about the
correct determination of the specific trace event type identifier, regardless of how many independent libraries are used in the
same user application; see <a href="#tagfcjh_2">Trace System Overview: for Offline Analysis</a> and <a href="#tagfcjh_3">Trace
System Overview: for Online Analysis</a> .</p>

<p>This trace API provides the necessary operations to accomplish this subfunction. This is done by providing functions to
associate a programmer-defined name with an implementation-defined trace event type identifier (see the <a href=
"../functions/posix_trace_eventid_open.html"><i>posix_trace_eventid_open</i>()</a> function), and to send this trace event into a
potential trace stream (see the <a href="../functions/posix_trace_event.html"><i>posix_trace_event</i>()</a> function).<br>
</p>
</li>

<li>
<p>Trace Operation Control</p>

<p>When controlling the recording of trace events in a trace stream, the programmer is concerned with the correct initialization of
the trace mechanism (that is, the sizing of the trace stream), the correct retention of trace events in a permanent storage, the
correct dynamic recording of trace events, and so on.</p>

<p>This trace API provides the necessary material to permit this efficiently. This is done by providing functions to initialize a
new trace stream, and optionally a trace log:</p>

<ul>
<li>
<p>Trace Stream Attributes Object Initialization (see <a href=
"../functions/posix_trace_attr_init.html"><i>posix_trace_attr_init</i>()</a>)</p>
</li>

<li>
<p>Functions to Retrieve or Set Information About a Trace Stream (see <a href=
"../functions/posix_trace_attr_getgenversion.html"><i>posix_trace_attr_getgenversion</i>()</a>)</p>
</li>

<li>
<p>Functions to Retrieve or Set the Behavior of a Trace Stream (see <a href=
"../functions/posix_trace_attr_getinherited.html"><i>posix_trace_attr_getinherited</i>()</a>)</p>
</li>

<li>
<p>Functions to Retrieve or Set Trace Stream Size Attributes (see <a href=
"../functions/posix_trace_attr_getmaxusereventsize.html"><i>posix_trace_attr_getmaxusereventsize</i>()</a>)</p>
</li>

<li>
<p>Trace Stream Initialization, Flush, and Shutdown from a Process (see <a href=
"../functions/posix_trace_create.html"><i>posix_trace_create</i>()</a>)</p>
</li>

<li>
<p>Clear Trace Stream and Trace Log (see <a href="../functions/posix_trace_clear.html"><i>posix_trace_clear</i>()</a>)</p>
</li>
</ul>

<p>To select the trace event types that are to be traced:</p>

<ul>
<li>
<p>Manipulate Trace Event Type Identifier (see <a href=
"../functions/posix_trace_trid_eventid_open.html"><i>posix_trace_trid_eventid_open</i>()</a>)</p>
</li>

<li>
<p>Iterate over a Mapping of Trace Event Type (see <a href=
"../functions/posix_trace_eventtypelist_getnext_id.html"><i>posix_trace_eventtypelist_getnext_id</i>()</a>)</p>
</li>

<li>
<p>Manipulate Trace Event Type Sets (see <a href=
"../functions/posix_trace_eventset_empty.html"><i>posix_trace_eventset_empty</i>()</a>)</p>
</li>

<li>
<p>Set Filter of an Initialized Trace Stream (see <a href=
"../functions/posix_trace_set_filter.html"><i>posix_trace_set_filter</i>()</a>)</p>
</li>
</ul>

<p>To control the execution of an active trace stream:</p>

<ul>
<li>
<p>Trace Start and Stop (see <a href="../functions/posix_trace_start.html"><i>posix_trace_start</i>()</a>)</p>
</li>

<li>
<p>Functions to Retrieve the Trace Attributes or Trace Statuses (see <a href=
"../functions/posix_trace_get_attr.html"><i>posix_trace_get_attr</i>()</a>)</p>
</li>
</ul>

<img src=".././Figures/b-3.gif">

<center><b><a name="tagfcjh_3"></a> Figure: Trace System Overview: for Online Analysis</b></center>

<br>
</li>

<li>
<p>Trace Analysis</p>

<p>Once correctly recorded, on permanent storage or not, an ultimate activity consists of the analysis of the recorded information.
If the recorded data is on permanent storage, a specific open operation is required to associate a trace stream to a trace log.</p>

<p>The first intent of the group was to request the presence of a system identification structure in the trace stream attribute.
This was, for the application, to allow some portable way to process the recorded information. However, there is no requirement
that the <b>utsname</b> structure, on which this system identification was based, be portable from one machine to another, so the
contents of the attribute cannot be interpreted correctly by an application conforming to IEEE&nbsp;Std&nbsp;1003.1-2001.</p>

<p>This modification has been incorporated and requests that some unspecified information be recorded in the trace log in order to
fail opening it if the analysis process and the controller process were running in different types of machine, but does not request
that this information be accessible to the application. This modification has implied a modification in the <a href=
"../functions/posix_trace_open.html"><i>posix_trace_open</i>()</a> function error code returns.</p>

<p>This trace API provides functions to:</p>

<ul>
<li>
<p>Extract trace stream identification attributes (see <a href=
"../functions/posix_trace_attr_getgenversion.html"><i>posix_trace_attr_getgenversion</i>()</a>)</p>
</li>

<li>
<p>Extract trace stream behavior attributes (see <a href=
"../functions/posix_trace_attr_getinherited.html"><i>posix_trace_attr_getinherited</i>()</a>)</p>
</li>

<li>
<p>Extract trace event, stream, and log size attributes (see <a href=
"../functions/posix_trace_attr_getmaxusereventsize.html"><i>posix_trace_attr_getmaxusereventsize</i>()</a>)</p>
</li>

<li>
<p>Look up trace event type names (see <a href=
"../functions/posix_trace_eventid_get_name.html"><i>posix_trace_eventid_get_name</i>()</a>)</p>
</li>

<li>
<p>Iterate over trace event type identifiers (see <a href=
"../functions/posix_trace_eventtypelist_getnext_id.html"><i>posix_trace_eventtypelist_getnext_id</i>()</a>)</p>
</li>

<li>
<p>Open, rewind, and close a trace log (see <a href="../functions/posix_trace_open.html"><i>posix_trace_open</i>()</a>)</p>
</li>

<li>
<p>Read trace stream attributes and status (see <a href=
"../functions/posix_trace_get_attr.html"><i>posix_trace_get_attr</i>()</a>)</p>
</li>

<li>
<p>Read trace events (see <a href="../functions/posix_trace_getnext_event.html"><i>posix_trace_getnext_event</i>()</a>)</p>
</li>
</ul>
</li>
</ul>

<p>Due to the following two reasons:</p>

<ol>
<li>
<p>The requirement that the trace system must not add unacceptable overhead to the traced process and so that the trace event point
execution must be fast</p>
</li>

<li>
<p>The traced application does not care about tracing errors</p>
</li>
</ol>

<p>the trace system cannot return any internal error to the application. Internal error conditions can range from unrecoverable
errors that will force the active trace stream to abort, to small errors that can affect the quality of tracing without aborting
the trace stream. The group decided to define a system trace event to report to the analysis process such internal errors. It is
not the intention of IEEE&nbsp;Std&nbsp;1003.1-2001 to require an implementation to report an internal error that corrupts or
terminates tracing operation. The implementor is free to decide which internal documented errors, if any, the trace system is able
to report.<br>
</p>

<h5><a name="tag_03_02_11_07"></a>States of a Trace Stream</h5>

<dl compact>
<dt></dt>

<dd><img src=".././Figures/b-4.gif"></dd>
</dl>

<center><b><a name="tagfcjh_4"></a> Figure: Trace System Overview: States of a Trace Stream</b></center>

<p><a href="#tagfcjh_4">Trace System Overview: States of a Trace Stream</a> shows the different states an active trace stream
passes through. After the <a href="../functions/posix_trace_create.html"><i>posix_trace_create</i>()</a> function call, a trace
stream becomes CREATED and a trace stream is associated for the future collection of trace events. The status of the trace stream
is POSIX_TRACE_SUSPENDED. The state becomes STARTED after a call to the <a href=
"../functions/posix_trace_start.html"><i>posix_trace_start</i>()</a> function, and the status becomes POSIX_TRACE_RUNNING. In this
state, all trace events that are not filtered out will be stored into the trace stream. After a call to <a href=
"../functions/posix_trace_stop.html"><i>posix_trace_stop</i>()</a>, the trace stream becomes STOPPED (and the status
POSIX_TRACE_SUSPENDED). In this state, no new trace events will be recorded in the trace stream, but previously recorded trace
events may continue to be read.</p>

<p>After a call to <a href="../functions/posix_trace_shutdown.html"><i>posix_trace_shutdown</i>()</a>, the trace stream is in the
state COMPLETED. The trace stream no longer exists but, if the Trace Log option is supported, all the information contained in it
has been logged. If a log object has not been associated with the trace stream at the creation, it is the responsibility of the
trace controller process to not shut the trace stream down while trace events remain to be read in the stream.</p>

<h5><a name="tag_03_02_11_08"></a>Tracing All Processes</h5>

<p>Some implementations have a tracing subsystem with the ability to trace all processes. This is useful to debug some types of
device drivers such as those for ATM or X25 adapters. These types of adapters are used by several independent processes, that are
not issued from the same process.</p>

<p>The POSIX trace interface does not define any constant or option to create a trace stream tracing all processes. POSIX.1 does
not prevent this type of implementation and an implementor is free to add this capability. Nevertheless, the trace interface allows
tracing of all the system trace events and all the processes issued from the same process.</p>

<p>If such a tracing system capability has to be implemented, when a trace stream is created, it is recommended that a constant
named POSIX_TRACE_ALLPROC be used instead of the process identifier in the argument of the <a href=
"../functions/posix_trace_create.html"><i>posix_trace_create</i>()</a> or <a href=
"../functions/posix_trace_create_withlog.html"><i>posix_trace_create_withlog</i>()</a> function. A possible value for
POSIX_TRACE_ALLPROC may be -1 instead of a real process identifier.</p>

<p>The implementor has to be aware that there is some impact on the tracing behavior as defined in the POSIX trace interface. For
example:</p>

<ul>
<li>
<p>If the default value for the inheritance attribute is set to POSIX_TRACE_CLOSE_FOR_CHILD, the implementation has to stop tracing
for the child process.</p>
</li>

<li>
<p>The trace controller which is creating this type of trace stream must have the appropriate privilege to trace all the
processes.</p>
</li>
</ul>

<h5><a name="tag_03_02_11_09"></a>Trace Storage</h5>

<p>The model is based on two types of trace events: system trace events and user-defined trace events. The internal representation
of trace events is implementation-defined, and so the implementor is free to choose the more suitable, practical, and efficient way
to design the internal management of trace events. For the timestamping operation, the model does not impose the CLOCK_REALTIME or
any other clock. The buffering allocation and operation follow the same principle. The implementor is free to use one or more
buffers to record trace events; the interface assumes only a logical trace stream of sequentially recorded trace events. Regarding
flushing of trace events, the interface allows the definition of a trace log object which typically can be a file. But the group
was also aware of defining functions to permit the use of this interface in small realtime systems, which may not have general file
system capabilities. For instance, the three functions <a href=
"../functions/posix_trace_getnext_event.html"><i>posix_trace_getnext_event</i>()</a> (blocking), <a href=
"../functions/posix_trace_timedgetnext_event.html"><i>posix_trace_timedgetnext_event</i>()</a> (blocking with timeout), and <a
href="../functions/posix_trace_trygetnext_event.html"><i>posix_trace_trygetnext_event</i>()</a> (non-blocking) are proposed to read
the recorded trace events.</p>

<p>The policy to be used when the trace stream becomes full also relies on common practice:</p>

<ul>
<li>
<p>For an active trace stream, the POSIX_TRACE_LOOP trace stream policy permits automatic overrun (overwrite of oldest trace
events) while waiting for some user-defined condition to cause tracing to stop. By contrast, the POSIX_TRACE_UNTIL_FULL trace
stream policy requires the system to stop tracing when the trace stream is full. However, if the trace stream that is full is at
least partially emptied by a call to the <a href="../functions/posix_trace_flush.html"><i>posix_trace_flush</i>()</a> function or
by calls to the <a href="../functions/posix_trace_getnext_event.html"><i>posix_trace_getnext_event</i>()</a> function, the trace
system will automatically resume tracing.</p>

<p>If the Trace Log option is supported, the operation of the POSIX_TRACE_FLUSH policy is an extension of the
POSIX_TRACE_UNTIL_FULL policy. The automatic free operation (by flushing to the associated trace log) is added.</p>
</li>

<li>
<p>If a log is associated with the trace stream and this log is a regular file, these policies also apply for the log. One more
policy, POSIX_TRACE_APPEND, is defined to allow indefinite extension of the log. Since the log destination can be any device or
pseudo-device, the implementation may not be able to manipulate the destination as required by IEEE&nbsp;Std&nbsp;1003.1-2001. For
this reason, the behavior of the log full policy may be unspecified depending on the trace log type.</p>

<p>The current trace interface does not define a service to preallocate space for a trace log file, because this space can be
preallocated by means of a call to the <a href="../functions/posix_fallocate.html"><i>posix_fallocate</i>()</a> function. This
function could be called after the file has been opened, but before the trace stream is created. The <a href=
"../functions/posix_fallocate.html"><i>posix_fallocate</i>()</a> function ensures that any required storage for regular file data
is allocated on the file system storage media. If <a href="../functions/posix_fallocate.html"><i>posix_fallocate</i>()</a> returns
successfully, subsequent writes to the specified file data will not fail due to the lack of free space on the file system storage
media. Besides trace events, a trace stream also includes trace attributes and the mapping from trace event names to trace event
type identifiers. The implementor is free to choose how to store the trace attributes and the trace event type map, but must ensure
that this information is not lost when a trace stream overrun occurs.</p>
</li>
</ul>

<h5><a name="tag_03_02_11_10"></a>Trace Programming Examples</h5>

<p>Several programming examples are presented to show the code of the different possible subfunctions using a trace subsystem. All
these programs need to include the <a href="../basedefs/trace.h.html"><i>&lt;trace.h&gt;</i></a> header. In the examples shown,
error checking is omitted for more simplicity.</p>

<h5><a name="tag_03_02_11_11"></a>Trace Operation Control</h5>

<p>These examples show the creation of a trace stream for another process; one which is already trace instrumented. All the default
trace stream attributes are used to simplify programming in the first example. The second example shows more possibilities.</p>

<h5><a name="tag_03_02_11_12"></a>First Example</h5>

<pre>
<tt>/* Caution. Error checks omitted */
{
    trace_attr_t attr;
    pid_t pid = traced_process_pid;
    int fd;
    trace_id_t trid;
<br>
    - - - - - -
    /* Initialize trace stream attributes */
    posix_trace_attr_init(&amp;attr);
    /* Open a trace log */
    fd=open("/tmp/mytracelog",...);
    /*
     * Create a new trace associated with a log
     * and with default attributes
     */
<br>
    posix_trace_create_withlog(pid, &amp;attr, fd, &amp;trid);
<br>
    /* Trace attribute structure can now be destroyed */
    posix_trace_attr_destroy(&amp;attr);
    /* Start of trace event recording */
    posix_trace_start(trid);
    - - - - - -
    - - - - - -
    /* Duration of tracing */
    - - - - - -
    - - - - - -
    /* Stop and shutdown of trace activity */
    posix_trace_shutdown(trid);
    - - - - - -
}
</tt>
</pre>

<h5><a name="tag_03_02_11_13"></a>Second Example</h5>

<p>Between the initialization of the trace stream attributes and the creation of the trace stream, these trace stream attributes
may be modified; see <a href="#tag_03_02_11_19">Trace Stream Attribute Manipulation</a> for a specific programming example. Between
the creation and the start of the trace stream, the event filter may be set; after the trace stream is started, the event filter
may be changed. The setting of an event set and the change of a filter is shown in <a href="#tag_03_02_11_20">Create a Trace Event
Type Set and Change the Trace Event Type Filter</a> .</p>

<pre>
<tt>/* Caution. Error checks omitted */
{
    trace_attr_t attr;
    pid_t pid = traced_process_pid;
    int fd;
    trace_id_t trid;
    - - - - - -
    /* Initialize trace stream attributes */
    posix_trace_attr_init(&amp;attr);
    /* Attr default may be changed at this place; see example */
    - - - - - -
    /* Create and open a trace log with R/W user access */
    fd=open("/tmp/mytracelog",O_WRONLY|O_CREAT,S_IRUSR|S_IWUSR);
    /* Create a new trace associated with a log */
    posix_trace_create_withlog(pid, &amp;attr, fd, &amp;trid);
    /*
     * If the Trace Filter option is supported
     * trace event type filter default may be changed at this place;
     * see example about changing the trace event type filter
     */
    posix_trace_start(trid);
    - - - - - -
<br>
    /*
     * If you have an uninteresting part of the application
     * you can stop temporarily.
     *
     * posix_trace_stop(trid);
     * - - - - - -
     * - - - - - -
     * posix_trace_start(trid);
     */
    - - - - - -
    /*
     * If the Trace Filter option is supported
     * the current trace event type filter can be changed
     * at any time (see example about how to set
     * a trace event type filter)
     */
    - - - - - -
<br>
    /* Stop the recording of trace events */
    posix_trace_stop(trid);
    /* Shutdown the trace stream */
    posix_trace_shutdown(trid);
    /*
     * Destroy trace stream attributes; attr structure may have
     * been used during tracing to fetch the attributes
     */
    posix_trace_attr_destroy(&amp;attr);
    - - - - - -
}
</tt>
</pre>

<h5><a name="tag_03_02_11_14"></a>Application Instrumentation</h5>

<p>This example shows an instrumented application. The code is included in a block of instructions, perhaps a function from a
library. Possibly in an initialization part of the instrumented application, two user trace events names are mapped to two trace
event type identifiers (function <a href="../functions/posix_trace_eventid_open.html"><i>posix_trace_eventid_open</i>()</a>). Then
two trace points are programmed.</p>

<pre>
<tt>/* Caution. Error checks omitted */
{
    trace_event_id_t eventid1, eventid2;
    - - - - - -
    /* Initialization of two trace event type ids */
    posix_trace_eventid_open("my_first_event",&amp;eventid1);
    posix_trace_eventid_open("my_second_event",&amp;eventid2);
    - - - - - -
    - - - - - -
    - - - - - -
    /* Trace point */
    posix_trace_event(eventid1,NULL,0);
    - - - - - -
    /* Trace point */
    posix_trace_event(eventid2,NULL,0);
    - - - - - -
}
</tt>
</pre>

<h5><a name="tag_03_02_11_15"></a>Trace Analyzer</h5>

<p>This example shows the manipulation of a trace log resulting from the dumping of a completed trace stream. All the default
attributes are used to simplify programming, and data associated with a trace event is not shown in the first example. The second
example shows more possibilities.</p>

<h5><a name="tag_03_02_11_16"></a>First Example</h5>

<pre>
<tt>/* Caution. Error checks omitted */
{
    int fd;
    trace_id_t trid;
    posix_trace_event_info trace_event;
    char trace_event_name[TRACE_EVENT_NAME_MAX];
    int return_value;
    size_t returndatasize;
    int lost_event_number;
<br>
    - - - - - -
<br>
    /* Open an existing trace log */
    fd=open("/tmp/tracelog", O_RDONLY);
    /* Open a trace stream on the open log */
    posix_trace_open(fd, &amp;trid);
    /* Read a trace event */
    posix_trace_getnext_event(trid, &amp;trace_event,
        NULL, 0, &amp;returndatasize,&amp;return_value);
<br>
    /* Read and print all trace event names out in a loop */
    while (return_value == NULL)
    {
        /*
         * Get the name of the trace event associated
         * with trid trace ID
         */
        posix_trace_eventid_get_name(trid, trace_event.event_id,
            trace_event_name);
        /* Print the trace event name out */
        printf("%s\n",trace_event_name);
        /* Read a trace event */
        posix_trace_getnext_event(trid, &amp;trace_event,
            NULL, 0, &amp;returndatasize,&amp;return_value);
    }
<br>
    /* Close the trace stream */
    posix_trace_close(trid);
    /* Close the trace log */
    close(fd);
}
</tt>
</pre>

<h5><a name="tag_03_02_11_17"></a>Second Example</h5>

<p>The complete example includes the two other examples in <a href="#tag_03_02_11_21">Retrieve Information from a Trace Log</a> and
in <a href="#tag_03_02_11_22">Retrieve the List of Trace Event Types Used in a Trace Log</a> . For example, the <i>maxdatasize</i>
variable is set in <a href="#tag_03_02_11_22">Retrieve the List of Trace Event Types Used in a Trace Log</a> .</p>

<pre>
<tt>/* Caution. Error checks omitted */
{
    int fd;
    trace_id_t trid;
    posix_trace_event_info trace_event;
    char trace_event_name[TRACE_EVENT_NAME_MAX];
    char * data;
    size_t maxdatasize=1024, returndatasize;
    int return_value;
    - - - - - -
<br>
    /* Open an existing trace log */
    fd=open("/tmp/tracelog", O_RDONLY);
    /* Open a trace stream on the open log */
    posix_trace_open( fd, &amp;trid);
    /*
     * Retrieve information about the trace stream which
     * was dumped in this trace log (see example)
     */
    - - - - - -
<br>
    /* Allocate a buffer for trace event data */
    data=(char *)malloc(maxdatasize);
    /*
     * Retrieve the list of trace events used in this
     * trace log (see example)
     */
    - - - - - -
<br>
    /* Read and print all trace event names and data out in a loop */
    while (1)
    {
    posix_trace_getnext_event(trid, &amp;trace_event,
        data, maxdatasize, &amp;returndatasize,&amp;return_value);
        if (return_value != NULL) break;
        /*
         * Get the name of the trace event type associated
         * with trid trace ID
         */
        posix_trace_eventid_get_name(trid, trace_event.event_id,
            trace_event_name);
        {
        int i;
<br>
        /* Print the trace event name out */
        printf("%s: ", trace_event_name);
        /* Print the trace event data out */
        for (i=0; i&lt;returndatasize, i++) printf("%02.2X",
            (unsigned char)data[i]);
        printf("\n");
        }
    }
<br>
    /* Close the trace stream */
    posix_trace_close(trid);
    /* The buffer data is deallocated */
    free(data);
    /* Now the file can be closed */
    close(fd);
}
</tt>
</pre>

<h5><a name="tag_03_02_11_18"></a>Several Programming Manipulations</h5>

<p>The following examples show some typical sets of operations needed in some contexts.</p>

<h5><a name="tag_03_02_11_19"></a>Trace Stream Attribute Manipulation</h5>

<p>This example shows the manipulation of a trace stream attribute object in order to change the default value provided by a
previous <a href="../functions/posix_trace_attr_init.html"><i>posix_trace_attr_init</i>()</a> call.</p>

<pre>
<tt>/* Caution. Error checks omitted */
{
    trace_attr_t attr;
    size_t logsize=100000;
    - - - - - -
    /* Initialize trace stream attributes */
    posix_trace_attr_init(&amp;attr);
    /* Set the trace name in the attributes structure */
    posix_trace_attr_setname(&amp;attr, "my_trace");
    /* Set the trace full policy */
    posix_trace_attr_setstreamfullpolicy(&amp;attr, POSIX_TRACE_LOOP);
    /* Set the trace log size */
    posix_trace_attr_setlogsize(&amp;attr, logsize);
    - - - - - -
}
</tt>
</pre>

<h5><a name="tag_03_02_11_20"></a>Create a Trace Event Type Set and Change the Trace Event Type Filter</h5>

<p>This example is valid only if the Trace Event Filter option is supported. This example shows the manipulation of a trace event
type set in order to change the trace event type filter for an existing active trace stream, which may be just-created, running, or
suspended. Some sets of trace event types are well-known, such as the set of trace event types not associated with a process, some
trace event types are just-built trace event types for this trace stream; one trace event type is the predefined trace event error
type which is deleted from the trace event type set.</p>

<pre>
<tt>/* Caution. Error checks omitted */
{
    trace_id_t trid = existing_trace;
    trace_event_set_t set;
    trace_event_id_t trace_event1, trace_event2;
    - - - - - -
    /* Initialize to an empty set of trace event types */
    /* (not strictly required because posix_trace_event_set_fill() */
    /* will ignore the prior contents of the event set.) */
    posix_trace_eventset_emptyset(&amp;set);
    /*
     * Fill the set with all system trace events
     * not associated with a process
     */
    posix_trace_eventset_fill(&amp;set, POSIX_TRACE_WOPID_EVENTS);
<br>
    /*
     * Get the trace event type identifier of the known trace event name
     * my_first_event for the trid trace stream
     */
    posix_trace_trid_eventid_open(trid, "my_first_event", &amp;trace_event1);
    /* Add the set with this trace event type identifier */
    posix_trace_eventset_add_event(trace_event1, &amp;set);
    /*
     * Get the trace event type identifier of the known trace event name
     * my_second_event for the trid trace stream
     */
<br>
    posix_trace_trid_eventid_open(trid, "my_second_event", &amp;trace_event2);
    /* Add the set with this trace event type identifier */
    posix_trace_eventset_add_event(trace_event2, &amp;set);
    - - - - - -
    /* Delete the system trace event POSIX_TRACE_ERROR from the set */
    posix_trace_eventset_del_event(POSIX_TRACE_ERROR, &amp;set);
    - - - - - -
<br>
    /* Modify the trace stream filter making it equal to the new set */
    posix_trace_set_filter(trid, &amp;set, POSIX_TRACE_SET_EVENTSET);
    - - - - - -
    /*
     * Now trace_event1, trace_event2, and all system trace event types
     * not associated with a process, except for the POSIX_TRACE_ERROR
     * system trace event type, are filtered out of (not recorded in) the
     * existing trace stream.
     */
}
</tt>
</pre>

<h5><a name="tag_03_02_11_21"></a>Retrieve Information from a Trace Log</h5>

<p>This example shows how to extract information from a trace log, the dump of a trace stream. This code:</p>

<ul>
<li>
<p>Asks if the trace stream has lost trace events</p>
</li>

<li>
<p>Extracts the information about the version of the trace subsystem which generated this trace log</p>
</li>

<li>
<p>Retrieves the maximum size of trace event data; this may be used to dynamically allocate an array for extracting trace event
data from the trace log without overflow</p>
</li>
</ul>

<pre>
<tt>/* Caution. Error checks omitted */
{
    struct posix_trace_status_info statusinfo;
    trace_attr_t attr;
    trace_id_t trid = existing_trace;
    size_t maxdatasize;
    char genversion[TRACE_NAME_MAX];
    - - - - - -
    /* Get the trace stream status */
    posix_trace_get_status(trid, &amp;statusinfo);
    /* Detect an overrun condition */
    if (statusinfo.posix_stream_overrun_status == POSIX_TRACE_OVERRUN)
        printf("trace events have been lost\n");
<br>
    /* Get attributes from the trid trace stream */
    posix_trace_get_attr(trid, &amp;attr);
    /* Get the trace generation version from the attributes */
    posix_trace_attr_getgenversion(&amp;attr, genversion);
    /* Print the trace generation version out */
    printf("Information about Trace Generator:%s\n",genversion);
<br>
    /* Get the trace event max data size from the attributes */
    posix_trace_attr_getmaxdatasize(&amp;attr, &amp;maxdatasize);
    /* Print the trace event max data size out */
    printf("Maximum size of associated data:%d\n",maxdatasize);
    /* Destroy the trace stream attributes */
    posix_trace_attr_destroy(&amp;attr);
}
</tt>
</pre>

<h5><a name="tag_03_02_11_22"></a>Retrieve the List of Trace Event Types Used in a Trace Log</h5>

<p>This example shows the retrieval of a trace stream's trace event type list. This operation may be very useful if you are
interested only in tracking the type of trace events in a trace log.</p>

<pre>
<tt>/* Caution. Error checks omitted */
{
    trace_id_t trid = existing_trace;
    trace_event_id_t event_id;
    char event_name[TRACE_EVENT_NAME_MAX];
    int return_value;
    - - - - - -
<br>
    /*
     * In a loop print all existing trace event names out
     * for the trid trace stream
     */
    while (1)
    {
        posix_trace_eventtypelist_getnext_id(trid, &amp;event_id
            &amp;return_value);
        if (return_value != NULL) break;
        /*
         * Get the name of the trace event associated
         * with trid trace ID
         */
        posix_trace_eventid_get_name(trid, event_id, event_name);
        /* Print the name out */
        printf("%s\n", event_name);
    }
}
</tt>
</pre>

<br>
<h5><a name="tag_03_02_11_23"></a>Rationale on Trace for Debugging</h5>

<dl compact>
<dt></dt>

<dd><img src=".././Figures/b-5.gif"></dd>
</dl>

<center><b><a name="tagfcjh_5"></a> Figure: Trace Another Process</b></center>

<p>Among the different possibilities offered by the trace interface defined in IEEE&nbsp;Std&nbsp;1003.1-2001, the debugging of an
application is the most interesting one. Typical operations in the controlling debugger process are to filter trace event types, to
get trace events from the trace stream, to stop the trace stream when the debugged process is executing uninteresting code, to
start the trace stream when some interesting point is reached, and so on. The interface defined in IEEE&nbsp;Std&nbsp;1003.1-2001
should define all the necessary base functions to allow this dynamic debug handling.</p>

<p><a href="#tagfcjh_5">Trace Another Process</a> shows an example in which the trace stream is created after the call to the <a
href="../functions/fork.html"><i>fork</i>()</a> function. If the user does not want to lose trace events, some synchronization
mechanism (represented in the figure) may be needed before calling the <i>exec</i> function, to give the parent a chance to create
the trace stream before the child begins the execution of its trace points.</p>

<h5><a name="tag_03_02_11_24"></a>Rationale on Trace Event Type Name Space</h5>

<p>At first, the working group was in favor of the representation of a trace event type by an integer ( <i>event_name</i>). It
seems that existing practice shows the weakness of such a representation. The collision of trace event types is the main problem
that cannot be simply resolved using this sort of representation. Suppose, for example, that a third party designs an instrumented
library. The user does not have the source of this library and wants to trace his application which uses in some part the
third-party library. There is no means for him to know what are the trace event types used in the instrumented library so he has
some chance of duplicating some of them and thus to obtain a contaminated tracing of his application.</p>

<dl compact>
<dt></dt>

<dd><img src=".././Figures/b-6.gif"></dd>
</dl>

<center><b><a name="tagfcjh_6"></a> Figure: Trace Name Space Overview: With Third-Party Library</b></center>

<p>There are requirements to allow program images containing pieces from various vendors to be traced without also requiring those
of any other vendors to coordinate their uses of the trace facility, and especially the naming of their various trace event types
and trace point IDs. The chosen solution is to provide a very large name space, large enough so that the individual vendors can
give their trace types and tracepoint IDs sufficiently long and descriptive names making the occurrence of collisions quite
unlikely. The probability of collision is thus made sufficiently low so that the problem may, as a practical matter, be ignored. By
requirement, the consequence of collisions will be a slight ambiguity in the trace streams; tracing will continue in spite of
collisions and ambiguities. &quot;The show must go on&quot;. The <i>posix_prog_address</i> member of the <b>posix_trace_event_info</b>
structure is used to allow trace streams to be unambiguously interpreted, despite the fact that trace event types and trace event
names need not be unique.</p>

<p>The <a href="../functions/posix_trace_eventid_open.html"><i>posix_trace_eventid_open</i>()</a> function is required to allow the
instrumented third-party library to get a valid trace event type identifier for its trace event names. This operation is, somehow,
an allocation, and the group was aware of proposing some deallocation mechanism which the instrumented application could use to
recover the resources used by a trace event type identifier. This would have given the instrumented application the benefit of
being capable of reusing a possible minimum set of trace event type identifiers, but also the inconvenience to have, possibly in
the same trace stream, one trace event type identifier identifying two different trace event types. After some discussions the
group decided to not define such a function which would make this API thicker for little benefit, the user having always the
possibility of adding identification information in the <i>data</i> member of the trace event structure.</p>

<p>The set of the trace event type identifiers the controlling process wants to filter out is initialized in the trace mechanism
using the function <a href="../functions/posix_trace_set_filter.html"><i>posix_trace_set_filter</i>()</a>, setting the arguments
according to the definitions explained in <a href="../functions/posix_trace_set_filter.html"><i>posix_trace_set_filter</i>()</a>.
This operation can be done statically (when the trace is in the STOPPED state) or dynamically (when the trace is in the STARTED
state). The preparation of the filter is normally done using the function defined in <a href=
"../functions/posix_trace_eventtypelist_getnext_id.html"><i>posix_trace_eventtypelist_getnext_id</i>()</a> and eventually the
function <a href="../functions/posix_trace_eventtypelist_rewind.html"><i>posix_trace_eventtypelist_rewind</i>()</a> in order to
know (before the recording) the list of the potential set of trace event types that can be recorded. In the case of an active trace
stream, this list may not be exhaustive. Actually, the target process may not have yet called the function <a href=
"../functions/posix_trace_eventid_open.html"><i>posix_trace_eventid_open</i>()</a>. But it is a common practice, for a controlling
process, to prepare the filtering of a future trace stream before its start. Therefore the user must have a way to get the trace
event type identifier corresponding to a well-known trace event name before its future association by the pre-cited function. This
is done by calling the <a href="../functions/posix_trace_trid_eventid_open.html"><i>posix_trace_trid_eventid_open</i>()</a>
function, given the trace stream identifier and the trace name, and described hereafter. Because this trace event type identifier
is associated with a trace stream identifier, where a unique process has initialized two or more traces, the implementation is
expected to return the same trace event type identifier for successive calls to <a href=
"../functions/posix_trace_trid_eventid_open.html"><i>posix_trace_trid_eventid_open</i>()</a> with different trace stream
identifiers. The <a href="../functions/posix_trace_eventid_get_name.html"><i>posix_trace_eventid_get_name</i>()</a> function is
used by the controller process to identify, by the name, the trace event type returned by a call to the <a href=
"../functions/posix_trace_eventtypelist_getnext_id.html"><i>posix_trace_eventtypelist_getnext_id</i>()</a> function.</p>

<p>Afterwards, the set of trace event types is constructed using the functions defined in <a href=
"../functions/posix_trace_eventset_empty.html"><i>posix_trace_eventset_empty</i>()</a>, <a href=
"../functions/posix_trace_eventset_fill.html"><i>posix_trace_eventset_fill</i>()</a>, <a href=
"../functions/posix_trace_eventset_add.html"><i>posix_trace_eventset_add</i>()</a>, and <a href=
"../functions/posix_trace_eventset_del.html"><i>posix_trace_eventset_del</i>()</a>.</p>

<p>A set of functions is provided devoted to the manipulation of the trace event type identifier and names for an active trace
stream. All these functions require the trace stream identifier argument as the first parameter. The opacity of the trace event
type identifier implies that the user cannot associate directly its well-known trace event name with the system-associated trace
event type identifier.</p>

<p>The <a href="../functions/posix_trace_trid_eventid_open.html"><i>posix_trace_trid_eventid_open</i>()</a> function allows the
application to get the system trace event type identifier back from the system, given its well-known trace event name. This
function is useful only when a controlling process needs to specify specific events to be filtered.</p>

<p>The <a href="../functions/posix_trace_eventid_get_name.html"><i>posix_trace_eventid_get_name</i>()</a> function allows the
application to obtain a trace event name given its trace event type identifier. One possible use of this function is to identify
the type of a trace event retrieved from the trace stream, and print it. The easiest way to implement this requirement, is to use a
single trace event type map for all the processes whose maps are required to be identical. A more difficult way is to attempt to
keep multiple maps identical at every call to <a href=
"../functions/posix_trace_eventid_open.html"><i>posix_trace_eventid_open</i>()</a> and <a href=
"../functions/posix_trace_trid_eventid_open.html"><i>posix_trace_trid_eventid_open</i>()</a>.</p>

<h5><a name="tag_03_02_11_25"></a>Rationale on Trace Events Type Filtering</h5>

<p>The most basic rationale for runtime and pre-registration filtering (selection/rejection) of trace event types is to prevent
choking of the trace collection facility, and/or overloading of the computer system. Any worthwhile trace facility can bring even
the largest computer to its knees. Otherwise, everything would be recorded and filtered after the fact; it would be much simpler,
but impractical.</p>

<p>To achieve debugging, measurement, or whatever the purpose of tracing, the filtering of trace event types is an important part
of trace analysis. Due to the fact that the trace events are put into a trace stream and probably logged afterwards into a file,
different levels of filtering-that is, rejection of trace event types-are possible.</p>

<h5><a name="tag_03_02_11_26"></a>Filtering of Trace Event Types Before Tracing</h5>

<p>This function, represented by the <a href="../functions/posix_trace_set_filter.html"><i>posix_trace_set_filter</i>()</a>
function in IEEE&nbsp;Std&nbsp;1003.1-2001 (see <a href=
"../functions/posix_trace_set_filter.html"><i>posix_trace_set_filter</i>()</a>), selects, before or during tracing, the set of
trace event types to be filtered out. It should be possible also (as OSF suggested in their ETAP trace specifications) to select
the kernel trace event types to be traced in a system-wide fashion. These two functionalities are called the pre-filtering of trace
event types.</p>

<p>The restriction on the actual type used for the <b>trace_event_set_t</b> type is intended to guarantee that these objects can
always be assigned, have their address taken, and be passed by value as parameters. It is not intended that this type be a
structure including pointers to other data structures, as that could impact the portability of applications performing such
operations. A reasonable implementation could be a structure containing an array of integer types.</p>

<h5><a name="tag_03_02_11_27"></a>Filtering of Trace Event Types at Runtime</h5>

<p>It is possible to build this functionality using the <a href=
"../functions/posix_trace_set_filter.html"><i>posix_trace_set_filter</i>()</a> function. A privileged process or a privileged
thread can get trace events from the trace stream of another process or thread, and thus specify the type of trace events to record
into a file, using implementation-defined methods and interfaces. This functionality, called inline filtering of trace event types,
is used for runtime analysis of trace streams.</p>

<h5><a name="tag_03_02_11_28"></a>Post-Mortem Filtering of Trace Event Types</h5>

<p>The word &quot;post-mortem&quot; is used here to indicate that some unanticipated situation occurs during execution that does not permit
a pre or inline filtering of trace events and that it is necessary to record all trace event types to have a chance to discover the
problem afterwards. When the program stops, all the trace events recorded previously can be analyzed in order to find the solution.
This functionality could be named the post-filtering of trace event types.</p>

<h5><a name="tag_03_02_11_29"></a>Discussions about Trace Event Type-Filtering</h5>

<p>After long discussions with the parties involved in the process of defining the trace interface, it seems that the sensitivity
to the filtering problem is different, but everybody agrees that the level of the overhead introduced during the tracing operation
depends on the filtering method elected. If the time that it takes the trace event to be recorded can be neglected, the overhead
introduced by the filtering process can be classified as follows:</p>

<dl compact>
<dt>Pre-filtering</dt>

<dd>System and process/thread-level overhead</dd>

<dt>Inline-filtering</dt>

<dd>Process/thread-level overhead</dd>

<dt>Post-filtering</dt>

<dd>No overhead; done offline</dd>
</dl>

<p>The pre-filtering could be named &quot;critical realtime&quot; filtering in the sense that the filtering of trace event type is
manageable at the user level so the user can lower to a minimum the filtering overhead at some user selected level of priority for
the inline filtering, or delay the filtering to after execution for the post-filtering. The counterpart of this solution is that
the size of the trace stream must be sufficient to record all the trace events. The advantage of the pre-filtering is that the
utilization of the trace stream is optimized.</p>

<p>Only pre-filtering is defined by IEEE&nbsp;Std&nbsp;1003.1-2001. However, great care must be taken in specifying pre-filtering,
so that it does not impose unacceptable overhead. Moreover, it is necessary to isolate all the functionality relative to the
pre-filtering.</p>

<p>The result of this rationale is to define a new option, the Trace Event Filter option, not necessarily implemented in small
realtime systems, where system overhead is minimized to the extent possible.</p>

<h5><a name="tag_03_02_11_30"></a>Tracing, pthread API</h5>

<p>The objective to be able to control tracing for individual threads may be in conflict with the efficiency expected in threads
with a <i>contentionscope</i> attribute of PTHREAD_SCOPE_PROCESS. For these threads, context switches from one thread that has
tracing enabled to another thread that has tracing disabled may require a kernel call to inform the kernel whether it has to trace
system events executed by that thread or not. For this reason, it was proposed that the ability to enable or disable tracing for
PTHREAD_SCOPE_PROCESS threads be made optional, through the introduction of a Trace Scope Process option. A trace implementation
which did not implement the Trace Scope Process option would not honor the tracing-state attribute of a thread with
PTHREAD_SCOPE_PROCESS; it would, however, honor the tracing-state attribute of a thread with PTHREAD_SCOPE_SYSTEM. This proposal
was rejected as:</p>

<ol>
<li>
<p>Removing desired functionality (per-thread trace control)</p>
</li>

<li>
<p>Introducing counter-intuitive behavior for the tracing-state attribute</p>
</li>

<li>
<p>Mixing logically orthogonal ideas (thread scheduling and thread tracing)<br>
[Objective 4]</p>
</li>
</ol>

<p>Finally, to solve this complex issue, this API does not provide <i>pthread_gettracingstate</i>(),
<i>pthread_settracingstate</i>(), <i>pthread_attr_gettracingstate</i>(), and <i>pthread_attr_settracingstate</i>() interfaces.
These interfaces force the thread implementation to add to the weight of the thread and cause a revision of the threads libraries,
just to support tracing. Worse yet, <a href="../functions/posix_trace_event.html"><i>posix_trace_event</i>()</a> must always test
this per-thread variable even in the common case where it is not used at all. Per-thread tracing is easy to implement using
existing interfaces where necessary; see the following example.</p>

<h5><a name="tag_03_02_11_31"></a>Example</h5>

<pre>
<tt>/* Caution. Error checks omitted */
static pthread_key_t my_key;
static trace_event_id_t my_event_id;
static pthread_once_t my_once = PTHREAD_ONCE_INIT;
<br>
void my_init(void)
{
    (void) pthread_key_create(&amp;my_key, NULL);
    (void) posix_trace_eventid_open("my", &amp;my_event_id);
}
<br>
int get_trace_flag(void)
{
    pthread_once(&amp;my_once, my_init);
    return (pthread_getspecific(my_key) != NULL);
}
<br>
void set_trace_flag(int f)
{
    pthread_once(&amp;my_once, my_init);
    pthread_setspecific(my_key, f? &amp;my_event_id: NULL);
}
<br>
fn()
{
    if (get_trace_flag())
        posix_trace_event(my_event_id, ...)
}
</tt>
</pre>

<p>The above example does not implement third-party state setting.</p>

<p>Lastly, per-thread tracing works poorly for threads with PTHREAD_SCOPE_PROCESS contention scope. These &quot;library&quot; threads have
minimal interaction with the kernel and would have to explicitly set the attributes whenever they are context switched to a new
kernel thread in order to trace system events. Such state was explicitly avoided in POSIX threads to keep PTHREAD_SCOPE_PROCESS
threads lightweight.</p>

<p>The reason that keeping PTHREAD_SCOPE_PROCESS threads lightweight is important is that such threads can be used not just for
simple multi-processors but also for co-routine style programming (such as discrete event simulation) without inventing a new
threads paradigm. Adding extra runtime cost to thread context switches will make using POSIX threads less attractive in these
situations.</p>

<h5><a name="tag_03_02_11_32"></a>Rationale on Triggering</h5>

<p>The ability to start or stop tracing based on the occurrence of specific trace event types has been proposed as a parallel to
similar functionality appearing in logic analyzers. Such triggering, in order to be very useful, should be based not only on the
trace event type, but on trace event-specific data, including tests of user-specified fields for matching or threshold values.</p>

<p>Such a facility is unnecessary where the buffering of the stream is not a constraint, since such checks can be performed offline
during post-mortem analysis.</p>

<p>For example, a large system could incorporate a daemon utility to collect the trace records from memory buffers and spool them
to secondary storage for later analysis. In the instances where resources are truly limited, such as embedded applications, the
application incorporation of application code to test the circumstances of a trace event and call the trace point only if needed is
usually straightforward.</p>

<p>For performance reasons, the <a href="../functions/posix_trace_event.html"><i>posix_trace_event</i>()</a> function should be
implemented using a macro, so if the trace is inactive, the trace event point calls are latent code and must cost no more than a
scalar test.</p>

<p>The API proposed in IEEE&nbsp;Std&nbsp;1003.1-2001 does not include any triggering functionality.</p>

<h5><a name="tag_03_02_11_33"></a>Rationale on Timestamp Clock</h5>

<p>It has been suggested that the tracing mechanism should include the possibility of specifying the clock to be used in
timestamping the trace events. When application trace events must be correlated to remote trace events, such a facility could
provide a global time reference not available from a local clock. Further, the application may be driven by timers based on a clock
different from that used for the timestamp, and the correlation of the trace to those untraced timer activities could be an
important part of the analysis of the application.</p>

<p>However, the tracing mechanism needs to be fast and just the provision of such an option can materially affect its performance.
Leaving aside the performance costs of reading some clocks, this notion is also ill-defined when kernel trace events are to be
traced by two applications making use of different tracing clocks. This can even happen within a single application where different
parts of the application are served by different clocks. Another complication can occur when a clock is maintained strictly at the
user level and is unavailable at the kernel level.</p>

<p>It is felt that the benefits of a selectable trace clock do not match its costs. Applications that wish to correlate clocks
other than the default tracing clock can include trace events with sample values of those other clocks, allowing correlation of
timestamps from the various independent clocks. In any case, such a technique would be required when applications are sensitive to
multiple clocks.</p>

<h5><a name="tag_03_02_11_34"></a>Rationale on Different Overrun Conditions</h5>

<p>The analysis of the dynamic behavior of the trace mechanism shows that different overrun conditions may occur. The API must
provide a means to manage such conditions in a portable way.</p>

<h5><a name="tag_03_02_11_35"></a>Overrun in Trace Streams Initialized with POSIX_TRACE_LOOP Policy</h5>

<p>In this case, the user of the trace mechanism is interested in using the trace stream with POSIX_TRACE_LOOP policy to record
trace events continuously, but ideally without losing any trace events. The online analyzer process must get the trace events at a
mean speed equivalent to the recording speed. Should the trace stream become full, a trace stream overrun occurs. This condition is
detected by getting the status of the active trace stream (function <a href=
"../functions/posix_trace_get_status.html"><i>posix_trace_get_status</i>()</a>) and looking at the member
<i>posix_stream_overrun_status</i> of the read <b>posix_stream_status</b> structure. In addition, two predefined trace event types
are defined:</p>

<ol>
<li>
<p>The beginning of a trace overflow, to locate the beginning of an overflow when reading a trace stream</p>
</li>

<li>
<p>The end of a trace overflow, to locate the end of an overflow, when reading a trace stream</p>
</li>
</ol>

<p>As a timestamp is associated with these predefined trace events, it is possible to know the duration of the overflow.</p>

<h5><a name="tag_03_02_11_36"></a>Overrun in Dumping Trace Streams into Trace Logs</h5>

<p>The user lets the trace mechanism dump the trace stream initialized with POSIX_TRACE_FLUSH policy automatically into a trace
log. If the dump operation is slower than the recording of trace events, the trace stream can overrun. This condition is detected
by getting the status of the active trace stream (function <a href=
"../functions/posix_trace_get_status.html"><i>posix_trace_get_status</i>()</a>) and looking at the member
<i>posix_log_overrun_status</i> of the read <b>posix_stream_status</b> structure. This overrun indicates that the trace mechanism
is not able to operate in this mode at this speed. It is the responsibility of the user to modify one of the trace parameters (the
stream size or the trace event type filter, for instance) to avoid such overrun conditions, if overruns are to be prevented. The
same already predefined trace event types (see <a href="#tag_03_02_11_35">Overrun in Trace Streams Initialized with
POSIX_TRACE_LOOP Policy</a> ) are used to detect and to know the duration of an overflow.</p>

<h5><a name="tag_03_02_11_37"></a>Reading an Active Trace Stream</h5>

<p>Although this trace API allows one to read an active trace stream with log while it is tracing, this feature can lead to false
overflow origin interpretation: the trace log or the reader of the trace stream. Reading from an active trace stream with log is
thus non-portable, and has been left unspecified.</p>

<h4><a name="tag_03_02_12"></a>Data Types</h4>

<p>The requirement that additional types defined in this section end in &quot;_t&quot; was prompted by the problem of name space pollution.
It is difficult to define a type (where that type is not one defined by IEEE&nbsp;Std&nbsp;1003.1-2001) in one header file and use
it in another without adding symbols to the name space of the program. To allow implementors to provide their own types, all
conforming applications are required to avoid symbols ending in &quot;_t&quot;, which permits the implementor to provide additional types.
Because a major use of types is in the definition of structure members, which can (and in many cases must) be added to the
structures defined in IEEE&nbsp;Std&nbsp;1003.1-2001, the need for additional types is compelling.</p>

<p>The types, such as <b>ushort</b> and <b>ulong</b>, which are in common usage, are not defined in IEEE&nbsp;Std&nbsp;1003.1-2001
(although <b>ushort_t</b> would be permitted as an extension). They can be added to <a href=
"../basedefs/sys/types.h.html"><i>&lt;sys/types.h&gt;</i></a> using a feature test macro (see <a href="#tag_03_02_02_01">POSIX.1
Symbols</a> ). A suggested symbol for these is _SYSIII. Similarly, the types like <b>u_short</b> would probably be best controlled
by _BSD.</p>

<p>Some of these symbols may appear in other headers; see <a href="#tag_03_02_02_04">The Name Space</a> .</p>

<dl compact>
<dt><b>dev_t</b></dt>

<dd>This type may be made large enough to accommodate host-locality considerations of networked systems.

<p>This type must be arithmetic. Earlier proposals allowed this to be non-arithmetic (such as a structure) and provided a
<i>samefile</i>() function for comparison.</p>
</dd>

<dt><b>gid_t</b></dt>

<dd>Some implementations had separated <b>gid_t</b> from <b>uid_t</b> before POSIX.1 was completed. It would be difficult for them
to coalesce them when it was unnecessary. Additionally, it is quite possible that user IDs might be different from group IDs
because the user ID might wish to span a heterogeneous network, where the group ID might not.

<p>For current implementations, the cost of having a separate <b>gid_t</b> will be only lexical.</p>
</dd>

<dt><b>mode_t</b></dt>

<dd>This type was chosen so that implementations could choose the appropriate integer type, and for compatibility with the
ISO&nbsp;C standard. 4.3 BSD uses <b>unsigned short</b> and the SVID uses <b>ushort</b>, which is the same. Historically, only the
low-order sixteen bits are significant.</dd>

<dt><b>nlink_t</b></dt>

<dd>This type was introduced in place of <b>short</b> for <i>st_nlink</i> (see the <a href=
"../basedefs/sys/stat.h.html"><i>&lt;sys/stat.h&gt;</i></a> header) in response to an objection that <b>short</b> was too
small.</dd>

<dt><b>off_t</b></dt>

<dd>This type is used only in <a href="../functions/lseek.html"><i>lseek</i>()</a>, <a href=
"../functions/fcntl.html"><i>fcntl</i>()</a>, and <a href="../basedefs/sys/stat.h.html"><i>&lt;sys/stat.h&gt;</i></a>. Many
implementations would have difficulties if it were defined as anything other than <b>long</b>. Requiring an integer type limits the
capabilities of <a href="../functions/lseek.html"><i>lseek</i>()</a> to four gigabytes. The ISO&nbsp;C standard supplies routines
that use larger types; see <a href="../functions/fgetpos.html"><i>fgetpos</i>()</a> and <a href=
"../functions/fsetpos.html"><i>fsetpos</i>()</a>. XSI-conformant systems provide the <a href=
"../functions/fseeko.html"><i>fseeko</i>()</a> and <a href="../functions/ftello.html"><i>ftello</i>()</a> functions that use larger
types.</dd>

<dt><b>pid_t</b></dt>

<dd>The inclusion of this symbol was controversial because it is tied to the issue of the representation of a process ID as a
number. From the point of view of a conforming application, process IDs should be &quot;magic cookies&quot;<a href=
"#tag_foot_1"><sup><small>1</small></sup></a> that are produced by calls such as <a href=
"../functions/fork.html"><i>fork</i>()</a>, used by calls such as <a href="../functions/waitpid.html"><i>waitpid</i>()</a> or <a
href="../functions/kill.html"><i>kill</i>()</a>, and not otherwise analyzed (except that the sign is used as a flag for certain
operations).

<p>The concept of a {PID_MAX} value interacted with this in early proposals. Treating process IDs as an opaque type both removes
the requirement for {PID_MAX} and allows systems to be more flexible in providing process IDs that span a large range of values, or
a small one.</p>

<p>Since the values in <b>uid_t</b>, <b>gid_t</b>, and <b>pid_t</b> will be numbers generally, and potentially both large in
magnitude and sparse, applications that are based on arrays of objects of this type are unlikely to be fully portable in any case.
Solutions that treat them as magic cookies will be portable.</p>

<p>{CHILD_MAX} precludes the possibility of a &quot;toy implementation&quot;, where there would only be one process.</p>
</dd>

<dt><b>ssize_t</b></dt>

<dd>This is intended to be a signed analog of <b>size_t</b>. The wording is such that an implementation may either choose to use a
longer type or simply to use the signed version of the type that underlies <b>size_t</b>. All functions that return <b>ssize_t</b>
( <a href="../functions/read.html"><i>read</i>()</a> and <a href="../functions/write.html"><i>write</i>()</a>) describe as
&quot;implementation-defined&quot; the result of an input exceeding {SSIZE_MAX}. It is recognized that some implementations might have
<b>int</b>s that are smaller than <b>size_t</b>. A conforming application would be constrained not to perform I/O in pieces larger
than {SSIZE_MAX}, but a conforming application using extensions would be able to use the full range if the implementation provided
an extended range, while still having a single type-compatible interface.

<p>The symbols <b>size_t</b> and <b>ssize_t</b> are also required in <a href=
"../basedefs/unistd.h.html"><i>&lt;unistd.h&gt;</i></a> to minimize the changes needed for calls to <a href=
"../functions/read.html"><i>read</i>()</a> and <a href="../functions/write.html"><i>write</i>()</a>. Implementors are reminded that
it must be possible to include both <a href="../basedefs/sys/types.h.html"><i>&lt;sys/types.h&gt;</i></a> and <a href=
"../basedefs/unistd.h.html"><i>&lt;unistd.h&gt;</i></a> in the same program (in either order) without error.</p>
</dd>

<dt><b>uid_t</b></dt>

<dd>Before the addition of this type, the data types used to represent these values varied throughout early proposals. The <a href=
"../basedefs/sys/stat.h.html"><i>&lt;sys/stat.h&gt;</i></a> header defined these values as type <b>short</b>, the
<i>&lt;passwd.h&gt;</i> file (now <a href="../basedefs/pwd.h.html"><i>&lt;pwd.h&gt;</i></a> and <a href=
"../basedefs/grp.h.html"><i>&lt;grp.h&gt;</i></a>) used an <b>int</b>, and <a href="../functions/getuid.html"><i>getuid</i>()</a>
returned an <b>int</b>. In response to a strong objection to the inconsistent definitions, all the types were switched to
<b>uid_t</b>.

<p>In practice, those historical implementations that use varying types of this sort can typedef <b>uid_t</b> to <b>short</b> with
no serious consequences.</p>

<p>The problem associated with this change concerns object compatibility after structure size changes. Since most implementations
will define <b>uid_t</b> as a short, the only substantive change will be a reduction in the size of the <b>passwd</b> structure.
Consequently, implementations with an overriding concern for object compatibility can pad the structure back to its current size.
For that reason, this problem was not considered critical enough to warrant the addition of a separate type to POSIX.1.</p>

<p>The types <b>uid_t</b> and <b>gid_t</b> are magic cookies. There is no {UID_MAX} defined by POSIX.1, and no structure imposed on
<b>uid_t</b> and <b>gid_t</b> other than that they be positive arithmetic types. (In fact, they could be <b>unsigned char</b>.)
There is no maximum or minimum specified for the number of distinct user or group IDs.</p>
</dd>
</dl>

<hr>
<h4><a name="tag_03_02_13"></a>Footnotes</h4>

<dl compact>
<dt><a name="tag_foot_1">1.</a></dt>

<dd>An historical term meaning: &quot;An opaque object, or token, of determinate size, whose significance is known only to the entity
which created it. An entity receiving such a token from the generating entity may only make such use of the `cookie' as is defined
and permitted by the supplying entity.&quot;</dd>
</dl>


<hr size="2" noshade>
<center><font size="2"><!--footer start-->
UNIX &reg; is a registered Trademark of The Open Group.<br>
POSIX &reg; is a registered Trademark of The IEEE.<br>
[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href=
"../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>
]</font></center>

<!--footer end-->
<hr size="2" noshade>
</body>
</html>