oldlinux-files/study/sabre/os/files/Processors/PentiumOptimization.html

<html>
<head>
<title>How to optimize for the Pentium family of microprocessors</title>
<style><!--
  body{font-family:arial,sans-serif;color:black,background-color:"#F0FFE0"}
  p{font-family:arial,sans-serif;}
  pre{font-family:courier,monospace}
  kbd{font-family:courier,monospace;font-weight:540}
  h1{font-size:"300%";font-weight:700;text-align:center}
  h2{font-size:"150%";font-weight:600;text-align:left;padding-top:1em}
  h3{font-size:"110%";font-weight:600;text-align:left}
  h4{font-size:"100%";font-weight:500;text-align:left;text-decoration:underline}
  td.a2{font-size:"150%";font-weight:600}
  td.a3{font-size:"110%";font-weight:600}
  td.a4{font-size:"100%";font-weight:600}
--></style>
</head>

<body text="#000000" bgcolor="#F0FFE0" link="#0000E0" vlink="#E000E0" alink="#FF0000">
<center><h1>How to optimize for the Pentium<br>
family of microprocessors</h1>
<p>
<small>Copyright &#169; 1996, 2000 by Agner Fog. Last modified 2000-03-14.</small>
</center><p>

<h2>Contents</h2>
<ol>
<li><a href="#1">Introduction</a>
<li><a href="#2">Literature</a>
<li><a href="#3">Calling assembly functions from high level language</a>
<li><a href="#4">Debugging and verifying</a>
<li><a href="#5">Memory model</a>
<li><a href="#6">Alignment</a>
<li><a href="#7">Cache</a>
<li><a href="#8">First time versus repeated execution</a>
<li><a href="#9">Address generation interlock (PPlain and PMMX)</a>
<li><a href="#10">Pairing integer instructions (PPlain and PMMX)</a>
 <ol><li><a href="#10_1">Perfect pairing</a>
 <li><a href="#10_2">Imperfect pairing</a>
 </ol>
<li><a href="#11">Splitting complex instructions into simpler ones (PPlain and PMMX)</a>
<li><a href="#12">Prefixes (PPlain and PMMX)</a>
<li><a href="#13">Overview of PPro, PII and PIII pipeline</a>
<li><a href="#14">Instruction decoding (PPro, PII and PIII)</a>
<li><a href="#15">Instruction fetch (PPro, PII and PIII)</a>
<li><a href="#16">Register renaming (PPro, PII and PIII)</a>
 <ol><li><a href="#16_1">Eliminating dependencies</a>
 <li><a href="#16_2">Register read stalls</a></ol>
<li><a href="#17">Out of order execution (PPro, PII and PIII)</a>
<li><a href="#18">Retirement (PPro, PII and PIII)</a>
<li><a href="#19">Partial stalls (PPro, PII and PIII)</a>
 <ol><li><a href="#19_1">Partial register stalls</a>
 <li><a href="#19_2">Partial flags stalls</a>
 <li><a href="#19_3">Flags stalls after shifts and rotates</a>
 <li><a href="#19_4">Partial memory stalls</a></ol>
<li><a href="#20">Dependency chains (PPro, PII and PIII)</a>
<li><a href="#21">Searching for bottlenecks (PPro, PII and PIII)</a>
<li><a href="#22">Jumps and branches (all processors)</a>
 <ol><li><a href="#22_1">Branch prediction in PPlain</a>
 <li><a href="#22_2">Branch prediction in PMMX, PPro, PII and PIII</a>
 <li><a href="#22_3">Avoiding jumps (all processors)</a>
 <li><a href="#22_4">Avoiding conditional jumps by using flags (all processors)</a>
 <li><a href="#22_5">Replacing conditional jumps by conditional moves (PPro, PII and PIII)</a></ol>
<li><a href="#23">Reducing code size (all processors)</a>
<li><a href="#24">Scheduling floating point code (PPlain and PMMX)</a>
<li><a href="#25">Loop optimization (all processors)</a>
 <ol><li><a href="#25_1">Loops in PPlain and PMMX</a>
 <li><a href="#25_2">Loops in PPro, PII and PIII</a></ol>
<li><a href="#26">Problematic Instructions</a>
 <ol><li><a href="#26_1">XCHG (all processors)</a>
 <li><a href="#26_2">Rotates through carry (all processors)</a>
 <li><a href="#26_3">String instructions (all processors)</a>
 <li><a href="#26_4">Bit test (all processors)</a>
 <li><a href="#26_5">Integer multiplication (all processors)</a>
 <li><a href="#26_6">WAIT instruction (all processors)</a>
 <li><a href="#26_7">FCOM + FSTSW AX (all processors)</a>
 <li><a href="#26_8">FPREM (all processors)</a>
 <li><a href="#26_9">FRNDINT (all processors)</a>
 <li><a href="#26_10">FSCALE and exponential function (all processors)</a>
 <li><a href="#26_11">FPTAN (all processors)</a>
 <li><a href="#26_12">FSQRT (PIII)</a>
 <li><a href="#26_13">MOV [MEM], ACCUM (PPlain and PMMX)</a>
 <li><a href="#26_14">TEST instruction (PPlain and PMMX)</a>
 <li><a href="#26_15">Bit scan (PPlain and PMMX)</a>
 <li><a href="#26_16">FLDCW (PPro, PII and PIII)</a>
 </ol>
<li><a href="#27">Special topics</a>
 <ol><li><a href="#27_1">LEA instruction (all processors)</a>
 <li><a href="#27_2">Division (all processors)</a>
 <li><a href="#27_3">Freeing floating point registers (all processors)</a>
 <li><a href="#27_4">Transitions between floating point and MMX instructions PMMX, PII and PIII)</a>
 <li><a href="#27_5">Converting from floating point to integer (All processors)</a>
 <li><a href="#27_6">Using integer instructions to do floating point operations (All processors)</a>
 <li><a href="#27_7">Using floating point instructions to do integer operations (PPlain and PMMX)</a>
 <li><a href="#27_8">Moving blocks of data (All processors)</a>
 <li><a href="#27_9">Self-modifying code (All processors)</a>
 <li><a href="#27_10">Detecting processor type (All processors)</a>
 </ol>
<li><a href="#28">List of instruction timings for PPlain and PMMX</a>
 <ol><li><a href="#28_1">Integer instructions</a>
 <li><a href="#28_2">Floating point instructions</a>
 <li><a href="#28_3">MMX instructions (PMMX)</a></ol>
<li><a href="#29">List of instruction timings and micro-op breakdown for PPro, PII and PIII</a>
 <ol><li><a href="#29_1">Integer instructions</a>
 <li><a href="#29_2">Floating point instructions</a>
 <li><a href="#29_3">MMX instructions (PII and PIII)</a>
 <li><a href="#29_4">XMM instructions (PIII)</a>
 </ol>
<li><a href="#30">Testing speed</a>
<li><a href="#31">Comparison of the different microprocessors</a>
</ol>

<p><h2><a name="1">1</a>. Introduction</h2>
This manual describes in detail how to write optimized assembly language
code, with particular focus on the Pentium&#174; family of microprocessors.
<p>
Most of the information herein is based on my own research. Many people have
sent me useful information and corrections for this manual, and I keep
updating it whenever I have new important information. This manual is
therefore more accurate, detailed, comprehensive and exact than any other
source of information, and it contains many details not found anywhere else.
This information will enable you in many cases to calculate exactly how many
clock cycles a piece of code will take. I do not claim, though, that all
information in this manual is exact: Some timings etc. can be difficult or
impossible to measure exactly, and I do not have access to the inside information
on technical implementations that the writers of Intel manuals have.
<p>
The following versions of Pentium processors are discussed in this manual:<p>
<table border=1>
<tr><td class="a3">abbreviation</td><td class="a3">name</td></tr>
<tr><td>PPlain</td><td>plain old Pentium (without MMX)</td></tr>
<tr><td>PMMX</td><td>Pentium with MMX</td></tr>
<tr><td>PPro</td><td>Pentium Pro</td></tr>
<tr><td>PII</td><td>Pentium II (including Celeron and Xeon)</td></tr>
<tr><td>PIII</td><td>Pentium III (including variants)</td></tr>
</table>
<p>
The assembly language syntax used in this manual is MASM 5.10 syntax.
There is no official standard for X86 assembly language, but this is the
closest you can get to a de facto standard since most assemblers have a
MASM 5.10 compatible mode. (I do not recommend using MASM version 5.10 though,
because it has a serious bug in 32 bit mode. Use TASM or a later version of MASM).
<p>
Some of the remarks in this manual may seem like a criticism of Intel. This should not be
taken to mean that other brands are better. The Pentium family of microprocessors
may be faster than any compatible competing brand, better documented, and with better
testability features. For these reasons, no competing brand has been subjected to the same
level of independent research by me or by anybody else.
<p>
Programming in assembly language is much more difficult than high level language. Making
bugs is very easy, and finding them is very difficult. Now you have been warned! It is
assumed that the reader is already experienced in assembly programming. If not, then
please read some books on the subject and get some programming experience before you
begin to do complicated optimizations.
<p>
The hardware design of the PPlain and PMMX chips has many features which are
optimized specifically for some commonly used instructions or instruction combinations,
rather than using general optimization methods. Consequently, the rules for optimizing
software for this design are complicated and have many exceptions, but the possible gain
in performance may be substantial. The PPro, PII and PIII processors have a very
different design where the processor takes care of much of the optimization work by
executing instructions out of order, but the more complicated design of these processors
generate many potential bottlenecks, so there may be a lot to gain
by optimizing manually for these processors.
<p>
Before you start to convert your code to assembly, make sure that your algorithm is optimal.
Often you can improve a piece of code much more by improving the algorithm than by
converting it to assembly code.
<p>
Next, you have to identify the critical parts of your program. Often more than 99% of the
CPU time is spent in the innermost loop of a program. In this case you should optimize only
this loop and leave everything else in high level language. Some assembly programmers
waste a lot of energy optimizing the wrong parts of their programs, the only significant effect
of their effort being that the programs become more difficult to debug and maintain!
<p>
If it is not obvious where the critical parts of your program are then you may use a profiler to
find them. If it turns out that the bottleneck is disk access, then you may modify your
program to make disk access sequential in order to improve disk caching, rather than
turning to assembly programming. If the bottleneck is graphics output then you may look for
a way of reducing the number of calls to graphic procedures.
<p>
Some high level language compilers offer relatively good optimization for specific
processors, but further optimization by hand can usually make it much better.
<p>
Please don't send your programming questions to me. I am not gonna do your homework
for you!
<p>
Good luck with your hunt for nanoseconds!
<p>
<h2><a name="2">2</a>. Literature</h2>
A lot of useful literature and tutorials can be downloaded for free from Intel's www site or
acquired in print or on CD-ROM. It is recommended that you study this literature in order to
get acquainted with the microprocessor architecture. However, the documents from Intel are
not always accurate - especially the tutorials have many errors (evidently, they haven't
tested their own examples).
<p>
I will not give the URL's here because the file locations change very often. You can find the
documents you need by using the search facilities at:
<a href="http://developer.intel.com" target="external">developer.intel.com</a> or follow the
links from <a href="http://www.agner.org/assem">www.agner.org/assem</a>
<p>
Some documents are in .PDF format. If you don't have software for viewing or printing .PDF
files, then you may download the Acrobat file reader from <a href="http://www.adobe.com" target="external">www.adobe.com</a>
<p>
The use of MMX and XMM (SIMD) instructions for optimizing specific applications are described in several
application notes. The instruction set is described in various manuals and tutorials.
<p>
VTUNE is a software tool from Intel for optimizing code. I have not tested it and can
therefore not give any evalutation of it here.
<p>
A lot of other sources than Intel also have useful information. These sources are listed in
the FAQ for the newsgroup comp.lang.asm.x86. For other internet ressources follow the
links from <a href="http://www.agner.org/assem">www.agner.org/assem</a>
<p>
<h2><a name="3">3</a>. Calling assembly functions from high level language</h2>
You can either use inline assembly or code a subroutine entirely in assembly language and
link it into your project. If you choose the latter option, then it is recommended that you use
a compiler which is capable of translating high level code directly to assembly. This assures
that you get the function calling method right. Most C++ compilers can do this.
<p>
The methods for function calling and name mangling can be quite complicated. There are
many different calling conventions, and the different brands of compilers are not compatible
in this respect. If you are calling assembly language subroutines from C++, then the best
method in terms of consistency and compatibility is to declare your functions  <kbd>extern "C"</kbd>
and <kbd>_cdecl</kbd>. The assembly code must then have the function name prefixed by an
underscore (<kbd>_</kbd>) and be assembled with case sensitivity on externals (option -mx).
<p>
If you need to make overloaded functions, overloaded operators, member
functions, and other C++ specialties then you have to code it in C++ first and
make your compiler translate it to assembly in order to get the right linking
information and calling method. These details are different for different brands
of compilers. If you want an assembly function with any other calling method
than <kbd>extern "C"</kbd> and <kbd>_cdecl</kbd> to be callable from code
compiled with different compilers then you need to give it one public name
for each compiler. For example an overloaded square function:
<pre>  ; int square (int x);
  SQUARE_I PROC NEAR             ; integer square function
  @square$qi LABEL NEAR          ; link name for Borland compiler
  ?square@@YAHH@Z LABEL NEAR     ; link name for Microsoft compiler
  _square__Fi LABEL NEAR         ; link name for Gnu compiler
  PUBLIC @square$qi, ?square@@YAHH@Z, _square__Fi
          MOV     EAX, [ESP+4]
          IMUL    EAX
          RET
  SQUARE_I ENDP

  ; double square (double x);
  SQUARE_D PROC NEAR             ; double precision float square function
  @square$qd LABEL NEAR          ; link name for Borland compiler
  ?square@@YANN@Z LABEL NEAR     ; link name for Microsoft compiler
  _square__Fd LABEL NEAR         ; link name for Gnu compiler
  PUBLIC @square$qd, ?square@@YANN@Z, _square__Fd
          FLD     QWORD PTR [ESP+4]
          FMUL    ST(0), ST(0)
          RET
  SQUARE_D ENDP</pre>
<p>
The way of transferring parameters depends on the calling convention:</pre><p>
<table border=1 cellpadding=1 cellspacing=1><tr>
<td class="a3">&nbsp;calling convention&nbsp;</td>
<td class="a3">&nbsp;parameter order on stack&nbsp;</td>
<td class="a3">&nbsp;parameters removed by&nbsp;</td></tr>
<tr><td>&nbsp;_cdecl&nbsp;</td><td>&nbsp;first par. at low address&nbsp;
</td><td>&nbsp;caller&nbsp;</td></tr><tr><td>&nbsp;_stdcall&nbsp;</td><td>
&nbsp;first par. at low address&nbsp;</td><td>&nbsp;subroutine&nbsp;</td></tr>
<tr><td>&nbsp;_fastcall&nbsp;</td><td>&nbsp;compiler specific&nbsp;
</td><td>&nbsp;subroutine&nbsp;</td></tr>
<tr><td>&nbsp;_pascal&nbsp;</td><td>&nbsp;first par. at high address&nbsp;
</td><td>&nbsp;subroutine&nbsp;</td>
</tr></table>
<p>
<u>Register usage in 16 bit mode DOS or Windows, C or C++:</u><br>
16-bit return value in <kbd>AX</kbd>, 32-bit return value in <kbd>DX:AX</kbd>,
floating point return value in <kbd>ST(0)</kbd>. Registers <kbd>AX, BX, CX,
DX, ES</kbd> and arithmetic flags may be changed by the procedure; all other
registers must be saved and restored. A procedure can rely on <kbd>SI, DI, BP, DS</kbd>
and <kbd>SS</kbd> being unchanged across a call to another procedure.
<p>
<u>Register usage in 32 bit Windows, C++ and other programming languages:</u><br>
Integer return value in  <kbd>EAX</kbd>,  floating point return value in <kbd>ST(0)</kbd>.
Registers  <kbd>EAX, ECX, EDX</kbd> (not <kbd>EBX</kbd>) may be changed by the procedure; all other
registers must be saved and restored. Segment registers cannot be changed, not even
temporarily. <kbd>CS, DS, ES,</kbd> and <kbd>SS</kbd> all point to the flat segment group. <kbd>FS</kbd> is used by the
operating system. <kbd>GS</kbd> is unused, but reserved. Flags may be changed by the procedure
with the following restrictions: The direction flag is 0 by default. The direction flag may be
set temporarily, but must be cleared before any call or return. The interrupt flag cannot be
cleared. The floating point register stack is empty at the entry of a procedure and must be
empty at return, except for <kbd>ST(0)</kbd> if it is used for return value. MMX registers may be
changed by the procedure and if so cleared by <kbd>EMMS</kbd> before returning and before calling any
other procedure that may use floating point registers. All XMM registers may be modified
by procedures. Rules for passing parameters and return values in XMM registers
are described in Intel's application note AP 589. A procedure can rely on
<kbd>EBX, ESI, EDI, EBP</kbd> and all segment registers being unchanged across
a call to another procedure.
<p>

<h2><a name="4">4</a>. Debugging and verifying</h2>
Debugging assembly code can be quite hard and frustrating, as you probably already have
discovered. I would recommend that you start with writing the piece of code you want to
optimize as a subroutine in a high level language. Next, write a test program that will test
your subroutine thoroughly. Make sure the test program goes into all branches and
boundary cases.
<p>
When your high level language subroutine works with your test program then you are ready
to translate the code to assembly language.
<p>
Now you can start to optimize. Each time you have made a modification you should run it
on the test program to see if it works correctly.
Number all your versions and save them so that you can go back and test them again in
case you discover an error that the test program didn't catch (such as writing to a wrong
address).
<p>
Test the speed of the most critical part of your program with the method described in
chapter <a href="#30">30</a> or with a test program.  If the code is significantly slower than expected, then the
most probable causes are: cache misses (chapter <a href="#7">7</a>), misaligned
operands (chapter <a href="#6">6</a>), first
time penalty (chapter <a href="#8">8</a>), branch mispredictions
(chapter <a href="#22">22</a>), instruction fetch problems
(chapter <a href="#15">15</a>), register read stalls (<a href="#16">16</a>),
 or long dependency chains (chapter <a href="#20">20</a>).
<p>
Highly optimized code tends to be very difficult to read and understand for others, and even
for yourself when you get back to it after some time. In order to make it possible to maintain
the code it is important that you organize it into small logical units (procedures or macros)
with a well-defined interface and appropriate comments. The more complicated the code is
to read, the more important is a good documentation.
<p>
<h2><a name="5">5</a>. Memory model</h2>
The Pentiums are designed primarily for 32 bit code, and the performance is
inferior on 16 bit code. Segmenting your code and data also degrades performance
significantly, so you should generally prefer 32 bit flat mode, and an operating
system which supports this mode. The code examples shown in this
manual assume a 32 bit flat memory model, unless otherwise specified.
<p>
<h2><a name="6">6</a>. Alignment</h2>
All data in RAM should be aligned to addresses divisible by 2, 4, 8, or 16 according to this
scheme:
<table border=1 cellpadding=1 cellspacing=1><tr><td>
&nbsp;
</td><td colspan=2 align=center class="a3">alignment</td></tr>
<tr><td align=center class="a3">&nbsp;operand size&nbsp;</td>
<td align=center class="a3">&nbsp;PPlain and PMMX&nbsp;</td>
<td align=center class="a3">&nbsp;PPro, PII and PIII&nbsp;</td></tr>
<tr><td>&nbsp;1  (byte)&nbsp;</td>
<td align=center>1</td><td align=center>1</td></tr>
<tr><td>&nbsp;2  (word)&nbsp;</td><td align=center>2</td>
<td align=center>2</td></tr>
<tr><td>&nbsp;4  (dword)&nbsp;</td>
<td align=center align=center>4</td><td align=center>4</td></tr>
<tr><td>&nbsp;6  (fword)&nbsp;</td>
<td align=center>4</td><td align=center>8</td></tr>
<tr><td>&nbsp;8  (qword)&nbsp;</td>
<td align=center>8</td><td align=center>8</td></tr>
<tr><td>&nbsp;10 (tbyte)&nbsp;</td>
<td align=center>8</td><td align=center>16</td></tr>
<tr><td>&nbsp;16 (oword)&nbsp;</td>
<td align=center>n.a.</td><td align=center>16</td></tr>
</table>
<p>
On PPlain and PMMX, misaligned data will take at least 3 clock cycles extra to access if a 4
byte boundary is crossed. The penalty is higher when a cache line boundary is crossed.
<p>
On PPro, PII and PIII, misaligned data will cost you 6-12 clocks extra when a
cache line boundary is crossed. Misaligned operands smaller than 16 bytes that
do not cross a 32 byte boundary give no penalty.
<p>
Aligning data by 8 or 16 on a dword size stack may be a problem. A common method is to set up
an aligned frame pointer. A function with aligned local data may look like this:
<pre>_FuncWithAlign PROC NEAR
        PUSH    EBP                        ; prolog code
        MOV     EBP, ESP
        AND     EBP, -8                    ; align frame pointer by 8
        FLD     DWORD PTR [ESP+8]          ; function parameter
        SUB     ESP, LocalSpace + 4        ; allocate local space
        FSTP    QWORD PTR [EBP-LocalSpace] ; store something in aligned space
        ...
        ADD     ESP, LocalSpace + 4        ; epilog code. restore ESP
        POP     EBP                        ; (AGI stall on PPlain/PMMX)
        RET
_FuncWithAlign ENDP</pre>
<p>
While aligning data is always important, aligning code is not necessary on the PPlain and
PMMX. Principles for aligning code on PPro, PII and PIII are explained in chapter <a href="#15">15</a>.
<p>
<h2><a name="7">7</a>. Cache</h2>
The PPlain and PPro have 8 kb of on-chip cache (level one cache) for code, and 8 kb for
data. The PMMX, PII and PIII have 16 kb for code and 16 kb for data. Data in the level 1 cache
can be read or written to in just one clock cycle, whereas a cache miss may cost many
clock cycles. It is therefore important that you understand how the cache works in order to
use it most efficiently.
<p>
The data cache consists of 256 or 512 lines of 32 bytes each. Each time you read a data
item which is not cached, the processor will read an entire cache line from memory. The
cache lines are always aligned to a physical address divisible by 32. When you have read a
byte at an address divisible by 32, then the next 31 bytes can be read or written to at almost
no extra cost. You can take advantage of this by arranging data items which are used near
each other together into aligned blocks of 32 bytes of memory. If, for example, you have a
loop which accesses two arrays, then you may interleave the two arrays into one array of
structures, so that data which are used together are also stored together.
<p>
If the size of an array or other data structure is a multiple of 32 bytes, then you should
preferably align it by 32.
<p>
The cache is set-associative. This means that a cache line can not be assigned to an
arbitrary memory address. Each cache line has a 7-bit set-value which must match bits 5
through 11 of the physical RAM address (bit 0-4 define the 32 bytes within a cache line).
The PPlain and PPro have two cache lines for each of the 128 set-values, so there are two
possible cache lines to assign to any RAM address. The PMMX, PII and PIII have four.
<p>
The consequence of this is that the cache can hold no more than two or four different data
blocks which have the same value in bits 5-11 of the address. You can determine if two
addresses have the same set-value by the following method: Strip off the lower 5 bits of
each address to get a value divisible by 32. If the difference between the two truncated
addresses is a multiple of 4096 (=1000H), then the addresses have the same set-value.
<p>
Let me illustrate this by the following piece of code, where ESI holds an address divisible by
32:

<pre>AGAIN:  MOV  EAX, [ESI]
        MOV  EBX, [ESI + 13*4096 +  4]
        MOV  ECX, [ESI + 20*4096 + 28]
        DEC  EDX
        JNZ  AGAIN</pre>
<p>
The three addresses used here all have the same set-value because the differences
between the truncated addresses are multipla of 4096. This loop will perform very poorly on
the PPlain and PPro. At the time you read <kbd>ECX</kbd> there is no free cache
line with the proper set-value so the processor takes the least recently used
of the two possible cache lines, that is the one which was used for <kbd>EAX</kbd>, and
fills it with the data from <kbd>[ESI+20*4096]</kbd> to
<kbd>[ESI+20*4096+31]</kbd> and reads <kbd>ECX</kbd>.
Next, when reading <kbd>EAX</kbd>, you find that the cache
line that held the value for <kbd>EAX</kbd> has now been discarded, so you
take the least recently used
line, which is the one holding the <kbd>EBX</kbd> value, and so on..
You have nothing but cache misses and the loop takes something like 60 clock
cycles. If the third line is changed to:
<pre>        MOV  ECX, [ESI + 20*4096 + 32]</pre>
<p>
then we have crossed a 32 byte boundary, so that we do not have the same set-value as in
the first two lines, and there will be no problem assigning a cache line to each of the three
addresses. The loop now takes only 3 clock cycles (except for the first time) - a very
considerable improvement! As already mentioned, the PMMX, PII and PIII have 4-way caches
so that you have four cache lines with the same set-value. (Some Intel documents
erroneously say that the PII cache is 2-way).
<p>
It may be very difficult to determine if your data addresses have the same set-values,
especially if they are scattered around in different segments. The best thing you can do to
avoid problems of this kind is to keep all data used in the critical part or your program within
one contiguous block not bigger than the cache, or two contiguous blocks no bigger than
half that size (for example one block for static data and another block for data on the stack).
This will make sure that your cache lines are used optimally.
<p>
If the critical part of your code accesses big data structures or random data addresses, then
you may want to keep all frequently used variables (counters, pointers, control variables,
etc.) within a single contiguous block of max 4 kbytes so that you have a complete set of
cache lines free for accessing random data. Since you probably need stack space anyway
for subroutine parameters and return addresses, the best thing is to copy all frequently
used static data to dynamic variables on the stack, and copy them back again outside the
critical loop if they have been changed.
<p>
Reading a data item which is not in the level one cache causes an entire cache line to be
filled from the level two cache, which takes approximately 200 ns (that is 20 clocks on a
100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for first are
available already after 50-100 ns. If the data item is not in the level two cache either, then
you will get a delay of something like 200-300 ns. This delay will be somewhat longer if you
cross a DRAM page boundary. (The size of a DRAM page is 1 kb for 4 and 8 MB 72 pin
RAM modules, and 2 kb for 16 and 32 MB modules).
<p>
When reading big blocks of data from memory, the speed is limited by the time it takes to fill
cache lines. You can sometimes improve speed by reading data in a non-sequential order:
before you finish reading data from one cache line start reading the first item from the next
cache line. This method can increase reading speed by 20 - 40% when reading from main
memory or level 2 cache on PPlain and PMMX, and from level 2 cache on PPro, PII and PIII. A
disadvantage of this method is of course that the program code becomes extremely clumsy
and complicated. For further information on this trick see <a href="http://www.intelligentfirm.com" target="external">www.intelligentfirm.com</a>.
<p>
When you write to an address which is not in the level 1 cache, then the value will go right
through to the level 2 cache or to the RAM (depending on how the level 2 cache is set up)
on the PPlain and PMMX. This takes approximately 100 ns. If you write eight or more times
to the same 32 byte block of memory without also reading from it, and the block is not in the
level one cache, then it may be advantageous to make a dummy read from the block first to
load it into a cache line. All subsequent writes to the same block will then go to the cache
instead, which takes only one clock cycle. On PPlain and PMMX, there is sometimes a
small penalty for writing repeatedly to the same address without reading in between.
<p>
On PPro, PII and PIII, a write miss will normally load a cache line, but it is possible to setup an
area of memory to perform differently, for example video RAM (See Pentium Pro Family
Developer's Manual, vol. 3: Operating System Writer's Guide").
<p>
Other ways of speeding up memory reads and writes are discussed in chapter
<a href="#27_8">27.8</a> below.
<p>
The PPlain and PPro have two write buffers, PMMX, PII and PIII have four. On the PMMX, PII and
PIII you may have up to four unfinished writes to uncached memory without delaying the
subsequent instructions. Each write buffer can handle operands up to 64 bits wide.
<p>
Temporary data may conveniently be stored on the stack because the stack area is very
likely to be in the cache. However, you should be aware of the alignment problems
if your data elements are bigger than the stack word size.
<p>
If the life ranges of two data structures do not overlap, then they may share
the same RAM area to increase cache efficiency. This is consistent with the
common practice of allocating space for temporary variables on the stack.
<p>
Storing temporary data in registers is of course even more efficient. Since registers is a
scarce ressource you may want to use <kbd>[ESP]</kbd> rather than <kbd>[EBP]</kbd> for addressing data on the
stack, in order to free <kbd>EBP</kbd> for other purposes. Just don't forget that the value of <kbd>ESP</kbd>
changes every time you do a <kbd>PUSH</kbd> or <kbd>POP</kbd>.
(You cannot use <kbd>ESP</kbd> under 16-bit Windows
because the timer interrupt will modify the high word of <kbd>ESP</kbd> at unpredictable places in your
code.)
<p>
There is a separate cache for code, which is similar to the data cache. The size of the code
cache is 8 kb on PPlain and PPro and 16 kb on the PMMX, PII and PIII. It is important that the
critical part of your code (the innermost loops) fit in the code cache. Frequently used pieces
of code or routines which are used together should preferable be stored near each other.
Seldom used branches or procedures should be put away in the bottom of your code or
somewhere else.
<p>
<h2><a name="8">8</a>. First time versus repeated execution</h2>
A piece of code usually takes much more time the first time it is executed than when it is
repeated. The reasons are the following:
<ol>
<li>Loading the code from RAM into the cache takes longer time than executing it.
<li>Any data accessed by the code has to be loaded into the cache, which may take much
more time than executing the instructions. When the code is repeated then the data are
more likely to be in the cache.
<li>Jump instructions will not be in the branch target buffer the first time they execute, and
therefore are less likely to be predicted correctly. See chapter <a href="#22">22</a>.
<li>In the PPlain, decoding the code is a bottleneck. If it takes one clock cycle to determine
the length of an instruction, then it is not possible to decode two instructions per clock
cycle, because the processor doesn't know where the second instruction begins. The
PPlain solves this problem by remembering the length of any instruction which has
remained in the cache since last time it was executed. As a consequence of this, a set
of instructions will not pair in the PPlain the first time they are executed, unless the first
of the two instructions is only one byte long. The PMMX, PPro, PII and PIII have no penalty
on first time decoding.
</ol><p>
For these four reasons, a piece of code inside a loop will generally take much more time the
first time it executes than the subsequent times.
<p>
If you have a big loop which doesn't fit into the code cache then you will get penalties all the
time because it doesn't run from the cache. You should therefore try to reorganize the loop
to make it fit into the cache.
<p>
If you have very many jumps, calls, and branches inside a loop, then you may get the
penalty of branch target buffer misses repeatedly.
<p>
Likewise, if a loop repeatedly accesses a data structure too big for the data cache, then you
will get the penalty of data cache misses all the time.
<p>
<h2><a name="9">9</a>. Address generation interlock (PPlain and PMMX)</h2>
It takes one clock cycle to calculate the address needed by an instruction which accesses
memory. Normally, this calculation is done at a separate stage in the pipeline while the
preceding instruction or instruction pair is executing. But if the address depends on the
result of an instruction executing in the preceding clock cycle, then you have to wait an
extra clock cycle for the address to be calculated. This is called an AGI stall.
Example:<br>
<kbd>ADD EBX,4 / MOV EAX,[EBX] ; AGI stall</kbd><br>
The stall in this example can be removed by putting some other instructions in between
<kbd>ADD EBX,4</kbd> and <kbd>MOV EAX,[EBX]</kbd> or by rewriting the code to:
<kbd>MOV EAX,[EBX+4] / ADD EBX,4</kbd>
<p>
You can also get an AGI stall with instructions which use <kbd>ESP</kbd> implicitly
for addressing, such
as <kbd>PUSH, POP, CALL,</kbd> and <kbd>RET</kbd>, if <kbd>ESP</kbd> has been
changed in the preceding clock cycle by
instructions such as <kbd>MOV, ADD,</kbd> or <kbd>SUB</kbd>.
The PPlain and PMMX have special circuitry to
predict the value of <kbd>ESP</kbd> after a stack operation so that you do not
get an AGI delay after
changing <kbd>ESP</kbd> with <kbd>PUSH, POP,</kbd> or <kbd>CALL</kbd>.
You can get an AGI stall after <kbd>RET</kbd> only if it has an
immediate operand to add to <kbd>ESP</kbd>.
<p>
Examples:
<pre>ADD ESP,4 / POP ESI            ; AGI stall
POP EAX   / POP ESI            ; no stall, pair
MOV ESP,EBP / RET              ; AGI stall
CALL L1 / L1: MOV EAX,[ESP+8]  ; no stall
RET / POP EAX                  ; no stall
RET 8 / POP EAX                ; AGI stall</pre>
<p>
The <kbd>LEA</kbd> instruction is also subject to an AGI stall if it uses a
base or index register which
has been changed in the preceding clock cycle. Example:
<pre>INC ESI / LEA EAX,[EBX+4*ESI]  ; AGI stall</pre>
<p>
PPro, PII and PIII have no AGI stalls for memory reads and <kbd>LEA</kbd>, but they do have
AGI stalls for memory writes. This is not very significant unless the subsequent
code has to wait for the write to finish.<p>
<h2><a name="10">10</a>. Pairing integer instructions (PPlain and PMMX)</h2>
<h3><a name="10_1">10.1 Perfect pairing</a></h3>
The PPlain and PMMX have two pipelines for executing instructions, called the U-pipe and
the V-pipe. Under certain conditions it is possible to execute two instructions
simultaneously, one in the U-pipe and one in the V-pipe. This can almost double the speed.
It is therefore advantageous to reorder your instructions to make them pair.
<p>
The following instructions are pairable in either pipe:
<ul><li><kbd>MOV</kbd> register, memory, or immediate into register or memory
<li><kbd>PUSH</kbd> register or immediate, <kbd>POP</kbd> register
<li><kbd>LEA, NOP</kbd>
<li><kbd>INC, DEC, ADD, SUB, CMP, AND, OR, XOR,</kbd>
<li>and some forms of <kbd>TEST</kbd> (see chapter <a href="#26_14">26.14</a>)
</ul>
The following instructions are pairable in the U-pipe only:
<ul><li><kbd>ADC, SBB</kbd>
<li><kbd>SHR, SAR, SHL, SAL</kbd> with immediate count
<li><kbd>ROR, ROL, RCR, RCL</kbd> with an immediate count of 1
</ul>
The following instructions can execute in either pipe but are only pairable
when in the V-pipe:
<ul><li>near call
<li>short and near jump
<li>short and near conditional jump.
</ul>
All other integer instructions can execute in the U-pipe only, and are not pairable.
<p>
Two consecutive instructions will pair when the following conditions are met:
<p>
<u>1.</u> The first instruction is pairable in the U-pipe and the second
instruction is pairable in the V-pipe.
<p>
<u>2.</u> The second instruction does not read or write a register which the first
instruction writes to.<br>
Examples:
<pre>    MOV EAX, EBX / MOV ECX, EAX     ; read after write, do not pair
    MOV EAX, 1   / MOV EAX, 2       ; write after write, do not pair
    MOV EBX, EAX / MOV EAX, 2       ; write after read, pair OK
    MOV EBX, EAX / MOV ECX, EAX     ; read after read, pair OK
    MOV EBX, EAX / INC EAX          ; read and write after read, pair OK</pre>
<p>
<u>3.</u>  In rule 2 partial registers are treated as full registers. Example:
<pre>    MOV AL, BL  /  MOV AH, 0</pre><p>
writes to different parts of the same register, do not pair
<p>
<u>4.</u>  Two instructions which both write to parts of the flags register can pair despite rule 2 and
3.  Example:
<pre>    SHR EAX, 4 / INC EBX            ; pair OK</pre>
<p>
<u>5.</u>  An instruction which writes to the flags can pair with a conditional jump despite rule 2.
Example:
<pre>    CMP EAX, 2 / JA LabelBigger     ; pair OK</pre>
<p>
<u>6.</u>  The following instruction combinations can pair despite the fact that they both modify the
stack pointer:
<pre>    PUSH + PUSH,  PUSH + CALL,  POP + POP</pre>
<p>
<u><a name="10-7">7.</a></u> There are restrictions on the pairing of instructions with prefix.
There are several types of prefixes:
<ul>
<li>instructions addressing a non-default segment have a segment prefix.
<li>instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit mode have an
operand size prefix.
<li>instructions using 32 bit base or index registers in 16 bit mode have an address size
prefix.
<li>repeated string instructions have a repeat prefix.
<li>locked instructions have a <kbd>LOCK</kbd> prefix.
<li>many instructions which were not implemented on the 8086 processor have a two byte
opcode where the first byte is <kbd>0FH</kbd>. The <kbd>0FH</kbd> byte behaves
as a prefix on the PPlain, but
not on the other versions. The most common instructions with <kbd>0FH</kbd>
prefix are: <kbd>MOVZX,
MOVSX, PUSH FS, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETcc, BT,
BTC, BTR, BTS, BSF, BSR, SHLD, SHRD,</kbd> and <kbd>IMUL</kbd> with two operands and no
immediate operand.
</ul>
<p>
On the PPlain, a prefixed instruction can only execute in the U-pipe, except for conditional
near jumps.
<p>
On the PMMX, instructions with operand size, address size, or <kbd>0FH</kbd>
prefix can execute in
either pipe, whereas instructions with segment, repeat, or lock prefix can
only execute in the U-pipe.
<p>
<u>8.</u> An instruction which has both a displacement and immediate data is not pairable on the
PPlain and only pairable in the U-pipe on the PMMX:
<pre>    MOV DWORD PTR DS:[1000], 0    ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX+8], 1       ; not pairable or only in U-pipe
    CMP BYTE PTR [EBX], 1         ; pairable
    CMP BYTE PTR [EBX+8], AL      ; pairable</pre><p>
    (Another problem with instructions which have both a displacement and
    immediate data on the PMMX is that such instructions may be longer
    than 7 bytes, which means that only one instruction can be decoded
    per clock cycle, as explained in chapter <a href="#12">12</a>.)
<p>
<u>9.</u> Both instructions must be preloaded and decoded. This is explained in
chapter <a href="#8">8</a>.
<p>
<u>10.</u> There are special pairing rules for MMX instructions on the PMMX:
<ul>
<li>MMX shift, pack or unpack instructions can execute in either pipe but cannot pair with
other MMX shift, pack or unpack instructions.
<li>MMX multiply instructions can execute in either pipe but cannot pair with other MMX
multiply instructions. They take 3 clock cycles and the last 2 clock cycles can overlap
with subsequent instructions in the same way as floating point instructions can (see
chapter <a href="#24">24</a>).
<li>an MMX instruction which accesses memory or integer registers can execute only in the
U-pipe and cannot pair with a non-MMX instruction.
</ul>
<p>

<h3><a name="10_2">10.2 Imperfect pairing</a></h3>
There are situations where the two instructions in a pair will not execute simultaneously, or
only partially overlap in time. They should still be considered a pair, though, because the
first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent
instruction can start to execute before both instructions in the imperfect pair have finished.
<p>
Imperfect pairing will happen in the following cases:
<p>
<u>1.</u> If the second instructions suffers an AGI stall (see chapter <a href="#9">9</a>).
<p>
<u>2.</u> Two instructions cannot access the same DWORD of memory simultaneously.
The following examples assume that <kbd>ESI</kbd> is divisible by 4:<br>
<kbd>   MOV AL, [ESI] / MOV BL, [ESI+1]</kbd><br>
The two operands are within the same DWORD, so they cannot execute
simultaneously. The pair takes 2 clock cycles.<br>
<kbd>   MOV AL, [ESI+3] / MOV BL, [ESI+4]</kbd><br>
Here the two operands are on each side of a DWORD boundary, so they
pair perfectly, and take only one clock cycle.
<p>
<u>3.</u> Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache
bank conflict). For DWORD addresses this means that the difference between the two
addresses should not be divisible by 32. Examples:
<pre>   MOV [ESI], EAX / MOV [ESI+32000], EBX ;  imperfect pairing
   MOV [ESI], EAX / MOV [ESI+32004], EBX ;  perfect pairing</pre>
<p>
Pairable integer instructions which do not access memory take one clock cycle to
execute, except for mispredicted jumps. <kbd>MOV</kbd> instructions to or from
memory also take
only one clock cycle if the data area is in the cache and properly aligned. There is no
speed penalty for using complex addressing modes such as scaled index registers.
<p>
A pairable integer instruction which reads from memory, does some calculation, and
stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions).
<p>
A pairable integer instruction which reads from memory, does some calculation, and
writes the result back to the memory, takes 3 clock cycles. (read/modify/write
instructions).
<p>
<u>4.</u> If a read/modify/write instruction is paired with a read/modify or read/modify/write
instruction, then they will pair imperfectly.

<p>
The number of clock cycles used is given in the following table:</ol>
<table border=1 cellpadding=1 cellspacing=1><tr>
<td align="center" class="a3">
First instruction</td>
<td colspan=3 align="center" class="a3">Second instruction</td></tr>
<tr><td>&nbsp;</td><td align=center>&nbsp;MOV or register only&nbsp;
</td><td align=center>&nbsp;read/modify&nbsp;</td>
<td align=center>&nbsp;read/modify/write&nbsp;</td></tr>
<tr><td align=center>&nbsp;MOV or register only&nbsp;</td>
<td align=center>&nbsp;1&nbsp;</td>
<td align=center>&nbsp;2&nbsp;</td>
<td align=center>&nbsp;3&nbsp;</td></tr>
<tr><td align=center>&nbsp;read/modify&nbsp;</td>
<td align=center>&nbsp;2&nbsp;</td>
<td align=center>&nbsp;2&nbsp;</td>
<td align=center>&nbsp;3&nbsp;</td></tr>
<tr><td align=center>&nbsp;read/modify/write&nbsp;</td>
<td align=center>&nbsp;3&nbsp;</td>
<td align=center>&nbsp;4&nbsp;</td>
<td align=center>&nbsp;5&nbsp;</td></tr>
</table>

<p>Example:<br><kbd> ADD [mem1], EAX / ADD EBX, [mem2]  ; 4 clock cycles<br> ADD EBX, [mem2] / ADD [mem1], EAX  ; 3 clock cycles</kbd>
<p>
<u>5.</u> When two paired instructions both take extra time due to cache misses, misalignment,
or jump misprediction, then the pair will take more time than each instruction, but less
than the sum of the two.
<p>
<u>6.</u> A pairable floating point instruction followed by <kbd>FXCH</kbd>
 will make imperfect pairing if the
next instruction is not a floating point instruction.
<p>
<p>In order to avoid imperfect pairing you have to know which instructions go into the U-pipe,
and which to the V-pipe. You can find out this by looking backwards in your code and
search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair
due to one of the rules above.
<p>
Imperfect pairing can often be avoided by reordering your instructions.
Example:
 <br>
<pre>L1:     MOV     EAX,[ESI]
        MOV     EBX,[ESI]
        INC     ECX</pre><p>

Here the two <kbd>MOV</kbd> instructions form an imperfect pair because they both access the same
memory location, and the sequence takes 3 clock cycles. You can improve the code by
reordering the instructions so that <kbd>INC ECX</kbd> pairs with one of the
<kbd>MOV</kbd> instructions.

<pre>L2:     MOV     EAX,OFFSET A
        XOR     EBX,EBX
        INC     EBX
        MOV     ECX,[EAX]
        JMP     L1</pre><p>

The pair  <kbd>INC EBX / MOV ECX,[EAX]</kbd>  is imperfect because the latter
instruction has an
AGI stall. The sequence takes 4 clocks. If you insert a <kbd>NOP</kbd> or any other instruction so that
<kbd>MOV ECX,[EAX]</kbd> pairs with  <kbd>JMP L1</kbd> instead, then the sequence takes only 3 clocks.
<p>
<a name="imperfectpush">
The next example is in 16 bit mode, assuming that <kbd>SP</kbd> is divisible by 4:</a>

<pre>L3:     PUSH    AX
        PUSH    BX
        PUSH    CX
        PUSH    DX
        CALL    FUNC</pre><p>

Here the <kbd>PUSH</kbd> instructions form two imperfect pairs, because both operands in each pair go
into the same dword of memory. <kbd>PUSH BX</kbd> could possibly pair perfectly
with <kbd>PUSH CX</kbd>
(because they go on each side of a DWORD boundary) but it doesn't because it has already
been paired with <kbd>PUSH AX</kbd>. The sequence therefore takes 5 clocks. If you insert a <kbd>NOP</kbd> or
any other instruction so that <kbd>PUSH BX</kbd> pairs with <kbd>PUSH CX</kbd>,
and <kbd>PUSH DX</kbd> with <kbd>CALL FUNC</kbd>,
then the sequence will take only 3 clocks. Another way to solve the problem is to make sure
that <kbd>SP</kbd> is not divisible by 4. Knowing whether <kbd>SP</kbd> is
divisible by 4 or not in 16 bit mode can
be difficult, so the best way to avoid this problem is to use 32 bit mode.

<p>

<h2><a name="11">11</a>. Splitting complex instructions into simpler ones (PPlain and PMMX)</h2>
You may split up read/modify and read/modify/write instructions to improve pairing.
Example:<br>
<kbd>    ADD [mem1],EAX / ADD [mem2],EBX    ; 5 clock cycles</kbd><br>
This code may be split up into a sequence which takes only 3 clock cycles:
<pre>    MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX
    MOV [mem1],ECX / MOV [mem2],EDX</pre><p>

Likewise you may split up non-pairable instructions into pairable instructions:
<pre>    PUSH [mem1]
    PUSH [mem2]  ; non-pairable</pre><p>
Split up into:
<pre>    MOV EAX,[mem1]
    MOV EBX,[mem2]
    PUSH EAX
    PUSH EBX     ; everything pairs</pre><p>

Other examples of non-pairable instructions which may be split up into simpler pairable
instructions:<br>
<kbd>CDQ</kbd>  split into:  <kbd>MOV EDX,EAX / SAR EDX,31</kbd><br>
<kbd>NOT EAX</kbd>  change to  <kbd>XOR EAX,-1</kbd><br>
<kbd>NEG EAX</kbd>  split into  <kbd>XOR EAX,-1 / INC EAX</kbd><br>
<kbd>MOVZX EAX,BYTE PTR [mem]</kbd> split into <kbd>  XOR EAX,EAX / MOV AL,BYTE PTR [mem]</kbd><br>
<kbd>JECXZ</kbd>  split into  <kbd>TEST ECX,ECX / JZ</kbd><br>
<kbd>LOOP</kbd>   split into  <kbd>DEC ECX / JNZ</kbd><br>
<kbd>XLAT</kbd>   change to   <kbd>MOV AL,[EBX+EAX]</kbd>
<p>
If splitting instructions doesn't improve speed, then you may keep the complex or
nonpairable instructions in order to reduce code size.
<p>
Splitting instructions is not needed on the PPro, PII and PIII, except when the split instructions
generate fewer uops.
<p>
<h2><a name="12">12</a>. Prefixes (PPlain and PMMX)</h2>
An instruction with one or more prefixes may not be able to execute in the V-pipe (se
<a href="#10-7">chapter 10, sect. 7</a>), and it may take more than one clock cycle to decode.
<p>
On the PPlain, the decoding delay is one clock cycle for each prefix except for the <kbd>0FH</kbd>
prefix of conditional near jumps.
<p>
The PMMX has no decoding delay for <kbd>0FH</kbd> prefix.
Segment and repeat prefixes take one
clock extra to decode. Address and operand size prefixes take two clocks extra to decode.
The PMMX can decode two instructions per clock cycle if the first instruction has a
segment or repeat prefix or no prefix, and the second instruction has no prefix. Instructions
with address or operand size prefixes can only decode alone on the PMMX. Instructions with
more than one prefix take one clock extra for each prefix.
<p>
Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be
avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided
in 32 bit mode by using only 8 bit and 32 bit integers.
<p>
Where prefixes are unavoidable, the decoding delay may be masked if a preceding
instruction takes more than one clock cycle to execute. The rule for the PPlain is that any
instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the
decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction
pairs. In other words, each extra clock cycle that an instruction takes to execute can be
used to decode one prefix in a later instruction. This shadowing effect even extends across
a predicted branch. Any instruction which takes more than one clock cycle to execute, and
any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any
other reason except decoding delay and branch misprediction, has a shadowing effect.
<p>
The PMMX has a similar shadowing effect, but the mechanism is different. Decoded
instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to
four instructions. As long as there are instructions in the FIFO buffer you get no delay.
When the buffer is empty then instructions are executed as soon as they are decoded. The
buffer is filled when instructions are decoded faster than they are executed, i.e. when you
have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions
execute faster than they are decoded, i.e. when you have decoding delays due to prefixes.
The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two
instructions per clock cycle provided that the second instruction is without prefixes and
none of the instructions are longer than 7 bytes. The two execution pipelines (U and V) can
each receive one instruction per clock cycle from the FIFO buffer.
<p>
Examples: <br>
<kbd>    CLD / REP MOVSD</kbd><br>
The <kbd>CLD</kbd> instruction takes two clock cycles and can therefore overshadow the decoding
delay of the <kbd>REP</kbd> prefix. The code would take one clock cycle more if the <kbd>CLD</kbd> instruction was
placed far from the <kbd>REP MOVSD.</kbd>
<br>
<kbd>    CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL</kbd><br>
The <kbd>CMP</kbd> instruction takes two clock cycles here because it is a read/modify instruction. The
<kbd>0FH</kbd> prefix of the <kbd>SETNZ</kbd> instruction is decoded during the second clock cycle of the <kbd>CMP </kbd>
instruction, so that the decoding delay is hidden on the PPlain (The PMMX has no
decoding delay for the <kbd>0FH</kbd>).
<p>
Prefix penalties in PPro, PII and PIII are described in chapter <a href="#14">14</a>.
<p>
<h2><a name="13">13</a>. Overview of PPro, PII and PIII pipeline</h2>
The architecture of the PPro, PII and PIII microprocessors is well explained and illustrated in
various manuals and tutorials from Intel. It is recommended that you study this material in
order to get an understanding of how these microprocessors work. I will describe the
structure briefly here with particular focus on those elements that are important for
optimizing code.
<p>
Instruction codes are fetched from the code cache in aligned 16-byte chunks into a double
buffer that can hold two 16-byte chunks. The code is passed on from the double buffer to
the decoders in blocks which I will call ifetch blocks (instruction fetch blocks). The ifetch
blocks are usually 16 bytes long, but not aligned. The purpose of the double-buffer is to
make it possible to decode an instruction that crosses a 16-byte boundary (i.e. an address
divisible by 16).
<p>
The ifetch block goes to the instruction length decoder, which determines where each
instruction begins and ends, and next to the instruction decoders. There are three decoders
so that you can decode up to three instructions in each clock cycle. A group of up to three
instructions that are decoded in the same clock cycle is called a decode group.
<p>
The decoders translate instructions into micro-operations, abbreviated uops. Simple
instructions generate only one uop, while more complex instructions may generate several
uops. For example, the instruction <kbd>ADD EAX,[MEM]</kbd> is decoded into two uops: one for
reading the source operand from memory, and one for doing the addition. The purpose of
splitting instructions into uops is to make the handling later in the system more effective.
<p>
The three decoders are called D0, D1, and D2.  D0 can handle all instructions, while D1
and D2 can handle only simple instructions that generate one uop.
<p>
The uops from the decoders go via a short queue to the register allocation table (RAT). The
execution of uops work on temporary registers which are later written to the permanent
registers  <kbd>EAX, EBX</kbd>, etc.  The purpose of the RAT is to tell the uops which temporary
registers to use, and to allow register renaming (see later).
<p>
After the RAT, the uops to go the reorder buffer (ROB). The purpose of the ROB is to
enable out-of-order execution. A uop stays in the reservation station until the operands it
needs are available. If an operand for one uop is delayed because a previous uop that
generates the operand is not finished yet, then the ROB may find another uop later in the
queue that can be executed in the meantime in order to save time.
<p>
The uops that are ready for execution are sent to the execution units, which are clustered
around five ports: Port 0 and 1 can handle arithmetic operations, jumps, etc. Port 2 takes
care of all reads from memory, port 3 calculates addresses for memory writes, and port 4
does memory writes.
<p>
When an instruction has been executed then it is marked in the ROB as ready to retire. It then goes to
the retirement station. Here the contents of the temporary registers used by the uops are
written to the permanent registers. While uops can be executed out of order, they must be
retired in order.
<p>
In the following chapters, I will describe in detail how to optimize the throughput of each step
in the pipeline.
<p>
<h2><a name="14">14</a>. Instruction decoding (PPro, PII and PIII)</h2>
I am describing instruction decoding before instruction fetching here because you need to
know how the decoders work in order to understand the possible delays in instruction
fetching.
<p>
The decoders can handle three instructions per clock cycle, but only when certain
conditions are met. Decoder D0 can handle any instruction that generates up to 4 uops in a
single clock cycle. Decoders D1 and D2 can handle only instructions that generate 1 uop
and these instructions can be no more than 8 bytes long.
<p>
To summarize the rules for decoding two or three instructions in the same clock cycle:
<ul>
<li>The first instruction (D0) generates no more than 4 uops,
<li>The second and third instructions generate no more than 1 uop each,
<li>The second and third instructions are no more than 8 bytes long each,
<li>The instructions must be contained within the same 16 bytes ifetch block (see next
chapter).
</ul>
There is no limit to the length of the instruction in D0 (despite Intel manuals saying
something else), as long as the three instructions fit into one 16 bytes ifetch block.
<p>
An instruction that generates more than 4 uops takes two or more clock cycles to decode,
and no other instructions can decode in parallel.
<p>
It follows from the rules above that the decoders can produce a maximum of 6 uops per
clock cycle if the first instruction in each decode group generates 4 uops and the next two
generate 1 uop each. The minimum production is 2 uops per clock cycle, which you get
when all instructions generate 2 uops each, so that D1 and D2 are never used.
<p>
For maximum throughput, it is recommended that you order your instructions according to
the 4-1-1 pattern: instructions that generate 2 to 4 uops can be interspearsed with two
simple 1-uop instructions for free, in the sense that they do not add to the decoding time.
Example:
<pre>MOV     EBX, [MEM1]     ; 1 uop  (D0)
INC     EBX             ; 1 uop  (D1)
ADD     EAX, [MEM2]     ; 2 uops (D0)
ADD     [MEM3], EAX     ; 4 uops (D0)</pre><p>
This takes 3 clock cycles to decode. You can save one clock cycle by reordering the
instructions into two decode groups:
<pre>ADD     EAX, [MEM2]     ; 2 uops (D0)
MOV     EBX, [MEM1]     ; 1 uop  (D1)
INC     EBX             ; 1 uop  (D2)
ADD     [MEM3], EAX     ; 4 uops (D0)</pre><p>

The decoders now generate 8 uops in two clock cycles, which is probably satisfactory.
Later stages in the pipeline can handle only 3 uops per clock cycle so with a decoding rate
higher than this you can assume that decoding is not a bottleneck. However, complications
in the fetch mechanism can delay decoding as described in the next chapter, so to be safe
you may want to aim at a decoding rate higher than 3 uops per clock cycle.
<p>
You can see how many uops each instruction generates in the tables in chapter <a href="#29">29</a>.
<p>
Instruction prefixes can also incur penalties in the decoders. Instructions can have several
kinds of prefixes:
 <ul>
<li>An operand size prefix is needed when you have a 16-bit operand in a 32-bit
environment or vice versa. (Except for instructions that can only have one operand size,
such as <kbd>FNSTSW AX</kbd>). An operand size prefix gives a penalty of a few clocks if the
instruction has an immediate operand of 16 or 32 bits because the length of the operand
is changed by the prefix. Examples:
<pre>   ADD BX, 9      ; no penalty because immediate operand is 8 bits
   MOV WORD PTR [MEM16], 9  ; penalty because operand is 16 bits </pre>
The last instruction should be changed to:
<pre>   MOV EAX, 9
   MOV WORD PTR [MEM16], AX  ; no penalty because no immediate</pre>
<li>An address size prefix is used when you use 32-bit addressing in 16 bit mode or vice
versa. This is seldom needed and should generally be avoided. The address size prefix
gives a penalty whenever you have an explicit memory operand (even when there is no
displacement) because the interpretation of the r/m bits in the instruction code is
changed by the prefix. Instructions with only implicit memory operands, such as string
instructions, have no penalty with address size prefix.
<li>Segment prefixes are used when you address data in a non-default data segment.
Segment prefixes give no penalty on the PPro, PII and PIII.
<li>Repeat prefixes and lock prefixes give no penalty in the decoders.
<li>There is always a penalty if you have more than one prefix. This penalty is usually one
clock per prefix.
</ul>
<p>
<h2><a name="15">15</a>. Instruction fetch (PPro, PII and PIII)</h2>
The code is fetched in aligned 16-bytes chunks from the code cache and placed in the
double buffer, which is called so because it can contain two such chunks. The code is then
taken from the double buffer and fed to the decoders in blocks which are usually 16 bytes
long, but not necessarily aligned by 16. I will call these blocks ifetch blocks (instruction fetch
blocks). If an ifetch block crosses a 16 byte boundary in the code then it needs to take from
both chunks in the double buffer. So the purpose of the double buffer is to allow instruction
fetching across 16 byte boundaries.
<p>
The double buffer can fetch one 16-bytes chunk per clock cycle and can generate one
ifetch block per clock cycle. The ifetch blocks are usually 16 bytes long, but can be shorter
if there is a predicted jump in the block. (See chapter <a href="#22">22</a> about jump prediction).
<p>
Unfortunately, the double buffer is not big enough for handling fetches around jumps
without delay. If the ifetch block that contains the jump instruction crosses a 16-byte
boundary then the double buffer needs to keep two consecutive aligned 16-bytes chunks of
code in order to generate it. If the first instruction after the jump crosses a 16-byte
boundary, then the double buffer needs to load two new 16-bytes chunks of code before a
valid ifetch block can be generated. This means that, in the worst case, the decoding of the
first instruction after a jump can be delayed for two clock cycles. You get one penalty for a
16-byte boundary in the ifetch block containing the jump instruction, and one penalty for a
16-byte boundary in the first instruction after the jump. You can get bonus if you have more
than one decode group in the ifetch block that contains the jump because this gives the
double buffer extra time to fetch one or two 16-byte chunks of code in advance for the
instructions after the jump. The bonuses can compensate for the penalties according to the
table below. If the double buffer has fetched only one 16-byte chunk of code after the jump,
then the first ifetch block after the jump will be identical to this chunk, that is, aligned to a
16-byte boundary. In other words, the first ifetch block after the jump will not begin at the
first instruction, but at the nearest preceding address divisible by 16. If the double buffer
has had time to load two 16-byte chunks, then the new ifetch block can cross a 16-byte
boundary and begin at the first instruction after the jump. These rules are summarized in
the following <a name="ifetchtable">table:</a><p>

<table border=1 cellpadding=1 cellspacing=1><tr><td class="a3">
Number of<br>decode groups<br>in ifetch-block<br>containing jump</td>
<td class="a3">16-byte<br>boundary in this<br>ifetch-block</td>
<td class="a3">16-byte<br>boundary in first<br>instruction after<br>jump</td>
<td class="a3"><br><br>decoder delay</td>
<td class="a3">alignment of first<br>ifetch after jump</td></tr>
<tr><td align="center">1</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">by 16</td></tr>
<tr><td align="center">1</td>
<td align="center">0</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">1</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">1</td>
<td align="center">by 16</td></tr>
<tr><td align="center">1</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">2</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">2</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">2</td>
<td align="center">0</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">2</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">by 16</td></tr>
<tr><td align="center">2</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">3 or more</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">3 or more</td>
<td align="center">0</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">3 or more</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">to instruction</td></tr>
<tr><td align="center">3 or more</td>
<td align="center">1</td>
<td align="center">1</td>
<td align="center">0</td>
<td align="center">to instruction</td></tr>
</table>
<p>Jumps delay the fetching so that a loop always takes at least two clock cycles more per
iteration than the number of 16 byte boundaries in the loop.
<p>
A further problem with the instruction fetch mechanism is that a new ifetch block is not
generated until the previous one is exhausted. Each ifetch block can contain several
decode groups. If a 16 bytes long ifetch block ends with an unfinished instruction, then the
next ifetch block will begin at the beginning of that instruction. The first instruction in an
ifetch block always goes to decoder D0, and the next two instructions go to D1 and D2, if
possible. The consequence of this is that D1 and D2 are used less than optimally. If the
code is structured according to the recommended 4-1-1 pattern, and an instruction intended
to go into D1 or D2 happens to be the first instruction in an ifetch block, then that instruction
has to go into D0 with the result that one clock cycle is wasted.
This is probably a hardware design flaw. At least it is suboptimal design.
The consequence of this problem is that the time it takes to decode a piece of
code can vary considerably depending on where the first ifetch block begins.
<p>
If decoding speed is critical and you want to avoid these problems then you have to know
where each ifetch block begins. This is quite a tedious job. First you need to make your
code segment paragraph-aligned in order to know where the 16-byte boundaries are. Then
you have to look at the output listing from your assembler to see how long each instruction
is. (It is recommended that you study how instructions are coded so that you can predict the
lengths of the instructions.) If you know where one ifetch block begins then you can find
where the next ifetch block begins in the following way: Make the block 16 bytes long. If it
ends at an instruction boundary then the next block will begin there. If it ends with an
unfinished instruction then the next block will begin at the beginning of this instruction.
(Only the lengths of the instructions counts here, it doesn't matter how many uops they
generate or what they do). This way you can work your way all through the code and mark
where each ifetch block begins. The only problem is knowing where to start. If you know
where one ifetch block is then you can find all the subsequent ones, but you have to know
where the first one begins. Here are some guidelines:
 <ul>
<li>The first ifetch block after a jump, call, or return can begin either at the first instruction or
at the nearest preceding 16-bytes boundary, according to the table above. If you align
the first instruction to begin at a 16-byte boundary then you can be sure that the first
ifetch block begins here. You may want to align important subroutine entries and loop
entries by 16 for this purpose.

<li>If the combined length of two consecutive instructions is more than 16 bytes then you
can be certain that the second one doesn't fit into the same ifetch block as the first one,
and consequently you will always have an ifetch block beginning at the second
instruction. You can use this as a starting point for finding where subsequent ifetch
blocks begin.

<li>The first ifetch block after a branch misprediction begins at a 16-byte boundary. As
explained in chapter <a href="#22_2">22.2</a>, a loop that repeats more than 5 times will always have a
misprediction when it exits. The first ifetch block after such a loop will therefore begin at
the nearest preceding 16-byte boundary.

<li>Other serializing events also cause the next ifetch block to start at a 16-byte boundary.
Such events include interrupts, exceptions, self-modifying code, and serializing
instructions such as <kbd>CPUID, IN,</kbd> and <kbd>OUT</kbd>.
</ul>
<p>

I am sure you want an example now:

<pre>
 address      instruction             length    uops    expected decoder
 ----------------------------------------------------------------------
 1000h        MOV ECX, 1000             5         1       D0
 1005h   LL:  MOV [ESI], EAX            2         2       D0
 1007h        MOV [MEM], 0             10         2       D0
 1011h        LEA EBX, [EAX+200]        6         1       D1
 1017h        MOV BYTE PTR [ESI], 0     3         2       D0
 101Ah        BSR EDX, EAX              3         2       D0
 101Dh        MOV BYTE PTR [ESI+1],0    4         2       D0
 1021h        DEC ECX                   1         1       D1
 1022h        JNZ LL                    2         1       D2</pre>
<p>
Let's assume that the first ifetch block begins at address 1000h and ends at 1010h. This is
before the end of the  <kbd>MOV [MEM],0</kbd>  instruction so the next ifetch block will begin at 1007h
and end at 1017h. This is at an instruction boundary so the third ifetch block will begin at
1017h and cover the rest of the loop. The number of clock cycles it takes to decode this is
the number of D0 instructions, which is 5 per iteration of the LL loop. The last ifetch block
contained three decode blocks covering the last five instructions, and it has one 16-byte
boundary (1020h). Looking at the table above we find that the first ifetch block after the
jump will begin at the first instruction after the jump, that is the <kbd>LL</kbd>
label at 1005h, and end at
1015h. This is before the end of the <kbd>LEA</kbd> instruction, so the next ifetch block will go from
1011h to 1021h, and the last one from 1021h covering the rest. Now the <kbd>LEA</kbd> instruction
and the <kbd>DEC</kbd> instruction both fall at the beginning of an ifetch block which forces them to go
into D0. We now have 7 instructions in D0 and the loop takes 7 clocks to decode in the
second iteration. The last ifetch block contains only one decode group
 (<kbd>DEC ECX / JNZ LL</kbd>)  and has no 16-byte boundary. According to the table, the next ifetch block after the
jump will begin at a 16-byte boundary, which is 1000h. This will give us the same situation
as in the first iteration, and you will see that the loop takes alternatingly 5 and 7 clock cycles
to decode. Since there are no other bottlenecks, the complete loop will take 6000 clocks to
run 1000 iterations. If the starting address had been different so that you had a 16-byte
boundary in the first or the last instruction of the loop then it would take 8000 clocks. If you
reorder the loop so that no D1 or D2 instructions fall at the beginning of an ifetch block then
you can make it take only 5000 clocks.
<p>
The example above was deliberately constructed so that fetch and decoding is the only
bottleneck. The easiest way to avoid this problem is to structure your code to generate
much more than 3 uops per clock cycle so that decoding will not be a bottleneck despite the
penalties described here. In small loops this may not be possible and then you have to find
out how to optimize the instruction fetch and decoding.
<p>
One thing you can do is to change the starting address of your procedure in order to avoid
16-byte boundaries where you don't want them. Remember to make your code segment
paragraph aligned so that you know where the boundaries are.
<p>
If you insert an  <kbd>ALIGN 16</kbd>  directive before the loop entry then the assembler will put in
<kbd>NOP</kbd>'s and other filler instructions up to the nearest 16 byte boundary. Most assemblers use
the instruction  <kbd>XCHG EBX,EBX</kbd>  as a 2-byte filler (the so called 2-byte <kbd>NOP</kbd>). Whoever got
this idea, it's a bad one because this instruction takes more time than two <kbd>NOP</kbd>'s on most
processors! If the loop executes many times then whatever is outside the loop is
unimportant in terms of speed and you don't have to care about the suboptimal filler
instructions. But if the time taken by the fillers is important then you may
select the filler instructions manually.
You may as well use filler instructions that do something useful, such as refreshing
a register in order to avoid register read stalls (see chapter <a href="#16_2">16.2</a>)
For example, if you are using register <kbd>EBP</kbd> for addressing but seldom
write to it, then you may use <kbd>MOV EBP,EBP</kbd> or <kbd>ADD EBP, 0</kbd> as
filler in order to reduce the possibilities of register read stalls.
If you have nothing useful to do, you may use <kbd>FXCH ST(0)</kbd> as a good filler
because it doesn't put any load on the execution ports, provided that <kbd>ST(0)</kbd>
contains a valid floating point value.
<p>
Another possible remedy is to reorder your instructions in order to get the ifetch boundaries
where they don't hurt. This can be quite a difficult puzzle and it is not always possible to find
a satisfactory solution.
<p>
Yet another possibility is to manipulate instruction lengths. Sometimes you can substitute
one instruction with another one with a different length. Many instructions can be coded in
different versions with different lengths. The assembler always chooses the shortest
possible version of an instruction, but it is often possible to hard-code a longer version. For
example, <kbd>DEC ECX</kbd>  is one byte long,  <kbd>SUB ECX,1</kbd>  is 3 bytes, and you can code a 6 bytes
version with a long immediate operand using this trick:
<pre>         SUB ECX, 9999
         ORG $-4
         DD 1</pre><p>
Instructions with a memory operand can be made one byte longer with a SIB byte, but the
easiest way of making an instruction one byte longer is to add a <kbd>DS:</kbd>
segment prefix (<kbd>DB 3Eh</kbd>).
The microprocessors generally accept redundant and meaningless prefixes (except
<kbd>LOCK</kbd>) as long as the instruction length does not exceed 15 bytes. Even instructions without
a memory operand can have a segment prefix. So if you want the <kbd>DEC ECX</kbd> instruction to be
2 bytes long, write:
<pre>         DB  3Eh
         DEC ECX</pre><p>
Remember that you get a penalty in the decoder if an instruction has more than one prefix.
It is possible that instructions with meaningless prefixes - especially repeat and
lock prefixes - will be used in future processors for new instructions when there are no
more vacant instruction codes, but I would consider it safe to use a segment prefix
with any instruction.
<p>
With these methods it will usually be possible to put the ifetch boundaries where you want
them, although it can be a tedious puzzle.
<p>
<h2><a name="16">16</a>. Register renaming (PPro, PII and PIII)</h2>
<h3><a name="16_1">16.1 Eliminating dependencies</a></h3>
Register renaming is an advanced technique used by these microprocessors to remove
dependencies between different parts of the code. Example:
<pre>         MOV EAX, [MEM1]
         IMUL EAX, 6
         MOV [MEM2], EAX
         MOV EAX, [MEM3]
         INC EAX
         MOV [MEM4], EAX</pre><p>
Here the last three instructions are independent of the first three in the sense that they don't
need any result from the first three instructions. To optimize this on earlier processors you
would have to use a different register instead of <kbd>EAX</kbd> in the last three instructions and
reorder the instructions so that the last three instructions could execute in parallel with the
first three instructions. The PPro, PII and PIII processors do this for you automatically. They
assign a new temporary register for <kbd>EAX</kbd> every time you write to it. Thereby the
<kbd>MOV EAX,[MEM3]</kbd>  instruction becomes independent of the preceding instructions.
With out-of-order execution it is likely to finish the move to <kbd>[MEM4]</kbd> before the slow
<kbd>IMUL</kbd> instruction is finished.
<p>
Register renaming goes fully automatically. A new temporary register is assigned as an
alias for the permanent register every time an instruction writes to this register. An
instruction that both reads and writes a register also causes renaming. For example the
<kbd>INC EAX</kbd> instruction above uses one temporary register for input and another temporary
register for output. This does not remove any dependency, of course, but it has some
significance for subsequent register reads as I will explain later.
<p>
All general purpose registers, stack pointer, flags, floating point registers,
MMX registers, XMM registers and segment registers can be renamed.
Control words, and the floating point status word cannot be renamed and this is the
reason why the use of these registers is slow. There are 40 universal temporary
registers so it is unlikely that you will run out of temporary registers.
<p>
A common way of setting a register to zero is  <kbd>XOR EAX,EAX</kbd>  or
<kbd>SUB EAX,EAX</kbd>. These
instructions are not recognized as independent of the previous value of the register. If you
want to remove the dependency on slow preceding instructions then use
<kbd> MOV EAX,0.</kbd>
<p>
Register renaming is controlled by the register alias table (RAT) and the reorder buffer
(ROB). The uops from the decoders go to the RAT via a queue, and then to the ROB and
the reservation station. The RAT can handle only 3 uops per clock cycle. This means that
the overall throughput of the microprocessor can never exceed 3 uops per clock cycle on
average.
<p>
There is no practical limit to the number of renamings. The RAT can rename three registers
per clock cycle, and it can even rename the same register three times in one clock cycle.
<p>
<h3><a name="16_2">16.2</a> Register read stalls</h3>
But there is another limitation which may be quite serious, and that is that you
can only read two different permanent register names per clock cycle. This
limitation applies to all registers used by an instruction except those registers
that the instruction writes to only.
Example:
<pre>         MOV [EDI + ESI], EAX
         MOV EBX, [ESP + EBP]</pre><p>
The first instruction generates two uops: one that reads  <kbd>EAX</kbd>
and one that reads  <kbd>EDI</kbd>  and
<kbd>ESI</kbd>.  The second instruction generates one uop that reads
<kbd>ESP</kbd>  and  <kbd>EBP</kbd>.  <kbd>EBX</kbd> does not
count as a read because it is only written to by the instruction. Let's assume that these
three uops go through the RAT together. I will use the word triplet for a group of three
consecutive uops that go through the RAT together. Since the ROB can handle only two
permanent register reads per clock cycle and we need five register reads, our triplet will be
delayed for two extra clock cycles before it comes to the reservation station. With 3 or 4
register reads in the triplet it would be delayed by one clock cycle.
<p>
The same register can be read more than once in the same triplet without adding to the
count. If the instructions above are changed to:
<pre>         MOV [EDI + ESI], EDI
         MOV EBX, [EDI + EDI]</pre><p>
then you will need only two register reads (<kbd>EDI</kbd> and <kbd>ESI</kbd>) and the triplet will not be delayed.
<p>
A register that is going to be written to by a pending uop is stored in the ROB so that it can
be read for free until it is written back, which takes at least 3 clock cycles,
and usually more. Write-back is the end of the execution stage where the value
becomes available. In other words, you can read any number of registers in the
RAT without stall if their values are not yet available from the execution units.
The reason for this is that when a value becomes available it is immediately
written directly to any subsequent ROB entries that need it. But if the value
has already been written back to a temporary or permanent register when a
subsequent uop that needs it goes into the RAT, then the value has to be read
from the register file, which has only two read ports. There are three pipeline
stages from the RAT to the execution unit so you can be certain that a register
written to in one uop-triplet can be read for free in at least the next three
triplets. If the writeback is delayed by reordering, slow instructions,
dependency chains, cache misses, or by any other kind of stall, then the
register can be read for free further down the instruction stream.
<p>
Example:
<pre>         MOV EAX, EBX
         SUB ECX, EAX
         INC EBX
         MOV EDX, [EAX]
         ADD ESI, EBX
         ADD EDI, ECX</pre><p>
These 6 instructions generate 1 uop each. Let's assume that the first 3 uops go through the
RAT together. These 3 uops read register <kbd>EBX</kbd>, <kbd>ECX</kbd>, and <kbd>EAX</kbd>. But since we are writing to
<kbd>EAX</kbd> before reading it, the read is free and we get no stall. The next three uops read <kbd>EAX</kbd>,
<kbd>ESI</kbd>, <kbd>EBX</kbd>, <kbd>EDI</kbd>, and <kbd>ECX</kbd>. Since both <kbd>EAX</kbd>, <kbd>EBX</kbd> and <kbd>ECX</kbd> have been modified in the
preceding triplet and not yet written back then they can be read for free, so that only <kbd>ESI</kbd>
and <kbd>EDI</kbd> count, and we get no stall in the second triplet either. If the
<kbd>SUB ECX,EAX</kbd>
instruction in the first triplet is changed to <kbd>CMP ECX,EAX</kbd> then <kbd>ECX</kbd> is not written to and we
will get a stall in the second triplet for reading <kbd>ESI</kbd>, <kbd>EDI</kbd> and <kbd>ECX</kbd>.  Similarly, if the <kbd>INC</kbd>
<kbd>EBX</kbd> instruction in the first triplet is changed to <kbd>NOP</kbd> or something else then we will get a stall
in the second triplet for reading <kbd>ESI</kbd>, <kbd>EBX</kbd> and <kbd>EDI</kbd>.
<p>
No uop can read more than two registers. Therefore, all instructions that need
to read more than two registers are split up into two or more uops.
<p>
To count the number of register reads, you have to include all registers which are read by
the instruction. This includes integer registers, the flags register, the stack pointer,
floating point registers and MMX registers.
An XMM register counts as two registers,
except when only part of it is used, as e.g. in  <kbd>ADDSS</kbd>  and  <kbd>MOVHLPS</kbd>.
Segment registers and the instruction pointer do not count.
For example, in <kbd>SETZ AL</kbd> you count the flags
register but not <kbd>AL</kbd>. <kbd>ADD EBX,ECX</kbd> counts both <kbd>EBX</kbd> and <kbd>ECX</kbd>, but not the flags because
they are written to only. <kbd>PUSH EAX</kbd> reads <kbd>EAX</kbd> and the
stack pointer and then writes to the stack pointer.
<p>
The <kbd>FXCH</kbd> instruction is a special case. It works by renaming, but doesn't read any values
so that it doesn't count in the rules for register read stalls.  An <kbd>FXCH</kbd> instruction behaves
like 1 uop that neither reads nor writes any registers with regard to the rules for register read
stalls.
<p>
Don't confuse uop triplets with decode groups. A decode group can generate from 1 to 6
uops, and even if the decode group has three instructions and generates three uops there
is no guarantee that the three uops will go into the RAT together.
<p>
The queue between the decoders and the RAT is so short (10 uops) that you cannot
assume that register read stalls do not stall the decoders or that fluctuations
in decoder throughput do not stall the RAT.
<p>
It is very difficult to predict which uops go through the RAT together unless the queue is
empty, and for optimized code the queue should be empty only after mispredicted branches.
Several uops generated by the same instruction do not necessarily go through the RAT
together; the uops are simply taken consecutively from the queue, three at at time. The
sequence is not broken by a predicted jump: uops before and after the jump can go through
the RAT together. Only a mispredicted jump will discard the queue and start over again so
that the next three uops are sure to go into the RAT together.
<p>
If three consecutive uops read more than two different registers then you would of course
prefer that they do not go through the RAT together. The probability that they do is one
third. The penalty of reading three or four written-back registers in one triplet of uops is one
clock cycle. You can think of the one clock delay as equivalent to the load of three more
uops through the RAT. With the probability of 1/3 of the three uops going into the RAT
together, the average penalty will be the equivalent of 3/3 = 1 uop. To calculate the average
time it will take for a piece of code to go through the RAT, add the number of potential
register read stalls to the number of uops and divide by three. You can see that It doesn't
pay to remove the stall by putting in an extra instruction unless you know for sure which
uops go into the RAT together or you can prevent more than one potential register read stall
by one extra instruction.
<p>
In situations where you aim at a throughput of 3 uops per clock, the limit of two permanent
register reads per clock cycle may be a problematic bottleneck to handle. Possible ways to
remove register read stalls are:
<ul><li>keep uops that read the same register close together so that they are likely to go into the
same triplet.
<li>keep uops that read different registers spaced so that they cannot go into the same
triplet.
<li>place uops that read a register no more than 3 - 4 triplets after an instruction that writes
to or modifies this register to make sure it hasn't been written back before it is read (it
doesn't matter if you have a jump between as long as it is predicted). If you have reason
to expect the register write to be delayed for whatever reason then you can safely read
the register somewhat further down the instruction stream.
<li>use absolute addresses instead of pointers in order to reduce the number of register
reads.
<li>you may rename a register in a triplet where it doesn't cause a stall in order to prevent a
read stall for this register in one or more later triplets. Example:
<kbd>MOV ESP,ESP / ... / MOV EAX,[ESP+8]</kbd>.
This method costs an extra uop and therefore doesn't pay unless the expected average
number of read stalls prevented is more than 1/3.
</ul>
<p>
For instructions that generate more than one uop you may want to know the order of the
uops generated by the instruction in order to make a precise analysis of the possibility of
register read stalls. I have therefore listed the most common cases below.
<p>
<u>Writes to memory</u><br>
A memory write generates two uops. The first one (to port 4) is a store operation, reading
the register to store. The second uop (port 3) calculates the memory address, reading any
pointer registers. Examples:<br>
<kbd>MOV [EDI], EAX</kbd><br>
First uop reads <kbd>EAX</kbd>, second uop reads <kbd>EDI</kbd>.<br>
<kbd>FSTP QWORD PTR [EBX+8*ECX]</kbd><br>
First uop reads <kbd>ST(0)</kbd>, second uop reads <kbd>EBX</kbd> and <kbd>ECX.</kbd>
<p>
<u>Read and modify</u><br>
An instruction that reads a memory operand and modifies a register by some arithmetic or
logical operation generates two uops. The first one (port 2) is a memory load instruction
reading any pointer registers, the second uop is an arithmetic instruction (port 0 or 1)
reading and writing to the destination register and possibly writing to the flags.
Example:<br>
<kbd>ADD EAX, [ESI+20]</kbd><br>
First uop reads <kbd>ESI,</kbd> second uop reads <kbd>EAX</kbd> and writes <kbd>EAX</kbd> and flags.
<p>
<u>Read/modify/write</u><br>
A read/modify/write instruction generates four uops. The first uop (port 2) reads any pointer
registers, the second uop (port 0 or 1) reads and writes to any source register and possibly
writes to the flags, the third uop (port 4) reads only the temporary result which doesn't count
here, the fourth uop (port 3) reads any pointer registers again. Since the first and the fourth
uop cannot go into the RAT together you cannot take advantage of the fact that they read
the same pointer registers. Example:<br>
<kbd>OR [ESI+EDI], EAX</kbd><br>
The first uop reads <kbd>ESI</kbd> and <kbd>EDI</kbd>, the second uop reads <kbd>EAX</kbd> and writes <kbd>EAX</kbd> and the
flags, the third uop reads only the temporary result, the fourth uop reads <kbd>ESI</kbd> and <kbd>EDI</kbd> again. No
matter how these uops go into the RAT you can be sure that the uop that reads <kbd>EAX</kbd> goes
together with one of the uops that read <kbd>ESI</kbd> and <kbd>EDI</kbd>.  A register read stall is therefore
inevitable for this instruction unless one of the registers has been modified recently.
<p>
<u>Push register</u><br>
A push register instruction generates 3 uops. The first one (port 4) is a store instruction,
reading the register. The second uop (port 3) generates the address, reading the stack
pointer. The third uop (port 0 or 1) subtracts the word size from the stack pointer, reading
and modifying the stack pointer.
<p>
<u>Pop register</u><br>
A pop register instruction generates 2 uops. The first uop (port 2) loads the value, reading
the stack pointer and writing to the register. The second uop (port 0 or 1) adjusts the stack
pointer, reading and modifying the stack pointer.
<p>
<u>Call</u><br>
A near call generates 4 uops (port 1, 4, 3, 01). The first two uops read only the instruction
pointer which doesn't count because it cannot be renamed. The third uop reads the stack
pointer. The last uop reads and modifies the stack pointer.
<p>
<u>Return</u><br>
A near return generates 4 uops (port 2, 01, 01, 1). The first uop reads the stack pointer.
The third uop reads and modifies the stack pointer.
<p>
An example of how to avoid a register read stall is given in example 2.6.
<p>
<h2><a name="17">17</a>. Out of order execution (PPro, PII and PIII)</h2>
The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its
operands are ready and there is a vacant execution unit for it. This makes out-of-order
execution possible. If one part of the code is delayed because of a cache miss then it won't
delay later parts of the code if they are independent of the delayed operations.
<p>
Writes to memory cannot execute out of order relative to other writes. There are four write
buffers, so if you expect many cache misses on writes or you are writing to uncached
memory then it is recommended that you schedule four writes at at time and make sure the
processor has something else to do before you give it the next four writes. Memory reads
and other instructions can execute out of order, except <kbd>IN, OUT</kbd> and serializing
instructions.
<p>
If your code writes to a memory address and soon after reads from the same address, then
the read may by mistake be executed before the write because the ROB doesn't know the
memory addresses at the time of reordering. This error is detected when the write address
is calculated, and then the read operation (which was executed speculatively)
has to be re-done. The penalty for this is approximately 3 clocks. The only way to avoid this penalty is to
make sure the execution unit has other things to do between a write and a subsequent read
from the same memory address.
<p>
There are several execution units clustered around five ports. Port 0 and 1 are for
arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1,
whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and
rotates, and floating point operations. Port 1 also handles jumps and some MMX
and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory
write, and port 4 executes all memory write operations. In chapter <a href="#29">29</a> you'll find a complete
list of the uops generated by code instructions with an indication of which ports they go to.
Note that all memory write operations require two uops, one for port 3 and one for port 4,
while memory read operations use only one uop (port 2).
<p>
In most cases each port can receive one new uop per clock cycle. This means that you can
execute up to 5 uops in the same clock cycle if they go to five different ports, but since
there is a limit of 3 uops per clock earlier in the pipeline you will never execute more than 3
uops per clock on average.
<p>
You must make sure that no execution port receives more than one third of the uops if you
want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter
<a href="#29">29</a> and
count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then
you can improve your code by replacing some <kbd>MOV register,register</kbd>
or <kbd>MOV register,immediate</kbd> instructions with
 <kbd>MOV register,memory</kbd> in order to move some
of the load from port 0 and 1 to port 2.
<p>
Most uops take only one clock cycle to execute, but multiplications, divisions, and many
floating point operations take more:
<p>
Floating point addition and subtraction takes 3 clocks, but the execution unit is fully
pipelined so that it can receive a new <kbd>FADD</kbd> or <kbd>FSUB</kbd>
in every clock cycle before the
preceding ones are finished (provided, of course, that they are independent).
<p>
Integer multiplication takes 4 clocks, floating point multiplication 5, and
MMX multiplication 3 clocks. Integer and MMX multiplication is pipelined so
that it can receive a new instruction every clock cycle. Floating point
multiplication is partially pipelined: The execution unit can receive a new
<kbd>FMUL</kbd> instruction two clocks after the preceding one, so that the
maximum throughput is one <kbd>FMUL</kbd> per two clock cycles. The holes
between the <kbd>FMUL</kbd>'s cannot be filled by integer multiplications
because they use the same circuitry. XMM additions and multiplications take
3 and 4 clocks respectively, and are fully pipelined. But since each logical XMM
register is implemented as two physical 64-bit registers, you need two uops for a
packed XMM operation, and the throughput will then be one arithmetic XMM
instruction every two clock cycles. XMM add and multiply instructions can execute
in parallel because they don't use the same execution port.
<p>
Integer and floating point division takes up to 39 clocks and is not pipelined. This means
that the execution unit cannot begin a new division until the previous division is finished.
The same applies to squareroot and transcendental functions.
<p>
Also jump instructions, calls, and returns are not fully pipelined. You cannot execute a new
jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps,
calls, and returns is one for every two clocks.
<p>
You should, of course, avoid instructions that generate many uops.
The <kbd>LOOP XX</kbd>
instruction, for example, should be replaced by <kbd>DEC ECX / JNZ XX</kbd>.
<p>
If you have consecutive <kbd>POP</kbd> instructions then you may break them up to reduce the
number of uops:
<pre>POP ECX / POP EBX / POP EAX     ; can be changed to:
MOV ECX,[ESP] / MOV EBX,[ESP+4] / MOV EAX,[ESP] / ADD ESP,12</pre>
The former code generates 6 uops, the latter generates only 4 and decodes faster.
Doing the same with <kbd>PUSH</kbd> instructions is less advantageous because the split-up code is likely to
generate register read stalls unless you have other instructions to put in between or the
registers have been renamed recently. Doing it with <kbd>CALL</kbd> and <kbd>RET</kbd>
instructions will
interfere with prediction in the return stack buffer. Note also that the
<kbd>ADD ESP</kbd> instruction can cause an AGI stall in earlier processors.
<p>
<h2><a name="18">18</a>. Retirement (PPro, PII and PIII)</h2>
Retirement is a process where the temporary registers used by the uops are copied into the
permanent registers <kbd>EAX, EBX</kbd>, etc.
When a uop has been executed it is marked in the ROB as ready to retire.
<p>
The retirement station can handle three uops per clock cycle. This may not seem like a
problem because the throughput is already limited to 3 uops per clock in the RAT. But
retirement may still be a bottleneck for two reasons. Firstly, instructions must retire in order.
If a uop is executed out of order then it cannot retire before all preceding uops in the order
have retired. And the second limitation is that taken jumps must retire in the first of the three
slots in the retirement station. Just like decoder D1 and D2 can be idle if the next instruction
only fits into D0, the last two slots in the retirement station can be idle if the next uop to
retire is a taken jump. This is significant if you have a small loop where the number of uops
in the loop is not divisible by three.
<p>
All uops stay in the reorder buffer (ROB) until they retire. The ROB can hold 40 uops. This
sets a limit to the number of instructions that can execute during the long delay of a division
or other slow operation. Before the division is finished the ROB will be filled up with
executed uops waiting to retire. Only when the division is finished and retired can the
subsequent uops begin to retire, because retirement takes place in order.
<p>
In case of speculative execution of predicted branches (see chapter <a href="#22">22</a>) the speculatively
executed uops cannot retire until it is certain that the prediction was correct. If the prediction
turns out to be wrong then the speculatively executed uops are discarded without
retirement.
<p>
The following instructions cannot execute speculatively: memory writes,
<kbd>IN, OUT</kbd>, and serializing instructions.
<p>
<h2><a name="19">19</a>. Partial stalls (PPro, PII and PIII)</h2>
<h3><a name="19_1">19.1 Partial register stalls</a></h3>
Partial register stall is a problem that occurs when you write to part of a 32
bit register and later read from the whole register or a bigger part of it.
Example:
<pre>        MOV AL, BYTE PTR [M8]
        MOV EBX, EAX            ; partial register stall</pre><p>
This gives a delay of 5-6 clocks. The reason is that a temporary register has been assigned
to <kbd>AL</kbd> (to make it independent of <kbd>AH</kbd>).
The execution unit has to wait until the write to <kbd>AL</kbd> has
retired before it is possible to combine the value from <kbd>AL</kbd> with the
value of the rest of <kbd>EAX</kbd>.
The stall can be avoided by changing to code to:
<pre>        MOVZX EBX, BYTE PTR [MEM8]
        AND EAX, 0FFFFFF00h
        OR EBX, EAX</pre>
<p>
Of course you can also avoid the partial stalls by putting in other instructions after the write
to the partial register so that it has time to retire before you read from the full register.
<p>
You should be aware of partial stalls whenever you mix different data sizes (8, 16, and 32
bits):
<pre>        MOV BH, 0
        ADD BX, AX              ; stall
        INC EBX                 ; stall</pre><p>

You don't get a stall when reading a partial register after writing to the full register, or a
bigger part of it:
<pre>        MOV EAX, [MEM32]
        ADD BL, AL              ; no stall
        ADD BH, AH              ; no stall
        MOV CX, AX              ; no stall
        MOV DX, BX              ; stall</pre><p>

The easiest way to avoid partial register stalls is to always use full registers
and use <kbd>MOVZX</kbd> or <kbd>MOVSX</kbd> when reading from smaller memory
operands. These instructions are fast on the
PPro, PII and PIII, but slow on earlier processors. Therefore, a compromise is
offered when you
want your code to perform reasonably well on all processors. The replacement
for <kbd>MOVZX EAX,BYTE PTR [M8]</kbd> looks like this:
<pre>        XOR EAX, EAX
        MOV AL, BYTE PTR [M8]</pre><p>

The PPro, PII and PIII processors make a special case out of this combination
to avoid a partial
register stall when later reading from <kbd>EAX</kbd>.
The trick is that a register is tagged as empty
when it is <kbd>XOR</kbd>'ed with itself. The processor remembers that the upper 24 bits of
<kbd>EAX</kbd> are zero, so that a partial stall can be avoided. This mechanism
works only on certain combinations:
<pre>        XOR EAX, EAX
        MOV AL, 3
        MOV EBX, EAX            ; no stall

        XOR AH, AH
        MOV AL, 3
        MOV BX, AX              ; no stall

        XOR EAX, EAX
        MOV AH, 3
        MOV EBX, EAX            ; stall

        SUB EBX, EBX
        MOV BL, DL
        MOV ECX, EBX            ; no stall

        MOV EBX, 0
        MOV BL, DL
        MOV ECX, EBX            ; stall

        MOV BL, DL
        XOR EBX, EBX            ; no stall</pre><p>

Setting a register to zero by subtracting it from itself works the same as the
<kbd>XOR</kbd>, but setting it to zero with the <kbd>MOV</kbd> instruction
doesn't prevent the stall.
<p>
You can set the <kbd>XOR</kbd> outside a loop:
<pre>        XOR EAX, EAX
        MOV ECX, 100
LL:     MOV AL, [ESI]
        MOV [EDI], EAX          ; no stall
        INC ESI
        ADD EDI, 4
        DEC ECX
        JNZ LL</pre><p>
The processor remembers that the upper 24 bits of <kbd>EAX</kbd> are zero as
long as you don't get
an interrupt, misprediction, or other serializing event.
<p>
You should remember to neutralize any partial register you have used before calling a
subroutine that might push the full register:
<pre>        ADD BL, AL
        MOV [MEM8], BL
        XOR EBX, EBX            ; neutralize BL
        CALL _HighLevelFunction</pre><p>
Most high level language procedures push <kbd>EBX</kbd> at the start of the
procedure which would generate a partial register stall in the example above
if you hadn't neutralized <kbd>BL</kbd>.
<p>
Setting a register to zero with the <kbd>XOR</kbd> method doesn't break its dependency on earlier
instructions:
<pre>        DIV EBX
        MOV [MEM], EAX
        MOV EAX, 0              ; break dependency
        XOR EAX, EAX            ; prevent partial register stall
        MOV AL, CL
        ADD EBX, EAX</pre><p>
Setting <kbd>EAX</kbd> to zero twice here seems redundant, but without
the <kbd>MOV EAX,0</kbd> the last
instructions would have to wait for the slow <kbd>DIV</kbd> to finish, and
without <kbd>XOR EAX,EAX</kbd> you
would have a partial register stall.
<p>
The <kbd>FNSTSW AX</kbd> instruction is special: in 32 bit mode it behaves as
if writing to the entire <kbd>EAX</kbd>. In fact, it does something like this
in 32 bit mode:<br>
<kbd>    AND EAX,0FFFF0000h / FNSTSW TEMP / OR EAX,TEMP</kbd><br>
hence, you don't get a partial register stall when reading <kbd>EAX</kbd>
after this instruction in 32 bit mode:
<pre>    FNSTSW AX / MOV EBX,EAX         ; stall only if 16 bit mode
    MOV AX,0  / FNSTSW AX           ; stall only if 32 bit mode</pre>

<p>
<h3><a name="19_2">19.2 Partial flags stalls</a></h3>
The flags register can also cause partial register stalls:
<pre>        CMP EAX, EBX
        INC ECX
        JBE XX          ; partial flags stall</pre><p>
The  <kbd>JBE</kbd>  instruction reads both the carry flag and the zero flag.
Since the <kbd>INC</kbd> instruction
changes the zero flag, but not the carry flag, the  <kbd>JBE</kbd>  instruction has to wait for the two
preceding instructions to retire before it can combine the carry flag from the  <kbd>CMP</kbd> instruction
and the zero flag from the  <kbd>INC</kbd>  instruction. This situation is likely to be a bug rather than an
intended combination of flags. To correct it change  <kbd>INC ECX</kbd>  to  <kbd>ADD ECX,1</kbd>.
  A similar
bug that causes a partial flags stall is  <kbd>SAHF / JL XX</kbd>.  The  <kbd>JL</kbd>
instruction tests the sign
flag and the overflow flag, but  <kbd>SAHF</kbd>  doesn't change the overflow flag.
To correct it, change
<kbd>JL XX</kbd>   to  <kbd>JS XX</kbd>.
<p>
Unexpectedly (and contrary to what Intel manuals say) you also get a partial flags stall after
an instruction that modifies some of the flag bits when reading only unmodified flag bits:
<pre>        CMP EAX, EBX
        INC ECX
        JC  XX          ; partial flags stall</pre><p>
but not when reading only modified bits:
<pre>        CMP EAX, EBX
        INC ECX
        JE  XX          ; no stall</pre><p>

Partial flags stalls are likely to occur on instructions that read many or
all flags bits, i.e.  <kbd>LAHF, PUSHF, PUSHFD</kbd>.  The following instructions cause
partial flags stalls when followed by  <kbd>LAHF</kbd>  or  <kbd>PUSHF(D)</kbd>:
<kbd>INC, DEC, TEST</kbd>,  bit tests, bit scan,  <kbd>CLC, STC, CMC, CLD, STD, CLI, STI, MUL,
IMUL</kbd>,  and all shifts and rotates.
The following instructions do not cause partial flags stalls:
<kbd>AND, OR, XOR, ADD, ADC, SUB, SBB, CMP, NEG</kbd>.
It is strange that  <kbd>TEST</kbd>  and  <kbd>AND</kbd>  behave differently while, by definition, they
do exactly the same thing to the flags. You may use a  <kbd>SETcc</kbd>
  instruction instead of  <kbd>LAHF</kbd>
or  <kbd>PUSHF(D)</kbd>  for storing the value of a flag in order to avoid a stall.
<p>
Examples:
<pre>    INC EAX   / PUSHFD      ; stall
    ADD EAX,1 / PUSHFD      ; no stall

    SHR EAX,1 / PUSHFD      ; stall
    SHR EAX,1 / OR EAX,EAX / PUSHFD   ; no stall

    TEST EBX,EBX / LAHF     ; stall
    AND  EBX,EBX / LAHF     ; no stall
    TEST EBX,EBX / SETZ AL  ; no stall

    CLC / SETZ AL           ; stall
    CLD / SETZ AL           ; no stall</pre><p>
The penalty for partial flags stalls is approximately 4 clocks.
<p>

<h3><a name="19_3">19.3 Flags stalls after shifts and rotates</a></h3>
You can get a stall resembling the partial flags stall when reading any flag
bit after a shift or rotate, except for shifts and rotates by one (short form):
<pre>    SHR EAX,1 / JZ XX                ; no stall
    SHR EAX,2 / JZ XX                ; stall
    SHR EAX,2 / OR EAX,EAX / JZ XX   ; no stall

    SHR EAX,5 / JC XX                ; stall
    SHR EAX,4 / SHR EAX,1 / JC XX    ; no stall

    SHR EAX,CL / JZ XX               ; stall, even if CL = 1
    SHRD EAX,EBX,1 / JZ XX           ; stall
    ROL EBX,8 / JC XX                ; stall</pre><p>

The penalty for these stalls is approximately 4 clocks.
<p>

<h3><a name="19_4">19.4 Partial memory stalls</a></h3>
A partial memory stall is somewhat analogous to a partial register stall. It occurs when you
mix data sizes for the same memory address:
<pre>        MOV BYTE PTR [ESI], AL
        MOV EBX, DWORD PTR [ESI]        ; partial memory stall</pre><p>
Here you get a stall because the processor has to combine the byte written from AL with the
next three bytes, which were in memory before, to get the four bytes needed for
reading into <kbd>EBX</kbd>. The penalty is approximately 7-8 clocks.
<p>
Unlike the partial register stalls, you also get a partial memory stall when you write a bigger
operand to memory and then read part of it, if the smaller part doesn't start at the same
address:
<pre>        MOV DWORD PTR [ESI], EAX
        MOV BL, BYTE PTR [ESI]          ; no stall
        MOV BH, BYTE PTR [ESI+1]        ; stall</pre><p>
You can avoid this stall by changing the last line to <kbd>MOV BH,AH</kbd>,
but such a solution is not
possible in a situation like this:
<pre>        FISTP QWORD PTR [EDI]
        MOV EAX, DWORD PTR [EDI]
        MOV EDX, DWORD PTR [EDI+4]      ; stall</pre><p>

Interestingly, you can also get a partial memory stall when writing and reading completely
different addresses if they happen to have the same set-value in different cache banks:
<pre>        MOV BYTE PTR [ESI], AL
        MOV EBX, DWORD PTR [ESI+4092]   ; no stall
        MOV ECX, DWORD PTR [ESI+4096]   ; stall</pre>
<p>
<h2><a name="20">20</a>. Dependency chains (PPro, PII and PIII)</h2>
A series of instructions where each instruction depends on the result of the preceding one
is called a dependency chain. Long dependency chains should be avoided, if possible,
because they prevent out-of-order and parallel execution.
<p>
Example:
<pre>   MOV EAX, [MEM1]
   ADD EAX, [MEM2]
   ADD EAX, [MEM3]
   ADD EAX, [MEM4]
   MOV [MEM5], EAX</pre><p>
In this eaxmple, the <kbd>ADD</kbd> instructions generate 2 uops each, one for reading from memory
(port 2), and one for adding (port 0 or 1). The read uops can execute out or order, while the
add uops must wait for the previous uops to finish. This dependency chain does not take
very long to execute, because each addition adds only 1 clock to the execution time. But if
you have slow instructions like multiplications, or even worse: divisions, then you should
definitely do something to break the dependency chain. The way to do this is to use multiple
accumulators:
<pre>   MOV EAX, [MEM1]         ; start first chain
   MOV EBX, [MEM2]         ; start other chain in different accumulator
   IMUL EAX, [MEM3]
   IMUL EBX, [MEM4]
   IMUL EAX, EBX           ; join chains in the end
   MOV [MEM5], EAX</pre><p>
Here, the second <kbd>IMUL</kbd> instruction can start before the first one is finished.
Since the <kbd>IMUL</kbd> instruction has a delay of 4 clocks and is fully pipelined, you
may have up to 4 accumulators.
<p>
Division is not pipelined so you cannot do the same with chained divisions,
but you can of course multiply all the divisors and do only one division in
the end.
<p>
Floating point instructions have a longer delay than integer instructions, so
you should definitely break up long dependency chains with floating point
instructions:
<pre>   FLD [MEM1]         ; start first chain
   FLD [MEM2]         ; start second chain in different accumulator
   FADD [MEM3]
   FXCH
   FADD [MEM4]
   FXCH
   FADD [MEM5]
   FADD               ; join chains in the end
   FSTP [MEM6]</pre><p>
You need a lot of <kbd>FXCH</kbd> instructions for this, but don't worry: they
are cheap. <kbd>FXCH</kbd>
instructions are resolved in the RAT by register renaming so they don't
put any load on the execution ports. An <kbd>FXCH</kbd> does count as 1 uop in the RAT,
ROB, and retirement station, though.
<p>
If the dependency chain is long you may need three accumulators:
<pre>        FLD [MEM1]              ; start first chain
        FLD [MEM2]              ; start second chain
        FLD [MEM3]              ; start third chain
        FADD [MEM4]             ; third chain
        FXCH ST(1)
        FADD [MEM5]             ; second chain
        FXCH ST(2)
        FADD [MEM6]             ; first chain
        FXCH ST(1)
        FADD [MEM7]             ; third chain
        FXCH ST(2)
        FADD [MEM8]             ; second chain
        FXCH ST(1)
        FADD                    ; join first and third chain
        FADD                    ; join with second chain
        FSTP [MEM9]</pre><p>

Avoid storing intermediate data in memory and read them immediately afterwards:
<pre>        MOV [TEMP], EAX
        MOV EBX, [TEMP]</pre><p>
There is a penalty for attempting to read from a memory address before a previous write to
that address is finished. In the example above, change the last instruction to
<kbd>MOV EBX,EAX</kbd>
or put some other instructions in between.
<p>
There is one situation where you cannot avoid storing intermediate data in memory, and
that is when transferring data from an integer register to a floating point register, or vice
versa. For example:
<pre>        MOV EAX, [MEM1]
        ADD EAX, [MEM2]
        MOV [TEMP], EAX
        FILD [TEMP]</pre><p>
If you don't have anything to put in between the write to <kbd>TEMP</kbd> and the
read from <kbd>TEMP</kbd>, then
you may consider using a floating point register instead of <kbd>EAX</kbd>:
<pre>        FILD [MEM1]
        FIADD [MEM2]</pre><p>

Consecutive jumps, calls, or returns may also be considered dependency chains. The
throughput for these instructions is one jump per two clock cycles. It is therefore
recommended that you give the microprocessor something else to do between the jumps.
<p>
<h2><a name="21">21</a>. Searching for bottlenecks (PPro, PII and PIII)</h2>
When optimizing code for these processors, it is important to analyze where the
bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if
there is another bottleneck which is narrower.
<p>
If you expect code cache misses then you should restructure your code to keep the most
used parts of code together.
<p>
If you expect many data cache misses then forget about everything else and concentrate
on how to restructure your data to reduce the number of cache misses (chapter <a href="#7">7</a>), and
avoid long dependency chains after a data read cache miss (chapter <a href="#20">20</a>).
<p>
If you have many divisions then try to reduce them (chapter <a href="#27_2">27.2</a>) and make sure the
processor has something else to do during the divisions.
<p>
Dependency chains tend to hamper out-of-order execution (chapter <a href="#20">20</a>). Try to break long
dependency chains, especially if they contain slow instructions such as multiplication,
division, and floating point instructions.
<p>
If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable,
then try if some of them can be avoided. Replace conditional jumps with conditional moves
if possible, and replace small procedures with macros (chapter <a href="#22_3">22.3</a>).
<p>
If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial
stalls. If you use <kbd>PUSHF</kbd> or <kbd>LAHF</kbd> instructions then look out for partial flags stalls. Avoid
testing flags after shifts or rotates by more than 1 (chapter <a href="#19">19</a>).
<p>
If you aim at a throughput of 3 uops per clock cycle then be aware of possible delays in
instruction fetch and decoding (chapter and <a href="#14">14</a> and <a href="#15">15</a>), especially in small loops.
<p>
The limit of two permanent register reads per clock cycle may reduce your throughput to
less than 3 uops per clock cycle (chapter <a href="#16_2">16.2</a>). This is likely to happen if you often read
registers more than 4 clock cycles after they last were modified. This may, for example,
happen if you often use pointers for addressing your data but seldom modify the pointers.
<p>
A throughput of 3 uops per clock requires that no execution port gets more than one third of
the uops (chapter <a href="#17">17</a>).
<p>
The retirement station can handle 3 uops per clock, but may be slightly less effective for
taken jumps (chapter <a href="#18">18</a>).

<p>
<h2><a name="22">22</a>. Jumps and branches (all processors)</h2>
The Pentium family of processors attempt to predict where a jump will go to, and whether a
conditional jump will be taken or fall through. If the prediction is correct, then it can save a
considerable amount of time by loading the subsequent instructions into the pipeline and
start decoding them before the jump is executed. If the prediction turns out to be wrong,
then the pipeline has to be flushed, which will cost a penalty depending on the length of the
pipeline.
<p>
The predictions are based on a Branch Target Buffer (BTB) which stores the history for
each branch or jump instruction and makes predictions based on the prior history of
executions of each instruction. The BTB is organized like a set-associative cache where
new entries are allocated according to a pseudo-random replacement method.
<p>
When optimizing code, it is important to minimize the number of misprediction penalties.
This requires a good understanding of how the jump prediction works.
<p>
The branch prediction mechanisms are not described adequately in Intel manuals or
anywhere else. I am therefore giving a very detailed description here. This information is
based on my own research (with the help of Karki Jitendra Bahadur for the PPlain).
<p>
In the following, I will use the term 'control transfer instruction' for any instruction which can
change the instruction pointer, including conditional and unconditional, direct and indirect,
near and far, jumps, calls, and returns. All these instructions use prediction.
<p>

<h3><a name="22_1">22.1 Branch prediction in PPlain</a></h3>
The branch prediction mechanism for the PPlain is very different from the other three
processors. Information found in Intel documents and elsewhere on this subject is directly
misleading, and following the advises given is such documents is likely to lead to
sub-optimal code.
<p>
The PPlain has a branch target buffer (BTB), which can hold information for up to 256 jump
instructions. The BTB is organized like a 4-way set-associative cache with 64 entries per
way. This means that the BTB can hold no more than 4 entries with the same set value.
Unlike the data cache, the BTB uses a pseudo random replacement algorithm, which
means that a new entry will not necessarily displace the least recently used entry of the
same set-value. How the set-value is calculated will be explained later. Each BTB entry
stores the address of the jump target and a prediction state, which can have four different
values:
<p>
state 0:  "strongly not taken" <br>
state 1:  "weakly not taken"   <br>
state 2:  "weakly taken"           <br>
state 3:  "strongly taken"<p>

A branch instruction is predicted to jump when in state 2 or 3, and to fall through when in
state 0 or 1. The state transition works like a two-bit counter, so that the state is
incremented when the branch is taken, and decremented when it falls through. The counter
saturates, rather than wrap around, so that it does not decrement beyond 0 or increment
beyond 3. Ideally, this would provide a reasonably good prediction, because a branch
instruction would have to deviate twice from what it does most of the time, before the
prediction changes.
<p>
However, this mechanism has been compromised by the fact that state 0 also means
'unused BTB entry'. So a BTB entry in state 0 is the same as no BTB entry. This makes
sense, because a branch instruction is predicted to fall through if it has no BTB entry. This
improves the utilization of the BTB, because a branch instruction which is seldom taken will
most of the time not take up any BTB entry.
<p>
Now, if a jumping instruction has no BTB entry, then a new BTB entry will be generated,
and this new entry will always be set to state 3. This means that it is impossible to go from
state 0 to state 1 (except for a very special case discussed later). From state 0 you can only
go to state 3, if the branch is taken. If the branch falls through, then it will stay out of the
BTB.
<p>
This is a serious design flaw. By throwing state 0 entries out of the BTB and always setting
new entries to state 3, the designers apparently have given priority to minimizing the first
time penalty for unconditional jumps and branches often taken, and ignored that this
seriously compromises the basic idea behind the mechanism and reduces the performance
in small innermost loops. The consequence of this flaw is, that a branch instruction which
falls through most of the time will have up to three times as many mispredictions as a
branch instruction which is taken most of the time. (Apparently, Intel engineers
have been unaware of this flaw until I published my findings).
<p>
You may take this asymmetry into account by organizing your branches so that they are
taken more often than not. Consider for example this if-then-else construction:
<pre>        TEST EAX,EAX
        JZ   A
        &lt;branch 1&gt;
        JMP  E
A:      &lt;branch 2&gt;
E:</pre>
<p>
If branch 1 is executed more often than branch 2, and branch 2 is seldom executed twice in
succession, then you can reduce the number of branch mispredictions by up to a factor 3
by swapping the two branches so that the branch instruction will jump more often than fall
through:
<pre>        TEST EAX,EAX
        JNZ  A
        &lt;branch 2&gt;
        JMP  E
A:      &lt;branch 1&gt;
E:</pre><p>

(This is contrary to the recommendations in Intel's manuals and tutorials).
<p>
There may be reasons to put the most often executed branch first, however:
<ol>
<li>Putting seldom executed branches away in the bottom of your code can improve code
cache utilization.
<li>A branch instruction seldom taken will stay out of the BTB most of the time, possibly
improving BTB utilization.
<li>The branch instruction will be predicted as not taken if it has been flushed out of the
BTB by other branch instructions.
<li>The asymmetry in branch prediction only exists on the PPlain.
</ol>
<p>
These considerations have little weight, however, for small critical loops, so I would still
recommend organizing branches with a skewed distribution so that the branch instruction is
taken more often than not, unless branch 2 is executed so seldom, that misprediction
doesn't matter.
<p>
Likewise, you should preferably organize loops with the testing branch instruction at the
bottom, as in this example:
<pre>        MOV ECX, [N]
L:      MOV [EDI],EAX
        ADD EDI,4
        DEC ECX
        JNZ L</pre><p>
If N is high, then the JNZ instruction here will be taken more often than not, and never fall
through twice in succession.
<p>
Consider the situation where a branch is taken every second time. The first time it jumps
the BTB entry will go into state 3, and will then alternate between state 2 and 3. It is
predicted to jump all the time, which gives 50% mispredictions. <a name="worstpred">Assume now that it deviates
from this regular pattern and falls through an extra time</a>. The jump pattern is:<pre>
01010100101010101010101, where 0 means nojump, and 1 means jump.
       ^</pre>
The extra nojump is indicated with a <kbd>^</kbd> above. After this incident, the BTB entry will alternate
between state 1 and 2, which gives 100% mispredictions. It will continue in this unfortunate
mode until there is another deviation from the 0101 pattern. This is the worst case for this
branch prediction mechanism.
<p>
<h4>22.1.2 BTB is looking ahead (PPlain)</h4>
The BTB mechanism is counting instruction pairs, rather than single instructions, so you
have to know how instructions are pairing in order to analyze where a BTB entry is stored.
The BTB entry for any control instruction is attached to the address of the U-pipe
instruction in the preceding instruction pair. (An unpaired instruction counts as one pair).
Example:
<pre>    SHR EAX,1
    MOV EBX,[ESI]
    CMP EAX,EBX
    JB  L</pre><p>
Here <kbd>SHR</kbd> pairs with <kbd>MOV</kbd>, and <kbd>CMP</kbd> pairs with
<kbd>JB</kbd>. The BTB entry for <kbd>JB L</kbd> is thus
attached to the address of the <kbd>SHR EAX,1</kbd> instruction. When this BTB entry is met, and if it
is in state 2 or 3, then the Pentium will read the target address from the BTB entry, and load
the instructions following L into the pipeline. This happens before the branch instruction has
been decoded, so the Pentium relies solely on the information in the BTB when doing this.
<p>
You may remember, that instructions are seldom pairing the first time they are executed
(see chapter <a href="#8">8</a>). If the instructions above are not pairing, then the BTB entry should be
attached to the address of the <kbd>CMP</kbd> instruction, and this entry would be wrong on the next
execution, when instructions are pairing. However, in most cases the PPlain is smart
enough to not make a BTB entry when there is an unused pairing opportunity, so you don't
get a BTB entry until the second execution, and hence you won't get a prediction until the
third execution. (In the rare case, where every second instruction is a single-byte
instruction, you may get a BTB entry on the first execution which becomes invalid in the
second execution, but since the instruction it is attached to will then go to the V-pipe, it is
ignored and gives no penalty. A BTB entry is only read if it is attached to the address of a
U-pipe instruction).
<p>
A BTB entry is identified by its set-value which is equal to bits 0-5 of the address it is
attached to. Bits 6-31 are then stored in the BTB as a tag. Addresses which are spaced a
multiple of 64 bytes apart will have the same set-value. You can have no more than four
BTB entries with the same set-value. If you want to check whether your jump instructions
contend for the same BTB entries, then you have to compare bits 0-5 of the addresses of
the U-pipe instructions in the preceding instruction pairs. This is very tedious, and I have
never heard of anybody doing so. There are no tools available to do this job for you.
<p>
<h4>22.1.3 Consecutive branches (PPlain)</h4>
When a jump is mispredicted, then the pipeline gets flushed. If the next instruction pair
executed also contains a control transfer instruction, then the PPlain won't load its target
because it cannot load a new target while the pipeline is being flushed. The result is that the
second jump instruction is predicted to fall through regardless of the state of its BTB entry.
Therefore, if the second jump is also taken, then you will get another penalty. The state of
the BTB entry for the second jump instruction does get correctly updated, though. If you
have a long chain of control transfer instructions, and the first jump in the chain is
mispredicted, then the pipeline will get flushed all the time, and you will get nothing but
mispredictions until you meet an instruction pair which does not jump. The most extreme
case of this is a loop which jumps to itself: It will get a misprediction penalty for each
iteration.
<p>
This is not the only problem with consecutive control transfer instructions. Another problem
is that you can have another branch instruction between a BTB entry and the control
transfer instruction it belongs to. If the first branch instruction jumps to somewhere else,
then strange things may happen. Consider this example:
<pre>        SHR EAX,1
        MOV EBX,[ESI]
        CMP EAX,EBX
        JB  L1
        JMP L2

L1:     MOV EAX,EBX
        INC EBX</pre><p>

When <kbd>JB L1</kbd> falls through, then you will get a BTB entry for
<kbd>JMP L2</kbd> attached to the
address of <kbd>CMP EAX,EBX</kbd>. But what will happen when <kbd>JB L1</kbd>
later is taken?  At the time
when the BTB entry for <kbd>JMP L2</kbd> is read, the processor doesn't know that the next
instruction pair does not contain a jump instruction, so it will actually predict the instruction
pair <kbd>MOV EAX,EBX / INC EBX</kbd> to jump to <kbd>L2</kbd>.
The penalty for predicting non-jump
instructions to jump is 3 clock cycles. The BTB entry for <kbd>JMP L2</kbd> will get its state
decremented, because it is applied to something which doesn't jump.  If we keep going to
<kbd>L1</kbd>, then the BTB entry for <kbd>JMP L2</kbd> will be decremented to state 1 and 0, so that the
problem will disappear until next time <kbd>JMP L2</kbd> is executed.
<p>
The penalty for predicting the non-jumping instructions to jump only occurs when the jump
to <kbd>L1</kbd> is predicted. In the case that <kbd>JB L1</kbd> is mispredictedly
jumping, then the pipeline gets
flushed and we won't get the false <kbd>L2</kbd> target loaded, so in this case we will not see the
penalty of predicting the non-jumping instructions to jump, but we do get the BTB entry for
<kbd>JMP L2</kbd> decremented.
<p>
Suppose, now, that we replace the <kbd>INC EBX</kbd> instruction above with another jump
instruction.  This third jump instruction will then use the same BTB entry as
<kbd>JMP L2</kbd> with
the possible penalty of predicting a wrong target, (unless it happens to also
have <kbd>L2</kbd> as target).
<p>
To summarize, consecutive jumps can lead to the following problems:
<ul>
<li>failure to load a jump target when the pipeline is being flushed by a preceding
mispredicted jump.
<li>a BTB entry being mis-applied to non-jumping instructions and predicting them to jump.
<li>a second consequence of the above is that a mis-applied BTB entry will get its state
decremented, possibly leading to a later misprediction of the jump it belongs to. Even
unconditional jumps can be predicted to fall through for this reason.
<li>two jump instructions may share the same BTB entry, leading to the prediction of a
wrong target.
</ul>
<p>

All this mess may give you a lot of penalties, so you should definitely avoid having an
instruction pair containing a jump immediately after another poorly predictable control
transfer instruction or its target.
<p>
It is time for another illustrative example:
<pre>        CALL P
        TEST EAX,EAX
        JZ   L2
L1:     MOV  [EDI],EBX
        ADD  EDI,4
        DEC  EAX
        JNZ  L1
L2:     CALL P</pre><p>

This looks like a quite nice and normal piece of code: A function call, a loop which is
bypassed when the count is zero, and another function call. How many problems can you
spot in this program?
<p>
First, we may note that the function <kbd>P</kbd> is called alternatingly from two different locations.
This means that the target for the return from <kbd>P</kbd> will be changing all the time. Consequently,
the return from <kbd>P</kbd> will always be mispredicted.
<p>
Assume, now, that <kbd>EAX</kbd> is zero. The jump to <kbd>L2</kbd> will not have its target loaded because the
mispredicted return caused a pipeline flush. Next, the second <kbd>CALL P</kbd> will also fail to have
its target loaded because <kbd>JZ L2</kbd> caused a pipeline flush. Here we have the situation where
a chain of consecutive jumps makes the pipeline flush repeatedly because the first jump
was mispredicted. The BTB entry for <kbd>JZ L2</kbd> is stored at the address of <kbd>P</kbd>'s return
instruction. This BTB entry will now be mis-applied to whatever comes after the second
<kbd>CALL P</kbd>,  but that doesn't give a penalty because the pipeline is flushed by the mispredicted
second return.
<p>
Now, let's see what happens if <kbd>EAX</kbd> has a nonzero value the next time:
<kbd>JZ L2</kbd> is always
predicted to fall through because of the flush. The second <kbd>CALL P</kbd>
has a BTB entry at the
address of <kbd>TEST EAX,EAX</kbd>. This entry will be mis-applied to the
<kbd>MOV/ADD</kbd> pair, predicting
it to jump to <kbd>P</kbd>. This causes a flush which prevents <kbd>JNZ L1</kbd>
from loading its target. If we
have been here before, then the second <kbd>CALL P</kbd> will have another BTB entry at the
address of <kbd>DEC EAX</kbd>. On the second and third iteration of the loop, this entry will also be
mis-applied to the <kbd>MOV/ADD</kbd> pair, until it has had its state decremented to 1 or 0. This will
not cause a penalty on the second iteration because the flush from <kbd>JNZ L1</kbd> prevents it
from loading its false target, but on the third iteration it will. The subsequent iterations of the
loop have no penalties, but when it exits, <kbd>JNZ L1</kbd> is mispredicted. The flush would now
prevent <kbd>CALL P</kbd> from loading its target, were it not for the fact that the BTB entry for
<kbd>CALL P</kbd> has already been destroyed by being mis-applied several times.
<p>
We can improve this code by putting in some <kbd>NOP</kbd>'s to separate all consecutive jumps:
<pre>        CALL P
        TEST EAX,EAX
        NOP
        JZ   L2
L1:     MOV  [EDI],EBX
        ADD  EDI,4
        DEC  EAX
        JNZ  L1
L2:     NOP
        NOP
        CALL P</pre><p>
The extra <kbd>NOP</kbd>'s cost 2 clock cycles, but they save much more.
Furthermore, <kbd>JZ L2</kbd> is now
moved to the U-pipe which reduces its penalty from 4 to 3 when mispredicted. The only
problem that remains is that the returns from <kbd>P</kbd> are always mispredicted. This problem can
only be solved by replacing the call to <kbd>P</kbd> by an inline macro (if you have enough code
cache).
<p>
The lesson to learn from this example is that you should always look carefully for
consecutive jumps and see if you can save time by inserting some <kbd>NOP</kbd>'s. You should be
particularly aware of those situations where misprediction is unavoidable, such as loop exits
and returns from procedures which are called from varying locations. If you have something
useful to put in, instead of the <kbd>NOP</kbd>'s, then you should of course do so.
<p>
Multiway branches (case statements) may be implemented either as a tree of branch
instructions or as a list of jump addresses. If you choose to use a tree of branch
instructions, then you have to include some <kbd>NOP</kbd>'s or other instructions to separate the
consecutive branches. A list of jump addresses may therefore be a better solution on the
PPlain. The list of jump addresses should be placed in the data segment. Never put data in
the code segment!
<p>
<h4>22.1.4 Tight loops (PPlain)</h4>
In a small loop you will often access the same BTB entry repeatedly with small intervals.
This never causes a stall. Rather than waiting for a BTB entry to be updated, the PPlain
somehow bypasses the pipeline and gets the resulting state from the last jump before it has
been written to the BTB. This mechanism is almost transparent to the user, but it does in
some cases have funny effects: You can see a branch prediction going from state 0 to state
1, rather than to state 3, if the zero has not yet been written to the BTB. This happens if the
loop has no more than four instruction pairs. In loops with only two instruction pairs you
may sometimes have state 0 for two consecutive iterations without going out of the BTB. In
such small loops it also happens in rare cases that the prediction uses the state resulting
from two iterations ago, rather than from the last iteration. These funny effects will usually
not have any negative effects on performance.
<p>
<h3><a name="22_2">22.2 Branch prediction in PMMX, PPro, PII and PIII</a></h3>
<h4>22.2.1 BTB organization (PMMX, PPro, PII and PIII)</h4>
The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16
sets. Each entry is identified by bits 2-31 of the address of the last byte of the control
transfer instruction it belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the BTB
as a tag. Control transfer instructions which are spaced 64 bytes apart have the
same set-value and may therefore occasionally push each other out of the BTB. Since there are 16
ways per set, this won't happen too often.
<p>
The branch target buffer (BTB) of the PPro, PII and PIII has 512 entries organized as 16 ways *
32 sets. Each entry is identified by bits 4-31 of the address of the last byte of the control
transfer instruction it belongs to. Bits 4-8 define the set, and all bits are stored in the BTB as
a tag. Control transfer instructions which are spaced 512 bytes apart have the
same set-value and may therefore occasionally push each other out of the BTB. Since there are 16
ways per set, this won't happen too often.
<p>
The PPro, PII and PIII allocate a BTB entry to any control transfer instruction the first time it is
executed. The PMMX allocates it the first time it jumps. A branch instruction which never
jumps will stay out of the BTB on the PMMX. As soon as it has jumped once, it will stay in
the BTB, even if it never jumps again.
<p>
An entry may be pushed out of the BTB when another control transfer instruction with the
same set-value needs a BTB entry.
<p>
<h4>22.2.2 Misprediction penalty (PMMX, PPro, PII and PIII)</h4>
In the PMMX, the penalty for misprediction of a conditional jump is 4 clocks in the U-pipe,
and 5 clocks if it is executed in the V-pipe. For all other control transfer instructions it is 4
clocks.
<p>
In the PPro, PII and PIII, the misprediction penalty is very high due to the long pipeline. A
misprediction usually costs between 10 and 20 clock cycles. It is therefore very important to
be aware of poorly predictable branches when running on PPro, PII and PIII.
<p>
<h4>22.2.3 Pattern recognition for conditional jumps (PMMX, PPro, PII and PIII)</h4>
These processors have an advanced pattern recognition mechanism which will correctly
predict a branch instruction which, for example, is taken every fourth time and falls through
the other three times. In fact, they can predict any repetitive pattern of jumps and nojumps
with a period of up to five, and many patterns with higher periods.
<p>
The mechanism is a so-called "two-level adaptive branch prediction scheme", invented by
T.-Y. Yeh and Y. N. Patt. It is based on the same kind of two-bit counters as described
above for the PPlain (but without the assymmetry flaw). The counter is incremented when
the jump is taken and decremented when not taken. There is no wrap-around when
counting up from 3 or down from 0. A branch instruction is predicted to be taken when the
corresponding counter is in state 2 or 3, and to fall through when in state 0 or 1. An
impressive improvement is now obtained by having sixteen such counters for each BTB
entry. It selects one of these sixteen counters based on the history of the branch instruction
for the last four executions. If, for example, the branch instruction jumps once and then falls
through three times, then you have the history bits 1000 (1=jump, 0=nojump). This will
make it use counter number 8 (1000 binary = 8) for predicting the next time and update
counter 8 afterwards.
<p>
If the sequence 1000 is always followed by a 1, then counter number 8 will soon end up in
its highest state (state 3) so that it will always predict a 1000 sequence to be followed by a
1. It will take two deviations from this pattern to change the prediction. The repetitive pattern
100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state 0. The other
twelve counters will be unused.
<p>
<h4>22.2.4 Perfectly predicted patterns (PMMX, PPro, PII and PIII)</h4>
A repetitive branch pattern is predicted perfectly by this mechanism if
every 4-bit sub-sequence in the period is unique.
Below is a list of repetitive branch patterns which are predicted perfectly:<p>
<table border=1 cellpadding=1 cellspacing=1>
<tr><td class="a3">&nbsp;period&nbsp;</td>
<td class="a3">&nbsp;perfectly predicted patterns&nbsp;</td></tr>
<tr><td>1-5</td><td>all</td></tr>
<tr><td>6</td><td>000011, 000101, 000111, 001011</td></tr>
<tr><td>7</td><td>0000101, 0000111, 0001011</td></tr>
<tr><td>8</td><td>00001011, 00001111, 00010011, 00010111, 00101101</td></tr>
<tr><td>9</td><td>000010011, 000010111, 000100111, 000101101</td></tr>
<tr><td>10</td><td>0000100111, 0000101101, 0000101111, 0000110111, 0001010011, 0001011101</td></tr>
<tr><td>11</td><td>00001001111, 00001010011, 00001011101, 00010100111</td></tr>
<tr><td>12</td><td>000010100111, 000010111101, 000011010111, 000100110111, 000100111011</td></tr>
<tr><td>13</td><td>0000100110111, 0000100111011, 0000101001111</td></tr>
<tr><td>14</td><td>00001001101111, 00001001111011, 00010011010111, 00010011101011, 00010110011101, 00010110100111</td></tr>
<tr><td>15</td><td>000010011010111, 000010011101011, 000010100110111,
000010100111011, 000010110011101, 000010110100111,
000010111010011, 000011010010111</td></tr>
<tr><td>16</td><td>0000100110101111, 0000100111101011, 0000101100111101,
0000101101001111</td></tr>
</table>
<p>When reading this table, you should be aware that if a pattern is predicted correctly than the
same pattern reversed (read backwards) is also predicted correctly, as well as the same
pattern with all bits inverted. Example:
In the table we find the pattern: 0001011.
Reversing this pattern gives: 1101000.
Inverting all bits gives: 1110100.
Both reversing and inverting: 0010111.
These four patterns are all recognizable. Rotating the pattern one place to the left gives:
0010110. This is of course not a new pattern, only a phase shifted version of the same
pattern. All patterns which can be derived from one of the patterns in the table by reversing,
inverting and rotating are also recognizable. For reasons of brevity, these are not listed.
<p>
It takes two periods for the pattern recognition mechanism to learn a regular repetitive
pattern after the BTB entry has been allocated. The pattern of mispredictions in the learning
period is not reproducible. This is probably because the BTB entry contained something
prior to allocation. Since BTB entries are allocated according to a random scheme, there is
little chance of predicting what happens during the initial learning period.
<p>

<h4>22.2.5 Handling deviations from a regular pattern (PMMX, PPro, PII and PIII)</h4>
The branch prediction mechanism is also extremely good at handling 'almost regular'
patterns, or deviations from the regular pattern. Not only does it learn what the regular
pattern looks like. It also learns what deviations from the regular pattern look like. If
deviations are always of the same type, then it will remember what comes after the irregular
event, and the deviation will cost only one misprediction.
<p>
Example:
<pre>
0001110001110001110001011100011100011100010111000
                      ^                   ^</pre>
In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns that the
repeated sequence is 000111. The first irregularity is an unexpected 0, which I have
marked with a <kbd>^</kbd>. After this 0 the next three jumps may be mispredicted, because it hasn't
learned what comes after 0010, 0101, and 1011. After one or two irregularities of the same
kind it has learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes 1.
This means that after at most two irregularities of the same kind, it has learned to handle
this kind of irregularity with only one misprediction.
<p>
The prediction mechanism is also very effective when alternating between two different
regular patterns. If, for example, we have the pattern 000111 (with period 6) repeated many
times, then the pattern 01 (period 2) many times, and then return to the 000111 pattern,
then the mechanism doesn't have to relearn the 000111 pattern, because the counters
used in the 000111 sequence have been left un-touched during the 01 sequence. After a
few alternations between the two patterns, it has also learned to handle the changes of
pattern with only one misprediction for each time the pattern is switched.
<p>
<h4>22.2.6 Patterns which are not predicted perfectly (PMMX, PPro, PII and PIII)</h4>
The simplest branch pattern which cannot be predicted perfectly is a branch which is taken
on every 6'th execution. The pattern is:
<pre>000001000001000001
    ^^    ^^    ^^
    ab    ab    ab</pre>
The sequence 0000 is alternatingly followed by a 0, in the positions marked a above, and
by a 1, in the positions marked b. This affects counter number 0 which will count up and
down all the time. If counter 0 happens to start in state 0 or 1, then it will alternate between
state 0 and 1. This will lead to a misprediction in position b. If counter 0 happens to start in
state 3, then it will alternate between state 2 and 3 which will cause a misprediction in
position a. The worst case is when it starts in state 2. It will alternate between state 1 and 2
with the unfortunate consequence that we get a misprediction both in position a and b. (This
is analogous to the worst case for the PPlain explained <a href="#worstpred">above</a>). Which of these four
situations we will get depends on the history of the BTB entry prior to allocation to this
branch. This is beyond our control because of the random allocation method.
<p>
In principle, it is possible to avoid the worst case situation where we have two mispredictions
per cycle by giving it an initial branch sequence which is specially designed for putting the
counter in the desired state. Such an approach cannot be recommended, however,
because of the considerable extra code complexity required, and because whatever
information we have put into the counter is likely to be lost during the next timer interrupt or
task switch.
<p>
<h4>22.2.7 Completely random patterns (PMMX, PPro, PII and PIII)</h4>
The powerful capability of pattern recognition has a minor drawback in the case of
completely random sequences with no regularities.
<p>
The following table lists the experimental fraction of mispredictions for a completely random
sequence of jumps and nojumps:<p>

<table border=1 cellpadding=1 cellspacing=1><tr>
<td align="center" class="a3">&nbsp;fraction of jumps/nojumps&nbsp;</td>
<td align="center" class="a3">&nbsp;fraction of mispredictions&nbsp;</td></tr>
<tr><td align="center">0.001/0.999</td>
<td align="center">0.001001</td></tr>
<tr><td align="center">0.01/0.99</td>
<td align="center">0.0101</td></tr>
<tr><td align="center">0.05/0.95</td>
<td align="center">0.0525</td></tr>
<tr><td align="center">0.10/0.90</td>
<td align="center">0.110</td></tr>
<tr><td align="center">0.15/0.85</td>
<td align="center">0.171</td></tr>
<tr><td align="center">0.20/0.80</td>
<td align="center">0.235</td></tr>
<tr><td align="center">0.25/0.75</td>
<td align="center">0.300</td></tr>
<tr><td align="center">0.30/0.70</td>
<td align="center">0.362</td></tr>
<tr><td align="center">0.35/0.65</td>
<td align="center">0.418</td></tr>
<tr><td align="center">0.40/0.60</td>
<td align="center">0.462</td></tr>
<tr><td align="center">0.45/0.55</td>
<td align="center">0.490</td></tr>
<tr><td align="center">0.50/0.50</td>
<td align="center">0.500</td></tr>
</table>
<p>
The fraction of mispredictions is slightly higher than it would be without pattern recognition
because the processor keeps trying to find repeated patterns in a sequence which has no
regularities.
<p>
<h4>22.2.8 Tight loops (PMMX)</h4>
The branch prediction is not reliable in tiny loops where the pattern recognition mechanism
doesn't have time to update its data before the next branch is met. This means that simple
patterns, which would normally be predicted perfectly, are not recognized. Incidentally,
some patterns which normally would not be recognized, are predicted perfectly in tight
loops. For example, a loop which always repeats 6 times would have the branch pattern
111110 for the branch instruction at the bottom of the loop. This pattern would normally
have one or two mispredictions per iteration, but in a tight loop it has none. The same
applies to a loop which repeats 7 times. Most other repeat counts are predicted poorer in
tight loops than normally. This means that a loop which iterates 6 or 7 times should
preferably be tight, whereas other loops should preferably not be tight. You may unroll a
loop if necessary to make it less tight.
<p>
To find out whether a loop will behave as 'tight' on the PMMX you may follow the following
rule of thumb: Count the number of instructions in the loop. If the number is 6 or less, then
the loop will behave as tight. If you have more than 7 instructions, then you can be
reasonably sure that the pattern recognition functions normally. Strangely enough, it doesn't
matter how many clock cycles each instruction takes, whether it has stalls, or whether it is
paired or not. Complex integer instructions do not count. A loop can have lots of complex
integer instructions and still behave as a tight loop. A complex integer instruction
is a non-pairable integer instruction which always takes more than one clock cycle. Complex floating
point instructions and MMX instructions still count as one. Note, that this rule of thumb is
heuristic and not completely reliable. In important cases you may want to do your own
testing. You can use performance monitor counter number 35H for the PMMX to count
branch mispredictions. Test results may not be completely deterministic, because branch
predictions may depend on the history of the BTB entry prior to allocation.
<p>
Tight loops on PPro, PII and PIII are predicted normally, and take minimum two clock cycles per
iteration.
<p>
<h4>22.2.9 Indirect jumps and calls (PMMX, PPro, PII and PIII)</h4>
There is no pattern recognition for indirect jumps and calls, and the BTB can remember no
more than one target for an indirect jump. It is simply predicted to go to the same target as it
did last time.
<p>
<h4>22.2.10 JECXZ and LOOP (PMMX)</h4>
There is no pattern recognition for these two instructions in the PMMX. They are simply
predicted to go the same way as last time they were executed. These two instructions
should be avoided in time-critical code for PMMX. (In PPro, PII and PIII they are predicted using
pattern recognition, but the loop instruction is still inferior to <kbd>DEC ECX / JNZ</kbd>).
<p>
<h4>22.2.11 Returns (PMMX, PPro, PII and PIII)</h4>
The PMMX, PPro, PII and PIII processors have a Return Stack Buffer (RSB) which is used for
predicting return instructions. The RSB works as a First-In-Last-Out  buffer. Each time a
call instruction is executed, the corresponding return address is pushed into the RSB. And
each time a return instruction is executed, a return address is pulled out of the RSB and
used for prediction of the return. This mechanism makes sure that return instructions are
correctly predicted when the same subroutine is called from several different locations.
<p>
In order to make sure this mechanism works correctly, you must make sure that all calls
and returns are matched. Never jump out of a subroutine without a return and never use a
return as an indirect jump if speed is critical.
<p>
The RSB can hold four entries in the PMMX, sixteen in the PPro, PII and PIII. In the case where
the RSB is empty, the return instruction is predicted in the same way as an indirect jump,
i.e. it is expected to go to the same target as it did last time.
<p>
On the PMMX, when subroutines are nested deeper than four levels then the innermost
four levels use the RSB, whereas all subsequent returns from the outer levels use the
simpler prediction mechanism as long as there are no new calls. A return instruction which
uses the RSB still occupies a BTB entry. Four entries in the RSB of the PMMX doesn't
sound of much, but it is probably sufficient. Subroutine nesting deeper than four levels is
certainly not unusual, but only the innermost levels matter in terms of speed, except
possibly for recursive procedures.
<p>
On the PPro, PII and PIII, when subroutines are nested deeper than sixteen levels then the
innermost 16 levels use the RSB, whereas all subsequent returns from the outer levels are
mispredicted. Recursive subroutines should therefore not go deeper than 16 levels.
<p>
<h4>22.2.12 Static prediction in PMMX</h4>
A control transfer instruction which has not been seen before or which is not in the BTB is
always predicted to fall through on the PMMX. It doesn't matter whether it goes forward or
backwards.
<p>
A branch instruction will not get a BTB entry if it always falls through. As soon as it is taken
once, it will get into the BTB and stay there no matter how many times it falls through. A
control transfer instruction can only go out of the BTB when it is pushed out by another
control transfer instruction which steals its BTB entry.
<p>
Any control transfer instruction which jumps to the address immediately following itself will
not get a BTB entry. Example: <pre>
        JMP SHORT LL
LL:</pre><p>
This instruction will never get a BTB entry and therefore always have a misprediction
penalty.
<p>
<h4>22.2.13 Static prediction in PPro, PII and PIII</h4>
On PPro, PII and PIII, a control transfer instruction which has not been seen before or which is
not in the BTB is predicted to fall through if it goes forwards, and to be taken if it goes
backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on
these processors.
<p>
If your code is unlikely to be cached then it is preferred to have the most frequently
executed branch fall through in order to improve prefetching.
<p>
<h4>22.2.14 Close jumps (PMMX)</h4>
On the PMMX, there is a risk that two control transfer instructions will share the same BTB
entry if they are too close to each other. The obvious result is that they will always be
mispredicted.
<p>
The BTB entry for a control transfer instruction is identified by bits 2-31 of the address of
the last byte in the instruction. If two control transfer instructions are so close together that
they differ only in bits 0-1 of the address, then we have the problem of a shared BTB entry.
Example:
<pre>        CALL    P
        JNC     SHORT L</pre><p>
If the last byte of the <kbd>CALL</kbd> instruction and the last byte of the <kbd>JNC</kbd> instruction lie within the
same dword of memory, then we have the penalty. You have to look at the output list file
from the assembler to see whether the two addresses are separated by a DWORD boundary
or not. (A DWORD boundary is an address divisible by 4).
<p>
There are various ways to solve this problem: <br>
1. Move the code sequence a little up or down in memory so that you get a dword
boundary between the two addresses.<br>
2. Change the short jump to a near jump (with 4 bytes displacement) so that the end of the
instruction is moved further down. There is no way you can force the assembler to use
anything but the shortest form of an instruction so you have to hard-code the near
branch if you choose this solution.<br>
3. Put in some instruction between the <kbd>CALL</kbd> and the <kbd>JNC</kbd> instructions. This is the easiest
method, and the only method if you don't know where DWORD boundaries are because
your segment is not dword aligned or because the code keeps moving up and down as
you make changes in the preceding code:<br>
<pre>        CALL    P
        MOV     EAX,EAX         ; two bytes filler to be safe
        JNC     SHORT L</pre><p>
If you want to avoid problems on the PPlain too, then put in two <kbd>NOP</kbd>'s instead to prevent
pairing (see section 22.1.3 above).
 <p>
The <kbd>RET</kbd> instruction is particularly prone to this problem because it is only one byte long:
<pre>        JNZ     NEXT
        RET</pre><p>
Here you may need up to three bytes of fillers:
<pre>        JNZ     NEXT
        NOP
        MOV     EAX,EAX
        RET</pre>
<p>
<h4>22.2.15 Consecutive calls or returns (PMMX)</h4>
There is a penalty when the first instruction pair following the target label of a call contains
another call instruction or if a return follows immediately after another return. Example:
<pre>FUNC1   PROC    NEAR
        NOP             ; avoid call after call
        NOP
        CALL    FUNC2
        CALL    FUNC3
        NOP             ; avoid return after return
        RET
FUNC1   ENDP</pre><p>
Two <kbd>NOP</kbd>'s are required before <kbd>CALL FUNC2</kbd> because a single
<kbd>NOP</kbd> would pair with the
<kbd>CALL</kbd>. One <kbd>NOP</kbd> is enough before the <kbd>RET</kbd> because
<kbd>RET</kbd> is unpairable. No <kbd>NOP</kbd>'s are required
between the two <kbd>CALL</kbd> instructions because there is no penalty for call after return. (On the
PPlain you would need two <kbd>NOP</kbd>'s here too).
<p>
The penalty for chained calls only occurs when the same subroutines are called from more
than one location (probably because the RSB needs updating). Chained returns always
have a penalty. There is sometimes a small stall for a jump after a call, but no penalty for
return after call; call after return; jump, call, or return after jump; or jump after return.
<p>
<h4>22.2.16 Chained jumps (PPro, PII and PIII)</h4>
A jump, call, or return cannot be executed in the first clock cycle after a previous jump, call,
or return. Therefore, chained jumps will take two clock cycles for each jump, and you may
want to make sure that the processor has something else to do in parallel. For the same
reason, a loop will take at least two clock cycles per iteration on these processors.
<p>
<h4>22.2.17 Designing for branch predictabiligy (PMMX, PPro, PII and PIII)</h4>
Multiway branches (switch/case statements) are implemented either as an indirect jump
using a list of jump addresses, or as a tree of branch instructions. Since indirect jumps are
poorly predicted, the latter method may be preferred if easily predicted patterns can be
expected and you have enough BTB entries. In case you decide to use the former method,
then it is recommended that you put the list of jump addresses in the data segment.
<p>
You may want to reorganize your code so that branch patterns which are not predicted
perfectly can be replaced by other patterns which are. Consider, for example, a loop which
always executes 20 times. The conditional jump at the bottom of the loop is taken 19 times
and falls through every 20'th time. This pattern is regular, but not recognized by the pattern
recognition mechanism, so the fall-through is always mispredicted. You may make two
nested loops by four and five, or unroll the loop by four and let it execute 5 times, in order to
have only recognizable patterns. This kind of complicated schemes are only worth the extra
code on the PPro, PII and PIII processors where mispredictions are very expensive. For higher
loop counts there is no reason to do anything about the single misprediction.
<p>
<h3><a name="22_3">22.3. Avoiding jumps (all processors)</a></h3>
There can be many reasons why you may want reduce the number of jumps, calls and
returns:
<ul>
<li>jump mispredictions are very expensive,
<li>there are various penalties for consecutive or chained jumps, depending on the
processor,
<li>jump instructions may push one another out of the branch target buffer because of the
random replacement algorithm,
<li>a return takes 2 clocks on PPlain and PMMX, calls and returns generate 4 uops on
PPro, PII and PIII.
<li>on PPro, PII and PIII, instruction fetch may be delayed after a jump
(chapter <a href="#15">15</a>), and
retirement may be slightly less effective for taken jumps then for other uops
(chapter <a href="#18">18</a>).
</ul>
<p>
Calls and returns can be avoided by replacing small procedures with inline macros.
And in many cases it is possible to reduce the number of jumps by restructuring
your code. For example, a jump to a jump should be replaced by a jump to the final
target. In some cases this is even possible with conditional jumps if the condition
is the same or is known. A jump to a return can be replaced by a return. If you want
to eliminate a return to a return, then you should not manipulate the stack pointer
because that would interfere with the prediction mechanism of the return stack buffer.
Instead, you can replace the preceding call with a jump. For example
<kbd>CALL PRO1 / RET</kbd> can be replaced by <kbd>JMP PRO1</kbd> if
<kbd>PRO1</kbd> ends with the same kind of <kbd>RET</kbd>.
<p>
You may also eliminate a jump by dublicating the code jumped to. This can be
useful if you have a two-way branch inside a loop or before a return. Example:
<pre>A:      CMP     [EAX+4*EDX],ECX
        JE      B
        CALL    X
        JMP     C
B:      CALL    Y
C:      INC     EDX
        JNZ     A
        MOV     ESP, EBP
        POP     EBP
        RET</pre>
The jump to <kbd>C</kbd> may be eliminated by dublicating the loop epilog:
<pre>A:      CMP [EAX+4*EDX],ECX
        JE      B
        CALL    X
        INC     EDX
        JNZ     A
        JMP     D
B:      CALL    Y
C:      INC     EDX
        JNZ     A
D:      MOV     ESP, EBP
        POP     EBP
        RET</pre>
The most often executed branch should come first here. The jump to <kbd>D</kbd>
is outside the loop and therefore less critical. If this jump is executed so often
that it needs optimizing too, then replace it with the three instructions following
<kbd>D</kbd>.
<p>
<h3><a name="22_4">22.4. Avoiding conditional jumps by using flags (all processors)</a></h3>
The most important jumps to eliminate are conditional jumps, especially if
they are poorly predictable. Sometimes it is possible to obtain the same effect
as a branch by ingenious manipulation of bits and flags. For example you may
calculate the absolute value of a signed number
without branching:
<pre>         CDQ
         XOR EAX,EDX
         SUB EAX,EDX</pre>
(On PPlain and PMMX, use  <kbd>MOV EDX,EAX / SAR EDX,31</kbd> instead of <kbd>CDQ</kbd>).
<p>
The carry flag is particularly useful for this kind of tricks:<br>
Setting carry if a value is zero:  <kbd>CMP [VALUE],1</kbd><br>
Setting carry if a value is not zero:  <kbd>XOR EAX,EAX / CMP EAX,[VALUE]</kbd><br>
Incrementing a counter if carry:  <kbd>ADC EAX,0</kbd><br>
Setting a bit for each time the carry is set:  <kbd>RCL EAX,1</kbd><br>
Generating a bit mask if carry is set:  <kbd>SBB EAX,EAX</kbd><br>
Setting a bit on an arbitrary condition:  <kbd>SETcond AL</kbd><br>
Setting all bits on an arbitrary condition:  <kbd>XOR EAX,EAX / SETNcond AL / DEC EAX</kbd><br>
(remember to reverse the condition in the last example)
<p>
The following example finds the minimum of two unsigned numbers: if (b &lt; a) a = b;
<pre>         SUB EBX,EAX
         SBB ECX,ECX
         AND ECX,EBX
         ADD EAX,ECX</pre><p>
The next example chooses between two numbers:  if (a != 0) a = b; else a = c;
<pre>         CMP EAX,1
         SBB EAX,EAX
         XOR ECX,EBX
         AND EAX,ECX
         XOR EAX,EBX</pre><p>
Whether or not such tricks are worth the extra code depends on how predictable a
conditional jump would be, whether the extra pairing or scheduling opportunities of the branch-free code
can be utilized, and whether there are other jumps following immediately after which could
suffer the penalties of consecutive jumps.
<p>
<h3><a name="22_5">22.5. Replacing conditional jumps by conditional moves (PPro, PII and PIII)</a></h3>
The PPro, PII and PIII processors have conditional move instructions intended specifically for
avoiding branches because branch misprediction is very time-consuming on these
processors. There are conditional move instructions for both integer and floating point
registers.  For code that will run only on these processors you may replace poorly
predictable branches with conditional moves whenever possible. If you want your code to
run on all processors then you may make two versions of the most critical parts of the code,
one for processors that support conditional move instructions and one for those that don't
(see chapter <a href="#27_10">27.10</a> for how to detect if conditional moves are supported).
<p>
The misprediction penalty for a branch may be so high that it is advantageous to replace it
with conditional moves even when it costs several extra instructions. But a conditional move
instruction has the disadvantage that it makes dependency chains longer. The conditional move
waits for both register operands to be ready even
though only one of them is needed. A conditional move is waiting for three operands to be
ready: the condition flag and the two move operands. You have to consider if any of these
three operands are likely to be delayed by dependency chains or cache misses. If the
condition flag is available long before the move operands then you may as well use a
branch, because a possible branch misprediction could be resolved while waiting for the
move operands. In situations where you have to wait long for a move operand that may not
be needed after all, the branch will be faster than the conditional move despite a possible
misprediction penalty. The opposite situation is when the condition flag is delayed while
both move operands are available early. In this situation the conditional move is preferred
over the branch if misprediction is likely.

<p>
<h2><a name="23">23</a>. Reducing code size (all processors)</h2>
As explained in chapter <a href="#7">7</a>, the code cache is 8 or 16 kb. If you have problems keeping the
critical parts of your code within the code cache, then you may consider reducing the size of
your code.
<p>
 32 bit code is usually bigger than 16 bit code because addresses and data constants take 4
bytes in 32 bit code and only 2 bytes in 16 bit code. However, 16 bit code has other
penalties such as prefixes and problems with accessing adjacent words simultaneously
(see chapter 10.2 <a href="#imperfectpush">above</a>). Some other methods for reducing the size or your code are discussed
below.
<p>
Both jump addresses, data addresses, and data constants take less space if they can be
expressed as a sign-extended byte, i.e. if they are within the interval from -128 to +127.
<p>
For jump addresses this means that short jumps take two bytes of code, whereas jumps
beyond 127 bytes take 5 bytes if unconditional and 6 bytes if conditional.
<p>
Likewise, data addresses take less space if they can be expressed as a pointer and a
displacement between -128 and +127.
Example:<br>
<kbd> MOV EBX,DS:[100000] / ADD EBX,DS:[100004]          ; 12 bytes</kbd><br>
Reduce to:<br>
<kbd> MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4]   ; 10 bytes</kbd>
<p>
The advantage of using a pointer obviously increases if you use it many times. Storing data
on the stack and using <kbd>EBP</kbd> or <kbd>ESP</kbd> as pointer will thus make your code smaller than if you
use static memory locations and absolute addresses, provided of course that your data are
within +/-127 bytes of the pointer. Using <kbd>PUSH</kbd> and <kbd>POP</kbd> to write and read temporary data is
even shorter.
<p>
 Data constants may also take less space if they are between -128 and +127. Most
instructions with immediate operands have a short form where the operand is a
sign-extended single byte. Examples:
<pre>    PUSH 200      ; 5 bytes
    PUSH 100      ; 2 bytes

    ADD EBX,128   ; 6 bytes
    SUB EBX,-128  ; 3 bytes</pre><p>

 The most important instruction with an immediate operand which doesn't have such a short
form is <kbd>MOV</kbd>.<br>
Examples:
<pre>    MOV EAX, 0              ; 5 bytes</pre>
 May be changed to:<br>
<pre>    XOR EAX,EAX             ; 2 bytes</pre>
And
<pre>    MOV EAX, 1              ; 5 bytes</pre>
 May be changed to:
<pre>    XOR EAX,EAX / INC EAX   ; 3 bytes</pre>
or:
<pre>    PUSH 1 / POP EAX        ; 3 bytes</pre>
And
<pre>    MOV EAX, -1             ; 5 bytes</pre>
May be changed to:
<pre>    OR EAX, -1              ; 3 bytes</pre>
<p>
If the same address or constant is used more than once then you may load it
into a register. A <kbd>MOV</kbd>
with a 4-byte immediate operand may sometimes be replaced by an arithmetic
instruction if the value of the register before the <kbd>MOV</kbd> is known. Example:
<pre>        MOV     [mem1],200             ; 10 bytes
        MOV     [mem2],200             ; 10 bytes
        MOV     [mem3],201             ; 10 bytes
        MOV     EAX,100                ;  5 bytes
        MOV     EBX,150                ;  5 bytes</pre><p>
Assuming that <kbd>mem1</kbd> and <kbd>mem3</kbd> are both within -128/+127
bytes of <kbd>mem2</kbd>, this may be changed to:
<pre>        MOV     EBX, OFFSET mem2       ;  5 bytes
        MOV     EAX,200                ;  5 bytes
        MOV     [EBX+mem1-mem2],EAX    ;  3 bytes
        MOV     [EBX],EAX              ;  2 bytes
        INC     EAX                    ;  1 byte
        MOV     [EBX+mem3-mem2],EAX    ;  3 bytes
        SUB     EAX,101                ;  3 bytes
        LEA     EBX,[EAX+50]           ;  3 bytes</pre><p>
Be aware of the AGI stall in the <kbd>LEA</kbd> instruction (for PPlain and PMMX).
<p>
You may also consider that different instructions have different lengths. The following
instructions take only one byte and are therefore very attractive:
<kbd>PUSH reg</kbd>, <kbd>POP reg, INC reg32, DEC reg32</kbd>.<br>
<kbd>INC</kbd> and <kbd>DEC</kbd> with 8 bit registers take 2 bytes, so
<kbd>INC EAX</kbd> is shorter than <kbd>INC AL</kbd>.
<p>
<kbd>XCHG EAX,reg</kbd> is also a single-byte instruction and thus takes less space
than <kbd>MOV EAX,reg</kbd>, but it is slower.
<p>
Some instructions take one byte less when they use the accumulator than when they use
any other register. <br>
Examples:

<pre>    MOV EAX,DS:[100000]  is smaller than  MOV EBX,DS:[100000]
    ADD EAX,1000         is smaller than  ADD EBX,1000</pre><p>

 Instructions with pointers take one byte less when they have only a base pointer
 (not <kbd>ESP</kbd>)
and a displacement than when they have a scaled index  register, or both base pointer and
index register, or <kbd>ESP</kbd> as base pointer.  <br>
Examples:
<pre>    MOV EAX,[array][EBX]  is smaller than  MOV EAX,[array][EBX*4]
    MOV EAX,[EBP+12]      is smaller than  MOV EAX,[ESP+12]</pre><p>
 Instructions with <kbd>EBP</kbd> as base pointer and no displacement and no index take one byte more
than with other registers:
<pre>    MOV EAX,[EBX]    is smaller than  MOV EAX,[EBP],  but
    MOV EAX,[EBX+4]  is same size as  MOV EAX,[EBP+4].</pre><p>
 Instructions with a scaled index pointer and no base pointer must have a four byte
displacement, even when it is 0:
<pre>    LEA EAX,[EBX+EBX]  is shorter than  LEA EAX,[2*EBX].</pre>
<p>
<h2><a name="24">24</a>. Scheduling floating point code (PPlain and PMMX)</h2>
Floating point instructions cannot pair the way integer instructions can, except for one
special case, defined by the following rules:
<ul>
<li>the first instruction (executing in the U-pipe) must be <kbd>FLD, FADD, FSUB, FMUL,
FDIV, FCOM, FCHS,</kbd> or <kbd>FABS</kbd>.
<li>the second instruction (in V-pipe) must be <kbd>FXCH</kbd>
<li>the instruction following the <kbd>FXCH</kbd> must be a floating point instruction, otherwise the
<kbd>FXCH</kbd> will pair imperfectly and take an extra clock cycle.
</ul>
This special pairing is important, as will be explained shortly.
<p>
While floating point instructions in general cannot be paired, many can be pipelined, i.e. one
instruction can begin before the previous instruction has finished. Example:
<pre>    FADD ST(1),ST(0)   ; clock cycle 1-3
    FADD ST(2),ST(0)   ; clock cycle 2-4
    FADD ST(3),ST(0)   ; clock cycle 3-5
    FADD ST(4),ST(0)   ; clock cycle 4-6</pre><p>
Obviously, two instructions cannot overlap if the second instruction needs the result of the
first. Since almost all floating point instructions involve the top of stack register,
<kbd>ST(0)</kbd>, there are seemingly not very many possibilities for making an instruction independent of the
result of previous instructions. The solution to this problem is register renaming. The <kbd>FXCH</kbd>
instruction does not in reality swap the contents of two registers, it only swaps their names.
Instructions which push or pop the register stack also work by renaming. Floating point
register renaming has been highly optimized on the Pentiums so that a register may be
renamed while in use. Register renaming never causes stalls - it is even possible to rename
a register more than once in the same clock cycle, as for example when you pair <kbd>FLD</kbd> or
<kbd>FCOMPP</kbd> with <kbd>FXCH</kbd>.
<p>
By the proper use of <kbd>FXCH</kbd> instructions you may obtain a lot of overlapping in your floating
point code. Example:
<pre>    FLD     [a1]    ; clock cycle 1
    FADD    [a2]    ; clock cycle 2-4
    FLD     [b1]    ; clock cycle 3
    FADD    [b2]    ; clock cycle 4-6
    FLD     [c1]    ; clock cycle 5
    FADD    [c2]    ; clock cycle 6-8
    FXCH    ST(2)   ; clock cycle 6
    FADD    [a3]    ; clock cycle 7-9
    FXCH    ST(1)   ; clock cycle 7
    FADD    [b3]    ; clock cycle 8-10
    FXCH    ST(2)   ; clock cycle 8
    FADD    [c3]    ; clock cycle 9-11
    FXCH    ST(1)   ; clock cycle 9
    FADD    [a4]    ; clock cycle 10-12
    FXCH    ST(2)   ; clock cycle 10
    FADD    [b4]    ; clock cycle 11-13
    FXCH    ST(1)   ; clock cycle 11
    FADD    [c4]    ; clock cycle 12-14
    FXCH    ST(2)   ; clock cycle 12</pre>
In the above example we are interleaving three independent threads. Each <kbd>FADD</kbd> takes 3
clock cycles, and we can start a new <kbd>FADD</kbd> in each clock cycle. When we have started an
<kbd>FADD</kbd> in the 'a' thread we have time to start two new <kbd>FADD</kbd>
instructions in the '<kbd>b</kbd>' and '<kbd>c</kbd>'
threads before returning to the '<kbd>a</kbd>' thread, so every third
<kbd>FADD</kbd> belongs to the same thread.
We are using <kbd>FXCH</kbd> instructions every time to get the register that belongs to the desired
thread into <kbd>ST(0)</kbd>. As you can see in the example above, this generates a regular pattern,
but note well that the <kbd>FXCH</kbd> instructions repeat with a period of two while the threads have a
period of three. This can be quite confusing, so you have to 'play computer' in order to know
which registers are where.
<p>
All versions of the instructions <kbd>FADD, FSUB, FMUL,</kbd> and
<kbd>FILD</kbd> take 3 clock cycles and are
able to overlap, so that these instructions may be scheduled using the method described
above. Using a memory operand does not take more time than a register operand if the
memory operand is in the level 1 cache and properly aligned.
<p>
By now you must be used to rules having exceptions, and the overlapping rule is no
exception: You cannot start an <kbd>FMUL</kbd> instruction one clock cycle
after another <kbd>FMUL</kbd>
instruction, because the <kbd>FMUL</kbd> circuitry is not perfectly pipelined.
It is recommended that you
put another instruction in between two <kbd>FMUL</kbd>'s. Example:
<pre>    FLD     [a1]    ; clock cycle 1
    FLD     [b1]    ; clock cycle 2
    FLD     [c1]    ; clock cycle 3
    FXCH    ST(2)   ; clock cycle 3
    FMUL    [a2]    ; clock cycle 4-6
    FXCH            ; clock cycle 4
    FMUL    [b2]    ; clock cycle 5-7    (stall)
    FXCH    ST(2)   ; clock cycle 5
    FMUL    [c2]    ; clock cycle 7-9    (stall)
    FXCH            ; clock cycle 7
    FSTP    [a3]    ; clock cycle 8-9
    FXCH            ; clock cycle 10     (unpaired)
    FSTP    [b3]    ; clock cycle 11-12
    FSTP    [c3]    ; clock cycle 13-14</pre>
Here you have a stall before <kbd>FMUL [b2]</kbd> and before <kbd>FMUL [c2]</kbd>
because another <kbd>FMUL</kbd>
started in the preceding clock cycle. You can improve this code by putting
<kbd>FLD</kbd> instructions in between the <kbd>FMUL</kbd>'s:
<pre>    FLD     [a1]    ; clock cycle 1
    FMUL    [a2]    ; clock cycle 2-4
    FLD     [b1]    ; clock cycle 3
    FMUL    [b2]    ; clock cycle 4-6
    FLD     [c1]    ; clock cycle 5
    FMUL    [c2]    ; clock cycle 6-8
    FXCH    ST(2)   ; clock cycle 6
    FSTP    [a3]    ; clock cycle 7-8
    FSTP    [b3]    ; clock cycle 9-10
    FSTP    [c3]    ; clock cycle 11-12</pre><p>
In other cases you may put <kbd>FADD, FSUB</kbd>, or anything else in
between <kbd>FMUL</kbd>'s to avoid the stalls.
<p>
Overlapping floating point instructions requires of course that you have some independent
threads that you can interleave. If you have only one big  formula to execute, then you may
compute parts of the formula in parallel to  achieve overlapping. If, for example, you want to
add six numbers, then you may split the operations into two threads with three numbers in
each, and add the two threads in the end:

<pre>    FLD     [a]     ; clock cycle 1
    FADD    [b]     ; clock cycle 2-4
    FLD     [c]     ; clock cycle 3
    FADD    [d]     ; clock cycle 4-6
    FXCH            ; clock cycle 4
    FADD    [e]     ; clock cycle 5-7
    FXCH            ; clock cycle 5
    FADD    [f]     ; clock cycle 7-9    (stall)
    FADD            ; clock cycle 10-12  (stall)</pre><p>

Here we have a one clock stall before <kbd>FADD [f]</kbd> because it is waiting
for the result of <kbd>FADD [d]</kbd> and a two clock stall before the last
<kbd>FADD</kbd> because it is waiting for the result of
<kbd>FADD [f]</kbd>. The latter stall can be hidden by filling in some integer
instructions, but the first stall can not because an integer instruction at
this place would make the <kbd>FXCH</kbd> pair imperfectly.
<p>
The first stall can be avoided by having three threads rather than two, but
that would cost an extra <kbd>FLD</kbd> so we do not save anything by having
three threads rather than two unless there are at least eight numbers to add.
<p>
Not all floating point instructions can overlap. And some floating point
instructions can overlap more subsequent integer instructions than subsequent
floating point instructions. The <kbd>FDIV</kbd> instruction, for example,
takes 39 clock cycles. All but the first clock cycle can
overlap with integer instructions, but only the last two clock cycles can
overlap with floating point instructions. Example:
<pre> FDIV         ; clock cycle 1-39  (U-pipe)
 FXCH         ; clock cycle 1-2   (V-pipe, imperfect pairing)
 SHR EAX,1    ; clock cycle 3     (U-pipe)
 INC EBX      ; clock cycle 3     (V-pipe)
 CMC          ; clock cycle 4-5   (non-pairable)
 FADD [x]     ; clock cycle 38-40 (U-pipe, waiting while FPU busy)
 FXCH         ; clock cycle 38    (V-pipe)
 FMUL [y]     ; clock cycle 40-42 (U-pipe, waiting for result of FDIV)</pre>
The first <kbd>FXCH</kbd> pairs with the <kbd>FDIV</kbd>, but takes an extra
clock cycle because it is not followed by a floating point instruction.
The <kbd>SHR / INC</kbd> pair starts before the <kbd>FDIV</kbd> is finished, but
has to wait for the <kbd>FXCH</kbd> to finish. The <kbd>FADD</kbd> has to wait
till clock 38 because new floating point instructions can only execute during
the last two clock cycles of the <kbd>FDIV</kbd>. The second <kbd>FXCH</kbd>
pairs with the <kbd>FADD</kbd>. The <kbd>FMUL</kbd> has to wait for the <kbd>FDIV</kbd>
to finish because it uses the result of the division.
<p>
If you have nothing else to put in after a floating point instruction with a
large integer overlap, such as <kbd>FDIV</kbd> or <kbd>FSQRT</kbd>, then you
may put in a dummy read from an address which you expect to need later in
the program to make sure it is in the level one cache. Example:
<pre>        FDIV    QWORD PTR [EBX]
        CMP     [ESI],ESI
        FMUL    QWORD PTR [ESI]</pre><p>
Here we use the integer overlap to pre-load the value at <kbd>[ESI]</kbd> into
the cache while the <kbd>FDIV</kbd> is being computed (we don't care what
the result of the <kbd>CMP</kbd> is).
<p>
Chapter <a href="#28">28</a> gives a complete listing of floating point instructions, and
what they can pair or overlap with.
<p>
There is no penalty for using a memory operand on floating point instuctions
because the arithmetic unit is one step later in the pipeline than the read
unit. The tradeoff of this comes when you store floating point data to memory.
The <kbd>FST</kbd> or <kbd>FSTP</kbd> instruction with a memory
operand takes two clock cycles in the execution stage, but it needs the data one clock
earlier so you will get a one clock stall if the value to store is not ready one clock cycle in
advance. This is analogous to an AGI stall. Example:
<pre>    FLD     [a1]    ; clock cycle 1
    FADD    [a2]    ; clock cycle 2-4
    FLD     [b1]    ; clock cycle 3
    FADD    [b2]    ; clock cycle 4-6
    FXCH            ; clock cycle 4
    FSTP    [a3]    ; clock cycle 6-7
    FSTP    [b3]    ; clock cycle 8-9</pre><p>

The <kbd>FSTP [a3]</kbd> stalls for one clock cycle because the result of
<kbd>FADD [a2]</kbd> is not ready
in the preceding clock cycle. In many cases you cannot hide  this type of stall without
scheduling your floating point code into four threads or putting some integer instructions in
between. The two clock cycles in the execution stage of the <kbd>FST(P)</kbd> instruction cannot pair
or overlap with any subsequent instructions.
<p>
Instructions with integer operands such as <kbd>FIADD, FISUB, FIMUL, FIDIV, FICOM</kbd> may
be split up into simpler operations in order to improve overlapping. Example:
<pre>    FILD    [a]     ; clock cycle 1-3
    FIMUL   [b]     ; clock cycle 4-9</pre><p>
Split up into:
<pre>    FILD    [a]     ; clock cycle 1-3
    FILD    [b]     ; clock cycle 2-4
    FMUL            ; clock cycle 5-7</pre><p>
In this example, you save two clocks by overlapping the two FILD instructions.
<p>
<h2><a name="25">25</a>. Loop optimization (all processors)</h2>
When analyzing a program you often find that most of the time consumption lies in the
innermost loop. The way to improve the speed is to carefully optimize the most
time-consuming loop using assembly language. The rest of the program may be left in high-level
language.
<p>
In all the following examples it is assumed that all data are in the level 1 cache. If the speed
is limited by cache misses then there is no reason to optimize the instructions. Rather, you
should concentrate on organizing your data in a way that minimizes cache misses (see
chapter <a href="#7">7</a>).
<p>
<h3><a name="25_1">25.1. Loops in PPlain and PMMX</a></h3>
A loop generally contains a counter controlling how many times to iterate, and often array
access reading or writing one array element for each iteration. I have chosen as example a
procedure which reads integers from an array, changes the sign of each integer, and stores
the results in another array.
<p>
A C language code for this procedure would be:
<pre>void ChangeSign (int * A, int * B, int N) {
  int i;
  for (i=0; i&lt;N; i++) B[i] = -A[i];}</pre>
<p>
Translating to assembly, we might write the procedure like this:
<p>
<h4>Example 1.1:</h4>
<pre>_ChangeSign PROC NEAR
        PUSH    ESI
        PUSH    EDI
A       EQU     DWORD PTR [ESP+12]
B       EQU     DWORD PTR [ESP+16]
N       EQU     DWORD PTR [ESP+20]
        MOV     ECX, [N]
        JECXZ   L2
        MOV     ESI, [A]
        MOV     EDI, [B]
        CLD
L1:     LODSD
        NEG     EAX
        STOSD
        LOOP    L1
L2:     POP     EDI
        POP     ESI
        RET                     ; (no extra pop if _cdecl calling convention)
_ChangeSign     ENDP</pre>
This looks like a nice solution, but it is not optimal because it uses slow non-pairable
instructions. It takes 11 clock cycles per iteration if all data are in the level one cache.
<p>
<h4>Using pairable instructions only (PPlain and PMMX)</h4>
<h4>Example 1.2:</h4>
<pre>        MOV     ECX, [N]
        MOV     ESI, [A]
        TEST    ECX, ECX
        JZ      SHORT L2
        MOV     EDI, [B]
L1:     MOV     EAX, [ESI]       ; u
        XOR     EBX, EBX         ; v (pairs)
        ADD     ESI, 4           ; u
        SUB     EBX, EAX         ; v (pairs)
        MOV     [EDI], EBX       ; u
        ADD     EDI, 4           ; v (pairs)
        DEC     ECX              ; u
        JNZ     L1               ; v (pairs)
L2:</pre>
Here we have used pairable instructions only, and scheduled the instructions so that
everything pairs. It now takes only 4 clock cycles per iteration. We could have obtained the
same speed without splitting the NEG  instruction, but the other unpairable instructions
should be split up.
<p>
<h4>Using the same register for counter and index</h4>
<h4>Example 1.3:</h4>
<pre>        MOV     ESI, [A]
        MOV     EDI, [B]
        MOV     ECX, [N]
        XOR     EDX, EDX
        TEST    ECX, ECX
        JZ      SHORT L2
L1:     MOV     EAX, [ESI+4*EDX]          ; u
        NEG     EAX                       ; u
        MOV     [EDI+4*EDX], EAX          ; u
        INC     EDX                       ; v (pairs)
        CMP     EDX, ECX                  ; u
        JB      L1                        ; v (pairs)
L2:</pre><p>
Using the same register for counter and index gives us fewer instructions in the body of the
loop, but it still takes 4 clocks because we have two unpaired instructions.
<p>
<h4>Letting the counter end at zero (PPlain and PMMX)</h4>
We want to get rid of the <kbd>CMP</kbd> instruction in example 1.3 by letting the  counter end at zero
and use the zero flag for detecting when we are finished as we did in example 1.2. One way
to do this would be to execute the loop backwards taking the last array elements first.
However, data caches are optimized for accessing data forwards, not backwards, so if
cache misses are likely, then you should rather start the counter at -N and count through
negative values up to zero. The base registers should then point to the end of the arrays
rather than the beginning:
<p>
<h4>Example 1.4:</h4>
<pre>        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L2
L1:     MOV     EAX, [ESI+4*ECX]          ; u
        NEG     EAX                       ; u
        MOV     [EDI+4*ECX], EAX          ; u
        INC     ECX                       ; v (pairs)
        JNZ     L1                        ; u
L2:</pre>
We are now down at five instructions in the loop body but it still takes 4 clocks because of
poor pairing. (If the addresses and sizes of the arrays are constants we may save two
registers by substituting <kbd>A+SIZE A</kbd> for <kbd>ESI</kbd>
and <kbd>B+SIZE B</kbd> for <kbd>EDI</kbd>). Now let's see how we
can improve pairing.
<p>
<h4>Pairing calculations with loop overhead (PPlain and PMMX)</h4>
We may want to improve pairing by intermingling calculations with the loop control
instructions. If we want to put something in between <kbd>INC ECX</kbd>
and <kbd>JNZ L1</kbd>, it has to be
something that doesn't affect the zero flag. The <kbd>MOV [EDI+4*ECX],EBX</kbd>
instruction after <kbd>INC ECX</kbd> would generate an AGI delay, so we have
to be more ingenious:
<p>
<h4>Example 1.5:</h4>
<pre>        MOV     EAX, [N]
        XOR     ECX, ECX
        SHL     EAX, 2                    ; 4 * N
        JZ      SHORT L3
        MOV     ESI, [A]
        MOV     EDI, [B]
        SUB     ECX, EAX                  ; - 4 * N
        ADD     ESI, EAX                  ; point to end of array A
        ADD     EDI, EAX                  ; point to end of array B
        JMP     SHORT L2
L1:     MOV     [EDI+ECX-4], EAX          ; u
L2:     MOV     EAX, [ESI+ECX]            ; v (pairs)
        XOR     EAX, -1                   ; u
        ADD     ECX, 4                    ; v (pairs)
        INC     EAX                       ; u
        JNC     L1                        ; v (pairs)
        MOV     [EDI+ECX-4], EAX
L3:</pre>
I have used a different way to calculate the negative of <kbd>EAX</kbd> here:
inverting all bits and adding one. The reason why I am using this method is
that I can use a dirty trick with the
<kbd>INC</kbd> instruction: <kbd>INC</kbd> doesn't change the carry flag,
whereas <kbd>ADD</kbd> does. I am using <kbd>ADD</kbd>
rather than <kbd>INC</kbd> to increment my loop counter and testing the carry
flag rather than the zero
flag. It is then possible to put the <kbd>INC EAX</kbd> in between without
affecting the carry flag. You
may think that we could have used <kbd>LEA EAX,[EAX+1]</kbd> here instead of
<kbd>INC EAX</kbd>, at least
that doesn't change any flags, but the <kbd>LEA</kbd> instruction would have
an AGI stall so that's not
the best solution. Note that the trick with the <kbd>INC</kbd> instruction
not changing the carry flag is useful only on PPlain and PMMX, but will
cause a partial flags stall on PPro, PII and PIII.
<p>
I have obtained perfect pairing here and the loop now takes only 3 clock cycles.
Whether you want to increment the loop counter by 1 (as in example 1.4) or by 4
(as in example 1.5) is a matter of taste, it makes no difference in loop timing.
<p>
<h4>Overlapping the end of one operation with the beginning of the next (PPlain and PMMX)</h4>
The method used in example 1.5 is not very generally applicable so we may look for other
methods of improving pairing opportunities. One way is to reorganize the loop so that the
end of one operation overlaps with the beginning of the next. I will call this convoluting the
loop. A convoluted loop has an unfinished operation at the end of each loop iteration which
will be finished in the next run. Actually, example 1.5 did pair the last <kbd>MOV</kbd> of one iteration
with the first <kbd>MOV</kbd> of the next, but we want to explore this method further:
<p>
<h4>Example 1.6:</h4>
<pre>        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L3
        XOR     EBX, EBX
        MOV     EAX, [ESI+4*ECX]
        INC     ECX
        JZ      SHORT L2
L1:     SUB     EBX, EAX                  ; u
        MOV     EAX, [ESI+4*ECX]          ; v (pairs)
        MOV     [EDI+4*ECX-4], EBX        ; u
        INC     ECX                       ; v (pairs)
        MOV     EBX, 0                    ; u
        JNZ     L1                        ; v (pairs)
L2:     SUB     EBX, EAX
        MOV     [EDI+4*ECX-4], EBX
L3:</pre><p>

Here we begin reading the second value before we have stored the first, and
this of course improves pairing opportunities. The <kbd>MOV EBX,0</kbd>
instruction has been put in between <kbd>INC ECX</kbd> and <kbd>JNZ L1</kbd>
not to improve pairing but to avoid AGI stall.
<p>
<h4>Rolling out a loop (PPlain and PMMX)</h4>
The most generally applicable way to improve pairing opportunities is to do two operations
for each run and do half as many runs. This is called rolling out a loop:
<p>
<h4>Example 1.7:</h4>
<pre>        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L2
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L1
        MOV     EAX, [ESI+4*ECX]          ; N is odd. do the odd one
        NEG     EAX
        MOV     [EDI+4*ECX], EAX
        INC     ECX                       ; make counter even
        JZ      SHORT L2                  ; N = 1
L1:     MOV     EAX, [ESI+4*ECX]          ; u
        MOV     EBX, [ESI+4*ECX+4]        ; v (pairs)
        NEG     EAX                       ; u
        NEG     EBX                       ; u
        MOV     [EDI+4*ECX], EAX          ; u
        MOV     [EDI+4*ECX+4], EBX        ; v (pairs)
        ADD     ECX, 2                    ; u
        JNZ     L1                        ; v (pairs)
L2:</pre>
<p>
Now we are doing two operations in parallel which gives the best pairing opportunities. We
have to test if <kbd>N</kbd> is odd and if so do one operation outside the loop because the loop can
only do an even number of operations.
<p>
The loop has an AGI stall at the first <kbd>MOV</kbd> instruction because
<kbd>ECX</kbd> has been incremented in
the preceding clock cycle. The loop therefore takes 6 clock cycles for two operations.
<p>
<h4>Reorganizing a loop to remove AGI stall (PPlain and PMMX)</h4>
<h4>Example 1.8:</h4>
<pre>        MOV     ESI, [A]
        MOV     EAX, [N]
        MOV     EDI, [B]
        XOR     ECX, ECX
        LEA     ESI, [ESI+4*EAX]          ; point to end of array A
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+4*EAX]          ; point to end of array B
        JZ      SHORT L3
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L2
        MOV     EAX, [ESI+4*ECX]          ; N is odd. do the odd one
        NEG     EAX                       ; no pairing opportunity
        MOV     [EDI+4*ECX-4], EAX
        INC     ECX                       ; make counter even
        JNZ     SHORT L2
        NOP                       ; add NOP's if JNZ L2 not predictable
        NOP
        JMP     SHORT L3                  ; N = 1
L1:     NEG     EAX                       ; u
        NEG     EBX                       ; u
        MOV     [EDI+4*ECX-8], EAX        ; u
        MOV     [EDI+4*ECX-4], EBX        ; v (pairs)
L2:     MOV     EAX, [ESI+4*ECX]          ; u
        MOV     EBX, [ESI+4*ECX+4]        ; v (pairs)
        ADD     ECX, 2                    ; u
        JNZ     L1                        ; v (pairs)
        NEG     EAX
        NEG     EBX
        MOV     [EDI+4*ECX-8], EAX
        MOV     [EDI+4*ECX-4], EBX
L3:</pre>
<p>
The trick is to find a pair of instructions that do not use the loop counter as index and
reorganize the loop so that the counter is incremented in the preceding clock cycle. We are
now down at 5 clock cycles for two operations which is close to the best possible.
<p>
If data caching is critical, then you may improve the speed further by
interleaving the <kbd>A</kbd> and <kbd>B</kbd> arrays into one structured array
so that each <kbd>B[i]</kbd> comes immediately after the
corresponding <kbd>A[i]</kbd>. If the structured array is aligned by at least
8 then <kbd>B[i]</kbd> will always be
in the same cache line as <kbd>A[i]</kbd>, so you will never have a cache
miss when writing <kbd>B[i]</kbd>.
This may of course have a tradeoff in other parts of the program so you
have to weigh the costs against the benefits.
<p>
<h4>Rolling out by more than 2 (PPlain and PMMX)</h4>
You may think of doing more than two operations per iteration in order to  reduce the loop
overhead per operation. But since the loop overhead in most cases can be reduced to only
one clock cycle per iteration, then rolling out the loop by 4 rather than by 2 would only save
1/4 clock cycle per operation, which is hardly worth the effort. Only if the loop overhead
cannot be reduced to one clock cycle and if N is very big, should you think of unrolling by 4.
<p>
The drawbacks of excessive loop unrolling are:
<ol>
<li>You need to calculate N MODULO R, where R is the unrolling factor, and do N
MODULO R operations before or after the main loop in order to make the remaining
number of operations divisible by R. This takes a lot of extra code and poorly predictable
branches. And the loop body of course also becomes bigger.
<li>A Piece of code usually takes much more time the first time it executes, and the penalty
of first time execution is bigger the more code you have, especially if N is small.
<li>Excessive code size makes the utilization of the code cache less effective.
</ol>
<p>
<h4>Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers (PPlain and PMMX)</h4>
If you need to manipulate arrays of 8 or 16 bit operands, then there is a problem with
unrolled loops because you may not be able to pair two memory  access operations. For
example <kbd>MOV AL,[ESI] / MOV BL,[ESI+1]</kbd> will not pair if the two operands are within
the same dword of memory. But there may be a  much smarter method, namely to handle
four bytes at a time in the same 32 bit register.
<p>
The following example adds 2 to all elements of an array of bytes:
<p>
<h4>Example 1.9:</h4>
<pre>        MOV     ESI, [A]         ; address of byte array
        MOV     ECX, [N]         ; number of elements in byte array
        TEST    ECX, ECX         ; test if N is 0
        JZ      SHORT L2
        MOV     EAX, [ESI]       ; read first four bytes
L1:     MOV     EBX, EAX         ; copy into EBX
        AND     EAX, 7F7F7F7FH   ; get lower 7 bits of each byte in EAX
        XOR     EBX, EAX         ; get the highest bit of each byte
        ADD     EAX, 02020202H   ; add desired value to all four bytes
        XOR     EBX, EAX         ; combine bits again
        MOV     EAX, [ESI+4]     ; read next four bytes
        MOV     [ESI], EBX       ; store result
        ADD     ESI, 4           ; increment pointer
        SUB     ECX, 4           ; decrement loop counter
        JA      L1               ; loop
L2:</pre>
This loop takes 5 clock cycles for every 4 bytes. The array should of course  be aligned by
4. If the number of elements in the array is not divisible by  four, then you may padd it in the
end with a few extra bytes to make the length divisible by four. This loop will always read
past the end of the array, so you should make sure the array is not placed at the end of a
segment to avoid a general protection error.
<p>
Note that I have masked out the highest bit of each byte to avoid a possible  carry from
each byte into the next when adding. I am using <kbd>XOR</kbd> rather than
<kbd>ADD</kbd> when putting in the high bit again to avoid carry.
<p>
The <kbd>ADD ESI,4</kbd> instruction could have been avoided by using the loop counter as index
as in example 1.4. However, this would give an odd number of  instructions in the loop
body, so there would be one unpaired instruction and the loop would still take 5 clocks.
Making the branch instruction unpaired would save one clock after the last operation when
the branch is  mispredicted, but we would have to spend an extra clock cycle in the prolog
code to setup a pointer to the end of the array and calculate -N, so the two methods will be
exactly equally fast. The method presented here is the simplest and shortest.
<p>
The next example finds the length of a zero-terminated string by searching
for the first byte of zero. It is faster than using <kbd>REP SCASB</kbd>:
<p>
<h4><a name="1-10">Example 1.10:</a></h4>
<pre>STRLEN  PROC    NEAR
        MOV     EAX,[ESP+4]               ; get pointer
        MOV     EDX,7
        ADD     EDX,EAX                   ; pointer+7 used in the end
        PUSH    EBX
        MOV     EBX,[EAX]                 ; read first 4 bytes
        ADD     EAX,4                     ; increment pointer
L1:     LEA     ECX,[EBX-01010101H]       ; subtract 1 from each byte
        XOR     EBX,-1                    ; invert all bytes
        AND     ECX,EBX                   ; and these two
        MOV     EBX,[EAX]                 ; read next 4 bytes
        ADD     EAX,4                     ; increment pointer
        AND     ECX,80808080H             ; test all sign bits
        JZ      L1                        ; no zero bytes, continue loop
        TEST    ECX,00008080H             ; test first two bytes
        JNZ     SHORT L2
        SHR     ECX,16                    ; not in the first 2 bytes
        ADD     EAX,2
L2:     SHL     CL,1                      ; use carry flag to avoid a branch
        POP     EBX
        SBB     EAX,EDX                   ; compute length
        RET
STRLEN  ENDP</pre>
Again we have used the method of overlapping the end of one operation with the beginning
of the next to improve pairing. I have not unrolled the loop because it is likely to repeat
relatively few times. The string should of  course be aligned by 4. The code will always read
past the end of the  string, so the string should not be placed at the end of a segment.
<p>
The loop body has an odd number of instructions so there is one unpaired. Making the
branch instruction unpaired rather than one of the other instructions has the advantage that
it saves 1 clock cycle when the branch is mispredicted.
<p>
The <kbd>TEST ECX,00008080H</kbd> instruction is non-pairable. You could use the  pairable
instruction <kbd>OR CH,CL</kbd> here instead, but then you would have to put
in a <kbd>NOP</kbd> or something to avoid the penalties of consecutive branches.
Another problem with <kbd>OR CH,CL</kbd> is that it
would cause a partial register stall on a PPro, PII and PIII. So I have chosen to keep the
unpairable <kbd>TEST</kbd> instruction.
<p>
Handling 4 bytes simultaneously can be quite difficult. The code uses a formula which
generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible
to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all
bytes (in the <kbd>LEA</kbd> instruction). I have not masked out the highest bit of each byte before
subtracting, as I did in the previous example, so the subtraction may generate a borrow to
the next byte, but only if it is zero, and this is  exactly the situation where we don't care what
the next byte is, because we  are searching forwards for the first zero. If we were searching
backwards  then we would have to re-read the dword after detecting a zero, and then test all
four bytes to find the last zero, or use <kbd>BSWAP</kbd> to reverse the order of the bytes.
<p>
If you want to search for a byte value other than zero, then you may <kbd>XOR</kbd> all four bytes
with the value you are searching for, and then use the method above to search for zero.
<p>
<h4>Loops with MMX operations (PMMX)</h4>
Handling multiple operands in the same register is easier on the MMX processors because
they have special instructions and special 64 bit registers for exactly this purpose.
<p>
Returning to the problem of adding two to all bytes in an array, we may take advantage of
the MMX instructions:
<h4>Example 1.11:</h4>
<pre>.data
ALIGN   8
ADDENTS DQ      0202020202020202h       ; specify byte to add eight times
A       DD      ?                       ; address of byte array
N       DD      ?                       ; number of iterations

.code
        MOV     ESI, [A]
        MOV     ECX, [N]
        MOVQ    MM2, [ADDENTS]
        JMP     SHORT L2
        ; top of loop
L1:     MOVQ    [ESI-8], MM0    ; store result
L2:     MOVQ    MM0, MM2        ; load addents
        PADDB   MM0, [ESI]      ; add eight bytes in one operation
        ADD     ESI, 8
        DEC     ECX
        JNZ     L1
        MOVQ    [ESI-8], MM0    ; store last result
        EMMS</pre>
The store instruction is moved to after the loop control instructions in order to avoid a store stall.
<p>
This loop takes 4 clocks because the <kbd>PADDB</kbd> instruction doesn't pair
with <kbd>ADD ESI,8</kbd>. (A MMX instruction with memory access cannot pair
with a non-MMX instruction or with another MMX instruction with memory access).
We could get rid of <kbd>ADD ESI,8</kbd> by using <kbd>ECX</kbd> as index,
but that would give an AGI stall.
<p>
Since the loop overhead is considerable we might want to unroll the loop:
<p>
<h4>Example 1.12:</h4>
<pre>.data
ALIGN   8
ADDENTS DQ      0202020202020202h       ; specify byte to add eight
times
A       DD      ?                       ; address of byte array
N       DD      ?                       ; number of iterations

.code
        MOVQ    MM2, [ADDENTS]
        MOV     ESI, [A]
        MOV     ECX, [N]
        MOVQ    MM0, MM2
        MOVQ    MM1, MM2
L3:     PADDB   MM0, [ESI]
        PADDB   MM1, [ESI+8]
        MOVQ    [ESI], MM0
        MOVQ    MM0, MM2
        MOVQ    [ESI+8], MM1
        MOVQ    MM1, MM2
        ADD     ESI, 16
        DEC     ECX
        JNZ     L3
        EMMS</pre>
This unrolled loop takes 6 clocks per iteration for adding 16 bytes.
The <kbd>PADDB</kbd> instructions are not paired. The two threads are
interleaved to avoid a store stall.
<p>
Using the MMX instructions has a high penalty if you are using floating point
instructions shortly afterwards, so there may still be situations where you
want to use 32 bit registers as in example 1.9.
<p>
<h4>Loops with floating point operations (PPlain and PMMX)</h4>
The methods of optimizing floating point loops are basically the same as for integer loops,
although the floating point instructions are overlapping rather than pairing.
<p>
Consider the C language code:
<pre>  int i, n;  double * X;  double * Y;  double DA;
  for (i=0; i&lt;n; i++)  Y[i] = Y[i] - DA * X[i];</pre>
This piece of code (called DAXPY) has been studied extensively because it is the key to
solving linear equations.
<p>
<h4>Example 1.13:</h4>
<pre>DSIZE   = 8                                      ; data size
        MOV     EAX, [N]                         ; number of elements
        MOV     ESI, [X]                         ; pointer to X
        MOV     EDI, [Y]                         ; pointer to Y
        XOR     ECX, ECX
        LEA     ESI, [ESI+DSIZE*EAX]             ; point to end of X
        SUB     ECX, EAX                         ; -N
        LEA     EDI, [EDI+DSIZE*EAX]             ; point to end of Y
        JZ      SHORT L3                         ; test for N = 0
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]        ; DA * X[0]
        JMP     SHORT L2                         ; jump into loop
L1:     FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]        ; DA * X[i]
        FXCH                                     ; get old result
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]  ; store Y[i]
L2:     FSUBR   DSIZE PTR [EDI+DSIZE*ECX]        ; subtract from Y[i]
        INC     ECX                              ; increment index
        JNZ     L1                               ; loop
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]  ; store last result
L3:</pre>
Here we are using the same methods as in example 1.6: Using the loop counter as index
register and counting through negative values up to zero. The end of one operation
overlaps with the beginning of the next.
<p>
The interleaving of floating point operations work perfectly here:
The 2 clock stall between <kbd>FMUL</kbd> and <kbd>FSUBR</kbd> is filled
with the <kbd>FSTP</kbd> of the previous result. The 3 clock stall between
<kbd>FSUBR</kbd> and <kbd>FSTP</kbd> is filled with the loop overhead and the
first two instructions of the next operation. An AGI stall has been avoided
by reading the only parameter that doesn't depend on the index in the first clock cycle after the index has been incremented.
<p>
This solution takes 6 clock cycles per operation, which is better than the
unrolled solution published by Intel!
<p>
<h4>Unrolling floating point loops (PPlain and PMMX)</h4>
<a name="unrollby3">The DAXPY loop unrolled by 3 is quite complicated:</a>
<h4>Example 1.14:</h4>
<pre>DSIZE = 8                                 ; data size
IF DSIZE EQ 4
SHIFTCOUNT = 2
ELSE
SHIFTCOUNT = 3
ENDIF

        MOV     EAX, [N]                  ; number of elements
        MOV     ECX, 3*DSIZE              ; counter bias
        SHL     EAX, SHIFTCOUNT           ; DSIZE*N
        JZ      L4                        ; N = 0
        MOV     ESI, [X]                  ; pointer to X
        SUB     ECX, EAX                  ; (3-N)*DSIZE
        MOV     EDI, [Y]                  ; pointer to Y
        SUB     ESI, ECX                  ; end of pointer - bias
        SUB     EDI, ECX
        TEST    ECX, ECX
        FLD     DSIZE PTR [ESI+ECX]       ; first X
        JNS     SHORT L2                  ; less than 4 operations
L1:     ; main loop
        FMUL    DSIZE PTR [DA]
        FLD     DSIZE PTR [ESI+ECX+DSIZE]
        FMUL    DSIZE PTR [DA]
        FXCH
        FSUBR   DSIZE PTR [EDI+ECX]
        FXCH
        FLD     DSIZE PTR [ESI+ECX+2*DSIZE]
        FMUL    DSIZE PTR [DA]
        FXCH
        FSUBR   DSIZE PTR [EDI+ECX+DSIZE]
        FXCH    ST(2)
        FSTP    DSIZE PTR [EDI+ECX]
        FSUBR   DSIZE PTR [EDI+ECX+2*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+DSIZE]
        FLD     DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        ADD     ECX, 3*DSIZE
        JS      L1                        ; loop
L2:     FMUL    DSIZE PTR [DA]            ; finish leftover operation
        FSUBR   DSIZE PTR [EDI+ECX]
        SUB     ECX, 2*DSIZE              ; change pointer bias
        JZ      SHORT L3                  ; finished
        FLD     DSIZE PTR [DA]            ; start next operation
        FMUL    DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        FSUBR   DSIZE PTR [EDI+ECX+3*DSIZE]
        ADD     ECX, 1*DSIZE
        JZ      SHORT L3                  ; finished
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+ECX+3*DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
        FSUBR   DSIZE PTR [EDI+ECX+3*DSIZE]
        ADD     ECX, 1*DSIZE
L3:     FSTP    DSIZE PTR [EDI+ECX+2*DSIZE]
L4:</pre>
The reason why I am showing you how to unroll a loop by 3 is not to  recommend
it, but to warn you how difficult it is! Be prepared to spend a considerable amount of time debugging
and verifying your code when doing something like this. There are several problems to take
care of: In most  cases, you cannot remove all stalls from a floating point loop unrolled by
less than 4 unless you convolute it (i.e. there are unfinished operations at the end of each
run which are being finished in the next run). The last <kbd>FLD</kbd> in the main loop above is the
beginning of the first operation in the next run. It would be tempting here to make a solution
which reads past the end of the array and then discards the extra value in the end, as in
example 1.9 and 1.10, but that is not recommended in floating point loops because the
reading of the extra value might generate a denormal operand exception in case the
memory position after the array doesn't contain a valid floating point number. To avoid this,
we have to do at least one more operation after the main loop.
<p>
The number of operations to do outside an unrolled loop would normally be N MODULO R,
where N is the number of operations, and R is the unrolling factor.  But in the case of a
convoluted loop, we have to do one more, i.e. (N-1) MODULO R + 1, for the
abovementioned reason.
<p>
Normally, we would prefer to do the extra operations before the main loop, but here we have
to do them afterwards for two reasons: One reason is to take care of the leftover operand
from the convolution. The other reason is that calculating the number of extra operations
requires a division if R is not a power of 2, and a division is time consuming. Doing the extra
operations after the loop saves the division.
<p>
The next problem is to calculate how to bias the loop counter so that it will change sign at
the right time, and adjust the base pointers so as to compensate for this bias. Finally, you
have to make sure the leftover operand from the convolution is handled correctly for all
values of N.
<p>
The epilog code doing 1-3 operations could have been implemented as a separate loop, but
that would cost an extra branch misprediction, so the solution above is faster.
<p>Now that I have scared you by demonstrating how difficult it is to unroll by 3, I will show you
that it is much easier to unroll by 4:
<p>
<h4>Example 1.15:</h4>
<pre>DSIZE   = 8                               ; data size
        MOV     EAX, [N]                  ; number of elements
        MOV     ESI, [X]                  ; pointer to X
        MOV     EDI, [Y]                  ; pointer to Y
        XOR     ECX, ECX
        LEA     ESI, [ESI+DSIZE*EAX]      ; point to end of X
        SUB     ECX, EAX                  ; -N
        LEA     EDI, [EDI+DSIZE*EAX]      ; point to end of Y
        TEST    AL,1                      ; test if N is odd
        JZ      SHORT L1
        FLD     DSIZE PTR [DA]            ; do the odd operation
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        INC     ECX                       ; adjust counter
        FSTP    DSIZE PTR [EDI+DSIZE*ECX-DSIZE]
L1:     TEST    AL,2            ; test for possibly 2 more operations
        JZ      L2
        FLD     DSIZE PTR [DA]        ; N MOD 4 = 2 or 3. Do two more
        FMUL    DSIZE PTR [ESI+DSIZE*ECX]
        FLD     DSIZE PTR [DA]
        FMUL    DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FXCH
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        ADD     ECX, 2                ; counter is now divisible by 4
L2:     TEST    ECX, ECX
        JZ      L4                        ; no more operations
L3:     ; main loop:
        FLD     DSIZE PTR [DA]
        FLD     DSIZE PTR [ESI+DSIZE*ECX]
        FMUL    ST,ST(1)
        FLD     DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
        FMUL    ST,ST(2)
        FLD     DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE]
        FMUL    ST,ST(3)
        FXCH    ST(2)
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]
        FXCH    ST(3)
        FMUL    DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FXCH    ST(2)
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
        FXCH
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
        FXCH    ST(3)
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
        FSTP    DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
        ADD     ECX, 4                          ; increment index by 4
        JNZ     L3                              ; loop
L4:</pre>
<p>
It is usually quite easy to find a stall-free solution when unrolling by 4, and there is no need
for convolution. The number of extra operations to do outside the main loop is N MODULO
4, which can be calculated easily without division, simply by testing the two lowest bits in N.
The extra operations are done before the main loop rather than after, to make the handling
of the loop counter simpler.
<p>
The tradeoff of loop unrolling is that the extra operations outside the loop are slower due to
incomplete overlapping and possible branch mispredictions, and the first time penalty is
higher because of increased code size.
<p>
As a general recommendation, I would say that if N is big or if convoluting the loop without
unrolling cannot remove enough stalls, then you should unroll critical integer loops by 2 and
floating point loops by 4.
<p>
<h3><a name="25_2">25.2 Loops in PPro, PII and PIII</a></h3>
In the previous chapter (<a href="#25_1">25.1</a>) I explained how to use convolution and loop unrolling in order
to improve pairing in PPlain and PMMX. On the PPro, PII and PIII there is no reason to do this
thanks to the out-of-order execution mechanism. But there are other quite difficult problems
to take care of, most importantly ifetch boundaries and register read stalls.
<p>
I have chosen the same example as in chapter <a href="#25_1">25.1</a>
for the previous microprocessors: a  procedure which reads integers from an
array, changes the sign of each integer, and stores the results in another array.
<p>
A C language code for this procedure would be:
<pre>void ChangeSign (int * A, int * B, int N) {
  int i;
  for (i=0; i&lt;N; i++) B[i] = -A[i];}</pre>
Translating to assembly, we might write the procedure like this:
<h4>Example 2.1:</h4>
<pre>_ChangeSign PROC NEAR
        PUSH    ESI
        PUSH    EDI
A       EQU     DWORD PTR [ESP+12]
B       EQU     DWORD PTR [ESP+16]
N       EQU     DWORD PTR [ESP+20]

        MOV     ECX, [N]
        JECXZ   L2
        MOV     ESI, [A]
        MOV     EDI, [B]
        CLD
L1:     LODSD
        NEG     EAX
        STOSD
        LOOP    L1
L2:     POP     EDI
        POP     ESI
        RET
_ChangeSign     ENDP</pre>
This looks like a nice solution, but it is not optimal because it uses the non-optimal
instructions <kbd>LOOP, LODSD</kbd> and <kbd>STOSD</kbd> that generate many uops.
It takes 6-7 clock cycles per iteration if all data are in the level one cache.
Avoiding these instructions we get:
<h4>Example 2.2:</h4>
<pre>        MOV     ECX, [N]
        JECXZ   L2
        MOV     ESI, [A]
        MOV     EDI, [B]
ALIGN   16
L1:     MOV     EAX, [ESI]       ; len=2, p2rESIwEAX
        ADD     ESI, 4           ; len=3, p01rwESIwF
        NEG     EAX              ; len=2, p01rwEAXwF
        MOV     [EDI], EAX       ; len=2, p4rEAX, p3rEDI
        ADD     EDI, 4           ; len=3, p01rwEDIwF
        DEC     ECX              ; len=1, p01rwECXwF
        JNZ     L1               ; len=2, p1rF
L2:</pre>
The comments are interpreted as follows: the <kbd>MOV EAX,[ESI]</kbd>
instruction is 2 bytes long, it generates one uop for port 2 that reads
<kbd>ESI</kbd> and writes to (renames) <kbd>EAX</kbd>. This
information is needed for analyzing the possible bottlenecks.
<p>
Let's first analyze the instruction decoding (chapter <a href="#14">14</a>): One of the
instructions generates 2 uops (<kbd>MOV [EDI],EAX</kbd>).
This instruction must go into decoder D0. There are three
decode groups in the loop so it can decode in 3 clock cycles.
<p>
Next, let's look at the instruction fetch (chapter <a href="#15">15</a>): If an ifetch boundary prevents the first
three instructions from decoding together then there will be three decode groups in the last
ifetch block so that the next iteration will have the ifetch block starting at the first instruction
where we want it, and we will get a delay only in the first iteration. A worse situation would
be a 16-byte boundary and an ifetch boundary in one of the last three instructions.
According to the <a href="#ifetchtable">ifetch table</a>, this will generate a delay of 1 clock and cause the next
iteration to have its first ifetch block aligned by 16, so that the problem continues through all
iterations. The result is a fetch time of 4 clocks per iteration rather than 3. There are two
ways to prevent this situation: the first method is to control where the ifetch blocks lie on the
first iteration; the second method is to control where the 16-byte boundaries are. The latter
method is the easiest. Since the entire loop has only 15 bytes of code you can avoid any
16-byte boundary by aligning the loop entry by 16, as shown above. This will put the entire
loop into a single ifetch block so that no further analysis of instruction fetching is needed.
<p>
The third problem to look at is register read stalls (chapter <a href="#16">16</a>). No register is read in this
loop without being written to at least a few clock cycles before, so there can be no register
read stalls.
<p>
The fourth analysis is execution (chapter <a href="#17">17</a>). Counting the uops for the different ports we
get:<br>
port 0 or 1: 4 uops<br>
port 1 only: 1 uop<br>
port 2: 1 uop<br>
port 3: 1 uop<br>
port 4: 1 uop<br>
Assuming that the uops that can go to either port 0 or 1 are distributed optimally, the
execution time will be 2.5 clocks per iteration.
<p>
The last analysis is retirement (chapter <a href="#18">18</a>). Since the number of uops in the loop is not
divisible by 3, the retirement slots will not be used optimally when the jump has to retire in
the first slot. The time needed for retirement is the number of uops divided by 3, and
rounded up to nearest integer. This gives 3 clocks for retirement.
<p>
In conclusion, the loop above can execute in 3 clocks per iteration if the loop entry is
aligned by 16. I am assuming that the conditional jump is predicted every time except on the
exit of the loop (chapter <a href="#22_2">22.2</a>).
<p>
<h4>Using the same register for counter and index and letting the counter end at zero (PPro, PII and PIII)</h4>
<h4><a name="2-3">Example 2.3:</a></h4>
<pre>        MOV     ECX, [N]
        MOV     ESI, [A]
        MOV     EDI, [B]
        LEA     ESI, [ESI+4*ECX]          ; point to end of array A
        LEA     EDI, [EDI+4*ECX]          ; point to end of array B
        NEG     ECX                       ; -N
        JZ      SHORT L2
ALIGN   16
L1:     MOV     EAX, [ESI+4*ECX]          ; len=3, p2rESIrECXwEAX
        NEG     EAX                       ; len=2, p01rwEAXwF
        MOV     [EDI+4*ECX], EAX          ; len=3, p4rEAX, p3rEDIrECX
        INC     ECX                       ; len=1, p01rwECXwF
        JNZ     L1                        ; len=2, p1rF
L2:</pre>
Here we have reduced the number of uops to 6 by using the same register as counter and
index. The base pointers point to the end of the arrays so that the index can count up
through negative values to zero.
<p>
Decoding: There are two decode groups in the loop so it will decode in 2 clocks.
<p>
Instruction fetch: A loop always takes at least one clock cycle more than the the number of
16 byte blocks. Since there are only 11 bytes of code in the loop it is possible to have it all
in one ifetch block. By aligning the loop entry by 16 we can make sure that we don't get
more than one 16-byte block so that it is possible to fetch in 2 clocks.
<p>
Register read stalls: The <kbd>ESI</kbd> and <kbd>EDI</kbd> registers are read, but not modified inside the loop.
They will therefore be counted as permanent register reads, but not in the same triplet.
Register <kbd>EAX, ECX</kbd>, and flags are modified inside the loop and read
before they are written back so they will cause no permanent register reads.
The conclusion is that there are no register read stalls.
<p>
Execution:<br>
port 0 or 1: 2 uops<br>
port 1: 1 uop<br>
port 2: 1 uop<br>
port 3: 1 uop<br>
port 4: 1 uop<br>
Execution time: 1.5 clocks.
<p>
Retirement:<br>
6 uops = 2 clocks.
<p>
Conclusion: this loop takes only 2 clock cycles per iteration.
<p>
If you use absolute addresses instead of <kbd>ESI</kbd> and <kbd>EDI</kbd> then the loop will take 3 clocks
because it cannot be contained in a single 16-byte block.
<p>
<h4>Unrolling a loop (PPro, PII and PIII)</h4>
Doing more than one operation in each run and doing correspondingly fewer runs is called
loop unrolling. In previous processors you would unroll loops to get parallel execution by
pairing (chapter <a href="#25_1">25.1</a>). In PPro, PII and PIII this is not needed because the out-of-order
execution mechanism takes care of that. There is no need to use two different registers
either, because register renaming takes care of this. The purpose of unrolling here is to
reduce the loop overhead per iteration.
<p>
The following example is the same as example 2.2 , but unrolled by 2, which means that
you do two operations per iteration and half as many iterations
<h4>Example 2.4:</h4>
<pre>        MOV     ECX, [N]
        MOV     ESI, [A]
        MOV     EDI, [B]
        SHR     ECX, 1           ; N/2
        JNC     SHORT L1         ; test if N was odd
        MOV     EAX, [ESI]       ; do the odd one first
        ADD     ESI, 4
        NEG     EAX
        MOV     [EDI], EAX
        ADD     EDI, 4
L1:     JECXZ   L3

ALIGN   16
L2:     MOV     EAX, [ESI]       ; len=2, p2rESIwEAX
        NEG     EAX              ; len=2, p01rwEAXwF
        MOV     [EDI], EAX       ; len=2, p4rEAX, p3rEDI
        MOV     EAX, [ESI+4]     ; len=3, p2rESIwEAX
        NEG     EAX              ; len=2, p01rwEAXwF
        MOV     [EDI+4], EAX     ; len=3, p4rEAX, p3rEDI
        ADD     ESI, 8           ; len=3, p01rwESIwF
        ADD     EDI, 8           ; len=3, p01rwEDIwF
        DEC     ECX              ; len=1, p01rwECXwF
        JNZ     L2               ; len=2, p1rF
L3:</pre>
In example 2.2 the loop overhead (i.e. adjusting pointers and counter, and jumping back)
was 4 uops and the 'real job' was 4 uops. When unrolling the loop by two you do the 'real
job' twice and the overhead once, so you get 12 uops in all. This reduces the overhead from
50% to 33% of the uops. Since the unrolled loop can do only an even number of operations
you have to check if N is odd and if so do one operation outside the loop.
<p>
Analyzing instruction fetching in this loop we find that a new ifetch block begins in the
<kbd>ADD ESI,8</kbd> instruction, forcing it into decoder D0. This makes the loop decode in 5 clock cycles
and not 4 as we wanted. We can solve this problem by coding the preceding instruction in a
longer version. Change <kbd>MOV [EDI+4],EAX </kbd> to:
<pre>    MOV [EDI+9999],EAX     ; make instruction with long displacement
    ORG $-4
    DD 4                   ; rewrite displacement to 4</pre>
This will force a new ifetch block to begin at the long <kbd>MOV [EDI+4],EAX</kbd>
instruction, so that decoding time is now down at 4 clocks. The rest of the
pipeline can handle 3 uops per clock so that the expected execution time is 4
clocks per iteration, or 2 clocks per operation.
<p>
Testing this solution shows that it actually takes a little more. My measurements showed
approximately 4.5 clocks per iteration. This is probably due to a sub-optimal reordering of
the uops. Possibly, the ROB doesn't find the optimal execution-order for the uops but
submits them in a less than optimal order. This problem was not predicted, and only testing
can reveal such a problem. We may help the ROB by doing some of the reordering
manually:
<h4>Example 2.5:</h4>
<pre>ALIGN   16
L2:     MOV     EAX, [ESI]       ; len=2, p2rESIwEAX
        MOV     EBX, [ESI+4]     ; len=3, p2rESIwEBX
        NEG     EAX              ; len=2, p01rwEAXwF
        MOV     [EDI], EAX       ; len=2, p4rEAX, p3rEDI
        ADD     ESI, 8           ; len=3, p01rwESIwF
        NEG     EBX              ; len=2, p01rwEBXwF
        MOV     [EDI+4], EBX     ; len=3, p4rEBX, p3rEDI
        ADD     EDI, 8           ; len=3, p01rwEDIwF
        DEC     ECX              ; len=1, p01rwECXwF
        JNZ     L2               ; len=2, p1rF
L3:</pre>
The loop now executes in 4 clocks per iteration. This solution also solves the problem with
instruction fetch blocks. The cost is that we need an extra register because we cannot take
advantage of register renaming.
<p>
<h4>Rolling out by more than 2</h4>
Loop unrolling is recommended when the loop overhead constitutes a high proportion of the
total execution time. In example 2.3  the overhead is only 2 uops, so the gain by unrolling is
little, but I will show you how to unroll it anyway, just for the exercise.
<p>
The 'real job' is 4 uops and the overhead 2. Unrolling by two we get 2*4+2 = 10 uops. The
retirement time will be 10/3, rounded up to an integer, that is 4 clock cycles. This calculation
shows that nothing is gained by unrolling this by two. Unrolling by four we get:
<h4>Example 2.6:</h4>
<pre>        MOV     ECX, [N]
        SHL     ECX, 2                    ; number of bytes to handle
        MOV     ESI, [A]
        MOV     EDI, [B]
        ADD     ESI, ECX                  ; point to end of array A
        ADD     EDI, ECX                  ; point to end of array B
        NEG     ECX                       ; -4*N
        TEST    ECX, 4                    ; test if N is odd
        JZ      SHORT L1
        MOV     EAX, [ESI+ECX]            ; N is odd. do the odd one
        NEG     EAX
        MOV     [EDI+ECX], EAX
        ADD     ECX, 4
L1:     TEST    ECX, 8                    ; test if N/2 is odd
        JZ      SHORT L2
        MOV     EAX, [ESI+ECX]            ; N/2 is odd. do two extra
        NEG     EAX
        MOV     [EDI+ECX], EAX
        MOV     EAX, [ESI+ECX+4]
        NEG     EAX
        MOV     [EDI+ECX+4], EAX
        ADD     ECX, 8
L2:     JECXZ   SHORT L4

ALIGN   16
L3:     MOV     EAX, [ESI+ECX]            ; len=3, p2rESIrECXwEAX
        NEG     EAX                       ; len=2, p01rwEAXwF
        MOV     [EDI+ECX], EAX            ; len=3, p4rEAX, p3rEDIrECX
        MOV     EAX, [ESI+ECX+4]          ; len=4, p2rESIrECXwEAX
        NEG     EAX                       ; len=2, p01rwEAXwF
        MOV     [EDI+ECX+4], EAX          ; len=4, p4rEAX, p3rEDIrECX
        MOV     EAX, [ESI+ECX+8]          ; len=4, p2rESIrECXwEAX
        MOV     EBX, [ESI+ECX+12]         ; len=4, p2rESIrECXwEAX
        NEG     EAX                       ; len=2, p01rwEAXwF
        MOV     [EDI+ECX+8], EAX          ; len=4, p4rEAX, p3rEDIrECX
        NEG     EBX                       ; len=2, p01rwEAXwF
        MOV     [EDI+ECX+12], EBX         ; len=4, p4rEAX, p3rEDIrECX
        ADD     ECX, 16                   ; len=3, p01rwECXwF
        JS      L3                        ; len=2, p1rF
L4:</pre>
The ifetch blocks are where we want them. Decode time is 6 clocks.
<p>
Register read stalls is a problem here because <kbd>ECX</kbd> has retired near the end of the loop
and we need to read both <kbd>ESI, EDI,</kbd> and <kbd>ECX</kbd>. The instructions have been reordered in
order to avoid reading  ESI  near the bottom so that we can avoid a register read stall. In
other words, the reason for reordering instructions and use an extra register (<kbd>EBX</kbd>) is not the
same as in the previous example.
<p>
There are 12 uops and the loop executes in 6 clocks per iteration, or 1.5 clocks per
operation.
<p>
It may be tempting to unroll loops by a high factor in order to get the maximum speed. But
since the loop overhead in most cases can be reduced to something like one clock cycle
per iteration then unrolling the loop by 4 rather than by 2 would save only 1/4 clock cycle
per operation which is hardly worth the effort. Only if the loop overhead is high compared to
the rest of the loop and N is very big should you think of unrolling by 4. Unrolling by more
than 4 does not make sense.
<p>
The drawbacks of excessive loop unrolling are:
<ol>
<li>You need to calculate N MODULO R, where R is the unrolling factor, and do N
MODULO R operations before or after the main loop in order to make the remaining
number of operations divisible by R. This takes a lot of extra code and poorly predictable
branches. And the loop body of course also becomes bigger.
<li>A Piece of code usually takes much more time the first time it executes, and the penalty
of first time execution is bigger the more code you have, especially if N is small.
<li>Excessive code size makes the utilization of the code cache less effective.
</ol>
<p>
Using an unrolling factor which is not a power of 2 makes the calculation of N MODULO R
quite difficult, and is generally not recommended unless N is known to be divisible by R.
<a href="#unrollby3">Example 1.14</a> shows how to unroll by 3.
<p>
<h4>Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers (PPro, PII and PIII)</h4>
It is sometimes possible to handle four bytes at a time in the same 32 bit register. The
following example adds 2 to all elements of an array of bytes:
<h4><a name="2-7">Example 2.7:</a></h4>
<pre>        MOV     ESI, [A]         ; address of byte array
        MOV     ECX, [N]         ; number of elements in byte array
        JECXZ   L2
ALIGN   16
        DB   7  DUP (90H)        ; 7 NOP's for controlling alignment

 L1:    MOV     EAX, [ESI]       ; read four bytes
        MOV     EBX, EAX         ; copy into EBX
        AND     EAX, 7F7F7F7FH   ; get lower 7 bits of each byte in EAX
        XOR     EBX, EAX         ; get the highest bit of each byte
        ADD     EAX, 02020202H   ; add desired value to all four bytes
        XOR     EBX, EAX         ; combine bits again
        MOV     [ESI], EBX       ; store result
        ADD     ESI, 4           ; increment pointer
        SUB     ECX, 4           ; decrement loop counter
        JA      L1               ; loop
L2:</pre>
Note that I have masked out the highest bit of each byte to avoid a possible carry from each
byte into the next one when adding. I am using <kbd>XOR</kbd> rather than
<kbd>ADD</kbd> when putting in the high bit again to avoid carry.
The array should of course be aligned by 4.
<p>
This loop should ideally take 4 clocks per iteration, but it takes somewhat
more due to the dependency chain and difficult reordering. On PII and PIII
you can do the same more effectively using MMX registers.
<p>
The next example finds the length of a zero-terminated string by searching  for the first byte
of zero. It is much faster than using <kbd>REPNE SCASB</kbd>:
<h4><a name="2-8">Example 2.8:</a></h4>
<pre>_strlen PROC    NEAR
        PUSH    EBX
        MOV     EAX,[ESP+8]            ; get pointer to string
        LEA     EDX,[EAX+3]            ; pointer+3 used in the end
L1:     MOV     EBX,[EAX]              ; read first 4 bytes
        ADD     EAX,4                  ; increment pointer
        LEA     ECX,[EBX-01010101H]    ; subtract 1 from each byte
        NOT     EBX                    ; invert all bytes
        AND     ECX,EBX                ; and these two
        AND     ECX,80808080H          ; test all sign bits
        JZ      L1                     ; no zero bytes, continue loop
        MOV     EBX,ECX
        SHR     EBX,16
        TEST    ECX,00008080H          ; test first two bytes
        CMOVZ   ECX,EBX                ; shift right if not in the first 2 bytes
        LEA     EBX,[EAX+2]
        CMOVZ   EAX,EBX
        SHL     CL,1                   ; use carry flag to avoid branch
        SBB     EAX,EDX                ; compute length
        POP     EBX
        RET
_strlen ENDP</pre>
This loop takes 3 clocks for each iteration testing 4 bytes. The string should of course be
aligned by 4. The code may read past the end of the string, so the string should not be
placed at the end of a segment.
<p>
Handling 4 bytes simultaneously can be quite difficult. This code uses a formula which
generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible
to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all
bytes (in the <kbd>LEA ECX</kbd> instruction). I have not masked out the highest bit of each byte
before subtracting, as I did in example <a href="#2-7">2.7</a>, so the subtraction may generate a borrow to the
next byte, but only if it is zero, and this is exactly the situation where we don't care what the
next byte is, because we are searching forwards for the first zero. If we were searching
backwards then we would have to re-read the dword after detecting a zero, and then test all
four bytes to find the last zero, or use <kbd>BSWAP</kbd> to reverse the order of the bytes. If you want
to search for a byte value other than zero, then you may <kbd>XOR</kbd> all four bytes with the value
you are searching for, and then use the method above to search for zero.
<p>
<h4>Loops with MMX instructions (PII and PIII)</h4>
Using MMX instructions we can compare 8 bytes in one operation:
<h4><a name="2-9">Example 2.9:</a></h4>
<pre>_strlen PROC    NEAR
        PUSH    EBX
        MOV     EAX,[ESP+8]
        LEA     EDX,[EAX+7]
        PXOR    MM0,MM0
L1:     MOVQ    MM1,[EAX]        ; len=3 p2rEAXwMM1
        ADD     EAX,8            ; len=3 p01rEAX
        PCMPEQB MM1,MM0          ; len=3 p01rMM0rMM1
        MOVD    EBX,MM1          ; len=3 p01rMM1wEBX
        PSRLQ   MM1,32           ; len=4 p1rMM1
        MOVD    ECX,MM1          ; len=3 p01rMM1wECX
        OR      ECX,EBX          ; len=2 p01rECXrEBXwF
        JZ      L1               ; len=2 p1rF
        MOVD    ECX,MM1
        TEST    EBX,EBX
        CMOVZ   EBX,ECX
        LEA     ECX,[EAX+4]
        CMOVZ   EAX,ECX
        MOV     ECX,EBX
        SHR     ECX,16
        TEST    BX,BX
        CMOVZ   EBX,ECX
        LEA     ECX,[EAX+2]
        CMOVZ   EAX,ECX
        SHR     BL,1
        SBB     EAX,EDX
        EMMS
        POP     EBX
        RET
_strlen ENDP</pre>
This loop has 7 uops for port 0 and 1 which gives an average execution time of 3.5 clocks
per iteration. The measured time is 3.8 clocks which shows that the ROB handles the
situation reasonably well despite a dependency chain that is 6 uops long. Testing 8 bytes in
less than 4 clocks is incredibly much faster than <kbd>REPNE SCASB</kbd>.
<p>
<h4>Loops with floating point instructions (PPro, PII and PIII)</h4>
The methods for optimizing floating point loops are basically the same as for integer loops,
but you should be more aware of dependency chains because of the long latencies of
instruction execution.
<p>
Consider the C language code:
<pre>  int i, n;  double * X;  double * Y;  double DA;
  for (i=0; i&lt;n; i++)  Y[i] = Y[i] - DA * X[i];</pre>
This piece of code (called DAXPY) has been studied extensively because it is the key to
solving linear equations.
<h4>Example 2.10:</h4>
<pre>DSIZE   = 8                      ; data size (4 or 8)
        MOV     ECX, [N]         ; number of elements
        MOV     ESI, [X]         ; pointer to X
        MOV     EDI, [Y]         ; pointer to Y
        JECXZ   L2               ; test for N = 0
        FLD     DSIZE PTR [DA]   ; load DA outside loop
ALIGN   16
        DB    2 DUP (90H)        ; 2 NOP's for alignment
L1:     FLD     DSIZE PTR [ESI]  ; len=3 p2rESIwST0
        ADD     ESI,DSIZE        ; len=3 p01rESI
        FMUL    ST,ST(1)         ; len=2 p0rST0rST1
        FSUBR   DSIZE PTR [EDI]  ; len=3 p2rEDI, p0rST0
        FSTP    DSIZE PTR [EDI]  ; len=3 p4rST0, p3rEDI
        ADD     EDI,DSIZE        ; len=3 p01rEDI
        DEC     ECX              ; len=1 p01rECXwF
        JNZ     L1               ; len=2 p1rF
        FSTP    ST               ; discard DA
L2:</pre>
The dependency chain is 10 clock cycles long, but the loop takes only 4 clocks per iteration
because it can begin a new operation before the previous one is finished. The purpose of
the alignment is to prevent a 16-byte boundary in the last ifetch block.
<p>
<h4>Example 2.11:</h4>
<pre>DSIZE   = 8                                ; data size (4 or 8)
        MOV     ECX, [N]                   ; number of elements
        MOV     ESI, [X]                   ; pointer to X
        MOV     EDI, [Y]                   ; pointer to Y
        LEA     ESI, [ESI+DSIZE*ECX]       ; point to end of array
        LEA     EDI, [EDI+DSIZE*ECX]       ; point to end of array
        NEG     ECX                        ; -N
        JZ      SHORT L2                   ; test for N = 0
        FLD     DSIZE PTR [DA]             ; load DA outside loop
ALIGN   16
L1:     FLD     DSIZE PTR [ESI+DSIZE*ECX]  ; len=3 p2rESIrECXwST0
        FMUL    ST,ST(1)                   ; len=2 p0rST0rST1
        FSUBR   DSIZE PTR [EDI+DSIZE*ECX]  ; len=3 p2rEDIrECX, p0rST0
        FSTP    DSIZE PTR [EDI+DSIZE*ECX]  ; len=3 p4rST0, p3rEDIrECX
        INC     ECX                        ; len=1 p01rECXwF
        JNZ     L1                         ; len=2 p1rF
        FSTP    ST                         ; discard DA
L2:</pre>
Here we have used the same trick as in example <a href="#2-3">2.3</a>. Ideally, this loop should take 3
clocks, but measurements say approximately 3.5 clocks due to the long dependency chain.
Unrolling the loop doesn't save much.
<p>
<h4>Loops with XMM instructions (PIII)</h4>
The XMM instructions on the PIII allow you to operate on four single precision
floating point numbers in parallel. The operands must be aligned by 16.
<p>
The DAXPY algorithm is not very suited for XMM instructions because the precision
is poor, it may not be possible to align the operands by 16, and you need some
extra code if the number of operations is not a multiple of four. I am showing
the code here anyway, just to give an example of a loop with XMM instructions:
<h4>Example 2.12:</h4>
<pre>        MOV     ECX, [N]                   ; number of elements
        MOV     ESI, [X]                   ; pointer to X
        MOV     EDI, [Y]                   ; pointer to Y
        SHL     ECX, 2
        ADD     ESI, ECX                   ; point to end of X
        ADD     EDI, ECX                   ; point to end of Y
        NEG     ECX                        ; -4*N
        MOV     EAX, [DA]                  ; load DA outside loop
        XOR     EAX, 80000000H             ; change sign of DA
        PUSH    EAX
        MOVSS   XMM1, [ESP]                ; -DA
        ADD     ESP, 4
        SHUFPS  XMM1, XMM1, 0              ; copy -DA to all four positions
        CMP     ECX, -16
        JG      L2
L1:     MOVAPS  XMM0, [ESI+ECX]            ; len=4 2*p2rESIrECXwXMM0
        ADD     ECX, 16                    ; len=3 p01rwECXwF
        MULPS   XMM0, XMM1                 ; len=3 2*p0rXMM0rXMM1
        CMP     ECX, -16                   ; len=3 p01rECXwF
        ADDPS   XMM0, [EDI+ECX-16]         ; len=5 2*p2rEDIrECX, 2*p1rXMM0
        MOVAPS  [EDI+ECX-16], XMM0         ; len=5 2*p4rXMM0, 2*p3rEDIrECX
        JNG     L1                         ; len=2 p1rF
L2:     JECXZ   L4                         ; check if finished
        MOVAPS  XMM0, [ESI+ECX]            ; 1-3 operations missing, do 4 more
        MULPS   XMM0, XMM1
        ADDPS   XMM0, [EDI+ECX]
        CMP     ECX, -8
        JG      L3
        MOVLPS  [EDI+ECX], XMM0            ; store two more results
        ADD     ECX, 8
        MOVHLPS XMM0, XMM0
L3:     JECXZ   L4
        MOVSS   [EDI+ECX], XMM0            ; store one more result
L4:</pre>
The <kbd>L1</kbd> loop takes 5-6 clocks for 4 operations.
The <kbd>ECX</kbd> instructions have been placed before and after the
<kbd>MULPS XMM0, XMM1</kbd> instruction in order to avoid a register read port stall
generated by the reading of the two parts of the <kbd>XMM1</kbd> register
together with <kbd>ESI</kbd> or <kbd>EDI</kbd> in the RAT. The extra code after
<kbd>L2</kbd> takes care of the situation where N is not divisible by 4.
Note that this code may read past the end of A and B. This may delay the last
operation if the extra memory positions read do not contain normal floating
point numbers. If possible, put in some dummy extra data to make the number
of operations divisible by 4 and leave out the extra code after <kbd>L2</kbd>.
<p>
<h2><a name="26">26</a>. Problematic Instructions</h2>
<h3><a name="26_1">26.1 XCHG (all processors)</a></h3>
The <kbd>XCHG register,[memory]</kbd> instruction is dangerous. By default this instruction has
an implicit <kbd>LOCK</kbd> prefix which prevents it from using the cache. This instruction is therefore
very time consuming, and should always be avoided.
<p>
<h3><a name="26_2">26.2 Rotates through carry (all processors)</a></h3>
<kbd>RCR</kbd> and <kbd>RCL</kbd> with a count different from one are slow and should be avoided.
<p>
<h3><a name="26_3">26.3 String instructions (all processors)</a></h3>
String instructions without a repeat prefix are too slow and should be replaced by simpler
instructions. The same applies to <kbd>LOOP</kbd> on all processors and to
<kbd>JECXZ</kbd> on PPlain and PMMX.
<p>
<kbd>REP MOVSD</kbd> and <kbd>REP STOSD</kbd> are quite fast if the repeat
count is not too small. Always use the DWORD version if possible, and make
sure that both source and destination are aligned by 8.
<p>
Some other methods of moving data are faster under certain conditions. See
chapter <a href="#27_8">27.8</a> for details.
<p>
Note that while the <kbd>REP MOVS</kbd> instruction writes a word to the destination, it reads the next
word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4
are the same in these two addresses. In other words, you will get a penalty of one clock
extra per iteration if <kbd>ESI+(wordsize)-EDI</kbd> is divisible by 32. The easiest way to avoid
cache bank conflicts is to use the DWORD version and align both source and destination by
8. Never use <kbd>MOVSB</kbd> or <kbd>MOVSW</kbd> in optimized code, not even in 16 bit mode.
<p>
<kbd>REP MOVS</kbd> and <kbd>REP STOS</kbd> can perform very fast by moving an entire cache line at a time
on PPro, PII and PIII. This happens only when the following conditions are met:
<ul>
<li>both source and destination must be aligned by 8
<li>direction must be forward (direction flag cleared)
<li>the count (<kbd>ECX</kbd>) must be greater than or equal to 64
<li>the difference between <kbd>EDI</kbd> and <kbd>ESI</kbd> must be numerically greater than or equal to 32
<li>the memory type for both source and destination must be either writeback or
write-combining (you can normally assume this).
</ul><p>
Under these conditions the number of uops issued is approximately 215+2*<kbd>ECX</kbd> for
<kbd>REP MOVSD</kbd> and 185+1.5*<kbd>ECX</kbd> for <kbd>REP STOSD,</kbd>
giving a speed of approximately 5 bytes per
clock cycle for both instructions, which is almost 3 times as fast as when the above
conditions are not met.
<p>
The byte and word versions also benefit from this fast mode, but they are less effective than
the dword versions.
<p>
<kbd>REP STOSD</kbd> is optimal under the same conditions as <kbd>REP MOVSD</kbd>.
<p>
<kbd>REP LOADS, REP SCAS,</kbd> and <kbd>REP CMPS</kbd> are not optimal, and
may be replaced by loops. See example <a href="#1-10">1.10</a>, <a href="#2-8">2.8</a>
and <a href="#2-9">2.9</a> for alternatives to <kbd>REPNE SCASB. REP CMPS</kbd>
may suffer cache bank conflicts if bit 2-4 are the same in <kbd>ESI</kbd> and
<kbd>EDI</kbd>.
<p>
<h3><a name="26_4">26.4 Bit test (all processors)</a></h3>
<kbd>BT, BTC, BTR</kbd>, and <kbd>BTS</kbd> instructions should preferably be replaced by instructions like
<kbd>TEST, AND, OR, XOR</kbd>, or shifts on PPlain and PMMX. On PPro, PII and PIII, bit tests with a
memory operand should be avoided.
<p>
<h3><a name="26_5">26.5 Integer multiplication (all processors)</a></h3>
An integer multiplication takes approximately 9 clock cycles on PPlain and PMMX and 4 on
PPro, PII and PIII. It is therefore often advantageous to replace a multiplication by a constant
with a combination of other instructions such as <kbd>SHL, ADD, SUB</kbd>,
and <kbd>LEA</kbd>. Example:<br>
<kbd>IMUL EAX,10</kbd><br>
can be replaced with<br>
<kbd>MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX</kbd><br>
or<br>
<kbd>LEA EAX,[EAX+4*EAX] / ADD EAX,EAX</kbd>
<p>
Floating point multiplication is faster than integer multiplication on PPlain and PMMX, but
the time spent on converting integers to float and converting the product back again is
usually more than the time saved by using floating point multiplication, except when the
number of conversions is low compared with the number of multiplications. MMX
multiplication is fast, but is only available with 16-bit operands.
<p>
<h3><a name="26_6">26.6 WAIT instruction (all processors)</a></h3>
You can often increase speed by omitting the <kbd>WAIT</kbd> instruction.
The <kbd>WAIT</kbd> instruction has three functions:
<p>
<u>a.</u> The old 8087 processor requires a <kbd>WAIT</kbd> before every
floating point instruction to make sure the coprocessor is ready to receive it.
<p>
<u>b.</u> <kbd>WAIT</kbd> is used for coordinating memory access between the floating point unit and the
integer unit. Examples:
<pre><u>b.1.</u>  FISTP [mem32]
      WAIT             ; wait for FPU to write before..
      MOV EAX,[mem32]  ; reading the result with the integer unit

<u>b.2.</u>  FILD [mem32]
      WAIT             ; wait for FPU to read value..
      MOV [mem32],EAX  ; before overwriting it with integer unit

<u>b.3.</u>  FLD QWORD PTR [ESP]
      WAIT             ; prevent an accidental interrupt from..
      ADD ESP,8        ; overwriting value on stack</pre>
<p>
<u>c.</u> <kbd>WAIT</kbd> is sometimes used to check for exceptions. It will generate an interrupt if an
unmasked exception bit in the floating point status word has been set by a preceding
floating point instruction.
<p>
<u>Regarding a:</u><br>
The function in point a is never needed on any other processors than the old 8087. Unless
you want your code to be compatible with the 8087 you should tell your assembler not to
put in these <kbd>WAIT</kbd>'s by specifying a higher processor. A 8087 floating point emulator also
inserts <kbd>WAIT</kbd> instructions. You should therefore tell your assembler not to generate
emulation code unless you need it.
<p>
<u>Regarding b:</u><br>
<kbd>WAIT</kbd> instructions to coordinate memory access are definitely needed on the 8087 and
80287 but not on the Pentiums. It is not quite clear whether it is needed on the 80387 and
80486. I have made several tests on these Intel processors and not been able to provoke
any error by omitting the <kbd>WAIT</kbd> on any 32 bit Intel processor, although Intel manuals say that
the <kbd>WAIT</kbd> is needed for this purpose except after <kbd>FNSTSW</kbd>
 and <kbd>FNSTCW</kbd>. Omitting <kbd>WAIT</kbd>
instructions for coordinating memory access is not 100 % safe, even when writing 32 bit
code, because the code may be able to run on the very rare combination of a 80386 main
processor with a 287 coprocessor, which requires the <kbd>WAIT</kbd>. Also, I have no information on
non-Intel processors, and I have not tested all possible hardware and software
combinations, so there may be other situations where the <kbd>WAIT</kbd> is needed.
<p>
If you want to be certain that your code will work on any 32 bit processor (including
non-Intel processors) then I would recommend that you include the <kbd>WAIT</kbd> here in order to be
safe.
<p>
<u>Regarding c:</u><br>
The assembler automatically inserts a <kbd>WAIT</kbd> for this purpose before the following
instructions: <kbd>FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW</kbd>. You can omit the <kbd>WAIT</kbd>
by writing FNCLEX, etc. My tests show that the WAIT is unneccessary in most cases
because these instructions without <kbd>WAIT</kbd> will still generate an interrupt on exceptions except
for <kbd>FNCLEX</kbd> and <kbd>FNINIT</kbd> on the 80387. (There is some inconsistency about whether the
<kbd>IRET</kbd> from the interrupt points to the <kbd>FN..</kbd> instruction or to the next instruction).
<p>
Almost all other floating point instructions will also generate an interrupt if a previous floating
point instruction has set an unmasked exception bit, so the exception is likely to be detected
sooner or later anyway. You may insert a <kbd>WAIT</kbd> after the last floating point instruction in your
program to be sure to catch all exceptions.
<p>
You may still need the <kbd>WAIT</kbd> if you want to know exactly where an exception occurred in
order to be able to recover from the situation. Consider, for example, the code under b.3
above: If you want to be able to recover from an exception generated by the <kbd>FLD</kbd> here, then
you need the <kbd>WAIT</kbd> because an interrupt after <kbd>ADD ESP,8</kbd> would overwrite the value to load.
<kbd>FNOP</kbd> may be faster than <kbd>WAIT</kbd> and serve the same purpose.
<p>
<h3><a name="26_7">26.7 FCOM + FSTSW AX  (all processors)</a></h3>
The <kbd>FNSTSW</kbd> instruction is very slow on all processors. The PPro, PII and PIII
processors have
<kbd>FCOMI</kbd> instructions to avoid the slow <kbd>FNSTSW</kbd>.
Using <kbd>FCOMI</kbd> instead of the common
sequence <kbd>FCOM / FNSTSW AX / SAHF</kbd> will save you 8 clock cycles. You should
therefore use <kbd>FCOMI</kbd> to avoid <kbd>FNSTSW</kbd> wherever possible, even in cases where it costs
some extra code.
<p>
On processors without <kbd>FCOMI</kbd> instructions, the usual way of doing floating point
comparisons is:
<pre>    FLD [a]
    FCOMP [b]
    FSTSW AX
    SAHF
    JB ASmallerThanB</pre>
You may improve this code by using <kbd>FNSTSW AX</kbd> rather than
<kbd>FSTSW AX</kbd> and test <kbd>AH</kbd>
directly rather than using the non-pairable <kbd>SAHF</kbd>
(TASM version 3.0 has a bug with the <kbd>FNSTSW AX</kbd> instruction):
<pre>    FLD [a]
    FCOMP [b]
    FNSTSW AX
    SHR AH,1
    JC ASmallerThanB</pre>
<p>
Testing for zero or equality:
<pre>    FTST
    FNSTSW AX
    AND AH,40H
    JNZ IsZero     ; (the zero flag is inverted!)</pre>
<p>
Test if greater:
<pre>    FLD [a]
    FCOMP [b]
    FNSTSW AX
    AND AH,41H
    JZ AGreaterThanB</pre>
<p>
Do not use <kbd>TEST AH,41H</kbd> as it is not pairable on PPlain and PMMX.
<p>
On the PPlain and PMMX, the <kbd>FNSTSW</kbd> instruction takes 2 clocks, but it is delayed for an
additional 4 clocks after any floating point instruction because it is waiting for the status
word to retire from the pipeline. This delay comes even after <kbd>FNOP</kbd>
which cannot change the status word, but not after integer instructions.
You can fill the latency between <kbd>FCOM</kbd> and
<kbd>FNSTSW</kbd> with integer instructions taking up to four clock cycles.
A paired <kbd>FXCH</kbd> immediately
after <kbd>FCOM</kbd> doesn't delay the <kbd>FNSTSW</kbd>, not even if the pairing is imperfect:
<pre>    FCOM                  ; clock 1
    FXCH                  ; clock 1-2 (imperfect pairing)
    INC DWORD PTR [EBX]   ; clock 3-5
    FNSTSW AX             ; clock 6-7</pre>
<p>
You may want to use <kbd>FCOM</kbd> rather than <kbd>FTST</kbd>
here because <kbd>FTST</kbd> is not pairable.
Remember to include the <kbd>N</kbd> in <kbd>FNSTSW</kbd>. <kbd>FSTSW</kbd>
(without <kbd>N</kbd>) has a <kbd>WAIT</kbd> prefix which delays
it further.
<p>
It is sometimes faster to use integer instructions for comparing floating point values, as
described in chapter <a href="#27_6">27.6</a>.
<p>
<h3><a name="26_8">26.8 FPREM  (all processors)</a></h3>
The <kbd>FPREM</kbd> and <kbd>FPREM1</kbd> instructions are slow on all
processors. You may replace it by the following algorithm: Multiply by
the reciprocal divisor, get the fractional part by subtracting
the truncated value, then multiply by the divisor.
(see chapter <a href="#27_5">27.5</a> on how to truncate)
<p>
Some documents say that these instructions may give incomplete reductions and
that it is therefore necessary to repeat the <kbd>FPREM</kbd> or
<kbd>FPREM1</kbd> instruction until the reduction is complete.
I have tested this on several processors beginning with the old 8087 and I have
found no situation where a repetition of the <kbd>FPREM</kbd> or <kbd>FPREM1</kbd>
was needed.
<p>
<h3><a name="26_9">26.9 FRNDINT (all processors)</a></h3>
This instruction is slow on all processors. Replace it by:
<pre>
    FISTP QWORD PTR [TEMP]
    FILD  QWORD PTR [TEMP]</pre>
This code is faster despite a possible penalty for attempting to read from
<kbd>[TEMP]</kbd> before the write is finished. It is recommended to put
other instructions in between in order to avoid
this penalty. See chapter <a href="#27_5">27.5</a> on how to truncate.
<p>
<h3><a name="26_10">26.10 FSCALE and exponential function (all processors)</a></h3>
<kbd>FSCALE</kbd> is slow on all processors. Computing integer powers
of 2 can be done much faster by inserting the desired power in the exponent
field of the floating point number.
To calculate 2<sup>N</sup>, where N is a signed integer, select from the examples below the one that
fits your range of N:
<p>
For |N| &lt; 2<sup>7</sup>-1 you can use single precision:
<pre>    MOV     EAX, [N]
    SHL     EAX, 23
    ADD     EAX, 3F800000H
    MOV     DWORD PTR [TEMP], EAX
    FLD     DWORD PTR [TEMP]</pre>
<p>
For |N| &lt; 2<sup>10</sup>-1 you can use double precision:
<pre>    MOV     EAX, [N]
    SHL     EAX, 20
    ADD     EAX, 3FF00000H
    MOV     DWORD PTR [TEMP], 0
    MOV     DWORD PTR [TEMP+4], EAX
    FLD     QWORD PTR [TEMP]</pre>
<p>
For |N| &lt; 2<sup>14</sup>-1 use long double precision:
<pre>    MOV     EAX, [N]
    ADD     EAX, 00003FFFH
    MOV     DWORD PTR [TEMP],   0
    MOV     DWORD PTR [TEMP+4], 80000000H
    MOV     DWORD PTR [TEMP+8], EAX
    FLD     TBYTE PTR [TEMP]</pre>
<p>
<kbd>FSCALE</kbd> is often used in the calculation of exponential functions. The following code shows
an exponential function without the slow <kbd>FRNDINT</kbd> and <kbd>FSCALE</kbd> instructions:
<p>
<pre>; extern "C" long double _cdecl exp (double x);
_exp    PROC    NEAR
PUBLIC  _exp
        FLDL2E
        FLD     QWORD PTR [ESP+4]             ; x
        FMUL                                  ; z = x*log2(e)
        FIST    DWORD PTR [ESP+4]             ; round(z)
        SUB     ESP, 12
        MOV     DWORD PTR [ESP], 0
        MOV     DWORD PTR [ESP+4], 80000000H
        FISUB   DWORD PTR [ESP+16]            ; z - round(z)
        MOV     EAX, [ESP+16]
        ADD     EAX,3FFFH
        MOV     [ESP+8],EAX
        JLE     SHORT UNDERFLOW
        CMP     EAX,8000H
        JGE     SHORT OVERFLOW
        F2XM1
        FLD1
        FADD                                  ; 2^(z-round(z))
        FLD     TBYTE PTR [ESP]               ; 2^(round(z))
        ADD     ESP,12
        FMUL                                  ; 2^z = e^x
        RET

UNDERFLOW:
        FSTP    ST
        FLDZ                                  ; return 0
        ADD     ESP,12
        RET

OVERFLOW:
        PUSH    07F800000H                    ; +infinity
        FSTP    ST
        FLD     DWORD PTR [ESP]               ; return infinity
        ADD     ESP,16
        RET

_exp    ENDP</pre>
<p>
<h3><a name="26_11">26.11 FPTAN  (all processors)</a></h3>
According to the manuals, <kbd>FPTAN</kbd> returns two values X and Y and
leaves it to the programmer to divide Y with X to get the result, but in
fact it always returns 1 in X so you can save the division. My tests show that
on all 32 bit Intel processors with floating point unit or coprocessor,
<kbd>FPTAN</kbd> always returns 1 in X regardless of the argument. If you want to
be absolutely sure that your code will run correctly on all processors, then
you may test if X is 1, which is faster than dividing with X. The Y value may
be very high, but never infinity, so you don't have to test if Y contains a
valid number if you know that the argument is valid.
<p>
<h3><a name="26_12">26.12 FSQRT (PIII)</a></h3>
A fast way of calculating an approximate squareroot on the PIII is to multiply
the reciprocal squareroot of x by x:<br>
SQRT(x) = x * RSQRT(x)<br>
The instruction <kbd>RSQRTSS</kbd> or <kbd>RSQRTPS</kbd> gives the reciprocal
squareroot with a precision of 12 bits. You can improve the precision to 23 bits
by using the Newton-Raphson formula described in Intel's application note AP-803:<br>
x<sub>0</sub> = <kbd>RSQRTSS</kbd>(a)<br>
x<sub>1</sub> = 0.5 * x<sub>0</sub> * (3 - (a * x<sub>0</sub>)) * x<sub>0</sub>)<br>
where x<sub>0</sub> is the first approximation to the reciprocal squareroot of
a, and x<sub>1</sub> is a better approximation. The order of evaluation is
important. You must use this formula before multiplying with a to get the squareroot.
<p>
<h3><a name="26_13">26.13 MOV [MEM], ACCUM (PPlain and PMMX)</a></h3>
The instructions <kbd>MOV [mem],AL &nbsp; MOV [mem],AX &nbsp; MOV [mem],EAX</kbd>
 are treated by the pairing circuitry as if they were writing to the accumulator.
 Thus the following instructions do not pair:
<pre>    MOV [mydata], EAX
    MOV EBX, EAX</pre>
<p>
This problem occurs only with the short form of the <kbd>MOV</kbd>
 instruction which can not have a
base or index register, and which can only have the accumulator as source.
You can avoid the problem by using another register, by reordering your
instructions, by using a pointer, or by hard-coding the general form of
the <kbd>MOV</kbd> instruction.
<p>
In 32 bit mode you can write the general form of <kbd>MOV [mem],EAX</kbd>:
<pre>    DB 89H, 05H
    DD OFFSET DS:mem</pre>
<p>
In 16 bit mode you can write the general form of <kbd>MOV [mem],AX</kbd>:
<pre>    DB 89H, 06H
    DW OFFSET DS:mem</pre>
<p>
To use <kbd>AL</kbd> instead of <kbd>(E)AX</kbd>, you replace <kbd>89H</kbd>
 with <kbd>88H</kbd>
<p>
This flaw has not been fixed in the PMMX.
<p>
<h3><a name="26_14">26.14 TEST instruction (PPlain and PMMX)</a></h3>
The <kbd>TEST</kbd> instruction with an immediate operand is only
pairable if the destination is <kbd>AL</kbd>, <kbd>AX</kbd>, or <kbd>EAX</kbd>.
<p>
<kbd>TEST register,register</kbd>
 and <kbd>TEST register,memory</kbd> is always pairable.
<p>
Examples:
<pre>    TEST ECX,ECX                ; pairable
    TEST [mem],EBX              ; pairable
    TEST EDX,256                ; not pairable
    TEST DWORD PTR [EBX],8000H  ; not pairable</pre><p>
To make it pairable, use any of the following methods:
<pre>    MOV EAX,[EBX] / TEST EAX,8000H
    MOV EDX,[EBX] / AND  EDX,8000H
    MOV AL,[EBX+1] / TEST AL,80H
    MOV AL,[EBX+1] / TEST AL,AL  ; (result in sign flag)</pre>
(The reason for this non-pairability is probably that the first byte of the 2-byte instruction is
the same as for some other non-pairable instructions, and the processor cannot afford to
check the second byte too when determining pairability.)
<p>
<h3><a name="26_15">26.15 Bit scan (PPlain and PMMX)</a></h3>
<kbd>BSF</kbd> and <kbd>BSR</kbd> are the poorest optimized instructions on
the PPlain and PMMX, taking
approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.
<p>
The following code emulates <kbd>BSR ECX,EAX</kbd>:
<pre>        TEST    EAX,EAX
        JZ      SHORT BS1
        MOV     DWORD PTR [TEMP],EAX
        MOV     DWORD PTR [TEMP+4],0
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with old 80287 processor
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20        ; isolate exponent
        SUB     ECX,3FFH      ; adjust
        TEST    EAX,EAX       ; clear zero flag
BS1:</pre>
<p>
The following code emulates <kbd>BSF ECX,EAX</kbd>:
<pre>        TEST    EAX,EAX
        JZ      SHORT BS2
        XOR     ECX,ECX
        MOV     DWORD PTR [TEMP+4],ECX
        SUB     ECX,EAX
        AND     EAX,ECX
        MOV     DWORD PTR [TEMP],EAX
        FILD    QWORD PTR [TEMP]
        FSTP    QWORD PTR [TEMP]
        WAIT    ; WAIT only needed for compatibility with old 80287 processor
        MOV     ECX, DWORD PTR [TEMP+4]
        SHR     ECX,20
        SUB     ECX,3FFH
        TEST    EAX,EAX       ; clear zero flag
BS2:</pre>
<p>
These emulation codes should not be used on the PPro, PII and PIII, where the bit scan
instructions take only 1 or 2 clocks, and where the emulation codes shown above have two
partial memory stalls.
<p>
<h3><a name="26_16">26.16 FLDCW (PPro, PII and PIII)</a></h3>
The PPro, PII and PIII have a serious stall after the
<kbd>FLDCW</kbd> instruction if followed by any floating
point instruction which reads the control word (which almost all floating
point instructions do).
<p>
When C or C++ code is compiled it often generates a lot of
<kbd>FLDCW</kbd> instructions because conversion of floating point numbers
to integers is done with truncation while other floating
point instructions use rounding. After translation to assembly, you can
improve this code by using rounding instead of truncation where possible,
or by moving the <kbd>FLDCW</kbd> out of a loop
where truncation is needed inside the loop.
<p>
See chapter <a href="#27_5">27.5</a> on how to convert floating point numbers to integers
whitout changing the control word.
<p>
<h2><a name="27">27</a>. Special topics</h2>
<h3><a name="27_1">27.1 LEA instruction (all processors)</a></h3>
The <kbd>LEA</kbd> instruction is useful for many purposes because it can do
a shift, two additions, and a move in just one instruction taking one clock cycle.
Example:<br>
<kbd>LEA EAX,[EBX+8*ECX-1000]</kbd><br>
is much faster than<br>
<kbd>MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000</kbd><br>
The <kbd>LEA</kbd> instruction can also be used to do an add or shift without
changing the flags. The source and destination need not have the same word size,
so <kbd>LEA EAX,[BX]</kbd> is a possible
replacement for <kbd>MOVZX EAX,BX</kbd>, although suboptimal on most processors.
<p>
You must be aware, however, that the <kbd>LEA</kbd> instruction will suffer
an AGI stall on the PPlain and PMMX if it uses a base or index register
which has been written to in the preceding clock cycle.
<p>
Since the <kbd>LEA</kbd> instruction is pairable in the v-pipe on PPlain and
PMMX and shift instructions are not, you may use <kbd>LEA</kbd> as
a substitute for a <kbd>SHL</kbd> by 1, 2, or 3 if you want the
instruction to execute in the V-pipe.
<p>
The 32 bit processors have no documented addressing mode with a scaled index register
and nothing else, so an instruction like <kbd>LEA EAX,[EAX*2]</kbd>
is actually coded as <kbd>LEA EAX,[EAX*2+00000000]</kbd>
with an immediate displacement of 4 bytes.  You may reduce the
instruction size by instead writing <kbd>LEA EAX,[EAX+EAX]</kbd> or even
better <kbd>ADD EAX,EAX</kbd>.
The latter code cannot have an AGI delay in PPlain and PMMX. If you happen to have a register
which is zero (like a loop counter after a loop), then you may use it
as a base register to reduce the code size:
<p>
<pre>LEA EAX,[EBX*4]     ; 7 bytes
LEA EAX,[ECX+EBX*4] ; 3 bytes</pre>
<p>
<h3><a name="27_2">27.2 Division (all processors)</a></h3>
Division is quite time consuming. On PPro, PII and PIII an integer division
takes 19, 23, or 39 clocks for byte, word, and dword divisors respectively.
On PPlain and PMMX an unsigned integer division takes approximately the same,
while a signed integer division takes somewhat more. It is therefore
preferable to use the smallest operand size possible that won't generate an
overflow, even if it costs an operand size prefix, and use unsigned
division if possible.
<p>
<h4>Integer division by a constant (all processors)</h4>
Integer division by a power of two can be done by shifting right. Dividing an
unsigned integer by 2<sup>N</sup>:
<pre>        SHR     EAX, N</pre>
Dividing a signed integer by 2<sup>N</sup>:
<pre>        CDQ
        AND     EDX, (1 SHL N) -1  ; or  SHR EDX, 32-N
        ADD     EAX, EDX
        SAR     EAX, N</pre>
The <kbd>SHR</kbd> alternative is shorter than the <kbd>AND</kbd> if N > 7,
but can only go to execution port 0 (or u-pipe), whereas <kbd>AND</kbd> can
go to either port 0 or 1 (u or v-pipe).
<p>
Dividing by a constant can be done by multiplying with the reciprocal.
To calculate the unsigned integer division q = x / d, you first calculate
the reciprocal of the divisor, f = 2<sup>r</sup> / d, where r defines the position of the binary
decimal point (radix point). Then multiply x with f and shift
right r positions. The maximum value of r is 32+b, where b is the number of binary digits in d
minus 1. (b is the highest integer for which 2<sup>b</sup> &lt;= d). Use r = 32+b to cover the maximum
range for the value of the dividend x.
<p>
This method needs some refinement in order to compensate for rounding errors.
The following algorithm will give you the correct result for unsigned integer
division with truncation, i.e. the same result as the <kbd>DIV</kbd>
instruction gives (Thanks to Terje Mathisen who invented this method):
  <pre>
  b = (the number of significant bits in d) - 1
  r = 32 + b
  f = 2<sup>r</sup> / d
  If f is an integer then d is a power of 2: goto case A.
  If f is not an integer, then check if the fractional part of f is &lt; 0.5
  If the fractional part of f &lt; 0.5: goto case B.
  If the fractional part of f &gt; 0.5: goto case C.

  case A: (d = 2<sup>b</sup>)
  result = x SHR b

  case B: (fractional part of f &lt; 0.5)
  round f down to nearest integer
  result = ((x+1) * f) SHR r

  case C: (fractional part of f &gt; 0.5)
  round f up to nearest integer
  result = (x * f) SHR r
  </pre>
<p>
Example:<br>
Assume that you want to divide by 5.<br>
5 = 00000101b.<br>
b = (number of significant binary digits) - 1 = 2<br>
r = 32+2 = 34<br>
f = 2<sup>34</sup> / 5 = 3435973836.8 = 0CCCCCCCC.CCC...(hexadecimal)<br>
The fractional part is greater than a half: use case C.<br>
Round f up to 0CCCCCCCDh.
<p>
The following code divides <kbd>EAX</kbd> by 5 and returns the result in <kbd>EDX</kbd>:
<pre>        MOV     EDX,0CCCCCCCDh
        MUL     EDX
        SHR     EDX,2</pre>
<p>
After the multiplication, <kbd>EDX</kbd> contains the product shifted right
32 places. Since r = 34 you have to shift 2 more places to get the result.
To divide by 10 you just change the last line to <kbd>SHR EDX,3</kbd>.
<p>
In case B you would have:
<pre>        INC     EAX
        MOV     EDX,f
        MUL     EDX
        SHR     EDX,b</pre>
<p>
This code works for all values of x except 0FFFFFFFFH which gives zero because of
overflow in the <kbd>INC</kbd> instruction. If x = 0FFFFFFFFH is possible, then change the code to:
<pre>        MOV     EDX,f
        ADD     EAX,1
        JC      DOVERFL
        MUL     EDX
DOVERFL:SHR     EDX,b</pre>
<p>
If the value of x is limited, then you may use a lower value of r, i.e.
fewer digits. There can be several reasons to use a lower value of r:
<ul>
<li>you may set r = 32 to avoid the <kbd>SHR EDX,b</kbd> in the end.
<li>you may set r = 16+b and use a multiplication instruction that
gives a 32 bit result rather
than 64 bits. This will free the <kbd>EDX</kbd> register:
<pre>   IMUL EAX,0CCCDh / SHR EAX,18</pre>
<li>you may choose a value of r that gives case C rather than case
B in order to avoid the <kbd>INC EAX</kbd> instruction
</ul>
<p>
The maximum value for x in these cases is at least 2<sup>r-b</sup>,
sometimes higher. You have to do a systematic test if you want to know the
exact maximum value of x for which your code works correctly.
<p>
You may want to replace the slow multiplication instruction with faster instructions as
explained in chapter <a href="#26"_5>26.5</a>.
<p>
The following example divides <kbd>EAX</kbd> by 10 and returns the result
in <kbd>EAX</kbd>. I have chosen r=17 rather than 19 because it happens
to give a code, which is easier to optimize, and covers
the same range for x. f = 2<sup>17</sup> / 10 = 3333h, case B: q = (x+1)*3333h:
<pre>        LEA     EBX,[EAX+2*EAX+3]
        LEA     ECX,[EAX+2*EAX+3]
        SHL     EBX,4
        MOV     EAX,ECX
        SHL     ECX,8
        ADD     EAX,EBX
        SHL     EBX,8
        ADD     EAX,ECX
        ADD     EAX,EBX
        SHR     EAX,17</pre>
<p>
A systematic test shows that this code works correctly for all x &lt; 10004H.
<p>
<h4>Repeated integer division by the same value (all processors)</h4>
If the divisor is not known at assembly time, but you are dividing
repeatedly with the same divisor, then you may use the same method as above.
The code has to distinguish between
case A, B and C and calculate f before doing the divisions.
<p>
The code that follows shows how to do multiple divisions with the same divisor (unsigned
division with truncation). First call <kbd>SET_DIVISOR</kbd> to specify the
divisor and calculate the
reciprocal, then call <kbd>DIVIDE_FIXED</kbd> for each value to divide by the same divisor.
<pre>
.data

RECIPROCAL_DIVISOR DD ?            ; rounded reciprocal divisor
CORRECTION         DD ?            ; case A: -1, case B: 1, case C: 0
BSHIFT             DD ?            ; number of bits in divisor - 1

.code

SET_DIVISOR PROC NEAR              ; divisor in EAX
        PUSH    EBX
        MOV     EBX,EAX
        BSR     ECX,EAX            ; b = number of bits in divisor - 1
        MOV     EDX,1
        JZ      ERROR              ; error: divisor is zero
        SHL     EDX,CL             ; 2^b
        MOV     [BSHIFT],ECX       ; save b
        CMP     EAX,EDX
        MOV     EAX,0
        JE      SHORT CASE_A       ; divisor is a power of 2
        DIV     EBX                ; 2^(32+b) / d
        SHR     EBX,1              ; divisor / 2
        XOR     ECX,ECX
        CMP     EDX,EBX            ; compare remainder with divisor/2
        SETBE   CL                 ; 1 if case B
        MOV     [CORRECTION],ECX   ; correction for rounding errors
        XOR     ECX,1
        ADD     EAX,ECX            ; add 1 if case C
        MOV     [RECIPROCAL_DIVISOR],EAX ; rounded reciprocal divisor
        POP     EBX
        RET
CASE_A: MOV     [CORRECTION],-1    ; remember that we have case A
        POP     EBX
        RET
SET_DIVISOR     ENDP

DIVIDE_FIXED PROC NEAR                 ; dividend in EAX, result in EAX
        MOV     EDX,[CORRECTION]
        MOV     ECX,[BSHIFT]
        TEST    EDX,EDX
        JS      SHORT DSHIFT           ; divisor is power of 2
        ADD     EAX,EDX                ; correct for rounding error
        JC      SHORT DOVERFL          ; correct for overflow
        MUL     [RECIPROCAL_DIVISOR]   ; multiply with reciprocal divisor
        MOV     EAX,EDX
DSHIFT: SHR     EAX,CL                 ; adjust for number of bits
        RET
DOVERFL:MOV     EAX,[RECIPROCAL_DIVISOR] ; dividend = 0FFFFFFFFH
        SHR     EAX,CL                 ; do division by shifting
        RET
DIVIDE_FIXED    ENDP</pre>
This code gives the same result as the <kbd>DIV</kbd> instruction for
0 &lt;= x &lt; 2<sup>32</sup>,  0 &lt; d &lt; 2<sup>32</sup>.<br>
Note: The line <kbd>JC DOVERFL</kbd> and its target are not needed if
you are certain that x &lt; 0FFFFFFFFH.
<p>
If powers of 2 occur so seldom that it is not worth optimizing for them,
then you may leave out the jump to <kbd>DSHIFT</kbd> and instead do a
multiplication with <kbd>CORRECTION</kbd> = 0 for case A.
<p>
If the divisor is changed so often that the procedure <kbd>SET_DIVISOR</kbd> needs optimizing, then you may
replace the <kbd>BSR</kbd> instruction with the code given in chapter
<a href="#26_15">26.15</a> for the PPlain and PMMX processors.
<p>
<h4>Floating point division (all processors)</h4>
Floating point division takes 38 or 39 clock cycles for the highest precision.
You can save time by specifying a lower precision in the floating point
control word (On PPlain and PMMX, only <kbd>FDIV</kbd> and <kbd>FIDIV</kbd> are faster at low
precision; on PPro, PII and PIII, this also applies
to <kbd>FSQRT</kbd>. No other instructions can be speeded up this way).
<p>
<h4><a name="paralleldiv">Parallel division (PPlain and PMMX)</a></h4>
On PPlain and PMMX, it is possible to do a floating point division and an integer division in
parallel to save time. On PPro, PII and PIII this is not possible, because integer division and
floating point division use the same circuitry.<br>
Example: A = A1 / A2; B = B1 / B2
<pre>        FILD    [B1]
        FILD    [B2]
        MOV     EAX, [A1]
        MOV     EBX, [A2]
        CDQ
        FDIV
        DIV     EBX
        FISTP   [B]
        MOV     [A], EAX</pre><p>
(make sure you set the floating point control word to the desired rounding method)
<p>
<h4>Using reciprocal instruction for fast division (PIII)</h4>
On PIII you can use the fast reciprocal instruction <kbd>RCPSS</kbd> or
<kbd>RCPPS</kbd> on the divisor and then multiply with the dividend. However,
the precision is only 12 bits. You can increase the precision to 23 bits by
using the Newton-Raphson method described in Intel's application note AP-803:<br>
x<sub>0</sub> = <kbd>RCPSS</kbd>(d)<br>
x<sub>1</sub> = x<sub>0</sub> * (2 - d * x<sub>0</sub>)
= 2*x<sub>0</sub> - d * x<sub>0</sub> * x<sub>0</sub><br>
where x<sub>0</sub> is the first approximation to the reciprocal of the divisor, d,
and x<sub>1</sub> is a better approximation. You must use this formula before
multiplying with the dividend:
<pre>        MOVAPS  XMM1, [DIVISORS]         ; load divisors
        RCPPS   XMM0, XMM1               ; approximate reciprocal
        MULPS   XMM1, XMM0               ; Newton-Raphson formula
        MULPS   XMM1, XMM0
        ADDPS   XMM0, XMM0
        SUBPS   XMM0, XMM1
        MULPS   XMM0, [DIVIDENDS]        ; results in XMM0</pre>
This makes four divisions in 18 clock cycles with a precision of 23 bits.
Increasing the precision further by repeating the Newton-Raphson formula
in the floating point registers is possible, but not very advantageous.
<p>
If you want to use this method for integer divisions then you have to check for
rounding errors. The following code makes four divisions with truncation on packed
word size integers in approximately 42 clock cycles. It gives exact results for
0 <= dividend < 7FFFFH and 0 < divisor <= 7FFFFH:
<pre>        MOVQ MM1, [DIVISORS]      ; load four divisors
        MOVQ MM2, [DIVIDENDS]     ; load four dividends
        PUNPCKHWD MM4, MM1        ; unpack divisors to DWORDs
        PSRAD MM4, 16
        PUNPCKLWD MM3, MM1
        PSRAD MM3, 16
        CVTPI2PS XMM1, MM4        ; convert divisors to float, upper two operands
        MOVLHPS XMM1, XMM1
        CVTPI2PS XMM1, MM3        ; convert lower two operands
        PUNPCKHWD MM4, MM2        ; unpack dividends to DWORDs
        PSRAD MM4, 16
        PUNPCKLWD MM3, MM2
        PSRAD MM3, 16
        CVTPI2PS XMM2, MM4        ; convert dividends to float, upper two operands
        MOVLHPS XMM2, XMM2
        CVTPI2PS XMM2, MM3        ; convert lower two operands
        RCPPS XMM0, XMM1          ; approximate reciprocal of divisors
        MULPS XMM1, XMM0          ; improve precision with Newton-Raphson method
        PCMPEQW MM4, MM4          ; make four integer 1's in the meantime
        PSRLW MM4, 15
        MULPS XMM1, XMM0
        ADDPS XMM0, XMM0
        SUBPS XMM0, XMM1          ; reciprocal divisors with 23 bit precision
        MULPS XMM0, XMM2          ; multiply with dividends
        CVTTPS2PI MM0, XMM0       ; truncate lower two results
        MOVHLPS XMM0, XMM0
        CVTTPS2PI MM3, XMM0       ; truncate upper two results
        PACKSSDW MM0, MM3         ; pack the four results into MM0
        MOVQ MM3, MM1             ; multiply results with divisors...
        PMULLW MM3, MM0           ; to check for rounding errors
        PADDSW MM0, MM4           ; add 1 to compensate for later subtraction
        PADDSW MM3, MM1           ; add divisor. this should be > dividend
        PCMPGTW MM3, MM2          ; check if too small
        PADDSW MM0, MM3           ; subtract 1 if not too small
        MOVQ [QUOTIENTS], MM0     ; save the four results</pre>
This code checks if the result is too small and makes the appropriate
correction. It is not necessary to check if the result is too big.
<p>
<h4>Avoiding divisions (all processors)</h4>
Obviously, you should always try to minimize the number of divisions. Floating point division
with a constant or repeated division with the same value should of course be done by
multiplying with the reciprocal. But there are many other situations where you can reduce
the number of divisions. For example:
if (A/B &gt; C)...  can be rewritten as  if (A &gt; B*C)...  when B is positive, and the opposite when
B is negative.
<p>
A/B + C/D  can be rewritten as  (A*D + C*B) / (B*D)
<p>
If you are using integer division, then you should be aware that the rounding errors may be
different when you rewrite the formulas.
<p>
<h3><a name="27_3">27.3 Freeing floating point registers (all processors)</a></h3>
You have to free all used floating point registers before exiting a subroutine,
except for any register used for the result.
<p>
The fastest way of freeing one register is <kbd>FSTP ST</kbd>.
The fastest way of freeing two registers is <kbd>FCOMPP</kbd> on PPlain and PMMX;
on PPro, PII and PIII you may use either
<kbd>FCOMPP</kbd> or two times <kbd>FSTP ST</kbd>, whichever fits best into
the decoding sequence.
<p>
It is not recommended to use <kbd>FFREE</kbd>.
<p>
<h3><a name="27_4">27.4 Transitions between floating point and MMX instructions (PMMX, PII and PIII)</a></h3>
You must issue an <kbd>EMMS</kbd> instruction after your last MMX instruction if there
is a possibility that floating point code follows later.
<p>
On PMMX there is a high penalty for switching between floating point and MMX
instructions. The first floating point instruction after an
<kbd>EMMS</kbd> takes approximately 58 clocks extra, and the first MMX instruction
after a floating point instruction takes approximately 38 clocks extra.
<p>
On PII and PIII there is no such penalty. The delay after <kbd>EMMS</kbd>
can be hidden by putting in integer
instructions between <kbd>EMMS</kbd> and the first floating point instruction.
<p>
<h3><a name="27_5">27.5 Converting from floating point to integer (All processors)</a></h3>
All conversions from floating point to integer, and vice versa, must go via a memory
location:
<pre>    FISTP DWORD PTR [TEMP]
    MOV EAX, [TEMP]</pre><p>
On PPro, PII and PIII, this code is likely to have a penalty for attempting to
read from <kbd>[TEMP]</kbd> before the write to <kbd>[TEMP]</kbd> is finished
because the <kbd>FIST</kbd> instruction is slow (see chapter <a href="#17">17</a>).
It doesn't help to put in a <kbd>WAIT</kbd> (see chapter <a href="#26_6">26.6</a>).
It is recommended that you put in other instructions between the write to
<kbd>[TEMP]</kbd> and the read from <kbd>[TEMP]</kbd> if possible in
order to avoid this penalty. This applies to all the examples that follow.
<p>
The specifications for the C and C++ language requires that conversion
from floating point
numbers to integers use truncation rather than rounding. The method used by most C
libraries is to change the floating point control word to indicate truncation before using an
<kbd>FISTP</kbd> instruction and changing it back again afterwords. This method is very slow on all
processors. On PPro, PII and PIII, the floating point control word cannot be renamed, so all
subsequent floating point instructions must wait for the <kbd>FLDCW</kbd> instruction to retire.
<p>
Whenever you have a conversion from floating point to integer in C or C++, you should
think of whether you can use rounding to nearest integer instead of truncation. If your
standard library doesn't have a fast round function then make your own using the code
examples listed below.
<p>
If you need truncation inside a loop then you should change the control word only outside
the loop if the rest of the floating point instructions in the loop can work correctly in
truncation mode.
<p>
You may use various tricks for truncating without changing the control word, as illustrated in
the examples below. These examples presume that the control word is set to default, i.e.
rounding to nearest or even.
<p>
<h4>Rounding to nearest or even</h4>
<pre>; extern "C" int round (double x);
_round  PROC    NEAR
PUBLIC  _round
        FLD     QWORD PTR [ESP+4]
        FISTP   DWORD PTR [ESP+4]
        MOV     EAX, DWORD PTR [ESP+4]
        RET
_round  ENDP</pre>
<p>
<h4>Truncation towards zero</h4>
<pre>; extern "C" int truncate (double x);
_truncate PROC    NEAR
PUBLIC  _truncate
        FLD     QWORD PTR [ESP+4]   ; x
        SUB     ESP, 12             ; space for local variables
        FIST    DWORD PTR [ESP]     ; rounded value
        FST     DWORD PTR [ESP+4]   ; float value
        FISUB   DWORD PTR [ESP]     ; subtract rounded value
        FSTP    DWORD PTR [ESP+8]   ; difference
        POP     EAX                 ; rounded value
        POP     ECX                 ; float value
        POP     EDX                 ; difference (float)
        TEST    ECX, ECX            ; test sign of x
        JS      SHORT NEGATIVE
        ADD     EDX, 7FFFFFFFH      ; produce carry if difference &lt; -0
        SBB     EAX, 0              ; subtract 1 if x-round(x) &lt; -0
        RET
NEGATIVE:
        XOR     ECX, ECX
        TEST    EDX, EDX
        SETG    CL                  ; 1 if difference &gt; 0
        ADD     EAX, ECX            ; add 1 if x-round(x) &gt; 0
        RET
_truncate ENDP</pre>
<p>
<h4>Truncation towards minus infinity</h4>
<pre>; extern "C" int ifloor (double x);
_ifloor PROC    NEAR
PUBLIC  _ifloor
        FLD     QWORD PTR [ESP+4]   ; x
        SUB     ESP, 8              ; space for local variables
        FIST    DWORD PTR [ESP]     ; rounded value
        FISUB   DWORD PTR [ESP]     ; subtract rounded value
        FSTP    DWORD PTR [ESP+4]   ; difference
        POP     EAX                 ; rounded value
        POP     EDX                 ; difference (float)
        ADD     EDX, 7FFFFFFFH      ; produce carry if difference &lt; -0
        SBB     EAX, 0              ; subtract 1 if x-round(x) &lt; -0
        RET
_ifloor ENDP</pre>
<p>
These procedures work for -2<sup>31</sup> &lt; x &lt; 2<sup>31</sup>-1.
They do not check for overflow or NAN's.
<p>
The PIII has instructions for truncation of single precision floating point
numbers: <kbd>CVTTSS2SI</kbd> and <kbd>CVTTPS2PI</kbd>. These instructions
are very useful if the single precision is satisfactory, but if you are converting
a float with higher precision to single precision in order to use these truncation
instructions then you have the problem that the number may be rounded up in the
conversion to single precision.
<p>
<h4>Alternative to FISTP instruction (PPlain and PMMX)</h4>
<p>
Converting a floating point number to integer is normally done like this:
<pre>        FISTP   DWORD PTR [TEMP]
        MOV     EAX, [TEMP]</pre>
<p>
An alternative method is:
<pre>.DATA
ALIGN 8
TEMP    DQ      ?
MAGIC   DD      59C00000H   ; f.p. representation of 2^51 + 2^52

.CODE
        FADD    [MAGIC]
        FSTP    QWORD PTR [TEMP]
        MOV     EAX, DWORD PTR [TEMP]</pre>
<p>
Adding the 'magic number' of 2<sup>51</sup> + 2<sup>52</sup> has the effect
that any integer between -2<sup>31</sup> and +2<sup>31</sup>
 will be aligned in the lower 32 bits when storing as a double precision floating
point number. The result is the same as you get with <kbd>FISTP</kbd> for all rounding methods except
truncation towards zero. The result is different from <kbd>FISTP</kbd> if the control word specifies
truncation or in case of overflow. You may need a <kbd>WAIT</kbd> instruction for
compatibility with the old 80287 processor, see chapter <a href="#26_6">26.6</a>.
<p>
This method is not faster than using <kbd>FISTP</kbd>, but it gives better
scheduling opportunities on
PPlain and PMMX because there is a 3 clock void between <kbd>FADD</kbd>
 and <kbd>FSTP</kbd> which may be
filled with other instrucions. You may multiply or divide the number by a
power of 2 in the same operation by doing the opposite to the magic number.
You may also add a constant by
adding it to the magic number, which then has to be double precision.
<p>
<h3><a name="27_6">27.6 Using integer instructions to do floating point operations (all processors)</a></h3>
Integer instructions are generally faster than floating point instructions, so it is often
advantageous to use integer instructions for doing simple floating point operations. The
most obvious example is moving data. Example:<br>
<kbd>   FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]</kbd><br>
Change to:<br>
<kbd>   MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX</kbd><br>
<p>
<h4>Testing if a floating point value is zero:</h4>
The floating point value of zero is usually represented as 32 or 64 bits of zero, but there is a
pitfall here: The sign bit may be set! Minus zero is regarded as a valid floating point number,
and the processor may actually generate a zero with the sign bit set if for example
multiplying a negative number with zero. So if you want to test if a floating point number is
zero, you should not test the sign bit. Example:<br>
<kbd>   FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero</kbd><br>
Use integer instructions instead, and shift out the sign bit:<br>
<kbd>   MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero</kbd><br>
If the floating point number is double precision (QWORD) then you only have to
test bit 32-62. If they are zero, then the lower half will also be zero if it is a normal floating point number.
<p>
<h4>Testing if negative:</h4>
A floating point number is negative if the sign bit is set and at least one other bit is set.
Example:<br>
<kbd>   MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative</kbd>
<p>
<h4>Manipulating the sign bit:</h4>
You can change the sign of a floating point number simply by flipping the
sign bit. Example:<br>
<kbd>   XOR BYTE PTR [a] + (TYPE a) - 1, 80H</kbd>
<p>
Likewise you may get the absolute value of a floating point number by simply ANDing out
the sign bit.
<p>
<h4>Comparing numbers:</h4>
Floating point numbers are stored in a unique format which allows you to use integer
instructions for comparing floating point numbers, except for the sign bit. If you are certain
that two floating point numbers both are normal and positive then you may simply compare
them as integers. Example:<br>
<kbd> FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB</kbd><br>
Change to:<br>
<kbd> MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB</kbd><br>
This method only works if the two numbers have the same precision and you are certain
that none of the numbers have the sign bit set.
<p>
If negative numbers are possible, then you have to convert the negative numbers to
2-complement, and do a signed compare:
<pre>        MOV     EAX, [a]
        MOV     EBX, [b]
        MOV     ECX, EAX
        MOV     EDX, EBX
        SAR     ECX, 31              ; copy sign bit
        AND     EAX, 7FFFFFFFH       ; remove sign bit
        SAR     EDX, 31
        AND     EBX, 7FFFFFFFH
        XOR     EAX, ECX      ; make 2-complement if sign bit was set
        XOR     EBX, EDX
        SUB     EAX, ECX
        SUB     EBX, EDX
        CMP     EAX, EBX
        JL      ASmallerThanB        ; signed comparison</pre>
This method works for all normal floating point numbers, including -0.
<p>
<h3><a name="27_7">27.7 Using floating point instructions to do integer operations (PPlain and PMMX)</a></h3>
<h4>Integer multiplication (PPlain and PMMX)</h4>
Floating point multiplication is faster than integer multiplication on the PPlain and PMMX,
but the price for converting integer factors to float and converting the result back to integer
is high, so floating point multiplication is only advantageous if the number of conversions
needed is low compared with the number of multiplications. (It may be tempting to use
denormal floating point operands to save some of the conversions here, but the handling of
denormals is very slow, so this is not a good idea!)
<p>
On the PMMX, MMX multiplication instructions are faster than integer multiplication, and
can be pipelined to a throughput of one multiplication per clock cycle, so this may be the
best solution for doing fast multiplication on the PMMX, if you can live with 16 bit precision.
<p>
Integer multiplication is faster than floating point on PPro, PII and PIII.
<p>
<h4>Integer division (PPlain and PMMX)</h4>
Floating point division is not faster than integer division, but you can do other integer
operations (including integer division, but not integer multiplication) while the floating point
unit is working on the division (See example <a href="#paralleldiv">above</a>).
<p>
<h4>Converting binary to decimal numbers (all processors)</h4>
Using the <kbd>FBSTP</kbd> instruction is a simple and convenient way of converting a binary number
to decimal, although not necessarily the fastest method.
<p>
<h3><a name="27_8">27.8 Moving blocks of data (all processors)</a></h3>
There are several ways of moving blocks of data. The most common method is
<kbd>REP MOVSD</kbd>, but under certain conditions other methods are faster.
<p>
On PPlain and PMMX it is faster to move 8 bytes at a time using floating
point registers if the destination is not in the cache:
<pre>TOP:    FILD    QWORD PTR [ESI]
        FILD    QWORD PTR [ESI+8]
        FXCH
        FISTP   QWORD PTR [EDI]
        FISTP   QWORD PTR [EDI+8]
        ADD     ESI, 16
        ADD     EDI, 16
        DEC     ECX
        JNZ     TOP</pre>
<p>
The source and destination should of course be aligned by 8. The extra time used by the
slow <kbd>FILD</kbd> and <kbd>FISTP</kbd> instructions is compensated for by the fact that you only have to do
half as many write operations. Note that this method is only advantageous on the PPlain
and PMMX and only if the destination is not in the level 1 cache. You cannot use
<kbd>FLD</kbd> and <kbd>FSTP</kbd> (without <kbd>I</kbd>) on arbitrary bit patterns because denormal numbers
are handled slowly and certain bit patterns are not preserved unchanged.
<p>
On the PMMX processor it is faster to use MMX instructions to move eight bytes
at a time if the destination is not in the cache:
<pre>TOP:    MOVQ    MM0,[ESI]
        MOVQ    [EDI],MM0
        ADD     ESI,8
        ADD     EDI,8
        DEC     ECX
        JNZ     TOP</pre>
<p>
There is no need to unroll this loop or optimize it further if cache misses are expected,
because memory access is the bottleneck here, not instruction execution.
<p>
On PPro, PII and PIII processors the <kbd>REP MOVSD</kbd> instruction is particularly
fast when the following conditions are met (see chapter <a href="#26_3">26.3</a>):
<ul>
<li>both source and destination must be aligned by 8
<li>direction must be forward (direction flag cleared)
<li>the count (<kbd>ECX</kbd>) must be greater than or equal to 64
<li>the difference between <kbd>EDI</kbd> and <kbd>ESI</kbd> must be numerically greater than or equal to 32
<li>the memory type for both source and destination must be either writeback or
write-combining (you can normally assume this).
</ul>
<p>
On the PII it is faster to use MMX registers if the above conditions are not met
and the destination is likely to be in the level 1 cache. The loop may be rolled
out by two, and the source and destination should of course be aligned by 8.
<p>
On the PIII the fastest way of moving data is to use the <kbd>MOVAPS</kbd>
instruction if the above conditions are not met or if the destination is in
the level 1 or level 2 cache:
<pre>        SUB     EDI, ESI
TOP:    MOVAPS  XMM0, [ESI]
        MOVAPS  [ESI+EDI], XMM0
        ADD     ESI, 16
        DEC     ECX
        JNZ     TOP</pre>
Unlike <kbd>FLD</kbd>, <kbd>MOVAPS</kbd> can handle any bit pattern without
problems. Remember that source and destination must be aligned by 16.
<p>
If the number of bytes to move is not divisible by 16 then you may round up
to the nearest number divisible by 16 and put some extra space at the end of
the destination buffer to receive the superfluous bytes. If this is not possible
then you have to move the remaining bytes by other methods.
<p>
On the PIII you also have the option of writing directly to RAM memory without
involving the cache by using the <kbd>MOVNTQ</kbd> or <kbd>MOVNTPS</kbd>
instruction. This can be useful if you don't want the destination to go into
a cache. <kbd>MOVNTPS</kbd> is only slightly faster than <kbd>MOVNTQ</kbd>.
<p>
<h3><a name="27_9">27.9 Self-modifying code (All processors)</a></h3>
The penalty for executing a piece of code immediately after modifying it is approximately 19
clocks for PPlain, 31 for PMMX, and 150-300 for PPro, PII and PIII. The 80486 and earlier
processors require a jump between the modifying and the modified code in order to flush
the code cache.
<p>
To get permission to modify code in a protected operating system you need to call special
system functions: In 16-bit Windows call ChangeSelector, in 32-bit Windows call
VirtualProtect and FlushInstructionCache (or put the code in a data segment).
<p>
Self-modifying code is not considered good programming practice, but it may be justified if
the gain in speed is considerable.
<p>
<h3><a name="27_10">27.10 Detecting processor type (All processors)</a></h3>
I think it is fairly obvious by now that what is optimal for one microprocessor may not be
optimal for another. You may make the most critical parts of you program in different
versions, each optimized for a specific microprocessor and selecting the desired version at
run time after detecting which microprocessor the program is running on. If you are using
instructions that are not supported by all microprocessors (i.e. conditional
moves, <kbd>FCOMI</kbd>, MMX and XMM instructions) then you must first check if the program is running on a microprocessor
that supports these instructions. The subroutine below checks the type of microprocessor
and the features supported.
<p>
<pre>; define CPUID instruction if not known by assembler:
CPUID   MACRO
        DB      0FH, 0A2H
ENDM

; C++ prototype:
; extern "C" long int DetectProcessor (void);

; return value:
; bits 8-11 = family (5 for PPlain and PMMX, 6 for PPro, PII and PIII)
; bit  0 = floating point instructions supported
; bit 15 = conditional move and FCOMI instructions supported
; bit 23 = MMX instructions supported
; bit 25 = XMM instructions supported

_DetectProcessor PROC NEAR
PUBLIC  _DetectProcessor
        PUSH    EBX
        PUSH    ESI
        PUSH    EDI
        PUSH    EBP
        ; detect if CPUID instruction supported by microprocessor:
        PUSHFD
        POP     EAX
        MOV     EBX, EAX
        XOR     EAX, 1 SHL 21    ; check if CPUID bit can toggle
        PUSH    EAX
        POPFD
        PUSHFD
        POP     EAX
        XOR     EAX, EBX
        AND     EAX, 1 SHL 21
        JZ      SHORT DPEND      ; CPUID instruction not supported
        XOR     EAX, EAX
        CPUID                    ; get number of CPUID functions
        TEST    EAX, EAX
        JZ      SHORT DPEND      ; CPUID function 1 not supported
        MOV     EAX, 1
        CPUID                    ; get family and features
        AND     EAX, 000000F00H  ; family
        AND     EDX, 0FFFFF0FFH  ; features flags
        OR      EAX, EDX         ; combine bits
DPEND:  POP     EBP
        POP     EDI
        POP     ESI
        POP     EBX
        RET
_DetectProcessor ENDP</pre>
<p>
Note that some operating systems do not allow XMM instructions.
Information on how to check for operating system support of XMM instructions can
be found in Intel's application note AP-900: "Identifying support for Streaming
SIMD Extensions in the Processor and Operating System".
More information on microprocessor identification can be found in Intel's
application note AP-485: "Intel Processor Identification and the CPUID Instruction".
<p>
To code the conditional move, MMX, XMM instructions etc. on an assembler that doesn't have
these instructions use the macros at <a href="http://www.agner.org/assem/macros.zip">www.agner.org/assem/macros.zip</a>
<p>
<h2><a name="28">28</a>. List of instruction timings for PPlain and PMMX</h2>
<h3><a name="28_1">28.1 Integer instructions</a></h3>
<b>Explanations:</b><br>
<u>Operands:</u><br>
r = register, m = memory, i = immediate data, sr = segment register<br>
m32 = 32 bit memory operand, etc.
<p>
<u>Clock cycles:</u><br>
The numbers are minimum values. Cache misses, misalignment, and exceptions may
increase the clock counts considerably.
<p>
<u>Pairability:</u><br>
u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe,
np = not pairable.
<p>

<table border=1 cellpadding=4 cellspacing=1>
<tr><td class="a3">&nbsp;Instruction&nbsp;</td>
<td class="a3">&nbsp;Operands&nbsp;</td>
<td class="a3">&nbsp;Clock cycles&nbsp;</td>
<td class="a3">&nbsp;Pairability&nbsp;</td></tr>
<tr><td>NOP</td><td>&nbsp;</td><td>1</td><td>uv</td></tr>
<tr><td>MOV</td><td>r/m, r/m/i</td><td>1</td><td>uv</td></tr>
<tr><td>MOV</td><td>r/m, sr</td><td>1</td><td>np</td></tr>
<tr><td>MOV</td><td>sr , r/m</td><td>&gt;= 2 b)</td><td>np</td></tr>
<tr><td>MOV</td><td>m , accum</td><td>1</td><td>uv h)</td></tr>
<tr><td>XCHG</td><td>(E)AX, r</td><td>2</td><td>np</td></tr>
<tr><td>XCHG</td><td>r , r</td><td>3</td><td>np</td></tr>
<tr><td>XCHG</td><td>r , m</td><td>&gt;15</td><td>np</td></tr>
<tr><td>XLAT</td><td>&nbsp;</td><td>4</td><td>np</td></tr>
<tr><td>PUSH</td><td>r/i</td><td>1</td><td>uv</td></tr>
<tr><td>POP</td><td>r</td><td>1</td><td>uv</td></tr>
<tr><td>PUSH</td><td>m</td><td>2</td><td>np</td></tr>
<tr><td>POP</td><td>m</td><td>3</td><td>np</td></tr>
<tr><td>PUSH</td><td>sr</td><td>1 b)</td><td>np</td></tr>
<tr><td>POP</td><td>sr</td><td>&gt;= 3 b)</td><td>np</td></tr>
<tr><td>PUSHF</td><td>&nbsp;</td><td>3-5</td><td>np</td></tr>
<tr><td>POPF</td><td>&nbsp;</td><td>4-6</td><td>np</td></tr>
<tr><td>PUSHA POPA</td><td>&nbsp;</td><td>5-9 i)</td><td>np</td></tr>
<tr><td>PUSHAD POPAD</td><td>&nbsp;</td><td>5</td><td>np</td></tr>
<tr><td>LAHF SAHF</td><td>&nbsp;</td><td>2</td><td>np</td></tr>
<tr><td>MOVSX MOVZX</td><td>r , r/m</td><td>3 a)</td><td>np</td></tr>
<tr><td>LEA</td><td>r , m</td><td>1</td><td>uv</td></tr>
<tr><td>LDS LES LFS LGS LSS</td><td>m</td><td>4 c)</td><td>np</td></tr>
<tr><td>ADD SUB AND OR XOR</td><td>r , r/i</td><td>1</td><td>uv</td></tr>
<tr><td>ADD SUB AND OR XOR</td><td>r , m</td><td>2</td><td>uv</td></tr>
<tr><td>ADD SUB AND OR XOR</td><td>m , r/i</td><td>3</td><td>uv</td></tr>
<tr><td>ADC SBB</td><td>r , r/i</td><td>1</td><td>u</td></tr>
<tr><td>ADC SBB</td><td>r , m</td><td>2</td><td>u</td></tr>
<tr><td>ADC SBB</td><td>m , r/i</td><td>3</td><td>u</td></tr>
<tr><td>CMP</td><td>r , r/i</td><td>1</td><td>uv</td></tr>
<tr><td>CMP</td><td>m , r/i</td><td>2</td><td>uv</td></tr>
<tr><td>TEST</td><td>r , r</td><td>1</td><td>uv</td></tr>
<tr><td>TEST</td><td>m , r</td><td>2</td><td>uv</td></tr>
<tr><td>TEST</td><td>r , i</td><td>1</td><td>f)</td></tr>
<tr><td>TEST</td><td>m , i</td><td>2</td><td>np</td></tr>
<tr><td>INC DEC</td><td>r</td><td>1</td><td>uv</td></tr>
<tr><td>INC DEC</td><td>m</td><td>3</td><td>uv</td></tr>
<tr><td>NEG NOT</td><td>r/m</td><td>1/3</td><td>np</td></tr>
<tr><td>MUL IMUL</td><td>r8/r16/m8/m16</td><td>11</td><td>np</td></tr>
<tr><td>MUL IMUL</td><td>all other versions</td><td>9 d)</td><td>np</td></tr>
<tr><td>DIV</td><td>r8/m8</td><td>17</td><td>np</td></tr>
<tr><td>DIV</td><td>r16/m16</td><td>25</td><td>np</td></tr>
<tr><td>DIV</td><td>r32/m32</td><td>41</td><td>np</td></tr>
<tr><td>IDIV</td><td>r8/m8</td><td>22</td><td>np</td></tr>
<tr><td>IDIV</td><td>r16/m16</td><td>30</td><td>np</td></tr>
<tr><td>IDIV</td><td>r32/m32</td><td>46</td><td>np</td></tr>
<tr><td>CBW CWDE</td><td>&nbsp;</td><td>3</td><td>np</td></tr>
<tr><td>CWD CDQ</td><td>&nbsp;</td><td>2</td><td>np</td></tr>
<tr><td>SHR SHL SAR SAL</td><td>r , i</td><td>1</td><td>u</td></tr>
<tr><td>SHR SHL SAR SAL</td><td>m , i</td><td>3</td><td>u</td></tr>
<tr><td>SHR SHL SAR SAL</td><td>r/m, CL</td><td>4/5</td><td>np</td></tr>
<tr><td>ROR ROL RCR RCL</td><td>r/m, 1</td><td>1/3</td><td>u</td></tr>
<tr><td>ROR ROL</td><td>r/m, i(&gt;&lt;1)</td><td>1/3</td><td>np</td></tr>
<tr><td>ROR ROL</td><td>r/m, CL</td><td>4/5</td><td>np</td></tr>
<tr><td>RCR RCL</td><td>r/m, i(&gt;&lt;1)</td><td>8/10</td><td>np</td></tr>
<tr><td>RCR RCL</td><td>r/m, CL</td><td>7/9</td><td>np</td></tr>
<tr><td>SHLD SHRD</td><td>r, i/CL</td><td>4 a)</td><td>np</td></tr>
<tr><td>SHLD SHRD</td><td>m, i/CL</td><td>5 a)</td><td>np</td></tr>
<tr><td>BT</td><td>r, r/i</td><td>4 a)</td><td>np</td></tr>
<tr><td>BT</td><td>m, i</td><td>4 a)</td><td>np</td></tr>
<tr><td>BT</td><td>m, i</td><td>9 a)</td><td>np</td></tr>
<tr><td>BTR BTS BTC</td><td>r, r/i</td><td>7 a)</td><td>np</td></tr>
<tr><td>BTR BTS BTC</td><td>m, i</td><td>8 a)</td><td>np</td></tr>
<tr><td>BTR BTS BTC</td><td>m, r</td><td>14 a)</td><td>np</td></tr>
<tr><td>BSF BSR</td><td>r , r/m</td><td>7-73 a)</td><td>np</td></tr>
<tr><td>SETcc</td><td>r/m</td><td>1/2 a)</td><td>np</td></tr>
<tr><td>JMP CALL</td><td>short/near</td><td>1 e)</td><td>v</td></tr>
<tr><td>JMP CALL</td><td>far</td><td>&gt;= 3 e)</td><td>np</td></tr>
<tr><td>conditional jump</td><td>short/near</td><td>1/4/5/6 e)</td><td>v</td></tr>
<tr><td>CALL JMP</td><td>r/m</td><td>2/5 e</td><td>np</td></tr>
<tr><td>RETN</td><td>&nbsp;</td><td>2/5 e</td><td>np</td></tr>
<tr><td>RETN</td><td>i</td><td>3/6 e)</td><td>np</td></tr>
<tr><td>RETF</td><td>&nbsp;</td><td>4/7 e)</td><td>np</td></tr>
<tr><td>RETF</td><td>i</td><td>5/8 e)</td><td>np</td></tr>
<tr><td>J(E)CXZ</td><td>short</td><td>4-11 e)</td><td>np</td></tr>
<tr><td>LOOP</td><td>short</td><td>5-10 e)</td><td>np</td></tr>
<tr><td>BOUND</td><td>r , m</td><td>8</td><td>np</td></tr>
<tr><td>CLC STC CMC CLD STD</td><td>&nbsp;</td><td>2</td><td>np</td></tr>
<tr><td>CLI STI</td><td>&nbsp;</td><td>6-9</td><td>np</td></tr>
<tr><td>LODS</td><td>&nbsp;</td><td>2</td><td>np</td></tr>
<tr><td>REP LODS</td><td>&nbsp;</td><td>7+3*n g)</td><td>np</td></tr>
<tr><td>STOS</td><td>&nbsp;</td><td>3</td><td>np</td></tr>
<tr><td>REP STOS</td><td>&nbsp;</td><td>10+n g)</td><td>np</td></tr>
<tr><td>MOVS</td><td>&nbsp;</td><td>4</td><td>np</td></tr>
<tr><td>REP MOVS</td><td>&nbsp;</td><td>12+n g)</td><td>np</td></tr>
<tr><td>SCAS</td><td>&nbsp;</td><td>4</td><td>np</td></tr>
<tr><td>REP(N)E SCAS</td><td>&nbsp;</td><td>9+4*n g)</td><td>np</td></tr>
<tr><td>CMPS</td><td>&nbsp;</td><td>5</td><td>np</td></tr>
<tr><td>REP(N)E CMPS</td><td>&nbsp;</td><td>8+4*n g)</td><td>np</td></tr>
<tr><td>BSWAP</td><td>&nbsp;</td><td>1 a)</td><td>np</td></tr>
<tr><td>CPUID</td><td>&nbsp;</td><td>13-16 a)</td><td>np</td></tr>
<tr><td>RDTSC</td><td>&nbsp;</td><td>6-13 a) j)</td><td>np</td></tr>
</table>
<p>
<b>Notes:</b><br>
a) this instruction has a <kbd>0FH</kbd> prefix which takes one clock cycle extra to
   decode on a PPlain unless preceded by a multicycle instruction (see
   chapter <a href="#12">12</a>).<br>
b) versions with <kbd>FS</kbd> and <kbd>GS</kbd> have a <kbd>0FH</kbd>
 prefix. see note a.<br>
c) versions with <kbd>SS, FS</kbd>, and <kbd>GS</kbd> have a <kbd>0FH</kbd> prefix. see note a.<br>
d) versions with two operands and no immediate have a <kbd>0FH</kbd> prefix, see note a.<br>
e) see chapter <a href="#22">22</a><br>
f) only pairable if register is accumulator. see chapter <a href="#26_14">26.14</a>.<br>
g) add one clock cycle for decoding the repeat prefix unless preceded by a
   multicycle instruction (such as <kbd>CLD</kbd>. see chapter <a href="#12">12</a>).<br>
h) pairs as if it were writing to the accumulator. see chapter <a href="#26_14">26.14</a>.<br>
i) 9 if <kbd>SP</kbd> divisible by 4. <a href="#imperfectpush">See 10.2</a><br>
j) on PPlain: 6 in priviledged or real mode, 11 in nonpriviledged, error in
   virtual mode. On PMMX: 8 and 13 clocks respectively.<br>
<p>
<h3><a name="28_2">28.2 Floating point instructions</a></h3>
<p>
<b>Explanations:</b><br>
<u>Operands:</u><br>
r = register, m = memory, m32 = 32 bit memory operand, etc.
<p>
<u>Clock cycles:</u><br>
The numbers are minimum values. Cache misses, misalignment, denormal operands, and
exceptions may increase the clock counts considerably.
<p>
<u>Pairability:</u><br>
+ = pairable with <kbd>FXCH</kbd>, np = not pairable with <kbd>FXCH</kbd>.
<p>
<u>i-ov:</u><br>
Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap
with subsequent integer instructions.
<p>
<u>fp-ov:</u><br>
Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can
overlap with subsequent floating point instructions.
(<kbd>WAIT</kbd> is considered a floating point instruction here)<p>

<table border=1 cellpadding=4 cellspacing=1>
<tr><td class="a3">&nbsp;Instruction&nbsp;</td>
<td class="a3">&nbsp;Operand&nbsp;</td>
<td class="a3">&nbsp;Clock cycles&nbsp;</td>
<td class="a3">&nbsp;Pairability&nbsp;</td>
<td class="a3">&nbsp;i-ov&nbsp;</td>
<td class="a3">&nbsp;fp-ov&nbsp;</td></tr>
<tr><td>FLD</td><td>r/m32/m64</td><td>1</td><td>+</td><td>0</td><td>0</td></tr>
<tr><td>FLD</td><td>m80</td><td>3</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FBLD</td><td>m80</td><td>48-58</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FST(P)</td><td>r</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FST(P)</td><td>m32/m64</td><td>2 m)</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FST(P)</td><td>m80</td><td>3 m)</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FBSTP</td><td>m80</td><td>148-154</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FILD</td><td>m</td><td>3</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FIST(P)</td><td>m</td><td>6</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FLDZ FLD1</td><td>&nbsp;</td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FLDPI FLDL2E etc.</td><td>&nbsp;</td><td>5 s)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FNSTSW</td><td>AX/m16</td><td>6 q)</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FLDCW</td><td>m16</td><td>8</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FNSTCW</td><td>m16</td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FADD(P)</td><td>r/m</td><td>3</td><td>+</td><td>2</td><td>2</td></tr>
<tr><td>FSUB(R)(P)</td><td>r/m</td><td>3</td><td>+</td><td>2</td><td>2</td></tr>
<tr><td>FMUL(P)</td><td>r/m</td><td>3</td><td>+</td><td>2</td><td>2 n)</td></tr>
<tr><td>FDIV(R)(P)</td><td>r/m</td><td>19/33/39 p)</td><td>+</td><td>38 o)</td><td>2</td></tr>
<tr><td>FCHS FABS</td><td>&nbsp;</td><td>1</td><td>+</td><td>0</td><td>0</td></tr>
<tr><td>FCOM(P)(P) FUCOM</td><td>r/m</td><td>1</td><td>+</td><td>0</td><td>0</td></tr>
<tr><td>FIADD FISUB(R)</td><td>m</td><td>6</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FIMUL</td><td>m</td><td>6</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FIDIV(R)</td><td>m</td><td>22/36/42 p)</td><td>np</td><td>38 o)</td><td>2</td></tr>
<tr><td>FICOM</td><td>m</td><td>4</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FTST</td><td>&nbsp;</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FXAM</td><td>&nbsp;</td><td>17-21</td><td>np</td><td>4</td><td>0</td></tr>
<tr><td>FPREM</td><td>&nbsp;</td><td>16-64</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FPREM1</td><td>&nbsp;</td><td>20-70</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FRNDINT</td><td>&nbsp;</td><td>9-20</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FSCALE</td><td>&nbsp;</td><td>20-32</td><td>np</td><td>5</td><td>0</td></tr>
<tr><td>FXTRACT</td><td>&nbsp;</td><td>12-66</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FSQRT</td><td>&nbsp;</td><td>70</td><td>np</td><td>69 o)</td><td>2</td></tr>
<tr><td>FSIN FCOS</td><td>&nbsp;</td><td>65-100 r)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FSINCOS</td><td>&nbsp;</td><td>89-112 r)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>F2XM1</td><td>&nbsp;</td><td>53-59 r)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FYL2X</td><td>&nbsp;</td><td>103 r)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FYL2XP1</td><td>&nbsp;</td><td>105 r)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FPTAN</td><td>&nbsp;</td><td>120-147 r)</td><td>np</td><td>36 o)</td><td>0</td></tr>
<tr><td>FPATAN</td><td>&nbsp;</td><td>112-134 r)</td><td>np</td><td>2</td><td>2</td></tr>
<tr><td>FNOP</td><td>&nbsp;</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FXCH</td><td>r</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FINCSTP FDECSTP</td><td>&nbsp;</td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FFREE</td><td>r</td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FNCLEX</td><td>&nbsp;</td><td>6-9</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FNINIT</td><td>&nbsp;</td><td>12-22</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FNSAVE</td><td>m</td><td>124-300</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>FRSTOR</td><td>m</td><td>70-95</td><td>np</td><td>0</td><td>0</td></tr>
<tr><td>WAIT</td><td>&nbsp;</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
</table>
<p>
<b>Notes:</b><br>
m) The value to store is needed one clock cycle in advance.<br>
n) 1 if the overlapping instruction is also an <kbd>FMUL</kbd>.<br>
o) Cannot overlap integer multiplication instructions.<br>
p) <kbd>FDIV</kbd> takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision
respectively. <kbd>FIDIV</kbd> takes 3 clocks more. The precision is defined by bit
8-9 of the floating point control word.<br>
q) The first 4 clock cycles can overlap with preceding integer instructions.
See chapter <a href="#26_7">26.7</a>.<br>
r) clock counts are typical. Trivial cases may be faster, extreme cases may
be slower.<br>
s) may be up to 3 clocks more when output needed for <kbd>FST</kbd>,
<kbd>FCHS</kbd>, or <kbd>FABS</kbd>.
<p>
<h3><a name="28_3">28.3 MMX instructions (PMMX)</a></h3>
<p>
A list of MMX instruction timings is not needed because they all take one clock cycle,
except the MMX multiply instructions which take 3. MMX multiply instructions can be
overlapped and pipelined to yield a throughput of one multiplication per clock cycle.
<p>
The <kbd>EMMS</kbd> instruction takes only one clock cycle, but the first floating point instruction after
an <kbd>EMMS</kbd> takes approximately 58 clocks extra, and the first MMX instruction after a floating
point instruction takes approximately 38 clocks extra. There is no penalty for an MMX
instruction after <kbd>EMMS</kbd> on the PMMX (but a possible small penalty on the PII and PIII).
<p>
There is no penalty for using a memory operand in an MMX instruction because the MMX
arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes
when you store data from an MMX register to memory or to a 32 bit register: The data have
to be ready one clock cycle in advance. This is analogous to the floating point store
instructions.
<p>
All MMX instructions except <kbd>EMMS</kbd> are pairable in either pipe. Pairing rules for MMX
instructions are described in chapter <a href="#10">10</a>.
<p>
<h2><a name="29">29</a>. List of instruction timings and micro-op breakdown for PPro, PII and PIII</h2>
<b>Explanations:</b><br>
<u>Operands:</u><br>
r = register,  m = memory,  i = immediate data,  sr = segment register,
m32 = 32 bit memory operand, etc.
<p>
<u>Micro-ops:</u><br>
The number of micro-ops that the instruction generates for each execution port.<br>
p0:  port 0:  ALU, etc.<br>
p1:  port 1:  ALU, jumps<br>
p01:  instructions that can go to either port 0 or 1, whichever is vacant first.<br>
p2:  port 2:  load data, etc.<br>
p3:  port 3:  address generation for store<br>
p4:  port 4:  store data
<p>
<u>Delay:</u><br>
This is the delay that the instruction generates in a dependency chain.
(This is not the same as the time spent in the execution unit. Values may be
inaccurate in situations where they cannot be measured exactly, especially with
memory operands).
The numbers are minimum values. Cache misses, misalignment, and exceptions
may increase the clock counts considerably. Floating point operands are
presumed to be normal numbers. Denormal numbers, NANs and infinity increase
the delays by 50-150 clocks, except in XMM move, shuffle and boolean instructions.
Floating point overflow, underflow, denormal or NAN results give a similar delay.
<p>
<u>Throughput:</u><br>
The maximum throughput for several instructions of the same kind. For example, a
throughput of 1/2 for <kbd>FMUL</kbd> means that a new <kbd>FMUL</kbd> instruction can start executing
every 2 clock cycles.<p>

<table border="1" cellpadding="4" cellspacing="1">
<tr>
<td colspan="10" class="a2"><a name="29_1">29.1 Integer instructions</a></td>
</tr>
<tr>
<td class="a3">Instruction</td>
<td class="a3">Operands</td>
<td colspan="6" class="a3">micro-ops</td>
<td class="a3">delay</td>
<td class="a3">throughput</td>
</tr>
<tr><td>&nbsp;</td><td>&nbsp;</td><td class="a3">p0</td><td class="a3">p1</td>
<td class="a3">p01</td><td class="a3">p2</td><td class="a3">p3</td><td class="a3">p4</td>
<td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>NOP</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>r,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>r,m</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>m,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>r,sr</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>m,sr</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>sr,r</td><td colspan="3">8</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>5</td><td>&nbsp;</td></tr>
<tr><td>MOV</td><td>sr,m</td><td colspan="3">7</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>8</td><td>&nbsp;</td></tr>
<tr><td>MOVSX MOVZX</td><td>r,r</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOVSX MOVZX</td><td>r,m</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CMOVcc</td><td>r,r</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CMOVcc</td><td>r,m</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>XCHG</td><td>r,r</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>XCHG</td><td>r,m</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td>
<td>1</td><td>1</td><td>1</td><td>high b)</td><td>&nbsp;</td></tr>
<tr><td>XLAT</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PUSH</td><td>r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>POP</td><td>r</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>POP</td><td>(E)SP</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PUSH</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>POP</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>5</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PUSH</td><td>sr</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>POP</td><td>sr</td><td>&nbsp;</td><td>&nbsp;</td><td>8</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PUSHF(D)</td><td>&nbsp;</td><td>3</td><td>&nbsp;</td><td>11</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>POPF(D)</td><td>&nbsp;</td><td>10</td><td>&nbsp;</td><td>6</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PUSHA(D)</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>8</td><td>8</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>POPA(D)</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>8</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>LAHF SAHF</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>LEA</td><td>r,m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1 c)</td><td>&nbsp;</td></tr>
<tr><td>LDS LES LFS LGS LSS</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>8</td>
<td>3</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ADD SUB AND OR XOR</td><td>r,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ADD SUB AND OR XOR</td><td>r,m</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ADD SUB AND OR XOR</td><td>m,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ADC SBB</td><td>r,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ADC SBB</td><td>r,m</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ADC SBB</td><td>m,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CMP TEST</td><td>r,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CMP TEST</td><td>m,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>INC DEC NEG NOT</td><td>r</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>INC DEC NEG NOT</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>AAS DAA DAS</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>AAD</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>&nbsp;</td></tr>
<tr><td>AAM</td><td>&nbsp;</td><td>1</td><td>1</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>15</td><td>&nbsp;</td></tr>
<tr><td>MUL IMUL</td><td>r,(r),(i)</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/1</td></tr>
<tr><td>MUL IMUL</td><td>(r),m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/1</td></tr>
<tr><td>DIV IDIV</td><td>r8</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>19</td><td>1/12</td></tr>
<tr><td>DIV IDIV</td><td>r16</td><td>3</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>23</td><td>1/21</td></tr>
<tr><td>DIV IDIV</td><td>r32</td><td>3</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>39</td><td>1/37</td></tr>
<tr><td>DIV IDIV</td><td>m8</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>19</td><td>1/12</td></tr>
<tr><td>DIV IDIV</td><td>m16</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>23</td><td>1/21</td></tr>
<tr><td>DIV IDIV</td><td>m32</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>39</td><td>1/37</td></tr>
<tr><td>CBW CWDE</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CWD CDQ</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SHR SHL SAR ROR ROL</td><td>r,i/CL</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SHR SHL SAR ROR ROL</td><td>m,i/CL</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RCR RCL</td><td>r,1</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RCR RCL</td><td>r8,i/CL</td><td>4</td><td>&nbsp;</td><td>4</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RCR RCL</td><td>r16/32,i/CL</td><td>3</td><td>&nbsp;</td><td>3</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RCR RCL</td><td>m,1</td><td>1</td><td>&nbsp;</td><td>2</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RCR RCL</td><td>m8,i/CL</td><td>4</td><td>&nbsp;</td><td>3</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RCR RCL</td><td>m16/32,i/CL</td><td>4</td><td>&nbsp;</td><td>2</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SHLD SHRD</td><td>r,r,i/CL</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SHLD SHRD</td><td>m,r,i/CL</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BT</td><td>r,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BT</td><td>m,r/i</td><td>1</td><td>&nbsp;</td><td>6</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BTR BTS BTC</td><td>r,r/i</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BTR BTS BTC</td><td>m,r/i</td><td>1</td><td>&nbsp;</td><td>6</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BSF BSR</td><td>r,r</td><td>&nbsp;</td><td>1</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BSF BSR</td><td>r,m</td><td>&nbsp;</td><td>1</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SETcc</td><td>r</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SETcc</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>JMP</td><td>short/near</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>JMP</td><td>far</td><td colspan="3">21</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>JMP</td><td>r</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>JMP</td><td>m(near)</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>JMP</td><td>m(far)</td><td colspan="3">21</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>conditional jump</td><td>short/near</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>CALL</td><td>near</td><td>&nbsp;</td><td>1</td><td>1</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>CALL</td><td>far</td><td colspan="3">28</td>
<td>1</td><td>2</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CALL</td><td>r</td><td>&nbsp;</td><td>1</td><td>2</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>CALL</td><td>m(near)</td><td>&nbsp;</td><td>1</td><td>4</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>CALL</td><td>m (far)</td><td colspan="3">28</td>
<td>2</td><td>2</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RETN</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>2</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>RETN</td><td>i</td><td>&nbsp;</td><td>1</td><td>3</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1/2</td></tr>
<tr><td>RETF</td><td>&nbsp;</td><td colspan="3">23</td>
<td>3</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RETF</td><td>i</td><td colspan="3">23</td><td>3</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>J(E)CXZ</td><td>short</td><td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>LOOP</td><td>short</td><td>2</td><td>1</td><td>8</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>LOOP(N)E</td><td>short</td><td>2</td><td>1</td><td>8</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ENTER</td><td>i,0</td><td>&nbsp;</td><td>&nbsp;</td><td>12</td><td>&nbsp;</td>
<td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>ENTER</td><td>a,b</td><td colspan="3">ca. 18+4b</td><td>&nbsp;</td>
<td>b-1</td><td>2b</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>LEAVE</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BOUND</td><td>r,m</td><td>7</td><td>&nbsp;</td><td>6</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CLC STC CMC</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CLD STD</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CLI</td><td>&nbsp;</td><td colspan="3">9</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>STI</td><td>&nbsp;</td><td colspan="3">17</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>INTO</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>5</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>LODS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>REP LODS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td colspan="2">10+6n</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>STOS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>REP STOS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td colspan="4">ca. 5n a)</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOVS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>3</td>
<td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>REP MOVS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td colspan="4">ca. 6n a)</td>
<td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SCAS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>REP(N)E SCAS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td colspan="2">12+7n</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CMPS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>REP(N)E CMPS</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td colspan="2">12+9n</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>BSWAP</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>CPUID</td><td>&nbsp;</td><td colspan="3">23-48</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>RDTSC</td><td>&nbsp;</td><td colspan="3">13</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>IN</td><td>&nbsp;</td><td colspan="3">18</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&gt;300</td><td>&nbsp;</td></tr>
<tr><td>OUT</td><td>&nbsp;</td><td colspan="3">18</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&gt;300</td><td>&nbsp;</td></tr>
<tr><td>PREFETCHNTA&nbsp; d)</td><td>m</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>
&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PREFETCHT0&nbsp; d)</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PREFETCHT1&nbsp; d)</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>PREFETCHT2&nbsp; d)</td><td>m</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>
&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>SFENCE&nbsp; d)</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>
&nbsp;1</td><td>&nbsp;</td><td>1/6</td></tr>
</TABLE><p>
<b>Notes:</b><br>
a) faster under certain conditions: see chapter <a href="#26_3">26.3</a>.<br>
b) see chapter <a href="#26_1">26.1</a><br>
c) 3 if constant without base or index register<br>
d) PIII only.
<p>

<table border="1" cellpadding="4" cellspacing="1">
<tr>
<td colspan="10" class="a2"><a name="29_2">29.2 Floating point instructions</a></td>
</tr>
<tr>
<td class="a3">Instruction</td>
<td class="a3">Operands</td>
<td colspan="6" align="center" class="a3">micro-ops</td>
<td class="a3">delay</td>
<td class="a3">throughput</td>
</tr>
<tr>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td class="a4">p0</td>
<td class="a4">p1</td>
<td class="a4">p01</td>
<td class="a4">p2</td>
<td class="a4">p3</td>
<td class="a4">p4</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
</tr>
<tr><td>FLD</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FLD</td><td>m32/64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FLD</td><td>m80</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FBLD</td><td>m80</td><td>38</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FST(P)</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FST(P)</td><td>m32/m64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FSTP</td><td>m80</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>2</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FBSTP</td><td>m80</td><td>165</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>2</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FXCH</td><td>r</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>0</td><td>3/1 f)</td></tr>
<tr><td>FILD</td><td>m</td><td>3</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>5</td><td>&nbsp;</td></tr>
<tr><td>FIST(P)</td><td>m</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>5</td><td>&nbsp;</td></tr>
<tr><td>FLDZ</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td colspan="2">FLD1 FLDPI FLDL2E etc.</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FCMOVcc</td><td>r</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td></tr>
<tr><td>FNSTSW</td><td>AX</td><td>3</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>7</td><td>&nbsp;</td></tr>
<tr><td>FNSTSW</td><td>m16</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FLDCW</td><td>m16</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>10</td><td>&nbsp;</td></tr>
<tr><td>FNSTCW</td><td>m16</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FADD(P) FSUB(R)(P)</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>FADD(P) FSUB(R)(P)</td><td>m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>3-4</td><td>1/1</td></tr>
<tr><td>FMUL(P)</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>5</td><td>1/2 g)</td></tr>
<tr><td>FMUL(P)</td><td>m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>5-6</td><td>1/2 g)</td></tr>
<tr><td>FDIV(R)(P)</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>38 h)</td><td>1/37</td></tr>
<tr><td>FDIV(R)(P)</td><td>m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>38 h)</td><td>1/37</td></tr>
<tr><td>FABS</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FCHS</td><td>&nbsp;</td><td>3</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td></tr>
<tr><td>FCOM(P) FUCOM</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FCOM(P) FUCOM</td><td>m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FCOMPP FUCOMPP</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FCOMI(P) FUCOMI(P)</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FCOMI(P) FUCOMI(P)</td><td>m</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FIADD FISUB(R)</td><td>m</td><td>6</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FIMUL</td><td>m</td><td>6</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FIDIV(R)</td><td>m</td><td>6</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FICOM(P)</td><td>m</td><td>6</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FTST</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td></tr>
<tr><td>FXAM</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td></tr>
<tr><td>FPREM</td><td>&nbsp;</td><td>23</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FPREM1</td><td>&nbsp;</td><td>33</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FRNDINT</td><td>&nbsp;</td><td>30</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FSCALE</td><td>&nbsp;</td><td>56</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FXTRACT</td><td>&nbsp;</td><td>15</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FSQRT</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>69</td><td>e,i)</td></tr>
<tr><td>FSIN FCOS</td><td>&nbsp;</td><td colspan="3">17-97</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>27-103</td><td>e)</td></tr>
<tr><td>FSINCOS</td><td>&nbsp;</td><td colspan="3">18-110</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>29-130</td><td>e)</td></tr>
<tr><td>F2XM1</td><td>&nbsp;</td><td colspan="3">17-48</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>66</td><td>e)</td></tr>
<tr><td>FYL2X</td><td>&nbsp;</td><td colspan="3">36-54</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>103</td><td>e)</td></tr>
<tr><td>FYL2XP1</td><td>&nbsp;</td><td colspan="3">31-53</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>98-107</td><td>e)</td></tr>
<tr><td>FPTAN</td><td>&nbsp;</td><td colspan="3">21-102</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>13-143</td><td>e)</td></tr>
<tr><td>FPATAN</td><td>&nbsp;</td><td colspan="3">25-86</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>44-143</td><td>e)</td></tr>
<tr><td>FNOP</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FINCSTP FDECSTP</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FFREE</td><td>r</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FFREEP</td><td>r</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FNCLEX</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FNINIT</td><td>&nbsp;</td><td colspan="3">13</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FNSAVE</td><td>&nbsp;</td><td colspan="3">141</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>FRSTOR</td><td>&nbsp;</td><td colspan="3">72</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>WAIT</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td></tr>
</table>

<b>Notes:</b><br>
e) not pipelined<br>
f) <kbd>FXCH</kbd> generates 1 micro-op that is resolved by register renaming without
   going to any port.<br>
g) <kbd>FMUL</kbd> uses the same circuitry as integer multiplication. Therefore, the
   combined throughput of mixed floating point and integer multiplications is
   1 <kbd>FMUL</kbd> + 1 <kbd>IMUL</kbd> per 3 clock cycles.<br>
h) <kbd>FDIV</kbd> delay depends on precision specified in control word:
   precision 64 bits gives delay 38, precision 53 bits gives delay 32,
   precision 24 bits gives delay 18. Division by a power of 2 takes 9 clocks.
   Throughput is 1/(delay-1).<br>
i) faster for lower precision.
<p>

<table border=1 cellpadding=4 cellspacing=1><tr>
<td colspan="10" class="a2"><a name="29_3">29.3 MMX instructions (PII and PIII)</a></td></tr>
<tr><td class="a3">Instruction</td>
<td class="a3">Operands</td>
<td colspan="6" align="center" class="a3">micro-ops</td>
<td class="a3">delay</td>
<td class="a3">throughput</td></tr>
<tr><td>&nbsp;</td><td>&nbsp;</td>
<td class="a4">p0</td><td class="a4">p1</td><td class="a4">p01</td>
<td class="a4">p2</td><td class="a4">p3</td><td class="a4">p4</td>
<td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOVD MOVQ</td><td>r,r</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>2/1</td></tr>
<tr><td>MOVD MOVQ</td><td>r64,m32/64</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>MOVD MOVQ</td><td>m32/64,r64</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>
1</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PADD PSUB PCMP</td><td>r64,r64</td>&nbsp;
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PADD PSUB PCMP</td><td>r64,m64</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PMUL PMADD</td><td>r64,r64</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>PMUL PMADD</td><td>r64,m64</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>PAND PANDN POR <br>PXOR</td>
<td>r64,r64</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2/1</td></tr>
<tr><td>PAND PANDN POR<br>PXOR</td><td>r64,m64</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PSRA PSRL PSLL</td><td>r64,r64/i</td>
<td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PSRA PSRL PSLL</td><td>r64,m64</td>
<td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PACK PUNPCK</td><td>r64,r64</td>
<td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>PACK PUNPCK</td><td>r64,m64</td>
<td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td><td>&nbsp;
</td><td>&nbsp;</td><td>&nbsp;</td><td>1/1</td></tr>
<tr><td>EMMS</td><td>&nbsp;</td><td colspan="3">11</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>6 k)</td><td>&nbsp;</td></tr>
<tr><td>MASKMOVQ&nbsp; d)</td><td>r64,r64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td>
<td>&nbsp;1</td><td>&nbsp;1</td><td>2-8</td><td>1/30-1/2</td></tr>
<tr><td>PMOVMSKB&nbsp; d)</td><td>r32,r64</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1/1</td></tr>
<tr><td>MOVNTQ&nbsp; d)</td><td>m64,r64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;1</td><td>&nbsp;1</td><td>&nbsp;</td><td>1/30-1/1</td></tr>
<tr><td>PSHUFW&nbsp; d)</td><td>r64,r64,i</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1/1</td></tr>
<tr><td>PSHUFW&nbsp; d)</td><td>r64,m64,i</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;2</td><td>&nbsp;1/1</td></tr>
<tr><td>PEXTRW&nbsp; d)</td><td>r32,r64,i</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;2</td><td>&nbsp;1/1</td></tr>
<tr><td>PISRW&nbsp; d)</td><td>r64,r32,i</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1/1</td></tr>
<tr><td>PISRW&nbsp; d)</td><td>r64,m16,i</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;2</td><td>&nbsp;1/1</td></tr>
<tr><td>PAVGB PAVGW&nbsp; d)</td><td>r64,r64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;2/1</td></tr>
<tr><td>PAVGB PAVGW&nbsp; d)</td><td>r64,m64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;2</td><td>&nbsp;1/1</td></tr>
<tr><td>PMINUB PMAXUB PMINSW PMAXSW&nbsp; d)</td><td>r64,r64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;2/1</td></tr>
<tr><td>PMINUB PMAXUB PMINSW PMAXSW&nbsp; d)</td><td>r64,m64</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;2</td><td>&nbsp;1/1</td></tr>
<tr><td>PMULHUW&nbsp; d)</td><td>r64,r64</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;3</td><td>&nbsp;1/1</td></tr>
<tr><td>PMULHUW&nbsp; d)</td><td>r64,m64</td><td>&nbsp;1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;4</td><td>&nbsp;1/1</td></tr>
<tr><td>PSADBW&nbsp; d)</td><td>r64,r64</td><td>&nbsp;2</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;5</td><td>&nbsp;1/2</td></tr>
<tr><td>PSADBW&nbsp; d)</td><td>r64,m64</td><td>&nbsp;2</td><td>&nbsp;</td><td>&nbsp;1</td><td>&nbsp;1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;6</td><td>&nbsp;1/2</td></tr>
</table>
<b>Notes:</b><br>
d) PIII only.<br>
k) you may hide the delay by inserting other instructions between <kbd>EMMS</kbd> and any
subsequent floating point instruction.
<p>
<table border="1" callpadding="4" cellspacing="1"><tr>
<td colspan="10" class="a2"><a name="29_4">29.4 XMM instructions (PIII)</a></td></tr>
<tr><td class="a3">Instruction</td>
<td class="a3">Operands</td>
<td colspan="6" align="center" class="a3">micro-ops</td>
<td class="a3">delay</td>
<td class="a3">throughput</td></tr>
<tr><td>&nbsp;</td><td>&nbsp;</td>
<td class="a4">&nbsp;p0&nbsp;</td><td class="a4">&nbsp;p1&nbsp;</td>
<td class="a4">&nbsp;p01&nbsp;</td><td class="a4">&nbsp;p2&nbsp;</td>
<td class="a4">&nbsp;p3&nbsp;</td><td class="a4">&nbsp;p4&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><td>MOVAPS</td><td>r128,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;
</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVAPS</td><td>r128,m128</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>&nbsp;
</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>MOVAPS</td><td>m128,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>2</td><td>3</td><td>1/2</td></tr>
<tr><td>MOVUPS</td><td>r128,m128</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td>
<td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/4</td></tr>
<tr><td>MOVUPS</td><td>m128,r128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td>
<td>4</td><td>3</td><td>1/4</td></tr>
<tr><td>MOVSS</td><td>r128,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVSS</td><td>r128,m32</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVSS</td><td>m32,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>1</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVHPS MOVLPS</td><td>r128,m64</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVHPS MOVLPS</td><td>m64,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>1</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVLHPS MOVHLPS</td><td>r128,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVMSKPS</td><td>r32,r128</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>MOVNTPS</td><td>m128,r128</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;2</td><td>&nbsp;2</td><td>&nbsp;</td><td>1/15-1/2</td></tr>
<tr><td>CVTPI2PS</td><td>r128,r64</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>CVTPI2PS</td><td>r128,m64</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/2</td></tr>
<tr><td>CVTPS2PI CVTTPS2PI</td><td>r64,r128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>CVTPS2PI</td><td>r64,m128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/1</td></tr>
<tr><td>CVTSI2SS</td><td>r128,r32</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/2</td></tr>
<tr><td>CVTSI2SS</td><td>r128,m32</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>5</td><td>1/2</td></tr>
<tr><td>CVTSS2SI CVTTSS2SI</td><td>r32,r128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>CVTSS2SI</td><td>r32,m128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/2</td></tr>
<tr><td>ADDPS SUBPS</td><td>r128,r128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>ADDPS SUBPS</td><td>r128,m128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>ADDSS SUBSS</td><td>r128,r128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>ADDSS SUBSS</td><td>r128,m32</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>MULPS</td><td>r128,r128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/2</td></tr>
<tr><td>MULPS</td><td>r128,m128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/2</td></tr>
<tr><td>MULSS</td><td>r128,r128</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/1</td></tr>
<tr><td>MULSS</td><td>r128,m32</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>4</td><td>1/1</td></tr>
<tr><td>DIVPS</td><td>r128,r128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>48</td><td>1/34</td></tr>
<tr><td>DIVPS</td><td>r128,m128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>48</td><td>1/34</td></tr>
<tr><td>DIVSS</td><td>r128,r128</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>18</td><td>1/17</td></tr>
<tr><td>DIVSS</td><td>r128,m32</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>18</td><td>1/17</td></tr>
<tr><td>ANDPS ANDNPS ORPS XORPS</td><td>r128,r128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>ANDPS ANDNPS ORPS XORPS</td><td>r128,m128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>MAXPS MINPS</td><td>r128,r128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td>&nbsp;</tr>
<tr><td>MAXPS MINPS</td><td>r128,m128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>MAXSS MINSS</td><td>r128,r128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>MAXSS MINSS</td><td>r128,m32</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>CMPccPS</td><td>r128,r128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>CMPccPS</td><td>r128,m128</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>CMPccSS</td><td>r128,r128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>CMPccSS</td><td>r128,m32</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/1</td></tr>
<tr><td>COMISS UCOMISS</td><td>r128,r128</td><td>&nbsp;</td><td>1</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>COMISS UCOMISS</td><td>r128,m32</td><td>&nbsp;</td><td>1</td>
<td>&nbsp;</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>SQRTPS</td><td>r128,r128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>56</td><td>1/56</td></tr>
<tr><td>SQRTPS</td><td>r128,m128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>57</td><td>1/56</td></tr>
<tr><td>SQRTSS</td><td>r128,r128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>30</td><td>1/28</td></tr>
<tr><td>SQRTSS</td><td>r128,m32</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>31</td><td>1/28</td></tr>
<tr><td>RSQRTPS</td><td>r128,r128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>RSQRTPS</td><td>r128,m128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>RSQRTSS</td><td>r128,r128</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>RSQRTSS</td><td>r128,m32</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/1</td></tr>
<tr><td>RCPPS</td><td>r128,r128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>RCPPS</td><td>r128,m128</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>RCPSS</td><td>r128,r128</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>1</td><td>1/1</td></tr>
<tr><td>RCPSS</td><td>r128,m32</td><td>1</td><td>&nbsp;</td><td>&nbsp;</td>
<td>1</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/1</td></tr>
<tr><td>SHUFPS</td><td>r128,r128,i</td><td>&nbsp;</td><td>2</td><td>1</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>SHUFPS</td><td>r128,m128,i</td><td>&nbsp;</td><td>2</td><td>&nbsp;</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>2</td><td>1/2</td></tr>
<tr><td>UNPCKHPS UNPCKLPS</td><td>r128,r128</td><td>&nbsp;</td><td>2</td>
<td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>UNPCKHPS UNPCKLPS</td><td>r128,m128</td><td>&nbsp;</td><td>2</td>
<td>&nbsp;</td><td>2</td><td>&nbsp;</td><td>&nbsp;</td><td>3</td><td>1/2</td></tr>
<tr><td>LDMXCSR</td><td>m32</td><td colspan="3">11</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>15</td><td>1/15</td></tr>
<tr><td>STMXCSR</td><td>m32</td><td colspan="3">6</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>7</td><td>1/9</td></tr>
<tr><td>FXSAVE</td><td>m4096</td><td colspan="3">116</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>62</td><td>&nbsp;</td></tr>
<tr><td>FXRSTOR</td><td>m4096</td><td colspan="3">89</td>
<td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>68</td><td>&nbsp;</td></tr>
</table>
<p>
<h2><a name="30">30</a>. Testing speed</h2>
The Pentium family of processors have an internal 64 bit clock counter which can be read
into <kbd>EDX:EAX</kbd> using the instruction <kbd>RDTSC</kbd>
(read time stamp counter). This is very useful for
testing exactly how many clock cycles a piece of code takes.
<p>
The program below is useful for measuring the number of clock cycles a piece of code
takes. The program executes the code to test 10 times and stores the 10 clock counts.
The program can be used in both 16 and 32 bit mode on the PPlain and PMMX:
<pre>;************   Test program for PPlain and PMMX:    ********************

ITER    EQU     10              ; number of iterations
OVERHEAD EQU    15              ; 15 for PPlain, 17 for PMMX

RDTSC   MACRO                   ; define RDTSC instruction
        DB      0FH,31H
ENDM
;************   Data segment:                   ********************
.DATA                           ; data segment
ALIGN   4
COUNTER DD      0               ; loop counter
TICS    DD      0               ; temporary storage of clock
RESULTLIST  DD  ITER DUP (0)    ; list of test results
;************   Code segment:                   ********************
.CODE                           ; code segment
BEGIN:  MOV     [COUNTER],0     ; reset loop counter
TESTLOOP:                       ; test loop
;************   Do any initializations here:    ********************
        FINIT
;************   End of initializations          ********************
        RDTSC                   ; read clock counter
        MOV     [TICS],EAX      ; save count
        CLD                     ; non-pairable filler
REPT    8
        NOP                     ; eight NOP's to avoid shadowing effect
ENDM

;************   Put instructions to test here:  ********************
        FLDPI                   ; this is only an example
        FSQRT
        RCR     EBX,10
        FSTP    ST
;***************** End of instructions to test  ********************

        CLC                     ; non-pairable filler with shadow
        RDTSC                   ; read counter again
        SUB     EAX,[TICS]      ; compute difference
        SUB     EAX,OVERHEAD    ; subtract clocks used by fillers etc.
        MOV     EDX,[COUNTER]   ; loop counter
        MOV     [RESULTLIST][EDX],EAX   ; store result in table
        ADD     EDX,TYPE RESULTLIST     ; increment counter
        MOV     [COUNTER],EDX           ; store counter
        CMP     EDX,ITER * (TYPE RESULTLIST)
        JB      TESTLOOP                ; repeat ITER times

; insert here code to read out the values in RESULTLIST</pre>
<p>
The 'filler' instructions before and after the piece of code to test are are
included in order to get consistent results on the PPlain. The <kbd>CLD</kbd>
is a non-pairable instruction which has been inserted to make sure the pairing
is the same the first time as the subsequent times. The
eight <kbd>NOP</kbd> instructions are inserted to prevent any prefixes in the
code to test to be decoded in the shadow of the preceding instructions on
the PPlain. Single byte instructions are used here to obtain the same pairing
the first time as the subsequent times. The <kbd>CLC</kbd> after the
code to test is a non-pairable instruction which has a shadow under which
the <kbd>0FH</kbd> prefix of the <kbd>RDTSC</kbd> can be decoded so that
it is independent of any shadowing effect from the code
to test on the PPlain.
<p>
On The PMMX you may want to insert <kbd>XOR EAX,EAX / CPUID</kbd>
before the instructions to test if you want the FIFO instruction buffer
to be empty, or some time-consuming instruction
(f.ex. <kbd>CLI</kbd> or <kbd>AAD</kbd>) if you want the FIFO buffer to
be full (<kbd>CPUID</kbd> has no shadow under which
prefixes of subsequent instructions can decode).
<p>
On the PPro, PII and PIII you have to insert <kbd>XOR EAX,EAX / CPUID</kbd>
before and after each <kbd>RDTSC</kbd> to prevent it from executing
in parallel with anything else, and remove the filler
instructions. (<kbd>CPUID</kbd> is a serializing instruction which means
that it flushes the pipeline and waits for all pending operations to
finish before proceeding. This is useful for testing purposes.)
<p>
The <kbd>RDTSC</kbd> instruction cannot execute in virtual mode on the
PPlain and PMMX, so if you are running DOS programs you must run
in real mode. (Press F8 while booting and select
"safe mode command prompt only" or "bypass startup files").
<p>
The complete test program is available from <a href="http://www.agner.org/assem/">www.agner.org/assem/</a>.
<p>
The Pentium processors have special performance monitor counters which can count
events such as cache misses, misalignments, various stalls, etc. Details about how to use the
performance monitor counters are not covered by this manual but can be found in
"Intel Architecture Software Developer's Manual", vol. 3, Appendix A.
<p>
<h2><a name="31">31</a>. Comparison of the different microprocessors</h2>
The following table summarizes some important differences between the microprocessors in
the Pentium family:
<p>

<table border=1 cellpadding=4 cellspacing=1>
<tr><td>&nbsp;</td>
<td class="a3">&nbsp;PPlain&nbsp;</td>
<td class="a3">&nbsp;PMMX&nbsp;</td>
<td class="a3">&nbsp;PPro&nbsp;</td>
<td class="a3">&nbsp;PII&nbsp;</td>
<td class="a3">&nbsp;PIII&nbsp;</td>
</tr><tr>
<td>code cache, kb</td>
<td>8</td>
<td>16</td>
<td>8</td>
<td>16</td>
<td>16</td>
</tr><tr>
<td>data cache, kb</td>
<td>8</td>
<td>16</td>
<td>8</td>
<td>16</td>
<td>16</td>
</tr><tr>
<td>built in level 2 cache, kb</td>
<td>0</td>
<td>0</td>
<td>256</td>
<td>512 *)</td>
<td>512 *)</td></tr>
<tr>
<td>MMX instructions</td>
<td>no</td>
<td>yes</td>
<td>no</td>
<td>yes</td>
<td>yes</td></tr>
<tr>
<td>XMM instructions</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>no</td>
<td>yes</td></tr>
<tr>
<td>conditional move instructruct.</td>
<td>no</td>
<td>no</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr><tr>
<td>out of order execution</td>
<td>no</td>
<td>no</td>
<td>yes</td>
<td>yes</td>
<td>yes</td>
</tr><tr>
<td>branch prediction</td>
<td>poor</td>
<td>good</td>
<td>good</td>
<td>good</td>
<td>good</td>
</tr><tr>
<td>branch target buffer entries</td>
<td>256</td>
<td>256</td>
<td>512</td>
<td>512</td>
<td>512</td>
</tr><tr>
<td>return stack buffer size</td>
<td>0</td>
<td>4</td>
<td>16</td>
<td>16</td>
<td>16</td>
</tr><tr>
<td>branch misprediction penalty</td>
<td>3-4</td>
<td>4-5</td>
<td>10-20</td>
<td>10-20</td>
<td>10-20</td>
</tr><tr>
<td>partial register stall</td>
<td>0</td>
<td>0</td>
<td>5</td>
<td>5</td>
<td>5</td>
</tr><tr>
<td>FMUL latency</td>
<td>3</td>
<td>3</td>
<td>5</td>
<td>5</td>
<td>5</td>
</tr><tr>
<td>FMUL throughput</td>
<td>1/2</td>
<td>1/2</td>
<td>1/2</td>
<td>1/2</td>
<td>1/2</td>
</tr><tr>
<td>IMUL latency</td>
<td>9</td>
<td>9</td>
<td>4</td>
<td>4</td>
<td>4</td>
</tr><tr>
<td>IMUL throughput</td>
<td>1/9</td>
<td>1/9</td>
<td>1/1</td>
<td>1/1</td>
<td>1/1</td>
</tr></table>
<p>*) Celeron: 0-128, Xeon: 512 or more, many other variants available.
On some versions the level 2 cache runs at half speed.
<p>
<u>Comments to the table:</u><br>
Code cache size is important if the critical part of your program is not limited to a small
memory space.
<p>
Data cache size is important for all programs that handle more than small amounts of data
in the critical part.
<p>
MMX and XMM instructions are useful for programs that handle massively parallel
data, such as sound and image processing. In other applications it may not be
possible to take advantage of the MMX and XMM instructions.
<p>
Conditional move instructructions are useful for avoiding poorly predictable conditional
jumps.
<p>
Out of order execution improves performance, especially on non-optimized code. It includes
automatic instruction reordering and register renaming.
<p>
Processors with a good branch prediction method can predict simple repetitive patterns. A
good branch prediction is most important if the branch misprediction penalty is high.
<p>
A return stack buffer improves prediction of return instructions when a subroutine is called
alternatingly from different locations.
<p>
Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.
<p>
The latency of a multiplication instruction is the time it takes in a dependency chain. A
throughput of 1/2 means that the execution can be pipelined so that a new multiplication
can begin every second clock cycle. This defines the speed for handling parallel data.
<p>
Most of the optimizations described in this document have little or no negative effects on
other microprocessors, including non-Intel processors, but there are some problems to be
aware of.
<p>
Scheduling floating point code for the PPlain and PMMX often requires a lot of
extra <kbd>FXCH</kbd> instructions. This will slow down execution on older microprocessors, but not on the
Pentium family and advanced non-Intel processors.
<p>
Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the
conditional moves in the PPro, PII and PIII will create problems if you want your code to be
compatible with earlier microprocessors. The solution may be to write several versions of
your code, each optimized for a particular processor. Your program should detect which
processor it is running on and select the appropriate version of code
(chapter <a href="#27_10">27.10</a>).
<p>
</body>
</html>