6717 lines
358 KiB
HTML
6717 lines
358 KiB
HTML
<html>
|
|
<head>
|
|
<title>How to optimize for the Pentium family of microprocessors</title>
|
|
<style><!--
|
|
body{font-family:arial,sans-serif;color:black,background-color:"#F0FFE0"}
|
|
p{font-family:arial,sans-serif;}
|
|
pre{font-family:courier,monospace}
|
|
kbd{font-family:courier,monospace;font-weight:540}
|
|
h1{font-size:"300%";font-weight:700;text-align:center}
|
|
h2{font-size:"150%";font-weight:600;text-align:left;padding-top:1em}
|
|
h3{font-size:"110%";font-weight:600;text-align:left}
|
|
h4{font-size:"100%";font-weight:500;text-align:left;text-decoration:underline}
|
|
td.a2{font-size:"150%";font-weight:600}
|
|
td.a3{font-size:"110%";font-weight:600}
|
|
td.a4{font-size:"100%";font-weight:600}
|
|
--></style>
|
|
</head>
|
|
|
|
<body text="#000000" bgcolor="#F0FFE0" link="#0000E0" vlink="#E000E0" alink="#FF0000">
|
|
<center><h1>How to optimize for the Pentium<br>
|
|
family of microprocessors</h1>
|
|
<p>
|
|
<small>Copyright © 1996, 2000 by Agner Fog. Last modified 2000-03-14.</small>
|
|
</center><p>
|
|
|
|
<h2>Contents</h2>
|
|
<ol>
|
|
<li><a href="#1">Introduction</a>
|
|
<li><a href="#2">Literature</a>
|
|
<li><a href="#3">Calling assembly functions from high level language</a>
|
|
<li><a href="#4">Debugging and verifying</a>
|
|
<li><a href="#5">Memory model</a>
|
|
<li><a href="#6">Alignment</a>
|
|
<li><a href="#7">Cache</a>
|
|
<li><a href="#8">First time versus repeated execution</a>
|
|
<li><a href="#9">Address generation interlock (PPlain and PMMX)</a>
|
|
<li><a href="#10">Pairing integer instructions (PPlain and PMMX)</a>
|
|
<ol><li><a href="#10_1">Perfect pairing</a>
|
|
<li><a href="#10_2">Imperfect pairing</a>
|
|
</ol>
|
|
<li><a href="#11">Splitting complex instructions into simpler ones (PPlain and PMMX)</a>
|
|
<li><a href="#12">Prefixes (PPlain and PMMX)</a>
|
|
<li><a href="#13">Overview of PPro, PII and PIII pipeline</a>
|
|
<li><a href="#14">Instruction decoding (PPro, PII and PIII)</a>
|
|
<li><a href="#15">Instruction fetch (PPro, PII and PIII)</a>
|
|
<li><a href="#16">Register renaming (PPro, PII and PIII)</a>
|
|
<ol><li><a href="#16_1">Eliminating dependencies</a>
|
|
<li><a href="#16_2">Register read stalls</a></ol>
|
|
<li><a href="#17">Out of order execution (PPro, PII and PIII)</a>
|
|
<li><a href="#18">Retirement (PPro, PII and PIII)</a>
|
|
<li><a href="#19">Partial stalls (PPro, PII and PIII)</a>
|
|
<ol><li><a href="#19_1">Partial register stalls</a>
|
|
<li><a href="#19_2">Partial flags stalls</a>
|
|
<li><a href="#19_3">Flags stalls after shifts and rotates</a>
|
|
<li><a href="#19_4">Partial memory stalls</a></ol>
|
|
<li><a href="#20">Dependency chains (PPro, PII and PIII)</a>
|
|
<li><a href="#21">Searching for bottlenecks (PPro, PII and PIII)</a>
|
|
<li><a href="#22">Jumps and branches (all processors)</a>
|
|
<ol><li><a href="#22_1">Branch prediction in PPlain</a>
|
|
<li><a href="#22_2">Branch prediction in PMMX, PPro, PII and PIII</a>
|
|
<li><a href="#22_3">Avoiding jumps (all processors)</a>
|
|
<li><a href="#22_4">Avoiding conditional jumps by using flags (all processors)</a>
|
|
<li><a href="#22_5">Replacing conditional jumps by conditional moves (PPro, PII and PIII)</a></ol>
|
|
<li><a href="#23">Reducing code size (all processors)</a>
|
|
<li><a href="#24">Scheduling floating point code (PPlain and PMMX)</a>
|
|
<li><a href="#25">Loop optimization (all processors)</a>
|
|
<ol><li><a href="#25_1">Loops in PPlain and PMMX</a>
|
|
<li><a href="#25_2">Loops in PPro, PII and PIII</a></ol>
|
|
<li><a href="#26">Problematic Instructions</a>
|
|
<ol><li><a href="#26_1">XCHG (all processors)</a>
|
|
<li><a href="#26_2">Rotates through carry (all processors)</a>
|
|
<li><a href="#26_3">String instructions (all processors)</a>
|
|
<li><a href="#26_4">Bit test (all processors)</a>
|
|
<li><a href="#26_5">Integer multiplication (all processors)</a>
|
|
<li><a href="#26_6">WAIT instruction (all processors)</a>
|
|
<li><a href="#26_7">FCOM + FSTSW AX (all processors)</a>
|
|
<li><a href="#26_8">FPREM (all processors)</a>
|
|
<li><a href="#26_9">FRNDINT (all processors)</a>
|
|
<li><a href="#26_10">FSCALE and exponential function (all processors)</a>
|
|
<li><a href="#26_11">FPTAN (all processors)</a>
|
|
<li><a href="#26_12">FSQRT (PIII)</a>
|
|
<li><a href="#26_13">MOV [MEM], ACCUM (PPlain and PMMX)</a>
|
|
<li><a href="#26_14">TEST instruction (PPlain and PMMX)</a>
|
|
<li><a href="#26_15">Bit scan (PPlain and PMMX)</a>
|
|
<li><a href="#26_16">FLDCW (PPro, PII and PIII)</a>
|
|
</ol>
|
|
<li><a href="#27">Special topics</a>
|
|
<ol><li><a href="#27_1">LEA instruction (all processors)</a>
|
|
<li><a href="#27_2">Division (all processors)</a>
|
|
<li><a href="#27_3">Freeing floating point registers (all processors)</a>
|
|
<li><a href="#27_4">Transitions between floating point and MMX instructions PMMX, PII and PIII)</a>
|
|
<li><a href="#27_5">Converting from floating point to integer (All processors)</a>
|
|
<li><a href="#27_6">Using integer instructions to do floating point operations (All processors)</a>
|
|
<li><a href="#27_7">Using floating point instructions to do integer operations (PPlain and PMMX)</a>
|
|
<li><a href="#27_8">Moving blocks of data (All processors)</a>
|
|
<li><a href="#27_9">Self-modifying code (All processors)</a>
|
|
<li><a href="#27_10">Detecting processor type (All processors)</a>
|
|
</ol>
|
|
<li><a href="#28">List of instruction timings for PPlain and PMMX</a>
|
|
<ol><li><a href="#28_1">Integer instructions</a>
|
|
<li><a href="#28_2">Floating point instructions</a>
|
|
<li><a href="#28_3">MMX instructions (PMMX)</a></ol>
|
|
<li><a href="#29">List of instruction timings and micro-op breakdown for PPro, PII and PIII</a>
|
|
<ol><li><a href="#29_1">Integer instructions</a>
|
|
<li><a href="#29_2">Floating point instructions</a>
|
|
<li><a href="#29_3">MMX instructions (PII and PIII)</a>
|
|
<li><a href="#29_4">XMM instructions (PIII)</a>
|
|
</ol>
|
|
<li><a href="#30">Testing speed</a>
|
|
<li><a href="#31">Comparison of the different microprocessors</a>
|
|
</ol>
|
|
|
|
<p><h2><a name="1">1</a>. Introduction</h2>
|
|
This manual describes in detail how to write optimized assembly language
|
|
code, with particular focus on the Pentium® family of microprocessors.
|
|
<p>
|
|
Most of the information herein is based on my own research. Many people have
|
|
sent me useful information and corrections for this manual, and I keep
|
|
updating it whenever I have new important information. This manual is
|
|
therefore more accurate, detailed, comprehensive and exact than any other
|
|
source of information, and it contains many details not found anywhere else.
|
|
This information will enable you in many cases to calculate exactly how many
|
|
clock cycles a piece of code will take. I do not claim, though, that all
|
|
information in this manual is exact: Some timings etc. can be difficult or
|
|
impossible to measure exactly, and I do not have access to the inside information
|
|
on technical implementations that the writers of Intel manuals have.
|
|
<p>
|
|
The following versions of Pentium processors are discussed in this manual:<p>
|
|
<table border=1>
|
|
<tr><td class="a3">abbreviation</td><td class="a3">name</td></tr>
|
|
<tr><td>PPlain</td><td>plain old Pentium (without MMX)</td></tr>
|
|
<tr><td>PMMX</td><td>Pentium with MMX</td></tr>
|
|
<tr><td>PPro</td><td>Pentium Pro</td></tr>
|
|
<tr><td>PII</td><td>Pentium II (including Celeron and Xeon)</td></tr>
|
|
<tr><td>PIII</td><td>Pentium III (including variants)</td></tr>
|
|
</table>
|
|
<p>
|
|
The assembly language syntax used in this manual is MASM 5.10 syntax.
|
|
There is no official standard for X86 assembly language, but this is the
|
|
closest you can get to a de facto standard since most assemblers have a
|
|
MASM 5.10 compatible mode. (I do not recommend using MASM version 5.10 though,
|
|
because it has a serious bug in 32 bit mode. Use TASM or a later version of MASM).
|
|
<p>
|
|
Some of the remarks in this manual may seem like a criticism of Intel. This should not be
|
|
taken to mean that other brands are better. The Pentium family of microprocessors
|
|
may be faster than any compatible competing brand, better documented, and with better
|
|
testability features. For these reasons, no competing brand has been subjected to the same
|
|
level of independent research by me or by anybody else.
|
|
<p>
|
|
Programming in assembly language is much more difficult than high level language. Making
|
|
bugs is very easy, and finding them is very difficult. Now you have been warned! It is
|
|
assumed that the reader is already experienced in assembly programming. If not, then
|
|
please read some books on the subject and get some programming experience before you
|
|
begin to do complicated optimizations.
|
|
<p>
|
|
The hardware design of the PPlain and PMMX chips has many features which are
|
|
optimized specifically for some commonly used instructions or instruction combinations,
|
|
rather than using general optimization methods. Consequently, the rules for optimizing
|
|
software for this design are complicated and have many exceptions, but the possible gain
|
|
in performance may be substantial. The PPro, PII and PIII processors have a very
|
|
different design where the processor takes care of much of the optimization work by
|
|
executing instructions out of order, but the more complicated design of these processors
|
|
generate many potential bottlenecks, so there may be a lot to gain
|
|
by optimizing manually for these processors.
|
|
<p>
|
|
Before you start to convert your code to assembly, make sure that your algorithm is optimal.
|
|
Often you can improve a piece of code much more by improving the algorithm than by
|
|
converting it to assembly code.
|
|
<p>
|
|
Next, you have to identify the critical parts of your program. Often more than 99% of the
|
|
CPU time is spent in the innermost loop of a program. In this case you should optimize only
|
|
this loop and leave everything else in high level language. Some assembly programmers
|
|
waste a lot of energy optimizing the wrong parts of their programs, the only significant effect
|
|
of their effort being that the programs become more difficult to debug and maintain!
|
|
<p>
|
|
If it is not obvious where the critical parts of your program are then you may use a profiler to
|
|
find them. If it turns out that the bottleneck is disk access, then you may modify your
|
|
program to make disk access sequential in order to improve disk caching, rather than
|
|
turning to assembly programming. If the bottleneck is graphics output then you may look for
|
|
a way of reducing the number of calls to graphic procedures.
|
|
<p>
|
|
Some high level language compilers offer relatively good optimization for specific
|
|
processors, but further optimization by hand can usually make it much better.
|
|
<p>
|
|
Please don't send your programming questions to me. I am not gonna do your homework
|
|
for you!
|
|
<p>
|
|
Good luck with your hunt for nanoseconds!
|
|
<p>
|
|
<h2><a name="2">2</a>. Literature</h2>
|
|
A lot of useful literature and tutorials can be downloaded for free from Intel's www site or
|
|
acquired in print or on CD-ROM. It is recommended that you study this literature in order to
|
|
get acquainted with the microprocessor architecture. However, the documents from Intel are
|
|
not always accurate - especially the tutorials have many errors (evidently, they haven't
|
|
tested their own examples).
|
|
<p>
|
|
I will not give the URL's here because the file locations change very often. You can find the
|
|
documents you need by using the search facilities at:
|
|
<a href="http://developer.intel.com" target="external">developer.intel.com</a> or follow the
|
|
links from <a href="http://www.agner.org/assem">www.agner.org/assem</a>
|
|
<p>
|
|
Some documents are in .PDF format. If you don't have software for viewing or printing .PDF
|
|
files, then you may download the Acrobat file reader from <a href="http://www.adobe.com" target="external">www.adobe.com</a>
|
|
<p>
|
|
The use of MMX and XMM (SIMD) instructions for optimizing specific applications are described in several
|
|
application notes. The instruction set is described in various manuals and tutorials.
|
|
<p>
|
|
VTUNE is a software tool from Intel for optimizing code. I have not tested it and can
|
|
therefore not give any evalutation of it here.
|
|
<p>
|
|
A lot of other sources than Intel also have useful information. These sources are listed in
|
|
the FAQ for the newsgroup comp.lang.asm.x86. For other internet ressources follow the
|
|
links from <a href="http://www.agner.org/assem">www.agner.org/assem</a>
|
|
<p>
|
|
<h2><a name="3">3</a>. Calling assembly functions from high level language</h2>
|
|
You can either use inline assembly or code a subroutine entirely in assembly language and
|
|
link it into your project. If you choose the latter option, then it is recommended that you use
|
|
a compiler which is capable of translating high level code directly to assembly. This assures
|
|
that you get the function calling method right. Most C++ compilers can do this.
|
|
<p>
|
|
The methods for function calling and name mangling can be quite complicated. There are
|
|
many different calling conventions, and the different brands of compilers are not compatible
|
|
in this respect. If you are calling assembly language subroutines from C++, then the best
|
|
method in terms of consistency and compatibility is to declare your functions <kbd>extern "C"</kbd>
|
|
and <kbd>_cdecl</kbd>. The assembly code must then have the function name prefixed by an
|
|
underscore (<kbd>_</kbd>) and be assembled with case sensitivity on externals (option -mx).
|
|
<p>
|
|
If you need to make overloaded functions, overloaded operators, member
|
|
functions, and other C++ specialties then you have to code it in C++ first and
|
|
make your compiler translate it to assembly in order to get the right linking
|
|
information and calling method. These details are different for different brands
|
|
of compilers. If you want an assembly function with any other calling method
|
|
than <kbd>extern "C"</kbd> and <kbd>_cdecl</kbd> to be callable from code
|
|
compiled with different compilers then you need to give it one public name
|
|
for each compiler. For example an overloaded square function:
|
|
<pre> ; int square (int x);
|
|
SQUARE_I PROC NEAR ; integer square function
|
|
@square$qi LABEL NEAR ; link name for Borland compiler
|
|
?square@@YAHH@Z LABEL NEAR ; link name for Microsoft compiler
|
|
_square__Fi LABEL NEAR ; link name for Gnu compiler
|
|
PUBLIC @square$qi, ?square@@YAHH@Z, _square__Fi
|
|
MOV EAX, [ESP+4]
|
|
IMUL EAX
|
|
RET
|
|
SQUARE_I ENDP
|
|
|
|
; double square (double x);
|
|
SQUARE_D PROC NEAR ; double precision float square function
|
|
@square$qd LABEL NEAR ; link name for Borland compiler
|
|
?square@@YANN@Z LABEL NEAR ; link name for Microsoft compiler
|
|
_square__Fd LABEL NEAR ; link name for Gnu compiler
|
|
PUBLIC @square$qd, ?square@@YANN@Z, _square__Fd
|
|
FLD QWORD PTR [ESP+4]
|
|
FMUL ST(0), ST(0)
|
|
RET
|
|
SQUARE_D ENDP</pre>
|
|
<p>
|
|
The way of transferring parameters depends on the calling convention:</pre><p>
|
|
<table border=1 cellpadding=1 cellspacing=1><tr>
|
|
<td class="a3"> calling convention </td>
|
|
<td class="a3"> parameter order on stack </td>
|
|
<td class="a3"> parameters removed by </td></tr>
|
|
<tr><td> _cdecl </td><td> first par. at low address
|
|
</td><td> caller </td></tr><tr><td> _stdcall </td><td>
|
|
first par. at low address </td><td> subroutine </td></tr>
|
|
<tr><td> _fastcall </td><td> compiler specific
|
|
</td><td> subroutine </td></tr>
|
|
<tr><td> _pascal </td><td> first par. at high address
|
|
</td><td> subroutine </td>
|
|
</tr></table>
|
|
<p>
|
|
<u>Register usage in 16 bit mode DOS or Windows, C or C++:</u><br>
|
|
16-bit return value in <kbd>AX</kbd>, 32-bit return value in <kbd>DX:AX</kbd>,
|
|
floating point return value in <kbd>ST(0)</kbd>. Registers <kbd>AX, BX, CX,
|
|
DX, ES</kbd> and arithmetic flags may be changed by the procedure; all other
|
|
registers must be saved and restored. A procedure can rely on <kbd>SI, DI, BP, DS</kbd>
|
|
and <kbd>SS</kbd> being unchanged across a call to another procedure.
|
|
<p>
|
|
<u>Register usage in 32 bit Windows, C++ and other programming languages:</u><br>
|
|
Integer return value in <kbd>EAX</kbd>, floating point return value in <kbd>ST(0)</kbd>.
|
|
Registers <kbd>EAX, ECX, EDX</kbd> (not <kbd>EBX</kbd>) may be changed by the procedure; all other
|
|
registers must be saved and restored. Segment registers cannot be changed, not even
|
|
temporarily. <kbd>CS, DS, ES,</kbd> and <kbd>SS</kbd> all point to the flat segment group. <kbd>FS</kbd> is used by the
|
|
operating system. <kbd>GS</kbd> is unused, but reserved. Flags may be changed by the procedure
|
|
with the following restrictions: The direction flag is 0 by default. The direction flag may be
|
|
set temporarily, but must be cleared before any call or return. The interrupt flag cannot be
|
|
cleared. The floating point register stack is empty at the entry of a procedure and must be
|
|
empty at return, except for <kbd>ST(0)</kbd> if it is used for return value. MMX registers may be
|
|
changed by the procedure and if so cleared by <kbd>EMMS</kbd> before returning and before calling any
|
|
other procedure that may use floating point registers. All XMM registers may be modified
|
|
by procedures. Rules for passing parameters and return values in XMM registers
|
|
are described in Intel's application note AP 589. A procedure can rely on
|
|
<kbd>EBX, ESI, EDI, EBP</kbd> and all segment registers being unchanged across
|
|
a call to another procedure.
|
|
<p>
|
|
|
|
<h2><a name="4">4</a>. Debugging and verifying</h2>
|
|
Debugging assembly code can be quite hard and frustrating, as you probably already have
|
|
discovered. I would recommend that you start with writing the piece of code you want to
|
|
optimize as a subroutine in a high level language. Next, write a test program that will test
|
|
your subroutine thoroughly. Make sure the test program goes into all branches and
|
|
boundary cases.
|
|
<p>
|
|
When your high level language subroutine works with your test program then you are ready
|
|
to translate the code to assembly language.
|
|
<p>
|
|
Now you can start to optimize. Each time you have made a modification you should run it
|
|
on the test program to see if it works correctly.
|
|
Number all your versions and save them so that you can go back and test them again in
|
|
case you discover an error that the test program didn't catch (such as writing to a wrong
|
|
address).
|
|
<p>
|
|
Test the speed of the most critical part of your program with the method described in
|
|
chapter <a href="#30">30</a> or with a test program. If the code is significantly slower than expected, then the
|
|
most probable causes are: cache misses (chapter <a href="#7">7</a>), misaligned
|
|
operands (chapter <a href="#6">6</a>), first
|
|
time penalty (chapter <a href="#8">8</a>), branch mispredictions
|
|
(chapter <a href="#22">22</a>), instruction fetch problems
|
|
(chapter <a href="#15">15</a>), register read stalls (<a href="#16">16</a>),
|
|
or long dependency chains (chapter <a href="#20">20</a>).
|
|
<p>
|
|
Highly optimized code tends to be very difficult to read and understand for others, and even
|
|
for yourself when you get back to it after some time. In order to make it possible to maintain
|
|
the code it is important that you organize it into small logical units (procedures or macros)
|
|
with a well-defined interface and appropriate comments. The more complicated the code is
|
|
to read, the more important is a good documentation.
|
|
<p>
|
|
<h2><a name="5">5</a>. Memory model</h2>
|
|
The Pentiums are designed primarily for 32 bit code, and the performance is
|
|
inferior on 16 bit code. Segmenting your code and data also degrades performance
|
|
significantly, so you should generally prefer 32 bit flat mode, and an operating
|
|
system which supports this mode. The code examples shown in this
|
|
manual assume a 32 bit flat memory model, unless otherwise specified.
|
|
<p>
|
|
<h2><a name="6">6</a>. Alignment</h2>
|
|
All data in RAM should be aligned to addresses divisible by 2, 4, 8, or 16 according to this
|
|
scheme:
|
|
<table border=1 cellpadding=1 cellspacing=1><tr><td>
|
|
|
|
</td><td colspan=2 align=center class="a3">alignment</td></tr>
|
|
<tr><td align=center class="a3"> operand size </td>
|
|
<td align=center class="a3"> PPlain and PMMX </td>
|
|
<td align=center class="a3"> PPro, PII and PIII </td></tr>
|
|
<tr><td> 1 (byte) </td>
|
|
<td align=center>1</td><td align=center>1</td></tr>
|
|
<tr><td> 2 (word) </td><td align=center>2</td>
|
|
<td align=center>2</td></tr>
|
|
<tr><td> 4 (dword) </td>
|
|
<td align=center align=center>4</td><td align=center>4</td></tr>
|
|
<tr><td> 6 (fword) </td>
|
|
<td align=center>4</td><td align=center>8</td></tr>
|
|
<tr><td> 8 (qword) </td>
|
|
<td align=center>8</td><td align=center>8</td></tr>
|
|
<tr><td> 10 (tbyte) </td>
|
|
<td align=center>8</td><td align=center>16</td></tr>
|
|
<tr><td> 16 (oword) </td>
|
|
<td align=center>n.a.</td><td align=center>16</td></tr>
|
|
</table>
|
|
<p>
|
|
On PPlain and PMMX, misaligned data will take at least 3 clock cycles extra to access if a 4
|
|
byte boundary is crossed. The penalty is higher when a cache line boundary is crossed.
|
|
<p>
|
|
On PPro, PII and PIII, misaligned data will cost you 6-12 clocks extra when a
|
|
cache line boundary is crossed. Misaligned operands smaller than 16 bytes that
|
|
do not cross a 32 byte boundary give no penalty.
|
|
<p>
|
|
Aligning data by 8 or 16 on a dword size stack may be a problem. A common method is to set up
|
|
an aligned frame pointer. A function with aligned local data may look like this:
|
|
<pre>_FuncWithAlign PROC NEAR
|
|
PUSH EBP ; prolog code
|
|
MOV EBP, ESP
|
|
AND EBP, -8 ; align frame pointer by 8
|
|
FLD DWORD PTR [ESP+8] ; function parameter
|
|
SUB ESP, LocalSpace + 4 ; allocate local space
|
|
FSTP QWORD PTR [EBP-LocalSpace] ; store something in aligned space
|
|
...
|
|
ADD ESP, LocalSpace + 4 ; epilog code. restore ESP
|
|
POP EBP ; (AGI stall on PPlain/PMMX)
|
|
RET
|
|
_FuncWithAlign ENDP</pre>
|
|
<p>
|
|
While aligning data is always important, aligning code is not necessary on the PPlain and
|
|
PMMX. Principles for aligning code on PPro, PII and PIII are explained in chapter <a href="#15">15</a>.
|
|
<p>
|
|
<h2><a name="7">7</a>. Cache</h2>
|
|
The PPlain and PPro have 8 kb of on-chip cache (level one cache) for code, and 8 kb for
|
|
data. The PMMX, PII and PIII have 16 kb for code and 16 kb for data. Data in the level 1 cache
|
|
can be read or written to in just one clock cycle, whereas a cache miss may cost many
|
|
clock cycles. It is therefore important that you understand how the cache works in order to
|
|
use it most efficiently.
|
|
<p>
|
|
The data cache consists of 256 or 512 lines of 32 bytes each. Each time you read a data
|
|
item which is not cached, the processor will read an entire cache line from memory. The
|
|
cache lines are always aligned to a physical address divisible by 32. When you have read a
|
|
byte at an address divisible by 32, then the next 31 bytes can be read or written to at almost
|
|
no extra cost. You can take advantage of this by arranging data items which are used near
|
|
each other together into aligned blocks of 32 bytes of memory. If, for example, you have a
|
|
loop which accesses two arrays, then you may interleave the two arrays into one array of
|
|
structures, so that data which are used together are also stored together.
|
|
<p>
|
|
If the size of an array or other data structure is a multiple of 32 bytes, then you should
|
|
preferably align it by 32.
|
|
<p>
|
|
The cache is set-associative. This means that a cache line can not be assigned to an
|
|
arbitrary memory address. Each cache line has a 7-bit set-value which must match bits 5
|
|
through 11 of the physical RAM address (bit 0-4 define the 32 bytes within a cache line).
|
|
The PPlain and PPro have two cache lines for each of the 128 set-values, so there are two
|
|
possible cache lines to assign to any RAM address. The PMMX, PII and PIII have four.
|
|
<p>
|
|
The consequence of this is that the cache can hold no more than two or four different data
|
|
blocks which have the same value in bits 5-11 of the address. You can determine if two
|
|
addresses have the same set-value by the following method: Strip off the lower 5 bits of
|
|
each address to get a value divisible by 32. If the difference between the two truncated
|
|
addresses is a multiple of 4096 (=1000H), then the addresses have the same set-value.
|
|
<p>
|
|
Let me illustrate this by the following piece of code, where ESI holds an address divisible by
|
|
32:
|
|
|
|
<pre>AGAIN: MOV EAX, [ESI]
|
|
MOV EBX, [ESI + 13*4096 + 4]
|
|
MOV ECX, [ESI + 20*4096 + 28]
|
|
DEC EDX
|
|
JNZ AGAIN</pre>
|
|
<p>
|
|
The three addresses used here all have the same set-value because the differences
|
|
between the truncated addresses are multipla of 4096. This loop will perform very poorly on
|
|
the PPlain and PPro. At the time you read <kbd>ECX</kbd> there is no free cache
|
|
line with the proper set-value so the processor takes the least recently used
|
|
of the two possible cache lines, that is the one which was used for <kbd>EAX</kbd>, and
|
|
fills it with the data from <kbd>[ESI+20*4096]</kbd> to
|
|
<kbd>[ESI+20*4096+31]</kbd> and reads <kbd>ECX</kbd>.
|
|
Next, when reading <kbd>EAX</kbd>, you find that the cache
|
|
line that held the value for <kbd>EAX</kbd> has now been discarded, so you
|
|
take the least recently used
|
|
line, which is the one holding the <kbd>EBX</kbd> value, and so on..
|
|
You have nothing but cache misses and the loop takes something like 60 clock
|
|
cycles. If the third line is changed to:
|
|
<pre> MOV ECX, [ESI + 20*4096 + 32]</pre>
|
|
<p>
|
|
then we have crossed a 32 byte boundary, so that we do not have the same set-value as in
|
|
the first two lines, and there will be no problem assigning a cache line to each of the three
|
|
addresses. The loop now takes only 3 clock cycles (except for the first time) - a very
|
|
considerable improvement! As already mentioned, the PMMX, PII and PIII have 4-way caches
|
|
so that you have four cache lines with the same set-value. (Some Intel documents
|
|
erroneously say that the PII cache is 2-way).
|
|
<p>
|
|
It may be very difficult to determine if your data addresses have the same set-values,
|
|
especially if they are scattered around in different segments. The best thing you can do to
|
|
avoid problems of this kind is to keep all data used in the critical part or your program within
|
|
one contiguous block not bigger than the cache, or two contiguous blocks no bigger than
|
|
half that size (for example one block for static data and another block for data on the stack).
|
|
This will make sure that your cache lines are used optimally.
|
|
<p>
|
|
If the critical part of your code accesses big data structures or random data addresses, then
|
|
you may want to keep all frequently used variables (counters, pointers, control variables,
|
|
etc.) within a single contiguous block of max 4 kbytes so that you have a complete set of
|
|
cache lines free for accessing random data. Since you probably need stack space anyway
|
|
for subroutine parameters and return addresses, the best thing is to copy all frequently
|
|
used static data to dynamic variables on the stack, and copy them back again outside the
|
|
critical loop if they have been changed.
|
|
<p>
|
|
Reading a data item which is not in the level one cache causes an entire cache line to be
|
|
filled from the level two cache, which takes approximately 200 ns (that is 20 clocks on a
|
|
100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for first are
|
|
available already after 50-100 ns. If the data item is not in the level two cache either, then
|
|
you will get a delay of something like 200-300 ns. This delay will be somewhat longer if you
|
|
cross a DRAM page boundary. (The size of a DRAM page is 1 kb for 4 and 8 MB 72 pin
|
|
RAM modules, and 2 kb for 16 and 32 MB modules).
|
|
<p>
|
|
When reading big blocks of data from memory, the speed is limited by the time it takes to fill
|
|
cache lines. You can sometimes improve speed by reading data in a non-sequential order:
|
|
before you finish reading data from one cache line start reading the first item from the next
|
|
cache line. This method can increase reading speed by 20 - 40% when reading from main
|
|
memory or level 2 cache on PPlain and PMMX, and from level 2 cache on PPro, PII and PIII. A
|
|
disadvantage of this method is of course that the program code becomes extremely clumsy
|
|
and complicated. For further information on this trick see <a href="http://www.intelligentfirm.com" target="external">www.intelligentfirm.com</a>.
|
|
<p>
|
|
When you write to an address which is not in the level 1 cache, then the value will go right
|
|
through to the level 2 cache or to the RAM (depending on how the level 2 cache is set up)
|
|
on the PPlain and PMMX. This takes approximately 100 ns. If you write eight or more times
|
|
to the same 32 byte block of memory without also reading from it, and the block is not in the
|
|
level one cache, then it may be advantageous to make a dummy read from the block first to
|
|
load it into a cache line. All subsequent writes to the same block will then go to the cache
|
|
instead, which takes only one clock cycle. On PPlain and PMMX, there is sometimes a
|
|
small penalty for writing repeatedly to the same address without reading in between.
|
|
<p>
|
|
On PPro, PII and PIII, a write miss will normally load a cache line, but it is possible to setup an
|
|
area of memory to perform differently, for example video RAM (See Pentium Pro Family
|
|
Developer's Manual, vol. 3: Operating System Writer's Guide").
|
|
<p>
|
|
Other ways of speeding up memory reads and writes are discussed in chapter
|
|
<a href="#27_8">27.8</a> below.
|
|
<p>
|
|
The PPlain and PPro have two write buffers, PMMX, PII and PIII have four. On the PMMX, PII and
|
|
PIII you may have up to four unfinished writes to uncached memory without delaying the
|
|
subsequent instructions. Each write buffer can handle operands up to 64 bits wide.
|
|
<p>
|
|
Temporary data may conveniently be stored on the stack because the stack area is very
|
|
likely to be in the cache. However, you should be aware of the alignment problems
|
|
if your data elements are bigger than the stack word size.
|
|
<p>
|
|
If the life ranges of two data structures do not overlap, then they may share
|
|
the same RAM area to increase cache efficiency. This is consistent with the
|
|
common practice of allocating space for temporary variables on the stack.
|
|
<p>
|
|
Storing temporary data in registers is of course even more efficient. Since registers is a
|
|
scarce ressource you may want to use <kbd>[ESP]</kbd> rather than <kbd>[EBP]</kbd> for addressing data on the
|
|
stack, in order to free <kbd>EBP</kbd> for other purposes. Just don't forget that the value of <kbd>ESP</kbd>
|
|
changes every time you do a <kbd>PUSH</kbd> or <kbd>POP</kbd>.
|
|
(You cannot use <kbd>ESP</kbd> under 16-bit Windows
|
|
because the timer interrupt will modify the high word of <kbd>ESP</kbd> at unpredictable places in your
|
|
code.)
|
|
<p>
|
|
There is a separate cache for code, which is similar to the data cache. The size of the code
|
|
cache is 8 kb on PPlain and PPro and 16 kb on the PMMX, PII and PIII. It is important that the
|
|
critical part of your code (the innermost loops) fit in the code cache. Frequently used pieces
|
|
of code or routines which are used together should preferable be stored near each other.
|
|
Seldom used branches or procedures should be put away in the bottom of your code or
|
|
somewhere else.
|
|
<p>
|
|
<h2><a name="8">8</a>. First time versus repeated execution</h2>
|
|
A piece of code usually takes much more time the first time it is executed than when it is
|
|
repeated. The reasons are the following:
|
|
<ol>
|
|
<li>Loading the code from RAM into the cache takes longer time than executing it.
|
|
<li>Any data accessed by the code has to be loaded into the cache, which may take much
|
|
more time than executing the instructions. When the code is repeated then the data are
|
|
more likely to be in the cache.
|
|
<li>Jump instructions will not be in the branch target buffer the first time they execute, and
|
|
therefore are less likely to be predicted correctly. See chapter <a href="#22">22</a>.
|
|
<li>In the PPlain, decoding the code is a bottleneck. If it takes one clock cycle to determine
|
|
the length of an instruction, then it is not possible to decode two instructions per clock
|
|
cycle, because the processor doesn't know where the second instruction begins. The
|
|
PPlain solves this problem by remembering the length of any instruction which has
|
|
remained in the cache since last time it was executed. As a consequence of this, a set
|
|
of instructions will not pair in the PPlain the first time they are executed, unless the first
|
|
of the two instructions is only one byte long. The PMMX, PPro, PII and PIII have no penalty
|
|
on first time decoding.
|
|
</ol><p>
|
|
For these four reasons, a piece of code inside a loop will generally take much more time the
|
|
first time it executes than the subsequent times.
|
|
<p>
|
|
If you have a big loop which doesn't fit into the code cache then you will get penalties all the
|
|
time because it doesn't run from the cache. You should therefore try to reorganize the loop
|
|
to make it fit into the cache.
|
|
<p>
|
|
If you have very many jumps, calls, and branches inside a loop, then you may get the
|
|
penalty of branch target buffer misses repeatedly.
|
|
<p>
|
|
Likewise, if a loop repeatedly accesses a data structure too big for the data cache, then you
|
|
will get the penalty of data cache misses all the time.
|
|
<p>
|
|
<h2><a name="9">9</a>. Address generation interlock (PPlain and PMMX)</h2>
|
|
It takes one clock cycle to calculate the address needed by an instruction which accesses
|
|
memory. Normally, this calculation is done at a separate stage in the pipeline while the
|
|
preceding instruction or instruction pair is executing. But if the address depends on the
|
|
result of an instruction executing in the preceding clock cycle, then you have to wait an
|
|
extra clock cycle for the address to be calculated. This is called an AGI stall.
|
|
Example:<br>
|
|
<kbd>ADD EBX,4 / MOV EAX,[EBX] ; AGI stall</kbd><br>
|
|
The stall in this example can be removed by putting some other instructions in between
|
|
<kbd>ADD EBX,4</kbd> and <kbd>MOV EAX,[EBX]</kbd> or by rewriting the code to:
|
|
<kbd>MOV EAX,[EBX+4] / ADD EBX,4</kbd>
|
|
<p>
|
|
You can also get an AGI stall with instructions which use <kbd>ESP</kbd> implicitly
|
|
for addressing, such
|
|
as <kbd>PUSH, POP, CALL,</kbd> and <kbd>RET</kbd>, if <kbd>ESP</kbd> has been
|
|
changed in the preceding clock cycle by
|
|
instructions such as <kbd>MOV, ADD,</kbd> or <kbd>SUB</kbd>.
|
|
The PPlain and PMMX have special circuitry to
|
|
predict the value of <kbd>ESP</kbd> after a stack operation so that you do not
|
|
get an AGI delay after
|
|
changing <kbd>ESP</kbd> with <kbd>PUSH, POP,</kbd> or <kbd>CALL</kbd>.
|
|
You can get an AGI stall after <kbd>RET</kbd> only if it has an
|
|
immediate operand to add to <kbd>ESP</kbd>.
|
|
<p>
|
|
Examples:
|
|
<pre>ADD ESP,4 / POP ESI ; AGI stall
|
|
POP EAX / POP ESI ; no stall, pair
|
|
MOV ESP,EBP / RET ; AGI stall
|
|
CALL L1 / L1: MOV EAX,[ESP+8] ; no stall
|
|
RET / POP EAX ; no stall
|
|
RET 8 / POP EAX ; AGI stall</pre>
|
|
<p>
|
|
The <kbd>LEA</kbd> instruction is also subject to an AGI stall if it uses a
|
|
base or index register which
|
|
has been changed in the preceding clock cycle. Example:
|
|
<pre>INC ESI / LEA EAX,[EBX+4*ESI] ; AGI stall</pre>
|
|
<p>
|
|
PPro, PII and PIII have no AGI stalls for memory reads and <kbd>LEA</kbd>, but they do have
|
|
AGI stalls for memory writes. This is not very significant unless the subsequent
|
|
code has to wait for the write to finish.<p>
|
|
<h2><a name="10">10</a>. Pairing integer instructions (PPlain and PMMX)</h2>
|
|
<h3><a name="10_1">10.1 Perfect pairing</a></h3>
|
|
The PPlain and PMMX have two pipelines for executing instructions, called the U-pipe and
|
|
the V-pipe. Under certain conditions it is possible to execute two instructions
|
|
simultaneously, one in the U-pipe and one in the V-pipe. This can almost double the speed.
|
|
It is therefore advantageous to reorder your instructions to make them pair.
|
|
<p>
|
|
The following instructions are pairable in either pipe:
|
|
<ul><li><kbd>MOV</kbd> register, memory, or immediate into register or memory
|
|
<li><kbd>PUSH</kbd> register or immediate, <kbd>POP</kbd> register
|
|
<li><kbd>LEA, NOP</kbd>
|
|
<li><kbd>INC, DEC, ADD, SUB, CMP, AND, OR, XOR,</kbd>
|
|
<li>and some forms of <kbd>TEST</kbd> (see chapter <a href="#26_14">26.14</a>)
|
|
</ul>
|
|
The following instructions are pairable in the U-pipe only:
|
|
<ul><li><kbd>ADC, SBB</kbd>
|
|
<li><kbd>SHR, SAR, SHL, SAL</kbd> with immediate count
|
|
<li><kbd>ROR, ROL, RCR, RCL</kbd> with an immediate count of 1
|
|
</ul>
|
|
The following instructions can execute in either pipe but are only pairable
|
|
when in the V-pipe:
|
|
<ul><li>near call
|
|
<li>short and near jump
|
|
<li>short and near conditional jump.
|
|
</ul>
|
|
All other integer instructions can execute in the U-pipe only, and are not pairable.
|
|
<p>
|
|
Two consecutive instructions will pair when the following conditions are met:
|
|
<p>
|
|
<u>1.</u> The first instruction is pairable in the U-pipe and the second
|
|
instruction is pairable in the V-pipe.
|
|
<p>
|
|
<u>2.</u> The second instruction does not read or write a register which the first
|
|
instruction writes to.<br>
|
|
Examples:
|
|
<pre> MOV EAX, EBX / MOV ECX, EAX ; read after write, do not pair
|
|
MOV EAX, 1 / MOV EAX, 2 ; write after write, do not pair
|
|
MOV EBX, EAX / MOV EAX, 2 ; write after read, pair OK
|
|
MOV EBX, EAX / MOV ECX, EAX ; read after read, pair OK
|
|
MOV EBX, EAX / INC EAX ; read and write after read, pair OK</pre>
|
|
<p>
|
|
<u>3.</u> In rule 2 partial registers are treated as full registers. Example:
|
|
<pre> MOV AL, BL / MOV AH, 0</pre><p>
|
|
writes to different parts of the same register, do not pair
|
|
<p>
|
|
<u>4.</u> Two instructions which both write to parts of the flags register can pair despite rule 2 and
|
|
3. Example:
|
|
<pre> SHR EAX, 4 / INC EBX ; pair OK</pre>
|
|
<p>
|
|
<u>5.</u> An instruction which writes to the flags can pair with a conditional jump despite rule 2.
|
|
Example:
|
|
<pre> CMP EAX, 2 / JA LabelBigger ; pair OK</pre>
|
|
<p>
|
|
<u>6.</u> The following instruction combinations can pair despite the fact that they both modify the
|
|
stack pointer:
|
|
<pre> PUSH + PUSH, PUSH + CALL, POP + POP</pre>
|
|
<p>
|
|
<u><a name="10-7">7.</a></u> There are restrictions on the pairing of instructions with prefix.
|
|
There are several types of prefixes:
|
|
<ul>
|
|
<li>instructions addressing a non-default segment have a segment prefix.
|
|
<li>instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit mode have an
|
|
operand size prefix.
|
|
<li>instructions using 32 bit base or index registers in 16 bit mode have an address size
|
|
prefix.
|
|
<li>repeated string instructions have a repeat prefix.
|
|
<li>locked instructions have a <kbd>LOCK</kbd> prefix.
|
|
<li>many instructions which were not implemented on the 8086 processor have a two byte
|
|
opcode where the first byte is <kbd>0FH</kbd>. The <kbd>0FH</kbd> byte behaves
|
|
as a prefix on the PPlain, but
|
|
not on the other versions. The most common instructions with <kbd>0FH</kbd>
|
|
prefix are: <kbd>MOVZX,
|
|
MOVSX, PUSH FS, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETcc, BT,
|
|
BTC, BTR, BTS, BSF, BSR, SHLD, SHRD,</kbd> and <kbd>IMUL</kbd> with two operands and no
|
|
immediate operand.
|
|
</ul>
|
|
<p>
|
|
On the PPlain, a prefixed instruction can only execute in the U-pipe, except for conditional
|
|
near jumps.
|
|
<p>
|
|
On the PMMX, instructions with operand size, address size, or <kbd>0FH</kbd>
|
|
prefix can execute in
|
|
either pipe, whereas instructions with segment, repeat, or lock prefix can
|
|
only execute in the U-pipe.
|
|
<p>
|
|
<u>8.</u> An instruction which has both a displacement and immediate data is not pairable on the
|
|
PPlain and only pairable in the U-pipe on the PMMX:
|
|
<pre> MOV DWORD PTR DS:[1000], 0 ; not pairable or only in U-pipe
|
|
CMP BYTE PTR [EBX+8], 1 ; not pairable or only in U-pipe
|
|
CMP BYTE PTR [EBX], 1 ; pairable
|
|
CMP BYTE PTR [EBX+8], AL ; pairable</pre><p>
|
|
(Another problem with instructions which have both a displacement and
|
|
immediate data on the PMMX is that such instructions may be longer
|
|
than 7 bytes, which means that only one instruction can be decoded
|
|
per clock cycle, as explained in chapter <a href="#12">12</a>.)
|
|
<p>
|
|
<u>9.</u> Both instructions must be preloaded and decoded. This is explained in
|
|
chapter <a href="#8">8</a>.
|
|
<p>
|
|
<u>10.</u> There are special pairing rules for MMX instructions on the PMMX:
|
|
<ul>
|
|
<li>MMX shift, pack or unpack instructions can execute in either pipe but cannot pair with
|
|
other MMX shift, pack or unpack instructions.
|
|
<li>MMX multiply instructions can execute in either pipe but cannot pair with other MMX
|
|
multiply instructions. They take 3 clock cycles and the last 2 clock cycles can overlap
|
|
with subsequent instructions in the same way as floating point instructions can (see
|
|
chapter <a href="#24">24</a>).
|
|
<li>an MMX instruction which accesses memory or integer registers can execute only in the
|
|
U-pipe and cannot pair with a non-MMX instruction.
|
|
</ul>
|
|
<p>
|
|
|
|
<h3><a name="10_2">10.2 Imperfect pairing</a></h3>
|
|
There are situations where the two instructions in a pair will not execute simultaneously, or
|
|
only partially overlap in time. They should still be considered a pair, though, because the
|
|
first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent
|
|
instruction can start to execute before both instructions in the imperfect pair have finished.
|
|
<p>
|
|
Imperfect pairing will happen in the following cases:
|
|
<p>
|
|
<u>1.</u> If the second instructions suffers an AGI stall (see chapter <a href="#9">9</a>).
|
|
<p>
|
|
<u>2.</u> Two instructions cannot access the same DWORD of memory simultaneously.
|
|
The following examples assume that <kbd>ESI</kbd> is divisible by 4:<br>
|
|
<kbd> MOV AL, [ESI] / MOV BL, [ESI+1]</kbd><br>
|
|
The two operands are within the same DWORD, so they cannot execute
|
|
simultaneously. The pair takes 2 clock cycles.<br>
|
|
<kbd> MOV AL, [ESI+3] / MOV BL, [ESI+4]</kbd><br>
|
|
Here the two operands are on each side of a DWORD boundary, so they
|
|
pair perfectly, and take only one clock cycle.
|
|
<p>
|
|
<u>3.</u> Rule 2 is extended to the case where bit 2-4 is the same in the two addresses (cache
|
|
bank conflict). For DWORD addresses this means that the difference between the two
|
|
addresses should not be divisible by 32. Examples:
|
|
<pre> MOV [ESI], EAX / MOV [ESI+32000], EBX ; imperfect pairing
|
|
MOV [ESI], EAX / MOV [ESI+32004], EBX ; perfect pairing</pre>
|
|
<p>
|
|
Pairable integer instructions which do not access memory take one clock cycle to
|
|
execute, except for mispredicted jumps. <kbd>MOV</kbd> instructions to or from
|
|
memory also take
|
|
only one clock cycle if the data area is in the cache and properly aligned. There is no
|
|
speed penalty for using complex addressing modes such as scaled index registers.
|
|
<p>
|
|
A pairable integer instruction which reads from memory, does some calculation, and
|
|
stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions).
|
|
<p>
|
|
A pairable integer instruction which reads from memory, does some calculation, and
|
|
writes the result back to the memory, takes 3 clock cycles. (read/modify/write
|
|
instructions).
|
|
<p>
|
|
<u>4.</u> If a read/modify/write instruction is paired with a read/modify or read/modify/write
|
|
instruction, then they will pair imperfectly.
|
|
|
|
<p>
|
|
The number of clock cycles used is given in the following table:</ol>
|
|
<table border=1 cellpadding=1 cellspacing=1><tr>
|
|
<td align="center" class="a3">
|
|
First instruction</td>
|
|
<td colspan=3 align="center" class="a3">Second instruction</td></tr>
|
|
<tr><td> </td><td align=center> MOV or register only
|
|
</td><td align=center> read/modify </td>
|
|
<td align=center> read/modify/write </td></tr>
|
|
<tr><td align=center> MOV or register only </td>
|
|
<td align=center> 1 </td>
|
|
<td align=center> 2 </td>
|
|
<td align=center> 3 </td></tr>
|
|
<tr><td align=center> read/modify </td>
|
|
<td align=center> 2 </td>
|
|
<td align=center> 2 </td>
|
|
<td align=center> 3 </td></tr>
|
|
<tr><td align=center> read/modify/write </td>
|
|
<td align=center> 3 </td>
|
|
<td align=center> 4 </td>
|
|
<td align=center> 5 </td></tr>
|
|
</table>
|
|
|
|
<p>Example:<br><kbd> ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles<br> ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles</kbd>
|
|
<p>
|
|
<u>5.</u> When two paired instructions both take extra time due to cache misses, misalignment,
|
|
or jump misprediction, then the pair will take more time than each instruction, but less
|
|
than the sum of the two.
|
|
<p>
|
|
<u>6.</u> A pairable floating point instruction followed by <kbd>FXCH</kbd>
|
|
will make imperfect pairing if the
|
|
next instruction is not a floating point instruction.
|
|
<p>
|
|
<p>In order to avoid imperfect pairing you have to know which instructions go into the U-pipe,
|
|
and which to the V-pipe. You can find out this by looking backwards in your code and
|
|
search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair
|
|
due to one of the rules above.
|
|
<p>
|
|
Imperfect pairing can often be avoided by reordering your instructions.
|
|
Example:
|
|
<br>
|
|
<pre>L1: MOV EAX,[ESI]
|
|
MOV EBX,[ESI]
|
|
INC ECX</pre><p>
|
|
|
|
Here the two <kbd>MOV</kbd> instructions form an imperfect pair because they both access the same
|
|
memory location, and the sequence takes 3 clock cycles. You can improve the code by
|
|
reordering the instructions so that <kbd>INC ECX</kbd> pairs with one of the
|
|
<kbd>MOV</kbd> instructions.
|
|
|
|
<pre>L2: MOV EAX,OFFSET A
|
|
XOR EBX,EBX
|
|
INC EBX
|
|
MOV ECX,[EAX]
|
|
JMP L1</pre><p>
|
|
|
|
The pair <kbd>INC EBX / MOV ECX,[EAX]</kbd> is imperfect because the latter
|
|
instruction has an
|
|
AGI stall. The sequence takes 4 clocks. If you insert a <kbd>NOP</kbd> or any other instruction so that
|
|
<kbd>MOV ECX,[EAX]</kbd> pairs with <kbd>JMP L1</kbd> instead, then the sequence takes only 3 clocks.
|
|
<p>
|
|
<a name="imperfectpush">
|
|
The next example is in 16 bit mode, assuming that <kbd>SP</kbd> is divisible by 4:</a>
|
|
|
|
<pre>L3: PUSH AX
|
|
PUSH BX
|
|
PUSH CX
|
|
PUSH DX
|
|
CALL FUNC</pre><p>
|
|
|
|
Here the <kbd>PUSH</kbd> instructions form two imperfect pairs, because both operands in each pair go
|
|
into the same dword of memory. <kbd>PUSH BX</kbd> could possibly pair perfectly
|
|
with <kbd>PUSH CX</kbd>
|
|
(because they go on each side of a DWORD boundary) but it doesn't because it has already
|
|
been paired with <kbd>PUSH AX</kbd>. The sequence therefore takes 5 clocks. If you insert a <kbd>NOP</kbd> or
|
|
any other instruction so that <kbd>PUSH BX</kbd> pairs with <kbd>PUSH CX</kbd>,
|
|
and <kbd>PUSH DX</kbd> with <kbd>CALL FUNC</kbd>,
|
|
then the sequence will take only 3 clocks. Another way to solve the problem is to make sure
|
|
that <kbd>SP</kbd> is not divisible by 4. Knowing whether <kbd>SP</kbd> is
|
|
divisible by 4 or not in 16 bit mode can
|
|
be difficult, so the best way to avoid this problem is to use 32 bit mode.
|
|
|
|
<p>
|
|
|
|
<h2><a name="11">11</a>. Splitting complex instructions into simpler ones (PPlain and PMMX)</h2>
|
|
You may split up read/modify and read/modify/write instructions to improve pairing.
|
|
Example:<br>
|
|
<kbd> ADD [mem1],EAX / ADD [mem2],EBX ; 5 clock cycles</kbd><br>
|
|
This code may be split up into a sequence which takes only 3 clock cycles:
|
|
<pre> MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX
|
|
MOV [mem1],ECX / MOV [mem2],EDX</pre><p>
|
|
|
|
Likewise you may split up non-pairable instructions into pairable instructions:
|
|
<pre> PUSH [mem1]
|
|
PUSH [mem2] ; non-pairable</pre><p>
|
|
Split up into:
|
|
<pre> MOV EAX,[mem1]
|
|
MOV EBX,[mem2]
|
|
PUSH EAX
|
|
PUSH EBX ; everything pairs</pre><p>
|
|
|
|
Other examples of non-pairable instructions which may be split up into simpler pairable
|
|
instructions:<br>
|
|
<kbd>CDQ</kbd> split into: <kbd>MOV EDX,EAX / SAR EDX,31</kbd><br>
|
|
<kbd>NOT EAX</kbd> change to <kbd>XOR EAX,-1</kbd><br>
|
|
<kbd>NEG EAX</kbd> split into <kbd>XOR EAX,-1 / INC EAX</kbd><br>
|
|
<kbd>MOVZX EAX,BYTE PTR [mem]</kbd> split into <kbd> XOR EAX,EAX / MOV AL,BYTE PTR [mem]</kbd><br>
|
|
<kbd>JECXZ</kbd> split into <kbd>TEST ECX,ECX / JZ</kbd><br>
|
|
<kbd>LOOP</kbd> split into <kbd>DEC ECX / JNZ</kbd><br>
|
|
<kbd>XLAT</kbd> change to <kbd>MOV AL,[EBX+EAX]</kbd>
|
|
<p>
|
|
If splitting instructions doesn't improve speed, then you may keep the complex or
|
|
nonpairable instructions in order to reduce code size.
|
|
<p>
|
|
Splitting instructions is not needed on the PPro, PII and PIII, except when the split instructions
|
|
generate fewer uops.
|
|
<p>
|
|
<h2><a name="12">12</a>. Prefixes (PPlain and PMMX)</h2>
|
|
An instruction with one or more prefixes may not be able to execute in the V-pipe (se
|
|
<a href="#10-7">chapter 10, sect. 7</a>), and it may take more than one clock cycle to decode.
|
|
<p>
|
|
On the PPlain, the decoding delay is one clock cycle for each prefix except for the <kbd>0FH</kbd>
|
|
prefix of conditional near jumps.
|
|
<p>
|
|
The PMMX has no decoding delay for <kbd>0FH</kbd> prefix.
|
|
Segment and repeat prefixes take one
|
|
clock extra to decode. Address and operand size prefixes take two clocks extra to decode.
|
|
The PMMX can decode two instructions per clock cycle if the first instruction has a
|
|
segment or repeat prefix or no prefix, and the second instruction has no prefix. Instructions
|
|
with address or operand size prefixes can only decode alone on the PMMX. Instructions with
|
|
more than one prefix take one clock extra for each prefix.
|
|
<p>
|
|
Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be
|
|
avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided
|
|
in 32 bit mode by using only 8 bit and 32 bit integers.
|
|
<p>
|
|
Where prefixes are unavoidable, the decoding delay may be masked if a preceding
|
|
instruction takes more than one clock cycle to execute. The rule for the PPlain is that any
|
|
instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the
|
|
decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction
|
|
pairs. In other words, each extra clock cycle that an instruction takes to execute can be
|
|
used to decode one prefix in a later instruction. This shadowing effect even extends across
|
|
a predicted branch. Any instruction which takes more than one clock cycle to execute, and
|
|
any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any
|
|
other reason except decoding delay and branch misprediction, has a shadowing effect.
|
|
<p>
|
|
The PMMX has a similar shadowing effect, but the mechanism is different. Decoded
|
|
instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to
|
|
four instructions. As long as there are instructions in the FIFO buffer you get no delay.
|
|
When the buffer is empty then instructions are executed as soon as they are decoded. The
|
|
buffer is filled when instructions are decoded faster than they are executed, i.e. when you
|
|
have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions
|
|
execute faster than they are decoded, i.e. when you have decoding delays due to prefixes.
|
|
The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two
|
|
instructions per clock cycle provided that the second instruction is without prefixes and
|
|
none of the instructions are longer than 7 bytes. The two execution pipelines (U and V) can
|
|
each receive one instruction per clock cycle from the FIFO buffer.
|
|
<p>
|
|
Examples: <br>
|
|
<kbd> CLD / REP MOVSD</kbd><br>
|
|
The <kbd>CLD</kbd> instruction takes two clock cycles and can therefore overshadow the decoding
|
|
delay of the <kbd>REP</kbd> prefix. The code would take one clock cycle more if the <kbd>CLD</kbd> instruction was
|
|
placed far from the <kbd>REP MOVSD.</kbd>
|
|
<br>
|
|
<kbd> CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL</kbd><br>
|
|
The <kbd>CMP</kbd> instruction takes two clock cycles here because it is a read/modify instruction. The
|
|
<kbd>0FH</kbd> prefix of the <kbd>SETNZ</kbd> instruction is decoded during the second clock cycle of the <kbd>CMP </kbd>
|
|
instruction, so that the decoding delay is hidden on the PPlain (The PMMX has no
|
|
decoding delay for the <kbd>0FH</kbd>).
|
|
<p>
|
|
Prefix penalties in PPro, PII and PIII are described in chapter <a href="#14">14</a>.
|
|
<p>
|
|
<h2><a name="13">13</a>. Overview of PPro, PII and PIII pipeline</h2>
|
|
The architecture of the PPro, PII and PIII microprocessors is well explained and illustrated in
|
|
various manuals and tutorials from Intel. It is recommended that you study this material in
|
|
order to get an understanding of how these microprocessors work. I will describe the
|
|
structure briefly here with particular focus on those elements that are important for
|
|
optimizing code.
|
|
<p>
|
|
Instruction codes are fetched from the code cache in aligned 16-byte chunks into a double
|
|
buffer that can hold two 16-byte chunks. The code is passed on from the double buffer to
|
|
the decoders in blocks which I will call ifetch blocks (instruction fetch blocks). The ifetch
|
|
blocks are usually 16 bytes long, but not aligned. The purpose of the double-buffer is to
|
|
make it possible to decode an instruction that crosses a 16-byte boundary (i.e. an address
|
|
divisible by 16).
|
|
<p>
|
|
The ifetch block goes to the instruction length decoder, which determines where each
|
|
instruction begins and ends, and next to the instruction decoders. There are three decoders
|
|
so that you can decode up to three instructions in each clock cycle. A group of up to three
|
|
instructions that are decoded in the same clock cycle is called a decode group.
|
|
<p>
|
|
The decoders translate instructions into micro-operations, abbreviated uops. Simple
|
|
instructions generate only one uop, while more complex instructions may generate several
|
|
uops. For example, the instruction <kbd>ADD EAX,[MEM]</kbd> is decoded into two uops: one for
|
|
reading the source operand from memory, and one for doing the addition. The purpose of
|
|
splitting instructions into uops is to make the handling later in the system more effective.
|
|
<p>
|
|
The three decoders are called D0, D1, and D2. D0 can handle all instructions, while D1
|
|
and D2 can handle only simple instructions that generate one uop.
|
|
<p>
|
|
The uops from the decoders go via a short queue to the register allocation table (RAT). The
|
|
execution of uops work on temporary registers which are later written to the permanent
|
|
registers <kbd>EAX, EBX</kbd>, etc. The purpose of the RAT is to tell the uops which temporary
|
|
registers to use, and to allow register renaming (see later).
|
|
<p>
|
|
After the RAT, the uops to go the reorder buffer (ROB). The purpose of the ROB is to
|
|
enable out-of-order execution. A uop stays in the reservation station until the operands it
|
|
needs are available. If an operand for one uop is delayed because a previous uop that
|
|
generates the operand is not finished yet, then the ROB may find another uop later in the
|
|
queue that can be executed in the meantime in order to save time.
|
|
<p>
|
|
The uops that are ready for execution are sent to the execution units, which are clustered
|
|
around five ports: Port 0 and 1 can handle arithmetic operations, jumps, etc. Port 2 takes
|
|
care of all reads from memory, port 3 calculates addresses for memory writes, and port 4
|
|
does memory writes.
|
|
<p>
|
|
When an instruction has been executed then it is marked in the ROB as ready to retire. It then goes to
|
|
the retirement station. Here the contents of the temporary registers used by the uops are
|
|
written to the permanent registers. While uops can be executed out of order, they must be
|
|
retired in order.
|
|
<p>
|
|
In the following chapters, I will describe in detail how to optimize the throughput of each step
|
|
in the pipeline.
|
|
<p>
|
|
<h2><a name="14">14</a>. Instruction decoding (PPro, PII and PIII)</h2>
|
|
I am describing instruction decoding before instruction fetching here because you need to
|
|
know how the decoders work in order to understand the possible delays in instruction
|
|
fetching.
|
|
<p>
|
|
The decoders can handle three instructions per clock cycle, but only when certain
|
|
conditions are met. Decoder D0 can handle any instruction that generates up to 4 uops in a
|
|
single clock cycle. Decoders D1 and D2 can handle only instructions that generate 1 uop
|
|
and these instructions can be no more than 8 bytes long.
|
|
<p>
|
|
To summarize the rules for decoding two or three instructions in the same clock cycle:
|
|
<ul>
|
|
<li>The first instruction (D0) generates no more than 4 uops,
|
|
<li>The second and third instructions generate no more than 1 uop each,
|
|
<li>The second and third instructions are no more than 8 bytes long each,
|
|
<li>The instructions must be contained within the same 16 bytes ifetch block (see next
|
|
chapter).
|
|
</ul>
|
|
There is no limit to the length of the instruction in D0 (despite Intel manuals saying
|
|
something else), as long as the three instructions fit into one 16 bytes ifetch block.
|
|
<p>
|
|
An instruction that generates more than 4 uops takes two or more clock cycles to decode,
|
|
and no other instructions can decode in parallel.
|
|
<p>
|
|
It follows from the rules above that the decoders can produce a maximum of 6 uops per
|
|
clock cycle if the first instruction in each decode group generates 4 uops and the next two
|
|
generate 1 uop each. The minimum production is 2 uops per clock cycle, which you get
|
|
when all instructions generate 2 uops each, so that D1 and D2 are never used.
|
|
<p>
|
|
For maximum throughput, it is recommended that you order your instructions according to
|
|
the 4-1-1 pattern: instructions that generate 2 to 4 uops can be interspearsed with two
|
|
simple 1-uop instructions for free, in the sense that they do not add to the decoding time.
|
|
Example:
|
|
<pre>MOV EBX, [MEM1] ; 1 uop (D0)
|
|
INC EBX ; 1 uop (D1)
|
|
ADD EAX, [MEM2] ; 2 uops (D0)
|
|
ADD [MEM3], EAX ; 4 uops (D0)</pre><p>
|
|
This takes 3 clock cycles to decode. You can save one clock cycle by reordering the
|
|
instructions into two decode groups:
|
|
<pre>ADD EAX, [MEM2] ; 2 uops (D0)
|
|
MOV EBX, [MEM1] ; 1 uop (D1)
|
|
INC EBX ; 1 uop (D2)
|
|
ADD [MEM3], EAX ; 4 uops (D0)</pre><p>
|
|
|
|
The decoders now generate 8 uops in two clock cycles, which is probably satisfactory.
|
|
Later stages in the pipeline can handle only 3 uops per clock cycle so with a decoding rate
|
|
higher than this you can assume that decoding is not a bottleneck. However, complications
|
|
in the fetch mechanism can delay decoding as described in the next chapter, so to be safe
|
|
you may want to aim at a decoding rate higher than 3 uops per clock cycle.
|
|
<p>
|
|
You can see how many uops each instruction generates in the tables in chapter <a href="#29">29</a>.
|
|
<p>
|
|
Instruction prefixes can also incur penalties in the decoders. Instructions can have several
|
|
kinds of prefixes:
|
|
<ul>
|
|
<li>An operand size prefix is needed when you have a 16-bit operand in a 32-bit
|
|
environment or vice versa. (Except for instructions that can only have one operand size,
|
|
such as <kbd>FNSTSW AX</kbd>). An operand size prefix gives a penalty of a few clocks if the
|
|
instruction has an immediate operand of 16 or 32 bits because the length of the operand
|
|
is changed by the prefix. Examples:
|
|
<pre> ADD BX, 9 ; no penalty because immediate operand is 8 bits
|
|
MOV WORD PTR [MEM16], 9 ; penalty because operand is 16 bits </pre>
|
|
The last instruction should be changed to:
|
|
<pre> MOV EAX, 9
|
|
MOV WORD PTR [MEM16], AX ; no penalty because no immediate</pre>
|
|
<li>An address size prefix is used when you use 32-bit addressing in 16 bit mode or vice
|
|
versa. This is seldom needed and should generally be avoided. The address size prefix
|
|
gives a penalty whenever you have an explicit memory operand (even when there is no
|
|
displacement) because the interpretation of the r/m bits in the instruction code is
|
|
changed by the prefix. Instructions with only implicit memory operands, such as string
|
|
instructions, have no penalty with address size prefix.
|
|
<li>Segment prefixes are used when you address data in a non-default data segment.
|
|
Segment prefixes give no penalty on the PPro, PII and PIII.
|
|
<li>Repeat prefixes and lock prefixes give no penalty in the decoders.
|
|
<li>There is always a penalty if you have more than one prefix. This penalty is usually one
|
|
clock per prefix.
|
|
</ul>
|
|
<p>
|
|
<h2><a name="15">15</a>. Instruction fetch (PPro, PII and PIII)</h2>
|
|
The code is fetched in aligned 16-bytes chunks from the code cache and placed in the
|
|
double buffer, which is called so because it can contain two such chunks. The code is then
|
|
taken from the double buffer and fed to the decoders in blocks which are usually 16 bytes
|
|
long, but not necessarily aligned by 16. I will call these blocks ifetch blocks (instruction fetch
|
|
blocks). If an ifetch block crosses a 16 byte boundary in the code then it needs to take from
|
|
both chunks in the double buffer. So the purpose of the double buffer is to allow instruction
|
|
fetching across 16 byte boundaries.
|
|
<p>
|
|
The double buffer can fetch one 16-bytes chunk per clock cycle and can generate one
|
|
ifetch block per clock cycle. The ifetch blocks are usually 16 bytes long, but can be shorter
|
|
if there is a predicted jump in the block. (See chapter <a href="#22">22</a> about jump prediction).
|
|
<p>
|
|
Unfortunately, the double buffer is not big enough for handling fetches around jumps
|
|
without delay. If the ifetch block that contains the jump instruction crosses a 16-byte
|
|
boundary then the double buffer needs to keep two consecutive aligned 16-bytes chunks of
|
|
code in order to generate it. If the first instruction after the jump crosses a 16-byte
|
|
boundary, then the double buffer needs to load two new 16-bytes chunks of code before a
|
|
valid ifetch block can be generated. This means that, in the worst case, the decoding of the
|
|
first instruction after a jump can be delayed for two clock cycles. You get one penalty for a
|
|
16-byte boundary in the ifetch block containing the jump instruction, and one penalty for a
|
|
16-byte boundary in the first instruction after the jump. You can get bonus if you have more
|
|
than one decode group in the ifetch block that contains the jump because this gives the
|
|
double buffer extra time to fetch one or two 16-byte chunks of code in advance for the
|
|
instructions after the jump. The bonuses can compensate for the penalties according to the
|
|
table below. If the double buffer has fetched only one 16-byte chunk of code after the jump,
|
|
then the first ifetch block after the jump will be identical to this chunk, that is, aligned to a
|
|
16-byte boundary. In other words, the first ifetch block after the jump will not begin at the
|
|
first instruction, but at the nearest preceding address divisible by 16. If the double buffer
|
|
has had time to load two 16-byte chunks, then the new ifetch block can cross a 16-byte
|
|
boundary and begin at the first instruction after the jump. These rules are summarized in
|
|
the following <a name="ifetchtable">table:</a><p>
|
|
|
|
<table border=1 cellpadding=1 cellspacing=1><tr><td class="a3">
|
|
Number of<br>decode groups<br>in ifetch-block<br>containing jump</td>
|
|
<td class="a3">16-byte<br>boundary in this<br>ifetch-block</td>
|
|
<td class="a3">16-byte<br>boundary in first<br>instruction after<br>jump</td>
|
|
<td class="a3"><br><br>decoder delay</td>
|
|
<td class="a3">alignment of first<br>ifetch after jump</td></tr>
|
|
<tr><td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">by 16</td></tr>
|
|
<tr><td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">1</td>
|
|
<td align="center">by 16</td></tr>
|
|
<tr><td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">2</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">2</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">2</td>
|
|
<td align="center">0</td>
|
|
<td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">2</td>
|
|
<td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">by 16</td></tr>
|
|
<tr><td align="center">2</td>
|
|
<td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">3 or more</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">3 or more</td>
|
|
<td align="center">0</td>
|
|
<td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">3 or more</td>
|
|
<td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">0</td>
|
|
<td align="center">to instruction</td></tr>
|
|
<tr><td align="center">3 or more</td>
|
|
<td align="center">1</td>
|
|
<td align="center">1</td>
|
|
<td align="center">0</td>
|
|
<td align="center">to instruction</td></tr>
|
|
</table>
|
|
<p>Jumps delay the fetching so that a loop always takes at least two clock cycles more per
|
|
iteration than the number of 16 byte boundaries in the loop.
|
|
<p>
|
|
A further problem with the instruction fetch mechanism is that a new ifetch block is not
|
|
generated until the previous one is exhausted. Each ifetch block can contain several
|
|
decode groups. If a 16 bytes long ifetch block ends with an unfinished instruction, then the
|
|
next ifetch block will begin at the beginning of that instruction. The first instruction in an
|
|
ifetch block always goes to decoder D0, and the next two instructions go to D1 and D2, if
|
|
possible. The consequence of this is that D1 and D2 are used less than optimally. If the
|
|
code is structured according to the recommended 4-1-1 pattern, and an instruction intended
|
|
to go into D1 or D2 happens to be the first instruction in an ifetch block, then that instruction
|
|
has to go into D0 with the result that one clock cycle is wasted.
|
|
This is probably a hardware design flaw. At least it is suboptimal design.
|
|
The consequence of this problem is that the time it takes to decode a piece of
|
|
code can vary considerably depending on where the first ifetch block begins.
|
|
<p>
|
|
If decoding speed is critical and you want to avoid these problems then you have to know
|
|
where each ifetch block begins. This is quite a tedious job. First you need to make your
|
|
code segment paragraph-aligned in order to know where the 16-byte boundaries are. Then
|
|
you have to look at the output listing from your assembler to see how long each instruction
|
|
is. (It is recommended that you study how instructions are coded so that you can predict the
|
|
lengths of the instructions.) If you know where one ifetch block begins then you can find
|
|
where the next ifetch block begins in the following way: Make the block 16 bytes long. If it
|
|
ends at an instruction boundary then the next block will begin there. If it ends with an
|
|
unfinished instruction then the next block will begin at the beginning of this instruction.
|
|
(Only the lengths of the instructions counts here, it doesn't matter how many uops they
|
|
generate or what they do). This way you can work your way all through the code and mark
|
|
where each ifetch block begins. The only problem is knowing where to start. If you know
|
|
where one ifetch block is then you can find all the subsequent ones, but you have to know
|
|
where the first one begins. Here are some guidelines:
|
|
<ul>
|
|
<li>The first ifetch block after a jump, call, or return can begin either at the first instruction or
|
|
at the nearest preceding 16-bytes boundary, according to the table above. If you align
|
|
the first instruction to begin at a 16-byte boundary then you can be sure that the first
|
|
ifetch block begins here. You may want to align important subroutine entries and loop
|
|
entries by 16 for this purpose.
|
|
|
|
<li>If the combined length of two consecutive instructions is more than 16 bytes then you
|
|
can be certain that the second one doesn't fit into the same ifetch block as the first one,
|
|
and consequently you will always have an ifetch block beginning at the second
|
|
instruction. You can use this as a starting point for finding where subsequent ifetch
|
|
blocks begin.
|
|
|
|
<li>The first ifetch block after a branch misprediction begins at a 16-byte boundary. As
|
|
explained in chapter <a href="#22_2">22.2</a>, a loop that repeats more than 5 times will always have a
|
|
misprediction when it exits. The first ifetch block after such a loop will therefore begin at
|
|
the nearest preceding 16-byte boundary.
|
|
|
|
<li>Other serializing events also cause the next ifetch block to start at a 16-byte boundary.
|
|
Such events include interrupts, exceptions, self-modifying code, and serializing
|
|
instructions such as <kbd>CPUID, IN,</kbd> and <kbd>OUT</kbd>.
|
|
</ul>
|
|
<p>
|
|
|
|
I am sure you want an example now:
|
|
|
|
<pre>
|
|
address instruction length uops expected decoder
|
|
----------------------------------------------------------------------
|
|
1000h MOV ECX, 1000 5 1 D0
|
|
1005h LL: MOV [ESI], EAX 2 2 D0
|
|
1007h MOV [MEM], 0 10 2 D0
|
|
1011h LEA EBX, [EAX+200] 6 1 D1
|
|
1017h MOV BYTE PTR [ESI], 0 3 2 D0
|
|
101Ah BSR EDX, EAX 3 2 D0
|
|
101Dh MOV BYTE PTR [ESI+1],0 4 2 D0
|
|
1021h DEC ECX 1 1 D1
|
|
1022h JNZ LL 2 1 D2</pre>
|
|
<p>
|
|
Let's assume that the first ifetch block begins at address 1000h and ends at 1010h. This is
|
|
before the end of the <kbd>MOV [MEM],0</kbd> instruction so the next ifetch block will begin at 1007h
|
|
and end at 1017h. This is at an instruction boundary so the third ifetch block will begin at
|
|
1017h and cover the rest of the loop. The number of clock cycles it takes to decode this is
|
|
the number of D0 instructions, which is 5 per iteration of the LL loop. The last ifetch block
|
|
contained three decode blocks covering the last five instructions, and it has one 16-byte
|
|
boundary (1020h). Looking at the table above we find that the first ifetch block after the
|
|
jump will begin at the first instruction after the jump, that is the <kbd>LL</kbd>
|
|
label at 1005h, and end at
|
|
1015h. This is before the end of the <kbd>LEA</kbd> instruction, so the next ifetch block will go from
|
|
1011h to 1021h, and the last one from 1021h covering the rest. Now the <kbd>LEA</kbd> instruction
|
|
and the <kbd>DEC</kbd> instruction both fall at the beginning of an ifetch block which forces them to go
|
|
into D0. We now have 7 instructions in D0 and the loop takes 7 clocks to decode in the
|
|
second iteration. The last ifetch block contains only one decode group
|
|
(<kbd>DEC ECX / JNZ LL</kbd>) and has no 16-byte boundary. According to the table, the next ifetch block after the
|
|
jump will begin at a 16-byte boundary, which is 1000h. This will give us the same situation
|
|
as in the first iteration, and you will see that the loop takes alternatingly 5 and 7 clock cycles
|
|
to decode. Since there are no other bottlenecks, the complete loop will take 6000 clocks to
|
|
run 1000 iterations. If the starting address had been different so that you had a 16-byte
|
|
boundary in the first or the last instruction of the loop then it would take 8000 clocks. If you
|
|
reorder the loop so that no D1 or D2 instructions fall at the beginning of an ifetch block then
|
|
you can make it take only 5000 clocks.
|
|
<p>
|
|
The example above was deliberately constructed so that fetch and decoding is the only
|
|
bottleneck. The easiest way to avoid this problem is to structure your code to generate
|
|
much more than 3 uops per clock cycle so that decoding will not be a bottleneck despite the
|
|
penalties described here. In small loops this may not be possible and then you have to find
|
|
out how to optimize the instruction fetch and decoding.
|
|
<p>
|
|
One thing you can do is to change the starting address of your procedure in order to avoid
|
|
16-byte boundaries where you don't want them. Remember to make your code segment
|
|
paragraph aligned so that you know where the boundaries are.
|
|
<p>
|
|
If you insert an <kbd>ALIGN 16</kbd> directive before the loop entry then the assembler will put in
|
|
<kbd>NOP</kbd>'s and other filler instructions up to the nearest 16 byte boundary. Most assemblers use
|
|
the instruction <kbd>XCHG EBX,EBX</kbd> as a 2-byte filler (the so called 2-byte <kbd>NOP</kbd>). Whoever got
|
|
this idea, it's a bad one because this instruction takes more time than two <kbd>NOP</kbd>'s on most
|
|
processors! If the loop executes many times then whatever is outside the loop is
|
|
unimportant in terms of speed and you don't have to care about the suboptimal filler
|
|
instructions. But if the time taken by the fillers is important then you may
|
|
select the filler instructions manually.
|
|
You may as well use filler instructions that do something useful, such as refreshing
|
|
a register in order to avoid register read stalls (see chapter <a href="#16_2">16.2</a>)
|
|
For example, if you are using register <kbd>EBP</kbd> for addressing but seldom
|
|
write to it, then you may use <kbd>MOV EBP,EBP</kbd> or <kbd>ADD EBP, 0</kbd> as
|
|
filler in order to reduce the possibilities of register read stalls.
|
|
If you have nothing useful to do, you may use <kbd>FXCH ST(0)</kbd> as a good filler
|
|
because it doesn't put any load on the execution ports, provided that <kbd>ST(0)</kbd>
|
|
contains a valid floating point value.
|
|
<p>
|
|
Another possible remedy is to reorder your instructions in order to get the ifetch boundaries
|
|
where they don't hurt. This can be quite a difficult puzzle and it is not always possible to find
|
|
a satisfactory solution.
|
|
<p>
|
|
Yet another possibility is to manipulate instruction lengths. Sometimes you can substitute
|
|
one instruction with another one with a different length. Many instructions can be coded in
|
|
different versions with different lengths. The assembler always chooses the shortest
|
|
possible version of an instruction, but it is often possible to hard-code a longer version. For
|
|
example, <kbd>DEC ECX</kbd> is one byte long, <kbd>SUB ECX,1</kbd> is 3 bytes, and you can code a 6 bytes
|
|
version with a long immediate operand using this trick:
|
|
<pre> SUB ECX, 9999
|
|
ORG $-4
|
|
DD 1</pre><p>
|
|
Instructions with a memory operand can be made one byte longer with a SIB byte, but the
|
|
easiest way of making an instruction one byte longer is to add a <kbd>DS:</kbd>
|
|
segment prefix (<kbd>DB 3Eh</kbd>).
|
|
The microprocessors generally accept redundant and meaningless prefixes (except
|
|
<kbd>LOCK</kbd>) as long as the instruction length does not exceed 15 bytes. Even instructions without
|
|
a memory operand can have a segment prefix. So if you want the <kbd>DEC ECX</kbd> instruction to be
|
|
2 bytes long, write:
|
|
<pre> DB 3Eh
|
|
DEC ECX</pre><p>
|
|
Remember that you get a penalty in the decoder if an instruction has more than one prefix.
|
|
It is possible that instructions with meaningless prefixes - especially repeat and
|
|
lock prefixes - will be used in future processors for new instructions when there are no
|
|
more vacant instruction codes, but I would consider it safe to use a segment prefix
|
|
with any instruction.
|
|
<p>
|
|
With these methods it will usually be possible to put the ifetch boundaries where you want
|
|
them, although it can be a tedious puzzle.
|
|
<p>
|
|
<h2><a name="16">16</a>. Register renaming (PPro, PII and PIII)</h2>
|
|
<h3><a name="16_1">16.1 Eliminating dependencies</a></h3>
|
|
Register renaming is an advanced technique used by these microprocessors to remove
|
|
dependencies between different parts of the code. Example:
|
|
<pre> MOV EAX, [MEM1]
|
|
IMUL EAX, 6
|
|
MOV [MEM2], EAX
|
|
MOV EAX, [MEM3]
|
|
INC EAX
|
|
MOV [MEM4], EAX</pre><p>
|
|
Here the last three instructions are independent of the first three in the sense that they don't
|
|
need any result from the first three instructions. To optimize this on earlier processors you
|
|
would have to use a different register instead of <kbd>EAX</kbd> in the last three instructions and
|
|
reorder the instructions so that the last three instructions could execute in parallel with the
|
|
first three instructions. The PPro, PII and PIII processors do this for you automatically. They
|
|
assign a new temporary register for <kbd>EAX</kbd> every time you write to it. Thereby the
|
|
<kbd>MOV EAX,[MEM3]</kbd> instruction becomes independent of the preceding instructions.
|
|
With out-of-order execution it is likely to finish the move to <kbd>[MEM4]</kbd> before the slow
|
|
<kbd>IMUL</kbd> instruction is finished.
|
|
<p>
|
|
Register renaming goes fully automatically. A new temporary register is assigned as an
|
|
alias for the permanent register every time an instruction writes to this register. An
|
|
instruction that both reads and writes a register also causes renaming. For example the
|
|
<kbd>INC EAX</kbd> instruction above uses one temporary register for input and another temporary
|
|
register for output. This does not remove any dependency, of course, but it has some
|
|
significance for subsequent register reads as I will explain later.
|
|
<p>
|
|
All general purpose registers, stack pointer, flags, floating point registers,
|
|
MMX registers, XMM registers and segment registers can be renamed.
|
|
Control words, and the floating point status word cannot be renamed and this is the
|
|
reason why the use of these registers is slow. There are 40 universal temporary
|
|
registers so it is unlikely that you will run out of temporary registers.
|
|
<p>
|
|
A common way of setting a register to zero is <kbd>XOR EAX,EAX</kbd> or
|
|
<kbd>SUB EAX,EAX</kbd>. These
|
|
instructions are not recognized as independent of the previous value of the register. If you
|
|
want to remove the dependency on slow preceding instructions then use
|
|
<kbd> MOV EAX,0.</kbd>
|
|
<p>
|
|
Register renaming is controlled by the register alias table (RAT) and the reorder buffer
|
|
(ROB). The uops from the decoders go to the RAT via a queue, and then to the ROB and
|
|
the reservation station. The RAT can handle only 3 uops per clock cycle. This means that
|
|
the overall throughput of the microprocessor can never exceed 3 uops per clock cycle on
|
|
average.
|
|
<p>
|
|
There is no practical limit to the number of renamings. The RAT can rename three registers
|
|
per clock cycle, and it can even rename the same register three times in one clock cycle.
|
|
<p>
|
|
<h3><a name="16_2">16.2</a> Register read stalls</h3>
|
|
But there is another limitation which may be quite serious, and that is that you
|
|
can only read two different permanent register names per clock cycle. This
|
|
limitation applies to all registers used by an instruction except those registers
|
|
that the instruction writes to only.
|
|
Example:
|
|
<pre> MOV [EDI + ESI], EAX
|
|
MOV EBX, [ESP + EBP]</pre><p>
|
|
The first instruction generates two uops: one that reads <kbd>EAX</kbd>
|
|
and one that reads <kbd>EDI</kbd> and
|
|
<kbd>ESI</kbd>. The second instruction generates one uop that reads
|
|
<kbd>ESP</kbd> and <kbd>EBP</kbd>. <kbd>EBX</kbd> does not
|
|
count as a read because it is only written to by the instruction. Let's assume that these
|
|
three uops go through the RAT together. I will use the word triplet for a group of three
|
|
consecutive uops that go through the RAT together. Since the ROB can handle only two
|
|
permanent register reads per clock cycle and we need five register reads, our triplet will be
|
|
delayed for two extra clock cycles before it comes to the reservation station. With 3 or 4
|
|
register reads in the triplet it would be delayed by one clock cycle.
|
|
<p>
|
|
The same register can be read more than once in the same triplet without adding to the
|
|
count. If the instructions above are changed to:
|
|
<pre> MOV [EDI + ESI], EDI
|
|
MOV EBX, [EDI + EDI]</pre><p>
|
|
then you will need only two register reads (<kbd>EDI</kbd> and <kbd>ESI</kbd>) and the triplet will not be delayed.
|
|
<p>
|
|
A register that is going to be written to by a pending uop is stored in the ROB so that it can
|
|
be read for free until it is written back, which takes at least 3 clock cycles,
|
|
and usually more. Write-back is the end of the execution stage where the value
|
|
becomes available. In other words, you can read any number of registers in the
|
|
RAT without stall if their values are not yet available from the execution units.
|
|
The reason for this is that when a value becomes available it is immediately
|
|
written directly to any subsequent ROB entries that need it. But if the value
|
|
has already been written back to a temporary or permanent register when a
|
|
subsequent uop that needs it goes into the RAT, then the value has to be read
|
|
from the register file, which has only two read ports. There are three pipeline
|
|
stages from the RAT to the execution unit so you can be certain that a register
|
|
written to in one uop-triplet can be read for free in at least the next three
|
|
triplets. If the writeback is delayed by reordering, slow instructions,
|
|
dependency chains, cache misses, or by any other kind of stall, then the
|
|
register can be read for free further down the instruction stream.
|
|
<p>
|
|
Example:
|
|
<pre> MOV EAX, EBX
|
|
SUB ECX, EAX
|
|
INC EBX
|
|
MOV EDX, [EAX]
|
|
ADD ESI, EBX
|
|
ADD EDI, ECX</pre><p>
|
|
These 6 instructions generate 1 uop each. Let's assume that the first 3 uops go through the
|
|
RAT together. These 3 uops read register <kbd>EBX</kbd>, <kbd>ECX</kbd>, and <kbd>EAX</kbd>. But since we are writing to
|
|
<kbd>EAX</kbd> before reading it, the read is free and we get no stall. The next three uops read <kbd>EAX</kbd>,
|
|
<kbd>ESI</kbd>, <kbd>EBX</kbd>, <kbd>EDI</kbd>, and <kbd>ECX</kbd>. Since both <kbd>EAX</kbd>, <kbd>EBX</kbd> and <kbd>ECX</kbd> have been modified in the
|
|
preceding triplet and not yet written back then they can be read for free, so that only <kbd>ESI</kbd>
|
|
and <kbd>EDI</kbd> count, and we get no stall in the second triplet either. If the
|
|
<kbd>SUB ECX,EAX</kbd>
|
|
instruction in the first triplet is changed to <kbd>CMP ECX,EAX</kbd> then <kbd>ECX</kbd> is not written to and we
|
|
will get a stall in the second triplet for reading <kbd>ESI</kbd>, <kbd>EDI</kbd> and <kbd>ECX</kbd>. Similarly, if the <kbd>INC</kbd>
|
|
<kbd>EBX</kbd> instruction in the first triplet is changed to <kbd>NOP</kbd> or something else then we will get a stall
|
|
in the second triplet for reading <kbd>ESI</kbd>, <kbd>EBX</kbd> and <kbd>EDI</kbd>.
|
|
<p>
|
|
No uop can read more than two registers. Therefore, all instructions that need
|
|
to read more than two registers are split up into two or more uops.
|
|
<p>
|
|
To count the number of register reads, you have to include all registers which are read by
|
|
the instruction. This includes integer registers, the flags register, the stack pointer,
|
|
floating point registers and MMX registers.
|
|
An XMM register counts as two registers,
|
|
except when only part of it is used, as e.g. in <kbd>ADDSS</kbd> and <kbd>MOVHLPS</kbd>.
|
|
Segment registers and the instruction pointer do not count.
|
|
For example, in <kbd>SETZ AL</kbd> you count the flags
|
|
register but not <kbd>AL</kbd>. <kbd>ADD EBX,ECX</kbd> counts both <kbd>EBX</kbd> and <kbd>ECX</kbd>, but not the flags because
|
|
they are written to only. <kbd>PUSH EAX</kbd> reads <kbd>EAX</kbd> and the
|
|
stack pointer and then writes to the stack pointer.
|
|
<p>
|
|
The <kbd>FXCH</kbd> instruction is a special case. It works by renaming, but doesn't read any values
|
|
so that it doesn't count in the rules for register read stalls. An <kbd>FXCH</kbd> instruction behaves
|
|
like 1 uop that neither reads nor writes any registers with regard to the rules for register read
|
|
stalls.
|
|
<p>
|
|
Don't confuse uop triplets with decode groups. A decode group can generate from 1 to 6
|
|
uops, and even if the decode group has three instructions and generates three uops there
|
|
is no guarantee that the three uops will go into the RAT together.
|
|
<p>
|
|
The queue between the decoders and the RAT is so short (10 uops) that you cannot
|
|
assume that register read stalls do not stall the decoders or that fluctuations
|
|
in decoder throughput do not stall the RAT.
|
|
<p>
|
|
It is very difficult to predict which uops go through the RAT together unless the queue is
|
|
empty, and for optimized code the queue should be empty only after mispredicted branches.
|
|
Several uops generated by the same instruction do not necessarily go through the RAT
|
|
together; the uops are simply taken consecutively from the queue, three at at time. The
|
|
sequence is not broken by a predicted jump: uops before and after the jump can go through
|
|
the RAT together. Only a mispredicted jump will discard the queue and start over again so
|
|
that the next three uops are sure to go into the RAT together.
|
|
<p>
|
|
If three consecutive uops read more than two different registers then you would of course
|
|
prefer that they do not go through the RAT together. The probability that they do is one
|
|
third. The penalty of reading three or four written-back registers in one triplet of uops is one
|
|
clock cycle. You can think of the one clock delay as equivalent to the load of three more
|
|
uops through the RAT. With the probability of 1/3 of the three uops going into the RAT
|
|
together, the average penalty will be the equivalent of 3/3 = 1 uop. To calculate the average
|
|
time it will take for a piece of code to go through the RAT, add the number of potential
|
|
register read stalls to the number of uops and divide by three. You can see that It doesn't
|
|
pay to remove the stall by putting in an extra instruction unless you know for sure which
|
|
uops go into the RAT together or you can prevent more than one potential register read stall
|
|
by one extra instruction.
|
|
<p>
|
|
In situations where you aim at a throughput of 3 uops per clock, the limit of two permanent
|
|
register reads per clock cycle may be a problematic bottleneck to handle. Possible ways to
|
|
remove register read stalls are:
|
|
<ul><li>keep uops that read the same register close together so that they are likely to go into the
|
|
same triplet.
|
|
<li>keep uops that read different registers spaced so that they cannot go into the same
|
|
triplet.
|
|
<li>place uops that read a register no more than 3 - 4 triplets after an instruction that writes
|
|
to or modifies this register to make sure it hasn't been written back before it is read (it
|
|
doesn't matter if you have a jump between as long as it is predicted). If you have reason
|
|
to expect the register write to be delayed for whatever reason then you can safely read
|
|
the register somewhat further down the instruction stream.
|
|
<li>use absolute addresses instead of pointers in order to reduce the number of register
|
|
reads.
|
|
<li>you may rename a register in a triplet where it doesn't cause a stall in order to prevent a
|
|
read stall for this register in one or more later triplets. Example:
|
|
<kbd>MOV ESP,ESP / ... / MOV EAX,[ESP+8]</kbd>.
|
|
This method costs an extra uop and therefore doesn't pay unless the expected average
|
|
number of read stalls prevented is more than 1/3.
|
|
</ul>
|
|
<p>
|
|
For instructions that generate more than one uop you may want to know the order of the
|
|
uops generated by the instruction in order to make a precise analysis of the possibility of
|
|
register read stalls. I have therefore listed the most common cases below.
|
|
<p>
|
|
<u>Writes to memory</u><br>
|
|
A memory write generates two uops. The first one (to port 4) is a store operation, reading
|
|
the register to store. The second uop (port 3) calculates the memory address, reading any
|
|
pointer registers. Examples:<br>
|
|
<kbd>MOV [EDI], EAX</kbd><br>
|
|
First uop reads <kbd>EAX</kbd>, second uop reads <kbd>EDI</kbd>.<br>
|
|
<kbd>FSTP QWORD PTR [EBX+8*ECX]</kbd><br>
|
|
First uop reads <kbd>ST(0)</kbd>, second uop reads <kbd>EBX</kbd> and <kbd>ECX.</kbd>
|
|
<p>
|
|
<u>Read and modify</u><br>
|
|
An instruction that reads a memory operand and modifies a register by some arithmetic or
|
|
logical operation generates two uops. The first one (port 2) is a memory load instruction
|
|
reading any pointer registers, the second uop is an arithmetic instruction (port 0 or 1)
|
|
reading and writing to the destination register and possibly writing to the flags.
|
|
Example:<br>
|
|
<kbd>ADD EAX, [ESI+20]</kbd><br>
|
|
First uop reads <kbd>ESI,</kbd> second uop reads <kbd>EAX</kbd> and writes <kbd>EAX</kbd> and flags.
|
|
<p>
|
|
<u>Read/modify/write</u><br>
|
|
A read/modify/write instruction generates four uops. The first uop (port 2) reads any pointer
|
|
registers, the second uop (port 0 or 1) reads and writes to any source register and possibly
|
|
writes to the flags, the third uop (port 4) reads only the temporary result which doesn't count
|
|
here, the fourth uop (port 3) reads any pointer registers again. Since the first and the fourth
|
|
uop cannot go into the RAT together you cannot take advantage of the fact that they read
|
|
the same pointer registers. Example:<br>
|
|
<kbd>OR [ESI+EDI], EAX</kbd><br>
|
|
The first uop reads <kbd>ESI</kbd> and <kbd>EDI</kbd>, the second uop reads <kbd>EAX</kbd> and writes <kbd>EAX</kbd> and the
|
|
flags, the third uop reads only the temporary result, the fourth uop reads <kbd>ESI</kbd> and <kbd>EDI</kbd> again. No
|
|
matter how these uops go into the RAT you can be sure that the uop that reads <kbd>EAX</kbd> goes
|
|
together with one of the uops that read <kbd>ESI</kbd> and <kbd>EDI</kbd>. A register read stall is therefore
|
|
inevitable for this instruction unless one of the registers has been modified recently.
|
|
<p>
|
|
<u>Push register</u><br>
|
|
A push register instruction generates 3 uops. The first one (port 4) is a store instruction,
|
|
reading the register. The second uop (port 3) generates the address, reading the stack
|
|
pointer. The third uop (port 0 or 1) subtracts the word size from the stack pointer, reading
|
|
and modifying the stack pointer.
|
|
<p>
|
|
<u>Pop register</u><br>
|
|
A pop register instruction generates 2 uops. The first uop (port 2) loads the value, reading
|
|
the stack pointer and writing to the register. The second uop (port 0 or 1) adjusts the stack
|
|
pointer, reading and modifying the stack pointer.
|
|
<p>
|
|
<u>Call</u><br>
|
|
A near call generates 4 uops (port 1, 4, 3, 01). The first two uops read only the instruction
|
|
pointer which doesn't count because it cannot be renamed. The third uop reads the stack
|
|
pointer. The last uop reads and modifies the stack pointer.
|
|
<p>
|
|
<u>Return</u><br>
|
|
A near return generates 4 uops (port 2, 01, 01, 1). The first uop reads the stack pointer.
|
|
The third uop reads and modifies the stack pointer.
|
|
<p>
|
|
An example of how to avoid a register read stall is given in example 2.6.
|
|
<p>
|
|
<h2><a name="17">17</a>. Out of order execution (PPro, PII and PIII)</h2>
|
|
The reorder buffer (ROB) can hold 40 uops. Each uop waits in the ROB until all its
|
|
operands are ready and there is a vacant execution unit for it. This makes out-of-order
|
|
execution possible. If one part of the code is delayed because of a cache miss then it won't
|
|
delay later parts of the code if they are independent of the delayed operations.
|
|
<p>
|
|
Writes to memory cannot execute out of order relative to other writes. There are four write
|
|
buffers, so if you expect many cache misses on writes or you are writing to uncached
|
|
memory then it is recommended that you schedule four writes at at time and make sure the
|
|
processor has something else to do before you give it the next four writes. Memory reads
|
|
and other instructions can execute out of order, except <kbd>IN, OUT</kbd> and serializing
|
|
instructions.
|
|
<p>
|
|
If your code writes to a memory address and soon after reads from the same address, then
|
|
the read may by mistake be executed before the write because the ROB doesn't know the
|
|
memory addresses at the time of reordering. This error is detected when the write address
|
|
is calculated, and then the read operation (which was executed speculatively)
|
|
has to be re-done. The penalty for this is approximately 3 clocks. The only way to avoid this penalty is to
|
|
make sure the execution unit has other things to do between a write and a subsequent read
|
|
from the same memory address.
|
|
<p>
|
|
There are several execution units clustered around five ports. Port 0 and 1 are for
|
|
arithmetic operations etc. Simple move, arithmetic and logic operations can go to either port 0 or 1,
|
|
whichever is vacant first. Port 0 also handles multiplication, division, integer shifts and
|
|
rotates, and floating point operations. Port 1 also handles jumps and some MMX
|
|
and XMM operations. Port 2 handles all reads from memory and a few string and XMM operations, port 3 calculates addresses for memory
|
|
write, and port 4 executes all memory write operations. In chapter <a href="#29">29</a> you'll find a complete
|
|
list of the uops generated by code instructions with an indication of which ports they go to.
|
|
Note that all memory write operations require two uops, one for port 3 and one for port 4,
|
|
while memory read operations use only one uop (port 2).
|
|
<p>
|
|
In most cases each port can receive one new uop per clock cycle. This means that you can
|
|
execute up to 5 uops in the same clock cycle if they go to five different ports, but since
|
|
there is a limit of 3 uops per clock earlier in the pipeline you will never execute more than 3
|
|
uops per clock on average.
|
|
<p>
|
|
You must make sure that no execution port receives more than one third of the uops if you
|
|
want to maintain a throughput of 3 uops per clock. Use the table of uops in chapter
|
|
<a href="#29">29</a> and
|
|
count how many uops go to each port. If port 0 and 1 are saturated while port 2 is free then
|
|
you can improve your code by replacing some <kbd>MOV register,register</kbd>
|
|
or <kbd>MOV register,immediate</kbd> instructions with
|
|
<kbd>MOV register,memory</kbd> in order to move some
|
|
of the load from port 0 and 1 to port 2.
|
|
<p>
|
|
Most uops take only one clock cycle to execute, but multiplications, divisions, and many
|
|
floating point operations take more:
|
|
<p>
|
|
Floating point addition and subtraction takes 3 clocks, but the execution unit is fully
|
|
pipelined so that it can receive a new <kbd>FADD</kbd> or <kbd>FSUB</kbd>
|
|
in every clock cycle before the
|
|
preceding ones are finished (provided, of course, that they are independent).
|
|
<p>
|
|
Integer multiplication takes 4 clocks, floating point multiplication 5, and
|
|
MMX multiplication 3 clocks. Integer and MMX multiplication is pipelined so
|
|
that it can receive a new instruction every clock cycle. Floating point
|
|
multiplication is partially pipelined: The execution unit can receive a new
|
|
<kbd>FMUL</kbd> instruction two clocks after the preceding one, so that the
|
|
maximum throughput is one <kbd>FMUL</kbd> per two clock cycles. The holes
|
|
between the <kbd>FMUL</kbd>'s cannot be filled by integer multiplications
|
|
because they use the same circuitry. XMM additions and multiplications take
|
|
3 and 4 clocks respectively, and are fully pipelined. But since each logical XMM
|
|
register is implemented as two physical 64-bit registers, you need two uops for a
|
|
packed XMM operation, and the throughput will then be one arithmetic XMM
|
|
instruction every two clock cycles. XMM add and multiply instructions can execute
|
|
in parallel because they don't use the same execution port.
|
|
<p>
|
|
Integer and floating point division takes up to 39 clocks and is not pipelined. This means
|
|
that the execution unit cannot begin a new division until the previous division is finished.
|
|
The same applies to squareroot and transcendental functions.
|
|
<p>
|
|
Also jump instructions, calls, and returns are not fully pipelined. You cannot execute a new
|
|
jump in the first clock cycle after a preceding jump. So the maximum throughput for jumps,
|
|
calls, and returns is one for every two clocks.
|
|
<p>
|
|
You should, of course, avoid instructions that generate many uops.
|
|
The <kbd>LOOP XX</kbd>
|
|
instruction, for example, should be replaced by <kbd>DEC ECX / JNZ XX</kbd>.
|
|
<p>
|
|
If you have consecutive <kbd>POP</kbd> instructions then you may break them up to reduce the
|
|
number of uops:
|
|
<pre>POP ECX / POP EBX / POP EAX ; can be changed to:
|
|
MOV ECX,[ESP] / MOV EBX,[ESP+4] / MOV EAX,[ESP] / ADD ESP,12</pre>
|
|
The former code generates 6 uops, the latter generates only 4 and decodes faster.
|
|
Doing the same with <kbd>PUSH</kbd> instructions is less advantageous because the split-up code is likely to
|
|
generate register read stalls unless you have other instructions to put in between or the
|
|
registers have been renamed recently. Doing it with <kbd>CALL</kbd> and <kbd>RET</kbd>
|
|
instructions will
|
|
interfere with prediction in the return stack buffer. Note also that the
|
|
<kbd>ADD ESP</kbd> instruction can cause an AGI stall in earlier processors.
|
|
<p>
|
|
<h2><a name="18">18</a>. Retirement (PPro, PII and PIII)</h2>
|
|
Retirement is a process where the temporary registers used by the uops are copied into the
|
|
permanent registers <kbd>EAX, EBX</kbd>, etc.
|
|
When a uop has been executed it is marked in the ROB as ready to retire.
|
|
<p>
|
|
The retirement station can handle three uops per clock cycle. This may not seem like a
|
|
problem because the throughput is already limited to 3 uops per clock in the RAT. But
|
|
retirement may still be a bottleneck for two reasons. Firstly, instructions must retire in order.
|
|
If a uop is executed out of order then it cannot retire before all preceding uops in the order
|
|
have retired. And the second limitation is that taken jumps must retire in the first of the three
|
|
slots in the retirement station. Just like decoder D1 and D2 can be idle if the next instruction
|
|
only fits into D0, the last two slots in the retirement station can be idle if the next uop to
|
|
retire is a taken jump. This is significant if you have a small loop where the number of uops
|
|
in the loop is not divisible by three.
|
|
<p>
|
|
All uops stay in the reorder buffer (ROB) until they retire. The ROB can hold 40 uops. This
|
|
sets a limit to the number of instructions that can execute during the long delay of a division
|
|
or other slow operation. Before the division is finished the ROB will be filled up with
|
|
executed uops waiting to retire. Only when the division is finished and retired can the
|
|
subsequent uops begin to retire, because retirement takes place in order.
|
|
<p>
|
|
In case of speculative execution of predicted branches (see chapter <a href="#22">22</a>) the speculatively
|
|
executed uops cannot retire until it is certain that the prediction was correct. If the prediction
|
|
turns out to be wrong then the speculatively executed uops are discarded without
|
|
retirement.
|
|
<p>
|
|
The following instructions cannot execute speculatively: memory writes,
|
|
<kbd>IN, OUT</kbd>, and serializing instructions.
|
|
<p>
|
|
<h2><a name="19">19</a>. Partial stalls (PPro, PII and PIII)</h2>
|
|
<h3><a name="19_1">19.1 Partial register stalls</a></h3>
|
|
Partial register stall is a problem that occurs when you write to part of a 32
|
|
bit register and later read from the whole register or a bigger part of it.
|
|
Example:
|
|
<pre> MOV AL, BYTE PTR [M8]
|
|
MOV EBX, EAX ; partial register stall</pre><p>
|
|
This gives a delay of 5-6 clocks. The reason is that a temporary register has been assigned
|
|
to <kbd>AL</kbd> (to make it independent of <kbd>AH</kbd>).
|
|
The execution unit has to wait until the write to <kbd>AL</kbd> has
|
|
retired before it is possible to combine the value from <kbd>AL</kbd> with the
|
|
value of the rest of <kbd>EAX</kbd>.
|
|
The stall can be avoided by changing to code to:
|
|
<pre> MOVZX EBX, BYTE PTR [MEM8]
|
|
AND EAX, 0FFFFFF00h
|
|
OR EBX, EAX</pre>
|
|
<p>
|
|
Of course you can also avoid the partial stalls by putting in other instructions after the write
|
|
to the partial register so that it has time to retire before you read from the full register.
|
|
<p>
|
|
You should be aware of partial stalls whenever you mix different data sizes (8, 16, and 32
|
|
bits):
|
|
<pre> MOV BH, 0
|
|
ADD BX, AX ; stall
|
|
INC EBX ; stall</pre><p>
|
|
|
|
You don't get a stall when reading a partial register after writing to the full register, or a
|
|
bigger part of it:
|
|
<pre> MOV EAX, [MEM32]
|
|
ADD BL, AL ; no stall
|
|
ADD BH, AH ; no stall
|
|
MOV CX, AX ; no stall
|
|
MOV DX, BX ; stall</pre><p>
|
|
|
|
The easiest way to avoid partial register stalls is to always use full registers
|
|
and use <kbd>MOVZX</kbd> or <kbd>MOVSX</kbd> when reading from smaller memory
|
|
operands. These instructions are fast on the
|
|
PPro, PII and PIII, but slow on earlier processors. Therefore, a compromise is
|
|
offered when you
|
|
want your code to perform reasonably well on all processors. The replacement
|
|
for <kbd>MOVZX EAX,BYTE PTR [M8]</kbd> looks like this:
|
|
<pre> XOR EAX, EAX
|
|
MOV AL, BYTE PTR [M8]</pre><p>
|
|
|
|
The PPro, PII and PIII processors make a special case out of this combination
|
|
to avoid a partial
|
|
register stall when later reading from <kbd>EAX</kbd>.
|
|
The trick is that a register is tagged as empty
|
|
when it is <kbd>XOR</kbd>'ed with itself. The processor remembers that the upper 24 bits of
|
|
<kbd>EAX</kbd> are zero, so that a partial stall can be avoided. This mechanism
|
|
works only on certain combinations:
|
|
<pre> XOR EAX, EAX
|
|
MOV AL, 3
|
|
MOV EBX, EAX ; no stall
|
|
|
|
XOR AH, AH
|
|
MOV AL, 3
|
|
MOV BX, AX ; no stall
|
|
|
|
XOR EAX, EAX
|
|
MOV AH, 3
|
|
MOV EBX, EAX ; stall
|
|
|
|
SUB EBX, EBX
|
|
MOV BL, DL
|
|
MOV ECX, EBX ; no stall
|
|
|
|
MOV EBX, 0
|
|
MOV BL, DL
|
|
MOV ECX, EBX ; stall
|
|
|
|
MOV BL, DL
|
|
XOR EBX, EBX ; no stall</pre><p>
|
|
|
|
Setting a register to zero by subtracting it from itself works the same as the
|
|
<kbd>XOR</kbd>, but setting it to zero with the <kbd>MOV</kbd> instruction
|
|
doesn't prevent the stall.
|
|
<p>
|
|
You can set the <kbd>XOR</kbd> outside a loop:
|
|
<pre> XOR EAX, EAX
|
|
MOV ECX, 100
|
|
LL: MOV AL, [ESI]
|
|
MOV [EDI], EAX ; no stall
|
|
INC ESI
|
|
ADD EDI, 4
|
|
DEC ECX
|
|
JNZ LL</pre><p>
|
|
The processor remembers that the upper 24 bits of <kbd>EAX</kbd> are zero as
|
|
long as you don't get
|
|
an interrupt, misprediction, or other serializing event.
|
|
<p>
|
|
You should remember to neutralize any partial register you have used before calling a
|
|
subroutine that might push the full register:
|
|
<pre> ADD BL, AL
|
|
MOV [MEM8], BL
|
|
XOR EBX, EBX ; neutralize BL
|
|
CALL _HighLevelFunction</pre><p>
|
|
Most high level language procedures push <kbd>EBX</kbd> at the start of the
|
|
procedure which would generate a partial register stall in the example above
|
|
if you hadn't neutralized <kbd>BL</kbd>.
|
|
<p>
|
|
Setting a register to zero with the <kbd>XOR</kbd> method doesn't break its dependency on earlier
|
|
instructions:
|
|
<pre> DIV EBX
|
|
MOV [MEM], EAX
|
|
MOV EAX, 0 ; break dependency
|
|
XOR EAX, EAX ; prevent partial register stall
|
|
MOV AL, CL
|
|
ADD EBX, EAX</pre><p>
|
|
Setting <kbd>EAX</kbd> to zero twice here seems redundant, but without
|
|
the <kbd>MOV EAX,0</kbd> the last
|
|
instructions would have to wait for the slow <kbd>DIV</kbd> to finish, and
|
|
without <kbd>XOR EAX,EAX</kbd> you
|
|
would have a partial register stall.
|
|
<p>
|
|
The <kbd>FNSTSW AX</kbd> instruction is special: in 32 bit mode it behaves as
|
|
if writing to the entire <kbd>EAX</kbd>. In fact, it does something like this
|
|
in 32 bit mode:<br>
|
|
<kbd> AND EAX,0FFFF0000h / FNSTSW TEMP / OR EAX,TEMP</kbd><br>
|
|
hence, you don't get a partial register stall when reading <kbd>EAX</kbd>
|
|
after this instruction in 32 bit mode:
|
|
<pre> FNSTSW AX / MOV EBX,EAX ; stall only if 16 bit mode
|
|
MOV AX,0 / FNSTSW AX ; stall only if 32 bit mode</pre>
|
|
|
|
<p>
|
|
<h3><a name="19_2">19.2 Partial flags stalls</a></h3>
|
|
The flags register can also cause partial register stalls:
|
|
<pre> CMP EAX, EBX
|
|
INC ECX
|
|
JBE XX ; partial flags stall</pre><p>
|
|
The <kbd>JBE</kbd> instruction reads both the carry flag and the zero flag.
|
|
Since the <kbd>INC</kbd> instruction
|
|
changes the zero flag, but not the carry flag, the <kbd>JBE</kbd> instruction has to wait for the two
|
|
preceding instructions to retire before it can combine the carry flag from the <kbd>CMP</kbd> instruction
|
|
and the zero flag from the <kbd>INC</kbd> instruction. This situation is likely to be a bug rather than an
|
|
intended combination of flags. To correct it change <kbd>INC ECX</kbd> to <kbd>ADD ECX,1</kbd>.
|
|
A similar
|
|
bug that causes a partial flags stall is <kbd>SAHF / JL XX</kbd>. The <kbd>JL</kbd>
|
|
instruction tests the sign
|
|
flag and the overflow flag, but <kbd>SAHF</kbd> doesn't change the overflow flag.
|
|
To correct it, change
|
|
<kbd>JL XX</kbd> to <kbd>JS XX</kbd>.
|
|
<p>
|
|
Unexpectedly (and contrary to what Intel manuals say) you also get a partial flags stall after
|
|
an instruction that modifies some of the flag bits when reading only unmodified flag bits:
|
|
<pre> CMP EAX, EBX
|
|
INC ECX
|
|
JC XX ; partial flags stall</pre><p>
|
|
but not when reading only modified bits:
|
|
<pre> CMP EAX, EBX
|
|
INC ECX
|
|
JE XX ; no stall</pre><p>
|
|
|
|
Partial flags stalls are likely to occur on instructions that read many or
|
|
all flags bits, i.e. <kbd>LAHF, PUSHF, PUSHFD</kbd>. The following instructions cause
|
|
partial flags stalls when followed by <kbd>LAHF</kbd> or <kbd>PUSHF(D)</kbd>:
|
|
<kbd>INC, DEC, TEST</kbd>, bit tests, bit scan, <kbd>CLC, STC, CMC, CLD, STD, CLI, STI, MUL,
|
|
IMUL</kbd>, and all shifts and rotates.
|
|
The following instructions do not cause partial flags stalls:
|
|
<kbd>AND, OR, XOR, ADD, ADC, SUB, SBB, CMP, NEG</kbd>.
|
|
It is strange that <kbd>TEST</kbd> and <kbd>AND</kbd> behave differently while, by definition, they
|
|
do exactly the same thing to the flags. You may use a <kbd>SETcc</kbd>
|
|
instruction instead of <kbd>LAHF</kbd>
|
|
or <kbd>PUSHF(D)</kbd> for storing the value of a flag in order to avoid a stall.
|
|
<p>
|
|
Examples:
|
|
<pre> INC EAX / PUSHFD ; stall
|
|
ADD EAX,1 / PUSHFD ; no stall
|
|
|
|
SHR EAX,1 / PUSHFD ; stall
|
|
SHR EAX,1 / OR EAX,EAX / PUSHFD ; no stall
|
|
|
|
TEST EBX,EBX / LAHF ; stall
|
|
AND EBX,EBX / LAHF ; no stall
|
|
TEST EBX,EBX / SETZ AL ; no stall
|
|
|
|
CLC / SETZ AL ; stall
|
|
CLD / SETZ AL ; no stall</pre><p>
|
|
The penalty for partial flags stalls is approximately 4 clocks.
|
|
<p>
|
|
|
|
<h3><a name="19_3">19.3 Flags stalls after shifts and rotates</a></h3>
|
|
You can get a stall resembling the partial flags stall when reading any flag
|
|
bit after a shift or rotate, except for shifts and rotates by one (short form):
|
|
<pre> SHR EAX,1 / JZ XX ; no stall
|
|
SHR EAX,2 / JZ XX ; stall
|
|
SHR EAX,2 / OR EAX,EAX / JZ XX ; no stall
|
|
|
|
SHR EAX,5 / JC XX ; stall
|
|
SHR EAX,4 / SHR EAX,1 / JC XX ; no stall
|
|
|
|
SHR EAX,CL / JZ XX ; stall, even if CL = 1
|
|
SHRD EAX,EBX,1 / JZ XX ; stall
|
|
ROL EBX,8 / JC XX ; stall</pre><p>
|
|
|
|
The penalty for these stalls is approximately 4 clocks.
|
|
<p>
|
|
|
|
<h3><a name="19_4">19.4 Partial memory stalls</a></h3>
|
|
A partial memory stall is somewhat analogous to a partial register stall. It occurs when you
|
|
mix data sizes for the same memory address:
|
|
<pre> MOV BYTE PTR [ESI], AL
|
|
MOV EBX, DWORD PTR [ESI] ; partial memory stall</pre><p>
|
|
Here you get a stall because the processor has to combine the byte written from AL with the
|
|
next three bytes, which were in memory before, to get the four bytes needed for
|
|
reading into <kbd>EBX</kbd>. The penalty is approximately 7-8 clocks.
|
|
<p>
|
|
Unlike the partial register stalls, you also get a partial memory stall when you write a bigger
|
|
operand to memory and then read part of it, if the smaller part doesn't start at the same
|
|
address:
|
|
<pre> MOV DWORD PTR [ESI], EAX
|
|
MOV BL, BYTE PTR [ESI] ; no stall
|
|
MOV BH, BYTE PTR [ESI+1] ; stall</pre><p>
|
|
You can avoid this stall by changing the last line to <kbd>MOV BH,AH</kbd>,
|
|
but such a solution is not
|
|
possible in a situation like this:
|
|
<pre> FISTP QWORD PTR [EDI]
|
|
MOV EAX, DWORD PTR [EDI]
|
|
MOV EDX, DWORD PTR [EDI+4] ; stall</pre><p>
|
|
|
|
Interestingly, you can also get a partial memory stall when writing and reading completely
|
|
different addresses if they happen to have the same set-value in different cache banks:
|
|
<pre> MOV BYTE PTR [ESI], AL
|
|
MOV EBX, DWORD PTR [ESI+4092] ; no stall
|
|
MOV ECX, DWORD PTR [ESI+4096] ; stall</pre>
|
|
<p>
|
|
<h2><a name="20">20</a>. Dependency chains (PPro, PII and PIII)</h2>
|
|
A series of instructions where each instruction depends on the result of the preceding one
|
|
is called a dependency chain. Long dependency chains should be avoided, if possible,
|
|
because they prevent out-of-order and parallel execution.
|
|
<p>
|
|
Example:
|
|
<pre> MOV EAX, [MEM1]
|
|
ADD EAX, [MEM2]
|
|
ADD EAX, [MEM3]
|
|
ADD EAX, [MEM4]
|
|
MOV [MEM5], EAX</pre><p>
|
|
In this eaxmple, the <kbd>ADD</kbd> instructions generate 2 uops each, one for reading from memory
|
|
(port 2), and one for adding (port 0 or 1). The read uops can execute out or order, while the
|
|
add uops must wait for the previous uops to finish. This dependency chain does not take
|
|
very long to execute, because each addition adds only 1 clock to the execution time. But if
|
|
you have slow instructions like multiplications, or even worse: divisions, then you should
|
|
definitely do something to break the dependency chain. The way to do this is to use multiple
|
|
accumulators:
|
|
<pre> MOV EAX, [MEM1] ; start first chain
|
|
MOV EBX, [MEM2] ; start other chain in different accumulator
|
|
IMUL EAX, [MEM3]
|
|
IMUL EBX, [MEM4]
|
|
IMUL EAX, EBX ; join chains in the end
|
|
MOV [MEM5], EAX</pre><p>
|
|
Here, the second <kbd>IMUL</kbd> instruction can start before the first one is finished.
|
|
Since the <kbd>IMUL</kbd> instruction has a delay of 4 clocks and is fully pipelined, you
|
|
may have up to 4 accumulators.
|
|
<p>
|
|
Division is not pipelined so you cannot do the same with chained divisions,
|
|
but you can of course multiply all the divisors and do only one division in
|
|
the end.
|
|
<p>
|
|
Floating point instructions have a longer delay than integer instructions, so
|
|
you should definitely break up long dependency chains with floating point
|
|
instructions:
|
|
<pre> FLD [MEM1] ; start first chain
|
|
FLD [MEM2] ; start second chain in different accumulator
|
|
FADD [MEM3]
|
|
FXCH
|
|
FADD [MEM4]
|
|
FXCH
|
|
FADD [MEM5]
|
|
FADD ; join chains in the end
|
|
FSTP [MEM6]</pre><p>
|
|
You need a lot of <kbd>FXCH</kbd> instructions for this, but don't worry: they
|
|
are cheap. <kbd>FXCH</kbd>
|
|
instructions are resolved in the RAT by register renaming so they don't
|
|
put any load on the execution ports. An <kbd>FXCH</kbd> does count as 1 uop in the RAT,
|
|
ROB, and retirement station, though.
|
|
<p>
|
|
If the dependency chain is long you may need three accumulators:
|
|
<pre> FLD [MEM1] ; start first chain
|
|
FLD [MEM2] ; start second chain
|
|
FLD [MEM3] ; start third chain
|
|
FADD [MEM4] ; third chain
|
|
FXCH ST(1)
|
|
FADD [MEM5] ; second chain
|
|
FXCH ST(2)
|
|
FADD [MEM6] ; first chain
|
|
FXCH ST(1)
|
|
FADD [MEM7] ; third chain
|
|
FXCH ST(2)
|
|
FADD [MEM8] ; second chain
|
|
FXCH ST(1)
|
|
FADD ; join first and third chain
|
|
FADD ; join with second chain
|
|
FSTP [MEM9]</pre><p>
|
|
|
|
Avoid storing intermediate data in memory and read them immediately afterwards:
|
|
<pre> MOV [TEMP], EAX
|
|
MOV EBX, [TEMP]</pre><p>
|
|
There is a penalty for attempting to read from a memory address before a previous write to
|
|
that address is finished. In the example above, change the last instruction to
|
|
<kbd>MOV EBX,EAX</kbd>
|
|
or put some other instructions in between.
|
|
<p>
|
|
There is one situation where you cannot avoid storing intermediate data in memory, and
|
|
that is when transferring data from an integer register to a floating point register, or vice
|
|
versa. For example:
|
|
<pre> MOV EAX, [MEM1]
|
|
ADD EAX, [MEM2]
|
|
MOV [TEMP], EAX
|
|
FILD [TEMP]</pre><p>
|
|
If you don't have anything to put in between the write to <kbd>TEMP</kbd> and the
|
|
read from <kbd>TEMP</kbd>, then
|
|
you may consider using a floating point register instead of <kbd>EAX</kbd>:
|
|
<pre> FILD [MEM1]
|
|
FIADD [MEM2]</pre><p>
|
|
|
|
Consecutive jumps, calls, or returns may also be considered dependency chains. The
|
|
throughput for these instructions is one jump per two clock cycles. It is therefore
|
|
recommended that you give the microprocessor something else to do between the jumps.
|
|
<p>
|
|
<h2><a name="21">21</a>. Searching for bottlenecks (PPro, PII and PIII)</h2>
|
|
When optimizing code for these processors, it is important to analyze where the
|
|
bottlenecks are. Spending time on optimizing away one bottleneck doesn't make sense if
|
|
there is another bottleneck which is narrower.
|
|
<p>
|
|
If you expect code cache misses then you should restructure your code to keep the most
|
|
used parts of code together.
|
|
<p>
|
|
If you expect many data cache misses then forget about everything else and concentrate
|
|
on how to restructure your data to reduce the number of cache misses (chapter <a href="#7">7</a>), and
|
|
avoid long dependency chains after a data read cache miss (chapter <a href="#20">20</a>).
|
|
<p>
|
|
If you have many divisions then try to reduce them (chapter <a href="#27_2">27.2</a>) and make sure the
|
|
processor has something else to do during the divisions.
|
|
<p>
|
|
Dependency chains tend to hamper out-of-order execution (chapter <a href="#20">20</a>). Try to break long
|
|
dependency chains, especially if they contain slow instructions such as multiplication,
|
|
division, and floating point instructions.
|
|
<p>
|
|
If you have many jumps, calls, or returns, and especially if the jumps are poorly predictable,
|
|
then try if some of them can be avoided. Replace conditional jumps with conditional moves
|
|
if possible, and replace small procedures with macros (chapter <a href="#22_3">22.3</a>).
|
|
<p>
|
|
If you are mixing different data sizes (8, 16, and 32 bit integers) then look out for partial
|
|
stalls. If you use <kbd>PUSHF</kbd> or <kbd>LAHF</kbd> instructions then look out for partial flags stalls. Avoid
|
|
testing flags after shifts or rotates by more than 1 (chapter <a href="#19">19</a>).
|
|
<p>
|
|
If you aim at a throughput of 3 uops per clock cycle then be aware of possible delays in
|
|
instruction fetch and decoding (chapter and <a href="#14">14</a> and <a href="#15">15</a>), especially in small loops.
|
|
<p>
|
|
The limit of two permanent register reads per clock cycle may reduce your throughput to
|
|
less than 3 uops per clock cycle (chapter <a href="#16_2">16.2</a>). This is likely to happen if you often read
|
|
registers more than 4 clock cycles after they last were modified. This may, for example,
|
|
happen if you often use pointers for addressing your data but seldom modify the pointers.
|
|
<p>
|
|
A throughput of 3 uops per clock requires that no execution port gets more than one third of
|
|
the uops (chapter <a href="#17">17</a>).
|
|
<p>
|
|
The retirement station can handle 3 uops per clock, but may be slightly less effective for
|
|
taken jumps (chapter <a href="#18">18</a>).
|
|
|
|
<p>
|
|
<h2><a name="22">22</a>. Jumps and branches (all processors)</h2>
|
|
The Pentium family of processors attempt to predict where a jump will go to, and whether a
|
|
conditional jump will be taken or fall through. If the prediction is correct, then it can save a
|
|
considerable amount of time by loading the subsequent instructions into the pipeline and
|
|
start decoding them before the jump is executed. If the prediction turns out to be wrong,
|
|
then the pipeline has to be flushed, which will cost a penalty depending on the length of the
|
|
pipeline.
|
|
<p>
|
|
The predictions are based on a Branch Target Buffer (BTB) which stores the history for
|
|
each branch or jump instruction and makes predictions based on the prior history of
|
|
executions of each instruction. The BTB is organized like a set-associative cache where
|
|
new entries are allocated according to a pseudo-random replacement method.
|
|
<p>
|
|
When optimizing code, it is important to minimize the number of misprediction penalties.
|
|
This requires a good understanding of how the jump prediction works.
|
|
<p>
|
|
The branch prediction mechanisms are not described adequately in Intel manuals or
|
|
anywhere else. I am therefore giving a very detailed description here. This information is
|
|
based on my own research (with the help of Karki Jitendra Bahadur for the PPlain).
|
|
<p>
|
|
In the following, I will use the term 'control transfer instruction' for any instruction which can
|
|
change the instruction pointer, including conditional and unconditional, direct and indirect,
|
|
near and far, jumps, calls, and returns. All these instructions use prediction.
|
|
<p>
|
|
|
|
<h3><a name="22_1">22.1 Branch prediction in PPlain</a></h3>
|
|
The branch prediction mechanism for the PPlain is very different from the other three
|
|
processors. Information found in Intel documents and elsewhere on this subject is directly
|
|
misleading, and following the advises given is such documents is likely to lead to
|
|
sub-optimal code.
|
|
<p>
|
|
The PPlain has a branch target buffer (BTB), which can hold information for up to 256 jump
|
|
instructions. The BTB is organized like a 4-way set-associative cache with 64 entries per
|
|
way. This means that the BTB can hold no more than 4 entries with the same set value.
|
|
Unlike the data cache, the BTB uses a pseudo random replacement algorithm, which
|
|
means that a new entry will not necessarily displace the least recently used entry of the
|
|
same set-value. How the set-value is calculated will be explained later. Each BTB entry
|
|
stores the address of the jump target and a prediction state, which can have four different
|
|
values:
|
|
<p>
|
|
state 0: "strongly not taken" <br>
|
|
state 1: "weakly not taken" <br>
|
|
state 2: "weakly taken" <br>
|
|
state 3: "strongly taken"<p>
|
|
|
|
A branch instruction is predicted to jump when in state 2 or 3, and to fall through when in
|
|
state 0 or 1. The state transition works like a two-bit counter, so that the state is
|
|
incremented when the branch is taken, and decremented when it falls through. The counter
|
|
saturates, rather than wrap around, so that it does not decrement beyond 0 or increment
|
|
beyond 3. Ideally, this would provide a reasonably good prediction, because a branch
|
|
instruction would have to deviate twice from what it does most of the time, before the
|
|
prediction changes.
|
|
<p>
|
|
However, this mechanism has been compromised by the fact that state 0 also means
|
|
'unused BTB entry'. So a BTB entry in state 0 is the same as no BTB entry. This makes
|
|
sense, because a branch instruction is predicted to fall through if it has no BTB entry. This
|
|
improves the utilization of the BTB, because a branch instruction which is seldom taken will
|
|
most of the time not take up any BTB entry.
|
|
<p>
|
|
Now, if a jumping instruction has no BTB entry, then a new BTB entry will be generated,
|
|
and this new entry will always be set to state 3. This means that it is impossible to go from
|
|
state 0 to state 1 (except for a very special case discussed later). From state 0 you can only
|
|
go to state 3, if the branch is taken. If the branch falls through, then it will stay out of the
|
|
BTB.
|
|
<p>
|
|
This is a serious design flaw. By throwing state 0 entries out of the BTB and always setting
|
|
new entries to state 3, the designers apparently have given priority to minimizing the first
|
|
time penalty for unconditional jumps and branches often taken, and ignored that this
|
|
seriously compromises the basic idea behind the mechanism and reduces the performance
|
|
in small innermost loops. The consequence of this flaw is, that a branch instruction which
|
|
falls through most of the time will have up to three times as many mispredictions as a
|
|
branch instruction which is taken most of the time. (Apparently, Intel engineers
|
|
have been unaware of this flaw until I published my findings).
|
|
<p>
|
|
You may take this asymmetry into account by organizing your branches so that they are
|
|
taken more often than not. Consider for example this if-then-else construction:
|
|
<pre> TEST EAX,EAX
|
|
JZ A
|
|
<branch 1>
|
|
JMP E
|
|
A: <branch 2>
|
|
E:</pre>
|
|
<p>
|
|
If branch 1 is executed more often than branch 2, and branch 2 is seldom executed twice in
|
|
succession, then you can reduce the number of branch mispredictions by up to a factor 3
|
|
by swapping the two branches so that the branch instruction will jump more often than fall
|
|
through:
|
|
<pre> TEST EAX,EAX
|
|
JNZ A
|
|
<branch 2>
|
|
JMP E
|
|
A: <branch 1>
|
|
E:</pre><p>
|
|
|
|
(This is contrary to the recommendations in Intel's manuals and tutorials).
|
|
<p>
|
|
There may be reasons to put the most often executed branch first, however:
|
|
<ol>
|
|
<li>Putting seldom executed branches away in the bottom of your code can improve code
|
|
cache utilization.
|
|
<li>A branch instruction seldom taken will stay out of the BTB most of the time, possibly
|
|
improving BTB utilization.
|
|
<li>The branch instruction will be predicted as not taken if it has been flushed out of the
|
|
BTB by other branch instructions.
|
|
<li>The asymmetry in branch prediction only exists on the PPlain.
|
|
</ol>
|
|
<p>
|
|
These considerations have little weight, however, for small critical loops, so I would still
|
|
recommend organizing branches with a skewed distribution so that the branch instruction is
|
|
taken more often than not, unless branch 2 is executed so seldom, that misprediction
|
|
doesn't matter.
|
|
<p>
|
|
Likewise, you should preferably organize loops with the testing branch instruction at the
|
|
bottom, as in this example:
|
|
<pre> MOV ECX, [N]
|
|
L: MOV [EDI],EAX
|
|
ADD EDI,4
|
|
DEC ECX
|
|
JNZ L</pre><p>
|
|
If N is high, then the JNZ instruction here will be taken more often than not, and never fall
|
|
through twice in succession.
|
|
<p>
|
|
Consider the situation where a branch is taken every second time. The first time it jumps
|
|
the BTB entry will go into state 3, and will then alternate between state 2 and 3. It is
|
|
predicted to jump all the time, which gives 50% mispredictions. <a name="worstpred">Assume now that it deviates
|
|
from this regular pattern and falls through an extra time</a>. The jump pattern is:<pre>
|
|
01010100101010101010101, where 0 means nojump, and 1 means jump.
|
|
^</pre>
|
|
The extra nojump is indicated with a <kbd>^</kbd> above. After this incident, the BTB entry will alternate
|
|
between state 1 and 2, which gives 100% mispredictions. It will continue in this unfortunate
|
|
mode until there is another deviation from the 0101 pattern. This is the worst case for this
|
|
branch prediction mechanism.
|
|
<p>
|
|
<h4>22.1.2 BTB is looking ahead (PPlain)</h4>
|
|
The BTB mechanism is counting instruction pairs, rather than single instructions, so you
|
|
have to know how instructions are pairing in order to analyze where a BTB entry is stored.
|
|
The BTB entry for any control instruction is attached to the address of the U-pipe
|
|
instruction in the preceding instruction pair. (An unpaired instruction counts as one pair).
|
|
Example:
|
|
<pre> SHR EAX,1
|
|
MOV EBX,[ESI]
|
|
CMP EAX,EBX
|
|
JB L</pre><p>
|
|
Here <kbd>SHR</kbd> pairs with <kbd>MOV</kbd>, and <kbd>CMP</kbd> pairs with
|
|
<kbd>JB</kbd>. The BTB entry for <kbd>JB L</kbd> is thus
|
|
attached to the address of the <kbd>SHR EAX,1</kbd> instruction. When this BTB entry is met, and if it
|
|
is in state 2 or 3, then the Pentium will read the target address from the BTB entry, and load
|
|
the instructions following L into the pipeline. This happens before the branch instruction has
|
|
been decoded, so the Pentium relies solely on the information in the BTB when doing this.
|
|
<p>
|
|
You may remember, that instructions are seldom pairing the first time they are executed
|
|
(see chapter <a href="#8">8</a>). If the instructions above are not pairing, then the BTB entry should be
|
|
attached to the address of the <kbd>CMP</kbd> instruction, and this entry would be wrong on the next
|
|
execution, when instructions are pairing. However, in most cases the PPlain is smart
|
|
enough to not make a BTB entry when there is an unused pairing opportunity, so you don't
|
|
get a BTB entry until the second execution, and hence you won't get a prediction until the
|
|
third execution. (In the rare case, where every second instruction is a single-byte
|
|
instruction, you may get a BTB entry on the first execution which becomes invalid in the
|
|
second execution, but since the instruction it is attached to will then go to the V-pipe, it is
|
|
ignored and gives no penalty. A BTB entry is only read if it is attached to the address of a
|
|
U-pipe instruction).
|
|
<p>
|
|
A BTB entry is identified by its set-value which is equal to bits 0-5 of the address it is
|
|
attached to. Bits 6-31 are then stored in the BTB as a tag. Addresses which are spaced a
|
|
multiple of 64 bytes apart will have the same set-value. You can have no more than four
|
|
BTB entries with the same set-value. If you want to check whether your jump instructions
|
|
contend for the same BTB entries, then you have to compare bits 0-5 of the addresses of
|
|
the U-pipe instructions in the preceding instruction pairs. This is very tedious, and I have
|
|
never heard of anybody doing so. There are no tools available to do this job for you.
|
|
<p>
|
|
<h4>22.1.3 Consecutive branches (PPlain)</h4>
|
|
When a jump is mispredicted, then the pipeline gets flushed. If the next instruction pair
|
|
executed also contains a control transfer instruction, then the PPlain won't load its target
|
|
because it cannot load a new target while the pipeline is being flushed. The result is that the
|
|
second jump instruction is predicted to fall through regardless of the state of its BTB entry.
|
|
Therefore, if the second jump is also taken, then you will get another penalty. The state of
|
|
the BTB entry for the second jump instruction does get correctly updated, though. If you
|
|
have a long chain of control transfer instructions, and the first jump in the chain is
|
|
mispredicted, then the pipeline will get flushed all the time, and you will get nothing but
|
|
mispredictions until you meet an instruction pair which does not jump. The most extreme
|
|
case of this is a loop which jumps to itself: It will get a misprediction penalty for each
|
|
iteration.
|
|
<p>
|
|
This is not the only problem with consecutive control transfer instructions. Another problem
|
|
is that you can have another branch instruction between a BTB entry and the control
|
|
transfer instruction it belongs to. If the first branch instruction jumps to somewhere else,
|
|
then strange things may happen. Consider this example:
|
|
<pre> SHR EAX,1
|
|
MOV EBX,[ESI]
|
|
CMP EAX,EBX
|
|
JB L1
|
|
JMP L2
|
|
|
|
L1: MOV EAX,EBX
|
|
INC EBX</pre><p>
|
|
|
|
When <kbd>JB L1</kbd> falls through, then you will get a BTB entry for
|
|
<kbd>JMP L2</kbd> attached to the
|
|
address of <kbd>CMP EAX,EBX</kbd>. But what will happen when <kbd>JB L1</kbd>
|
|
later is taken? At the time
|
|
when the BTB entry for <kbd>JMP L2</kbd> is read, the processor doesn't know that the next
|
|
instruction pair does not contain a jump instruction, so it will actually predict the instruction
|
|
pair <kbd>MOV EAX,EBX / INC EBX</kbd> to jump to <kbd>L2</kbd>.
|
|
The penalty for predicting non-jump
|
|
instructions to jump is 3 clock cycles. The BTB entry for <kbd>JMP L2</kbd> will get its state
|
|
decremented, because it is applied to something which doesn't jump. If we keep going to
|
|
<kbd>L1</kbd>, then the BTB entry for <kbd>JMP L2</kbd> will be decremented to state 1 and 0, so that the
|
|
problem will disappear until next time <kbd>JMP L2</kbd> is executed.
|
|
<p>
|
|
The penalty for predicting the non-jumping instructions to jump only occurs when the jump
|
|
to <kbd>L1</kbd> is predicted. In the case that <kbd>JB L1</kbd> is mispredictedly
|
|
jumping, then the pipeline gets
|
|
flushed and we won't get the false <kbd>L2</kbd> target loaded, so in this case we will not see the
|
|
penalty of predicting the non-jumping instructions to jump, but we do get the BTB entry for
|
|
<kbd>JMP L2</kbd> decremented.
|
|
<p>
|
|
Suppose, now, that we replace the <kbd>INC EBX</kbd> instruction above with another jump
|
|
instruction. This third jump instruction will then use the same BTB entry as
|
|
<kbd>JMP L2</kbd> with
|
|
the possible penalty of predicting a wrong target, (unless it happens to also
|
|
have <kbd>L2</kbd> as target).
|
|
<p>
|
|
To summarize, consecutive jumps can lead to the following problems:
|
|
<ul>
|
|
<li>failure to load a jump target when the pipeline is being flushed by a preceding
|
|
mispredicted jump.
|
|
<li>a BTB entry being mis-applied to non-jumping instructions and predicting them to jump.
|
|
<li>a second consequence of the above is that a mis-applied BTB entry will get its state
|
|
decremented, possibly leading to a later misprediction of the jump it belongs to. Even
|
|
unconditional jumps can be predicted to fall through for this reason.
|
|
<li>two jump instructions may share the same BTB entry, leading to the prediction of a
|
|
wrong target.
|
|
</ul>
|
|
<p>
|
|
|
|
All this mess may give you a lot of penalties, so you should definitely avoid having an
|
|
instruction pair containing a jump immediately after another poorly predictable control
|
|
transfer instruction or its target.
|
|
<p>
|
|
It is time for another illustrative example:
|
|
<pre> CALL P
|
|
TEST EAX,EAX
|
|
JZ L2
|
|
L1: MOV [EDI],EBX
|
|
ADD EDI,4
|
|
DEC EAX
|
|
JNZ L1
|
|
L2: CALL P</pre><p>
|
|
|
|
This looks like a quite nice and normal piece of code: A function call, a loop which is
|
|
bypassed when the count is zero, and another function call. How many problems can you
|
|
spot in this program?
|
|
<p>
|
|
First, we may note that the function <kbd>P</kbd> is called alternatingly from two different locations.
|
|
This means that the target for the return from <kbd>P</kbd> will be changing all the time. Consequently,
|
|
the return from <kbd>P</kbd> will always be mispredicted.
|
|
<p>
|
|
Assume, now, that <kbd>EAX</kbd> is zero. The jump to <kbd>L2</kbd> will not have its target loaded because the
|
|
mispredicted return caused a pipeline flush. Next, the second <kbd>CALL P</kbd> will also fail to have
|
|
its target loaded because <kbd>JZ L2</kbd> caused a pipeline flush. Here we have the situation where
|
|
a chain of consecutive jumps makes the pipeline flush repeatedly because the first jump
|
|
was mispredicted. The BTB entry for <kbd>JZ L2</kbd> is stored at the address of <kbd>P</kbd>'s return
|
|
instruction. This BTB entry will now be mis-applied to whatever comes after the second
|
|
<kbd>CALL P</kbd>, but that doesn't give a penalty because the pipeline is flushed by the mispredicted
|
|
second return.
|
|
<p>
|
|
Now, let's see what happens if <kbd>EAX</kbd> has a nonzero value the next time:
|
|
<kbd>JZ L2</kbd> is always
|
|
predicted to fall through because of the flush. The second <kbd>CALL P</kbd>
|
|
has a BTB entry at the
|
|
address of <kbd>TEST EAX,EAX</kbd>. This entry will be mis-applied to the
|
|
<kbd>MOV/ADD</kbd> pair, predicting
|
|
it to jump to <kbd>P</kbd>. This causes a flush which prevents <kbd>JNZ L1</kbd>
|
|
from loading its target. If we
|
|
have been here before, then the second <kbd>CALL P</kbd> will have another BTB entry at the
|
|
address of <kbd>DEC EAX</kbd>. On the second and third iteration of the loop, this entry will also be
|
|
mis-applied to the <kbd>MOV/ADD</kbd> pair, until it has had its state decremented to 1 or 0. This will
|
|
not cause a penalty on the second iteration because the flush from <kbd>JNZ L1</kbd> prevents it
|
|
from loading its false target, but on the third iteration it will. The subsequent iterations of the
|
|
loop have no penalties, but when it exits, <kbd>JNZ L1</kbd> is mispredicted. The flush would now
|
|
prevent <kbd>CALL P</kbd> from loading its target, were it not for the fact that the BTB entry for
|
|
<kbd>CALL P</kbd> has already been destroyed by being mis-applied several times.
|
|
<p>
|
|
We can improve this code by putting in some <kbd>NOP</kbd>'s to separate all consecutive jumps:
|
|
<pre> CALL P
|
|
TEST EAX,EAX
|
|
NOP
|
|
JZ L2
|
|
L1: MOV [EDI],EBX
|
|
ADD EDI,4
|
|
DEC EAX
|
|
JNZ L1
|
|
L2: NOP
|
|
NOP
|
|
CALL P</pre><p>
|
|
The extra <kbd>NOP</kbd>'s cost 2 clock cycles, but they save much more.
|
|
Furthermore, <kbd>JZ L2</kbd> is now
|
|
moved to the U-pipe which reduces its penalty from 4 to 3 when mispredicted. The only
|
|
problem that remains is that the returns from <kbd>P</kbd> are always mispredicted. This problem can
|
|
only be solved by replacing the call to <kbd>P</kbd> by an inline macro (if you have enough code
|
|
cache).
|
|
<p>
|
|
The lesson to learn from this example is that you should always look carefully for
|
|
consecutive jumps and see if you can save time by inserting some <kbd>NOP</kbd>'s. You should be
|
|
particularly aware of those situations where misprediction is unavoidable, such as loop exits
|
|
and returns from procedures which are called from varying locations. If you have something
|
|
useful to put in, instead of the <kbd>NOP</kbd>'s, then you should of course do so.
|
|
<p>
|
|
Multiway branches (case statements) may be implemented either as a tree of branch
|
|
instructions or as a list of jump addresses. If you choose to use a tree of branch
|
|
instructions, then you have to include some <kbd>NOP</kbd>'s or other instructions to separate the
|
|
consecutive branches. A list of jump addresses may therefore be a better solution on the
|
|
PPlain. The list of jump addresses should be placed in the data segment. Never put data in
|
|
the code segment!
|
|
<p>
|
|
<h4>22.1.4 Tight loops (PPlain)</h4>
|
|
In a small loop you will often access the same BTB entry repeatedly with small intervals.
|
|
This never causes a stall. Rather than waiting for a BTB entry to be updated, the PPlain
|
|
somehow bypasses the pipeline and gets the resulting state from the last jump before it has
|
|
been written to the BTB. This mechanism is almost transparent to the user, but it does in
|
|
some cases have funny effects: You can see a branch prediction going from state 0 to state
|
|
1, rather than to state 3, if the zero has not yet been written to the BTB. This happens if the
|
|
loop has no more than four instruction pairs. In loops with only two instruction pairs you
|
|
may sometimes have state 0 for two consecutive iterations without going out of the BTB. In
|
|
such small loops it also happens in rare cases that the prediction uses the state resulting
|
|
from two iterations ago, rather than from the last iteration. These funny effects will usually
|
|
not have any negative effects on performance.
|
|
<p>
|
|
<h3><a name="22_2">22.2 Branch prediction in PMMX, PPro, PII and PIII</a></h3>
|
|
<h4>22.2.1 BTB organization (PMMX, PPro, PII and PIII)</h4>
|
|
The branch target buffer (BTB) of the PMMX has 256 entries organized as 16 ways * 16
|
|
sets. Each entry is identified by bits 2-31 of the address of the last byte of the control
|
|
transfer instruction it belongs to. Bits 2-5 define the set, and bits 6-31 are stored in the BTB
|
|
as a tag. Control transfer instructions which are spaced 64 bytes apart have the
|
|
same set-value and may therefore occasionally push each other out of the BTB. Since there are 16
|
|
ways per set, this won't happen too often.
|
|
<p>
|
|
The branch target buffer (BTB) of the PPro, PII and PIII has 512 entries organized as 16 ways *
|
|
32 sets. Each entry is identified by bits 4-31 of the address of the last byte of the control
|
|
transfer instruction it belongs to. Bits 4-8 define the set, and all bits are stored in the BTB as
|
|
a tag. Control transfer instructions which are spaced 512 bytes apart have the
|
|
same set-value and may therefore occasionally push each other out of the BTB. Since there are 16
|
|
ways per set, this won't happen too often.
|
|
<p>
|
|
The PPro, PII and PIII allocate a BTB entry to any control transfer instruction the first time it is
|
|
executed. The PMMX allocates it the first time it jumps. A branch instruction which never
|
|
jumps will stay out of the BTB on the PMMX. As soon as it has jumped once, it will stay in
|
|
the BTB, even if it never jumps again.
|
|
<p>
|
|
An entry may be pushed out of the BTB when another control transfer instruction with the
|
|
same set-value needs a BTB entry.
|
|
<p>
|
|
<h4>22.2.2 Misprediction penalty (PMMX, PPro, PII and PIII)</h4>
|
|
In the PMMX, the penalty for misprediction of a conditional jump is 4 clocks in the U-pipe,
|
|
and 5 clocks if it is executed in the V-pipe. For all other control transfer instructions it is 4
|
|
clocks.
|
|
<p>
|
|
In the PPro, PII and PIII, the misprediction penalty is very high due to the long pipeline. A
|
|
misprediction usually costs between 10 and 20 clock cycles. It is therefore very important to
|
|
be aware of poorly predictable branches when running on PPro, PII and PIII.
|
|
<p>
|
|
<h4>22.2.3 Pattern recognition for conditional jumps (PMMX, PPro, PII and PIII)</h4>
|
|
These processors have an advanced pattern recognition mechanism which will correctly
|
|
predict a branch instruction which, for example, is taken every fourth time and falls through
|
|
the other three times. In fact, they can predict any repetitive pattern of jumps and nojumps
|
|
with a period of up to five, and many patterns with higher periods.
|
|
<p>
|
|
The mechanism is a so-called "two-level adaptive branch prediction scheme", invented by
|
|
T.-Y. Yeh and Y. N. Patt. It is based on the same kind of two-bit counters as described
|
|
above for the PPlain (but without the assymmetry flaw). The counter is incremented when
|
|
the jump is taken and decremented when not taken. There is no wrap-around when
|
|
counting up from 3 or down from 0. A branch instruction is predicted to be taken when the
|
|
corresponding counter is in state 2 or 3, and to fall through when in state 0 or 1. An
|
|
impressive improvement is now obtained by having sixteen such counters for each BTB
|
|
entry. It selects one of these sixteen counters based on the history of the branch instruction
|
|
for the last four executions. If, for example, the branch instruction jumps once and then falls
|
|
through three times, then you have the history bits 1000 (1=jump, 0=nojump). This will
|
|
make it use counter number 8 (1000 binary = 8) for predicting the next time and update
|
|
counter 8 afterwards.
|
|
<p>
|
|
If the sequence 1000 is always followed by a 1, then counter number 8 will soon end up in
|
|
its highest state (state 3) so that it will always predict a 1000 sequence to be followed by a
|
|
1. It will take two deviations from this pattern to change the prediction. The repetitive pattern
|
|
100010001000 will have counter 8 in state 3, and counter 1, 2 and 4 in state 0. The other
|
|
twelve counters will be unused.
|
|
<p>
|
|
<h4>22.2.4 Perfectly predicted patterns (PMMX, PPro, PII and PIII)</h4>
|
|
A repetitive branch pattern is predicted perfectly by this mechanism if
|
|
every 4-bit sub-sequence in the period is unique.
|
|
Below is a list of repetitive branch patterns which are predicted perfectly:<p>
|
|
<table border=1 cellpadding=1 cellspacing=1>
|
|
<tr><td class="a3"> period </td>
|
|
<td class="a3"> perfectly predicted patterns </td></tr>
|
|
<tr><td>1-5</td><td>all</td></tr>
|
|
<tr><td>6</td><td>000011, 000101, 000111, 001011</td></tr>
|
|
<tr><td>7</td><td>0000101, 0000111, 0001011</td></tr>
|
|
<tr><td>8</td><td>00001011, 00001111, 00010011, 00010111, 00101101</td></tr>
|
|
<tr><td>9</td><td>000010011, 000010111, 000100111, 000101101</td></tr>
|
|
<tr><td>10</td><td>0000100111, 0000101101, 0000101111, 0000110111, 0001010011, 0001011101</td></tr>
|
|
<tr><td>11</td><td>00001001111, 00001010011, 00001011101, 00010100111</td></tr>
|
|
<tr><td>12</td><td>000010100111, 000010111101, 000011010111, 000100110111, 000100111011</td></tr>
|
|
<tr><td>13</td><td>0000100110111, 0000100111011, 0000101001111</td></tr>
|
|
<tr><td>14</td><td>00001001101111, 00001001111011, 00010011010111, 00010011101011, 00010110011101, 00010110100111</td></tr>
|
|
<tr><td>15</td><td>000010011010111, 000010011101011, 000010100110111,
|
|
000010100111011, 000010110011101, 000010110100111,
|
|
000010111010011, 000011010010111</td></tr>
|
|
<tr><td>16</td><td>0000100110101111, 0000100111101011, 0000101100111101,
|
|
0000101101001111</td></tr>
|
|
</table>
|
|
<p>When reading this table, you should be aware that if a pattern is predicted correctly than the
|
|
same pattern reversed (read backwards) is also predicted correctly, as well as the same
|
|
pattern with all bits inverted. Example:
|
|
In the table we find the pattern: 0001011.
|
|
Reversing this pattern gives: 1101000.
|
|
Inverting all bits gives: 1110100.
|
|
Both reversing and inverting: 0010111.
|
|
These four patterns are all recognizable. Rotating the pattern one place to the left gives:
|
|
0010110. This is of course not a new pattern, only a phase shifted version of the same
|
|
pattern. All patterns which can be derived from one of the patterns in the table by reversing,
|
|
inverting and rotating are also recognizable. For reasons of brevity, these are not listed.
|
|
<p>
|
|
It takes two periods for the pattern recognition mechanism to learn a regular repetitive
|
|
pattern after the BTB entry has been allocated. The pattern of mispredictions in the learning
|
|
period is not reproducible. This is probably because the BTB entry contained something
|
|
prior to allocation. Since BTB entries are allocated according to a random scheme, there is
|
|
little chance of predicting what happens during the initial learning period.
|
|
<p>
|
|
|
|
<h4>22.2.5 Handling deviations from a regular pattern (PMMX, PPro, PII and PIII)</h4>
|
|
The branch prediction mechanism is also extremely good at handling 'almost regular'
|
|
patterns, or deviations from the regular pattern. Not only does it learn what the regular
|
|
pattern looks like. It also learns what deviations from the regular pattern look like. If
|
|
deviations are always of the same type, then it will remember what comes after the irregular
|
|
event, and the deviation will cost only one misprediction.
|
|
<p>
|
|
Example:
|
|
<pre>
|
|
0001110001110001110001011100011100011100010111000
|
|
^ ^</pre>
|
|
In this sequence, a 0 means nojump, a 1 means jump. The mechanism learns that the
|
|
repeated sequence is 000111. The first irregularity is an unexpected 0, which I have
|
|
marked with a <kbd>^</kbd>. After this 0 the next three jumps may be mispredicted, because it hasn't
|
|
learned what comes after 0010, 0101, and 1011. After one or two irregularities of the same
|
|
kind it has learned that after 0010 comes a 1, after 0101 comes 1, and after 1011 comes 1.
|
|
This means that after at most two irregularities of the same kind, it has learned to handle
|
|
this kind of irregularity with only one misprediction.
|
|
<p>
|
|
The prediction mechanism is also very effective when alternating between two different
|
|
regular patterns. If, for example, we have the pattern 000111 (with period 6) repeated many
|
|
times, then the pattern 01 (period 2) many times, and then return to the 000111 pattern,
|
|
then the mechanism doesn't have to relearn the 000111 pattern, because the counters
|
|
used in the 000111 sequence have been left un-touched during the 01 sequence. After a
|
|
few alternations between the two patterns, it has also learned to handle the changes of
|
|
pattern with only one misprediction for each time the pattern is switched.
|
|
<p>
|
|
<h4>22.2.6 Patterns which are not predicted perfectly (PMMX, PPro, PII and PIII)</h4>
|
|
The simplest branch pattern which cannot be predicted perfectly is a branch which is taken
|
|
on every 6'th execution. The pattern is:
|
|
<pre>000001000001000001
|
|
^^ ^^ ^^
|
|
ab ab ab</pre>
|
|
The sequence 0000 is alternatingly followed by a 0, in the positions marked a above, and
|
|
by a 1, in the positions marked b. This affects counter number 0 which will count up and
|
|
down all the time. If counter 0 happens to start in state 0 or 1, then it will alternate between
|
|
state 0 and 1. This will lead to a misprediction in position b. If counter 0 happens to start in
|
|
state 3, then it will alternate between state 2 and 3 which will cause a misprediction in
|
|
position a. The worst case is when it starts in state 2. It will alternate between state 1 and 2
|
|
with the unfortunate consequence that we get a misprediction both in position a and b. (This
|
|
is analogous to the worst case for the PPlain explained <a href="#worstpred">above</a>). Which of these four
|
|
situations we will get depends on the history of the BTB entry prior to allocation to this
|
|
branch. This is beyond our control because of the random allocation method.
|
|
<p>
|
|
In principle, it is possible to avoid the worst case situation where we have two mispredictions
|
|
per cycle by giving it an initial branch sequence which is specially designed for putting the
|
|
counter in the desired state. Such an approach cannot be recommended, however,
|
|
because of the considerable extra code complexity required, and because whatever
|
|
information we have put into the counter is likely to be lost during the next timer interrupt or
|
|
task switch.
|
|
<p>
|
|
<h4>22.2.7 Completely random patterns (PMMX, PPro, PII and PIII)</h4>
|
|
The powerful capability of pattern recognition has a minor drawback in the case of
|
|
completely random sequences with no regularities.
|
|
<p>
|
|
The following table lists the experimental fraction of mispredictions for a completely random
|
|
sequence of jumps and nojumps:<p>
|
|
|
|
<table border=1 cellpadding=1 cellspacing=1><tr>
|
|
<td align="center" class="a3"> fraction of jumps/nojumps </td>
|
|
<td align="center" class="a3"> fraction of mispredictions </td></tr>
|
|
<tr><td align="center">0.001/0.999</td>
|
|
<td align="center">0.001001</td></tr>
|
|
<tr><td align="center">0.01/0.99</td>
|
|
<td align="center">0.0101</td></tr>
|
|
<tr><td align="center">0.05/0.95</td>
|
|
<td align="center">0.0525</td></tr>
|
|
<tr><td align="center">0.10/0.90</td>
|
|
<td align="center">0.110</td></tr>
|
|
<tr><td align="center">0.15/0.85</td>
|
|
<td align="center">0.171</td></tr>
|
|
<tr><td align="center">0.20/0.80</td>
|
|
<td align="center">0.235</td></tr>
|
|
<tr><td align="center">0.25/0.75</td>
|
|
<td align="center">0.300</td></tr>
|
|
<tr><td align="center">0.30/0.70</td>
|
|
<td align="center">0.362</td></tr>
|
|
<tr><td align="center">0.35/0.65</td>
|
|
<td align="center">0.418</td></tr>
|
|
<tr><td align="center">0.40/0.60</td>
|
|
<td align="center">0.462</td></tr>
|
|
<tr><td align="center">0.45/0.55</td>
|
|
<td align="center">0.490</td></tr>
|
|
<tr><td align="center">0.50/0.50</td>
|
|
<td align="center">0.500</td></tr>
|
|
</table>
|
|
<p>
|
|
The fraction of mispredictions is slightly higher than it would be without pattern recognition
|
|
because the processor keeps trying to find repeated patterns in a sequence which has no
|
|
regularities.
|
|
<p>
|
|
<h4>22.2.8 Tight loops (PMMX)</h4>
|
|
The branch prediction is not reliable in tiny loops where the pattern recognition mechanism
|
|
doesn't have time to update its data before the next branch is met. This means that simple
|
|
patterns, which would normally be predicted perfectly, are not recognized. Incidentally,
|
|
some patterns which normally would not be recognized, are predicted perfectly in tight
|
|
loops. For example, a loop which always repeats 6 times would have the branch pattern
|
|
111110 for the branch instruction at the bottom of the loop. This pattern would normally
|
|
have one or two mispredictions per iteration, but in a tight loop it has none. The same
|
|
applies to a loop which repeats 7 times. Most other repeat counts are predicted poorer in
|
|
tight loops than normally. This means that a loop which iterates 6 or 7 times should
|
|
preferably be tight, whereas other loops should preferably not be tight. You may unroll a
|
|
loop if necessary to make it less tight.
|
|
<p>
|
|
To find out whether a loop will behave as 'tight' on the PMMX you may follow the following
|
|
rule of thumb: Count the number of instructions in the loop. If the number is 6 or less, then
|
|
the loop will behave as tight. If you have more than 7 instructions, then you can be
|
|
reasonably sure that the pattern recognition functions normally. Strangely enough, it doesn't
|
|
matter how many clock cycles each instruction takes, whether it has stalls, or whether it is
|
|
paired or not. Complex integer instructions do not count. A loop can have lots of complex
|
|
integer instructions and still behave as a tight loop. A complex integer instruction
|
|
is a non-pairable integer instruction which always takes more than one clock cycle. Complex floating
|
|
point instructions and MMX instructions still count as one. Note, that this rule of thumb is
|
|
heuristic and not completely reliable. In important cases you may want to do your own
|
|
testing. You can use performance monitor counter number 35H for the PMMX to count
|
|
branch mispredictions. Test results may not be completely deterministic, because branch
|
|
predictions may depend on the history of the BTB entry prior to allocation.
|
|
<p>
|
|
Tight loops on PPro, PII and PIII are predicted normally, and take minimum two clock cycles per
|
|
iteration.
|
|
<p>
|
|
<h4>22.2.9 Indirect jumps and calls (PMMX, PPro, PII and PIII)</h4>
|
|
There is no pattern recognition for indirect jumps and calls, and the BTB can remember no
|
|
more than one target for an indirect jump. It is simply predicted to go to the same target as it
|
|
did last time.
|
|
<p>
|
|
<h4>22.2.10 JECXZ and LOOP (PMMX)</h4>
|
|
There is no pattern recognition for these two instructions in the PMMX. They are simply
|
|
predicted to go the same way as last time they were executed. These two instructions
|
|
should be avoided in time-critical code for PMMX. (In PPro, PII and PIII they are predicted using
|
|
pattern recognition, but the loop instruction is still inferior to <kbd>DEC ECX / JNZ</kbd>).
|
|
<p>
|
|
<h4>22.2.11 Returns (PMMX, PPro, PII and PIII)</h4>
|
|
The PMMX, PPro, PII and PIII processors have a Return Stack Buffer (RSB) which is used for
|
|
predicting return instructions. The RSB works as a First-In-Last-Out buffer. Each time a
|
|
call instruction is executed, the corresponding return address is pushed into the RSB. And
|
|
each time a return instruction is executed, a return address is pulled out of the RSB and
|
|
used for prediction of the return. This mechanism makes sure that return instructions are
|
|
correctly predicted when the same subroutine is called from several different locations.
|
|
<p>
|
|
In order to make sure this mechanism works correctly, you must make sure that all calls
|
|
and returns are matched. Never jump out of a subroutine without a return and never use a
|
|
return as an indirect jump if speed is critical.
|
|
<p>
|
|
The RSB can hold four entries in the PMMX, sixteen in the PPro, PII and PIII. In the case where
|
|
the RSB is empty, the return instruction is predicted in the same way as an indirect jump,
|
|
i.e. it is expected to go to the same target as it did last time.
|
|
<p>
|
|
On the PMMX, when subroutines are nested deeper than four levels then the innermost
|
|
four levels use the RSB, whereas all subsequent returns from the outer levels use the
|
|
simpler prediction mechanism as long as there are no new calls. A return instruction which
|
|
uses the RSB still occupies a BTB entry. Four entries in the RSB of the PMMX doesn't
|
|
sound of much, but it is probably sufficient. Subroutine nesting deeper than four levels is
|
|
certainly not unusual, but only the innermost levels matter in terms of speed, except
|
|
possibly for recursive procedures.
|
|
<p>
|
|
On the PPro, PII and PIII, when subroutines are nested deeper than sixteen levels then the
|
|
innermost 16 levels use the RSB, whereas all subsequent returns from the outer levels are
|
|
mispredicted. Recursive subroutines should therefore not go deeper than 16 levels.
|
|
<p>
|
|
<h4>22.2.12 Static prediction in PMMX</h4>
|
|
A control transfer instruction which has not been seen before or which is not in the BTB is
|
|
always predicted to fall through on the PMMX. It doesn't matter whether it goes forward or
|
|
backwards.
|
|
<p>
|
|
A branch instruction will not get a BTB entry if it always falls through. As soon as it is taken
|
|
once, it will get into the BTB and stay there no matter how many times it falls through. A
|
|
control transfer instruction can only go out of the BTB when it is pushed out by another
|
|
control transfer instruction which steals its BTB entry.
|
|
<p>
|
|
Any control transfer instruction which jumps to the address immediately following itself will
|
|
not get a BTB entry. Example: <pre>
|
|
JMP SHORT LL
|
|
LL:</pre><p>
|
|
This instruction will never get a BTB entry and therefore always have a misprediction
|
|
penalty.
|
|
<p>
|
|
<h4>22.2.13 Static prediction in PPro, PII and PIII</h4>
|
|
On PPro, PII and PIII, a control transfer instruction which has not been seen before or which is
|
|
not in the BTB is predicted to fall through if it goes forwards, and to be taken if it goes
|
|
backwards (e.g. a loop). Static prediction takes longer time than dynamic prediction on
|
|
these processors.
|
|
<p>
|
|
If your code is unlikely to be cached then it is preferred to have the most frequently
|
|
executed branch fall through in order to improve prefetching.
|
|
<p>
|
|
<h4>22.2.14 Close jumps (PMMX)</h4>
|
|
On the PMMX, there is a risk that two control transfer instructions will share the same BTB
|
|
entry if they are too close to each other. The obvious result is that they will always be
|
|
mispredicted.
|
|
<p>
|
|
The BTB entry for a control transfer instruction is identified by bits 2-31 of the address of
|
|
the last byte in the instruction. If two control transfer instructions are so close together that
|
|
they differ only in bits 0-1 of the address, then we have the problem of a shared BTB entry.
|
|
Example:
|
|
<pre> CALL P
|
|
JNC SHORT L</pre><p>
|
|
If the last byte of the <kbd>CALL</kbd> instruction and the last byte of the <kbd>JNC</kbd> instruction lie within the
|
|
same dword of memory, then we have the penalty. You have to look at the output list file
|
|
from the assembler to see whether the two addresses are separated by a DWORD boundary
|
|
or not. (A DWORD boundary is an address divisible by 4).
|
|
<p>
|
|
There are various ways to solve this problem: <br>
|
|
1. Move the code sequence a little up or down in memory so that you get a dword
|
|
boundary between the two addresses.<br>
|
|
2. Change the short jump to a near jump (with 4 bytes displacement) so that the end of the
|
|
instruction is moved further down. There is no way you can force the assembler to use
|
|
anything but the shortest form of an instruction so you have to hard-code the near
|
|
branch if you choose this solution.<br>
|
|
3. Put in some instruction between the <kbd>CALL</kbd> and the <kbd>JNC</kbd> instructions. This is the easiest
|
|
method, and the only method if you don't know where DWORD boundaries are because
|
|
your segment is not dword aligned or because the code keeps moving up and down as
|
|
you make changes in the preceding code:<br>
|
|
<pre> CALL P
|
|
MOV EAX,EAX ; two bytes filler to be safe
|
|
JNC SHORT L</pre><p>
|
|
If you want to avoid problems on the PPlain too, then put in two <kbd>NOP</kbd>'s instead to prevent
|
|
pairing (see section 22.1.3 above).
|
|
<p>
|
|
The <kbd>RET</kbd> instruction is particularly prone to this problem because it is only one byte long:
|
|
<pre> JNZ NEXT
|
|
RET</pre><p>
|
|
Here you may need up to three bytes of fillers:
|
|
<pre> JNZ NEXT
|
|
NOP
|
|
MOV EAX,EAX
|
|
RET</pre>
|
|
<p>
|
|
<h4>22.2.15 Consecutive calls or returns (PMMX)</h4>
|
|
There is a penalty when the first instruction pair following the target label of a call contains
|
|
another call instruction or if a return follows immediately after another return. Example:
|
|
<pre>FUNC1 PROC NEAR
|
|
NOP ; avoid call after call
|
|
NOP
|
|
CALL FUNC2
|
|
CALL FUNC3
|
|
NOP ; avoid return after return
|
|
RET
|
|
FUNC1 ENDP</pre><p>
|
|
Two <kbd>NOP</kbd>'s are required before <kbd>CALL FUNC2</kbd> because a single
|
|
<kbd>NOP</kbd> would pair with the
|
|
<kbd>CALL</kbd>. One <kbd>NOP</kbd> is enough before the <kbd>RET</kbd> because
|
|
<kbd>RET</kbd> is unpairable. No <kbd>NOP</kbd>'s are required
|
|
between the two <kbd>CALL</kbd> instructions because there is no penalty for call after return. (On the
|
|
PPlain you would need two <kbd>NOP</kbd>'s here too).
|
|
<p>
|
|
The penalty for chained calls only occurs when the same subroutines are called from more
|
|
than one location (probably because the RSB needs updating). Chained returns always
|
|
have a penalty. There is sometimes a small stall for a jump after a call, but no penalty for
|
|
return after call; call after return; jump, call, or return after jump; or jump after return.
|
|
<p>
|
|
<h4>22.2.16 Chained jumps (PPro, PII and PIII)</h4>
|
|
A jump, call, or return cannot be executed in the first clock cycle after a previous jump, call,
|
|
or return. Therefore, chained jumps will take two clock cycles for each jump, and you may
|
|
want to make sure that the processor has something else to do in parallel. For the same
|
|
reason, a loop will take at least two clock cycles per iteration on these processors.
|
|
<p>
|
|
<h4>22.2.17 Designing for branch predictabiligy (PMMX, PPro, PII and PIII)</h4>
|
|
Multiway branches (switch/case statements) are implemented either as an indirect jump
|
|
using a list of jump addresses, or as a tree of branch instructions. Since indirect jumps are
|
|
poorly predicted, the latter method may be preferred if easily predicted patterns can be
|
|
expected and you have enough BTB entries. In case you decide to use the former method,
|
|
then it is recommended that you put the list of jump addresses in the data segment.
|
|
<p>
|
|
You may want to reorganize your code so that branch patterns which are not predicted
|
|
perfectly can be replaced by other patterns which are. Consider, for example, a loop which
|
|
always executes 20 times. The conditional jump at the bottom of the loop is taken 19 times
|
|
and falls through every 20'th time. This pattern is regular, but not recognized by the pattern
|
|
recognition mechanism, so the fall-through is always mispredicted. You may make two
|
|
nested loops by four and five, or unroll the loop by four and let it execute 5 times, in order to
|
|
have only recognizable patterns. This kind of complicated schemes are only worth the extra
|
|
code on the PPro, PII and PIII processors where mispredictions are very expensive. For higher
|
|
loop counts there is no reason to do anything about the single misprediction.
|
|
<p>
|
|
<h3><a name="22_3">22.3. Avoiding jumps (all processors)</a></h3>
|
|
There can be many reasons why you may want reduce the number of jumps, calls and
|
|
returns:
|
|
<ul>
|
|
<li>jump mispredictions are very expensive,
|
|
<li>there are various penalties for consecutive or chained jumps, depending on the
|
|
processor,
|
|
<li>jump instructions may push one another out of the branch target buffer because of the
|
|
random replacement algorithm,
|
|
<li>a return takes 2 clocks on PPlain and PMMX, calls and returns generate 4 uops on
|
|
PPro, PII and PIII.
|
|
<li>on PPro, PII and PIII, instruction fetch may be delayed after a jump
|
|
(chapter <a href="#15">15</a>), and
|
|
retirement may be slightly less effective for taken jumps then for other uops
|
|
(chapter <a href="#18">18</a>).
|
|
</ul>
|
|
<p>
|
|
Calls and returns can be avoided by replacing small procedures with inline macros.
|
|
And in many cases it is possible to reduce the number of jumps by restructuring
|
|
your code. For example, a jump to a jump should be replaced by a jump to the final
|
|
target. In some cases this is even possible with conditional jumps if the condition
|
|
is the same or is known. A jump to a return can be replaced by a return. If you want
|
|
to eliminate a return to a return, then you should not manipulate the stack pointer
|
|
because that would interfere with the prediction mechanism of the return stack buffer.
|
|
Instead, you can replace the preceding call with a jump. For example
|
|
<kbd>CALL PRO1 / RET</kbd> can be replaced by <kbd>JMP PRO1</kbd> if
|
|
<kbd>PRO1</kbd> ends with the same kind of <kbd>RET</kbd>.
|
|
<p>
|
|
You may also eliminate a jump by dublicating the code jumped to. This can be
|
|
useful if you have a two-way branch inside a loop or before a return. Example:
|
|
<pre>A: CMP [EAX+4*EDX],ECX
|
|
JE B
|
|
CALL X
|
|
JMP C
|
|
B: CALL Y
|
|
C: INC EDX
|
|
JNZ A
|
|
MOV ESP, EBP
|
|
POP EBP
|
|
RET</pre>
|
|
The jump to <kbd>C</kbd> may be eliminated by dublicating the loop epilog:
|
|
<pre>A: CMP [EAX+4*EDX],ECX
|
|
JE B
|
|
CALL X
|
|
INC EDX
|
|
JNZ A
|
|
JMP D
|
|
B: CALL Y
|
|
C: INC EDX
|
|
JNZ A
|
|
D: MOV ESP, EBP
|
|
POP EBP
|
|
RET</pre>
|
|
The most often executed branch should come first here. The jump to <kbd>D</kbd>
|
|
is outside the loop and therefore less critical. If this jump is executed so often
|
|
that it needs optimizing too, then replace it with the three instructions following
|
|
<kbd>D</kbd>.
|
|
<p>
|
|
<h3><a name="22_4">22.4. Avoiding conditional jumps by using flags (all processors)</a></h3>
|
|
The most important jumps to eliminate are conditional jumps, especially if
|
|
they are poorly predictable. Sometimes it is possible to obtain the same effect
|
|
as a branch by ingenious manipulation of bits and flags. For example you may
|
|
calculate the absolute value of a signed number
|
|
without branching:
|
|
<pre> CDQ
|
|
XOR EAX,EDX
|
|
SUB EAX,EDX</pre>
|
|
(On PPlain and PMMX, use <kbd>MOV EDX,EAX / SAR EDX,31</kbd> instead of <kbd>CDQ</kbd>).
|
|
<p>
|
|
The carry flag is particularly useful for this kind of tricks:<br>
|
|
Setting carry if a value is zero: <kbd>CMP [VALUE],1</kbd><br>
|
|
Setting carry if a value is not zero: <kbd>XOR EAX,EAX / CMP EAX,[VALUE]</kbd><br>
|
|
Incrementing a counter if carry: <kbd>ADC EAX,0</kbd><br>
|
|
Setting a bit for each time the carry is set: <kbd>RCL EAX,1</kbd><br>
|
|
Generating a bit mask if carry is set: <kbd>SBB EAX,EAX</kbd><br>
|
|
Setting a bit on an arbitrary condition: <kbd>SETcond AL</kbd><br>
|
|
Setting all bits on an arbitrary condition: <kbd>XOR EAX,EAX / SETNcond AL / DEC EAX</kbd><br>
|
|
(remember to reverse the condition in the last example)
|
|
<p>
|
|
The following example finds the minimum of two unsigned numbers: if (b < a) a = b;
|
|
<pre> SUB EBX,EAX
|
|
SBB ECX,ECX
|
|
AND ECX,EBX
|
|
ADD EAX,ECX</pre><p>
|
|
The next example chooses between two numbers: if (a != 0) a = b; else a = c;
|
|
<pre> CMP EAX,1
|
|
SBB EAX,EAX
|
|
XOR ECX,EBX
|
|
AND EAX,ECX
|
|
XOR EAX,EBX</pre><p>
|
|
Whether or not such tricks are worth the extra code depends on how predictable a
|
|
conditional jump would be, whether the extra pairing or scheduling opportunities of the branch-free code
|
|
can be utilized, and whether there are other jumps following immediately after which could
|
|
suffer the penalties of consecutive jumps.
|
|
<p>
|
|
<h3><a name="22_5">22.5. Replacing conditional jumps by conditional moves (PPro, PII and PIII)</a></h3>
|
|
The PPro, PII and PIII processors have conditional move instructions intended specifically for
|
|
avoiding branches because branch misprediction is very time-consuming on these
|
|
processors. There are conditional move instructions for both integer and floating point
|
|
registers. For code that will run only on these processors you may replace poorly
|
|
predictable branches with conditional moves whenever possible. If you want your code to
|
|
run on all processors then you may make two versions of the most critical parts of the code,
|
|
one for processors that support conditional move instructions and one for those that don't
|
|
(see chapter <a href="#27_10">27.10</a> for how to detect if conditional moves are supported).
|
|
<p>
|
|
The misprediction penalty for a branch may be so high that it is advantageous to replace it
|
|
with conditional moves even when it costs several extra instructions. But a conditional move
|
|
instruction has the disadvantage that it makes dependency chains longer. The conditional move
|
|
waits for both register operands to be ready even
|
|
though only one of them is needed. A conditional move is waiting for three operands to be
|
|
ready: the condition flag and the two move operands. You have to consider if any of these
|
|
three operands are likely to be delayed by dependency chains or cache misses. If the
|
|
condition flag is available long before the move operands then you may as well use a
|
|
branch, because a possible branch misprediction could be resolved while waiting for the
|
|
move operands. In situations where you have to wait long for a move operand that may not
|
|
be needed after all, the branch will be faster than the conditional move despite a possible
|
|
misprediction penalty. The opposite situation is when the condition flag is delayed while
|
|
both move operands are available early. In this situation the conditional move is preferred
|
|
over the branch if misprediction is likely.
|
|
|
|
<p>
|
|
<h2><a name="23">23</a>. Reducing code size (all processors)</h2>
|
|
As explained in chapter <a href="#7">7</a>, the code cache is 8 or 16 kb. If you have problems keeping the
|
|
critical parts of your code within the code cache, then you may consider reducing the size of
|
|
your code.
|
|
<p>
|
|
32 bit code is usually bigger than 16 bit code because addresses and data constants take 4
|
|
bytes in 32 bit code and only 2 bytes in 16 bit code. However, 16 bit code has other
|
|
penalties such as prefixes and problems with accessing adjacent words simultaneously
|
|
(see chapter 10.2 <a href="#imperfectpush">above</a>). Some other methods for reducing the size or your code are discussed
|
|
below.
|
|
<p>
|
|
Both jump addresses, data addresses, and data constants take less space if they can be
|
|
expressed as a sign-extended byte, i.e. if they are within the interval from -128 to +127.
|
|
<p>
|
|
For jump addresses this means that short jumps take two bytes of code, whereas jumps
|
|
beyond 127 bytes take 5 bytes if unconditional and 6 bytes if conditional.
|
|
<p>
|
|
Likewise, data addresses take less space if they can be expressed as a pointer and a
|
|
displacement between -128 and +127.
|
|
Example:<br>
|
|
<kbd> MOV EBX,DS:[100000] / ADD EBX,DS:[100004] ; 12 bytes</kbd><br>
|
|
Reduce to:<br>
|
|
<kbd> MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4] ; 10 bytes</kbd>
|
|
<p>
|
|
The advantage of using a pointer obviously increases if you use it many times. Storing data
|
|
on the stack and using <kbd>EBP</kbd> or <kbd>ESP</kbd> as pointer will thus make your code smaller than if you
|
|
use static memory locations and absolute addresses, provided of course that your data are
|
|
within +/-127 bytes of the pointer. Using <kbd>PUSH</kbd> and <kbd>POP</kbd> to write and read temporary data is
|
|
even shorter.
|
|
<p>
|
|
Data constants may also take less space if they are between -128 and +127. Most
|
|
instructions with immediate operands have a short form where the operand is a
|
|
sign-extended single byte. Examples:
|
|
<pre> PUSH 200 ; 5 bytes
|
|
PUSH 100 ; 2 bytes
|
|
|
|
ADD EBX,128 ; 6 bytes
|
|
SUB EBX,-128 ; 3 bytes</pre><p>
|
|
|
|
The most important instruction with an immediate operand which doesn't have such a short
|
|
form is <kbd>MOV</kbd>.<br>
|
|
Examples:
|
|
<pre> MOV EAX, 0 ; 5 bytes</pre>
|
|
May be changed to:<br>
|
|
<pre> XOR EAX,EAX ; 2 bytes</pre>
|
|
And
|
|
<pre> MOV EAX, 1 ; 5 bytes</pre>
|
|
May be changed to:
|
|
<pre> XOR EAX,EAX / INC EAX ; 3 bytes</pre>
|
|
or:
|
|
<pre> PUSH 1 / POP EAX ; 3 bytes</pre>
|
|
And
|
|
<pre> MOV EAX, -1 ; 5 bytes</pre>
|
|
May be changed to:
|
|
<pre> OR EAX, -1 ; 3 bytes</pre>
|
|
<p>
|
|
If the same address or constant is used more than once then you may load it
|
|
into a register. A <kbd>MOV</kbd>
|
|
with a 4-byte immediate operand may sometimes be replaced by an arithmetic
|
|
instruction if the value of the register before the <kbd>MOV</kbd> is known. Example:
|
|
<pre> MOV [mem1],200 ; 10 bytes
|
|
MOV [mem2],200 ; 10 bytes
|
|
MOV [mem3],201 ; 10 bytes
|
|
MOV EAX,100 ; 5 bytes
|
|
MOV EBX,150 ; 5 bytes</pre><p>
|
|
Assuming that <kbd>mem1</kbd> and <kbd>mem3</kbd> are both within -128/+127
|
|
bytes of <kbd>mem2</kbd>, this may be changed to:
|
|
<pre> MOV EBX, OFFSET mem2 ; 5 bytes
|
|
MOV EAX,200 ; 5 bytes
|
|
MOV [EBX+mem1-mem2],EAX ; 3 bytes
|
|
MOV [EBX],EAX ; 2 bytes
|
|
INC EAX ; 1 byte
|
|
MOV [EBX+mem3-mem2],EAX ; 3 bytes
|
|
SUB EAX,101 ; 3 bytes
|
|
LEA EBX,[EAX+50] ; 3 bytes</pre><p>
|
|
Be aware of the AGI stall in the <kbd>LEA</kbd> instruction (for PPlain and PMMX).
|
|
<p>
|
|
You may also consider that different instructions have different lengths. The following
|
|
instructions take only one byte and are therefore very attractive:
|
|
<kbd>PUSH reg</kbd>, <kbd>POP reg, INC reg32, DEC reg32</kbd>.<br>
|
|
<kbd>INC</kbd> and <kbd>DEC</kbd> with 8 bit registers take 2 bytes, so
|
|
<kbd>INC EAX</kbd> is shorter than <kbd>INC AL</kbd>.
|
|
<p>
|
|
<kbd>XCHG EAX,reg</kbd> is also a single-byte instruction and thus takes less space
|
|
than <kbd>MOV EAX,reg</kbd>, but it is slower.
|
|
<p>
|
|
Some instructions take one byte less when they use the accumulator than when they use
|
|
any other register. <br>
|
|
Examples:
|
|
|
|
<pre> MOV EAX,DS:[100000] is smaller than MOV EBX,DS:[100000]
|
|
ADD EAX,1000 is smaller than ADD EBX,1000</pre><p>
|
|
|
|
Instructions with pointers take one byte less when they have only a base pointer
|
|
(not <kbd>ESP</kbd>)
|
|
and a displacement than when they have a scaled index register, or both base pointer and
|
|
index register, or <kbd>ESP</kbd> as base pointer. <br>
|
|
Examples:
|
|
<pre> MOV EAX,[array][EBX] is smaller than MOV EAX,[array][EBX*4]
|
|
MOV EAX,[EBP+12] is smaller than MOV EAX,[ESP+12]</pre><p>
|
|
Instructions with <kbd>EBP</kbd> as base pointer and no displacement and no index take one byte more
|
|
than with other registers:
|
|
<pre> MOV EAX,[EBX] is smaller than MOV EAX,[EBP], but
|
|
MOV EAX,[EBX+4] is same size as MOV EAX,[EBP+4].</pre><p>
|
|
Instructions with a scaled index pointer and no base pointer must have a four byte
|
|
displacement, even when it is 0:
|
|
<pre> LEA EAX,[EBX+EBX] is shorter than LEA EAX,[2*EBX].</pre>
|
|
<p>
|
|
<h2><a name="24">24</a>. Scheduling floating point code (PPlain and PMMX)</h2>
|
|
Floating point instructions cannot pair the way integer instructions can, except for one
|
|
special case, defined by the following rules:
|
|
<ul>
|
|
<li>the first instruction (executing in the U-pipe) must be <kbd>FLD, FADD, FSUB, FMUL,
|
|
FDIV, FCOM, FCHS,</kbd> or <kbd>FABS</kbd>.
|
|
<li>the second instruction (in V-pipe) must be <kbd>FXCH</kbd>
|
|
<li>the instruction following the <kbd>FXCH</kbd> must be a floating point instruction, otherwise the
|
|
<kbd>FXCH</kbd> will pair imperfectly and take an extra clock cycle.
|
|
</ul>
|
|
This special pairing is important, as will be explained shortly.
|
|
<p>
|
|
While floating point instructions in general cannot be paired, many can be pipelined, i.e. one
|
|
instruction can begin before the previous instruction has finished. Example:
|
|
<pre> FADD ST(1),ST(0) ; clock cycle 1-3
|
|
FADD ST(2),ST(0) ; clock cycle 2-4
|
|
FADD ST(3),ST(0) ; clock cycle 3-5
|
|
FADD ST(4),ST(0) ; clock cycle 4-6</pre><p>
|
|
Obviously, two instructions cannot overlap if the second instruction needs the result of the
|
|
first. Since almost all floating point instructions involve the top of stack register,
|
|
<kbd>ST(0)</kbd>, there are seemingly not very many possibilities for making an instruction independent of the
|
|
result of previous instructions. The solution to this problem is register renaming. The <kbd>FXCH</kbd>
|
|
instruction does not in reality swap the contents of two registers, it only swaps their names.
|
|
Instructions which push or pop the register stack also work by renaming. Floating point
|
|
register renaming has been highly optimized on the Pentiums so that a register may be
|
|
renamed while in use. Register renaming never causes stalls - it is even possible to rename
|
|
a register more than once in the same clock cycle, as for example when you pair <kbd>FLD</kbd> or
|
|
<kbd>FCOMPP</kbd> with <kbd>FXCH</kbd>.
|
|
<p>
|
|
By the proper use of <kbd>FXCH</kbd> instructions you may obtain a lot of overlapping in your floating
|
|
point code. Example:
|
|
<pre> FLD [a1] ; clock cycle 1
|
|
FADD [a2] ; clock cycle 2-4
|
|
FLD [b1] ; clock cycle 3
|
|
FADD [b2] ; clock cycle 4-6
|
|
FLD [c1] ; clock cycle 5
|
|
FADD [c2] ; clock cycle 6-8
|
|
FXCH ST(2) ; clock cycle 6
|
|
FADD [a3] ; clock cycle 7-9
|
|
FXCH ST(1) ; clock cycle 7
|
|
FADD [b3] ; clock cycle 8-10
|
|
FXCH ST(2) ; clock cycle 8
|
|
FADD [c3] ; clock cycle 9-11
|
|
FXCH ST(1) ; clock cycle 9
|
|
FADD [a4] ; clock cycle 10-12
|
|
FXCH ST(2) ; clock cycle 10
|
|
FADD [b4] ; clock cycle 11-13
|
|
FXCH ST(1) ; clock cycle 11
|
|
FADD [c4] ; clock cycle 12-14
|
|
FXCH ST(2) ; clock cycle 12</pre>
|
|
In the above example we are interleaving three independent threads. Each <kbd>FADD</kbd> takes 3
|
|
clock cycles, and we can start a new <kbd>FADD</kbd> in each clock cycle. When we have started an
|
|
<kbd>FADD</kbd> in the 'a' thread we have time to start two new <kbd>FADD</kbd>
|
|
instructions in the '<kbd>b</kbd>' and '<kbd>c</kbd>'
|
|
threads before returning to the '<kbd>a</kbd>' thread, so every third
|
|
<kbd>FADD</kbd> belongs to the same thread.
|
|
We are using <kbd>FXCH</kbd> instructions every time to get the register that belongs to the desired
|
|
thread into <kbd>ST(0)</kbd>. As you can see in the example above, this generates a regular pattern,
|
|
but note well that the <kbd>FXCH</kbd> instructions repeat with a period of two while the threads have a
|
|
period of three. This can be quite confusing, so you have to 'play computer' in order to know
|
|
which registers are where.
|
|
<p>
|
|
All versions of the instructions <kbd>FADD, FSUB, FMUL,</kbd> and
|
|
<kbd>FILD</kbd> take 3 clock cycles and are
|
|
able to overlap, so that these instructions may be scheduled using the method described
|
|
above. Using a memory operand does not take more time than a register operand if the
|
|
memory operand is in the level 1 cache and properly aligned.
|
|
<p>
|
|
By now you must be used to rules having exceptions, and the overlapping rule is no
|
|
exception: You cannot start an <kbd>FMUL</kbd> instruction one clock cycle
|
|
after another <kbd>FMUL</kbd>
|
|
instruction, because the <kbd>FMUL</kbd> circuitry is not perfectly pipelined.
|
|
It is recommended that you
|
|
put another instruction in between two <kbd>FMUL</kbd>'s. Example:
|
|
<pre> FLD [a1] ; clock cycle 1
|
|
FLD [b1] ; clock cycle 2
|
|
FLD [c1] ; clock cycle 3
|
|
FXCH ST(2) ; clock cycle 3
|
|
FMUL [a2] ; clock cycle 4-6
|
|
FXCH ; clock cycle 4
|
|
FMUL [b2] ; clock cycle 5-7 (stall)
|
|
FXCH ST(2) ; clock cycle 5
|
|
FMUL [c2] ; clock cycle 7-9 (stall)
|
|
FXCH ; clock cycle 7
|
|
FSTP [a3] ; clock cycle 8-9
|
|
FXCH ; clock cycle 10 (unpaired)
|
|
FSTP [b3] ; clock cycle 11-12
|
|
FSTP [c3] ; clock cycle 13-14</pre>
|
|
Here you have a stall before <kbd>FMUL [b2]</kbd> and before <kbd>FMUL [c2]</kbd>
|
|
because another <kbd>FMUL</kbd>
|
|
started in the preceding clock cycle. You can improve this code by putting
|
|
<kbd>FLD</kbd> instructions in between the <kbd>FMUL</kbd>'s:
|
|
<pre> FLD [a1] ; clock cycle 1
|
|
FMUL [a2] ; clock cycle 2-4
|
|
FLD [b1] ; clock cycle 3
|
|
FMUL [b2] ; clock cycle 4-6
|
|
FLD [c1] ; clock cycle 5
|
|
FMUL [c2] ; clock cycle 6-8
|
|
FXCH ST(2) ; clock cycle 6
|
|
FSTP [a3] ; clock cycle 7-8
|
|
FSTP [b3] ; clock cycle 9-10
|
|
FSTP [c3] ; clock cycle 11-12</pre><p>
|
|
In other cases you may put <kbd>FADD, FSUB</kbd>, or anything else in
|
|
between <kbd>FMUL</kbd>'s to avoid the stalls.
|
|
<p>
|
|
Overlapping floating point instructions requires of course that you have some independent
|
|
threads that you can interleave. If you have only one big formula to execute, then you may
|
|
compute parts of the formula in parallel to achieve overlapping. If, for example, you want to
|
|
add six numbers, then you may split the operations into two threads with three numbers in
|
|
each, and add the two threads in the end:
|
|
|
|
<pre> FLD [a] ; clock cycle 1
|
|
FADD [b] ; clock cycle 2-4
|
|
FLD [c] ; clock cycle 3
|
|
FADD [d] ; clock cycle 4-6
|
|
FXCH ; clock cycle 4
|
|
FADD [e] ; clock cycle 5-7
|
|
FXCH ; clock cycle 5
|
|
FADD [f] ; clock cycle 7-9 (stall)
|
|
FADD ; clock cycle 10-12 (stall)</pre><p>
|
|
|
|
Here we have a one clock stall before <kbd>FADD [f]</kbd> because it is waiting
|
|
for the result of <kbd>FADD [d]</kbd> and a two clock stall before the last
|
|
<kbd>FADD</kbd> because it is waiting for the result of
|
|
<kbd>FADD [f]</kbd>. The latter stall can be hidden by filling in some integer
|
|
instructions, but the first stall can not because an integer instruction at
|
|
this place would make the <kbd>FXCH</kbd> pair imperfectly.
|
|
<p>
|
|
The first stall can be avoided by having three threads rather than two, but
|
|
that would cost an extra <kbd>FLD</kbd> so we do not save anything by having
|
|
three threads rather than two unless there are at least eight numbers to add.
|
|
<p>
|
|
Not all floating point instructions can overlap. And some floating point
|
|
instructions can overlap more subsequent integer instructions than subsequent
|
|
floating point instructions. The <kbd>FDIV</kbd> instruction, for example,
|
|
takes 39 clock cycles. All but the first clock cycle can
|
|
overlap with integer instructions, but only the last two clock cycles can
|
|
overlap with floating point instructions. Example:
|
|
<pre> FDIV ; clock cycle 1-39 (U-pipe)
|
|
FXCH ; clock cycle 1-2 (V-pipe, imperfect pairing)
|
|
SHR EAX,1 ; clock cycle 3 (U-pipe)
|
|
INC EBX ; clock cycle 3 (V-pipe)
|
|
CMC ; clock cycle 4-5 (non-pairable)
|
|
FADD [x] ; clock cycle 38-40 (U-pipe, waiting while FPU busy)
|
|
FXCH ; clock cycle 38 (V-pipe)
|
|
FMUL [y] ; clock cycle 40-42 (U-pipe, waiting for result of FDIV)</pre>
|
|
The first <kbd>FXCH</kbd> pairs with the <kbd>FDIV</kbd>, but takes an extra
|
|
clock cycle because it is not followed by a floating point instruction.
|
|
The <kbd>SHR / INC</kbd> pair starts before the <kbd>FDIV</kbd> is finished, but
|
|
has to wait for the <kbd>FXCH</kbd> to finish. The <kbd>FADD</kbd> has to wait
|
|
till clock 38 because new floating point instructions can only execute during
|
|
the last two clock cycles of the <kbd>FDIV</kbd>. The second <kbd>FXCH</kbd>
|
|
pairs with the <kbd>FADD</kbd>. The <kbd>FMUL</kbd> has to wait for the <kbd>FDIV</kbd>
|
|
to finish because it uses the result of the division.
|
|
<p>
|
|
If you have nothing else to put in after a floating point instruction with a
|
|
large integer overlap, such as <kbd>FDIV</kbd> or <kbd>FSQRT</kbd>, then you
|
|
may put in a dummy read from an address which you expect to need later in
|
|
the program to make sure it is in the level one cache. Example:
|
|
<pre> FDIV QWORD PTR [EBX]
|
|
CMP [ESI],ESI
|
|
FMUL QWORD PTR [ESI]</pre><p>
|
|
Here we use the integer overlap to pre-load the value at <kbd>[ESI]</kbd> into
|
|
the cache while the <kbd>FDIV</kbd> is being computed (we don't care what
|
|
the result of the <kbd>CMP</kbd> is).
|
|
<p>
|
|
Chapter <a href="#28">28</a> gives a complete listing of floating point instructions, and
|
|
what they can pair or overlap with.
|
|
<p>
|
|
There is no penalty for using a memory operand on floating point instuctions
|
|
because the arithmetic unit is one step later in the pipeline than the read
|
|
unit. The tradeoff of this comes when you store floating point data to memory.
|
|
The <kbd>FST</kbd> or <kbd>FSTP</kbd> instruction with a memory
|
|
operand takes two clock cycles in the execution stage, but it needs the data one clock
|
|
earlier so you will get a one clock stall if the value to store is not ready one clock cycle in
|
|
advance. This is analogous to an AGI stall. Example:
|
|
<pre> FLD [a1] ; clock cycle 1
|
|
FADD [a2] ; clock cycle 2-4
|
|
FLD [b1] ; clock cycle 3
|
|
FADD [b2] ; clock cycle 4-6
|
|
FXCH ; clock cycle 4
|
|
FSTP [a3] ; clock cycle 6-7
|
|
FSTP [b3] ; clock cycle 8-9</pre><p>
|
|
|
|
The <kbd>FSTP [a3]</kbd> stalls for one clock cycle because the result of
|
|
<kbd>FADD [a2]</kbd> is not ready
|
|
in the preceding clock cycle. In many cases you cannot hide this type of stall without
|
|
scheduling your floating point code into four threads or putting some integer instructions in
|
|
between. The two clock cycles in the execution stage of the <kbd>FST(P)</kbd> instruction cannot pair
|
|
or overlap with any subsequent instructions.
|
|
<p>
|
|
Instructions with integer operands such as <kbd>FIADD, FISUB, FIMUL, FIDIV, FICOM</kbd> may
|
|
be split up into simpler operations in order to improve overlapping. Example:
|
|
<pre> FILD [a] ; clock cycle 1-3
|
|
FIMUL [b] ; clock cycle 4-9</pre><p>
|
|
Split up into:
|
|
<pre> FILD [a] ; clock cycle 1-3
|
|
FILD [b] ; clock cycle 2-4
|
|
FMUL ; clock cycle 5-7</pre><p>
|
|
In this example, you save two clocks by overlapping the two FILD instructions.
|
|
<p>
|
|
<h2><a name="25">25</a>. Loop optimization (all processors)</h2>
|
|
When analyzing a program you often find that most of the time consumption lies in the
|
|
innermost loop. The way to improve the speed is to carefully optimize the most
|
|
time-consuming loop using assembly language. The rest of the program may be left in high-level
|
|
language.
|
|
<p>
|
|
In all the following examples it is assumed that all data are in the level 1 cache. If the speed
|
|
is limited by cache misses then there is no reason to optimize the instructions. Rather, you
|
|
should concentrate on organizing your data in a way that minimizes cache misses (see
|
|
chapter <a href="#7">7</a>).
|
|
<p>
|
|
<h3><a name="25_1">25.1. Loops in PPlain and PMMX</a></h3>
|
|
A loop generally contains a counter controlling how many times to iterate, and often array
|
|
access reading or writing one array element for each iteration. I have chosen as example a
|
|
procedure which reads integers from an array, changes the sign of each integer, and stores
|
|
the results in another array.
|
|
<p>
|
|
A C language code for this procedure would be:
|
|
<pre>void ChangeSign (int * A, int * B, int N) {
|
|
int i;
|
|
for (i=0; i<N; i++) B[i] = -A[i];}</pre>
|
|
<p>
|
|
Translating to assembly, we might write the procedure like this:
|
|
<p>
|
|
<h4>Example 1.1:</h4>
|
|
<pre>_ChangeSign PROC NEAR
|
|
PUSH ESI
|
|
PUSH EDI
|
|
A EQU DWORD PTR [ESP+12]
|
|
B EQU DWORD PTR [ESP+16]
|
|
N EQU DWORD PTR [ESP+20]
|
|
MOV ECX, [N]
|
|
JECXZ L2
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
CLD
|
|
L1: LODSD
|
|
NEG EAX
|
|
STOSD
|
|
LOOP L1
|
|
L2: POP EDI
|
|
POP ESI
|
|
RET ; (no extra pop if _cdecl calling convention)
|
|
_ChangeSign ENDP</pre>
|
|
This looks like a nice solution, but it is not optimal because it uses slow non-pairable
|
|
instructions. It takes 11 clock cycles per iteration if all data are in the level one cache.
|
|
<p>
|
|
<h4>Using pairable instructions only (PPlain and PMMX)</h4>
|
|
<h4>Example 1.2:</h4>
|
|
<pre> MOV ECX, [N]
|
|
MOV ESI, [A]
|
|
TEST ECX, ECX
|
|
JZ SHORT L2
|
|
MOV EDI, [B]
|
|
L1: MOV EAX, [ESI] ; u
|
|
XOR EBX, EBX ; v (pairs)
|
|
ADD ESI, 4 ; u
|
|
SUB EBX, EAX ; v (pairs)
|
|
MOV [EDI], EBX ; u
|
|
ADD EDI, 4 ; v (pairs)
|
|
DEC ECX ; u
|
|
JNZ L1 ; v (pairs)
|
|
L2:</pre>
|
|
Here we have used pairable instructions only, and scheduled the instructions so that
|
|
everything pairs. It now takes only 4 clock cycles per iteration. We could have obtained the
|
|
same speed without splitting the NEG instruction, but the other unpairable instructions
|
|
should be split up.
|
|
<p>
|
|
<h4>Using the same register for counter and index</h4>
|
|
<h4>Example 1.3:</h4>
|
|
<pre> MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
MOV ECX, [N]
|
|
XOR EDX, EDX
|
|
TEST ECX, ECX
|
|
JZ SHORT L2
|
|
L1: MOV EAX, [ESI+4*EDX] ; u
|
|
NEG EAX ; u
|
|
MOV [EDI+4*EDX], EAX ; u
|
|
INC EDX ; v (pairs)
|
|
CMP EDX, ECX ; u
|
|
JB L1 ; v (pairs)
|
|
L2:</pre><p>
|
|
Using the same register for counter and index gives us fewer instructions in the body of the
|
|
loop, but it still takes 4 clocks because we have two unpaired instructions.
|
|
<p>
|
|
<h4>Letting the counter end at zero (PPlain and PMMX)</h4>
|
|
We want to get rid of the <kbd>CMP</kbd> instruction in example 1.3 by letting the counter end at zero
|
|
and use the zero flag for detecting when we are finished as we did in example 1.2. One way
|
|
to do this would be to execute the loop backwards taking the last array elements first.
|
|
However, data caches are optimized for accessing data forwards, not backwards, so if
|
|
cache misses are likely, then you should rather start the counter at -N and count through
|
|
negative values up to zero. The base registers should then point to the end of the arrays
|
|
rather than the beginning:
|
|
<p>
|
|
<h4>Example 1.4:</h4>
|
|
<pre> MOV ESI, [A]
|
|
MOV EAX, [N]
|
|
MOV EDI, [B]
|
|
XOR ECX, ECX
|
|
LEA ESI, [ESI+4*EAX] ; point to end of array A
|
|
SUB ECX, EAX ; -N
|
|
LEA EDI, [EDI+4*EAX] ; point to end of array B
|
|
JZ SHORT L2
|
|
L1: MOV EAX, [ESI+4*ECX] ; u
|
|
NEG EAX ; u
|
|
MOV [EDI+4*ECX], EAX ; u
|
|
INC ECX ; v (pairs)
|
|
JNZ L1 ; u
|
|
L2:</pre>
|
|
We are now down at five instructions in the loop body but it still takes 4 clocks because of
|
|
poor pairing. (If the addresses and sizes of the arrays are constants we may save two
|
|
registers by substituting <kbd>A+SIZE A</kbd> for <kbd>ESI</kbd>
|
|
and <kbd>B+SIZE B</kbd> for <kbd>EDI</kbd>). Now let's see how we
|
|
can improve pairing.
|
|
<p>
|
|
<h4>Pairing calculations with loop overhead (PPlain and PMMX)</h4>
|
|
We may want to improve pairing by intermingling calculations with the loop control
|
|
instructions. If we want to put something in between <kbd>INC ECX</kbd>
|
|
and <kbd>JNZ L1</kbd>, it has to be
|
|
something that doesn't affect the zero flag. The <kbd>MOV [EDI+4*ECX],EBX</kbd>
|
|
instruction after <kbd>INC ECX</kbd> would generate an AGI delay, so we have
|
|
to be more ingenious:
|
|
<p>
|
|
<h4>Example 1.5:</h4>
|
|
<pre> MOV EAX, [N]
|
|
XOR ECX, ECX
|
|
SHL EAX, 2 ; 4 * N
|
|
JZ SHORT L3
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
SUB ECX, EAX ; - 4 * N
|
|
ADD ESI, EAX ; point to end of array A
|
|
ADD EDI, EAX ; point to end of array B
|
|
JMP SHORT L2
|
|
L1: MOV [EDI+ECX-4], EAX ; u
|
|
L2: MOV EAX, [ESI+ECX] ; v (pairs)
|
|
XOR EAX, -1 ; u
|
|
ADD ECX, 4 ; v (pairs)
|
|
INC EAX ; u
|
|
JNC L1 ; v (pairs)
|
|
MOV [EDI+ECX-4], EAX
|
|
L3:</pre>
|
|
I have used a different way to calculate the negative of <kbd>EAX</kbd> here:
|
|
inverting all bits and adding one. The reason why I am using this method is
|
|
that I can use a dirty trick with the
|
|
<kbd>INC</kbd> instruction: <kbd>INC</kbd> doesn't change the carry flag,
|
|
whereas <kbd>ADD</kbd> does. I am using <kbd>ADD</kbd>
|
|
rather than <kbd>INC</kbd> to increment my loop counter and testing the carry
|
|
flag rather than the zero
|
|
flag. It is then possible to put the <kbd>INC EAX</kbd> in between without
|
|
affecting the carry flag. You
|
|
may think that we could have used <kbd>LEA EAX,[EAX+1]</kbd> here instead of
|
|
<kbd>INC EAX</kbd>, at least
|
|
that doesn't change any flags, but the <kbd>LEA</kbd> instruction would have
|
|
an AGI stall so that's not
|
|
the best solution. Note that the trick with the <kbd>INC</kbd> instruction
|
|
not changing the carry flag is useful only on PPlain and PMMX, but will
|
|
cause a partial flags stall on PPro, PII and PIII.
|
|
<p>
|
|
I have obtained perfect pairing here and the loop now takes only 3 clock cycles.
|
|
Whether you want to increment the loop counter by 1 (as in example 1.4) or by 4
|
|
(as in example 1.5) is a matter of taste, it makes no difference in loop timing.
|
|
<p>
|
|
<h4>Overlapping the end of one operation with the beginning of the next (PPlain and PMMX)</h4>
|
|
The method used in example 1.5 is not very generally applicable so we may look for other
|
|
methods of improving pairing opportunities. One way is to reorganize the loop so that the
|
|
end of one operation overlaps with the beginning of the next. I will call this convoluting the
|
|
loop. A convoluted loop has an unfinished operation at the end of each loop iteration which
|
|
will be finished in the next run. Actually, example 1.5 did pair the last <kbd>MOV</kbd> of one iteration
|
|
with the first <kbd>MOV</kbd> of the next, but we want to explore this method further:
|
|
<p>
|
|
<h4>Example 1.6:</h4>
|
|
<pre> MOV ESI, [A]
|
|
MOV EAX, [N]
|
|
MOV EDI, [B]
|
|
XOR ECX, ECX
|
|
LEA ESI, [ESI+4*EAX] ; point to end of array A
|
|
SUB ECX, EAX ; -N
|
|
LEA EDI, [EDI+4*EAX] ; point to end of array B
|
|
JZ SHORT L3
|
|
XOR EBX, EBX
|
|
MOV EAX, [ESI+4*ECX]
|
|
INC ECX
|
|
JZ SHORT L2
|
|
L1: SUB EBX, EAX ; u
|
|
MOV EAX, [ESI+4*ECX] ; v (pairs)
|
|
MOV [EDI+4*ECX-4], EBX ; u
|
|
INC ECX ; v (pairs)
|
|
MOV EBX, 0 ; u
|
|
JNZ L1 ; v (pairs)
|
|
L2: SUB EBX, EAX
|
|
MOV [EDI+4*ECX-4], EBX
|
|
L3:</pre><p>
|
|
|
|
Here we begin reading the second value before we have stored the first, and
|
|
this of course improves pairing opportunities. The <kbd>MOV EBX,0</kbd>
|
|
instruction has been put in between <kbd>INC ECX</kbd> and <kbd>JNZ L1</kbd>
|
|
not to improve pairing but to avoid AGI stall.
|
|
<p>
|
|
<h4>Rolling out a loop (PPlain and PMMX)</h4>
|
|
The most generally applicable way to improve pairing opportunities is to do two operations
|
|
for each run and do half as many runs. This is called rolling out a loop:
|
|
<p>
|
|
<h4>Example 1.7:</h4>
|
|
<pre> MOV ESI, [A]
|
|
MOV EAX, [N]
|
|
MOV EDI, [B]
|
|
XOR ECX, ECX
|
|
LEA ESI, [ESI+4*EAX] ; point to end of array A
|
|
SUB ECX, EAX ; -N
|
|
LEA EDI, [EDI+4*EAX] ; point to end of array B
|
|
JZ SHORT L2
|
|
TEST AL,1 ; test if N is odd
|
|
JZ SHORT L1
|
|
MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one
|
|
NEG EAX
|
|
MOV [EDI+4*ECX], EAX
|
|
INC ECX ; make counter even
|
|
JZ SHORT L2 ; N = 1
|
|
L1: MOV EAX, [ESI+4*ECX] ; u
|
|
MOV EBX, [ESI+4*ECX+4] ; v (pairs)
|
|
NEG EAX ; u
|
|
NEG EBX ; u
|
|
MOV [EDI+4*ECX], EAX ; u
|
|
MOV [EDI+4*ECX+4], EBX ; v (pairs)
|
|
ADD ECX, 2 ; u
|
|
JNZ L1 ; v (pairs)
|
|
L2:</pre>
|
|
<p>
|
|
Now we are doing two operations in parallel which gives the best pairing opportunities. We
|
|
have to test if <kbd>N</kbd> is odd and if so do one operation outside the loop because the loop can
|
|
only do an even number of operations.
|
|
<p>
|
|
The loop has an AGI stall at the first <kbd>MOV</kbd> instruction because
|
|
<kbd>ECX</kbd> has been incremented in
|
|
the preceding clock cycle. The loop therefore takes 6 clock cycles for two operations.
|
|
<p>
|
|
<h4>Reorganizing a loop to remove AGI stall (PPlain and PMMX)</h4>
|
|
<h4>Example 1.8:</h4>
|
|
<pre> MOV ESI, [A]
|
|
MOV EAX, [N]
|
|
MOV EDI, [B]
|
|
XOR ECX, ECX
|
|
LEA ESI, [ESI+4*EAX] ; point to end of array A
|
|
SUB ECX, EAX ; -N
|
|
LEA EDI, [EDI+4*EAX] ; point to end of array B
|
|
JZ SHORT L3
|
|
TEST AL,1 ; test if N is odd
|
|
JZ SHORT L2
|
|
MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one
|
|
NEG EAX ; no pairing opportunity
|
|
MOV [EDI+4*ECX-4], EAX
|
|
INC ECX ; make counter even
|
|
JNZ SHORT L2
|
|
NOP ; add NOP's if JNZ L2 not predictable
|
|
NOP
|
|
JMP SHORT L3 ; N = 1
|
|
L1: NEG EAX ; u
|
|
NEG EBX ; u
|
|
MOV [EDI+4*ECX-8], EAX ; u
|
|
MOV [EDI+4*ECX-4], EBX ; v (pairs)
|
|
L2: MOV EAX, [ESI+4*ECX] ; u
|
|
MOV EBX, [ESI+4*ECX+4] ; v (pairs)
|
|
ADD ECX, 2 ; u
|
|
JNZ L1 ; v (pairs)
|
|
NEG EAX
|
|
NEG EBX
|
|
MOV [EDI+4*ECX-8], EAX
|
|
MOV [EDI+4*ECX-4], EBX
|
|
L3:</pre>
|
|
<p>
|
|
The trick is to find a pair of instructions that do not use the loop counter as index and
|
|
reorganize the loop so that the counter is incremented in the preceding clock cycle. We are
|
|
now down at 5 clock cycles for two operations which is close to the best possible.
|
|
<p>
|
|
If data caching is critical, then you may improve the speed further by
|
|
interleaving the <kbd>A</kbd> and <kbd>B</kbd> arrays into one structured array
|
|
so that each <kbd>B[i]</kbd> comes immediately after the
|
|
corresponding <kbd>A[i]</kbd>. If the structured array is aligned by at least
|
|
8 then <kbd>B[i]</kbd> will always be
|
|
in the same cache line as <kbd>A[i]</kbd>, so you will never have a cache
|
|
miss when writing <kbd>B[i]</kbd>.
|
|
This may of course have a tradeoff in other parts of the program so you
|
|
have to weigh the costs against the benefits.
|
|
<p>
|
|
<h4>Rolling out by more than 2 (PPlain and PMMX)</h4>
|
|
You may think of doing more than two operations per iteration in order to reduce the loop
|
|
overhead per operation. But since the loop overhead in most cases can be reduced to only
|
|
one clock cycle per iteration, then rolling out the loop by 4 rather than by 2 would only save
|
|
1/4 clock cycle per operation, which is hardly worth the effort. Only if the loop overhead
|
|
cannot be reduced to one clock cycle and if N is very big, should you think of unrolling by 4.
|
|
<p>
|
|
The drawbacks of excessive loop unrolling are:
|
|
<ol>
|
|
<li>You need to calculate N MODULO R, where R is the unrolling factor, and do N
|
|
MODULO R operations before or after the main loop in order to make the remaining
|
|
number of operations divisible by R. This takes a lot of extra code and poorly predictable
|
|
branches. And the loop body of course also becomes bigger.
|
|
<li>A Piece of code usually takes much more time the first time it executes, and the penalty
|
|
of first time execution is bigger the more code you have, especially if N is small.
|
|
<li>Excessive code size makes the utilization of the code cache less effective.
|
|
</ol>
|
|
<p>
|
|
<h4>Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers (PPlain and PMMX)</h4>
|
|
If you need to manipulate arrays of 8 or 16 bit operands, then there is a problem with
|
|
unrolled loops because you may not be able to pair two memory access operations. For
|
|
example <kbd>MOV AL,[ESI] / MOV BL,[ESI+1]</kbd> will not pair if the two operands are within
|
|
the same dword of memory. But there may be a much smarter method, namely to handle
|
|
four bytes at a time in the same 32 bit register.
|
|
<p>
|
|
The following example adds 2 to all elements of an array of bytes:
|
|
<p>
|
|
<h4>Example 1.9:</h4>
|
|
<pre> MOV ESI, [A] ; address of byte array
|
|
MOV ECX, [N] ; number of elements in byte array
|
|
TEST ECX, ECX ; test if N is 0
|
|
JZ SHORT L2
|
|
MOV EAX, [ESI] ; read first four bytes
|
|
L1: MOV EBX, EAX ; copy into EBX
|
|
AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX
|
|
XOR EBX, EAX ; get the highest bit of each byte
|
|
ADD EAX, 02020202H ; add desired value to all four bytes
|
|
XOR EBX, EAX ; combine bits again
|
|
MOV EAX, [ESI+4] ; read next four bytes
|
|
MOV [ESI], EBX ; store result
|
|
ADD ESI, 4 ; increment pointer
|
|
SUB ECX, 4 ; decrement loop counter
|
|
JA L1 ; loop
|
|
L2:</pre>
|
|
This loop takes 5 clock cycles for every 4 bytes. The array should of course be aligned by
|
|
4. If the number of elements in the array is not divisible by four, then you may padd it in the
|
|
end with a few extra bytes to make the length divisible by four. This loop will always read
|
|
past the end of the array, so you should make sure the array is not placed at the end of a
|
|
segment to avoid a general protection error.
|
|
<p>
|
|
Note that I have masked out the highest bit of each byte to avoid a possible carry from
|
|
each byte into the next when adding. I am using <kbd>XOR</kbd> rather than
|
|
<kbd>ADD</kbd> when putting in the high bit again to avoid carry.
|
|
<p>
|
|
The <kbd>ADD ESI,4</kbd> instruction could have been avoided by using the loop counter as index
|
|
as in example 1.4. However, this would give an odd number of instructions in the loop
|
|
body, so there would be one unpaired instruction and the loop would still take 5 clocks.
|
|
Making the branch instruction unpaired would save one clock after the last operation when
|
|
the branch is mispredicted, but we would have to spend an extra clock cycle in the prolog
|
|
code to setup a pointer to the end of the array and calculate -N, so the two methods will be
|
|
exactly equally fast. The method presented here is the simplest and shortest.
|
|
<p>
|
|
The next example finds the length of a zero-terminated string by searching
|
|
for the first byte of zero. It is faster than using <kbd>REP SCASB</kbd>:
|
|
<p>
|
|
<h4><a name="1-10">Example 1.10:</a></h4>
|
|
<pre>STRLEN PROC NEAR
|
|
MOV EAX,[ESP+4] ; get pointer
|
|
MOV EDX,7
|
|
ADD EDX,EAX ; pointer+7 used in the end
|
|
PUSH EBX
|
|
MOV EBX,[EAX] ; read first 4 bytes
|
|
ADD EAX,4 ; increment pointer
|
|
L1: LEA ECX,[EBX-01010101H] ; subtract 1 from each byte
|
|
XOR EBX,-1 ; invert all bytes
|
|
AND ECX,EBX ; and these two
|
|
MOV EBX,[EAX] ; read next 4 bytes
|
|
ADD EAX,4 ; increment pointer
|
|
AND ECX,80808080H ; test all sign bits
|
|
JZ L1 ; no zero bytes, continue loop
|
|
TEST ECX,00008080H ; test first two bytes
|
|
JNZ SHORT L2
|
|
SHR ECX,16 ; not in the first 2 bytes
|
|
ADD EAX,2
|
|
L2: SHL CL,1 ; use carry flag to avoid a branch
|
|
POP EBX
|
|
SBB EAX,EDX ; compute length
|
|
RET
|
|
STRLEN ENDP</pre>
|
|
Again we have used the method of overlapping the end of one operation with the beginning
|
|
of the next to improve pairing. I have not unrolled the loop because it is likely to repeat
|
|
relatively few times. The string should of course be aligned by 4. The code will always read
|
|
past the end of the string, so the string should not be placed at the end of a segment.
|
|
<p>
|
|
The loop body has an odd number of instructions so there is one unpaired. Making the
|
|
branch instruction unpaired rather than one of the other instructions has the advantage that
|
|
it saves 1 clock cycle when the branch is mispredicted.
|
|
<p>
|
|
The <kbd>TEST ECX,00008080H</kbd> instruction is non-pairable. You could use the pairable
|
|
instruction <kbd>OR CH,CL</kbd> here instead, but then you would have to put
|
|
in a <kbd>NOP</kbd> or something to avoid the penalties of consecutive branches.
|
|
Another problem with <kbd>OR CH,CL</kbd> is that it
|
|
would cause a partial register stall on a PPro, PII and PIII. So I have chosen to keep the
|
|
unpairable <kbd>TEST</kbd> instruction.
|
|
<p>
|
|
Handling 4 bytes simultaneously can be quite difficult. The code uses a formula which
|
|
generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible
|
|
to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all
|
|
bytes (in the <kbd>LEA</kbd> instruction). I have not masked out the highest bit of each byte before
|
|
subtracting, as I did in the previous example, so the subtraction may generate a borrow to
|
|
the next byte, but only if it is zero, and this is exactly the situation where we don't care what
|
|
the next byte is, because we are searching forwards for the first zero. If we were searching
|
|
backwards then we would have to re-read the dword after detecting a zero, and then test all
|
|
four bytes to find the last zero, or use <kbd>BSWAP</kbd> to reverse the order of the bytes.
|
|
<p>
|
|
If you want to search for a byte value other than zero, then you may <kbd>XOR</kbd> all four bytes
|
|
with the value you are searching for, and then use the method above to search for zero.
|
|
<p>
|
|
<h4>Loops with MMX operations (PMMX)</h4>
|
|
Handling multiple operands in the same register is easier on the MMX processors because
|
|
they have special instructions and special 64 bit registers for exactly this purpose.
|
|
<p>
|
|
Returning to the problem of adding two to all bytes in an array, we may take advantage of
|
|
the MMX instructions:
|
|
<h4>Example 1.11:</h4>
|
|
<pre>.data
|
|
ALIGN 8
|
|
ADDENTS DQ 0202020202020202h ; specify byte to add eight times
|
|
A DD ? ; address of byte array
|
|
N DD ? ; number of iterations
|
|
|
|
.code
|
|
MOV ESI, [A]
|
|
MOV ECX, [N]
|
|
MOVQ MM2, [ADDENTS]
|
|
JMP SHORT L2
|
|
; top of loop
|
|
L1: MOVQ [ESI-8], MM0 ; store result
|
|
L2: MOVQ MM0, MM2 ; load addents
|
|
PADDB MM0, [ESI] ; add eight bytes in one operation
|
|
ADD ESI, 8
|
|
DEC ECX
|
|
JNZ L1
|
|
MOVQ [ESI-8], MM0 ; store last result
|
|
EMMS</pre>
|
|
The store instruction is moved to after the loop control instructions in order to avoid a store stall.
|
|
<p>
|
|
This loop takes 4 clocks because the <kbd>PADDB</kbd> instruction doesn't pair
|
|
with <kbd>ADD ESI,8</kbd>. (A MMX instruction with memory access cannot pair
|
|
with a non-MMX instruction or with another MMX instruction with memory access).
|
|
We could get rid of <kbd>ADD ESI,8</kbd> by using <kbd>ECX</kbd> as index,
|
|
but that would give an AGI stall.
|
|
<p>
|
|
Since the loop overhead is considerable we might want to unroll the loop:
|
|
<p>
|
|
<h4>Example 1.12:</h4>
|
|
<pre>.data
|
|
ALIGN 8
|
|
ADDENTS DQ 0202020202020202h ; specify byte to add eight
|
|
times
|
|
A DD ? ; address of byte array
|
|
N DD ? ; number of iterations
|
|
|
|
.code
|
|
MOVQ MM2, [ADDENTS]
|
|
MOV ESI, [A]
|
|
MOV ECX, [N]
|
|
MOVQ MM0, MM2
|
|
MOVQ MM1, MM2
|
|
L3: PADDB MM0, [ESI]
|
|
PADDB MM1, [ESI+8]
|
|
MOVQ [ESI], MM0
|
|
MOVQ MM0, MM2
|
|
MOVQ [ESI+8], MM1
|
|
MOVQ MM1, MM2
|
|
ADD ESI, 16
|
|
DEC ECX
|
|
JNZ L3
|
|
EMMS</pre>
|
|
This unrolled loop takes 6 clocks per iteration for adding 16 bytes.
|
|
The <kbd>PADDB</kbd> instructions are not paired. The two threads are
|
|
interleaved to avoid a store stall.
|
|
<p>
|
|
Using the MMX instructions has a high penalty if you are using floating point
|
|
instructions shortly afterwards, so there may still be situations where you
|
|
want to use 32 bit registers as in example 1.9.
|
|
<p>
|
|
<h4>Loops with floating point operations (PPlain and PMMX)</h4>
|
|
The methods of optimizing floating point loops are basically the same as for integer loops,
|
|
although the floating point instructions are overlapping rather than pairing.
|
|
<p>
|
|
Consider the C language code:
|
|
<pre> int i, n; double * X; double * Y; double DA;
|
|
for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i];</pre>
|
|
This piece of code (called DAXPY) has been studied extensively because it is the key to
|
|
solving linear equations.
|
|
<p>
|
|
<h4>Example 1.13:</h4>
|
|
<pre>DSIZE = 8 ; data size
|
|
MOV EAX, [N] ; number of elements
|
|
MOV ESI, [X] ; pointer to X
|
|
MOV EDI, [Y] ; pointer to Y
|
|
XOR ECX, ECX
|
|
LEA ESI, [ESI+DSIZE*EAX] ; point to end of X
|
|
SUB ECX, EAX ; -N
|
|
LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y
|
|
JZ SHORT L3 ; test for N = 0
|
|
FLD DSIZE PTR [DA]
|
|
FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[0]
|
|
JMP SHORT L2 ; jump into loop
|
|
L1: FLD DSIZE PTR [DA]
|
|
FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[i]
|
|
FXCH ; get old result
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store Y[i]
|
|
L2: FSUBR DSIZE PTR [EDI+DSIZE*ECX] ; subtract from Y[i]
|
|
INC ECX ; increment index
|
|
JNZ L1 ; loop
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store last result
|
|
L3:</pre>
|
|
Here we are using the same methods as in example 1.6: Using the loop counter as index
|
|
register and counting through negative values up to zero. The end of one operation
|
|
overlaps with the beginning of the next.
|
|
<p>
|
|
The interleaving of floating point operations work perfectly here:
|
|
The 2 clock stall between <kbd>FMUL</kbd> and <kbd>FSUBR</kbd> is filled
|
|
with the <kbd>FSTP</kbd> of the previous result. The 3 clock stall between
|
|
<kbd>FSUBR</kbd> and <kbd>FSTP</kbd> is filled with the loop overhead and the
|
|
first two instructions of the next operation. An AGI stall has been avoided
|
|
by reading the only parameter that doesn't depend on the index in the first clock cycle after the index has been incremented.
|
|
<p>
|
|
This solution takes 6 clock cycles per operation, which is better than the
|
|
unrolled solution published by Intel!
|
|
<p>
|
|
<h4>Unrolling floating point loops (PPlain and PMMX)</h4>
|
|
<a name="unrollby3">The DAXPY loop unrolled by 3 is quite complicated:</a>
|
|
<h4>Example 1.14:</h4>
|
|
<pre>DSIZE = 8 ; data size
|
|
IF DSIZE EQ 4
|
|
SHIFTCOUNT = 2
|
|
ELSE
|
|
SHIFTCOUNT = 3
|
|
ENDIF
|
|
|
|
MOV EAX, [N] ; number of elements
|
|
MOV ECX, 3*DSIZE ; counter bias
|
|
SHL EAX, SHIFTCOUNT ; DSIZE*N
|
|
JZ L4 ; N = 0
|
|
MOV ESI, [X] ; pointer to X
|
|
SUB ECX, EAX ; (3-N)*DSIZE
|
|
MOV EDI, [Y] ; pointer to Y
|
|
SUB ESI, ECX ; end of pointer - bias
|
|
SUB EDI, ECX
|
|
TEST ECX, ECX
|
|
FLD DSIZE PTR [ESI+ECX] ; first X
|
|
JNS SHORT L2 ; less than 4 operations
|
|
L1: ; main loop
|
|
FMUL DSIZE PTR [DA]
|
|
FLD DSIZE PTR [ESI+ECX+DSIZE]
|
|
FMUL DSIZE PTR [DA]
|
|
FXCH
|
|
FSUBR DSIZE PTR [EDI+ECX]
|
|
FXCH
|
|
FLD DSIZE PTR [ESI+ECX+2*DSIZE]
|
|
FMUL DSIZE PTR [DA]
|
|
FXCH
|
|
FSUBR DSIZE PTR [EDI+ECX+DSIZE]
|
|
FXCH ST(2)
|
|
FSTP DSIZE PTR [EDI+ECX]
|
|
FSUBR DSIZE PTR [EDI+ECX+2*DSIZE]
|
|
FXCH
|
|
FSTP DSIZE PTR [EDI+ECX+DSIZE]
|
|
FLD DSIZE PTR [ESI+ECX+3*DSIZE]
|
|
FXCH
|
|
FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
|
|
ADD ECX, 3*DSIZE
|
|
JS L1 ; loop
|
|
L2: FMUL DSIZE PTR [DA] ; finish leftover operation
|
|
FSUBR DSIZE PTR [EDI+ECX]
|
|
SUB ECX, 2*DSIZE ; change pointer bias
|
|
JZ SHORT L3 ; finished
|
|
FLD DSIZE PTR [DA] ; start next operation
|
|
FMUL DSIZE PTR [ESI+ECX+3*DSIZE]
|
|
FXCH
|
|
FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
|
|
FSUBR DSIZE PTR [EDI+ECX+3*DSIZE]
|
|
ADD ECX, 1*DSIZE
|
|
JZ SHORT L3 ; finished
|
|
FLD DSIZE PTR [DA]
|
|
FMUL DSIZE PTR [ESI+ECX+3*DSIZE]
|
|
FXCH
|
|
FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
|
|
FSUBR DSIZE PTR [EDI+ECX+3*DSIZE]
|
|
ADD ECX, 1*DSIZE
|
|
L3: FSTP DSIZE PTR [EDI+ECX+2*DSIZE]
|
|
L4:</pre>
|
|
The reason why I am showing you how to unroll a loop by 3 is not to recommend
|
|
it, but to warn you how difficult it is! Be prepared to spend a considerable amount of time debugging
|
|
and verifying your code when doing something like this. There are several problems to take
|
|
care of: In most cases, you cannot remove all stalls from a floating point loop unrolled by
|
|
less than 4 unless you convolute it (i.e. there are unfinished operations at the end of each
|
|
run which are being finished in the next run). The last <kbd>FLD</kbd> in the main loop above is the
|
|
beginning of the first operation in the next run. It would be tempting here to make a solution
|
|
which reads past the end of the array and then discards the extra value in the end, as in
|
|
example 1.9 and 1.10, but that is not recommended in floating point loops because the
|
|
reading of the extra value might generate a denormal operand exception in case the
|
|
memory position after the array doesn't contain a valid floating point number. To avoid this,
|
|
we have to do at least one more operation after the main loop.
|
|
<p>
|
|
The number of operations to do outside an unrolled loop would normally be N MODULO R,
|
|
where N is the number of operations, and R is the unrolling factor. But in the case of a
|
|
convoluted loop, we have to do one more, i.e. (N-1) MODULO R + 1, for the
|
|
abovementioned reason.
|
|
<p>
|
|
Normally, we would prefer to do the extra operations before the main loop, but here we have
|
|
to do them afterwards for two reasons: One reason is to take care of the leftover operand
|
|
from the convolution. The other reason is that calculating the number of extra operations
|
|
requires a division if R is not a power of 2, and a division is time consuming. Doing the extra
|
|
operations after the loop saves the division.
|
|
<p>
|
|
The next problem is to calculate how to bias the loop counter so that it will change sign at
|
|
the right time, and adjust the base pointers so as to compensate for this bias. Finally, you
|
|
have to make sure the leftover operand from the convolution is handled correctly for all
|
|
values of N.
|
|
<p>
|
|
The epilog code doing 1-3 operations could have been implemented as a separate loop, but
|
|
that would cost an extra branch misprediction, so the solution above is faster.
|
|
<p>Now that I have scared you by demonstrating how difficult it is to unroll by 3, I will show you
|
|
that it is much easier to unroll by 4:
|
|
<p>
|
|
<h4>Example 1.15:</h4>
|
|
<pre>DSIZE = 8 ; data size
|
|
MOV EAX, [N] ; number of elements
|
|
MOV ESI, [X] ; pointer to X
|
|
MOV EDI, [Y] ; pointer to Y
|
|
XOR ECX, ECX
|
|
LEA ESI, [ESI+DSIZE*EAX] ; point to end of X
|
|
SUB ECX, EAX ; -N
|
|
LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y
|
|
TEST AL,1 ; test if N is odd
|
|
JZ SHORT L1
|
|
FLD DSIZE PTR [DA] ; do the odd operation
|
|
FMUL DSIZE PTR [ESI+DSIZE*ECX]
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX]
|
|
INC ECX ; adjust counter
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE]
|
|
L1: TEST AL,2 ; test for possibly 2 more operations
|
|
JZ L2
|
|
FLD DSIZE PTR [DA] ; N MOD 4 = 2 or 3. Do two more
|
|
FMUL DSIZE PTR [ESI+DSIZE*ECX]
|
|
FLD DSIZE PTR [DA]
|
|
FMUL DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
|
|
FXCH
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX]
|
|
FXCH
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
|
|
FXCH
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX]
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
|
|
ADD ECX, 2 ; counter is now divisible by 4
|
|
L2: TEST ECX, ECX
|
|
JZ L4 ; no more operations
|
|
L3: ; main loop:
|
|
FLD DSIZE PTR [DA]
|
|
FLD DSIZE PTR [ESI+DSIZE*ECX]
|
|
FMUL ST,ST(1)
|
|
FLD DSIZE PTR [ESI+DSIZE*ECX+DSIZE]
|
|
FMUL ST,ST(2)
|
|
FLD DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE]
|
|
FMUL ST,ST(3)
|
|
FXCH ST(2)
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX]
|
|
FXCH ST(3)
|
|
FMUL DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE]
|
|
FXCH
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
|
|
FXCH ST(2)
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
|
|
FXCH
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
|
|
FXCH ST(3)
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX]
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE]
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE]
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE]
|
|
ADD ECX, 4 ; increment index by 4
|
|
JNZ L3 ; loop
|
|
L4:</pre>
|
|
<p>
|
|
It is usually quite easy to find a stall-free solution when unrolling by 4, and there is no need
|
|
for convolution. The number of extra operations to do outside the main loop is N MODULO
|
|
4, which can be calculated easily without division, simply by testing the two lowest bits in N.
|
|
The extra operations are done before the main loop rather than after, to make the handling
|
|
of the loop counter simpler.
|
|
<p>
|
|
The tradeoff of loop unrolling is that the extra operations outside the loop are slower due to
|
|
incomplete overlapping and possible branch mispredictions, and the first time penalty is
|
|
higher because of increased code size.
|
|
<p>
|
|
As a general recommendation, I would say that if N is big or if convoluting the loop without
|
|
unrolling cannot remove enough stalls, then you should unroll critical integer loops by 2 and
|
|
floating point loops by 4.
|
|
<p>
|
|
<h3><a name="25_2">25.2 Loops in PPro, PII and PIII</a></h3>
|
|
In the previous chapter (<a href="#25_1">25.1</a>) I explained how to use convolution and loop unrolling in order
|
|
to improve pairing in PPlain and PMMX. On the PPro, PII and PIII there is no reason to do this
|
|
thanks to the out-of-order execution mechanism. But there are other quite difficult problems
|
|
to take care of, most importantly ifetch boundaries and register read stalls.
|
|
<p>
|
|
I have chosen the same example as in chapter <a href="#25_1">25.1</a>
|
|
for the previous microprocessors: a procedure which reads integers from an
|
|
array, changes the sign of each integer, and stores the results in another array.
|
|
<p>
|
|
A C language code for this procedure would be:
|
|
<pre>void ChangeSign (int * A, int * B, int N) {
|
|
int i;
|
|
for (i=0; i<N; i++) B[i] = -A[i];}</pre>
|
|
Translating to assembly, we might write the procedure like this:
|
|
<h4>Example 2.1:</h4>
|
|
<pre>_ChangeSign PROC NEAR
|
|
PUSH ESI
|
|
PUSH EDI
|
|
A EQU DWORD PTR [ESP+12]
|
|
B EQU DWORD PTR [ESP+16]
|
|
N EQU DWORD PTR [ESP+20]
|
|
|
|
MOV ECX, [N]
|
|
JECXZ L2
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
CLD
|
|
L1: LODSD
|
|
NEG EAX
|
|
STOSD
|
|
LOOP L1
|
|
L2: POP EDI
|
|
POP ESI
|
|
RET
|
|
_ChangeSign ENDP</pre>
|
|
This looks like a nice solution, but it is not optimal because it uses the non-optimal
|
|
instructions <kbd>LOOP, LODSD</kbd> and <kbd>STOSD</kbd> that generate many uops.
|
|
It takes 6-7 clock cycles per iteration if all data are in the level one cache.
|
|
Avoiding these instructions we get:
|
|
<h4>Example 2.2:</h4>
|
|
<pre> MOV ECX, [N]
|
|
JECXZ L2
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
ALIGN 16
|
|
L1: MOV EAX, [ESI] ; len=2, p2rESIwEAX
|
|
ADD ESI, 4 ; len=3, p01rwESIwF
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI], EAX ; len=2, p4rEAX, p3rEDI
|
|
ADD EDI, 4 ; len=3, p01rwEDIwF
|
|
DEC ECX ; len=1, p01rwECXwF
|
|
JNZ L1 ; len=2, p1rF
|
|
L2:</pre>
|
|
The comments are interpreted as follows: the <kbd>MOV EAX,[ESI]</kbd>
|
|
instruction is 2 bytes long, it generates one uop for port 2 that reads
|
|
<kbd>ESI</kbd> and writes to (renames) <kbd>EAX</kbd>. This
|
|
information is needed for analyzing the possible bottlenecks.
|
|
<p>
|
|
Let's first analyze the instruction decoding (chapter <a href="#14">14</a>): One of the
|
|
instructions generates 2 uops (<kbd>MOV [EDI],EAX</kbd>).
|
|
This instruction must go into decoder D0. There are three
|
|
decode groups in the loop so it can decode in 3 clock cycles.
|
|
<p>
|
|
Next, let's look at the instruction fetch (chapter <a href="#15">15</a>): If an ifetch boundary prevents the first
|
|
three instructions from decoding together then there will be three decode groups in the last
|
|
ifetch block so that the next iteration will have the ifetch block starting at the first instruction
|
|
where we want it, and we will get a delay only in the first iteration. A worse situation would
|
|
be a 16-byte boundary and an ifetch boundary in one of the last three instructions.
|
|
According to the <a href="#ifetchtable">ifetch table</a>, this will generate a delay of 1 clock and cause the next
|
|
iteration to have its first ifetch block aligned by 16, so that the problem continues through all
|
|
iterations. The result is a fetch time of 4 clocks per iteration rather than 3. There are two
|
|
ways to prevent this situation: the first method is to control where the ifetch blocks lie on the
|
|
first iteration; the second method is to control where the 16-byte boundaries are. The latter
|
|
method is the easiest. Since the entire loop has only 15 bytes of code you can avoid any
|
|
16-byte boundary by aligning the loop entry by 16, as shown above. This will put the entire
|
|
loop into a single ifetch block so that no further analysis of instruction fetching is needed.
|
|
<p>
|
|
The third problem to look at is register read stalls (chapter <a href="#16">16</a>). No register is read in this
|
|
loop without being written to at least a few clock cycles before, so there can be no register
|
|
read stalls.
|
|
<p>
|
|
The fourth analysis is execution (chapter <a href="#17">17</a>). Counting the uops for the different ports we
|
|
get:<br>
|
|
port 0 or 1: 4 uops<br>
|
|
port 1 only: 1 uop<br>
|
|
port 2: 1 uop<br>
|
|
port 3: 1 uop<br>
|
|
port 4: 1 uop<br>
|
|
Assuming that the uops that can go to either port 0 or 1 are distributed optimally, the
|
|
execution time will be 2.5 clocks per iteration.
|
|
<p>
|
|
The last analysis is retirement (chapter <a href="#18">18</a>). Since the number of uops in the loop is not
|
|
divisible by 3, the retirement slots will not be used optimally when the jump has to retire in
|
|
the first slot. The time needed for retirement is the number of uops divided by 3, and
|
|
rounded up to nearest integer. This gives 3 clocks for retirement.
|
|
<p>
|
|
In conclusion, the loop above can execute in 3 clocks per iteration if the loop entry is
|
|
aligned by 16. I am assuming that the conditional jump is predicted every time except on the
|
|
exit of the loop (chapter <a href="#22_2">22.2</a>).
|
|
<p>
|
|
<h4>Using the same register for counter and index and letting the counter end at zero (PPro, PII and PIII)</h4>
|
|
<h4><a name="2-3">Example 2.3:</a></h4>
|
|
<pre> MOV ECX, [N]
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
LEA ESI, [ESI+4*ECX] ; point to end of array A
|
|
LEA EDI, [EDI+4*ECX] ; point to end of array B
|
|
NEG ECX ; -N
|
|
JZ SHORT L2
|
|
ALIGN 16
|
|
L1: MOV EAX, [ESI+4*ECX] ; len=3, p2rESIrECXwEAX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI+4*ECX], EAX ; len=3, p4rEAX, p3rEDIrECX
|
|
INC ECX ; len=1, p01rwECXwF
|
|
JNZ L1 ; len=2, p1rF
|
|
L2:</pre>
|
|
Here we have reduced the number of uops to 6 by using the same register as counter and
|
|
index. The base pointers point to the end of the arrays so that the index can count up
|
|
through negative values to zero.
|
|
<p>
|
|
Decoding: There are two decode groups in the loop so it will decode in 2 clocks.
|
|
<p>
|
|
Instruction fetch: A loop always takes at least one clock cycle more than the the number of
|
|
16 byte blocks. Since there are only 11 bytes of code in the loop it is possible to have it all
|
|
in one ifetch block. By aligning the loop entry by 16 we can make sure that we don't get
|
|
more than one 16-byte block so that it is possible to fetch in 2 clocks.
|
|
<p>
|
|
Register read stalls: The <kbd>ESI</kbd> and <kbd>EDI</kbd> registers are read, but not modified inside the loop.
|
|
They will therefore be counted as permanent register reads, but not in the same triplet.
|
|
Register <kbd>EAX, ECX</kbd>, and flags are modified inside the loop and read
|
|
before they are written back so they will cause no permanent register reads.
|
|
The conclusion is that there are no register read stalls.
|
|
<p>
|
|
Execution:<br>
|
|
port 0 or 1: 2 uops<br>
|
|
port 1: 1 uop<br>
|
|
port 2: 1 uop<br>
|
|
port 3: 1 uop<br>
|
|
port 4: 1 uop<br>
|
|
Execution time: 1.5 clocks.
|
|
<p>
|
|
Retirement:<br>
|
|
6 uops = 2 clocks.
|
|
<p>
|
|
Conclusion: this loop takes only 2 clock cycles per iteration.
|
|
<p>
|
|
If you use absolute addresses instead of <kbd>ESI</kbd> and <kbd>EDI</kbd> then the loop will take 3 clocks
|
|
because it cannot be contained in a single 16-byte block.
|
|
<p>
|
|
<h4>Unrolling a loop (PPro, PII and PIII)</h4>
|
|
Doing more than one operation in each run and doing correspondingly fewer runs is called
|
|
loop unrolling. In previous processors you would unroll loops to get parallel execution by
|
|
pairing (chapter <a href="#25_1">25.1</a>). In PPro, PII and PIII this is not needed because the out-of-order
|
|
execution mechanism takes care of that. There is no need to use two different registers
|
|
either, because register renaming takes care of this. The purpose of unrolling here is to
|
|
reduce the loop overhead per iteration.
|
|
<p>
|
|
The following example is the same as example 2.2 , but unrolled by 2, which means that
|
|
you do two operations per iteration and half as many iterations
|
|
<h4>Example 2.4:</h4>
|
|
<pre> MOV ECX, [N]
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
SHR ECX, 1 ; N/2
|
|
JNC SHORT L1 ; test if N was odd
|
|
MOV EAX, [ESI] ; do the odd one first
|
|
ADD ESI, 4
|
|
NEG EAX
|
|
MOV [EDI], EAX
|
|
ADD EDI, 4
|
|
L1: JECXZ L3
|
|
|
|
ALIGN 16
|
|
L2: MOV EAX, [ESI] ; len=2, p2rESIwEAX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI], EAX ; len=2, p4rEAX, p3rEDI
|
|
MOV EAX, [ESI+4] ; len=3, p2rESIwEAX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI+4], EAX ; len=3, p4rEAX, p3rEDI
|
|
ADD ESI, 8 ; len=3, p01rwESIwF
|
|
ADD EDI, 8 ; len=3, p01rwEDIwF
|
|
DEC ECX ; len=1, p01rwECXwF
|
|
JNZ L2 ; len=2, p1rF
|
|
L3:</pre>
|
|
In example 2.2 the loop overhead (i.e. adjusting pointers and counter, and jumping back)
|
|
was 4 uops and the 'real job' was 4 uops. When unrolling the loop by two you do the 'real
|
|
job' twice and the overhead once, so you get 12 uops in all. This reduces the overhead from
|
|
50% to 33% of the uops. Since the unrolled loop can do only an even number of operations
|
|
you have to check if N is odd and if so do one operation outside the loop.
|
|
<p>
|
|
Analyzing instruction fetching in this loop we find that a new ifetch block begins in the
|
|
<kbd>ADD ESI,8</kbd> instruction, forcing it into decoder D0. This makes the loop decode in 5 clock cycles
|
|
and not 4 as we wanted. We can solve this problem by coding the preceding instruction in a
|
|
longer version. Change <kbd>MOV [EDI+4],EAX </kbd> to:
|
|
<pre> MOV [EDI+9999],EAX ; make instruction with long displacement
|
|
ORG $-4
|
|
DD 4 ; rewrite displacement to 4</pre>
|
|
This will force a new ifetch block to begin at the long <kbd>MOV [EDI+4],EAX</kbd>
|
|
instruction, so that decoding time is now down at 4 clocks. The rest of the
|
|
pipeline can handle 3 uops per clock so that the expected execution time is 4
|
|
clocks per iteration, or 2 clocks per operation.
|
|
<p>
|
|
Testing this solution shows that it actually takes a little more. My measurements showed
|
|
approximately 4.5 clocks per iteration. This is probably due to a sub-optimal reordering of
|
|
the uops. Possibly, the ROB doesn't find the optimal execution-order for the uops but
|
|
submits them in a less than optimal order. This problem was not predicted, and only testing
|
|
can reveal such a problem. We may help the ROB by doing some of the reordering
|
|
manually:
|
|
<h4>Example 2.5:</h4>
|
|
<pre>ALIGN 16
|
|
L2: MOV EAX, [ESI] ; len=2, p2rESIwEAX
|
|
MOV EBX, [ESI+4] ; len=3, p2rESIwEBX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI], EAX ; len=2, p4rEAX, p3rEDI
|
|
ADD ESI, 8 ; len=3, p01rwESIwF
|
|
NEG EBX ; len=2, p01rwEBXwF
|
|
MOV [EDI+4], EBX ; len=3, p4rEBX, p3rEDI
|
|
ADD EDI, 8 ; len=3, p01rwEDIwF
|
|
DEC ECX ; len=1, p01rwECXwF
|
|
JNZ L2 ; len=2, p1rF
|
|
L3:</pre>
|
|
The loop now executes in 4 clocks per iteration. This solution also solves the problem with
|
|
instruction fetch blocks. The cost is that we need an extra register because we cannot take
|
|
advantage of register renaming.
|
|
<p>
|
|
<h4>Rolling out by more than 2</h4>
|
|
Loop unrolling is recommended when the loop overhead constitutes a high proportion of the
|
|
total execution time. In example 2.3 the overhead is only 2 uops, so the gain by unrolling is
|
|
little, but I will show you how to unroll it anyway, just for the exercise.
|
|
<p>
|
|
The 'real job' is 4 uops and the overhead 2. Unrolling by two we get 2*4+2 = 10 uops. The
|
|
retirement time will be 10/3, rounded up to an integer, that is 4 clock cycles. This calculation
|
|
shows that nothing is gained by unrolling this by two. Unrolling by four we get:
|
|
<h4>Example 2.6:</h4>
|
|
<pre> MOV ECX, [N]
|
|
SHL ECX, 2 ; number of bytes to handle
|
|
MOV ESI, [A]
|
|
MOV EDI, [B]
|
|
ADD ESI, ECX ; point to end of array A
|
|
ADD EDI, ECX ; point to end of array B
|
|
NEG ECX ; -4*N
|
|
TEST ECX, 4 ; test if N is odd
|
|
JZ SHORT L1
|
|
MOV EAX, [ESI+ECX] ; N is odd. do the odd one
|
|
NEG EAX
|
|
MOV [EDI+ECX], EAX
|
|
ADD ECX, 4
|
|
L1: TEST ECX, 8 ; test if N/2 is odd
|
|
JZ SHORT L2
|
|
MOV EAX, [ESI+ECX] ; N/2 is odd. do two extra
|
|
NEG EAX
|
|
MOV [EDI+ECX], EAX
|
|
MOV EAX, [ESI+ECX+4]
|
|
NEG EAX
|
|
MOV [EDI+ECX+4], EAX
|
|
ADD ECX, 8
|
|
L2: JECXZ SHORT L4
|
|
|
|
ALIGN 16
|
|
L3: MOV EAX, [ESI+ECX] ; len=3, p2rESIrECXwEAX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI+ECX], EAX ; len=3, p4rEAX, p3rEDIrECX
|
|
MOV EAX, [ESI+ECX+4] ; len=4, p2rESIrECXwEAX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI+ECX+4], EAX ; len=4, p4rEAX, p3rEDIrECX
|
|
MOV EAX, [ESI+ECX+8] ; len=4, p2rESIrECXwEAX
|
|
MOV EBX, [ESI+ECX+12] ; len=4, p2rESIrECXwEAX
|
|
NEG EAX ; len=2, p01rwEAXwF
|
|
MOV [EDI+ECX+8], EAX ; len=4, p4rEAX, p3rEDIrECX
|
|
NEG EBX ; len=2, p01rwEAXwF
|
|
MOV [EDI+ECX+12], EBX ; len=4, p4rEAX, p3rEDIrECX
|
|
ADD ECX, 16 ; len=3, p01rwECXwF
|
|
JS L3 ; len=2, p1rF
|
|
L4:</pre>
|
|
The ifetch blocks are where we want them. Decode time is 6 clocks.
|
|
<p>
|
|
Register read stalls is a problem here because <kbd>ECX</kbd> has retired near the end of the loop
|
|
and we need to read both <kbd>ESI, EDI,</kbd> and <kbd>ECX</kbd>. The instructions have been reordered in
|
|
order to avoid reading ESI near the bottom so that we can avoid a register read stall. In
|
|
other words, the reason for reordering instructions and use an extra register (<kbd>EBX</kbd>) is not the
|
|
same as in the previous example.
|
|
<p>
|
|
There are 12 uops and the loop executes in 6 clocks per iteration, or 1.5 clocks per
|
|
operation.
|
|
<p>
|
|
It may be tempting to unroll loops by a high factor in order to get the maximum speed. But
|
|
since the loop overhead in most cases can be reduced to something like one clock cycle
|
|
per iteration then unrolling the loop by 4 rather than by 2 would save only 1/4 clock cycle
|
|
per operation which is hardly worth the effort. Only if the loop overhead is high compared to
|
|
the rest of the loop and N is very big should you think of unrolling by 4. Unrolling by more
|
|
than 4 does not make sense.
|
|
<p>
|
|
The drawbacks of excessive loop unrolling are:
|
|
<ol>
|
|
<li>You need to calculate N MODULO R, where R is the unrolling factor, and do N
|
|
MODULO R operations before or after the main loop in order to make the remaining
|
|
number of operations divisible by R. This takes a lot of extra code and poorly predictable
|
|
branches. And the loop body of course also becomes bigger.
|
|
<li>A Piece of code usually takes much more time the first time it executes, and the penalty
|
|
of first time execution is bigger the more code you have, especially if N is small.
|
|
<li>Excessive code size makes the utilization of the code cache less effective.
|
|
</ol>
|
|
<p>
|
|
Using an unrolling factor which is not a power of 2 makes the calculation of N MODULO R
|
|
quite difficult, and is generally not recommended unless N is known to be divisible by R.
|
|
<a href="#unrollby3">Example 1.14</a> shows how to unroll by 3.
|
|
<p>
|
|
<h4>Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers (PPro, PII and PIII)</h4>
|
|
It is sometimes possible to handle four bytes at a time in the same 32 bit register. The
|
|
following example adds 2 to all elements of an array of bytes:
|
|
<h4><a name="2-7">Example 2.7:</a></h4>
|
|
<pre> MOV ESI, [A] ; address of byte array
|
|
MOV ECX, [N] ; number of elements in byte array
|
|
JECXZ L2
|
|
ALIGN 16
|
|
DB 7 DUP (90H) ; 7 NOP's for controlling alignment
|
|
|
|
L1: MOV EAX, [ESI] ; read four bytes
|
|
MOV EBX, EAX ; copy into EBX
|
|
AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX
|
|
XOR EBX, EAX ; get the highest bit of each byte
|
|
ADD EAX, 02020202H ; add desired value to all four bytes
|
|
XOR EBX, EAX ; combine bits again
|
|
MOV [ESI], EBX ; store result
|
|
ADD ESI, 4 ; increment pointer
|
|
SUB ECX, 4 ; decrement loop counter
|
|
JA L1 ; loop
|
|
L2:</pre>
|
|
Note that I have masked out the highest bit of each byte to avoid a possible carry from each
|
|
byte into the next one when adding. I am using <kbd>XOR</kbd> rather than
|
|
<kbd>ADD</kbd> when putting in the high bit again to avoid carry.
|
|
The array should of course be aligned by 4.
|
|
<p>
|
|
This loop should ideally take 4 clocks per iteration, but it takes somewhat
|
|
more due to the dependency chain and difficult reordering. On PII and PIII
|
|
you can do the same more effectively using MMX registers.
|
|
<p>
|
|
The next example finds the length of a zero-terminated string by searching for the first byte
|
|
of zero. It is much faster than using <kbd>REPNE SCASB</kbd>:
|
|
<h4><a name="2-8">Example 2.8:</a></h4>
|
|
<pre>_strlen PROC NEAR
|
|
PUSH EBX
|
|
MOV EAX,[ESP+8] ; get pointer to string
|
|
LEA EDX,[EAX+3] ; pointer+3 used in the end
|
|
L1: MOV EBX,[EAX] ; read first 4 bytes
|
|
ADD EAX,4 ; increment pointer
|
|
LEA ECX,[EBX-01010101H] ; subtract 1 from each byte
|
|
NOT EBX ; invert all bytes
|
|
AND ECX,EBX ; and these two
|
|
AND ECX,80808080H ; test all sign bits
|
|
JZ L1 ; no zero bytes, continue loop
|
|
MOV EBX,ECX
|
|
SHR EBX,16
|
|
TEST ECX,00008080H ; test first two bytes
|
|
CMOVZ ECX,EBX ; shift right if not in the first 2 bytes
|
|
LEA EBX,[EAX+2]
|
|
CMOVZ EAX,EBX
|
|
SHL CL,1 ; use carry flag to avoid branch
|
|
SBB EAX,EDX ; compute length
|
|
POP EBX
|
|
RET
|
|
_strlen ENDP</pre>
|
|
This loop takes 3 clocks for each iteration testing 4 bytes. The string should of course be
|
|
aligned by 4. The code may read past the end of the string, so the string should not be
|
|
placed at the end of a segment.
|
|
<p>
|
|
Handling 4 bytes simultaneously can be quite difficult. This code uses a formula which
|
|
generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible
|
|
to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all
|
|
bytes (in the <kbd>LEA ECX</kbd> instruction). I have not masked out the highest bit of each byte
|
|
before subtracting, as I did in example <a href="#2-7">2.7</a>, so the subtraction may generate a borrow to the
|
|
next byte, but only if it is zero, and this is exactly the situation where we don't care what the
|
|
next byte is, because we are searching forwards for the first zero. If we were searching
|
|
backwards then we would have to re-read the dword after detecting a zero, and then test all
|
|
four bytes to find the last zero, or use <kbd>BSWAP</kbd> to reverse the order of the bytes. If you want
|
|
to search for a byte value other than zero, then you may <kbd>XOR</kbd> all four bytes with the value
|
|
you are searching for, and then use the method above to search for zero.
|
|
<p>
|
|
<h4>Loops with MMX instructions (PII and PIII)</h4>
|
|
Using MMX instructions we can compare 8 bytes in one operation:
|
|
<h4><a name="2-9">Example 2.9:</a></h4>
|
|
<pre>_strlen PROC NEAR
|
|
PUSH EBX
|
|
MOV EAX,[ESP+8]
|
|
LEA EDX,[EAX+7]
|
|
PXOR MM0,MM0
|
|
L1: MOVQ MM1,[EAX] ; len=3 p2rEAXwMM1
|
|
ADD EAX,8 ; len=3 p01rEAX
|
|
PCMPEQB MM1,MM0 ; len=3 p01rMM0rMM1
|
|
MOVD EBX,MM1 ; len=3 p01rMM1wEBX
|
|
PSRLQ MM1,32 ; len=4 p1rMM1
|
|
MOVD ECX,MM1 ; len=3 p01rMM1wECX
|
|
OR ECX,EBX ; len=2 p01rECXrEBXwF
|
|
JZ L1 ; len=2 p1rF
|
|
MOVD ECX,MM1
|
|
TEST EBX,EBX
|
|
CMOVZ EBX,ECX
|
|
LEA ECX,[EAX+4]
|
|
CMOVZ EAX,ECX
|
|
MOV ECX,EBX
|
|
SHR ECX,16
|
|
TEST BX,BX
|
|
CMOVZ EBX,ECX
|
|
LEA ECX,[EAX+2]
|
|
CMOVZ EAX,ECX
|
|
SHR BL,1
|
|
SBB EAX,EDX
|
|
EMMS
|
|
POP EBX
|
|
RET
|
|
_strlen ENDP</pre>
|
|
This loop has 7 uops for port 0 and 1 which gives an average execution time of 3.5 clocks
|
|
per iteration. The measured time is 3.8 clocks which shows that the ROB handles the
|
|
situation reasonably well despite a dependency chain that is 6 uops long. Testing 8 bytes in
|
|
less than 4 clocks is incredibly much faster than <kbd>REPNE SCASB</kbd>.
|
|
<p>
|
|
<h4>Loops with floating point instructions (PPro, PII and PIII)</h4>
|
|
The methods for optimizing floating point loops are basically the same as for integer loops,
|
|
but you should be more aware of dependency chains because of the long latencies of
|
|
instruction execution.
|
|
<p>
|
|
Consider the C language code:
|
|
<pre> int i, n; double * X; double * Y; double DA;
|
|
for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i];</pre>
|
|
This piece of code (called DAXPY) has been studied extensively because it is the key to
|
|
solving linear equations.
|
|
<h4>Example 2.10:</h4>
|
|
<pre>DSIZE = 8 ; data size (4 or 8)
|
|
MOV ECX, [N] ; number of elements
|
|
MOV ESI, [X] ; pointer to X
|
|
MOV EDI, [Y] ; pointer to Y
|
|
JECXZ L2 ; test for N = 0
|
|
FLD DSIZE PTR [DA] ; load DA outside loop
|
|
ALIGN 16
|
|
DB 2 DUP (90H) ; 2 NOP's for alignment
|
|
L1: FLD DSIZE PTR [ESI] ; len=3 p2rESIwST0
|
|
ADD ESI,DSIZE ; len=3 p01rESI
|
|
FMUL ST,ST(1) ; len=2 p0rST0rST1
|
|
FSUBR DSIZE PTR [EDI] ; len=3 p2rEDI, p0rST0
|
|
FSTP DSIZE PTR [EDI] ; len=3 p4rST0, p3rEDI
|
|
ADD EDI,DSIZE ; len=3 p01rEDI
|
|
DEC ECX ; len=1 p01rECXwF
|
|
JNZ L1 ; len=2 p1rF
|
|
FSTP ST ; discard DA
|
|
L2:</pre>
|
|
The dependency chain is 10 clock cycles long, but the loop takes only 4 clocks per iteration
|
|
because it can begin a new operation before the previous one is finished. The purpose of
|
|
the alignment is to prevent a 16-byte boundary in the last ifetch block.
|
|
<p>
|
|
<h4>Example 2.11:</h4>
|
|
<pre>DSIZE = 8 ; data size (4 or 8)
|
|
MOV ECX, [N] ; number of elements
|
|
MOV ESI, [X] ; pointer to X
|
|
MOV EDI, [Y] ; pointer to Y
|
|
LEA ESI, [ESI+DSIZE*ECX] ; point to end of array
|
|
LEA EDI, [EDI+DSIZE*ECX] ; point to end of array
|
|
NEG ECX ; -N
|
|
JZ SHORT L2 ; test for N = 0
|
|
FLD DSIZE PTR [DA] ; load DA outside loop
|
|
ALIGN 16
|
|
L1: FLD DSIZE PTR [ESI+DSIZE*ECX] ; len=3 p2rESIrECXwST0
|
|
FMUL ST,ST(1) ; len=2 p0rST0rST1
|
|
FSUBR DSIZE PTR [EDI+DSIZE*ECX] ; len=3 p2rEDIrECX, p0rST0
|
|
FSTP DSIZE PTR [EDI+DSIZE*ECX] ; len=3 p4rST0, p3rEDIrECX
|
|
INC ECX ; len=1 p01rECXwF
|
|
JNZ L1 ; len=2 p1rF
|
|
FSTP ST ; discard DA
|
|
L2:</pre>
|
|
Here we have used the same trick as in example <a href="#2-3">2.3</a>. Ideally, this loop should take 3
|
|
clocks, but measurements say approximately 3.5 clocks due to the long dependency chain.
|
|
Unrolling the loop doesn't save much.
|
|
<p>
|
|
<h4>Loops with XMM instructions (PIII)</h4>
|
|
The XMM instructions on the PIII allow you to operate on four single precision
|
|
floating point numbers in parallel. The operands must be aligned by 16.
|
|
<p>
|
|
The DAXPY algorithm is not very suited for XMM instructions because the precision
|
|
is poor, it may not be possible to align the operands by 16, and you need some
|
|
extra code if the number of operations is not a multiple of four. I am showing
|
|
the code here anyway, just to give an example of a loop with XMM instructions:
|
|
<h4>Example 2.12:</h4>
|
|
<pre> MOV ECX, [N] ; number of elements
|
|
MOV ESI, [X] ; pointer to X
|
|
MOV EDI, [Y] ; pointer to Y
|
|
SHL ECX, 2
|
|
ADD ESI, ECX ; point to end of X
|
|
ADD EDI, ECX ; point to end of Y
|
|
NEG ECX ; -4*N
|
|
MOV EAX, [DA] ; load DA outside loop
|
|
XOR EAX, 80000000H ; change sign of DA
|
|
PUSH EAX
|
|
MOVSS XMM1, [ESP] ; -DA
|
|
ADD ESP, 4
|
|
SHUFPS XMM1, XMM1, 0 ; copy -DA to all four positions
|
|
CMP ECX, -16
|
|
JG L2
|
|
L1: MOVAPS XMM0, [ESI+ECX] ; len=4 2*p2rESIrECXwXMM0
|
|
ADD ECX, 16 ; len=3 p01rwECXwF
|
|
MULPS XMM0, XMM1 ; len=3 2*p0rXMM0rXMM1
|
|
CMP ECX, -16 ; len=3 p01rECXwF
|
|
ADDPS XMM0, [EDI+ECX-16] ; len=5 2*p2rEDIrECX, 2*p1rXMM0
|
|
MOVAPS [EDI+ECX-16], XMM0 ; len=5 2*p4rXMM0, 2*p3rEDIrECX
|
|
JNG L1 ; len=2 p1rF
|
|
L2: JECXZ L4 ; check if finished
|
|
MOVAPS XMM0, [ESI+ECX] ; 1-3 operations missing, do 4 more
|
|
MULPS XMM0, XMM1
|
|
ADDPS XMM0, [EDI+ECX]
|
|
CMP ECX, -8
|
|
JG L3
|
|
MOVLPS [EDI+ECX], XMM0 ; store two more results
|
|
ADD ECX, 8
|
|
MOVHLPS XMM0, XMM0
|
|
L3: JECXZ L4
|
|
MOVSS [EDI+ECX], XMM0 ; store one more result
|
|
L4:</pre>
|
|
The <kbd>L1</kbd> loop takes 5-6 clocks for 4 operations.
|
|
The <kbd>ECX</kbd> instructions have been placed before and after the
|
|
<kbd>MULPS XMM0, XMM1</kbd> instruction in order to avoid a register read port stall
|
|
generated by the reading of the two parts of the <kbd>XMM1</kbd> register
|
|
together with <kbd>ESI</kbd> or <kbd>EDI</kbd> in the RAT. The extra code after
|
|
<kbd>L2</kbd> takes care of the situation where N is not divisible by 4.
|
|
Note that this code may read past the end of A and B. This may delay the last
|
|
operation if the extra memory positions read do not contain normal floating
|
|
point numbers. If possible, put in some dummy extra data to make the number
|
|
of operations divisible by 4 and leave out the extra code after <kbd>L2</kbd>.
|
|
<p>
|
|
<h2><a name="26">26</a>. Problematic Instructions</h2>
|
|
<h3><a name="26_1">26.1 XCHG (all processors)</a></h3>
|
|
The <kbd>XCHG register,[memory]</kbd> instruction is dangerous. By default this instruction has
|
|
an implicit <kbd>LOCK</kbd> prefix which prevents it from using the cache. This instruction is therefore
|
|
very time consuming, and should always be avoided.
|
|
<p>
|
|
<h3><a name="26_2">26.2 Rotates through carry (all processors)</a></h3>
|
|
<kbd>RCR</kbd> and <kbd>RCL</kbd> with a count different from one are slow and should be avoided.
|
|
<p>
|
|
<h3><a name="26_3">26.3 String instructions (all processors)</a></h3>
|
|
String instructions without a repeat prefix are too slow and should be replaced by simpler
|
|
instructions. The same applies to <kbd>LOOP</kbd> on all processors and to
|
|
<kbd>JECXZ</kbd> on PPlain and PMMX.
|
|
<p>
|
|
<kbd>REP MOVSD</kbd> and <kbd>REP STOSD</kbd> are quite fast if the repeat
|
|
count is not too small. Always use the DWORD version if possible, and make
|
|
sure that both source and destination are aligned by 8.
|
|
<p>
|
|
Some other methods of moving data are faster under certain conditions. See
|
|
chapter <a href="#27_8">27.8</a> for details.
|
|
<p>
|
|
Note that while the <kbd>REP MOVS</kbd> instruction writes a word to the destination, it reads the next
|
|
word from the source in the same clock cycle. You can have a cache bank conflict if bit 2-4
|
|
are the same in these two addresses. In other words, you will get a penalty of one clock
|
|
extra per iteration if <kbd>ESI+(wordsize)-EDI</kbd> is divisible by 32. The easiest way to avoid
|
|
cache bank conflicts is to use the DWORD version and align both source and destination by
|
|
8. Never use <kbd>MOVSB</kbd> or <kbd>MOVSW</kbd> in optimized code, not even in 16 bit mode.
|
|
<p>
|
|
<kbd>REP MOVS</kbd> and <kbd>REP STOS</kbd> can perform very fast by moving an entire cache line at a time
|
|
on PPro, PII and PIII. This happens only when the following conditions are met:
|
|
<ul>
|
|
<li>both source and destination must be aligned by 8
|
|
<li>direction must be forward (direction flag cleared)
|
|
<li>the count (<kbd>ECX</kbd>) must be greater than or equal to 64
|
|
<li>the difference between <kbd>EDI</kbd> and <kbd>ESI</kbd> must be numerically greater than or equal to 32
|
|
<li>the memory type for both source and destination must be either writeback or
|
|
write-combining (you can normally assume this).
|
|
</ul><p>
|
|
Under these conditions the number of uops issued is approximately 215+2*<kbd>ECX</kbd> for
|
|
<kbd>REP MOVSD</kbd> and 185+1.5*<kbd>ECX</kbd> for <kbd>REP STOSD,</kbd>
|
|
giving a speed of approximately 5 bytes per
|
|
clock cycle for both instructions, which is almost 3 times as fast as when the above
|
|
conditions are not met.
|
|
<p>
|
|
The byte and word versions also benefit from this fast mode, but they are less effective than
|
|
the dword versions.
|
|
<p>
|
|
<kbd>REP STOSD</kbd> is optimal under the same conditions as <kbd>REP MOVSD</kbd>.
|
|
<p>
|
|
<kbd>REP LOADS, REP SCAS,</kbd> and <kbd>REP CMPS</kbd> are not optimal, and
|
|
may be replaced by loops. See example <a href="#1-10">1.10</a>, <a href="#2-8">2.8</a>
|
|
and <a href="#2-9">2.9</a> for alternatives to <kbd>REPNE SCASB. REP CMPS</kbd>
|
|
may suffer cache bank conflicts if bit 2-4 are the same in <kbd>ESI</kbd> and
|
|
<kbd>EDI</kbd>.
|
|
<p>
|
|
<h3><a name="26_4">26.4 Bit test (all processors)</a></h3>
|
|
<kbd>BT, BTC, BTR</kbd>, and <kbd>BTS</kbd> instructions should preferably be replaced by instructions like
|
|
<kbd>TEST, AND, OR, XOR</kbd>, or shifts on PPlain and PMMX. On PPro, PII and PIII, bit tests with a
|
|
memory operand should be avoided.
|
|
<p>
|
|
<h3><a name="26_5">26.5 Integer multiplication (all processors)</a></h3>
|
|
An integer multiplication takes approximately 9 clock cycles on PPlain and PMMX and 4 on
|
|
PPro, PII and PIII. It is therefore often advantageous to replace a multiplication by a constant
|
|
with a combination of other instructions such as <kbd>SHL, ADD, SUB</kbd>,
|
|
and <kbd>LEA</kbd>. Example:<br>
|
|
<kbd>IMUL EAX,10</kbd><br>
|
|
can be replaced with<br>
|
|
<kbd>MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX</kbd><br>
|
|
or<br>
|
|
<kbd>LEA EAX,[EAX+4*EAX] / ADD EAX,EAX</kbd>
|
|
<p>
|
|
Floating point multiplication is faster than integer multiplication on PPlain and PMMX, but
|
|
the time spent on converting integers to float and converting the product back again is
|
|
usually more than the time saved by using floating point multiplication, except when the
|
|
number of conversions is low compared with the number of multiplications. MMX
|
|
multiplication is fast, but is only available with 16-bit operands.
|
|
<p>
|
|
<h3><a name="26_6">26.6 WAIT instruction (all processors)</a></h3>
|
|
You can often increase speed by omitting the <kbd>WAIT</kbd> instruction.
|
|
The <kbd>WAIT</kbd> instruction has three functions:
|
|
<p>
|
|
<u>a.</u> The old 8087 processor requires a <kbd>WAIT</kbd> before every
|
|
floating point instruction to make sure the coprocessor is ready to receive it.
|
|
<p>
|
|
<u>b.</u> <kbd>WAIT</kbd> is used for coordinating memory access between the floating point unit and the
|
|
integer unit. Examples:
|
|
<pre><u>b.1.</u> FISTP [mem32]
|
|
WAIT ; wait for FPU to write before..
|
|
MOV EAX,[mem32] ; reading the result with the integer unit
|
|
|
|
<u>b.2.</u> FILD [mem32]
|
|
WAIT ; wait for FPU to read value..
|
|
MOV [mem32],EAX ; before overwriting it with integer unit
|
|
|
|
<u>b.3.</u> FLD QWORD PTR [ESP]
|
|
WAIT ; prevent an accidental interrupt from..
|
|
ADD ESP,8 ; overwriting value on stack</pre>
|
|
<p>
|
|
<u>c.</u> <kbd>WAIT</kbd> is sometimes used to check for exceptions. It will generate an interrupt if an
|
|
unmasked exception bit in the floating point status word has been set by a preceding
|
|
floating point instruction.
|
|
<p>
|
|
<u>Regarding a:</u><br>
|
|
The function in point a is never needed on any other processors than the old 8087. Unless
|
|
you want your code to be compatible with the 8087 you should tell your assembler not to
|
|
put in these <kbd>WAIT</kbd>'s by specifying a higher processor. A 8087 floating point emulator also
|
|
inserts <kbd>WAIT</kbd> instructions. You should therefore tell your assembler not to generate
|
|
emulation code unless you need it.
|
|
<p>
|
|
<u>Regarding b:</u><br>
|
|
<kbd>WAIT</kbd> instructions to coordinate memory access are definitely needed on the 8087 and
|
|
80287 but not on the Pentiums. It is not quite clear whether it is needed on the 80387 and
|
|
80486. I have made several tests on these Intel processors and not been able to provoke
|
|
any error by omitting the <kbd>WAIT</kbd> on any 32 bit Intel processor, although Intel manuals say that
|
|
the <kbd>WAIT</kbd> is needed for this purpose except after <kbd>FNSTSW</kbd>
|
|
and <kbd>FNSTCW</kbd>. Omitting <kbd>WAIT</kbd>
|
|
instructions for coordinating memory access is not 100 % safe, even when writing 32 bit
|
|
code, because the code may be able to run on the very rare combination of a 80386 main
|
|
processor with a 287 coprocessor, which requires the <kbd>WAIT</kbd>. Also, I have no information on
|
|
non-Intel processors, and I have not tested all possible hardware and software
|
|
combinations, so there may be other situations where the <kbd>WAIT</kbd> is needed.
|
|
<p>
|
|
If you want to be certain that your code will work on any 32 bit processor (including
|
|
non-Intel processors) then I would recommend that you include the <kbd>WAIT</kbd> here in order to be
|
|
safe.
|
|
<p>
|
|
<u>Regarding c:</u><br>
|
|
The assembler automatically inserts a <kbd>WAIT</kbd> for this purpose before the following
|
|
instructions: <kbd>FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW</kbd>. You can omit the <kbd>WAIT</kbd>
|
|
by writing FNCLEX, etc. My tests show that the WAIT is unneccessary in most cases
|
|
because these instructions without <kbd>WAIT</kbd> will still generate an interrupt on exceptions except
|
|
for <kbd>FNCLEX</kbd> and <kbd>FNINIT</kbd> on the 80387. (There is some inconsistency about whether the
|
|
<kbd>IRET</kbd> from the interrupt points to the <kbd>FN..</kbd> instruction or to the next instruction).
|
|
<p>
|
|
Almost all other floating point instructions will also generate an interrupt if a previous floating
|
|
point instruction has set an unmasked exception bit, so the exception is likely to be detected
|
|
sooner or later anyway. You may insert a <kbd>WAIT</kbd> after the last floating point instruction in your
|
|
program to be sure to catch all exceptions.
|
|
<p>
|
|
You may still need the <kbd>WAIT</kbd> if you want to know exactly where an exception occurred in
|
|
order to be able to recover from the situation. Consider, for example, the code under b.3
|
|
above: If you want to be able to recover from an exception generated by the <kbd>FLD</kbd> here, then
|
|
you need the <kbd>WAIT</kbd> because an interrupt after <kbd>ADD ESP,8</kbd> would overwrite the value to load.
|
|
<kbd>FNOP</kbd> may be faster than <kbd>WAIT</kbd> and serve the same purpose.
|
|
<p>
|
|
<h3><a name="26_7">26.7 FCOM + FSTSW AX (all processors)</a></h3>
|
|
The <kbd>FNSTSW</kbd> instruction is very slow on all processors. The PPro, PII and PIII
|
|
processors have
|
|
<kbd>FCOMI</kbd> instructions to avoid the slow <kbd>FNSTSW</kbd>.
|
|
Using <kbd>FCOMI</kbd> instead of the common
|
|
sequence <kbd>FCOM / FNSTSW AX / SAHF</kbd> will save you 8 clock cycles. You should
|
|
therefore use <kbd>FCOMI</kbd> to avoid <kbd>FNSTSW</kbd> wherever possible, even in cases where it costs
|
|
some extra code.
|
|
<p>
|
|
On processors without <kbd>FCOMI</kbd> instructions, the usual way of doing floating point
|
|
comparisons is:
|
|
<pre> FLD [a]
|
|
FCOMP [b]
|
|
FSTSW AX
|
|
SAHF
|
|
JB ASmallerThanB</pre>
|
|
You may improve this code by using <kbd>FNSTSW AX</kbd> rather than
|
|
<kbd>FSTSW AX</kbd> and test <kbd>AH</kbd>
|
|
directly rather than using the non-pairable <kbd>SAHF</kbd>
|
|
(TASM version 3.0 has a bug with the <kbd>FNSTSW AX</kbd> instruction):
|
|
<pre> FLD [a]
|
|
FCOMP [b]
|
|
FNSTSW AX
|
|
SHR AH,1
|
|
JC ASmallerThanB</pre>
|
|
<p>
|
|
Testing for zero or equality:
|
|
<pre> FTST
|
|
FNSTSW AX
|
|
AND AH,40H
|
|
JNZ IsZero ; (the zero flag is inverted!)</pre>
|
|
<p>
|
|
Test if greater:
|
|
<pre> FLD [a]
|
|
FCOMP [b]
|
|
FNSTSW AX
|
|
AND AH,41H
|
|
JZ AGreaterThanB</pre>
|
|
<p>
|
|
Do not use <kbd>TEST AH,41H</kbd> as it is not pairable on PPlain and PMMX.
|
|
<p>
|
|
On the PPlain and PMMX, the <kbd>FNSTSW</kbd> instruction takes 2 clocks, but it is delayed for an
|
|
additional 4 clocks after any floating point instruction because it is waiting for the status
|
|
word to retire from the pipeline. This delay comes even after <kbd>FNOP</kbd>
|
|
which cannot change the status word, but not after integer instructions.
|
|
You can fill the latency between <kbd>FCOM</kbd> and
|
|
<kbd>FNSTSW</kbd> with integer instructions taking up to four clock cycles.
|
|
A paired <kbd>FXCH</kbd> immediately
|
|
after <kbd>FCOM</kbd> doesn't delay the <kbd>FNSTSW</kbd>, not even if the pairing is imperfect:
|
|
<pre> FCOM ; clock 1
|
|
FXCH ; clock 1-2 (imperfect pairing)
|
|
INC DWORD PTR [EBX] ; clock 3-5
|
|
FNSTSW AX ; clock 6-7</pre>
|
|
<p>
|
|
You may want to use <kbd>FCOM</kbd> rather than <kbd>FTST</kbd>
|
|
here because <kbd>FTST</kbd> is not pairable.
|
|
Remember to include the <kbd>N</kbd> in <kbd>FNSTSW</kbd>. <kbd>FSTSW</kbd>
|
|
(without <kbd>N</kbd>) has a <kbd>WAIT</kbd> prefix which delays
|
|
it further.
|
|
<p>
|
|
It is sometimes faster to use integer instructions for comparing floating point values, as
|
|
described in chapter <a href="#27_6">27.6</a>.
|
|
<p>
|
|
<h3><a name="26_8">26.8 FPREM (all processors)</a></h3>
|
|
The <kbd>FPREM</kbd> and <kbd>FPREM1</kbd> instructions are slow on all
|
|
processors. You may replace it by the following algorithm: Multiply by
|
|
the reciprocal divisor, get the fractional part by subtracting
|
|
the truncated value, then multiply by the divisor.
|
|
(see chapter <a href="#27_5">27.5</a> on how to truncate)
|
|
<p>
|
|
Some documents say that these instructions may give incomplete reductions and
|
|
that it is therefore necessary to repeat the <kbd>FPREM</kbd> or
|
|
<kbd>FPREM1</kbd> instruction until the reduction is complete.
|
|
I have tested this on several processors beginning with the old 8087 and I have
|
|
found no situation where a repetition of the <kbd>FPREM</kbd> or <kbd>FPREM1</kbd>
|
|
was needed.
|
|
<p>
|
|
<h3><a name="26_9">26.9 FRNDINT (all processors)</a></h3>
|
|
This instruction is slow on all processors. Replace it by:
|
|
<pre>
|
|
FISTP QWORD PTR [TEMP]
|
|
FILD QWORD PTR [TEMP]</pre>
|
|
This code is faster despite a possible penalty for attempting to read from
|
|
<kbd>[TEMP]</kbd> before the write is finished. It is recommended to put
|
|
other instructions in between in order to avoid
|
|
this penalty. See chapter <a href="#27_5">27.5</a> on how to truncate.
|
|
<p>
|
|
<h3><a name="26_10">26.10 FSCALE and exponential function (all processors)</a></h3>
|
|
<kbd>FSCALE</kbd> is slow on all processors. Computing integer powers
|
|
of 2 can be done much faster by inserting the desired power in the exponent
|
|
field of the floating point number.
|
|
To calculate 2<sup>N</sup>, where N is a signed integer, select from the examples below the one that
|
|
fits your range of N:
|
|
<p>
|
|
For |N| < 2<sup>7</sup>-1 you can use single precision:
|
|
<pre> MOV EAX, [N]
|
|
SHL EAX, 23
|
|
ADD EAX, 3F800000H
|
|
MOV DWORD PTR [TEMP], EAX
|
|
FLD DWORD PTR [TEMP]</pre>
|
|
<p>
|
|
For |N| < 2<sup>10</sup>-1 you can use double precision:
|
|
<pre> MOV EAX, [N]
|
|
SHL EAX, 20
|
|
ADD EAX, 3FF00000H
|
|
MOV DWORD PTR [TEMP], 0
|
|
MOV DWORD PTR [TEMP+4], EAX
|
|
FLD QWORD PTR [TEMP]</pre>
|
|
<p>
|
|
For |N| < 2<sup>14</sup>-1 use long double precision:
|
|
<pre> MOV EAX, [N]
|
|
ADD EAX, 00003FFFH
|
|
MOV DWORD PTR [TEMP], 0
|
|
MOV DWORD PTR [TEMP+4], 80000000H
|
|
MOV DWORD PTR [TEMP+8], EAX
|
|
FLD TBYTE PTR [TEMP]</pre>
|
|
<p>
|
|
<kbd>FSCALE</kbd> is often used in the calculation of exponential functions. The following code shows
|
|
an exponential function without the slow <kbd>FRNDINT</kbd> and <kbd>FSCALE</kbd> instructions:
|
|
<p>
|
|
<pre>; extern "C" long double _cdecl exp (double x);
|
|
_exp PROC NEAR
|
|
PUBLIC _exp
|
|
FLDL2E
|
|
FLD QWORD PTR [ESP+4] ; x
|
|
FMUL ; z = x*log2(e)
|
|
FIST DWORD PTR [ESP+4] ; round(z)
|
|
SUB ESP, 12
|
|
MOV DWORD PTR [ESP], 0
|
|
MOV DWORD PTR [ESP+4], 80000000H
|
|
FISUB DWORD PTR [ESP+16] ; z - round(z)
|
|
MOV EAX, [ESP+16]
|
|
ADD EAX,3FFFH
|
|
MOV [ESP+8],EAX
|
|
JLE SHORT UNDERFLOW
|
|
CMP EAX,8000H
|
|
JGE SHORT OVERFLOW
|
|
F2XM1
|
|
FLD1
|
|
FADD ; 2^(z-round(z))
|
|
FLD TBYTE PTR [ESP] ; 2^(round(z))
|
|
ADD ESP,12
|
|
FMUL ; 2^z = e^x
|
|
RET
|
|
|
|
UNDERFLOW:
|
|
FSTP ST
|
|
FLDZ ; return 0
|
|
ADD ESP,12
|
|
RET
|
|
|
|
OVERFLOW:
|
|
PUSH 07F800000H ; +infinity
|
|
FSTP ST
|
|
FLD DWORD PTR [ESP] ; return infinity
|
|
ADD ESP,16
|
|
RET
|
|
|
|
_exp ENDP</pre>
|
|
<p>
|
|
<h3><a name="26_11">26.11 FPTAN (all processors)</a></h3>
|
|
According to the manuals, <kbd>FPTAN</kbd> returns two values X and Y and
|
|
leaves it to the programmer to divide Y with X to get the result, but in
|
|
fact it always returns 1 in X so you can save the division. My tests show that
|
|
on all 32 bit Intel processors with floating point unit or coprocessor,
|
|
<kbd>FPTAN</kbd> always returns 1 in X regardless of the argument. If you want to
|
|
be absolutely sure that your code will run correctly on all processors, then
|
|
you may test if X is 1, which is faster than dividing with X. The Y value may
|
|
be very high, but never infinity, so you don't have to test if Y contains a
|
|
valid number if you know that the argument is valid.
|
|
<p>
|
|
<h3><a name="26_12">26.12 FSQRT (PIII)</a></h3>
|
|
A fast way of calculating an approximate squareroot on the PIII is to multiply
|
|
the reciprocal squareroot of x by x:<br>
|
|
SQRT(x) = x * RSQRT(x)<br>
|
|
The instruction <kbd>RSQRTSS</kbd> or <kbd>RSQRTPS</kbd> gives the reciprocal
|
|
squareroot with a precision of 12 bits. You can improve the precision to 23 bits
|
|
by using the Newton-Raphson formula described in Intel's application note AP-803:<br>
|
|
x<sub>0</sub> = <kbd>RSQRTSS</kbd>(a)<br>
|
|
x<sub>1</sub> = 0.5 * x<sub>0</sub> * (3 - (a * x<sub>0</sub>)) * x<sub>0</sub>)<br>
|
|
where x<sub>0</sub> is the first approximation to the reciprocal squareroot of
|
|
a, and x<sub>1</sub> is a better approximation. The order of evaluation is
|
|
important. You must use this formula before multiplying with a to get the squareroot.
|
|
<p>
|
|
<h3><a name="26_13">26.13 MOV [MEM], ACCUM (PPlain and PMMX)</a></h3>
|
|
The instructions <kbd>MOV [mem],AL MOV [mem],AX MOV [mem],EAX</kbd>
|
|
are treated by the pairing circuitry as if they were writing to the accumulator.
|
|
Thus the following instructions do not pair:
|
|
<pre> MOV [mydata], EAX
|
|
MOV EBX, EAX</pre>
|
|
<p>
|
|
This problem occurs only with the short form of the <kbd>MOV</kbd>
|
|
instruction which can not have a
|
|
base or index register, and which can only have the accumulator as source.
|
|
You can avoid the problem by using another register, by reordering your
|
|
instructions, by using a pointer, or by hard-coding the general form of
|
|
the <kbd>MOV</kbd> instruction.
|
|
<p>
|
|
In 32 bit mode you can write the general form of <kbd>MOV [mem],EAX</kbd>:
|
|
<pre> DB 89H, 05H
|
|
DD OFFSET DS:mem</pre>
|
|
<p>
|
|
In 16 bit mode you can write the general form of <kbd>MOV [mem],AX</kbd>:
|
|
<pre> DB 89H, 06H
|
|
DW OFFSET DS:mem</pre>
|
|
<p>
|
|
To use <kbd>AL</kbd> instead of <kbd>(E)AX</kbd>, you replace <kbd>89H</kbd>
|
|
with <kbd>88H</kbd>
|
|
<p>
|
|
This flaw has not been fixed in the PMMX.
|
|
<p>
|
|
<h3><a name="26_14">26.14 TEST instruction (PPlain and PMMX)</a></h3>
|
|
The <kbd>TEST</kbd> instruction with an immediate operand is only
|
|
pairable if the destination is <kbd>AL</kbd>, <kbd>AX</kbd>, or <kbd>EAX</kbd>.
|
|
<p>
|
|
<kbd>TEST register,register</kbd>
|
|
and <kbd>TEST register,memory</kbd> is always pairable.
|
|
<p>
|
|
Examples:
|
|
<pre> TEST ECX,ECX ; pairable
|
|
TEST [mem],EBX ; pairable
|
|
TEST EDX,256 ; not pairable
|
|
TEST DWORD PTR [EBX],8000H ; not pairable</pre><p>
|
|
To make it pairable, use any of the following methods:
|
|
<pre> MOV EAX,[EBX] / TEST EAX,8000H
|
|
MOV EDX,[EBX] / AND EDX,8000H
|
|
MOV AL,[EBX+1] / TEST AL,80H
|
|
MOV AL,[EBX+1] / TEST AL,AL ; (result in sign flag)</pre>
|
|
(The reason for this non-pairability is probably that the first byte of the 2-byte instruction is
|
|
the same as for some other non-pairable instructions, and the processor cannot afford to
|
|
check the second byte too when determining pairability.)
|
|
<p>
|
|
<h3><a name="26_15">26.15 Bit scan (PPlain and PMMX)</a></h3>
|
|
<kbd>BSF</kbd> and <kbd>BSR</kbd> are the poorest optimized instructions on
|
|
the PPlain and PMMX, taking
|
|
approximately 11 + 2*n clock cycles, where n is the number of zeros skipped.
|
|
<p>
|
|
The following code emulates <kbd>BSR ECX,EAX</kbd>:
|
|
<pre> TEST EAX,EAX
|
|
JZ SHORT BS1
|
|
MOV DWORD PTR [TEMP],EAX
|
|
MOV DWORD PTR [TEMP+4],0
|
|
FILD QWORD PTR [TEMP]
|
|
FSTP QWORD PTR [TEMP]
|
|
WAIT ; WAIT only needed for compatibility with old 80287 processor
|
|
MOV ECX, DWORD PTR [TEMP+4]
|
|
SHR ECX,20 ; isolate exponent
|
|
SUB ECX,3FFH ; adjust
|
|
TEST EAX,EAX ; clear zero flag
|
|
BS1:</pre>
|
|
<p>
|
|
The following code emulates <kbd>BSF ECX,EAX</kbd>:
|
|
<pre> TEST EAX,EAX
|
|
JZ SHORT BS2
|
|
XOR ECX,ECX
|
|
MOV DWORD PTR [TEMP+4],ECX
|
|
SUB ECX,EAX
|
|
AND EAX,ECX
|
|
MOV DWORD PTR [TEMP],EAX
|
|
FILD QWORD PTR [TEMP]
|
|
FSTP QWORD PTR [TEMP]
|
|
WAIT ; WAIT only needed for compatibility with old 80287 processor
|
|
MOV ECX, DWORD PTR [TEMP+4]
|
|
SHR ECX,20
|
|
SUB ECX,3FFH
|
|
TEST EAX,EAX ; clear zero flag
|
|
BS2:</pre>
|
|
<p>
|
|
These emulation codes should not be used on the PPro, PII and PIII, where the bit scan
|
|
instructions take only 1 or 2 clocks, and where the emulation codes shown above have two
|
|
partial memory stalls.
|
|
<p>
|
|
<h3><a name="26_16">26.16 FLDCW (PPro, PII and PIII)</a></h3>
|
|
The PPro, PII and PIII have a serious stall after the
|
|
<kbd>FLDCW</kbd> instruction if followed by any floating
|
|
point instruction which reads the control word (which almost all floating
|
|
point instructions do).
|
|
<p>
|
|
When C or C++ code is compiled it often generates a lot of
|
|
<kbd>FLDCW</kbd> instructions because conversion of floating point numbers
|
|
to integers is done with truncation while other floating
|
|
point instructions use rounding. After translation to assembly, you can
|
|
improve this code by using rounding instead of truncation where possible,
|
|
or by moving the <kbd>FLDCW</kbd> out of a loop
|
|
where truncation is needed inside the loop.
|
|
<p>
|
|
See chapter <a href="#27_5">27.5</a> on how to convert floating point numbers to integers
|
|
whitout changing the control word.
|
|
<p>
|
|
<h2><a name="27">27</a>. Special topics</h2>
|
|
<h3><a name="27_1">27.1 LEA instruction (all processors)</a></h3>
|
|
The <kbd>LEA</kbd> instruction is useful for many purposes because it can do
|
|
a shift, two additions, and a move in just one instruction taking one clock cycle.
|
|
Example:<br>
|
|
<kbd>LEA EAX,[EBX+8*ECX-1000]</kbd><br>
|
|
is much faster than<br>
|
|
<kbd>MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000</kbd><br>
|
|
The <kbd>LEA</kbd> instruction can also be used to do an add or shift without
|
|
changing the flags. The source and destination need not have the same word size,
|
|
so <kbd>LEA EAX,[BX]</kbd> is a possible
|
|
replacement for <kbd>MOVZX EAX,BX</kbd>, although suboptimal on most processors.
|
|
<p>
|
|
You must be aware, however, that the <kbd>LEA</kbd> instruction will suffer
|
|
an AGI stall on the PPlain and PMMX if it uses a base or index register
|
|
which has been written to in the preceding clock cycle.
|
|
<p>
|
|
Since the <kbd>LEA</kbd> instruction is pairable in the v-pipe on PPlain and
|
|
PMMX and shift instructions are not, you may use <kbd>LEA</kbd> as
|
|
a substitute for a <kbd>SHL</kbd> by 1, 2, or 3 if you want the
|
|
instruction to execute in the V-pipe.
|
|
<p>
|
|
The 32 bit processors have no documented addressing mode with a scaled index register
|
|
and nothing else, so an instruction like <kbd>LEA EAX,[EAX*2]</kbd>
|
|
is actually coded as <kbd>LEA EAX,[EAX*2+00000000]</kbd>
|
|
with an immediate displacement of 4 bytes. You may reduce the
|
|
instruction size by instead writing <kbd>LEA EAX,[EAX+EAX]</kbd> or even
|
|
better <kbd>ADD EAX,EAX</kbd>.
|
|
The latter code cannot have an AGI delay in PPlain and PMMX. If you happen to have a register
|
|
which is zero (like a loop counter after a loop), then you may use it
|
|
as a base register to reduce the code size:
|
|
<p>
|
|
<pre>LEA EAX,[EBX*4] ; 7 bytes
|
|
LEA EAX,[ECX+EBX*4] ; 3 bytes</pre>
|
|
<p>
|
|
<h3><a name="27_2">27.2 Division (all processors)</a></h3>
|
|
Division is quite time consuming. On PPro, PII and PIII an integer division
|
|
takes 19, 23, or 39 clocks for byte, word, and dword divisors respectively.
|
|
On PPlain and PMMX an unsigned integer division takes approximately the same,
|
|
while a signed integer division takes somewhat more. It is therefore
|
|
preferable to use the smallest operand size possible that won't generate an
|
|
overflow, even if it costs an operand size prefix, and use unsigned
|
|
division if possible.
|
|
<p>
|
|
<h4>Integer division by a constant (all processors)</h4>
|
|
Integer division by a power of two can be done by shifting right. Dividing an
|
|
unsigned integer by 2<sup>N</sup>:
|
|
<pre> SHR EAX, N</pre>
|
|
Dividing a signed integer by 2<sup>N</sup>:
|
|
<pre> CDQ
|
|
AND EDX, (1 SHL N) -1 ; or SHR EDX, 32-N
|
|
ADD EAX, EDX
|
|
SAR EAX, N</pre>
|
|
The <kbd>SHR</kbd> alternative is shorter than the <kbd>AND</kbd> if N > 7,
|
|
but can only go to execution port 0 (or u-pipe), whereas <kbd>AND</kbd> can
|
|
go to either port 0 or 1 (u or v-pipe).
|
|
<p>
|
|
Dividing by a constant can be done by multiplying with the reciprocal.
|
|
To calculate the unsigned integer division q = x / d, you first calculate
|
|
the reciprocal of the divisor, f = 2<sup>r</sup> / d, where r defines the position of the binary
|
|
decimal point (radix point). Then multiply x with f and shift
|
|
right r positions. The maximum value of r is 32+b, where b is the number of binary digits in d
|
|
minus 1. (b is the highest integer for which 2<sup>b</sup> <= d). Use r = 32+b to cover the maximum
|
|
range for the value of the dividend x.
|
|
<p>
|
|
This method needs some refinement in order to compensate for rounding errors.
|
|
The following algorithm will give you the correct result for unsigned integer
|
|
division with truncation, i.e. the same result as the <kbd>DIV</kbd>
|
|
instruction gives (Thanks to Terje Mathisen who invented this method):
|
|
<pre>
|
|
b = (the number of significant bits in d) - 1
|
|
r = 32 + b
|
|
f = 2<sup>r</sup> / d
|
|
If f is an integer then d is a power of 2: goto case A.
|
|
If f is not an integer, then check if the fractional part of f is < 0.5
|
|
If the fractional part of f < 0.5: goto case B.
|
|
If the fractional part of f > 0.5: goto case C.
|
|
|
|
case A: (d = 2<sup>b</sup>)
|
|
result = x SHR b
|
|
|
|
case B: (fractional part of f < 0.5)
|
|
round f down to nearest integer
|
|
result = ((x+1) * f) SHR r
|
|
|
|
case C: (fractional part of f > 0.5)
|
|
round f up to nearest integer
|
|
result = (x * f) SHR r
|
|
</pre>
|
|
<p>
|
|
Example:<br>
|
|
Assume that you want to divide by 5.<br>
|
|
5 = 00000101b.<br>
|
|
b = (number of significant binary digits) - 1 = 2<br>
|
|
r = 32+2 = 34<br>
|
|
f = 2<sup>34</sup> / 5 = 3435973836.8 = 0CCCCCCCC.CCC...(hexadecimal)<br>
|
|
The fractional part is greater than a half: use case C.<br>
|
|
Round f up to 0CCCCCCCDh.
|
|
<p>
|
|
The following code divides <kbd>EAX</kbd> by 5 and returns the result in <kbd>EDX</kbd>:
|
|
<pre> MOV EDX,0CCCCCCCDh
|
|
MUL EDX
|
|
SHR EDX,2</pre>
|
|
<p>
|
|
After the multiplication, <kbd>EDX</kbd> contains the product shifted right
|
|
32 places. Since r = 34 you have to shift 2 more places to get the result.
|
|
To divide by 10 you just change the last line to <kbd>SHR EDX,3</kbd>.
|
|
<p>
|
|
In case B you would have:
|
|
<pre> INC EAX
|
|
MOV EDX,f
|
|
MUL EDX
|
|
SHR EDX,b</pre>
|
|
<p>
|
|
This code works for all values of x except 0FFFFFFFFH which gives zero because of
|
|
overflow in the <kbd>INC</kbd> instruction. If x = 0FFFFFFFFH is possible, then change the code to:
|
|
<pre> MOV EDX,f
|
|
ADD EAX,1
|
|
JC DOVERFL
|
|
MUL EDX
|
|
DOVERFL:SHR EDX,b</pre>
|
|
<p>
|
|
If the value of x is limited, then you may use a lower value of r, i.e.
|
|
fewer digits. There can be several reasons to use a lower value of r:
|
|
<ul>
|
|
<li>you may set r = 32 to avoid the <kbd>SHR EDX,b</kbd> in the end.
|
|
<li>you may set r = 16+b and use a multiplication instruction that
|
|
gives a 32 bit result rather
|
|
than 64 bits. This will free the <kbd>EDX</kbd> register:
|
|
<pre> IMUL EAX,0CCCDh / SHR EAX,18</pre>
|
|
<li>you may choose a value of r that gives case C rather than case
|
|
B in order to avoid the <kbd>INC EAX</kbd> instruction
|
|
</ul>
|
|
<p>
|
|
The maximum value for x in these cases is at least 2<sup>r-b</sup>,
|
|
sometimes higher. You have to do a systematic test if you want to know the
|
|
exact maximum value of x for which your code works correctly.
|
|
<p>
|
|
You may want to replace the slow multiplication instruction with faster instructions as
|
|
explained in chapter <a href="#26"_5>26.5</a>.
|
|
<p>
|
|
The following example divides <kbd>EAX</kbd> by 10 and returns the result
|
|
in <kbd>EAX</kbd>. I have chosen r=17 rather than 19 because it happens
|
|
to give a code, which is easier to optimize, and covers
|
|
the same range for x. f = 2<sup>17</sup> / 10 = 3333h, case B: q = (x+1)*3333h:
|
|
<pre> LEA EBX,[EAX+2*EAX+3]
|
|
LEA ECX,[EAX+2*EAX+3]
|
|
SHL EBX,4
|
|
MOV EAX,ECX
|
|
SHL ECX,8
|
|
ADD EAX,EBX
|
|
SHL EBX,8
|
|
ADD EAX,ECX
|
|
ADD EAX,EBX
|
|
SHR EAX,17</pre>
|
|
<p>
|
|
A systematic test shows that this code works correctly for all x < 10004H.
|
|
<p>
|
|
<h4>Repeated integer division by the same value (all processors)</h4>
|
|
If the divisor is not known at assembly time, but you are dividing
|
|
repeatedly with the same divisor, then you may use the same method as above.
|
|
The code has to distinguish between
|
|
case A, B and C and calculate f before doing the divisions.
|
|
<p>
|
|
The code that follows shows how to do multiple divisions with the same divisor (unsigned
|
|
division with truncation). First call <kbd>SET_DIVISOR</kbd> to specify the
|
|
divisor and calculate the
|
|
reciprocal, then call <kbd>DIVIDE_FIXED</kbd> for each value to divide by the same divisor.
|
|
<pre>
|
|
.data
|
|
|
|
RECIPROCAL_DIVISOR DD ? ; rounded reciprocal divisor
|
|
CORRECTION DD ? ; case A: -1, case B: 1, case C: 0
|
|
BSHIFT DD ? ; number of bits in divisor - 1
|
|
|
|
.code
|
|
|
|
SET_DIVISOR PROC NEAR ; divisor in EAX
|
|
PUSH EBX
|
|
MOV EBX,EAX
|
|
BSR ECX,EAX ; b = number of bits in divisor - 1
|
|
MOV EDX,1
|
|
JZ ERROR ; error: divisor is zero
|
|
SHL EDX,CL ; 2^b
|
|
MOV [BSHIFT],ECX ; save b
|
|
CMP EAX,EDX
|
|
MOV EAX,0
|
|
JE SHORT CASE_A ; divisor is a power of 2
|
|
DIV EBX ; 2^(32+b) / d
|
|
SHR EBX,1 ; divisor / 2
|
|
XOR ECX,ECX
|
|
CMP EDX,EBX ; compare remainder with divisor/2
|
|
SETBE CL ; 1 if case B
|
|
MOV [CORRECTION],ECX ; correction for rounding errors
|
|
XOR ECX,1
|
|
ADD EAX,ECX ; add 1 if case C
|
|
MOV [RECIPROCAL_DIVISOR],EAX ; rounded reciprocal divisor
|
|
POP EBX
|
|
RET
|
|
CASE_A: MOV [CORRECTION],-1 ; remember that we have case A
|
|
POP EBX
|
|
RET
|
|
SET_DIVISOR ENDP
|
|
|
|
DIVIDE_FIXED PROC NEAR ; dividend in EAX, result in EAX
|
|
MOV EDX,[CORRECTION]
|
|
MOV ECX,[BSHIFT]
|
|
TEST EDX,EDX
|
|
JS SHORT DSHIFT ; divisor is power of 2
|
|
ADD EAX,EDX ; correct for rounding error
|
|
JC SHORT DOVERFL ; correct for overflow
|
|
MUL [RECIPROCAL_DIVISOR] ; multiply with reciprocal divisor
|
|
MOV EAX,EDX
|
|
DSHIFT: SHR EAX,CL ; adjust for number of bits
|
|
RET
|
|
DOVERFL:MOV EAX,[RECIPROCAL_DIVISOR] ; dividend = 0FFFFFFFFH
|
|
SHR EAX,CL ; do division by shifting
|
|
RET
|
|
DIVIDE_FIXED ENDP</pre>
|
|
This code gives the same result as the <kbd>DIV</kbd> instruction for
|
|
0 <= x < 2<sup>32</sup>, 0 < d < 2<sup>32</sup>.<br>
|
|
Note: The line <kbd>JC DOVERFL</kbd> and its target are not needed if
|
|
you are certain that x < 0FFFFFFFFH.
|
|
<p>
|
|
If powers of 2 occur so seldom that it is not worth optimizing for them,
|
|
then you may leave out the jump to <kbd>DSHIFT</kbd> and instead do a
|
|
multiplication with <kbd>CORRECTION</kbd> = 0 for case A.
|
|
<p>
|
|
If the divisor is changed so often that the procedure <kbd>SET_DIVISOR</kbd> needs optimizing, then you may
|
|
replace the <kbd>BSR</kbd> instruction with the code given in chapter
|
|
<a href="#26_15">26.15</a> for the PPlain and PMMX processors.
|
|
<p>
|
|
<h4>Floating point division (all processors)</h4>
|
|
Floating point division takes 38 or 39 clock cycles for the highest precision.
|
|
You can save time by specifying a lower precision in the floating point
|
|
control word (On PPlain and PMMX, only <kbd>FDIV</kbd> and <kbd>FIDIV</kbd> are faster at low
|
|
precision; on PPro, PII and PIII, this also applies
|
|
to <kbd>FSQRT</kbd>. No other instructions can be speeded up this way).
|
|
<p>
|
|
<h4><a name="paralleldiv">Parallel division (PPlain and PMMX)</a></h4>
|
|
On PPlain and PMMX, it is possible to do a floating point division and an integer division in
|
|
parallel to save time. On PPro, PII and PIII this is not possible, because integer division and
|
|
floating point division use the same circuitry.<br>
|
|
Example: A = A1 / A2; B = B1 / B2
|
|
<pre> FILD [B1]
|
|
FILD [B2]
|
|
MOV EAX, [A1]
|
|
MOV EBX, [A2]
|
|
CDQ
|
|
FDIV
|
|
DIV EBX
|
|
FISTP [B]
|
|
MOV [A], EAX</pre><p>
|
|
(make sure you set the floating point control word to the desired rounding method)
|
|
<p>
|
|
<h4>Using reciprocal instruction for fast division (PIII)</h4>
|
|
On PIII you can use the fast reciprocal instruction <kbd>RCPSS</kbd> or
|
|
<kbd>RCPPS</kbd> on the divisor and then multiply with the dividend. However,
|
|
the precision is only 12 bits. You can increase the precision to 23 bits by
|
|
using the Newton-Raphson method described in Intel's application note AP-803:<br>
|
|
x<sub>0</sub> = <kbd>RCPSS</kbd>(d)<br>
|
|
x<sub>1</sub> = x<sub>0</sub> * (2 - d * x<sub>0</sub>)
|
|
= 2*x<sub>0</sub> - d * x<sub>0</sub> * x<sub>0</sub><br>
|
|
where x<sub>0</sub> is the first approximation to the reciprocal of the divisor, d,
|
|
and x<sub>1</sub> is a better approximation. You must use this formula before
|
|
multiplying with the dividend:
|
|
<pre> MOVAPS XMM1, [DIVISORS] ; load divisors
|
|
RCPPS XMM0, XMM1 ; approximate reciprocal
|
|
MULPS XMM1, XMM0 ; Newton-Raphson formula
|
|
MULPS XMM1, XMM0
|
|
ADDPS XMM0, XMM0
|
|
SUBPS XMM0, XMM1
|
|
MULPS XMM0, [DIVIDENDS] ; results in XMM0</pre>
|
|
This makes four divisions in 18 clock cycles with a precision of 23 bits.
|
|
Increasing the precision further by repeating the Newton-Raphson formula
|
|
in the floating point registers is possible, but not very advantageous.
|
|
<p>
|
|
If you want to use this method for integer divisions then you have to check for
|
|
rounding errors. The following code makes four divisions with truncation on packed
|
|
word size integers in approximately 42 clock cycles. It gives exact results for
|
|
0 <= dividend < 7FFFFH and 0 < divisor <= 7FFFFH:
|
|
<pre> MOVQ MM1, [DIVISORS] ; load four divisors
|
|
MOVQ MM2, [DIVIDENDS] ; load four dividends
|
|
PUNPCKHWD MM4, MM1 ; unpack divisors to DWORDs
|
|
PSRAD MM4, 16
|
|
PUNPCKLWD MM3, MM1
|
|
PSRAD MM3, 16
|
|
CVTPI2PS XMM1, MM4 ; convert divisors to float, upper two operands
|
|
MOVLHPS XMM1, XMM1
|
|
CVTPI2PS XMM1, MM3 ; convert lower two operands
|
|
PUNPCKHWD MM4, MM2 ; unpack dividends to DWORDs
|
|
PSRAD MM4, 16
|
|
PUNPCKLWD MM3, MM2
|
|
PSRAD MM3, 16
|
|
CVTPI2PS XMM2, MM4 ; convert dividends to float, upper two operands
|
|
MOVLHPS XMM2, XMM2
|
|
CVTPI2PS XMM2, MM3 ; convert lower two operands
|
|
RCPPS XMM0, XMM1 ; approximate reciprocal of divisors
|
|
MULPS XMM1, XMM0 ; improve precision with Newton-Raphson method
|
|
PCMPEQW MM4, MM4 ; make four integer 1's in the meantime
|
|
PSRLW MM4, 15
|
|
MULPS XMM1, XMM0
|
|
ADDPS XMM0, XMM0
|
|
SUBPS XMM0, XMM1 ; reciprocal divisors with 23 bit precision
|
|
MULPS XMM0, XMM2 ; multiply with dividends
|
|
CVTTPS2PI MM0, XMM0 ; truncate lower two results
|
|
MOVHLPS XMM0, XMM0
|
|
CVTTPS2PI MM3, XMM0 ; truncate upper two results
|
|
PACKSSDW MM0, MM3 ; pack the four results into MM0
|
|
MOVQ MM3, MM1 ; multiply results with divisors...
|
|
PMULLW MM3, MM0 ; to check for rounding errors
|
|
PADDSW MM0, MM4 ; add 1 to compensate for later subtraction
|
|
PADDSW MM3, MM1 ; add divisor. this should be > dividend
|
|
PCMPGTW MM3, MM2 ; check if too small
|
|
PADDSW MM0, MM3 ; subtract 1 if not too small
|
|
MOVQ [QUOTIENTS], MM0 ; save the four results</pre>
|
|
This code checks if the result is too small and makes the appropriate
|
|
correction. It is not necessary to check if the result is too big.
|
|
<p>
|
|
<h4>Avoiding divisions (all processors)</h4>
|
|
Obviously, you should always try to minimize the number of divisions. Floating point division
|
|
with a constant or repeated division with the same value should of course be done by
|
|
multiplying with the reciprocal. But there are many other situations where you can reduce
|
|
the number of divisions. For example:
|
|
if (A/B > C)... can be rewritten as if (A > B*C)... when B is positive, and the opposite when
|
|
B is negative.
|
|
<p>
|
|
A/B + C/D can be rewritten as (A*D + C*B) / (B*D)
|
|
<p>
|
|
If you are using integer division, then you should be aware that the rounding errors may be
|
|
different when you rewrite the formulas.
|
|
<p>
|
|
<h3><a name="27_3">27.3 Freeing floating point registers (all processors)</a></h3>
|
|
You have to free all used floating point registers before exiting a subroutine,
|
|
except for any register used for the result.
|
|
<p>
|
|
The fastest way of freeing one register is <kbd>FSTP ST</kbd>.
|
|
The fastest way of freeing two registers is <kbd>FCOMPP</kbd> on PPlain and PMMX;
|
|
on PPro, PII and PIII you may use either
|
|
<kbd>FCOMPP</kbd> or two times <kbd>FSTP ST</kbd>, whichever fits best into
|
|
the decoding sequence.
|
|
<p>
|
|
It is not recommended to use <kbd>FFREE</kbd>.
|
|
<p>
|
|
<h3><a name="27_4">27.4 Transitions between floating point and MMX instructions (PMMX, PII and PIII)</a></h3>
|
|
You must issue an <kbd>EMMS</kbd> instruction after your last MMX instruction if there
|
|
is a possibility that floating point code follows later.
|
|
<p>
|
|
On PMMX there is a high penalty for switching between floating point and MMX
|
|
instructions. The first floating point instruction after an
|
|
<kbd>EMMS</kbd> takes approximately 58 clocks extra, and the first MMX instruction
|
|
after a floating point instruction takes approximately 38 clocks extra.
|
|
<p>
|
|
On PII and PIII there is no such penalty. The delay after <kbd>EMMS</kbd>
|
|
can be hidden by putting in integer
|
|
instructions between <kbd>EMMS</kbd> and the first floating point instruction.
|
|
<p>
|
|
<h3><a name="27_5">27.5 Converting from floating point to integer (All processors)</a></h3>
|
|
All conversions from floating point to integer, and vice versa, must go via a memory
|
|
location:
|
|
<pre> FISTP DWORD PTR [TEMP]
|
|
MOV EAX, [TEMP]</pre><p>
|
|
On PPro, PII and PIII, this code is likely to have a penalty for attempting to
|
|
read from <kbd>[TEMP]</kbd> before the write to <kbd>[TEMP]</kbd> is finished
|
|
because the <kbd>FIST</kbd> instruction is slow (see chapter <a href="#17">17</a>).
|
|
It doesn't help to put in a <kbd>WAIT</kbd> (see chapter <a href="#26_6">26.6</a>).
|
|
It is recommended that you put in other instructions between the write to
|
|
<kbd>[TEMP]</kbd> and the read from <kbd>[TEMP]</kbd> if possible in
|
|
order to avoid this penalty. This applies to all the examples that follow.
|
|
<p>
|
|
The specifications for the C and C++ language requires that conversion
|
|
from floating point
|
|
numbers to integers use truncation rather than rounding. The method used by most C
|
|
libraries is to change the floating point control word to indicate truncation before using an
|
|
<kbd>FISTP</kbd> instruction and changing it back again afterwords. This method is very slow on all
|
|
processors. On PPro, PII and PIII, the floating point control word cannot be renamed, so all
|
|
subsequent floating point instructions must wait for the <kbd>FLDCW</kbd> instruction to retire.
|
|
<p>
|
|
Whenever you have a conversion from floating point to integer in C or C++, you should
|
|
think of whether you can use rounding to nearest integer instead of truncation. If your
|
|
standard library doesn't have a fast round function then make your own using the code
|
|
examples listed below.
|
|
<p>
|
|
If you need truncation inside a loop then you should change the control word only outside
|
|
the loop if the rest of the floating point instructions in the loop can work correctly in
|
|
truncation mode.
|
|
<p>
|
|
You may use various tricks for truncating without changing the control word, as illustrated in
|
|
the examples below. These examples presume that the control word is set to default, i.e.
|
|
rounding to nearest or even.
|
|
<p>
|
|
<h4>Rounding to nearest or even</h4>
|
|
<pre>; extern "C" int round (double x);
|
|
_round PROC NEAR
|
|
PUBLIC _round
|
|
FLD QWORD PTR [ESP+4]
|
|
FISTP DWORD PTR [ESP+4]
|
|
MOV EAX, DWORD PTR [ESP+4]
|
|
RET
|
|
_round ENDP</pre>
|
|
<p>
|
|
<h4>Truncation towards zero</h4>
|
|
<pre>; extern "C" int truncate (double x);
|
|
_truncate PROC NEAR
|
|
PUBLIC _truncate
|
|
FLD QWORD PTR [ESP+4] ; x
|
|
SUB ESP, 12 ; space for local variables
|
|
FIST DWORD PTR [ESP] ; rounded value
|
|
FST DWORD PTR [ESP+4] ; float value
|
|
FISUB DWORD PTR [ESP] ; subtract rounded value
|
|
FSTP DWORD PTR [ESP+8] ; difference
|
|
POP EAX ; rounded value
|
|
POP ECX ; float value
|
|
POP EDX ; difference (float)
|
|
TEST ECX, ECX ; test sign of x
|
|
JS SHORT NEGATIVE
|
|
ADD EDX, 7FFFFFFFH ; produce carry if difference < -0
|
|
SBB EAX, 0 ; subtract 1 if x-round(x) < -0
|
|
RET
|
|
NEGATIVE:
|
|
XOR ECX, ECX
|
|
TEST EDX, EDX
|
|
SETG CL ; 1 if difference > 0
|
|
ADD EAX, ECX ; add 1 if x-round(x) > 0
|
|
RET
|
|
_truncate ENDP</pre>
|
|
<p>
|
|
<h4>Truncation towards minus infinity</h4>
|
|
<pre>; extern "C" int ifloor (double x);
|
|
_ifloor PROC NEAR
|
|
PUBLIC _ifloor
|
|
FLD QWORD PTR [ESP+4] ; x
|
|
SUB ESP, 8 ; space for local variables
|
|
FIST DWORD PTR [ESP] ; rounded value
|
|
FISUB DWORD PTR [ESP] ; subtract rounded value
|
|
FSTP DWORD PTR [ESP+4] ; difference
|
|
POP EAX ; rounded value
|
|
POP EDX ; difference (float)
|
|
ADD EDX, 7FFFFFFFH ; produce carry if difference < -0
|
|
SBB EAX, 0 ; subtract 1 if x-round(x) < -0
|
|
RET
|
|
_ifloor ENDP</pre>
|
|
<p>
|
|
These procedures work for -2<sup>31</sup> < x < 2<sup>31</sup>-1.
|
|
They do not check for overflow or NAN's.
|
|
<p>
|
|
The PIII has instructions for truncation of single precision floating point
|
|
numbers: <kbd>CVTTSS2SI</kbd> and <kbd>CVTTPS2PI</kbd>. These instructions
|
|
are very useful if the single precision is satisfactory, but if you are converting
|
|
a float with higher precision to single precision in order to use these truncation
|
|
instructions then you have the problem that the number may be rounded up in the
|
|
conversion to single precision.
|
|
<p>
|
|
<h4>Alternative to FISTP instruction (PPlain and PMMX)</h4>
|
|
<p>
|
|
Converting a floating point number to integer is normally done like this:
|
|
<pre> FISTP DWORD PTR [TEMP]
|
|
MOV EAX, [TEMP]</pre>
|
|
<p>
|
|
An alternative method is:
|
|
<pre>.DATA
|
|
ALIGN 8
|
|
TEMP DQ ?
|
|
MAGIC DD 59C00000H ; f.p. representation of 2^51 + 2^52
|
|
|
|
.CODE
|
|
FADD [MAGIC]
|
|
FSTP QWORD PTR [TEMP]
|
|
MOV EAX, DWORD PTR [TEMP]</pre>
|
|
<p>
|
|
Adding the 'magic number' of 2<sup>51</sup> + 2<sup>52</sup> has the effect
|
|
that any integer between -2<sup>31</sup> and +2<sup>31</sup>
|
|
will be aligned in the lower 32 bits when storing as a double precision floating
|
|
point number. The result is the same as you get with <kbd>FISTP</kbd> for all rounding methods except
|
|
truncation towards zero. The result is different from <kbd>FISTP</kbd> if the control word specifies
|
|
truncation or in case of overflow. You may need a <kbd>WAIT</kbd> instruction for
|
|
compatibility with the old 80287 processor, see chapter <a href="#26_6">26.6</a>.
|
|
<p>
|
|
This method is not faster than using <kbd>FISTP</kbd>, but it gives better
|
|
scheduling opportunities on
|
|
PPlain and PMMX because there is a 3 clock void between <kbd>FADD</kbd>
|
|
and <kbd>FSTP</kbd> which may be
|
|
filled with other instrucions. You may multiply or divide the number by a
|
|
power of 2 in the same operation by doing the opposite to the magic number.
|
|
You may also add a constant by
|
|
adding it to the magic number, which then has to be double precision.
|
|
<p>
|
|
<h3><a name="27_6">27.6 Using integer instructions to do floating point operations (all processors)</a></h3>
|
|
Integer instructions are generally faster than floating point instructions, so it is often
|
|
advantageous to use integer instructions for doing simple floating point operations. The
|
|
most obvious example is moving data. Example:<br>
|
|
<kbd> FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI]</kbd><br>
|
|
Change to:<br>
|
|
<kbd> MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX</kbd><br>
|
|
<p>
|
|
<h4>Testing if a floating point value is zero:</h4>
|
|
The floating point value of zero is usually represented as 32 or 64 bits of zero, but there is a
|
|
pitfall here: The sign bit may be set! Minus zero is regarded as a valid floating point number,
|
|
and the processor may actually generate a zero with the sign bit set if for example
|
|
multiplying a negative number with zero. So if you want to test if a floating point number is
|
|
zero, you should not test the sign bit. Example:<br>
|
|
<kbd> FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero</kbd><br>
|
|
Use integer instructions instead, and shift out the sign bit:<br>
|
|
<kbd> MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero</kbd><br>
|
|
If the floating point number is double precision (QWORD) then you only have to
|
|
test bit 32-62. If they are zero, then the lower half will also be zero if it is a normal floating point number.
|
|
<p>
|
|
<h4>Testing if negative:</h4>
|
|
A floating point number is negative if the sign bit is set and at least one other bit is set.
|
|
Example:<br>
|
|
<kbd> MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative</kbd>
|
|
<p>
|
|
<h4>Manipulating the sign bit:</h4>
|
|
You can change the sign of a floating point number simply by flipping the
|
|
sign bit. Example:<br>
|
|
<kbd> XOR BYTE PTR [a] + (TYPE a) - 1, 80H</kbd>
|
|
<p>
|
|
Likewise you may get the absolute value of a floating point number by simply ANDing out
|
|
the sign bit.
|
|
<p>
|
|
<h4>Comparing numbers:</h4>
|
|
Floating point numbers are stored in a unique format which allows you to use integer
|
|
instructions for comparing floating point numbers, except for the sign bit. If you are certain
|
|
that two floating point numbers both are normal and positive then you may simply compare
|
|
them as integers. Example:<br>
|
|
<kbd> FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB</kbd><br>
|
|
Change to:<br>
|
|
<kbd> MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB</kbd><br>
|
|
This method only works if the two numbers have the same precision and you are certain
|
|
that none of the numbers have the sign bit set.
|
|
<p>
|
|
If negative numbers are possible, then you have to convert the negative numbers to
|
|
2-complement, and do a signed compare:
|
|
<pre> MOV EAX, [a]
|
|
MOV EBX, [b]
|
|
MOV ECX, EAX
|
|
MOV EDX, EBX
|
|
SAR ECX, 31 ; copy sign bit
|
|
AND EAX, 7FFFFFFFH ; remove sign bit
|
|
SAR EDX, 31
|
|
AND EBX, 7FFFFFFFH
|
|
XOR EAX, ECX ; make 2-complement if sign bit was set
|
|
XOR EBX, EDX
|
|
SUB EAX, ECX
|
|
SUB EBX, EDX
|
|
CMP EAX, EBX
|
|
JL ASmallerThanB ; signed comparison</pre>
|
|
This method works for all normal floating point numbers, including -0.
|
|
<p>
|
|
<h3><a name="27_7">27.7 Using floating point instructions to do integer operations (PPlain and PMMX)</a></h3>
|
|
<h4>Integer multiplication (PPlain and PMMX)</h4>
|
|
Floating point multiplication is faster than integer multiplication on the PPlain and PMMX,
|
|
but the price for converting integer factors to float and converting the result back to integer
|
|
is high, so floating point multiplication is only advantageous if the number of conversions
|
|
needed is low compared with the number of multiplications. (It may be tempting to use
|
|
denormal floating point operands to save some of the conversions here, but the handling of
|
|
denormals is very slow, so this is not a good idea!)
|
|
<p>
|
|
On the PMMX, MMX multiplication instructions are faster than integer multiplication, and
|
|
can be pipelined to a throughput of one multiplication per clock cycle, so this may be the
|
|
best solution for doing fast multiplication on the PMMX, if you can live with 16 bit precision.
|
|
<p>
|
|
Integer multiplication is faster than floating point on PPro, PII and PIII.
|
|
<p>
|
|
<h4>Integer division (PPlain and PMMX)</h4>
|
|
Floating point division is not faster than integer division, but you can do other integer
|
|
operations (including integer division, but not integer multiplication) while the floating point
|
|
unit is working on the division (See example <a href="#paralleldiv">above</a>).
|
|
<p>
|
|
<h4>Converting binary to decimal numbers (all processors)</h4>
|
|
Using the <kbd>FBSTP</kbd> instruction is a simple and convenient way of converting a binary number
|
|
to decimal, although not necessarily the fastest method.
|
|
<p>
|
|
<h3><a name="27_8">27.8 Moving blocks of data (all processors)</a></h3>
|
|
There are several ways of moving blocks of data. The most common method is
|
|
<kbd>REP MOVSD</kbd>, but under certain conditions other methods are faster.
|
|
<p>
|
|
On PPlain and PMMX it is faster to move 8 bytes at a time using floating
|
|
point registers if the destination is not in the cache:
|
|
<pre>TOP: FILD QWORD PTR [ESI]
|
|
FILD QWORD PTR [ESI+8]
|
|
FXCH
|
|
FISTP QWORD PTR [EDI]
|
|
FISTP QWORD PTR [EDI+8]
|
|
ADD ESI, 16
|
|
ADD EDI, 16
|
|
DEC ECX
|
|
JNZ TOP</pre>
|
|
<p>
|
|
The source and destination should of course be aligned by 8. The extra time used by the
|
|
slow <kbd>FILD</kbd> and <kbd>FISTP</kbd> instructions is compensated for by the fact that you only have to do
|
|
half as many write operations. Note that this method is only advantageous on the PPlain
|
|
and PMMX and only if the destination is not in the level 1 cache. You cannot use
|
|
<kbd>FLD</kbd> and <kbd>FSTP</kbd> (without <kbd>I</kbd>) on arbitrary bit patterns because denormal numbers
|
|
are handled slowly and certain bit patterns are not preserved unchanged.
|
|
<p>
|
|
On the PMMX processor it is faster to use MMX instructions to move eight bytes
|
|
at a time if the destination is not in the cache:
|
|
<pre>TOP: MOVQ MM0,[ESI]
|
|
MOVQ [EDI],MM0
|
|
ADD ESI,8
|
|
ADD EDI,8
|
|
DEC ECX
|
|
JNZ TOP</pre>
|
|
<p>
|
|
There is no need to unroll this loop or optimize it further if cache misses are expected,
|
|
because memory access is the bottleneck here, not instruction execution.
|
|
<p>
|
|
On PPro, PII and PIII processors the <kbd>REP MOVSD</kbd> instruction is particularly
|
|
fast when the following conditions are met (see chapter <a href="#26_3">26.3</a>):
|
|
<ul>
|
|
<li>both source and destination must be aligned by 8
|
|
<li>direction must be forward (direction flag cleared)
|
|
<li>the count (<kbd>ECX</kbd>) must be greater than or equal to 64
|
|
<li>the difference between <kbd>EDI</kbd> and <kbd>ESI</kbd> must be numerically greater than or equal to 32
|
|
<li>the memory type for both source and destination must be either writeback or
|
|
write-combining (you can normally assume this).
|
|
</ul>
|
|
<p>
|
|
On the PII it is faster to use MMX registers if the above conditions are not met
|
|
and the destination is likely to be in the level 1 cache. The loop may be rolled
|
|
out by two, and the source and destination should of course be aligned by 8.
|
|
<p>
|
|
On the PIII the fastest way of moving data is to use the <kbd>MOVAPS</kbd>
|
|
instruction if the above conditions are not met or if the destination is in
|
|
the level 1 or level 2 cache:
|
|
<pre> SUB EDI, ESI
|
|
TOP: MOVAPS XMM0, [ESI]
|
|
MOVAPS [ESI+EDI], XMM0
|
|
ADD ESI, 16
|
|
DEC ECX
|
|
JNZ TOP</pre>
|
|
Unlike <kbd>FLD</kbd>, <kbd>MOVAPS</kbd> can handle any bit pattern without
|
|
problems. Remember that source and destination must be aligned by 16.
|
|
<p>
|
|
If the number of bytes to move is not divisible by 16 then you may round up
|
|
to the nearest number divisible by 16 and put some extra space at the end of
|
|
the destination buffer to receive the superfluous bytes. If this is not possible
|
|
then you have to move the remaining bytes by other methods.
|
|
<p>
|
|
On the PIII you also have the option of writing directly to RAM memory without
|
|
involving the cache by using the <kbd>MOVNTQ</kbd> or <kbd>MOVNTPS</kbd>
|
|
instruction. This can be useful if you don't want the destination to go into
|
|
a cache. <kbd>MOVNTPS</kbd> is only slightly faster than <kbd>MOVNTQ</kbd>.
|
|
<p>
|
|
<h3><a name="27_9">27.9 Self-modifying code (All processors)</a></h3>
|
|
The penalty for executing a piece of code immediately after modifying it is approximately 19
|
|
clocks for PPlain, 31 for PMMX, and 150-300 for PPro, PII and PIII. The 80486 and earlier
|
|
processors require a jump between the modifying and the modified code in order to flush
|
|
the code cache.
|
|
<p>
|
|
To get permission to modify code in a protected operating system you need to call special
|
|
system functions: In 16-bit Windows call ChangeSelector, in 32-bit Windows call
|
|
VirtualProtect and FlushInstructionCache (or put the code in a data segment).
|
|
<p>
|
|
Self-modifying code is not considered good programming practice, but it may be justified if
|
|
the gain in speed is considerable.
|
|
<p>
|
|
<h3><a name="27_10">27.10 Detecting processor type (All processors)</a></h3>
|
|
I think it is fairly obvious by now that what is optimal for one microprocessor may not be
|
|
optimal for another. You may make the most critical parts of you program in different
|
|
versions, each optimized for a specific microprocessor and selecting the desired version at
|
|
run time after detecting which microprocessor the program is running on. If you are using
|
|
instructions that are not supported by all microprocessors (i.e. conditional
|
|
moves, <kbd>FCOMI</kbd>, MMX and XMM instructions) then you must first check if the program is running on a microprocessor
|
|
that supports these instructions. The subroutine below checks the type of microprocessor
|
|
and the features supported.
|
|
<p>
|
|
<pre>; define CPUID instruction if not known by assembler:
|
|
CPUID MACRO
|
|
DB 0FH, 0A2H
|
|
ENDM
|
|
|
|
; C++ prototype:
|
|
; extern "C" long int DetectProcessor (void);
|
|
|
|
; return value:
|
|
; bits 8-11 = family (5 for PPlain and PMMX, 6 for PPro, PII and PIII)
|
|
; bit 0 = floating point instructions supported
|
|
; bit 15 = conditional move and FCOMI instructions supported
|
|
; bit 23 = MMX instructions supported
|
|
; bit 25 = XMM instructions supported
|
|
|
|
_DetectProcessor PROC NEAR
|
|
PUBLIC _DetectProcessor
|
|
PUSH EBX
|
|
PUSH ESI
|
|
PUSH EDI
|
|
PUSH EBP
|
|
; detect if CPUID instruction supported by microprocessor:
|
|
PUSHFD
|
|
POP EAX
|
|
MOV EBX, EAX
|
|
XOR EAX, 1 SHL 21 ; check if CPUID bit can toggle
|
|
PUSH EAX
|
|
POPFD
|
|
PUSHFD
|
|
POP EAX
|
|
XOR EAX, EBX
|
|
AND EAX, 1 SHL 21
|
|
JZ SHORT DPEND ; CPUID instruction not supported
|
|
XOR EAX, EAX
|
|
CPUID ; get number of CPUID functions
|
|
TEST EAX, EAX
|
|
JZ SHORT DPEND ; CPUID function 1 not supported
|
|
MOV EAX, 1
|
|
CPUID ; get family and features
|
|
AND EAX, 000000F00H ; family
|
|
AND EDX, 0FFFFF0FFH ; features flags
|
|
OR EAX, EDX ; combine bits
|
|
DPEND: POP EBP
|
|
POP EDI
|
|
POP ESI
|
|
POP EBX
|
|
RET
|
|
_DetectProcessor ENDP</pre>
|
|
<p>
|
|
Note that some operating systems do not allow XMM instructions.
|
|
Information on how to check for operating system support of XMM instructions can
|
|
be found in Intel's application note AP-900: "Identifying support for Streaming
|
|
SIMD Extensions in the Processor and Operating System".
|
|
More information on microprocessor identification can be found in Intel's
|
|
application note AP-485: "Intel Processor Identification and the CPUID Instruction".
|
|
<p>
|
|
To code the conditional move, MMX, XMM instructions etc. on an assembler that doesn't have
|
|
these instructions use the macros at <a href="http://www.agner.org/assem/macros.zip">www.agner.org/assem/macros.zip</a>
|
|
<p>
|
|
<h2><a name="28">28</a>. List of instruction timings for PPlain and PMMX</h2>
|
|
<h3><a name="28_1">28.1 Integer instructions</a></h3>
|
|
<b>Explanations:</b><br>
|
|
<u>Operands:</u><br>
|
|
r = register, m = memory, i = immediate data, sr = segment register<br>
|
|
m32 = 32 bit memory operand, etc.
|
|
<p>
|
|
<u>Clock cycles:</u><br>
|
|
The numbers are minimum values. Cache misses, misalignment, and exceptions may
|
|
increase the clock counts considerably.
|
|
<p>
|
|
<u>Pairability:</u><br>
|
|
u = pairable in u-pipe, v = pairable in v-pipe, uv = pairable in either pipe,
|
|
np = not pairable.
|
|
<p>
|
|
|
|
<table border=1 cellpadding=4 cellspacing=1>
|
|
<tr><td class="a3"> Instruction </td>
|
|
<td class="a3"> Operands </td>
|
|
<td class="a3"> Clock cycles </td>
|
|
<td class="a3"> Pairability </td></tr>
|
|
<tr><td>NOP</td><td> </td><td>1</td><td>uv</td></tr>
|
|
<tr><td>MOV</td><td>r/m, r/m/i</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>MOV</td><td>r/m, sr</td><td>1</td><td>np</td></tr>
|
|
<tr><td>MOV</td><td>sr , r/m</td><td>>= 2 b)</td><td>np</td></tr>
|
|
<tr><td>MOV</td><td>m , accum</td><td>1</td><td>uv h)</td></tr>
|
|
<tr><td>XCHG</td><td>(E)AX, r</td><td>2</td><td>np</td></tr>
|
|
<tr><td>XCHG</td><td>r , r</td><td>3</td><td>np</td></tr>
|
|
<tr><td>XCHG</td><td>r , m</td><td>>15</td><td>np</td></tr>
|
|
<tr><td>XLAT</td><td> </td><td>4</td><td>np</td></tr>
|
|
<tr><td>PUSH</td><td>r/i</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>POP</td><td>r</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>PUSH</td><td>m</td><td>2</td><td>np</td></tr>
|
|
<tr><td>POP</td><td>m</td><td>3</td><td>np</td></tr>
|
|
<tr><td>PUSH</td><td>sr</td><td>1 b)</td><td>np</td></tr>
|
|
<tr><td>POP</td><td>sr</td><td>>= 3 b)</td><td>np</td></tr>
|
|
<tr><td>PUSHF</td><td> </td><td>3-5</td><td>np</td></tr>
|
|
<tr><td>POPF</td><td> </td><td>4-6</td><td>np</td></tr>
|
|
<tr><td>PUSHA POPA</td><td> </td><td>5-9 i)</td><td>np</td></tr>
|
|
<tr><td>PUSHAD POPAD</td><td> </td><td>5</td><td>np</td></tr>
|
|
<tr><td>LAHF SAHF</td><td> </td><td>2</td><td>np</td></tr>
|
|
<tr><td>MOVSX MOVZX</td><td>r , r/m</td><td>3 a)</td><td>np</td></tr>
|
|
<tr><td>LEA</td><td>r , m</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>LDS LES LFS LGS LSS</td><td>m</td><td>4 c)</td><td>np</td></tr>
|
|
<tr><td>ADD SUB AND OR XOR</td><td>r , r/i</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>ADD SUB AND OR XOR</td><td>r , m</td><td>2</td><td>uv</td></tr>
|
|
<tr><td>ADD SUB AND OR XOR</td><td>m , r/i</td><td>3</td><td>uv</td></tr>
|
|
<tr><td>ADC SBB</td><td>r , r/i</td><td>1</td><td>u</td></tr>
|
|
<tr><td>ADC SBB</td><td>r , m</td><td>2</td><td>u</td></tr>
|
|
<tr><td>ADC SBB</td><td>m , r/i</td><td>3</td><td>u</td></tr>
|
|
<tr><td>CMP</td><td>r , r/i</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>CMP</td><td>m , r/i</td><td>2</td><td>uv</td></tr>
|
|
<tr><td>TEST</td><td>r , r</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>TEST</td><td>m , r</td><td>2</td><td>uv</td></tr>
|
|
<tr><td>TEST</td><td>r , i</td><td>1</td><td>f)</td></tr>
|
|
<tr><td>TEST</td><td>m , i</td><td>2</td><td>np</td></tr>
|
|
<tr><td>INC DEC</td><td>r</td><td>1</td><td>uv</td></tr>
|
|
<tr><td>INC DEC</td><td>m</td><td>3</td><td>uv</td></tr>
|
|
<tr><td>NEG NOT</td><td>r/m</td><td>1/3</td><td>np</td></tr>
|
|
<tr><td>MUL IMUL</td><td>r8/r16/m8/m16</td><td>11</td><td>np</td></tr>
|
|
<tr><td>MUL IMUL</td><td>all other versions</td><td>9 d)</td><td>np</td></tr>
|
|
<tr><td>DIV</td><td>r8/m8</td><td>17</td><td>np</td></tr>
|
|
<tr><td>DIV</td><td>r16/m16</td><td>25</td><td>np</td></tr>
|
|
<tr><td>DIV</td><td>r32/m32</td><td>41</td><td>np</td></tr>
|
|
<tr><td>IDIV</td><td>r8/m8</td><td>22</td><td>np</td></tr>
|
|
<tr><td>IDIV</td><td>r16/m16</td><td>30</td><td>np</td></tr>
|
|
<tr><td>IDIV</td><td>r32/m32</td><td>46</td><td>np</td></tr>
|
|
<tr><td>CBW CWDE</td><td> </td><td>3</td><td>np</td></tr>
|
|
<tr><td>CWD CDQ</td><td> </td><td>2</td><td>np</td></tr>
|
|
<tr><td>SHR SHL SAR SAL</td><td>r , i</td><td>1</td><td>u</td></tr>
|
|
<tr><td>SHR SHL SAR SAL</td><td>m , i</td><td>3</td><td>u</td></tr>
|
|
<tr><td>SHR SHL SAR SAL</td><td>r/m, CL</td><td>4/5</td><td>np</td></tr>
|
|
<tr><td>ROR ROL RCR RCL</td><td>r/m, 1</td><td>1/3</td><td>u</td></tr>
|
|
<tr><td>ROR ROL</td><td>r/m, i(><1)</td><td>1/3</td><td>np</td></tr>
|
|
<tr><td>ROR ROL</td><td>r/m, CL</td><td>4/5</td><td>np</td></tr>
|
|
<tr><td>RCR RCL</td><td>r/m, i(><1)</td><td>8/10</td><td>np</td></tr>
|
|
<tr><td>RCR RCL</td><td>r/m, CL</td><td>7/9</td><td>np</td></tr>
|
|
<tr><td>SHLD SHRD</td><td>r, i/CL</td><td>4 a)</td><td>np</td></tr>
|
|
<tr><td>SHLD SHRD</td><td>m, i/CL</td><td>5 a)</td><td>np</td></tr>
|
|
<tr><td>BT</td><td>r, r/i</td><td>4 a)</td><td>np</td></tr>
|
|
<tr><td>BT</td><td>m, i</td><td>4 a)</td><td>np</td></tr>
|
|
<tr><td>BT</td><td>m, i</td><td>9 a)</td><td>np</td></tr>
|
|
<tr><td>BTR BTS BTC</td><td>r, r/i</td><td>7 a)</td><td>np</td></tr>
|
|
<tr><td>BTR BTS BTC</td><td>m, i</td><td>8 a)</td><td>np</td></tr>
|
|
<tr><td>BTR BTS BTC</td><td>m, r</td><td>14 a)</td><td>np</td></tr>
|
|
<tr><td>BSF BSR</td><td>r , r/m</td><td>7-73 a)</td><td>np</td></tr>
|
|
<tr><td>SETcc</td><td>r/m</td><td>1/2 a)</td><td>np</td></tr>
|
|
<tr><td>JMP CALL</td><td>short/near</td><td>1 e)</td><td>v</td></tr>
|
|
<tr><td>JMP CALL</td><td>far</td><td>>= 3 e)</td><td>np</td></tr>
|
|
<tr><td>conditional jump</td><td>short/near</td><td>1/4/5/6 e)</td><td>v</td></tr>
|
|
<tr><td>CALL JMP</td><td>r/m</td><td>2/5 e</td><td>np</td></tr>
|
|
<tr><td>RETN</td><td> </td><td>2/5 e</td><td>np</td></tr>
|
|
<tr><td>RETN</td><td>i</td><td>3/6 e)</td><td>np</td></tr>
|
|
<tr><td>RETF</td><td> </td><td>4/7 e)</td><td>np</td></tr>
|
|
<tr><td>RETF</td><td>i</td><td>5/8 e)</td><td>np</td></tr>
|
|
<tr><td>J(E)CXZ</td><td>short</td><td>4-11 e)</td><td>np</td></tr>
|
|
<tr><td>LOOP</td><td>short</td><td>5-10 e)</td><td>np</td></tr>
|
|
<tr><td>BOUND</td><td>r , m</td><td>8</td><td>np</td></tr>
|
|
<tr><td>CLC STC CMC CLD STD</td><td> </td><td>2</td><td>np</td></tr>
|
|
<tr><td>CLI STI</td><td> </td><td>6-9</td><td>np</td></tr>
|
|
<tr><td>LODS</td><td> </td><td>2</td><td>np</td></tr>
|
|
<tr><td>REP LODS</td><td> </td><td>7+3*n g)</td><td>np</td></tr>
|
|
<tr><td>STOS</td><td> </td><td>3</td><td>np</td></tr>
|
|
<tr><td>REP STOS</td><td> </td><td>10+n g)</td><td>np</td></tr>
|
|
<tr><td>MOVS</td><td> </td><td>4</td><td>np</td></tr>
|
|
<tr><td>REP MOVS</td><td> </td><td>12+n g)</td><td>np</td></tr>
|
|
<tr><td>SCAS</td><td> </td><td>4</td><td>np</td></tr>
|
|
<tr><td>REP(N)E SCAS</td><td> </td><td>9+4*n g)</td><td>np</td></tr>
|
|
<tr><td>CMPS</td><td> </td><td>5</td><td>np</td></tr>
|
|
<tr><td>REP(N)E CMPS</td><td> </td><td>8+4*n g)</td><td>np</td></tr>
|
|
<tr><td>BSWAP</td><td> </td><td>1 a)</td><td>np</td></tr>
|
|
<tr><td>CPUID</td><td> </td><td>13-16 a)</td><td>np</td></tr>
|
|
<tr><td>RDTSC</td><td> </td><td>6-13 a) j)</td><td>np</td></tr>
|
|
</table>
|
|
<p>
|
|
<b>Notes:</b><br>
|
|
a) this instruction has a <kbd>0FH</kbd> prefix which takes one clock cycle extra to
|
|
decode on a PPlain unless preceded by a multicycle instruction (see
|
|
chapter <a href="#12">12</a>).<br>
|
|
b) versions with <kbd>FS</kbd> and <kbd>GS</kbd> have a <kbd>0FH</kbd>
|
|
prefix. see note a.<br>
|
|
c) versions with <kbd>SS, FS</kbd>, and <kbd>GS</kbd> have a <kbd>0FH</kbd> prefix. see note a.<br>
|
|
d) versions with two operands and no immediate have a <kbd>0FH</kbd> prefix, see note a.<br>
|
|
e) see chapter <a href="#22">22</a><br>
|
|
f) only pairable if register is accumulator. see chapter <a href="#26_14">26.14</a>.<br>
|
|
g) add one clock cycle for decoding the repeat prefix unless preceded by a
|
|
multicycle instruction (such as <kbd>CLD</kbd>. see chapter <a href="#12">12</a>).<br>
|
|
h) pairs as if it were writing to the accumulator. see chapter <a href="#26_14">26.14</a>.<br>
|
|
i) 9 if <kbd>SP</kbd> divisible by 4. <a href="#imperfectpush">See 10.2</a><br>
|
|
j) on PPlain: 6 in priviledged or real mode, 11 in nonpriviledged, error in
|
|
virtual mode. On PMMX: 8 and 13 clocks respectively.<br>
|
|
<p>
|
|
<h3><a name="28_2">28.2 Floating point instructions</a></h3>
|
|
<p>
|
|
<b>Explanations:</b><br>
|
|
<u>Operands:</u><br>
|
|
r = register, m = memory, m32 = 32 bit memory operand, etc.
|
|
<p>
|
|
<u>Clock cycles:</u><br>
|
|
The numbers are minimum values. Cache misses, misalignment, denormal operands, and
|
|
exceptions may increase the clock counts considerably.
|
|
<p>
|
|
<u>Pairability:</u><br>
|
|
+ = pairable with <kbd>FXCH</kbd>, np = not pairable with <kbd>FXCH</kbd>.
|
|
<p>
|
|
<u>i-ov:</u><br>
|
|
Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap
|
|
with subsequent integer instructions.
|
|
<p>
|
|
<u>fp-ov:</u><br>
|
|
Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can
|
|
overlap with subsequent floating point instructions.
|
|
(<kbd>WAIT</kbd> is considered a floating point instruction here)<p>
|
|
|
|
<table border=1 cellpadding=4 cellspacing=1>
|
|
<tr><td class="a3"> Instruction </td>
|
|
<td class="a3"> Operand </td>
|
|
<td class="a3"> Clock cycles </td>
|
|
<td class="a3"> Pairability </td>
|
|
<td class="a3"> i-ov </td>
|
|
<td class="a3"> fp-ov </td></tr>
|
|
<tr><td>FLD</td><td>r/m32/m64</td><td>1</td><td>+</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FLD</td><td>m80</td><td>3</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FBLD</td><td>m80</td><td>48-58</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FST(P)</td><td>r</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FST(P)</td><td>m32/m64</td><td>2 m)</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FST(P)</td><td>m80</td><td>3 m)</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FBSTP</td><td>m80</td><td>148-154</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FILD</td><td>m</td><td>3</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FIST(P)</td><td>m</td><td>6</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FLDZ FLD1</td><td> </td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FLDPI FLDL2E etc.</td><td> </td><td>5 s)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FNSTSW</td><td>AX/m16</td><td>6 q)</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FLDCW</td><td>m16</td><td>8</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FNSTCW</td><td>m16</td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FADD(P)</td><td>r/m</td><td>3</td><td>+</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FSUB(R)(P)</td><td>r/m</td><td>3</td><td>+</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FMUL(P)</td><td>r/m</td><td>3</td><td>+</td><td>2</td><td>2 n)</td></tr>
|
|
<tr><td>FDIV(R)(P)</td><td>r/m</td><td>19/33/39 p)</td><td>+</td><td>38 o)</td><td>2</td></tr>
|
|
<tr><td>FCHS FABS</td><td> </td><td>1</td><td>+</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FCOM(P)(P) FUCOM</td><td>r/m</td><td>1</td><td>+</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FIADD FISUB(R)</td><td>m</td><td>6</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FIMUL</td><td>m</td><td>6</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FIDIV(R)</td><td>m</td><td>22/36/42 p)</td><td>np</td><td>38 o)</td><td>2</td></tr>
|
|
<tr><td>FICOM</td><td>m</td><td>4</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FTST</td><td> </td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FXAM</td><td> </td><td>17-21</td><td>np</td><td>4</td><td>0</td></tr>
|
|
<tr><td>FPREM</td><td> </td><td>16-64</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FPREM1</td><td> </td><td>20-70</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FRNDINT</td><td> </td><td>9-20</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FSCALE</td><td> </td><td>20-32</td><td>np</td><td>5</td><td>0</td></tr>
|
|
<tr><td>FXTRACT</td><td> </td><td>12-66</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FSQRT</td><td> </td><td>70</td><td>np</td><td>69 o)</td><td>2</td></tr>
|
|
<tr><td>FSIN FCOS</td><td> </td><td>65-100 r)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FSINCOS</td><td> </td><td>89-112 r)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>F2XM1</td><td> </td><td>53-59 r)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FYL2X</td><td> </td><td>103 r)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FYL2XP1</td><td> </td><td>105 r)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FPTAN</td><td> </td><td>120-147 r)</td><td>np</td><td>36 o)</td><td>0</td></tr>
|
|
<tr><td>FPATAN</td><td> </td><td>112-134 r)</td><td>np</td><td>2</td><td>2</td></tr>
|
|
<tr><td>FNOP</td><td> </td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FXCH</td><td>r</td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FINCSTP FDECSTP</td><td> </td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FFREE</td><td>r</td><td>2</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FNCLEX</td><td> </td><td>6-9</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FNINIT</td><td> </td><td>12-22</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FNSAVE</td><td>m</td><td>124-300</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>FRSTOR</td><td>m</td><td>70-95</td><td>np</td><td>0</td><td>0</td></tr>
|
|
<tr><td>WAIT</td><td> </td><td>1</td><td>np</td><td>0</td><td>0</td></tr>
|
|
</table>
|
|
<p>
|
|
<b>Notes:</b><br>
|
|
m) The value to store is needed one clock cycle in advance.<br>
|
|
n) 1 if the overlapping instruction is also an <kbd>FMUL</kbd>.<br>
|
|
o) Cannot overlap integer multiplication instructions.<br>
|
|
p) <kbd>FDIV</kbd> takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision
|
|
respectively. <kbd>FIDIV</kbd> takes 3 clocks more. The precision is defined by bit
|
|
8-9 of the floating point control word.<br>
|
|
q) The first 4 clock cycles can overlap with preceding integer instructions.
|
|
See chapter <a href="#26_7">26.7</a>.<br>
|
|
r) clock counts are typical. Trivial cases may be faster, extreme cases may
|
|
be slower.<br>
|
|
s) may be up to 3 clocks more when output needed for <kbd>FST</kbd>,
|
|
<kbd>FCHS</kbd>, or <kbd>FABS</kbd>.
|
|
<p>
|
|
<h3><a name="28_3">28.3 MMX instructions (PMMX)</a></h3>
|
|
<p>
|
|
A list of MMX instruction timings is not needed because they all take one clock cycle,
|
|
except the MMX multiply instructions which take 3. MMX multiply instructions can be
|
|
overlapped and pipelined to yield a throughput of one multiplication per clock cycle.
|
|
<p>
|
|
The <kbd>EMMS</kbd> instruction takes only one clock cycle, but the first floating point instruction after
|
|
an <kbd>EMMS</kbd> takes approximately 58 clocks extra, and the first MMX instruction after a floating
|
|
point instruction takes approximately 38 clocks extra. There is no penalty for an MMX
|
|
instruction after <kbd>EMMS</kbd> on the PMMX (but a possible small penalty on the PII and PIII).
|
|
<p>
|
|
There is no penalty for using a memory operand in an MMX instruction because the MMX
|
|
arithmetic unit is one step later in the pipeline than the load unit. But the penalty comes
|
|
when you store data from an MMX register to memory or to a 32 bit register: The data have
|
|
to be ready one clock cycle in advance. This is analogous to the floating point store
|
|
instructions.
|
|
<p>
|
|
All MMX instructions except <kbd>EMMS</kbd> are pairable in either pipe. Pairing rules for MMX
|
|
instructions are described in chapter <a href="#10">10</a>.
|
|
<p>
|
|
<h2><a name="29">29</a>. List of instruction timings and micro-op breakdown for PPro, PII and PIII</h2>
|
|
<b>Explanations:</b><br>
|
|
<u>Operands:</u><br>
|
|
r = register, m = memory, i = immediate data, sr = segment register,
|
|
m32 = 32 bit memory operand, etc.
|
|
<p>
|
|
<u>Micro-ops:</u><br>
|
|
The number of micro-ops that the instruction generates for each execution port.<br>
|
|
p0: port 0: ALU, etc.<br>
|
|
p1: port 1: ALU, jumps<br>
|
|
p01: instructions that can go to either port 0 or 1, whichever is vacant first.<br>
|
|
p2: port 2: load data, etc.<br>
|
|
p3: port 3: address generation for store<br>
|
|
p4: port 4: store data
|
|
<p>
|
|
<u>Delay:</u><br>
|
|
This is the delay that the instruction generates in a dependency chain.
|
|
(This is not the same as the time spent in the execution unit. Values may be
|
|
inaccurate in situations where they cannot be measured exactly, especially with
|
|
memory operands).
|
|
The numbers are minimum values. Cache misses, misalignment, and exceptions
|
|
may increase the clock counts considerably. Floating point operands are
|
|
presumed to be normal numbers. Denormal numbers, NANs and infinity increase
|
|
the delays by 50-150 clocks, except in XMM move, shuffle and boolean instructions.
|
|
Floating point overflow, underflow, denormal or NAN results give a similar delay.
|
|
<p>
|
|
<u>Throughput:</u><br>
|
|
The maximum throughput for several instructions of the same kind. For example, a
|
|
throughput of 1/2 for <kbd>FMUL</kbd> means that a new <kbd>FMUL</kbd> instruction can start executing
|
|
every 2 clock cycles.<p>
|
|
|
|
<table border="1" cellpadding="4" cellspacing="1">
|
|
<tr>
|
|
<td colspan="10" class="a2"><a name="29_1">29.1 Integer instructions</a></td>
|
|
</tr>
|
|
<tr>
|
|
<td class="a3">Instruction</td>
|
|
<td class="a3">Operands</td>
|
|
<td colspan="6" class="a3">micro-ops</td>
|
|
<td class="a3">delay</td>
|
|
<td class="a3">throughput</td>
|
|
</tr>
|
|
<tr><td> </td><td> </td><td class="a3">p0</td><td class="a3">p1</td>
|
|
<td class="a3">p01</td><td class="a3">p2</td><td class="a3">p3</td><td class="a3">p4</td>
|
|
<td> </td><td> </td></tr>
|
|
<tr><td>NOP</td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>MOV</td><td>r,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>MOV</td><td>r,m</td><td> </td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>MOV</td><td>m,r/i</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>MOV</td><td>r,sr</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>MOV</td><td>m,sr</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>MOV</td><td>sr,r</td><td colspan="3">8</td>
|
|
<td> </td><td> </td><td> </td><td>5</td><td> </td></tr>
|
|
<tr><td>MOV</td><td>sr,m</td><td colspan="3">7</td>
|
|
<td>1</td><td> </td><td> </td><td>8</td><td> </td></tr>
|
|
<tr><td>MOVSX MOVZX</td><td>r,r</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>MOVSX MOVZX</td><td>r,m</td><td> </td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CMOVcc</td><td>r,r</td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CMOVcc</td><td>r,m</td><td>1</td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>XCHG</td><td>r,r</td><td> </td><td> </td><td>3</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>XCHG</td><td>r,m</td><td> </td><td> </td><td>4</td>
|
|
<td>1</td><td>1</td><td>1</td><td>high b)</td><td> </td></tr>
|
|
<tr><td>XLAT</td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>PUSH</td><td>r/i</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>POP</td><td>r</td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>POP</td><td>(E)SP</td><td> </td><td> </td><td>2</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>PUSH</td><td>m</td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>POP</td><td>m</td><td> </td><td> </td><td>5</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>PUSH</td><td>sr</td><td> </td><td> </td><td>2</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>POP</td><td>sr</td><td> </td><td> </td><td>8</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>PUSHF(D)</td><td> </td><td>3</td><td> </td><td>11</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>POPF(D)</td><td> </td><td>10</td><td> </td><td>6</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>PUSHA(D)</td><td> </td><td> </td><td> </td><td>2</td>
|
|
<td> </td><td>8</td><td>8</td><td> </td><td> </td></tr>
|
|
<tr><td>POPA(D)</td><td> </td><td> </td><td> </td><td>2</td>
|
|
<td>8</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>LAHF SAHF</td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>LEA</td><td>r,m</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1 c)</td><td> </td></tr>
|
|
<tr><td>LDS LES LFS LGS LSS</td><td>m</td><td> </td><td> </td><td>8</td>
|
|
<td>3</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>ADD SUB AND OR XOR</td><td>r,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>ADD SUB AND OR XOR</td><td>r,m</td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>ADD SUB AND OR XOR</td><td>m,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>ADC SBB</td><td>r,r/i</td><td> </td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>ADC SBB</td><td>r,m</td><td> </td><td> </td><td>2</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>ADC SBB</td><td>m,r/i</td><td> </td><td> </td><td>3</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>CMP TEST</td><td>r,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CMP TEST</td><td>m,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>INC DEC NEG NOT</td><td>r</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>INC DEC NEG NOT</td><td>m</td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>AAS DAA DAS</td><td> </td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>AAD</td><td> </td><td>1</td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td> </td><td>4</td><td> </td></tr>
|
|
<tr><td>AAM</td><td> </td><td>1</td><td>1</td><td>2</td>
|
|
<td> </td><td> </td><td> </td><td>15</td><td> </td></tr>
|
|
<tr><td>MUL IMUL</td><td>r,(r),(i)</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>4</td><td>1/1</td></tr>
|
|
<tr><td>MUL IMUL</td><td>(r),m</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>4</td><td>1/1</td></tr>
|
|
<tr><td>DIV IDIV</td><td>r8</td><td>2</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td>19</td><td>1/12</td></tr>
|
|
<tr><td>DIV IDIV</td><td>r16</td><td>3</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td>23</td><td>1/21</td></tr>
|
|
<tr><td>DIV IDIV</td><td>r32</td><td>3</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td>39</td><td>1/37</td></tr>
|
|
<tr><td>DIV IDIV</td><td>m8</td><td>2</td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td>19</td><td>1/12</td></tr>
|
|
<tr><td>DIV IDIV</td><td>m16</td><td>2</td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td>23</td><td>1/21</td></tr>
|
|
<tr><td>DIV IDIV</td><td>m32</td><td>2</td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td>39</td><td>1/37</td></tr>
|
|
<tr><td>CBW CWDE</td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CWD CDQ</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>SHR SHL SAR ROR ROL</td><td>r,i/CL</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>SHR SHL SAR ROR ROL</td><td>m,i/CL</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>RCR RCL</td><td>r,1</td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>RCR RCL</td><td>r8,i/CL</td><td>4</td><td> </td><td>4</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>RCR RCL</td><td>r16/32,i/CL</td><td>3</td><td> </td><td>3</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>RCR RCL</td><td>m,1</td><td>1</td><td> </td><td>2</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>RCR RCL</td><td>m8,i/CL</td><td>4</td><td> </td><td>3</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>RCR RCL</td><td>m16/32,i/CL</td><td>4</td><td> </td><td>2</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>SHLD SHRD</td><td>r,r,i/CL</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>SHLD SHRD</td><td>m,r,i/CL</td><td>2</td><td> </td><td>1</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>BT</td><td>r,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>BT</td><td>m,r/i</td><td>1</td><td> </td><td>6</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>BTR BTS BTC</td><td>r,r/i</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>BTR BTS BTC</td><td>m,r/i</td><td>1</td><td> </td><td>6</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>BSF BSR</td><td>r,r</td><td> </td><td>1</td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>BSF BSR</td><td>r,m</td><td> </td><td>1</td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>SETcc</td><td>r</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>SETcc</td><td>m</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>JMP</td><td>short/near</td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td>1/2</td></tr>
|
|
<tr><td>JMP</td><td>far</td><td colspan="3">21</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>JMP</td><td>r</td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td>1/2</td></tr>
|
|
<tr><td>JMP</td><td>m(near)</td><td> </td><td>1</td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td>1/2</td></tr>
|
|
<tr><td>JMP</td><td>m(far)</td><td colspan="3">21</td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>conditional jump</td><td>short/near</td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td>1/2</td></tr>
|
|
<tr><td>CALL</td><td>near</td><td> </td><td>1</td><td>1</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td>1/2</td></tr>
|
|
<tr><td>CALL</td><td>far</td><td colspan="3">28</td>
|
|
<td>1</td><td>2</td><td>2</td><td> </td><td> </td></tr>
|
|
<tr><td>CALL</td><td>r</td><td> </td><td>1</td><td>2</td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td>1/2</td></tr>
|
|
<tr><td>CALL</td><td>m(near)</td><td> </td><td>1</td><td>4</td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td>1/2</td></tr>
|
|
<tr><td>CALL</td><td>m (far)</td><td colspan="3">28</td>
|
|
<td>2</td><td>2</td><td>2</td><td> </td><td> </td></tr>
|
|
<tr><td>RETN</td><td> </td><td> </td><td>1</td><td>2</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td>1/2</td></tr>
|
|
<tr><td>RETN</td><td>i</td><td> </td><td>1</td><td>3</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td>1/2</td></tr>
|
|
<tr><td>RETF</td><td> </td><td colspan="3">23</td>
|
|
<td>3</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>RETF</td><td>i</td><td colspan="3">23</td><td>3</td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>J(E)CXZ</td><td>short</td><td> </td><td>1</td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>LOOP</td><td>short</td><td>2</td><td>1</td><td>8</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>LOOP(N)E</td><td>short</td><td>2</td><td>1</td><td>8</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>ENTER</td><td>i,0</td><td> </td><td> </td><td>12</td><td> </td>
|
|
<td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>ENTER</td><td>a,b</td><td colspan="3">ca. 18+4b</td><td> </td>
|
|
<td>b-1</td><td>2b</td><td> </td><td> </td></tr>
|
|
<tr><td>LEAVE</td><td> </td><td> </td><td> </td><td>2</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>BOUND</td><td>r,m</td><td>7</td><td> </td><td>6</td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CLC STC CMC</td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CLD STD</td><td> </td><td> </td><td> </td><td>4</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CLI</td><td> </td><td colspan="3">9</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>STI</td><td> </td><td colspan="3">17</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>INTO</td><td> </td><td> </td><td> </td><td>5</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>LODS</td><td> </td><td> </td><td> </td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>REP LODS</td><td> </td><td> </td><td> </td><td colspan="2">10+6n</td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>STOS</td><td> </td><td> </td><td> </td><td> </td>
|
|
<td>1</td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>REP STOS</td><td> </td><td> </td><td> </td>
|
|
<td colspan="4">ca. 5n a)</td><td> </td><td> </td></tr>
|
|
<tr><td>MOVS</td><td> </td><td> </td><td> </td><td>1</td><td>3</td>
|
|
<td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>REP MOVS</td><td> </td><td> </td><td> </td><td colspan="4">ca. 6n a)</td>
|
|
<td> </td><td> </td></tr>
|
|
<tr><td>SCAS</td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>REP(N)E SCAS</td><td> </td><td> </td><td> </td><td colspan="2">12+7n</td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CMPS</td><td> </td><td> </td><td> </td><td>4</td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>REP(N)E CMPS</td><td> </td><td> </td><td> </td>
|
|
<td colspan="2">12+9n</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>BSWAP</td><td> </td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>CPUID</td><td> </td><td colspan="3">23-48</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td></tr>
|
|
<tr><td>RDTSC</td><td> </td><td colspan="3">13</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>IN</td><td> </td><td colspan="3">18</td>
|
|
<td> </td><td> </td><td> </td><td>>300</td><td> </td></tr>
|
|
<tr><td>OUT</td><td> </td><td colspan="3">18</td>
|
|
<td> </td><td> </td><td> </td><td>>300</td><td> </td></tr>
|
|
<tr><td>PREFETCHNTA d)</td><td>m</td><td> </td>
|
|
<td> </td><td> </td><td> 1</td><td> </td><td>
|
|
</td><td> </td><td> </td></tr>
|
|
<tr><td>PREFETCHT0 d)</td><td>m</td><td> </td><td> </td><td> </td><td> 1</td><td> </td>
|
|
<td> </td><td> </td><td> </td></tr>
|
|
<tr><td>PREFETCHT1 d)</td><td>m</td><td> </td><td> </td>
|
|
<td> </td><td> 1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>PREFETCHT2 d)</td><td>m</td><td> </td><td> </td><td> </td><td> 1</td><td> </td><td>
|
|
</td><td> </td><td> </td></tr>
|
|
<tr><td>SFENCE d)</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> 1</td><td>
|
|
1</td><td> </td><td>1/6</td></tr>
|
|
</TABLE><p>
|
|
<b>Notes:</b><br>
|
|
a) faster under certain conditions: see chapter <a href="#26_3">26.3</a>.<br>
|
|
b) see chapter <a href="#26_1">26.1</a><br>
|
|
c) 3 if constant without base or index register<br>
|
|
d) PIII only.
|
|
<p>
|
|
|
|
<table border="1" cellpadding="4" cellspacing="1">
|
|
<tr>
|
|
<td colspan="10" class="a2"><a name="29_2">29.2 Floating point instructions</a></td>
|
|
</tr>
|
|
<tr>
|
|
<td class="a3">Instruction</td>
|
|
<td class="a3">Operands</td>
|
|
<td colspan="6" align="center" class="a3">micro-ops</td>
|
|
<td class="a3">delay</td>
|
|
<td class="a3">throughput</td>
|
|
</tr>
|
|
<tr>
|
|
<td> </td>
|
|
<td> </td>
|
|
<td class="a4">p0</td>
|
|
<td class="a4">p1</td>
|
|
<td class="a4">p01</td>
|
|
<td class="a4">p2</td>
|
|
<td class="a4">p3</td>
|
|
<td class="a4">p4</td>
|
|
<td> </td>
|
|
<td> </td>
|
|
</tr>
|
|
<tr><td>FLD</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FLD</td><td>m32/64</td><td> </td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FLD</td><td>m80</td><td>2</td><td> </td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FBLD</td><td>m80</td><td>38</td><td> </td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FST(P)</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FST(P)</td><td>m32/m64</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td>1</td><td>1</td><td>1</td><td> </td></tr>
|
|
<tr><td>FSTP</td><td>m80</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td>2</td><td>2</td><td> </td><td> </td></tr>
|
|
<tr><td>FBSTP</td><td>m80</td><td>165</td><td> </td><td> </td>
|
|
<td> </td><td>2</td><td>2</td><td> </td><td> </td></tr>
|
|
<tr><td>FXCH</td><td>r</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>0</td><td>3/1 f)</td></tr>
|
|
<tr><td>FILD</td><td>m</td><td>3</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>5</td><td> </td></tr>
|
|
<tr><td>FIST(P)</td><td>m</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td>1</td><td>1</td><td>5</td><td> </td></tr>
|
|
<tr><td>FLDZ</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td colspan="2">FLD1 FLDPI FLDL2E etc.</td><td>2</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FCMOVcc</td><td>r</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td> </td></tr>
|
|
<tr><td>FNSTSW</td><td>AX</td><td>3</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>7</td><td> </td></tr>
|
|
<tr><td>FNSTSW</td><td>m16</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>FLDCW</td><td>m16</td><td>1</td><td> </td><td>1</td>
|
|
<td>1</td><td> </td><td> </td><td>10</td><td> </td></tr>
|
|
<tr><td>FNSTCW</td><td>m16</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td>1</td><td>1</td><td> </td><td> </td></tr>
|
|
<tr><td>FADD(P) FSUB(R)(P)</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>FADD(P) FSUB(R)(P)</td><td>m</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>3-4</td><td>1/1</td></tr>
|
|
<tr><td>FMUL(P)</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>5</td><td>1/2 g)</td></tr>
|
|
<tr><td>FMUL(P)</td><td>m</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>5-6</td><td>1/2 g)</td></tr>
|
|
<tr><td>FDIV(R)(P)</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>38 h)</td><td>1/37</td></tr>
|
|
<tr><td>FDIV(R)(P)</td><td>m</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>38 h)</td><td>1/37</td></tr>
|
|
<tr><td>FABS</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FCHS</td><td> </td><td>3</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td> </td></tr>
|
|
<tr><td>FCOM(P) FUCOM</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FCOM(P) FUCOM</td><td>m</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FCOMPP FUCOMPP</td><td> </td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FCOMI(P) FUCOMI(P)</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FCOMI(P) FUCOMI(P)</td><td>m</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FIADD FISUB(R)</td><td>m</td><td>6</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FIMUL</td><td>m</td><td>6</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FIDIV(R)</td><td>m</td><td>6</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FICOM(P)</td><td>m</td><td>6</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FTST</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td> </td></tr>
|
|
<tr><td>FXAM</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td> </td></tr>
|
|
<tr><td>FPREM</td><td> </td><td>23</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FPREM1</td><td> </td><td>33</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FRNDINT</td><td> </td><td>30</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FSCALE</td><td> </td><td>56</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FXTRACT</td><td> </td><td>15</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FSQRT</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>69</td><td>e,i)</td></tr>
|
|
<tr><td>FSIN FCOS</td><td> </td><td colspan="3">17-97</td><td> </td><td> </td>
|
|
<td> </td><td>27-103</td><td>e)</td></tr>
|
|
<tr><td>FSINCOS</td><td> </td><td colspan="3">18-110</td><td> </td><td> </td>
|
|
<td> </td><td>29-130</td><td>e)</td></tr>
|
|
<tr><td>F2XM1</td><td> </td><td colspan="3">17-48</td><td> </td><td> </td>
|
|
<td> </td><td>66</td><td>e)</td></tr>
|
|
<tr><td>FYL2X</td><td> </td><td colspan="3">36-54</td><td> </td>
|
|
<td> </td><td> </td><td>103</td><td>e)</td></tr>
|
|
<tr><td>FYL2XP1</td><td> </td><td colspan="3">31-53</td><td> </td>
|
|
<td> </td><td> </td><td>98-107</td><td>e)</td></tr>
|
|
<tr><td>FPTAN</td><td> </td><td colspan="3">21-102</td><td> </td>
|
|
<td> </td><td> </td><td>13-143</td><td>e)</td></tr>
|
|
<tr><td>FPATAN</td><td> </td><td colspan="3">25-86</td><td> </td>
|
|
<td> </td><td> </td><td>44-143</td><td>e)</td></tr>
|
|
<tr><td>FNOP</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FINCSTP FDECSTP</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FFREE</td><td>r</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FFREEP</td><td>r</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FNCLEX</td><td> </td><td> </td><td> </td><td>3</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FNINIT</td><td> </td><td colspan="3">13</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FNSAVE</td><td> </td><td colspan="3">141</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>FRSTOR</td><td> </td><td colspan="3">72</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td> </td></tr>
|
|
<tr><td>WAIT</td><td> </td><td> </td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td> </td><td> </td><td> </td></tr>
|
|
</table>
|
|
|
|
<b>Notes:</b><br>
|
|
e) not pipelined<br>
|
|
f) <kbd>FXCH</kbd> generates 1 micro-op that is resolved by register renaming without
|
|
going to any port.<br>
|
|
g) <kbd>FMUL</kbd> uses the same circuitry as integer multiplication. Therefore, the
|
|
combined throughput of mixed floating point and integer multiplications is
|
|
1 <kbd>FMUL</kbd> + 1 <kbd>IMUL</kbd> per 3 clock cycles.<br>
|
|
h) <kbd>FDIV</kbd> delay depends on precision specified in control word:
|
|
precision 64 bits gives delay 38, precision 53 bits gives delay 32,
|
|
precision 24 bits gives delay 18. Division by a power of 2 takes 9 clocks.
|
|
Throughput is 1/(delay-1).<br>
|
|
i) faster for lower precision.
|
|
<p>
|
|
|
|
<table border=1 cellpadding=4 cellspacing=1><tr>
|
|
<td colspan="10" class="a2"><a name="29_3">29.3 MMX instructions (PII and PIII)</a></td></tr>
|
|
<tr><td class="a3">Instruction</td>
|
|
<td class="a3">Operands</td>
|
|
<td colspan="6" align="center" class="a3">micro-ops</td>
|
|
<td class="a3">delay</td>
|
|
<td class="a3">throughput</td></tr>
|
|
<tr><td> </td><td> </td>
|
|
<td class="a4">p0</td><td class="a4">p1</td><td class="a4">p01</td>
|
|
<td class="a4">p2</td><td class="a4">p3</td><td class="a4">p4</td>
|
|
<td> </td><td> </td></tr>
|
|
<tr><td>MOVD MOVQ</td><td>r,r</td>
|
|
<td> </td><td> </td><td>1</td><td> </td><td>
|
|
</td><td> </td><td> </td><td>2/1</td></tr>
|
|
<tr><td>MOVD MOVQ</td><td>r64,m32/64</td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>MOVD MOVQ</td><td>m32/64,r64</td><td>
|
|
</td><td> </td><td> </td><td> </td><td>1</td><td>
|
|
1</td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PADD PSUB PCMP</td><td>r64,r64</td>
|
|
<td> </td><td> </td><td>1</td><td> </td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PADD PSUB PCMP</td><td>r64,m64</td>
|
|
<td> </td><td> </td><td>1</td><td>1</td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PMUL PMADD</td><td>r64,r64</td>
|
|
<td>1</td><td> </td><td> </td><td> </td><td>
|
|
</td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>PMUL PMADD</td><td>r64,m64</td>
|
|
<td>1</td><td> </td><td> </td><td>1</td><td>
|
|
</td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>PAND PANDN POR <br>PXOR</td>
|
|
<td>r64,r64</td><td> </td><td> </td><td>1</td><td>
|
|
</td><td> </td><td> </td><td> </td><td>2/1</td></tr>
|
|
<tr><td>PAND PANDN POR<br>PXOR</td><td>r64,m64</td>
|
|
<td> </td><td> </td><td>1</td><td>1</td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PSRA PSRL PSLL</td><td>r64,r64/i</td>
|
|
<td> </td><td>1</td><td> </td><td> </td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PSRA PSRL PSLL</td><td>r64,m64</td>
|
|
<td> </td><td>1</td><td> </td><td>1</td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PACK PUNPCK</td><td>r64,r64</td>
|
|
<td> </td><td>1</td><td> </td><td> </td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>PACK PUNPCK</td><td>r64,m64</td>
|
|
<td> </td><td>1</td><td> </td><td>1</td><td>
|
|
</td><td> </td><td> </td><td>1/1</td></tr>
|
|
<tr><td>EMMS</td><td> </td><td colspan="3">11</td><td> </td><td>
|
|
</td><td> </td><td>6 k)</td><td> </td></tr>
|
|
<tr><td>MASKMOVQ d)</td><td>r64,r64</td><td> </td><td> </td><td> 1</td><td> </td>
|
|
<td> 1</td><td> 1</td><td>2-8</td><td>1/30-1/2</td></tr>
|
|
<tr><td>PMOVMSKB d)</td><td>r32,r64</td><td> </td><td> 1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> 1</td><td> 1/1</td></tr>
|
|
<tr><td>MOVNTQ d)</td><td>m64,r64</td><td> </td><td> </td><td> </td><td> </td>
|
|
<td> 1</td><td> 1</td><td> </td><td>1/30-1/1</td></tr>
|
|
<tr><td>PSHUFW d)</td><td>r64,r64,i</td><td> </td><td> 1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> 1</td><td> 1/1</td></tr>
|
|
<tr><td>PSHUFW d)</td><td>r64,m64,i</td><td> </td><td> 1</td><td> </td><td> 1</td>
|
|
<td> </td><td> </td><td> 2</td><td> 1/1</td></tr>
|
|
<tr><td>PEXTRW d)</td><td>r32,r64,i</td><td> </td><td> 1</td><td> 1</td><td> </td>
|
|
<td> </td><td> </td><td> 2</td><td> 1/1</td></tr>
|
|
<tr><td>PISRW d)</td><td>r64,r32,i</td><td> </td><td> 1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> 1</td><td> 1/1</td></tr>
|
|
<tr><td>PISRW d)</td><td>r64,m16,i</td><td> </td><td> 1</td><td> </td><td> 1</td>
|
|
<td> </td><td> </td><td> 2</td><td> 1/1</td></tr>
|
|
<tr><td>PAVGB PAVGW d)</td><td>r64,r64</td><td> </td><td> </td><td> 1</td><td> </td>
|
|
<td> </td><td> </td><td> 1</td><td> 2/1</td></tr>
|
|
<tr><td>PAVGB PAVGW d)</td><td>r64,m64</td><td> </td><td> </td><td> 1</td><td> 1</td>
|
|
<td> </td><td> </td><td> 2</td><td> 1/1</td></tr>
|
|
<tr><td>PMINUB PMAXUB PMINSW PMAXSW d)</td><td>r64,r64</td><td> </td><td> </td><td> 1</td><td> </td>
|
|
<td> </td><td> </td><td> 1</td><td> 2/1</td></tr>
|
|
<tr><td>PMINUB PMAXUB PMINSW PMAXSW d)</td><td>r64,m64</td><td> </td><td> </td><td> 1</td><td> 1</td>
|
|
<td> </td><td> </td><td> 2</td><td> 1/1</td></tr>
|
|
<tr><td>PMULHUW d)</td><td>r64,r64</td><td> 1</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> 3</td><td> 1/1</td></tr>
|
|
<tr><td>PMULHUW d)</td><td>r64,m64</td><td> 1</td><td> </td><td> </td><td> 1</td>
|
|
<td> </td><td> </td><td> 4</td><td> 1/1</td></tr>
|
|
<tr><td>PSADBW d)</td><td>r64,r64</td><td> 2</td><td> </td><td> 1</td><td> </td>
|
|
<td> </td><td> </td><td> 5</td><td> 1/2</td></tr>
|
|
<tr><td>PSADBW d)</td><td>r64,m64</td><td> 2</td><td> </td><td> 1</td><td> 1</td>
|
|
<td> </td><td> </td><td> 6</td><td> 1/2</td></tr>
|
|
</table>
|
|
<b>Notes:</b><br>
|
|
d) PIII only.<br>
|
|
k) you may hide the delay by inserting other instructions between <kbd>EMMS</kbd> and any
|
|
subsequent floating point instruction.
|
|
<p>
|
|
<table border="1" callpadding="4" cellspacing="1"><tr>
|
|
<td colspan="10" class="a2"><a name="29_4">29.4 XMM instructions (PIII)</a></td></tr>
|
|
<tr><td class="a3">Instruction</td>
|
|
<td class="a3">Operands</td>
|
|
<td colspan="6" align="center" class="a3">micro-ops</td>
|
|
<td class="a3">delay</td>
|
|
<td class="a3">throughput</td></tr>
|
|
<tr><td> </td><td> </td>
|
|
<td class="a4"> p0 </td><td class="a4"> p1 </td>
|
|
<td class="a4"> p01 </td><td class="a4"> p2 </td>
|
|
<td class="a4"> p3 </td><td class="a4"> p4 </td>
|
|
<td> </td><td> </td></tr>
|
|
<tr><td>MOVAPS</td><td>r128,r128</td><td> </td><td> </td><td>2</td><td> </td><td>
|
|
</td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVAPS</td><td>r128,m128</td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td>
|
|
</td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>MOVAPS</td><td>m128,r128</td><td> </td><td> </td><td> </td><td> </td>
|
|
<td>2</td><td>2</td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>MOVUPS</td><td>r128,m128</td><td> </td><td> </td><td> </td><td>4</td>
|
|
<td> </td><td> </td><td>2</td><td>1/4</td></tr>
|
|
<tr><td>MOVUPS</td><td>m128,r128</td><td> </td><td>1</td><td> </td><td> </td><td>4</td>
|
|
<td>4</td><td>3</td><td>1/4</td></tr>
|
|
<tr><td>MOVSS</td><td>r128,r128</td><td> </td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVSS</td><td>r128,m32</td><td> </td><td> </td><td>1</td><td>1</td>
|
|
<td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVSS</td><td>m32,r128</td><td> </td><td> </td><td> </td><td> </td><td>1</td>
|
|
<td>1</td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVHPS MOVLPS</td><td>r128,m64</td><td> </td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVHPS MOVLPS</td><td>m64,r128</td><td> </td><td> </td><td> </td><td> </td>
|
|
<td>1</td><td>1</td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVLHPS MOVHLPS</td><td>r128,r128</td><td> </td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVMSKPS</td><td>r32,r128</td><td>1</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>MOVNTPS</td><td>m128,r128</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> 2</td><td> 2</td><td> </td><td>1/15-1/2</td></tr>
|
|
<tr><td>CVTPI2PS</td><td>r128,r64</td><td> </td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>CVTPI2PS</td><td>r128,m64</td><td> </td><td>2</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>4</td><td>1/2</td></tr>
|
|
<tr><td>CVTPS2PI CVTTPS2PI</td><td>r64,r128</td><td> </td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>CVTPS2PI</td><td>r64,m128</td><td> </td><td>1</td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>4</td><td>1/1</td></tr>
|
|
<tr><td>CVTSI2SS</td><td>r128,r32</td><td> </td><td>2</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>4</td><td>1/2</td></tr>
|
|
<tr><td>CVTSI2SS</td><td>r128,m32</td><td> </td><td>2</td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>5</td><td>1/2</td></tr>
|
|
<tr><td>CVTSS2SI CVTTSS2SI</td><td>r32,r128</td><td> </td><td>1</td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>CVTSS2SI</td><td>r32,m128</td><td> </td><td>1</td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td>4</td><td>1/2</td></tr>
|
|
<tr><td>ADDPS SUBPS</td><td>r128,r128</td><td> </td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>ADDPS SUBPS</td><td>r128,m128</td><td> </td><td>2</td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>ADDSS SUBSS</td><td>r128,r128</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>ADDSS SUBSS</td><td>r128,m32</td><td> </td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>MULPS</td><td>r128,r128</td><td>2</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>4</td><td>1/2</td></tr>
|
|
<tr><td>MULPS</td><td>r128,m128</td><td>2</td><td> </td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>4</td><td>1/2</td></tr>
|
|
<tr><td>MULSS</td><td>r128,r128</td><td>1</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>4</td><td>1/1</td></tr>
|
|
<tr><td>MULSS</td><td>r128,m32</td><td>1</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>4</td><td>1/1</td></tr>
|
|
<tr><td>DIVPS</td><td>r128,r128</td><td>2</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>48</td><td>1/34</td></tr>
|
|
<tr><td>DIVPS</td><td>r128,m128</td><td>2</td><td> </td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>48</td><td>1/34</td></tr>
|
|
<tr><td>DIVSS</td><td>r128,r128</td><td>1</td><td> </td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>18</td><td>1/17</td></tr>
|
|
<tr><td>DIVSS</td><td>r128,m32</td><td>1</td><td> </td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>18</td><td>1/17</td></tr>
|
|
<tr><td>ANDPS ANDNPS ORPS XORPS</td><td>r128,r128</td><td> </td><td>2</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>ANDPS ANDNPS ORPS XORPS</td><td>r128,m128</td><td> </td><td>2</td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>MAXPS MINPS</td><td>r128,r128</td><td> </td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/2</td> </tr>
|
|
<tr><td>MAXPS MINPS</td><td>r128,m128</td><td> </td><td>2</td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>MAXSS MINSS</td><td>r128,r128</td><td> </td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>MAXSS MINSS</td><td>r128,m32</td><td> </td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>CMPccPS</td><td>r128,r128</td><td> </td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>CMPccPS</td><td>r128,m128</td><td> </td><td>2</td><td> </td><td>2</td>
|
|
<td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>CMPccSS</td><td>r128,r128</td><td> </td><td>1</td><td> </td><td>1</td>
|
|
<td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>CMPccSS</td><td>r128,m32</td><td> </td><td>1</td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>3</td><td>1/1</td></tr>
|
|
<tr><td>COMISS UCOMISS</td><td>r128,r128</td><td> </td><td>1</td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>COMISS UCOMISS</td><td>r128,m32</td><td> </td><td>1</td>
|
|
<td> </td><td>1</td><td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>SQRTPS</td><td>r128,r128</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>56</td><td>1/56</td></tr>
|
|
<tr><td>SQRTPS</td><td>r128,m128</td><td>2</td><td> </td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td>57</td><td>1/56</td></tr>
|
|
<tr><td>SQRTSS</td><td>r128,r128</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>30</td><td>1/28</td></tr>
|
|
<tr><td>SQRTSS</td><td>r128,m32</td><td>2</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>31</td><td>1/28</td></tr>
|
|
<tr><td>RSQRTPS</td><td>r128,r128</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>RSQRTPS</td><td>r128,m128</td><td>2</td><td> </td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>RSQRTSS</td><td>r128,r128</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>RSQRTSS</td><td>r128,m32</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>2</td><td>1/1</td></tr>
|
|
<tr><td>RCPPS</td><td>r128,r128</td><td>2</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>RCPPS</td><td>r128,m128</td><td>2</td><td> </td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>RCPSS</td><td>r128,r128</td><td>1</td><td> </td><td> </td>
|
|
<td> </td><td> </td><td> </td><td>1</td><td>1/1</td></tr>
|
|
<tr><td>RCPSS</td><td>r128,m32</td><td>1</td><td> </td><td> </td>
|
|
<td>1</td><td> </td><td> </td><td>2</td><td>1/1</td></tr>
|
|
<tr><td>SHUFPS</td><td>r128,r128,i</td><td> </td><td>2</td><td>1</td>
|
|
<td> </td><td> </td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>SHUFPS</td><td>r128,m128,i</td><td> </td><td>2</td><td> </td>
|
|
<td>2</td><td> </td><td> </td><td>2</td><td>1/2</td></tr>
|
|
<tr><td>UNPCKHPS UNPCKLPS</td><td>r128,r128</td><td> </td><td>2</td>
|
|
<td>2</td><td> </td><td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>UNPCKHPS UNPCKLPS</td><td>r128,m128</td><td> </td><td>2</td>
|
|
<td> </td><td>2</td><td> </td><td> </td><td>3</td><td>1/2</td></tr>
|
|
<tr><td>LDMXCSR</td><td>m32</td><td colspan="3">11</td>
|
|
<td> </td><td> </td><td> </td><td>15</td><td>1/15</td></tr>
|
|
<tr><td>STMXCSR</td><td>m32</td><td colspan="3">6</td>
|
|
<td> </td><td> </td><td> </td><td>7</td><td>1/9</td></tr>
|
|
<tr><td>FXSAVE</td><td>m4096</td><td colspan="3">116</td>
|
|
<td> </td><td> </td><td> </td><td>62</td><td> </td></tr>
|
|
<tr><td>FXRSTOR</td><td>m4096</td><td colspan="3">89</td>
|
|
<td> </td><td> </td><td> </td><td>68</td><td> </td></tr>
|
|
</table>
|
|
<p>
|
|
<h2><a name="30">30</a>. Testing speed</h2>
|
|
The Pentium family of processors have an internal 64 bit clock counter which can be read
|
|
into <kbd>EDX:EAX</kbd> using the instruction <kbd>RDTSC</kbd>
|
|
(read time stamp counter). This is very useful for
|
|
testing exactly how many clock cycles a piece of code takes.
|
|
<p>
|
|
The program below is useful for measuring the number of clock cycles a piece of code
|
|
takes. The program executes the code to test 10 times and stores the 10 clock counts.
|
|
The program can be used in both 16 and 32 bit mode on the PPlain and PMMX:
|
|
<pre>;************ Test program for PPlain and PMMX: ********************
|
|
|
|
ITER EQU 10 ; number of iterations
|
|
OVERHEAD EQU 15 ; 15 for PPlain, 17 for PMMX
|
|
|
|
RDTSC MACRO ; define RDTSC instruction
|
|
DB 0FH,31H
|
|
ENDM
|
|
;************ Data segment: ********************
|
|
.DATA ; data segment
|
|
ALIGN 4
|
|
COUNTER DD 0 ; loop counter
|
|
TICS DD 0 ; temporary storage of clock
|
|
RESULTLIST DD ITER DUP (0) ; list of test results
|
|
;************ Code segment: ********************
|
|
.CODE ; code segment
|
|
BEGIN: MOV [COUNTER],0 ; reset loop counter
|
|
TESTLOOP: ; test loop
|
|
;************ Do any initializations here: ********************
|
|
FINIT
|
|
;************ End of initializations ********************
|
|
RDTSC ; read clock counter
|
|
MOV [TICS],EAX ; save count
|
|
CLD ; non-pairable filler
|
|
REPT 8
|
|
NOP ; eight NOP's to avoid shadowing effect
|
|
ENDM
|
|
|
|
;************ Put instructions to test here: ********************
|
|
FLDPI ; this is only an example
|
|
FSQRT
|
|
RCR EBX,10
|
|
FSTP ST
|
|
;***************** End of instructions to test ********************
|
|
|
|
CLC ; non-pairable filler with shadow
|
|
RDTSC ; read counter again
|
|
SUB EAX,[TICS] ; compute difference
|
|
SUB EAX,OVERHEAD ; subtract clocks used by fillers etc.
|
|
MOV EDX,[COUNTER] ; loop counter
|
|
MOV [RESULTLIST][EDX],EAX ; store result in table
|
|
ADD EDX,TYPE RESULTLIST ; increment counter
|
|
MOV [COUNTER],EDX ; store counter
|
|
CMP EDX,ITER * (TYPE RESULTLIST)
|
|
JB TESTLOOP ; repeat ITER times
|
|
|
|
; insert here code to read out the values in RESULTLIST</pre>
|
|
<p>
|
|
The 'filler' instructions before and after the piece of code to test are are
|
|
included in order to get consistent results on the PPlain. The <kbd>CLD</kbd>
|
|
is a non-pairable instruction which has been inserted to make sure the pairing
|
|
is the same the first time as the subsequent times. The
|
|
eight <kbd>NOP</kbd> instructions are inserted to prevent any prefixes in the
|
|
code to test to be decoded in the shadow of the preceding instructions on
|
|
the PPlain. Single byte instructions are used here to obtain the same pairing
|
|
the first time as the subsequent times. The <kbd>CLC</kbd> after the
|
|
code to test is a non-pairable instruction which has a shadow under which
|
|
the <kbd>0FH</kbd> prefix of the <kbd>RDTSC</kbd> can be decoded so that
|
|
it is independent of any shadowing effect from the code
|
|
to test on the PPlain.
|
|
<p>
|
|
On The PMMX you may want to insert <kbd>XOR EAX,EAX / CPUID</kbd>
|
|
before the instructions to test if you want the FIFO instruction buffer
|
|
to be empty, or some time-consuming instruction
|
|
(f.ex. <kbd>CLI</kbd> or <kbd>AAD</kbd>) if you want the FIFO buffer to
|
|
be full (<kbd>CPUID</kbd> has no shadow under which
|
|
prefixes of subsequent instructions can decode).
|
|
<p>
|
|
On the PPro, PII and PIII you have to insert <kbd>XOR EAX,EAX / CPUID</kbd>
|
|
before and after each <kbd>RDTSC</kbd> to prevent it from executing
|
|
in parallel with anything else, and remove the filler
|
|
instructions. (<kbd>CPUID</kbd> is a serializing instruction which means
|
|
that it flushes the pipeline and waits for all pending operations to
|
|
finish before proceeding. This is useful for testing purposes.)
|
|
<p>
|
|
The <kbd>RDTSC</kbd> instruction cannot execute in virtual mode on the
|
|
PPlain and PMMX, so if you are running DOS programs you must run
|
|
in real mode. (Press F8 while booting and select
|
|
"safe mode command prompt only" or "bypass startup files").
|
|
<p>
|
|
The complete test program is available from <a href="http://www.agner.org/assem/">www.agner.org/assem/</a>.
|
|
<p>
|
|
The Pentium processors have special performance monitor counters which can count
|
|
events such as cache misses, misalignments, various stalls, etc. Details about how to use the
|
|
performance monitor counters are not covered by this manual but can be found in
|
|
"Intel Architecture Software Developer's Manual", vol. 3, Appendix A.
|
|
<p>
|
|
<h2><a name="31">31</a>. Comparison of the different microprocessors</h2>
|
|
The following table summarizes some important differences between the microprocessors in
|
|
the Pentium family:
|
|
<p>
|
|
|
|
<table border=1 cellpadding=4 cellspacing=1>
|
|
<tr><td> </td>
|
|
<td class="a3"> PPlain </td>
|
|
<td class="a3"> PMMX </td>
|
|
<td class="a3"> PPro </td>
|
|
<td class="a3"> PII </td>
|
|
<td class="a3"> PIII </td>
|
|
</tr><tr>
|
|
<td>code cache, kb</td>
|
|
<td>8</td>
|
|
<td>16</td>
|
|
<td>8</td>
|
|
<td>16</td>
|
|
<td>16</td>
|
|
</tr><tr>
|
|
<td>data cache, kb</td>
|
|
<td>8</td>
|
|
<td>16</td>
|
|
<td>8</td>
|
|
<td>16</td>
|
|
<td>16</td>
|
|
</tr><tr>
|
|
<td>built in level 2 cache, kb</td>
|
|
<td>0</td>
|
|
<td>0</td>
|
|
<td>256</td>
|
|
<td>512 *)</td>
|
|
<td>512 *)</td></tr>
|
|
<tr>
|
|
<td>MMX instructions</td>
|
|
<td>no</td>
|
|
<td>yes</td>
|
|
<td>no</td>
|
|
<td>yes</td>
|
|
<td>yes</td></tr>
|
|
<tr>
|
|
<td>XMM instructions</td>
|
|
<td>no</td>
|
|
<td>no</td>
|
|
<td>no</td>
|
|
<td>no</td>
|
|
<td>yes</td></tr>
|
|
<tr>
|
|
<td>conditional move instructruct.</td>
|
|
<td>no</td>
|
|
<td>no</td>
|
|
<td>yes</td>
|
|
<td>yes</td>
|
|
<td>yes</td>
|
|
</tr><tr>
|
|
<td>out of order execution</td>
|
|
<td>no</td>
|
|
<td>no</td>
|
|
<td>yes</td>
|
|
<td>yes</td>
|
|
<td>yes</td>
|
|
</tr><tr>
|
|
<td>branch prediction</td>
|
|
<td>poor</td>
|
|
<td>good</td>
|
|
<td>good</td>
|
|
<td>good</td>
|
|
<td>good</td>
|
|
</tr><tr>
|
|
<td>branch target buffer entries</td>
|
|
<td>256</td>
|
|
<td>256</td>
|
|
<td>512</td>
|
|
<td>512</td>
|
|
<td>512</td>
|
|
</tr><tr>
|
|
<td>return stack buffer size</td>
|
|
<td>0</td>
|
|
<td>4</td>
|
|
<td>16</td>
|
|
<td>16</td>
|
|
<td>16</td>
|
|
</tr><tr>
|
|
<td>branch misprediction penalty</td>
|
|
<td>3-4</td>
|
|
<td>4-5</td>
|
|
<td>10-20</td>
|
|
<td>10-20</td>
|
|
<td>10-20</td>
|
|
</tr><tr>
|
|
<td>partial register stall</td>
|
|
<td>0</td>
|
|
<td>0</td>
|
|
<td>5</td>
|
|
<td>5</td>
|
|
<td>5</td>
|
|
</tr><tr>
|
|
<td>FMUL latency</td>
|
|
<td>3</td>
|
|
<td>3</td>
|
|
<td>5</td>
|
|
<td>5</td>
|
|
<td>5</td>
|
|
</tr><tr>
|
|
<td>FMUL throughput</td>
|
|
<td>1/2</td>
|
|
<td>1/2</td>
|
|
<td>1/2</td>
|
|
<td>1/2</td>
|
|
<td>1/2</td>
|
|
</tr><tr>
|
|
<td>IMUL latency</td>
|
|
<td>9</td>
|
|
<td>9</td>
|
|
<td>4</td>
|
|
<td>4</td>
|
|
<td>4</td>
|
|
</tr><tr>
|
|
<td>IMUL throughput</td>
|
|
<td>1/9</td>
|
|
<td>1/9</td>
|
|
<td>1/1</td>
|
|
<td>1/1</td>
|
|
<td>1/1</td>
|
|
</tr></table>
|
|
<p>*) Celeron: 0-128, Xeon: 512 or more, many other variants available.
|
|
On some versions the level 2 cache runs at half speed.
|
|
<p>
|
|
<u>Comments to the table:</u><br>
|
|
Code cache size is important if the critical part of your program is not limited to a small
|
|
memory space.
|
|
<p>
|
|
Data cache size is important for all programs that handle more than small amounts of data
|
|
in the critical part.
|
|
<p>
|
|
MMX and XMM instructions are useful for programs that handle massively parallel
|
|
data, such as sound and image processing. In other applications it may not be
|
|
possible to take advantage of the MMX and XMM instructions.
|
|
<p>
|
|
Conditional move instructructions are useful for avoiding poorly predictable conditional
|
|
jumps.
|
|
<p>
|
|
Out of order execution improves performance, especially on non-optimized code. It includes
|
|
automatic instruction reordering and register renaming.
|
|
<p>
|
|
Processors with a good branch prediction method can predict simple repetitive patterns. A
|
|
good branch prediction is most important if the branch misprediction penalty is high.
|
|
<p>
|
|
A return stack buffer improves prediction of return instructions when a subroutine is called
|
|
alternatingly from different locations.
|
|
<p>
|
|
Partial register stalls make handling of mixed data sizes (8, 16, 32 bit) more difficult.
|
|
<p>
|
|
The latency of a multiplication instruction is the time it takes in a dependency chain. A
|
|
throughput of 1/2 means that the execution can be pipelined so that a new multiplication
|
|
can begin every second clock cycle. This defines the speed for handling parallel data.
|
|
<p>
|
|
Most of the optimizations described in this document have little or no negative effects on
|
|
other microprocessors, including non-Intel processors, but there are some problems to be
|
|
aware of.
|
|
<p>
|
|
Scheduling floating point code for the PPlain and PMMX often requires a lot of
|
|
extra <kbd>FXCH</kbd> instructions. This will slow down execution on older microprocessors, but not on the
|
|
Pentium family and advanced non-Intel processors.
|
|
<p>
|
|
Taking advantage of the MMX instructions in the PMMX, PII and PIII processors or the
|
|
conditional moves in the PPro, PII and PIII will create problems if you want your code to be
|
|
compatible with earlier microprocessors. The solution may be to write several versions of
|
|
your code, each optimized for a particular processor. Your program should detect which
|
|
processor it is running on and select the appropriate version of code
|
|
(chapter <a href="#27_10">27.10</a>).
|
|
<p>
|
|
</body>
|
|
</html>
|
|
|