871 lines
42 KiB
HTML
871 lines
42 KiB
HTML
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
|
|
<html>
|
|
<head>
|
|
<title>OS/RC: Design Elements of the FreeBSD VM System</title>
|
|
</head>
|
|
|
|
<body bgcolor="#ffffff">
|
|
|
|
<p align=right>Back to the <a href="..">OS/RC</a></p>
|
|
|
|
<h2>Design Elements of the FreeBSD VM System</h2>
|
|
<h3>By Matthew Dillon <a href="mailto:dillon@apollo.backplane.com">dillon@apollo.backplane.com</a></h3>
|
|
</P>
|
|
<p font class="Normal">
|
|
The title is really just a fancy way of saying that I am going
|
|
to attempt to describe the whole VM enchilada, hopefully in a
|
|
way that everyone can follow. For the last year I have
|
|
concentrated on a number of major kernel subsystems within
|
|
FreeBSD, with the VM and Swap subsystems being the most
|
|
interesting and NFS being 'a necessary chore'. I rewrote only
|
|
small portions of the code. In the VM arena the only major
|
|
rewrite I have done is to the swap subsystem. Most of my work
|
|
was cleanup and maintenance, with only moderate code rewriting
|
|
and no major algorithmic adjustments within the VM subsystem.
|
|
The bulk of the VM subsystem's theoretical base remains unchanged
|
|
and a lot of the credit for the modernization effort in the
|
|
last few years belongs to John Dyson and David Greenman. Not
|
|
being a historian like Kirk I will not attempt to tag all the
|
|
various features with peoples names, since I will invariably
|
|
get it wrong.
|
|
</P>
|
|
<p font class="Normal">
|
|
Before moving along to the actual design let's spend a little
|
|
time on the necessity of maintaining and modernizing any
|
|
long-living codebase. In the programming world, algorithms
|
|
tend to be more important than code and it is precisely due to
|
|
BSD's academic roots that a great deal of attention was paid
|
|
to algorithm design from the beginning. More attention paid
|
|
to the design generally leads to a clean and flexible codebase
|
|
that can be fairly easily modified, extended, or replaced over
|
|
time. While BSD is considered an 'old' operating system by
|
|
some people, those of us who work on it tend to view it more
|
|
as a 'mature' codebase which has various components modified,
|
|
extended, or replaced with modern code. It has evolved, and
|
|
FreeBSD is at the bleeding edge no matter how old some of the
|
|
code might be. This is an important distinction to make and
|
|
one that is unfortunately lost to many people. The biggest
|
|
error a programmer can make is to not learn from history, and
|
|
this is precisely the error that many other modern operating
|
|
systems have made. NT is the best example of this, and the
|
|
consequences have been dire. Linux also makes this mistake to
|
|
some degree -- enough that we BSD folk can make small jokes
|
|
about it every once in a while, anyway (grin). Linux's problem
|
|
is simply one of a lack of experience and history to compare
|
|
ideas against, a problem that is easily and rapidly being
|
|
addressed by the Linux community in the same way it has been
|
|
addressed in the BSD community -- by continuous code development.
|
|
The NT folk, on the other hand, repeatedly make the same mistakes
|
|
solved by UNIX decades ago and then spend years fixing them.
|
|
Over and over again. They have a severe case of 'not designed
|
|
here' and 'we are always right because our marketing department
|
|
says so'. I have little tolerance for anyone who cannot learn
|
|
from history.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Much of the apparent complexity of the FreeBSD design, especially
|
|
in the VM/Swap subsystem, is a direct result of having to solve
|
|
serious performance issues that occur under various conditions.
|
|
These issues are not due to bad algorithmic design but instead
|
|
rise from environmental factors. In any direct comparison
|
|
between platforms, these issues become most apparent when system
|
|
resources begin to get stressed. As I describe FreeBSD's
|
|
VM/Swap subsystem the reader should always keep two points in
|
|
mind. First, the most important aspect of performance design
|
|
is what is known as "Optimizing the Critical Path". It is
|
|
often the case that performance optimizations add a little
|
|
bloat to the code in order to make the critical path perform
|
|
better. Second, a solid, generalized design outperforms a
|
|
heavily-optimized design over the long run. While a generalized
|
|
design may end up being slower than an heavily-optimized design
|
|
when they are first implemented, the generalized design tends
|
|
to be easier to adapt to changing conditions and the
|
|
heavily-optimized design winds up having to be thrown away.
|
|
Any codebase that will survive and be maintainable for years
|
|
must therefore be designed properly from the beginning even if
|
|
it costs some performance. Twenty years ago people were still
|
|
arguing that programming in assembly was better than programming
|
|
in a high-level language because it produced code that was ten
|
|
times as fast. Today, the fallibility of that argument is
|
|
obvious -- as are the parallels to algorithmic design and code
|
|
generalization.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
<strong>VM Objects</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
The best way to begin describing the FreeBSD VM system is to
|
|
look at it from the perspective of a user-level process. Each
|
|
user process sees a single, private, contiguous VM address
|
|
space containing several types of memory objects. These objects
|
|
have various characteristics. Program code and program data
|
|
are effectively a single memory-mapped file (the binary file
|
|
being run), but program code is read-only while program data
|
|
is copy-on-write. Program BSS is just memory allocated and
|
|
filled with zeros on demand, called demand zero page fill.
|
|
Arbitrary files can be memory-mapped into the address space as
|
|
well, which is how the shared library mechanism works. Such
|
|
mappings can require modifications to remain private to the
|
|
process making them. The fork system call adds an entirely
|
|
new dimension to the VM management problem on top of the
|
|
complexity already given.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
A program binary data page (which is a basic copy-on-write
|
|
page) illustrates the complexity. A program binary contains
|
|
a preinitialized data section which is initially mapped directly
|
|
from the program file. When a program is loaded into a process's
|
|
VM space, this area is initially memory-mapped and backed by
|
|
the program binary itself, allowing the VM system to free/reuse
|
|
the page and later load it back in from the binary. The moment
|
|
a process modifies this data, however, the VM system must make
|
|
a private copy of the page for that process. Since the private
|
|
copy has been modified, the VM system may no longer free it,
|
|
because there is no longer any way to restore it later on.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
You will notice immediately that what was originally a simple
|
|
file mapping has become much more complex. Data may be modified
|
|
on a page-by-page basis whereas the file mapping encompasses
|
|
many pages at once. The complexity further increases when a
|
|
process forks. When a process forks, the result is two processes
|
|
-- each with their own private address spaces, including any
|
|
modifications made by the original process prior to the call
|
|
to fork(). It would be silly for the VM system to make a
|
|
complete copy of the data at the time of the fork() because it
|
|
is quite possible that at least one of the two processes will
|
|
only need to read from that page from then on, allowing the
|
|
original page to continue to be used. What was a private page
|
|
is made copy-on-write again, since each process (parent and
|
|
child) expects their own personal post-fork modifications to
|
|
remain private to themselves and not effect the other.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
FreeBSD manages all of this with a layered VM Object model.
|
|
The original binary program file winds up being the lowest VM
|
|
Object layer. A copy-on-write layer is pushed on top of that
|
|
to hold those pages which had to be copied from the original
|
|
file. If the program modifies a data page belonging to the
|
|
original file the VM system takes a fault and makes a copy of
|
|
the page in the higher layer. When a process forks, additional
|
|
VM Object layers are pushed on.
|
|
|
|
This might make a little more sense with a fairly basic example.
|
|
A fork() is a common operation for any *BSD system, so this
|
|
example will consider a program that starts up, and forks.
|
|
|
|
When the process starts, the VM system creates an object layer,
|
|
let's call this A:
|
|
|
|
<center><img src="fig1.gif"></center>
|
|
<p font class="Normal">
|
|
A represents the file--pages may be paged in and out of the file's
|
|
physical media as necessary. Paging in from the disk is reasonable
|
|
for a program, but we really don't want to page back out and
|
|
overwrite the executable. The VM system therefore creates a second
|
|
layer, B, that will be physically backed by swap space:
|
|
|
|
<center><img src="fig2.gif"></center>
|
|
<p font class="Normal">
|
|
On the first write to a page after this, a new page is created in
|
|
B, and its contents are initialized from A. All pages in B can be
|
|
paged in or out to a swap device. When the program forks, the
|
|
VM system creates two new object layers--C1 for the parent, and C2
|
|
for the child--that rest on top of B:
|
|
|
|
<center><img src="fig3.gif"></center>
|
|
<p font class="Normal">
|
|
In this case, let's say a page in B is modified by the original
|
|
parent process. The process will take a copy-on-write fault and
|
|
duplicate the page in C1, leaving the original page in B untouched.
|
|
Now, let's say the same page in B is modified by the child process.
|
|
The process will take a copy-on-write fault and duplicate the page
|
|
in C2. The original page in B is now completely hidden since both
|
|
C1 and C2 have a copy and B could theoretically be destroyed if it
|
|
does not represent a 'real' file). However, this sort of
|
|
optimization is not trivial to make because it is so fine-grained.
|
|
FreeBSD does not make this optimization.
|
|
|
|
Now, suppose (as is often the case) that the child process does an
|
|
exec(). Its current address space is usually replaced by a new
|
|
address space representing a new file. In this case, the C2 layer
|
|
is destroyed:
|
|
|
|
<center><img src="fig4.gif"></center>
|
|
<p font class="Normal">
|
|
In this case, the number of children of B drops to one, and all
|
|
accesses to B now go through C1. This means that B and C1 can
|
|
be collapsed together. Any pages in B that also exist in C1 are
|
|
deleted from B during the collapse. Thus, even though the
|
|
optimization in the previous step could not be made, we can
|
|
recover the dead pages when either of the processes exit or exec().
|
|
</P>
|
|
<p font class="Normal">
|
|
This model creates a number of potential problems. The first
|
|
is that you can wind up with a relatively deep stack of layered
|
|
VM Objects which can cost scanning time and memory when you
|
|
when you take a fault. Deep layering can occur when processes
|
|
fork and then fork again (either parent or child).
|
|
The second problem is that you can wind up with dead,
|
|
inaccessible pages deep in the stack of VM Objects. In our
|
|
last example if both the parent and child processes modify the
|
|
same page, they both get their own private copies of the page
|
|
and the original page in B is no longer accessible by anyone.
|
|
That page in B can be freed.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
FreeBSD solves the deep layering problem with a special optimization
|
|
called the "All Shadowed Case". This case occurs if either C1 or C2
|
|
take sufficient COW faults to completely shadow all pages in B. Lets
|
|
say that C1 achieves this. C1 can now bypass B entirely, so rather
|
|
then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But
|
|
look what also happened -- now B has only one reference (C2), so we
|
|
can collapse B and C2 together. The end result is that B is deleted
|
|
entirely and we have C1->A and C2->A. It is often the case that B
|
|
will contain a large number of pages and neither C1 nor C2 will be
|
|
able to completely overshadow it. If we fork again and create a set
|
|
of D layers, however, it is much more likely that one of the D layers
|
|
will eventually be able to completely overshadow the much smaller dataset
|
|
reprsented by C1 or C2. The same optimization will work at any point
|
|
in the graph and the grand result of this is that even on a heavily forked
|
|
machine VM Object stacks tend to not get much deeper then 4. This is
|
|
true of both the parent and the children and true whether the parent is
|
|
doing the forking or whether the children cascade forks.
|
|
</P>
|
|
<p font class="Normal">
|
|
The dead page problem still exists in the case where C1 or C2 do not
|
|
completely overshadow B. Due to our other optimizations this case does
|
|
not represent much of a problem and we simply allow the pages to be dead.
|
|
If the system runs low on memory it will swap them out, eating a little
|
|
swap, but that's it.
|
|
</P>
|
|
<p font class="Normal">
|
|
The advantage to the VM Object model is that fork() is extremely
|
|
fast, since no real data copying need take place. The disadvantage
|
|
is that you can build a relatively complex VM Object layering
|
|
that slows page fault handling down a little, and you spend
|
|
memory managing the VM Object structures. The optimizations
|
|
FreeBSD makes proves to reduce the problems enough that they
|
|
can be ignored, leaving no real disadvantage.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
<strong>SWAP Layers</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Private data pages are initially either copy-on-write or
|
|
zero-fill pages. When a change, and therefore a copy, is made,
|
|
the original backing object (usually a file) can no longer be
|
|
used to save a copy of the page when the VM system needs to
|
|
reuse it for other purposes. This is where SWAP comes in.
|
|
SWAP is allocated to create backing store for memory that does
|
|
not otherwise have it. FreeBSD allocates the swap management
|
|
structure for a VM Object only when it is actually needed.
|
|
However, the swap management structure has had problems
|
|
historically.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Under FreeBSD 3.x the swap management structure preallocates
|
|
an array that encompasses the entire object requiring swap
|
|
backing store -- even if only a few pages of that object are
|
|
swap-backed. This creates a kernel memory fragmentation problem
|
|
when large objects are mapped, or processes with large runsizes
|
|
(RSS) fork. Also, in order to keep track of swap space, a
|
|
'list of holes' is kept in kernel memory, and this tends to
|
|
get severely fragmented as well. Since the 'list of holes' is
|
|
a linear list, the swap allocation and freeing performance is
|
|
a non-optimal O(n)-per-page. It also requires kernel memory
|
|
allocations to take place during the swap freeing process, and
|
|
that creates low memory deadlock problems. The problem is
|
|
further exacerbated by holes created due to the interleaving
|
|
algorithm. Also, the swap block map can become fragmented fairly
|
|
easily resulting in non-contiguous allocations. Kernel memory
|
|
must also be allocated on the fly for additional swap management
|
|
structures when a swapout occurs. It is evident that there
|
|
was plenty of room for improvement.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
For FreeBSD 4.x, I completely rewrote the swap subsystem. With
|
|
this rewrite, swap management structures are allocated through
|
|
a hash table rather than a linear array giving them a fixed
|
|
allocation size and much finer granularity. Rather then using
|
|
a linearly linked list to keep track of swap space reservations,
|
|
it now uses a bitmap of swap blocks arranged in a radix tree
|
|
structure with free-space hinting in the radix node structures.
|
|
This effectively makes swap allocation and freeing an O(1)
|
|
operation. The entire radix tree bitmap is also preallocated
|
|
in order to avoid having to allocate kernel memory during
|
|
critical low memory swapping operations. After all, the system
|
|
tends to swap when it is low on memory so we should avoid
|
|
allocating kernel memory at such times in order to avoid
|
|
potential deadlocks. Finally, to reduce fragmentation the
|
|
radix tree is capable of allocating large contiguous chunks at
|
|
once, skipping over smaller fragmented chunks. I did not take
|
|
the final step of having an 'allocating hint pointer' that
|
|
would trundle through a portion of swap as allocations were
|
|
made in order to further guarantee contiguous allocations or
|
|
at least locality of reference, but I ensured that such an
|
|
addition could be made.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
<strong>When To Free a Page</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Since the VM system uses all available memory for disk caching,
|
|
there are usually very few truly-free pages. The VM system
|
|
depends on being able to properly choose pages which are not
|
|
in use to reuse for new allocations. Selecting the optimal
|
|
pages to free is possibly the single-most important function
|
|
any VM system can perform because if it makes a poor selection,
|
|
the VM system may be forced to unnecessarily retrieve pages
|
|
from disk, seriously degrading system performance.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
How much overhead are we willing to suffer in the critical path
|
|
to avoid freeing the wrong page? Each wrong choice we make
|
|
will cost us hundreds of thousands of CPU cycles and a noticeable
|
|
stall of the affected processes, so we are willing to endure
|
|
a significant amount of overhead in order to be sure that the
|
|
right page is chosen. This is why FreeBSD tends to outperform
|
|
other systems when memory resources become stressed.
|
|
|
|
</P>
|
|
|
|
<p font class="Normal">
|
|
<table border="0" cellspacing="5" cellpadding="5">
|
|
<tr>
|
|
<td width="65%" align="top"><p font class="Normal">
|
|
The free page determination algorithm is built upon a history
|
|
of the use of memory pages. To acquire this history, the system
|
|
takes advantage of a page-used bit feature that most hardware
|
|
page tables have. </p>
|
|
<p font class="Normal"> In any
|
|
case, the page-used bit is cleared and at some later point the
|
|
VM system comes across the page again and sees that the page-used
|
|
bit has been set. This indicates that the page is still being
|
|
actively used. If the bit is still clear it is an indication
|
|
that the page is not being actively used. By testing this bit
|
|
periodically, a use history (in the form of a counter) for the
|
|
physical page is developed. When the VM system later needs to
|
|
free up some pages, checking this history becomes the cornerstone
|
|
of determining the best candidate page to reuse.
|
|
</p>
|
|
</td>
|
|
<td width="35%" align="top" bgcolor="#dadada"><font size="-1"><center><strong>What if the hardware
|
|
has no page-used bit?</strong></center><br>
|
|
<p font class="Normal">For those platforms that do not have
|
|
this feature, the system actually emulates a page-used bit.
|
|
It unmaps or protects a page, forcing a page fault if the page
|
|
is accessed again. When the page fault is taken, the system
|
|
simply marks the page as having been used and unprotects the
|
|
page so that it may be used. While taking such page faults
|
|
just to determine if a page is being used appears to be an
|
|
expensive proposition, it is much less expensive than reusing
|
|
the page for some other purpose only to find that a process
|
|
needs it back and then have to go to disk.</P></font>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
FreeBSD makes use of several page queues to further refine the
|
|
selection of pages to reuse as well as to determine when dirty
|
|
pages must be flushed to their backing store. Since page tables
|
|
are dynamic entities under FreeBSD, it costs virtually nothing
|
|
to unmap a page from the address space of any processes using
|
|
it. When a page candidate has been chosen based on the page-use
|
|
counter, this is precisely what is done. The system must make
|
|
a distinction between clean pages which can theoretically be
|
|
freed up at any time, and dirty pages which must first be
|
|
written to their backing store before being reusable. When a
|
|
page candidate has been found it is moved to the inactive queue
|
|
if it is dirty, or the cache queue if it is clean. A separate
|
|
algorithm based on the dirty-to-clean page ratio determines
|
|
when dirty pages in the inactive queue must be flushed to disk.
|
|
Once this is accomplished, the flushed pages are moved from
|
|
the inactive queue to the cache queue. At this point, pages
|
|
in the cache queue can still be reactivated by a VM fault at
|
|
relatively low cost. However, pages in the cache queue are
|
|
considered to be 'immediately freeable' and will be reused in
|
|
an LRU (least-recently used) fashion when the system needs to
|
|
allocate new memory.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
It is important to note that the FreeBSD VM system attempts to
|
|
separate clean and dirty pages for the express reason of avoiding
|
|
unnecessary flushes of dirty pages (which eats I/O bandwidth),
|
|
nor does it move pages between the various page queues gratuitously
|
|
when the memory subsystem is not being stressed. This is why
|
|
you will see some systems with very low cache queue counts and
|
|
high active queue counts when doing a 'systat -vm' command.
|
|
As the VM system becomes more stressed, it makes a greater
|
|
effort to maintain the various page queues at the levels
|
|
determined to be the most effective. An urban myth has circulated
|
|
for years that Linux did a better job avoiding swapouts than
|
|
FreeBSD, but this in fact is not true. What was actually
|
|
occurring was that FreeBSD was proactively paging out unused
|
|
pages in order to make room for more disk cache while Linux
|
|
was keeping unused pages in core and leaving less memory
|
|
available for cache and process pages. I don't know whether
|
|
this is still true today.
|
|
|
|
</P>
|
|
|
|
<p font class="Normal">
|
|
<strong>Pre-Faulting and Zeroing Optimizations</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Taking a VM fault is not expensive if the underlying page is
|
|
already in core and can simply be mapped into the process, but
|
|
it can become expensive if you take a whole lot of them on a
|
|
regular basis. A good example of this is running a program
|
|
such as 'ls' or 'ps' over and over again. If the program binary
|
|
is mapped into memory but not mapped into the page table, then
|
|
all the pages that will be accessed by the program will have
|
|
to be faulted in every time the program is run. This is
|
|
unnecessary when the pages in question are already in the VM
|
|
Cache, so FreeBSD will attempt to pre-populate a process's page
|
|
tables with those pages that are already in the VM Cache. One
|
|
thing that FreeBSD does not yet do is pre-copy-on-write certain
|
|
pages on exec. For example, if you run the /bin/ls program
|
|
while running 'vmstat 1' you will notice that it always takes
|
|
a certain number of page faults, even when you run it over and
|
|
over again. These are zero-fill faults, not program code faults
|
|
(which were pre-faulted in already). Pre-copying pages on exec
|
|
or fork is an area that could use more study.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
A large percentage of page faults that occur are zero-fill
|
|
faults. You can usually see this by observing the 'vmstat -s'
|
|
output. These occur when a process accesses pages in its BSS
|
|
area. The BSS area is expected to be initially zero but the
|
|
VM system does not bother to allocate any memory at all until
|
|
the process actually accesses it. When a fault occurs the VM
|
|
system must not only allocate a new page, it must zero it as
|
|
well. To optimize the zeroing operation the VM system has the
|
|
ability to pre-zero pages and mark them as such, and to request
|
|
pre-zeroed pages when zero-fill faults occur. The pre-zeroing
|
|
occurs whenever the CPU is idle but the number of pages the
|
|
system pre-zeros is limited in order to avoid blowing away the
|
|
memory caches. This is an excellent example of adding complexity
|
|
to the VM system in order to optimize the critical path.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
<strong>Page Table Optimizations</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
The page table optimizations make up the most contentious part
|
|
of the FreeBSD VM design and they have shown some strain with
|
|
the advent of serious use of mmap(). I think this is actually
|
|
a feature of most BSDs though I am not sure when it was first
|
|
introduced. There are two major optimizations. The first is
|
|
that hardware page tables do not contain persistent state but
|
|
instead can be thrown away at any time with only a minor amount
|
|
of management overhead. The second is that every active page
|
|
table entry in the system has a governing pv_entry structure
|
|
which is tied into the vm_page structure. FreeBSD can simply
|
|
iterate through those mappings that are known to exist while
|
|
Linux must check all page tables that *might* contain a specific
|
|
mapping to see if it does, which can achieve O(n^2) overhead
|
|
in certain situations. It is because of this that FreeBSD
|
|
tends to make better choices on which pages to reuse or swap
|
|
when memory is stressed, giving it better performance under
|
|
load. However, FreeBSD requires kernel tuning to accommodate
|
|
large-shared-address-space situations such as those that can
|
|
occur in a news system because it may run out of pv_entry
|
|
structures.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Both Linux and FreeBSD need work in this area. FreeBSD is
|
|
trying to maximize the advantage of a potentially sparse
|
|
active-mapping model (not all processes need to map all pages
|
|
of a shared library, for example), whereas Linux is trying to
|
|
simplify its algorithms. FreeBSD generally has the performance
|
|
advantage here at the cost of wasting a little extra memory,
|
|
but FreeBSD breaks down in the case where a large file is
|
|
massively shared across hundreds of processes. Linux, on the
|
|
other hand, breaks down in the case where many processes are
|
|
sparsely-mapping the same shared library and also runs non-optimally
|
|
when trying to determine whether a page can be reused or not.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
<strong>Page Coloring</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
We'll end with the page coloring optimizations. Page coloring
|
|
is a performance optimization designed to ensure that accesses
|
|
to contiguous pages in virtual memory make the best use of the
|
|
processor cache. In ancient times (i.e. 10+ years ago) processor
|
|
caches tended to map virtual memory rather than physical memory.
|
|
This led to a huge number of problems including having to clear
|
|
the cache on every context switch in some cases, and problems
|
|
with data aliasing in the cache. Modern processor caches map
|
|
physical memory precisely to solve those problems. This means
|
|
that two side-by-side pages in a processes address space may
|
|
not correspond to two side-by-side pages in the cache. In
|
|
fact, if you aren't careful side-by-side pages in virtual memory
|
|
could wind up using the same page in the processor cache --
|
|
leading to cacheable data being thrown away prematurely and
|
|
reducing CPU performance. This is true even with multi-way
|
|
set-associative caches (though the effect is mitigated somewhat).
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
FreeBSD's memory allocation code implements page coloring
|
|
optimizations, which means that the memory allocation code will
|
|
attempt to locate free pages that are contiguous from the point
|
|
of view of the cache. For example, if page 16 of physical
|
|
memory is assigned to page 0 of a process's virtual memory and
|
|
the cache can hold 4 pages, the page coloring code will not
|
|
assign page 20 of physical memory to page 1 of a process's
|
|
virtual memory. It would, instead, assign page 21 of physical
|
|
memory. The page coloring code attempts to avoid assigning
|
|
page 20 because this maps over the same cache memory as page
|
|
16 and would result in non-optimal caching. This code adds a
|
|
significant amount of complexity to the VM memory allocation
|
|
subsystem as you can well imagine, but the result is well worth
|
|
the effort. Page Coloring makes VM memory as deterministic as
|
|
physical memory in regards to cache performance.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
<strong>Conclusion</strong>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Virtual memory in modern operating systems must address a number
|
|
of different issues efficiently and for many different usage
|
|
patterns. The modular and algorithmic approach that BSD has
|
|
historically taken allows us to study and understand the current
|
|
implementation as well as relatively cleanly replace large
|
|
sections of the code. There have been a number of improvements
|
|
to the FreeBSD VM system in the last several years, and work
|
|
is ongoing.
|
|
|
|
</P>
|
|
<hr>
|
|
<p font class="Normal">
|
|
<h2>A Bonus Question and Answer Session by Allen Briggs </h2><a href="mailto:briggs@ninthwonder.com"><briggs@ninthwonder.com></a>
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Q: What is "the interleaving algorithm" that you refer to in your listing
|
|
of the ills of the FreeBSD 3.x swap arrangments?
|
|
|
|
</P>
|
|
<blockquote>
|
|
<p font class="Normal">
|
|
A: FreeBSD uses a fixed swap interleave which defaults to 4. This means
|
|
that FreeBSD reserves space for four swap areas even if you only have one,
|
|
two, or three. Since swap is interleaved the linear address space
|
|
representing the 'four swap areas' will be fragmented if you don't actually
|
|
have four swap areas. For example, if you have two swap areas A and B
|
|
FreeBSD's address space representation for that swap area will be
|
|
interleaved in blocks of 16 pages:
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
A B C D A B C D A B C D A B C D
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
FreeBSD 3.x uses a 'sequential list of free regions' approach to accounting
|
|
for the free swap areas. The idea is that large blocks of free linear
|
|
space can be represented with a single list node (kern/subr_rlist.c).
|
|
But due to the fragmentation the sequential list winds up being insanely
|
|
fragmented. In the above example, completely unused swap will have A and
|
|
B shown as 'free' and C and D shown as 'all allocated'. Each A-B sequence
|
|
requires a list node to account for because C and D are holes, so the list
|
|
node cannot be combined with the next A-B sequence.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Why do we interleave our swap space instead of just tack swap areas onto
|
|
the end and do something fancier? Because it's a whole lot easier to
|
|
allocate linear swaths of an address space and have the result
|
|
automatically be interleaved across multiple disks than it is to try to
|
|
put that sophistication elsewhere.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
The fragmentation causes other problems. Being a linear
|
|
list under 3.x, and having such a huge amount of inherent fragmentation,
|
|
allocating and freeing swap winds up being an O(N) algorithm instead of
|
|
an O(1) algorithm. Combined with other factors (heavy swapping) and you
|
|
start getting into O(N^2) and O(N^3) levels of overhead, which is bad.
|
|
The 3.x system may also need to allocate KVM during a swap operation to
|
|
create a new list node which can lead to a deadlock if the system is
|
|
trying to pageout pages in a low-memory situation.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Under 4.x we do not use a sequential list. Instead we use a radix tree
|
|
and bitmaps of swap blocks rather than ranged list nodes. We take the
|
|
hit of preallocating all the bitmaps required for the entire swap
|
|
area up front but it winds up wasting less memory due to the use of
|
|
a bitmap (one bit per block) instead of a linked list of nodes. The
|
|
use of a radix tree instead of a sequential list gives us nearly O(1)
|
|
performance no matter how fragmented the tree becomes.
|
|
|
|
</P>
|
|
</blockquote>
|
|
|
|
<p font class="Normal">
|
|
Q: I don't get the following:
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
It is important to note that the FreeBSD VM system attempts to separate
|
|
clean and dirty pages for the express reason of avoiding unnecessary
|
|
flushes of dirty pages (which eats I/O bandwidth), nor does it move
|
|
pages between the various page queues gratitously when the memory subsystem
|
|
is not being stressed. This is why you will see some systems with
|
|
very low cache queue counts and high active queue counts when doing a
|
|
'systat -vm' command.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Q: How is the separation of clean and dirty (inactive) pages related to the
|
|
situation where you see low cache queue counts and high active queue counts
|
|
in 'systat -vm'? Do the systat stats roll the active and dirty pages
|
|
together for the active queue count?
|
|
|
|
<blockquote>
|
|
</P>
|
|
<p font class="Normal">
|
|
A: Yes, that is confusing. The relationship is "goal" verses "reality".
|
|
Our goal is to separate the pages but the reality is that if we are not
|
|
in a memory crunch, we don't really have to.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
What this means is that FreeBSD will not try very hard to separate out
|
|
dirty pages (inactive queue) from clean pages (cache queue) when the
|
|
system is not being stressed, nor will it try to deactivate pages
|
|
(active queue -> inactive queue) when the system is not being stressed,
|
|
even if they aren't being used.
|
|
|
|
</P>
|
|
</blockquote>
|
|
<p font class="Normal">
|
|
Q: In the /bin/ls / 'vmstat 1' example, wouldn't some of the page faults be
|
|
data page faults (COW from executable file to private page)? I.e., I
|
|
would expect the page faults to be some zero-fill and some program data.
|
|
Or are you implying that FreeBSD does do pre-COW for the program data?
|
|
|
|
</P>
|
|
<blockquote>
|
|
<p font class="Normal">
|
|
A: A COW fault can be either zero-fill or program-data. The mechanism
|
|
is the same either way because the backing program-data is almost
|
|
certainly already in the cache. I am indeed lumping the two together.
|
|
FreeBSD does not pre-COW program data or zero-fill, but it *does*
|
|
pre-map pages that exist in its cache.
|
|
|
|
</P>
|
|
</blockquote>
|
|
<p font class="Normal">
|
|
Q: In your section on page table optimizations, can you give a little more
|
|
detail about pv_entry and vm_page (or should vm_page be vm_pmap -- as
|
|
in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)?
|
|
Specifically, what kind of operation/reaction would require scanning the
|
|
mappings?
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
How does Linux do in the case where FreeBSD breaks down (sharing a large
|
|
file mapping over many processes)?
|
|
|
|
</P>
|
|
<blockquote>
|
|
<p font class="Normal">
|
|
A: A vm_page represents an (object,index#) tuple. A pv_entry represents
|
|
a hardware page table entry (pte). If you have five processes sharing
|
|
the same physical page, and three of those processes's page tables
|
|
actually map the page, that page will be represented by a single
|
|
vm_page structure and three pv_entry structures.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
pv_entry structures only represent pages mapped by the MMU (one
|
|
pv_entry represnts one pte). This means that when we need to remove
|
|
all hardware references to a vm_page (in order to reuse the page for
|
|
something else, page it out, clear it, dirty it, and so forth) we can
|
|
simply scan the linked list of pv_entry's associated with that vm_page
|
|
to remove or modify the pte's from their page tables.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Under Linux there is no such linked list. In order to remove all the
|
|
hardware page table mappings for a vm_page linux must index into every
|
|
VM object that *might* have mapped the page. For example, if you have
|
|
50 processes all mapping the same shared library and want to get rid of
|
|
page X in that library, you need to index into the page table for
|
|
each of those 50 processes even if only 10 of them have actually mapped
|
|
the page. So Linux is trading off the simplicity of its design against
|
|
performance. Many VM algorithms which are O(1) or (small N) under FreeBSD
|
|
wind up being O(N), O(N^2), or worse under Linux. Since the pte's
|
|
representing a particular page in an object tend to be at the same
|
|
offset in all the page tables they are mapped in, reducing the number
|
|
of accesses into the page tables at the same pte offset will often avoid
|
|
blowing away the L1 cache line for that offset, which can lead to better
|
|
performance.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
FreeBSD has added complexity (the pv_entry scheme) in order to increase
|
|
performance (to limit page table accesses to *only* those pte's that need
|
|
to be modified).
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
But FreeBSD has a scaling problem that Linux does not in that there are
|
|
a limited number of pv_entry structures and this causes problems when you
|
|
have massive sharing of data. In this case you may run out of pv_entry
|
|
structures even though there is plenty of free memory available. This
|
|
can be fixed easily enough by bumping up the number of pv_entry structures
|
|
in the kernel config, but we really need to find a better way to do it.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
In regards to the memory overhead of a page table verses the pv_entry
|
|
scheme: Linux uses 'permanent' page tables that are not throw away, but
|
|
does not need a pv_entry for each potentially mapped pte. FreeBSD uses
|
|
'throw away' page tables but adds in a pv_entry structure for each
|
|
actually-mapped pte. I think memory utilization winds up being about
|
|
the same, giving FreeBSD an algorithmic advantage with its ability to
|
|
throw away page tables at will with very low overhead.
|
|
|
|
</P>
|
|
</blockquote>
|
|
<p font class="Normal">
|
|
Q: Finally, in the page coloring section, it might help to have a little
|
|
more description of what you mean here. I didn't quite follow it.
|
|
|
|
</P>
|
|
<blockquote>
|
|
<p font class="Normal">
|
|
A: Do you know how an L1 hardware memory cache works? I'll explain:
|
|
Consider a machine with 16MB of main memory but only 128K of L1 cache.
|
|
Generally the way this cache works is that each 128K block of main memory
|
|
uses the *same* 128K of cache. If you access offset 0 in main memory
|
|
and then offset offset 128K in main memory you can wind up throwing
|
|
away the cached data you read from offset 0!
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Now, I am simplifying things greatly. What I just described is what
|
|
is called a 'direct mapped' hardware memory cache. Most modern caches
|
|
are what are called 2-way-set-associative or 4-way-set-associative
|
|
caches. The set-associatively allows you to access up to N different
|
|
memory regions that overlap the same cache memory without destroying
|
|
the previously cached data. But only N.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
So if I have a 4-way set associative cache I can access offset 0,
|
|
offset 128K, 256K and offset 384K and still be able to access offset 0
|
|
again and have it come from the L1 cache. If I then access offset 512K,
|
|
however, one of the four previously cached data objects will be thrown
|
|
away by the cache.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
It is extremely important... EXTREMELY important for most of a processor's
|
|
memory accesses to be able to come from the L1 cache, because the L1
|
|
cache operates at the processor frequency. The moment you have an L1
|
|
cahe miss and have to go to the L2 cache or to main memory, the processor
|
|
will stall and potentially sit twidling its fingers for *hundreds* of
|
|
instructions worth of time waiting for a read from main memory to complete.
|
|
Main memory (the dynamic ram you stuff into a computer) is S.L.O.W.,
|
|
capitalized and boldfaced, when compared to the speed of a modern
|
|
processor core.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Ok, so now onto page coloring: All modern memory caches are what are
|
|
known as *physical* caches. They cache physical memory addresses, not
|
|
virtual memory addresses. This allows the cache to be left alone across
|
|
a process context switch, which is very important.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
But in the UNIX world you are dealing with virtual address spaces, not
|
|
physical address spaces. Any program you write will see the virtual
|
|
address space given to it. The actual *physical* pages underlying that
|
|
virtual address space are not necessarily physically contiguous! In
|
|
fact, you might have two pages that are side by side in a processes
|
|
address space which wind up being at offset 0 and offset 128K in
|
|
*physical* memory.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
A program normally assumes that two side-by-side pages will be optimally
|
|
cached. That is, that you can access data objects in both pages without
|
|
having them blow away each other's cache entry. But this is only true
|
|
if the physical pages underlying the virtual address space are contiguous
|
|
(insofar as the cache is concerned).
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
This is what Page coloring does. Instead of assigning *random* physical
|
|
pages to virtual addresses, which may result in non-optimal cache
|
|
performance , Page coloring assigns *reasonably-contiguous* physical pages
|
|
to virtual addresses. Thus programs can be written under the assumption
|
|
that the characteristics of the underlying hardware cache are the same
|
|
for their virtual address space as they would be if the program had been
|
|
run directly in a physical address space.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Note that I say 'reasonably' contiguous rather than simply 'contiguous'.
|
|
From the point of view of a 128K direct mapped cache, the physical
|
|
address 0 is the same as the physical address 128K. So two side-by-side
|
|
pages in your virtual address space may wind up being offset 128K and
|
|
offset 132K in physical memory, but could also easily be offset 128K
|
|
and offset 4K in physical memory and still retain the same cache
|
|
performance characteristics. So page-coloring does *NOT* have to
|
|
assign truely contiguous pages of physical memory to contiguous pages
|
|
of virtual memory, it just needs to make sure it assigns contiguous
|
|
pages from the point of view of cache performance and operation.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
Oops, that was a bit longer explanation than I intended.
|
|
|
|
</P>
|
|
<p font class="Normal">
|
|
-Matt
|
|
|
|
</P>
|
|
</blockquote>
|
|
|
|
|
|
<hr noshade color="#dadada"><br>
|
|
<font class="Small">Author maintains all copyrights on this article.<br></font>
|
|
<p align=right>Back to the <a href="..">OS/RC</a></p>
|
|
</body>
|
|
</html>
|