add directory study

This commit is contained in:
gohigh
2024-02-19 00:25:23 -05:00
parent b1306b38b1
commit f3774e2f8c
4001 changed files with 2285787 additions and 0 deletions

Binary file not shown.

Binary file not shown.

After

Width:  |  Height:  |  Size: 464 B

Binary file not shown.

After

Width:  |  Height:  |  Size: 697 B

Binary file not shown.

After

Width:  |  Height:  |  Size: 1008 B

Binary file not shown.

After

Width:  |  Height:  |  Size: 942 B

View File

@@ -0,0 +1,870 @@
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>OS/RC: Design Elements of the FreeBSD VM System</title>
</head>
<body bgcolor="#ffffff">
<p align=right>Back to the <a href="..">OS/RC</a></p>
<h2>Design Elements of the FreeBSD VM System</h2>
<h3>By Matthew Dillon <a href="mailto:dillon@apollo.backplane.com">dillon@apollo.backplane.com</a></h3>
</P>
<p font class="Normal">
The title is really just a fancy way of saying that I am going
to attempt to describe the whole VM enchilada, hopefully in a
way that everyone can follow. For the last year I have
concentrated on a number of major kernel subsystems within
FreeBSD, with the VM and Swap subsystems being the most
interesting and NFS being 'a necessary chore'. I rewrote only
small portions of the code. In the VM arena the only major
rewrite I have done is to the swap subsystem. Most of my work
was cleanup and maintenance, with only moderate code rewriting
and no major algorithmic adjustments within the VM subsystem.
The bulk of the VM subsystem's theoretical base remains unchanged
and a lot of the credit for the modernization effort in the
last few years belongs to John Dyson and David Greenman. Not
being a historian like Kirk I will not attempt to tag all the
various features with peoples names, since I will invariably
get it wrong.
</P>
<p font class="Normal">
Before moving along to the actual design let's spend a little
time on the necessity of maintaining and modernizing any
long-living codebase. In the programming world, algorithms
tend to be more important than code and it is precisely due to
BSD's academic roots that a great deal of attention was paid
to algorithm design from the beginning. More attention paid
to the design generally leads to a clean and flexible codebase
that can be fairly easily modified, extended, or replaced over
time. While BSD is considered an 'old' operating system by
some people, those of us who work on it tend to view it more
as a 'mature' codebase which has various components modified,
extended, or replaced with modern code. It has evolved, and
FreeBSD is at the bleeding edge no matter how old some of the
code might be. This is an important distinction to make and
one that is unfortunately lost to many people. The biggest
error a programmer can make is to not learn from history, and
this is precisely the error that many other modern operating
systems have made. NT is the best example of this, and the
consequences have been dire. Linux also makes this mistake to
some degree -- enough that we BSD folk can make small jokes
about it every once in a while, anyway (grin). Linux's problem
is simply one of a lack of experience and history to compare
ideas against, a problem that is easily and rapidly being
addressed by the Linux community in the same way it has been
addressed in the BSD community -- by continuous code development.
The NT folk, on the other hand, repeatedly make the same mistakes
solved by UNIX decades ago and then spend years fixing them.
Over and over again. They have a severe case of 'not designed
here' and 'we are always right because our marketing department
says so'. I have little tolerance for anyone who cannot learn
from history.
</P>
<p font class="Normal">
Much of the apparent complexity of the FreeBSD design, especially
in the VM/Swap subsystem, is a direct result of having to solve
serious performance issues that occur under various conditions.
These issues are not due to bad algorithmic design but instead
rise from environmental factors. In any direct comparison
between platforms, these issues become most apparent when system
resources begin to get stressed. As I describe FreeBSD's
VM/Swap subsystem the reader should always keep two points in
mind. First, the most important aspect of performance design
is what is known as "Optimizing the Critical Path". It is
often the case that performance optimizations add a little
bloat to the code in order to make the critical path perform
better. Second, a solid, generalized design outperforms a
heavily-optimized design over the long run. While a generalized
design may end up being slower than an heavily-optimized design
when they are first implemented, the generalized design tends
to be easier to adapt to changing conditions and the
heavily-optimized design winds up having to be thrown away.
Any codebase that will survive and be maintainable for years
must therefore be designed properly from the beginning even if
it costs some performance. Twenty years ago people were still
arguing that programming in assembly was better than programming
in a high-level language because it produced code that was ten
times as fast. Today, the fallibility of that argument is
obvious -- as are the parallels to algorithmic design and code
generalization.
</P>
<p font class="Normal">
<strong>VM Objects</strong>
</P>
<p font class="Normal">
The best way to begin describing the FreeBSD VM system is to
look at it from the perspective of a user-level process. Each
user process sees a single, private, contiguous VM address
space containing several types of memory objects. These objects
have various characteristics. Program code and program data
are effectively a single memory-mapped file (the binary file
being run), but program code is read-only while program data
is copy-on-write. Program BSS is just memory allocated and
filled with zeros on demand, called demand zero page fill.
Arbitrary files can be memory-mapped into the address space as
well, which is how the shared library mechanism works. Such
mappings can require modifications to remain private to the
process making them. The fork system call adds an entirely
new dimension to the VM management problem on top of the
complexity already given.
</P>
<p font class="Normal">
A program binary data page (which is a basic copy-on-write
page) illustrates the complexity. A program binary contains
a preinitialized data section which is initially mapped directly
from the program file. When a program is loaded into a process's
VM space, this area is initially memory-mapped and backed by
the program binary itself, allowing the VM system to free/reuse
the page and later load it back in from the binary. The moment
a process modifies this data, however, the VM system must make
a private copy of the page for that process. Since the private
copy has been modified, the VM system may no longer free it,
because there is no longer any way to restore it later on.
</P>
<p font class="Normal">
You will notice immediately that what was originally a simple
file mapping has become much more complex. Data may be modified
on a page-by-page basis whereas the file mapping encompasses
many pages at once. The complexity further increases when a
process forks. When a process forks, the result is two processes
-- each with their own private address spaces, including any
modifications made by the original process prior to the call
to fork(). It would be silly for the VM system to make a
complete copy of the data at the time of the fork() because it
is quite possible that at least one of the two processes will
only need to read from that page from then on, allowing the
original page to continue to be used. What was a private page
is made copy-on-write again, since each process (parent and
child) expects their own personal post-fork modifications to
remain private to themselves and not effect the other.
</P>
<p font class="Normal">
FreeBSD manages all of this with a layered VM Object model.
The original binary program file winds up being the lowest VM
Object layer. A copy-on-write layer is pushed on top of that
to hold those pages which had to be copied from the original
file. If the program modifies a data page belonging to the
original file the VM system takes a fault and makes a copy of
the page in the higher layer. When a process forks, additional
VM Object layers are pushed on.
This might make a little more sense with a fairly basic example.
A fork() is a common operation for any *BSD system, so this
example will consider a program that starts up, and forks.
When the process starts, the VM system creates an object layer,
let's call this A:
<center><img src="fig1.gif"></center>
<p font class="Normal">
A represents the file--pages may be paged in and out of the file's
physical media as necessary. Paging in from the disk is reasonable
for a program, but we really don't want to page back out and
overwrite the executable. The VM system therefore creates a second
layer, B, that will be physically backed by swap space:
<center><img src="fig2.gif"></center>
<p font class="Normal">
On the first write to a page after this, a new page is created in
B, and its contents are initialized from A. All pages in B can be
paged in or out to a swap device. When the program forks, the
VM system creates two new object layers--C1 for the parent, and C2
for the child--that rest on top of B:
<center><img src="fig3.gif"></center>
<p font class="Normal">
In this case, let's say a page in B is modified by the original
parent process. The process will take a copy-on-write fault and
duplicate the page in C1, leaving the original page in B untouched.
Now, let's say the same page in B is modified by the child process.
The process will take a copy-on-write fault and duplicate the page
in C2. The original page in B is now completely hidden since both
C1 and C2 have a copy and B could theoretically be destroyed if it
does not represent a 'real' file). However, this sort of
optimization is not trivial to make because it is so fine-grained.
FreeBSD does not make this optimization.
Now, suppose (as is often the case) that the child process does an
exec(). Its current address space is usually replaced by a new
address space representing a new file. In this case, the C2 layer
is destroyed:
<center><img src="fig4.gif"></center>
<p font class="Normal">
In this case, the number of children of B drops to one, and all
accesses to B now go through C1. This means that B and C1 can
be collapsed together. Any pages in B that also exist in C1 are
deleted from B during the collapse. Thus, even though the
optimization in the previous step could not be made, we can
recover the dead pages when either of the processes exit or exec().
</P>
<p font class="Normal">
This model creates a number of potential problems. The first
is that you can wind up with a relatively deep stack of layered
VM Objects which can cost scanning time and memory when you
when you take a fault. Deep layering can occur when processes
fork and then fork again (either parent or child).
The second problem is that you can wind up with dead,
inaccessible pages deep in the stack of VM Objects. In our
last example if both the parent and child processes modify the
same page, they both get their own private copies of the page
and the original page in B is no longer accessible by anyone.
That page in B can be freed.
</P>
<p font class="Normal">
FreeBSD solves the deep layering problem with a special optimization
called the "All Shadowed Case". This case occurs if either C1 or C2
take sufficient COW faults to completely shadow all pages in B. Lets
say that C1 achieves this. C1 can now bypass B entirely, so rather
then have C1->B->A and C2->B->A we now have C1->A and C2->B->A. But
look what also happened -- now B has only one reference (C2), so we
can collapse B and C2 together. The end result is that B is deleted
entirely and we have C1->A and C2->A. It is often the case that B
will contain a large number of pages and neither C1 nor C2 will be
able to completely overshadow it. If we fork again and create a set
of D layers, however, it is much more likely that one of the D layers
will eventually be able to completely overshadow the much smaller dataset
reprsented by C1 or C2. The same optimization will work at any point
in the graph and the grand result of this is that even on a heavily forked
machine VM Object stacks tend to not get much deeper then 4. This is
true of both the parent and the children and true whether the parent is
doing the forking or whether the children cascade forks.
</P>
<p font class="Normal">
The dead page problem still exists in the case where C1 or C2 do not
completely overshadow B. Due to our other optimizations this case does
not represent much of a problem and we simply allow the pages to be dead.
If the system runs low on memory it will swap them out, eating a little
swap, but that's it.
</P>
<p font class="Normal">
The advantage to the VM Object model is that fork() is extremely
fast, since no real data copying need take place. The disadvantage
is that you can build a relatively complex VM Object layering
that slows page fault handling down a little, and you spend
memory managing the VM Object structures. The optimizations
FreeBSD makes proves to reduce the problems enough that they
can be ignored, leaving no real disadvantage.
</P>
<p font class="Normal">
<strong>SWAP Layers</strong>
</P>
<p font class="Normal">
Private data pages are initially either copy-on-write or
zero-fill pages. When a change, and therefore a copy, is made,
the original backing object (usually a file) can no longer be
used to save a copy of the page when the VM system needs to
reuse it for other purposes. This is where SWAP comes in.
SWAP is allocated to create backing store for memory that does
not otherwise have it. FreeBSD allocates the swap management
structure for a VM Object only when it is actually needed.
However, the swap management structure has had problems
historically.
</P>
<p font class="Normal">
Under FreeBSD 3.x the swap management structure preallocates
an array that encompasses the entire object requiring swap
backing store -- even if only a few pages of that object are
swap-backed. This creates a kernel memory fragmentation problem
when large objects are mapped, or processes with large runsizes
(RSS) fork. Also, in order to keep track of swap space, a
'list of holes' is kept in kernel memory, and this tends to
get severely fragmented as well. Since the 'list of holes' is
a linear list, the swap allocation and freeing performance is
a non-optimal O(n)-per-page. It also requires kernel memory
allocations to take place during the swap freeing process, and
that creates low memory deadlock problems. The problem is
further exacerbated by holes created due to the interleaving
algorithm. Also, the swap block map can become fragmented fairly
easily resulting in non-contiguous allocations. Kernel memory
must also be allocated on the fly for additional swap management
structures when a swapout occurs. It is evident that there
was plenty of room for improvement.
</P>
<p font class="Normal">
For FreeBSD 4.x, I completely rewrote the swap subsystem. With
this rewrite, swap management structures are allocated through
a hash table rather than a linear array giving them a fixed
allocation size and much finer granularity. Rather then using
a linearly linked list to keep track of swap space reservations,
it now uses a bitmap of swap blocks arranged in a radix tree
structure with free-space hinting in the radix node structures.
This effectively makes swap allocation and freeing an O(1)
operation. The entire radix tree bitmap is also preallocated
in order to avoid having to allocate kernel memory during
critical low memory swapping operations. After all, the system
tends to swap when it is low on memory so we should avoid
allocating kernel memory at such times in order to avoid
potential deadlocks. Finally, to reduce fragmentation the
radix tree is capable of allocating large contiguous chunks at
once, skipping over smaller fragmented chunks. I did not take
the final step of having an 'allocating hint pointer' that
would trundle through a portion of swap as allocations were
made in order to further guarantee contiguous allocations or
at least locality of reference, but I ensured that such an
addition could be made.
</P>
<p font class="Normal">
<strong>When To Free a Page</strong>
</P>
<p font class="Normal">
Since the VM system uses all available memory for disk caching,
there are usually very few truly-free pages. The VM system
depends on being able to properly choose pages which are not
in use to reuse for new allocations. Selecting the optimal
pages to free is possibly the single-most important function
any VM system can perform because if it makes a poor selection,
the VM system may be forced to unnecessarily retrieve pages
from disk, seriously degrading system performance.
</P>
<p font class="Normal">
How much overhead are we willing to suffer in the critical path
to avoid freeing the wrong page? Each wrong choice we make
will cost us hundreds of thousands of CPU cycles and a noticeable
stall of the affected processes, so we are willing to endure
a significant amount of overhead in order to be sure that the
right page is chosen. This is why FreeBSD tends to outperform
other systems when memory resources become stressed.
</P>
<p font class="Normal">
<table border="0" cellspacing="5" cellpadding="5">
<tr>
<td width="65%" align="top"><p font class="Normal">
The free page determination algorithm is built upon a history
of the use of memory pages. To acquire this history, the system
takes advantage of a page-used bit feature that most hardware
page tables have. </p>
<p font class="Normal"> In any
case, the page-used bit is cleared and at some later point the
VM system comes across the page again and sees that the page-used
bit has been set. This indicates that the page is still being
actively used. If the bit is still clear it is an indication
that the page is not being actively used. By testing this bit
periodically, a use history (in the form of a counter) for the
physical page is developed. When the VM system later needs to
free up some pages, checking this history becomes the cornerstone
of determining the best candidate page to reuse.
</p>
</td>
<td width="35%" align="top" bgcolor="#dadada"><font size="-1"><center><strong>What if the hardware
has no page-used bit?</strong></center><br>
<p font class="Normal">For those platforms that do not have
this feature, the system actually emulates a page-used bit.
It unmaps or protects a page, forcing a page fault if the page
is accessed again. When the page fault is taken, the system
simply marks the page as having been used and unprotects the
page so that it may be used. While taking such page faults
just to determine if a page is being used appears to be an
expensive proposition, it is much less expensive than reusing
the page for some other purpose only to find that a process
needs it back and then have to go to disk.</P></font>
</td>
</tr>
</table>
</P>
<p font class="Normal">
FreeBSD makes use of several page queues to further refine the
selection of pages to reuse as well as to determine when dirty
pages must be flushed to their backing store. Since page tables
are dynamic entities under FreeBSD, it costs virtually nothing
to unmap a page from the address space of any processes using
it. When a page candidate has been chosen based on the page-use
counter, this is precisely what is done. The system must make
a distinction between clean pages which can theoretically be
freed up at any time, and dirty pages which must first be
written to their backing store before being reusable. When a
page candidate has been found it is moved to the inactive queue
if it is dirty, or the cache queue if it is clean. A separate
algorithm based on the dirty-to-clean page ratio determines
when dirty pages in the inactive queue must be flushed to disk.
Once this is accomplished, the flushed pages are moved from
the inactive queue to the cache queue. At this point, pages
in the cache queue can still be reactivated by a VM fault at
relatively low cost. However, pages in the cache queue are
considered to be 'immediately freeable' and will be reused in
an LRU (least-recently used) fashion when the system needs to
allocate new memory.
</P>
<p font class="Normal">
It is important to note that the FreeBSD VM system attempts to
separate clean and dirty pages for the express reason of avoiding
unnecessary flushes of dirty pages (which eats I/O bandwidth),
nor does it move pages between the various page queues gratuitously
when the memory subsystem is not being stressed. This is why
you will see some systems with very low cache queue counts and
high active queue counts when doing a 'systat -vm' command.
As the VM system becomes more stressed, it makes a greater
effort to maintain the various page queues at the levels
determined to be the most effective. An urban myth has circulated
for years that Linux did a better job avoiding swapouts than
FreeBSD, but this in fact is not true. What was actually
occurring was that FreeBSD was proactively paging out unused
pages in order to make room for more disk cache while Linux
was keeping unused pages in core and leaving less memory
available for cache and process pages. I don't know whether
this is still true today.
</P>
<p font class="Normal">
<strong>Pre-Faulting and Zeroing Optimizations</strong>
</P>
<p font class="Normal">
Taking a VM fault is not expensive if the underlying page is
already in core and can simply be mapped into the process, but
it can become expensive if you take a whole lot of them on a
regular basis. A good example of this is running a program
such as 'ls' or 'ps' over and over again. If the program binary
is mapped into memory but not mapped into the page table, then
all the pages that will be accessed by the program will have
to be faulted in every time the program is run. This is
unnecessary when the pages in question are already in the VM
Cache, so FreeBSD will attempt to pre-populate a process's page
tables with those pages that are already in the VM Cache. One
thing that FreeBSD does not yet do is pre-copy-on-write certain
pages on exec. For example, if you run the /bin/ls program
while running 'vmstat 1' you will notice that it always takes
a certain number of page faults, even when you run it over and
over again. These are zero-fill faults, not program code faults
(which were pre-faulted in already). Pre-copying pages on exec
or fork is an area that could use more study.
</P>
<p font class="Normal">
A large percentage of page faults that occur are zero-fill
faults. You can usually see this by observing the 'vmstat -s'
output. These occur when a process accesses pages in its BSS
area. The BSS area is expected to be initially zero but the
VM system does not bother to allocate any memory at all until
the process actually accesses it. When a fault occurs the VM
system must not only allocate a new page, it must zero it as
well. To optimize the zeroing operation the VM system has the
ability to pre-zero pages and mark them as such, and to request
pre-zeroed pages when zero-fill faults occur. The pre-zeroing
occurs whenever the CPU is idle but the number of pages the
system pre-zeros is limited in order to avoid blowing away the
memory caches. This is an excellent example of adding complexity
to the VM system in order to optimize the critical path.
</P>
<p font class="Normal">
<strong>Page Table Optimizations</strong>
</P>
<p font class="Normal">
The page table optimizations make up the most contentious part
of the FreeBSD VM design and they have shown some strain with
the advent of serious use of mmap(). I think this is actually
a feature of most BSDs though I am not sure when it was first
introduced. There are two major optimizations. The first is
that hardware page tables do not contain persistent state but
instead can be thrown away at any time with only a minor amount
of management overhead. The second is that every active page
table entry in the system has a governing pv_entry structure
which is tied into the vm_page structure. FreeBSD can simply
iterate through those mappings that are known to exist while
Linux must check all page tables that *might* contain a specific
mapping to see if it does, which can achieve O(n^2) overhead
in certain situations. It is because of this that FreeBSD
tends to make better choices on which pages to reuse or swap
when memory is stressed, giving it better performance under
load. However, FreeBSD requires kernel tuning to accommodate
large-shared-address-space situations such as those that can
occur in a news system because it may run out of pv_entry
structures.
</P>
<p font class="Normal">
Both Linux and FreeBSD need work in this area. FreeBSD is
trying to maximize the advantage of a potentially sparse
active-mapping model (not all processes need to map all pages
of a shared library, for example), whereas Linux is trying to
simplify its algorithms. FreeBSD generally has the performance
advantage here at the cost of wasting a little extra memory,
but FreeBSD breaks down in the case where a large file is
massively shared across hundreds of processes. Linux, on the
other hand, breaks down in the case where many processes are
sparsely-mapping the same shared library and also runs non-optimally
when trying to determine whether a page can be reused or not.
</P>
<p font class="Normal">
<strong>Page Coloring</strong>
</P>
<p font class="Normal">
We'll end with the page coloring optimizations. Page coloring
is a performance optimization designed to ensure that accesses
to contiguous pages in virtual memory make the best use of the
processor cache. In ancient times (i.e. 10+ years ago) processor
caches tended to map virtual memory rather than physical memory.
This led to a huge number of problems including having to clear
the cache on every context switch in some cases, and problems
with data aliasing in the cache. Modern processor caches map
physical memory precisely to solve those problems. This means
that two side-by-side pages in a processes address space may
not correspond to two side-by-side pages in the cache. In
fact, if you aren't careful side-by-side pages in virtual memory
could wind up using the same page in the processor cache --
leading to cacheable data being thrown away prematurely and
reducing CPU performance. This is true even with multi-way
set-associative caches (though the effect is mitigated somewhat).
</P>
<p font class="Normal">
FreeBSD's memory allocation code implements page coloring
optimizations, which means that the memory allocation code will
attempt to locate free pages that are contiguous from the point
of view of the cache. For example, if page 16 of physical
memory is assigned to page 0 of a process's virtual memory and
the cache can hold 4 pages, the page coloring code will not
assign page 20 of physical memory to page 1 of a process's
virtual memory. It would, instead, assign page 21 of physical
memory. The page coloring code attempts to avoid assigning
page 20 because this maps over the same cache memory as page
16 and would result in non-optimal caching. This code adds a
significant amount of complexity to the VM memory allocation
subsystem as you can well imagine, but the result is well worth
the effort. Page Coloring makes VM memory as deterministic as
physical memory in regards to cache performance.
</P>
<p font class="Normal">
<strong>Conclusion</strong>
</P>
<p font class="Normal">
Virtual memory in modern operating systems must address a number
of different issues efficiently and for many different usage
patterns. The modular and algorithmic approach that BSD has
historically taken allows us to study and understand the current
implementation as well as relatively cleanly replace large
sections of the code. There have been a number of improvements
to the FreeBSD VM system in the last several years, and work
is ongoing.
</P>
<hr>
<p font class="Normal">
<h2>A Bonus Question and Answer Session by Allen Briggs </h2><a href="mailto:briggs@ninthwonder.com">&#60;briggs@ninthwonder.com&#62;</a>
</P>
<p font class="Normal">
Q: What is "the interleaving algorithm" that you refer to in your listing
of the ills of the FreeBSD 3.x swap arrangments?
</P>
<blockquote>
<p font class="Normal">
A: FreeBSD uses a fixed swap interleave which defaults to 4. This means
that FreeBSD reserves space for four swap areas even if you only have one,
two, or three. Since swap is interleaved the linear address space
representing the 'four swap areas' will be fragmented if you don't actually
have four swap areas. For example, if you have two swap areas A and B
FreeBSD's address space representation for that swap area will be
interleaved in blocks of 16 pages:
</P>
<p font class="Normal">
A B C D A B C D A B C D A B C D
</P>
<p font class="Normal">
FreeBSD 3.x uses a 'sequential list of free regions' approach to accounting
for the free swap areas. The idea is that large blocks of free linear
space can be represented with a single list node (kern/subr_rlist.c).
But due to the fragmentation the sequential list winds up being insanely
fragmented. In the above example, completely unused swap will have A and
B shown as 'free' and C and D shown as 'all allocated'. Each A-B sequence
requires a list node to account for because C and D are holes, so the list
node cannot be combined with the next A-B sequence.
</P>
<p font class="Normal">
Why do we interleave our swap space instead of just tack swap areas onto
the end and do something fancier? Because it's a whole lot easier to
allocate linear swaths of an address space and have the result
automatically be interleaved across multiple disks than it is to try to
put that sophistication elsewhere.
</P>
<p font class="Normal">
The fragmentation causes other problems. Being a linear
list under 3.x, and having such a huge amount of inherent fragmentation,
allocating and freeing swap winds up being an O(N) algorithm instead of
an O(1) algorithm. Combined with other factors (heavy swapping) and you
start getting into O(N^2) and O(N^3) levels of overhead, which is bad.
The 3.x system may also need to allocate KVM during a swap operation to
create a new list node which can lead to a deadlock if the system is
trying to pageout pages in a low-memory situation.
</P>
<p font class="Normal">
Under 4.x we do not use a sequential list. Instead we use a radix tree
and bitmaps of swap blocks rather than ranged list nodes. We take the
hit of preallocating all the bitmaps required for the entire swap
area up front but it winds up wasting less memory due to the use of
a bitmap (one bit per block) instead of a linked list of nodes. The
use of a radix tree instead of a sequential list gives us nearly O(1)
performance no matter how fragmented the tree becomes.
</P>
</blockquote>
<p font class="Normal">
Q: I don't get the following:
</P>
<p font class="Normal">
It is important to note that the FreeBSD VM system attempts to separate
clean and dirty pages for the express reason of avoiding unnecessary
flushes of dirty pages (which eats I/O bandwidth), nor does it move
pages between the various page queues gratitously when the memory subsystem
is not being stressed. This is why you will see some systems with
very low cache queue counts and high active queue counts when doing a
'systat -vm' command.
</P>
<p font class="Normal">
Q: How is the separation of clean and dirty (inactive) pages related to the
situation where you see low cache queue counts and high active queue counts
in 'systat -vm'? Do the systat stats roll the active and dirty pages
together for the active queue count?
<blockquote>
</P>
<p font class="Normal">
A: Yes, that is confusing. The relationship is "goal" verses "reality".
Our goal is to separate the pages but the reality is that if we are not
in a memory crunch, we don't really have to.
</P>
<p font class="Normal">
What this means is that FreeBSD will not try very hard to separate out
dirty pages (inactive queue) from clean pages (cache queue) when the
system is not being stressed, nor will it try to deactivate pages
(active queue -&#62; inactive queue) when the system is not being stressed,
even if they aren't being used.
</P>
</blockquote>
<p font class="Normal">
Q: In the /bin/ls / 'vmstat 1' example, wouldn't some of the page faults be
data page faults (COW from executable file to private page)? I.e., I
would expect the page faults to be some zero-fill and some program data.
Or are you implying that FreeBSD does do pre-COW for the program data?
</P>
<blockquote>
<p font class="Normal">
A: A COW fault can be either zero-fill or program-data. The mechanism
is the same either way because the backing program-data is almost
certainly already in the cache. I am indeed lumping the two together.
FreeBSD does not pre-COW program data or zero-fill, but it *does*
pre-map pages that exist in its cache.
</P>
</blockquote>
<p font class="Normal">
Q: In your section on page table optimizations, can you give a little more
detail about pv_entry and vm_page (or should vm_page be vm_pmap -- as
in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)?
Specifically, what kind of operation/reaction would require scanning the
mappings?
</P>
<p font class="Normal">
How does Linux do in the case where FreeBSD breaks down (sharing a large
file mapping over many processes)?
</P>
<blockquote>
<p font class="Normal">
A: A vm_page represents an (object,index#) tuple. A pv_entry represents
a hardware page table entry (pte). If you have five processes sharing
the same physical page, and three of those processes's page tables
actually map the page, that page will be represented by a single
vm_page structure and three pv_entry structures.
</P>
<p font class="Normal">
pv_entry structures only represent pages mapped by the MMU (one
pv_entry represnts one pte). This means that when we need to remove
all hardware references to a vm_page (in order to reuse the page for
something else, page it out, clear it, dirty it, and so forth) we can
simply scan the linked list of pv_entry's associated with that vm_page
to remove or modify the pte's from their page tables.
</P>
<p font class="Normal">
Under Linux there is no such linked list. In order to remove all the
hardware page table mappings for a vm_page linux must index into every
VM object that *might* have mapped the page. For example, if you have
50 processes all mapping the same shared library and want to get rid of
page X in that library, you need to index into the page table for
each of those 50 processes even if only 10 of them have actually mapped
the page. So Linux is trading off the simplicity of its design against
performance. Many VM algorithms which are O(1) or (small N) under FreeBSD
wind up being O(N), O(N^2), or worse under Linux. Since the pte's
representing a particular page in an object tend to be at the same
offset in all the page tables they are mapped in, reducing the number
of accesses into the page tables at the same pte offset will often avoid
blowing away the L1 cache line for that offset, which can lead to better
performance.
</P>
<p font class="Normal">
FreeBSD has added complexity (the pv_entry scheme) in order to increase
performance (to limit page table accesses to *only* those pte's that need
to be modified).
</P>
<p font class="Normal">
But FreeBSD has a scaling problem that Linux does not in that there are
a limited number of pv_entry structures and this causes problems when you
have massive sharing of data. In this case you may run out of pv_entry
structures even though there is plenty of free memory available. This
can be fixed easily enough by bumping up the number of pv_entry structures
in the kernel config, but we really need to find a better way to do it.
</P>
<p font class="Normal">
In regards to the memory overhead of a page table verses the pv_entry
scheme: Linux uses 'permanent' page tables that are not throw away, but
does not need a pv_entry for each potentially mapped pte. FreeBSD uses
'throw away' page tables but adds in a pv_entry structure for each
actually-mapped pte. I think memory utilization winds up being about
the same, giving FreeBSD an algorithmic advantage with its ability to
throw away page tables at will with very low overhead.
</P>
</blockquote>
<p font class="Normal">
Q: Finally, in the page coloring section, it might help to have a little
more description of what you mean here. I didn't quite follow it.
</P>
<blockquote>
<p font class="Normal">
A: Do you know how an L1 hardware memory cache works? I'll explain:
Consider a machine with 16MB of main memory but only 128K of L1 cache.
Generally the way this cache works is that each 128K block of main memory
uses the *same* 128K of cache. If you access offset 0 in main memory
and then offset offset 128K in main memory you can wind up throwing
away the cached data you read from offset 0!
</P>
<p font class="Normal">
Now, I am simplifying things greatly. What I just described is what
is called a 'direct mapped' hardware memory cache. Most modern caches
are what are called 2-way-set-associative or 4-way-set-associative
caches. The set-associatively allows you to access up to N different
memory regions that overlap the same cache memory without destroying
the previously cached data. But only N.
</P>
<p font class="Normal">
So if I have a 4-way set associative cache I can access offset 0,
offset 128K, 256K and offset 384K and still be able to access offset 0
again and have it come from the L1 cache. If I then access offset 512K,
however, one of the four previously cached data objects will be thrown
away by the cache.
</P>
<p font class="Normal">
It is extremely important... EXTREMELY important for most of a processor's
memory accesses to be able to come from the L1 cache, because the L1
cache operates at the processor frequency. The moment you have an L1
cahe miss and have to go to the L2 cache or to main memory, the processor
will stall and potentially sit twidling its fingers for *hundreds* of
instructions worth of time waiting for a read from main memory to complete.
Main memory (the dynamic ram you stuff into a computer) is S.L.O.W.,
capitalized and boldfaced, when compared to the speed of a modern
processor core.
</P>
<p font class="Normal">
Ok, so now onto page coloring: All modern memory caches are what are
known as *physical* caches. They cache physical memory addresses, not
virtual memory addresses. This allows the cache to be left alone across
a process context switch, which is very important.
</P>
<p font class="Normal">
But in the UNIX world you are dealing with virtual address spaces, not
physical address spaces. Any program you write will see the virtual
address space given to it. The actual *physical* pages underlying that
virtual address space are not necessarily physically contiguous! In
fact, you might have two pages that are side by side in a processes
address space which wind up being at offset 0 and offset 128K in
*physical* memory.
</P>
<p font class="Normal">
A program normally assumes that two side-by-side pages will be optimally
cached. That is, that you can access data objects in both pages without
having them blow away each other's cache entry. But this is only true
if the physical pages underlying the virtual address space are contiguous
(insofar as the cache is concerned).
</P>
<p font class="Normal">
This is what Page coloring does. Instead of assigning *random* physical
pages to virtual addresses, which may result in non-optimal cache
performance , Page coloring assigns *reasonably-contiguous* physical pages
to virtual addresses. Thus programs can be written under the assumption
that the characteristics of the underlying hardware cache are the same
for their virtual address space as they would be if the program had been
run directly in a physical address space.
</P>
<p font class="Normal">
Note that I say 'reasonably' contiguous rather than simply 'contiguous'.
From the point of view of a 128K direct mapped cache, the physical
address 0 is the same as the physical address 128K. So two side-by-side
pages in your virtual address space may wind up being offset 128K and
offset 132K in physical memory, but could also easily be offset 128K
and offset 4K in physical memory and still retain the same cache
performance characteristics. So page-coloring does *NOT* have to
assign truely contiguous pages of physical memory to contiguous pages
of virtual memory, it just needs to make sure it assigns contiguous
pages from the point of view of cache performance and operation.
</P>
<p font class="Normal">
Oops, that was a bit longer explanation than I intended.
</P>
<p font class="Normal">
-Matt
</P>
</blockquote>
<hr noshade color="#dadada"><br>
<font class="Small">Author maintains all copyrights on this article.<br></font>
<p align=right>Back to the <a href="..">OS/RC</a></p>
</body>
</html>

View File

@@ -0,0 +1,401 @@
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Bringing SMP to Your UP Operating System</title>
<meta http-equiv="Cache-Control" content="must-revalidate">
</head>
<body bgcolor="#ffffff" vlink="#0000ff">
<table width=600 align=center>
<tr><td>
<center>
<h1>Bringing SMP to Your UP Operating System</h1>
Sidney Cammeresi
</center>
<br><br><br>
<blockquote>
<h3>Overview</h3>
<ol>
<li><a href="#preface">Preface</a></li>
<li><a href="#terms">Terminology</a></li>
<li><a href="#config">MP Configuration Tables</a></li>
<li><a href="#apic">Using Local APICs</a></li>
<li><a href="#booting">Booting Application Processors</a></li>
<li><a href="#protmode">Switching from Real Mode to Protected Mode</a></li>
<li><a href="#ipis">Interprocessor Interrupts</a></li>
<li><a href="#other">Other Considerations<br><br></a></li>
<li>Scheduling</li>
<li>Using the I/O APIC<br><br></li>
<!--
<li><a href="#scheduling">Scheduling</a></li>
<li><a href="#ioapic">Using the I/O APIC</a></li>
-->
<li><a href="#references">References</a></li>
<li><a href="#revsions">Revision History</a></li>
</ol>
</blockquote>
<br><br><br>
<h3>
<a name="preface">Preface</a></h3>
This tutorial is intended as a supplement to the
<a href="http://www.acm.uiuc.edu/sigops/roll_your_own">SigOps OS
Tutorial</a> to teach the fundamentals of symmetric multprocessing
using Intel MP compliant hardware. Knowledge of the concepts and
implementations of basic operating system parts such as managing virtual
memory and multitasking are assumed and will not be discussed except as
they relate to multiprocessing. Knowledge equivalent to an intermediate
or advanced computer architecture college course will be helpful in
understanding scheduling issues, but is not required.
<p>
This tutorial is not intended to be a complete explanation of how to
implement an SMP-capable operating system, nor as a replacement for
Intel's documentation. Rather it is designed to
give an overview of the things I learned in writing SMP support for
<a href="http://www.frotz.net/openblt">OpenBLT</a>, a freely
redistributable microkernel-based operating system under the BSD
licence. Particularly, some tedious hardware
aspects will not be discussed in detail when the reader could just as
easily read official Intel documentation. The interested reader should
refer to the references for more detailed information. For code examples,
the reader should refer to the source code of
<a href="http://www.frotz.net/openblt">OpenBLT</a> or
<a href="http://www.freebsd.org">FreeBSD</a>. The Linux kernel source
code might be helpful, although it is under the GPL.
<p>
This tutorial is a work in progress. If you see an error or something
that needs clarification, please <a href="mailto:cammeres@uiuc.edu">e-mail
me</a>.
<p>
<h3>
<a name="terms">Terminology</a>
</h3>
<dl>
<dt>AP
<dd>application processor. A processor that is not the BSP. All APs are
in a halted state when the BIOS first gives control to the operating
system.
<dt>APIC
<dd>Advanced Programmable Interrupt Controller. Either a local APIC or
an I/O APIC. It is attached to the APIC bus.
<dt>APIC bus
<dd>A special non-architechural bus on which the APICs in the system
send messages.
<dt>BSP
<dd>bootstrap processor. The processor which is given control after the
BIOS finishes its POST.
<dt>I/O APIC
<dd>A special APIC for receiving and distributing interrupts from external
devices which is backward compatible with the PIC. There is generally
only one per computer.
<dt>IPI
<dd>interprocessor interrupt. A special interrupt sent to a processor by
the originating processor programming its APIC with a target or logical
target ID, and an interrupt vector.
<dt>Local APIC
<dd>an APIC built in to the processor. It is responsible for dispatching
interrupts sent over the APIC bus to its processor core and sending
interrupts to other processors over the APIC bus.
<dt>MP
<dd>Intel's MultiProcessor Specification, a standard which defines how
SMP hardware should be presented to the operating system and how the
operating system should interact with this hardware.
<dt>serialisation
<dd>The act of executing a certain instruction which causes the processor
to pause to
retire all instructions currently being executed before proceeding
to the next instruction in the stream. For example, before switching
to protected mode, the processor must retire all instructions that
began executing in real mode before beginning any in protected mode.
<dt>SMP
<dd>symmetric multiprocessing. Using multiple processors which share
the same physical memory in the same computer at the time. You are
probably reading this tutorial with the hope that your operating
system will become SMP-capable.
<dt>UP
<dd>uniprocessor. Your operating system to date is a UP operating system.
</dl>
<h3>
<a name="config">MP Detection and Configuration</a>
</h3>
When the system first starts, the BIOS detects the hardware installed in
the system using electric means and then creates structures to describe
this hardware to the operating system. There are two such tables.
The first is the MP Floating Pointer Structure, which is required.
The second is the MP Configuration Table, which is optional. If the
configuration table does not exist, the operating system should set up
the default configuration indicated in the floating pointer structure.
Some data in the tables is in ASCII. Strings are padded with spaces
and are not null-terminated.
<p>
First, you need to find the floating pointer structure. According to
the spec, it can be in one of four places: (1) in the first kilobyte
of the extended BIOS data area, (2) the last kilobyte of base memory,
(3) the top of physical memory, or (4) the BIOS read-only memory space
between 0xe0000 and 0xfffff. You need to search these areas for the
four-byte signature "_MP_" which denotes the start of the floating
pointer structure. Absence of this structure indicates that the system
is not MP compliant. At this point your operating system can either halt,
or it can fall back into a UP setup.
<p>
You should checksum the structure to make sure it has not been corrupted.
There is not much of interest in the floating pointer structure, unless
your system does not have a configuration table. In this case, you will
need to get the number of the default configuration your system adheres to
and set up the system for SMP using those parameters. Otherwise, you will
need to get the address of the configuration table and begin parsing that.
<p>
The configuration table is divided into three parts: a header, a base
section, and an extended section. The header begins with the four-byte
signature "PCMP", although you do not have to search for it. Once you
find it, checksum it. At this point, you can print the OEM and product
ID strings in the configuration table if you want. You should get the
address of the local APIC from this and store it. Then, proceed to
parse the base section.
<p>
The base section consists of a set of entries that describe either
processors, system busses, I/O APICs, I/O interrupt assignments, or
local interrupt assignments. All entries are eight bytes in length,
save processor entries which are twenty bytes. The first byte of each
entry denotes the type of the entry. Look through each entry. You will
probably want to generate quite a few OS-specific data structure here.
In particular, you will want to note the APIC ID of each processor in
the system, its version, and its type as well as the address of the
system's I/O APIC.
<p>
<h3>
<a name="apic">Using Local APICs</a>
</h3>
MP systems have a special bus to which all APICs in the system are
connected. This bus is one of the ways the processors can communicate
with one another (the other, of course, is shared memory). APICs (both
local and I/O) are memory mapped devices. The default location for
the local APIC is at 0xfee00000 in physical memory. The local APIC
will appear in the same place for each processor, but each processor
will reference its own APIC; the APIC intercepts memory references to
its registers, and those references will not generate bus cycles on
some systems. Since APICs are mapped in high memory, the APs will have
to switch to protected mode before they can intialise their local APICs.
If you like, you can map the APIC to a different address using the paging
unit, but be sure to disable caching in the page table entry since some
registers can change between accesses. For this reason, pointers to
APIC registers should be volatile. To initialise the BSP's local APIC,
set the enable bit in the spurious interrupt vector register and set
the error interrupt vector in the local vector table.
<p>
<h3>
<a name="booting">Booting Application Processors</a>
</h3>
Once you have detected the processors in the system, set up your local
APIC, and verified that you can communicate with it (hint: read the
APIC version register), it's time to boot the APs. N.B. that it is good
practise to not try to boot the BSP here. That would be bad.
<p>
Since the APs will wake up in real mode, everything they need to get
started should be in low memory (below 0x100000 or one megabyte).
First, set the shutdown code by setting address 0:f to 0xa. Then,
grab a page of memory for the AP's stack. You will also need space
to store the `trampoline' code, i.e. the code the processor executes
after waking up to switch to protected mode and jump to the kernel.
You can either use the same page of code for each processor or store
the code at the bottom of the processor's stack. Note that the start of
the code must be at a page-aligned address. Copy the code there, then
set the warm reset vector at address 40:67 to the start of this code.
Next, you should reset a bit in the kernel which the processor will use
to signal that it has booted and finished initialisation and clear any
APIC error by writing a zero to the error status register. If you need
to pass any parameters or data to the AP, now would be a good time to
set that up. For example, since OpenBLT's kernel runs in high memory,
I have to pass the address of the page directory in memory so that the
AP can load it and enable paging before calling the kernel.
<p>
Now you can actually boot the processor. The procedure consists of
sending a sequence of interrupts to the processor. The incremental effect
of each is undefined, but at the end of the sequence, the processor will
be booted. First send an INIT IPI. Assert the INIT signal by writing
the target processor's APIC ID to the high word of the interrupt command
register. Then write to the low word with the bits set to enable the INIT
delivery mode, level triggered, and assert the interrupt. Deassert INIT
by repeating the procedure with the assert bit reset. Now, wait 10 ms.
Use of the APIC timer is suggested.
<p>
If the local APIC is not an 82489dx, you need to send two STARTUP IPIs.
Clear APIC errors, set the target APIC ID in the ICR, then send the
interrupt by writing to the low word of the ICR with bits set for STARTUP
delivery mode and with the code vector in the low byte. The code vector
is the physical page number at which the processor should start executing,
i.e. the start of your trampoline code. Wait 200 ms, then check the low
word of the ICR to make sure bit 16 is reset to indicate the interrupt
was dispatched before sending the second STARTUP. After sending it,
spin and wait for the AP to set its ready bit in memory. You may want
to set a timeout of 5 seconds, after which you assume the processor did
not wake up.
<p>
<h3>
<a name="protmode">Switching from Real Mode to Protected Mode</a>
</h3>
Provided you did everything right above, the processor at some point woke
up in real mode and started executing the code you told it to. First,
execute a "cli" instruction to turn off interrupts, just in case. Now,
begin the switch to protected mode. Load an appropriatee value into GDTR.
This can either point to the actual GDT or in my case, a temporary GDT.
If you need to activate paging, load the address of a page directory
into cr3. Then set bit zero in cr0 to enable protected mode as well as
bit 31 if you need to enable paging to get into the kernel. Then do an
ljmp to the kernel text segment with an offset that points to the next
instruction to serialise the processor. Now that you're in protected
mode, load appropriate descriptors into the segment registers, then
execute a "cld", which is reportedly what gcc expects. Then, jump to
the starting address of your kernel.
<p>
Don't reference any symbols in this code since it
will be running at an address for which it was not linked;
all memory references must be absolute. Since your kernel is above
one megabyte in memory, you can't access any global variables in real
mode. Also be careful in specifying your offset address for the ljmp
instruction, and do specify the address of the start of your kernel,
not a symbol in the instruction that goes into the kernel.
Jumping to a symbol doesn't seem to work. For details,
see OpenBLT's kernel/trampoline.S.
<p>
Debugging this part is really not too bad. What you have to do is
establish some communication space in low memory, then have the AP write
bytes to that memory to explain what it is doing and print these out on
the BSP.
<p>
<h3>
<a name="ipis">Interprocessor Interrupts</a>
</h3>
IPIs are used to maintain synchronisation between the processors.
For example, if a kernel page table entry changes, both processors must
either flush their TLBs or invalidate that particular page table entry.
Whichever processor changed the mapping knows to do this automatically,
but the other processor does not; therefore, the processor which changed
the mapping must send an IPI to the other processor to tell it to flush
its TLB or invalidate the page table entry.
<p>
Using the local APIC, you can send interrupts to all processors, all
processors but the one sending the interrupt, or a specific processor
or logical address as well as self-interrupts. To send an IPI, write
the destination APIC ID, if needed, into the high word of the ICR, then
write the low word of ICR with the destination shorthand and interrupt
vector set to send the IPI. Be sure to wrap these functions in spinlocks.
You might want to turn off interrupts as well while sending IPIs.
<p>
<h3>
<a name="other">Other Considerations</a>
</h3>
One thing to note is that semaphores (a.k.a. spinlocks) may need to
be done differently under SMP. Consider a scenario where semaphores
are procured with a ``bts'' instruction. If both processors hit that
instruction at the same time while the semaphore is reset, they might
both think they have acquired it. For this reason, you would need to
use a ``lock'' prefix on that instruction to lock the system bus and
maintain synchronisation.
<p>
<h3>
<a name="scheduling">Scheduling</a>
</h3>
Not yet.
<h3>
<a name="ioapic">Using the I/O APIC</a>
</h3>
Not yet.
<h3>
<a name="references">References</a>
</h3>
All of these documents are available from Intel's developer web site at
<a href="http://developer.intel.com">developer.intel.com</a>. Supposedly,
by request, Intel will also send you printed documentation by post.
<ul>
<li>82378ZB System I/O and 82379AB System I/O APIC
<ul>
<li>&sect;3.7: APIC Registers</li>
</ul>
</li>
<li>MultiProcessor Specification, Version 1.4
<ul>
<li>&sect;4: MP Configuration Table</li>
<li>&sect;B: Operating System Programming Guidelines</li>
</ul>
</li>
<li>Intel Architecture Software Developer's Manual, Volume 2: Instruction
Set Reference</li>
<li>Intel Architecture Software Developer's Manual, Volume 3: System
Programming Guide
<ul>
<li>&sect;7.1: Locked Atomic Operations</li>
<li>&sect;7.4: The APIC</li>
<li>&sect;8.7: Software Initialization for Protected Mode</li>
<li>&sect;8.8: Mode Switching</li>
<li>&sect;B: MP Bootup Sequence</li>
</ul>
</li>
</ul>
<h3>
<a name="revisions">Revision History</a>
</h3>
<pre>
07 Nov 1998 - initial version, most sections filled out.
</pre>
<br clear=all>
<hr>
<i>Bringing SMP to Your UP Operating System</i> is Copyright &copy; 1998 by
Sidney Cammeresi in its entirety. All rights reserved.<p>
Permission is granted to make verbatim copies of this tutorial for
non-commercial use provided this notice remains intact on all copies.
</td></tr>
</table>
<pre>$Id: smp.html 1.3 Thu, 01 Jul 1999 10:51:51 -0500 sac $</pre>
</body>
</html>

Binary file not shown.

Binary file not shown.

View File

@@ -0,0 +1,7 @@
<html>
<head>
<meta http-equiv="refresh" content="0;url=/Linux.old/sabre/os/articles">
</head>
<body lang="zh-CN">
</body>
</html>

View File

@@ -0,0 +1,280 @@
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=utf-8">
<TITLE>Write Your Own Operating System</TITLE>
</HEAD>
<BODY>
<hr color="#0000FF" noshade>
<p>
<div align="center">
<h1>Write Your Own Operating System [FAQ]</h1>
<h5>by <A href="mailto:dfiber@mega-tokyo.com?subject=OS-FAQ">Stuart 'Dark Fiber' George</A> &lt;dfiber@mega-tokyo.com&gt;</h3>
<h5>Last Updated : Tuesday 23rd October 2000</h3>
</div>
</P>
<P title="" align="CENTER">Download a .zip'ed version of the os-faq <A href="ftp://ftp.mega-tokyo.com/pub/operating-systems/os-faq.zip">here</A></P>
<hr color="#0000FF" noshade>
<p>
<UL>
<LI>Introductions, Overview
<UL>
<LI><A href="os-faq-intro.html#introduction">Introduction</a>
<LI><A href="os-faq-intro.html#getting_started">Getting Started</a>
<LI><A href="os-faq-intro.html#what_bits_cant_i_make_in_c">What bits can't I make in C</a>
<LI><A href="os-faq-intro.html#what_order_should_i_make_things_in">What order should I make things in</a>
</UL>
<LI>Kernel Questions
<UL>
<LI><A href="os-faq-kernel.html#load_kernel">How do I make a kernel file I can load</a>
<LI><A href="os-faq-kernel.html#32bit_files">Help! When I load my kernel my machine resets</a>
<LI><A href="os-faq-boot.html#easier_load">Is there an easier way to boot my kernel</a>
</UL>
<LI>Boot Loader and Boot Menu's
<UL>
<LI><A href="os-faq-boot.html#boot_loadmenu">Tell me about bootloaders and bootmenus</a>
<li><a href="os-faq-bsmbr.html#boot_sector">Boot Sectors</a>
<li><a href="os-faq-bsmbr.html#mbr">Master Boot Record</a>
<LI>GRUB
<UL>
<LI><A href="os-faq-grub.html#whats_grub">What is GRUB</a>
<LI><A href="os-faq-grub.html#get_grub">Where can I get GRUB</a>
<LI><A href="os-faq-grub.html#grub_aout">GRUB and DJGPP A.OUT</a>
<LI><A href="os-faq-grub.html#grub_nasm">GRUB and NASM ELF files</a>
<LI><A href="os-faq-grub.html#grub_watcom">GRUB and Watcom ELF files</a>
</UL>
<LI>LILO
<UL>
<LI><A href="os-faq-lilo.html#lilo">What is LILO</a>
<LI>Where can I get LILO
</UL>
<li>XOSL
<li>System Commander
<li>Boot Magic
</UL>
<LI>COMPILERS
<Ul>
<LI>DJGPP
<UL>
<LI><A href="os-faq-elf.html#elf_files">Can DJGPP output ELF files</a>
</UL>
<! TODO >
<LI>Watcom C/C++
<LI>Visual C/C++
</UL>
<LI>Hardware
<UL>
<LI>CPU
<UL>
<LI><A href="os-faq-v86.html#whats_v8086">What is v8086 mode?</a>
<LI><A href="os-faq-v86.html#detect_v86">How do I detect v8086 mode?</a>
<li>AMD K6
<ul>
<LI><A href="os-faq-cpu-amdk6.html#k6_writeback">AMD K6 WriteBack Optimisations</a>
</ul>
</UL>
<LI>Memory
<UL>
<LI>The A20
<UL>
<LI><A href="os-faq-memory.html#what_is_a20">What is the A20 line?</a>
<LI><A href="os-faq-memory.html#access_my_memory">Why cant I access all my memory?</a>
<LI><A href="os-faq-memory.html#enable_a20">How do I enable the A20 line?</a>
</UL>
<LI>Memory Sizing
<UL>
<LI><A href="os-faq-memory.html#determine_memory">How do I determine the amount of RAM?</a>
<LI><A href="os-faq-memory.html#determine_memory_bios">How do I determine the amount of RAM with the BIOS?</a>
<LI><A href="os-faq-memory.html#determine_memory_probe">How do I determine the amount of RAM with direct probing?</a>
</UL>
</UL>
<LI>IRQ's and Exceptions, PIC, NMI, APIC, OPIC
<UL>
<LI><A href="os-faq-pics.html#irq_exception">How do I know if an IRQ or exception is firing?</a>
<LI><A href="os-faq-pics.html#what_pic">What is the PIC?</a>
<LI><A href="os-faq-pics.html#remap_pic">Can I remap the PIC?</a>
<LI><A href="os-faq-pics.html#nmi">So whats the NMI then?</a>
<LI><A href="os-faq-pics.html#apic">Tell me about APIC</a>
<LI><A href="os-faq-pics.html#opic">Tell me about OPIC</a>
</UL>
<LI>Interrupt Service Routines (ISR's)
<UL>
<LI><A href="os-faq-isr.html#isr">Whats an ISR?</a>
<LI><A href="os-faq-isr.html#normal_v_isr">Whats the difference between an ISR and a normal routine?</a>
<LI><A href="os-faq-isr.html#gcc_isr">So how do I do an ISR with GCC?</a>
</UL>
<LI>Video
<UL>
<LI><A href="os-faq-console.html#text_mode">How do I output text to the screen in protected mode?</a>
<LI><A href="os-faq-console.html#detect_text_screen">How do I detect if I have a colour or monochrome monitor?</a>
<LI><A href="os-faq-console.html#moving_cursor">How do I move the cursor when I print?</a>
</UL>
<LI>Plug and Play
<UL>
<LI><A href="os-faq-pnp.html#prog_pnp">Where can I find programming info on PNP?</a>
<LI><A href="os-faq-pnp.html#pnp_pmode">I heard you can do PNP calls with the BIOS in Protected Mode?</a>
</UL>
<LI>PCI
<UL>
<LI><A href="os-faq-pci.html#prog_pci">Where can I find programming info on PCI?</a>
<LI><A href="os-faq-pci.html#pci_pmode">I heard you can do PCI calls with the BIOS in Protected Mode?</a>
</UL>
</UL>
<LI>C Programming
<UL>
<LI><A href="os-faq-libc.html#no_printf">Where did my printf go?</a>
<LI><A href="os-faq-libc.html#libc">Whats this LIBC thing?</a>
<LI><A href="os-faq-libc.html#existing_libc">What C libraries exist for me to use?</a>
</UL>
<LI>C++ Programming
<UL>
<LI><A href="os-faq-cpp.html#start">Doing a kernel in C++</a>
<LI><A href="os-faq-cpp.html#rtti">Aiyah! Whats RTTI? (Run Time Type Info)</a>
<LI><A href="os-faq-cpp.html#disable_rtti">How do I disable RTTI in GCC?</a>
<LI><A href="os-faq-cpp.html#new_delete">Can I use NEW and DELETE in my kernel?</a>
</UL>
<LI>Linkers
<UL>
<LI><a href="os-faq-linker.html#linkers">Linker Info!</a>
<LI><a href="os-faq-linker.html#linkers_jloc">JLoc</a>
<LI><a href="os-faq-linker.html#linkers_alink">ALink</a>
<LI><a href="os-faq-linker.html#linkers_ld">LD (GNU)</a>
<LI><a href="os-faq-linker.html#linkers_tlink">TLink / TLink32 (Borland)</a>
<LI><a href="os-faq-linker.html#linkers_link">Link / NLink (Microsoft)</a>
<LI><a href="os-faq-linker.html#linkers_val">VAL</a>
<LI><a href="os-faq-linker.html#linkers_wlink">WLink (Watcom)</a>
<LI><a href="os-faq-linker.html#linkers_comp">A Comparison</a>
</UL>
<LI>Executable File Types
<UL>
<LI><A href="os-faq-exec.html#exec_files">Executable Files</a>
<LI><A href="os-faq-exec.html#exec_exe">EXE (dos &quot;MZ&quot;)</a>
<LI><A href="os-faq-exec.html#exec_ne">EXE (win16 &quot;NE&quot;)</a>
<LI><A href="os-faq-exec.html#exec_le">EXE (OS/2 &quot;LE/LX&quot;)</a>
<LI><A href="os-faq-exec.html#exec_pe">EXE (Win32 &quot;PE&quot;)</a>
<LI><A href="os-faq-exec.html#exec_elf">ELF</a>
<LI><A href="os-faq-exec.html#exec_coff">COFF</a>
<LI><A href="os-faq-exec.html#exec_aout">A.OUT</a>
</UL>
<LI>Filesystems
<UL>
<LI><A href="os-faq-fs.html#file_systems">Tell me about filesystems</a>
<LI><A href="os-faq-fs.html#fs_fat">FAT</a>
<LI><A href="os-faq-fs.html#fs_vfat">VFAT</a>
<LI><A href="os-faq-fs.html#fs_fat32">FAT32</a>
<LI><A href="os-faq-fs.html#fs_hpfs">HPFS (High Performance File System)</a>
<LI><A href="os-faq-fs.html#fs_ntfs">NTFS (New Technology File System)</a>
<LI><A href="os-faq-fs.html#fs_ext2fs">ext2fs (2nd extended file system)</a>
<LI><A href="os-faq-fs.html#fs_befs">BeFS</a>
<LI><A href="os-faq-fs.html#fs_ffs_amiga">FFS (Amiga)</a>
<LI><A href="os-faq-fs.html#fs_ffs_bsd">FFS (BSD)</a>
<LI><A href="os-faq-fs.html#fs_nfs">NFS</a>
<LI><A href="os-faq-fs.html#fs_afs">AFS</a>
<LI><a href="os-faq-fs.html#fs_rfs">RFS</a>
<LI><A href="os-faq-fs.html#fs_xfs">XFS</a>
</UL>
<LI>Resources
<UL>
<LI>Books
<ul>
<LI><A href="os-faq-books.html#books">Reference Books</a>
<LI><A href="os-faq-books.html#book_0">The Indispensable PC Hardware Book</A>
<LI><A href="os-faq-books.html#book_1">Operating System Concepts</A>
<LI><A href="os-faq-books.html#book_2">Operating Systems : Design and Implementation</A>
<LI><A href="os-faq-books.html#book_3">Operating Systems : Internals and Design Principals</A>
<LI><a href="os-faq-books.html#book_4">Distributed Operating Systems</a></td>
<LI><a href="os-faq-books.html#book_5">Inside Windows NT, Second Edition</a></td>
<LI><a href="os-faq-books.html#book_6">Lion's Commentary on UNIX sixth edition, with source code</a></td>
<LI><a href="os-faq-books.html#book_7">UNIX Internals: The New Frontiers</a></td>
</ul>
<LI><A href="os-faq-links.html#small_free_kernels">Some small kernels with source</a>
<LI><A href="os-faq-acronyms.html#acronyms">Chip Numbers, Acronyms and Things</A>
</UL>
<LI>Third Party Tools
<UL>
<LI><a href="os-faq-3rd.html#vmware">VMWare PC Emulator</a>
<LI><a href="os-faq-3rd.html#bochs">Bochs (i386) PC emulator</a>
<LI><a href="os-faq-3rd.html#mtools">MTools (DOS disk image tools)</a>
<LI><a href="os-faq-3rd.html#simics">SimICS (SunSparc Simulator)</a>
</UL>
<LI>Contributors
<UL>
<LI><A href="os-faq-contributors.html#contributors">Who helped with the FAQ</a>
</UL>
<LI>Todo
<UL>
<LI><A href="os-faq-todo.html#todo">The TODO list</a>
</UL>
</UL>
<!-- *************** DHTML Outline (end) ***************** -->
<hr color="#0000FF" noshade>
<p>
Whats New!
<ul>
<li>trying to add more material on various C/C++ compilers
<li>Added VMWare to the tools list
<li>Removed the link to the free Intel Developer CD's (offer is no longer valid)
<li>More info on some boot mangaers (xoxsl, system commander, boot magic, etc)
<li>Fixed those nasty link colours
</ul>
<p>
<hr color="#0000FF" noshade>
<P>
<TABLE border="0" cellspacing="1" cellpadding="10" align="CENTER">
<CAPTION>The OS-FAQ is a member of the OS Web Ring
</CAPTION>
<TR>
<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&next" target="_top">Next</A>
</TD>
<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&skip" target="_top">Skip Next</A>
</TD>
<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&next5" target="_top">Next 5</A><BR>
</TD>
<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&list" target="_top">List Sites</A>
</TD>
</TR>
</TABLE></P>
</BODY>
</HTML>

Binary file not shown.

Binary file not shown.