add directory study

2024-02-19 00:25:23 -05:00
parent b1306b38b1
commit f3774e2f8c
4001 changed files with 2285787 additions and 0 deletions
--- a/study/sabre/os/files/Misc/AddressTranslation.pdf
+++ b/study/sabre/os/files/Misc/AddressTranslation.pdf
--- a/study/sabre/os/files/Misc/BSD_VM/fig1.gif
+++ b/study/sabre/os/files/Misc/BSD_VM/fig1.gif
--- a/study/sabre/os/files/Misc/BSD_VM/fig2.gif
+++ b/study/sabre/os/files/Misc/BSD_VM/fig2.gif
--- a/study/sabre/os/files/Misc/BSD_VM/fig3.gif
+++ b/study/sabre/os/files/Misc/BSD_VM/fig3.gif
--- a/study/sabre/os/files/Misc/BSD_VM/fig4.gif
+++ b/study/sabre/os/files/Misc/BSD_VM/fig4.gif
--- a/study/sabre/os/files/Misc/BSD_VM/index.html
+++ b/study/sabre/os/files/Misc/BSD_VM/index.html
@@ -0,0 +1,870 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
+<html>
+<head>
+<title>OS/RC: Design Elements of the FreeBSD VM System</title>
+</head>
+
+<body bgcolor="#ffffff">
+
+<p align=right>Back to the <a href="..">OS/RC</a></p>
+
+<h2>Design Elements of the FreeBSD VM System</h2>
+<h3>By Matthew Dillon <a href="mailto:dillon@apollo.backplane.com">dillon@apollo.backplane.com</a></h3>
+</P>
+<p font class="Normal">
+    The title is really just a fancy way of saying that I am going
+    to attempt to describe the whole VM enchilada, hopefully in a
+    way that everyone can follow.  For the last year I have
+    concentrated on a number of major kernel subsystems within
+    FreeBSD, with the VM and Swap subsystems being the most
+    interesting and NFS being 'a necessary chore'.  I rewrote only
+    small portions of the code.  In the VM arena the only major
+    rewrite I have done is to the swap subsystem.  Most of my work
+    was cleanup and maintenance, with only moderate code rewriting
+    and no major algorithmic adjustments within the VM subsystem.
+    The bulk of the VM subsystem's theoretical base remains unchanged
+    and a lot of the credit for the modernization effort in the
+    last few years belongs to John Dyson and David Greenman.  Not
+    being a historian like Kirk I will not attempt to tag all the
+    various features with peoples names, since I will invariably
+    get it wrong.
+</P>
+<p font class="Normal">
+    Before moving along to the actual design let's spend a little
+    time on the necessity of maintaining and modernizing any
+    long-living codebase.  In the programming world, algorithms
+    tend to be more important than code and it is precisely due to
+    BSD's academic roots that a great deal of attention was paid
+    to algorithm design from the beginning.  More attention paid
+    to the design generally leads to a clean and flexible codebase
+    that can be fairly easily modified, extended, or replaced over
+    time.  While BSD is considered an 'old' operating system by
+    some people, those of us who work on it tend to view it more
+    as a 'mature' codebase which has various components modified,
+    extended, or replaced with modern code.  It has evolved, and
+    FreeBSD is at the bleeding edge no matter how old some of the
+    code might be.   This is an important distinction to make and
+    one that is unfortunately lost to many people.  The biggest
+    error a programmer can make is to not learn from history, and
+    this is precisely the error that many other modern operating
+    systems have made.  NT is the best example of this, and the
+    consequences have been dire.  Linux also makes this mistake to
+    some degree -- enough that we BSD folk can make small jokes
+    about it every once in a while, anyway (grin).  Linux's problem
+    is simply one of a lack of experience and history to compare
+    ideas against, a problem that is easily and rapidly being
+    addressed by the Linux community in the same way it has been
+    addressed in the BSD community -- by continuous code development.
+    The NT folk, on the other hand, repeatedly make the same mistakes
+    solved by UNIX decades ago and then spend years fixing them.
+    Over and over again.  They have a severe case of 'not designed
+    here' and 'we are always right because our marketing department
+    says so'.  I have little tolerance for anyone who cannot learn
+    from history.
+
+</P>
+<p font class="Normal">
+    Much of the apparent complexity of the FreeBSD design, especially
+    in the VM/Swap subsystem, is a direct result of having to solve
+    serious performance issues that occur under various conditions.
+    These issues are not due to bad algorithmic design but instead
+    rise from environmental factors.  In any direct comparison
+    between platforms, these issues become most apparent when system
+    resources begin to get stressed.  As I describe FreeBSD's
+    VM/Swap subsystem the reader should always keep two points in
+    mind.  First, the most important aspect of performance design
+    is what is known as "Optimizing the Critical Path".  It is
+    often the case that performance optimizations add a little
+    bloat to the code in order to make the critical path perform
+    better.  Second, a solid, generalized design outperforms a
+    heavily-optimized design over the long run.  While a generalized
+    design may end up being slower than an heavily-optimized design
+    when they are first implemented, the generalized design tends
+    to be easier to adapt to changing conditions and the
+    heavily-optimized design winds up having to be thrown away.
+    Any codebase that will survive and be maintainable for years
+    must therefore be designed properly from the beginning even if
+    it costs some performance.  Twenty years ago people were still
+    arguing that programming in assembly was better than programming
+    in a high-level language because it produced code that was ten
+    times as fast.  Today, the fallibility of that argument is
+    obvious -- as are the parallels to algorithmic design and code
+    generalization.
+
+</P>
+<p font class="Normal">
+<strong>VM Objects</strong>
+
+</P>
+<p font class="Normal">
+    The best way to begin describing the FreeBSD VM system is to
+    look at it from the perspective of a user-level process.   Each
+    user process sees a single, private, contiguous VM address
+    space containing several types of memory objects.  These objects
+    have various characteristics.  Program code and program data
+    are effectively a single memory-mapped file (the binary file
+    being run), but program code is read-only while program data
+    is copy-on-write.  Program BSS is just memory allocated and
+    filled with zeros on demand, called demand zero page fill.
+    Arbitrary files can be memory-mapped into the address space as
+    well, which is how the shared library mechanism works.  Such
+    mappings can require modifications to remain private to the
+    process making them.  The fork system call adds an entirely
+    new dimension to the VM management problem on top of the
+    complexity already given.
+
+</P>
+<p font class="Normal">
+    A program binary data page (which is a basic copy-on-write
+    page) illustrates the complexity.  A program binary contains
+    a preinitialized data section which is initially mapped directly
+    from the program file.  When a program is loaded into a process's
+    VM space, this area is initially memory-mapped and backed by
+    the program binary itself, allowing the VM system to free/reuse
+    the page and later load it back in from the binary.  The moment
+    a process modifies this data, however, the VM system must make
+    a private copy of the page for that process.  Since the private
+    copy has been modified, the VM system may no longer free it,
+    because there is no longer any way to restore it later on.
+
+</P>
+<p font class="Normal">
+    You will notice immediately that what was originally a simple
+    file mapping has become much more complex.  Data may be modified
+    on a page-by-page basis whereas the file mapping encompasses
+    many pages at once.  The complexity further increases when a
+    process forks.  When a process forks, the result is two processes
+    -- each with their own private address spaces, including any
+    modifications made by the original process prior to the call
+    to fork().  It would be silly for the VM system to make a
+    complete copy of the data at the time of the fork() because it
+    is quite possible that at least one of the two processes will
+    only need to read from that page from then on, allowing the
+    original page to continue to be used.  What was a private page
+    is made copy-on-write again, since each process (parent and
+    child) expects their own personal post-fork modifications to
+    remain private to themselves and not effect the other.
+
+</P>
+<p font class="Normal">
+    FreeBSD manages all of this with a layered VM Object model.
+    The original binary program file winds up being the lowest VM
+    Object layer.  A copy-on-write layer is pushed on top of that
+    to hold those pages which had to be copied from the original
+    file.  If the program modifies a data page belonging to the
+    original file the VM system takes a fault and makes a copy of
+    the page in the higher layer.  When a process forks, additional
+    VM Object layers are pushed on.
+
+    This might make a little more sense with a fairly basic example.
+    A fork() is a common operation for any *BSD system, so this
+    example will consider a program that starts up, and forks.
+
+    When the process starts, the VM system creates an object layer,
+    let's call this A:
+
+<center><img src="fig1.gif"></center>
+<p font class="Normal">
+    A represents the file--pages may be paged in and out of the file's
+    physical media as necessary.  Paging in from the disk is reasonable
+    for a program, but we really don't want to page back out and
+    overwrite the executable.  The VM system therefore creates a second
+    layer, B, that will be physically backed by swap space:
+
+<center><img src="fig2.gif"></center>
+<p font class="Normal">
+    On the first write to a page after this, a new page is created in
+    B, and its contents are initialized from A.  All pages in B can be
+    paged in or out to a swap device.  When the program forks, the
+    VM system creates two new object layers--C1 for the parent, and C2
+    for the child--that rest on top of B:
+  
+<center><img src="fig3.gif"></center>
+<p font class="Normal">
+    In this case, let's say a page in B is modified by the original
+    parent process.  The process will take a copy-on-write fault and
+    duplicate the page in C1, leaving the original page in B untouched.
+    Now, let's say the same page in B is modified by the child process.
+    The process will take a copy-on-write fault and duplicate the page
+    in C2.  The original page in B is now completely hidden since both
+    C1 and C2 have a copy and B could theoretically be destroyed if it
+    does not represent a 'real' file).  However, this sort of
+    optimization is not trivial to make because it is so fine-grained.
+    FreeBSD does not make this optimization.
+
+    Now, suppose (as is often the case) that the child process does an
+    exec().  Its current address space is usually replaced by a new
+    address space representing a new file.  In this case, the C2 layer
+    is destroyed:
+  
+<center><img src="fig4.gif"></center>
+<p font class="Normal">
+    In this case, the number of children of B drops to one, and all
+    accesses to B now go through C1.  This means that B and C1 can
+    be collapsed together.  Any pages in B that also exist in C1 are
+    deleted from B during the collapse.  Thus, even though the
+    optimization in the previous step could not be made, we can
+    recover the dead pages when either of the processes exit or exec().
+</P>
+<p font class="Normal">
+    This model creates a number of potential problems.  The first
+    is that you can wind up with a relatively deep stack of layered
+    VM Objects which can cost scanning time and memory when you
+    when you take a fault.  Deep layering can occur when processes
+    fork and then fork again (either parent or child).
+    The second problem is that you can wind up with dead,
+    inaccessible pages deep in the stack of VM Objects.  In our
+    last example if both the parent and child processes modify the
+    same page, they both get their own private copies of the page
+    and the original page in B is no longer accessible by anyone.
+    That page in B can be freed.
+
+</P>
+<p font class="Normal">
+    FreeBSD solves the deep layering problem with a special optimization
+    called the "All Shadowed Case".  This case occurs if either C1 or C2
+    take sufficient COW faults to completely shadow all pages in B.  Lets
+    say that C1 achieves this.  C1 can now bypass B entirely, so rather
+    then have C1->B->A and C2->B->A we now have C1->A and C2->B->A.  But
+    look what also happened -- now B has only one reference (C2), so we
+    can collapse B and C2 together.  The end result is that B is deleted
+    entirely and we have C1->A and C2->A.  It is often the case that B
+    will contain a large number of pages and neither C1 nor C2 will be
+    able to completely overshadow it.  If we fork again and create a set
+    of D layers, however, it is much more likely that one of the D layers
+    will eventually be able to completely overshadow the much smaller dataset
+    reprsented by C1 or C2.  The same optimization will work at any point
+    in the graph and the grand result of this is that even on a heavily forked
+    machine VM Object stacks tend to not get much deeper then 4.  This is
+    true of both the parent and the children and true whether the parent is
+    doing the forking or whether the children cascade forks.
+</P>
+<p font class="Normal">
+    The dead page problem still exists in the case where C1 or C2 do not
+    completely overshadow B.  Due to our other optimizations this case does
+    not represent much of a problem and we simply allow the pages to be dead.
+    If the system runs low on memory it will swap them out, eating a little
+    swap, but that's it.
+</P>
+<p font class="Normal">
+    The advantage to the VM Object model is that fork() is extremely
+    fast, since no real data copying need take place.  The disadvantage
+    is that you can build a relatively complex VM Object layering
+    that slows page fault handling down a little, and you spend
+    memory managing the VM Object structures.  The optimizations
+    FreeBSD makes proves to reduce the problems enough that they
+    can be ignored, leaving no real disadvantage.
+
+</P>
+<p font class="Normal">
+<strong>SWAP Layers</strong>
+
+</P>
+<p font class="Normal">
+    Private data pages are initially either copy-on-write or
+    zero-fill pages.  When a change, and therefore a copy, is made,
+    the original backing object (usually a file) can no longer be
+    used to save a copy of the page when the VM system needs to
+    reuse it for other purposes.  This is where SWAP comes in.
+    SWAP is allocated to create backing store for memory that does
+    not otherwise have it.  FreeBSD allocates the swap management
+    structure for a VM Object only when it is actually needed.
+    However, the swap management structure has had problems
+    historically.
+
+</P>
+<p font class="Normal">
+    Under FreeBSD 3.x the swap management structure preallocates
+    an array that encompasses the entire object requiring swap
+    backing store -- even if only a few pages of that object are
+    swap-backed.  This creates a kernel memory fragmentation problem
+    when large objects are mapped, or processes with large runsizes
+    (RSS) fork.  Also, in order to keep track of swap space, a
+    'list of holes' is kept in kernel memory, and this tends to
+    get severely fragmented as well.  Since the 'list of holes' is
+    a linear list, the swap allocation and freeing performance is
+    a non-optimal O(n)-per-page.  It also requires kernel memory
+    allocations to take place during the swap freeing process, and
+    that creates low memory deadlock problems.  The problem is
+    further exacerbated by holes created due to the interleaving
+    algorithm.  Also, the swap block map can become fragmented fairly
+    easily resulting in non-contiguous allocations.  Kernel memory
+    must also be allocated on the fly for additional swap management
+    structures when a swapout occurs.  It is evident that there
+    was plenty of room for improvement.
+
+</P>
+<p font class="Normal">
+    For FreeBSD 4.x, I completely rewrote the swap subsystem.  With
+    this rewrite, swap management structures are allocated through
+    a hash table rather than a linear array giving them a fixed
+    allocation size and much finer granularity.  Rather then using
+    a linearly linked list to keep track of swap space reservations,
+    it now uses a bitmap of swap blocks arranged in a radix tree
+    structure with free-space hinting in the radix node structures.
+    This effectively makes swap allocation and freeing an O(1)
+    operation.  The entire radix tree bitmap is also preallocated
+    in order to avoid having to allocate kernel memory during
+    critical low memory swapping operations.  After all, the system
+    tends to swap when it is low on memory so we should avoid
+    allocating kernel memory at such times in order to avoid
+    potential deadlocks.  Finally, to reduce fragmentation the
+    radix tree is capable of allocating large contiguous chunks at
+    once, skipping over smaller fragmented chunks.  I did not take
+    the final step of having an 'allocating hint pointer' that
+    would trundle through a portion of swap as allocations were
+    made in order to further guarantee contiguous allocations or
+    at least locality of reference, but I ensured that such an
+    addition could be made.
+
+</P>
+<p font class="Normal">
+<strong>When To Free a Page</strong>
+
+</P>
+<p font class="Normal">
+    Since the VM system uses all available memory for disk caching,
+    there are usually very few truly-free pages.  The VM system
+    depends on being able to properly choose pages which are not
+    in use to reuse for new allocations.  Selecting the optimal
+    pages to free is possibly the single-most important function
+    any VM system can perform because if it makes a poor selection,
+    the VM system may be forced to unnecessarily retrieve pages
+    from disk, seriously degrading system performance.
+
+</P>
+<p font class="Normal">
+    How much overhead are we willing to suffer in the critical path
+    to avoid freeing the wrong page?  Each wrong choice we make
+    will cost us hundreds of thousands of CPU cycles and a noticeable
+    stall of the affected processes, so we are willing to endure
+    a significant amount of overhead in order to be sure that the
+    right page is chosen.  This is why FreeBSD tends to outperform
+    other systems when memory resources become stressed.
+
+</P>
+
+<p font class="Normal">
+<table border="0" cellspacing="5" cellpadding="5">
+<tr>
+  <td width="65%" align="top"><p font class="Normal">
+    The free page determination algorithm is built upon a history
+    of the use of memory pages.  To acquire this history, the system
+    takes advantage of a page-used bit feature that most hardware
+    page tables have.  </p>
+    <p font class="Normal"> In any
+    case, the page-used bit is cleared and at some later point the
+    VM system comes across the page again and sees that the page-used
+    bit has been set.  This indicates that the page is still being
+    actively used.  If the bit is still clear it is an indication
+    that the page is not being actively used.  By testing this bit
+    periodically, a use history (in the form of a counter) for the
+    physical page is developed.  When the VM system later needs to
+    free up some pages, checking this history becomes the cornerstone
+    of determining the best candidate page to reuse.
+    </p>
+  </td>
+  <td width="35%" align="top" bgcolor="#dadada"><font size="-1"><center><strong>What if the hardware
+    has no page-used bit?</strong></center><br>
+    <p font class="Normal">For those platforms that do not have
+    this feature, the system actually emulates a page-used bit.
+    It unmaps or protects a page, forcing a page fault if the page
+    is accessed again.  When the page fault is taken, the system
+    simply marks the page as having been used and unprotects the
+    page so that it may be used.  While taking such page faults
+    just to determine if a page is being used appears to be an
+    expensive proposition, it is much less expensive than reusing
+    the page for some other purpose only to find that a process
+    needs it back and then have to go to disk.</P></font>
+  </td>
+</tr>
+</table>
+
+</P>
+<p font class="Normal">
+    FreeBSD makes use of several page queues to further refine the
+    selection of pages to reuse as well as to determine when dirty
+    pages must be flushed to their backing store.  Since page tables
+    are dynamic entities under FreeBSD, it costs virtually nothing
+    to unmap a page from the address space of any processes using
+    it.  When a page candidate has been chosen based on the page-use
+    counter, this is precisely what is done.  The system must make
+    a distinction between clean pages which can theoretically be
+    freed up at any time, and dirty pages which must first be
+    written to their backing store before being reusable.  When a
+    page candidate has been found it is moved to the inactive queue
+    if it is dirty, or the cache queue if it is clean.  A separate
+    algorithm based on the dirty-to-clean page ratio determines
+    when dirty pages in the inactive queue must be flushed to disk.
+    Once this is accomplished, the flushed pages are moved from
+    the inactive queue to the cache queue.  At this point, pages
+    in the cache queue can still be reactivated by a VM fault at
+    relatively low cost.  However, pages in the cache queue are
+    considered to be 'immediately freeable' and will be reused in
+    an LRU (least-recently used) fashion when the system needs to
+    allocate new memory.
+
+</P>
+<p font class="Normal">
+    It is important to note that the FreeBSD VM system attempts to
+    separate clean and dirty pages for the express reason of avoiding
+    unnecessary flushes of dirty pages (which eats I/O bandwidth),
+    nor does it move pages between the various page queues gratuitously
+    when the memory subsystem is not being stressed.  This is why
+    you will see some systems with very low cache queue counts and
+    high active queue counts when doing a 'systat -vm' command.
+    As the VM system becomes more stressed, it makes a greater
+    effort to maintain the various page queues at the levels
+    determined to be the most effective.  An urban myth has circulated
+    for years that Linux did a better job avoiding swapouts than
+    FreeBSD, but this in fact is not true.  What was actually
+    occurring was that FreeBSD was proactively paging out unused
+    pages in order to make room for more disk cache while Linux
+    was keeping unused pages in core and leaving less memory
+    available for cache and process pages.  I don't know whether
+    this is still true today.
+
+</P>
+
+<p font class="Normal">
+<strong>Pre-Faulting and Zeroing Optimizations</strong>
+
+</P>
+<p font class="Normal">
+    Taking a VM fault is not expensive if the underlying page is
+    already in core and can simply be mapped into the process, but
+    it can become expensive if you take a whole lot of them on a
+    regular basis.  A good example of this is running a program
+    such as 'ls' or 'ps' over and over again.  If the program binary
+    is mapped into memory but not mapped into the page table, then
+    all the pages that will be accessed by the program will have
+    to be faulted in every time the program is run.  This is
+    unnecessary when the pages in question are already in the VM
+    Cache, so FreeBSD will attempt to pre-populate a process's page
+    tables with those pages that are already in the VM Cache.  One
+    thing that FreeBSD does not yet do is pre-copy-on-write certain
+    pages on exec.  For example, if you run the /bin/ls program
+    while running 'vmstat 1' you will notice that it always takes
+    a certain number of page faults, even when you run it over and
+    over again.  These are zero-fill faults, not program code faults
+    (which were pre-faulted in already).  Pre-copying pages on exec
+    or fork is an area that could use more study.
+
+</P>
+<p font class="Normal">
+    A large percentage of page faults that occur are zero-fill
+    faults.  You can usually see this by observing the 'vmstat -s'
+    output.  These occur when a process accesses pages in its BSS
+    area.  The BSS area is expected to be initially zero but the
+    VM system does not bother to allocate any memory at all until
+    the process actually accesses it.  When a fault occurs the VM
+    system must not only allocate a new page, it must zero it as
+    well.  To optimize the zeroing operation the VM system has the
+    ability to pre-zero pages and mark them as such, and to request
+    pre-zeroed pages when zero-fill faults occur.  The pre-zeroing
+    occurs whenever the CPU is idle but the number of pages the
+    system pre-zeros is limited in order to avoid blowing away the
+    memory caches.  This is an excellent example of adding complexity
+    to the VM system in order to optimize the critical path.
+
+</P>
+<p font class="Normal">
+<strong>Page Table Optimizations</strong>
+
+</P>
+<p font class="Normal">
+    The page table optimizations make up the most contentious part
+    of the FreeBSD VM design and they have shown some strain with
+    the advent of serious use of mmap().  I think this is actually
+    a feature of most BSDs though I am not sure when it was first
+    introduced.  There are two major optimizations.  The first is
+    that hardware page tables do not contain persistent state but
+    instead can be thrown away at any time with only a minor amount
+    of management overhead.  The second is that every active page
+    table entry in the system has a governing pv_entry structure
+    which is tied into the vm_page structure.  FreeBSD can simply
+    iterate through those mappings that are known to exist while
+    Linux must check all page tables that *might* contain a specific
+    mapping to see if it does, which can achieve O(n^2) overhead
+    in certain situations.  It is because of this that FreeBSD
+    tends to make better choices on which pages to reuse or swap
+    when memory is stressed, giving it better performance under
+    load.  However, FreeBSD requires kernel tuning to accommodate
+    large-shared-address-space situations such as those that can
+    occur in a news system because it may run out of pv_entry
+    structures.
+
+</P>
+<p font class="Normal">
+    Both Linux and FreeBSD need work in this area.  FreeBSD is
+    trying to maximize the advantage of a potentially sparse
+    active-mapping model (not all processes need to map all pages
+    of a shared library, for example), whereas Linux is trying to
+    simplify its algorithms.  FreeBSD generally has the performance
+    advantage here at the cost of wasting a little extra memory,
+    but FreeBSD breaks down in the case where a large file is
+    massively shared across hundreds of processes.  Linux, on the
+    other hand, breaks down in the case where many processes are
+    sparsely-mapping the same shared library and also runs non-optimally
+    when trying to determine whether a page can be reused or not.
+
+</P>
+<p font class="Normal">
+<strong>Page Coloring</strong>
+
+</P>
+<p font class="Normal">
+    We'll end with the page coloring optimizations.  Page coloring
+    is a performance optimization designed to ensure that accesses
+    to contiguous pages in virtual memory make the best use of the
+    processor cache.  In ancient times (i.e. 10+ years ago) processor
+    caches tended to map virtual memory rather than physical memory.
+    This led to a huge number of problems including having to clear
+    the cache on every context switch in some cases, and problems
+    with data aliasing in the cache.  Modern processor caches map
+    physical memory precisely to solve those problems.  This means
+    that two side-by-side pages in a processes address space may
+    not correspond to two side-by-side pages in the cache.  In
+    fact, if you aren't careful side-by-side pages in virtual memory
+    could wind up using the same page in the processor cache --
+    leading to cacheable data being thrown away prematurely and
+    reducing CPU performance.  This is true even with multi-way
+    set-associative caches (though the effect is mitigated somewhat).
+
+</P>
+<p font class="Normal">
+    FreeBSD's memory allocation code implements page coloring
+    optimizations, which means that the memory allocation code will
+    attempt to locate free pages that are contiguous from the point
+    of view of the cache.  For example, if page 16 of physical
+    memory is assigned to page 0 of a process's virtual memory and
+    the cache can hold 4 pages, the page coloring code will not
+    assign page 20 of physical memory to page 1 of a process's
+    virtual memory.  It would, instead, assign page 21 of physical
+    memory.  The page coloring code attempts to avoid assigning
+    page 20 because this maps over the same cache memory as page
+    16 and would result in non-optimal caching.  This code adds a
+    significant amount of complexity to the VM memory allocation
+    subsystem as you can well imagine, but the result is well worth
+    the effort.  Page Coloring makes VM memory as deterministic as
+    physical memory in regards to cache performance.
+
+</P>
+<p font class="Normal">
+<strong>Conclusion</strong>
+
+</P>
+<p font class="Normal">
+    Virtual memory in modern operating systems must address a number
+    of different issues efficiently and for many different usage
+    patterns.  The modular and algorithmic approach that BSD has
+    historically taken allows us to study and understand the current
+    implementation as well as relatively cleanly replace large
+    sections of the code.  There have been a number of improvements
+    to the FreeBSD VM system in the last several years, and work
+    is ongoing.
+
+</P>
+<hr>
+<p font class="Normal">
+<h2>A Bonus Question and Answer Session by Allen Briggs </h2><a href="mailto:briggs@ninthwonder.com">&#60;briggs@ninthwonder.com&#62;</a>
+
+</P>
+<p font class="Normal">
+Q: What is "the interleaving algorithm" that you refer to in your listing
+   of the ills of the FreeBSD 3.x swap arrangments?
+
+</P>
+<blockquote>
+<p font class="Normal">
+A: FreeBSD uses a fixed swap interleave which defaults to 4.  This means 
+    that FreeBSD reserves space for four swap areas even if you only have one,
+    two, or three.  Since swap is interleaved the linear address space
+    representing the 'four swap areas' will be fragmented if you don't actually
+    have four swap areas.  For example, if you have two swap areas A and B
+    FreeBSD's address space representation for that swap area will be
+    interleaved in blocks of 16 pages:
+
+</P>
+<p font class="Normal">
+	A B C D A B C D A B C D A B C D
+
+</P>
+<p font class="Normal">
+    FreeBSD 3.x uses a 'sequential list of free regions' approach to accounting
+    for the free swap areas.  The idea is that large blocks of free linear
+    space can be represented with a single list node (kern/subr_rlist.c).
+    But due to the fragmentation the sequential list winds up being insanely
+    fragmented.  In the above example, completely unused swap will have A and
+    B shown as 'free' and C and D shown as 'all allocated'.  Each A-B sequence
+    requires a list node to account for because C and D are holes, so the list
+    node cannot be combined with the next A-B sequence.
+
+</P>
+<p font class="Normal">
+    Why do we interleave our swap space instead of just tack swap areas onto
+    the end and do something fancier?  Because it's a whole lot easier to 
+    allocate linear swaths of an address space and have the result 
+    automatically be interleaved across multiple disks than it is to try to
+    put that sophistication elsewhere.
+
+</P>
+<p font class="Normal">
+    The fragmentation causes other problems.  Being a linear
+    list under 3.x, and having such a huge amount of inherent fragmentation,
+    allocating and freeing swap winds up being an O(N) algorithm instead of
+    an O(1) algorithm.  Combined with other factors (heavy swapping) and you
+    start getting into O(N^2) and O(N^3) levels of overhead, which is bad.
+    The 3.x system may also need to allocate KVM during a swap operation to
+    create a new list node which can lead to a deadlock if the system is 
+    trying to pageout pages in a low-memory situation.
+
+</P>
+<p font class="Normal">
+    Under 4.x we do not use a sequential list.  Instead we use a radix tree
+    and bitmaps of swap blocks rather than ranged list nodes.  We take the
+    hit of preallocating all the bitmaps required for the entire swap
+    area up front but it winds up wasting less memory due to the use of
+    a bitmap (one bit per block) instead of a linked list of nodes.  The
+    use of a radix tree instead of a sequential list gives us nearly O(1)
+    performance no matter how fragmented the tree becomes.
+
+</P>
+</blockquote>
+
+<p font class="Normal">
+Q: I don't get the following:
+
+</P>
+<p font class="Normal">
+    It is important to note that the FreeBSD VM system attempts to separate
+    clean and dirty pages for the express reason of avoiding unnecessary
+    flushes of dirty pages (which eats I/O bandwidth), nor does it move
+    pages between the various page queues gratitously when the memory subsystem
+    is not being stressed.  This is why you will see some systems with
+    very low cache queue counts and high active queue counts when doing a
+    'systat -vm' command.
+
+</P>
+<p font class="Normal">
+Q: How is the separation of clean and dirty (inactive) pages related to the
+situation where you see low cache queue counts and high active queue counts
+in 'systat -vm'?  Do the systat stats roll the active and dirty pages
+together for the active queue count?
+
+<blockquote>
+</P>
+<p font class="Normal">
+A:    Yes, that is confusing.  The relationship is "goal" verses "reality".
+    Our goal is to separate the pages but the reality is that if we are not
+    in a memory crunch, we don't really have to.
+
+</P>
+<p font class="Normal">
+    What this means is that FreeBSD will not try very hard to separate out
+    dirty pages (inactive queue) from clean pages (cache queue) when the
+    system is not being stressed, nor will it try to deactivate pages
+    (active queue -&#62; inactive queue) when the system is not being stressed,
+    even if they aren't being used.
+
+</P>
+</blockquote>
+<p font class="Normal">
+Q: In the /bin/ls / 'vmstat 1' example, wouldn't some of the page faults be
+data page faults (COW from executable file to private page)?  I.e., I
+would expect the page faults to be some zero-fill and some program data.
+Or are you implying that FreeBSD does do pre-COW for the program data?
+
+</P>
+<blockquote>
+<p font class="Normal">
+A:    A COW fault can be either zero-fill or program-data.  The mechanism
+    is the same either way because the backing program-data is almost 
+    certainly already in the cache.  I am indeed lumping the two together.
+    FreeBSD does not pre-COW program data or zero-fill, but it *does* 
+    pre-map pages that exist in its cache.
+
+</P>
+</blockquote>
+<p font class="Normal">
+Q: In your section on page table optimizations, can you give a little more
+detail about pv_entry and vm_page (or should vm_page be vm_pmap -- as
+in 4.4, cf. pp. 180-181 of McKusick, Bostic, Karel, Quarterman)?
+Specifically, what kind of operation/reaction would require scanning the
+mappings?
+
+</P>
+<p font class="Normal">
+How does Linux do in the case where FreeBSD breaks down (sharing a large
+file mapping over many processes)?
+
+</P>
+<blockquote>
+<p font class="Normal">
+A:    A vm_page represents an (object,index#) tuple.  A pv_entry represents
+    a hardware page table entry (pte).  If you have five processes sharing
+    the same physical page, and three of those processes's page tables 
+    actually map the page, that page will be represented by a single
+    vm_page structure and three pv_entry structures.
+
+</P>
+<p font class="Normal">
+    pv_entry structures only represent pages mapped by the MMU (one 
+    pv_entry represnts one pte).  This means that when we need to remove
+    all hardware references to a vm_page (in order to reuse the page for
+    something else, page it out, clear it, dirty it, and so forth) we can
+    simply scan the linked list of pv_entry's associated with that vm_page
+    to remove or modify the pte's from their page tables. 
+
+</P>
+<p font class="Normal">
+    Under Linux there is no such linked list.  In order to remove all the 
+    hardware page table mappings for a vm_page linux must index into every
+    VM object that *might* have mapped the page.  For example, if you have
+    50 processes all mapping the same shared library and want to get rid of
+    page X in that library, you need to index into the page table for
+    each of those 50 processes even if only 10 of them have actually mapped
+    the page.  So Linux is trading off the simplicity of its design against
+    performance.  Many VM algorithms which are O(1) or (small N) under FreeBSD
+    wind up being O(N), O(N^2), or worse under Linux.  Since the pte's
+    representing a particular page in an object tend to be at the same
+    offset in all the page tables they are mapped in, reducing the number
+    of accesses into the page tables at the same pte offset will often avoid
+    blowing away the L1 cache line for that offset, which can lead to better
+    performance.
+
+</P>
+<p font class="Normal">
+    FreeBSD has added complexity (the pv_entry scheme) in order to increase
+    performance (to limit page table accesses to *only* those pte's that need
+    to be modified).
+
+</P>
+<p font class="Normal">
+    But FreeBSD has a scaling problem that Linux does not in that there are
+    a limited number of pv_entry structures and this causes problems when you
+    have massive sharing of data.  In this case you may run out of pv_entry
+    structures even though there is plenty of free memory available.  This
+    can be fixed easily enough by bumping up the number of pv_entry structures
+    in the kernel config, but we really need to find a better way to do it.
+
+</P>
+<p font class="Normal">
+    In regards to the memory overhead of a page table verses the pv_entry
+    scheme:  Linux uses 'permanent' page tables that are not throw away, but
+    does not need a pv_entry for each potentially mapped pte.  FreeBSD uses 
+    'throw away' page tables but adds in a pv_entry structure for each
+    actually-mapped pte.  I think memory utilization winds up being about
+    the same, giving FreeBSD an algorithmic advantage with its ability to
+    throw away page tables at will with very low overhead.
+
+</P>
+</blockquote>
+<p font class="Normal">
+Q: Finally, in the page coloring section, it might help to have a little
+   more description of what you mean here.  I didn't quite follow it.
+
+</P>
+<blockquote>
+<p font class="Normal">
+A:  Do you know how an L1 hardware memory cache works?  I'll explain:  
+    Consider a machine with 16MB of main memory but only 128K of L1 cache.
+    Generally the way this cache works is that each 128K block of main memory
+    uses the *same* 128K of cache.  If you access offset 0 in main memory
+    and then offset offset 128K in main memory you can wind up throwing
+    away the cached data you read from offset 0!
+
+</P>
+<p font class="Normal">
+    Now, I am simplifying things greatly.  What I just described is what
+    is called a 'direct mapped' hardware memory cache.  Most modern caches
+    are what are called 2-way-set-associative or 4-way-set-associative
+    caches.  The set-associatively allows you to access up to N different
+    memory regions that overlap the same cache memory without destroying
+    the previously cached data.  But only N.
+
+</P>
+<p font class="Normal">
+    So if I have a 4-way set associative cache I can access offset 0, 
+    offset 128K, 256K and offset 384K and still be able to access offset 0
+    again and have it come from the L1 cache.  If I then access offset 512K,
+    however, one of the four previously cached data objects will be thrown
+    away by the cache.
+
+</P>
+<p font class="Normal">
+    It is extremely important... EXTREMELY important for most of a processor's
+    memory accesses to be able to come from the L1 cache, because the L1
+    cache operates at the processor frequency.  The moment you have an L1
+    cahe miss and have to go to the L2 cache or to main memory, the processor
+    will stall and potentially sit twidling its fingers for *hundreds* of
+    instructions worth of time waiting for a read from main memory to complete.
+    Main memory (the dynamic ram you stuff into a computer) is S.L.O.W.,
+    capitalized and boldfaced, when compared to the speed of a modern
+    processor core.
+
+</P>
+<p font class="Normal">
+    Ok, so now onto page coloring:  All modern memory caches are what are
+    known as *physical* caches.  They cache physical memory addresses, not
+    virtual memory addresses.  This allows the cache to be left alone across
+    a process context switch, which is very important.
+
+</P>
+<p font class="Normal">
+    But in the UNIX world you are dealing with virtual address spaces, not
+    physical address spaces.  Any program you write will see the virtual
+    address space given to it.  The actual *physical* pages underlying that
+    virtual address space are not necessarily physically contiguous!  In
+    fact, you might have two pages that are side by side in a processes
+    address space which wind up being at offset 0 and offset 128K in 
+    *physical* memory.  
+
+</P>
+<p font class="Normal">
+    A program normally assumes that two side-by-side pages will be optimally
+    cached.  That is, that you can access data objects in both pages without
+    having them blow away each other's cache entry.  But this is only true
+    if the physical pages underlying the virtual address space are contiguous
+    (insofar as the cache is concerned).
+
+</P>
+<p font class="Normal">
+    This is what Page coloring does.  Instead of assigning *random* physical
+    pages to virtual addresses, which may result in non-optimal cache
+    performance , Page coloring assigns *reasonably-contiguous* physical pages
+    to virtual addresses.  Thus programs can be written under the assumption
+    that the characteristics of the underlying hardware cache are the same
+    for their virtual address space as they would be if the program had been
+    run directly in a physical address space.
+
+</P>
+<p font class="Normal">
+    Note that I say 'reasonably' contiguous rather than simply 'contiguous'.
+    From the point of view of a 128K direct mapped cache, the physical 
+    address 0 is the same as the physical address 128K.  So two side-by-side
+    pages in your virtual address space may wind up being offset 128K and
+    offset 132K in physical memory, but could also easily be offset 128K
+    and offset 4K in physical memory and still retain the same cache 
+    performance characteristics.  So page-coloring does *NOT* have to
+    assign truely contiguous pages of physical memory to contiguous pages
+    of virtual memory, it just needs to make sure it assigns contiguous
+    pages from the point of view of cache performance and operation.
+
+</P>
+<p font class="Normal">
+    Oops, that was a bit longer explanation than I intended.
+
+</P>
+<p font class="Normal">
+							-Matt
+
+</P>
+</blockquote>
+
+
+<hr noshade color="#dadada"><br>
+<font class="Small">Author maintains all copyrights on this article.<br></font>
+<p align=right>Back to the <a href="..">OS/RC</a></p>
+</body>
+</html>
--- a/study/sabre/os/files/Misc/BringingSMPtoYourUPOperatingSystem.html
+++ b/study/sabre/os/files/Misc/BringingSMPtoYourUPOperatingSystem.html
@@ -0,0 +1,401 @@
+<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
+
+<html>
+
+<head>
+<title>Bringing SMP to Your UP Operating System</title>
+<meta http-equiv="Cache-Control" content="must-revalidate">
+</head>
+
+<body bgcolor="#ffffff" vlink="#0000ff">
+
+<table width=600 align=center>
+<tr><td>
+
+<center>
+<h1>Bringing SMP to Your UP Operating System</h1>
+Sidney Cammeresi
+</center>
+<br><br><br>
+
+<blockquote>
+<h3>Overview</h3>
+<ol>
+<li><a href="#preface">Preface</a></li>
+<li><a href="#terms">Terminology</a></li>
+<li><a href="#config">MP Configuration Tables</a></li>
+<li><a href="#apic">Using Local APICs</a></li>
+<li><a href="#booting">Booting Application Processors</a></li>
+<li><a href="#protmode">Switching from Real Mode to Protected Mode</a></li>
+<li><a href="#ipis">Interprocessor Interrupts</a></li>
+<li><a href="#other">Other Considerations<br><br></a></li>
+
+<li>Scheduling</li>
+<li>Using the I/O APIC<br><br></li>
+<!--
+<li><a href="#scheduling">Scheduling</a></li>
+<li><a href="#ioapic">Using the I/O APIC</a></li>
+-->
+
+<li><a href="#references">References</a></li>
+<li><a href="#revsions">Revision History</a></li>
+</ol>
+</blockquote>
+<br><br><br>
+
+<h3>
+<a name="preface">Preface</a></h3>
+
+This tutorial is intended as a supplement to the
+<a href="http://www.acm.uiuc.edu/sigops/roll_your_own">SigOps OS
+Tutorial</a> to teach the fundamentals of symmetric multprocessing
+using Intel MP compliant hardware.  Knowledge of the concepts and
+implementations of basic operating system parts such as managing virtual
+memory and multitasking are assumed and will not be discussed except as
+they relate to multiprocessing.  Knowledge equivalent to an intermediate
+or advanced computer architecture college course will be helpful in
+understanding scheduling issues, but is not required.
+<p>
+
+This tutorial is not intended to be a complete explanation of how to
+implement an SMP-capable operating system, nor as a replacement for
+Intel's documentation.  Rather it is designed to
+give an overview of the things I learned in writing SMP support for
+<a href="http://www.frotz.net/openblt">OpenBLT</a>, a freely
+redistributable microkernel-based operating system under the BSD
+licence.  Particularly, some tedious hardware
+aspects will not be discussed in detail when the reader could just as
+easily read official Intel documentation.  The interested reader should
+refer to the references for more detailed information.  For code examples,
+the reader should refer to the source code of
+<a href="http://www.frotz.net/openblt">OpenBLT</a> or
+<a href="http://www.freebsd.org">FreeBSD</a>.  The Linux kernel source
+code might be helpful, although it is under the GPL.
+<p>
+
+This tutorial is a work in progress.  If you see an error or something
+that needs clarification, please <a href="mailto:cammeres@uiuc.edu">e-mail
+me</a>.
+<p>
+
+<h3>
+<a name="terms">Terminology</a>
+</h3>
+
+<dl>
+<dt>AP
+    <dd>application processor.  A processor that is not the BSP.  All APs are
+        in a halted state when the BIOS first gives control to the operating
+        system.
+<dt>APIC
+    <dd>Advanced Programmable Interrupt Controller.  Either a local APIC or
+        an I/O APIC.  It is attached to the APIC bus.
+<dt>APIC bus
+    <dd>A special non-architechural bus on which the APICs in the system
+        send messages.
+<dt>BSP
+    <dd>bootstrap processor.  The processor which is given control after the
+        BIOS finishes its POST.
+<dt>I/O APIC
+    <dd>A special APIC for receiving and distributing interrupts from external
+        devices which is backward compatible with the PIC.  There is generally
+        only one per computer.
+<dt>IPI
+    <dd>interprocessor interrupt.  A special interrupt sent to a processor by
+        the originating processor programming its APIC with a target or logical
+        target ID, and an interrupt vector.
+<dt>Local APIC
+    <dd>an APIC built in to the processor.  It is responsible for dispatching
+        interrupts sent over the APIC bus to its processor core and sending
+        interrupts to other processors over the APIC bus.
+<dt>MP
+    <dd>Intel's MultiProcessor Specification, a standard which defines how
+        SMP hardware should be presented to the operating system and how the
+        operating system should interact with this hardware.
+<dt>serialisation
+    <dd>The act of executing a certain instruction which causes the processor
+        to pause to
+        retire all instructions currently being executed before proceeding
+        to the next instruction in the stream.  For example, before switching
+        to protected mode, the processor must retire all instructions that
+        began executing in real mode before beginning any in protected mode.
+<dt>SMP
+    <dd>symmetric multiprocessing.  Using multiple processors which share
+        the same physical memory in the same computer at the time.  You are
+        probably reading this tutorial with the hope that your operating
+        system will become SMP-capable.
+<dt>UP
+    <dd>uniprocessor.  Your operating system to date is a UP operating system.
+</dl>
+
+<h3>
+<a name="config">MP Detection and Configuration</a>
+</h3>
+
+When the system first starts, the BIOS detects the hardware installed in
+the system using electric means and then creates structures to describe
+this hardware to the operating system.  There are two such tables.
+The first is the MP Floating Pointer Structure, which is required.
+The second is the MP Configuration Table, which is optional.  If the
+configuration table does not exist, the operating system should set up
+the default configuration indicated in the floating pointer structure.
+Some data in the tables is in ASCII.  Strings are padded with spaces
+and are not null-terminated.
+<p>
+
+First, you need to find the floating pointer structure.  According to
+the spec, it can be in one of four places:  (1) in the first kilobyte
+of the extended BIOS data area, (2) the last kilobyte of base memory,
+(3) the top of physical memory, or (4) the BIOS read-only memory space
+between 0xe0000 and 0xfffff.  You need to search these areas for the
+four-byte signature "_MP_" which denotes the start of the floating
+pointer structure.  Absence of this structure indicates that the system
+is not MP compliant.  At this point your operating system can either halt,
+or it can fall back into a UP setup.
+<p>
+
+You should checksum the structure to make sure it has not been corrupted.
+There is not much of interest in the floating pointer structure, unless
+your system does not have a configuration table.  In this case, you will
+need to get the number of the default configuration your system adheres to
+and set up the system for SMP using those parameters.  Otherwise, you will
+need to get the address of the configuration table and begin parsing that.
+<p>
+
+The configuration table is divided into three parts:  a header, a base
+section, and an extended section.  The header begins with the four-byte
+signature "PCMP", although you do not have to search for it.  Once you
+find it, checksum it.  At this point, you can print the OEM and product
+ID strings in the configuration table if you want.  You should get the
+address of the local APIC from this and store it.  Then, proceed to
+parse the base section.
+<p>
+
+The base section consists of a set of entries that describe either
+processors, system busses, I/O APICs, I/O interrupt assignments, or
+local interrupt assignments.  All entries are eight bytes in length,
+save processor entries which are twenty bytes.  The first byte of each
+entry denotes the type of the entry.  Look through each entry.  You will
+probably want to generate quite a few OS-specific data structure here.
+In particular, you will want to note the APIC ID of each processor in
+the system, its version, and its type as well as the address of the
+system's I/O APIC.
+<p>
+
+<h3>
+<a name="apic">Using Local APICs</a>
+</h3>
+
+MP systems have a special bus to which all APICs in the system are
+connected.  This bus is one of the ways the processors can communicate
+with one another (the other, of course, is shared memory).  APICs (both
+local and I/O) are memory mapped devices.  The default location for
+the local APIC is at 0xfee00000 in physical memory.  The local APIC
+will appear in the same place for each processor, but each processor
+will reference its own APIC; the APIC intercepts memory references to
+its registers, and those references will not generate bus cycles on
+some systems.  Since APICs are mapped in high memory, the APs will have
+to switch to protected mode before they can intialise their local APICs.
+If you like, you can map the APIC to a different address using the paging
+unit, but be sure to disable caching in the page table entry since some
+registers can change between accesses.  For this reason, pointers to
+APIC registers should be volatile.  To initialise the BSP's local APIC,
+set the enable bit in the spurious interrupt vector register and set
+the error interrupt vector in the local vector table.
+<p>
+
+<h3>
+<a name="booting">Booting Application Processors</a>
+</h3>
+
+Once you have detected the processors in the system, set up your local
+APIC, and verified that you can communicate with it (hint: read the
+APIC version register), it's time to boot the APs.  N.B. that it is good
+practise to not try to boot the BSP here.  That would be bad.
+<p>
+
+Since the APs will wake up in real mode, everything they need to get
+started should be in low memory (below 0x100000 or one megabyte).
+First, set the shutdown code by setting address 0:f to 0xa.  Then,
+grab a page of memory for the AP's stack.  You will also need space
+to store the `trampoline' code, i.e. the code the processor executes
+after waking up to switch to protected mode and jump to the kernel.
+You can either use the same page of code for each processor or store
+the code at the bottom of the processor's stack.  Note that the start of
+the code must be at a page-aligned address.  Copy the code there, then
+set the warm reset vector at address 40:67 to the start of this code.
+Next, you should reset a bit in the kernel which the processor will use
+to signal that it has booted and finished initialisation and clear any
+APIC error by writing a zero to the error status register.  If you need
+to pass any parameters or data to the AP, now would be a good time to
+set that up.  For example, since OpenBLT's kernel runs in high memory,
+I have to pass the address of the page directory in memory so that the
+AP can load it and enable paging before calling the kernel.
+<p>
+
+Now you can actually boot the processor.  The procedure consists of
+sending a sequence of interrupts to the processor.  The incremental effect
+of each is undefined, but at the end of the sequence, the processor will
+be booted.  First send an INIT IPI.  Assert the INIT signal by writing
+the target processor's APIC ID to the high word of the interrupt command
+register.  Then write to the low word with the bits set to enable the INIT
+delivery mode, level triggered, and assert the interrupt.  Deassert INIT
+by repeating the procedure with the assert bit reset.  Now, wait 10 ms.
+Use of the APIC timer is suggested.
+<p>
+
+If the local APIC is not an 82489dx, you need to send two STARTUP IPIs.
+Clear APIC errors, set the target APIC ID in the ICR, then send the
+interrupt by writing to the low word of the ICR with bits set for STARTUP
+delivery mode and with the code vector in the low byte.  The code vector
+is the physical page number at which the processor should start executing,
+i.e. the start of your trampoline code.  Wait 200 ms, then check the low
+word of the ICR to make sure bit 16 is reset to indicate the interrupt
+was dispatched before sending the second STARTUP.  After sending it,
+spin and wait for the AP to set its ready bit in memory.  You may want
+to set a timeout of 5 seconds, after which you assume the processor did
+not wake up.
+<p>
+
+<h3>
+<a name="protmode">Switching from Real Mode to Protected Mode</a>
+</h3>
+
+Provided you did everything right above, the processor at some point woke
+up in real mode and started executing the code you told it to.  First,
+execute a "cli" instruction to turn off interrupts, just in case.  Now,
+begin the switch to protected mode.  Load an appropriatee value into GDTR.
+This can either point to the actual GDT or in my case, a temporary GDT.
+If you need to activate paging, load the address of a page directory
+into cr3.  Then set bit zero in cr0 to enable protected mode as well as
+bit 31 if you need to enable paging to get into the kernel.  Then do an
+ljmp to the kernel text segment with an offset that points to the next
+instruction to serialise the processor.  Now that you're in protected
+mode, load appropriate descriptors into the segment registers, then
+execute a "cld", which is reportedly what gcc expects.  Then, jump to
+the starting address of your kernel.
+<p>
+
+Don't reference any symbols in this code since it
+will be running at an address for which it was not linked;
+all memory references must be absolute.  Since your kernel is above
+one megabyte in memory, you can't access any global variables in real
+mode.  Also be careful in specifying your offset address for the ljmp
+instruction, and do specify the address of the start of your kernel,
+not a symbol in the instruction that goes into the kernel.
+Jumping to a symbol doesn't seem to work.  For details,
+see OpenBLT's kernel/trampoline.S.
+<p>
+
+Debugging this part is really not too bad.  What you have to do is
+establish some communication space in low memory, then have the AP write
+bytes to that memory to explain what it is doing and print these out on
+the BSP.
+<p>
+
+<h3>
+<a name="ipis">Interprocessor Interrupts</a>
+</h3>
+
+IPIs are used to maintain synchronisation between the processors.
+For example, if a kernel page table entry changes, both processors must
+either flush their TLBs or invalidate that particular page table entry.
+Whichever processor changed the mapping knows to do this automatically,
+but the other processor does not; therefore, the processor which changed
+the mapping must send an IPI to the other processor to tell it to flush
+its TLB or invalidate the page table entry.
+<p>
+
+Using the local APIC, you can send interrupts to all processors, all
+processors but the one sending the interrupt, or a specific processor
+or logical address as well as self-interrupts.  To send an IPI, write
+the destination APIC ID, if needed, into the high word of the ICR, then
+write the low word of ICR with the destination shorthand and interrupt
+vector set to send the IPI.  Be sure to wrap these functions in spinlocks.
+You might want to turn off interrupts as well while sending IPIs.
+<p>
+
+<h3>
+<a name="other">Other Considerations</a>
+</h3>
+
+One thing to note is that semaphores (a.k.a. spinlocks) may need to
+be done differently under SMP.  Consider a scenario where semaphores
+are procured with a ``bts'' instruction.  If both processors hit that
+instruction at the same time while the semaphore is reset, they might
+both think they have acquired it.  For this reason, you would need to
+use a ``lock'' prefix on that instruction to lock the system bus and
+maintain synchronisation.
+<p>
+
+<h3>
+<a name="scheduling">Scheduling</a>
+</h3>
+
+Not yet.
+
+<h3>
+<a name="ioapic">Using the I/O APIC</a>
+</h3>
+
+Not yet.
+
+<h3>
+<a name="references">References</a>
+</h3>
+
+All of these documents are available from Intel's developer web site at
+<a href="http://developer.intel.com">developer.intel.com</a>.  Supposedly,
+by request, Intel will also send you printed documentation by post.
+
+<ul>
+<li>82378ZB System I/O and 82379AB System I/O APIC
+    <ul>
+    <li>&sect;3.7: APIC Registers</li>
+    </ul>
+</li>
+<li>MultiProcessor Specification, Version 1.4
+    <ul>
+    <li>&sect;4: MP Configuration Table</li>
+    <li>&sect;B: Operating System Programming Guidelines</li>
+    </ul>
+</li>
+<li>Intel Architecture Software Developer's Manual, Volume 2:  Instruction
+    Set Reference</li>
+<li>Intel Architecture Software Developer's Manual, Volume 3:  System
+    Programming Guide
+    <ul>
+    <li>&sect;7.1: Locked Atomic Operations</li>
+    <li>&sect;7.4: The APIC</li>
+    <li>&sect;8.7: Software Initialization for Protected Mode</li>
+    <li>&sect;8.8: Mode Switching</li>
+    <li>&sect;B: MP Bootup Sequence</li>
+    </ul>
+</li>
+</ul>
+
+<h3>
+<a name="revisions">Revision History</a>
+</h3>
+
+<pre>
+07 Nov 1998   - initial version, most sections filled out.
+</pre>
+
+<br clear=all>
+<hr>
+<i>Bringing SMP to Your UP Operating System</i> is Copyright &copy; 1998 by
+Sidney Cammeresi in its entirety.  All rights reserved.<p>
+Permission is granted to make verbatim copies of this tutorial for
+non-commercial use provided this notice remains intact on all copies.
+
+</td></tr>
+</table>
+
+<pre>$Id: smp.html 1.3 Thu, 01 Jul 1999 10:51:51 -0500 sac $</pre>
+
+</body>
+
+</html>
+
+
--- a/study/sabre/os/files/Misc/Implementing
+++ b/study/sabre/os/files/Misc/Implementing
--- a/study/sabre/os/files/Misc/Thread_Segment_Stacks.pdf
+++ b/study/sabre/os/files/Misc/Thread_Segment_Stacks.pdf
--- a/study/sabre/os/files/Misc/bios-asm.zip
+++ b/study/sabre/os/files/Misc/bios-asm.zip
--- a/study/sabre/os/files/Misc/index.html
+++ b/study/sabre/os/files/Misc/index.html
@@ -0,0 +1,7 @@
+<html>
+<head>  
+<meta http-equiv="refresh" content="0;url=/Linux.old/sabre/os/articles">
+</head>
+<body lang="zh-CN">
+</body>
+</html>
--- a/study/sabre/os/files/Misc/os-faq/index.html
+++ b/study/sabre/os/files/Misc/os-faq/index.html
@@ -0,0 +1,280 @@
+
+<HTML>
+<HEAD>
+<META http-equiv="Content-Type" content="text/html; charset=utf-8">
+<TITLE>Write Your Own Operating System</TITLE>
+
+</HEAD>
+
+<BODY>
+<hr color="#0000FF" noshade>
+
+<p>
+
+<div align="center">
+<h1>Write Your Own Operating System [FAQ]</h1>
+<h5>by <A href="mailto:dfiber@mega-tokyo.com?subject=OS-FAQ">Stuart 'Dark Fiber' George</A> &lt;dfiber@mega-tokyo.com&gt;</h3>
+<h5>Last Updated : Tuesday 23rd October 2000</h3>
+</div>
+</P>
+
+
+<P title="" align="CENTER">Download a .zip'ed version of the os-faq <A href="ftp://ftp.mega-tokyo.com/pub/operating-systems/os-faq.zip">here</A></P>
+
+<hr color="#0000FF" noshade>
+
+<p>
+	<UL>
+		<LI>Introductions, Overview
+		<UL>
+			<LI><A href="os-faq-intro.html#introduction">Introduction</a>
+			<LI><A href="os-faq-intro.html#getting_started">Getting Started</a>
+			<LI><A href="os-faq-intro.html#what_bits_cant_i_make_in_c">What bits can't I make in C</a>
+			<LI><A href="os-faq-intro.html#what_order_should_i_make_things_in">What order should I make things in</a>
+		</UL>
+	
+		<LI>Kernel Questions
+		<UL>
+			<LI><A href="os-faq-kernel.html#load_kernel">How do I make a kernel file I can load</a>
+			<LI><A href="os-faq-kernel.html#32bit_files">Help! When I load my kernel my machine resets</a>
+			<LI><A href="os-faq-boot.html#easier_load">Is there an easier way to boot my kernel</a>
+		</UL>
+		
+		<LI>Boot Loader and Boot Menu's
+		<UL>
+			<LI><A href="os-faq-boot.html#boot_loadmenu">Tell me about bootloaders and bootmenus</a>
+			<li><a href="os-faq-bsmbr.html#boot_sector">Boot Sectors</a>
+			<li><a href="os-faq-bsmbr.html#mbr">Master Boot Record</a>			
+			<LI>GRUB
+			<UL>
+				<LI><A href="os-faq-grub.html#whats_grub">What is GRUB</a>
+				<LI><A href="os-faq-grub.html#get_grub">Where can I get GRUB</a>
+				<LI><A href="os-faq-grub.html#grub_aout">GRUB and DJGPP A.OUT</a>
+				<LI><A href="os-faq-grub.html#grub_nasm">GRUB and NASM ELF files</a>
+				<LI><A href="os-faq-grub.html#grub_watcom">GRUB and Watcom ELF files</a>
+			</UL>
+		
+			<LI>LILO
+			<UL>
+				<LI><A href="os-faq-lilo.html#lilo">What is LILO</a>
+				<LI>Where can I get LILO
+			</UL>
+			
+			<li>XOSL
+			<li>System Commander
+			<li>Boot Magic
+		</UL>
+
+	<LI>COMPILERS
+	<Ul>
+		<LI>DJGPP
+		<UL>
+			<LI><A href="os-faq-elf.html#elf_files">Can DJGPP output ELF files</a>
+		</UL>
+<! TODO >
+		<LI>Watcom C/C++
+		<LI>Visual C/C++
+	</UL>
+	
+	<LI>Hardware
+	<UL>
+		<LI>CPU
+		<UL>
+			<LI><A href="os-faq-v86.html#whats_v8086">What is v8086 mode?</a>
+			<LI><A href="os-faq-v86.html#detect_v86">How do I detect v8086 mode?</a>
+			<li>AMD K6
+				<ul>
+				<LI><A href="os-faq-cpu-amdk6.html#k6_writeback">AMD K6 WriteBack Optimisations</a>
+				</ul>
+			
+		</UL>
+
+		<LI>Memory
+		<UL>
+			<LI>The A20
+			<UL>
+				<LI><A href="os-faq-memory.html#what_is_a20">What is the A20 line?</a>
+				<LI><A href="os-faq-memory.html#access_my_memory">Why cant I access all my memory?</a>
+				<LI><A href="os-faq-memory.html#enable_a20">How do I enable  the A20 line?</a>
+			</UL>
+		
+			<LI>Memory Sizing
+			<UL>
+				<LI><A href="os-faq-memory.html#determine_memory">How do I determine the amount of RAM?</a>
+				<LI><A href="os-faq-memory.html#determine_memory_bios">How do I determine the amount of RAM with the BIOS?</a>
+				<LI><A href="os-faq-memory.html#determine_memory_probe">How do I determine the amount of RAM with direct probing?</a>
+			</UL>
+		</UL>
+		
+		<LI>IRQ's and Exceptions, PIC, NMI, APIC, OPIC
+		<UL>
+			<LI><A href="os-faq-pics.html#irq_exception">How do I know if an IRQ or exception is firing?</a>
+			<LI><A href="os-faq-pics.html#what_pic">What is the PIC?</a>
+			<LI><A href="os-faq-pics.html#remap_pic">Can I remap the PIC?</a>
+			<LI><A href="os-faq-pics.html#nmi">So whats the NMI then?</a>
+			<LI><A href="os-faq-pics.html#apic">Tell me about APIC</a>
+			<LI><A href="os-faq-pics.html#opic">Tell me about OPIC</a>
+		</UL>		
+
+		<LI>Interrupt Service Routines (ISR's)
+		<UL>
+			<LI><A href="os-faq-isr.html#isr">Whats an ISR?</a>
+			<LI><A href="os-faq-isr.html#normal_v_isr">Whats the difference between an ISR and a normal routine?</a>
+			<LI><A href="os-faq-isr.html#gcc_isr">So how do I do an ISR with GCC?</a>
+		</UL>
+
+		<LI>Video
+		<UL>
+			<LI><A href="os-faq-console.html#text_mode">How do I output text to the screen in protected mode?</a>
+			<LI><A href="os-faq-console.html#detect_text_screen">How do I detect if I  have a colour or monochrome monitor?</a>
+			<LI><A href="os-faq-console.html#moving_cursor">How do I move the cursor when I print?</a>
+		</UL>
+
+		<LI>Plug and Play
+		<UL>
+			<LI><A href="os-faq-pnp.html#prog_pnp">Where can I find programming info on PNP?</a>
+			<LI><A href="os-faq-pnp.html#pnp_pmode">I heard you can do PNP calls with the BIOS in Protected Mode?</a>
+		</UL>
+
+		<LI>PCI
+		<UL>
+			<LI><A href="os-faq-pci.html#prog_pci">Where can I find programming info on PCI?</a>
+			<LI><A href="os-faq-pci.html#pci_pmode">I heard you can do PCI calls with the BIOS in Protected Mode?</a>
+		</UL>
+	</UL>
+
+	<LI>C Programming
+	<UL>
+		<LI><A href="os-faq-libc.html#no_printf">Where did my printf go?</a>
+		<LI><A href="os-faq-libc.html#libc">Whats this LIBC thing?</a>
+		<LI><A href="os-faq-libc.html#existing_libc">What C libraries exist for me to use?</a>
+	</UL>
+
+	<LI>C++ Programming
+	<UL>
+		<LI><A href="os-faq-cpp.html#start">Doing a kernel in C++</a>
+		<LI><A href="os-faq-cpp.html#rtti">Aiyah! Whats RTTI? (Run Time Type Info)</a>
+		<LI><A href="os-faq-cpp.html#disable_rtti">How do I disable RTTI in GCC?</a>
+		<LI><A href="os-faq-cpp.html#new_delete">Can I use NEW and DELETE in my kernel?</a>
+	</UL>
+
+	<LI>Linkers
+	<UL>
+		<LI><a href="os-faq-linker.html#linkers">Linker Info!</a>
+		<LI><a href="os-faq-linker.html#linkers_jloc">JLoc</a>
+		<LI><a href="os-faq-linker.html#linkers_alink">ALink</a>
+		<LI><a href="os-faq-linker.html#linkers_ld">LD (GNU)</a>
+		<LI><a href="os-faq-linker.html#linkers_tlink">TLink / TLink32 (Borland)</a>
+		<LI><a href="os-faq-linker.html#linkers_link">Link / NLink (Microsoft)</a>
+		<LI><a href="os-faq-linker.html#linkers_val">VAL</a>
+		<LI><a href="os-faq-linker.html#linkers_wlink">WLink (Watcom)</a>
+		<LI><a href="os-faq-linker.html#linkers_comp">A Comparison</a>
+	</UL>
+
+	<LI>Executable File Types
+	<UL>
+		<LI><A href="os-faq-exec.html#exec_files">Executable Files</a>
+		<LI><A href="os-faq-exec.html#exec_exe">EXE (dos &quot;MZ&quot;)</a>
+		<LI><A href="os-faq-exec.html#exec_ne">EXE (win16 &quot;NE&quot;)</a>
+		<LI><A href="os-faq-exec.html#exec_le">EXE (OS/2 &quot;LE/LX&quot;)</a>
+		<LI><A href="os-faq-exec.html#exec_pe">EXE (Win32 &quot;PE&quot;)</a>
+		<LI><A href="os-faq-exec.html#exec_elf">ELF</a>
+		<LI><A href="os-faq-exec.html#exec_coff">COFF</a>
+		<LI><A href="os-faq-exec.html#exec_aout">A.OUT</a>
+	</UL>
+
+	<LI>Filesystems
+	<UL>
+		<LI><A href="os-faq-fs.html#file_systems">Tell me about filesystems</a>
+		<LI><A href="os-faq-fs.html#fs_fat">FAT</a>
+		<LI><A href="os-faq-fs.html#fs_vfat">VFAT</a>
+		<LI><A href="os-faq-fs.html#fs_fat32">FAT32</a>
+		<LI><A href="os-faq-fs.html#fs_hpfs">HPFS (High Performance File System)</a>
+		<LI><A href="os-faq-fs.html#fs_ntfs">NTFS (New Technology File System)</a>
+		<LI><A href="os-faq-fs.html#fs_ext2fs">ext2fs (2nd extended file system)</a>
+		<LI><A href="os-faq-fs.html#fs_befs">BeFS</a>
+		<LI><A href="os-faq-fs.html#fs_ffs_amiga">FFS (Amiga)</a>
+		<LI><A href="os-faq-fs.html#fs_ffs_bsd">FFS (BSD)</a>
+		<LI><A href="os-faq-fs.html#fs_nfs">NFS</a>
+		<LI><A href="os-faq-fs.html#fs_afs">AFS</a>
+		<LI><a href="os-faq-fs.html#fs_rfs">RFS</a>
+		<LI><A href="os-faq-fs.html#fs_xfs">XFS</a>
+	</UL>
+
+	<LI>Resources
+	<UL>
+		<LI>Books
+		<ul>
+			<LI><A href="os-faq-books.html#books">Reference Books</a>
+			<LI><A href="os-faq-books.html#book_0">The Indispensable PC Hardware Book</A>
+			<LI><A href="os-faq-books.html#book_1">Operating System Concepts</A>
+			<LI><A href="os-faq-books.html#book_2">Operating Systems : Design and Implementation</A>
+			<LI><A href="os-faq-books.html#book_3">Operating Systems : Internals and Design Principals</A>
+			<LI><a href="os-faq-books.html#book_4">Distributed Operating Systems</a></td>
+			<LI><a href="os-faq-books.html#book_5">Inside Windows NT, Second Edition</a></td>
+			<LI><a href="os-faq-books.html#book_6">Lion's Commentary on UNIX sixth edition, with source code</a></td>
+			<LI><a href="os-faq-books.html#book_7">UNIX Internals: The New Frontiers</a></td>					
+		</ul>				
+		<LI><A href="os-faq-links.html#small_free_kernels">Some small kernels with source</a>
+		<LI><A href="os-faq-acronyms.html#acronyms">Chip Numbers, Acronyms and Things</A>
+	</UL>
+
+	
+	<LI>Third Party Tools
+	<UL>
+		<LI><a href="os-faq-3rd.html#vmware">VMWare PC Emulator</a>
+		<LI><a href="os-faq-3rd.html#bochs">Bochs (i386) PC emulator</a>
+		<LI><a href="os-faq-3rd.html#mtools">MTools (DOS disk image tools)</a>
+		<LI><a href="os-faq-3rd.html#simics">SimICS (SunSparc Simulator)</a>
+	</UL>
+		
+	<LI>Contributors
+	<UL>
+		<LI><A href="os-faq-contributors.html#contributors">Who helped with the FAQ</a>
+	</UL>
+	
+	<LI>Todo
+	<UL>
+		<LI><A href="os-faq-todo.html#todo">The TODO list</a>
+	</UL>
+
+</UL>
+
+
+<!-- *************** DHTML Outline (end) ***************** -->
+
+<hr color="#0000FF" noshade>
+
+<p>
+Whats New!
+<ul>
+	<li>trying to add more material on various C/C++ compilers
+	<li>Added VMWare to the tools list
+	<li>Removed the link to the free Intel Developer CD's (offer is no longer valid)
+	<li>More info on some boot mangaers (xoxsl, system commander, boot magic, etc)
+	<li>Fixed those nasty link colours
+</ul>
+
+<p>
+
+<hr color="#0000FF" noshade>
+
+<P>
+<TABLE border="0" cellspacing="1" cellpadding="10" align="CENTER">
+	<CAPTION>The OS-FAQ is a member of the OS Web Ring
+	</CAPTION>
+	<TR>
+		<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&next" target="_top">Next</A>
+		</TD>
+		<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&skip" target="_top">Skip Next</A>
+		</TD>
+		<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&next5" target="_top">Next 5</A><BR>
+		</TD>
+		<TD><A href="http://www.webring.org/cgi-bin/webring?ring=os&id=31&list" target="_top">List Sites</A>
+		</TD>
+	</TR>
+</TABLE></P>
+
+
+</BODY>
+</HTML>
--- a/study/sabre/os/files/Misc/os-faq/os-faq.zip
+++ b/study/sabre/os/files/Misc/os-faq/os-faq.zip
--- a/study/sabre/os/files/Misc/vade.mecum.2.pdf
+++ b/study/sabre/os/files/Misc/vade.mecum.2.pdf