423 lines
15 KiB
Markdown
423 lines
15 KiB
Markdown
# xv6 System Call Background
|
|
|
|
To be able to implement this project, you'll have to understand a little bit
|
|
about how xv6 implements system calls. As you recall from the [OS
|
|
book](http://www.ostep.org/), a system call is a protected transfer of control
|
|
from an application (running in *user mode*) to the OS (running in *kernel
|
|
mode*). The general approach, which we refer to as *limited direct execution*
|
|
(*LDE*), enables the kernel to maintain control of the machine while generally
|
|
letting user applications run efficiently and without kernel intervention.
|
|
|
|
We'll specifically trace what happens in the code in order to understand a
|
|
*system call*. System calls allow the operating system to run code on the
|
|
behalf of user requests but in a protected manner, both by jumping into the
|
|
kernel (in a very specific and restricted way) and also by simultaneously
|
|
raising the privilege level of the hardware, so that the OS can perform
|
|
certain restricted operations.
|
|
|
|
## System Call Overview
|
|
|
|
Before delving into the details, we first provide an overview of the entire
|
|
process. The problem we are trying to solve is simple: how can we build a
|
|
system such that the OS is allowed access to all of the resources of the
|
|
machine (including access to special instructions, to physical memory,
|
|
and to any devices) while user programs are only able to do so in a restricted
|
|
manner?
|
|
|
|
The way we achieve this goal is with hardware support. The hardware must
|
|
explicitly have a notion of privilege built into it, and thus be able to
|
|
distinguish when the OS is running versus typical user applications.
|
|
|
|
## Getting Into The Kernel: A Trap
|
|
|
|
The first step in a system call begins at user-level with an application. The
|
|
application that wishes to make a system call (such as **read()**) calls the
|
|
relevant library routine. However, all the library version of the system call
|
|
does is to place the proper arguments in relevant registers and issue some
|
|
kind of **trap** instruction, as we see in an expanded version of **usys.S**
|
|
(Some macros are used to define these functions so as to make life
|
|
easier for the kernel developer; the example shows the macro expanded to the
|
|
actual assembly code).
|
|
|
|
```
|
|
.globl read;
|
|
read:
|
|
movl $6, %eax;
|
|
int $64;
|
|
ret
|
|
```
|
|
File: **usys.S**
|
|
|
|
Here we can see that the **read()** library function actually doesn't do much
|
|
at all; it moves the value 5 into the register **%eax** and issues the x86
|
|
trap instruction which is confusingly called **int** (short for *interrupt*).
|
|
The value in **%eax** is going to be used by the kernel to *vector* to the
|
|
right system call, i.e., it determines which system call is being invoked. The
|
|
**int** instruction takes one argument (here it is 64), which tells the
|
|
hardware which trap type this is. In xv6, trap 64 is used to handle system
|
|
calls. Any other arguments which are passed to the system call are passed on
|
|
the stack.
|
|
|
|
## Kernel Side: Trap Tables
|
|
|
|
Once the **int** instruction is executed, the hardware takes over and does a
|
|
bunch of work on behalf of the caller. One important thing the hardware does
|
|
is to raise the *privilege level* of the CPU to kernel mode; on x86 this is
|
|
usually means moving from a *CPL* *(Current Privilege Level)* of 3 (the level
|
|
at which user applications run) to CPL 0 (in which the kernel runs). Yes,
|
|
there are a couple of in-between privilege levels, but most systems do not
|
|
make use of these.
|
|
|
|
The second important thing the hardware does is to transfer control to the
|
|
*trap vectors* of the system. To enable the hardware to know what code to run
|
|
when a particular trap occurs, the OS, when booting, must make sure to inform
|
|
the hardware of the location of the code to run when such traps take
|
|
place. This is done in **main.c** as follows:
|
|
|
|
```c
|
|
int
|
|
mainc(void)
|
|
{
|
|
...
|
|
tvinit(); // trap vectors initialized here
|
|
...
|
|
}
|
|
```
|
|
FILE: **main.c**
|
|
|
|
The routine **tvinit()** is the relevant one here. Peeking inside of it, we
|
|
see:
|
|
|
|
```c
|
|
void tvinit(void)
|
|
{
|
|
int i;
|
|
|
|
for(i = 0; i < 256; i++)
|
|
SETGATE(idt[i], 0, SEG_KCODE<<3, vectors[i], 0);
|
|
|
|
// this is the line we care about...
|
|
SETGATE(idt[T_SYSCALL], 1, SEG_KCODE<<3, vectors[T_SYSCALL], DPL_USER);
|
|
|
|
initlock(&tickslock, "time");
|
|
}
|
|
```
|
|
FILE: **trap.c**
|
|
|
|
The **SETGATE()** macro is the relevant code here. It is used to set the
|
|
**idt** array to point to the proper code to execute when various traps and
|
|
interrupts occur. For system calls, the single **SETGATE()** call (which
|
|
comes after the loop) is the one we're interested in. Here is what the macro
|
|
does (as well as the gate descriptor it sets):
|
|
|
|
```c
|
|
// Gate descriptors for interrupts and traps
|
|
struct gatedesc {
|
|
uint off_15_0 : 16; // low 16 bits of offset in segment
|
|
uint cs : 16; // code segment selector
|
|
uint args : 5; // # args, 0 for interrupt/trap gates
|
|
uint rsv1 : 3; // reserved(should be zero I guess)
|
|
uint type : 4; // type(STS_{TG,IG32,TG32})
|
|
uint s : 1; // must be 0 (system)
|
|
uint dpl : 2; // descriptor(meaning new) privilege level
|
|
uint p : 1; // Present
|
|
uint off_31_16 : 16; // high bits of offset in segment
|
|
};
|
|
|
|
// Set up a normal interrupt/trap gate descriptor.
|
|
// - istrap: 1 for a trap (= exception) gate, 0 for an interrupt gate.
|
|
// interrupt gate clears FL_IF, trap gate leaves FL_IF alone
|
|
// - sel: Code segment selector for interrupt/trap handler
|
|
// - off: Offset in code segment for interrupt/trap handler
|
|
// - dpl: Descriptor Privilege Level -
|
|
// the privilege level required for software to invoke
|
|
// this interrupt/trap gate explicitly using an int instruction.
|
|
#define SETGATE(gate, istrap, sel, off, d) \
|
|
{ \
|
|
(gate).off_15_0 = (uint) (off) & 0xffff; \
|
|
(gate).cs = (sel); \
|
|
(gate).args = 0; \
|
|
(gate).rsv1 = 0; \
|
|
(gate).type = (istrap) ? STS_TG32 : STS_IG32; \
|
|
(gate).s = 0; \
|
|
(gate).dpl = (d); \
|
|
(gate).p = 1; \
|
|
(gate).off_31_16 = (uint) (off) >> 16; \
|
|
}
|
|
```
|
|
FILE: **mmu.h**
|
|
|
|
As you can see from the code, all the **SETGATE()** macros does is set the
|
|
values of an in-memory data structure. Most important is the **off**
|
|
parameter, which tells the hardware where the trap handling code is. In the
|
|
initialization code, the value **vectors[T_SYSCALL]** is passed in; thus,
|
|
whatever the **vectors** array points to will be the code to run when a system
|
|
call takes place. There are other details (which are important too); consult
|
|
an [x86 hardware architecture
|
|
manuals](http://www.intel.com/products/processor/manuals) (particularly
|
|
Chapters 3a and 3b) for more information.
|
|
|
|
Note, however, that we still have not informed the hardware of this
|
|
information, but rather filled a data structure. The actual hardware informing
|
|
occurs a little later in the boot sequence; in xv6, it happens in the routine
|
|
**mpmain()** in the file **main.c**, which calls **idtinit** in **trap.c**,
|
|
which calls **lidt()** in the include file **x86.h**:
|
|
|
|
```c
|
|
static void
|
|
mpmain(void)
|
|
{
|
|
idtinit();
|
|
...
|
|
|
|
void
|
|
idtinit(void)
|
|
{
|
|
lidt(idt, sizeof(idt));
|
|
}
|
|
|
|
static inline void
|
|
lidt(struct gatedesc *p, int size)
|
|
{
|
|
volatile ushort pd[3];
|
|
|
|
pd[0] = size-1;
|
|
pd[1] = (uint)p;
|
|
pd[2] = (uint)p >> 16;
|
|
|
|
asm volatile("lidt (%0)" : : "r" (pd));
|
|
}
|
|
```
|
|
|
|
Here, you can see how (eventually) a single assembly instruction is called to
|
|
tell the hardware where to find the *interrupt descriptor table (IDT)* in
|
|
memory. Note this is done in **mpmain()** as each processor in the system
|
|
must have such a table (they all use the same one of course). Finally, after
|
|
executing this instruction (which is only possible when the kernel is running,
|
|
in privileged mode), we are ready to think about what happens when a user
|
|
application invokes a system call.
|
|
|
|
```c
|
|
struct trapframe {
|
|
// registers as pushed by pusha
|
|
uint edi;
|
|
uint esi;
|
|
uint ebp;
|
|
uint oesp; // useless & ignored
|
|
uint ebx;
|
|
uint edx;
|
|
uint ecx;
|
|
uint eax;
|
|
|
|
// rest of trap frame
|
|
ushort es;
|
|
ushort padding1;
|
|
ushort ds;
|
|
ushort padding2;
|
|
uint trapno;
|
|
|
|
// below here defined by x86 hardware
|
|
uint err;
|
|
uint eip;
|
|
ushort cs;
|
|
ushort padding3;
|
|
uint eflags;
|
|
|
|
// below here only when crossing rings, such as from user to kernel
|
|
uint esp;
|
|
ushort ss;
|
|
ushort padding4;
|
|
};
|
|
```
|
|
File: **x86.h**
|
|
|
|
## From Low-level To The C Trap Handler
|
|
|
|
The OS has carefully set up its trap handlers, and thus we are ready to see
|
|
what happens on the OS side once an application issues a system call via the
|
|
**int** instruction. Before any code is run, the hardware must perform a
|
|
number of tasks. The first thing it does are those tasks which are
|
|
difficult/impossible for the software to do itself, including saving the
|
|
current PC (IP or EIP in Intel terminology) onto the stack, as well as a
|
|
number of other registers such as the **eflags** register (which contains the
|
|
current status of the CPU while the program was running), stack pointer, and
|
|
so forth. One can see what the hardware is expected to save by looking at the
|
|
**trapframe** structure as defined in **x86.h**.
|
|
|
|
As you can see from the bottom of the trapframe structure, some pieces of the
|
|
trap frame are filled in by the hardware (up to the **err** field); the rest
|
|
will be saved by the OS. The first code OS that is run is **vector64()**
|
|
as found in **vectors.S** (which is automatically generated by the script
|
|
**vectors.pl**).
|
|
|
|
```c
|
|
.globl vector64
|
|
vector64:
|
|
pushl $64
|
|
jmp alltraps
|
|
```
|
|
File: **vectors.S** (generated by **vectors.pl**)
|
|
|
|
This code pushes the trap number onto the stack (filling in the **trapno**
|
|
field of the trap frame) and then calls **alltraps()** to do most of the
|
|
saving of context into the trap frame.
|
|
|
|
```
|
|
# vectors.S sends all traps here.
|
|
.globl alltraps
|
|
alltraps:
|
|
# Build trap frame.
|
|
pushl %ds
|
|
pushl %es
|
|
pushal
|
|
|
|
# Set up data segments.
|
|
movl $SEG_KDATA_SEL, %eax
|
|
movw %ax,%ds
|
|
movw %ax,%es
|
|
|
|
# Call trap(tf), where tf=%esp
|
|
pushl %esp
|
|
call trap
|
|
addl $4, %esp
|
|
```
|
|
File: **trapasm.S**
|
|
|
|
The code in **alltraps()** pushes a few more segment registers (not described
|
|
here, yet) onto the stack before pushing the remaining general purpose
|
|
registers onto the trap frame via a **pushal** instruction. Then, the OS
|
|
changes the descriptor segment and extra segment registers so that it can
|
|
access its own (kernel) memory. Finally, the C trap handler is called.
|
|
|
|
## The C Trap Handler
|
|
|
|
Once done with the low-level details of setting up the trap frame, the
|
|
low-level assembly code calls up into a generic C trap handler called
|
|
**trap()**, which is passed a pointer to the trap frame. This trap handler is
|
|
called upon all types of interrupts and traps, and thus check the trap number
|
|
field of the trap frame (**trapno**) to determine what to do. The first check
|
|
is for the system call trap number (**T_SYSCALL**, or 64 as defined somewhat
|
|
arbitrarily in **traps.h**), which then handles the system call, as you see
|
|
here:
|
|
|
|
```c
|
|
void
|
|
trap(struct trapframe *tf)
|
|
{
|
|
if(tf->trapno == T_SYSCALL){
|
|
if(cp->killed)
|
|
exit();
|
|
cp->tf = tf;
|
|
syscall();
|
|
if(cp->killed)
|
|
exit();
|
|
return;
|
|
}
|
|
... // continues
|
|
}
|
|
```
|
|
FILE: **trap.c**
|
|
|
|
The code isn't too complicated. It checks if the current process (that made
|
|
the system call) has been killed; if so, it simply exits and cleans up the
|
|
process (and thus does not proceed with the system call). It then calls
|
|
**syscall()** to actually perform the system call; more details on that
|
|
below. Finally, it checks whether the process has been killed again before
|
|
returning. Note that we'll follow the return path below in more detail.
|
|
|
|
```c
|
|
static int (*syscalls[])(void) = {
|
|
[SYS_chdir] sys_chdir,
|
|
[SYS_close] sys_close,
|
|
[SYS_dup] sys_dup,
|
|
[SYS_exec] sys_exec,
|
|
[SYS_exit] sys_exit,
|
|
[SYS_fork] sys_fork,
|
|
[SYS_fstat] sys_fstat,
|
|
[SYS_getpid] sys_getpid,
|
|
[SYS_kill] sys_kill,
|
|
[SYS_link] sys_link,
|
|
[SYS_mkdir] sys_mkdir,
|
|
[SYS_mknod] sys_mknod,
|
|
[SYS_open] sys_open,
|
|
[SYS_pipe] sys_pipe,
|
|
[SYS_read] sys_read,
|
|
[SYS_sbrk] sys_sbrk,
|
|
[SYS_sleep] sys_sleep,
|
|
[SYS_unlink] sys_unlink,
|
|
[SYS_wait] sys_wait,
|
|
[SYS_write] sys_write,
|
|
};
|
|
|
|
void
|
|
syscall(void)
|
|
{
|
|
int num;
|
|
|
|
num = cp->tf->eax;
|
|
if(num >= 0 && num < NELEM(syscalls) && syscalls[num])
|
|
cp->tf->eax = syscalls[num]();
|
|
else {
|
|
cprintf("%d %s: unknown sys call %d\n",
|
|
cp->pid, cp->name, num);
|
|
cp->tf->eax = -1;
|
|
}
|
|
}
|
|
]
|
|
```
|
|
File: **syscall.c**
|
|
|
|
## Vectoring To The System Call
|
|
|
|
Once we finally get to the **syscall()** routine in **syscall.c**, not much
|
|
work is left to do (see above). The system call number has been passed to us
|
|
in the register **%eax**, and now we unpack that number from the trap frame
|
|
and use it to call the appropriate routine as defined in the system call table
|
|
**syscalls[]**. Pretty much all operating systems have a table similar to this
|
|
to define the various system calls they support. After carefully checking that
|
|
the system call number is in bounds, the pointed-to routine is called to
|
|
handle the call. For example, if the system call **read()** was called by the
|
|
user, the routine **sys_read()** will be invoked here. The return value, you
|
|
might note, is stored in **%eax** to pass back to the user.
|
|
|
|
## The Return Path
|
|
|
|
The return path is pretty easy. First, the system call returns an integer
|
|
value, which the code in **syscall()** grabs and places into the **%eax**
|
|
field of the trap frame. The code then returns into **trap()**, which simply
|
|
returns into where it was called from in the assembly trap handler.
|
|
|
|
```c
|
|
# Return falls through to trapret...
|
|
.globl trapret
|
|
trapret:
|
|
popal
|
|
popl %es
|
|
popl %ds
|
|
addl $0x8, %esp # trapno and errcode
|
|
iret
|
|
```
|
|
File: **trapasm.S**
|
|
|
|
This return code doesn't do too much, just making sure to pop the relevant
|
|
values off the stack to restore the context of the running process. Finally,
|
|
one more special instruction is called: **iret**, or the **return-from-trap**
|
|
instruction. This instruction is similar to a return from a procedure call,
|
|
but simultaneously lowers the privilege level back to user mode and jumps back
|
|
to the instruction immediately following the **int** instruction called to
|
|
invoke the system call, restoring all the state that has been saved into the
|
|
trap frame. At this point, the user stub for **read()** (as seen in the
|
|
**usys.S** code) is run again, which just uses a normal
|
|
return-from-procedure-call instruction (**ret**) in order to return to the
|
|
caller.
|
|
|
|
## Summary
|
|
|
|
We have seen the path in and out of the kernel on a system call. As you can
|
|
tell, it is much more complex than a simple procedure call, and requires a
|
|
careful protocol on behalf of the OS and hardware to ensure that application
|
|
state is properly saved and restored on entry and return. As always, the
|
|
concept is easy: with operating systems, the devil is always in the details.
|
|
|
|
|