362 lines
20 KiB
Markdown
362 lines
20 KiB
Markdown
|
|
# Overview
|
|
|
|
In this assignment, you will be developing a concurrent web server. To
|
|
simplify this project, we are providing you with the code for a non-concurrent
|
|
(but working) web server. This basic web server operates with only a single
|
|
thread; it will be your job to make the web server multi-threaded so that it
|
|
can handle multiple requests at the same time.
|
|
|
|
The goals of this project are:
|
|
- To learn the basic architecture of a simple web server
|
|
- To learn how to add concurrency to a non-concurrent system
|
|
- To learn how to read and modify an existing code base effectively
|
|
|
|
Useful reading from [OSTEP](http://ostep.org) includes:
|
|
- [Intro to threads](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf)
|
|
- [Using locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf)
|
|
- [Producer-consumer relationships](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf)
|
|
- [Server concurrency architecture](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf)
|
|
|
|
# HTTP Background
|
|
|
|
Before describing what you will be implementing in this project, we will
|
|
provide a very brief overview of how a classic web server works, and the HTTP
|
|
protocol (version 1.0) used to communicate with it; although web browsers and
|
|
servers have [evolved a lot over the
|
|
years](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP),
|
|
the old versions still work and give you a good start in understanding how
|
|
things work. Our goal in providing you with a basic web server is that you can
|
|
be shielded from learning all of the details of network connections and the
|
|
HTTP protocol needed to do the project; however, the network code has been
|
|
greatly simplified and is fairly understandable should you choose to to study
|
|
it.
|
|
|
|
Classic web browsers and web servers interact using a text-based protocol
|
|
called **HTTP** (**Hypertext Transfer Protocol**). A web browser opens a
|
|
connection to a web server and requests some content with HTTP. The web server
|
|
responds with the requested content and closes the connection. The browser
|
|
reads the content and displays it on the screen.
|
|
|
|
HTTP is built on top of the **TCP/IP** protocol suite provided by the
|
|
operating system. Together, TPC and IP ensure that messages are routed to
|
|
their correct destination, get from source to destination reliably in the face
|
|
of failure, and do not overly congest the network by sending too many messages
|
|
at once, among other features. To learn more about networks, take a networking
|
|
class (or many!), or read [this free book](https://book.systemsapproach.org).
|
|
|
|
Each piece of content on the web server is associated with a file in the
|
|
server's file system. The simplest is *static* content, in which a client
|
|
sends a request just to read a specific file from the server. Slightly more
|
|
complex is *dynamic* content, in which a client requests that an executable
|
|
file be run on the web server and its output returned to the client.
|
|
Each file has a unique name known as a **URL** (**Universal Resource
|
|
Locator**).
|
|
|
|
As a simple example, let's say the client browser wants to fetch static
|
|
content (i.e., just some file) from a web server running on some machine. The
|
|
client might then type in the following URL to the browser:
|
|
`http://www.cs.wisc.edu/index.html`. This URL identifies that the HTTP
|
|
protocol is to be used, and that an HTML file in the root directory (`/`) of
|
|
the web server called `index.html` on the host machine `www.cs.wisc.edu`
|
|
should be fetched.
|
|
|
|
The web server is not just uniquely identified by which machine it is running
|
|
on but also the **port** it is listening for connections upon. Ports are a
|
|
communication abstraction that allow multiple (possibly independent) network
|
|
communications to happen concurrently upon a machine; for example, the web
|
|
server might be receiving an HTTP request upon port 80 while a mail server is
|
|
sending email out using port 25. By default, web servers are expected to run
|
|
on port 80 (the well-known HTTP port number), but sometimes (as in this
|
|
project), a different port number will be used. To fetch a file from a web
|
|
server running at a different port number (say 8000), specify the port number
|
|
directly in the URL, e.g., `http://www.cs.wisc.edu:8000/index.html`.
|
|
|
|
URLs for executable files (i.e., dynamic content) can include program
|
|
arguments after the file name. For example, to just run a program (`test.cgi`)
|
|
without any arguments, the client might use the URL
|
|
`http://www.cs.wisc.edu/test.cgi`. To specify more arguments, the `?` and `&`
|
|
characters are used, with the `?` character to separate the file name from the
|
|
arguments and the `& character to separate each argument from the others. For
|
|
example, `http://www.cs.wisc.edu/test.cgi?x=10&y=20` can be used to send
|
|
multiple arguments `x` and `y` and their respective values to the program
|
|
`test.cgi`. The program being run is called a **CGI program** (short for
|
|
[Common Gateway
|
|
Interface](https://en.wikipedia.org/wiki/Common_Gateway_Interface); yes, this
|
|
is a terrible name); the arguments are passed into the program as part of the
|
|
[`QUERY_STRING`](https://en.wikipedia.org/wiki/Query_string) environment
|
|
variable, which the program can then parse to access these arguments.
|
|
|
|
# The HTTP Request
|
|
|
|
When a client (e.g., a browser) wants to fetch a file from a machine, the
|
|
process starts by sending a machine a message. But what exactly is in the body
|
|
of that message? These *request contents*, and the subsequent *reply
|
|
contents*, are specified precisely by the HTTP protocol.
|
|
|
|
Let's start with the request contents, sent from the web browser to the
|
|
server. This HTTP request consists of a request line, followed by zero or more
|
|
request headers, and finally an empty text line. A request line has the form:
|
|
`method uri version`. The `method` is usually `GET`, which tells the web
|
|
server that the client simply wants to read the specified file; however, other
|
|
methods exist (e.g., `POST`). The `uri` is the file name, and perhaps optional
|
|
arguments (in the case of dynamic content). Finally, the `version` indicates
|
|
the version of the HTTP protocol that the web client is using (e.g.,
|
|
HTTP/1.0).
|
|
|
|
The HTTP response (from the server to the browser) is similar; it consists of
|
|
a response line, zero or more response headers, an empty text line, and
|
|
finally the interesting part, the response body. A response line has the form
|
|
version `status message`. The `status` is a three-digit positive integer that
|
|
indicates the state of the request; some common states are `200` for `OK`,
|
|
`403` for `Forbidden` (i.e., the client can't access that file), and `404` for
|
|
`File Not Found` (the famous error). Two important lines in the header are
|
|
`Content-Type`, which tells the client the type of the content in the response
|
|
body (e.g., HTML or gif or otherwise) and `Content-Length`, which indicates
|
|
the file size in bytes.
|
|
|
|
For this project, you don't really need to know this information about HTTP
|
|
unless you want to understand the details of the code we have given you. You
|
|
will not need to modify any of the procedures in the web server that deal with
|
|
the HTTP protocol or network connections. However, it's always good to learn
|
|
more, isn't it?
|
|
|
|
# A Basic Web Server
|
|
|
|
The code for the web server is available in this repository. You can compile
|
|
the files herein by simply typing `make`. Compile and run this basic web
|
|
server before making any changes to it! `make clean` removes .o files and
|
|
executables and lets you do a clean build.
|
|
|
|
When you run this basic web server, you need to specify the port number that
|
|
it will listen on; ports below number 1024 are *reserved* (see the list
|
|
[here](https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml))
|
|
so you should specify port numbers that are greater than 1023 to avoid this
|
|
reserved range; the max is 65535. Be wary: if running on a shared machine, you
|
|
could conflict with others and thus have your server fail to bind to the
|
|
desired port. If this happens, try a different number!
|
|
|
|
When you then connect your web browser to this server, make sure that
|
|
you specify this same port. For example, assume that you are running on
|
|
`bumble21.cs.wisc.edu` and use port number 8003; copy your favorite HTML file
|
|
to the directory that you start the web server from. Then, to view this file
|
|
from a web browser (running on the same or a different machine), use the url
|
|
`bumble21.cs.wisc.edu:8003/favorite.html`. If you run the client and web
|
|
server on the same machine, you can just use the hostname `localhost` as a
|
|
convenience, e.g., `localhost:8003/favorite.html`.
|
|
|
|
To make the project a bit easier, we are providing you with a minimal web
|
|
server, consisting of only a few hundred lines of C code. As a result, the
|
|
server is limited in its functionality; it does not handle any HTTP requests
|
|
other than `GET`, understands only a few content types, and supports only the
|
|
`QUERY_STRING` environment variable for CGI programs. This web server is also
|
|
not very robust; for example, if a web client closes its connection to the
|
|
server, it may trip an assertion in the server causing it to exit. We do not
|
|
expect you to fix these problems (though you can, if you like, you know, for
|
|
fun).
|
|
|
|
Helper functions are provided to simplify error checking. A wrapper calls the
|
|
desired function and immediately terminate if an error occurs. The wrappers
|
|
are found in the file `io-helper.h`); more about this below. One should
|
|
always check error codes, even if all you do in response is exit; dropping
|
|
errors silently is **BAD C PROGRAMMING** and should be avoided at all costs.
|
|
|
|
# Finally: Some New Functionality!
|
|
|
|
In this project, you will be adding three key pieces of functionality to the
|
|
basic web server. First, you make the web server multi-threaded. Second, you
|
|
will implement different scheduling policies so that requests are serviced in
|
|
different orders. Third, you will add statistics to measure how the web server
|
|
is performing. You will also be modifying how the web server is invoked so
|
|
that it can handle new input parameters (e.g., the number of threads to
|
|
create).
|
|
|
|
## Part 1: Multi-threaded
|
|
|
|
The basic web server that we provided has a single thread of
|
|
control. Single-threaded web servers suffer from a fundamental performance
|
|
problem in that only a single HTTP request can be serviced at a time. Thus,
|
|
every other client that is accessing this web server must wait until the
|
|
current http request has finished; this is especially a problem if the current
|
|
HTTP request is a long-running CGI program or is resident only on disk (i.e.,
|
|
is not in memory). Thus, the most important extension that you will be adding
|
|
is to make the basic web server multi-threaded.
|
|
|
|
The simplest approach to building a multi-threaded server is to spawn a new
|
|
thread for every new http request. The OS will then schedule these threads
|
|
according to its own policy. The advantage of creating these threads is that
|
|
now short requests will not need to wait for a long request to complete;
|
|
further, when one thread is blocked (i.e., waiting for disk I/O to finish) the
|
|
other threads can continue to handle other requests. However, the drawback of
|
|
the one-thread-per-request approach is that the web server pays the overhead
|
|
of creating a new thread on every request.
|
|
|
|
Therefore, the generally preferred approach for a multi-threaded server is to
|
|
create a fixed-size *pool* of worker threads when the web server is first
|
|
started. With the pool-of-threads approach, each thread is blocked until there
|
|
is an http request for it to handle. Therefore, if there are more worker
|
|
threads than active requests, then some of the threads will be blocked,
|
|
waiting for new HTTP requests to arrive; if there are more requests than
|
|
worker threads, then those requests will need to be buffered until there is a
|
|
ready thread.
|
|
|
|
In your implementation, you must have a master thread that begins by creating
|
|
a pool of worker threads, the number of which is specified on the command
|
|
line. Your master thread is then responsible for accepting new HTTP
|
|
connections over the network and placing the descriptor for this connection
|
|
into a fixed-size buffer; in your basic implementation, the master thread
|
|
should not read from this connection. The number of elements in the buffer is
|
|
also specified on the command line. Note that the existing web server has a
|
|
single thread that accepts a connection and then immediately handles the
|
|
connection; in your web server, this thread should place the connection
|
|
descriptor into a fixed-size buffer and return to accepting more connections.
|
|
|
|
Each worker thread is able to handle both static and dynamic requests. A
|
|
worker thread wakes when there is an HTTP request in the queue; when there are
|
|
multiple HTTP requests available, which request is handled depends upon the
|
|
scheduling policy, described below. Once the worker thread wakes, it performs
|
|
the read on the network descriptor, obtains the specified content (by either
|
|
reading the static file or executing the CGI process), and then returns the
|
|
content to the client by writing to the descriptor. The worker thread then
|
|
waits for another HTTP request.
|
|
|
|
Note that the master thread and the worker threads are in a producer-consumer
|
|
relationship and require that their accesses to the shared buffer be
|
|
synchronized. Specifically, the master thread must block and wait if the
|
|
buffer is full; a worker thread must wait if the buffer is empty. In this
|
|
project, you are required to use condition variables. Note: if your
|
|
implementation performs any busy-waiting (or spin-waiting) instead, you will
|
|
be heavily penalized.
|
|
|
|
Side note: do not be confused by the fact that the basic web server we provide
|
|
forks a new process for each CGI process that it runs. Although, in a very
|
|
limited sense, the web server does use multiple processes, it never handles
|
|
more than a single request at a time; the parent process in the web server
|
|
explicitly waits for the child CGI process to complete before continuing and
|
|
accepting more HTTP requests. When making your server multi-threaded, you
|
|
should not modify this section of the code.
|
|
|
|
## Part 2: Scheduling Policies
|
|
|
|
In this project, you will implement a number of different scheduling
|
|
policies. Note that when your web server has multiple worker threads running
|
|
(the number of which is specified on the command line), you will not have any
|
|
control over which thread is actually scheduled at any given time by the
|
|
OS. Your role in scheduling is to determine which HTTP request should be
|
|
handled by each of the waiting worker threads in your web server.
|
|
|
|
The scheduling policy is determined by a command line argument when the web
|
|
server is started and are as follows:
|
|
|
|
- **First-in-First-out (FIFO)**: When a worker thread wakes, it handles the
|
|
first request (i.e., the oldest request) in the buffer. Note that the HTTP
|
|
requests will not necessarily finish in FIFO order; the order in which the
|
|
requests complete will depend upon how the OS schedules the active threads.
|
|
|
|
- ** Smallest File First (SFF)**: When a worker thread wakes, it handles the
|
|
request for the smallest file. This policy approximates Shortest Job First to
|
|
the extent that the size of the file is a good prediction of how long it takes
|
|
to service that request. Requests for static and dynamic content may be
|
|
intermixed, depending upon the sizes of those files. Note that this algorithm
|
|
can lead to the starvation of requests for large files. You will also note
|
|
that the SFF policy requires that something be known about each request (e.g.,
|
|
the size of the file) before the requests can be scheduled. Thus, to support
|
|
this scheduling policy, you will need to do some initial processing of the
|
|
request (hint: using `stat()` on the filename) outside of the worker threads;
|
|
you will probably want the master thread to perform this work, which requires
|
|
that it read from the network descriptor.
|
|
|
|
## Security
|
|
|
|
Running a networked server can be dangerous, especially if you are not
|
|
careful. Thus, security is something you should consider carefully when
|
|
creating a web server. One thing you should always make sure to do is not
|
|
leave your server running beyond testing, thus leaving open a potential
|
|
backdoor into files in your system.
|
|
|
|
Your system should also make sure to constrain file requests to stay within
|
|
the sub-tree of the file system hierarchy, rooted at the base working
|
|
directory that the server starts in. You must take steps to ensure that
|
|
pathnames that are passed in do not refer to files outside of this sub-tree.
|
|
One simple (perhaps overly conservative) way to do this is to reject any
|
|
pathname with `..` in it, thus avoiding any traversals up the file system
|
|
tree. More sophisticated solutions could use `chroot()` or Linux containers,
|
|
but perhaps those are beyond the scope of the project.
|
|
|
|
## Command-line Parameters
|
|
|
|
Your C program must be invoked exactly as follows:
|
|
|
|
```sh
|
|
prompt> ./wserver [-d <basedir>] [-p <portnum>] [-t <threads>] [-b <buffers>] [-s <schedalg>]
|
|
```
|
|
|
|
The command line arguments to your web server are to be interpreted as
|
|
follows.
|
|
|
|
- **basedir**: this is the root directory from which the web server should
|
|
operate. The server should try to ensure that file accesses do not access
|
|
files above this directory in the file-system hierarchy. Default: current
|
|
working directory (e.g., `.`).
|
|
- **portnum**: the port number that the web server should listen on; the basic web
|
|
server already handles this argument. Default: 10000.
|
|
- **threads**: the number of worker threads that should be created within the web
|
|
server. Must be a positive integer. Default: 1.
|
|
- **buffers**: the number of request connections that can be accepted at one
|
|
time. Must be a positive integer. Note that it is not an error for more or
|
|
less threads to be created than buffers. Default: 1.
|
|
- **schedalg**: the scheduling algorithm to be performed. Must be one of FIFO
|
|
or SFF. Default: FIFO.
|
|
|
|
For example, you could run your program as:
|
|
```
|
|
prompt> server -d . -p 8003 -t 8 -b 16 -s SFF
|
|
```
|
|
|
|
In this case, your web server will listen to port 8003, create 8 worker threads for
|
|
handling HTTP requests, allocate 16 buffers for connections that are currently
|
|
in progress (or waiting), and use SFF scheduling for arriving requests.
|
|
|
|
# Source Code Overview
|
|
|
|
We recommend understanding how the code that we gave you works. We provide
|
|
the following files:
|
|
|
|
- **wserver.c:** Contains main() for the web server and the basic serving loop.
|
|
- **request.c:** Performs most of the work for handling requests in the basic
|
|
web server. Start at `request_handle()` and work through the logic from
|
|
there.
|
|
- **io_helper.h:** Contains wrapper functions for the system calls invoked by
|
|
the basic web server and client. The convention is to add `_or_die` to an
|
|
existing call to provide a version that either succeeds or exits. For
|
|
example, the `open()` system call is used to open a file, but can fail for a
|
|
number of reasons. The wrapper, `open_or_die()`, either successfully opens a
|
|
file or exists upon failure.
|
|
- **wclient.c:** Contains main() and the support routines for the very simple
|
|
web client. To test your server, you may want to change this code so that it
|
|
can send simultaneous requests to your server. By launching `wclient`
|
|
multiple times, you can test how your server handles concurrent requests.
|
|
- **spin.c:** A simple CGI program. Basically, it spins for a fixed amount
|
|
of time, which you may useful in testing various aspects of your server.
|
|
- **Makefile:** We also provide you with a sample Makefile that creates
|
|
`wserver`, `wclient`, and `spin.cgi`. You can type make to create all of
|
|
these programs. You can type make clean to remove the object files and the
|
|
executables. You can type make server to create just the server program,
|
|
etc. As you create new files, you will need to add them to the Makefile.
|
|
|
|
The best way to learn about the code is to compile it and run it. Run the
|
|
server we gave you with your preferred web browser. Run this server with the
|
|
client code we gave you. You can even have the client code we gave you contact
|
|
any other server that speaks HTTP. Make small changes to the server code
|
|
(e.g., have it print out more debugging information) to see if you understand
|
|
how it works.
|
|
|
|
## Additional Useful Reading
|
|
|
|
We anticipate that you will find the following routines useful for creating
|
|
and synchronizing threads: `pthread_create()`, `pthread_mutex_init()`,
|
|
`pthread_mutex_lock()`, `pthread_mutex_unlock()`, `pthread_cond_init()`,
|
|
`pthread_cond_wait()`, `pthread_cond_signal()`. To find information on these
|
|
library routines, read the man pages (RTFM).
|
|
|