init web server docs; some details still left out
This commit is contained in:
361
concurrency-webserver/README.md
Normal file
361
concurrency-webserver/README.md
Normal file
@@ -0,0 +1,361 @@
|
||||
|
||||
# Overview
|
||||
|
||||
In this assignment, you will be developing a concurrent web server. To
|
||||
simplify this project, we are providing you with the code for a non-concurrent
|
||||
(but working) web server. This basic web server operates with only a single
|
||||
thread; it will be your job to make the web server multi-threaded so that it
|
||||
can handle multiple requests at the same time.
|
||||
|
||||
The goals of this project are:
|
||||
- To learn the basic architecture of a simple web server
|
||||
- To learn how to add concurrency to a non-concurrent system
|
||||
- To learn how to read and modify an existing code base effectively
|
||||
|
||||
Useful reading from [OSTEP](http://ostep.org) includes:
|
||||
- [Intro to threads](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf)
|
||||
- [Using locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf)
|
||||
- [Producer-consumer relationships](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf)
|
||||
- [Server concurrency architecture](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf)
|
||||
|
||||
# HTTP Background
|
||||
|
||||
Before describing what you will be implementing in this project, we will
|
||||
provide a very brief overview of how a classic web server works, and the HTTP
|
||||
protocol (version 1.0) used to communicate with it; although web browsers and
|
||||
servers have [evolved a lot over the
|
||||
years](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP),
|
||||
the old versions still work and give you a good start in understanding how
|
||||
things work. Our goal in providing you with a basic web server is that you can
|
||||
be shielded from learning all of the details of network connections and the
|
||||
HTTP protocol needed to do the project; however, the network code has been
|
||||
greatly simplified and is fairly understandable should you choose to to study
|
||||
it.
|
||||
|
||||
Classic web browsers and web servers interact using a text-based protocol
|
||||
called **HTTP** (**Hypertext Transfer Protocol**). A web browser opens a
|
||||
connection to a web server and requests some content with HTTP. The web server
|
||||
responds with the requested content and closes the connection. The browser
|
||||
reads the content and displays it on the screen.
|
||||
|
||||
HTTP is built on top of the **TCP/IP** protocol suite provided by the
|
||||
operating system. Together, TPC and IP ensure that messages are routed to
|
||||
their correct destination, get from source to destination reliably in the face
|
||||
of failure, and do not overly congest the network by sending too many messages
|
||||
at once, among other features. To learn more about networks, take a networking
|
||||
class (or many!), or read [this free book](https://book.systemsapproach.org).
|
||||
|
||||
Each piece of content on the web server is associated with a file in the
|
||||
server's file system. The simplest is *static* content, in which a client
|
||||
sends a request just to read a specific file from the server. Slightly more
|
||||
complex is *dynamic* content, in which a client requests that an executable
|
||||
file be run on the web server and its output returned to the client.
|
||||
Each file has a unique name known as a **URL** (**Universal Resource
|
||||
Locator**).
|
||||
|
||||
As a simple example, let's say the client browser wants to fetch static
|
||||
content (i.e., just some file) from a web server running on some machine. The
|
||||
client might then type in the following URL to the browser:
|
||||
`http://www.cs.wisc.edu/index.html`. This URL identifies that the HTTP
|
||||
protocol is to be used, and that an HTML file in the root directory (`/`) of
|
||||
the web server called `index.html` on the host machine `www.cs.wisc.edu`
|
||||
should be fetched.
|
||||
|
||||
The web server is not just uniquely identified by which machine it is running
|
||||
on but also the **port** it is listening for connections upon. Ports are a
|
||||
communication abstraction that allow multiple (possibly independent) network
|
||||
communications to happen concurrently upon a machine; for example, the web
|
||||
server might be receiving an HTTP request upon port 80 while a mail server is
|
||||
sending email out using port 25. By default, web servers are expected to run
|
||||
on port 80 (the well-known HTTP port number), but sometimes (as in this
|
||||
project), a different port number will be used. To fetch a file from a web
|
||||
server running at a different port number (say 8000), specify the port number
|
||||
directly in the URL, e.g., `http://www.cs.wisc.edu:8000/index.html`.
|
||||
|
||||
URLs for executable files (i.e., dynamic content) can include program
|
||||
arguments after the file name. For example, to just run a program (`test.cgi`)
|
||||
without any arguments, the client might use the URL
|
||||
`http://www.cs.wisc.edu/test.cgi`. To specify more arguments, the `?` and `&`
|
||||
characters are used, with the `?` character to separate the file name from the
|
||||
arguments and the `& character to separate each argument from the others. For
|
||||
example, `http://www.cs.wisc.edu/test.cgi?x=10&y=20` can be used to send
|
||||
multiple arguments `x` and `y` and their respective values to the program
|
||||
`test.cgi`. The program being run is called a **CGI program** (short for
|
||||
[Common Gateway
|
||||
Interface](https://en.wikipedia.org/wiki/Common_Gateway_Interface); yes, this
|
||||
is a terrible name); the arguments are passed into the program as part of the
|
||||
[`QUERY_STRING`](https://en.wikipedia.org/wiki/Query_string) environment
|
||||
variable, which the program can then parse to access these arguments.
|
||||
|
||||
# The HTTP Request
|
||||
|
||||
When a client (e.g., a browser) wants to fetch a file from a machine, the
|
||||
process starts by sending a machine a message. But what exactly is in the body
|
||||
of that message? These *request contents*, and the subsequent *reply
|
||||
contents*, are specified precisely by the HTTP protocol.
|
||||
|
||||
Let's start with the request contents, sent from the web browser to the
|
||||
server. This HTTP request consists of a request line, followed by zero or more
|
||||
request headers, and finally an empty text line. A request line has the form:
|
||||
`method uri version`. The `method` is usually `GET`, which tells the web
|
||||
server that the client simply wants to read the specified file; however, other
|
||||
methods exist (e.g., `POST`). The `uri` is the file name, and perhaps optional
|
||||
arguments (in the case of dynamic content). Finally, the `version` indicates
|
||||
the version of the HTTP protocol that the web client is using (e.g.,
|
||||
HTTP/1.0).
|
||||
|
||||
The HTTP response (from the server to the browser) is similar; it consists of
|
||||
a response line, zero or more response headers, an empty text line, and
|
||||
finally the interesting part, the response body. A response line has the form
|
||||
version `status message`. The `status` is a three-digit positive integer that
|
||||
indicates the state of the request; some common states are `200` for `OK`,
|
||||
`403` for `Forbidden` (i.e., the client can't access that file), and `404` for
|
||||
`File Not Found` (the famous error). Two important lines in the header are
|
||||
`Content-Type`, which tells the client the type of the content in the response
|
||||
body (e.g., HTML or gif or otherwise) and `Content-Length`, which indicates
|
||||
the file size in bytes.
|
||||
|
||||
For this project, you don't really need to know this information about HTTP
|
||||
unless you want to understand the details of the code we have given you. You
|
||||
will not need to modify any of the procedures in the web server that deal with
|
||||
the HTTP protocol or network connections. However, it's always good to learn
|
||||
more, isn't it?
|
||||
|
||||
# A Basic Web Server
|
||||
|
||||
The code for the web server is available in this repository. You can compile
|
||||
the files herein by simply typing `make`. Compile and run this basic web
|
||||
server before making any changes to it! `make clean` removes .o files and
|
||||
executables and lets you do a clean build.
|
||||
|
||||
When you run this basic web server, you need to specify the port number that
|
||||
it will listen on; ports below number 1024 are *reserved* (see the list
|
||||
[here](https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml))
|
||||
so you should specify port numbers that are greater than 1023 to avoid this
|
||||
reserved range; the max is 65535. Be wary: if running on a shared machine, you
|
||||
could conflict with others and thus have your server fail to bind to the
|
||||
desired port. If this happens, try a different number!
|
||||
|
||||
When you then connect your web browser to this server, make sure that
|
||||
you specify this same port. For example, assume that you are running on
|
||||
`bumble21.cs.wisc.edu` and use port number 8003; copy your favorite HTML file
|
||||
to the directory that you start the web server from. Then, to view this file
|
||||
from a web browser (running on the same or a different machine), use the url
|
||||
`bumble21.cs.wisc.edu:8003/favorite.html`. If you run the client and web
|
||||
server on the same machine, you can just use the hostname `localhost` as a
|
||||
convenience, e.g., `localhost:8003/favorite.html`.
|
||||
|
||||
To make the project a bit easier, we are providing you with a minimal web
|
||||
server, consisting of only a few hundred lines of C code. As a result, the
|
||||
server is limited in its functionality; it does not handle any HTTP requests
|
||||
other than `GET`, understands only a few content types, and supports only the
|
||||
`QUERY_STRING` environment variable for CGI programs. This web server is also
|
||||
not very robust; for example, if a web client closes its connection to the
|
||||
server, it may trip an assertion in the server causing it to exit. We do not
|
||||
expect you to fix these problems (though you can, if you like, you know, for
|
||||
fun).
|
||||
|
||||
Helper functions are provided to simplify error checking. A wrapper calls the
|
||||
desired function and immediately terminate if an error occurs. The wrappers
|
||||
are found in the file `io-helper.h`); more about this below. One should
|
||||
always check error codes, even if all you do in response is exit; dropping
|
||||
errors silently is **BAD C PROGRAMMING** and should be avoided at all costs.
|
||||
|
||||
# Finally: Some New Functionality!
|
||||
|
||||
In this project, you will be adding three key pieces of functionality to the
|
||||
basic web server. First, you make the web server multi-threaded. Second, you
|
||||
will implement different scheduling policies so that requests are serviced in
|
||||
different orders. Third, you will add statistics to measure how the web server
|
||||
is performing. You will also be modifying how the web server is invoked so
|
||||
that it can handle new input parameters (e.g., the number of threads to
|
||||
create).
|
||||
|
||||
## Part 1: Multi-threaded
|
||||
|
||||
The basic web server that we provided has a single thread of
|
||||
control. Single-threaded web servers suffer from a fundamental performance
|
||||
problem in that only a single HTTP request can be serviced at a time. Thus,
|
||||
every other client that is accessing this web server must wait until the
|
||||
current http request has finished; this is especially a problem if the current
|
||||
HTTP request is a long-running CGI program or is resident only on disk (i.e.,
|
||||
is not in memory). Thus, the most important extension that you will be adding
|
||||
is to make the basic web server multi-threaded.
|
||||
|
||||
The simplest approach to building a multi-threaded server is to spawn a new
|
||||
thread for every new http request. The OS will then schedule these threads
|
||||
according to its own policy. The advantage of creating these threads is that
|
||||
now short requests will not need to wait for a long request to complete;
|
||||
further, when one thread is blocked (i.e., waiting for disk I/O to finish) the
|
||||
other threads can continue to handle other requests. However, the drawback of
|
||||
the one-thread-per-request approach is that the web server pays the overhead
|
||||
of creating a new thread on every request.
|
||||
|
||||
Therefore, the generally preferred approach for a multi-threaded server is to
|
||||
create a fixed-size *pool* of worker threads when the web server is first
|
||||
started. With the pool-of-threads approach, each thread is blocked until there
|
||||
is an http request for it to handle. Therefore, if there are more worker
|
||||
threads than active requests, then some of the threads will be blocked,
|
||||
waiting for new HTTP requests to arrive; if there are more requests than
|
||||
worker threads, then those requests will need to be buffered until there is a
|
||||
ready thread.
|
||||
|
||||
In your implementation, you must have a master thread that begins by creating
|
||||
a pool of worker threads, the number of which is specified on the command
|
||||
line. Your master thread is then responsible for accepting new HTTP
|
||||
connections over the network and placing the descriptor for this connection
|
||||
into a fixed-size buffer; in your basic implementation, the master thread
|
||||
should not read from this connection. The number of elements in the buffer is
|
||||
also specified on the command line. Note that the existing web server has a
|
||||
single thread that accepts a connection and then immediately handles the
|
||||
connection; in your web server, this thread should place the connection
|
||||
descriptor into a fixed-size buffer and return to accepting more connections.
|
||||
|
||||
Each worker thread is able to handle both static and dynamic requests. A
|
||||
worker thread wakes when there is an HTTP request in the queue; when there are
|
||||
multiple HTTP requests available, which request is handled depends upon the
|
||||
scheduling policy, described below. Once the worker thread wakes, it performs
|
||||
the read on the network descriptor, obtains the specified content (by either
|
||||
reading the static file or executing the CGI process), and then returns the
|
||||
content to the client by writing to the descriptor. The worker thread then
|
||||
waits for another HTTP request.
|
||||
|
||||
Note that the master thread and the worker threads are in a producer-consumer
|
||||
relationship and require that their accesses to the shared buffer be
|
||||
synchronized. Specifically, the master thread must block and wait if the
|
||||
buffer is full; a worker thread must wait if the buffer is empty. In this
|
||||
project, you are required to use condition variables. Note: if your
|
||||
implementation performs any busy-waiting (or spin-waiting) instead, you will
|
||||
be heavily penalized.
|
||||
|
||||
Side note: do not be confused by the fact that the basic web server we provide
|
||||
forks a new process for each CGI process that it runs. Although, in a very
|
||||
limited sense, the web server does use multiple processes, it never handles
|
||||
more than a single request at a time; the parent process in the web server
|
||||
explicitly waits for the child CGI process to complete before continuing and
|
||||
accepting more HTTP requests. When making your server multi-threaded, you
|
||||
should not modify this section of the code.
|
||||
|
||||
## Part 2: Scheduling Policies
|
||||
|
||||
In this project, you will implement a number of different scheduling
|
||||
policies. Note that when your web server has multiple worker threads running
|
||||
(the number of which is specified on the command line), you will not have any
|
||||
control over which thread is actually scheduled at any given time by the
|
||||
OS. Your role in scheduling is to determine which HTTP request should be
|
||||
handled by each of the waiting worker threads in your web server.
|
||||
|
||||
The scheduling policy is determined by a command line argument when the web
|
||||
server is started and are as follows:
|
||||
|
||||
- **First-in-First-out (FIFO)**: When a worker thread wakes, it handles the
|
||||
first request (i.e., the oldest request) in the buffer. Note that the HTTP
|
||||
requests will not necessarily finish in FIFO order; the order in which the
|
||||
requests complete will depend upon how the OS schedules the active threads.
|
||||
|
||||
- ** Smallest File First (SFF)**: When a worker thread wakes, it handles the
|
||||
request for the smallest file. This policy approximates Shortest Job First to
|
||||
the extent that the size of the file is a good prediction of how long it takes
|
||||
to service that request. Requests for static and dynamic content may be
|
||||
intermixed, depending upon the sizes of those files. Note that this algorithm
|
||||
can lead to the starvation of requests for large files. You will also note
|
||||
that the SFF policy requires that something be known about each request (e.g.,
|
||||
the size of the file) before the requests can be scheduled. Thus, to support
|
||||
this scheduling policy, you will need to do some initial processing of the
|
||||
request (hint: using `stat()` on the filename) outside of the worker threads;
|
||||
you will probably want the master thread to perform this work, which requires
|
||||
that it read from the network descriptor.
|
||||
|
||||
## Security
|
||||
|
||||
Running a networked server can be dangerous, especially if you are not
|
||||
careful. Thus, security is something you should consider carefully when
|
||||
creating a web server. One thing you should always make sure to do is not
|
||||
leave your server running beyond testing, thus leaving open a potential
|
||||
backdoor into files in your system.
|
||||
|
||||
Your system should also make sure to constrain file requests to stay within
|
||||
the sub-tree of the file system hierarchy, rooted at the base working
|
||||
directory that the server starts in. You must take steps to ensure that
|
||||
pathnames that are passed in do not refer to files outside of this sub-tree.
|
||||
One simple (perhaps overly conservative) way to do this is to reject any
|
||||
pathname with `..` in it, thus avoiding any traversals up the file system
|
||||
tree. More sophisticated solutions could use `chroot()` or Linux containers,
|
||||
but perhaps those are beyond the scope of the project.
|
||||
|
||||
## Command-line Parameters
|
||||
|
||||
Your C program must be invoked exactly as follows:
|
||||
|
||||
```sh
|
||||
prompt> ./wserver [-d <basedir>] [-p <portnum>] [-t <threads>] [-b <buffers>] [-s <schedalg>]
|
||||
```
|
||||
|
||||
The command line arguments to your web server are to be interpreted as
|
||||
follows.
|
||||
|
||||
- **basedir**: this is the root directory from which the web server should
|
||||
operate. The server should try to ensure that file accesses do not access
|
||||
files above this directory in the file-system hierarchy. Default: current
|
||||
working directory (e.g., `.`).
|
||||
- **portnum**: the port number that the web server should listen on; the basic web
|
||||
server already handles this argument. Default: 10000.
|
||||
- **threads**: the number of worker threads that should be created within the web
|
||||
server. Must be a positive integer. Default: 1.
|
||||
- **buffers**: the number of request connections that can be accepted at one
|
||||
time. Must be a positive integer. Note that it is not an error for more or
|
||||
less threads to be created than buffers. Default: 1.
|
||||
- **schedalg**: the scheduling algorithm to be performed. Must be one of FIFO
|
||||
or SFF. Default: FIFO.
|
||||
|
||||
For example, you could run your program as:
|
||||
```
|
||||
prompt> server -d . -p 8003 -t 8 -b 16 -s SFF
|
||||
```
|
||||
|
||||
In this case, your web server will listen to port 8003, create 8 worker threads for
|
||||
handling HTTP requests, allocate 16 buffers for connections that are currently
|
||||
in progress (or waiting), and use SFF scheduling for arriving requests.
|
||||
|
||||
# Source Code Overview
|
||||
|
||||
We recommend understanding how the code that we gave you works. We provide
|
||||
the following files:
|
||||
|
||||
- **wserver.c:** Contains main() for the web server and the basic serving loop.
|
||||
- **request.c:** Performs most of the work for handling requests in the basic
|
||||
web server. Start at `request_handle()` and work through the logic from
|
||||
there.
|
||||
- **io_helper.h:** Contains wrapper functions for the system calls invoked by
|
||||
the basic web server and client. The convention is to add `_or_die` to an
|
||||
existing call to provide a version that either succeeds or exits. For
|
||||
example, the `open()` system call is used to open a file, but can fail for a
|
||||
number of reasons. The wrapper, `open_or_die()`, either successfully opens a
|
||||
file or exists upon failure.
|
||||
- **wclient.c:** Contains main() and the support routines for the very simple
|
||||
web client. To test your server, you may want to change this code so that it
|
||||
can send simultaneous requests to your server. By launching `wclient`
|
||||
multiple times, you can test how your server handles concurrent requests.
|
||||
- **spin.c:** A simple CGI program. Basically, it spins for a fixed amount
|
||||
of time, which you may useful in testing various aspects of your server.
|
||||
- **Makefile:** We also provide you with a sample Makefile that creates
|
||||
`wserver`, `wclient`, and `spin.cgi`. You can type make to create all of
|
||||
these programs. You can type make clean to remove the object files and the
|
||||
executables. You can type make server to create just the server program,
|
||||
etc. As you create new files, you will need to add them to the Makefile.
|
||||
|
||||
The best way to learn about the code is to compile it and run it. Run the
|
||||
server we gave you with your preferred web browser. Run this server with the
|
||||
client code we gave you. You can even have the client code we gave you contact
|
||||
any other server that speaks HTTP. Make small changes to the server code
|
||||
(e.g., have it print out more debugging information) to see if you understand
|
||||
how it works.
|
||||
|
||||
## Additional Useful Reading
|
||||
|
||||
We anticipate that you will find the following routines useful for creating
|
||||
and synchronizing threads: `pthread_create()`, `pthread_mutex_init()`,
|
||||
`pthread_mutex_lock()`, `pthread_mutex_unlock()`, `pthread_cond_init()`,
|
||||
`pthread_cond_wait()`, `pthread_cond_signal()`. To find information on these
|
||||
library routines, read the man pages (RTFM).
|
||||
|
||||
Reference in New Issue
Block a user