From bac60948f46ef844e175785fdf48ff5c7e8eccb9 Mon Sep 17 00:00:00 2001 From: Remzi Arpaci-Dusseau Date: Thu, 28 Feb 2019 13:30:48 -0600 Subject: [PATCH] init web server docs; some details still left out --- concurrency-webserver/README.md | 361 ++++++++++++++++++++++++++++++++ 1 file changed, 361 insertions(+) create mode 100644 concurrency-webserver/README.md diff --git a/concurrency-webserver/README.md b/concurrency-webserver/README.md new file mode 100644 index 0000000..725edb0 --- /dev/null +++ b/concurrency-webserver/README.md @@ -0,0 +1,361 @@ + +# Overview + +In this assignment, you will be developing a concurrent web server. To +simplify this project, we are providing you with the code for a non-concurrent +(but working) web server. This basic web server operates with only a single +thread; it will be your job to make the web server multi-threaded so that it +can handle multiple requests at the same time. + +The goals of this project are: +- To learn the basic architecture of a simple web server +- To learn how to add concurrency to a non-concurrent system +- To learn how to read and modify an existing code base effectively + +Useful reading from [OSTEP](http://ostep.org) includes: +- [Intro to threads](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf) +- [Using locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf) +- [Producer-consumer relationships](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf) +- [Server concurrency architecture](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-events.pdf) + +# HTTP Background + +Before describing what you will be implementing in this project, we will +provide a very brief overview of how a classic web server works, and the HTTP +protocol (version 1.0) used to communicate with it; although web browsers and +servers have [evolved a lot over the +years](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP), +the old versions still work and give you a good start in understanding how +things work. Our goal in providing you with a basic web server is that you can +be shielded from learning all of the details of network connections and the +HTTP protocol needed to do the project; however, the network code has been +greatly simplified and is fairly understandable should you choose to to study +it. + +Classic web browsers and web servers interact using a text-based protocol +called **HTTP** (**Hypertext Transfer Protocol**). A web browser opens a +connection to a web server and requests some content with HTTP. The web server +responds with the requested content and closes the connection. The browser +reads the content and displays it on the screen. + +HTTP is built on top of the **TCP/IP** protocol suite provided by the +operating system. Together, TPC and IP ensure that messages are routed to +their correct destination, get from source to destination reliably in the face +of failure, and do not overly congest the network by sending too many messages +at once, among other features. To learn more about networks, take a networking +class (or many!), or read [this free book](https://book.systemsapproach.org). + +Each piece of content on the web server is associated with a file in the +server's file system. The simplest is *static* content, in which a client +sends a request just to read a specific file from the server. Slightly more +complex is *dynamic* content, in which a client requests that an executable +file be run on the web server and its output returned to the client. +Each file has a unique name known as a **URL** (**Universal Resource +Locator**). + +As a simple example, let's say the client browser wants to fetch static +content (i.e., just some file) from a web server running on some machine. The +client might then type in the following URL to the browser: +`http://www.cs.wisc.edu/index.html`. This URL identifies that the HTTP +protocol is to be used, and that an HTML file in the root directory (`/`) of +the web server called `index.html` on the host machine `www.cs.wisc.edu` +should be fetched. + +The web server is not just uniquely identified by which machine it is running +on but also the **port** it is listening for connections upon. Ports are a +communication abstraction that allow multiple (possibly independent) network +communications to happen concurrently upon a machine; for example, the web +server might be receiving an HTTP request upon port 80 while a mail server is +sending email out using port 25. By default, web servers are expected to run +on port 80 (the well-known HTTP port number), but sometimes (as in this +project), a different port number will be used. To fetch a file from a web +server running at a different port number (say 8000), specify the port number +directly in the URL, e.g., `http://www.cs.wisc.edu:8000/index.html`. + +URLs for executable files (i.e., dynamic content) can include program +arguments after the file name. For example, to just run a program (`test.cgi`) +without any arguments, the client might use the URL +`http://www.cs.wisc.edu/test.cgi`. To specify more arguments, the `?` and `&` +characters are used, with the `?` character to separate the file name from the +arguments and the `& character to separate each argument from the others. For +example, `http://www.cs.wisc.edu/test.cgi?x=10&y=20` can be used to send +multiple arguments `x` and `y` and their respective values to the program +`test.cgi`. The program being run is called a **CGI program** (short for +[Common Gateway +Interface](https://en.wikipedia.org/wiki/Common_Gateway_Interface); yes, this +is a terrible name); the arguments are passed into the program as part of the +[`QUERY_STRING`](https://en.wikipedia.org/wiki/Query_string) environment +variable, which the program can then parse to access these arguments. + +# The HTTP Request + +When a client (e.g., a browser) wants to fetch a file from a machine, the +process starts by sending a machine a message. But what exactly is in the body +of that message? These *request contents*, and the subsequent *reply +contents*, are specified precisely by the HTTP protocol. + +Let's start with the request contents, sent from the web browser to the +server. This HTTP request consists of a request line, followed by zero or more +request headers, and finally an empty text line. A request line has the form: +`method uri version`. The `method` is usually `GET`, which tells the web +server that the client simply wants to read the specified file; however, other +methods exist (e.g., `POST`). The `uri` is the file name, and perhaps optional +arguments (in the case of dynamic content). Finally, the `version` indicates +the version of the HTTP protocol that the web client is using (e.g., +HTTP/1.0). + +The HTTP response (from the server to the browser) is similar; it consists of +a response line, zero or more response headers, an empty text line, and +finally the interesting part, the response body. A response line has the form +version `status message`. The `status` is a three-digit positive integer that +indicates the state of the request; some common states are `200` for `OK`, +`403` for `Forbidden` (i.e., the client can't access that file), and `404` for +`File Not Found` (the famous error). Two important lines in the header are +`Content-Type`, which tells the client the type of the content in the response +body (e.g., HTML or gif or otherwise) and `Content-Length`, which indicates +the file size in bytes. + +For this project, you don't really need to know this information about HTTP +unless you want to understand the details of the code we have given you. You +will not need to modify any of the procedures in the web server that deal with +the HTTP protocol or network connections. However, it's always good to learn +more, isn't it? + +# A Basic Web Server + +The code for the web server is available in this repository. You can compile +the files herein by simply typing `make`. Compile and run this basic web +server before making any changes to it! `make clean` removes .o files and +executables and lets you do a clean build. + +When you run this basic web server, you need to specify the port number that +it will listen on; ports below number 1024 are *reserved* (see the list +[here](https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml)) +so you should specify port numbers that are greater than 1023 to avoid this +reserved range; the max is 65535. Be wary: if running on a shared machine, you +could conflict with others and thus have your server fail to bind to the +desired port. If this happens, try a different number! + +When you then connect your web browser to this server, make sure that +you specify this same port. For example, assume that you are running on +`bumble21.cs.wisc.edu` and use port number 8003; copy your favorite HTML file +to the directory that you start the web server from. Then, to view this file +from a web browser (running on the same or a different machine), use the url +`bumble21.cs.wisc.edu:8003/favorite.html`. If you run the client and web +server on the same machine, you can just use the hostname `localhost` as a +convenience, e.g., `localhost:8003/favorite.html`. + +To make the project a bit easier, we are providing you with a minimal web +server, consisting of only a few hundred lines of C code. As a result, the +server is limited in its functionality; it does not handle any HTTP requests +other than `GET`, understands only a few content types, and supports only the +`QUERY_STRING` environment variable for CGI programs. This web server is also +not very robust; for example, if a web client closes its connection to the +server, it may trip an assertion in the server causing it to exit. We do not +expect you to fix these problems (though you can, if you like, you know, for +fun). + +Helper functions are provided to simplify error checking. A wrapper calls the +desired function and immediately terminate if an error occurs. The wrappers +are found in the file `io-helper.h`); more about this below. One should +always check error codes, even if all you do in response is exit; dropping +errors silently is **BAD C PROGRAMMING** and should be avoided at all costs. + +# Finally: Some New Functionality! + +In this project, you will be adding three key pieces of functionality to the +basic web server. First, you make the web server multi-threaded. Second, you +will implement different scheduling policies so that requests are serviced in +different orders. Third, you will add statistics to measure how the web server +is performing. You will also be modifying how the web server is invoked so +that it can handle new input parameters (e.g., the number of threads to +create). + +## Part 1: Multi-threaded + +The basic web server that we provided has a single thread of +control. Single-threaded web servers suffer from a fundamental performance +problem in that only a single HTTP request can be serviced at a time. Thus, +every other client that is accessing this web server must wait until the +current http request has finished; this is especially a problem if the current +HTTP request is a long-running CGI program or is resident only on disk (i.e., +is not in memory). Thus, the most important extension that you will be adding +is to make the basic web server multi-threaded. + +The simplest approach to building a multi-threaded server is to spawn a new +thread for every new http request. The OS will then schedule these threads +according to its own policy. The advantage of creating these threads is that +now short requests will not need to wait for a long request to complete; +further, when one thread is blocked (i.e., waiting for disk I/O to finish) the +other threads can continue to handle other requests. However, the drawback of +the one-thread-per-request approach is that the web server pays the overhead +of creating a new thread on every request. + +Therefore, the generally preferred approach for a multi-threaded server is to +create a fixed-size *pool* of worker threads when the web server is first +started. With the pool-of-threads approach, each thread is blocked until there +is an http request for it to handle. Therefore, if there are more worker +threads than active requests, then some of the threads will be blocked, +waiting for new HTTP requests to arrive; if there are more requests than +worker threads, then those requests will need to be buffered until there is a +ready thread. + +In your implementation, you must have a master thread that begins by creating +a pool of worker threads, the number of which is specified on the command +line. Your master thread is then responsible for accepting new HTTP +connections over the network and placing the descriptor for this connection +into a fixed-size buffer; in your basic implementation, the master thread +should not read from this connection. The number of elements in the buffer is +also specified on the command line. Note that the existing web server has a +single thread that accepts a connection and then immediately handles the +connection; in your web server, this thread should place the connection +descriptor into a fixed-size buffer and return to accepting more connections. + +Each worker thread is able to handle both static and dynamic requests. A +worker thread wakes when there is an HTTP request in the queue; when there are +multiple HTTP requests available, which request is handled depends upon the +scheduling policy, described below. Once the worker thread wakes, it performs +the read on the network descriptor, obtains the specified content (by either +reading the static file or executing the CGI process), and then returns the +content to the client by writing to the descriptor. The worker thread then +waits for another HTTP request. + +Note that the master thread and the worker threads are in a producer-consumer +relationship and require that their accesses to the shared buffer be +synchronized. Specifically, the master thread must block and wait if the +buffer is full; a worker thread must wait if the buffer is empty. In this +project, you are required to use condition variables. Note: if your +implementation performs any busy-waiting (or spin-waiting) instead, you will +be heavily penalized. + +Side note: do not be confused by the fact that the basic web server we provide +forks a new process for each CGI process that it runs. Although, in a very +limited sense, the web server does use multiple processes, it never handles +more than a single request at a time; the parent process in the web server +explicitly waits for the child CGI process to complete before continuing and +accepting more HTTP requests. When making your server multi-threaded, you +should not modify this section of the code. + +## Part 2: Scheduling Policies + +In this project, you will implement a number of different scheduling +policies. Note that when your web server has multiple worker threads running +(the number of which is specified on the command line), you will not have any +control over which thread is actually scheduled at any given time by the +OS. Your role in scheduling is to determine which HTTP request should be +handled by each of the waiting worker threads in your web server. + +The scheduling policy is determined by a command line argument when the web +server is started and are as follows: + +- **First-in-First-out (FIFO)**: When a worker thread wakes, it handles the +first request (i.e., the oldest request) in the buffer. Note that the HTTP +requests will not necessarily finish in FIFO order; the order in which the +requests complete will depend upon how the OS schedules the active threads. + +- ** Smallest File First (SFF)**: When a worker thread wakes, it handles the +request for the smallest file. This policy approximates Shortest Job First to +the extent that the size of the file is a good prediction of how long it takes +to service that request. Requests for static and dynamic content may be +intermixed, depending upon the sizes of those files. Note that this algorithm +can lead to the starvation of requests for large files. You will also note +that the SFF policy requires that something be known about each request (e.g., +the size of the file) before the requests can be scheduled. Thus, to support +this scheduling policy, you will need to do some initial processing of the +request (hint: using `stat()` on the filename) outside of the worker threads; +you will probably want the master thread to perform this work, which requires +that it read from the network descriptor. + +## Security + +Running a networked server can be dangerous, especially if you are not +careful. Thus, security is something you should consider carefully when +creating a web server. One thing you should always make sure to do is not +leave your server running beyond testing, thus leaving open a potential +backdoor into files in your system. + +Your system should also make sure to constrain file requests to stay within +the sub-tree of the file system hierarchy, rooted at the base working +directory that the server starts in. You must take steps to ensure that +pathnames that are passed in do not refer to files outside of this sub-tree. +One simple (perhaps overly conservative) way to do this is to reject any +pathname with `..` in it, thus avoiding any traversals up the file system +tree. More sophisticated solutions could use `chroot()` or Linux containers, +but perhaps those are beyond the scope of the project. + +## Command-line Parameters + +Your C program must be invoked exactly as follows: + +```sh +prompt> ./wserver [-d ] [-p ] [-t ] [-b ] [-s ] +``` + +The command line arguments to your web server are to be interpreted as +follows. + +- **basedir**: this is the root directory from which the web server should + operate. The server should try to ensure that file accesses do not access + files above this directory in the file-system hierarchy. Default: current + working directory (e.g., `.`). +- **portnum**: the port number that the web server should listen on; the basic web + server already handles this argument. Default: 10000. +- **threads**: the number of worker threads that should be created within the web + server. Must be a positive integer. Default: 1. +- **buffers**: the number of request connections that can be accepted at one + time. Must be a positive integer. Note that it is not an error for more or + less threads to be created than buffers. Default: 1. +- **schedalg**: the scheduling algorithm to be performed. Must be one of FIFO + or SFF. Default: FIFO. + +For example, you could run your program as: +``` +prompt> server -d . -p 8003 -t 8 -b 16 -s SFF +``` + +In this case, your web server will listen to port 8003, create 8 worker threads for +handling HTTP requests, allocate 16 buffers for connections that are currently +in progress (or waiting), and use SFF scheduling for arriving requests. + +# Source Code Overview + +We recommend understanding how the code that we gave you works. We provide +the following files: + +- **wserver.c:** Contains main() for the web server and the basic serving loop. +- **request.c:** Performs most of the work for handling requests in the basic + web server. Start at `request_handle()` and work through the logic from + there. +- **io_helper.h:** Contains wrapper functions for the system calls invoked by + the basic web server and client. The convention is to add `_or_die` to an + existing call to provide a version that either succeeds or exits. For + example, the `open()` system call is used to open a file, but can fail for a + number of reasons. The wrapper, `open_or_die()`, either successfully opens a + file or exists upon failure. +- **wclient.c:** Contains main() and the support routines for the very simple + web client. To test your server, you may want to change this code so that it + can send simultaneous requests to your server. By launching `wclient` + multiple times, you can test how your server handles concurrent requests. +- **spin.c:** A simple CGI program. Basically, it spins for a fixed amount + of time, which you may useful in testing various aspects of your server. +- **Makefile:** We also provide you with a sample Makefile that creates + `wserver`, `wclient`, and `spin.cgi`. You can type make to create all of + these programs. You can type make clean to remove the object files and the + executables. You can type make server to create just the server program, + etc. As you create new files, you will need to add them to the Makefile. + +The best way to learn about the code is to compile it and run it. Run the +server we gave you with your preferred web browser. Run this server with the +client code we gave you. You can even have the client code we gave you contact +any other server that speaks HTTP. Make small changes to the server code +(e.g., have it print out more debugging information) to see if you understand +how it works. + +## Additional Useful Reading + +We anticipate that you will find the following routines useful for creating +and synchronizing threads: `pthread_create()`, `pthread_mutex_init()`, +`pthread_mutex_lock()`, `pthread_mutex_unlock()`, `pthread_cond_init()`, +`pthread_cond_wait()`, `pthread_cond_signal()`. To find information on these +library routines, read the man pages (RTFM). +