Files
build-web-application-with-…/en/eBook/03.1.md
2014-05-10 18:05:31 +08:00

10 KiB
Raw Blame History

Web working principles

Every time you open your browsers, type some URLs and press enter, then you will see the beautiful web pages appear on you screen. But do you know what is happening behind this simple action?

Normally, your browser is a client, after you typed URL, it sends requests to DNS server, get IP address of the URL. Then it finds the server in that IP address, asks to setup TCP connections. When the browser finished sending HTTP requests, server starts handling your request packages, then return HTTP response packages to your browser. Finally, the browser renders bodies of the web pages, and disconnects from the server.

Figure 3.1 Processes of users visit a website

A web server also known as a HTTP server, it uses HTTP protocol to communicate with clients. All web browsers can be seen as clients.

We can divide web working principles to following steps:

  • Client uses TCP/IP protocol to connect to server.
  • Client sends HTTP request packages to server.
  • Server returns HTTP response packages to client, if request resources including dynamic scripts, server calls script engine first.
  • Client disconnects from server, starts rendering HTML.

This a simple work flow of HTTP affairs, notice that every time server closes connections after sent data to clients, and waits for next request.

URL and DNS resolution

We are always using URL to access web pages, but do you know how URL works?

The full name of URL is Uniform Resource Locator, this is for describing resources on the internet. Its basic form as follows.

scheme://host[:port#]/path/.../[?query-string][#anchor]
scheme         assign underlying protocol(such as HTTP, HTTPS, ftp)
host           IP or domain name of HTTP server
port#          default port is 80, and you can omit in this case. If you want to use other ports, you must to specify which port. For example, http://www.cnblogs.com:8080/
path           resources path
query-string   data are sent to server
anchor         anchor

DNS is abbreviation of Domain Name System, it's the name system for computer network services, it converts domain name to actual IP addresses, just like a translator.

Figure 3.2 DNS working principles

To understand more about its working principle, let's see detailed DNS resolution process as follows.

  1. After typed domain name www.qq.com in the browser, operating system will check if there is any mapping relationship in the hosts file for this domain name, if so then finished the domain name resolution.
  2. If no mapping relationship in the hosts file, operating system will check if there is any cache in the DNS, if so then finished the domain name resolution.
  3. If no mapping relationship in the hosts and DNS cache, operating system finds the first DNS resolution server in your TCP/IP setting, which is local DNS server at this time. When local DNS server received query, if the domain name that you want to query is contained in the local configuration of regional resources, then gives back results to the client. This DNS resolution is authoritative.
  4. If local DNS server doesn't contain the domain name, and there is a mapping relationship in the cache, local DNS server gives back this result to client. This DNS resolution is not authoritative.
  5. If local DNS server cannot resolve this domain name either by configuration of regional resource or cache, it gets into next step depends on the local DNS server setting. If the local DNS server doesn't enable forward mode, it sends request to root DNS server, then returns the IP of top level DNS server may know this domain name, .com in this case. If the first top level DNS server doesn't know, it sends request to next top level DNS server until the one that knows the domain name. Then the top level DNS server asks next level DNS server for qq.com, then finds the www.qq.com in some servers.
  6. If the local DNS server enabled forward mode, it sends request to upper level DNS server, if the upper level DNS server also doesn't know the domain name, then keep sending request to upper level. Whether local DNS server enables forward mode, server's IP address of domain name returns to local DNS server, and local server sends it to clients.

Figure 3.3 DNS resolution work flow

Recursive query process means the enquirers are changing in the process, and enquirers do not change in Iterative query process.

Now we know clients get IP addresses in the end, so the browsers are communicating with servers through IP addresses.

HTTP protocol

HTTP protocol is the core part of web services. It's important to know what is HTTP protocol before you understand how web works.

HTTP is the protocol that used for communicating between browsers and web servers, it is based on TCP protocol, and usually use port 80 in the web server side. It is a protocol that uses request-response model, clients send request and servers response. According to HTTP protocol, clients always setup a new connection and send a HTTP request to server in every affair. Server is not able to connect to client proactively, or a call back connection. The connection between the client and the server can be closed by any side. For example, you can cancel your download task and HTTP connection. It disconnects from server before you finish downloading.

HTTP protocol is stateless, which means the server has no idea about the relationship between two connections, even though they are both from same client. To solve this problem, web applications use Cookie to maintain sustainable state of connections.

Because HTTP protocol is based on TCP protocol, so all TCP attacks will affect the HTTP communication in your server, such as SYN Flood, DoS and DDoS.

HTTP request package (browser information)

Request packages all have three parts: request line, request header, and body. There is one blank line between header and body.

GET /domains/example/ HTTP/1.1      // request line: request method, URL, protocol and its version
Hostwww.iana.org             // domain name
User-AgentMozilla/5.0 (Windows NT 6.1) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4            // browser information
Accepttext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8    // mine that clients can accept
Accept-Encodinggzip,deflate,sdch     // stream compression
Accept-CharsetUTF-8,*;q=0.5      // character set in client side
// blank line
// body, request resource arguments (for example, arguments in POST)

We use fiddler to get following request information.

Figure 3.4 Information of GET method caught by fiddler

Figure 3.5 Information of POST method caught by fiddler

We can see that GET method doesn't have request body that POST does.

There are many methods you can use to communicate with servers in HTTP, and GET, POST, PUT, DELETE are the basic 4 methods that we use. A URL represented a resource on the network, so these 4 method means query, change, add and delete operations. GET and POST are most commonly used in HTTP. GET appends data to the URL and uses ? to break up them, uses & between arguments, like EditPosts.aspx?name=test1&id=123456. POST puts data in the request body because URL has length limitation by browsers, so POST can submit much more data than GET method. Also when we submit our user name and password, we don't want this kind of information appear in the URL, so we use POST to keep them invisible.

HTTP response package (server information)

Let's see what information is contained in the response packages.

HTTP/1.1 200 OK                     // status line
Server: nginx/1.0.8                 // web server software and its version in the server machine
Date:Date: Tue, 30 Oct 2012 04:14:25 GMT        // responded time
Content-Type: text/html             // responded data type
Transfer-Encoding: chunked          // it means data were sent in fragments
Connection: keep-alive              // keep connection 
Content-Length: 90                  // length of body
// blank line
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"... // message body

The first line is called status line, it has HTTP version, status code and statue message.

Status code tells the client is HTTP server has expectation response. In HTTP/1.1, we defined 5 kinds of status code.

- 1xx Informational
- 2xx Success
- 3xx Redirection
- 4xx Client Error
- 5xx Server Error

Let's see more examples about response packages, 200 means server responded correctly, 302 means redirection.

Figure 3.6 Full information for visiting a website

HTTP is stateless and Connection: keep-alive

Stateless doesn't means server has no ability to keep a connection, in other words, server doesn't know any relationship between any two requests.

In HTTP/1.1, Keep-alive is used as default, if clients have more requests, they will use the same connection for many different requests.

Notice that Keep-alive cannot keep one connection forever, the software runs in the server has certain time to keep connection, and you can change it.

Request instance

Figure 3.7 All packages for open one web page

We can see the whole process of communication between the client and server by above picture. You may notice that there are many resource files in the list, they are called static files, and Go has specialized processing methods for these files.

This is the most important function of browsers, request for a URL and get data from web servers, then render HTML for good user interface. If it finds some files in the DOM, such as CSS or JS files, browsers will request for these resources from server again, until all the resources finished rendering on your screen.

Reduce HTTP request times is one of the methods that improves speed of loading web pages, which is reducing CSS and JS files, it reduces pressure in the web servers at the same time.