书名：Python Automation Cookbook
作者名：Jaime Buelta
本章字数：52字
更新时间：2021-06-30 14:52:52

Downloading web pages

The basic ability to download a web page involves making an HTTP GET request against a URL. This is the basic operation of any web browser.

Let's quickly recap the different parts of this operation, as it has three distinct elements:

Using the HTTP protocol. This deals with the way the request is structured.
Using the GET method, which is the most common HTTP method. We'll see more in the Accessing web APIs recipe.
A full URL describing the address of the page, including the server (for example: mypage.com) and the path (for example: /page).

That request will be routed toward the server by the internet and processed by the server, then a response will be sent back. This response will contain a status code, typically 200 if everything went fine, and a body with the result, which will normally be text with an HTML page.

Most of this is handled automatically by the HTTP client used to perform the request. We'll see in this recipe how to make a simple request to obtain a web page.

HTTP requests and responses can also contain headers. Headers contain important information about the request itself, such as the total size of the request, the format of the content, the date of the request, and what browser or server is used.

Getting ready

Using the fantastic requests module, getting web pages is super simple. Install the module:

$ echo "requests==2.23.0" >> requirements.txt
$ source .venv/bin/activate
(.venv) $ pip install -r requirements.txt

We'll download the page at http://www.columbia.edu/~fdc/sample.html because it is a straightforward HTML page that is easy to read in text mode.

How to do it...

Import the requests module:
```
>>> import requests
```

Make a request to the server using the following URL, which will take a second or two:

>>> url = 'http://www.columbia.edu/~fdc/sample.html'
>>> response = requests.get(url)

Check the returned object status code:
```
>>> response.status_code
200
```

Check the content of the result:

>>> response.text
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n
...
FULL BODY
...
<!-- close the <html> begun above -->\n'

Check the ongoing and returned headers:

>>> response.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
>>> response.headers
{'Date': 'Fri, 24 Jan 2020 19:04:12 GMT', 'Server': 'Apache', 'Last-Modified': 'Wed, 11 Dec 2019 12:46:44 GMT', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Encoding': 'gzip', 'Content-Length': '10127', 'Keep-Alive': 'timeout=15, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html', 'Set-Cookie': 'BIGipServer~CUIT~www.columbia.edu-80-pool=1311259520.20480.0000; expires=Sat, 25-Jan-2020 01:04:12 GMT; path=/; Httponly'}

How it works...

The operation of requests is very simple; perform the request, using the GET method in this case, over the URL. This returns a result object that can be analyzed. The main elements are the status_code and the body content, which can be presented as text.

The full request can be inspected in the request attribute:

>>> response.request
<PreparedRequest [GET]>
>>> response.request.url
http://www.columbia.edu/~fdc/sample.html'

The full requests module documentation can be found here: https://requests.readthedocs.io/en/master/.

Over the course of the chapter, we'll be showing more features of the requests library.

There's more...

All HTTP status codes can be seen at this web page: https://httpstatuses.com/. They are also described in the http.HTTPStatus enum with convenient constant names, such as OK, NOT_FOUND, or FORBIDDEN.

The most famous error status code is arguably 404, which is returned when the resource described by a URL is not found. Try it out by doing requests.get(http://www.columbia.edu/invalid).

The general structure of the status code is:

1XX – Information on specifics about the protocol.

2XX – Success.

3XX – Redirection. For example: The URL is no longer valid and is available somewhere else. The new URL should be included.

4XX – Client error. There's some error in the information sent to the server (like a bad format) or in the client (for example, authentication is required to be able to access the URL).

5XX – Server error. There's an error on the server side; for example, the server might be unavailable or there might be a bug processing the request.

A request can use the HTTPS (secure HTTP) protocol. It is the same as HTTP but ensures that the contents of the request and response are private. requests handles it transparently.

Any website that handles any private information should use HTTPS to ensure that the information has not leaked out. HTTP is vulnerable to someone eavesdropping. Use HTTPS where available.

Downloading web pages

Getting ready

How to do it...

How it works...

There's more...

See also