- Python Automation Cookbook
- Jaime Buelta
- 44字
- 2021-06-30 14:52:53
Downloading web pages
The basic ability to download a web page involves making an HTTP GET
request against a URL. This is the basic operation of any web browser.
Let's quickly recap the different parts of this operation, as it has three distinct elements:
- Using the HTTP protocol. This deals with the way the request is structured.
- Using the
GET
method, which is the most common HTTP method. We'll see more in the Accessing web APIs recipe. - A full URL describing the address of the page, including the server (for example:
mypage.com
) and the path (for example:/page
).
That request will be routed toward the server by the internet and processed by the server, then a response will be sent back. This response will contain a status code, typically 200 if everything went fine, and a body with the result, which will normally be text with an HTML page.
Most of this is handled automatically by the HTTP client used to perform the request. We'll see in this recipe how to make a simple request to obtain a web page.
HTTP requests and responses can also contain headers. Headers contain important information about the request itself, such as the total size of the request, the format of the content, the date of the request, and what browser or server is used.
Getting ready
Using the fantastic requests
module, getting web pages is super simple. Install the module:
$ echo "requests==2.23.0" >> requirements.txt
$ source .venv/bin/activate
(.venv) $ pip install -r requirements.txt
We'll download the page at http://www.columbia.edu/~fdc/sample.html because it is a straightforward HTML page that is easy to read in text mode.
How to do it...
- Import the
requests
module:>>> import requests
- Make a request to the server using the following URL, which will take a second or two:
>>> url = 'http://www.columbia.edu/~fdc/sample.html' >>> response = requests.get(url)
- Check the returned object status code:
>>> response.status_code 200
- Check the content of the result:
>>> response.text '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n ... FULL BODY ... <!-- close the <html> begun above -->\n'
- Check the ongoing and returned headers:
>>> response.request.headers {'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> response.headers {'Date': 'Fri, 24 Jan 2020 19:04:12 GMT', 'Server': 'Apache', 'Last-Modified': 'Wed, 11 Dec 2019 12:46:44 GMT', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Encoding': 'gzip', 'Content-Length': '10127', 'Keep-Alive': 'timeout=15, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html', 'Set-Cookie': 'BIGipServer~CUIT~www.columbia.edu-80-pool=1311259520.20480.0000; expires=Sat, 25-Jan-2020 01:04:12 GMT; path=/; Httponly'}
How it works...
The operation of requests
is very simple; perform the request, using the GET
method in this case, over the URL. This returns a result
object that can be analyzed. The main elements are the status_code
and the body content, which can be presented as text
.
The full request can be inspected in the request
attribute:
>>> response.request
<PreparedRequest [GET]>
>>> response.request.url
http://www.columbia.edu/~fdc/sample.html'
The full requests
module documentation can be found here: https://requests.readthedocs.io/en/master/.
Over the course of the chapter, we'll be showing more features of the requests
library.
There's more...
All HTTP status codes can be seen at this web page: https://httpstatuses.com/. They are also described in the http.HTTPStatus
enum with convenient constant names, such as OK
, NOT_FOUND
, or FORBIDDEN
.
The most famous error status code is arguably 404
, which is returned when the resource described by a URL is not found. Try it out by doing requests.get(http://www.columbia.edu/invalid)
.
The general structure of the status code is:
1XX – Information on specifics about the protocol.
2XX – Success.
3XX – Redirection. For example: The URL is no longer valid and is available somewhere else. The new URL should be included.
4XX – Client error. There's some error in the information sent to the server (like a bad format) or in the client (for example, authentication is required to be able to access the URL).
5XX – Server error. There's an error on the server side; for example, the server might be unavailable or there might be a bug processing the request.
A request can use the HTTPS (secure HTTP) protocol. It is the same as HTTP but ensures that the contents of the request and response are private. requests
handles it transparently.
Any website that handles any private information should use HTTPS to ensure that the information has not leaked out. HTTP is vulnerable to someone eavesdropping. Use HTTPS where available.
See also
- The Installing third-party packages recipe in Chapter 1, Let's Begin Our Automation Journey, to learn the basics of installing external modules.
- The Parsing HTML recipe, later in this chapter, to find out how to treat the information returned from the server.