- Python Automation Cookbook
- Jaime Buelta
- 167字
- 2021-06-30 14:52:58
Interacting with forms
A common element present in web pages is forms. Forms are a way of sending values to a web page, for example, to create a new comment on a blog post, or to submit a purchase.
Browsers present forms so you can input values and send them in a single action after pressing the submit or equivalent button. We'll see how to create this action programmatically in this recipe.
Be aware that sending data to a site is normally a more delicate matter than receiving data from it. For example, sending automatic comments to a website is very much the definition of spam. This means that it can be more difficult to automate as it involves considering security measures. Double-check that what you're trying to achieve is a valid, ethical use case.
Getting ready
We'll work against the test server https://httpbin.org/forms/post, which allows us to send a test form and sends back the submitted information.
Note that the URL https://httpbin.org/forms/post renders the form, but internally calls the URL https://httpbin.org/post
to send the information. We'll use both URLs during this recipe.
The following is an example form to order a pizza:
Figure 3.2: Rendered form
You can fill the form in manually and see it return the information in JSON format, including extra information such as the browser being used.
The following is the frontend of the web form that is generated:
Figure 3.3: Filled-in form
The following screenshot shows the backend of the web form that is generated:
Figure 3.4: Returned JSON content
We need to analyze the HTML to see the accepted data for the form. The source code is as follows:
Figure 3.5: Source code
Check the names of the inputs, custname
, custtel
, custemail
, size
(a radio option), topping
(a multiselection checkbox), delivery
(time), and comments
.
How to do it...
- Import the
requests
,BeautifulSoup
, andre
modules:>>> import requests >>> from bs4 import BeautifulSoup >>> import re
- Retrieve the form page, parse it, and print the input fields. Check that the posting URL is
/post
(not/forms/post
):>>> response = requests.get('https://httpbin.org/forms/post') >>> page = BeautifulSoup(response.text) >>> form = page.find('form') >>> {field.get('name') for field in form.find_all(re.compile('input|textarea'))} {'delivery', 'topping', 'size', 'custemail', 'comments', 'custtel', 'custname'}
Note that
textarea
is a valid input and is defined in the HTML format. - Prepare the data to be posted as a dictionary. Check that the values are as defined in the form:
>>> data = {'custname': "Sean O'Connell", 'custtel': '123-456-789', 'custemail': 'sean@oconnell.ie', 'size': 'small', 'topping': ['bacon', 'onion'], 'delivery': '20:30', 'comments': ''}
- Post the values and check that the response is the same as returned in the browser:
>>> response = requests.post('https://httpbin.org/post', data) >>> response <Response [200]> >>> response.json() {'args': {}, 'data': '', 'files': {}, 'form': {'comments': '', 'custemail': 'sean@oconnell.ie', 'custname': "Sean O'Connell", 'custtel': '123-456-789', 'delivery': '20:30', 'size': 'small', 'topping': ['bacon', 'onion']}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Content-Length': '140', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.22.0'}, 'json': None, 'origin': '89.100.17.159', 'url': 'https://httpbin.org/post'}
How it works...
Requests
directly encodes and sends data in the configured format. By default, it sends POST
data in the application/x-www-form-urlencoded
format.
Compare the action of requests with the Accessing web APIs recipe, where the data is explicitly sent in JSON format using the argument json
. This means that the Content-Type
is application/json
instead of application/x-www-form-urlencoded
.
The key aspect here is to respect the format of the form and the possible values that can return an error if incorrect, typically a 400 error, indicating a problem with the client.
There's more...
Other than following the format of forms and inputting valid values, the main problem when working with forms is the multiple ways of preventing spam and abusive behavior.
You will often have to ensure that you have downloaded a form before submitting it, to avoid submitting multiple forms or Cross-Site Request Forgery (CSRF).
CSRF, which means producing a malicious call from a page to a different one taking advantage of the fact that your browser is authenticated, is a serious problem – for example, you might think you were entering a site about adorable puppies, that in fact takes advantage of you being logged into your bank page to perform financial operations on your behalf: such as transferring your savings to a distant account. Here is a good description of CSRF: https://stackoverflow.com/a/33829607. New techniques in browsers help with these CSRF issues by default.
To obtain the specific token, you need to first download the form, as shown in the recipe, obtain the value of the CSRF token, and resubmit it. Note that the token can have different names; this is just an example:
>>> form.find(attrs={'name': 'token'}).get('value')
'ABCEDF12345'
See also
- The Downloading web pages recipe, earlier in this chapter, to learn the basics of requesting web pages.
- The Parsing HTML recipe, earlier in this chapter, to follow up on structuring the returned information from the server.