书名：Python Automation Cookbook
作者名：Jaime Buelta
本章字数：163字
更新时间：2024-12-21 01:38:32

3
Building Your First Web Scraping Application

The internet, and the World Wide Web (WWW), is probably the most prominent source of information today. Most of that information is retrievable through HTTP. HTTP was invented originally to share pages of hypertext (hence the name HyperText Transfer Protocol), which started the WWW.

This process happens each time that we request a web page, so it should be familiar to almost everyone. But we can also perform these operations programmatically to retrieve and process information automatically. Python has in its standard library an HTTP client, but the fantastic requests module makes obtaining web pages very easy. In this chapter, we will see how.

In this chapter, we'll cover the following recipes:

Downloading web pages
Parsing HTML
Crawling the web
Subscribing to feeds
Accessing web APIs
Interacting with forms
Using Selenium for advanced interaction
Accessing password-protected pages
Speeding up web scraping

Let's start with the basics of how to programmatically obtain an existing web page.

3 Building Your First Web Scraping Application

3
Building Your First Web Scraping Application