- Python Automation Cookbook
- Jaime Buelta
- 105字
- 2021-06-30 14:52:54
Parsing HTML
Downloading raw text or a binary file is a good starting point, but the main language of the web is HTML.
HTML is a structured language, defining different parts of a document such as headings and paragraphs. HTML is also hierarchical, defining sub-elements. The ability to parse raw text into a structured document is basically the ability to extract information automatically from a web page. For example, some text can be relevant if enclosed in certain HTML elements, such as a class
div
or after a heading h3
tag.
Getting ready
We'll use the excellent Beautiful Soup
module to parse HTML text into a memory object that can be analyzed. We need to use the latest version of the beautifulsoup4
package to be compatible with Python 3. Add the package to your requirements.txt
and install the dependencies in the virtual environment:
$ echo "beautifulsoup4==4.8.2" >> requirements.txt
$ pip install -r requirements.txt
How to do it...
- Import
BeautifulSoup
andrequests
:>>> import requests >>> from bs4 import BeautifulSoup
- Set up the URL of the page to download and retrieve it:
>>> URL = 'http://www.columbia.edu/~fdc/sample.html' >>> response = requests.get(URL) >>> response <Response [200]>
- Parse the downloaded page:
>>> page = BeautifulSoup(response.text, 'html.parser')
- Obtain the title of the page. See that it is the same as what's displayed in the browser:
>>> page.title <title>Sample Web Page</title> >>> page.title.string 'Sample Web Page'
- Find all the
h3
elements in the page, to determine the existing sections:>>> page.find_all('h3') [<h3><a name="contents">CONTENTS</a></h3>, <h3><a name="basics">1. Creating a Web Page</a></h3>, <h3><a name="syntax">2. HTML Syntax</a></h3>, <h3><a name="chars">3. Special Characters</a></h3>, <h3><a name="convert">4. Converting Plain Text to HTML</a></h3>, <h3><a name="effects">5. Effects</a></h3>, <h3><a name="lists">6. Lists</a></h3>, <h3><a name="links">7. Links</a></h3>, <h3><a name="tables">8. Tables</a></h3>, <h3><a name="install">9. Installing Your Web Page on the Internet</a></h3>, <h3><a name="more">10. Where to go from here</a></h3>]
- Extract the text on the section for Special Characters. Stop when you reach the next
<h3>
tag:>>> link_section = page.find('h3', attrs={'id': 'chars'}) >>> section = [] >>> for element in link_section.next_elements: ... if element.name == 'h3': ... break ... section.append(element.string or '') ... >>> result = ''.join(section) >>> result '3. Special Characters\n\nHTML special "character entities" start with ampersand (&&) and\nend with semicolon (;;), like "€€" = "€". The\never-popular "no-break space" is . There are special\nentity names for accented Latin letters and other West European special\ncharacters such as:\n\n\n\n\n\nää\na-umlaut\n\xa0ä\xa0\n\n\nÄÄ\nA-umlaut \n\xa0Ä\xa0\n\n\náá\na-acute \n\xa0á\xa0\n\n\nàà\na-grave \n\xa0à\xa0\n\n\nññ\nn-tilde \n\xa0ñ\xa0\n\n\nßß\nGerman double-s\n\xa0ß\xa0\n\n\nþþ\nIcelandic thorn \n\xa0þ\xa0\n\xa0þ\xa0\n\n\n\n\nExamples:\n\n\nFor SpanishSpanish you would need:\nÁÁ (Á),\náá (á),\nÉÉ (É),\néé (é),\nÍÍ (Í),\níí (í),\nÓÓ (Ó),\nóó (ó),\nÚÚ (ú),\núú (ú),\nÑÑ (Ñ),\nññ (ñ);\n¿¿ (¿);\n¡¡ (¡).\nExample: Añorarán = AñoraránAñorarán.\n\n\nFor GermanGerman you would need:\nÄÄ (Ä),\nää (ä),\nÖÖ (Ö),\nöö (ö),\nÜÜ (ü),\nüü (ü),\nßß (ß).\nExample: Grüße aus Köln = Grüße aus KölnGrüße aus Köln.\n\n\n\nCLICK HERECLICK HERE\nfor a complete list. When the page encoding is\nUTF-8UTF-8, which is\nrecommended, you can also enter any character at all, Roman,\nCyrillic, Arabic, Hebrew, Greek. Japanese,\netc, either as numeric entities or (if you have a way to type them) directly\nfrom the keyboard.\n\n\n\nAnd remember: if you want to\ninclude <<, &&,\nor >> literally in text to be displayed, you have\nto write <<,\n&&, >>, respectively.\n\n\n\n\n'
Notice that all the raw text is displayed, without including the enclosing HTML tags.
How it works...
The first step is to download the page. Then, the raw text can be parsed, as in step 3. The resulting page
object contains the parsed information.
The html.parser
parser is the default one, but for certain operations, it can have problems. For example, for big pages it can be slow, and it can have issues rendering highly dynamic web pages. You can use other parsers, such as lxml
, which is much faster, or html5lib
, which will be closer to how a browser operates. They are external modules that will need to be added to the requirements.txt
file.
BeautifulSoup
allows us to search for HTML elements. It can search for the first occurrence of an HTML element with .find()
or return a list with .find_all()
. In step 5, it searched for a specific tag, <a>
, that had a particular attribute, id=chars
. After that, it kept iterating on .next_elements
until it found the next h3
tag, which marks the end of the section.
The text of each element is extracted and finally composed into a single text. Note the or
that avoids storing None
, returned when an element has no text.
HTML is highly versatile and can have multiple structures. The case presented in this recipe is typical, but other options on dividing sections can be grouping related sections inside a big <div>
tag or other elements, or even raw text. Some experimentation will be required until you find the specific process to extract the juicy bits on a web page. Don't be afraid to try!
There's more...
Regexes can be used as input in the .find()
and .find_all()
methods. For example, this search uses the h2
and h3
tags:
>>> page.find_all(re.compile('^h(2|3)'))
[<h2>Sample Web Page</h2>, <h3 id="contents">CONTENTS</h3>, <h3 id="basics">1. Creating a Web Page</h3>, <h3 id="syntax">2. HTML Syntax</h3>, <h3 id="chars">3. Special Characters</h3>, <h3 id="convert">4. Converting Plain Text to HTML</h3>, <h3 id="effects">5. Effects</h3>, <h3 id="lists">6. Lists</h3>, <h3 id="links">7. Links</h3>, <h3 id="tables">8. Tables</h3>, <h3 id="viewing">9. Viewing Your Web Page</h3>, <h3 id="install">10. Installing Your Web Page on the Internet</h3>, <h3 id="more">11. Where to go from here</h3>]
Another useful find
parameter is including the CSS class with the class_ parameter
. This will be shown later in the book.
The full Beautiful Soup documentation can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
See also
- The Installing third-party packages recipe in Chapter 1, Let's Begin Our Automation Journey, to learn about installing external modules.
- The Downloading web pages recipe, earlier in this chapter, to learn the basics of requesting web pages.