书名：Python Automation Cookbook
作者名：Jaime Buelta
本章字数：105字
更新时间：2024-12-21 01:38:32

Parsing HTML

Downloading raw text or a binary file is a good starting point, but the main language of the web is HTML.

HTML is a structured language, defining different parts of a document such as headings and paragraphs. HTML is also hierarchical, defining sub-elements. The ability to parse raw text into a structured document is basically the ability to extract information automatically from a web page. For example, some text can be relevant if enclosed in certain HTML elements, such as a class div or after a heading h3 tag.

Getting ready

We'll use the excellent Beautiful Soup module to parse HTML text into a memory object that can be analyzed. We need to use the latest version of the beautifulsoup4 package to be compatible with Python 3. Add the package to your requirements.txt and install the dependencies in the virtual environment:

$ echo "beautifulsoup4==4.8.2" >> requirements.txt
$ pip install -r requirements.txt

How to do it...

Import BeautifulSoup and requests:

>>> import requests
>>> from bs4 import BeautifulSoup

Set up the URL of the page to download and retrieve it:

>>> URL = 'http://www.columbia.edu/~fdc/sample.html'
>>> response = requests.get(URL)
>>> response
<Response [200]>

Parse the downloaded page:

>>> page = BeautifulSoup(response.text, 'html.parser')

Obtain the title of the page. See that it is the same as what's displayed in the browser:
```
>>> page.title
<title>Sample Web Page</title>
>>> page.title.string
'Sample Web Page'
```

Find all the h3 elements in the page, to determine the existing sections:

>>> page.find_all('h3')
[<h3><a name="contents">CONTENTS</a></h3>, <h3><a name="basics">1. Creating a Web Page</a></h3>, <h3><a name="syntax">2. HTML Syntax</a></h3>, <h3><a name="chars">3. Special Characters</a></h3>, <h3><a name="convert">4. Converting Plain Text to HTML</a></h3>, <h3><a name="effects">5. Effects</a></h3>, <h3><a name="lists">6. Lists</a></h3>, <h3><a name="links">7. Links</a></h3>, <h3><a name="tables">8. Tables</a></h3>, <h3><a name="install">9. Installing Your Web Page on the Internet</a></h3>, <h3><a name="more">10. Where to go from here</a></h3>]

Extract the text on the section for Special Characters. Stop when you reach the next <h3> tag:

>>> link_section = page.find('h3', attrs={'id': 'chars'})
>>> section = []
>>> for element in link_section.next_elements:
...     if element.name == 'h3':
...         break
...     section.append(element.string or '')
...
>>> result = ''.join(section)
>>> result
'3. Special Characters\n\nHTML special "character entities" start with ampersand (&&) and\nend with semicolon (;;), like "&euro;&euro;" = "€".  The\never-popular "no-break space" is &nbsp;&nbsp;.  There are special\nentity names for accented Latin letters and other West European special\ncharacters such as:\n\n\n\n\n\n&auml;&auml;\na-umlaut\n\xa0ä\xa0\n\n\n&Auml;&Auml;\nA-umlaut \n\xa0Ä\xa0\n\n\n&aacute;&aacute;\na-acute \n\xa0á\xa0\n\n\n&agrave;&agrave;\na-grave \n\xa0à\xa0\n\n\n&ntilde;&ntilde;\nn-tilde \n\xa0ñ\xa0\n\n\n&szlig;&szlig;\nGerman double-s\n\xa0ß\xa0\n\n\n&thorn;&thorn;\nIcelandic thorn \n\xa0þ\xa0\n\xa0þ\xa0\n\n\n\n\nExamples:\n\n\nFor SpanishSpanish you would need:\n&Aacute;&Aacute; (Á),\n&aacute;&aacute; (á),\n&Eacute;&Eacute; (É),\n&eacute;&eacute; (é),\n&Iacute;&Iacute; (Í),\n&iacute;&iacute; (í),\n&Oacute;&Oacute; (Ó),\n&oacute;&oacute; (ó),\n&Uacute;&Uacute; (ú),\n&uacute;&uacute; (ú),\n&Ntilde;&Ntilde; (Ñ),\n&ntilde;&ntilde; (ñ);\n&iquest;&iquest; (¿);\n&iexcl;&iexcl; (¡).\nExample: Añorarán = A&ntilde;orar&aacute;nA&ntilde;orar&aacute;n.\n\n\nFor GermanGerman you would need:\n&Auml;&Auml; (Ä),\n&auml;&auml; (ä),\n&Ouml;&Ouml; (Ö),\n&ouml;&ouml; (ö),\n&Uuml;&Uuml; (ü),\n&uuml;&uuml; (ü),\n&szlig;&szlig; (ß).\nExample: Grüße aus Köln = Gr&uuml;&szlig;e aus K&ouml;lnGr&uuml;&szlig;e aus K&ouml;ln.\n\n\n\nCLICK HERECLICK HERE\nfor a complete list.  When the page encoding is\nUTF-8UTF-8, which is\nrecommended, you can also enter any character at all, Roman,\nCyrillic, Arabic, Hebrew, Greek. Japanese,\netc, either as numeric entities or (if you have a way to type them) directly\nfrom the keyboard.\n\n\n\nAnd remember: if you want to\ninclude <<, &&,\nor >> literally in text to be displayed, you have\nto write &lt;&lt;,\n&amp;&amp;, &gt;&gt;, respectively.\n\n\n\n\n'

Notice that all the raw text is displayed, without including the enclosing HTML tags.

How it works...

The first step is to download the page. Then, the raw text can be parsed, as in step 3. The resulting page object contains the parsed information.

The html.parser parser is the default one, but for certain operations, it can have problems. For example, for big pages it can be slow, and it can have issues rendering highly dynamic web pages. You can use other parsers, such as lxml, which is much faster, or html5lib, which will be closer to how a browser operates. They are external modules that will need to be added to the requirements.txt file.

BeautifulSoup allows us to search for HTML elements. It can search for the first occurrence of an HTML element with .find() or return a list with .find_all(). In step 5, it searched for a specific tag, <a>, that had a particular attribute, id=chars. After that, it kept iterating on .next_elements until it found the next h3 tag, which marks the end of the section.

The text of each element is extracted and finally composed into a single text. Note the or that avoids storing None, returned when an element has no text.

HTML is highly versatile and can have multiple structures. The case presented in this recipe is typical, but other options on dividing sections can be grouping related sections inside a big <div> tag or other elements, or even raw text. Some experimentation will be required until you find the specific process to extract the juicy bits on a web page. Don't be afraid to try!

There's more...

Regexes can be used as input in the .find() and .find_all() methods. For example, this search uses the h2 and h3 tags:

>>> page.find_all(re.compile('^h(2|3)'))
[<h2>Sample Web Page</h2>, <h3 id="contents">CONTENTS</h3>, <h3 id="basics">1. Creating a Web Page</h3>, <h3 id="syntax">2. HTML Syntax</h3>, <h3 id="chars">3. Special Characters</h3>, <h3 id="convert">4. Converting Plain Text to HTML</h3>, <h3 id="effects">5. Effects</h3>, <h3 id="lists">6. Lists</h3>, <h3 id="links">7. Links</h3>, <h3 id="tables">8. Tables</h3>, <h3 id="viewing">9. Viewing Your Web Page</h3>, <h3 id="install">10. Installing Your Web Page on the Internet</h3>, <h3 id="more">11. Where to go from here</h3>]

Another useful find parameter is including the CSS class with the class_ parameter. This will be shown later in the book.

The full Beautiful Soup documentation can be found here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Parsing HTML

Getting ready

How to do it...

How it works...

There's more...

See also