书名：Python Web Scraping Cookbook
作者名：Michael Heydt
本章字数：190字
更新时间：2025-02-26 12:46:23

How to do it...

We will look at using urlopen and requests to handle HTML in UTF-8. These two libraries handle this differently, so let's examine this. Let's start importing urllib, loading the page, and examining some of the content.

In [8]: from urllib.request import urlopen
   ...: page = urlopen("http://localhost:8080/unicode.html")
   ...: content = page.read()
   ...: content[840:1280]
   ...:
Out[8]: b'><strong>Cyrillic</strong> &nbsp; U+0400 \xe2\x80\x93 U+04FF &nbsp; (1024\xe2\x80\x931279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">\xd0\x89</td>\n <td class="b" width="50">\xd0\xa9</td>\n <td class="b" width="50">\xd1\x89</td>\n <td class="b" width="50">\xd3\x83</td>\n </tr>\n </tbody>\n </table>\n\n '

Note how the Cyrillic characters were read in as multi-byte codes using \ notation, such as \xd0\x89.

To rectify this, we can convert the content to UTF-8 format using the Python str statement:

In [9]: str(content, "utf-8")[837:1270]
Out[9]: '<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">Љ</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">Ӄ</td>\n </tr>\n </tbody>\n </table>\n\n '

Note that the output now has the characters encoded properly.

We can exclude this extra step by using requests.

In [9]: import requests
   ...: response = requests.get("http://localhost:8080/unicode.html").text
   ...: response.text[837:1270]
   ...:
'<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">Љ</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">Ӄ</td>\n </tr>\n </tbody>\n </table>\n\n '