How to do it...

We will look at using urlopen and requests to handle HTML in UTF-8. These two libraries handle this differently, so let's examine this.  Let's start importing urllib, loading the page, and examining some of the content.

In [8]: from urllib.request import urlopen
...: page = urlopen("http://localhost:8080/unicode.html")
...: content = page.read()
...: content[840:1280]
...:
Out[8]: b'><strong>Cyrillic</strong> &nbsp; U+0400 \xe2\x80\x93 U+04FF &nbsp; (1024\xe2\x80\x931279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">\xd0\x89</td>\n <td class="b" width="50">\xd0\xa9</td>\n <td class="b" width="50">\xd1\x89</td>\n <td class="b" width="50">\xd3\x83</td>\n </tr>\n </tbody>\n </table>\n\n '
Note how the Cyrillic characters were read in as multi-byte codes using \ notation, such as \xd0\x89.

To rectify this, we can convert the content to UTF-8 format using the Python str statement:

In [9]: str(content, "utf-8")[837:1270]
Out[9]: '<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">Љ</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">Ӄ</td>\n </tr>\n </tbody>\n </table>\n\n '
Note that the output now has the characters encoded properly.

We can exclude this extra step by using requests.

In [9]: import requests
...: response = requests.get("http://localhost:8080/unicode.html").text
...: response.text[837:1270]
...:
'<strong>Cyrillic</strong> &nbsp; U+0400 – U+04FF &nbsp; (1024–1279)</p>\n <table class="unicode">\n <tbody>\n <tr valign="top">\n <td width="50">&nbsp;</td>\n <td class="b" width="50">Љ</td>\n <td class="b" width="50">Щ</td>\n <td class="b" width="50">щ</td>\n <td class="b" width="50">Ӄ</td>\n </tr>\n </tbody>\n </table>\n\n '