- Python Automation Cookbook
- Jaime Buelta
- 166字
- 2021-06-30 14:52:56
Subscribing to feeds
RSS is probably the biggest secret of the internet. Its time in the spotlight seemed to be during the 2000s, and it enables easy subscription to websites. It is present in lots of websites and it's incredibly useful.
At its core, RSS is a way of presenting a succession of ordered references (typically articles, but also other elements such as podcast episodes or YouTube publications) and publishing times. This makes for a very natural way of learning what articles are new since the last check, as well as presenting some structured data about them, such as the title and a summary.
In this recipe, we will present the feedparser
module and determine how to obtain data from an RSS feed.
RSS is not the only available feed format. There's also a format called Atom, but Atom and RSS are more or less the same. feedparser
is also capable of parsing Atom, so both formats can be processed in the same way.
Getting ready
We need to add the feedparser
dependency to our requirements.txt
file and reinstall it:
$ echo "feedparser==5.2.1" >> requirements.txt
$ pip install -r requirements.txt
Feed URLs can be found on almost all pages that deal with publications, including blogs, news, podcasts, and so on. Sometimes they are very easy to find, but sometimes they are a little bit hidden. Search for feed
or RSS
.
Most newspapers and news agencies have their RSS feeds divided by themes. For our example, we'll parse the New York Times main page feed, https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml. There are more feeds available on the main feed page: https://archive.nytimes.com/www.nytimes.com/services/xml/rss/index.html.
Please note that the feeds may be subject to terms and conditions of use. In the case of the New York Times, the terms and conditions are described at the end of the main feed page.
Please note that this feed changes quite often, meaning that the linked entries will be different than the examples in this book.
How to do it...
- Import the
feedparser
module, as well asdatetime
,delorean
, andrequests
:>>> import feedparser >>> import datetime >>> import delorean >>> import requests
- Parse the feed (it will be downloaded automatically) and check when it was last updated. Feed information, like the title of the feed, can be obtained in the
feed
attribute:>>> rss = feedparser.parse('http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml') >>> rss.channel.updated Friday, 24 Jan 2020 19:42:27 +0000'
- Get the entries that are less or equal to 6 hours old:
>>> time_limit = delorean.parse(rss.channel.updated) - datetime.timedelta(hours=6) >>> entries = [entry for entry in rss.entries if delorean.parse(entry.published) > time_limit]
- Some of the returned entries will be older than 6 hours:
>>> len(entries) 28 >>> len(rss.entries) 54
- Retrieve information about the entries, such as the
title
. The full entry URL is available aslink
. Explore the available information in this particular feed:>>> entries[18]['title'] 'These People Really Care About Fonts' >>> entries[18]['link'] 'https://www.nytimes.com/2020/01/24/style/typography-font-design.html?emc=rss&partner=rss' >>> requests.get(entries[18].link) <Response [200]>
How it works...
The parsed feed
object contains the information of the entries, as well as general information about the feed
itself, such as when it was updated. The feed information can be found in the feed
attribute:
>>> rss.feed.title
'NYT > Top Stories'
Each of the entries works as a dictionary, so the fields are easy to retrieve. They can also be accessed as attributes, but treating them as keys allows us to get all the available fields:
>>> entries[5].keys()
dict_keys(['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'media_content', 'summary', 'summary_detail', 'media_credit', 'credit', 'content', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'tags'])
The basic strategy when dealing with feeds is to parse them and go through the entries, performing a quick check on whether they are interesting or not, for example, by checking the description or summary. If the entry seems worth it, they can be fully downloaded through the link
field. Then, to avoid rechecking entries, store the latest publication date and next time, only check newer entries.
There's more...
The full feedparser
documentation can be found here: https://pythonhosted.org/feedparser/.
The information available can differ from feed to feed. In the New York Times example, there's a tag
field with tag information, but this is not standard. As a minimum, entries will have a title, a description, and a link.
RSS feeds are also a great way of curating your own selection of news sources. There are great feed readers for that.
See also
- The Installing third-party packages recipe in Chapter 1, Let's Begin Our Automation Journey, to learn the basics of installing external modules.
- The Downloading web pages recipe, earlier in this chapter, to learn more about making requests and obtaining remote pages.