- Python Automation Cookbook
- Jaime Buelta
- 66字
- 2021-06-30 14:53:02
Crawling and searching directories
In this recipe, we'll learn how to scan a directory recursively to get all the files contained there. That will include all the files in subdirectories. The matched files can be of a particular kind, like text files, or every single one of them.
This is normally a starting operation when dealing with files, to detect all the existing ones.
Getting ready
Let's start by creating a test directory with some file information:
$ mkdir dir
$ touch dir/file1.txt
$ touch dir/file2.txt
$ mkdir dir/subdir
$ touch dir/subdir/file3.txt
$ touch dir/subdir/file4.txt
$ touch dir/subdir/file5.pdf
$ touch dir/file6.pdf
All the files will be empty; we will use them in this recipe only to discover them. Notice there are four files that have a .txt
extension, and two that have a .pdf
extension.
The files are also available in the GitHub repository here: https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter04/documents/dir.
Enter the created dir
directory
$ cd dir
How to do it...
- Print all the filenames in the
dir
directory and subdirectories:>>> import os >>> for root, dirs, files in os.walk('.'): ... for file in files: ... print(file) ... file1.txt file2.txt file6.pdf file3.txt file4.txt file5.pdf
- Print the full path of the files, joining with the
root
:>>> for root, dirs, files in os.walk('.'): ... for file in files: ... full_file_path = os.path.join(root, file) ... print(full_file_path) ... ./dir/file1.txt ./dir/file2.txt ./dir/file6.pdf ./dir/subdir/file3.txt ./dir/subdir/file4.txt ./dir/subdir/file5.pdf
- Print only the
.pdf
files:>>> for root, dirs, files in os.walk('.'): ... for file in files: ... if file.endswith('.pdf'): ... full_file_path = os.path.join(root, file) ... print(full_file_path) ... ./dir/file6.pdf ./dir/subdir/file5.pdf
- Print only files that contain an even number:
>>> import re >>> for root, dirs, files in os.walk('.'): ... for file in files: ... if re.search(r'[13579]', file): ... full_file_path = os.path.join(root, file) ... print(full_file_path) ... ./dir/file1.txt ./dir/subdir/file3.txt ./dir/subdir/file5.pdf
How it works...
os.walk()
goes through a whole directory and all subdirectories under it, returning all the files. For each directory, it returns a tuple with the directory, any subdirectories under it, and all the files:
>>> for root, dirs, files in os.walk('.'):
... print(root, dirs, files)
...
. ['dir'] []
./dir ['subdir'] ['file1.txt', 'file2.txt', 'file6.pdf']
./dir/subdir [] ['file3.txt', 'file4.txt', 'file5.pdf']
The os.path.join()
function allows us to join two paths, such as the base path and the file.
As paths are returned as pure strings, any kind of filtering can be done, as in step 3. In step 4, the full power of regular expressions can be used to filter.
In the next recipe, we'll deal with the content of the files, and not just the filename.
There's more...
In this recipe, the returned files are not opened or modified in any way. This operation is read-only. Files can be opened as described in the following recipes.
Be aware that changing the structure of the directory while walking over it may affect the results. If you need to carry out some file maintenance while walking through the tree, like copying or moving a file, it's a good idea to store it in a different directory.
The os.path
module has other interesting functions. We talked about .join()
, but other included utilities are:
os.path.abspath()
, which returns the absolute path of a file.os.path.split()
, which splits the path between directory and file:>>> os.path.split('/a/very/long/path/file.txt') ('/a/very/long/path', 'file.txt')
os.path.exists()
, to return whether a file exists or not on the filesystem.
The full documentation about os.path
can be found here: https://docs.python.org/3/library/os.path.html. Another module, pathlib
, can be used for higher-level access, in an object-oriented way: https://docs.python.org/3/library/pathlib.html.
As demonstrated in step 4, multiple ways of filtering can be used. All of the string manipulations and tips shown in Chapter 1, Let's Begin Our Automation Journey, are available.
See also
- The Introducing regular expressions recipe in Chapter 1, Let's Begin Our Automation Journey, to learn how to filter using regular expressions.
- The Reading text files recipe, later in this chapter, to open the found files and read their context.