书名：Python Automation Cookbook
作者名：Jaime Buelta
本章字数：66字
更新时间：2024-12-21 01:38:41

Crawling and searching directories

In this recipe, we'll learn how to scan a directory recursively to get all the files contained there. That will include all the files in subdirectories. The matched files can be of a particular kind, like text files, or every single one of them.

This is normally a starting operation when dealing with files, to detect all the existing ones.

Getting ready

Let's start by creating a test directory with some file information:

$ mkdir dir
$ touch dir/file1.txt
$ touch dir/file2.txt
$ mkdir dir/subdir
$ touch dir/subdir/file3.txt
$ touch dir/subdir/file4.txt
$ touch dir/subdir/file5.pdf
$ touch dir/file6.pdf

All the files will be empty; we will use them in this recipe only to discover them. Notice there are four files that have a .txt extension, and two that have a .pdf extension.

The files are also available in the GitHub repository here: https://github.com/PacktPublishing/Python-Automation-Cookbook-Second-Edition/tree/master/Chapter04/documents/dir.

Enter the created dir directory

$ cd dir

How to do it...

Print all the filenames in the dir directory and subdirectories:

>>> import os
>>> for root, dirs, files in os.walk('.'):
...     for file in files:
...         print(file)
...
file1.txt
file2.txt
file6.pdf
file3.txt
file4.txt
file5.pdf

Print the full path of the files, joining with the root:

>>> for root, dirs, files in os.walk('.'):
...     for file in files:
...         full_file_path = os.path.join(root, file)
...         print(full_file_path)
...
./dir/file1.txt
./dir/file2.txt
./dir/file6.pdf
./dir/subdir/file3.txt
./dir/subdir/file4.txt
./dir/subdir/file5.pdf

Print only the .pdf files:

>>> for root, dirs, files in os.walk('.'):
...     for file in files:
...         if file.endswith('.pdf'):
...             full_file_path = os.path.join(root, file)
...             print(full_file_path)
...
./dir/file6.pdf
./dir/subdir/file5.pdf

Print only files that contain an even number:

>>> import re
>>> for root, dirs, files in os.walk('.'):
...     for file in files:
...         if re.search(r'[13579]', file):
...             full_file_path = os.path.join(root, file)
...             print(full_file_path)
...
./dir/file1.txt
./dir/subdir/file3.txt
./dir/subdir/file5.pdf

How it works...

os.walk() goes through a whole directory and all subdirectories under it, returning all the files. For each directory, it returns a tuple with the directory, any subdirectories under it, and all the files:

>>> for root, dirs, files in os.walk('.'):
...     print(root, dirs, files)
...
. ['dir'] []
./dir ['subdir'] ['file1.txt', 'file2.txt', 'file6.pdf']
./dir/subdir [] ['file3.txt', 'file4.txt', 'file5.pdf']

The os.path.join() function allows us to join two paths, such as the base path and the file.

As paths are returned as pure strings, any kind of filtering can be done, as in step 3. In step 4, the full power of regular expressions can be used to filter.

In the next recipe, we'll deal with the content of the files, and not just the filename.

There's more...

In this recipe, the returned files are not opened or modified in any way. This operation is read-only. Files can be opened as described in the following recipes.

Be aware that changing the structure of the directory while walking over it may affect the results. If you need to carry out some file maintenance while walking through the tree, like copying or moving a file, it's a good idea to store it in a different directory.

The os.path module has other interesting functions. We talked about .join(), but other included utilities are:

os.path.abspath(), which returns the absolute path of a file.

os.path.split(), which splits the path between directory and file:

>>> os.path.split('/a/very/long/path/file.txt')
('/a/very/long/path', 'file.txt')

os.path.exists(), to return whether a file exists or not on the filesystem.

The full documentation about os.path can be found here: https://docs.python.org/3/library/os.path.html. Another module, pathlib, can be used for higher-level access, in an object-oriented way: https://docs.python.org/3/library/pathlib.html.

As demonstrated in step 4, multiple ways of filtering can be used. All of the string manipulations and tips shown in Chapter 1, Let's Begin Our Automation Journey, are available.

Crawling and searching directories

Getting ready

How to do it...

How it works...

There's more...

See also