Crawling and searching directories

In this recipe, we'll learn how to scan a directory recursively to get all the files contained there. That will include all the files in subdirectories. The matched files can be of a particular kind, like text files, or every single one of them.

This is normally a starting operation when dealing with files, to detect all the existing ones.

Getting ready

Let's start by creating a test directory with some file information:

$ mkdir dir
$ touch dir/file1.txt
$ touch dir/file2.txt
$ mkdir dir/subdir
$ touch dir/subdir/file3.txt
$ touch dir/subdir/file4.txt
$ touch dir/subdir/file5.pdf
$ touch dir/file6.pdf

All the files will be empty; we will use them in this recipe only to discover them. Notice there are four files that have a .txt extension, and two that have a .pdf extension.

Enter the created dir directory

$ cd dir

How to do it...

  1. Print all the filenames in the dir directory and subdirectories:
    >>> import os
    >>> for root, dirs, files in os.walk('.'):
    ...     for file in files:
    ...         print(file)
    ...
    file1.txt
    file2.txt
    file6.pdf
    file3.txt
    file4.txt
    file5.pdf
    
  2. Print the full path of the files, joining with the root:
    >>> for root, dirs, files in os.walk('.'):
    ...     for file in files:
    ...         full_file_path = os.path.join(root, file)
    ...         print(full_file_path)
    ...
    ./dir/file1.txt
    ./dir/file2.txt
    ./dir/file6.pdf
    ./dir/subdir/file3.txt
    ./dir/subdir/file4.txt
    ./dir/subdir/file5.pdf
    
  3. Print only the .pdf files:
    >>> for root, dirs, files in os.walk('.'):
    ...     for file in files:
    ...         if file.endswith('.pdf'):
    ...             full_file_path = os.path.join(root, file)
    ...             print(full_file_path)
    ...
    ./dir/file6.pdf
    ./dir/subdir/file5.pdf
    
  4. Print only files that contain an even number:
    >>> import re
    >>> for root, dirs, files in os.walk('.'):
    ...     for file in files:
    ...         if re.search(r'[13579]', file):
    ...             full_file_path = os.path.join(root, file)
    ...             print(full_file_path)
    ...
    ./dir/file1.txt
    ./dir/subdir/file3.txt
    ./dir/subdir/file5.pdf
    

How it works...

os.walk() goes through a whole directory and all subdirectories under it, returning all the files. For each directory, it returns a tuple with the directory, any subdirectories under it, and all the files:

>>> for root, dirs, files in os.walk('.'):
...     print(root, dirs, files)
...
. ['dir'] []
./dir ['subdir'] ['file1.txt', 'file2.txt', 'file6.pdf']
./dir/subdir [] ['file3.txt', 'file4.txt', 'file5.pdf']

The os.path.join() function allows us to join two paths, such as the base path and the file.

As paths are returned as pure strings, any kind of filtering can be done, as in step 3. In step 4, the full power of regular expressions can be used to filter.

In the next recipe, we'll deal with the content of the files, and not just the filename.

There's more...

In this recipe, the returned files are not opened or modified in any way. This operation is read-only. Files can be opened as described in the following recipes.

Be aware that changing the structure of the directory while walking over it may affect the results. If you need to carry out some file maintenance while walking through the tree, like copying or moving a file, it's a good idea to store it in a different directory.

The os.path module has other interesting functions. We talked about .join(), but other included utilities are:

  • os.path.abspath(), which returns the absolute path of a file.
  • os.path.split(), which splits the path between directory and file:
    >>> os.path.split('/a/very/long/path/file.txt')
    ('/a/very/long/path', 'file.txt')
    
  • os.path.exists(), to return whether a file exists or not on the filesystem.

The full documentation about os.path can be found here: https://docs.python.org/3/library/os.path.html. Another module, pathlib, can be used for higher-level access, in an object-oriented way: https://docs.python.org/3/library/pathlib.html.

As demonstrated in step 4, multiple ways of filtering can be used. All of the string manipulations and tips shown in Chapter 1, Let's Begin Our Automation Journey, are available.

See also

  • The Introducing regular expressions recipe in Chapter 1, Let's Begin Our Automation Journey, to learn how to filter using regular expressions.
  • The Reading text files recipe, later in this chapter, to open the found files and read their context.