Files

Many of your programs will need to use persistent storage, i.e. files, to read from and/or write to.

Although we have seen how to read from a file already, we will revisit the topic and introduce the write capability.

Streams

In many of today's programming languages, the idea of a stream is used to describe IO (Input/Output) operations. The analogy is to a stream of water which flows downhill.

A stream is nothing more than an abstraction to the file itself. When you use a stream, you are basically interacting with a file.

Input stream

An input stream of course is used to stream input from a device, whether it be the keyboard or in this discussion a file.

Output stream

Similarly an output stream is used to stream output to a device, again be it the screen or a file.

Open modes

When a stream is opened there are basically three options or mode in doing so:

  1. read - data is read from the file into the stream and then accessed by the program.

  2. write - program writes data to the stream, which is then flushed (written) to the file.

  3. update - stream can be used for both read and write.

File Handle

A file handle is an object that refers to the file. It is the stream abstraction I mentioned above. You use this in your code to access the file.

For example:

open(file, mode='r', encoding=None)

The open() function creates and returns a file object. If the file cannot be opened, an OSError exception is raised.

The optional mode string specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode.

'w' writes to a new file, truncating the file if it already exists.

'x' is used for exclusive creation.

'a' for appending to an existing file, or creating a new file.

The available modes are:

The default mode is 'r' (open for reading text, synonym of 'rt').

Modes 'w+' and 'w+b' open and truncate the file.

Modes 'r+' and 'r+b' open the file with no truncation.

If reading/writing a binary file simply append a 'b' after the mode, e.g. r+b.

Read from a file

Let's explore some options when it comes to reading from a file.

Here's the link to the file rules.txt I will use.

.read() - Read one character at a time.

You can use the .read() method to read one or more character(s) at a time. When the end of the file is reached, the method returns an empty string.

.read() - Read the whole file at once

The .read() method can also be used to read the entire file at once. Make sure the file size does not exceed the capacity of the stream.

.realine() - read one line at a time

The .readline() method reads a single line returning an empty string when reaching the end of the file.

Iterate through the file object

This version is the one we have been using in our previous discussions.

The use of the with operator guarantees that the file will be closed for us, so one less thing to worry about.

newline (\n) treatment

Note that in each version the newline character \n is actually read and counted. This may not be the desired option, so you must take care of it if need be.

Notice how in the print() function used above, the end='' argument was added. Remove it and run the code again to see the effect of the newline read in.

Write to a file

Recall, to write to a file you can use the w mode in the .open() method.

To write to the file use the .write() method, providing the string to write as its argument.

Let's take a look.

This example will write 10 numbers to the file nums.txt. Since the argument to .write() must be a string, we use the str() function to convert each number first.

Filenames and Paths

Files are saved on your storage device. Such devices are managed by the OS whether MacOS, Linux or Windows.

Usually files are organized into directories or folders, with one such directory acting as the current directory or working directory.

It is important to understand that Python first looks for a file in the current directory, usually the same directory as your source file (.py) or (.ipynb).

The os module provides functions for working with files and directories.

Get current working directory

To get the current directory use os.getcwd().

Get absolute path

The path above is an absolute path. An absolute path starts with the root directory, / in my case on a Mac.

A relative path in contrast, is given in terms of the current directory. Thus, if I wanted to specify a relative path to the topics directory I would use:

..

where . corresponds to the current directory and .. to its parent.

Now, to obtain the absolute path of a file, use os.path.abspath().

Check if file or directory exists

To check if a file or directory exists, use os.path.exists().

Check if path is a directory

To check is a path is a directory, use os.path.isdir(). If returned value is True, then it is, else it is a regular file.

Check if path is a file

To check if a path is a file or not use os.path.isfile().

List directory contents

To list the contents of a directory use os.listdir().

Let's put some of these methods to use with an example.

The following recursive function will walk through each directory and print out the files it finds.

The .join() method concatenates the filename to the absolute path of its directory, to give us the full absolute path to the file itself.

Note: The os module offers the walk() function and you should take a look at it, as it is more versatile than our version.

Catching Exceptions

When dealing with files there are a lot of things that could go wrong.

FileNotFoundError

Opening a file that does not exist for reading will raise a FileNotFoundError as in:

PermissionError

Attempting to access a file you do not have permission to access will raise a PermissionError as in:

IsADirectoryError

If you try to open a directory instead of a file you will raise an IsADirectoryError as in:

try:except:else statement

Now, you could write a series of if statements to handle such possibilities, but this approach is rarely recommended.

Mixing error handling code via if statements and normal code makes your program hard to read and maintain. A better, more structured approach is to use exception handling using a try statement.

try:
    # statements that may raise exceptions
except some_error as err:
    # handle specific error raised
except:
    # handle all other exceptions
else:
    # statements executed if no exception raised

The basic idea is you try some code that may raise an exception.

Each except some_error as err: clause will list the specific error you want to handle. You may list as many as you want.

The except allows you to handle (catch) any exception you did not explicitly handle with any of the above except clauses. A catch-all clause so to speak.

Finally, the else allows you to execute code if no exception was raised.

Let's look at an example.

Databases

In addition to using flat files to permanently save your data, you may use a simple database offered by Python.

A database is a file that is organized for storing and retrieving data. Similar to a dictionary in that a key is mapped to a value, but unlike dictionaries, databases are permanent.

Python provides the dbm module for handling such files, so let's take a look at an example.

I will use the english to spanish map we saw with our dictionary discussion.

Notice how the keys and values are all converted automatically to bytes objects, indicated with the b prefix.

You can still think of them as regular strings for now.

Pickling

A limitation of the dbm module is that the keys and values have to be either strings or bytes. But, what if you want to use another type?

The pickle module can save the day in those circumstances. It will translate almost any type into a string for storage, and back to object for retrieval.

dumps() - object to string tranlation

The .dumps() method will translate an object into a string that pickle can handle easily.

loads() - string to object translation

Notice how the values are not in a human-readable format, so this is where the .loads() method comes handy.

Pipes

Your OS will most likely offer a shell or command window that allows you to interact with it by running commands.

For example, on a Unix like OS like a Mac, I can get a directory listing by using ls.

To do something similar with Python, i.e. be able to execute such a command from within your Python code, you can use a pipe.

A pipe object represents a running program and behaves similar to a regular file.

Let's take a look at how to use it to get a directory listing.

Checksums

Another useful command is md5 that returns a hashcode of a file or directory.

This value, often referred to as a digest, can be used as an encrypted representation of a file or directory.

You can then use the digest to compare the contents of a file or directory against any possible unwanted alterations.

For instance, you can hash a file and save its digest. When you want to verify the file has not been altered, hash the current file and compare to the original digest. If not the same, then the file was modified.

Writing Modules

Each file you have created that contains Python code is of course a module. Now however, we want to see how to import modules you write.

The thing to note here is that if the imported module contains executable statements, then those statements will execute once you import the module.

This is normally not what you want when you import, but rather when you run the module instead. We will see how to handle this distinction.

Here is a module lineCount.py that counts how many lines a file has and prints the results.

# File: linecount.py

def lineCount(filename):
    with open(filename) as fin:
        count = 0
        for _ in fin:
            count += 1
    return count

print(lineCount('linecount.py'))

Make sure you create this module first before running the following code.

Now, in order to avoid executing the module when you import it, modify the function call as follows:

if __name__ == '__main__':
    print(lineCount('linecount.py'))

This modification says: If the name of the module is __main__, which will be the case when you execute the module itself, then call the function, else do not.

If you want to use the lineCount() function you may use the linecount module to do so.

Since you are importing the module its __name__ will be linecount.

The import command creates a module object, which in our example exposes the lineCount() property, a function.

Nothing is actually read in, but rather the name of the module is added to the symbol table of the current module. This table helps Python locate things, such as functions, classes, variable etc.

A note on white space

Since white space (newlines, tabs, spaces) are invisible to the naked eye, they may cause issues that are hard to identify. So, to help you can use the repr() function that provides a Python interpreter-aware representation.

Practice problems

Create a separate Python source file (.py) in VSC to complete each exercise.


p1: sed.py

Write a function called sed() that takes as arguments a pattern string, a replacement string, and two filenames.

It should read the first file and write the contents into the second file (creating it if necessary). If the pattern string appears anywhere in the file, it should be replaced with the replacement string.

If an error occurs while opening, reading, writing or closing files, your program should catch the exception, print an error message, and exit.


p2: duplicateFile.py

In a large collection of MP3 files, there may be more than one copy of the same song, stored in different directories or with different filenames. The goal of this exercise is to search for duplicates.

  1. Write a program that searches a directory and all of its subdirectories, recursively, and returns a list of complete paths for all files with a given suffix (like .mp3). Hint: os.path provides provides several useful functions for manipulating file and path names.

  2. To recognize duplicates, you can use md5 to compute a checksum for each file. If two files have the same checksum, they probably have the same contents.

[Data Structures Case Study] [TOC] [Classes And Objects]