Reading Files#

Prerequisites:#

Learning Outcomes:#

  • Open files using Python’s built-in functions and extract their contents to variables

  • Use the CSV module to read data from CSV files

Reading Files#

One of the common uses of Python in chemistry is to analyse large amounts of data that you might have recorded in a lab and stored in various files (see previous lesson for discussion of file types). Python has numerous built-in capabilities that allow you to read and write these files.

Before you start opening and reading files in Python, make sure:

  • Your file is in the same directory as your code. See previous lesson for explanation on directories.

  • Your file is sensibly named so you can easily find it and not confuse it with another.

  • Your file is in the correct format. E.g. make sure objects in a .CSV file are separated by commas and not spaces or tabs.

Let’s start with a opening a simple text file and reading its contents:

molecule_file = open("molecule.txt", "r")
contents = molecule_file.read()
molecule_file.close()
print(contents)

After running the cell above, you should see the contents of the molecule.txt file in the cell output. You can verify the output by checking the file’s contents in a text editor.

Let’s go through step by step.

  • In the first line of code, the file "molecule.txt" is opened using the Python open() function, which takes two arguments.

    • The first argument is the file name, called as a string, using speech marks "" or ''.

    • The second argument specifies the mode in which the file is opened. The default is "t". text mode, which is identical to "r", read mode, where every item is read as a string. Some others are "w", write mode, "a", append, "x", create a new file, "r+", read/write, among others. You can find the full list on the Python documentation website.

  • Also in the first line, the file "molecule.txt" is stored in a special file-reading Python object that we have assigned the variable name molec_file. This is then used later on in the code.

  • In the second line, the file, now called molec_file, is read by the built-in Python function .read(). The full-stop indicates that the function is associated to the file before the dot. The molecular formula inside the file is then assigned the variable name contents.

  • In the third line, the file is closed. This is considered good practice. If it is not closed, various issues can start occurring (e.g. file access errors).

  • The final line prints the string that has been placed inside the variable contents.

Reading Files with with#

We can also use the with statement to open files. The benefit of this is that it will automatically close the file for us when we are done with it, without using .close(). This is a more “Pythonic” way to handle files and is generally recommended.

Let’s take a look at the same example using the with statement:

with open('molecule.txt', 'r') as molec_file:
    contents = molec_file.read()

print(contents)

As before, we open the molecule.txt file and read its contents. The difference is that we use the with statement to open the file, which automatically closes it when we are done with it (i.e., when we exit the with block).

We now have a way to read files in Python, and use their contents as variables in our code.

Read() and readline()

By default, the read() method returns the whole text, but you can also specify how many characters you want to return.

molec_file.read(4)

Only returns the first 4 characters of the text.

Another built-in function is readline(). This will return only the first line of the file. For example:

with open('molecule.txt', 'r') as molec_file:
    print(molec_file.readline())

Will return only the first line, which is the molecule C10H14O.

If you then wanted the first and the second line, you could add a second readline():

with open('molecule.txt', 'r') as molec_file:
    print(molec_file.readline())
    print(molec_file.readline())

Which would return the first and second molecules, on separate lines.


Using for loops to read a file

An easy way to extract information from a file is to use a for loop. The most simple case is to iterate line-by-line and print each line.

with open("molecule.txt") as molec_file:
    for line in molec_file:
        print(line)

This will print each line of the file (each molecular formula) on a new line. We could also take only certain characters by selecting a certain index on the line using square brackets.

with open("molecule.txt") as molec_file:
    for line in molec_file:
        print(line[0])

Will print the first characters of each line of the file. Each will be printed on a new line in the output cell.

But how to make this useful? The most obvious way would be to take these values and add it to a list so we can call it as a variable later in the code. Type out this code and try it:

with open("molecule.txt") as molec_file:
    molec_list = []
    for line in molec_file:
        molec_list.append(line)

print(molec_list)

The output we expect is [C10H14O, C5H8, C40H56], but the output we get is ['C10H14O\n', 'C5H8\n', 'C40H56']. The "\n" indicates a new line, and it appears because in a text file a new line is represented by this character.

So how do we split this file in a way that is useful? We need to use delimiters. Revisit the previous lesson for a discussion on these.


Exercise: Print a certain number of lines

For the file “spectrum.dat”, open using a with statement and print the first four lines.

Hint Use a for loop with the range() function to print the line a certain number of times.

Click to view answer
with open("./practice_files/spectrum.dat") as file:
   for i in range(4):
       print(file.readline())

This code prints the first 4 lines of the file. We have used the range() function to only print it 4 times.


Delimiters and .split()#

As discussed in the previous lesson, delimiters indicate the separation between items in a file. In a text file containing words, the delimiter would be a space. In a CSV file, the values are separated by a comma, so the delimiter is a comma. We can then extract information between delimiters using the built-in Python function .split().

.split() splits content based on the delimiter for any content placed before the dot. It takes a string as argument, which is the delimiter which it acts on. The default delimiter (no argument) is any whitespace (space or tab).

Using our example of the text file containing molecular formulas, we can split on the new line (using the special character “\n”).

with open("./practice_files/molecule.txt") as molec_file:
    contents = molec_file.read()
    contents = contents.split("\n")
print(contents)

Which prints:

[‘C10H14O’, ‘C5H8’, ‘C40H56’]

The above code does the following:

  • The with statement opens the file and assigns it the name molec_file. It automatically closes once the with indents are no longer respected.

  • The contents are read.

  • The contents are split. We have specified to split the content along new lines using "\n". However, in this case it would also work if we left the argument blank, as the new lines count as whitespace. However, if we wanted to split along a new line and there was whitespace within the line, we would have to specify "\n".

The output is: ['C10H14O', 'C5H8', 'C40H56'], a list of each item between the delimiter.

In this case, we have extracted and split the entire contents of the file. But what if there were multiple items in each line? Open and have a look at the file ‘alkanes_alkenes_alkynes.txt’. This holds the molecular formulae of each homologous series in a column. We could read it using the code above, but it would result in a very long list of strings:

['C2H6,C2H4,C2H2', 'C3H8,C3H6,C3H4', 'C4H10,C4H8,C4H6', 'C5H12,C5H10,C5H8', 'C6H14,C6H12,C6H10', ... ,'C14H30,C14H28,C14H26', 'C15H32,C15H30,C15H28']

Try for yourself.

Splitting a line

Instead of splitting all the contents, we can iterate through line by line and split the line we are on using the delimiter “,”.

with open("./practice_files/alkanes_alkenes_alkynes.csv") as file:
    file = file.read()
    file = file.split("\n")
    for line in file:
        line = line.split(",")
        print(line)

The above code returns each item in each line as items in a list. However, this isn’t particularly useful. It would be more useful to have a list of alkanes, a list of alkenes, and a list of alkynes. We can do this by appending to empty lists by referencing the position in each line using square brackets.

alkanes = []
alkenes = []
alkynes = []
with open("./practice_files/alkanes_alkenes_alkynes.csv") as file:
    file = file.read()
    file = file.split("\n")
    for line in file:
        line = line.split(",")
        alkanes.append(line[0])
        alkenes.append(line[1])
        alkynes.append(line[2])
print(alkanes)
print(alkenes)
print(alkynes)

MORE DESCRIPTION

Be careful - remember that Python is 0-indexed. The first column is index 0, the second column is index 1. In this example, if you call line[3], you will get the index error ‘list index out of range’.


Extracting and converting numerical data

As a default, information from files are imported as a string. Therefore, if you want to use numbers from your file, you need to convert it to numbers. The easiest way is to use the built-in Python float() function. But be careful! If there are letters or special characters in the string, you will receive an error.

values = []
with open("./practice_files/measurements_1.csv") as file:
    file = file.read()
    file = file.split("\n")
    for line in file:
        values.append(float(line))

print(values)

Each line in the file only contains one number. Therefore, we can convert each line to a float when we append it to our list of values. If we were working with a file that has multiple cells in each row, we would have to convert each individual value we are appending to a float.

Write out this code yourself and check it is working without error.


Exercise: Extract columns of data

The file ‘measurements_2.csv’ contains three columns of background data, corresponding to three different students. Write a code that extracts the data from each of the three columns and appends it to three lists. Then write a code that finds the average of each set of data.

Hint 1: Once you have split each line, you can call items in a line using square brackets. The first item in a line can be called using line[0], the second using line[1], and so on.

Hint 2: Define a function to find the average to avoid writing out the same code over and over.


Click to view answer
def find_mean(data_list):
   """Find the mean of a list of floats"""
   sum = 0
   for value in data_list:
       sum = sum + value
   average = sum/len(data_list)
   return average

student_1 = []
student_2 = []
student_3 = []
# Open our file and extract the contents into the lists above
with open("./practice_files/measurements_2.csv") as file:
   contents = file.read()
   contents = contents.split("\n")
   for line in contents:
       line = line.split(",")
       student_1.append(float(line[0]))
       student_2.append(float(line[1]))
       student_3.append(float(line[2]))
# Check we have the correct lists
print(student_1, "\n", student_2, "\n", student_3)

# Print the mean
print(find_mean(student_1))
print(find_mean(student_2))
print(find_mean(student_3))

We have defined our function at the top, which is good practice. Then, we open our file and extract the contents, ensuring values in each list are floats and not strings. Finally, we print the mean of those strings using our earlier defined function.


Exercise: Create a function to read a file

Above, we have written code to open a single file and extract the data. If we have many files of data that we want to read, we don’t want to write that out over and over again.

  1. Write a function that can extract columns of data, just like in the previous exercise, and call the file “measurements_2.csv” to check it works.

  2. Extension: Create a function that can read a certain index and return the data from that index in a list.

  3. Advanced extension: Create a function that can take an arbitrary number of list index values for a file and output the data of those two lists. (Hint: use nested loops and nested lists). HintL Use arbitrary arguments, *args.


Click to view answer

This is just turning our code from the previous exercise into a function. The function takes one argument (the file name), and will always output the first three columns of data as a three separate lists.

def read_data(data_file):
   """
   Read a file and output columns of data
   """
   list_1 = []
   list_2 = []
   list_3 = []
   # Open our file and extract the contents into the lists above
   with open(data_file) as file:
       contents = file.read()
       contents = contents.split("\n")
       for line in contents:
           line = line.split(",")
           list_1.append(float(line[0]))
           list_2.append(float(line[1]))
           list_3.append(float(line[2]))
   return list_1, list_2, list_3

# Call our function
student_1, student_2, student_3 = read_data("./practice_files/measurements_2.csv")
# Check we have the correct lists
print(student_1, "\n", student_2, "\n", student_3)

This has many limitations. It will not work if our data has less than 3 columns (try with the file “measurements_1” and see what happens). It also won’t return any columns past the first three, so you can’t retrieve the fifth, sixth, or 100th column. The extension question tackles this.


Click to view extension answer

This code is more flexible than the previous one. Now, we can accept files with any number of columns, and we retrieve the data frp, a certain column, specified with a second argument.

def read_file(data_file, column=0):
    """
    Read a file and output the data from a column.

    Parameters:
        data_file : STRING
            The pathway of a CSV file with no headings.
        column : INTEGER
            The index of the column we wish to extract. 
            An integer that is 0 or greater.
            Default = 0 (the first column)
    
    Returns:
        data : LIST
            The data from a certain column.
    """
    # Extract data into this empty list
    data = []
    with open(data_file) as file:
        file = file.read()
        contents = file.split("\n")   
        for line in contents:
            line = line.split(",")
            data.append(float(line[column]))
    return data

student_1 = read_file("./practice_files/measurements_2.csv", 1)
print(student_1)

But this still has issues. If we want to retrieve the data from multiple columns, we have to call our function multiple times, and each time type out the file path.

In the advanced extension, we instead use an arbitrary number of arguments to retrieve any number of columns from the data.


Click to view advanced answer

In this code, we have written *columns as our arbitrary arguments. As our file only has 3 columns of data, we could put in 0, 1, or 2, as our arbitrary arguments. If we had more columns of data, we could call any number of those columns.

def read_file(data_file, *columns):
    """
    Read a file and output any of the columns.

    Parameters:
        data_file : STRING
            The pathway to access a .csv file with no headers
        *columns : INT
            An arbitrary number of integers 
            Corresponding to the index of a column within data_file
    
    Returns:
        data : LIST
            The nested list of all our data. Each nested list is a column of data points
        column_index : LIST
            A record of which columns our data is from. Can be used to reference
    """
    # A record of which columns we are taking data from
    column_index = []
    # A nested list of the data, where each column is a list within this list
    data = []
    for column in columns:
        column_index.append(column)
    with open(data_file) as file:
        file = file.read()
        contents = file.split("\n")
        for index in column_index:
            temp_list = []   
            for line in contents:
                line = line.split(",")
                temp_list.append(float(line[index]))
            data.append(temp_list)
    return data, column_index

data, index = read_file("./practice_files/measurements_2.csv", 0, 2)
print(data) # For student 1 and student 3
print(index) # Student 1's data is in column 0, and student 3's data is in column 2
print("Student 1: " , data[0]) # This is student 1's data
print("Student_3: " , data[1]) # This is student 2's data

Once we have retrieved the nested list of data, we can separate it out by calling the index location. Since we only have two items in this list, we must call positions 0 and 1, which correspond to the data from Student 1 and Student 3.

We could call this function with any column index, as long as we do not go over the number of columns. Remember to check you have the right data, as if you are trying to call Student 2’s data and put the number ‘2’ as an argument, you will get an index error without realising (as Python will not read this as an error - if you want student 2’s data, you need to write ‘1’ as an argument).

Indexing can get confusing, especially with Python’s 0-indexing system. Keep your variable names consistent and descriptive, and leave sensible comments to help you and others understand what is going on.


Using the CSV module#

CSV (Comma Separated Values) files are a common format for storing tabular data, such as data from experiments or simulations. Each line in a CSV file represents a row of data, and each value in the row is separated by a comma (you can easily verify this by opening up a CSV file in a text editor). Python has a built-in module called csv that makes it easy to read (and write) CSV files.

Let’s take a look at how to read a CSV file using the csv module:

import csv

with open('elements.csv') as file:
    csv_reader = csv.reader(file)
    for row in csv_reader:
        print(row)

Here, we first import the built-in csv module to allow us to easily parse CSV files.

Next we open the elements.csv file using the with statement as we have seen before. Note that we are opening the file in read mode without needing to specify it explicitly.

The csv.reader() function takes the file object as an argument and returns a CSV reader object that can be used to iterate over the rows in the CSV file.

Finally, we use a for loop to iterate over the rows in the CSV file and print the contents of each row. The csv_reader object allows us to access each row as a list of values, making it easy to work with the data.

Parsing headers#

Data will often come with headers indicating what the data in each column is. If you use the methods above to extract information, you will end up with descriptions in your list of data, which cannot then be converted to floats and will result in errors. You must either skip over these lines, or append them into a useful list.

The most general way of doing this is using if statements. There are a number of strategies you could use. Some (but not all) are listed below. You can get creative with these, there are lots of methods!

Identify a line containing a phrase

Here, if the first line is equal to [‘nm’, ‘abs’] (which it is), the program will continue, meaning the current loop ends and the next loop begins. Since the lines in the next loops are numbers, the code will go straight to the else statement and append to the relevant lists.

You could also test just the first item in the line using if line[0] == "nm":. This might be useful if there are multiple lines beginning with the same thing that you want to extract.

wavelength = []
absorption = []
with open("./practice_files/spectrum.dat") as file:
    contents = file.read()
    contents = contents.split("\n")
    for line in contents:
        line = line.split()
        if line == ['nm', 'abs']:
            continue
        else:
            wavelength.append(float(line[0]))
            absorption.append(float(line[1]))
print("Wavelengths: " , wavelength)
print("Absorptions :" , absorption)

Check if an entry is a number using .isdigit()

In this example, we have added a for loop that iterates through the line, checking if each string item in the line contains numbers using the built-in Python function .isdigit(). This will return True if all characters in the string are numbers, and False if any of those characters are not numbers. In this way, we can skip over any items which are not floats.

Also, if we have removed any values (outliers, etc.), this will filter them out.

wavelength = []
absorption = []
with open("./practice_files/spectrum.dat") as file:
    contents = file.read()
    contents = contents.split("\n")
    for line in contents:
        line = line.split()
        for item in line:
            if item.isdigit() == False:
                continue
            else:
                wavelength.append(float(line[0]))
                absorption.append(float(line[1]))
print("Wavelengths: " , wavelength)
print("Absorptions :" , absorption)

enumerate() to skip specific lines

In this example, we enumerate the contents. This way, we can skip over any line we specify. In this case, the first line (index == 0) is skipped. If we also wanted to skip over the second line, we could use an or statement, e.g. if index == 0 or index == 1:.

wavelength = []
absorption = []
with open("./practice_files/spectrum.dat") as file:
    contents = file.read()
    lines = contents.split("\n")
    for index, line in enumerate(lines):
        if index == 0:
            continue
        else:
            wavelength.append(float(line[0]))
            absorption.append(float(line[1]))

print("Wavelengths: " , wavelength)
print("Absorptions :" , absorption)

Use length of line

In some cases, the headers are a different length to the rest of the file.

In the XYZ file “hydrogen_atoms.xyz”, the lines containing the data always have 4 elements to them: the atom, its x coordinate, its y coordinate, its z coordinate. All coordinates are given in Angstroms.

If you wanted to, instead of outputting a nested list of coordinates in the format [[x,y,z], [x,y,z]], you could create three lists corresponding the the three axes, each with every x- y- or z- value. You could then use these values for calculating the distance between atoms, or calculating the geometric similarity between two structures (using the RMSD formula for two structures).

atoms = []
H_coords = []

with open("./practice_files/hydrogen_atoms.xyz") as file:
    contents = file.read()
    lines = contents.split("\n")
    for line in lines:
        line = line.split()
        if len(line) != 4:
            continue
        else:
            atoms.append(line[0])
            coord = [float(line[1]), float(line[2]), float(line[3])]
            H_coords.append(coord)

print("H coords: ", H_coords)

Exercise: Extract data from periodic table

Using the file ‘periodic_table.csv’, write code to write each element out in the following format:

Element has the symbol symbol. It has number protons and a mass of mass amu.

For example: ’Hydrogen has the symbol He. It has 1 proton and a mass of 1.008 amu.’

Hint: Use f-strings for more control over your variables and better readability.


Click to view answer

There are many ways to write this. The important thing is that our output does NOT include the headers as a phrase. This answer has done this by skipping the line starting with the string “Name”. If you have achieved the same output a different way, that’s fine! At the moment, it is more important to just solve the problem rather than look for the “best” way to solve it.

with open("./practice_files/periodic_table.csv") as file:
    file = file.read()
    lines = file.split("\n")
    for line in lines:
        line = line.split(",")
        if line[0] == "Name":
            continue
        else:
            print(f"{line[0]} has the symbol {line[1]}. It has {line[2]} protons and a mass of {line[3]} amu.")

Think about how you could extract and use each column. If I wanted to just take the data for phosphorus, how would I achieve this? Could I use enumerate to find the right index?


Exercise: Using the CSV module

Using the CSV module and the file ‘periodic_table.csv’, write code to write each element out in the following format:

Element: element, Symbol: symbol, Atomic Number: number, Atomic Mass: mass.

Hint: Use f-strings for more control over your variables and better readability.


Click to view answer

Using the CSV module results in fewer lines of code.

import csv

with open('elements.csv') as csvfile:
    csv_reader = csv.reader(csvfile)
    next(csv_reader)  # Skip the header row
    for row in csv_reader:
        print(f"Name: {row[0]}, Symbol: {row[1]}, Atomic Number: {row[2]}, Atomic Mass: {row[3]}")

Remember that although the CSV module can simplify things, it is still important to understand how to read and write code without it, as this will give you a better understanding of programming and Python coding.


Files in a different directory#

If the file you are trying to call is unavoidably in a different directory to the one you are working in, you can still call the file by instead inputting the full filepath for the file. E.g. "C:\Users\Tara\Documents\PythonInChemistry\readingfiles\molecule.txt"

You could also use a relative filepath using ./ and ../.

  • ./ indicates you want to stay in the same directory. After the forward slash, you can then put in further directory names, ending with your file name.

  • ../ indicates you want to go back a directory from the one you are in. After the backslash, you can then put in further directory names, again ending with your file name.

If this is confusing, have a look at the previous lessons on file types and view the directory branch diagram.

Remember Forward slash and backward slash are interchangeable for specifying a filepath (at higher computing levels this is not necessarily true, but for our purposes it is).

Practice#

Question 1: Using the file path#

With the file “gas_const.txt” in a different directory to your program, write a code to open and print its contents. Remember you can use .. to go back a directory, and . to indicate the directory you are currently in.

# Answer
# With the file forward a directory.

with open("./practice_files/gas_const.txt") as file:
    contents = file.read()
    print(contents)
The gas constant is: 8.314 J/K.mol

Question 2: Calculate internuclear distance#

Using the coordinates in the file “hydrogen_atoms.xyz”, write a program to calculate the distance between the two nuclei using Pythagoras’ Theorem.

The van der Waals radius is the distance from the centre of a atom within which another atom will experience van der Waals interaction with that atom. For a diatomic, a distance between the two atom centres greater than twice the van der Waals radius will result in no van der Waals interaction.

Given that the van der Waals distance for H is 1.2 Angstroms, add a conditional statement that will check if the two hydrogen atoms in the file will interact or not.

# Answer

def distance(atom_1, atom_2):
    """
    Calculate distance in angstroms between two atoms of coordinates [x,y,z].
    Using Pythagoras' formula for the difference between each coordinate. 
    """
    distance = (atom_1[0]-atom_2[0])**2 + (atom_1[1]-atom_2[1])**2 + (atom_1[2]-atom_2[2])**2
    distance = distance ** (1/2)
    return distance

with open("./practice_files/hydrogen_atoms.xyz") as file:
    contents = file.read()
    contents = contents.split("\n")
    atom_coords = []
    for line in contents:
        line = line.split()
        if len(line) < 4:
            continue
        else:
            temp_coord = [float(line[1]), float(line[2]), float(line[3])]
            atom_coords.append(temp_coord)
H1 = atom_coords[0]
H2 = atom_coords[1]

print(distance(H1, H2))
if distance(H1,H2) < 1.2*2:
    print("These two atoms will interact")
else:
    print("These two atoms will not interact")
3.2055211744738172
These two atoms will not inteact

Question 3:#

Extract the data from “spectrum.dat” and identify which wavelength results in the maximum absorption.

Hint: The built-in Python function max() can find the largest item of a list of numbers.

# Answer

wavelength = []
absorption = []
with open("./practice_files/spectrum.dat") as file:
    contents = file.read()
    contents = contents.split("\n")
    for line in contents:
        line = line.split()
        if line[0] == 'nm':
            continue
        else:
            wavelength.append(float(line[0]))
            absorption.append(float(line[1]))
print("Wavelengths: " , wavelength)
print("Absorptions :" , absorption)

for i, j in zip(wavelength, absorption):
    if j == max(absorption):
        print("Max absorption is: ", j)
        print("Wavelength of the max absorption is: " , i, "nm")
Wavelengths:  [240.0, 250.0, 260.0, 270.0, 280.0, 290.0, 300.0, 310.0]
Absorptions : [0.123, 0.132, 0.346, 0.563, 0.998, 0.377, 0.007, 0.002]
Max absorption is:  0.998
Wavelength of the max absorption is:  280.0 nm

Summary#

  • Before reading a file in Python, make sure:

    • Your file is in the same directory as your code (or has a valid filepath).

    • Your file is named sensibly.

    • Your file is in the correct format.

  • Open a file using either:

    • file_name = open("file.txt", "r") contents = file_name.read() file_name.close()

    • with open("file.txt", "r") as file_name:

  • Delimiters indicate separation between items in a file. Common delimiters are commas (","), spaces (""), tabs ("\t"), and new lines ("\n").

  • Split a file along a delimiter using .split(). Inside the bracket place the delimiter as a string.

  • The CSV module allows you to quickly parse CSV files.

  • There are a number of ways to parse headers.

    • By identifying if the line contains a certain phrase. if line in contents == "time,result"

    • Check if an entry is a number using .isdigit().

    • Use enumerate() to skip certain lines.

    • Skip lines of a certain length.

  • Use a relative filepath to reference files in a different directory.