Tech Blog
Python Parsing
General parsing of data files with Python 3
Overview
At Evergreen we regularly undertake projects in which we need to analyse data generated by the bespoke acquisition systems we build for our clients. This could be, for example, a data related to the control system of a model wave energy converter.
In this post we will look at some of the tricks we have picked up over the course of many testing campaigns. All of our analysis routines are implemented in Python 3.
Simple example
The easiest approach is simply to use numpy to load the contents of a data file into a matrix from which we can pick out the columns of data we want to use.
Suppose our data acquistion systems records the elapsed time and a velocity signal. We would have a tab-separated file that looks like:
Time (s) | velocity (mm/s) |
---|---|
0.0000 | 14.794255 |
0.0078 | 14.930482 |
0.0156 | 15.064894 |
0.0234 | 15.197459 |
0.0312 | 15.328144 |
….. | ….. |
59.9609 | -1.005011 |
59.9688 | -0.905971 |
59.9766 | -0.808469 |
59.9844 | -0.712527 |
59.9922 | -0.618163 |
The simplest means of reading the file is using the numpy loadtxt
function to create a matrix with the time in the first column and the velocity in the second column:
import numpy as np
import matplotlib as plt
fname = 'example.txt'
data = np.loadtxt(fname, skiprows=1)
plt.figure()
plt.plot(data[:,0], data[:,1])
plt.xlabel('t (s)')
plt.ylabel('vel (mm/s)')
plt.grid()
plt.savefig('img1.png', format='png', dpi=500)
plt.show()
The resulting figure is as expected:
General Example
For files with a handful of columns, the simple methods is effective. Numerical indexing of the columns (data[:,0]
etc) has several drawbacks 1. Our files often have as many as 30 columns which makes identification by number error-prone. 2. Each time we want to parse a file we have have a long list picking out the columns we want to use – this a red-flag for copy-paste errors. 2. During an experimental series we sometimes add or remove columns: for the first two days of testing column 5 could be velocity but then column 6 thereafter. Writing code to automatically take care of these cases is time-consuming.
The best strategy we have found is to use the header line of the file to name each column in parsed output. For example, we would like to write:
plot(data.time, data.velocity)
and not have to worry about where in the file these values actually are. To achieve this, we can use the fromrecords
(docs) function in numpy
along with the loadtxt
function we have already seen. As we want to use the header line for identification, it is helpful to make the line easier to automatically read – our files now looks like:
time_s | velocity_mm_s |
---|---|
0.0000 | 14.794255 |
0.0078 | 14.930482 |
0.0156 | 15.064894 |
0.0234 | 15.197459 |
0.0312 | 15.328144 |
….. | ….. |
59.9609 | -1.005011 |
59.9688 | -0.905971 |
59.9766 | -0.808469 |
59.9844 | -0.712527 |
59.9922 | -0.618163 |
We have found leaving the units also helps to avoid mistakes, particularly when working between projects that have different conventions. Again, this is something that can cause hard-to-find bugs when using the ‘simple’ scheme above.
To parse the file we start by reading and storing the header, we load the contents then associate each column with its title.
def parse_file(fname, print_names=False):
'''Parse a file using the header line as the
Inputs:
fname - filename to parse
print_names - print the column headers
Outputs:
data - file contents
'''
# Read the header line into columns
with open(fname) as f:
cols = f.readline().rstrip().split('\t')
if print_names:
[print(c) for c in cols]
# Parse the data in the file
raw = np.loadtxt(fname, skiprows=1)
# Associate column with title
data = np.core.records.fromrecords(raw, names=cols)
return data
The print_names
argument to the function is there so we can see what our column names are without having to open the original file and is made optional to reduce unnecessary console output once our analysis scripts are established. You can, of course, manipulate the column titles in parse_file
to suit the output you want.
Pulling everything together, we have
fname = 'general_example.txt'
data = parse_file(fname, print_names=True)
plt.figure()
plt.plot(data.time_s, data.velocity_mm_s)
plt.xlabel('t (s)')
plt.ylabel('velocity (mm/s)')
plt.grid()
plt.show()
which gives the same result as before.
Summing up
In this post, we have taken advantage of routines already implemented in numpy to make dealing with our data files easier. We at Evergreen have found that the general technique has proven robust and certainly helped us to save time tracing bugs relating to identifying the correct data within large files.