
(Image credit: Rebecca Winzenried)
Ok, so following on from the previous two posts, I’ll assume you’re generating data in python, saving it to disk with the pickle module, and loading it up again elsewhere so you can visualise and explore the data, generate some plots with
matplotlib, or whatever.
So, how do you structure the data in such a way that it’s easy to find exactly what you want, quickly and simply?
My favoured method is pretty simple – essentially it boils down to using a dictionary to represent each datapoint. We simply map the field names to their values, e.g. (as a trivial example).
data = []
data.append({'sample_mean' : 3.5, 'sample_std_dev' : 1.2})
data.append({'sample_mean' : 2.4, 'sample_std_dev' : 2.1})
Of course, if you were really only recording two parameters for each datapoint, it wouldn’t be so hard to store them as tuples and simply remember which came first and which came second. The dictionary method really comes into its own when recording tens of parameters per datapoint.
This has a bunch of advantages compared to carefully ordered lists, binary formats, or other more primitive methods:
- Stores mixed data types very easily.
- No serialization versioning worries – either a key is present, or it’s not.
- Trivial to change / extend the data structure.
- Easy to dig around in old data and immediately see what’s what – just print the dictionary and get a list of key-value pairs for each datapoint.
- It’s also trivial to convert a list of dictionaries to a table using csv.dictwriter.
- Assuming you use a list of dictionaries to represent datapoints, it’s extremely easy to filter and group the data (a la database queries) using itertools (more on this later).
- Can be used to represent hierarchical data structures if necessary – just make one of your dictionary values a list, or another dictionary
However, there is one major downside – you have to ensure non-clashing, preferably human-readable, keys. This tends to result in long, unwieldy keys which are a pain to remember and type out – and since the keys are just strings, you don’t even get any help from autocompletion environments like IPython or PyDev. Fortunately, there’s a really simple hack to make this a non-issue.
The solution is simple: store your key-strings as variables, in an additional module which is easily imported when saving and loading the data (so the keys match up). An additional trick is to enclose your keys in namespaces using classes to group them together – this also makes it really easy to name keys in ways which don’t clash. As a trivial example, you might have something like this:
""" keys.py - A place to keep my data serialization keys """
class generation_params:
mean = 'true_mean'
std_dev = 'true_standard_deviation'
class sample:
mean = 'sample_mean'
std_dev = 'sample_std_dev'
So now my data exploration code goes something like this
import keys
import pickle
data = pickle.load(open('my_datafile.dat','rb'))
first_datapoint = data[0]
print first_datapoint
# {'sample_mean': 3.5, 'sample_std_dev': 1.2}
print first_datapoint[keys.sample.mean]
# 3.5
Note that as soon as you type the name of a class of keys (e.g. keys.sample) you can use autocompletion to list your possible options, so you no longer have to remember the exact spelling of dozens of different parameter names.
In fact, we can go one better. If you look at the class definitions, you’ll see that there’s actually a lot of duplication of data – we encode almost all the information about what a key means when we name it and place it in a class namespace, so typing out the string we’re going to assign to it is often just a waste of time (and violates the DRY principle). Fortunately, Python has some magic that can fix this for us. Here’s an example of the code I actually use when defining my keys:
""" keys2.py - Now with smarter key generation """
def _generate_default_keys(someclass):
"""Sets the values of class attributes according to their name.
NB. ignores any attributes with values other than None.
"""
for att_name, att_val in vars(someclass).iteritems():
if(att_name[:2]!="__" and att_val==None):
setattr(someclass, att_name, "".join([someclass.__name__,".",att_name]))
class generation_params:
mean = None
std_dev = None
special_key = "non_standard_key_value"
_generate_default_keys(generation_params)
class sample:
mean = None
std_dev = None
_generate_default_keys(sample)
#
# etc...
#
print sample.mean
# sample.mean
type(sample.mean)
print generation_params.special_key
# non_standard_key_value
What’s going on here? We’re making use of introspection, which means a class looking at its own innards and modifying them according to what it sees. The class mangling function breaks down as follows:
vars(someclass) returns a dictionary of variables belonging to someclass,and their values.
if (att_name[:2]!="__"): There are a lot of internal variables associated with a class you don’t want to mess with.
But they’re all prefixed with “__”, so it’s easy to avoid them.
if (att_val==None): We can imagine special cases where you do want to set the variable name manually.
So we’ll only mangle those with a value of None.
setattr(someclass, att_name, "".join([someclass.__name__,".",att_name])): Sets the values of those variables by looking at the name of their class, and the variable name, then joining them with a ‘.’.
- Finally, I’ve prefixed the class mangler function name with a ‘_’. By convention, this means the function isn’t intended for use outside that module. In practice, it means that autocompleting editors should only bring it up at the bottom of the list when you look for variables inside the ‘keys’ module.
And we’re done! Python endlessly amazes me in its succinct capabilities.
This framework was invaluable when I was rapidly iterating on some complicated simulations for my PhD, varying different things and trying to cross-compare a lot of different variables.
As mentioned, this is just a storage medium – you’ll still need to pick out those datapoints you actually want to plot. Next time, I’ll show you how to use itertools to filter your list of dictionaries, resulting in most of the functionality you’d get from a database, with none of the overheads of setting one up.