HPC Cluster monitoring via multiple htop instances

In a slightly unexpected sideline of my job, I’ve found myself doing a bit of HPC / HTC mini-cluster sysadmin work. While mostly a bit tedious, occasionally I find a new trick or tool for speeding things along, which can be pleasantly satisfying (on that note, go check out Ansible).

For cluster-usage monitoring I’ve been using Ganglia, which is reasonably simple to get running
(sudo apt-get install ganglia-monitor on the nodes, together with the updated web interface detailed here). But while it provides a great overview, sometimes you need to drill down and quickly see who’s running what, on which node. For this, htop (sudo apt-get install htop) is a great tool with minimal footprint in terms of CPU overhead and bandwidth. I’ve put together a little script that allows the user to fire up multiple instances of htop, by opening multiple terminals with a ‘ssh-and-run’ combo that works very nicely, so I thought I’d share:

#!/bin/bash
ALL_HOSTS=(node1 node2 node3)

if [ $# == 0 ]; then
    TARGETS=("${ALL_HOSTS[@]}")
else
    echo '$0:' $0
    TARGETS=("$@")
fi

for HOST in "${TARGETS[@]}"; do
    gnome-terminal -t $HOST -e "ssh $HOST -t htop" &
done

Run with no arguments, this will open up 3 child terminals and connect to the nodes listed as ALL_HOSTS. Alternatively the user may supply a manual list of ssh targets.
The tricky bit is the population and expansion of bash variable arrays – see here for details: http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_10_02.html .

Fix your Dell touchpad in Linux

Just testing a shiny new Dell Precision M4700, which is great… except the trackpad isn’t recognized in Xubuntu 12.04, resulting in no multi-touch – it’s amazing how clunky resorting to scroll bars seems now. Anyway, it turns out Dell are not thoroughly supporting the ‘Alps’ trackpads used on their latest models, which is a shame, as it results in reduced functionality out of the box for a raft of their laptops (see here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/606238/ ).

Fortunately, there is a fix: (Tested on Xubuntu 12.04)

Download psmouse-alps-1.3: http://www.dahetral.com/public-download/psmouse-alps-1.3.tbz/view
Then (using sudo / root):

  • Extract the folder psmouse-alps-1.3 to /usr/src
  • cd to /usr/src
  • dkms add psmouse-alps-1.3
  • dkms autoinstall
  • rmmod psmouse && modprobe psmouse

Bam! Working multitouch! :-)

Tagged , , ,

A Jabref journal abbreviations list for astronomers

Time for a quick diversion. Today, a very minor contribution to making astronomer’s lives easier.

If you’re writing any kind of scientific paper in LaTeX, it will save you time in the long run if you use a reference manager. My tool of choice is Jabref.

Many journals require you to submit references with abbreviated journal names, the style of which may vary from journal to journal.

Fortunately, Jabref provides an easy means of toggling your Bibtex catalogue journal entries between full and abbreviated form (Tools -> Abbreviate journal names).

It’s also possible to use your own custom list of journal abbreviations (Options -> Manage journal abbreviations). Since I couldn’t find any ready made lists of abbreviations for astronomers, I created this repository. Currently I’ve only provided an incomplete list of abbreviations in the MNRAS style, but this will grow over time, hopefully with the help of fellow astronomers. Fork me!

What to put in your pickle? Using dictionaries and smart keys to make handling your data easier.

Pickles!
(Image credit: Rebecca Winzenried)

Ok, so following on from the previous two posts, I’ll assume you’re generating data in python, saving it to disk with the pickle module, and loading it up again elsewhere so you can visualise and explore the data, generate some plots with
matplotlib, or whatever.

So, how do you structure the data in such a way that it’s easy to find exactly what you want, quickly and simply?

My favoured method is pretty simple – essentially it boils down to using a dictionary to represent each datapoint. We simply map the field names to their values, e.g. (as a trivial example).

data = []
data.append({'sample_mean' : 3.5, 'sample_std_dev' : 1.2})
data.append({'sample_mean' : 2.4, 'sample_std_dev' : 2.1})

Of course, if you were really only recording two parameters for each datapoint, it wouldn’t be so hard to store them as tuples and simply remember which came first and which came second. The dictionary method really comes into its own when recording tens of parameters per datapoint.

This has a bunch of advantages compared to carefully ordered lists, binary formats, or other more primitive methods:

  • Stores mixed data types very easily.
  • No serialization versioning worries – either a key is present, or it’s not.
  • Trivial to change / extend the data structure.
  • Easy to dig around in old data and immediately see what’s what – just print the dictionary and get a list of key-value pairs for each datapoint.
  • It’s also trivial to convert a list of dictionaries to a table using csv.dictwriter.
  • Assuming you use a list of dictionaries to represent datapoints, it’s extremely easy to filter and group the data (a la database queries) using itertools (more on this later).
  • Can be used to represent hierarchical data structures if necessary – just make one of your dictionary values a list, or another dictionary

However, there is one major downside – you have to ensure non-clashing, preferably human-readable, keys. This tends to result in long, unwieldy keys which are a pain to remember and type out – and since the keys are just strings, you don’t even get any help from autocompletion environments like IPython or PyDev. Fortunately, there’s a really simple hack to make this a non-issue.

The solution is simple: store your key-strings as variables, in an additional module which is easily imported when saving and loading the data (so the keys match up). An additional trick is to enclose your keys in namespaces using classes to group them together – this also makes it really easy to name keys in ways which don’t clash. As a trivial example, you might have something like this:

""" keys.py - A place to keep my data serialization keys """

class generation_params:
    mean = 'true_mean'
    std_dev = 'true_standard_deviation'

class sample:
    mean = 'sample_mean'
    std_dev = 'sample_std_dev'

So now my data exploration code goes something like this

import keys
import pickle

data = pickle.load(open('my_datafile.dat','rb'))
first_datapoint = data[0]

print first_datapoint
# {'sample_mean': 3.5, 'sample_std_dev': 1.2}

print first_datapoint[keys.sample.mean]
# 3.5

Note that as soon as you type the name of a class of keys (e.g. keys.sample) you can use autocompletion to list your possible options, so you no longer have to remember the exact spelling of dozens of different parameter names.

In fact, we can go one better. If you look at the class definitions, you’ll see that there’s actually a lot of duplication of data – we encode almost all the information about what a key means when we name it and place it in a class namespace, so typing out the string we’re going to assign to it is often just a waste of time (and violates the DRY principle). Fortunately, Python has some magic that can fix this for us. Here’s an example of the code I actually use when defining my keys:

""" keys2.py - Now with smarter key generation """
def _generate_default_keys(someclass):
    """Sets the values of class attributes according to their name.

       NB. ignores any attributes with values other than None.
    """
    for att_name, att_val in vars(someclass).iteritems():
        if(att_name[:2]!="__" and att_val==None):
            setattr(someclass, att_name, "".join([someclass.__name__,".",att_name]))

class generation_params:
    mean = None
    std_dev = None
    special_key = "non_standard_key_value"
_generate_default_keys(generation_params)

class sample:
    mean = None
    std_dev = None
_generate_default_keys(sample)

#
# etc...
#

print sample.mean
# sample.mean
type(sample.mean)


print generation_params.special_key
# non_standard_key_value

What’s going on here? We’re making use of introspection, which means a class looking at its own innards and modifying them according to what it sees. The class mangling function breaks down as follows:

  • vars(someclass) returns a dictionary of variables belonging to someclass,and their values.
  • if (att_name[:2]!="__"): There are a lot of internal variables associated with a class you don’t want to mess with.
    But they’re all prefixed with “__”, so it’s easy to avoid them.
  • if (att_val==None): We can imagine special cases where you do want to set the variable name manually.
    So we’ll only mangle those with a value of None.
  • setattr(someclass, att_name, "".join([someclass.__name__,".",att_name])): Sets the values of those variables by looking at the name of their class, and the variable name, then joining them with a ‘.’.
  • Finally, I’ve prefixed the class mangler function name with a ‘_’. By convention, this means the function isn’t intended for use outside that module. In practice, it means that autocompleting editors should only bring it up at the bottom of the list when you look for variables inside the ‘keys’ module.

And we’re done! Python endlessly amazes me in its succinct capabilities.

This framework was invaluable when I was rapidly iterating on some complicated simulations for my PhD, varying different things and trying to cross-compare a lot of different variables.

As mentioned, this is just a storage medium – you’ll still need to pick out those datapoints you actually want to plot. Next time, I’ll show you how to use itertools to filter your list of dictionaries, resulting in most of the functionality you’d get from a database, with none of the overheads of setting one up.

How Python makes serialization trivial.

In my last post, I explained a little about serialization in the context of small-to-medium scientific datasets, and the inherent problems that await unwary grad students.

In this post I’ll just give a quick reminder of how to serialize objects using Python – if you already know about `pickling,’ skip along to the next post.

Among Python’s many excellent libraries is one dedicated to object serialization: `pickle.’ There are a few things about this library that make it preferable by far to those I’ve tried in other languages. Of course, it takes care of all the basics such as handling tricky text and arbitrary precision real numbers, as you’d expect from a proper library. But what really makes it stand out is the fact that it has zero code overhead, in regard to defining data structures ahead of time (this is often a tedious, error prone task in other languages or with home-grown methods). Thanks to Python’s clever internals (i.e. introspection), you don’t have to tell the save or load routines ahead of time what to expect. Just open up a file and save your data, in whatever form you like best (within reason).
For example (taken from the python docs):

import pickle
#Create a dictionary of varied data
data1 = {'a': [1, 2.0, 3, 4+6j],
         'b': ('string', u'Unicode string'),
         'c': None}
simple_list = [1, 2, 3] #And a basic list

output = open('data.pkl', 'wb')  #Open a file for writing

pickle.dump(data1, output)   #`Save' the dictionary
pickle.dump(simple_list, output) #And the list
output.close() #Close the file when you're done with it.

To load the data back into our working environment, we simply do:

import pickle
pkl_file = open('data.pkl', 'rb')
data1 = pickle.load(pkl_file)
simple_list = pickle.load(pkl_file)

It’s that simple. While this doesn’t make for a particularly ground-breaking blog post (bear with me), it does mean that in about 6 lines of serialization-related code we’re already way ahead of most alternative solutions.

Ok, so that’s the basics out of the way. Next time, how to use this to your advantage when analysing complex data…

Serialization: Doing it right.

Serialization prior to network transport via bus
(Image credit: hktang.)

In this post I’ll explain a little bit about a common problem for computational all research scientists, and some common pitfalls you want to avoid. Next time I’ll talk about a simple way of handling the problem using Python, which works really well for me.

Suppose you have some reasonably complex data. You might only have a few datasets, with a handful of datapoints in each, but suppose each datapoint then has a dozen different features – maybe many different instrument parameters, perhaps different numbers quantifying your result in different ways. So you have enough datapoints that you don’t want to be shuffling your data around by hand – it has to be machine readable somehow.

Usually, your experiment, analysis or simulation is computationally expensive and time consuming, so you won’t want to repeat it if you don’t have to. On the other hand, you will want to plot different combinations of parameters again and again as you explore the data, pretty up your plots, and so on. So here we have a fairly well defined problem – we want an easy way of recording many different parameters, in various different groupings depending upon the problem. That way, we can save all the information we can possibly think of when performing the analysis, and have it at our fingertips when we’re visualizing the data. This is generally referred to by computer scientists as serialization.

(Note that there’s a second reason for doing this: reproducibility. You should really keep a backed-up copy of all sensible data accompanying your results, and make it available with your paper as supplementary materials, for other researchers. Of course, if it’s simulated or heavily analysed data you should also publish your code, but that’s a post for another day).

Typical methods of serialization and their failings include:

  • Outputting everything to a text file in CSV format (cf data records since time began).
  • This is very simple to do, and has the advantage of being compatible with just about any language (especially if you’re loading it into Python with the csv module). However, if your data is anything but simple integers, you will quickly hit problems, for example escaping characters, rounding issues, etc.
    Then there’s the problem that this is effectively a flat database, which is to say it cannot encode a hierarchy in the data. So, for example, if you have a bunch of datapoints recorded with instrument configuration A, and another bunch with instrument configuration B, then you either have to record separate tables for your instrument configurations and the resulting data, or you have a table tens of columns wide with the instrument parameters duplicated for every datapoint in a set. This quickly becomes a pain as your data grows.

  • Using library routines – or (shudder) writing your own – to store data in a binary format (e.g. FITS tables) (cf. astronomy, mid-80′s to present).
  • This has less issues with escaping and rounding, and may even allow for hierarchical groupings, but it’s often not very user friendly. Unlike a CSV file you can’t just open the data up in a spreadsheet program; you must use exactly the right software to decode the information. It can also be a pain to write in a flexible or reusable manner. What if you realise halfway through your experiments that you want to record additional parameters? Congratulations, you’ve just hit the problem of versioning, and your simple serialization problem just got a whole lot more complex.

  • Database management systems (cf big business, data specialists, 1960′s to present ).
  • DBMS don’t really fit into the category of `serialization methods,’ but I’ll mention them for completeness. DBMS are what you use if you have a really, seriously large dataset, or you’re handling concurrent access requests, or any one of many situations where they’re invaluable. But for most research scientists, the overheads of setting one up, finding the relevant interface library for your language, and learning SQL, mean DBMS are overkill and unfit for purpose. Not to mention the versioning / flexibility issues, and the fact that data in a DBMS can be a pain to move around or back up.

  • Use a dedicated data serialization format, e.g. XML, JSON (cf. The internet 1996 to present, a small sub-set of astronomers in the past five years).
  • XML and JSON are both standards for encoding data in a flexible, hierarchical manner. Both are text-based, so even if you lose the arcane code you used to output your data, you need nothing more than a text editor to go back and check your crucial values. If you’re programming in Java, C, or C++, and you want something better than the options outlined above, then JSON is likely a good bet for you (XML has an ugly syntax necessitated by a bunch of features you won’t need). Google for a good library with usage examples and get cracking. Be warned though, you’ll probably still have to put some effort into telling the JSON library just how to organize your information.

So, is there a better way? Yes! Python! (As usual). See next post.

Mini gotcha of the day: python subprocess, C++ and argument parsing

Mini python gotcha of the day, due to naive usage of subprocess.Popen() :

Trying to run a legacy analysis C++ program, I called it with a long list of arguments and flags (BTW I use  getopt_pp for C++ parsing, it’s great):
The code went something like this:
 
foo_prog = "myprog"
command = [ foo_prog ]
command.append("-f myflags")

" ".join(command)
> myprog -f myflags

p = subprocess.Popen(command)

…which crashes pretty promptly.
What’s wrong with this picture?
Well, subprocess, Popen et al are specifically designed to pass each argument as a single string, suitably escaped so that problematic characters (spaces in filenames, quotation marks etc) won’t unintentionally split the argument in two.
As a result,the “-f myflags” gets passed as a single string, whereas the C++ program expects multiple strings (the flag identifier followed by the flags themselves).
Result? Errors complaining about command line options not being supplied, much wailing and gnashing of teeth.

Solution: Supply the flag and the parameters as two separate strings.
 
command.extend(["-f", "myflags"])

Or, revert to the simpler ‘pass everything to shell’ mode.
 
p = subprocess.Popen(command, shell=True)

Hopefully this will save someone else 20 minutes of frustration.

blog = []

NB: In this blog, we shall mostly be subscribing to PEP 8.

Follow

Get every new post delivered to your Inbox.