Fast, Easy, Cheap: Pick One

Just some other blog about computers and programming

Using the Go Regexp Package

When I first looked at the regexp package for Go, I was a bit confused. In particular the part of the documentation that read:

There are 16 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression: Find(All)?(String)?(Submatch)?(Index)?

was a bit bewildering. Why are there 16 methods for performing regular expression matching? Coming from Python where there are match() and search() I didn’t immediately understand the justification.

After reflecting a bit about the design of the go language, I think I now understand why the package API is the way it is.

Firstly Go doesn’t support optional function arguments so you need to define differently named functions that accept a different number of parameters. Hence all the variants of the function such as Find(), FindAll(), FindAllSubmatch() etc.

Secondly because Go is statically typed and there is no support for function overloading you must also define a variant of the function for each type it must support. The regexp package supports both the []byte and string types hence variants such as Find() and FindString().

Basic Matching

In the vast majority of cases my programs tend to use precompiled regexps. Rarely do I have to construct a regexp at runtime. In the regexp package the best way to do this is with the MustCompile() function. This function lets you create a regexp and assign it to a var at the package level.

Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
package main

import (
  "fmt"
  "regexp"
)

var digitsRegexp = regexp.MustCompile(`\d+`)

func main() {
  someString := "1000abcd123"

  // Find just the leftmost
  fmt.Println(digitsRegexp.FindString(someString))

  // Find all (-1) the matches
  fmt.Println(digitsRegexp.FindAllString(someString, -1))
}

Try it out

A few things to note:

  • I’ve used the backticks (`...`) instead of quotes ("...") around my regexp literal, this is to avoid having to escape backslashes.
  • I’m using the FindString() method because my input is a string and not []bytes

Submatches

Many times regular expressions are much more complicated than this and benefit from the usage of capturing groups, or subexpressions as they are called in the regexp package.

Subexpressions are handled by the *SubMatch() series of methods. Instead of returning a single match string these methods return a []string which is indexed by the match group position. The 0th item of the slice corresponds to the entire match.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
package main

import (
  "fmt"
  "regexp"
)

var digitsRegexp = regexp.MustCompile(`(\d+)\D+(\d+)`)

func main() {
  someString := "1000abcd123"
  fmt.Println(digitsRegexp.FindStringSubmatch(someString))
}

Try it out

Named capturing groups

Once a regular expression begins to be come more complicated it’s useful to be able to document the purpose of the matching groups. Fortunately the regexp package supports named capturing groups much like python. A named capturing group is created with the (?P<name>re) syntax:

1
2
3
4
5
6
7
8
9
10
11
12
package main

import (
  "fmt"
  "regexp"
)

var myExp = regexp.MustCompile(`(?P<first>\d+)\.(\d+).(?P<second>\d+)`)

func main() {
  fmt.Printf("%+v", myExp.FindStringSubmatch("1234.5678.9"))
}

Try it out

The names of the capturing groups can be retrieved via the SubExpNames() method and their index within the slice will match the corresponding index of the slice returned by FindStringSubmatch(). Capturing groups without a name such as the middle one in the example expression will simply have an empty string.

Using this knowledge is possible to define a custom Regexp type which allows you to return your regular expression match as a map keyed by the subexpression name:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package main

import (
  "fmt"
  "regexp"
)

// embed regexp.Regexp in a new type so we can extend it
type myRegexp struct {
  *regexp.Regexp
}

// add a new method to our new regular expression type
func (r *myRegexp) FindStringSubmatchMap(s string) map[string]string {
  captures := make(map[string]string)

  match := r.FindStringSubmatch(s)
  if match == nil {
      return captures
  }

  for i, name := range r.SubexpNames() {
      // Ignore the whole regexp match and unnamed groups
      if i == 0 || name == "" {
          continue
      }
      
      captures[name] = match[i]

  }
  return captures
}

// an example regular expression
var myExp = myRegexp{regexp.MustCompile(`(?P<first>\d+)\.(\d+).(?P<second>\d+)`)}


func main() {
  fmt.Printf("%+v", myExp.FindStringSubmatchMap("1234.5678.9"))
}

You can run the code on the Go playground

This particular example ignores capturing groups without names but they could possibly be returned as a second return value or via special names in the map.

This post has just scratched the surface of the capabilities of the regexp package but hopefully it’s illustratative of some of the usage and gives you some ideas for how it can be extended.

Back to the Future

I finally ditched wordpress and moved to a more “modern” blogging platform, namely Octopress

Now I’m writing posts in Vim and keeping them version controlled in a git repo. Much better.

Quickly Open the Source of a Python Module From the Command-line

Add this to your .bashrc:

1
2
3
pysource() {
    $EDITOR $(python -c "import ${1}; import inspect; print inspect.getsourcefile(${1})")
}

Your favorite editor should pop up and you can browse the module source. If you want to open a module from an existing editor window, there’s probably plugins for that.

Then after reloading your .bashrc just run something like: $ pysource logging

Trac Math Plugins Galore!

I’ve decided to put my Trac plugin hackery knowledge to use. This past week I became the official maintainer of the TracMathPlugin. This is a plugin we use on our Trac installation at work which lets people enter in Math equations in the wiki using LaTeX, and then renders them as PNG to display them. The plugin had been kind of languishing for the last several years without a maintainer, and there were some open bugs which were quite easy to fix.

While this is cool, and somewhat useful, PNG output always seemed like a hack to me. Fortunately, I recently discovered MathJax. I’m not entirely clear as to how it works, but it seems to implement TeX (or some subset of it) in Javascript and then renders it client-side using web fonts. The output is truly beautiful. Modern browsers are awesome.

So of course, the next natural thing to do is integrate MathJax in to Trac.  A few hours of hacking later and I created TracMathJaxPlugin. It basically does what the name implies, allows you to use MathJax within the Trac wiki and other wiki-enabled components like the ticket system.

The feature set is currently quite basic, but hopefully once I get it installed at work and some more people begin to use it I will flesh it out a bit more.

GridEngine TotD: Making an Advance Reservation for Resources Which Aren’t Currently Available

So you’re trying to set up an advance reservation for someone, something fairly simple. However, due to the cluster being busy you get an error:

1
2
qrsub -u bob -N openmpi_test -pe openmpi 8 -l h_vmem=1G -d 500:0:0
error: no suitable queues

This is because for an advance reservation to be successfully submitted, GridEngine must be able to guarantee the requested resources will be available as soon as the reservation starts. Since we didn’t supply a start time, and the resources aren’t currently available, the request fails. One option is to figure out the walltimes of the currently running jobs, particularly those which are using the resources we want to reserve, and submit the reservation to start after they would be available. For example, if we knew GridEngine could guarantee the resources be available 10 hours from now, we’d use something like (assuming it was currently 08:08 on October 10th):

1
qrsub -u bob -N openmpi_test -pe openmpi 8 -l h_vmem=1G -d 500:0:0 -a 10211808

But what if we don’t know when the resources will be available, or more likely, we don’t care? We just want the reservation to be made as soon as possible. Well, the solution is to submit the reservation as a job which requires the same resources. The job won’t run until the resources are available and since we can set it up with a known walltime we can have it submit the reservation for just after it completes. The most difficult bit is calculating the value for the -a switch. For that, I use a little bit of Python. Here’s my script:

1
2
3
4
5
6
7
8
#!/bin/bash
#$ -pe openmpi 8
#$ -N openmpi_test_submit
#$ -l h_vmem=1G,h_rt=0:5:0
#$ -m e
#$ -M kamil@example.com,bob@example.com
after=$( python -c 'import datetime; print (datetime.datetime.now() + datetime.timedelta(minutes=10)).strftime("%y%m%d%H%M")' )
qrsub -u bob -N openmpi_test -pe openmpi 8 -l h_vmem=1G -d 500:0:0 -a $after

When submitted, the script will sit in the queue until the resources are available and then run with a walltime of 5 minutes. It will submit the advance reservation to start 10 minutes in the future. Upon completion it will email both myself and the user for whom the reservation is being created. Maybe a little convoluted, but it works.

Two ways I think GridEngine could make the reservation system better to support this use case:

  1. Allow relative values for the -a flag. For example -a +0:10:0 to start the reservation 10 minutes in to the future.

  2. Even better, allow the reservations to be queued as a flag to qrsub. This would eliminate the point for such a script entirely.

New OOM Killer Implementation in Linux 2.6.36

As reported on Kernel Newbies, there is almost a complete rewrite of the Out Of Memory (OOM) killer algorithm in the recently-release 2.6.36 kernel.

The LWN article “Another OOM killer rewrite” has a detailed explanation of the changes.

The addition of new heuristics for fork bomb detection and changes to child process detection are a welcome change, but I wonder about the modifications to the calculation of the badness score. It appears that process run-times are no longer considered, only the overall percentage of total memory used. For multi-job batch processing system as implemented compute cluster this is less than ideal.

Consider the following scenario:

  1. Job A is long-running high-memory job has been humming along on a node for quite some time

  2. Job B begins to run on the node

  3. Job B allocates a bunch of memory, and causes the OOMKiller to be triggered

  4. Job A gets killed because it’s using more memory

In the old OOM killer implementation, Job A would be safe because its badness score would be considerably lower than that of Job B by virtue of it having run for a longer period. In the new implementation, it’s selected for killing simply because it’s using more memory.

As a cluster administrator, this is not what I want. In most cases the desired policy is that existing jobs should take precedence over newly executed. The old implementation gets this right pretty much by default but the new one does not.

Speaking of defaults, another modification that comes along with the new badness heuristic is the new oom_score_adj variable in /proc which is meant to replace the oom_adj variable. The new oom_score_adj is simply an absolute value that can be used to linearly alter the badness score. The oom_adj variable in the previous version was a bit-shift of the badness, which I think was more useful for our purposes.

For example, in our cluster we would set the oom_adj value of the node shepherd processes (pbs_mom or sge_execd) to +10 or so. This would ensure that the processes controlling the jobs, and more importantly, their children which are the actual jobs, would be considered for killing by a score several orders of magnitude more than the system processes which actually keep the nodes alive.

Combined with runtime calculations this simple tweak meant that the first processes to be considered for killing would be those launched by the most recently executed job, exactly what we want. With the new algorithm it’s still unclear to me how to achieve the same behavior.

It seems that the new implementation is designed more with desktop usage in mind. It makes the assumption more desirable to reclaim as much memory as possible by killing the fewest number of processes.

While the invocation of the OOM killer by the kernel can typically be avoided through careful tuning of your resource management / scheduling system, the facilities to enforce the limits are not perfect. Most resource managers implement memory “quotas” through a combination of the system’s ulimits facility and resource tracking by the shepherd process.

The major limitation of the ulimits approach is that it applies on a per-process basis. If a job is granted a limit 1 GB of memory, nothing stops it from spawning multiple processes each of which consumes up to the limit. This is one reason why a second a level of enforcement is performed by the shepherd process. In GridEngine this is implemented by assigning a unique job-specific extended group ID to the job processes, and then tallying up the memory usage of all processes with that ID. This catches the case where multiple processes use more memory than requested but to keep overhead down the usage is only polled every few minutes. A job could still allocate and dirty more than its share of memory in between the polls and cause the OOMKiller to be triggered.

Since it seems the OOM killer can never be avoided entirely, it’s important that the algorithm is tunable enough that an administrator can control its behavior with a high degree of certainty. The new implementation in 2.6.36 seems to take some steps backwards in that respect.

Speeding Up SQLAlchemy Collections With Innerjoin

Recently I’ve been working on a project wherein we rewrote a body of code whose architecture was… well, less than ideal. We took a mess of an object model and built something more elegant and structured. The new code design was a huge win. We were left with one major problem. In the process of the redesign we had added another table to our database. Fetching all our objects now required a join, and for some reason this was now taking 4 hours for our 300k item dataset instead of the previous 5 minutes. So our code was much nicer, but a lot slower. Not exactly impressive. After some digging, I found the following:

In SQLAlchemy 0.5 the default (and it seems only) type of join for a relation is a LEFT OUTER JOIN. While this is great for cases where you may have a null relation, it’s not something that’s possible in our data model. Furthermore, it appears the SQLite implementation of this type of join is excruciatingly slow. For our purposes a simple and fast JOIN would be sufficient.

Having identified the problem, I set about trying to figure out how to convince SQLAlchemy’s ORM layer to to issue a standard JOIN. I investigated all sorts of solution including writing custom select queries to load collections, but nothing really seemed as clean as I would have liked it to be.

Someone on #sqlalchemy on irc.freenode.org pointed me to a new parameter in the 0.6 relation() function: innerjoin

So I changed the mapper code which previously looked like:

1
mapper(Entry, TABLES['entries'], properties={'models': relation(Model, backref='entry')})

to:

1
mapper(Entry, TABLES['entries'], properties={'models': relation(Model, backref='entry', innerjoin=True, lazy=False)})

and SQLAlchemy dutifully started putting out JOINs instead of the previously inefficient SQL code!

This tiny change decreased the execution time of the code on our sample database from around 4 hours down to 2 minutes. A 120x speedup! Great news for me. Overall the execution of the new code is still not quite as fast as I think it could be and I’ll be doing some profiling to investigate exactly why.

Eclipse, Python, and File Extensions

So I’ve recently been giving Eclipse (and PyDev) a try for developing Python. Despite my initial hesitation of switching away from Vim, I’ve found it to be quite productive. My text editing speed is not nearly as good as in Vim due to the not-as-functional text editor but I think the IDE capabilities more than make up for it by reducing my save-run cycle drastically. I’m aware of Eclim, the Vim interface to Eclipse, but have thus far been unable to get it to work on my system.

One other major sticking point for me is that we tend to organize our projects in a way that the main modules are in the <packagename>/ directory while the scripts users interact with are in a bin/ directory.

The problem is that the scripts do not have an extension and apparently Eclipse can only associate file types with an extension. There’s literally no other way to tell it that these files are also Python scripts. There’s even a bug that’s been open since 2004 regarding the issue!

So there are two possible solutions:

  1. Rename all my scripts to have a .py extension

  2. Manually open each one by right-clicking an selecting “Open with Pydev”

So far I’m going to try opting for option 2, since we just standardized on having user scripts without a .py extension. I’m going to see how that works out and revisit the topic later.

Making the Alt Key Work in X11 on OS X

It seems something is broken about the default keyboard map in X11 on OS X. I discovered this while trying to use an X-forwarded version of Eclipse from our cluster and couldn’t get any of the shortcuts that use the Alt key to work. Fortunately I found a nice fix in the Inkscape forum.

In short:

  1. Enable “Use the system keyboard layout” in your X11 preferences.

  2. Open an xterm. Run xmodmap -pke > ~/.Xmodmap[/code]

  3. Edit the file. Find the lines that say:

    keycode 66 = Mode_switch keycode 69 = Mode_switch

    and change them to:

    keycode 66 = Meta_L keycode 69 = Meta_R

  4. Disable “Use system keyboard layout” and “Emulate three button mouse” in your X11 preferences

  5. Restart X11

Now your alt key should work fine in your X11 apps.

Easy Context Managers With Contextlib

Do you ever use some functions in Python that you wish had a context manager, but do not? Are you too lazy to write one? Then use the contextlib module!

I often need to work with gzipped files, but unfortunately gzip.open() doesn’t work as a context manger. Fortunately contextlib has a convenient closing() function that makes converting any file-like object to a context manager trivial. For example, previously where I had code like:

1
2
3
4
5
6
7
8
9
10
import gzip

# ...

try:
    f = gzip.open(path)
    do_stuff(f)
    # ...
finally:
    f.close()

I can now replace it with:

1
2
3
4
5
6
7
8
import gzip
from contextlib import closing

# ...

with closing(gzip.open(path)) as f:
    do_stuff(f)
    # ...

Apart from the closing() function, contextlib includes a few other convenient context manager building functions:

contextlib.contextmanager(func) can be used as a decorator to turn any function with a single yield in to a context manager.

contextlib.nested(m1, m2, ...) allows you to nest multiple context managers but has been superseded in Python 2.7 by the new with syntax.

contextlib is available as of Python 2.5