Fast, Easy, Cheap: Pick One

Just some other blog about computers and programming

Using the Go Regexp Package

When I first looked at the regexp package for Go, I was a bit confused. In particular the part of the documentation that read:

There are 16 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression: Find(All)?(String)?(Submatch)?(Index)?

was a bit bewildering. Why are there 16 methods for performing regular expression matching? Coming from Python where there are match() and search() I didn’t immediately understand the justification.

After reflecting a bit about the design of the go language, I think I now understand why the package API is the way it is.

Firstly Go doesn’t support optional function arguments so you need to define differently named functions that accept a different number of parameters. Hence all the variants of the function such as Find(), FindAll(), FindAllSubmatch() etc.

Secondly because Go is statically typed and there is no support for function overloading you must also define a variant of the function for each type it must support. The regexp package supports both the []byte and string types hence variants such as Find() and FindString().

Basic Matching

In the vast majority of cases my programs tend to use precompiled regexps. Rarely do I have to construct a regexp at runtime. In the regexp package the best way to do this is with the MustCompile() function. This function lets you create a regexp and assign it to a var at the package level.

Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
package main

import (
  "fmt"
  "regexp"
)

var digitsRegexp = regexp.MustCompile(`\d+`)

func main() {
  someString := "1000abcd123"

  // Find just the leftmost
  fmt.Println(digitsRegexp.FindString(someString))

  // Find all (-1) the matches
  fmt.Println(digitsRegexp.FindAllString(someString, -1))
}

Try it out

A few things to note:

  • I’ve used the backticks (`...`) instead of quotes ("...") around my regexp literal, this is to avoid having to escape backslashes.
  • I’m using the FindString() method because my input is a string and not []bytes

Submatches

Many times regular expressions are much more complicated than this and benefit from the usage of capturing groups, or subexpressions as they are called in the regexp package.

Subexpressions are handled by the *SubMatch() series of methods. Instead of returning a single match string these methods return a []string which is indexed by the match group position. The 0th item of the slice corresponds to the entire match.

For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
package main

import (
  "fmt"
  "regexp"
)

var digitsRegexp = regexp.MustCompile(`(\d+)\D+(\d+)`)

func main() {
  someString := "1000abcd123"
  fmt.Println(digitsRegexp.FindStringSubmatch(someString))
}

Try it out

Named capturing groups

Once a regular expression begins to be come more complicated it’s useful to be able to document the purpose of the matching groups. Fortunately the regexp package supports named capturing groups much like python. A named capturing group is created with the (?P<name>re) syntax:

1
2
3
4
5
6
7
8
9
10
11
12
package main

import (
  "fmt"
  "regexp"
)

var myExp = regexp.MustCompile(`(?P<first>\d+)\.(\d+).(?P<second>\d+)`)

func main() {
  fmt.Printf("%+v", myExp.FindStringSubmatch("1234.5678.9"))
}

Try it out

The names of the capturing groups can be retrieved via the SubExpNames() method and their index within the slice will match the corresponding index of the slice returned by FindStringSubmatch(). Capturing groups without a name such as the middle one in the example expression will simply have an empty string.

Using this knowledge is possible to define a custom Regexp type which allows you to return your regular expression match as a map keyed by the subexpression name:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
package main

import (
  "fmt"
  "regexp"
)

// embed regexp.Regexp in a new type so we can extend it
type myRegexp struct {
  *regexp.Regexp
}

// add a new method to our new regular expression type
func (r *myRegexp) FindStringSubmatchMap(s string) map[string]string {
  captures := make(map[string]string)

  match := r.FindStringSubmatch(s)
  if match == nil {
      return captures
  }

  for i, name := range r.SubexpNames() {
      // Ignore the whole regexp match and unnamed groups
      if i == 0 || name == "" {
          continue
      }
      
      captures[name] = match[i]

  }
  return captures
}

// an example regular expression
var myExp = myRegexp{regexp.MustCompile(`(?P<first>\d+)\.(\d+).(?P<second>\d+)`)}


func main() {
  fmt.Printf("%+v", myExp.FindStringSubmatchMap("1234.5678.9"))
}

You can run the code on the Go playground

This particular example ignores capturing groups without names but they could possibly be returned as a second return value or via special names in the map.

This post has just scratched the surface of the capabilities of the regexp package but hopefully it’s illustratative of some of the usage and gives you some ideas for how it can be extended.