When I first looked at the regexp package for Go, I was a bit confused. In particular the part of the documentation that read:
There are 16 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression: Find(All)?(String)?(Submatch)?(Index)?
was a bit bewildering. Why are there 16 methods for performing regular
expression matching? Coming from Python where there are match()
and search()
I didn’t immediately understand the justification.
After reflecting a bit about the design of the go language, I think I now understand why the package API is the way it is.
Firstly Go doesn’t support optional function arguments so you need to
define differently named functions that accept a different number of parameters.
Hence all the variants of the function such as Find()
, FindAll()
,
FindAllSubmatch()
etc.
Secondly because Go is statically typed and there is no support for function
overloading you must also define a variant of the function for each type it must
support. The regexp package supports both the []byte
and string
types hence
variants such as Find()
and FindString()
.
Basic Matching
In the vast majority of cases my programs tend to use precompiled regexps.
Rarely do I have to construct a regexp at runtime. In the regexp package the
best way to do this is with the MustCompile()
function. This function lets you
create a regexp and assign it to a var at the package level.
Here’s an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
|
A few things to note:
- I’ve used the backticks (
`...`
) instead of quotes ("..."
) around my regexp literal, this is to avoid having to escape backslashes. - I’m using the
FindString()
method because my input is a string and not[]bytes
Submatches
Many times regular expressions are much more complicated than this and benefit from the usage of capturing groups, or subexpressions as they are called in the regexp package.
Subexpressions are handled by the *SubMatch()
series of methods. Instead of
returning a single match string these methods return a []string
which is
indexed by the match group position. The 0th item of the slice corresponds to
the entire match.
For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Named capturing groups
Once a regular expression begins to be come more complicated it’s useful to be
able to document the purpose of the matching groups. Fortunately the regexp
package supports named capturing groups much like python. A named capturing
group is created with the (?P<name>re)
syntax:
1 2 3 4 5 6 7 8 9 10 11 12 |
|
The names of the capturing groups can be retrieved via the SubExpNames()
method and their index within the slice will match the corresponding index of
the slice returned by FindStringSubmatch()
. Capturing groups without a name
such as the middle one in the example expression will simply have an empty
string.
Using this knowledge is possible to define a custom Regexp type which allows you to return your regular expression match as a map keyed by the subexpression name:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
|
You can run the code on the Go playground
This particular example ignores capturing groups without names but they could possibly be returned as a second return value or via special names in the map.
This post has just scratched the surface of the capabilities of the regexp package but hopefully it’s illustratative of some of the usage and gives you some ideas for how it can be extended.