The Exorcism of Regular Expressions

Ralph Ritoch

In all of my years programming I have never run into a demon more powerful and scary than the regular expression.  I still cringe every time I run into one of these beasts in code.  I am not talking about the fluffy regular expressions that you may use to verify that your input is all letters, or numbers. I’m talking about the unwieldy regular expressions that to even the trained eye look like a random sequence of characters. The practice of using giant, undocumented, regular expressions is all too common, so let us explore some techniques to exorcise these demons by documenting them.

Java Example:

Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])");

You may have seen some horror movies that suggest the only way to exorcise a demon is to discover it’s name, and that is the first thing we need to do. While regular expressions seem similar between different programming languages and libraries there are actually subtle differences which can make the same regular expression behave differently depending on the library being used and the options which are passed to the regular expression engine.  To discover the name of a demon, you would start with a book of demon names. To discover the named parts used in a regular expression, you need to locate and read the documentation of the regular expression library you are using.

The first step in an exorcism of regular expressions is to document the location of your book of names.  I don’t think I have ever seen it done before, but documenting the location of the regular expression documentation you are using can be very helpful. If you have used the wrong documentation, or the documentation becomes outdated, another developer may realize this and be able to correct bugs by comparing the documentation you used, against the documentation for the system currently being applied.

Example:

/* See: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html */

Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])");

Now that wasn’t very hard now was it? You may have just saved future developers from having headaches and sleepless nights. At this point your demon regular expression should be quivering in fear as you have successfully contained it in a trap but it is not yet time to let your guard down.  The documentation may be incomplete, it may contain undocumented features, and it may contain undocumented bugs. Your work has just begun!

We must next begin dissecting our regular expression. To do this we will copy our regular expression into a block comment and isolate all of the groupings which are in parenthesis.

1. We will put each parenthesis on it’s own line, increasing the indentation after opening parenthesis ‘(‘ and decreasing the indentation before closing parenthesis.

2. Once that is completed, go back over the code to make repeated sections more clear. Groups and sequences are repeated with either the '+', '*' being included after the group or sequence. We will pull the +/* character following a closing parenthesis onto the line with the parenthesis, and we will divide our repeated sequences onto their own lines. Be sure to skip any '+' or '*' character that is in a character class which are grouped in square brackets. These character classes can be nested.  The procedure to follow is to start counting at the number 1 when you see a open bracket, increase your count for each opening bracket ‘[‘ that isn’t preceded by a odd number of forward slashes, and decrease your count for each closing bracket ‘]’. When your count gets back to zero you are outside of the character class and any +/* you run  into are repeaters. Repeaters can also be defined by curly braces with one or two numbers in them, such as {1} or {1,2}, these will also need to be treated the same as you treat the '+' or '*' repeaters.

3. For this case we want to continue this process by separating out optional sequences, and separating out non-capturing groups. The non-capturing groups are indicated by the sequence “?:” at the beginning of a group. We will move the non-capturing sequence into the line with the opening parenthesis defining our groups, and we will separate out our optional sequences which end with a ‘?’. The ‘?’ character can be used in a character class so skip any character classes defined by nested square brackets as we did in the previous step.

Note: Don’t be fooled by escaped characters. If you see a odd number of forward slashes before the parenthesis, '*' or '+', it is to be treated as a normal character and not as a group or repeater.

See: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

/* See: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html */

/*

(?:

[a-z0-9!#$%&'*+/=?^_`{|}~-]+

(?:

\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+

)*

|”

(?:

[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f]

)*

)

@

(?:

(?:

[a-z0-9]

(?:

[a-z0-9-]*

[a-z0-9]

)?

\.

)+

[a-z0-9]

(?:

[a-z0-9-]*

[a-z0-9]

)?

|\[

(?:

(?:

25[0-5]|2[0-4][0-9]|[01]?

[0-9][0-9]?

)

\.

){3}

(?:

25[0-5]|2[0-4][0-9]|[01]?

[0-9][0-9]?

|[a-z0-9-]*

[a-z0-9]:

(?:

[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f]

)+

)

\]

)

*/

Pattern.compile(“(?:[a-z0-9!#$%&’*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&’*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])”);

Finally we need to name our character classes and our groups. We do this by replacing our character sequences with a name, and documenting the name outside of our dissection, leaving all repeaters and optional specifiers in place. Lastly we extract each of our groups and name them, starting at the deepest groups which don’t contain any subgroups, and working our way out of all groups. Coming up with names can be a difficult task, to simplify this start with a basic naming scheme, such as cc_# and g_#, increasing the number for each new character class and group. Any sequence of characters that aren’t part of a character class should be placed in quotes. These quoted parts can be added to your named character classes by adding spaces before or after the quoted parts. This makes it easier to separate your names from parts of the regular expression. Once every group and character class have a name you can go back and replace the names with names that are more meaningful. Any groups or sequences that don’t have an obvious meaningful name don’t need to be renamed. Finally we can establish a name for our entire regular expression. In this case I called it regex.

// See: http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

/*
cc_user_char = [a-z0-9!#$%&'*+/=?^_`{|}~-]
cc_2 = [\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]
cc_3 = [\x01-\x09\x0b\x0c\x0e-\x7f]
cc_alphanum = [a-z0-9]
cc_alphanum_or_dash = [a-z0-9-]
cc_zero_to_five = [0-5]
cc_zero_to_four = [0-4]
cc_number = [0-9]
cc_9 = [01]
cc_10 = [\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]
g_user_suffix = (?: '\.' cc_user_char+ )
g_2 = (?: c_2 | '\\' c_3)
g_3 = (?: cc_alphanum_or_dash* cc_alphanum)
g_4 = (?: cc_alphanum_or_dash* cc_alphanum)
g_nbyte = (?: '25' cc_zero_to_five | '2' cc_zero_to_four cc_number | cc_9? cc_number cc_number?)
g_6 = (?: cc_10 | '\\' cc_3)
g_user = (?: cc_user_char+ g_user_suffix* |'"' g_2* '"')
g_domain_prefix = (?: cc_alphanum g_3? '\.')
g_lead_ip_part = (?: g_nbyte '\.')
g_ip_suffix = (?: '25' cc_zero_to_four | '2' cc_zero_to_four cc_number | cc_9 ? cc_alphanum cc_number? | cc_alphanum_or_dash* cc_alphanum ':' g_6+)
g_domain = (?: g_domain_prefix+ cc_4 g_4? | '\[' g_lead_ip_part{3} g_ip_suffix '\]')
regex = g_user '@' g_domain
*/
Pattern.compile("(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])");

It is now much easier to see what this code does. It parses an email address. As soon as the x @ y pattern emerged it became obvious. If you are familiar with the RFC’s for email address than you can clearly see this regular expression is incomplete. It doesn’t handle comments, and it doesn’t handle the normal format, “Name” <user@host>. Now that you have exorcized this daemon you have complete control of it, and can use it to do your bidding!