String Formatting with Regex

My latest favorite trick with regular expressions is to shortcut string formatting. We’ve all written some code like this:

if (/* string is not formatted a certain way */)
{
    /* make it formatted that way */
}

Now, there’s nothing wrong with that code, but for simple examples you could do it all in one step with a regular expression!

Here are a few examples:

// remove "www." from a domain if one exists
// domain.com     -> domain.com
// www.domain.com -> domain.com
Regex.Replace(input, @"^(?:www.)?(.+)$", "$1");

// format phone number
// 1234567890       -> 123-456-7890
// (123) 456-7890   -> 123-456-7890
// (123) 456 - 7890 -> 123-456-7890
Regex.Replace(input, @"^\(?(\d{3})\)?\s*(\d{3})\s*-?\s*(\d{4})$", "$1-$2-$3");

Yay, regular expressions!

Validate Time Entry with Javascript

I was cleaning up a web form that had a textbox for the user to enter a time value. The thing I don’t love about using a textbox to capture a time value is that there’s no validation. The user might enter a bad value and not realize it, and I’d rather let them know right away rather than displaying a message after they try to submit the form.

Surely there’s something we can do with javascript and regular expressions to create an intuitive experience for the user, right?

Format Checkin’ Regular Expression

The first thing we’re going to need is a regular expression that can be used to determine if an entry is valid or not. I decided to use a pair: one for standard time and one for military time.

function validateTime(time) {
    if (!time) {
        return false;
    }
    var military = /^\s*([01]?\d|2[0-3]):[0-5]\d\s*$/i;
    var standard = /^\s*(0?\d|1[0-2]):[0-5]\d(\s+(AM|PM))?\s*$/i;
    return time.match(military) || time.match(standard);
}

Make Red When Invalid

Now that we have a way to determine if an entry is valid, we need to decide how to give that feedback to the user. My first thought was to use the input control’s keyup event to check the value and make the text red if it doesn’t match.

<input type="text" class="warnIfInvalid" />
$(new function () {
    $('.warnIfInvalid').on('keyup', function () {
        $(this).css('color', 'black');
        if (!validateTime($(this).val())) {
            $(this).css('color', 'red');
        }
    });
});

Change to Default When Invalid

The color feedback is nice, but what if our field is a required value? If the user doesn’t enter anything, there is nothing to let them know they did something wrong. So, my second idea was to use the input control’s blur event to force a default value if the user enters a blank or invalid value.

<input type="text" class="required" value="12:00 AM" />
$(new function () {
    $('.required').on('blur', function () {
        if (!validateTime($(this).val())) {
            $(this).val('12:00 AM');
        }
    });
});

Do Both!

I didn’t like simply changing the user’s value to a default value without letting them know that I’m about to do that. For example, my regular expression won’t match a standard time that doesn’t have a space between the minutes and AM/PM. We can combine both techniques described above to give the user feedback as they type but change their bad input to a default if they enter something invalid. (Note that I manually trigger the keyup event after changing the invalid value to my default value.)

<input type="text" class="required warnIfInvalid" value="12:00 AM" />
$(new function () {
    $('.required').on('blur', function () {
        if (!validateTime($(this).val())) {
            $(this).val('12:00 AM');
            $(this).keyup();
        }
    });
    $('.warnIfInvalid').on('keyup', function () {
        $(this).css('color', 'black');
        if (!validateTime($(this).val())) {
            $(this).css('color', 'red');
        }
    });
});

Live example can be found here: http://jsfiddle.net/adamprescott/Q9b6d/

Parse Camel Case into Words

I was working on a small project that had a list of camel case strings that I wanted to display to users. Displaying values as camel case feels dirty, though, so I wanted to pretty it up by parsing the strings into words. Sounds like a job for regular expressions!

After a few tries, this is what I settled on:

var x = Regex.Replace(value, @"([A-Z][^A-Z])", " $1");
x = Regex.Replace(x, "([a-z])([A-Z])", "$1 $2");
x = x.Trim();

Here are my test cases and the results:

Input:
  HelloWorld
  SuperMB
  SMBros
  OneTWOThree

Results:
  Hello World
  Super MB
  SM Bros
  One TWO Three

Ahh, just what I was hoping for. Thanks again, regular expressions!

Group Strings Using LINQ and Regular Expressions

I was working on a problem yesterday where I needed to combine strings that were the same except for one part. Here’s a simplified version of the problem:

Input Array:
"Adam likes apples."
"Adam likes bananas."

Desired Output:
"Adam likes apples and bananas."

It was a no-brainer to use regular expressions to do the matching and parsing, but I couldn’t figure out immediately how to use them in to accomplish my goal. I decided to use LINQ’s ToLookup method to create groups of matching items, and then loop through the groups to implement my combine logic.

The first step is to define a regular expression that lets me do two things. It needs to let me create a group “key,” and it needs to let me extract the data part that I’m ultimately trying to combine. For the simple example above, I can use the following pattern:

^(Adam likes )(.*)\.$

I can create the lookup using the regular expression like so:

var input = new[] 
    {
        "Adam likes apples.",
        "Adam likes bananas.",
    };
var regex = new Regex(@"^(Adam likes )(.*)\.$");
var lookup = input.ToLookup(x => regex.Replace(x, "$1"), x => x);

The final step is to loop through the lookup’s keys and do processing on the groups:

foreach (var key in lookup.Select(x => x.Key).ToList())
{
    if (lookup[key].Count() > 1)
    {
        var items = string.Join(
            " and ",
            lookup[key].Select(x => regex.Replace(x, "$2")).ToArray());
        
        var output = regex.Replace(
            lookup[key].First(),
            string.Format("$1{0}.", items));
        Console.WriteLine(output);
    }
    else
    {
        Console.WriteLine(lookup[key].First());
    }
}

Here’s another example to illustrate how this might be useful:

static void Main(string[] args)
{
    var input = new[] 
        {
            "Adam ate 3 apples.",
            "Adam ate 1 apple.",
            "Adam ate 1 banana.",
            "Adam ate 1 banana.",
            "Adam ate 1 orange.",
        };
    var regex = new Regex(@"^(Adam ate)\s+(\d+)\s+(.*?)s?\.$");
    var lookup = input.ToLookup(x => regex.Replace(x, "$1$3"), x => x);
    foreach (var key in lookup.Select(x => x.Key).ToList())
    {
        if (lookup[key].Count() > 1)
        {
            int sum = 0;
            foreach (var item in lookup[key])
            {
                sum += int.Parse(regex.Replace(item, "$2"));
            }

            var target = regex.Replace(lookup[key].First(), "$3");
            if (sum > 1)
            {
                target += "s";
            }
            var output = regex.Replace(
                lookup[key].First(),
                string.Format("$1 {0} {1}.", sum, target));
            Console.WriteLine(output);
        }
        else
        {
            Console.WriteLine(lookup[key].First());
        }
    }
    Console.ReadLine();
}

// Output:
//   Adam ate 4 apples.
//   Adam ate 2 bananas.
//   Adam ate 1 orange.

Writing Maintainable Regular Expressions

If you’ve worked with regular expressions at all, you know it’s easy for them to become quite unruly. It can be hard to decipher a regular expression as you’re working on it, when you know everything you’re trying to accomplish. Imagine how hard it will be for the poor guy who has to do maintenance on that thing later!

There are a few things you can do to make it better for everybody in the long run.

Write Unit Tests

Unit tests are PERFECT for any code that uses regular expressions because you can write a test for each different scenario that you’re trying to match. You don’t have to worry about accidentally breaking something that you had working previously because the tests will regression test everything as you go.

Include Samples

I like to include samples in the code to make it as obvious as possibly what’s going on to anybody looking at the code. I don’t want developers to have to mentally process a regular expression unless they’re there to work on the regular expression itself. I like to provide simple examples like this:

// matches a field and value in quotes
// matches
//   foo = "bar"
//   foo="bar"
// doesn't match
//   foo = bar
//   foo : "bar"
var pattern = @"((\w+)\s*=\s*("".*?"")";

Include Comments in the Pattern

Another trick you can do is to include comments in the regular expression itself by using #. This can be a helpful development tool, too, because it allows you to write out what you’re trying to match in isolated chunks. Note that you’ll need to use the IgnorePatternWhitespace option for this technique to work.

var pattern = @"(
    (?:"".*?"") # anything between quotes (?: -> not-captured)
    |           # or
    \S+         # one or more non-whitespace characters
)"; 
Regex re = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);

I really, really like regular expressions, but they can definitely be maintenance land mines. So, when you use them, do future developers a solid and use tips like these to make them as maintainable as possible.

Six Ways to Parse and Reformat Using Regular Expressions

The other day, I was consulted by a colleague on a regular expression. For those of you that know me, this is one of my favorite consultations, so I was thrilled to help him. He was doing a simple parse-and-reformat. It warmed my insides to know that he identified this as a perfect regular expression scenario and implemented it that way. It was a functional solution, but I felt that it could be simplified and more maintainable.

I’ll venture to say that the most straightforward way to do a regular expression parse-and-reformat for a developer that’s not familiar with regular expressions (You call yourself a developer..!?) is by creating a Match object and reformatting it.

1. Using a Match object

var date = "4/18/2013";
var regex = new Regex(@"^(\d+)/(\d+)/(\d+)$");

var match = regex.Match(date);
var result = string.Format("{0}-{1}-{2}", 
	match.Groups[3], 
	match.Groups[2], 
	match.Groups[1]);

Console.WriteLine(result);

You can accomplish the same task without creating a Match object by using the Replace method. There is a version that accepts a MatchEvaluator–which can be a lambda expression–so you can basically take the previous solution and plug it in.

2. Using a MatchEvaluator

var date = "4/18/2013";
var regex = new Regex(@"^(\d+)/(\d+)/(\d+)$");

var result = regex.Replace(date, 
	m => string.Format("{0}-{1}-{2}", 
		m.Groups[3], 
		m.Groups[2], 
		m.Groups[1]));

Console.WriteLine(result);

That’s a little bit better, but it’s still a little verbose. There’s another overload of the Replace method that accepts a replacement string. This allows you to skip the Match object altogether, and it results in a nice, tidy solution.

3. Using a replacement string

var date = "4/18/2013";
var regex = new Regex(@"^(\d+)/(\d+)/(\d+)$");

var result = regex.Replace(date, "${3}-${1}-${2}");

Console.WriteLine(result);

I have two problems with all three of these solutions, though. First, they use hard-coded indexes to access the capture groups. If another developer comes along and modifies the regular expression by adding another capture group, it could unintentionally affect the reformatting logic. The second issue I have is that it’s hard to understand the intent of the code. I have to read and process the regular expression and its capture groups in order to determine what the code is trying to do. These two issues add up to poor maintainability.

Don’t worry, though. Regular expressions have a built-in mechanism for naming capture groups. By modifying the regular expression, you can now reference the capture groups by name instead of index. It makes the regular expression itself a little noisier, but the rest of the code becomes much more readable and maintainable. Way better!

4. Using a Match object with named capture groups

var date = "4/18/2013";
var regex = new Regex(
	@"^(?<day>\d+)/(?<month>\d+)/(?<year>\d+)$");

var match = regex.Match(date);
var result = string.Format("{0}-{1}-{2}", 
	match.Groups["year"], 
	match.Groups["month"], 
	match.Groups["day"]);

Console.WriteLine(result);

5. Using a MatchEvaluator with named capture groups

var date = "4/18/2013";
var regex = new Regex(
	@"^(?<day>\d+)/(?<month>\d+)/(?<year>\d+)$");

var result = regex.Replace(date, 
	m => string.Format("{0}-{1}-{2}", 
		m.Groups["year"], 
		m.Groups["month"], 
		m.Groups["day"]));

Console.WriteLine(result);

6. Using a replacement string with named capture groups

var date = "4/18/2013";
var regex = new Regex(
	@"^(?<day>\d+)/(?<month>\d+)/(?<year>\d+)$");

var result = regex.Replace(date, "${year}-${month}-${day}");

Console.WriteLine(result);

Renumber Enums with Regular Expressions

We had a widely-used assembly with an enumeration that did not have explicitly assigned values that was being released from multiple branches and causing problems. In an effort to keep the enumerations synchronized across projects, explicit values were added. The problem is that the values started at 1, whereas the implicit counter starts at 0. The solution is simple: renumber ’em to start at 0. Sounds like a job for regular expressions!

I was really hoping that I could do this using regular expressions in VS2012’s find & replace, but I just couldn’t find a way to implement the necessary arithmetic. After floundering for 15 minutes or so, I decided to just write a simple script in LINQPad. Here’s what I came up with, and it works fantastically.

var filename = @&quot;C:\source\MehType.cs&quot;;

var contents = string.Empty;
using (var fs = new FileStream(filename, FileMode.Open))
{
    using (var sr = new StreamReader(fs))
    {
        contents = sr.ReadToEnd();
    }
}

var regex = new Regex(@&quot;(.*?= )(\d+)&quot;);
foreach (Match match in regex.Matches(contents))
{
    var num = int.Parse(match.Groups[2].Value);
    contents = contents.Replace(
        match.Value, match.Result(&quot;${1}&quot; + --num));
}

using (var fs = new FileStream(filename, FileMode.Create))
{
    using (var sw = new StreamWriter(fs))
    {
        sw.Write(contents);
        sw.Flush();
    }
}

The result is that this…

public enum MehType
{
    Erhmm = 1,
    Glurgh = 2,
    Mfhh = 3
}

…becomes this…

public enum MehType
{
    Erhmm = 0,
    Glurgh = 1,
    Mfhh = 2
}