String Formatting with Regex

My latest favorite trick with regular expressions is to shortcut string formatting. We’ve all written some code like this:

if (/* string is not formatted a certain way */)
{
    /* make it formatted that way */
}

Now, there’s nothing wrong with that code, but for simple examples you could do it all in one step with a regular expression!

Here are a few examples:

// remove "www." from a domain if one exists
// domain.com     -> domain.com
// www.domain.com -> domain.com
Regex.Replace(input, @"^(?:www.)?(.+)$", "$1");

// format phone number
// 1234567890       -> 123-456-7890
// (123) 456-7890   -> 123-456-7890
// (123) 456 - 7890 -> 123-456-7890
Regex.Replace(input, @"^\(?(\d{3})\)?\s*(\d{3})\s*-?\s*(\d{4})$", "$1-$2-$3");

Yay, regular expressions!

Validate Time Entry with Javascript

I was cleaning up a web form that had a textbox for the user to enter a time value. The thing I don’t love about using a textbox to capture a time value is that there’s no validation. The user might enter a bad value and not realize it, and I’d rather let them know right away rather than displaying a message after they try to submit the form.

Surely there’s something we can do with javascript and regular expressions to create an intuitive experience for the user, right?

Format Checkin’ Regular Expression

The first thing we’re going to need is a regular expression that can be used to determine if an entry is valid or not. I decided to use a pair: one for standard time and one for military time.

function validateTime(time) {
    if (!time) {
        return false;
    }
    var military = /^\s*([01]?\d|2[0-3]):[0-5]\d\s*$/i;
    var standard = /^\s*(0?\d|1[0-2]):[0-5]\d(\s+(AM|PM))?\s*$/i;
    return time.match(military) || time.match(standard);
}

Make Red When Invalid

Now that we have a way to determine if an entry is valid, we need to decide how to give that feedback to the user. My first thought was to use the input control’s keyup event to check the value and make the text red if it doesn’t match.

<input type="text" class="warnIfInvalid" />
$(new function () {
    $('.warnIfInvalid').on('keyup', function () {
        $(this).css('color', 'black');
        if (!validateTime($(this).val())) {
            $(this).css('color', 'red');
        }
    });
});

Change to Default When Invalid

The color feedback is nice, but what if our field is a required value? If the user doesn’t enter anything, there is nothing to let them know they did something wrong. So, my second idea was to use the input control’s blur event to force a default value if the user enters a blank or invalid value.

<input type="text" class="required" value="12:00 AM" />
$(new function () {
    $('.required').on('blur', function () {
        if (!validateTime($(this).val())) {
            $(this).val('12:00 AM');
        }
    });
});

Do Both!

I didn’t like simply changing the user’s value to a default value without letting them know that I’m about to do that. For example, my regular expression won’t match a standard time that doesn’t have a space between the minutes and AM/PM. We can combine both techniques described above to give the user feedback as they type but change their bad input to a default if they enter something invalid. (Note that I manually trigger the keyup event after changing the invalid value to my default value.)

<input type="text" class="required warnIfInvalid" value="12:00 AM" />
$(new function () {
    $('.required').on('blur', function () {
        if (!validateTime($(this).val())) {
            $(this).val('12:00 AM');
            $(this).keyup();
        }
    });
    $('.warnIfInvalid').on('keyup', function () {
        $(this).css('color', 'black');
        if (!validateTime($(this).val())) {
            $(this).css('color', 'red');
        }
    });
});

Live example can be found here: http://jsfiddle.net/adamprescott/Q9b6d/

Parse Camel Case into Words

I was working on a small project that had a list of camel case strings that I wanted to display to users. Displaying values as camel case feels dirty, though, so I wanted to pretty it up by parsing the strings into words. Sounds like a job for regular expressions!

After a few tries, this is what I settled on:

var x = Regex.Replace(value, @"([A-Z][^A-Z])", " $1");
x = Regex.Replace(x, "([a-z])([A-Z])", "$1 $2");
x = x.Trim();

Here are my test cases and the results:

Input:
  HelloWorld
  SuperMB
  SMBros
  OneTWOThree

Results:
  Hello World
  Super MB
  SM Bros
  One TWO Three

Ahh, just what I was hoping for. Thanks again, regular expressions!

Group Strings Using LINQ and Regular Expressions

I was working on a problem yesterday where I needed to combine strings that were the same except for one part. Here’s a simplified version of the problem:

Input Array:
"Adam likes apples."
"Adam likes bananas."

Desired Output:
"Adam likes apples and bananas."

It was a no-brainer to use regular expressions to do the matching and parsing, but I couldn’t figure out immediately how to use them in to accomplish my goal. I decided to use LINQ’s ToLookup method to create groups of matching items, and then loop through the groups to implement my combine logic.

The first step is to define a regular expression that lets me do two things. It needs to let me create a group “key,” and it needs to let me extract the data part that I’m ultimately trying to combine. For the simple example above, I can use the following pattern:

^(Adam likes )(.*)\.$

I can create the lookup using the regular expression like so:

var input = new[] 
    {
        "Adam likes apples.",
        "Adam likes bananas.",
    };
var regex = new Regex(@"^(Adam likes )(.*)\.$");
var lookup = input.ToLookup(x => regex.Replace(x, "$1"), x => x);

The final step is to loop through the lookup’s keys and do processing on the groups:

foreach (var key in lookup.Select(x => x.Key).ToList())
{
    if (lookup[key].Count() > 1)
    {
        var items = string.Join(
            " and ",
            lookup[key].Select(x => regex.Replace(x, "$2")).ToArray());
        
        var output = regex.Replace(
            lookup[key].First(),
            string.Format("$1{0}.", items));
        Console.WriteLine(output);
    }
    else
    {
        Console.WriteLine(lookup[key].First());
    }
}

Here’s another example to illustrate how this might be useful:

static void Main(string[] args)
{
    var input = new[] 
        {
            "Adam ate 3 apples.",
            "Adam ate 1 apple.",
            "Adam ate 1 banana.",
            "Adam ate 1 banana.",
            "Adam ate 1 orange.",
        };
    var regex = new Regex(@"^(Adam ate)\s+(\d+)\s+(.*?)s?\.$");
    var lookup = input.ToLookup(x => regex.Replace(x, "$1$3"), x => x);
    foreach (var key in lookup.Select(x => x.Key).ToList())
    {
        if (lookup[key].Count() > 1)
        {
            int sum = 0;
            foreach (var item in lookup[key])
            {
                sum += int.Parse(regex.Replace(item, "$2"));
            }

            var target = regex.Replace(lookup[key].First(), "$3");
            if (sum > 1)
            {
                target += "s";
            }
            var output = regex.Replace(
                lookup[key].First(),
                string.Format("$1 {0} {1}.", sum, target));
            Console.WriteLine(output);
        }
        else
        {
            Console.WriteLine(lookup[key].First());
        }
    }
    Console.ReadLine();
}

// Output:
//   Adam ate 4 apples.
//   Adam ate 2 bananas.
//   Adam ate 1 orange.

Writing Maintainable Regular Expressions

If you’ve worked with regular expressions at all, you know it’s easy for them to become quite unruly. It can be hard to decipher a regular expression as you’re working on it, when you know everything you’re trying to accomplish. Imagine how hard it will be for the poor guy who has to do maintenance on that thing later!

There are a few things you can do to make it better for everybody in the long run.

Write Unit Tests

Unit tests are PERFECT for any code that uses regular expressions because you can write a test for each different scenario that you’re trying to match. You don’t have to worry about accidentally breaking something that you had working previously because the tests will regression test everything as you go.

Include Samples

I like to include samples in the code to make it as obvious as possibly what’s going on to anybody looking at the code. I don’t want developers to have to mentally process a regular expression unless they’re there to work on the regular expression itself. I like to provide simple examples like this:

// matches a field and value in quotes
// matches
//   foo = "bar"
//   foo="bar"
// doesn't match
//   foo = bar
//   foo : "bar"
var pattern = @"((\w+)\s*=\s*("".*?"")";

Include Comments in the Pattern

Another trick you can do is to include comments in the regular expression itself by using #. This can be a helpful development tool, too, because it allows you to write out what you’re trying to match in isolated chunks. Note that you’ll need to use the IgnorePatternWhitespace option for this technique to work.

var pattern = @"(
    (?:"".*?"") # anything between quotes (?: -> not-captured)
    |           # or
    \S+         # one or more non-whitespace characters
)"; 
Regex re = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);

I really, really like regular expressions, but they can definitely be maintenance land mines. So, when you use them, do future developers a solid and use tips like these to make them as maintainable as possible.

Six Ways to Parse and Reformat Using Regular Expressions

The other day, I was consulted by a colleague on a regular expression. For those of you that know me, this is one of my favorite consultations, so I was thrilled to help him. He was doing a simple parse-and-reformat. It warmed my insides to know that he identified this as a perfect regular expression scenario and implemented it that way. It was a functional solution, but I felt that it could be simplified and more maintainable.

I’ll venture to say that the most straightforward way to do a regular expression parse-and-reformat for a developer that’s not familiar with regular expressions (You call yourself a developer..!?) is by creating a Match object and reformatting it.

1. Using a Match object

var date = "4/18/2013";
var regex = new Regex(@"^(\d+)/(\d+)/(\d+)$");

var match = regex.Match(date);
var result = string.Format("{0}-{1}-{2}", 
	match.Groups[3], 
	match.Groups[2], 
	match.Groups[1]);

Console.WriteLine(result);

You can accomplish the same task without creating a Match object by using the Replace method. There is a version that accepts a MatchEvaluator–which can be a lambda expression–so you can basically take the previous solution and plug it in.

2. Using a MatchEvaluator

var date = "4/18/2013";
var regex = new Regex(@"^(\d+)/(\d+)/(\d+)$");

var result = regex.Replace(date, 
	m => string.Format("{0}-{1}-{2}", 
		m.Groups[3], 
		m.Groups[2], 
		m.Groups[1]));

Console.WriteLine(result);

That’s a little bit better, but it’s still a little verbose. There’s another overload of the Replace method that accepts a replacement string. This allows you to skip the Match object altogether, and it results in a nice, tidy solution.

3. Using a replacement string

var date = "4/18/2013";
var regex = new Regex(@"^(\d+)/(\d+)/(\d+)$");

var result = regex.Replace(date, "${3}-${1}-${2}");

Console.WriteLine(result);

I have two problems with all three of these solutions, though. First, they use hard-coded indexes to access the capture groups. If another developer comes along and modifies the regular expression by adding another capture group, it could unintentionally affect the reformatting logic. The second issue I have is that it’s hard to understand the intent of the code. I have to read and process the regular expression and its capture groups in order to determine what the code is trying to do. These two issues add up to poor maintainability.

Don’t worry, though. Regular expressions have a built-in mechanism for naming capture groups. By modifying the regular expression, you can now reference the capture groups by name instead of index. It makes the regular expression itself a little noisier, but the rest of the code becomes much more readable and maintainable. Way better!

4. Using a Match object with named capture groups

var date = "4/18/2013";
var regex = new Regex(
	@"^(?<day>\d+)/(?<month>\d+)/(?<year>\d+)$");

var match = regex.Match(date);
var result = string.Format("{0}-{1}-{2}", 
	match.Groups["year"], 
	match.Groups["month"], 
	match.Groups["day"]);

Console.WriteLine(result);

5. Using a MatchEvaluator with named capture groups

var date = "4/18/2013";
var regex = new Regex(
	@"^(?<day>\d+)/(?<month>\d+)/(?<year>\d+)$");

var result = regex.Replace(date, 
	m => string.Format("{0}-{1}-{2}", 
		m.Groups["year"], 
		m.Groups["month"], 
		m.Groups["day"]));

Console.WriteLine(result);

6. Using a replacement string with named capture groups

var date = "4/18/2013";
var regex = new Regex(
	@"^(?<day>\d+)/(?<month>\d+)/(?<year>\d+)$");

var result = regex.Replace(date, "${year}-${month}-${day}");

Console.WriteLine(result);

Renumber Enums with Regular Expressions

We had a widely-used assembly with an enumeration that did not have explicitly assigned values that was being released from multiple branches and causing problems. In an effort to keep the enumerations synchronized across projects, explicit values were added. The problem is that the values started at 1, whereas the implicit counter starts at 0. The solution is simple: renumber ’em to start at 0. Sounds like a job for regular expressions!

I was really hoping that I could do this using regular expressions in VS2012’s find & replace, but I just couldn’t find a way to implement the necessary arithmetic. After floundering for 15 minutes or so, I decided to just write a simple script in LINQPad. Here’s what I came up with, and it works fantastically.

var filename = @&quot;C:\source\MehType.cs&quot;;

var contents = string.Empty;
using (var fs = new FileStream(filename, FileMode.Open))
{
    using (var sr = new StreamReader(fs))
    {
        contents = sr.ReadToEnd();
    }
}

var regex = new Regex(@&quot;(.*?= )(\d+)&quot;);
foreach (Match match in regex.Matches(contents))
{
    var num = int.Parse(match.Groups[2].Value);
    contents = contents.Replace(
        match.Value, match.Result(&quot;${1}&quot; + --num));
}

using (var fs = new FileStream(filename, FileMode.Create))
{
    using (var sw = new StreamWriter(fs))
    {
        sw.Write(contents);
        sw.Flush();
    }
}

The result is that this…

public enum MehType
{
    Erhmm = 1,
    Glurgh = 2,
    Mfhh = 3
}

…becomes this…

public enum MehType
{
    Erhmm = 0,
    Glurgh = 1,
    Mfhh = 2
}

Regular Expression Searching in Visual Studio 2012

One of the recurring themes throughout the enhancements made in Visual Studio 2012 is improved searching. There are search boxes everywhere: in the title bar, in Solution Explorer, in Test Explorer… Everywhere!

In addition to making searches more accessible, they’ve also improved and simplified support for regular expressions. Regular expression searching is something that seemed like it existed in earlier version of Visual Studio, but they used their own, custom syntax which made it difficult and unintuitive to use.

In Visual Studio 2012, regular expression searching is extremely easy–relatively speaking–to use and very intuitive. When you press CTRL+F to bring up the search window, there’s a regular expression toggle button. Click the button, and your search criteria will be interpreted as a regular expression.

That’s all good and well, but what’s really cool is that you can use capture groups for finding and replacing. How many times have you had to go through a file making the same change based on a pattern over and over again?

Here’s an example of finding/replacing with regular expressions and capture groups:

Find:    var(?<x>dog|cat)
Replace: varSuper${x}

Will replace "vardog", "varcat", "varDOG"
With "varSuperdog", "varSupercat", "varSuperDOG"
But not replace "var dog" or "dog"

How cool is that!?

Read more about using regular expressions in Visual Studio here.

 

Regular Expression Capture Groups in C#

We all know regular expressions are great for matching patterns and parsing strings, but if you really want to take your Regex game to the next level, spend some time looking at capture groups. Using capture groups–more specifically, named capture groups–makes it easier to manipulate results and replacements, and it also has a fortunate side-effect of improving readability.

To create a named capture group in a .net regular expression, use the syntax “(?<Name>pattern).” The name acts like an inline comment and allows you to reference the group using that name by using “${Name}” in Result and Replace statements.

Let’s look at an example that uses Replace. Social Security numbers are a sensitive piece of information that is often masked when displaying details. Let’s use a regular expression to hide part of the number.

var ssn = "123-45-6789";
var re = new Regex(@"\d{3}-\d{2}-(?<lastFour>\d{4})");
var masked = re.Replace(ssn, "xxx-xx-${lastFour}");
Console.WriteLine("{0} -> {1}", ssn, masked);
123-45-6789 -> xxx-xx-6789

Another common scenario is to extract a piece of data from a string. Here’s another quick example that extracts the month and year from a date string (great for grouping items in a reporting scenario!).

var date = "01/15/2012";
var re = new Regex(@"(?<month>\d{1,2})/(?<day>\d{1,2})/(?<year>\d{4})");
var monthYear = re.Match(date).Result("${year}-${month}");
Console.WriteLine("{0} -> {1}", date, monthYear);
01/15/2012 -> 2012-01

Parse US Street Addresses with Regular Expression in VB6

I have a confession to make: this article was actually implemented in VB6 originally. The meat of it is the same, but here’s the VB6 edition in case it’s of interest to anybody.

Enjoy!

Private Sub Form_Load()
    
    Call ClearControls
    
End Sub

Private Sub ClearControls()
    uxHouseNumber = ""
    uxStreetPrefix = ""
    uxStreetName = ""
    uxStreetType = ""
    uxStreetSuffix = ""
    uxApt = ""
    uxAdditionalInfo = ""
End Sub

'---------------------------------------------------------------------------------------
' Procedure : ConstructRegex
' Purpose   : Returns a regular expression pattern for parsing US street addresses
'---------------------------------------------------------------------------------------
'
Private Function ConstructRegex() As String
    
'    ConstructRegex = "^" & _                           -> begin string
'        "(\d+)" & _                                    -> 1 or more digits
'        "(\s+(?:" & GetStreetPrefixes() & "))?" & _    -> whitespace + valid prefix (optional)
'        "(\s+.*?)" & _                                 -> whitespace + one or characters
'        "(?:" & _                                      -> group (optional) {
'            "(\s+(?:" & GetStreetTypes() & "))" & _    ->   whitespace + valid street type
'            "(\s+(?:" & GetStreetSuffixes() & "))?" & _->   whitespace + valid street suffix (optional)
'            "(\s+.*)?" & _                             ->   whitespace + anything else (optional)
'        ")?" & _                                       -> }
'        "$"                                            -> end string

    ConstructRegex = "^" & _
        "(\d+)" & _
        "(\s+(?:" & GetStreetPrefixes() & "))?" & _
        "(\s+.*?)" & _
        "(?:" & _
            "(\s+(?:" & GetStreetTypes() & "))" & _
            "(\s+(?:" & GetStreetSuffixes() & "))?" & _
            "(\s+.*)?" & _
        ")?" & _
        "$"


End Function

'---------------------------------------------------------------------------------------
' Procedure : GetStreetPrefixes
' Purpose   : Returns a pipe-delimited list of valid street prefixes
'---------------------------------------------------------------------------------------
'
Private Function GetStreetPrefixes() As String
    
    GetStreetPrefixes = "TE|NW|HW|RD|E|MA|EI|NO|AU|SE|GR|OL|W|MM|OM|SW|ME|HA|JO|OV|S|OH|NE|K|N"
    
End Function

'---------------------------------------------------------------------------------------
' Procedure : GetStreetTypes
' Purpose   : Returns a pipe-delimited list of valid street types
'---------------------------------------------------------------------------------------
'
Private Function GetStreetTypes() As String
    
    GetStreetTypes = "TE|STCT|DR|SPGS|PARK|GRV|CRK|XING|BR|PINE|CTS|TRL|VI|RD|PIKE|MA|LO|TER|UN|CIR|WALK|CO|RUN|FRD|LDG|ML|AVE|NO|PA|SQ|BLVD|VLGS|VLY|GR|LN|HOUSE|VLG|OL|STA|CH|ROW|EXT|JC|BLDG|FLD|CT|HTS|MOTEL|PKWY|COOP|ACRES|ESTS|SCH|HL|CORD|ST|CLB|FLDS|PT|STPL|MDWS|APTS|ME|LOOP|SMT|RDG|UNIV|PLZ|MDW|EXPY|WALL|TR|FLS|HBR|TRFY|BCH|CRST|CI|PKY|OV|RNCH|CV|DIV|WA|S|WAY|I|CTR|VIS|PL|ANX|BL|ST TER|DM|STHY|RR|MNR"
    
End Function

'---------------------------------------------------------------------------------------
' Procedure : GetStreetSuffixes
' Purpose   : Returns a pipe-delimited list of valid street suffixes
'---------------------------------------------------------------------------------------
'
Private Function GetStreetSuffixes() As String
    
    GetStreetSuffixes = "NW|E|SE|W|SW|S|NE|N"
    
End Function

'---------------------------------------------------------------------------------------
' Procedure : uxAddress_Change
' Purpose   : Parses user input and displays components to user
'---------------------------------------------------------------------------------------
'
Private Sub uxAddress_Change()
    
    Dim strInput As String
    Dim re As RegExp
    Dim mc As MatchCollection
    Dim ma As Match
    
    Call ClearControls
    
    strInput = UCase$(uxAddress.Text)
    
    Set re = New RegExp
    re.Pattern = ConstructRegex()
    re.Global = True
    
    If re.Test(strInput) Then
        Set mc = re.Execute(strInput)
        Set ma = mc(0)
        
        uxHouseNumber = Trim$(ma.SubMatches(0))
        uxStreetPrefix = Trim$(ma.SubMatches(1))
        uxStreetName = Trim$(ma.SubMatches(2))
        uxStreetType = Trim$(ma.SubMatches(3))
        uxStreetSuffix = Trim$(ma.SubMatches(4))
        uxApt = Trim$(ma.SubMatches(5))
        
        Set ma = Nothing
        Set mc = Nothing
    Else
        uxStreetName = strInput
    End If
    
    Set re = Nothing
    
End Sub