Group Strings Using LINQ and Regular Expressions

I was working on a problem yesterday where I needed to combine strings that were the same except for one part. Here’s a simplified version of the problem:

Input Array:
"Adam likes apples."
"Adam likes bananas."

Desired Output:
"Adam likes apples and bananas."

It was a no-brainer to use regular expressions to do the matching and parsing, but I couldn’t figure out immediately how to use them in to accomplish my goal. I decided to use LINQ’s ToLookup method to create groups of matching items, and then loop through the groups to implement my combine logic.

The first step is to define a regular expression that lets me do two things. It needs to let me create a group “key,” and it needs to let me extract the data part that I’m ultimately trying to combine. For the simple example above, I can use the following pattern:

^(Adam likes )(.*)\.$

I can create the lookup using the regular expression like so:

var input = new[] 
    {
        "Adam likes apples.",
        "Adam likes bananas.",
    };
var regex = new Regex(@"^(Adam likes )(.*)\.$");
var lookup = input.ToLookup(x => regex.Replace(x, "$1"), x => x);

The final step is to loop through the lookup’s keys and do processing on the groups:

foreach (var key in lookup.Select(x => x.Key).ToList())
{
    if (lookup[key].Count() > 1)
    {
        var items = string.Join(
            " and ",
            lookup[key].Select(x => regex.Replace(x, "$2")).ToArray());
        
        var output = regex.Replace(
            lookup[key].First(),
            string.Format("$1{0}.", items));
        Console.WriteLine(output);
    }
    else
    {
        Console.WriteLine(lookup[key].First());
    }
}

Here’s another example to illustrate how this might be useful:

static void Main(string[] args)
{
    var input = new[] 
        {
            "Adam ate 3 apples.",
            "Adam ate 1 apple.",
            "Adam ate 1 banana.",
            "Adam ate 1 banana.",
            "Adam ate 1 orange.",
        };
    var regex = new Regex(@"^(Adam ate)\s+(\d+)\s+(.*?)s?\.$");
    var lookup = input.ToLookup(x => regex.Replace(x, "$1$3"), x => x);
    foreach (var key in lookup.Select(x => x.Key).ToList())
    {
        if (lookup[key].Count() > 1)
        {
            int sum = 0;
            foreach (var item in lookup[key])
            {
                sum += int.Parse(regex.Replace(item, "$2"));
            }

            var target = regex.Replace(lookup[key].First(), "$3");
            if (sum > 1)
            {
                target += "s";
            }
            var output = regex.Replace(
                lookup[key].First(),
                string.Format("$1 {0} {1}.", sum, target));
            Console.WriteLine(output);
        }
        else
        {
            Console.WriteLine(lookup[key].First());
        }
    }
    Console.ReadLine();
}

// Output:
//   Adam ate 4 apples.
//   Adam ate 2 bananas.
//   Adam ate 1 orange.
Advertisement

Author: Adam Prescott

I'm enthusiastic and passionate about creating intuitive, great-looking software. I strive to find the simplest solutions to complex problems, and I embrace agile principles and test-driven development.

3 thoughts on “Group Strings Using LINQ and Regular Expressions”

  1. Great! A powerful usage of LINQ and regex šŸ™‚ Maybe you can explain a bit of your regex-fu?

    In the regex in your second example, Why does the third parenthesized phrase “(.*?)” have a question mark? I understand the * means ‘zero or more’, wouldn’t that already cover all cases from the question mark afterwards?

    I loaded up the code and saw that by removing the question mark and playing a bit with the input, it causes the LookUp to contain keys for both ‘apple’ and ‘apples’, ‘banana’ and ‘bananas’ etc. Its apparently necessary for this to work, but I don’t understand why 😦

    1. Hey, Dan! Great question.

      A question mark that follows an asterisk–or a plus–makes it “lazy.” By default * & + are “greedy.”

      Here’s an example…
      Let’s say we have “apples” as the string we are investigating. The regular expression “(.*)s?” will match, but the “.*” capture group will greedily take the entire word. Modifying the regular expression to “(.*?)s?” causes the “.*” to take as little as possible, so the captured text becomes just “apple” and the optional “s?” picks up the rest. Make sense!?

      You can read a different explanation on greediness here:
      http://www.regular-expressions.info/repeat.html

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: