I was working on a problem yesterday where I needed to combine strings that were the same except for one part. Here’s a simplified version of the problem:
Input Array: "Adam likes apples." "Adam likes bananas." Desired Output: "Adam likes apples and bananas."
It was a no-brainer to use regular expressions to do the matching and parsing, but I couldn’t figure out immediately how to use them in to accomplish my goal. I decided to use LINQ’s ToLookup method to create groups of matching items, and then loop through the groups to implement my combine logic.
The first step is to define a regular expression that lets me do two things. It needs to let me create a group “key,” and it needs to let me extract the data part that I’m ultimately trying to combine. For the simple example above, I can use the following pattern:
^(Adam likes )(.*)\.$
I can create the lookup using the regular expression like so:
var input = new[] { "Adam likes apples.", "Adam likes bananas.", }; var regex = new Regex(@"^(Adam likes )(.*)\.$"); var lookup = input.ToLookup(x => regex.Replace(x, "$1"), x => x);
The final step is to loop through the lookup’s keys and do processing on the groups:
foreach (var key in lookup.Select(x => x.Key).ToList()) { if (lookup[key].Count() > 1) { var items = string.Join( " and ", lookup[key].Select(x => regex.Replace(x, "$2")).ToArray()); var output = regex.Replace( lookup[key].First(), string.Format("$1{0}.", items)); Console.WriteLine(output); } else { Console.WriteLine(lookup[key].First()); } }
Here’s another example to illustrate how this might be useful:
static void Main(string[] args) { var input = new[] { "Adam ate 3 apples.", "Adam ate 1 apple.", "Adam ate 1 banana.", "Adam ate 1 banana.", "Adam ate 1 orange.", }; var regex = new Regex(@"^(Adam ate)\s+(\d+)\s+(.*?)s?\.$"); var lookup = input.ToLookup(x => regex.Replace(x, "$1$3"), x => x); foreach (var key in lookup.Select(x => x.Key).ToList()) { if (lookup[key].Count() > 1) { int sum = 0; foreach (var item in lookup[key]) { sum += int.Parse(regex.Replace(item, "$2")); } var target = regex.Replace(lookup[key].First(), "$3"); if (sum > 1) { target += "s"; } var output = regex.Replace( lookup[key].First(), string.Format("$1 {0} {1}.", sum, target)); Console.WriteLine(output); } else { Console.WriteLine(lookup[key].First()); } } Console.ReadLine(); } // Output: // Adam ate 4 apples. // Adam ate 2 bananas. // Adam ate 1 orange.
Great! A powerful usage of LINQ and regex š Maybe you can explain a bit of your regex-fu?
In the regex in your second example, Why does the third parenthesized phrase “(.*?)” have a question mark? I understand the * means ‘zero or more’, wouldn’t that already cover all cases from the question mark afterwards?
I loaded up the code and saw that by removing the question mark and playing a bit with the input, it causes the LookUp to contain keys for both ‘apple’ and ‘apples’, ‘banana’ and ‘bananas’ etc. Its apparently necessary for this to work, but I don’t understand why š¦
Hey, Dan! Great question.
A question mark that follows an asterisk–or a plus–makes it “lazy.” By default * & + are “greedy.”
Here’s an example…
Let’s say we have “apples” as the string we are investigating. The regular expression “(.*)s?” will match, but the “.*” capture group will greedily take the entire word. Modifying the regular expression to “(.*?)s?” causes the “.*” to take as little as possible, so the captured text becomes just “apple” and the optional “s?” picks up the rest. Make sense!?
You can read a different explanation on greediness here:
http://www.regular-expressions.info/repeat.html
Ah! How subtle. Thanks!