I was working on a problem yesterday where I needed to combine strings that were the same except for one part. Here’s a simplified version of the problem:
Input Array: "Adam likes apples." "Adam likes bananas." Desired Output: "Adam likes apples and bananas."
It was a no-brainer to use regular expressions to do the matching and parsing, but I couldn’t figure out immediately how to use them in to accomplish my goal. I decided to use LINQ’s ToLookup method to create groups of matching items, and then loop through the groups to implement my combine logic.
The first step is to define a regular expression that lets me do two things. It needs to let me create a group “key,” and it needs to let me extract the data part that I’m ultimately trying to combine. For the simple example above, I can use the following pattern:
^(Adam likes )(.*)\.$
I can create the lookup using the regular expression like so:
var input = new[]
{
"Adam likes apples.",
"Adam likes bananas.",
};
var regex = new Regex(@"^(Adam likes )(.*)\.$");
var lookup = input.ToLookup(x => regex.Replace(x, "$1"), x => x);
The final step is to loop through the lookup’s keys and do processing on the groups:
foreach (var key in lookup.Select(x => x.Key).ToList())
{
if (lookup[key].Count() > 1)
{
var items = string.Join(
" and ",
lookup[key].Select(x => regex.Replace(x, "$2")).ToArray());
var output = regex.Replace(
lookup[key].First(),
string.Format("$1{0}.", items));
Console.WriteLine(output);
}
else
{
Console.WriteLine(lookup[key].First());
}
}
Here’s another example to illustrate how this might be useful:
static void Main(string[] args)
{
var input = new[]
{
"Adam ate 3 apples.",
"Adam ate 1 apple.",
"Adam ate 1 banana.",
"Adam ate 1 banana.",
"Adam ate 1 orange.",
};
var regex = new Regex(@"^(Adam ate)\s+(\d+)\s+(.*?)s?\.$");
var lookup = input.ToLookup(x => regex.Replace(x, "$1$3"), x => x);
foreach (var key in lookup.Select(x => x.Key).ToList())
{
if (lookup[key].Count() > 1)
{
int sum = 0;
foreach (var item in lookup[key])
{
sum += int.Parse(regex.Replace(item, "$2"));
}
var target = regex.Replace(lookup[key].First(), "$3");
if (sum > 1)
{
target += "s";
}
var output = regex.Replace(
lookup[key].First(),
string.Format("$1 {0} {1}.", sum, target));
Console.WriteLine(output);
}
else
{
Console.WriteLine(lookup[key].First());
}
}
Console.ReadLine();
}
// Output:
// Adam ate 4 apples.
// Adam ate 2 bananas.
// Adam ate 1 orange.
Great! A powerful usage of LINQ and regex 🙂 Maybe you can explain a bit of your regex-fu?
In the regex in your second example, Why does the third parenthesized phrase “(.*?)” have a question mark? I understand the * means ‘zero or more’, wouldn’t that already cover all cases from the question mark afterwards?
I loaded up the code and saw that by removing the question mark and playing a bit with the input, it causes the LookUp to contain keys for both ‘apple’ and ‘apples’, ‘banana’ and ‘bananas’ etc. Its apparently necessary for this to work, but I don’t understand why 😦
Hey, Dan! Great question.
A question mark that follows an asterisk–or a plus–makes it “lazy.” By default * & + are “greedy.”
Here’s an example…
Let’s say we have “apples” as the string we are investigating. The regular expression “(.*)s?” will match, but the “.*” capture group will greedily take the entire word. Modifying the regular expression to “(.*?)s?” causes the “.*” to take as little as possible, so the captured text becomes just “apple” and the optional “s?” picks up the rest. Make sense!?
You can read a different explanation on greediness here:
http://www.regular-expressions.info/repeat.html
Ah! How subtle. Thanks!