Simple .DistinctBy Extension

LINQ’s Distinct extension has largely been a disappointment to me. Sure, it’s nice when I’m working with a collection of integers, but more often than not, I’m working with a collection of objects and don’t have an IEqualityComparer<TSource> available to me. I know I could just create one, but I just want to use a lambda like just about everything else I do with LINQ!

To the internet!, right? I learned I could use the following trick to accomplish what I want:

collection
  .GroupBy(x => x.key)
  .Select(x => x.First());

Works like a charm, but I got tired of dot-GroupBy-dot-Select-ing and adding a comment about what I was doing for future maintainers, and I think it’s a lot better to just chuck it into an extension method.

public static IEnumerable<TSource> DistinctBy<TSource, TKey>(
    this IEnumerable<TSource> source, 
    Func<TSource, TKey> keySelector
{
    return
        source
            ?.GroupBy(keySelector)
            .Select(grp => grp.First());
}

Ahh, nice! Alternatively, could score this functionality by adding MoreLINQ to your project. On a neat side-note, you can also cherry-pick which MoreLINQ functionality you want by installing individual packages.

Group Strings Using LINQ and Regular Expressions

I was working on a problem yesterday where I needed to combine strings that were the same except for one part. Here’s a simplified version of the problem:

Input Array:
"Adam likes apples."
"Adam likes bananas."

Desired Output:
"Adam likes apples and bananas."

It was a no-brainer to use regular expressions to do the matching and parsing, but I couldn’t figure out immediately how to use them in to accomplish my goal. I decided to use LINQ’s ToLookup method to create groups of matching items, and then loop through the groups to implement my combine logic.

The first step is to define a regular expression that lets me do two things. It needs to let me create a group “key,” and it needs to let me extract the data part that I’m ultimately trying to combine. For the simple example above, I can use the following pattern:

^(Adam likes )(.*)\.$

I can create the lookup using the regular expression like so:

var input = new[] 
    {
        "Adam likes apples.",
        "Adam likes bananas.",
    };
var regex = new Regex(@"^(Adam likes )(.*)\.$");
var lookup = input.ToLookup(x => regex.Replace(x, "$1"), x => x);

The final step is to loop through the lookup’s keys and do processing on the groups:

foreach (var key in lookup.Select(x => x.Key).ToList())
{
    if (lookup[key].Count() > 1)
    {
        var items = string.Join(
            " and ",
            lookup[key].Select(x => regex.Replace(x, "$2")).ToArray());
        
        var output = regex.Replace(
            lookup[key].First(),
            string.Format("$1{0}.", items));
        Console.WriteLine(output);
    }
    else
    {
        Console.WriteLine(lookup[key].First());
    }
}

Here’s another example to illustrate how this might be useful:

static void Main(string[] args)
{
    var input = new[] 
        {
            "Adam ate 3 apples.",
            "Adam ate 1 apple.",
            "Adam ate 1 banana.",
            "Adam ate 1 banana.",
            "Adam ate 1 orange.",
        };
    var regex = new Regex(@"^(Adam ate)\s+(\d+)\s+(.*?)s?\.$");
    var lookup = input.ToLookup(x => regex.Replace(x, "$1$3"), x => x);
    foreach (var key in lookup.Select(x => x.Key).ToList())
    {
        if (lookup[key].Count() > 1)
        {
            int sum = 0;
            foreach (var item in lookup[key])
            {
                sum += int.Parse(regex.Replace(item, "$2"));
            }

            var target = regex.Replace(lookup[key].First(), "$3");
            if (sum > 1)
            {
                target += "s";
            }
            var output = regex.Replace(
                lookup[key].First(),
                string.Format("$1 {0} {1}.", sum, target));
            Console.WriteLine(output);
        }
        else
        {
            Console.WriteLine(lookup[key].First());
        }
    }
    Console.ReadLine();
}

// Output:
//   Adam ate 4 apples.
//   Adam ate 2 bananas.
//   Adam ate 1 orange.

Find Duplicate Database Entries with LINQPad

A co-worker and I were chatting about a code problem he was having that was likely due to duplicate entries in a database table. He thought that records in the table were unique based on two columns, but that didn’t seem to agree with what was happening in code. He wanted to write a SQL query to identify any duplicates, but he didn’t know how to do it. He doesn’t write a lot of SQL and wasn’t comfortable with it. He uses LINQ everyday, though, so I suggested he do it through one of my favorite tools: LINQPad.

Regardless of whether you’re doing it in SQL or LINQPad, the approach is the same: group by the fields and filter to show only groups with more than one item.

So, let’s do that in LINQPad. Here are the quick-steps to get you caught up to the point of writing your query:

  1. Open LINQPad
  2. Configure connection
  3. Configure query to use the connection
  4. Write query

There are two things to keep in mind when writing queries in LINQPad. First, the table names will be pluralized. “SomeObject” becomes “SomeObjects.” Second, field names will be case-sensitive and always start with a capital letter. Now let’s get to business…

SomeObjects.GroupBy(x => x.FirstColumn + "|" + x.SecondColumn)
    .Where(x => x.Count() > 1)

Boom! That’s all there is to it. I concatenate the columns to create a group key, and filter the results to only show groups with more than one item.

Collection Lookups

FindInCollection

Yesterday, I was discussing a method with a co-worker where I suggested we loop through a collection of records and, for each record, do another retrieval-by-ID via LINQ. He brought up that this would probably be done more efficiently by creating a dictionary before the loop and retrieving from the dictionary instead of repeatedly executing the LINQ query. So I decided to do some research.

Firstly, I learned about two new LINQ methods: ToDictionary and ToLookup. Lookups and dictionaries serve a similar purpose, but the primary distinction is that a lookup will allow duplicate keys. Check out this article for a quick comparison of the two structures.

With my new tools in hand, I wanted to compare the performance. I first came up with a test. I created a collection of simple objects that had an ID and then looped through and retrieved each item by ID. Here’s what the test looks like:

void Main()
{
	var iterations = 10000;
	var list = new List<Human>();
	for (int i = 0; i < iterations; i++)
	{
		list.Add(new Human(i));
	}
	
	var timesToAvg = 100;
	
	Console.WriteLine("Avg of .Where search: {0} ms", 
		AverageIt((l, i) => TestWhere(l, i), list, iterations, timesToAvg));
	
	Console.WriteLine("Avg of for-built Dictionary search: {0} ms", 
		AverageIt((l, i) => TestDictionary(l, i), list, iterations, timesToAvg));
		
	Console.WriteLine("Avg of LINQ-built Dictionary search: {0} ms", 
		AverageIt((l, i) => TestToDictionary(l, i), list, iterations, timesToAvg));
		
	Console.WriteLine("Avg of Lookup search: {0} ms", 
		AverageIt((l, i) => TestLookup(l, i), list, iterations, timesToAvg));
}

decimal AverageIt(Action<List<Human>, int> action, List<Human> list, int iterations, int timesToAvg)
{
	var sw = new Stopwatch();
	
	decimal sum = 0;
	for (int i = 0; i < timesToAvg; i++)
	{
		sw.Reset();
		sw.Start();
		action(list, iterations);
		sw.Stop();
		sum += sw.ElapsedMilliseconds;
	}
	return sum / timesToAvg;
}

class Human
{
	public int id;
	
	public Human(int id)
	{
		this.id = id;
	}
}

Then, I wrote a method for each algorithm I wanted to test: using .Where, using a manually-built dictionary, using a ToDictionary-built dictionary, and using a lookup. Here are the methods I wrote for each of the algorithms:

void TestWhere(List<Human> list, int iterations)
{	
	for (int i = 0; i < iterations; i++)
	{
		var h = list.Where(x => x.id == i).FirstOrDefault();
	}
}

void TestDictionary(List<Human> list, int iterations)
{
	var dict = new Dictionary<int, Human>();
	foreach (var h in list)
	{
		dict.Add(h.id, h);
	}
	for (int i = 0; i < iterations; i++)
	{
		var h = dict[i];
	}
}

void TestToDictionary(List<Human> list, int iterations)
{
	var dict = list.ToDictionary(x => x.id);
	for (int i = 0; i < iterations; i++)
	{
		var h = dict[i];
	}
}

void TestLookup(List<Human> list, int iterations)
{
	var lookup = list.ToLookup(
		x => x.id,
		x => x);
	for (int i = 0; i < iterations; i++)
	{
		var h = lookup[i];
	}
}

Here are the results:

Avg of .Where search: 987.89 ms
Avg of for-built Dictionary search: 1.85 ms
Avg of LINQ-built Dictionary search: 1.67 ms
Avg of Lookup search: 2.14 ms

I would say that the results are what I expected in terms of what performed best. I was surprised by just how poorly the .Where queries performed, though–it was awful! One note about the manually-built dictionary versus the one produced by LINQ’s ToDictionary method: in repeated tests, the better performing method was inconsistent, leading me to believe that there is no significant benefit or disadvantage to using one or the other. I’ll likely stick with ToDictionary in the future due to its brevity, though.

These results seem to prove that a dictionary is optimal for lookups when key uniqueness is guaranteed. If the key is not unique or its uniqueness is questionable, a lookup should be used instead. Never do what I wanted to do, though, and use a .Where as an inner-loop lookup retrieval mechanism.

12/10/2012 Update:
A co-worker pointed out that I don’t need to chain Where and FirstOrDefault. Instead, I can just use FirstOrDefault with a lambda. So I added this to the test app to see how it compared. Surprisingly, this seems to consistently run slower than using Where in conjunction with FirstOrDefault!

void TestFirstOrDefault(List<Human> list, int iterations)
{	
	for (int i = 0; i < iterations; i++)
	{
		var h = list.FirstOrDefault(x => x.id == i);
	}
}

We also agreed that there should be a for-each loop as a base comparison, so I added that as well.

void TestForEach(List<Human> list, int iterations)
{
	for (int i = 0; i < iterations; i++)
	{
		foreach (var x in list)
		{
			if (i == x.id)
			{
				break;
			}
		}
	}
}

Here are the full results with the two new algorithms:

Avg of ForEach search: 741.05 ms
Avg of .Where search: 980.13 ms
Avg of .FirstOrDefault search: 1189.01 ms
Avg of for-built Dictionary search: 1.57 ms
Avg of LINQ-built Dictionary search: 1.57 ms
Avg of Lookup search: 1.74 ms

**********
Complete code:

void Main()
{
	var iterations = 10000;
	var list = new List<Human>();
	for (int i = 0; i < iterations; i++)
	{
		list.Add(new Human(i));
	}
	
	var timesToAvg = 100;
	
	Console.WriteLine("Avg of ForEach search: {0} ms", 
		AverageIt((l, i) => TestForEach(l, i), list, iterations, timesToAvg));
	
	Console.WriteLine("Avg of .Where search: {0} ms", 
		AverageIt((l, i) => TestWhere(l, i), list, iterations, timesToAvg));
		
	Console.WriteLine("Avg of .FirstOrDefault search: {0} ms", 
		AverageIt((l, i) => TestFirstOrDefault(l, i), list, iterations, timesToAvg));
	
	Console.WriteLine("Avg of for-built Dictionary search: {0} ms", 
		AverageIt((l, i) => TestDictionary(l, i), list, iterations, timesToAvg));
		
	Console.WriteLine("Avg of LINQ-built Dictionary search: {0} ms", 
		AverageIt((l, i) => TestToDictionary(l, i), list, iterations, timesToAvg));
		
	Console.WriteLine("Avg of Lookup search: {0} ms", 
		AverageIt((l, i) => TestLookup(l, i), list, iterations, timesToAvg));
}

decimal AverageIt(Action<List<Human>, int> action, List<Human> list, int iterations, int timesToAvg)
{
	var sw = new Stopwatch();
	
	decimal sum = 0;
	for (int i = 0; i < timesToAvg; i++)
	{
		sw.Reset();
		sw.Start();
		action(list, iterations);
		sw.Stop();
		sum += sw.ElapsedMilliseconds;
	}
	return sum / timesToAvg;
}

class Human
{
	public int id;
	
	public Human(int id)
	{
		this.id = id;
	}
}

void TestForEach(List<Human> list, int iterations)
{
	for (int i = 0; i < iterations; i++)
	{
		foreach (var x in list)
		{
			if (i == x.id)
			{
				break;
			}
		}
	}
}

void TestWhere(List<Human> list, int iterations)
{	
	for (int i = 0; i < iterations; i++)
	{
		var h = list.Where(x => x.id == i).FirstOrDefault();
	}
}

void TestFirstOrDefault(List<Human> list, int iterations)
{	
	for (int i = 0; i < iterations; i++)
	{
		var h = list.FirstOrDefault(x => x.id == i);
	}
}

void TestDictionary(List<Human> list, int iterations)
{
	var dict = new Dictionary<int, Human>();
	foreach (var h in list)
	{
		dict.Add(h.id, h);
	}
	for (int i = 0; i < iterations; i++)
	{
		var h = dict[i];
	}
}

void TestToDictionary(List<Human> list, int iterations)
{
	var dict = list.ToDictionary(x => x.id);
	for (int i = 0; i < iterations; i++)
	{
		var h = dict[i];
	}
}

void TestLookup(List<Human> list, int iterations)
{
	var lookup = list.ToLookup(
		x => x.id,
		x => x);
	for (int i = 0; i < iterations; i++)
	{
		var h = lookup[i];
	}
}

Joins in LINQ

The scenario: you have two related collections of objects, and you need to smush ’em together into a collection of combined records. It’s easy to do with LINQ’s Join method, but Join can seem a little intimidating–just check out its declaration:

// yikes!
public static IEnumerable Join<TOuter, TInner, TKey, TResult>(
	this IEnumerable<TOuter> outer,
	IEnumerable<TInner> inner,
	Func<TOuter, TKey> outerKeySelector,
	Func<TInner, TKey> innerKeySelector,
	Func<TOuter, TInner, TResult> resultSelector
)

It’s really not so bad, though. Here’s the breakdown:

  • “this IEnumerable<TOuter> outer” what you’re joining from
  • “IEnumerable<TInner> inner” what you’re joining to
  • “Func<TOuter, TKey> outerKeySelector” an expression for how to match the ‘from’ records
  • “Func<TInner, TKey> innerKeySelector” an expression for how to match the ‘to’ records
  • “Func<TOuter, TInner, TResult> resultSelector” an expression for the joined result

Still sounds rough? Let’s look at an easy example:

class Person
{
	public string Name;
	public string Occupation;
}

class Job
{
	public string Name;
	public decimal Salary;
}

void Main()
{
	var people = new[]
	{
		new Person { Name = "Adam", Occupation = "Blogger" },
		new Person { Name = "Joe", Occupation = "Teacher" },
		new Person { Name = "Hilary", Occupation = "Actress" }
	};
	var jobs = new[]
	{
		new Job { Name = "Blogger", Salary = 0.0m },
		new Job { Name = "Teacher", Salary = 100.0m },
		new Job { Name = "Actress", Salary = 5000.0m }
	};

	var salaryByPerson = people.Join(
		jobs,
		p => p.Occupation,
		j => j.Name,
		(p,j) => new { Person = p.Name, Salary = j.Salary });

	foreach (var sbp in salaryByPerson)
	{
		Console.WriteLine("Person: {0}, Salary: {1}",
			sbp.Person,
			sbp.Salary.ToString("c"));
	}
}

/* Output
Person: Adam, Salary: $0.00
Person: Joe, Salary: $100.00
Person: Hilary, Salary: $5,000.00
*/

The Join in the above example is equivalent to SQL like this:

SELECT p.Name AS Person, j.Salary
FROM people p
JOIN jobs j ON p.Occupation=j.Name

Now you’ve got it, right? Yea!

Add Additional References in LINQPad

LINQPad is one of my favorite development tools. I use it all the time to do quick tests to verify thoughts or discussions that I’m having with peers. It’s terrific for building snippets for email and doing quick what-is-this-really-doing checks. (And it’s perfectly free!)

One of the not-so-obvious things that I’ve run into with LINQPad is adding references. I looked through all the menus looking for some sort of “Add References” option but found nothing!

While slightly less obvious than I would’ve liked, adding references is very easy to do: just press F4. This is the keyboard shortcut to Query Properties–which can be found in the menus–where you can add additional assembly references or import namespaces.

I loves me some LINQPad!

Making Enumerable Collections LINQ-Queryable

LINQ is one of the greatest things that’s happened to Windows programming. I love it, and I use it all the time.

Occasionally, you’ll run into an enumerable collection class that can’t be queried with LINQ because it isn’t associated with a type. This can be easily overcome by using the Enumerable.Cast<T>() method.

Here’s a quick example from MSDN:

System.Collections.ArrayList fruits = new System.Collections.ArrayList();
fruits.Add("apple");
fruits.Add("mango");

IEnumerable query =
	fruits.Cast().Select(fruit => fruit);

foreach (string fruit in query)
{
	Console.WriteLine(fruit);
}

This is a great technique to use instead of settling and using a for-each loop (ick!).

Manipulating Objects in a Collection with LINQ

I was chatting with a co-worker about using LINQ to replace for-each loops. More specifically, we were discussing how to modify the properties of items in a collection or a subset of the collection. I didn’t know how to do it immediately, but I worked on it a bit and found a pretty cool way to do just that.

My first thought was that you could just put your logic into a Select method, modify the objects, and then return a dummy value that would be ignored. Something like this:

// this does not work!
values.Select(x =>
{
    x.Name = x.Name.ToUpper();
    return true;
});

This did not work, though! I’m not entirely sure why, but I tried a few different approaches and found a way that does work. It feels less hacky, too, since I’m not returning that meaningless dummy value.

Here’s the solution that I came up with:

values.Aggregate(null as Person, (av, e) =>
{
    e.Name = e.Name.ToUpper();
    return av ?? e;
});

If you only want to manipulate a subset of the collection, you can insert a Where method before your aggregate, like this:

values.Where(x => x.Name.Equals("Blah"))
    .Aggregate(null as Person, (av, e) =>
    {
        e.Name = e.Name.ToUpper();
        return av ?? e;
    });

You can read more about the Aggregate method here.