A lot of thought went into the implementation of the set algorithms in LINQ: Distinct
, Except
, Intersect
and Union
. They guarantee that the items are returned in the same order that they appear in before calling the method. So, these two expressions return the items in the same order:
var authors1 = books.Select(b => b.Author).OrderBy(a => a).Distinct();
var authors2 = books.Select(b => b.Author).Distinct().OrderBy(a => a);
The implementation of Distinct
would look something like this:
public IEnumerable<T> Distinct<T>(this IEnumerable<T> source)
{
HashSet<T> set = new HashSet<T>();
foreach (T item in source)
{
if (set.Add(item))
{
yield return item;
}
}
}
The implementation of Except
would look like this:
public static IEnumerable<T> Except<T>(this IEnumerable<T> source1, IEnumerable<T> source2)
{
HashSet<T> set = new HashSet<T>(source2);
foreach (T item in source1)
{
if (set.Add(item))
{
yield return item;
}
}
}
The implementation of Intersect
would look something like this:
public static IEnumerable<T> Intersect<T>(this IEnumerable<T> source1, IEnumerable<T> source2)
{
HashSet<T> set = new HashSet<T>(source2);
HashSet<T> found = new HashSet<T>();
foreach (T item in source1)
{
if (set.Contains(item) && found.Add(item))
{
yield return item;
}
}
}
I added a second set to remove duplicates from the list.
Finally, Union
is a little more complex, but basically the same thing:
public static IEnumerable<T> Union<T>(this IEnumerable<T> source1, IEnumerable<T> source2)
{
HashSet<T> set = new HashSet<T>();
foreach (T item in source1)
{
if (set.Add(item))
{
yield return item;
}
}
foreach (T item in source2)
{
if (set.Add(item))
{
yield return item;
}
}
}
So, the only operation not supported by LINQ is symmetrical difference or SymmetricExcept
. I have been playing around with creating a “stable” version of this algorithm and out of pure curiosity am wondering if there is a better implementation. The most straight-forward implementation is to just call Except
twice:
public static IEnumerable<T> SymmetricExcept<T>(this IEnumerable<T> source1, IEnumerable<T> source2)
{
var except1 = source1.Except(source2);
var except2 = source2.Except(source1);
return except1.Concat(except2);
}
Although, this requires going through both lists twice. So I wrote a version that only requires going through the second list twice:
public static IEnumerable<T> SymmetricExcept<T>(this IEnumerable<T> source1, IEnumerable<T> source2)
{
HashSet<T> set = new HashSet<T>(source2);
HashSet<T> found = new HashSet<T>();
foreach (T item in source1)
{
if (!set.Contains(item) && found.Add(item))
{
yield return item;
}
}
foreach (T item in source2)
{
if (found.Add(item))
{
yield return item;
}
}
}
I haven’t verified the correctness of these algorithms. There is some ambiguity about how LINQ handles duplicates. I am curious if there is a more efficient way to do SymmetricExcept
, something better than O(m + 2n)
.
7