Finding substrings using culture-sensitive comparisons

Motivation

We will start this article with a bold statement: The string.IndexOf functionality for finding substrings in the .NET Framework is broken. We don’t mean that its results are incorrect, but that they are incomplete; and this design deficiency frequently misleads consumers into writing string manipulation logic that produces incorrect results or throws unexpected exceptions when applied to strings from the wild.

Concretely, string.IndexOf fails to indicate the length of the matched substring, implicitly causing consumers to make the mistaken assumption that the match will always have the same length as the argument that was specified to be searched for. This is not the case: Culture-sensitive comparisons can result in a match that has a different length from the search string. Notwithstanding, there is no public API in the .NET Framework Class Library that can retrieve the actual length of such matches. If you’ve ever written any string manipulation logic that involves substrings, there’s a reasonable chance that your code suffers from this bug.

Illustrative examples

Suppose that you need to write a method that removes the first occurrence of a substring from a specified string, as requested in this Stack Overflow question. Let’s assume that your method should perform a word (case-sensitive and culture-sensitive) search using the current culture – as is, after all, the default behavior for the string.IndexOf(string) method overload. The accepted (and highly upvoted) answer presents logic similar to the below:

public static string RemoveFirst(string source, string value)
{
    int index = source.IndexOf(value);
    return index < 0 ? source : source.Remove(index, value.Length);
}

This answer works fine for ASCII strings, and for a restricted subset of other Unicode characters; however, it will break for Unicode strings at large:

CultureInfo.CurrentCulture = new CultureInfo("en-US");
var a1 = RemoveFirst("abcde", "cd");             // correct: "abe"
var b1 = RemoveFirst("ani­mal", "animal");        // incorrect: "l" instead of ""
var c1 = RemoveFirst("cåt", "å");                // incorrect: "c̊t" instead of "ct"
var d1 = RemoveFirst("café", "é");               // ArgumentOutOfRangeException
var e1 = RemoveFirst("encyclopædia", "aedia");   // ArgumentOutOfRangeException

Due to the subtlety of the bug, it is not readily obvious why the above behavior occurs. We will rewrite the strings by substituting Unicode escape sequences for the problematic characters:

CultureInfo.CurrentCulture = new CultureInfo("en-US");
var a2 = RemoveFirst("abcde", "cd");                            // correct: "abe"
var b2 = RemoveFirst("ani\u00ADmal", "animal");                 // incorrect: "l" instead of ""
var c2 = RemoveFirst("c\u0061\u030At", "\u00E5");               // incorrect: "c\u030At" instead of "ct"
var d2 = RemoveFirst("caf\u00E9", "\u0065\u0301");              // ArgumentOutOfRangeException
var e2 = RemoveFirst("encyclop\u00E6dia", "\u0061\u0065dia");   // ArgumentOutOfRangeException

We shall now examine the cause of each of these errors.

Ignorable characters

var b1 = RemoveFirst("ani­mal", "animal");        // incorrect: "l" instead of ""
var b2 = RemoveFirst("ani\u00ADmal", "animal");

The example above, whose strings are incidentally taken from the MSDN documentation for IndexOf, contains a SOFT HYPHEN (U+00AD) in the source string.

Character sets include ignorable characters, which are characters that are not considered when performing a linguistic or culture-sensitive comparison. In a culture-sensitive search [not Ordinal or OrdinalIgnoreCase], if [the search string] contains an ignorable character, the result is equivalent to searching with that character removed.

[…] On the .NET Framework 4 or later, because the soft hyphen is an ignorable character, a culture-sensitive search returns the same value that it would return if the soft hyphen were not included in the search string.

Consequently, the search string "animal" is matched by the source string "ani\u00ADmal" in its entirety. However, "animal" has 6 characters, whilst "ani\u00ADmal" has 7 characters due to the soft hyphen. Removing 6 characters from the source string would erroneously leave its last character in place.

Combining character sequences

var c1 = RemoveFirst("cåt", "å");                // incorrect: "c̊t" instead of "ct"
var c2 = RemoveFirst("c\u0061\u030At", "\u00E5"); 

Unicode permits diacritics (such as accents) to be represented through combining characters. Such characters are combined with the base character that precedes them, giving rise to combining character sequences. The source string in the above example (also taken from the MSDN examples for IndexOf) contains LATIN SMALL LETTER A (U+0061: a) followed by COMBINING RING ABOVE (U+030A: ◌̊), resulting in the combining character sequence "å".

Several popular combining character sequences can also be represented as a precomposed character. The search string in the above example consists of LATIN SMALL LETTER A WITH RING ABOVE (U+00E5: å). Assuming support from your browser’s layout engine and font, you will probably see the two representations, "å" and "å", as identical, despite the difference in their underlying representations ("\u0061\u030A" versus "\u00E5"). However, the combining character sequence is represented using two characters – "å".Length is 2 – whilst the precomposed character only has one – "å".Length is 1.

Since the search string consists of just the precomposed character, the RemoveFirst method will only remove one character from the matched substring in the source string; namely, '\u0061' ('a'). The combining character '\u030A' is left in place, and will be combined with its new predecessor, the base character 'c', to inadvertently produce the new combining character sequence "c̊".

var d1 = RemoveFirst("café", "é");               // ArgumentOutOfRangeException
var d2 = RemoveFirst("caf\u00E9", "\u0065\u0301");

The next example demonstrates the same issue in the opposite direction. The source string ends with the precomposed character LATIN SMALL LETTER E WITH ACUTE (U+00E9: é), whilst the search string consists of LATIN SMALL LETTER E (U+0065: e) followed by COMBINING ACUTE ACCENT (U+0301: ◌́). These two will match – "caf\u00E9".IndexOf("\u0065\u0301") gives the zero-based index position 3. However, attempting to remove two characters – the length of the search string – from this index position in the source string will cause it to run past its end, throwing ArgumentOutOfRangeException.

Ligatures

In orthography, a multigraph is a sequence of letters that behaves as a single unit under a given language. Multigraphs are commonly associated with phonemes, which represent a distinct sound. The most popular category of multigraph is the digraph, which is constituted of a pair of letters. The English language has several multigraphs, such as ⟨sh⟩ (like in “ship” and “wish”) and ⟨ee⟩ (like in “meet”).

Multigraphs sometimes give rise to a fused graphical representation of their consistent letters; these are known as ligatures. There are two ligatures that are occasionally used in English: æ, formed from the letters ‘a’ and ‘e’; and œ, formed from the letters ‘o’ and ‘e’. The former was popularized by its appearance in the title of the Encyclopædia Britannica.

When performing linguistic string comparisons, ligatures should be treated as equal to the sequence of letters from which they are formed. For example, in English, "æ" is equal to "ae", and "œ" is equal to "oe".

var e1 = RemoveFirst("encyclopædia", "aedia");   // ArgumentOutOfRangeException
var e2 = RemoveFirst("encyclop\u00E6dia", "\u0061\u0065dia");

Consequently, "encyclopædia".IndexOf("aedia") returns 8, indicating that the search string was matched. However, the length of the matched substring, "ædia", is 4, whilst the length of the search value, "aedia", is 5. The source string has no further characters after the matched substring. Therefore, attempting to remove 5 characters from index 8 in the source string will throw ArgumentOutOfRangeException.

Strategies

There is no public API in the .NET Framework Class Library that can perform such a substring search, and it is unfortunately not trivial to build one. This problem was even raised by Jon Skeet in his Stack Overflow question, How can I perform a culture-sensitive “starts-with” operation from the middle of a string?

We shall first discuss some potential approaches, then present our preferred solution. Our goal will be to define a Find extension method for strings that has the following signature:

public static bool Find(
    this string source, 
    string searchValue, 
    StringComparison comparisonType, 
    out int matchIndex, 
    out int matchLength)
{
    // ...
}

Given such a method, a correct implementation of RemoveFirst would become trivial:

public static string RemoveFirst(string source, string value)
{
    int index, length;
    if (source.Find(value, StringComparison.CurrentCulture, out index, out length))
        return source.Remove(index, length);

    return source;
}

Note how we are using the length result returned by Find, rather than the original value.Length.

Restricting to ordinal comparisons

The easy way out would be to restrict our string operation to only support ordinal comparisons. This is the route taken by the .NET Framework Class Library for a number of methods on the String class, including Contains(string), Replace(string, string), and Split(string[], StringSplitOptions). These methods can only perform an ordinal (case-sensitive and culture-insensitive) search, with no overloads to support a case-insensitive or culture-sensitive search. (It is easy to define a Contains extension method that takes a StringComparison parameter and passes it to IndexOf(String, StringComparison); however, such a shortcut cannot be taken for Replace or Split, which would also require the length.)

This design shortcoming of the .NET Framework Class Library is jarring because of the inconsistency it introduces in the framework’s public API. Most string operations do support linguistic comparisons, and enact CurrentCulture behavior by default (consider CompareTo, IndexOf, StartsWith, EndsWith, …); whilst others are restricted to ordinal comparisons.

Per the StringComparison documentation:

An operation that uses ordinal sort rules performs a comparison based on the numeric value (Unicode code point) of each Char in the string. An ordinal comparison is fast but culture-insensitive. When you use ordinal sort rules to sort strings, [the Unicode character] U+xxxx comes before U+yyyy if the value of xxxx is numerically less than yyyy.

Under ordinal comparisons, two strings would be equal if and only if they consist of identical sequences of Unicode code points; in other words, their binary representations are the same. Ignorable characters would not be ignored; precomposed characters would not match combining character sequences; and ligatures would not match multigraphs of their letters. A favorable consequence of this restriction is that equal strings would always have the same length, so it becomes safe to assume that the matched substring would have the same length as the search string.

We could implement our RemoveFirst method to always perform ordinal searches by specifying StringComparison.Ordinal as argument to IndexOf:

public static string RemoveFirst(string source, string value)
{
    int index = source.IndexOf(value, StringComparison.Ordinal);
    return index < 0 ? source : source.Remove(index, value.Length);
}

Under this implementation, the search strings in our problematic examples would not be matched by the source strings at all. This ensures that they do not produce blatantly incorrect results, nor throw ArgumentOutOfRangeException. Rather, the source string is returned intact as the result of the RemoveFirst call:

var b1 = RemoveFirst("ani­mal", "animal");        // result: "ani­mal"
var c1 = RemoveFirst("cåt", "å");                // result: "cåt"
var d1 = RemoveFirst("café", "é");               // result: "café"
var e1 = RemoveFirst("encyclopædia", "aedia");   // result: "encyclopædia"

Such a restriction is not acceptable if one needs to process user strings. Best Practices for Using Strings in the .NET Framework on MSDN recommends using local linguistic comparisons – CurrentCulture or CurrentCultureIgnoreCase – for “most user input” (as well as “data displayed to the user”).

P/Invoke calls

The .NET Framework Class Library internally implements IndexOf by making P/Invoke calls to native libraries for linguistic comparisons. The source code for CompareInfo.IndexOf shows that it internally calls InternalFindNLSStringEx, which is an external method declaration for the native FindNLSStringEx.

The FindNLSStringEx function implemented by Windows defines an out parameter, pcchFound, which is documented as follows:

Pointer to a buffer containing the length of the string that the function finds. The string can be either longer or shorter than the search string.

Note that the value of pcchFound is often identical to the value provided in cchValue [the length of the search string], but can differ [if]:

  • The strings are equivalent, but have different lengths. For example, "A" plus "Combining Ring" (U+0041 U+030A) is equivalent to the "A Ring" (U+00C5).

Unfortunately, this pcchFound parameter is disregarded by the external method calls from the .NET Framework.

One way of determining the length of the matched substring would be to bypass the .NET API altogether and call the aforementioned Windows function directly through P/Invoke, retrieving the value that it writes to pcchFound. An example of such an implementation is given under the Stack Overflow answer by David Ewen.

This is the fastest approach because it only needs to run native code, and determines the length of the match within the same operation that locates its starting character position. This approach also involves the least implementation complexity; we just need to reuse a Windows function.

However, this approach suffers from the drawback that affects P/Invoke calls in general: It sacrifices portability. P/Invoke relies on the existence of specific native libraries, which might not be available on other platforms, such as Linux. Therefore, it is not suitable for libraries that are intended to be platform-agnostic.

Unicode normalization

An alternative approach involves subjecting the source and search strings to Unicode normalization through the string.Normalize(NormalizationForm) method before comparison, as suggested in the Stack Overflow answer by Esailija.

However, this approach is undesirable since the returned results would only apply to the normalized forms of the source and search strings, requiring the original strings to be discarded and replaced by their normalized forms for all subsequent processing and storage.

Furthermore, Unicode normalization would not always yield results consistent with culture-sensitive comparisons in .NET (such as Compare or Equals with StringComparison.CurrentCulture). As mentioned in the Unicode Normalization Forms annex, Form C and Form D only support canonical mappings, such as between precomposed characters and combining character sequences – for example, "é" and "e\u0301". However, the said forms do not perform compatibility mappings, as is required for ligatures. For example, "æ" is not decomposed to "ae", nor "ffi" to "ffi", despite that the said ligatures are considered to be equal to their corresponding character sequences under the English (United States) culture. Form KC and Form KD handle compatibility mappings, and can decompose some ligatures, such as "ffi", but miss others, such as "æ". (A Stack Overflow answer mentions that “Unicode 6.2 doesn't appear to contain a normative mapping from Æ to AE.”) The issue is made worse by the discrepancies between cultures – "æ" is equal to "ae" under English (United States), but not under Danish (Denmark), as discussed under the MSDN documentation for String comparison and sorting. Thus, normalization (to any form) would not give results that are consistent with CurrentCulture semantics.

Text elements

Yet another alternative involves iterating over the strings as a sequence of text elements, rather than UTF-16 code units, using the StringInfo.GetNextTextElement method, as presented in the Stack Overflow answer by Mårten Wikström. Results would be similar to those obtained from Unicode normalization: canonical mappings are honored, but compatibility mappings are not.

Iterative equality comparisons

The final approach, which we shall be implementing, involves iteratively testing for the length of the substring until a match with the search string is found.

Such an implementation would first call IndexOf to get the starting position of the match, then iteratively attempt to identify its length. It begins with the most likely case (hot path) of the matched substring having the same length as the search string, verifying this through a substring equality comparison. If not equal, it would attempt to decrement and increment the length of the substring by one character each time, performing an equality comparison against each such substring, until a match is confirmed.

The approach of iterating over the substring’s length is proposed in the Stack Overflow answer by Ken Kin, and endorsed by usr:

I have solved a similar problem once like this (search-string highlighting in HTML). I did it similarly. You can tune the loop and search strategy in a way that makes it completed very quickly by checking the likely cases first. The nice thing about this is that it seems to be totally correct and no Unicode details leak into your code.

Implementation

For a polished implementation of our proposed approach, refer to the Find method within the StringExtensions class, available as part of our DogmaMix.Core library on GitHub. We shall be presenting a simplified version in the discussion below.

Our implementation builds upon the string.IndexOf(string, StringComparison) method from the .NET Framework Class Library, but extends it to also return the length of the match, allowing string manipulation operations to subsequently be performed correctly.

Our method defines two out parameters, for returning both the zero-based starting character position and the length (in characters) of the match.

public static bool Find(
    this string source, 
    string searchValue, 
    StringComparison comparisonType, 
    out int matchIndex, 
    out int matchLength)
{
    // argument validation

    matchIndex = source.IndexOf(searchValue, comparisonType);
    if (matchIndex == -1)
    {
        matchLength = -1;
        return false;
    }

    matchLength = FindMatchLength(source, searchValue, comparisonType, matchIndex);
    return true;
}

FindMatchLength

After determining the starting character position of the match through IndexOf, we proceed to determine its length through our FindMatchLength helper method:

private static int FindMatchLength(string source, string searchValue, StringComparison comparisonType, int matchIndex)
{
    int matchLengthMaximum = source.Length - matchIndex;
    int matchLengthInitial = Math.Min(searchValue.Length, matchLengthMaximum);

    // Hot path: match length is same as specified search string
    if (Substring.Compare(source, matchIndex, matchLengthInitial, searchValue, 0, searchValue.Length, comparisonType) == 0)
        return matchLengthInitial;

    int matchLengthDecrementing = matchLengthInitial - 1;
    int matchLengthIncrementing = matchLengthInitial + 1;

    while (matchLengthDecrementing >= 0 || matchLengthIncrementing <= matchLengthMaximum)
    {
        if (matchLengthDecrementing >= 0)
        {
            if (Substring.Compare(source, matchIndex, matchLengthDecrementing, searchValue, 0, searchValue.Length, comparisonType) == 0)
                return matchLengthDecrementing;

            matchLengthDecrementing--;
        }

        if (matchLengthIncrementing <= matchLengthMaximum)
        {
            if (Substring.Compare(source, matchIndex, matchLengthIncrementing, searchValue, 0, searchValue.Length, comparisonType) == 0)
                return matchLengthIncrementing;

            matchLengthIncrementing++;
        }
    }

    // Should never happen
    return -1;
}

Substring.Compare

The FindMatchLength helper method, in turn, relies on another utility method we define, Substring.Compare, which compares substrings from two specified strings using the specified comparison rules, and returns an integer that indicates their relative position in the sort order.

The .NET Framework Class Library provides a series of string.Compare method overloads – such as string.Compare(string, int, string, int, int, StringComparison) – that accept parameters for specifying the starting position and length of the substrings to compare from the source strings. However, in another case of poor design, .NET fails to provide an overload that permits the lengths of the two respective substrings to be specified separately, presumably having also fallen victim to the fallacy that equal strings must have the same length.

Our Substring.Compare method is similar to the aforementioned string.Compare overload, but allows different lengths to be specified for the two substrings. It is implemented by calling the CompareInfo.Compare method – which fortunately accepts separate length parameters – on the appropriate CompareInfo instance with the appropriate CompareOptions value for each known value of StringComparison. For performance, substring instantiation is avoided, working with the start indexes and lengths instead.

public static class Substring
{
    public static int Compare(string strA, int indexA, int lengthA, string strB, int indexB, int lengthB, StringComparison comparisonType)
    { 
        // argument validation and null handling

        switch (comparisonType)
        {
            case StringComparison.CurrentCulture:
                return CultureInfo.CurrentCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.None);

            case StringComparison.CurrentCultureIgnoreCase:
                return CultureInfo.CurrentCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.IgnoreCase);

            case StringComparison.InvariantCulture:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.None);

            case StringComparison.InvariantCultureIgnoreCase:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.IgnoreCase);

            case StringComparison.Ordinal:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.Ordinal);

            case StringComparison.OrdinalIgnoreCase:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.OrdinalIgnoreCase);

            default:
                throw new ArgumentException("The string comparison type is not supported.", nameof(comparisonType));
        }
    }
}

Tests

Our implementation correctly handles all the above-discussed cases, including ignorable characters, combining character sequences, and ligatures.

var a1 = RemoveFirst("abcde", "cd");             // correct: "abe"
var b1 = RemoveFirst("ani­mal", "animal");        // correct: ""
var c1 = RemoveFirst("cåt", "å");                // correct: "ct"
var d1 = RemoveFirst("café", "é");               // correct: "caf"
var e1 = RemoveFirst("encyclopædia", "aedia");   // correct: "encyclop"

Known issue

Our method suffers from one known issue: When the search string is matched by multiple substrings at a given starting position in the source string, which one gets returned is undefined. Under the current implementation, the substring whose length is closest to the search string’s is preferred; however, this behavior would not always be the expected one. We believe that this issue only occurs for matches that end with ignorable characters; the inclusion or exclusion of such characters in the result is undefined.

Consider the following examples containing a soft hyphen, \u00AD:

var f = RemoveFirst("aethe\u00ADreal", "aethe");   // result: "\u00ADreal"
var g = RemoveFirst("æthe\u00ADreal", "aethe");    // result: "real"

In the first case, the search string is identical to the beginning of the source string, so the full remainder of the source string, including the soft hyphen, is returned. In the second case, the search string is one character longer than its truest match in the source string, "æthe", due to the ligature in the latter. However, the substring having the search string’s length, "æthe\u00AD", would still be a match, since the last character of such a substring would be the soft hyphen, which is ignorable. Therefore, the soft hyphen inadvertently gets included in the match.

This issue is arguably benign under the semantics of string comparisons if one assumes that the inclusion or exclusion of ignorable characters is insignificant either way. However, as shown above, it may lead to inconsistent results when the ignorable characters are considered.


TryFastCount: Optimizing sequence operations for known lengths

Representative problem

When implementing operations over IEnumerable<T> sequences, one would sometimes benefit from knowing the length of the supplied sequence upfront. Consider the source code for the implementation of the SequenceEqual extension method in .NET Framework 4.6.1. A reformatted and commented version is given below:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    if (comparer == null)
        comparer = EqualityComparer<TSource>.Default;

    // Get enumerators for iterating through the two sequences.
    using (IEnumerator<TSource> e1 = first.GetEnumerator())
    using (IEnumerator<TSource> e2 = second.GetEnumerator())
    {
        // Advance the first enumerator to its next element, until the end of its sequence.
        while (e1.MoveNext())
        {
            // Advance the second enumerator to its next element.
            // If it passes the end of its sequence, then it contained less elements than the first.
            if (!e2.MoveNext())
                return false;

            // If an unequal pair of elements is encountered,
            // then the sequences are not equal.
            if (!comparer.Equals(e1.Current, e2.Current))
                return false;
        }

        // The first enumerator has passed the end of its sequence.

        // If the second enumerator does not immediately pass the end of its sequence too,
        // then it contained more elements than the first.
        if (e2.MoveNext())
            return false;
    }

    // All elements are equal, and both enumerators passed the end of their sequences together.
    return true;
}

Sequences of different lengths should always cause the method to return false. However, there is no check done at the beginning of the method to return immediately for such cases – the two sequences are still iterated over, and their elements compared, potentially right up to the end of the shorter sequence. This is wasteful, especially in cases where sequences share long common prefixes, such as:

var x = new []{ 42, 98, 23, 56, 12, 76, 23, 51, 32 };
var y = new []{ 42, 98, 23, 56, 12, 76, 23, 53, 32, 87, 76, 21 };
var b = x.SequenceEqual(y);              // ↑ would still compare elements up to here,
                                         //   despite inequality of arrays being readily
                                         //   apparent due to different lengths

Attempt #1: Count() checks

One might be tempted to resolve this inefficiency by introducing an upfront check using the Count extension method:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    if (first.Count() != second.Count())
        return false;

    // ...
}

However, this is not recommended, since it might cause each sequence to be enumerated twice: once to determine its number of elements, and once to iterate over (and compare) its elements. As mentioned in our previous article, this could have unintended consequences on sequences that rely on deferred execution, such as IQueryable<T> instances that would issue and execute a query on the underlying data source (typically database) each time they are enumerated.

Attempt #2: Materialization

To avoid the repeated enumeration, one might opt to materialize both sequences as in-memory collections, similar to what LINQ does for non-streaming operators.

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    var firstList = first.ToList();
    var secondList = second.ToList();

    if (firstList.Count != secondList.Count)
        return false;

    // ...
}

However, this is also not recommendable. In most real-world scenarios, an unequal pair of elements would be encountered after a few iterations for the majority of sequence pairs, meaning that the rest of the elements were materialized for nothing.

Counting selectively

We have established that sequences involving deferred execution should not be counted or materialized in advance. However, what about sequences that are already in-memory collections? Such collections would usually store their number of elements as part of their data structure, allowing it to be retrieved instantaneously, without requiring any enumeration. This information could be used for optimizing operations over sequences that can take shortcuts when the lengths are known, but would otherwise need to measure the lengths as part of the same enumeration performed by the operation itself.

For example, we could optimize the SequenceEqual method implementation for ICollection<T> inputs, which expose a Count property:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    var firstCollection = first as ICollection<TSource>;
    var secondCollection = second as ICollection<TSource>;

    if (firstCollection != null &&
        secondCollection != null &&
        firstCollection.Count != secondCollection.Count)
    {
        return false;
    }

    // ...
}

The ICollection<T> interface technically cannot guarantee that all the types that implement it will provide an efficient implementation of the Count property. However, by convention, it should be safe to assume that this will be the case. The MSDN design guidelines for Choosing Between Properties and Methods provide the following advice:

Do use a property, rather than a method, if the value of the property is stored in the process memory and the property would just provide access to the value.

Do use a method, rather than a property, [if] the operation is orders of magnitude slower than a field set would be.

Most of the collection types in the .NET Framework Class Library guarantee that retrieving the value of the Count property is an O(1) operation – see List<T>.Count, Collection<T>.Count, HashSet<T>.Count, Dictionary<TKey, TValue>.Count, ….

The SequenceEqual implementation above addresses our performance concerns. However, there are collection types other than ICollection<T> that can provide a fast count. For example, non-generic collections, such as ArrayList, implement the ICollection interface, which also provides a Count property. It would become unwieldy to manually attempt to cast our sequences to each of these types whenever we require a fast count. Rather, we should promote code reuse by providing a utility method for this purpose.

TryFastCount

Our TryFastCount extension method is intended to determine the number of elements in a sequence, if possible, without enumerating it. We do not want to fall back to enumerating over the sequence if a fast count is not possible; we already have the Count extension method in the .NET Framework for that. Per the convention for Try… methods, the return value of TryFastCount indicates whether it succeeded, and its out parameter, count, contains the actual result on success. Specifically, when TryFastCount returns true, the count argument would contain the number of elements in the input sequence. When it returns false, the count argument should be ignored.

public static class EnumerableExtensions
{
    public static bool TryFastCount<TSource>(this IEnumerable<TSource> source, out int count)
    {
        if (source == null)
            throw new ArgumentNullException(nameof(source));

        var collection = source as ICollection<TSource>;
        if (collection != null)
        {
            count = collection.Count;
            return true;
        }

        var collectionNonGeneric = source as ICollection;
        if (collectionNonGeneric != null)
        {
            count = collectionNonGeneric.Count;
            return true;
        }

        var readOnlyCollection = source as IReadOnlyCollection<TSource>;
        if (readOnlyCollection != null)
        {
            count = readOnlyCollection.Count;
            return true;
        }

        var str = source as string;
        if (str != null)
        {
            count = str.Length;
            return true;
        }

        // possibly other type cast attempts

        count = -1;
        return false;
    }
}

The implementation above handles the base interfaces that cover the majority of collection types: ICollection<T>, ICollection, and IReadOnlyCollection<T> (introduced in .NET Framework 4.5). It also handles strings, which can be treated as an IEnumerable<char> collection providing a fast Length property. A benefit of centralizing this logic in such an extension method is that it can be easily extended to support new collection types, whose fast counts would then be automatically available to all sequence operations that call it. This implementation is available in our EnumerableExtensions class as part of our DogmaMix.Core library on GitHub.

Armed with this functionality, we can now optimize the SequenceEqual method to perform an initial count check only if such can be done efficiently, without enumerating the sequences. If not, SequenceEqual would proceed with iterating over the elements, performing the count check as part of the same enumeration.

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    int firstCount, secondCount;
    if (first.TryFastCount(out firstCount) &&
        second.TryFastCount(out secondCount) &&
        firstCount != secondCount)
    {
        return false;
    }

    // ...
}

SequenceComparer: Sorting sequences of sequences

When one talks about sorting, one would usually be thinking of sequences of elements, where each element is a single unit that may be atomically compared against other elements. However, there are plenty of scenarios where the elements to be compared are themselves sequences of another element type.

  • The most popular example would be strings. When performing ordinal comparisons, each string may treated as a sequence of characters, to be compared against the other string’s sequence of characters.
  • Strings encoded as UTF-8 would be represented as a byte array, which also serves as a valid sequence for performing ordinal comparisons. (Results may differ from ordinal comparisons on the source strings when supplementary characters are involved, as discussed later.)
  • Version may be treated as a sequence of up to four integers: Major, Minor, Build, Revision.
  • Arbitrary-precision data structures, such as BigInteger, are internally represented as a sequence of digits under some numeric base.
  • rowversion in SQL Server maps to an eight-element byte[] in .NET.

All aforementioned sequences have one common aspect: They may be compared against other sequences of the same type through an element-by-element comparison of their contents. There are three main variants of this rule, determining how sequences of different lengths are treated.

Sequence comparisons

Lexicographical order

The lexicographical order is a generalization of the alphabetical order, such that it may be applied to any type of sequences containing comparable elements. To compare a given pair of sequences, one would iteratively compare pairs of elements at corresponding indexes across the two sequences, starting from the beginning of the sequences, and proceeding until the first element inequality is found. The result of the sequence comparison corresponds to the result of the first unequal pair of elements. In other words, a sequence of elements { a1, a2, …, am } is less than another sequence { b1, b2, …, bn } if and only if, at the first index i where ai and bi differ, ai is less than bi.

If all elements are equal, then the two sequences are considered to be equal, provided that they are of the same length. When comparing sequences of different lengths, the aforementioned element-by-element check still applies. However, if all elements of the shorter sequence are equal to their counterparts in the longer sequence (meaning that the shorter sequence is a prefix of the longer sequence), then the shorter sequence is considered to be less than the longer sequence. Nominally, the empty sequence is considered to be less than any other sequence.

As an example, consider a comparison between the UTF-8 encoded decimal representations of two strings:

indexes 0 1 2 3 4 5 6 7 8 9
"operator" 111 112 101 114 97 116 111 114
"operations" 111 112 101 114 97 116 105 111 110 115
comparisons = = = = = = >

The first six elements of the two sequences are equal, and therefore do not determine the comparison result. At index 6, the element of the first sequence, 111, is greater than the element of the second sequence, 105. Therefore, overall, the first sequence is considered to be greater than the second sequence.

Consider another example:

indexes 0 1 2 3 4 5 6 7
"opera" 111 112 101 114 97
"operator" 111 112 101 114 97 116 111 114
comparisons = = = = = <

The first five elements of the two sequences are equal, and do not determine the comparison result. However, the first sequence only consists of the said five elements, whilst the second sequence contains more elements. Thus, the first sequence is a prefix of the second sequence, and is considered to be less than it.

A simplified implementation of such a comparer is given below:

public class LexicographicalComparer<TElement> : Comparer<IList<TElement>>
{
    public override int Compare(IList<TElement> x, IList<TElement> y)
    {
        for (int i = 0; i < Math.Min(x.Count, y.Count); i++)
        {
            int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
            if (elementComparison != 0)
                return elementComparison;
        }

        return x.Count.CompareTo(y.Count);
    }
}

Shortlex order

The lexicographical comparer gives precedence to the element-wise comparisons, and only considers the lengths of the sequences when all elements, up to the length of the shorter sequence, are equal. This order, whilst well-suited for strings, is not applicable for sequences whose elements assume a positional notation, such as big integers. In such cases, the lengths of the sequences take precedence – a shorter sequence is considered to be less than a longer sequence, irrespective of their contents. The element-by-element comparison would only occur for sequences of the same length. This variant of the lexicographical order is known as shortlex.

For example, imagine a big integer implementation that internally stores its values as a List<byte>, with each element representing a digit under base-256, and no leading zeros allowed. The number 992,397,253,847 would be stored as { 231, 15, 124, 92, 215 }, since its value is equal to 231 × 2564 + 15 × 2563 + 124 × 2562 + 92 × 2561 + 215 × 2560. (In practice, most big integer implementations would store their elements in reverse order, with the most significant digit stored last. This is not done here for simplicity.)

For values of the same length, we could use an algorithm similar to the one shown above, determining the order of the big integers through an element-by-element comparison of their digits, up to the first inequality:

indexes 0 1 2 3 4
992,397,253,847 231 15 124 92 215
992,401,385,498 231 15 187 104 26
comparisons = = <

However, such comparisons would not be applicable for sequences of different lengths. Consider another pair of big integers:

indexes 0 1 2 3 4
992,397,253,847 231 15 124 92 215
16,368,696 249 196 56

Although the first digit of the first sequence, 231, is less than that of the second sequence, 249, this is irrelevant, since its place value is greater: 231 × 2564 is greater than 249 × 2562. This applies as a general rule – the place value of the most significant digit of a shorter sequence will always be less than the place value of the most significant digit of a longer sequence. Thus, under shortlex, a shorter sequence is always less than a longer sequence.

We can tweak the simplified code from the previous section to give precedence to the sequence lengths:

public class ShortlexComparer<TElement> : Comparer<IList<TElement>>
{
    public override int Compare(IList<TElement> x, IList<TElement> y)
    {
        var lengthComparison = x.Count.CompareTo(y.Count);
        if (lengthComparison != 0)
            return lengthComparison;

        for (int i = 0; i < x.Count; i++)
        {
            int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
            if (elementComparison != 0)
                return elementComparison;
        }

        return 0;
    }
}

Same length only

There may be other scenarios where sequences of different lengths are not considered to be comparable. For example, we earlier stated that rowversion from SQL Server is represented as byte[8] in .NET. When comparing such arrays, we always expect them to have the same length – a length other than 8 would likely be indicative of some programming error. To handle such scenarios, we can define another comparer that is similar to shortlex, but throws an exception when it encounters a pair of sequences with different lengths:

public class SameLengthComparer<TElement> : Comparer<IList<TElement>>
{
    public override int Compare(IList<TElement> x, IList<TElement> y)
    {
        var lengthComparison = x.Count.CompareTo(y.Count);
        if (lengthComparison != 0)
            throw new ArgumentException("The sequences to be compared do not have the same number of elements.");

        for (int i = 0; i < x.Count; i++)
        {
            int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
            if (elementComparison != 0)
                return elementComparison;
        }

        return 0;
    }
}

StructuralComparer

The .NET Framework Class Library provides the StructuralComparer property in the StructuralComparisons static class:

A predefined object that is used to perform a structural comparison of two collection objects.

StructuralComparer relies on the IStructuralComparable interface, whose implementations provide an element-by-element comparison similar to the one we are aiming for. However, it suffers from a number of issues:

  • In .NET Framework 4.6.1, IStructuralComparable is only implemented by arrays and by tuples. We would like to support all IEnumerable<T> sequences.
  • StructuralComparer and IStructuralComparable are non-generic, meaning that boxing and unboxing would be required when comparing elements of value types. We would like to define a generic type parameter, TElement, for the elements.
  • StructuralComparer does not accept a custom comparer for element-level comparisons, but just calls the Compare(object, object) method on the non-generic Comparer.Default instance. One could cast an array to the IStructuralComparable interface and directly call its CompareTo(Object, IComparer) method, which accepts a non-generic IComparer instance that it uses on the element-level comparisons; however, such a structural comparison would not itself be presentable as an IComparer instance (unless manually wrapped by an outer comparer type). We would like to provide overloads in our comparer that can accept a custom IComparer<TElement> inner comparer for performing element-level comparisons.
  • The IStructuralComparable.CompareTo implementation for arrays throws an ArgumentException for arrays of different lengths. We would like to support lexicographical and shortlex orders for such sequences.

Implementation

We shall now proceed to demonstrate an implementation of the three sequence comparers discussed above: lexicographical, shortlex, and same-length. A complete implementation is available as part of our DogmaMix.Core library on GitHub, mostly under the SequenceComparer<TElement> class.

StringComparison enumeration

The three sequence comparison variants share a lot of common behavior, making it appropriate to employ some form of code reuse. We could design an inheritance hierarchy with a base class that provides the element-level comparison logic and relies on an abstract method, to be overridden by derived classes, for handling sequence lengths. However, we decided against this approach. The nature of the sequence length comparison needs to influence not just the result, but also the control flow, of the base class – Shortlex and SameLength compare lengths at the beginning, whilst Lexicographical only does so after the element-level comparisons. This introduces a design complexity that would not be justified just for the sake of achieving polymorphism over the few lines of code that such length comparisons involve.

Instead, we shall stick with a single comparer class, and define an enumeration to represent the sequence comparison variants. The role of this enumeration is similar to that of the StringComparison enumeration for strings.

public enum SequenceComparison
{
    Lexicographical,
    Shortlex,
    SameLength,
}

ListComparer<TElement>

Before we present our actual comparer implementation, let us start with a simplification: a comparer that can compare IList<T> sequences, rather than IEnumerable<T> sequences. IList<T> simplifies matters because it allows us to immediately determine the lengths of the two sequences through its Count property, as well as retrieve elements efficiently by index, avoiding the need for creating enumerators.

We create an IComparer<IList<TElement>> implementation for our sequence comparisons. Note that the generic type of the IComparer<T> interface implementation is IList<TElement>, indicating that our class is a comparer for lists (including arrays) of our element type. We define a constructor that can accept the SequenceComparison enumeration value and the IComparer<TElement> inner comparer to use for element-level comparisons. If the comparer argument is omitted, then the Comparer<TElement>.Default comparer for the element type is used.

public class ListComparer<TElement> : Comparer<IList<TElement>>
{
    private readonly SequenceComparison _comparisonType;
    private readonly IComparer<TElement> _elementComparer;

    public ListComparer(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        _comparisonType = comparisonType;
        _elementComparer = elementComparer ?? Comparer<TElement>.Default;
    }

    // ...
}

Finally, our implementation of the Compare method is a combination of the logic we presented in the earlier sections. The difference between the three SequenceComparison variants concerns how sequences of different lengths are initially treated: Shortlex immediately returns the non-zero length comparison result, whilst SameLength immediately throws an ArgumentException. An element-by-element comparison ensues. As soon as an unequal pair of elements is encountered, the element comparison result is returned. If all elements are equal (up to the end of the shorter or both sequences), the method returns the length comparison result – at this point guaranteed to be zero for Shortlex and SameLength, but potentially any value for Lexicographical.

public override int Compare(IList<TElement> x, IList<TElement> y)
{
    int lengthComparison = x.Count.CompareTo(y.Count);
    if (lengthComparison != 0)
    {
        if (_comparisonType == SequenceComparison.Shortlex)
            return lengthComparison;
        if (_comparisonType == SequenceComparison.SameLength)
            throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
    }

    for (int i = 0; i < Math.Min(x.Count, y.Count); i++)
    {
        int elementComparison = _elementComparer.Compare(x[i], y[i]);
        if (elementComparison != 0)
            return elementComparison;
    }

    return lengthComparison;
}

SequenceComparer<TElement>

We shall now enhance our implementation to handle IEnumerable<T> sequences in general. The fields and constructor remain the same:

public class SequenceComparer<TElement> : Comparer<IEnumerable<TElement>>
{
    private readonly SequenceComparison _comparisonType;
    private readonly IComparer<TElement> _elementComparer;

    public SequenceComparer(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        _comparisonType = comparisonType;
        _elementComparer = elementComparer ?? Comparer<TElement>.Default;
    }

    // ...
}

Next comes the Compare method.

Doing without Count

This time, we cannot rely on a Count property, since the IEnumerable<T> interface does not provide one. There exists a Count(IEnumerable<TSource>) extension method, which could technically be used. However, it is generally not considered good practice to enumerate a specified IEnumerable<T> sequence multiple times within a general-purpose library method. IEnumerable<T> can represent any kind of sequence, including IQueryable<T> instances for LINQ to Entities queries whose enumeration would involve issuing and executing the query on the underlying data source (typically a database). Multiple enumerations of such queries can lead to performance pitfalls as well as inconsistent results – the state of the underlying data source might change between one enumeration and the next. In fact, ReSharper gives an error for such occurrences: “Possible multiple enumeration of IEnumerable.” Therefore, we will need to integrate the sequence length checks as part of our main iteration logic.

One could argue that, in this case, the concern is merely pedantic – one would rarely need to perform sequence comparisons of IQueryable<T> instances; and the sequences would need to be enumerated multiple times anyway if such a set were to be sorted. This is a fair point; if conceded, one could oblige consumers to first materialize the sequences to lists in memory, using a call such as:

var lists = sequences.Select(s => s.ToList()).ToList();

The formerly presented ListComparer<TElement> could then be used for comparisons of such lists.

Notwithstanding, we will push ahead with an IComparer<IEnumerable<TElement>> implementation. We do not want to constrain our comparer to only support lists just because of their convenient Count property. Per the MSDN Guidelines for Collections:

✓ DO use the least-specialized type possible as a parameter type. Most members taking collections as parameters use the IEnumerable<T> interface.

X AVOID using ICollection<T> or ICollection as a parameter just to access the Count property. Instead, consider using IEnumerable<T> or IEnumerable and dynamically checking whether the object implements ICollection<T> or ICollection.

We will therefore proceed with accepting IEnumerable<TElement> sequences as inputs, and discuss how to optimize for in-memory collection types later.

Advancing enumerators in sync

To iterate over the sequences, we will use enumerators. For an example of how to use these over a pair of sequences, one may refer to the source code for the SequenceEqual extension method.

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Get enumerators to iterate over both sequences in sync.
    using (IEnumerator<TElement> xEnumerator = x.GetEnumerator())
    using (IEnumerator<TElement> yEnumerator = y.GetEnumerator())
    {
        // ...
    }
}

Each enumerator may be advanced to the next element of its sequence by calling its MoveNext method. This method returns true if the enumerator was successfully advanced to the next element; or false if the enumerator has passed the end of its sequence.

In our case, we need to advance both enumerators in sync. We could compare the return values from their respective MoveNext calls to determine whether one sequence has ended before the other:

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Get enumerators to iterate over both sequences in sync.
    using (IEnumerator<TElement> xEnumerator = x.GetEnumerator())
    using (IEnumerator<TElement> yEnumerator = y.GetEnumerator())
    {
        // Advance both enumerators to their next element, until at least one passes the end of its sequence.
        bool xMove, yMove;
        while ((xMove = xEnumerator.MoveNext()) &&
               (yMove = yEnumerator.MoveNext()))
        {
            // Compare the current pair of elements across the two sequences.
        }

        // Determine the relative length of the two sequences based on the final values of xMove and yMove.
    }
}

To improve our design, we will encapsulate this logic for enumerating through a pair of sequences in sync into a dedicated class, EnumeratorPair<T>. This class builds on the semantics of the MoveNext method from the IEnumerator<T> interface. Each MoveNext call returns true if both enumerators were successfully advanced to the next element; or false if one or both enumerators have passed the end of their sequence. Once MoveNext returns false, consumers may read the LengthComparison property, whose value indicates the relative length of the enumerated sequences:

Value Meaning
Less than zero The first enumerator passed the end of its sequence before the second enumerator. Therefore, the first sequence is shorter than the second.
Zero Both enumerators passed the end of their sequence during the same iteration. Therefore, the two sequences have the same length.
Greater than zero The second enumerator passed the end of its sequence before the first enumerator. Therefore, the first sequence is longer than the second.

A bare-bones implementation is given below; refer to EnumeratorPair<T> for a proper one.

public class EnumeratorPair<T> : IDisposable
{
    private readonly IEnumerator<T> _enumerator1;
    private readonly IEnumerator<T> _enumerator2;

    private bool _passedEnd;
    private int _lengthComparison;

    public EnumeratorPair(IEnumerable<T> sequence1, IEnumerable<T> sequence2)
    {
        _enumerator1 = sequence1.GetEnumerator();
        _enumerator2 = sequence2.GetEnumerator();
    }

    public T Current1 => _enumerator1.Current;
    public T Current2 => _enumerator2.Current;
    public int LengthComparison => _lengthComparison;

    public bool MoveNext()
    {
        bool move1 = _enumerator1.MoveNext();
        bool move2 = _enumerator2.MoveNext();

        if (!_passedEnd && (!move1 || !move2))
        {
            _passedEnd = true;
            _lengthComparison = move1.CompareTo(move2);
        }

        return move1 && move2;
    }

    public void Dispose()
    {
        _enumerator1.Dispose();
        _enumerator2.Dispose();
    }
}

Implementing Lexicographical

By leveraging this helper class, we can come up with a neat implementation for Lexicographical. We use a while loop to iterate over the elements of both sequences in sync. As soon as we encounter an unequal pair of elements, we return the element comparison result. Otherwise, we keep iterating until the end of one or both sequences is reached, in which case, we return the length comparison result indicating their relative length.

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Create an enumerator pair to iterate over both sequences in sync.
    using (var enumeratorPair = new EnumeratorPair<TElement>(x, y))
    {
        // Advance both enumerators to their next element, until at least one passes the end of its sequence.
        while (enumeratorPair.MoveNext())
        {
            // Compare the current pair of elements across the two sequences,
            // seeking element inequality.
            int elementComparison = _elementComparer.Compare(enumeratorPair.Current1, enumeratorPair.Current2);
            if (elementComparison != 0)
                return elementComparison;
        }

        // Return the length comparison result.
        return enumeratorPair.LengthComparison;
    }
}

Adding Shortlex and SameLength

We shall now enhance this implementation to also handle Shortlex and SameLength. Since the length of the sequences cannot be known in advance, we proceed with the element-by-element comparison, like in Lexicographical. However, when the first unequal pair of elements is encountered, the Shortlex and SameLength comparisons would continue advancing the enumerators of the two sequences (without comparing further elements) until the end of one or both is reached. If the end of both sequences is reached simultaneously (during the same iteration), then it may be concluded that they have the same length, and the element comparison result is returned by the Compare method. If one sequence is found to have ended before the other, then Shortlex would return the length comparison result, whilst SameLength would throw an ArgumentException. This logic for handling sequences of different lengths also needs to be applied in cases where one or both sequences end without having encountered an unequal pair of elements, so we place it after the outer while loop, and jump to it through a break statement.

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Create an enumerator pair to iterate over both sequences in sync.
    using (var enumeratorPair = new EnumeratorPair<TElement>(x, y))
    {
        // Advance both enumerators to their next element,
        // until at least one passes the end of its sequence.
        while (enumeratorPair.MoveNext())
        {
            // Compare the current pair of elements across the two sequences,
            // seeking element inequality.
            int elementComparison = _elementComparer.Compare(enumeratorPair.Current1, enumeratorPair.Current2);
            if (elementComparison != 0)
            {
                // If the current sequence comparison is Shortlex or SameLength, 
                // then length inequality needs to be given precedence over element inequality.
                if (_comparisonType == SequenceComparison.Shortlex ||
                    _comparisonType == SequenceComparison.SameLength)
                {
                    // Advance both enumerators, ignoring the elements,
                    // until at least one passes the end of its sequence. 
                    while (enumeratorPair.MoveNext())
                        ;

                    // If one sequence was shorter than the other, then use the length comparison result.
                    // Such logic is implemented beyond the outer while loop.
                    if (enumeratorPair.LengthComparison != 0)
                        break;
                }

                // If the current sequence comparison is Lexicographical, 
                // or the sequences have been found to share the same length,
                // then return the non-zero element comparison result.
                return elementComparison;
            }
        }

        // This point is only reached if at least one sequence ended without finding any unequal pair of elements;
        // or an unequal pair of elements was found, but the unequal length of the sequences dominates the result.

        // If the current sequence comparison is SameLength,
        // then throw exception for sequences of different lengths.
        if (_comparisonType == SequenceComparison.SameLength &&
            enumeratorPair.LengthComparison != 0)
            throw new ArgumentException("The sequences to be compared do not have the same number of elements.");

        // Return the length comparison result.
        return enumeratorPair.LengthComparison;
    }
}

Optimizing for in-memory collections

In an earlier section, we mentioned that the absence of a Count property on the IEnumerable<T> interface prevents us from performing sequence length comparisons at the beginning of our implementation. This is not an issue for Lexicographical, where the element-level comparisons dominate the result. However, it can lead to useless element comparisons (and iterations) for Shortlex and SameLength if the sequences are subsequently found to have had different lengths. Consider the example below:

var comparer = new SequenceComparer<int>(SequenceComparison.Shortlex);
var result = comparer.Compare(
    new int[] { 1, 2, 3, 4, 5, 6, 7 },
    new int[] { 1, 2, 3, 6, 7, 8, 9, 10 });

Under shortlex, it should be obvious that the first sequence is less than the second, due to it having a smaller number of elements; there is no need to read the element values. Such information is readily available to our comparer, since arrays expose a fast Length property (which is also returned by their ICollection.Count and ICollection<T>.Count explicit interface member implementations). Yet our implementation so far would still need to compare the first four elements of the sequences, and skip over the subsequent three element, before realizing that the sequences’ lengths are different, making the element comparisons inapplicable.

Rather than always suffer this wasteful enumeration, we can optimize for in-memory collections by attempting to determine their number of elements efficiently at the beginning of the comparison, and only proceed with the elements’ enumeration if found to have the same length. For sequences that rely on deferred execution and cannot provide this information upfront, we will fall back on our former behavior of performing the length comparison as part of the main enumeration.

This topic is discussed in more detail, and a sample implementation provided, under the article for our TryFastCount extension method. We will now show how this method could be leveraged to optimize our comparer implementation:

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Optimization for in-memory collections:
    // If the current sequence comparison is Shortlex or SameLength, 
    // then attempt to perform a length comparison without enumerating the sequences.
    if (_comparisonType == SequenceComparison.Shortlex ||
        _comparisonType == SequenceComparison.SameLength)
    {
        int xCount, yCount;
        if (x.TryFastCount(out xCount) &&
            y.TryFastCount(out yCount))
        {
            int countComparison = xCount.CompareTo(yCount);
            if (countComparison != 0)
            {
                // Shortlex returns immediately for sequences of different lengths,
                // whilst SameLength throws immediately for such sequences.
                if (_comparisonType == SequenceComparison.Shortlex)
                    return countComparison;
                if (_comparisonType == SequenceComparison.SameLength)
                    throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
            }
        }
    }

    // rest of implementation
}

Default comparers

We can define static properties as shorthand for the default sequence comparers that use Comparer<TElement>.Default for the element-level comparisons:

public static SequenceComparer<TElement> Lexicographical { get; } = 
    new SequenceComparer<TElement>(SequenceComparison.Lexicographical);

public static SequenceComparer<TElement> Shortlex { get; } =
    new SequenceComparer<TElement>(SequenceComparison.Shortlex);

public static SequenceComparer<TElement> SameLength { get; } =
    new SequenceComparer<TElement>(SequenceComparison.SameLength);

Factory method

Rather than exposing the constructor of our SequenceComparer<TElement> type as public, we could make it protected internal, and instead provide a factory method for creating sequence comparers. This would allow us to change the class design of our comparer – for example, to introduce an inheritance hierarchy – without breaking any consumers.

public static class SequenceComparer
{
    public static SequenceComparer<TElement> Create<TElement>(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        return new SequenceComparer<TElement>(comparisonType, elementComparer);
    }
}

public class SequenceComparer<TElement> : ComparerBase<IEnumerable<TElement>>
{
    protected internal SequenceComparer(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        _comparisonType = comparisonType;
        _elementComparer = elementComparer ?? Comparer<TElement>.Default;
    }

    // ...
}

Examples

The example below shows how our SequenceComparer<TElement> may be used to sort a collection of sequences of strings. We demonstrate the behavior of Lexicographical and Shortlex using the default inner comparer for strings, as well as using the ordinal comparer. The default string comparer performs a culture-sensitive linguistic comparison. Under the "en-US" culture, ñ sorts before t, so "Añasco" sorts before "Athens". The ordinal comparison, on the other hand, sorts ñ after t.

var sequences = new string[][]
{
    new string[] { "Paris", "Añasco", "Athens", "New York" },
    new string[] { "Madrid", "Paris", "Añasco" },
    new string[] { },
    new string[] { "Añasco", "Madrid" },
    new string[] { "Paris", "Añasco", "Athens" },
    new string[] { "Paris", "Athens" },
    new string[] { "Madrid", "Paris", "Athens", "New York" },
    new string[] { "Paris", "Athens", "Añasco" },
    new string[] { "Athens" },
    new string[] { "Athens", "Madrid", "Añasco" },
    new string[] { "Paris", "Añasco", "Athens", "Madrid", "New York" },
    new string[] { "Madrid", "Añasco" },
    new string[] { "Paris", "Añasco" },
};

CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("en-US");

// Lexicographical order, with element-level string comparisons being linguistic under "en-US" culture:
var a = sequences.OrderBy(s => s, SequenceComparer<string>.Lexicographical);
// Result:
// {  }
// { "Añasco", "Madrid" }
// { "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Athens" }
// { "Paris", "Athens", "Añasco" }

// Lexicographical order, with element-level string comparisons being ordinal:
var b = sequences.OrderBy(s => s, SequenceComparer.Create(SequenceComparison.Lexicographical, StringComparer.Ordinal));
// Result:
// {  }
// { "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Athens" }
// { "Paris", "Athens", "Añasco" }
// { "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }

// Shortlex order, with element-level string comparisons being linguistic under "en-US" culture:
var c = sequences.OrderBy(s => s, SequenceComparer<string>.Shortlex);
// Result:
// {  }
// { "Athens" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Paris", "Añasco" }
// { "Paris", "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Athens", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }

// Shortlex order, with element-level string comparisons being ordinal:
var d = sequences.OrderBy(s => s, SequenceComparer.Create(SequenceComparison.Shortlex, StringComparer.Ordinal));
// Result:
// {  }
// { "Athens" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Paris", "Athens" }
// { "Paris", "Añasco" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Athens", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }

Further considerations

Equality comparisons

The SequenceComparer<TElement> class is only intended for sort-order comparisons, not for equality comparisons; it intentionally does not implement the IEqualityComparer<T> interface for IEnumerable<TElement>. The justification given in our previous article for KeyComparer<TSource, TKey> also applies here: Comparer<TElement>.Default and EqualityComparer<TElement>.Default may have inconsistent semantics for a given element type TElement, as is the case for strings.

Furthermore, the differentiation between the lexicographical and shortlex orders does not apply to sequence equality comparisons: Sequences of different lengths are always simply unequal. The above-discussed optimization for checking the lengths of in-memory collections upfront (before enumerating their elements) would apply for such comparisons, and is presented as the representative problem in our TryFastCount article.

If you wish to perform equality comparisons on sequences, consider using our SequenceEqualityComparer<TElement> class instead.

Use for implementing IComparable<T>

In our previous article, we explained how the IComparable<T> interface is intended to be implemented by the type T itself for providing the inherent sort order, whilst IComparer<T> is implemented by other types to provide alternate sort orders. However, most collection types, including List<T>, do not implement IComparable<T>. If one defines a custom type that contains or derives from such a collection, then SequenceComparer<TElement> could prove useful in its implementation of the IComparable<T> interface for the inherent sort order.

Utf8String

For example, a structure that stores UTF-8 encoded representations of strings as byte arrays could implement the IComparable<T>.CompareTo method by applying the SequenceComparer<byte>.Lexicographical comparer to the said byte arrays. This would provide results almost identical to ordinal comparisons on the source strings.

public struct Utf8String : IComparable<Utf8String>
{
    private readonly byte[] _data;

    public Utf8String(string source)
    {
        _data = Encoding.UTF8.GetBytes(source);
    }

    public override string ToString()
    {
        return Encoding.UTF8.GetString(_data);
    }

    public int CompareTo(Utf8String other)
    {
        return SequenceComparer<byte>.Lexicographical.Compare(_data, other._data);
    }
}

Code point order

There is one scenario where the comparison results of Utf8String above would differ from ordinal comparisons on the source strings; namely, when the source strings contain a mix of surrogate pairs and BMP characters with a code point of U+E000 or higher. UTF-8 does not have the notion of surrogate pairs, so supplementary characters (U+10000 and beyond) would always sort after the BMP characters (U+0000–​U+FFFF); this is known as “code point order”. UTF-16, on the other hand, encodes supplementary characters as surrogate pairs (U+D800U+DFFF), which would hence sort before the subset of BMP characters having code points U+E000U+FFFF. Since the String class in the .NET Framework is based on UTF-16, its ordinal comparisons would exhibit the latter sort order. This discrepancy between UTF-8 and UTF-16 is mentioned in the core specification of the Unicode Standard, under the Binary Order section.


KeyComparer: Bringing the expressive power of Select to IComparer<T>

A frequent requirement in object-oriented programming is to sort a sequence of elements according to a projected key. This key typically corresponds to a property of the element type, although it could also be a projection of any aspect of its state, including functions over fields, properties, and methods. For example, suppose that we have the following class:

public partial class Person
{
    public string Ssn { get; set; }
    public DateTime DateOfBirth { get; set; }
    public string FirstName { get; set; }
    public string LastName { get; set; }
}

…and an array of its instances:

var persons = new []
{
    new Person { Ssn = "324-00-3015", DateOfBirth = new DateTime(1970, 12, 17), FirstName = "Tod",     LastName = "Temme" },
    new Person { Ssn = "548-00-1592", DateOfBirth = new DateTime(1968, 03, 13), FirstName = "Lucia",   LastName = "Armstrong" },
    new Person { Ssn = "129-00-7416", DateOfBirth = new DateTime(1982, 09, 02), FirstName = "Spencer", LastName = "Weaver" },
    new Person { Ssn = "831-00-6391", DateOfBirth = new DateTime(1974, 04, 30), FirstName = "Celia",   LastName = "Potter" },
    new Person { Ssn = "714-00-6502", DateOfBirth = new DateTime(1966, 11, 19), FirstName = "Powell",  LastName = "Beck" },
};

We might have scenarios where we want such a collection to be sorted by SSN, by date of birth, or by a concatenation of the first and last name.

Sorting in the .NET Framework

The .NET Framework Class Library exposes various sorting facilities: Array.Sort, List<T>.Sort, Enumerable.OrderBy, ParallelEnumerable.OrderBy, …. It also defines a number of sorted collection types that maintain their elements in sort order, such as SortedSet<T>, SortedList<TKey, TValue>, and SortedDictionary<TKey, TValue>. These collections allow elements to be inserted and removed more efficiently than if one were to sort the entire list each time. For example, SortSet<T> operations generally take O (log n) time, whilst a full sort would take O (n log n). The aforementioned three sorted collection types do not allow duplicate elements, although it is not inconceivable to implement a collection type that does.

IComparable<T>

Whilst the above types provide the implementation for the sorting algorithm or data structure, they by default rely on the element type to provide a means of determining the sort order for any given pair of elements. This is achieved through the IComparable<T> interface, which is to be implemented by types that have an inherent sort order. Various types have such an inherent order:

  • integers sort by the numerical order (-43, 2, 81, 122)
  • strings sort by the alphabetical order ("cat", "caterpillar", "mouse")
  • versions sort by the components (2.9.2, 2.10.0, 3.0.0)

The interface has a non-generic counterpart, IComparable, which should also be implemented by types alongside IComparable<T>, but only intended for use as a fallback by consumers who cannot avail of generics.

The IComparable<T> interface defines just one method to be implemented: CompareTo(T). When called on an instance of type T, with an argument specifying another instance of the same type, the CompareTo method should return an integer value as follows:

Value Meaning
Less than zero This instance precedes the specified other in the sort order.
Zero This instance occurs in the same position in the sort order as the specified other.
Greater than zero This instance follows the specified other in the sort order.

Whilst the overly permissive return values seem to be a minor infringement of Postel’s law, they provide for efficient implementations by numeric types. Consider the internal implementation for the Byte value type, which is a simple subtraction:

public partial struct Byte : IComparable<Byte>
{
    public int CompareTo(Byte value)
    {
        return this - value;
    }
}

This insight serves as a convenient way of remembering the method’s semantics: x.CompareTo(y) returns a value of the same sign as if one computed x - y for a numeric type.

In custom types, the IComparable<T> interface is commonly implemented by forwarding the CompareTo calls to a key property of the type, or some other projection of its state. For example, if we decide that the SSN is the natural key of our Person class, we could write:

public partial class Person : IComparable<Person>
{
    public int CompareTo(Person other)
    {
        // handle nulls

        return Comparer<string>.Default.Compare(Ssn, other.Ssn);
    }
}

We could consequently sort our Person[] array by SSN by calling Array.Sort<T>(T[]):

Array.Sort(persons);

IComparer<T>

A type may have more than one candidate sort order. The most popular case is the String class, which has six variants of the alphabetical order for performing sort-order comparisons: ordinal comparisons, linguistic comparisons under the invariant culture, linguistic comparisons under the current culture, and the case-insensitive counterparts of the aforementioned three. Only one sort order can be used as the default when implementing IComparable<T> – for strings, it’s case-sensitive linguistic comparisons under the current culture.

The other sort orders can be implemented through another interface that .NET provides: IComparer<T>. Whilst IComparable<T> is intrinsic to (and implemented by) the element type, IComparer<T> is extrinsic, intended to be implemented by a separate type. Its Compare(T, T) method takes two instances of the element type, x and y, and returns an integer value with semantics similar to those discussed earlier:

Value Meaning
Less than zero x is less than y.
Zero x equals y.
Greater than zero x is greater than y.

Rather than implementing the IComparer<T> interface directly, it is recommended to derive from the Comparer<T> abstract class. Comparer<T> provides an explicit interface implementation of the IComparer non-generic interface that forwards Compare(object, object) calls to the Compare(T, T) overridden method. Comparer<T> also provides a Default static property, which compares objects of type T using their implementation of the IComparable<T> generic interface (or, as fallback, the IComparable non-generic interface).

Most of the sorting facilities in the .NET Framework Class Library have a method or constructor overload that accepts such an IComparer<T> instance. For example, if we occasionally wanted to sort our Person[] array by date of birth, we could implement a custom comparer that provides this sort order:

public class PersonDateOfBirthComparer : Comparer<Person>
{
    public new static PersonDateOfBirthComparer Default { get; } = new PersonDateOfBirthComparer();

    public override int Compare(Person x, Person y)
    {
        // handle nulls

        return Comparer<DateTime>.Default.Compare(x.DateOfBirth, y.DateOfBirth);
    }
}

We could then sort our array using the Array.Sort<T>(T[], IComparer<T>) method overload:

Array.Sort(persons, PersonDateOfBirthComparer.Default);

Similarly, if we wanted to sort by full name, using a case-insensitive linguistic comparison under the current culture, we could define another comparer:

public class PersonFullNameComparer : Comparer<Person>
{
    public new static PersonFullNameComparer Default { get; } = new PersonFullNameComparer();

    public override int Compare(Person x, Person y)
    {
        // handle nulls

        return StringComparer.CurrentCultureIgnoreCase.Compare(
            $"{x.FirstName} {x.LastName}",
            $"{y.FirstName} {y.LastName}");
    }
}

However, this quickly gets unwieldy. What if we wanted to sort by full name using an ordinal or case-sensitive comparison? What if we introduced a new property to sort by? Ten new properties? Keep in mind that the // handle nulls comments above expand to several lines of code in each implementation:

if (object.ReferenceEquals(x, y))
    return 0;
if (x == null)
    return -1;
if (y == null)
    return 1;

LINQ

With LINQ, one-off key-based sorts can be easily achieved through the OrderBy<TSource, TKey> extension method, which accepts a function delegate that projects the key to sort by. It even offers an overload that accepts an IComparer<TKey> instance to use on the projected keys.

var bySsn         = persons.OrderBy(p => p.Ssn);
var byDateOfBirth = persons.OrderBy(p => p.DateOfBirth);
var byFullName    = persons.OrderBy(p => $"{p.FirstName} {p.LastName}", StringComparer.CurrentCultureIgnoreCase);

Sorted Collections

However, most of the rest of the .NET Framework Class Library still relies on the notion of comparers, especially in the case of sorted collection types, such as SortedSet<T>, which typically expose constructor overloads that accept an IComparer<T> argument. If we wanted to maintain a sorted collection of our Person instances under non-default sort orders, we would have to implement an IComparer<T> for each such sort order.

A possible workaround would be to use the SortedList<TKey, TValue> or SortedDictionary<TKey, TValue> structures instead, which introduce the notion of keys. However, these structures treat keys as extrinsic to the values. They have no functionality for projecting keys from values, requiring consumers to do so manually for each call that modifies their contents (for example, Add(TKey, TValue), Remove(TKey), …). This is not desirable for scenarios where keys are always projected consistently – the key projection function should be encapsulated within the data structure, much like SortedSet<T> does when combined with our custom comparers.

Proposed solution: KeyComparer<TSource, TKey>

Our proposed solution is to provide a KeyComparer<TSource, TKey> class that facilitates the creation of IComparer<TSource> instances, which could then be used with sorted collection types such as SortedSet<TSource>. Rather than having to define a new class implementing IComparer<TSource> for each key projection, consumers would only need to supply their key projection function, typically as a lambda expression, to our factory method.

Consider the example below, where a SortedSet<Person> collection is initialized with a comparer that sorts elements by their SSN:

var bySsn = new SortedSet<Person>(persons, KeyComparer.Create((Person p) => p.Ssn));

SortedSet<T> does not allow duplicate elements, so it should not be used for keys that are not guaranteed to be unique. For non-unique keys, we can use a third-party sorted collection type that permits such duplicates, such as the OrderedBag<T> collection from Wintellect’s Power Collections for .NET. Pairs of elements considered to be equal (namely, for which Compare(x, y) returns 0) would either preserve their original order or be sorted arbitrarily, depending on whether the underlying sorting algorithm is stable.

var byDateOfBirth = new OrderedBag<Person>(persons, KeyComparer.Create((Person p) => p.DateOfBirth));
var byFullName    = new OrderedBag<Person>(persons, KeyComparer.Create((Person p) => $"{p.FirstName} {p.LastName}", 
                                                                       StringComparer.Ordinal));

Just like OrderBy<TSource, TKey>, the KeyComparer.Create<TSource, TKey> factory method can accept an IComparer<TKey> inner comparer to be applied to the projected keys. In the last example above, an ordinal comparison is used for the full names. If such an inner comparer is not specified, Comparer<TKey>.Default is used.

Subsequent element operations on the sorted collections are straightforward. Each collection would internally take care of extracting the elements’ keys through the specified KeyComparer<TSource, TKey>, and use it to find the correct location for new or existing elements:

var person = new Person { Ssn = "301-00-1582", DateOfBirth = new DateTime(1984, 11, 01), FirstName = "Teddy", LastName = "Wake" };
bySsn.Add(person);
bySsn.Remove(person);

Contrast this with the explicit keys needed to be specified by consumers of SortedDictionary<TKey, TValue> (or SortedList<TKey, TValue>) for each operation on the collection. The lack of encapsulation of the key projection logic makes such operations more error-prone, since the consumer needs to ensure that the correct key is manually supplied each time:

var bySsnDict = new SortedDictionary<string, Person>(persons.ToDictionary(p => p.Ssn));
bySsnDict.Add(person.Ssn, person);
bySsnDict.Remove(person.Ssn);

As mentioned at the beginning, the key can be any function over the element’s state. For example, we could sort a sequence of numbers by their absolute (unsigned) value:

var nums = new [] { 47, -32, -54, 18, 62, -71, 58 };
var byAbs = new SortedSet<int>(nums, KeyComparer.Create((int n) => Math.Abs(n)));
// result: 18, -32, 47, -54, 58, 62, -71

Implementation

Below is a simplified implementation of this comparer class:

public class KeyComparer<TSource, TKey> : Comparer<TSource>
{
    private readonly Func<TSource, TKey> _keySelector;
    private readonly IComparer<TKey> _innerComparer;

    protected internal KeyComparer(
        Func<TSource, TKey> keySelector, 
        IComparer<TKey> innerComparer = null)
    {
        _keySelector = keySelector;
        _innerComparer = innerComparer ?? Comparer<TKey>.Default;
    }

    public override int Compare(TSource x, TSource y)
    {
        if (object.ReferenceEquals(x, y))
            return 0;
        if (x == null)
            return -1;
        if (y == null)
            return 1;

        TKey xKey = _keySelector(x);
        TKey yKey = _keySelector(y);
        return _innerComparer.Compare(xKey, yKey);
    }
}

Instances of the comparer should be created through the following factory:

public static class KeyComparer
{
    public static KeyComparer<TSource, TKey> Create<TSource, TKey>(
        Func<TSource, TKey> keySelector, 
        IComparer<TKey> innerComparer = null)
    {
        return new KeyComparer<TSource, TKey>(keySelector, innerComparer);
    }
}

A polished implementation of the KeyComparer<TSource, TKey> class and its factory – refactored, XML-documented, and unit-tested – is available as part of our DogmaMix.Core library on GitHub.

Further considerations

Equality comparisons

The KeyComparer<TSource, TKey> class intentionally refrains from implementing the IEqualityComparer<TSource> interface, for reasons given hereunder.

Implementations of IEqualityComparer<T> provide a custom equality comparison, intended for use in scenarios where one needs to enforce uniqueness (or grouping) of elements, but not a sort order. For example, the HashSet<T> collection type has constructors that accept an IEqualityComparer<T> argument to use when comparing values in the set. The EqualityComparer<T> class has a Default static property, which compares objects of type T using their implementation of the IEquatable<T> interface (or, as fallback, their overrides of the Equals(object) and GetHashCode() methods).

If an element type implements both IComparable<T> and IEquatable<T> interfaces, one might be tempted to assume, for any given pair of values x and y, that x.CompareTo(y) returning 0 would be semantically equivalent to x.Equals(y) returning true. However, this is not necessarily the case. In fact, the expectation is broken by the String class in the .NET Framework Class Library, which implements IComparable<String> to perform a culture-sensitive linguistic comparison under the current culture, but implements IEquatable<String> to perform an ordinal comparison. Since Comparer<T>.Default and EqualityComparer<T>.Default respectively rely on the IComparable<T> and IEquatable<T> implementations of type T, they would also exhibit these inconsistencies, leading to anomalies if used interchangeably.

// Evaluates to 0, indicating equality:
int a = Comparer<string>.Default.Compare("café", "cafe\u0301");

// Evaluates to false, indicating inequality:
bool b = EqualityComparer<string>.Default.Equals("café", "cafe\u0301");

Our KeyComparer<TSource, TKey> class does not oblige consumers to provide an IComparer<TKey> instance for comparing the keys projected from the elements. Indeed, for most key types, consumers would omit this argument from the factory method call, and rely on KeyComparer<TSource, TKey> to internally use Comparer<TKey>.Default. For KeyComparer<TSource, TKey> to also support equality comparisons, it would also need to internally use EqualityComparer<TKey>.Default, leading to the inconsistencies discussed above.

The .NET Framework Class Library redresses the inconsistency for String by providing (and recommending) the StringComparer class, whose instances have consistent semantics:

For string comparisons, the StringComparer class is recommended over Comparer<String>. Properties of the StringComparer class return predefined instances that perform string comparisons with different combinations of culture-sensitivity and case-sensitivity. The case-sensitivity and culture-sensitivity are consistent among the members of the same StringComparer instance.

Unfortunately, there is no interface in the .NET Framework to mark comparer types that implement both IComparer<T> and IEqualityComparer<T> consistently (nor for element types that implement both IComparable<T> and IEquatable<T> consistently). The only way of implementing a KeyComparer<TSource, TKey> class that can support both sort-order comparisons and equality comparisons would be to offload the responsibility for providing a consistent IComparer<TKey> and IEqualityComparer<TKey> pair to the consumers, hindering the flexibility of the class. Rather than take this approach, we provide a separate comparer for key equality comparisons through our KeyEqualityComparer<TSource, TKey> class.

Comparer<T>.Create

The .NET Framework 4.5 introduces a Comparer<T>.Create static method. This method serves a purpose that bears some similarity to our KeyComparer<TSource, TKey> class, in that it allows custom comparers to be created through function delegates (typically lambda expressions), rather than consumers having to implement a new comparer type each time. However, the delegate type differs – our KeyComparer.Create<TSource, TKey> factory method takes Func<TSource, TKey> (just like OrderBy<TSource, TKey> in LINQ), whilst Comparer<T>.Create takes Comparison<T>. The Comparison<T> delegate is equivalent to Func<T, T, int>, and is intended to represent a method that compares two objects of the same type.

Comparer<T>.Create is more flexible than our KeyComparer<TSource, TKey>, allowing consumers to define comparisons that are more complex than mere key projections. On the other hand, it requires more work to use correctly – consumers need to manually handle null checks, project the keys from both objects, and perform the key comparison using the appropriate inner comparer. For example, the full-name comparison discussed earlier could be implemented as follows:

var fullNameComparer = Comparer<Person>.Create((x, y) =>
{
    if (object.ReferenceEquals(x, y))
        return 0;
    if (x == null)
        return -1;
    if (y == null)
        return 1;

    return StringComparer.CurrentCulture.Compare(
        $"{x.FirstName} {x.LastName}",
        $"{y.FirstName} {y.LastName}");
});

Contrast this with the simpler implementation under KeyComparer<TSource, TKey>:

var fullNameComparer = KeyComparer.Create(
    (Person p) => $"{p.FirstName} {p.LastName}", 
    StringComparer.CurrentCulture);

Thus, whilst Comparer<T>.Create obviates the need to define new comparer types, it still requires consumers to implement all the comparison logic, leaving our KeyComparer<TSource, TKey> preferable for key comparisons.