SequenceComparer: Sorting sequences of sequences

When one talks about sorting, one would usually be thinking of sequences of elements, where each element is a single unit that may be atomically compared against other elements. However, there are plenty of scenarios where the elements to be compared are themselves sequences of another element type.

  • The most popular example would be strings. When performing ordinal comparisons, each string may treated as a sequence of characters, to be compared against the other string’s sequence of characters.
  • Strings encoded as UTF-8 would be represented as a byte array, which also serves as a valid sequence for performing ordinal comparisons. (Results may differ from ordinal comparisons on the source strings when supplementary characters are involved, as discussed later.)
  • Version may be treated as a sequence of up to four integers: Major, Minor, Build, Revision.
  • Arbitrary-precision data structures, such as BigInteger, are internally represented as a sequence of digits under some numeric base.
  • rowversion in SQL Server maps to an eight-element byte[] in .NET.

All aforementioned sequences have one common aspect: They may be compared against other sequences of the same type through an element-by-element comparison of their contents. There are three main variants of this rule, determining how sequences of different lengths are treated.

Sequence comparisons

Lexicographical order

The lexicographical order is a generalization of the alphabetical order, such that it may be applied to any type of sequences containing comparable elements. To compare a given pair of sequences, one would iteratively compare pairs of elements at corresponding indexes across the two sequences, starting from the beginning of the sequences, and proceeding until the first element inequality is found. The result of the sequence comparison corresponds to the result of the first unequal pair of elements. In other words, a sequence of elements { a1, a2, …, am } is less than another sequence { b1, b2, …, bn } if and only if, at the first index i where ai and bi differ, ai is less than bi.

If all elements are equal, then the two sequences are considered to be equal, provided that they are of the same length. When comparing sequences of different lengths, the aforementioned element-by-element check still applies. However, if all elements of the shorter sequence are equal to their counterparts in the longer sequence (meaning that the shorter sequence is a prefix of the longer sequence), then the shorter sequence is considered to be less than the longer sequence. Nominally, the empty sequence is considered to be less than any other sequence.

As an example, consider a comparison between the UTF-8 encoded decimal representations of two strings:

indexes 0 1 2 3 4 5 6 7 8 9
"operator" 111 112 101 114 97 116 111 114
"operations" 111 112 101 114 97 116 105 111 110 115
comparisons = = = = = = >

The first six elements of the two sequences are equal, and therefore do not determine the comparison result. At index 6, the element of the first sequence, 111, is greater than the element of the second sequence, 105. Therefore, overall, the first sequence is considered to be greater than the second sequence.

Consider another example:

indexes 0 1 2 3 4 5 6 7
"opera" 111 112 101 114 97
"operator" 111 112 101 114 97 116 111 114
comparisons = = = = = <

The first five elements of the two sequences are equal, and do not determine the comparison result. However, the first sequence only consists of the said five elements, whilst the second sequence contains more elements. Thus, the first sequence is a prefix of the second sequence, and is considered to be less than it.

A simplified implementation of such a comparer is given below:

public class LexicographicalComparer<TElement> : Comparer<IList<TElement>>
{
    public override int Compare(IList<TElement> x, IList<TElement> y)
    {
        for (int i = 0; i < Math.Min(x.Count, y.Count); i++)
        {
            int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
            if (elementComparison != 0)
                return elementComparison;
        }

        return x.Count.CompareTo(y.Count);
    }
}

Shortlex order

The lexicographical comparer gives precedence to the element-wise comparisons, and only considers the lengths of the sequences when all elements, up to the length of the shorter sequence, are equal. This order, whilst well-suited for strings, is not applicable for sequences whose elements assume a positional notation, such as big integers. In such cases, the lengths of the sequences take precedence – a shorter sequence is considered to be less than a longer sequence, irrespective of their contents. The element-by-element comparison would only occur for sequences of the same length. This variant of the lexicographical order is known as shortlex.

For example, imagine a big integer implementation that internally stores its values as a List<byte>, with each element representing a digit under base-256, and no leading zeros allowed. The number 992,397,253,847 would be stored as { 231, 15, 124, 92, 215 }, since its value is equal to 231 × 2564 + 15 × 2563 + 124 × 2562 + 92 × 2561 + 215 × 2560. (In practice, most big integer implementations would store their elements in reverse order, with the most significant digit stored last. This is not done here for simplicity.)

For values of the same length, we could use an algorithm similar to the one shown above, determining the order of the big integers through an element-by-element comparison of their digits, up to the first inequality:

indexes 0 1 2 3 4
992,397,253,847 231 15 124 92 215
992,401,385,498 231 15 187 104 26
comparisons = = <

However, such comparisons would not be applicable for sequences of different lengths. Consider another pair of big integers:

indexes 0 1 2 3 4
992,397,253,847 231 15 124 92 215
16,368,696 249 196 56

Although the first digit of the first sequence, 231, is less than that of the second sequence, 249, this is irrelevant, since its place value is greater: 231 × 2564 is greater than 249 × 2562. This applies as a general rule – the place value of the most significant digit of a shorter sequence will always be less than the place value of the most significant digit of a longer sequence. Thus, under shortlex, a shorter sequence is always less than a longer sequence.

We can tweak the simplified code from the previous section to give precedence to the sequence lengths:

public class ShortlexComparer<TElement> : Comparer<IList<TElement>>
{
    public override int Compare(IList<TElement> x, IList<TElement> y)
    {
        var lengthComparison = x.Count.CompareTo(y.Count);
        if (lengthComparison != 0)
            return lengthComparison;

        for (int i = 0; i < x.Count; i++)
        {
            int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
            if (elementComparison != 0)
                return elementComparison;
        }

        return 0;
    }
}

Same length only

There may be other scenarios where sequences of different lengths are not considered to be comparable. For example, we earlier stated that rowversion from SQL Server is represented as byte[8] in .NET. When comparing such arrays, we always expect them to have the same length – a length other than 8 would likely be indicative of some programming error. To handle such scenarios, we can define another comparer that is similar to shortlex, but throws an exception when it encounters a pair of sequences with different lengths:

public class SameLengthComparer<TElement> : Comparer<IList<TElement>>
{
    public override int Compare(IList<TElement> x, IList<TElement> y)
    {
        var lengthComparison = x.Count.CompareTo(y.Count);
        if (lengthComparison != 0)
            throw new ArgumentException("The sequences to be compared do not have the same number of elements.");

        for (int i = 0; i < x.Count; i++)
        {
            int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
            if (elementComparison != 0)
                return elementComparison;
        }

        return 0;
    }
}

StructuralComparer

The .NET Framework Class Library provides the StructuralComparer property in the StructuralComparisons static class:

A predefined object that is used to perform a structural comparison of two collection objects.

StructuralComparer relies on the IStructuralComparable interface, whose implementations provide an element-by-element comparison similar to the one we are aiming for. However, it suffers from a number of issues:

  • In .NET Framework 4.6.1, IStructuralComparable is only implemented by arrays and by tuples. We would like to support all IEnumerable<T> sequences.
  • StructuralComparer and IStructuralComparable are non-generic, meaning that boxing and unboxing would be required when comparing elements of value types. We would like to define a generic type parameter, TElement, for the elements.
  • StructuralComparer does not accept a custom comparer for element-level comparisons, but just calls the Compare(object, object) method on the non-generic Comparer.Default instance. One could cast an array to the IStructuralComparable interface and directly call its CompareTo(Object, IComparer) method, which accepts a non-generic IComparer instance that it uses on the element-level comparisons; however, such a structural comparison would not itself be presentable as an IComparer instance (unless manually wrapped by an outer comparer type). We would like to provide overloads in our comparer that can accept a custom IComparer<TElement> inner comparer for performing element-level comparisons.
  • The IStructuralComparable.CompareTo implementation for arrays throws an ArgumentException for arrays of different lengths. We would like to support lexicographical and shortlex orders for such sequences.

Implementation

We shall now proceed to demonstrate an implementation of the three sequence comparers discussed above: lexicographical, shortlex, and same-length. A complete implementation is available as part of our DogmaMix.Core library on GitHub, mostly under the SequenceComparer<TElement> class.

StringComparison enumeration

The three sequence comparison variants share a lot of common behavior, making it appropriate to employ some form of code reuse. We could design an inheritance hierarchy with a base class that provides the element-level comparison logic and relies on an abstract method, to be overridden by derived classes, for handling sequence lengths. However, we decided against this approach. The nature of the sequence length comparison needs to influence not just the result, but also the control flow, of the base class – Shortlex and SameLength compare lengths at the beginning, whilst Lexicographical only does so after the element-level comparisons. This introduces a design complexity that would not be justified just for the sake of achieving polymorphism over the few lines of code that such length comparisons involve.

Instead, we shall stick with a single comparer class, and define an enumeration to represent the sequence comparison variants. The role of this enumeration is similar to that of the StringComparison enumeration for strings.

public enum SequenceComparison
{
    Lexicographical,
    Shortlex,
    SameLength,
}

ListComparer<TElement>

Before we present our actual comparer implementation, let us start with a simplification: a comparer that can compare IList<T> sequences, rather than IEnumerable<T> sequences. IList<T> simplifies matters because it allows us to immediately determine the lengths of the two sequences through its Count property, as well as retrieve elements efficiently by index, avoiding the need for creating enumerators.

We create an IComparer<IList<TElement>> implementation for our sequence comparisons. Note that the generic type of the IComparer<T> interface implementation is IList<TElement>, indicating that our class is a comparer for lists (including arrays) of our element type. We define a constructor that can accept the SequenceComparison enumeration value and the IComparer<TElement> inner comparer to use for element-level comparisons. If the comparer argument is omitted, then the Comparer<TElement>.Default comparer for the element type is used.

public class ListComparer<TElement> : Comparer<IList<TElement>>
{
    private readonly SequenceComparison _comparisonType;
    private readonly IComparer<TElement> _elementComparer;

    public ListComparer(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        _comparisonType = comparisonType;
        _elementComparer = elementComparer ?? Comparer<TElement>.Default;
    }

    // ...
}

Finally, our implementation of the Compare method is a combination of the logic we presented in the earlier sections. The difference between the three SequenceComparison variants concerns how sequences of different lengths are initially treated: Shortlex immediately returns the non-zero length comparison result, whilst SameLength immediately throws an ArgumentException. An element-by-element comparison ensues. As soon as an unequal pair of elements is encountered, the element comparison result is returned. If all elements are equal (up to the end of the shorter or both sequences), the method returns the length comparison result – at this point guaranteed to be zero for Shortlex and SameLength, but potentially any value for Lexicographical.

public override int Compare(IList<TElement> x, IList<TElement> y)
{
    int lengthComparison = x.Count.CompareTo(y.Count);
    if (lengthComparison != 0)
    {
        if (_comparisonType == SequenceComparison.Shortlex)
            return lengthComparison;
        if (_comparisonType == SequenceComparison.SameLength)
            throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
    }

    for (int i = 0; i < Math.Min(x.Count, y.Count); i++)
    {
        int elementComparison = _elementComparer.Compare(x[i], y[i]);
        if (elementComparison != 0)
            return elementComparison;
    }

    return lengthComparison;
}

SequenceComparer<TElement>

We shall now enhance our implementation to handle IEnumerable<T> sequences in general. The fields and constructor remain the same:

public class SequenceComparer<TElement> : Comparer<IEnumerable<TElement>>
{
    private readonly SequenceComparison _comparisonType;
    private readonly IComparer<TElement> _elementComparer;

    public SequenceComparer(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        _comparisonType = comparisonType;
        _elementComparer = elementComparer ?? Comparer<TElement>.Default;
    }

    // ...
}

Next comes the Compare method.

Doing without Count

This time, we cannot rely on a Count property, since the IEnumerable<T> interface does not provide one. There exists a Count(IEnumerable<TSource>) extension method, which could technically be used. However, it is generally not considered good practice to enumerate a specified IEnumerable<T> sequence multiple times within a general-purpose library method. IEnumerable<T> can represent any kind of sequence, including IQueryable<T> instances for LINQ to Entities queries whose enumeration would involve issuing and executing the query on the underlying data source (typically a database). Multiple enumerations of such queries can lead to performance pitfalls as well as inconsistent results – the state of the underlying data source might change between one enumeration and the next. In fact, ReSharper gives an error for such occurrences: “Possible multiple enumeration of IEnumerable.” Therefore, we will need to integrate the sequence length checks as part of our main iteration logic.

One could argue that, in this case, the concern is merely pedantic – one would rarely need to perform sequence comparisons of IQueryable<T> instances; and the sequences would need to be enumerated multiple times anyway if such a set were to be sorted. This is a fair point; if conceded, one could oblige consumers to first materialize the sequences to lists in memory, using a call such as:

var lists = sequences.Select(s => s.ToList()).ToList();

The formerly presented ListComparer<TElement> could then be used for comparisons of such lists.

Notwithstanding, we will push ahead with an IComparer<IEnumerable<TElement>> implementation. We do not want to constrain our comparer to only support lists just because of their convenient Count property. Per the MSDN Guidelines for Collections:

✓ DO use the least-specialized type possible as a parameter type. Most members taking collections as parameters use the IEnumerable<T> interface.

X AVOID using ICollection<T> or ICollection as a parameter just to access the Count property. Instead, consider using IEnumerable<T> or IEnumerable and dynamically checking whether the object implements ICollection<T> or ICollection.

We will therefore proceed with accepting IEnumerable<TElement> sequences as inputs, and discuss how to optimize for in-memory collection types later.

Advancing enumerators in sync

To iterate over the sequences, we will use enumerators. For an example of how to use these over a pair of sequences, one may refer to the source code for the SequenceEqual extension method.

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Get enumerators to iterate over both sequences in sync.
    using (IEnumerator<TElement> xEnumerator = x.GetEnumerator())
    using (IEnumerator<TElement> yEnumerator = y.GetEnumerator())
    {
        // ...
    }
}

Each enumerator may be advanced to the next element of its sequence by calling its MoveNext method. This method returns true if the enumerator was successfully advanced to the next element; or false if the enumerator has passed the end of its sequence.

In our case, we need to advance both enumerators in sync. We could compare the return values from their respective MoveNext calls to determine whether one sequence has ended before the other:

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Get enumerators to iterate over both sequences in sync.
    using (IEnumerator<TElement> xEnumerator = x.GetEnumerator())
    using (IEnumerator<TElement> yEnumerator = y.GetEnumerator())
    {
        // Advance both enumerators to their next element, until at least one passes the end of its sequence.
        bool xMove, yMove;
        while ((xMove = xEnumerator.MoveNext()) &&
               (yMove = yEnumerator.MoveNext()))
        {
            // Compare the current pair of elements across the two sequences.
        }

        // Determine the relative length of the two sequences based on the final values of xMove and yMove.
    }
}

To improve our design, we will encapsulate this logic for enumerating through a pair of sequences in sync into a dedicated class, EnumeratorPair<T>. This class builds on the semantics of the MoveNext method from the IEnumerator<T> interface. Each MoveNext call returns true if both enumerators were successfully advanced to the next element; or false if one or both enumerators have passed the end of their sequence. Once MoveNext returns false, consumers may read the LengthComparison property, whose value indicates the relative length of the enumerated sequences:

Value Meaning
Less than zero The first enumerator passed the end of its sequence before the second enumerator. Therefore, the first sequence is shorter than the second.
Zero Both enumerators passed the end of their sequence during the same iteration. Therefore, the two sequences have the same length.
Greater than zero The second enumerator passed the end of its sequence before the first enumerator. Therefore, the first sequence is longer than the second.

A bare-bones implementation is given below; refer to EnumeratorPair<T> for a proper one.

public class EnumeratorPair<T> : IDisposable
{
    private readonly IEnumerator<T> _enumerator1;
    private readonly IEnumerator<T> _enumerator2;

    private bool _passedEnd;
    private int _lengthComparison;

    public EnumeratorPair(IEnumerable<T> sequence1, IEnumerable<T> sequence2)
    {
        _enumerator1 = sequence1.GetEnumerator();
        _enumerator2 = sequence2.GetEnumerator();
    }

    public T Current1 => _enumerator1.Current;
    public T Current2 => _enumerator2.Current;
    public int LengthComparison => _lengthComparison;

    public bool MoveNext()
    {
        bool move1 = _enumerator1.MoveNext();
        bool move2 = _enumerator2.MoveNext();

        if (!_passedEnd && (!move1 || !move2))
        {
            _passedEnd = true;
            _lengthComparison = move1.CompareTo(move2);
        }

        return move1 && move2;
    }

    public void Dispose()
    {
        _enumerator1.Dispose();
        _enumerator2.Dispose();
    }
}

Implementing Lexicographical

By leveraging this helper class, we can come up with a neat implementation for Lexicographical. We use a while loop to iterate over the elements of both sequences in sync. As soon as we encounter an unequal pair of elements, we return the element comparison result. Otherwise, we keep iterating until the end of one or both sequences is reached, in which case, we return the length comparison result indicating their relative length.

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Create an enumerator pair to iterate over both sequences in sync.
    using (var enumeratorPair = new EnumeratorPair<TElement>(x, y))
    {
        // Advance both enumerators to their next element, until at least one passes the end of its sequence.
        while (enumeratorPair.MoveNext())
        {
            // Compare the current pair of elements across the two sequences,
            // seeking element inequality.
            int elementComparison = _elementComparer.Compare(enumeratorPair.Current1, enumeratorPair.Current2);
            if (elementComparison != 0)
                return elementComparison;
        }

        // Return the length comparison result.
        return enumeratorPair.LengthComparison;
    }
}

Adding Shortlex and SameLength

We shall now enhance this implementation to also handle Shortlex and SameLength. Since the length of the sequences cannot be known in advance, we proceed with the element-by-element comparison, like in Lexicographical. However, when the first unequal pair of elements is encountered, the Shortlex and SameLength comparisons would continue advancing the enumerators of the two sequences (without comparing further elements) until the end of one or both is reached. If the end of both sequences is reached simultaneously (during the same iteration), then it may be concluded that they have the same length, and the element comparison result is returned by the Compare method. If one sequence is found to have ended before the other, then Shortlex would return the length comparison result, whilst SameLength would throw an ArgumentException. This logic for handling sequences of different lengths also needs to be applied in cases where one or both sequences end without having encountered an unequal pair of elements, so we place it after the outer while loop, and jump to it through a break statement.

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Create an enumerator pair to iterate over both sequences in sync.
    using (var enumeratorPair = new EnumeratorPair<TElement>(x, y))
    {
        // Advance both enumerators to their next element,
        // until at least one passes the end of its sequence.
        while (enumeratorPair.MoveNext())
        {
            // Compare the current pair of elements across the two sequences,
            // seeking element inequality.
            int elementComparison = _elementComparer.Compare(enumeratorPair.Current1, enumeratorPair.Current2);
            if (elementComparison != 0)
            {
                // If the current sequence comparison is Shortlex or SameLength, 
                // then length inequality needs to be given precedence over element inequality.
                if (_comparisonType == SequenceComparison.Shortlex ||
                    _comparisonType == SequenceComparison.SameLength)
                {
                    // Advance both enumerators, ignoring the elements,
                    // until at least one passes the end of its sequence. 
                    while (enumeratorPair.MoveNext())
                        ;

                    // If one sequence was shorter than the other, then use the length comparison result.
                    // Such logic is implemented beyond the outer while loop.
                    if (enumeratorPair.LengthComparison != 0)
                        break;
                }

                // If the current sequence comparison is Lexicographical, 
                // or the sequences have been found to share the same length,
                // then return the non-zero element comparison result.
                return elementComparison;
            }
        }

        // This point is only reached if at least one sequence ended without finding any unequal pair of elements;
        // or an unequal pair of elements was found, but the unequal length of the sequences dominates the result.

        // If the current sequence comparison is SameLength,
        // then throw exception for sequences of different lengths.
        if (_comparisonType == SequenceComparison.SameLength &&
            enumeratorPair.LengthComparison != 0)
            throw new ArgumentException("The sequences to be compared do not have the same number of elements.");

        // Return the length comparison result.
        return enumeratorPair.LengthComparison;
    }
}

Optimizing for in-memory collections

In an earlier section, we mentioned that the absence of a Count property on the IEnumerable<T> interface prevents us from performing sequence length comparisons at the beginning of our implementation. This is not an issue for Lexicographical, where the element-level comparisons dominate the result. However, it can lead to useless element comparisons (and iterations) for Shortlex and SameLength if the sequences are subsequently found to have had different lengths. Consider the example below:

var comparer = new SequenceComparer<int>(SequenceComparison.Shortlex);
var result = comparer.Compare(
    new int[] { 1, 2, 3, 4, 5, 6, 7 },
    new int[] { 1, 2, 3, 6, 7, 8, 9, 10 });

Under shortlex, it should be obvious that the first sequence is less than the second, due to it having a smaller number of elements; there is no need to read the element values. Such information is readily available to our comparer, since arrays expose a fast Length property (which is also returned by their ICollection.Count and ICollection<T>.Count explicit interface member implementations). Yet our implementation so far would still need to compare the first four elements of the sequences, and skip over the subsequent three elements, before realizing that the sequences’ lengths are different, making the element comparisons inapplicable.

Rather than always suffer this wasteful enumeration, we can optimize for in-memory collections by attempting to determine their number of elements efficiently at the beginning of the comparison, and only proceed with the elements’ enumeration if found to have the same length. For sequences that rely on deferred execution and cannot provide this information upfront, we will fall back on our former behavior of performing the length comparison as part of the main enumeration.

This topic is discussed in more detail, and a sample implementation provided, under the article for our TryFastCount extension method. We will now show how this method could be leveraged to optimize our comparer implementation:

public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
    // Optimization for in-memory collections:
    // If the current sequence comparison is Shortlex or SameLength, 
    // then attempt to perform a length comparison without enumerating the sequences.
    if (_comparisonType == SequenceComparison.Shortlex ||
        _comparisonType == SequenceComparison.SameLength)
    {
        int xCount, yCount;
        if (x.TryFastCount(out xCount) &&
            y.TryFastCount(out yCount))
        {
            int countComparison = xCount.CompareTo(yCount);
            if (countComparison != 0)
            {
                // Shortlex returns immediately for sequences of different lengths,
                // whilst SameLength throws immediately for such sequences.
                if (_comparisonType == SequenceComparison.Shortlex)
                    return countComparison;
                if (_comparisonType == SequenceComparison.SameLength)
                    throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
            }
        }
    }

    // rest of implementation
}

Default comparers

We can define static properties as shorthand for the default sequence comparers that use Comparer<TElement>.Default for the element-level comparisons:

public static SequenceComparer<TElement> Lexicographical { get; } = 
    new SequenceComparer<TElement>(SequenceComparison.Lexicographical);

public static SequenceComparer<TElement> Shortlex { get; } =
    new SequenceComparer<TElement>(SequenceComparison.Shortlex);

public static SequenceComparer<TElement> SameLength { get; } =
    new SequenceComparer<TElement>(SequenceComparison.SameLength);

Factory method

Rather than exposing the constructor of our SequenceComparer<TElement> type as public, we could make it protected internal, and instead provide a factory method for creating sequence comparers. This would allow us to change the class design of our comparer – for example, to introduce an inheritance hierarchy – without breaking any consumers.

public static class SequenceComparer
{
    public static SequenceComparer<TElement> Create<TElement>(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        return new SequenceComparer<TElement>(comparisonType, elementComparer);
    }
}

public class SequenceComparer<TElement> : ComparerBase<IEnumerable<TElement>>
{
    protected internal SequenceComparer(
        SequenceComparison comparisonType = SequenceComparison.Lexicographical,
        IComparer<TElement> elementComparer = null)
    {
        _comparisonType = comparisonType;
        _elementComparer = elementComparer ?? Comparer<TElement>.Default;
    }

    // ...
}

Examples

The example below shows how our SequenceComparer<TElement> may be used to sort a collection of sequences of strings. We demonstrate the behavior of Lexicographical and Shortlex using the default inner comparer for strings, as well as using the ordinal comparer. The default string comparer performs a culture-sensitive linguistic comparison. Under the "en-US" culture, ñ sorts before t, so "Añasco" sorts before "Athens". The ordinal comparison, on the other hand, sorts ñ after t.

var sequences = new string[][]
{
    new string[] { "Paris", "Añasco", "Athens", "New York" },
    new string[] { "Madrid", "Paris", "Añasco" },
    new string[] { },
    new string[] { "Añasco", "Madrid" },
    new string[] { "Paris", "Añasco", "Athens" },
    new string[] { "Paris", "Athens" },
    new string[] { "Madrid", "Paris", "Athens", "New York" },
    new string[] { "Paris", "Athens", "Añasco" },
    new string[] { "Athens" },
    new string[] { "Athens", "Madrid", "Añasco" },
    new string[] { "Paris", "Añasco", "Athens", "Madrid", "New York" },
    new string[] { "Madrid", "Añasco" },
    new string[] { "Paris", "Añasco" },
};

CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("en-US");

// Lexicographical order, with element-level string comparisons being linguistic under "en-US" culture:
var a = sequences.OrderBy(s => s, SequenceComparer<string>.Lexicographical);
// Result:
// {  }
// { "Añasco", "Madrid" }
// { "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Athens" }
// { "Paris", "Athens", "Añasco" }

// Lexicographical order, with element-level string comparisons being ordinal:
var b = sequences.OrderBy(s => s, SequenceComparer.Create(SequenceComparison.Lexicographical, StringComparer.Ordinal));
// Result:
// {  }
// { "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Athens" }
// { "Paris", "Athens", "Añasco" }
// { "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }

// Shortlex order, with element-level string comparisons being linguistic under "en-US" culture:
var c = sequences.OrderBy(s => s, SequenceComparer<string>.Shortlex);
// Result:
// {  }
// { "Athens" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Paris", "Añasco" }
// { "Paris", "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Athens", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }

// Shortlex order, with element-level string comparisons being ordinal:
var d = sequences.OrderBy(s => s, SequenceComparer.Create(SequenceComparison.Shortlex, StringComparer.Ordinal));
// Result:
// {  }
// { "Athens" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Paris", "Athens" }
// { "Paris", "Añasco" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Athens", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }

Further considerations

Equality comparisons

The SequenceComparer<TElement> class is only intended for sort-order comparisons, not for equality comparisons; it intentionally does not implement the IEqualityComparer<T> interface for IEnumerable<TElement>. The justification given in our previous article for KeyComparer<TSource, TKey> also applies here: Comparer<TElement>.Default and EqualityComparer<TElement>.Default may have inconsistent semantics for a given element type TElement, as is the case for strings.

Furthermore, the differentiation between the lexicographical and shortlex orders does not apply to sequence equality comparisons: Sequences of different lengths are always simply unequal. The above-discussed optimization for checking the lengths of in-memory collections upfront (before enumerating their elements) would apply for such comparisons, and is presented as the representative problem in our TryFastCount article.

If you wish to perform equality comparisons on sequences, consider using our SequenceEqualityComparer<TElement> class instead.

Use for implementing IComparable<T>

In our previous article, we explained how the IComparable<T> interface is intended to be implemented by the type T itself for providing the inherent sort order, whilst IComparer<T> is implemented by other types to provide alternate sort orders. However, most collection types, including List<T>, do not implement IComparable<T>. If one defines a custom type that contains or derives from such a collection, then SequenceComparer<TElement> could prove useful in its implementation of the IComparable<T> interface for the inherent sort order.

Utf8String

For example, a structure that stores UTF-8 encoded representations of strings as byte arrays could implement the IComparable<T>.CompareTo method by applying the SequenceComparer<byte>.Lexicographical comparer to the said byte arrays. This would provide results almost identical to ordinal comparisons on the source strings.

public struct Utf8String : IComparable<Utf8String>
{
    private readonly byte[] _data;

    public Utf8String(string source)
    {
        _data = Encoding.UTF8.GetBytes(source);
    }

    public override string ToString()
    {
        return Encoding.UTF8.GetString(_data);
    }

    public int CompareTo(Utf8String other)
    {
        return SequenceComparer<byte>.Lexicographical.Compare(_data, other._data);
    }
}

Code point order

There is one scenario where the comparison results of Utf8String above would differ from ordinal comparisons on the source strings; namely, when the source strings contain a mix of surrogate pairs and BMP characters with a code point of U+E000 or higher. UTF-8 does not have the notion of surrogate pairs, so supplementary characters (U+10000 and beyond) would always sort after the BMP characters (U+0000–​U+FFFF); this is known as “code point order”. UTF-16, on the other hand, encodes supplementary characters as surrogate pairs (U+D800U+DFFF), which would hence sort before the subset of BMP characters having code points U+E000U+FFFF. Since the String class in the .NET Framework is based on UTF-16, its ordinal comparisons would exhibit the latter sort order. This discrepancy between UTF-8 and UTF-16 is mentioned in the core specification of the Unicode Standard, under the Binary Order section.