When one talks about sorting, one would usually be thinking of sequences of elements,
where each element is a single unit that may be atomically compared against other elements.
However, there are plenty of scenarios where the elements to be compared are themselves sequences
of another element type.
- The most popular example would be strings. When performing ordinal comparisons, each string may treated
as a sequence of characters, to be compared against the other string’s sequence of characters.
- Strings encoded as UTF-8 would be represented as a
byte
array,
which also serves as a valid sequence for performing ordinal comparisons.
(Results may differ from ordinal comparisons on the source strings when
supplementary characters are involved,
as discussed later.)
Version
may be treated
as a sequence of up to four integers: Major
, Minor
, Build
, Revision
.
- Arbitrary-precision data structures,
such as
BigInteger
,
are internally represented as a sequence of digits under some numeric base.
rowversion
in SQL Server maps to an
eight-element byte[]
in .NET.
All aforementioned sequences have one common aspect: They may be compared against other sequences of the same type
through an element-by-element comparison of their contents. There are three main variants of this rule,
determining how sequences of different lengths are treated.
Sequence comparisons
Lexicographical order
The lexicographical order is a generalization of
the alphabetical order,
such that it may be applied to any type of sequences containing comparable elements.
To compare a given pair of sequences,
one would iteratively compare pairs of elements at corresponding indexes across the two sequences,
starting from the beginning of the sequences, and proceeding until the first element inequality is found.
The result of the sequence comparison corresponds to the result of the first unequal pair of elements.
In other words, a sequence of elements { a1, a2, …, am }
is less than another sequence { b1, b2, …, bn }
if and only if, at the first index i where ai and bi differ,
ai is less than bi.
If all elements are equal, then the two sequences are considered to be equal,
provided that they are of the same length.
When comparing sequences of different lengths, the aforementioned element-by-element check still applies.
However, if all elements of the shorter sequence are equal to their counterparts in the longer sequence
(meaning that the shorter sequence is a prefix of the longer sequence),
then the shorter sequence is considered to be less than the longer sequence.
Nominally, the empty sequence is considered to be less than any other sequence.
As an example, consider a comparison between the UTF-8 encoded decimal representations of two strings:
indexes |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
"operator" |
111 |
112 |
101 |
114 |
97 |
116 |
111 |
114 |
|
|
"operations" |
111 |
112 |
101 |
114 |
97 |
116 |
105 |
111 |
110 |
115 |
comparisons |
= |
= |
= |
= |
= |
= |
> |
|
|
|
The first six elements of the two sequences are equal, and therefore do not determine the comparison result.
At index 6, the element of the first sequence, 111, is greater than the element of the second sequence, 105.
Therefore, overall, the first sequence is considered to be greater than the second sequence.
Consider another example:
indexes |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
"opera" |
111 |
112 |
101 |
114 |
97 |
|
|
|
"operator" |
111 |
112 |
101 |
114 |
97 |
116 |
111 |
114 |
comparisons |
= |
= |
= |
= |
= |
< |
|
|
The first five elements of the two sequences are equal, and do not determine the comparison result.
However, the first sequence only consists of the said five elements, whilst the second sequence contains more elements.
Thus, the first sequence is a prefix of the second sequence, and is considered to be less than it.
A simplified implementation of such a comparer is given below:
public class LexicographicalComparer<TElement> : Comparer<IList<TElement>>
{
public override int Compare(IList<TElement> x, IList<TElement> y)
{
for (int i = 0; i < Math.Min(x.Count, y.Count); i++)
{
int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
if (elementComparison != 0)
return elementComparison;
}
return x.Count.CompareTo(y.Count);
}
}
Shortlex order
The lexicographical comparer gives precedence to the element-wise comparisons, and only considers
the lengths of the sequences when all elements, up to the length of the shorter sequence, are equal.
This order, whilst well-suited for strings, is not applicable for sequences whose elements assume a
positional notation, such as
big integers.
In such cases, the lengths of the sequences take precedence –
a shorter sequence is considered to be less than a longer sequence, irrespective of their contents.
The element-by-element comparison would only occur for sequences of the same length.
This variant of the lexicographical order is known as shortlex.
For example, imagine a big integer implementation that internally stores its values as a List<byte>
,
with each element representing a digit under base-256, and no leading zeros allowed.
The number 992,397,253,847 would be stored as { 231, 15, 124, 92, 215 }
,
since its value is equal to
231 × 2564 +
15 × 2563 +
124 × 2562 +
92 × 2561 +
215 × 2560.
(In practice, most big integer implementations would store their elements in reverse order,
with the most significant digit stored last. This is not done here for simplicity.)
For values of the same length, we could use an algorithm similar to the one shown above,
determining the order of the big integers through an element-by-element comparison of their digits,
up to the first inequality:
indexes |
0 |
1 |
2 |
3 |
4 |
992,397,253,847 |
231 |
15 |
124 |
92 |
215 |
992,401,385,498 |
231 |
15 |
187 |
104 |
26 |
comparisons |
= |
= |
< |
|
|
However, such comparisons would not be applicable for sequences of different lengths.
Consider another pair of big integers:
indexes |
0 |
1 |
2 |
3 |
4 |
992,397,253,847 |
231 |
15 |
124 |
92 |
215 |
16,368,696 |
249 |
196 |
56 |
|
|
Although the first digit of the first sequence, 231, is less than that of the second sequence, 249,
this is irrelevant, since its place value is greater:
231 × 2564 is greater than 249 × 2562.
This applies as a general rule – the place value of the most significant digit of a shorter sequence
will always be less than the place value of the most significant digit of a longer sequence.
Thus, under shortlex, a shorter sequence is always less than a longer sequence.
We can tweak the simplified code from the previous section to give precedence to the sequence lengths:
public class ShortlexComparer<TElement> : Comparer<IList<TElement>>
{
public override int Compare(IList<TElement> x, IList<TElement> y)
{
var lengthComparison = x.Count.CompareTo(y.Count);
if (lengthComparison != 0)
return lengthComparison;
for (int i = 0; i < x.Count; i++)
{
int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
if (elementComparison != 0)
return elementComparison;
}
return 0;
}
}
Same length only
There may be other scenarios where sequences of different lengths are not considered to be comparable.
For example, we earlier stated that rowversion
from SQL Server is represented as byte[8]
in .NET.
When comparing such arrays, we always expect them to have the same length –
a length other than 8 would likely be indicative of some programming error.
To handle such scenarios, we can define another comparer that is similar to shortlex,
but throws an exception when it encounters a pair of sequences with different lengths:
public class SameLengthComparer<TElement> : Comparer<IList<TElement>>
{
public override int Compare(IList<TElement> x, IList<TElement> y)
{
var lengthComparison = x.Count.CompareTo(y.Count);
if (lengthComparison != 0)
throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
for (int i = 0; i < x.Count; i++)
{
int elementComparison = Comparer<TElement>.Default.Compare(x[i], y[i]);
if (elementComparison != 0)
return elementComparison;
}
return 0;
}
}
StructuralComparer
The .NET Framework Class Library provides the
StructuralComparer
property in the
StructuralComparisons
static class:
A predefined object that is used to perform a structural comparison of two collection objects.
StructuralComparer
relies on the
IStructuralComparable
interface, whose implementations provide an element-by-element comparison similar to the one we are aiming for.
However, it suffers from a number of issues:
- In .NET Framework 4.6.1,
IStructuralComparable
is only implemented
by arrays and by tuples.
We would like to support all IEnumerable<T>
sequences.
StructuralComparer
and IStructuralComparable
are non-generic,
meaning that boxing and unboxing would be required when comparing elements of value types.
We would like to define a generic type parameter, TElement
, for the elements.
StructuralComparer
does not accept a custom comparer for element-level comparisons, but just calls the
Compare(object, object)
method on the
non-generic Comparer.Default
instance.
One could cast an array to the IStructuralComparable
interface and directly call its
CompareTo(Object, IComparer)
method,
which accepts a non-generic IComparer
instance that it uses on the element-level comparisons;
however, such a structural comparison would not itself be presentable as an IComparer
instance
(unless manually wrapped by an outer comparer type).
We would like to provide overloads in our comparer that can accept a custom
IComparer<TElement>
inner comparer
for performing element-level comparisons.
- The
IStructuralComparable.CompareTo
implementation for arrays throws an ArgumentException
for arrays of different lengths.
We would like to support lexicographical and shortlex orders for such sequences.
Implementation
We shall now proceed to demonstrate an implementation of the three sequence comparers discussed above:
lexicographical, shortlex, and same-length.
A complete implementation is available as part of our
DogmaMix.Core
library on GitHub, mostly under the
SequenceComparer<TElement>
class.
StringComparison
enumeration
The three sequence comparison variants share a lot of common behavior,
making it appropriate to employ some form of code reuse.
We could design an inheritance hierarchy with a base class that provides the element-level comparison logic
and relies on an abstract method, to be overridden by derived classes, for handling sequence lengths.
However, we decided against this approach.
The nature of the sequence length comparison needs to influence not just the result,
but also the control flow, of the base class –
Shortlex
and SameLength
compare lengths at the beginning,
whilst Lexicographical
only does so after the element-level comparisons.
This introduces a design complexity that would not be justified just for the sake of
achieving polymorphism over the few lines of code that such length comparisons involve.
Instead, we shall stick with a single comparer class, and define an enumeration to represent the sequence comparison variants.
The role of this enumeration is similar to that of the
StringComparison
enumeration for strings.
public enum SequenceComparison
{
Lexicographical,
Shortlex,
SameLength,
}
ListComparer<TElement>
Before we present our actual comparer implementation, let us start with a simplification:
a comparer that can compare IList<T>
sequences,
rather than IEnumerable<T>
sequences.
IList<T>
simplifies matters because it allows us to immediately determine the lengths of the two sequences
through its Count
property,
as well as retrieve elements efficiently by index,
avoiding the need for creating enumerators.
We create an IComparer<IList<TElement>>
implementation for our sequence comparisons.
Note that the generic type of the IComparer<T>
interface implementation is IList<TElement>
,
indicating that our class is a comparer for lists (including arrays) of our element type.
We define a constructor that can accept the SequenceComparison
enumeration value
and the IComparer<TElement>
inner comparer to use for element-level comparisons.
If the comparer argument is omitted, then the Comparer<TElement>.Default
comparer for the element type is used.
public class ListComparer<TElement> : Comparer<IList<TElement>>
{
private readonly SequenceComparison _comparisonType;
private readonly IComparer<TElement> _elementComparer;
public ListComparer(
SequenceComparison comparisonType = SequenceComparison.Lexicographical,
IComparer<TElement> elementComparer = null)
{
_comparisonType = comparisonType;
_elementComparer = elementComparer ?? Comparer<TElement>.Default;
}
// ...
}
Finally, our implementation of the Compare
method is a combination of the logic we presented in the earlier sections.
The difference between the three SequenceComparison
variants concerns
how sequences of different lengths are initially treated:
Shortlex
immediately returns the non-zero length comparison result,
whilst SameLength
immediately throws an ArgumentException
.
An element-by-element comparison ensues.
As soon as an unequal pair of elements is encountered, the element comparison result is returned.
If all elements are equal (up to the end of the shorter or both sequences),
the method returns the length comparison result –
at this point guaranteed to be zero for Shortlex
and SameLength
,
but potentially any value for Lexicographical
.
public override int Compare(IList<TElement> x, IList<TElement> y)
{
int lengthComparison = x.Count.CompareTo(y.Count);
if (lengthComparison != 0)
{
if (_comparisonType == SequenceComparison.Shortlex)
return lengthComparison;
if (_comparisonType == SequenceComparison.SameLength)
throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
}
for (int i = 0; i < Math.Min(x.Count, y.Count); i++)
{
int elementComparison = _elementComparer.Compare(x[i], y[i]);
if (elementComparison != 0)
return elementComparison;
}
return lengthComparison;
}
SequenceComparer<TElement>
We shall now enhance our implementation to handle
IEnumerable<T>
sequences in general.
The fields and constructor remain the same:
public class SequenceComparer<TElement> : Comparer<IEnumerable<TElement>>
{
private readonly SequenceComparison _comparisonType;
private readonly IComparer<TElement> _elementComparer;
public SequenceComparer(
SequenceComparison comparisonType = SequenceComparison.Lexicographical,
IComparer<TElement> elementComparer = null)
{
_comparisonType = comparisonType;
_elementComparer = elementComparer ?? Comparer<TElement>.Default;
}
// ...
}
Next comes the Compare
method.
Doing without Count
This time, we cannot rely on a Count
property, since the IEnumerable<T>
interface does not provide one.
There exists a Count(IEnumerable<TSource>)
extension method, which could technically be used.
However, it is generally not considered good practice to enumerate a specified IEnumerable<T>
sequence
multiple times within a general-purpose library method.
IEnumerable<T>
can represent any kind of sequence, including
IQueryable<T>
instances for
LINQ to Entities queries
whose enumeration would involve issuing and executing the query on the underlying data source (typically a database).
Multiple enumerations of such queries can lead to performance pitfalls as well as inconsistent results –
the state of the underlying data source might change between one enumeration and the next.
In fact, ReSharper gives an error for such occurrences: “Possible multiple enumeration of IEnumerable.”
Therefore, we will need to integrate the sequence length checks as part of our main iteration logic.
One could argue that, in this case, the concern is merely pedantic –
one would rarely need to perform sequence comparisons of IQueryable<T>
instances;
and the sequences would need to be enumerated multiple times anyway if such a set were to be sorted.
This is a fair point; if conceded, one could oblige consumers to first materialize the sequences to lists in memory,
using a call such as:
var lists = sequences.Select(s => s.ToList()).ToList();
The formerly presented ListComparer<TElement>
could then be used for comparisons of such lists.
Notwithstanding, we will push ahead with an IComparer<IEnumerable<TElement>>
implementation.
We do not want to constrain our comparer to only support lists just because of their convenient Count
property.
Per the Microsoft Learn Guidelines for Collections:
✓ DO use the least-specialized type possible as a parameter type.
Most members taking collections as parameters use the IEnumerable<T>
interface.
X AVOID using ICollection<T>
or ICollection
as a parameter just to access the Count
property.
Instead, consider using IEnumerable<T>
or IEnumerable
and dynamically checking whether
the object implements ICollection<T>
or ICollection
.
We will therefore proceed with accepting IEnumerable<TElement>
sequences as inputs,
and discuss how to optimize for in-memory collection types later.
Advancing enumerators in sync
To iterate over the sequences, we will use enumerators.
For an example of how to use these over a pair of sequences, one may refer to the
source code
for the SequenceEqual
extension method.
public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
// Get enumerators to iterate over both sequences in sync.
using (IEnumerator<TElement> xEnumerator = x.GetEnumerator())
using (IEnumerator<TElement> yEnumerator = y.GetEnumerator())
{
// ...
}
}
Each enumerator may be advanced to the next element of its sequence by calling its
MoveNext
method.
This method returns true
if the enumerator was successfully advanced to the next element;
or false
if the enumerator has passed the end of its sequence.
In our case, we need to advance both enumerators in sync.
We could compare the return values from their respective MoveNext
calls
to determine whether one sequence has ended before the other:
public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
// Get enumerators to iterate over both sequences in sync.
using (IEnumerator<TElement> xEnumerator = x.GetEnumerator())
using (IEnumerator<TElement> yEnumerator = y.GetEnumerator())
{
// Advance both enumerators to their next element, until at least one passes the end of its sequence.
bool xMove, yMove;
while ((xMove = xEnumerator.MoveNext()) &&
(yMove = yEnumerator.MoveNext()))
{
// Compare the current pair of elements across the two sequences.
}
// Determine the relative length of the two sequences based on the final values of xMove and yMove.
}
}
To improve our design, we will encapsulate this logic for enumerating through a pair of sequences
in sync into a dedicated class, EnumeratorPair<T>
.
This class builds on the semantics of the MoveNext
method from the IEnumerator<T>
interface.
Each MoveNext
call returns true
if both enumerators were successfully advanced to the next element;
or false
if one or both enumerators have passed the end of their sequence.
Once MoveNext
returns false
, consumers may read the LengthComparison
property,
whose value indicates the relative length of the enumerated sequences:
Value |
Meaning |
Less than zero |
The first enumerator passed the end of its sequence before the second enumerator. Therefore, the first sequence is shorter than the second. |
Zero |
Both enumerators passed the end of their sequence during the same iteration. Therefore, the two sequences have the same length. |
Greater than zero |
The second enumerator passed the end of its sequence before the first enumerator. Therefore, the first sequence is longer than the second. |
A bare-bones implementation is given below; refer to
EnumeratorPair<T>
for a proper one.
public class EnumeratorPair<T> : IDisposable
{
private readonly IEnumerator<T> _enumerator1;
private readonly IEnumerator<T> _enumerator2;
private bool _passedEnd;
private int _lengthComparison;
public EnumeratorPair(IEnumerable<T> sequence1, IEnumerable<T> sequence2)
{
_enumerator1 = sequence1.GetEnumerator();
_enumerator2 = sequence2.GetEnumerator();
}
public T Current1 => _enumerator1.Current;
public T Current2 => _enumerator2.Current;
public int LengthComparison => _lengthComparison;
public bool MoveNext()
{
bool move1 = _enumerator1.MoveNext();
bool move2 = _enumerator2.MoveNext();
if (!_passedEnd && (!move1 || !move2))
{
_passedEnd = true;
_lengthComparison = move1.CompareTo(move2);
}
return move1 && move2;
}
public void Dispose()
{
_enumerator1.Dispose();
_enumerator2.Dispose();
}
}
Implementing Lexicographical
By leveraging this helper class, we can come up with a neat implementation for Lexicographical
.
We use a while
loop to iterate over the elements of both sequences in sync.
As soon as we encounter an unequal pair of elements, we return the element comparison result.
Otherwise, we keep iterating until the end of one or both sequences is reached,
in which case, we return the length comparison result indicating their relative length.
public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
// Create an enumerator pair to iterate over both sequences in sync.
using (var enumeratorPair = new EnumeratorPair<TElement>(x, y))
{
// Advance both enumerators to their next element, until at least one passes the end of its sequence.
while (enumeratorPair.MoveNext())
{
// Compare the current pair of elements across the two sequences,
// seeking element inequality.
int elementComparison = _elementComparer.Compare(enumeratorPair.Current1, enumeratorPair.Current2);
if (elementComparison != 0)
return elementComparison;
}
// Return the length comparison result.
return enumeratorPair.LengthComparison;
}
}
Adding Shortlex
and SameLength
We shall now enhance this implementation to also handle Shortlex
and SameLength
.
Since the length of the sequences cannot be known in advance, we proceed
with the element-by-element comparison, like in Lexicographical
.
However, when the first unequal pair of elements is encountered, the Shortlex
and SameLength
comparisons
would continue advancing the enumerators of the two sequences (without comparing further elements)
until the end of one or both is reached.
If the end of both sequences is reached simultaneously (during the same iteration),
then it may be concluded that they have the same length,
and the element comparison result is returned by the Compare
method.
If one sequence is found to have ended before the other,
then Shortlex
would return the length comparison result,
whilst SameLength
would throw an ArgumentException
.
This logic for handling sequences of different lengths also needs to be applied in cases where
one or both sequences end without having encountered an unequal pair of elements,
so we place it after the outer while
loop, and jump to it through a break
statement.
public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
// Create an enumerator pair to iterate over both sequences in sync.
using (var enumeratorPair = new EnumeratorPair<TElement>(x, y))
{
// Advance both enumerators to their next element,
// until at least one passes the end of its sequence.
while (enumeratorPair.MoveNext())
{
// Compare the current pair of elements across the two sequences,
// seeking element inequality.
int elementComparison = _elementComparer.Compare(enumeratorPair.Current1, enumeratorPair.Current2);
if (elementComparison != 0)
{
// If the current sequence comparison is Shortlex or SameLength,
// then length inequality needs to be given precedence over element inequality.
if (_comparisonType == SequenceComparison.Shortlex ||
_comparisonType == SequenceComparison.SameLength)
{
// Advance both enumerators, ignoring the elements,
// until at least one passes the end of its sequence.
while (enumeratorPair.MoveNext())
;
// If one sequence was shorter than the other, then use the length comparison result.
// Such logic is implemented beyond the outer while loop.
if (enumeratorPair.LengthComparison != 0)
break;
}
// If the current sequence comparison is Lexicographical,
// or the sequences have been found to share the same length,
// then return the non-zero element comparison result.
return elementComparison;
}
}
// This point is only reached if at least one sequence ended without finding any unequal pair of elements;
// or an unequal pair of elements was found, but the unequal length of the sequences dominates the result.
// If the current sequence comparison is SameLength,
// then throw exception for sequences of different lengths.
if (_comparisonType == SequenceComparison.SameLength &&
enumeratorPair.LengthComparison != 0)
throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
// Return the length comparison result.
return enumeratorPair.LengthComparison;
}
}
Optimizing for in-memory collections
In an earlier section, we mentioned that the absence of a Count
property
on the IEnumerable<T>
interface prevents us from performing sequence length comparisons
at the beginning of our implementation.
This is not an issue for Lexicographical
, where the element-level comparisons dominate the result.
However, it can lead to useless element comparisons (and iterations) for Shortlex
and SameLength
if the sequences are subsequently found to have had different lengths. Consider the example below:
var comparer = new SequenceComparer<int>(SequenceComparison.Shortlex);
var result = comparer.Compare(
new int[] { 1, 2, 3, 4, 5, 6, 7 },
new int[] { 1, 2, 3, 6, 7, 8, 9, 10 });
Under shortlex, it should be obvious that the first sequence is less than the second,
due to it having a smaller number of elements; there is no need to read the element values.
Such information is readily available to our comparer, since arrays expose a fast
Length
property
(which is also returned by their
ICollection.Count
and
ICollection<T>.Count
explicit interface member implementations).
Yet our implementation so far would still need to compare the first four elements of the sequences,
and skip over the subsequent three elements, before realizing that the sequences’ lengths are different,
making the element comparisons inapplicable.
Rather than always suffer this wasteful enumeration, we can optimize for in-memory collections
by attempting to determine their number of elements efficiently at the beginning of the comparison,
and only proceed with the elements’ enumeration if found to have the same length.
For sequences that rely on deferred execution and cannot provide this information upfront,
we will fall back on our former behavior of performing the length comparison
as part of the main enumeration.
This topic is discussed in more detail, and a sample implementation provided,
under the article for our TryFastCount
extension method.
We will now show how this method could be leveraged to optimize our comparer implementation:
public override int Compare(IEnumerable<TElement> x, IEnumerable<TElement> y)
{
// Optimization for in-memory collections:
// If the current sequence comparison is Shortlex or SameLength,
// then attempt to perform a length comparison without enumerating the sequences.
if (_comparisonType == SequenceComparison.Shortlex ||
_comparisonType == SequenceComparison.SameLength)
{
int xCount, yCount;
if (x.TryFastCount(out xCount) &&
y.TryFastCount(out yCount))
{
int countComparison = xCount.CompareTo(yCount);
if (countComparison != 0)
{
// Shortlex returns immediately for sequences of different lengths,
// whilst SameLength throws immediately for such sequences.
if (_comparisonType == SequenceComparison.Shortlex)
return countComparison;
if (_comparisonType == SequenceComparison.SameLength)
throw new ArgumentException("The sequences to be compared do not have the same number of elements.");
}
}
}
// rest of implementation
}
Default comparers
We can define static properties as shorthand for the default sequence comparers
that use Comparer<TElement>.Default
for the element-level comparisons:
public static SequenceComparer<TElement> Lexicographical { get; } =
new SequenceComparer<TElement>(SequenceComparison.Lexicographical);
public static SequenceComparer<TElement> Shortlex { get; } =
new SequenceComparer<TElement>(SequenceComparison.Shortlex);
public static SequenceComparer<TElement> SameLength { get; } =
new SequenceComparer<TElement>(SequenceComparison.SameLength);
Factory method
Rather than exposing the constructor of our SequenceComparer<TElement>
type as public
,
we could make it protected internal
, and instead provide a factory method for creating sequence comparers.
This would allow us to change the class design of our comparer –
for example, to introduce an inheritance hierarchy –
without breaking any consumers.
public static class SequenceComparer
{
public static SequenceComparer<TElement> Create<TElement>(
SequenceComparison comparisonType = SequenceComparison.Lexicographical,
IComparer<TElement> elementComparer = null)
{
return new SequenceComparer<TElement>(comparisonType, elementComparer);
}
}
public class SequenceComparer<TElement> : ComparerBase<IEnumerable<TElement>>
{
protected internal SequenceComparer(
SequenceComparison comparisonType = SequenceComparison.Lexicographical,
IComparer<TElement> elementComparer = null)
{
_comparisonType = comparisonType;
_elementComparer = elementComparer ?? Comparer<TElement>.Default;
}
// ...
}
Examples
The example below shows how our SequenceComparer<TElement>
may be used to sort a collection of sequences of strings.
We demonstrate the behavior of Lexicographical
and Shortlex
using the default inner comparer for strings,
as well as using the ordinal comparer.
The default string comparer performs a culture-sensitive linguistic comparison.
Under the "en-US" culture, ñ
sorts before t
, so "Añasco"
sorts before "Athens"
.
The ordinal comparison, on the other hand, sorts ñ
after t
.
var sequences = new string[][]
{
new string[] { "Paris", "Añasco", "Athens", "New York" },
new string[] { "Madrid", "Paris", "Añasco" },
new string[] { },
new string[] { "Añasco", "Madrid" },
new string[] { "Paris", "Añasco", "Athens" },
new string[] { "Paris", "Athens" },
new string[] { "Madrid", "Paris", "Athens", "New York" },
new string[] { "Paris", "Athens", "Añasco" },
new string[] { "Athens" },
new string[] { "Athens", "Madrid", "Añasco" },
new string[] { "Paris", "Añasco", "Athens", "Madrid", "New York" },
new string[] { "Madrid", "Añasco" },
new string[] { "Paris", "Añasco" },
};
CultureInfo.CurrentCulture = CultureInfo.GetCultureInfo("en-US");
// Lexicographical order, with element-level string comparisons being linguistic under "en-US" culture:
var a = sequences.OrderBy(s => s, SequenceComparer<string>.Lexicographical);
// Result:
// { }
// { "Añasco", "Madrid" }
// { "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Athens" }
// { "Paris", "Athens", "Añasco" }
// Lexicographical order, with element-level string comparisons being ordinal:
var b = sequences.OrderBy(s => s, SequenceComparer.Create(SequenceComparison.Lexicographical, StringComparer.Ordinal));
// Result:
// { }
// { "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Athens" }
// { "Paris", "Athens", "Añasco" }
// { "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// Shortlex order, with element-level string comparisons being linguistic under "en-US" culture:
var c = sequences.OrderBy(s => s, SequenceComparer<string>.Shortlex);
// Result:
// { }
// { "Athens" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Paris", "Añasco" }
// { "Paris", "Athens" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Paris", "Athens", "Añasco" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
// Shortlex order, with element-level string comparisons being ordinal:
var d = sequences.OrderBy(s => s, SequenceComparer.Create(SequenceComparison.Shortlex, StringComparer.Ordinal));
// Result:
// { }
// { "Athens" }
// { "Añasco", "Madrid" }
// { "Madrid", "Añasco" }
// { "Paris", "Athens" }
// { "Paris", "Añasco" }
// { "Athens", "Madrid", "Añasco" }
// { "Madrid", "Paris", "Añasco" }
// { "Paris", "Athens", "Añasco" }
// { "Paris", "Añasco", "Athens" }
// { "Madrid", "Paris", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "New York" }
// { "Paris", "Añasco", "Athens", "Madrid", "New York" }
Further considerations
Equality comparisons
The SequenceComparer<TElement>
class is only intended for sort-order comparisons, not for equality comparisons;
it intentionally does not implement the
IEqualityComparer<T>
interface for IEnumerable<TElement>
.
The justification given in our previous article for
KeyComparer<TSource, TKey>
also applies here:
Comparer<TElement>.Default
and
EqualityComparer<TElement>.Default
may have inconsistent semantics for a given element type TElement
,
as is the case for strings.
Furthermore, the differentiation between the lexicographical and shortlex orders does not apply to
sequence equality comparisons: Sequences of different lengths are always simply unequal.
The above-discussed optimization for checking the lengths of in-memory collections
upfront (before enumerating their elements) would apply for such comparisons, and is presented as the representative problem
in our TryFastCount
article.
If you wish to perform equality comparisons on sequences, consider using our
SequenceEqualityComparer<TElement>
class instead.
Use for implementing IComparable<T>
In our previous article, we explained how the IComparable<T>
interface
is intended to be implemented by the type T
itself for providing the
inherent sort order, whilst IComparer<T>
is implemented
by other types to provide alternate sort orders.
However, most collection types, including List<T>
, do not implement IComparable<T>
.
If one defines a custom type that contains or derives from such a collection,
then SequenceComparer<TElement>
could prove useful in its implementation of the
IComparable<T>
interface for the inherent sort order.
Utf8String
For example, a structure that stores UTF-8 encoded representations of strings as byte arrays could implement
the IComparable<T>.CompareTo
method by applying the SequenceComparer<byte>.Lexicographical
comparer
to the said byte arrays.
This would provide results almost identical to ordinal comparisons on the source strings.
public struct Utf8String : IComparable<Utf8String>
{
private readonly byte[] _data;
public Utf8String(string source)
{
_data = Encoding.UTF8.GetBytes(source);
}
public override string ToString()
{
return Encoding.UTF8.GetString(_data);
}
public int CompareTo(Utf8String other)
{
return SequenceComparer<byte>.Lexicographical.Compare(_data, other._data);
}
}
Code point order
There is one scenario where the comparison results of Utf8String
above would differ from
ordinal comparisons on the source strings; namely, when the source strings contain a mix of
surrogate pairs and
BMP
characters with a code point of U+E000
or higher.
UTF-8 does not have the notion of surrogate pairs, so supplementary characters (U+10000
and beyond)
would always sort after the BMP characters (U+0000
–U+FFFF
); this is known as “code point order”.
UTF-16, on the other hand, encodes supplementary characters as surrogate pairs (U+D800
–U+DFFF
),
which would hence sort before the subset of BMP characters having code points U+E000
–U+FFFF
.
Since the String
class in the .NET Framework is based on UTF-16,
its ordinal comparisons would exhibit the latter sort order.
This discrepancy between UTF-8 and UTF-16 is mentioned in the core specification of the Unicode Standard,
under the Binary Order section.