TryFastCount
: Optimizing sequence operations for known lengths
Author: Douglas Williams Created: Modified: Tags: C#, .NET, collections
Representative problem
When implementing operations over
IEnumerable<T>
sequences,
one would sometimes benefit from knowing the length of the supplied sequence upfront.
Consider the source code
for the implementation of the
SequenceEqual
extension method
in .NET Framework 4.6.1.
A reformatted and commented version is given below:
public static bool SequenceEqual<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second,
IEqualityComparer<TSource> comparer)
{
// null checks
if (comparer == null)
comparer = EqualityComparer<TSource>.Default;
// Get enumerators for iterating through the two sequences.
using (IEnumerator<TSource> e1 = first.GetEnumerator())
using (IEnumerator<TSource> e2 = second.GetEnumerator())
{
// Advance the first enumerator to its next element, until the end of its sequence.
while (e1.MoveNext())
{
// Advance the second enumerator to its next element.
// If it passes the end of its sequence, then it contained less elements than the first.
if (!e2.MoveNext())
return false;
// If an unequal pair of elements is encountered,
// then the sequences are not equal.
if (!comparer.Equals(e1.Current, e2.Current))
return false;
}
// The first enumerator has passed the end of its sequence.
// If the second enumerator does not immediately pass the end of its sequence too,
// then it contained more elements than the first.
if (e2.MoveNext())
return false;
}
// All elements are equal, and both enumerators passed the end of their sequences together.
return true;
}
Sequences of different lengths should always cause the method to return false
.
However, there is no check done at the beginning of the method to return immediately for such cases –
the two sequences are still iterated over, and their elements compared,
potentially right up to the end of the shorter sequence.
This is wasteful, especially in cases where sequences share long common prefixes, such as:
var x = new [] { 42, 98, 23, 56, 12, 76, 23, 51, 32 };
var y = new [] { 42, 98, 23, 56, 12, 76, 23, 53, 32, 87, 76, 21 };
var b = x.SequenceEqual(y); // ↑ would still compare elements up to here,
// despite inequality of arrays being readily
// apparent due to different lengths
Attempt #1: Count()
checks
One might be tempted to resolve this inefficiency by introducing an upfront check using the
Count
extension method:
public static bool SequenceEqual<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second,
IEqualityComparer<TSource> comparer)
{
// null checks
if (first.Count() != second.Count())
return false;
// ...
}
However, this is not recommended, since it might cause each sequence to be enumerated twice:
once to determine its number of elements, and once to iterate over (and compare) its elements.
As mentioned in our previous article,
this could have unintended consequences on sequences that rely on deferred execution,
such as IQueryable<T>
instances
that would issue and execute a query on the underlying data source (typically database)
each time they are enumerated.
Attempt #2: Materialization
To avoid the repeated enumeration, one might opt to materialize both sequences as in-memory collections, similar to what LINQ does for non-streaming operators.
public static bool SequenceEqual<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second,
IEqualityComparer<TSource> comparer)
{
// null checks
var firstList = first.ToList();
var secondList = second.ToList();
if (firstList.Count != secondList.Count)
return false;
// ...
}
However, this is also not recommendable. In most real-world scenarios, an unequal pair of elements would be encountered after a few iterations for the majority of sequence pairs, meaning that the rest of the elements were materialized for nothing.
Counting selectively
We have established that sequences involving deferred execution should not be counted or materialized in advance. However, what about sequences that are already in-memory collections? Such collections would usually store their number of elements as part of their data structure, allowing it to be retrieved instantaneously, without requiring any enumeration. This information could be used for optimizing operations over sequences that can take shortcuts when the lengths are known, but would otherwise need to measure the lengths as part of the same enumeration performed by the operation itself.
For example, we could optimize the SequenceEqual
method implementation for
ICollection<T>
inputs,
which expose a Count
property:
public static bool SequenceEqual<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second,
IEqualityComparer<TSource> comparer)
{
// null checks
var firstCollection = first as ICollection<TSource>;
var secondCollection = second as ICollection<TSource>;
if (firstCollection != null &&
secondCollection != null &&
firstCollection.Count != secondCollection.Count)
{
return false;
}
// ...
}
The ICollection<T>
interface technically cannot guarantee that all the types that implement it
will provide an efficient implementation of the Count
property.
However, by convention, it should be safe to assume that this will be the case.
The Microsoft Learn design guidelines for
Choosing Between Properties and Methods
provide the following advice:
Do use a property, rather than a method, if the value of the property is stored in the process memory and the property would just provide access to the value.
Do use a method, rather than a property, [if] the operation is orders of magnitude slower than a field set would be.
Most of the collection types in the .NET Framework Class Library guarantee that
retrieving the value of the Count
property is an O(1) operation – see
List<T>.Count
,
Collection<T>.Count
,
HashSet<T>.Count
,
Dictionary<TKey, TValue>.Count
, ….
The SequenceEqual
implementation above addresses our performance concerns.
However, there are collection types other than ICollection<T>
that can provide a fast count.
For example, non-generic collections, such as
ArrayList
, implement the
ICollection
interface,
which also provides a
Count
property.
It would become unwieldy to manually attempt to cast our sequences to each of these types
whenever we require a fast count.
Rather, we should promote code reuse by providing a utility method for this purpose.
TryFastCount
Our TryFastCount
extension method is intended to determine the number of elements in a sequence,
if possible, without enumerating it.
We do not want to fall back to enumerating over the sequence if a fast count is not possible;
we already have the Count
extension method
in the .NET Framework for that.
Per the convention for Try…
methods, the return value of TryFastCount
indicates whether it succeeded,
and its out
parameter, count
, contains the actual result on success.
Specifically, when TryFastCount
returns true
, the count
argument would contain the number of elements
in the input sequence. When it returns false
, the count
argument should be ignored.
public static class EnumerableExtensions
{
public static bool TryFastCount<TSource>(this IEnumerable<TSource> source, out int count)
{
if (source == null)
throw new ArgumentNullException(nameof(source));
var collection = source as ICollection<TSource>;
if (collection != null)
{
count = collection.Count;
return true;
}
var collectionNonGeneric = source as ICollection;
if (collectionNonGeneric != null)
{
count = collectionNonGeneric.Count;
return true;
}
var readOnlyCollection = source as IReadOnlyCollection<TSource>;
if (readOnlyCollection != null)
{
count = readOnlyCollection.Count;
return true;
}
var str = source as string;
if (str != null)
{
count = str.Length;
return true;
}
// possibly other type cast attempts
count = -1;
return false;
}
}
The implementation above handles the base interfaces that cover the majority of collection types:
ICollection<T>
, ICollection
, and
IReadOnlyCollection<T>
(introduced in .NET Framework 4.5).
It also handles strings, which can be treated as an IEnumerable<char>
collection
providing a fast Length
property.
A benefit of centralizing this logic in such an extension method is that it can be easily extended to support
new collection types, whose fast counts would then be automatically available to all sequence operations that call it.
This implementation is available in our
EnumerableExtensions
class as part of our
DogmaMix.Core
library on GitHub.
Armed with this functionality, we can now optimize the SequenceEqual
method to perform an initial count check
only if such can be done efficiently, without enumerating the sequences.
If not, SequenceEqual
would proceed with iterating over the elements,
performing the count check as part of the same enumeration.
public static bool SequenceEqual<TSource>(
this IEnumerable<TSource> first,
IEnumerable<TSource> second,
IEqualityComparer<TSource> comparer)
{
// null checks
int firstCount, secondCount;
if (first.TryFastCount(out firstCount) &&
second.TryFastCount(out secondCount) &&
firstCount != secondCount)
{
return false;
}
// ...
}