TryFastCount: Optimizing sequence operations for known lengths

Representative problem

When implementing operations over IEnumerable<T> sequences, one would sometimes benefit from knowing the length of the supplied sequence upfront. Consider the source code for the implementation of the SequenceEqual extension method in .NET Framework 4.6.1. A reformatted and commented version is given below:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    if (comparer == null)
        comparer = EqualityComparer<TSource>.Default;

    // Get enumerators for iterating through the two sequences.
    using (IEnumerator<TSource> e1 = first.GetEnumerator())
    using (IEnumerator<TSource> e2 = second.GetEnumerator())
    {
        // Advance the first enumerator to its next element, until the end of its sequence.
        while (e1.MoveNext())
        {
            // Advance the second enumerator to its next element.
            // If it passes the end of its sequence, then it contained less elements than the first.
            if (!e2.MoveNext())
                return false;

            // If an unequal pair of elements is encountered,
            // then the sequences are not equal.
            if (!comparer.Equals(e1.Current, e2.Current))
                return false;
        }

        // The first enumerator has passed the end of its sequence.

        // If the second enumerator does not immediately pass the end of its sequence too,
        // then it contained more elements than the first.
        if (e2.MoveNext())
            return false;
    }

    // All elements are equal, and both enumerators passed the end of their sequences together.
    return true;
}

Sequences of different lengths should always cause the method to return false. However, there is no check done at the beginning of the method to return immediately for such cases – the two sequences are still iterated over, and their elements compared, potentially right up to the end of the shorter sequence. This is wasteful, especially in cases where sequences share long common prefixes, such as:

var x = new []{ 42, 98, 23, 56, 12, 76, 23, 51, 32 };
var y = new []{ 42, 98, 23, 56, 12, 76, 23, 53, 32, 87, 76, 21 };
var b = x.SequenceEqual(y);              // ↑ would still compare elements up to here,
                                         //   despite inequality of arrays being readily
                                         //   apparent due to different lengths

Attempt #1: Count() checks

One might be tempted to resolve this inefficiency by introducing an upfront check using the Count extension method:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    if (first.Count() != second.Count())
        return false;

    // ...
}

However, this is not recommended, since it might cause each sequence to be enumerated twice: once to determine its number of elements, and once to iterate over (and compare) its elements. As mentioned in our previous article, this could have unintended consequences on sequences that rely on deferred execution, such as IQueryable<T> instances that would issue and execute a query on the underlying data source (typically database) each time they are enumerated.

Attempt #2: Materialization

To avoid the repeated enumeration, one might opt to materialize both sequences as in-memory collections, similar to what LINQ does for non-streaming operators.

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    var firstList = first.ToList();
    var secondList = second.ToList();

    if (firstList.Count != secondList.Count)
        return false;

    // ...
}

However, this is also not recommendable. In most real-world scenarios, an unequal pair of elements would be encountered after a few iterations for the majority of sequence pairs, meaning that the rest of the elements were materialized for nothing.

Counting selectively

We have established that sequences involving deferred execution should not be counted or materialized in advance. However, what about sequences that are already in-memory collections? Such collections would usually store their number of elements as part of their data structure, allowing it to be retrieved instantaneously, without requiring any enumeration. This information could be used for optimizing operations over sequences that can take shortcuts when the lengths are known, but would otherwise need to measure the lengths as part of the same enumeration performed by the operation itself.

For example, we could optimize the SequenceEqual method implementation for ICollection<T> inputs, which expose a Count property:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    var firstCollection = first as ICollection<TSource>;
    var secondCollection = second as ICollection<TSource>;

    if (firstCollection != null &&
        secondCollection != null &&
        firstCollection.Count != secondCollection.Count)
    {
        return false;
    }

    // ...
}

The ICollection<T> interface technically cannot guarantee that all the types that implement it will provide an efficient implementation of the Count property. However, by convention, it should be safe to assume that this will be the case. The MSDN design guidelines for Choosing Between Properties and Methods provide the following advice:

Do use a property, rather than a method, if the value of the property is stored in the process memory and the property would just provide access to the value.

Do use a method, rather than a property, [if] the operation is orders of magnitude slower than a field set would be.

Most of the collection types in the .NET Framework Class Library guarantee that retrieving the value of the Count property is an O(1) operation – see List<T>.Count, Collection<T>.Count, HashSet<T>.Count, Dictionary<TKey, TValue>.Count, ….

The SequenceEqual implementation above addresses our performance concerns. However, there are collection types other than ICollection<T> that can provide a fast count. For example, non-generic collections, such as ArrayList, implement the ICollection interface, which also provides a Count property. It would become unwieldy to manually attempt to cast our sequences to each of these types whenever we require a fast count. Rather, we should promote code reuse by providing a utility method for this purpose.

TryFastCount

Our TryFastCount extension method is intended to determine the number of elements in a sequence, if possible, without enumerating it. We do not want to fall back to enumerating over the sequence if a fast count is not possible; we already have the Count extension method in the .NET Framework for that. Per the convention for Try… methods, the return value of TryFastCount indicates whether it succeeded, and its out parameter, count, contains the actual result on success. Specifically, when TryFastCount returns true, the count argument would contain the number of elements in the input sequence. When it returns false, the count argument should be ignored.

public static class EnumerableExtensions
{
    public static bool TryFastCount<TSource>(this IEnumerable<TSource> source, out int count)
    {
        if (source == null)
            throw new ArgumentNullException(nameof(source));

        var collection = source as ICollection<TSource>;
        if (collection != null)
        {
            count = collection.Count;
            return true;
        }

        var collectionNonGeneric = source as ICollection;
        if (collectionNonGeneric != null)
        {
            count = collectionNonGeneric.Count;
            return true;
        }

        var readOnlyCollection = source as IReadOnlyCollection<TSource>;
        if (readOnlyCollection != null)
        {
            count = readOnlyCollection.Count;
            return true;
        }

        var str = source as string;
        if (str != null)
        {
            count = str.Length;
            return true;
        }

        // possibly other type cast attempts

        count = -1;
        return false;
    }
}

The implementation above handles the base interfaces that cover the majority of collection types: ICollection<T>, ICollection, and IReadOnlyCollection<T> (introduced in .NET Framework 4.5). It also handles strings, which can be treated as an IEnumerable<char> collection providing a fast Length property. A benefit of centralizing this logic in such an extension method is that it can be easily extended to support new collection types, whose fast counts would then be automatically available to all sequence operations that call it. This implementation is available in our EnumerableExtensions class as part of our DogmaMix.Core library on GitHub.

Armed with this functionality, we can now optimize the SequenceEqual method to perform an initial count check only if such can be done efficiently, without enumerating the sequences. If not, SequenceEqual would proceed with iterating over the elements, performing the count check as part of the same enumeration.

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    // null checks

    int firstCount, secondCount;
    if (first.TryFastCount(out firstCount) &&
        second.TryFastCount(out secondCount) &&
        firstCount != secondCount)
    {
        return false;
    }

    // ...
}