Finding substrings using culture-sensitive comparisons

Author: Douglas Williams Created: 2016-03-22 Modified: 2016-04-12 Tags: C#, .NET, strings

Motivation

We will start this article with a bold statement: The string.IndexOf functionality for finding substrings in the .NET Framework is broken. We don’t mean that its results are incorrect, but that they are incomplete; and this design deficiency frequently misleads consumers into writing string manipulation logic that produces incorrect results or throws unexpected exceptions when applied to strings from the wild.

Concretely, string.IndexOf fails to indicate the length of the matched substring, implicitly causing consumers to make the mistaken assumption that the match will always have the same length as the argument that was specified to be searched for. This is not the case: Culture-sensitive comparisons can result in a match that has a different length from the search string. Notwithstanding, there is no public API in the .NET Framework Class Library that can retrieve the actual length of such matches. If you’ve ever written any string manipulation logic that involves substrings, there’s a reasonable chance that your code suffers from this bug.

Illustrative examples

Suppose that you need to write a method that removes the first occurrence of a substring from a specified string, as requested in this Stack Overflow question. Let’s assume that your method should perform a word (case-sensitive and culture-sensitive) search using the current culture – as is, after all, the default behavior for the string.IndexOf(string) method overload. The accepted (and highly upvoted) answer presents logic similar to the below:

public static string RemoveFirst(string source, string value)
{
    int index = source.IndexOf(value);
    return index < 0 ? source : source.Remove(index, value.Length);
}

This answer works fine for ASCII strings, and for a restricted subset of other Unicode characters; however, it will break for Unicode strings at large:

CultureInfo.CurrentCulture = new CultureInfo("en-US");
var a1 = RemoveFirst("abcde", "cd");             // correct: "abe"
var b1 = RemoveFirst("animal", "animal");        // incorrect: "l" instead of ""
var c1 = RemoveFirst("cåt", "å");                // incorrect: "c̊t" instead of "ct"
var d1 = RemoveFirst("café", "é");               // ArgumentOutOfRangeException
var e1 = RemoveFirst("encyclopædia", "aedia");   // ArgumentOutOfRangeException

Due to the subtlety of the bug, it is not readily obvious why the above behavior occurs. We will rewrite the strings by substituting Unicode escape sequences for the problematic characters:

CultureInfo.CurrentCulture = new CultureInfo("en-US");
var a2 = RemoveFirst("abcde", "cd");                            // correct: "abe"
var b2 = RemoveFirst("ani\u00ADmal", "animal");                 // incorrect: "l" instead of ""
var c2 = RemoveFirst("c\u0061\u030At", "\u00E5");               // incorrect: "c\u030At" instead of "ct"
var d2 = RemoveFirst("caf\u00E9", "\u0065\u0301");              // ArgumentOutOfRangeException
var e2 = RemoveFirst("encyclop\u00E6dia", "\u0061\u0065dia");   // ArgumentOutOfRangeException

We shall now examine the cause of each of these errors.

Ignorable characters

var b1 = RemoveFirst("animal", "animal");        // incorrect: "l" instead of ""
var b2 = RemoveFirst("ani\u00ADmal", "animal");

The example above, whose strings are incidentally taken from the Microsoft Learn documentation for IndexOf, contains a SOFT HYPHEN (U+00AD) in the source string.

Character sets include ignorable characters, which are characters that are not considered when performing a linguistic or culture-sensitive comparison. In a culture-sensitive search [not Ordinal or OrdinalIgnoreCase], if [the search string] contains an ignorable character, the result is equivalent to searching with that character removed.

[…] On the .NET Framework 4 or later, because the soft hyphen is an ignorable character, a culture-sensitive search returns the same value that it would return if the soft hyphen were not included in the search string.

Consequently, the search string "animal" is matched by the source string "ani\u00ADmal" in its entirety. However, "animal" has 6 characters, whilst "ani\u00ADmal" has 7 characters due to the soft hyphen. Removing 6 characters from the source string would erroneously leave its last character in place.

Combining character sequences

var c1 = RemoveFirst("cåt", "å");                // incorrect: "c̊t" instead of "ct"
var c2 = RemoveFirst("c\u0061\u030At", "\u00E5");

Unicode permits diacritics (such as accents) to be represented through combining characters. Such characters are combined with the base character that precedes them, giving rise to combining character sequences. The source string in the above example (also taken from the Microsoft Learn examples for IndexOf) contains LATIN SMALL LETTER A (U+0061: a) followed by COMBINING RING ABOVE (U+030A: ◌̊), resulting in the combining character sequence "å".

Several popular combining character sequences can also be represented as a precomposed character. The search string in the above example consists of LATIN SMALL LETTER A WITH RING ABOVE (U+00E5: å). Assuming support from your browser’s layout engine and font, you will probably see the two representations, "å" and "å", as identical, despite the difference in their underlying representations ("\u0061\u030A" versus "\u00E5"). However, the combining character sequence is represented using two characters – "å".Length is 2 – whilst the precomposed character only has one – "å".Length is 1.

Since the search string consists of just the precomposed character, the RemoveFirst method will only remove one character from the matched substring in the source string; namely, '\u0061' ('a'). The combining character '\u030A' is left in place, and will be combined with its new predecessor, the base character 'c', to inadvertently produce the new combining character sequence "c̊".

var d1 = RemoveFirst("café", "é");               // ArgumentOutOfRangeException
var d2 = RemoveFirst("caf\u00E9", "\u0065\u0301");

The next example demonstrates the same issue in the opposite direction. The source string ends with the precomposed character LATIN SMALL LETTER E WITH ACUTE (U+00E9: é), whilst the search string consists of LATIN SMALL LETTER E (U+0065: e) followed by COMBINING ACUTE ACCENT (U+0301: ◌́). These two will match – "caf\u00E9".IndexOf("\u0065\u0301") gives the zero-based index position 3. However, attempting to remove two characters – the length of the search string – from this index position in the source string will cause it to run past its end, throwing ArgumentOutOfRangeException.

Ligatures

In orthography, a multigraph is a sequence of letters that behaves as a single unit under a given language. Multigraphs are commonly associated with phonemes, which represent a distinct sound. The most popular category of multigraph is the digraph, which is constituted of a pair of letters. The English language has several multigraphs, such as ⟨sh⟩ (like in “ship” and “wish”) and ⟨ee⟩ (like in “meet”).

Multigraphs sometimes give rise to a fused graphical representation of their consistent letters; these are known as ligatures. There are two ligatures that are occasionally used in English: æ, formed from the letters ‘a’ and ‘e’; and œ, formed from the letters ‘o’ and ‘e’. The former was popularized by its appearance in the title of the Encyclopædia Britannica.

When performing linguistic string comparisons, ligatures should be treated as equal to the sequence of letters from which they are formed. For example, in English, "æ" is equal to "ae", and "œ" is equal to "oe".

var e1 = RemoveFirst("encyclopædia", "aedia");   // ArgumentOutOfRangeException
var e2 = RemoveFirst("encyclop\u00E6dia", "\u0061\u0065dia");

Consequently, "encyclopædia".IndexOf("aedia") returns 8, indicating that the search string was matched. However, the length of the matched substring, "ædia", is 4, whilst the length of the search value, "aedia", is 5. The source string has no further characters after the matched substring. Therefore, attempting to remove 5 characters from index 8 in the source string will throw ArgumentOutOfRangeException.

Strategies

There is no public API in the .NET Framework Class Library that can perform such a substring search, and it is unfortunately not trivial to build one. This problem was even raised by Jon Skeet in his Stack Overflow question, How can I perform a culture-sensitive “starts-with” operation from the middle of a string?

We shall first discuss some potential approaches, then present our preferred solution. Our goal will be to define a Find extension method for strings that has the following signature:

public static bool Find(
    this string source, 
    string searchValue, 
    StringComparison comparisonType, 
    out int matchIndex, 
    out int matchLength)
{
    // ...
}

Given such a method, a correct implementation of RemoveFirst would become trivial:

public static string RemoveFirst(string source, string value)
{
    int index, length;
    if (source.Find(value, StringComparison.CurrentCulture, out index, out length))
        return source.Remove(index, length);

    return source;
}

Note how we are using the length result returned by Find, rather than the original value.Length.

Restricting to ordinal comparisons

The easy way out would be to restrict our string operation to only support ordinal comparisons. This is the route taken by the .NET Framework Class Library for a number of methods on the String class, including Contains(string), Replace(string, string), and Split(string[], StringSplitOptions). These methods can only perform an ordinal (case-sensitive and culture-insensitive) search, with no overloads to support a case-insensitive or culture-sensitive search. (It is easy to define a Contains extension method that takes a StringComparison parameter and passes it to IndexOf(String, StringComparison); however, such a shortcut cannot be taken for Replace or Split, which would also require the length.)

This design shortcoming of the .NET Framework Class Library is jarring because of the inconsistency it introduces in the framework’s public API. Most string operations do support linguistic comparisons, and enact CurrentCulture behavior by default (consider CompareTo, IndexOf, StartsWith, EndsWith, …); whilst others are restricted to ordinal comparisons.

Per the StringComparison documentation:

An operation that uses ordinal sort rules performs a comparison based on the numeric value (Unicode code point) of each Char in the string. An ordinal comparison is fast but culture-insensitive. When you use ordinal sort rules to sort strings, [the Unicode character] U+xxxx comes before U+yyyy if the value of xxxx is numerically less than yyyy.

Under ordinal comparisons, two strings would be equal if and only if they consist of identical sequences of Unicode code points; in other words, their binary representations are the same. Ignorable characters would not be ignored; precomposed characters would not match combining character sequences; and ligatures would not match multigraphs of their letters. A favorable consequence of this restriction is that equal strings would always have the same length, so it becomes safe to assume that the matched substring would have the same length as the search string.

We could implement our RemoveFirst method to always perform ordinal searches by specifying StringComparison.Ordinal as argument to IndexOf:

public static string RemoveFirst(string source, string value)
{
    int index = source.IndexOf(value, StringComparison.Ordinal);
    return index < 0 ? source : source.Remove(index, value.Length);
}

Under this implementation, the search strings in our problematic examples would not be matched by the source strings at all. This ensures that they do not produce blatantly incorrect results, nor throw ArgumentOutOfRangeException. Rather, the source string is returned intact as the result of the RemoveFirst call:

var b1 = RemoveFirst("animal", "animal");        // result: "animal"
var c1 = RemoveFirst("cåt", "å");                // result: "cåt"
var d1 = RemoveFirst("café", "é");               // result: "café"
var e1 = RemoveFirst("encyclopædia", "aedia");   // result: "encyclopædia"

Such a restriction is not acceptable if one needs to process user strings. Best practices for comparing strings in .NET on Microsoft Learn recommends using local linguistic comparisons – CurrentCulture or CurrentCultureIgnoreCase – for “most user input” (as well as “data displayed to the user”).

P/Invoke calls

The .NET Framework Class Library internally implements IndexOf by making P/Invoke calls to native libraries for linguistic comparisons. The source code for CompareInfo.IndexOf shows that it internally calls InternalFindNLSStringEx, which is an external method declaration for the native FindNLSStringEx.

The FindNLSStringEx function implemented by Windows defines an out parameter, pcchFound, which is documented as follows:

Pointer to a buffer containing the length of the string that the function finds. The string can be either longer or shorter than the search string.

Note that the value of pcchFound is often identical to the value provided in cchValue [the length of the search string], but can differ [if]:

The strings are equivalent, but have different lengths. For example, "A" plus "Combining Ring" (U+0041 U+030A) is equivalent to the "A Ring" (U+00C5).

Unfortunately, this pcchFound parameter is disregarded by the external method calls from the .NET Framework.

One way of determining the length of the matched substring would be to bypass the .NET API altogether and call the aforementioned Windows function directly through P/Invoke, retrieving the value that it writes to pcchFound. An example of such an implementation is given under the Stack Overflow answer by David Ewen.

This is the fastest approach because it only needs to run native code, and determines the length of the match within the same operation that locates its starting character position. This approach also involves the least implementation complexity; we just need to reuse a Windows function.

However, this approach suffers from the drawback that affects P/Invoke calls in general: It sacrifices portability. P/Invoke relies on the existence of specific native libraries, which might not be available on other platforms, such as Linux. Therefore, it is not suitable for libraries that are intended to be platform-agnostic.

Unicode normalization

An alternative approach involves subjecting the source and search strings to Unicode normalization through the string.Normalize(NormalizationForm) method before comparison, as suggested in the Stack Overflow answer by Esailija.

However, this approach is undesirable since the returned results would only apply to the normalized forms of the source and search strings, requiring the original strings to be discarded and replaced by their normalized forms for all subsequent processing and storage.

Furthermore, Unicode normalization would not always yield results consistent with culture-sensitive comparisons in .NET (such as Compare or Equals with StringComparison.CurrentCulture). As mentioned in the Unicode Normalization Forms annex, Form C and Form D only support canonical mappings, such as between precomposed characters and combining character sequences – for example, "é" and "e\u0301". However, the said forms do not perform compatibility mappings, as is required for ligatures. For example, "æ" is not decomposed to "ae", nor "ﬃ" to "ffi", despite that the said ligatures are considered to be equal to their corresponding character sequences under the English (United States) culture. Form KC and Form KD handle compatibility mappings, and can decompose some ligatures, such as "ﬃ", but miss others, such as "æ". (A Stack Overflow answer mentions that “Unicode 6.2 doesn't appear to contain a normative mapping from Æ to AE.”) The issue is made worse by the discrepancies between cultures – "æ" is equal to "ae" under English (United States), but not under Danish (Denmark), as discussed under the Microsoft Learn documentation for String comparison and sorting. Thus, normalization (to any form) would not give results that are consistent with CurrentCulture semantics.

Text elements

Yet another alternative involves iterating over the strings as a sequence of text elements, rather than UTF-16 code units, using the StringInfo.GetNextTextElement method, as presented in the Stack Overflow answer by Mårten Wikström. Results would be similar to those obtained from Unicode normalization: canonical mappings are honored, but compatibility mappings are not.

Iterative equality comparisons

The final approach, which we shall be implementing, involves iteratively testing for the length of the substring until a match with the search string is found.

Such an implementation would first call IndexOf to get the starting position of the match, then iteratively attempt to identify its length. It begins with the most likely case (hot path) of the matched substring having the same length as the search string, verifying this through a substring equality comparison. If not equal, it would attempt to decrement and increment the length of the substring by one character each time, performing an equality comparison against each such substring, until a match is confirmed.

The approach of iterating over the substring’s length is proposed in the Stack Overflow answer by Ken Kin, and endorsed by usr:

I have solved a similar problem once like this (search-string highlighting in HTML). I did it similarly. You can tune the loop and search strategy in a way that makes it completed very quickly by checking the likely cases first. The nice thing about this is that it seems to be totally correct and no Unicode details leak into your code.

Implementation

For a polished implementation of our proposed approach, refer to the Find method within the StringExtensions class, available as part of our DogmaMix.Core library on GitHub. We shall be presenting a simplified version in the discussion below.

Our implementation builds upon the string.IndexOf(string, StringComparison) method from the .NET Framework Class Library, but extends it to also return the length of the match, allowing string manipulation operations to subsequently be performed correctly.

Our method defines two out parameters, for returning both the zero-based starting character position and the length (in characters) of the match.

public static bool Find(
    this string source, 
    string searchValue, 
    StringComparison comparisonType, 
    out int matchIndex, 
    out int matchLength)
{
    // argument validation

    matchIndex = source.IndexOf(searchValue, comparisonType);
    if (matchIndex == -1)
    {
        matchLength = -1;
        return false;
    }

    matchLength = FindMatchLength(source, searchValue, comparisonType, matchIndex);
    return true;
}

`FindMatchLength`

After determining the starting character position of the match through IndexOf, we proceed to determine its length through our FindMatchLength helper method:

private static int FindMatchLength(string source, string searchValue, StringComparison comparisonType, int matchIndex)
{
    int matchLengthMaximum = source.Length - matchIndex;
    int matchLengthInitial = Math.Min(searchValue.Length, matchLengthMaximum);

    // Hot path: match length is same as specified search string
    if (Substring.Compare(source, matchIndex, matchLengthInitial, searchValue, 0, searchValue.Length, comparisonType) == 0)
        return matchLengthInitial;

    int matchLengthDecrementing = matchLengthInitial - 1;
    int matchLengthIncrementing = matchLengthInitial + 1;

    while (matchLengthDecrementing >= 0 || matchLengthIncrementing <= matchLengthMaximum)
    {
        if (matchLengthDecrementing >= 0)
        {
            if (Substring.Compare(source, matchIndex, matchLengthDecrementing, searchValue, 0, searchValue.Length, comparisonType) == 0)
                return matchLengthDecrementing;

            matchLengthDecrementing--;
        }

        if (matchLengthIncrementing <= matchLengthMaximum)
        {
            if (Substring.Compare(source, matchIndex, matchLengthIncrementing, searchValue, 0, searchValue.Length, comparisonType) == 0)
                return matchLengthIncrementing;

            matchLengthIncrementing++;
        }
    }

    // Should never happen
    return -1;
}

`Substring.Compare`

The FindMatchLength helper method, in turn, relies on another utility method we define, Substring.Compare, which compares substrings from two specified strings using the specified comparison rules, and returns an integer that indicates their relative position in the sort order.

The .NET Framework Class Library provides a series of string.Compare method overloads – such as string.Compare(string, int, string, int, int, StringComparison) – that accept parameters for specifying the starting position and length of the substrings to compare from the source strings. However, in another case of poor design, .NET fails to provide an overload that permits the lengths of the two respective substrings to be specified separately, presumably having also fallen victim to the fallacy that equal strings must have the same length.

Our Substring.Compare method is similar to the aforementioned string.Compare overload, but allows different lengths to be specified for the two substrings. It is implemented by calling the CompareInfo.Compare method – which fortunately accepts separate length parameters – on the appropriate CompareInfo instance with the appropriate CompareOptions value for each known value of StringComparison. For performance, substring instantiation is avoided, working with the start indexes and lengths instead.

public static class Substring
{
    public static int Compare(string strA, int indexA, int lengthA, string strB, int indexB, int lengthB, StringComparison comparisonType)
    { 
        // argument validation and null handling

        switch (comparisonType)
        {
            case StringComparison.CurrentCulture:
                return CultureInfo.CurrentCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.None);

            case StringComparison.CurrentCultureIgnoreCase:
                return CultureInfo.CurrentCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.IgnoreCase);

            case StringComparison.InvariantCulture:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.None);

            case StringComparison.InvariantCultureIgnoreCase:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.IgnoreCase);

            case StringComparison.Ordinal:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.Ordinal);

            case StringComparison.OrdinalIgnoreCase:
                return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.OrdinalIgnoreCase);

            default:
                throw new ArgumentException("The string comparison type is not supported.", nameof(comparisonType));
        }
    }
}

Tests

Our implementation correctly handles all the above-discussed cases, including ignorable characters, combining character sequences, and ligatures.

var a1 = RemoveFirst("abcde", "cd");             // correct: "abe"
var b1 = RemoveFirst("animal", "animal");        // correct: ""
var c1 = RemoveFirst("cåt", "å");                // correct: "ct"
var d1 = RemoveFirst("café", "é");               // correct: "caf"
var e1 = RemoveFirst("encyclopædia", "aedia");   // correct: "encyclop"

Known issue

Our method suffers from one known issue: When the search string is matched by multiple substrings at a given starting position in the source string, which one gets returned is undefined. Under the current implementation, the substring whose length is closest to the search string’s is preferred; however, this behavior would not always be the expected one. We believe that this issue only occurs for matches that end with ignorable characters; the inclusion or exclusion of such characters in the result is undefined.

Consider the following examples containing a soft hyphen, \u00AD:

var f = RemoveFirst("aethe\u00ADreal", "aethe");   // result: "\u00ADreal"
var g = RemoveFirst("æthe\u00ADreal", "aethe");    // result: "real"

In the first case, the search string is identical to the beginning of the source string, so the full remainder of the source string, including the soft hyphen, is returned. In the second case, the search string is one character longer than its truest match in the source string, "æthe", due to the ligature in the latter. However, the substring having the search string’s length, "æthe\u00AD", would still be a match, since the last character of such a substring would be the soft hyphen, which is ignorable. Therefore, the soft hyphen inadvertently gets included in the match.

This issue is arguably benign under the semantics of string comparisons if one assumes that the inclusion or exclusion of ignorable characters is insignificant either way. However, as shown above, it may lead to inconsistent results when the ignorable characters are considered.