Finding substrings using culture-sensitive comparisons
Author: Douglas Williams Created: Modified: Tags: C#, .NET, strings
Motivation
We will start this article with a bold statement:
The string.IndexOf
functionality
for finding substrings in the .NET Framework is broken.
We don’t mean that its results are incorrect, but that they are incomplete;
and this design deficiency frequently misleads consumers into writing string manipulation logic
that produces incorrect results or throws unexpected exceptions when applied to strings from the wild.
Concretely, string.IndexOf
fails to indicate the length of the matched substring,
implicitly causing consumers to make the mistaken assumption that the match will always
have the same length as the argument that was specified to be searched for.
This is not the case:
Culture-sensitive comparisons can result in a match that has a different length from the search string.
Notwithstanding, there is no public API in the .NET Framework Class Library
that can retrieve the actual length of such matches.
If you’ve ever written any string manipulation logic that involves substrings,
there’s a reasonable chance that your code suffers from this bug.
Illustrative examples
Suppose that you need to write a method that removes the first occurrence of a substring from a specified string,
as requested in this Stack Overflow question.
Let’s assume that your method should perform a word (case-sensitive and culture-sensitive) search
using the current culture – as is, after all, the default behavior for the
string.IndexOf(string)
method overload.
The accepted (and highly upvoted) answer
presents logic similar to the below:
public static string RemoveFirst(string source, string value)
{
int index = source.IndexOf(value);
return index < 0 ? source : source.Remove(index, value.Length);
}
This answer works fine for ASCII strings, and for a restricted subset of other Unicode characters; however, it will break for Unicode strings at large:
CultureInfo.CurrentCulture = new CultureInfo("en-US");
var a1 = RemoveFirst("abcde", "cd"); // correct: "abe"
var b1 = RemoveFirst("animal", "animal"); // incorrect: "l" instead of ""
var c1 = RemoveFirst("cåt", "å"); // incorrect: "c̊t" instead of "ct"
var d1 = RemoveFirst("café", "é"); // ArgumentOutOfRangeException
var e1 = RemoveFirst("encyclopædia", "aedia"); // ArgumentOutOfRangeException
Due to the subtlety of the bug, it is not readily obvious why the above behavior occurs. We will rewrite the strings by substituting Unicode escape sequences for the problematic characters:
CultureInfo.CurrentCulture = new CultureInfo("en-US");
var a2 = RemoveFirst("abcde", "cd"); // correct: "abe"
var b2 = RemoveFirst("ani\u00ADmal", "animal"); // incorrect: "l" instead of ""
var c2 = RemoveFirst("c\u0061\u030At", "\u00E5"); // incorrect: "c\u030At" instead of "ct"
var d2 = RemoveFirst("caf\u00E9", "\u0065\u0301"); // ArgumentOutOfRangeException
var e2 = RemoveFirst("encyclop\u00E6dia", "\u0061\u0065dia"); // ArgumentOutOfRangeException
We shall now examine the cause of each of these errors.
Ignorable characters
var b1 = RemoveFirst("animal", "animal"); // incorrect: "l" instead of ""
var b2 = RemoveFirst("ani\u00ADmal", "animal");
The example above, whose strings are incidentally taken from the Microsoft Learn documentation for
IndexOf
, contains a
SOFT HYPHEN (U+00AD
)
in the source string.
Character sets include ignorable characters, which are characters that are not considered when performing a linguistic or culture-sensitive comparison. In a culture-sensitive search [not
Ordinal
orOrdinalIgnoreCase
], if [the search string] contains an ignorable character, the result is equivalent to searching with that character removed.[…] On the .NET Framework 4 or later, because the soft hyphen is an ignorable character, a culture-sensitive search returns the same value that it would return if the soft hyphen were not included in the search string.
Consequently, the search string "animal"
is matched by the source string "ani\u00ADmal"
in its entirety.
However, "animal"
has 6 characters, whilst "ani\u00ADmal"
has 7 characters due to the soft hyphen.
Removing 6 characters from the source string would erroneously leave its last character in place.
Combining character sequences
var c1 = RemoveFirst("cåt", "å"); // incorrect: "c̊t" instead of "ct"
var c2 = RemoveFirst("c\u0061\u030At", "\u00E5");
Unicode permits diacritics (such as accents) to be represented through
combining characters.
Such characters are combined with the base character that precedes them,
giving rise to combining character sequences.
The source string in the above example (also taken from the Microsoft Learn examples for
IndexOf
) contains
LATIN SMALL LETTER A
(U+0061
: a) followed by
COMBINING RING ABOVE
(U+030A
: ◌̊),
resulting in the combining character sequence "å"
.
Several popular combining character sequences can also be represented as a
precomposed character.
The search string in the above example consists of
LATIN SMALL LETTER A WITH RING ABOVE
(U+00E5
: å).
Assuming support from your browser’s layout engine and font,
you will probably see the two representations, "å"
and "å"
, as identical,
despite the difference in their underlying representations ("\u0061\u030A"
versus "\u00E5"
).
However, the combining character sequence is represented using two characters – "å".Length
is 2 –
whilst the precomposed character only has one – "å".Length
is 1.
Since the search string consists of just the precomposed character, the RemoveFirst
method
will only remove one character from the matched substring in the source string; namely, '\u0061'
('a'
).
The combining character '\u030A'
is left in place, and will be combined with its new predecessor,
the base character 'c'
, to inadvertently produce the new combining character sequence "c̊"
.
var d1 = RemoveFirst("café", "é"); // ArgumentOutOfRangeException
var d2 = RemoveFirst("caf\u00E9", "\u0065\u0301");
The next example demonstrates the same issue in the opposite direction.
The source string ends with the precomposed character
LATIN SMALL LETTER E WITH ACUTE
(U+00E9
: é),
whilst the search string consists of
LATIN SMALL LETTER E
(U+0065
: e) followed by
COMBINING ACUTE ACCENT
(U+0301
: ◌́).
These two will match – "caf\u00E9".IndexOf("\u0065\u0301")
gives the zero-based index position 3.
However, attempting to remove two characters – the length of the search string – from this index position
in the source string will cause it to run past its end, throwing ArgumentOutOfRangeException
.
Ligatures
In orthography, a multigraph is a sequence of letters that behaves as a single unit under a given language. Multigraphs are commonly associated with phonemes, which represent a distinct sound. The most popular category of multigraph is the digraph, which is constituted of a pair of letters. The English language has several multigraphs, such as ⟨sh⟩ (like in “ship” and “wish”) and ⟨ee⟩ (like in “meet”).
Multigraphs sometimes give rise to a fused graphical representation of their consistent letters; these are known as ligatures. There are two ligatures that are occasionally used in English: æ, formed from the letters ‘a’ and ‘e’; and œ, formed from the letters ‘o’ and ‘e’. The former was popularized by its appearance in the title of the Encyclopædia Britannica.
When performing linguistic string comparisons,
ligatures should be treated as equal to the sequence of letters from which they are formed.
For example, in English, "æ"
is equal to "ae"
, and "œ"
is equal to "oe"
.
var e1 = RemoveFirst("encyclopædia", "aedia"); // ArgumentOutOfRangeException
var e2 = RemoveFirst("encyclop\u00E6dia", "\u0061\u0065dia");
Consequently, "encyclopædia".IndexOf("aedia")
returns 8, indicating that the search string was matched.
However, the length of the matched substring, "ædia"
, is 4,
whilst the length of the search value, "aedia"
, is 5.
The source string has no further characters after the matched substring.
Therefore, attempting to remove 5 characters from index 8 in the source string
will throw ArgumentOutOfRangeException
.
Strategies
There is no public API in the .NET Framework Class Library that can perform such a substring search, and it is unfortunately not trivial to build one. This problem was even raised by Jon Skeet in his Stack Overflow question, How can I perform a culture-sensitive “starts-with” operation from the middle of a string?
We shall first discuss some potential approaches, then present our preferred solution.
Our goal will be to define a Find
extension method for strings that has the following signature:
public static bool Find(
this string source,
string searchValue,
StringComparison comparisonType,
out int matchIndex,
out int matchLength)
{
// ...
}
Given such a method, a correct implementation of RemoveFirst
would become trivial:
public static string RemoveFirst(string source, string value)
{
int index, length;
if (source.Find(value, StringComparison.CurrentCulture, out index, out length))
return source.Remove(index, length);
return source;
}
Note how we are using the length
result returned by Find
,
rather than the original value.Length
.
Restricting to ordinal comparisons
The easy way out would be to restrict our string operation to only support ordinal comparisons.
This is the route taken by the .NET Framework Class Library for a number of methods on the String
class, including
Contains(string)
,
Replace(string, string)
, and
Split(string[], StringSplitOptions)
.
These methods can only perform an ordinal (case-sensitive and culture-insensitive) search,
with no overloads to support a case-insensitive or culture-sensitive search.
(It is easy to define a Contains
extension method that takes a StringComparison
parameter and passes it to
IndexOf(String, StringComparison)
;
however, such a shortcut cannot be taken for Replace
or Split
, which would also require the length.)
This design shortcoming of the .NET Framework Class Library is jarring because of the
inconsistency it introduces in the framework’s public API.
Most string operations do support linguistic comparisons,
and enact CurrentCulture
behavior by default (consider
CompareTo
,
IndexOf
,
StartsWith
,
EndsWith
, …);
whilst others are restricted to ordinal comparisons.
Per the StringComparison
documentation:
An operation that uses ordinal sort rules performs a comparison based on the numeric value (Unicode code point) of each
Char
in the string. An ordinal comparison is fast but culture-insensitive. When you use ordinal sort rules to sort strings, [the Unicode character]U+xxxx
comes beforeU+yyyy
if the value ofxxxx
is numerically less thanyyyy
.
Under ordinal comparisons, two strings would be equal if and only if they consist of identical sequences of Unicode code points; in other words, their binary representations are the same. Ignorable characters would not be ignored; precomposed characters would not match combining character sequences; and ligatures would not match multigraphs of their letters. A favorable consequence of this restriction is that equal strings would always have the same length, so it becomes safe to assume that the matched substring would have the same length as the search string.
We could implement our RemoveFirst
method to always perform ordinal searches
by specifying StringComparison.Ordinal
as argument to
IndexOf
:
public static string RemoveFirst(string source, string value)
{
int index = source.IndexOf(value, StringComparison.Ordinal);
return index < 0 ? source : source.Remove(index, value.Length);
}
Under this implementation, the search strings in our problematic examples
would not be matched by the source strings at all.
This ensures that they do not produce blatantly incorrect results, nor throw ArgumentOutOfRangeException
.
Rather, the source string is returned intact as the result of the RemoveFirst
call:
var b1 = RemoveFirst("animal", "animal"); // result: "animal"
var c1 = RemoveFirst("cåt", "å"); // result: "cåt"
var d1 = RemoveFirst("café", "é"); // result: "café"
var e1 = RemoveFirst("encyclopædia", "aedia"); // result: "encyclopædia"
Such a restriction is not acceptable if one needs to process user strings.
Best practices for comparing strings in .NET
on Microsoft Learn recommends using local linguistic comparisons – CurrentCulture
or CurrentCultureIgnoreCase
–
for “most user input” (as well as “data displayed to the user”).
P/Invoke calls
The .NET Framework Class Library internally implements IndexOf
by making
P/Invoke calls to native libraries
for linguistic comparisons.
The source code for
CompareInfo.IndexOf
shows that it internally calls
InternalFindNLSStringEx
,
which is an external method declaration for the native FindNLSStringEx
.
The FindNLSStringEx
function
implemented by Windows defines an out parameter, pcchFound
, which is documented as follows:
Pointer to a buffer containing the length of the string that the function finds. The string can be either longer or shorter than the search string.
Note that the value of
pcchFound
is often identical to the value provided incchValue
[the length of the search string], but can differ [if]:
- The strings are equivalent, but have different lengths. For example,
"A"
plus"Combining Ring"
(U+0041 U+030A
) is equivalent to the"A Ring"
(U+00C5
).
Unfortunately, this pcchFound
parameter is disregarded by the external method calls from the .NET Framework.
One way of determining the length of the matched substring would be to bypass the .NET API altogether
and call the aforementioned Windows function directly through P/Invoke,
retrieving the value that it writes to pcchFound
.
An example of such an implementation is given under the
Stack Overflow answer by David Ewen.
This is the fastest approach because it only needs to run native code, and determines the length of the match within the same operation that locates its starting character position. This approach also involves the least implementation complexity; we just need to reuse a Windows function.
However, this approach suffers from the drawback that affects P/Invoke calls in general: It sacrifices portability. P/Invoke relies on the existence of specific native libraries, which might not be available on other platforms, such as Linux. Therefore, it is not suitable for libraries that are intended to be platform-agnostic.
Unicode normalization
An alternative approach involves subjecting the source and search strings to Unicode normalization through the
string.Normalize(NormalizationForm)
method
before comparison, as suggested in the Stack Overflow answer by Esailija.
However, this approach is undesirable since the returned results would only apply to the normalized forms of the source and search strings, requiring the original strings to be discarded and replaced by their normalized forms for all subsequent processing and storage.
Furthermore, Unicode normalization would not always yield results
consistent with culture-sensitive comparisons in .NET
(such as Compare
or Equals
with StringComparison.CurrentCulture
).
As mentioned in the Unicode Normalization Forms annex,
Form C and Form D only support canonical mappings,
such as between precomposed characters and combining character sequences – for example, "é"
and "e\u0301"
.
However, the said forms do not perform compatibility mappings, as is required for ligatures.
For example, "æ"
is not decomposed to "ae"
, nor "ffi"
to "ffi"
, despite that
the said ligatures are considered to be equal to their corresponding character sequences
under the English (United States) culture.
Form KC and Form KD handle compatibility mappings,
and can decompose some ligatures, such as "ffi"
, but miss others, such as "æ"
.
(A Stack Overflow answer mentions that
“Unicode 6.2 doesn't appear to contain a normative mapping from Æ
to AE
.”)
The issue is made worse by the discrepancies between cultures – "æ"
is equal to "ae"
under English (United States),
but not under Danish (Denmark), as discussed under the Microsoft Learn documentation for
String comparison and sorting.
Thus, normalization (to any form) would not give results that are consistent with
CurrentCulture
semantics.
Text elements
Yet another alternative involves iterating over the strings as a sequence of text elements,
rather than UTF-16 code units, using the
StringInfo.GetNextTextElement
method,
as presented in the Stack Overflow answer by Mårten Wikström.
Results would be similar to those obtained from Unicode normalization: canonical mappings are honored,
but compatibility mappings are not.
Iterative equality comparisons
The final approach, which we shall be implementing, involves iteratively testing for the length of the substring until a match with the search string is found.
Such an implementation would first call
IndexOf
to get the starting position of the match, then iteratively attempt to identify its length.
It begins with the most likely case (hot path) of the matched substring having the same length as the search string,
verifying this through a substring equality comparison.
If not equal, it would attempt to decrement and increment the length of the substring by one character each time,
performing an equality comparison against each such substring, until a match is confirmed.
The approach of iterating over the substring’s length is proposed in the Stack Overflow answer by Ken Kin, and endorsed by usr:
I have solved a similar problem once like this (search-string highlighting in HTML). I did it similarly. You can tune the loop and search strategy in a way that makes it completed very quickly by checking the likely cases first. The nice thing about this is that it seems to be totally correct and no Unicode details leak into your code.
Implementation
For a polished implementation of our proposed approach, refer to the Find
method within the
StringExtensions
class, available as part of our
DogmaMix.Core
library on GitHub.
We shall be presenting a simplified version in the discussion below.
Our implementation builds upon the
string.IndexOf(string, StringComparison)
method from the .NET Framework Class Library, but extends it to also return the length of the match,
allowing string manipulation operations to subsequently be performed correctly.
Our method defines two out parameters, for returning both the zero-based starting character position and the length (in characters) of the match.
public static bool Find(
this string source,
string searchValue,
StringComparison comparisonType,
out int matchIndex,
out int matchLength)
{
// argument validation
matchIndex = source.IndexOf(searchValue, comparisonType);
if (matchIndex == -1)
{
matchLength = -1;
return false;
}
matchLength = FindMatchLength(source, searchValue, comparisonType, matchIndex);
return true;
}
FindMatchLength
After determining the starting character position of the match through IndexOf
,
we proceed to determine its length through our FindMatchLength
helper method:
private static int FindMatchLength(string source, string searchValue, StringComparison comparisonType, int matchIndex)
{
int matchLengthMaximum = source.Length - matchIndex;
int matchLengthInitial = Math.Min(searchValue.Length, matchLengthMaximum);
// Hot path: match length is same as specified search string
if (Substring.Compare(source, matchIndex, matchLengthInitial, searchValue, 0, searchValue.Length, comparisonType) == 0)
return matchLengthInitial;
int matchLengthDecrementing = matchLengthInitial - 1;
int matchLengthIncrementing = matchLengthInitial + 1;
while (matchLengthDecrementing >= 0 || matchLengthIncrementing <= matchLengthMaximum)
{
if (matchLengthDecrementing >= 0)
{
if (Substring.Compare(source, matchIndex, matchLengthDecrementing, searchValue, 0, searchValue.Length, comparisonType) == 0)
return matchLengthDecrementing;
matchLengthDecrementing--;
}
if (matchLengthIncrementing <= matchLengthMaximum)
{
if (Substring.Compare(source, matchIndex, matchLengthIncrementing, searchValue, 0, searchValue.Length, comparisonType) == 0)
return matchLengthIncrementing;
matchLengthIncrementing++;
}
}
// Should never happen
return -1;
}
Substring.Compare
The FindMatchLength
helper method, in turn, relies on another utility method we define, Substring.Compare
,
which compares substrings from two specified strings using the specified comparison rules,
and returns an integer that indicates their relative position in the sort order.
The .NET Framework Class Library provides a series of
string.Compare
method overloads – such as
string.Compare(string, int, string, int, int, StringComparison)
–
that accept parameters for specifying the starting position and length of the substrings
to compare from the source strings.
However, in another case of poor design, .NET fails to provide an overload that permits the lengths of the
two respective substrings to be specified separately, presumably having also fallen victim to the fallacy
that equal strings must have the same length.
Our Substring.Compare
method is similar to the aforementioned string.Compare
overload,
but allows different lengths to be specified for the two substrings.
It is implemented by calling the
CompareInfo.Compare
method – which fortunately accepts separate length parameters – on the appropriate
CompareInfo
instance with the appropriate CompareOptions
value for each known value of StringComparison
.
For performance, substring instantiation is avoided, working with the start indexes and lengths instead.
public static class Substring
{
public static int Compare(string strA, int indexA, int lengthA, string strB, int indexB, int lengthB, StringComparison comparisonType)
{
// argument validation and null handling
switch (comparisonType)
{
case StringComparison.CurrentCulture:
return CultureInfo.CurrentCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.None);
case StringComparison.CurrentCultureIgnoreCase:
return CultureInfo.CurrentCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.IgnoreCase);
case StringComparison.InvariantCulture:
return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.None);
case StringComparison.InvariantCultureIgnoreCase:
return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.IgnoreCase);
case StringComparison.Ordinal:
return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.Ordinal);
case StringComparison.OrdinalIgnoreCase:
return CultureInfo.InvariantCulture.CompareInfo.Compare(strA, indexA, lengthA, strB, indexB, lengthB, CompareOptions.OrdinalIgnoreCase);
default:
throw new ArgumentException("The string comparison type is not supported.", nameof(comparisonType));
}
}
}
Tests
Our implementation correctly handles all the above-discussed cases, including ignorable characters, combining character sequences, and ligatures.
var a1 = RemoveFirst("abcde", "cd"); // correct: "abe"
var b1 = RemoveFirst("animal", "animal"); // correct: ""
var c1 = RemoveFirst("cåt", "å"); // correct: "ct"
var d1 = RemoveFirst("café", "é"); // correct: "caf"
var e1 = RemoveFirst("encyclopædia", "aedia"); // correct: "encyclop"
Known issue
Our method suffers from one known issue: When the search string is matched by multiple substrings at a given starting position in the source string, which one gets returned is undefined. Under the current implementation, the substring whose length is closest to the search string’s is preferred; however, this behavior would not always be the expected one. We believe that this issue only occurs for matches that end with ignorable characters; the inclusion or exclusion of such characters in the result is undefined.
Consider the following examples containing a soft hyphen, \u00AD
:
var f = RemoveFirst("aethe\u00ADreal", "aethe"); // result: "\u00ADreal"
var g = RemoveFirst("æthe\u00ADreal", "aethe"); // result: "real"
In the first case, the search string is identical to the beginning of the source string,
so the full remainder of the source string, including the soft hyphen, is returned.
In the second case, the search string is one character longer than its truest match in the source string,
"æthe"
, due to the ligature in the latter.
However, the substring having the search string’s length, "æthe\u00AD"
, would still be a match,
since the last character of such a substring would be the soft hyphen, which is ignorable.
Therefore, the soft hyphen inadvertently gets included in the match.
This issue is arguably benign under the semantics of string comparisons if one assumes that the inclusion or exclusion of ignorable characters is insignificant either way. However, as shown above, it may lead to inconsistent results when the ignorable characters are considered.