Locale-sensitivity is almost always a bug. In the interests of making it easy to write correct code, and to improve performance, strtolower() and related functions should only convert byte values in the ASCII range, as if the locale were “C”.
Prior to PHP 8.0, PHP's locale was set from the environment. When a user installs Linux, it asks what language you want it to be in. The user might not fully appreciate the consequences of this decision. It not only sets the user interface language for built-in commands, it also pervasively changes how string handling in the C library works. For example, a user selecting “Turkish” when installing Linux would find that applications calling toupper('i') would obtain the dotted capital I (U+0130, “İ”).
In an era of standardized text-based protocols, natural language is a minority application for case conversion. But even if the user did want natural language case conversion, they would be unlikely to achieve success with strtolower(). This is because it processes the string one byte at a time, feeding each byte to the C library's tolower(). If the input is UTF-8, by far the most popular modern choice, strtolower() will mangle the string, typically producing invalid UTF-8 as output.
PHP 8.0 stopped respecting the locale environment variables. So the locale is always “C” unless the user explicitly calls setlocale(). This means that the bulk of the backwards-incompatible change is already behind us. Any applications depending on the system locale to do case conversion of legacy 8-bit character sets would have been broken by PHP 8.0.
To take an example familiar to the author, MediaWiki calls setlocale() with a configurable locale, partly as a workaround for strict locale-sensitive character encoding in escapeshellarg(), and partly due to a misunderstanding of how locales work and how to select them.
Locale sensitivity in escapeshellarg() is defensible, since PHP is interfacing with a locale-dependent shell. In general, locale sensitivity can be used to interface with locale-sensitive libraries and shell commands.
Locale sensitivity is not useful for natural language processing in new code. We have the intl and mbstring extensions for that.
PHP libraries distributed with Packagist or PEAR cannot assume a particular locale. Setting the locale temporarily is discouraged by the PHP manual, because the locale is a true global and will influence other threads in a multithreaded SAPI. So libraries have a choice of either reimplementing these core string functions, or just calling them and hoping.
ASCII conversion to upper case is here defined as conversion of byte values in the range a-z to corresponding values in the range A-Z, by subtracting 32 from each byte value.
ASCII conversion to lower case is similarly defined as adding 32 to byte values in the A-Z range.
ASCII case conversion is 8-bit clean. Byte values greater than or equal to 128 are not modified, so if a string is encoded as UTF-8 or with an ISO 8859 character set, non-ASCII character values are preserved.
Case folding is the conversion of input text to some standard case for the purposes of case-insensitive comparison.
The following PHP string functions will do ASCII case conversion:
Also:
Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion.
ASCII case conversion is identical to case conversion with the “C” locale. So these changes have no effect unless setlocale() was called.
I considered having a global mode, with backwards-compatible behavior by default. Application would opt in to locale-insensitive processing, for example with str_use_ascii_case(true). However:
It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller.
I considered introducing ctype_tolower() and ctype_toupper(), which would do locale-sensitive case conversion like the old strtolower() and strtoupper(), but Nikita suggested that we may want to make the ctype extension generally be locale-independent, which would make these functions redundant.
I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(). They could be migrated in future.
There are about 50 direct callers of tolower() and toupper() which I haven't migrated.
Voting period: 2021-11-25 to 2021-12-09.