rfc:strtolower-ascii

This is an old revision of the document!


PHP RFC: Locale-independent case conversion

Introduction

Locale-sensitivity is almost always a bug. In the interests of making it easy to write correct code, and to improve performance, strtolower() and related functions should only convert byte values in the ASCII range, as if the locale were “C”.

What is locale sensitivity?

Prior to PHP 8.0, PHP's locale was set from the environment. When a user installs Linux, it asks what language you want it to be in. The user might not fully appreciate the consequences of this decision. It not only sets the user interface language for built-in commands, it also pervasively changes how string handling in the C library works. For example, a user selecting “Turkish” when installing Linux would find that applications calling toupper('i') would obtain the dotted capital I (U+0130, “İ”).

In an era of standardized text-based protocols, natural language is a minority application for case conversion. But even if the user did want natural language case conversion, they would be unlikely to achieve success with strtolower(). This is because it processes the string one byte at a time, feeding each byte to the C library's tolower(). If the input is UTF-8, by far the most popular modern choice, strtolower() will mangle the string, typically producing invalid UTF-8 as output.

PHP 8.0 stopped respecting the locale environment variables. So the locale is always “C” unless the user explicitly calls setlocale(). This means that the bulk of the backwards-incompatible change is already behind us. Any applications depending on the system locale to do case conversion of legacy 8-bit character sets would have been broken by PHP 8.0.

Why do applications call setlocale()?

To take an example familiar to the author, MediaWiki calls setlocale() with a configurable locale, partly as a workaround for strict locale-sensitive character encoding in escapeshellarg(), and partly due to a misunderstanding of how locales work and how to select them.

Locale sensitivity in escapeshellarg() is defensible, since PHP is interfacing with a locale-dependent shell. In general, locale sensitivity can be used to interface with locale-sensitive libraries and shell commands.

Locale sensitivity is not useful for natural language processing in new code. We have the intl and mbstring extensions for that.

PHP libraries distributed with Packagist or PEAR cannot assume a particular locale. Setting the locale temporarily is discouraged by the PHP manual, because the locale is a true global and will influence other threads in a multithreaded SAPI. So libraries have a choice of either reimplementing these core string functions, or just calling them and hoping.

What is ASCII case conversion?

ASCII conversion to upper case is here defined as conversion of byte values in the range a-z to corresponding values in the range A-Z, by subtracting 32 from each byte value.

ASCII conversion to lower case is similarly defined as adding 32 to byte values in the A-Z range.

ASCII case conversion is 8-bit clean. Byte values greater than or equal to 128 are not modified, so if a string is encoded as UTF-8 or with an ISO 8859 character set, non-ASCII character values are preserved.

Case folding is the conversion of input text to some standard case for the purposes of case-insensitive comparison.

Proposal

Main changes

The following PHP string functions will do ASCII case conversion:

  • strtolower
  • strtoupper
  • stristr
  • stripos
  • strripos
  • lcfirst
  • ucfirst
  • ucwords
  • str_ireplace

Also:

  • In arsort(), asort(), krsort(), ksort(), rsort(): SORT_FLAG_CASE will mean sorting by ASCII case folding.
  • array_change_key_case will do ASCII case folding.

Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion.

php_strtolower() and php_strtoupper() are the internal C API equivalent of strtoupper() and strtolower(). After reviewing the callers of these functions in the core tree, I decided that they should also be part of this change. They will henceforth do ASCII case conversion.

For consistency, I also made the case comparison functions in zend_operators.c do ASCII case conversion, specifically string_compare_function_ex, string_case_compare_function, zend_binary_zval_strcasecmp and zend_binary_zval_strncasecmp.

ASCII case conversion is identical to case conversion with the “C” locale. So these changes have no effect unless setlocale() was called.

Consequent changes

The flow-on effects of the change to the behavior of php_strtolower() and php_strtoupper() are a microcosm of the damaging and inappropriate uses locale-sensitive case conversion has been put to:

  • strip_tags(): tags will be matched against $allowed_tags by ASCII case-insensitive search. For example, currently, if $allowed_tags is ['DIV'], and the locale is Turkish, <div> would be stripped. With this change, <div> will be allowed.
  • grapheme_stripos() and grapheme_strripos() currently have a locale-sensitive “fast” path when the input is ASCII. This will become locale-independent.
  • ldap_get_entries(): The documentation states “The attribute index is converted to lowercase”. This will become ASCII lower case.
  • mb_send_mail(): Headers are gathered and indexed with case folding. This change will fix a FIXME comment in the code by using ASCII case conversion for header name comparisons.
  • oci_pconnect(): Case folding of parameters when looking for an existing connection will become locale-independent.
  • PDO DBLIB: ASCII will be used when stringifying UNIQUE column values and converting them to uppercase.
  • SoapClient: function names will be indexed by the ASCII lowercase name, consistent with normal Zend methods.
  • get_meta_tags(): The manual states that property names are converted to lower case -- this will become ASCII lower case.
  • http stream wrapper: HTTP headers will be matched by the ASCII lower case name.
  • phpinfo(): Anchor names contain the lower-case version of the extension name. This will become ASCII lower case.
  • xml_parser_set_option(): XML_OPTION_CASE_FOLDING will become ASCII case folding.
  • Stream protocol names will be matched by ASCII case insensitivity.
  • PHP manual docref URLs will be constructed by ASCII case conversion of the class and function.
  • rfc1867.c: When processing the POST request body, “boundary” will be matched by ASCII case insensitivity. Although I note that case insensitive matching is apparently not supported by the spec.

The consequences of the changes to zend_operators.c are:

  • unregister_tick_function(): Named tick functions will be identified by ASCII case folding.

New functions

I am proposing that locale-sensitive case conversion be provided by functions called ctype_tolower() and ctype_toupper(). Effectively, strtolower() will be renamed to ctype_tolower() and strtoupper() will be renamed to ctype_toupper(). My reasons are:

  • tolower() and toupper() are in ctype.h, so it fits with ctype's theme of providing access to ctype.h functions.
  • The limitations of the implementation are shared by the other ctype functions and so are less likely to be surprising.
  • The result is consistent with ctype_islower() and ctype_isupper().
  • It's easy to do, and maybe someone will want them.

Some statements in the manual about what the ctype extension is for will have to be updated.

For completeness, I have introduced a family of upper case functions to zend_operators.c by analogy with the lower case functions, most of which are currently not called.

Alternatives considered

I considered having a global mode, with backwards-compatible behavior by default. Application would opt in to locale-insensitive processing, for example with str_use_ascii_case(true). However:

  • Most developers are unaware of the bugs they are introducing by using locale sensitivity. An opt-in feature would delay the implicit rectification of these bugs.
  • My impression is that applications using locale-sensitive case conversion on purpose are rare to nonexistent. I would like to hear from anyone who is actually doing this.
  • A global mode is awkward for libraries and for large scale development in general.
  • A global mode would prevent constant propagation through the affected functions.
  • A global mode would add complexity to the code and documentation.
  • We already have a global mode in the form of setlocale().
  • In-tree extensions clearly benefit from an unconditional change to the internal API.

It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller.

Future Scope

I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(), and because they are intended for natural language processing. They could be migrated in future.

There are about 50 direct callers of tolower() and toupper() which I haven't migrated. They are similar in flavor to the php_strtolower() callers.

Proposed Voting Choices

The introduction of ctype_tolower() and ctype_toupper() can be a separate vote, if they seem controversial during the discussion stage.

rfc/strtolower-ascii.1632376404.txt.gz · Last modified: 2021/09/23 05:53 by tstarling