rfc:strtolower-ascii

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:strtolower-ascii [2021/09/22 12:45] – add implementation link tstarlingrfc:strtolower-ascii [2021/12/10 21:52] (current) – status: accepted tstarling
Line 1: Line 1:
 ====== PHP RFC: Locale-independent case conversion ====== ====== PHP RFC: Locale-independent case conversion ======
-  * Version: 0.9+  * Version: 1.2
   * Date: 2021-09-22   * Date: 2021-09-22
   * Author: Tim Starling <tstarling@wikimedia.org>   * Author: Tim Starling <tstarling@wikimedia.org>
-  * Status: Draft+  * Status: Accepted
   * Target version: PHP 8.2   * Target version: PHP 8.2
   * Implementation: https://github.com/php/php-src/pull/7506   * Implementation: https://github.com/php/php-src/pull/7506
Line 15: Line 15:
 Prior to PHP 8.0, PHP's locale was set from the environment. When a user installs Linux, it asks what language you want it to be in. The user might not fully appreciate the consequences of this decision. It not only sets the user interface language for built-in commands, it also pervasively changes how string handling in the C library works. For example, a user selecting "Turkish" when installing Linux would find that applications calling toupper('i') would obtain the dotted capital I (U+0130, "İ"). Prior to PHP 8.0, PHP's locale was set from the environment. When a user installs Linux, it asks what language you want it to be in. The user might not fully appreciate the consequences of this decision. It not only sets the user interface language for built-in commands, it also pervasively changes how string handling in the C library works. For example, a user selecting "Turkish" when installing Linux would find that applications calling toupper('i') would obtain the dotted capital I (U+0130, "İ").
  
-In an era of network connectivity and standardized text-based protocols, natural language is a minority application for case conversion. But even if the user did want natural language case conversion, they would be unlikely to achieve success with strtolower(). This is because it processes the string one byte at a time, feeding each byte to the C library's tolower(). If the input is UTF-8, by far the most popular modern choice, strtolower() will mangle the string, typically producing invalid UTF-8 as output.+In an era of standardized text-based protocols, natural language is a minority application for case conversion. But even if the user did want natural language case conversion, they would be unlikely to achieve success with strtolower(). This is because it processes the string one byte at a time, feeding each byte to the C library's tolower(). If the input is UTF-8, by far the most popular modern choice, strtolower() will mangle the string, typically producing invalid UTF-8 as output.
  
 PHP 8.0 stopped respecting the locale environment variables. So the locale is always "C" unless the user explicitly calls setlocale(). This means that the bulk of the backwards-incompatible change is already behind us. Any applications depending on the system locale to do case conversion of legacy 8-bit character sets would have been broken by PHP 8.0. PHP 8.0 stopped respecting the locale environment variables. So the locale is always "C" unless the user explicitly calls setlocale(). This means that the bulk of the backwards-incompatible change is already behind us. Any applications depending on the system locale to do case conversion of legacy 8-bit character sets would have been broken by PHP 8.0.
Line 37: Line 37:
 ASCII case conversion is 8-bit clean. Byte values greater than or equal to 128 are not modified, so if a string is encoded as UTF-8 or with an ISO 8859 character set, non-ASCII character values are preserved. ASCII case conversion is 8-bit clean. Byte values greater than or equal to 128 are not modified, so if a string is encoded as UTF-8 or with an ISO 8859 character set, non-ASCII character values are preserved.
  
-===== Proposal =====+Case folding is the conversion of input text to some standard case for the purposes of case-insensitive comparison.
  
-==== Main changes ====+===== Proposal =====
  
 The following PHP string functions will do ASCII case conversion: The following PHP string functions will do ASCII case conversion:
Line 53: Line 53:
   * str_ireplace   * str_ireplace
  
-Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion.+Also:
  
-==== Internal API changes ====+  * In arsort(), asort(), krsort(), ksort(), rsort(): SORT_FLAG_CASE will mean sorting by ASCII case folding. 
 +  * array_change_key_case() will do ASCII case folding.
  
-php_strtolower() and php_strtoupper() are the internal C API equivalent of strtoupper() and strtolower(). After reviewing the callers of these functions in the core tree, I decided that they should also be part of this change. They will henceforth do ASCII case conversion.+Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion.
  
-The flow-on effects of this change are a microcosm of the damaging and inappropriate uses locale-sensitive case conversion has been put to: +ASCII case conversion is identical to case conversion with the "C" locale. So these changes have no effect unless setlocale() was called.
- +
-  * strip_tags(): tags will be matched against $allowed_tags by ASCII case-insensitive search. For example, if $allowed_tags is ['DIV'], and the locale is Turkish, this change means "div" will be allowed. +
-  * grapheme_stripos() and grapheme_strripos() have a locale-sensitive "fast" path when the input is ASCII. +
-  * ldap_get_entries(): The documentation states "The attribute index is converted to lowercase". This will now be ASCII lowercase. +
-  * mb_send_mail(): Headers are gathered and indexed with case folding. Our change fixes a %%FIXME%% comment in the code wishing for locale-insensitive case conversion+
-  * oci_pconnect(): Case folding of parameters when looking for an existing connection will become locale-independent. +
-  * PDO DBLIB: ASCII will be used when stringifying UNIQUE column values and converting them to uppercase. +
-  * SoapClient: function names will be indexed by the ASCII lowercase name, consistent with normal Zend methods. +
-  * get_meta_tags(): The manual states that property names are converted to lower case -- this becomes ASCII lower case. +
-  * http stream wrapper: HTTP headers will be matched by the ASCII lower case name. +
-  * phpinfo(): Anchor names contain the lower-case version of the extension name. This will now be ASCII lower case. +
-  * xml_parser_set_option(): XML_OPTION_CASE_FOLDING will become ASCII case folding. +
-  * Stream protocol names will now be matched by ASCII case insensitivity. +
-  * PHP manual docref URLs will be constructed by ASCII case conversion of the class and function. +
-  * rfc1867.c: When processing the POST request body, "boundarywill be matched by ASCII case insensitivity. Although I note that case insensitive matching is apparently not supported by the spec. +
- +
-For completeness, I have introduced a family of upper case functions to zend_operators.c by analogy with the lower case functions, most of which are currently not called. +
- +
-==== New functions ==== +
- +
-I am proposing that locale-sensitive case conversion be provided by functions called ctype_tolower() and ctype_toupper()Effectively, strtolower() will be renamed to ctype_tolower() and strtoupper() will be renamed to ctype_toupper(). My reasons are: +
- +
-  * tolower() and toupper() are in ctype.h, so it fits with ctype's theme of providing access to ctype.h functions. +
-  * The limitations of the implementation are shared by the other ctype functions and so are less likely to be surprising. +
-  * The result is consistent with ctype_islower() and ctype_isupper(). +
-  * It's easy to do, and maybe someone will want them. +
- +
-Some statements in the manual about what the ctype extension is for will have to be updated. +
- +
-===== Backward Incompatible Changes ===== +
- +
-In summary, applications calling setlocale() may see changes in how case conversion is done. In the vast majority of cases, the changes will be beneficial. If any user depends on locale-sensitive 8-bit case conversion, they will have to migrate to ctype_tolower() and ctype_toupper().+
  
 ===== Alternatives considered ===== ===== Alternatives considered =====
Line 106: Line 75:
  
 It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller. It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller.
 +
 +I considered introducing ctype_tolower() and ctype_toupper(), which would do locale-sensitive case conversion like the old strtolower() and strtoupper(), but Nikita suggested that we may want to make the ctype extension generally be locale-independent, which would make these functions redundant.
  
 ===== Future Scope ===== ===== Future Scope =====
  
-This RFC is part of a program of reducing locale dependence in PHP.+I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(). They could be migrated in future. 
 + 
 +There are about 50 direct callers of tolower() and toupper() which I haven't migrated.
  
-===== Proposed Voting Choices =====+===== Voting =====
  
-I would consider making the introduction of ctype_tolower() and ctype_toupper() be optional. But if that seems uncontroversial during the discussion phase, we can just have a yes/no vote.+Voting period: 2021-11-25 to 2021-12-09.
  
 +<doodle title="Use locale-independent case conversion for string functions as proposed?" auth="tstarling" voteType="single" closed="true">
 +   * Yes
 +   * No
 +</doodle>
  
rfc/strtolower-ascii.1632314751.txt.gz · Last modified: 2021/09/22 12:45 by tstarling