rfc:strtolower-ascii

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:strtolower-ascii [2021/09/23 01:03] – More changes for consistency tstarlingrfc:strtolower-ascii [2021/12/10 21:52] (current) – status: accepted tstarling
Line 1: Line 1:
 ====== PHP RFC: Locale-independent case conversion ====== ====== PHP RFC: Locale-independent case conversion ======
-  * Version: 0.9+  * Version: 1.2
   * Date: 2021-09-22   * Date: 2021-09-22
   * Author: Tim Starling <tstarling@wikimedia.org>   * Author: Tim Starling <tstarling@wikimedia.org>
-  * Status: Draft+  * Status: Accepted
   * Target version: PHP 8.2   * Target version: PHP 8.2
   * Implementation: https://github.com/php/php-src/pull/7506   * Implementation: https://github.com/php/php-src/pull/7506
Line 40: Line 40:
  
 ===== Proposal ===== ===== Proposal =====
- 
-==== Main changes ==== 
  
 The following PHP string functions will do ASCII case conversion: The following PHP string functions will do ASCII case conversion:
Line 58: Line 56:
  
   * In arsort(), asort(), krsort(), ksort(), rsort(): SORT_FLAG_CASE will mean sorting by ASCII case folding.   * In arsort(), asort(), krsort(), ksort(), rsort(): SORT_FLAG_CASE will mean sorting by ASCII case folding.
-  * array_change_key_case will do ASCII case folding.+  * array_change_key_case() will do ASCII case folding.
  
 Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion. Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion.
  
-php_strtolower() and php_strtoupper() are the internal C API equivalent of strtoupper() and strtolower(). After reviewing the callers of these functions in the core tree, I decided that they should also be part of this change. They will henceforth do ASCII case conversion+ASCII case conversion is identical to case conversion with the "ClocaleSo these changes have no effect unless setlocale() was called.
- +
-For consistency, I also made the case comparison functions in zend_operators.c do ASCII case conversion, specifically string_compare_function_ex, string_case_compare_function, zend_binary_zval_strcasecmp and zend_binary_zval_strncasecmp. +
- +
-==== Consequent changes ==== +
- +
-The flow-on effects of the change to the behavior of php_strtolower() and php_strtoupper() are a microcosm of the damaging and inappropriate uses locale-sensitive case conversion has been put to: +
- +
-  * strip_tags(): tags will be matched against $allowed_tags by ASCII case-insensitive search. For example, currently, if $allowed_tags is ['DIV'], and the locale is Turkish, %%<div>%% would be stripped. With this change, %%<div>%% will be allowed. +
-  * grapheme_stripos() and grapheme_strripos() currently have a locale-sensitive "fast" path when the input is ASCII. This will become locale-independent. +
-  * ldap_get_entries(): The documentation states "The attribute index is converted to lowercase". This will become ASCII lower case. +
-  * mb_send_mail(): Headers are gathered and indexed with case folding. This change will fix a %%FIXME%% comment in the code by using ASCII case conversion for header name comparisons. +
-  * oci_pconnect(): Case folding of parameters when looking for an existing connection will become locale-independent. +
-  * PDO DBLIB: ASCII will be used when stringifying UNIQUE column values and converting them to uppercase. +
-  * SoapClient: function names will be indexed by the ASCII lowercase name, consistent with normal Zend methods. +
-  * get_meta_tags(): The manual states that property names are converted to lower case -- this will become ASCII lower case. +
-  * http stream wrapper: HTTP headers will be matched by the ASCII lower case name. +
-  * phpinfo(): Anchor names contain the lower-case version of the extension name. This will become ASCII lower case. +
-  * xml_parser_set_option(): XML_OPTION_CASE_FOLDING will become ASCII case folding. +
-  * Stream protocol names will be matched by ASCII case insensitivity. +
-  * PHP manual docref URLs will be constructed by ASCII case conversion of the class and function. +
-  * rfc1867.c: When processing the POST request body, "boundarywill be matched by ASCII case insensitivityAlthough I note that case insensitive matching is apparently not supported by the spec. +
- +
-The consequences of the changes to zend_operators.c are: +
- +
-  * unregister_tick_function(): Named tick functions will be identified by ASCII case folding. +
- +
-==== New functions ==== +
- +
-I am proposing that locale-sensitive case conversion be provided by functions called ctype_tolower() and ctype_toupper(). Effectively, strtolower() will be renamed to ctype_tolower() and strtoupper() will be renamed to ctype_toupper(). My reasons are: +
- +
-  * tolower() and toupper() are in ctype.h, so it fits with ctype's theme of providing access to ctype.h functions. +
-  * The limitations of the implementation are shared by the other ctype functions and so are less likely to be surprising. +
-  * The result is consistent with ctype_islower() and ctype_isupper(). +
-  * It's easy to do, and maybe someone will want them. +
- +
-Some statements in the manual about what the ctype extension is for will have to be updated. +
- +
-For completeness, I have introduced a family of upper case functions to zend_operators.c by analogy with the lower case functions, most of which are currently not called.+
  
 ===== Alternatives considered ===== ===== Alternatives considered =====
Line 115: Line 75:
  
 It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller. It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller.
 +
 +I considered introducing ctype_tolower() and ctype_toupper(), which would do locale-sensitive case conversion like the old strtolower() and strtoupper(), but Nikita suggested that we may want to make the ctype extension generally be locale-independent, which would make these functions redundant.
  
 ===== Future Scope ===== ===== Future Scope =====
  
-I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(), and because they are intended for natural language processing. They could be migrated in future.+I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(). They could be migrated in future.
  
-There are about 50 direct callers of tolower() and toupper() which I haven't migrated. They are similar in flavor to the php_strtolower() callers.+There are about 50 direct callers of tolower() and toupper() which I haven't migrated.
  
-===== Proposed Voting Choices =====+===== Voting =====
  
-I would consider making the introduction of ctype_tolower() and ctype_toupper() be optional. But if that seems uncontroversial during the discussion phase, we can just have a yes/no vote.+Voting period: 2021-11-25 to 2021-12-09.
  
 +<doodle title="Use locale-independent case conversion for string functions as proposed?" auth="tstarling" voteType="single" closed="true">
 +   * Yes
 +   * No
 +</doodle>
  
rfc/strtolower-ascii.1632359016.txt.gz · Last modified: 2021/09/23 01:03 by tstarling