rfc:strtolower-ascii

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:strtolower-ascii [2021/09/23 06:21] tstarlingrfc:strtolower-ascii [2021/12/10 21:52] (current) – status: accepted tstarling
Line 1: Line 1:
 ====== PHP RFC: Locale-independent case conversion ====== ====== PHP RFC: Locale-independent case conversion ======
-  * Version: 1.0+  * Version: 1.2
   * Date: 2021-09-22   * Date: 2021-09-22
   * Author: Tim Starling <tstarling@wikimedia.org>   * Author: Tim Starling <tstarling@wikimedia.org>
-  * Status: Under Discussion+  * Status: Accepted
   * Target version: PHP 8.2   * Target version: PHP 8.2
   * Implementation: https://github.com/php/php-src/pull/7506   * Implementation: https://github.com/php/php-src/pull/7506
Line 40: Line 40:
  
 ===== Proposal ===== ===== Proposal =====
- 
-==== Main changes ==== 
  
 The following PHP string functions will do ASCII case conversion: The following PHP string functions will do ASCII case conversion:
Line 61: Line 59:
  
 Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion. Note that strcasecmp(), strncasecmp() and substr_compare() with $case_insensitive = true were already using ASCII case conversion.
- 
-php_strtolower() and php_strtoupper() are the internal C API equivalent of strtoupper() and strtolower(). After reviewing the callers of these functions in the core tree, I decided that they should also be part of this change. They will henceforth do ASCII case conversion. 
- 
-For consistency, I also made the case comparison functions in zend_operators.c do ASCII case conversion, specifically string_compare_function_ex, string_case_compare_function, zend_binary_zval_strcasecmp and zend_binary_zval_strncasecmp. 
  
 ASCII case conversion is identical to case conversion with the "C" locale. So these changes have no effect unless setlocale() was called. ASCII case conversion is identical to case conversion with the "C" locale. So these changes have no effect unless setlocale() was called.
- 
-==== Consequent changes ==== 
- 
-The flow-on effects of the change to the behavior of php_strtolower() and php_strtoupper() are a microcosm of the damaging and inappropriate uses locale-sensitive case conversion has been put to: 
- 
-  * strip_tags(): tags will be matched against $allowed_tags by ASCII case-insensitive search. For example, currently, if $allowed_tags is '<div>', and the locale is Turkish, %%<DIV>%% would be stripped. With this change, %%<DIV>%% will be allowed. 
-  * grapheme_stripos() and grapheme_strripos() currently have a locale-sensitive "fast" path when the input is ASCII. This will become locale-independent. 
-  * ldap_get_entries(): The documentation states "The attribute index is converted to lowercase". This will become ASCII lower case. 
-  * mb_send_mail(): Headers are gathered and indexed with case folding. This change will fix a %%FIXME%% comment in the code by using ASCII case conversion for header name comparisons. 
-  * oci_pconnect(): Case folding of parameters when looking for an existing connection will become locale-independent. 
-  * PDO DBLIB: ASCII will be used when stringifying UNIQUE column values and converting them to uppercase. 
-  * SoapClient: function names will be indexed by the ASCII lowercase name, consistent with normal Zend methods. 
-  * get_meta_tags(): The manual states that property names are converted to lower case -- this will become ASCII lower case. 
-  * http stream wrapper: HTTP headers will be matched by the ASCII lower case name. 
-  * phpinfo(): Anchor names contain the lower-case version of the extension name. This will become ASCII lower case. 
-  * xml_parser_set_option(): XML_OPTION_CASE_FOLDING will become ASCII case folding. 
-  * Stream protocol names will be matched by ASCII case insensitivity. 
-  * PHP manual docref URLs will be constructed by ASCII case conversion of the class and function. 
-  * rfc1867.c: When processing the POST request body, "boundary" will be matched by ASCII case insensitivity. Although I note that case insensitive matching is apparently not supported by the spec. 
- 
-The consequences of the changes to zend_operators.c are: 
- 
-  * unregister_tick_function(): Named tick functions will be identified by ASCII case folding. 
- 
-==== New functions ==== 
- 
-I am proposing that locale-sensitive case conversion be provided by functions called ctype_tolower() and ctype_toupper(). Effectively, strtolower() will be renamed to ctype_tolower() and strtoupper() will be renamed to ctype_toupper(). My reasons are: 
- 
-  * tolower() and toupper() are in ctype.h, so it fits with ctype's theme of providing access to ctype.h functions. 
-  * The limitations of the implementation are shared by the other ctype functions and so are less likely to be surprising. 
-  * The result is consistent with ctype_islower() and ctype_isupper(). 
-  * It's easy to do, and maybe someone will want them. 
- 
-Some statements in the manual about what the ctype extension is for will have to be updated. 
- 
-For completeness, I have introduced a family of upper case functions to zend_operators.c by analogy with the lower case functions, most of which are currently not called. 
  
 ===== Alternatives considered ===== ===== Alternatives considered =====
Line 117: Line 75:
  
 It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller. It is not possible for strtolower() to raise a deprecation warning depending on its input, because there is no way to tell whether a given case transformation was intended by the caller.
 +
 +I considered introducing ctype_tolower() and ctype_toupper(), which would do locale-sensitive case conversion like the old strtolower() and strtoupper(), but Nikita suggested that we may want to make the ctype extension generally be locale-independent, which would make these functions redundant.
  
 ===== Future Scope ===== ===== Future Scope =====
  
-I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(), and because they are intended for natural language processing. They could be migrated in future.+I didn't include strnatcasecmp() and natcasesort() in this RFC, because they also use isdigit() and isspace(). They could be migrated in future.
  
-There are about 50 direct callers of tolower() and toupper() which I haven't migrated. They are similar in flavor to the php_strtolower() callers.+There are about 50 direct callers of tolower() and toupper() which I haven't migrated.
  
-===== Proposed Voting Choices =====+===== Voting =====
  
-The introduction of ctype_tolower() and ctype_toupper() can be a separate vote, if they seem controversial during the discussion stage.+Voting period: 2021-11-25 to 2021-12-09.
  
 +<doodle title="Use locale-independent case conversion for string functions as proposed?" auth="tstarling" voteType="single" closed="true">
 +   * Yes
 +   * No
 +</doodle>
  
rfc/strtolower-ascii.1632378090.txt.gz · Last modified: 2021/09/23 06:21 by tstarling