rfc:remove_utf8_decode_and_utf8_encode

PHP RFC: Deprecate and Remove utf8_encode and utf8_decode

Introduction

The built-in functions utf8_encode and utf8_decode convert strings encoded in ISO-8859-1 (“Latin 1”) to and from UTF-8, respectively. While this is sometimes a useful feature, they are commonly misunderstood, for three reasons:

  • Their names suggest a more general use than their implementation actually allows.
  • The Latin 1 encoding is commonly confused with other encodings, particularly Windows Code Page 1252.
  • The lack of error messages means that incorrect use is not easy to spot.

This RFC takes the view that their inclusion under the current name does more harm than good, and that removing them will encourage users to find more appropriate functions for their use cases.

Proposal

  • In PHP 8.2, all uses of utf8_encode and utf8_decode will raise a standard E_DEPRECATED diagnostic (“Function utf8_encode() is deprecated” / “Function utf8_decode() is deprecated”).
  • In PHP 9.0, the utf8_encode and utf8_decode functions will be removed from PHP.
  • Documentation and deprecation messages will encourage users to check that their usage is correct, and recommend mb_convert_encoding as the primary replacement, with UConverter::transcode and iconv also listed as possibilities.

Historical note

The functions were originally added as internal functions in the XML extension, and were exposed to userland in 2006.

They remained part of that extension (and thus technically optional) until Andrea Faulds moved them to ext/standard in PHP 7.2. At the same time, she reworded the documentation page which previously consisted mostly of a long explanation of UTF-8, and little explanation of the functions themselves.

Problems with the current functions

Poor naming

Character encoding issues are often poorly understood, and users will often look for a “quick fix” that just makes their UTF-8 “work properly”. The names “utf8_encode” and “utf8_decode” suggest functions that will do exactly that, and these functions are frequently used in functions called things like “fix_utf8” or “ensure_utf8”.

While the language can never protect users from all misunderstanding, it is unhelpful to include functions whose functionality could not be guessed without looking at the manual.

Confusion around Latin-1 encoding

ISO-8859-1, or Latin 1, is an 8-bit ASCII-compatible encoding standardised in 1985. It is notably the basis for the first 256 code points of Unicode.

There are two closely related encodings:

  • ISO-8859-15 (“Latin 9”) is an official replacement standard, replacing some printable characters with others that were deemed more useful. For instance, the “universal currency symbol” (¤, U+00A4) is replaced with the Euro symbol (€, U+20AC) at position 0xA4.
  • Windows Code Page 1252 is a proprietary encoding developed by Microsoft which adds additional printable characters in place of the rarely used “C1 control characters”. For instance, the Euro sign is placed at position 0x80.

All three encodings specify all 256 possible 8-bit values, so any sequence of bytes is a valid string in all three. However, the “C1 control characters” are effectively unused, so a string labelled “Latin 1” but containing values in the range 0x80 to 0x9F is often assumed to actually be in Windows Code Page 1252. The WHATWG HTML specification specifies that browsers should treat Latin 1 as a synonym for Windows 1252.

The PHP utf8_encode and utf8_decode functions do not handle the additional characters of Windows 1252, which may be surprising to users whose “Latin 1” text displays these characters in browsers.

Error handling

By nature and design, neither function raises any errors:

  • Both Latin 1 and UTF-8 are designed to be “ASCII compatible”, so bytes up to 0x7F are always left unchanged.
  • Any byte is valid in Latin 1 and has an unambiguous mapping to a Unicode code point, so utf8_encode has no error conditions.
  • The vast majority of Unicode code points do not have a mapping to Latin 1; utf8_decode handles these by substituting a '?' (0x3F)
  • Many byte sequences do not form a valid UTF-8 string; utf8_decode handles these by silently inserting a '?' (0x3F)

This lack of feedback to the user compounds the above problems, because incorrect uses of both functions can easily go unnoticed.

Usage

Note: this survey was carried out in April 2021; some of the specific examples may no longer be valid, but the general pattern is likely to remain.

A survey of the top 1000 packages by popularity on Packagist found 37 mentioning one or both of these functions. These uses can be roughly categorised as follows (some packages have uses in more than one category):

  • 4 libraries clearly using them correctly
  • 20 using them without clear understanding
  • 1 using on the output of strftime, which may be correct
  • 7 using utf8_decode to count codepoints in a UTF-8 string
  • 1 using them as “armour” (explained below)
  • 3 using them in context where they will do nothing
  • 2 providing polyfill implementations of the functions (patchwork/utf8 and symfony/polyfill-php72)
  • 4 providing stubs for static analysis

For character conversion

The correct use of these functions is to convert specifically between Latin 1 and UTF-8. This can be used as a fallback if other extensions are unavailable only if the source/target encoding is in fact Latin 1. Of the libraries analysed, only 4 clearly incorporate or document this condition.

The far more common case is to use utf8_encode for all non-UTF-8 inputs, implicitly assuming that anything other than UTF-8 is Latin 1. While this assumption may be valid in some cases, context often suggests it was simply not considered. Some clear misuses:

  • Use as a fallback from calling mb_convert_encoding with no source parameter, which is not equivalent because it uses the global “internal encoding” setting (e.g. phing/phing, sebastian/phpcpd)
  • Treating UTF-8 as the default encoding, but falling back to utf8_encode anyway, e.g. pdepend/pdepend

On output of strftime

The strftime function formats dates and times according to the currently selected locale. These locales are system-dependent, but many systems have European locales using Latin 1 encoding. If UTF-8 output is required, using utf8_encode(strftime(...)) will give the correct result for these locales.

This is used in nesbot/carbon and suggested in this Stack Overflow answer.

For counting code points

If $string is a valid UTF-8 string, strlen(utf8_decode($string)) can be used to count the number of code points it contains. This works because any unmappable code point is replaced with the single byte '?' in the output.

Although convenient, this is mostly used as a fallback for more specific functions, and a pure PHP implementation is also possible, as discussed below.

As "armour" for a binary value

Passing any string of bytes to utf8_encode produces a valid UTF-8 string; and the original bytes can be recovered using utf8_decode. This makes it possible to “armour” arbitrary binary data for transmission or storage as UTF-8 strings, similar to how Base64 or quoted printable encoding are used where ASCII is required.

It's likely that users discover this through trial-and-error, rather than understanding why it works. Examples include cache/adapter-common and two contributors to the php-internals list.

Doing nothing

Some of the clearest misuses occur when running either function on text which is guaranteed to be ASCII, so will be returned unchanged. For instance:

  • aws/aws-sdk-php calls utf8_encode on the output of sha1(), which formats its output in hexadecimal
  • ccampbell/chromephp and monolog/monolog call utf8_encode on the output of json_encode, whose default mode encodes all non-ASCII characters as \u.... escape strings.

Detecting UTF-8

An answer on Stack Overflow with 17 upvotes suggests this incredibly broken function:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}

This will return true for any ASCII string, and any UTF-8 string which contains only code points below U+00FF. For any other UTF-8 string, it will return false.

Throwing the kitchen sink at it

It is easy to find examples online of using utf8_encode and utf8_decode as part of a brute force attempt to fix problems that aren't understood. Here are a few found on Stack Overflow:

Alternatives to Removed Functionality

Removing these functions will break some code that is operating correctly. However, replacement is straight-forward in most cases.

For character conversion

PHP currently has three supported extensions which provide character encoding facilities, which can be used as approximate replacements:

  • ext/mbstring: $utf8 = mb_convert_encoding($latin1, 'UTF-8', 'ISO-8859-1'); and $latin1 = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');
  • ext/intl: $utf8 = UConverter::transcode($latin1, 'UTF8', 'ISO-8859-1'); and $latin1 = UConverter::transcode($utf8, 'ISO-8859-1', 'UTF8');
  • ext/iconv: $utf8 = iconv('ISO-8859-1', 'UTF-8', $latin1); and $latin1 = iconv('UTF-8', 'ISO-8859-1', $utf8);

These vary slightly in the options available, particularly around invalid and unmappable UTF-8 input. The 'to_subst' option to Uconverter::transcode allows the closest match to utf8_decode, e.g. $latin1 = UConverter::transcode($utf8, 'ISO-8859-1', 'UTF8', ['to_subst' => '?']);

Of these three extensions, ext/mbstring is probably the most commonly installed. It is already required by 65 of the 1000 most popular packages on Packagist, including phpunit/phpunit and laravel/framework; another 35 require symfony/polyfill-mbstring. It is also listed as a requirement for Drupal and phpBB, and recommended for WooCommerce.

By contrast, ext/iconv is required by only 6 (plus 4 via symfony/polyfill-iconv), and ext/intl by only 2. It also has an implementation entirely contained in the php-src git repository, whereas ext/intl and ext/iconv rely on external libraries, with ext/iconv particularly prone to platform-specific differences.

The possibility has been raised of making ext/mbstring non-optional in future, although that is out of scope for the current discussion.

An exact replacement is also straight-forward to implement in pure PHP, as long as performance is not critical. Examples are available in patchwork/utf8 and symfony/polyfill-php72. The same approach could be used for a standalone function for Windows-1252 or any other single-byte encoding.

In order to give users a clear message, it seems sensible to recommend mb_convert_encoding as the primary replacement for the removed functions.

For code point counting

Alternatives in bundled extensions:

  • $count = mb_strlen($string, 'UTF-8'); (from ext/mbstring)
  • $count = iconv_strlen($string, 'UTF-8'); (from ext/iconv)

Because of UTF-8's “self-synchronizing” design, code points can be counted without fully decoding the string, by counting bytes in the range 0x00 to 0x7F (ASCII) or 0xC2 to 0xF4 (leading bytes of a multi-byte sequence).

Examples of pure PHP implementations can be found in dompdf/dompdf, masterminds/html5, patchwork/utf8, and symfony/polyfill-iconv.

As "armour" for a binary value

If the exact functionality needs to be retained, any of the character conversion functions above will work fine.

If the requirement is just for safe transport of binary data, a more standard mechanism such as base64_encode / base64_decode should be preferred.

On output of strftime

The strftime function itself is now deprecated.

Where it is used, most systems now include variant locales which use UTF-8, so setlocale(LC_ALL, 'fr_FR.UTF8'); echo strftime("%A, %d %B %Y"); will have the same result as setlocale(LC_ALL, 'fr_FR'); echo utf8_encode(strftime("%A, %d %B %Y"));

Proposed PHP Version(s)

Deprecation in 8.2, removal in 9.0

RFC Impact

To Existing Extensions

The internal functions will be moved back to ext/xml, but no longer exposed as userland functions.

Unaffected PHP Functionality

The use of these functions internally within the ext/xml extension has not been examined, and will not be changed.

Patches and Tests

References

Rejected Features

Adding improved replacements

It would be possible to add new functions, under clearer names, with improved functionality; for instance:

  • Raise an error on invalid UTF-8
  • Handle Windows 1252 encoding rather than Latin 1

However, the functions would remain awkwardly narrow in their applicability; given there are several more general-purpose functions already officially bundled, it would seem arbitrary to include this specific feature today.

Changing name only

An alternative approach would be to introduce aliases, such as “latin1_to_utf8” and “utf8_to_latin1” without changing the existing functionality, then deprecate the old names.

This has the advantage of giving a single clear replacement for correct use of the existing functions. Again, if they did not already exist, it is unlikely we would add such narrow functions; users are better served by discovering existing general-purpose encoding functions.

Adding functionality to the existing functions

Conversely, we could add additional features to these functions without renaming them. For instance, by changing their signatures to utf8_encode(string $string, string $source_encoding = “ISO-8859-1”) and utf8_decode(string $string, string $destination_encoding = “ISO-8859-1”), respectively. This parameter could later be made mandatory, making the function's purpose clearer.

However, this would require either implementing a significant amount of extra code, or wrapping one of the existing functions, all of which are in optional extensions. It would also require several versions before the benefit could be realised: first, add the parameter; in a later version, raise a deprecation if the new parameter is not passed; finally, make the parameter mandatory. Users would then still need to check and update every use of the functions, which would be a similar effort to switching to a new function.

Vote

Should utf8_encode and utf8_decode be deprecated in 8.2 and removed in 9.0?
Real name Yes No
asgrim (asgrim)  
ashnazg (ashnazg)  
bmajdak (bmajdak)  
bwoebi (bwoebi)  
cmb (cmb)  
colinodell (colinodell)  
crell (crell)  
derick (derick)  
dharman (dharman)  
galvao (galvao)  
geekcom (geekcom)  
girgias (girgias)  
heiglandreas (heiglandreas)  
imsop (imsop)  
kocsismate (kocsismate)  
marandall (marandall)  
mbeccati (mbeccati)  
mcmic (mcmic)  
nikic (nikic)  
nohn (nohn)  
ocramius (ocramius)  
pajoye (pajoye)  
petk (petk)  
reywob (reywob)  
santiagolizardo (santiagolizardo)  
sergey (sergey)  
stas (stas)  
svpernova09 (svpernova09)  
thekid (thekid)  
theodorejb (theodorejb)  
theseer (theseer)  
timwolla (timwolla)  
trowski (trowski)  
twosee (twosee)  
villfa (villfa)  
Final result: 33 2
This poll has been closed.

Voting started 2022-04-05 18:40 UTC, and will run for two weeks, closing 2022-04-19 18:40 UTC

Changelog

  • v1.0 (2022-02-20) Initial version sent for discussion
  • v1.1 (2022-03-04) Made a stronger recommendation of mb_convert_encoding as a replacement (see “Alternatives to Removed Functionality”)
  • v1.2 (2022-04-04) Dropped proposed custom wording for deprecation message. The nuances can be better expressed in the manual.
rfc/remove_utf8_decode_and_utf8_encode.txt · Last modified: 2022/06/21 09:42 by cmb