The html_entity_decode()
function is a convenient way to replace HTML character references in a string with their replacement values.
Unfortunately, there are several issues with this function:
ENT_HTML5
it's still wrong.Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them.
$src = decode_html( HtmlContext::Attribute, $src_attribute ); if ( str_starts_with( $src, 'javascript:' ) ) { return null; }
HTML defines 2,231 named character referenced, such as …
. Of these, html_entity_decode()
is aware of 1,511. It's purely unaware of 614 of them, while 106 of the missing names appear because some named character references are not required to end in a semicolon.
The list of named character references is published alongside the HTML5 living specification. entities.json It states:
This list is static and will not be expanded or changed in the future.
For the sake of reliability and security, an HTML text decoder should decode only the specified character references, and all of them. This involves decoding the 106 names that don't require a trailing semicolon.
The 106 named character references that don't require a trailing semicolon are governed by special rules. In an attribute value they must be followed by an unambiguous character (unambiguous characters are matched by the PCRE pattern ^[A-Za-z0-9=]
), while in a text node they may appear anywhere. This legacy behavior was added to the specification to account for misuse in URL attribute values where the ampersand is both a character reference introducer and a query string argument separator.
$html = '<a href="/?search=creatures¬=ogre">Search for creatures ¬=ogre</a>'; echo parser_get_href( $html ); // /?search=creatures¬=ogre echo parser_get_text( $html ); // Search for creatures ¬=ogre
The end of an attribute's value is not an ambiguous follower, meaning that these references are also decoded when missing their semicolon and found at the end of an attribute.
HTML not only specifies normative behavior, but it also specifies how to handle failure modes (for example, if a numeric character reference references an invalid Unicode character). It does not give allowance for handling errors differently per parser. Notably, most invalid numeric character references decode to the replacement character U+FFFD �
, not to plaintext.
Windows-1252
differs from ISO-8859-1
(latin1
) in that it remaps 27 of the 32 C1 control characters. Since Windows computers used this encoding as the default and these C1 controls bytes appeared around the internet, HTML interprets them as if they were Windows-1252
even if the encoding ISO-8859-1
.
Furthermore, whenever a numeric character reference refers to one of the code points in this range, it ought to be remapped as if it were Windows-1252, even if the document encoding is UTF-8. Because html_entity_decode()
historically hasn't handled these, the web proliferates with mojibake for smart quotes, typographic dashes, some European characters, and more.
Proper Decode for ISO-8859-1 | Proper Decode for UTF-8 | Actual Decode (ISO-8859-1 and UTF-8) |
|
---|---|---|---|
html_entity_decode( $text, $html5flags, $encoding ) |
|||
“\x93&x93;” | ““ | �“ | �“ |
There are a few minor rules related to where in an HTML document a span of text is found.
�
.IFRAME
, NOEMBED
, NOFRAMES
, SCRIPT
, STYLE
, or XMP
).LISTING
, PRE
, or TEXTAREA
opening tag is ignored to allow for friendlier HTML syntax formatting. This newline includes those which are encoded as 

, for example.Properly handling text encoding in web applications can be an incredibly complicated task. One confusing aspect is sorting through all of the possible ways a string may be encoded.
$_POST
values appear encoded based on the submitting HTML page's <meta>
tags, the default encoding for the system submitting the page, and the user-preference override for the submitting page, unless accept-charset=utf8
is set as an attribute on the submitting form.$_GET
values come percent-encoded as bytes from an unspecified character set.The only commonality in handling text encodings is that in a web application there are usually many possible sources of data being read from and written to together, and the running PHP code has limited control over the data' character encoding.
To this end, UTF-8 is the one universal standard recommended for interchange of data. Code attempting to properly handle mixed character encodings may:
UTF-8-only functions encourage a many-to-one-to-many architecture while exposing an $encoding
parameter to lower-level string functions encourages a many-to-many-to-many design where accounting mistakes might lead to security exploits and/or data corruption.
This proposal introduces the decode_html()
function, an updated interface that leaves behind the legacy challenges of properly using html_entity_decode()
.
/** * Decodes raw UTF-8 HTML text sourced from an HTML attribute value or text node. * * Convert input HTML to UTF-8 if not already encoded. Returns UTF-8 encoded text. * Do not send the contents of a SCRIPT or STYLE element to this function. * * Example: * * "Cats & ¬dogs etc…" === decode_html( HtmlContext::Text, "Cats & ¬dogs etc…" ); * "/search?q=cat¬=dog" === decode_html( HtmlContext::Attribute, "/search?q=cat¬=dog" ); * * @param HtmlContext $context The provided HTML should be the full contents of an * HTML attribute, or a text node not containing other HTML. * @return string Decoded form of input HTML with character references * replaced by their UTF-8 substitutions. */ function decode_html( HtmlContext $context, string $html ): string {}
In brief:
A new enum specifies supported HTML contexts. For the most part the enum specifies three internal properties:
While these could be handled via three boolean flags, that would require developers to understand the nuances involved in the different situations where they imply. By focusing the API on the kind of situations developers work in, the burden is removed to know the internal details of HTML parsing. For this reason there is overlap in Script
, Style
, and Comment
contexts, because the parsing rules are identical.
enum HtmlContext { // A complete attribute value, single-quoted, double-quoted, or unquoted. case Attribute; // DATA content between tags: normal HTML text inside a BODY element. case BodyText; // Like BodyText, but found inside an SVG or MathML element where NULL bytes are treated differently. case ForeignText; // Text content inside of a SCRIPT element; nothing is escaped other than NULL bytes. case Script; // Identical to Script but left as a convenience/education tool. case Style; // Identical to Script but left as a convenience/education tool. case Comment; }
A few more contexts could exist, namely for elements where text content is not allowed (IFRAME
, NOEMBED
, and NOFRAMES
) and the deprecated XMP
element. For the sake of avoiding unnecessary complexity they are left out of this proposal, but their inclusion would not bloat the implementation in any way.
There is no need for backward incompatible changes, as this proposes a new interface. A note should be added to html_entity_decode()
pointing to decode_html()
instead for more accurate results.
next PHP 8.x
As a new built-in function, decode_html()
provides an efficient way to accurately decode HTML text, replacing character references and normalizing input according to the HTML5 rules.
[NEEDS FURTHER ANALYSIS]]
A new HtmlContext
enum indicates the provenance of an HTML string, whether from an attribute or from text content.
HtmlContext::Attribute
- refers to an HTML attribute. Used in the proposed function to indicate that the input string value was an HTML attribute value.HtmlContext::Text
- refers to an HTML text node/data/markup. Used in the proposed function to indicate that the input string value was an HTML text node.
[NEEDS FURTHER ANALYSIS]
This RFC does not propose changing any existing PHP functions, classes, or other interfaces. Namely, it leaves unaddressed the use of other HTML parsing facilities.
DOMDocument
is plagued by a number of issues related to its legacy as an XML parser attempting to parse HTML document, which XML parsers cannot do. These issues are addressed in PHP RFC: DOM HTML5 parsing and serialization.
With the introduction of HTMLDocument
it's finally possible in PHP to properly decode HTML text content, if the extension is provided, but doing so requires more setup that's not right by default and easy to misuse.
function decode_attribute( $value ) { $value = str_replace( "'", ''', $value ); $dom = \Dom\HTMLDocument::createFromString( "<!DOCTYPE html><meta charset=utf8><div attribute='{$value}'></div>" ); return $dom->getElementsByTagName('div')->item(0)->getAttribute('attribute'); } function decode_text( $value ) { $value = htmlspecialchars( $value ); $dom = \Dom\HTMLDocument::createFromString( "<!DOCTYPE html><meta charset=utf8><div>{$value}</div>" ); return $dom->getElementsByTagName('div')->item(0)->textContent; }
These snippets demonstrate how cumbersome it can be to properly setup the text decoding domain, and shortcuts are frequently taken because of this.
It may be possible to reuse the logic inside the lexbor
parser to provide this functionality, however, there may be some strong challenges to doing so:
lexbor
text parsing can be easily extracted from the generalized HTML tree-building algorithm and if reuse would imply a significant runtime overhead, vs. the use of a custom text parser.
This proposal entertains a related function, decode_html_ref( HtmlContext $context, string $html, int $offset, &$matched_byte_length = null ): ?string
which examines a string at a given offset and returns a replacement character if there is a character reference at the given offset, otherwise NULL
. This function is a useful component in building a variety of features that allow for partial decoding of an HTML string.
For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain:
html_attribute_starts_with( string $haystack, string $needle )
provides an efficient way to detect a prefix for an encoded string that doesn't require decoding and storing in memory what could be megabytes of content. This is useful for sanitization and security-related functions examining HTML attribute values.Include these so readers know where you are heading and can discuss the proposed voting options.
Because of the need to reliably and securely and conveniently decode HTML text, a custom decoder was implemented in WordPress in user-space PHP code. The algorithm has been ported from PHP into PHP's C code to evaluate for the sake of this RFC in #14927.
While this patch proposes a working HTML decoder meant for inclusion in PHP itself, it represents one of many possible implementations of a new interface. The interface is the most important part of this proposal, as it's the function signature and behavior which dictates whether this new method would lead to more reliable parsing on the web or not.
There are two viable approaches to add this functionality into PHP:
lexbor
into the language (vs. being an extension) and write functions that use its existing parsing mechanisms.Ultimately it's the function interface and behavior this RFC proposes, leaving the implementation free as long as it meets conforms to the HTML specification. The associated PR with its implementation is a port of the design built into WordPress in PHP.
This proposal originally discussed on the PHP internals mailing list. https://news-web.php.net/php.internals/124326