Table of Contents

PHP RFC: HTML5 Text Decoder

Introduction

The html_entity_decode() function is a convenient way to replace HTML character references in a string with their replacement values. Unfortunately, there are several issues with this function:

Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them.

$src = decode_html( HtmlContext::Attribute, $src_attribute );
if ( str_starts_with( $src, 'javascript:' ) ) {
	return null;
}

HTML text parsing nuances

Named character references

HTML defines 2,231 named character referenced, such as …. Of these, html_entity_decode() is aware of 1,511. It's purely unaware of 614 of them, while 106 of the missing names appear because some named character references are not required to end in a semicolon.

The list of named character references is published alongside the HTML5 living specification. entities.json It states:

For the sake of reliability and security, an HTML text decoder should decode only the specified character references, and all of them. This involves decoding the 106 names that don't require a trailing semicolon.

Ambiguous ampersand

The 106 named character references that don't require a trailing semicolon are governed by special rules. In an attribute value they must be followed by an unambiguous character (unambiguous characters are matched by the PCRE pattern ^[A-Za-z0-9=]), while in a text node they may appear anywhere. This legacy behavior was added to the specification to account for misuse in URL attribute values where the ampersand is both a character reference introducer and a query string argument separator.

$html = '<a href="/?search=creatures&not=ogre">Search for creatures &not=ogre</a>';
echo parser_get_href( $html ); // /?search=creatures&not=ogre
echo parser_get_text( $html ); // Search for creatures ¬=ogre

The end of an attribute's value is not an ambiguous follower, meaning that these references are also decoded when missing their semicolon and found at the end of an attribute.

Invalid numeric character references

HTML not only specifies normative behavior, but it also specifies how to handle failure modes (for example, if a numeric character reference references an invalid Unicode character). It does not give allowance for handling errors differently per parser. Notably, most invalid numeric character references decode to the replacement character U+FFFD , not to plaintext.

Concessions for Windows-1252 encodings

Windows-1252 differs from ISO-8859-1 (latin1) in that it remaps 27 of the 32 C1 control characters. Since Windows computers used this encoding as the default and these C1 controls bytes appeared around the internet, HTML interprets them as if they were Windows-1252 even if the encoding ISO-8859-1.

Furthermore, whenever a numeric character reference refers to one of the code points in this range, it ought to be remapped as if it were Windows-1252, even if the document encoding is UTF-8. Because html_entity_decode() historically hasn't handled these, the web proliferates with mojibake for smart quotes, typographic dashes, some European characters, and more.

Proper Decode
for ISO-8859-1
Proper Decode
for UTF-8
Actual Decode
(ISO-8859-1 and UTF-8)
html_entity_decode( $text, $html5flags, $encoding )
“\x93&x93;” ““ �“ �&#x93;

Context-based parsing

There are a few minor rules related to where in an HTML document a span of text is found.

Character encodings and UTF-8

Properly handling text encoding in web applications can be an incredibly complicated task. One confusing aspect is sorting through all of the possible ways a string may be encoded.

The only commonality in handling text encodings is that in a web application there are usually many possible sources of data being read from and written to together, and the running PHP code has limited control over the data' character encoding.

To this end, UTF-8 is the one universal standard recommended for interchange of data. Code attempting to properly handle mixed character encodings may:

UTF-8-only functions encourage a many-to-one-to-many architecture while exposing an $encoding parameter to lower-level string functions encourages a many-to-many-to-many design where accounting mistakes might lead to security exploits and/or data corruption.

Proposal

This proposal introduces the decode_html() function, an updated interface that leaves behind the legacy challenges of properly using html_entity_decode().

/**
 * Decodes raw UTF-8 HTML text sourced from an HTML attribute value or text node.
 *
 * Convert input HTML to UTF-8 if not already encoded. Returns UTF-8 encoded text.
 * Do not send the contents of a SCRIPT or STYLE element to this function.
 *
 * Example:
 *
 *     "Cats & ¬dogs etc…"     === decode_html( HtmlContext::Text, "Cats &amp &notdogs etc&hellip;" );
 *     "/search?q=cat&not=dog" === decode_html( HtmlContext::Attribute, "/search?q=cat&not=dog" );
 *
 * @param HtmlContext $context The provided HTML should be the full contents of an
 *                             HTML attribute, or a text node not containing other HTML.
 * @return string Decoded form of input HTML with character references
 *                replaced by their UTF-8 substitutions.
 */
function decode_html( HtmlContext $context, string $html ): string {}

In brief:

A new enum specifies supported HTML contexts. For the most part the enum specifies three internal properties:

While these could be handled via three boolean flags, that would require developers to understand the nuances involved in the different situations where they imply. By focusing the API on the kind of situations developers work in, the burden is removed to know the internal details of HTML parsing. For this reason there is overlap in Script, Style, and Comment contexts, because the parsing rules are identical.

enum HtmlContext {
	// A complete attribute value, single-quoted, double-quoted, or unquoted.
	case Attribute;
 
	// DATA content between tags: normal HTML text inside a BODY element.
	case BodyText;
 
	// Like BodyText, but found inside an SVG or MathML element where NULL bytes are treated differently.
	case ForeignText;
 
	// Text content inside of a SCRIPT element; nothing is escaped other than NULL bytes.
	case Script;
 
	// Identical to Script but left as a convenience/education tool.
	case Style;
 
	// Identical to Script but left as a convenience/education tool.
	case Comment;
}

A few more contexts could exist, namely for elements where text content is not allowed (IFRAME, NOEMBED, and NOFRAMES) and the deprecated XMP element. For the sake of avoiding unnecessary complexity they are left out of this proposal, but their inclusion would not bloat the implementation in any way.

Backward Incompatible Changes

There is no need for backward incompatible changes, as this proposes a new interface. A note should be added to html_entity_decode() pointing to decode_html() instead for more accurate results.

Proposed PHP Version(s)

next PHP 8.x

RFC Impact

As a new built-in function, decode_html() provides an efficient way to accurately decode HTML text, replacing character references and normalizing input according to the HTML5 rules.

To Opcache

[NEEDS FURTHER ANALYSIS]]

New Enums

A new HtmlContext enum indicates the provenance of an HTML string, whether from an attribute or from text content.

Open Issues

[NEEDS FURTHER ANALYSIS]

Unaffected PHP Functionality

This RFC does not propose changing any existing PHP functions, classes, or other interfaces. Namely, it leaves unaddressed the use of other HTML parsing facilities.

Why not use DOMDocument or HTMLDocument?

DOMDocument is plagued by a number of issues related to its legacy as an XML parser attempting to parse HTML document, which XML parsers cannot do. These issues are addressed in PHP RFC: DOM HTML5 parsing and serialization.

With the introduction of HTMLDocument it's finally possible in PHP to properly decode HTML text content, if the extension is provided, but doing so requires more setup that's not right by default and easy to misuse.

function decode_attribute( $value ) {
	$value = str_replace( "'", '&apos;', $value );
	$dom = \Dom\HTMLDocument::createFromString( "<!DOCTYPE html><meta charset=utf8><div attribute='{$value}'></div>" );
	return $dom->getElementsByTagName('div')->item(0)->getAttribute('attribute');
}
 
function decode_text( $value ) {
	$value = htmlspecialchars( $value );
	$dom = \Dom\HTMLDocument::createFromString( "<!DOCTYPE html><meta charset=utf8><div>{$value}</div>" );
	return $dom->getElementsByTagName('div')->item(0)->textContent;
}

These snippets demonstrate how cumbersome it can be to properly setup the text decoding domain, and shortcuts are frequently taken because of this.

A note on reuse of the recent lexbor additions

It may be possible to reuse the logic inside the lexbor parser to provide this functionality, however, there may be some strong challenges to doing so:

Future Scope

This proposal entertains a related function, decode_html_ref( HtmlContext $context, string $html, int $offset, &$matched_byte_length = null ): ?string which examines a string at a given offset and returns a replacement character if there is a character reference at the given offset, otherwise NULL. This function is a useful component in building a variety of features that allow for partial decoding of an HTML string.

For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain:

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Patches and Tests

Because of the need to reliably and securely and conveniently decode HTML text, a custom decoder was implemented in WordPress in user-space PHP code. The algorithm has been ported from PHP into PHP's C code to evaluate for the sake of this RFC in #14927.

While this patch proposes a working HTML decoder meant for inclusion in PHP itself, it represents one of many possible implementations of a new interface. The interface is the most important part of this proposal, as it's the function signature and behavior which dictates whether this new method would lead to more reliable parsing on the web or not.

Implementation

There are two viable approaches to add this functionality into PHP:

Ultimately it's the function interface and behavior this RFC proposes, leaving the implementation free as long as it meets conforms to the HTML specification. The associated PR with its implementation is a port of the design built into WordPress in PHP.

https://github.com/php/php-src/pull/14927

References

This proposal originally discussed on the PHP internals mailing list. https://news-web.php.net/php.internals/124326

Rejected Features