PHP RFC: HTML5 Text Decoder

Version: 0.1
Date: 2024-08-15
Author: Dennis Snell, dennis.snell@automattic.com
Status: Draft
First Published at: http://wiki.php.net/rfc/decode_html

Introduction

The html_entity_decode() function is a convenient way to replace HTML character references in a string with their replacement values. Unfortunately, there are several issues with this function:

its default arguments are almost always wrong
it's confusing to know how to call it properly.
when called correctly for ENT_HTML5 it's still wrong.
the interface doesn't provide a way to properly decode HTML text.

Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them.

$src = decode_html( HtmlContext::Attribute, $src_attribute );
if ( str_starts_with( $src, 'javascript:' ) ) {
	return null;
}

HTML text parsing nuances

Named character references

HTML defines 2,231 named character referenced, such as …. Of these, html_entity_decode() is aware of 1,511. It's purely unaware of 614 of them, while 106 of the missing names appear because some named character references are not required to end in a semicolon.

The list of named character references is published alongside the HTML5 living specification. entities.json It states:

This list is static and will not be expanded or changed in the future.

For the sake of reliability and security, an HTML text decoder should decode only the specified character references, and all of them. This involves decoding the 106 names that don't require a trailing semicolon.

Ambiguous ampersand

The 106 named character references that don't require a trailing semicolon are governed by special rules. In an attribute value they must be followed by an unambiguous character (unambiguous characters are matched by the PCRE pattern ^[A-Za-z0-9=]), while in a text node they may appear anywhere. This legacy behavior was added to the specification to account for misuse in URL attribute values where the ampersand is both a character reference introducer and a query string argument separator.

$html = '<a href="/?search=creatures&not=ogre">Search for creatures &not=ogre</a>';
echo parser_get_href( $html ); // /?search=creatures&not=ogre
echo parser_get_text( $html ); // Search for creatures ¬=ogre

The end of an attribute's value is not an ambiguous follower, meaning that these references are also decoded when missing their semicolon and found at the end of an attribute.

Invalid numeric character references

HTML not only specifies normative behavior, but it also specifies how to handle failure modes (for example, if a numeric character reference references an invalid Unicode character). It does not give allowance for handling errors differently per parser. Notably, most invalid numeric character references decode to the replacement character U+FFFD �, not to plaintext.

Concessions for Windows-1252 encodings

Windows-1252 differs from ISO-8859-1 (latin1) in that it remaps 27 of the 32 C1 control characters. Since Windows computers used this encoding as the default and these C1 controls bytes appeared around the internet, HTML interprets them as if they were Windows-1252 even if the encoding ISO-8859-1.

Furthermore, whenever a numeric character reference refers to one of the code points in this range, it ought to be remapped as if it were Windows-1252, even if the document encoding is UTF-8. Because html_entity_decode() historically hasn't handled these, the web proliferates with mojibake for smart quotes, typographic dashes, some European characters, and more.

`html_entity_decode( $text, $html5flags, $encoding )`
	Proper Decode for `ISO-8859-1`	Proper Decode for `UTF-8`	Actual Decode (ISO-8859-1 and UTF-8)
`“\x93&x93;”`	`““`	`�“`	`�`

Context-based parsing

There are a few minor rules related to where in an HTML document a span of text is found.

NULL bytes are removed when found inside text nodes deep inside a BODY element, but if found within and SVG or MathML context, they are replaced with the replacement symbol U+FFFD �.
No character references are decoded when inside comments, doctype declarations, CDATA sections (which are themselves only found within SVG or MathML elements), or within any of the elements which cannot contain other markup (IFRAME, NOEMBED, NOFRAMES, SCRIPT, STYLE, or XMP).
The first newline following a LISTING, PRE, or TEXTAREA opening tag is ignored to allow for friendlier HTML syntax formatting. This newline includes those which are encoded as 
, for example.

Character encodings and UTF-8

Properly handling text encoding in web applications can be an incredibly complicated task. One confusing aspect is sorting through all of the possible ways a string may be encoded.

PHP maintains its own default encoding.
Text read from a file is likely to be in a different encoding than the PHP character set, unless both are UTF-8.
$_POST values appear encoded based on the submitting HTML page's <meta> tags, the default encoding for the system submitting the page, and the user-preference override for the submitting page, unless accept-charset=utf8 is set as an attribute on the submitting form.
$_GET values come percent-encoded as bytes from an unspecified character set.
Database query results come in whatever character set is active for the current connection.
Source code is almost always written in UTF-8 bytes and mixed in text functions.

The only commonality in handling text encodings is that in a web application there are usually many possible sources of data being read from and written to together, and the running PHP code has limited control over the data' character encoding.

To this end, UTF-8 is the one universal standard recommended for interchange of data. Code attempting to properly handle mixed character encodings may:

Track the encoding of every string variable, which is often unavailable, and combine or split texts carefully, keeping track of encoding boundaries.
Convert everything into UTF-8 at the edge of the system so that internally, all text is identical in this regard.

UTF-8-only functions encourage a many-to-one-to-many architecture while exposing an $encoding parameter to lower-level string functions encourages a many-to-many-to-many design where accounting mistakes might lead to security exploits and/or data corruption.

Proposal

This proposal introduces the decode_html() function, an updated interface that leaves behind the legacy challenges of properly using html_entity_decode().

/**
 * Decodes raw UTF-8 HTML text sourced from an HTML attribute value or text node.
 *
 * Convert input HTML to UTF-8 if not already encoded. Returns UTF-8 encoded text.
 * Do not send the contents of a SCRIPT or STYLE element to this function.
 *
 * Example:
 *
 *     "Cats & ¬dogs etc…"     === decode_html( HtmlContext::Text, "Cats &amp &notdogs etc&hellip;" );
 *     "/search?q=cat&not=dog" === decode_html( HtmlContext::Attribute, "/search?q=cat&not=dog" );
 *
 * @param HtmlContext $context The provided HTML should be the full contents of an
 *                             HTML attribute, or a text node not containing other HTML.
 * @return string Decoded form of input HTML with character references
 *                replaced by their UTF-8 substitutions.
 */
function decode_html( HtmlContext $context, string $html ): string {}

In brief:

All input to this function must be UTF-8.
The replaced values for character references are UTF-8.
The list of named character references is non-configurable.
Calling code must indicate where the input came from, whether from an attribute value or “data,” or somewhere else special.
The passed input is assumed to be the entire contents of the attribute value or text node - not a truncation thereof.

A new enum specifies supported HTML contexts. For the most part the enum specifies three internal properties:

Are character references decoded?
Are ambiguous ampersand references interpreted?
Are NULL bytes replace or removed?

While these could be handled via three boolean flags, that would require developers to understand the nuances involved in the different situations where they imply. By focusing the API on the kind of situations developers work in, the burden is removed to know the internal details of HTML parsing. For this reason there is overlap in Script, Style, and Comment contexts, because the parsing rules are identical.

enum HtmlContext {
	// A complete attribute value, single-quoted, double-quoted, or unquoted.
	case Attribute;
 
	// DATA content between tags: normal HTML text inside a BODY element.
	case BodyText;
 
	// Like BodyText, but found inside an SVG or MathML element where NULL bytes are treated differently.
	case ForeignText;
 
	// Text content inside of a SCRIPT element; nothing is escaped other than NULL bytes.
	case Script;
 
	// Identical to Script but left as a convenience/education tool.
	case Style;
 
	// Identical to Script but left as a convenience/education tool.
	case Comment;
}

A few more contexts could exist, namely for elements where text content is not allowed (IFRAME, NOEMBED, and NOFRAMES) and the deprecated XMP element. For the sake of avoiding unnecessary complexity they are left out of this proposal, but their inclusion would not bloat the implementation in any way.

Backward Incompatible Changes

There is no need for backward incompatible changes, as this proposes a new interface. A note should be added to html_entity_decode() pointing to decode_html() instead for more accurate results.

Proposed PHP Version(s)

next PHP 8.x

RFC Impact

As a new built-in function, decode_html() provides an efficient way to accurately decode HTML text, replacing character references and normalizing input according to the HTML5 rules.

To Opcache

[NEEDS FURTHER ANALYSIS]]

New Enums

A new HtmlContext enum indicates the provenance of an HTML string, whether from an attribute or from text content.

HtmlContext::Attribute - refers to an HTML attribute. Used in the proposed function to indicate that the input string value was an HTML attribute value.
HtmlContext::Text - refers to an HTML text node/data/markup. Used in the proposed function to indicate that the input string value was an HTML text node.

Open Issues

[NEEDS FURTHER ANALYSIS]

Unaffected PHP Functionality

This RFC does not propose changing any existing PHP functions, classes, or other interfaces. Namely, it leaves unaddressed the use of other HTML parsing facilities.

Why not use DOMDocument or HTMLDocument?

DOMDocument is plagued by a number of issues related to its legacy as an XML parser attempting to parse HTML document, which XML parsers cannot do. These issues are addressed in PHP RFC: DOM HTML5 parsing and serialization.

With the introduction of HTMLDocument it's finally possible in PHP to properly decode HTML text content, if the extension is provided, but doing so requires more setup that's not right by default and easy to misuse.

function decode_attribute( $value ) {
	$value = str_replace( "'", '&apos;', $value );
	$dom = \Dom\HTMLDocument::createFromString( "<!DOCTYPE html><meta charset=utf8><div attribute='{$value}'></div>" );
	return $dom->getElementsByTagName('div')->item(0)->getAttribute('attribute');
}
 
function decode_text( $value ) {
	$value = htmlspecialchars( $value );
	$dom = \Dom\HTMLDocument::createFromString( "<!DOCTYPE html><meta charset=utf8><div>{$value}</div>" );
	return $dom->getElementsByTagName('div')->item(0)->textContent;
}

These snippets demonstrate how cumbersome it can be to properly setup the text decoding domain, and shortcuts are frequently taken because of this.

A note on reuse of the recent lexbor additions

It may be possible to reuse the logic inside the lexbor parser to provide this functionality, however, there may be some strong challenges to doing so:

It's unclear to the RFC author if the lexbor text parsing can be easily extracted from the generalized HTML tree-building algorithm and if reuse would imply a significant runtime overhead, vs. the use of a custom text parser.
It would be important that PHP itself ships with this new built-in function so that it's readily and simply available to application developers. Relying on having an extension installed would be a major obstacle to appealing to folks wanting to eliminate corruption and security issues while decoding HTML text.

Future Scope

This proposal entertains a related function, decode_html_ref( HtmlContext $context, string $html, int $offset, &$matched_byte_length = null ): ?string which examines a string at a given offset and returns a replacement character if there is a character reference at the given offset, otherwise NULL. This function is a useful component in building a variety of features that allow for partial decoding of an HTML string.

For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain:

html_attribute_starts_with( string $haystack, string $needle ) provides an efficient way to detect a prefix for an encoded string that doesn't require decoding and storing in memory what could be megabytes of content. This is useful for sanitization and security-related functions examining HTML attribute values.
There are useful visualizations which highlight character references within HTML, such as for syntax-highlighters. With only a full decoding function it's not possible to detect where the character references are found, but with this new system this becomes trivial.

Proposed Voting Choices

Include these so readers know where you are heading and can discuss the proposed voting options.

Patches and Tests

Because of the need to reliably and securely and conveniently decode HTML text, a custom decoder was implemented in WordPress in user-space PHP code. The algorithm has been ported from PHP into PHP's C code to evaluate for the sake of this RFC in #14927.

While this patch proposes a working HTML decoder meant for inclusion in PHP itself, it represents one of many possible implementations of a new interface. The interface is the most important part of this proposal, as it's the function signature and behavior which dictates whether this new method would lead to more reliable parsing on the web or not.

Implementation

There are two viable approaches to add this functionality into PHP:

Write a spec-compliant and performant parser in C and maintain it internally.
Incorporate lexbor into the language (vs. being an extension) and write functions that use its existing parsing mechanisms.

Ultimately it's the function interface and behavior this RFC proposes, leaving the implementation free as long as it meets conforms to the HTML specification. The associated PR with its implementation is a port of the design built into WordPress in PHP.

Numeric character references are handled in a straightforward manner, except a few optimizations take place before converting digits to numbers so that the parser can skip that stage if it's known that the parsed number would be too large. This also protects against certain kinds of overflow and denial attacks.
Named character references are split into groups which are keyed by their first two characters. For each group, a string contains a sequence of the rest of the character reference name and its substitution value, with leading bytes indicating length of the reference name and replacement. Lookup involves a couple of indirect memory branches and then proceeds in a way attempting to maximize cache locality.

https://github.com/php/php-src/pull/14927

References

This proposal originally discussed on the PHP internals mailing list. https://news-web.php.net/php.internals/124326

Table of Contents