rfc:decode_html
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
rfc:decode_html [2024/08/19 22:24] – dmsnell | rfc:decode_html [2024/09/06 18:58] (current) – Fix list bullet syntax. dmsnell | ||
---|---|---|---|
Line 17: | Line 17: | ||
Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them. | Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them. | ||
+ | |||
+ | <code php> | ||
+ | $src = decode_html( HtmlContext:: | ||
+ | if ( str_starts_with( $src, ' | ||
+ | return null; | ||
+ | } | ||
+ | </ | ||
==== HTML text parsing nuances ==== | ==== HTML text parsing nuances ==== | ||
Line 97: | Line 104: | ||
* Example: | * Example: | ||
* | * | ||
- | | + | |
| | ||
* | * | ||
* @param HtmlContext $context The provided HTML should be the full contents of an | * @param HtmlContext $context The provided HTML should be the full contents of an | ||
- | | + | |
- | * @return string Decoded form of input HTML. | + | * @return string Decoded form of input HTML with character references |
+ | | ||
*/ | */ | ||
function decode_html( HtmlContext $context, string $html ): string {} | function decode_html( HtmlContext $context, string $html ): string {} | ||
Line 112: | Line 120: | ||
* The replaced values for character references are UTF-8. | * The replaced values for character references are UTF-8. | ||
* The list of named character references is non-configurable. | * The list of named character references is non-configurable. | ||
- | * Calling code //must// indicate | + | * Calling code //must// indicate |
* The passed input is assumed to be the entire contents of the attribute value or text node - not a truncation thereof. | * The passed input is assumed to be the entire contents of the attribute value or text node - not a truncation thereof. | ||
- | This RFC does not propose solving | + | A new enum specifies supported HTML contexts. For the most part the enum specifies three internal properties: |
+ | |||
+ | * Are character references decoded? | ||
+ | * Are ambiguous ampersand references interpreted? | ||
+ | * Are NULL bytes replace or removed? | ||
+ | |||
+ | While these could be handled via three boolean flags, that would require developers to understand the nuances involved in the different situations where they imply. By focusing the API on the kind of situations developers work in, the burden is removed to know the internal details | ||
+ | |||
+ | <code php> | ||
+ | enum HtmlContext { | ||
+ | // A complete attribute value, single-quoted, double-quoted, or unquoted. | ||
+ | case Attribute; | ||
+ | |||
+ | // DATA content between tags: normal HTML text inside a BODY element. | ||
+ | case BodyText; | ||
+ | |||
+ | // Like BodyText, but found inside an SVG or MathML element where NULL bytes are treated differently. | ||
+ | case ForeignText; | ||
+ | |||
+ | // Text content inside of a SCRIPT element; nothing is escaped other than NULL bytes. | ||
+ | case Script; | ||
+ | |||
+ | // Identical to Script but left as a convenience/ | ||
+ | case Style; | ||
+ | |||
+ | // Identical to Script but left as a convenience/ | ||
+ | case Comment; | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | A few more contexts //could// exist, namely for elements where text content is not allowed ('' | ||
===== Backward Incompatible Changes ===== | ===== Backward Incompatible Changes ===== | ||
Line 133: | Line 171: | ||
'' | '' | ||
- | ==== New Constants | + | ==== New Enums ==== |
- | | + | A new '' |
- | * '' | + | |
+ | | ||
+ | * '' | ||
===== Open Issues ===== | ===== Open Issues ===== | ||
Line 178: | Line 218: | ||
===== Future Scope ===== | ===== Future Scope ===== | ||
- | This proposal entertains a related function, '' | + | This proposal entertains a related function, '' |
For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain: | For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain: | ||
Line 196: | Line 236: | ||
===== Implementation ===== | ===== Implementation ===== | ||
- | '' | + | There are two viable approaches to add this functionality into PHP: |
+ | |||
+ | * Write a spec-compliant and performant parser in C and maintain it internally. | ||
+ | * Incorporate | ||
+ | |||
+ | Ultimately it's the function interface and behavior this RFC proposes, leaving the implementation free as long as it meets conforms to the HTML specification. The associated PR with its implementation is a port of the design built into WordPress in PHP. | ||
+ | |||
+ | * Numeric character references are handled in a straightforward manner, except a few optimizations take place before converting digits to numbers so that the parser can skip that stage if it's known that the parsed number would be too large. This also protects against certain kinds of overflow and denial attacks. | ||
+ | * Named character references are split into groups which are keyed by their first two characters. For each group, a string contains a sequence of the rest of the character reference name and its substitution value, with leading bytes indicating length of the reference name and replacement. Lookup involves a couple of indirect memory branches and then proceeds in a way attempting to maximize cache locality. | ||
+ | |||
+ | https:// | ||
===== References ===== | ===== References ===== |
rfc/decode_html.1724106298.txt.gz · Last modified: 2024/08/19 22:24 by dmsnell