rfc:decode_html

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:decode_html [2024/08/16 00:56] – Typo with monospace syntax dmsnellrfc:decode_html [2024/09/06 18:58] (current) – Fix list bullet syntax. dmsnell
Line 17: Line 17:
  
 Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them. Regardless, this is still a useful utility serving a very common need, and PHP would benefit from a function that can decode HTML text properly with safe defaults, and in a way that educates callers on what they are doing instead of confusing them.
 +
 +<code php>
 +$src = decode_html( HtmlContext::Attribute, $src_attribute );
 +if ( str_starts_with( $src, 'javascript:' ) ) {
 + return null;
 +}
 +</code>
  
 ==== HTML text parsing nuances ==== ==== HTML text parsing nuances ====
Line 44: Line 51:
 === Invalid numeric character references === === Invalid numeric character references ===
  
-HTML not only specifies normative behavior, but it also specifies how to handle failure modes. For example, if a numeric character reference references an invalid Unicode character. It does not give allowance for handling errors differently per parser. Notably, most invalid numeric character references decode to the replacement character U+FFFD ''�'', //not// to plaintext.+HTML not only specifies normative behavior, but it also specifies how to handle failure modes (for example, if a numeric character reference references an invalid Unicode character). It does not give allowance for handling errors differently per parser. Notably, most invalid numeric character references decode to the replacement character U+FFFD ''�'', //not// to plaintext. 
 + 
 +=== Concessions for Windows-1252 encodings === 
 + 
 +''Windows-1252'' differs from ''ISO-8859-1'' (''latin1'') in that it remaps 27 of the 32 C1 control characters. Since Windows computers used this encoding as the default and these C1 controls bytes appeared around the internet, HTML interprets them as if they were ''Windows-1252'' even if the encoding ''ISO-8859-1''
 + 
 +Furthermore, whenever a numeric character reference refers to one of the code points in this range, it ought to be remapped as if it were Windows-1252, even if the document encoding is UTF-8. Because ''html_entity_decode()'' historically hasn't handled these, the web proliferates with mojibake for smart quotes, typographic dashes, some European characters, and more. 
 + 
 +|                                           ^ Proper Decode\\ for ''ISO-8859-1'' ^ Proper Decode\\ for ''UTF-8'' ^ Actual Decode\\ (ISO-8859-1 //and// UTF-8) ^ 
 +^ ''html_entity_decode( $text, $html5flags, $encoding )'' |||| 
 +^      ''"\x93&x93;"''              ''““''                                              |    ''�“''                                      ''�&#x93;''    |
  
 === Context-based parsing === === Context-based parsing ===
Line 56: Line 73:
 ==== Character encodings and UTF-8 ==== ==== Character encodings and UTF-8 ====
  
-Properly handling text encoding in web applications can be an incredibly complicated task. One confusing aspect is sorting through which encoding context is important:+Properly handling text encoding in web applications can be an incredibly complicated task. One confusing aspect is sorting through all of the possible ways a string may be encoded.
  
   * PHP maintains its own default encoding.   * PHP maintains its own default encoding.
Line 63: Line 80:
   * ''$_GET'' values come percent-encoded as bytes from an unspecified character set.   * ''$_GET'' values come percent-encoded as bytes from an unspecified character set.
   * Database query results come in whatever character set is active for the current connection.   * Database query results come in whatever character set is active for the current connection.
 +  * Source code is almost always written in UTF-8 bytes and mixed in text functions.
  
 The only commonality in handling text encodings is that in a web application there are usually many possible sources of data being read from and written to together, and the running PHP code has limited control over the data' character encoding. The only commonality in handling text encodings is that in a web application there are usually many possible sources of data being read from and written to together, and the running PHP code has limited control over the data' character encoding.
Line 86: Line 104:
  * Example:  * Example:
  *  *
-     "Cats & dogs etc…" === html_decodeHTML_TEXT, "Cats &amp dogs etc&hellip;" );+     "Cats & ¬dogs etc…"     === decode_htmlHtmlContext::Text, "Cats &amp &notdogs etc&hellip;" ); 
 +     "/search?q=cat&not=dog" === decode_html( HtmlContext::Attribute, "/search?q=cat&not=dog" );
  *  *
- * @param int $context HTML_ATTRIBUTE if sourced from within an HTML attribute, or + * @param HtmlContext $context The provided HTML should be the full contents of an 
-                     HTML_TEXT if sourced from a text node. +                             HTML attribute, or a text node not containing other HTML
- * @return string Decoded form of input HTML.+ * @return string Decoded form of input HTML with character references 
 +                replaced by their UTF-8 substitutions.
  */  */
-function decode_html( int $context, string $html ): string {}+function decode_html( HtmlContext $context, string $html ): string {}
 </code> </code>
  
Line 100: Line 120:
   * The replaced values for character references are UTF-8.   * The replaced values for character references are UTF-8.
   * The list of named character references is non-configurable.   * The list of named character references is non-configurable.
-  * Calling code //must// indicate whether the text originates from an attribute value or "data" (which is text content outside of tags and other syntax).+  * Calling code //must// indicate where the input came from, whether from an attribute value or "data,or somewhere else special.
   * The passed input is assumed to be the entire contents of the attribute value or text node - not a truncation thereof.   * The passed input is assumed to be the entire contents of the attribute value or text node - not a truncation thereof.
  
-This RFC does not propose solving the problem of HTML tag-dependent decodingfor example, inside a ''SCRIPT'' element. For cases where HTML character references should not be decoded, this function leaves the responsibility with the caller.+A new enum specifies supported HTML contexts. For the most part the enum specifies three internal properties: 
 + 
 +  * Are character references decoded? 
 +  * Are ambiguous ampersand references interpreted? 
 +  * Are NULL bytes replace or removed? 
 + 
 +While these could be handled via three boolean flags, that would require developers to understand the nuances involved in the different situations where they imply. By focusing the API on the kind of situations developers work in, the burden is removed to know the internal details of HTML parsing. For this reason there is overlap in ''Script'', ''Style'', and ''Comment'' contexts, because the parsing rules are identical. 
 + 
 +<code php> 
 +enum HtmlContext { 
 + // A complete attribute value, single-quoteddouble-quotedor unquoted. 
 + case Attribute; 
 + 
 + // DATA content between tags: normal HTML text inside a BODY element. 
 + case BodyText; 
 + 
 + // Like BodyText, but found inside an SVG or MathML element where NULL bytes are treated differently. 
 + case ForeignText; 
 + 
 + // Text content inside of a SCRIPT element; nothing is escaped other than NULL bytes. 
 + case Script; 
 + 
 + // Identical to Script but left as a convenience/education tool. 
 + case Style; 
 + 
 + // Identical to Script but left as a convenience/education tool. 
 + case Comment; 
 +
 +</code> 
 + 
 +A few more contexts //could// exist, namely for elements where text content is not allowed (''IFRAME'', ''NOEMBED'', and ''NOFRAMES'') and the deprecated ''XMP'' element. For the sake of avoiding unnecessary complexity they are left out of this proposal, but their inclusion would not bloat the implementation in any way.
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
Line 121: Line 171:
 ''[NEEDS FURTHER ANALYSIS]]'' ''[NEEDS FURTHER ANALYSIS]]''
  
-==== New Constants ====+==== New Enums ====
  
-  * ''HTML_ATTRIBUTE'' - refers to an HTML attribute. Used in the proposed function to indicate that the input string value was an HTML attribute value. +A new ''HtmlContext'' enum indicates the provenance of an HTML string, whether from an attribute or from text content. 
-  * ''HTML_TEXT'' - refers to an HTML text node/data/markup. Used in the proposed function to indicate that the input string value was an HTML text node.+ 
 +  * ''HtmlContext::Attribute'' - refers to an HTML attribute. Used in the proposed function to indicate that the input string value was an HTML attribute value. 
 +  * ''HtmlContext::Text'' - refers to an HTML text node/data/markup. Used in the proposed function to indicate that the input string value was an HTML text node.
  
 ===== Open Issues ===== ===== Open Issues =====
Line 166: Line 218:
 ===== Future Scope ===== ===== Future Scope =====
  
-This proposal entertains a related function, ''decode_html_ref( int $context, string $html, int $offset, &$matched_byte_length = null ): ?string'' which examines a string at a given offset and returns a replacement character if there is a character reference at the given offset, otherwise ''NULL''. This function is a useful component in building a variety of features that allow for partial decoding of an HTML string.+This proposal entertains a related function, ''decode_html_ref( HtmlContext $context, string $html, int $offset, &$matched_byte_length = null ): ?string'' which examines a string at a given offset and returns a replacement character if there is a character reference at the given offset, otherwise ''NULL''. This function is a useful component in building a variety of features that allow for partial decoding of an HTML string.
  
 For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain: For example, a suite of functions could be provided to efficiently perform string operations in the encoded domain:
Line 184: Line 236:
 ===== Implementation ===== ===== Implementation =====
  
-''[YET TO COME]''+There are two viable approaches to add this functionality into PHP: 
 + 
 +  * Write a spec-compliant and performant parser in C and maintain it internally. 
 +  * Incorporate ''lexbor'' into the language (vs. being an extension) and write functions that use its existing parsing mechanisms. 
 + 
 +Ultimately it's the function interface and behavior this RFC proposes, leaving the implementation free as long as it meets conforms to the HTML specification. The associated PR with its implementation is a port of the design built into WordPress in PHP. 
 + 
 +  * Numeric character references are handled in a straightforward manner, except a few optimizations take place before converting digits to numbers so that the parser can skip that stage if it's known that the parsed number would be too large. This also protects against certain kinds of overflow and denial attacks. 
 +  * Named character references are split into groups which are keyed by their first two characters. For each group, a string contains a sequence of the rest of the character reference name and its substitution value, with leading bytes indicating length of the reference name and replacement. Lookup involves a couple of indirect memory branches and then proceeds in a way attempting to maximize cache locality. 
 + 
 +https://github.com/php/php-src/pull/14927
  
 ===== References ===== ===== References =====
rfc/decode_html.1723769788.txt.gz · Last modified: 2024/08/16 00:56 by dmsnell