rfc:xml_option_parse_huge

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:xml_option_parse_huge [2023/09/21 20:46] nielsdosrfc:xml_option_parse_huge [2023/10/22 15:53] (current) – implemented nielsdos
Line 1: Line 1:
-====== PHP RFC: PHP_XML_OPTION_PARSE_HUGE ====== +====== PHP RFC: XML_OPTION_PARSE_HUGE ====== 
-  * Version: 0.9+  * Version: 0.9.1
   * Date: 2023-09-21   * Date: 2023-09-21
   * Author: Niels Dossche, nielsdos@php.net   * Author: Niels Dossche, nielsdos@php.net
-  * Status: Draft+  * Status: Implemented 
 +  * Implementation: https://github.com/php/php-src/commit/98b08c52db01609249ab2816ff25852a3cc0ad81
   * First Published at: https://wiki.php.net/rfc/xml_option_parse_huge   * First Published at: https://wiki.php.net/rfc/xml_option_parse_huge
  
 ===== Introduction ===== ===== Introduction =====
  
-ext/xml allows the user to parse XML in an event-driven way (SAX). The user can register callbacks to be called when certain nodes are encountered while parsing. In a sense, this is a streaming parsing model.+ext/xml allows the user to parse XML in an event-driven way (SAX). The user can register callbacks to be called when certain nodes are encountered while parsing. In a sense, this is a streaming parsing model: the user's callbacks are invoked while parsing is still happening.
 This RFC attempts to address a feature request on the old bugtracker: https://bugs.php.net/bug.php?id=68325. This RFC attempts to address a feature request on the old bugtracker: https://bugs.php.net/bug.php?id=68325.
  
Line 15: Line 16:
 First, it's important to note that the ext/xml extension can work with two different XML parsers: either libexpat or libxml2, with libxml2 being the more commonly used of the two. First, it's important to note that the ext/xml extension can work with two different XML parsers: either libexpat or libxml2, with libxml2 being the more commonly used of the two.
  
-Now, let's get to the heart of the issue:+Now, let's get to the actual issue:
  
-Starting with libxml2 version 2.7.6, parsing large input data is no longer allowed by default; it must be explicitly enabled. This change was made to prevent potential denial-of-service attacks. However, this modification unintentionally disrupted a legitimate use-case involving the <php>xml_parse</php> and <php>xml_parse_into_struct</php> functions. Attempting to parse large documents using these methods now results in a parsing error.+Starting with libxml2 version 2.7.0 (https://github.com/GNOME/libxml2/commit/8915c150b5630178b0f9e83f0d911090095b58a1), parsing large input data (*) is no longer allowed by default; it must be explicitly enabled. This change was made to prevent potential denial-of-service attacks. However, this modification unintentionally disrupted a legitimate use-case involving the <php>xml_parse</php> and <php>xml_parse_into_struct</php> functions. Attempting to parse large documents using these methods now results in a parsing error.
 There is a workaround for <php>xml_parse</php> by parsing in chunks, but this is a bit cumbersome if the data is already in memory anyway as you'll have to split the data into chunks. Ironically, this increases memory usage instead of preventing blowing up memory usage. For the latter method, that workaround does not work as you cannot use it with chunked parsing. There is a workaround for <php>xml_parse</php> by parsing in chunks, but this is a bit cumbersome if the data is already in memory anyway as you'll have to split the data into chunks. Ironically, this increases memory usage instead of preventing blowing up memory usage. For the latter method, that workaround does not work as you cannot use it with chunked parsing.
  
-This proposal aims to solve these issues by introducing a new option.+This proposal aims to solve this issue by introducing a new parser option
 + 
 +(*) The definition of large is defined in [[https://github.com/GNOME/libxml2/blob/fc26934eb0b8f66dab262465226ec14eac7cb3e8/include/libxml/parserInternals.h#L42|parserInternals.h]] in libxml2, but could potentially be changed by patching and recompiling libxml2. Currently this is a document of 10MB (not MiB), and there is also a [[https://github.com/GNOME/libxml2/blob/fc26934eb0b8f66dab262465226ec14eac7cb3e8/include/libxml/parserInternals.h#L61|maximum name length]] of 50K characters. Note that depending on configuration and versions these limits can change.
  
 ===== Proposal ===== ===== Proposal =====
  
-It's possible to set parser options via <php>xml_set_parser_option</php>. The idea is to add a new parser option that takes effect when libxml2 is used. The boolean option will be called <php>XML_OPTION_PARSE_HUGE</php>, so that will be a new integer constant added to the global namespace. The default value is <php>false</php> because that's the behaviour right now, and therefore will still protect against denial of service attacks in case of untrusted documents.+It's possible to set parser options via <php>xml_parser_set_option</php>. The idea is to add a new parser option that takes effect when libxml2 is used. Enabling this (boolean) option will allow parsing large documents. The option will be called <php>XML_OPTION_PARSE_HUGE</php>, so that will be a new integer constant added to the global namespace. The default value of the option is <php>false</php> because that's the behaviour right now, and therefore the denial-of-service prevention will still be active by default (which is useful for untrusted data).
  
 Internally, this option will pass XML_PARSE_HUGE to libxml2, allowing large documents to be parsed without resulting in a parse error. Internally, this option will pass XML_PARSE_HUGE to libxml2, allowing large documents to be parsed without resulting in a parse error.
 If libexpat is used, this option will do nothing as libexpat does not block loading large documents anyway. If libexpat is used, this option will do nothing as libexpat does not block loading large documents anyway.
 +
 +It's worth noting that for extensions like SimpleXML and DOM extensions, you can run into the same problem. However, there you //do// have the option <php>LIBXML_PARSEHUGE</php> already to work around this issue. The constant <php>XML_OPTION_PARSE_HUGE</php> would be the ext/xml equivalent for <php>LIBXML_PARSEHUGE</php>.
 +
 +==== Example Usage ====
 +
 +<PHP>
 +function startElement($parser, $name, $attrs) {
 +    // Do something interesting
 +}
 +function endElement($parser, $name) {
 +    // Do something interesting
 +}
 +$parser = xml_parser_create();
 +xml_parser_set_option($parser, XML_OPTION_PARSE_HUGE, true); // Changing this to false, or not executing this line, will cause the parsing to error out on large inputs
 +xml_set_element_handler($parser, "startElement", "endElement");
 +// Add more handlers
 +$success = xml_parse($parser, $my_long_xml_input_already_in_memory);
 +</PHP>
 +
 +If you try to change the huge parsing option while parsing is busy, e.g. in one of the callback handlers, and <php>Error</php> exception will be raised. That's because it is a programming error to do so, not an expected failure.
 +Example:
 +
 +<PHP>
 +<?php
 +function startElement($parser, $name, $attrs) {
 +    xml_parser_set_option($parser, XML_OPTION_PARSE_HUGE, false);
 +}
 +function endElement($parser, $name) {
 +    // Do something interesting
 +}
 +$parser = xml_parser_create();
 +xml_parser_set_option($parser, XML_OPTION_PARSE_HUGE, true);
 +xml_set_element_handler($parser, "startElement", "endElement");
 +// Add more handlers
 +$success = xml_parse($parser, "<xml></xml>");
 +</PHP>
 +
 +Results in:
 +Fatal error: Uncaught Error: Cannot change option XML_OPTION_PARSE_HUGE while parsing in example.php:3
 +
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
  
-No BC breaks unless the user defined a global constant XML_OPTION_PARSE_HUGE themselves.+No BC breaks unless the user defined a global constant <php>XML_OPTION_PARSE_HUGE</php> themselves.
  
 ===== Proposed PHP Version(s) ===== ===== Proposed PHP Version(s) =====
Line 72: Line 115:
 ===== Proposed Voting Choices ===== ===== Proposed Voting Choices =====
  
-One primary vote (requires 2/3 majority): add PHP_XML_OPTION_PARSE_HUGE?+One primary vote (requires 2/3 majority): add XML_OPTION_PARSE_HUGE parsing option? 
 + 
 +<doodle title="Add XML_OPTION_PARSE_HUGE parsing option" auth="nielsdos" voteType="single" closed="true" closeon="2023-10-21T21:10:00+02:00"> 
 +   * Yes 
 +   * No 
 +</doodle>
  
 ===== Patches and Tests ===== ===== Patches and Tests =====
Line 79: Line 127:
  
 ===== Implementation ===== ===== Implementation =====
-After the project is implemented, this section should contain  + 
-  - the version(s) it was merged into +Merged into 8.4: https://github.com/php/php-src/commit/98b08c52db01609249ab2816ff25852a3cc0ad81 
-  a link to the git commit(s) + 
-  - a link to the PHP manual entry for the feature +===== Changelog ===== 
-  - a link to the language specification section (if any)+ 
 +* 0.9.1: Fixed libxml2 version, clarified limit, added code sample, linked to equivalent constant 
 +* 0.9.0: First version under discussion
  
 ===== References ===== ===== References =====
rfc/xml_option_parse_huge.1695329202.txt.gz · Last modified: 2023/09/21 20:46 by nielsdos