ext/xml allows the user to parse XML in an event-driven way (SAX). The user can register callbacks to be called when certain nodes are encountered while parsing. In a sense, this is a streaming parsing model: the user's callbacks are invoked while parsing is still happening. This RFC attempts to address a feature request on the old bugtracker: https://bugs.php.net/bug.php?id=68325.
Let's break it down more clearly:
First, it's important to note that the ext/xml extension can work with two different XML parsers: either libexpat or libxml2, with libxml2 being the more commonly used of the two.
Now, let's get to the actual issue:
Starting with libxml2 version 2.7.0 (https://github.com/GNOME/libxml2/commit/8915c150b5630178b0f9e83f0d911090095b58a1), parsing large input data (*) is no longer allowed by default; it must be explicitly enabled. This change was made to prevent potential denial-of-service attacks. However, this modification unintentionally disrupted a legitimate use-case involving the xml_parse
and xml_parse_into_struct
functions. Attempting to parse large documents using these methods now results in a parsing error.
There is a workaround for xml_parse
by parsing in chunks, but this is a bit cumbersome if the data is already in memory anyway as you'll have to split the data into chunks. Ironically, this increases memory usage instead of preventing blowing up memory usage. For the latter method, that workaround does not work as you cannot use it with chunked parsing.
This proposal aims to solve this issue by introducing a new parser option.
(*) The definition of large is defined in parserInternals.h in libxml2, but could potentially be changed by patching and recompiling libxml2. Currently this is a document of 10MB (not MiB), and there is also a maximum name length of 50K characters. Note that depending on configuration and versions these limits can change.
It's possible to set parser options via xml_parser_set_option
. The idea is to add a new parser option that takes effect when libxml2 is used. Enabling this (boolean) option will allow parsing large documents. The option will be called XML_OPTION_PARSE_HUGE
, so that will be a new integer constant added to the global namespace. The default value of the option is false
because that's the behaviour right now, and therefore the denial-of-service prevention will still be active by default (which is useful for untrusted data).
Internally, this option will pass XML_PARSE_HUGE to libxml2, allowing large documents to be parsed without resulting in a parse error. If libexpat is used, this option will do nothing as libexpat does not block loading large documents anyway.
It's worth noting that for extensions like SimpleXML and DOM extensions, you can run into the same problem. However, there you do have the option LIBXML_PARSEHUGE
already to work around this issue. The constant XML_OPTION_PARSE_HUGE
would be the ext/xml equivalent for LIBXML_PARSEHUGE
.
function startElement($parser, $name, $attrs) { // Do something interesting } function endElement($parser, $name) { // Do something interesting } $parser = xml_parser_create(); xml_parser_set_option($parser, XML_OPTION_PARSE_HUGE, true); // Changing this to false, or not executing this line, will cause the parsing to error out on large inputs xml_set_element_handler($parser, "startElement", "endElement"); // Add more handlers $success = xml_parse($parser, $my_long_xml_input_already_in_memory);
If you try to change the huge parsing option while parsing is busy, e.g. in one of the callback handlers, and Error
exception will be raised. That's because it is a programming error to do so, not an expected failure.
Example:
<?php function startElement($parser, $name, $attrs) { xml_parser_set_option($parser, XML_OPTION_PARSE_HUGE, false); } function endElement($parser, $name) { // Do something interesting } $parser = xml_parser_create(); xml_parser_set_option($parser, XML_OPTION_PARSE_HUGE, true); xml_set_element_handler($parser, "startElement", "endElement"); // Add more handlers $success = xml_parse($parser, "<xml></xml>");
Results in: Fatal error: Uncaught Error: Cannot change option XML_OPTION_PARSE_HUGE while parsing in example.php:3
No BC breaks unless the user defined a global constant XML_OPTION_PARSE_HUGE
themselves.
Next PHP 8.x.
No changes.
It only impacts ext/xml.
No changes.
Adds a single integer constant to the global namespace: XML_OPTION_PARSE_HUGE ( = 5). Intended to be used only inside ext/xml.
No changes.
None yet.
Everything outside of ext/xml.
None yet.
One primary vote (requires 2/3 majority): add XML_OPTION_PARSE_HUGE parsing option?
Implementation: https://github.com/php/php-src/pull/12256
* 0.9.1: Fixed libxml2 version, clarified limit, added code sample, linked to equivalent constant * 0.9.0: First version under discussion
Links to external references, discussions or RFCs
Keep this updated with features that were discussed on the mail lists.