The DOM extension in PHP is used to parse, query, and manipulate XML/HTML documents. The DOM extension is supposed to follow the DOM specification. Originally this was the DOM Core Level 3 specification, but nowadays, that specification has evolved into the current "Living Specification" maintained by WHATWG. Unfortunately, there are many bugs in PHP's DOM extension. Most of those bugs are related to namespace and attribute handling. This leads to people trying to work around those bugs by relying on more bugs, or on undocumented side-effects of incorrect behaviour, leading to even more issues in the end. Furthermore, some of these bugs may have security implications. Note that the bugs are not HTML-exclusive, but also apply to XML documents.
Some of these bugs are caused because the method or property was implemented incorrectly back in the day, or because the original DOM 3 specification used to be unclear. A smaller part of this is because the specification has made breaking changes when HTML 5 first came along and the specification creators had to unify what browsers implemented into a single specification that everyone agreed on.
It's not possible to “just fix” these bugs because people actually rely on these bugs. They are also often unaware that what they're doing is actually incorrect or causes the internal document state to be inconsistent. We therefore have to fix this in a backwards-compatible way: i.e. a hard requirement is that all code written for the current DOM extension keeps working without requiring changes. In summary, the core challenge lies in the fact that two decades of buggy behavior have become deeply ingrained in the system.
It is clear that any behavioural fix must come in an opt-in manner. Fortunately, the HTML 5 RFC that landed in PHP 8.4-dev gives us a unique opportunity to do so!
To recap, that RFC has introduced 3 new classes in a new DOM namespace: DOM\Document, DOM\XMLDocument, and DOM\HTMLDocument. When utilising these new classes, HTML 5 parsing and serialization will be used. The old DOM classes are unaffected. As this RFC landed in the development cycle of PHP 8.4, there are no users yet. I propose that when the new classes are used, then the DOM extension will opt-into spec-compliance behaviour and the bugs are resolved. When you are using the (old) DOMDocument class, the old implementations will be used. This means that backwards compatibility is kept.
This document also includes a bug list highlighting issues within the scope of this RFC. Bugs are categorized into two categories: type issues and behavioral issues.
Solving behavioural issues is possible in an opt-in way, but type issues are more difficult to solve. In particular, changing types is backwards incompatible, especially considering that the DOM classes can be extended by userland classes.
Therefore, I propose the following BC solution.
The HTML 5 RFC added class aliases such as DOMNode
-> DOM\Node
etc.
I propose to make them real classes instead of aliases. This way we can leave the old classes untouched while fixing the types in the new classes, even though most of the structure remains the same. This also means DOMDocument
will no longer inherit from DOM\Document
. This inheritance was introduced in the HTML 5 RFC, but would no longer be possible due to type divergences. Note that a lot of code is still shared internally between these classes, so this is very doable from PHP's internal point-of-view.
The disadvantage of doing this is that it becomes more difficult for userland code to support the “old DOM” and “new DOM” classes simultaneously. They'd have to use type unions now. Then again, that's probably fine because their semantic behaviour can differ quite a bit.
The DOM spec has no explicit support for things like DOM entities and DOM notation nodes. I'd prefer to keep the support in because otherwise it makes working with XML more difficult, especially in combination with other extensions. The spec authors have removed some support for these things to simplify the specification but there's no reason why we should drop support for these things that don't have an alternative.
Migration will unfortunately not be trivial for everyone because many users rely on behaviour that was not intended by spec, or rely on bugs. Furthermore, there are userland libraries that currently work with the old classes and won't be able to support the old and new classes at the same time easily because of type and behaviour differences. So I expect the migration of the ecosystem to take a long time. There's a high chance that the migration of all old code won't happen until PHP versions older than 8.4 are EOL.
Therefore, a conscious decision is to not deprecate the old DOM classes anytime soon. New code can use the new DOM classes, while old code can keep using the old classes and migrate at their own pace.
I do propose however to add a note to the DOM documentation that the usage of the new classes is encouraged.
The old DOM classes and new DOM classes are internal-data-structure-compatible with each other, so it will be possible to import “old DOM nodes” into the “new DOM”. This should also help library developers migrate: they could write an adapter layer or adapter helper functions. This would, for quite a few libraries, reduce the requirement to maintain two different versions of the same library. Of course there are also libraries that expose the DOM classes that are used, they would have a harder time pulling off an adapter interface and there might not be an easy solution for this except to place the burden on the user of such a library.
To accomplish this I propose to add a method to DOM\Document:
class DOM\Document ... { public function importLegacyNode(DOMNode $node, bool $deep = false): DOM\Node; }
The reason to keep this as a separate method is to not pollute the existing importNode method $node
argument.
This method can throw if an unsupported node is imported (e.g. a document node itself), just like importNode already does.
A previous iteration of this proposal also proposed the adopt{Legacy,Modern}Node methods. This could create two different representations (DOM\Node
and DOMNode
) of the same node at the same time. This can cause weird issues because new DOM and old DOM make different assumptions. To prevent issues, I dropped this from the proposal.
A previous iteration included the importModernNode method that was added to DOMDocument. Upon trying to implement this, I found that it was too difficult to make it work correctly due to limitations in the import code implementation. In particular, in old DOM, namespaces must always be attached to an element. But when importing a node with namespaced attributes, this could sometimes lose the namespace of those attributes because at that point the cloned subtree is not attached to the document yet. While it's probably possible to fix this for most cases, there will always be cases where this causes issues. As such, I rather not provide this functionality than provide it in a half-working/half-broken state. The reverse direction, importLegacyNode, does not suffer from this problem because we have our own namespace handling code for new DOM.
To proactively prevent as many implementation issues as possible, I tried to test every edge case I found in the DOM spec.
WHATWG (the working group maintaining the DOM spec) also has a repository full of tests. It's called WPT (Web Platform Test). I ported a subset of these tests from Javascript to PHP and those ported tests all pass. This increases the confidence that the implementation is correct. Note that I only ported a subset because porting is very time consuming and mentally draining, even with automation.
To ensure that the old DOM classes still work, I rely on the PHP test suite, and I have also run the PHPUnit tests of real-world DOM-utilising libraries. I have tested veewee's XML library, Mensbeam library, some SimpleSAML libraries
I will be using the currently aliased names for the DOM classes in this document.
$nodeValue
(https://dom.spec.whatwg.org/#dom-node-nodevalue)$nodeName
(https://dom.spec.whatwg.org/#dom-node-nodename)$textContent
(https://dom.spec.whatwg.org/#dom-node-textcontent)isDefaultNamespace(string $namespace)
(https://dom.spec.whatwg.org/#dom-node-isdefaultnamespace)?string
instead of string
. The NULL value is treated the same as the empty string.lookupNamespaceURI(?string $prefix)
(https://dom.spec.whatwg.org/#dom-node-lookupnamespaceuri)lookupPrefix(string $namespace)
(https://dom.spec.whatwg.org/#dom-node-lookupprefix)?string
instead of string
. The NULL value is treated the same as the empty string.replaceChild(DOMNode $node, DOMNode $child)
(https://dom.spec.whatwg.org/#dom-node-replacechild)appendChild(DOMNode $node)
(https://dom.spec.whatwg.org/#dom-node-appendchild)insertBefore(DOMNode $node, ?DOMNode $child = null)
(https://dom.spec.whatwg.org/#dom-node-insertbefore)normalize()
(https://dom.spec.whatwg.org/#dom-node-normalize)This can be fixed unconditionally in the master branch.
cloneNode()
(https://dom.spec.whatwg.org/#concept-node-clone)splitText(int $offset)
(https://dom.spec.whatwg.org/#dom-text-splittext)For all the methods in this interface, the pre-insertion validity checking is incomplete. Source: https://dom.spec.whatwg.org/#concept-node-ensure-pre-insertion-validity
createAttribute(string $localName)
(https://dom.spec.whatwg.org/#dom-document-createattribute)createAttributeNS(?string $namespace, string $qualifiedName)
(https://dom.spec.whatwg.org/#dom-document-createattributens)createCDATASection(string $data)
(https://dom.spec.whatwg.org/#dom-document-createcdatasection)createProcessingInstruction(string $target, string $data = "")
(https://dom.spec.whatwg.org/#dom-document-createprocessinginstruction)createElement(string $localName, string $value = "")
(https://dom.spec.whatwg.org/#dom-document-createelement)createElementNS(?string $namespace, string $qualifiedName, string $value = "")
(https://dom.spec.whatwg.org/#internal-createelementns-steps)createElementNS("http://www.w3.org/2000/xmlns/", "svg")
.getElementsByTagName(string $qualifiedName)
(https://dom.spec.whatwg.org/#concept-getelementsbytagname)getElementsByTagNameNS(?string $namespace, string $localName)
(https://dom.spec.whatwg.org/#concept-getelementsbytagnamens)importNode(DOMNode $node, bool $deep = false)
(https://dom.spec.whatwg.org/#dom-document-importnode)getAttributeNode(string $qualifiedName)
(https://dom.spec.whatwg.org/#dom-element-getattributenode)getAttributeNodeNS(?string $namespace, string $localName)
(https://dom.spec.whatwg.org/#dom-element-getattributenodens)getAttribute(string $qualifiedName)
(https://dom.spec.whatwg.org/#dom-element-getattribute)getAttributeNS(?string $namespace, string $localName)
(https://dom.spec.whatwg.org/#dom-element-getattributens)hasAttribute(string $qualifiedName)
(https://dom.spec.whatwg.org/#dom-element-hasattribute)hasAttributeNS(?string $namespace, string $qualifiedName)
(https://dom.spec.whatwg.org/#dom-element-hasattributens)removeAttribute(string $qualifiedName)
(https://dom.spec.whatwg.org/#dom-element-removeattribute)removeAttributeNS(?string $namespace, string $localName)
(https://dom.spec.whatwg.org/#dom-element-removeattributens)setAttribute(string $qualifiedName, string $value)
(https://dom.spec.whatwg.org/#dom-element-setattribute)setAttributeNS(?string $namespace, string $qualifiedName, string $value)
(https://dom.spec.whatwg.org/#dom-element-setattributens)setAttributeNode(DOMAttr $attr)
(https://dom.spec.whatwg.org/#dom-element-setattributenode)setAttributeNodeNS(DOMAttr $attr)
(https://dom.spec.whatwg.org/#dom-element-setattributenodens)insertAdjacent{Text,Element}
(https://dom.spec.whatwg.org/#dom-element-insertadjacenttext and https://dom.spec.whatwg.org/#dom-element-insertadjacentelement)Has the same bugs as DOM\Element::getAttribute.
According to spec, the methods that operate on strings expect unsigned integer arguments instead of signed integer arguments.
This means for example that -1 must be treated as 2**32-1.
This allows you to do things like: $text->substringData(1, -1)
to get the string inside $text
excluding the first character.
This currently isn't the case and will become possible by this proposal.
$element->setAttribute("id")
does not work properly in combination with getElementById).The class hierarchy w.r.t. textual nodes is supposed to be:
However in the current implementation, the ProcessingInstruction class extends Node instead of CharacterData. Also CharacterData is a class instead of an interface in the current implementation, but that's because interfaces cannot contain properties in PHP.
There is one other minor bugs that can't easily be fixed without breaking BC, so I include it here too:
Both bugsnet and GitHub contain bug reports that are consequences of spec compliance issues. By implementing this proposal, the following reports will be closed as fixed:
Although this list looks small, the impact of this proposal is huge. It will fix a lot of issues that are not in this list.
Here are 3 examples of namespace bugs that are not solvable without this proposal. This should make it even clearer why the fixes have to be opt-in.
Try it out: https://3v4l.org/8aqgO
The expected serialization is
<outer xmlns="urn:a"><inner xmlns=""/></outer>
because the inner element was created using createElement, which puts the element in no namespace. Therefore, the xmlns=“” attribute is necessary. Unfortunately, the 3v4l snippet lacks the xmlns=“” attribute in the output. Therefore, if you were to reparse the output from the 3v4l snippet, then the inner element will suddenly become part of the urn:a namespace, which is incorrect. Fixing this would drastically change the behaviour of namespaces, and experience tells me that a lot of people don't know that this is wrong.
This is related to https://bugs.php.net/bug.php?id=81468, but note however that the expectation in that bug report is wrong because of the misunderstanding I explained above about how createElement works.
For the lack of a better term, shifting namespaces means that the prefix of the namespace changes on certain operations on the DOM tree. This is wrong because the prefix and namespace URI must always be kept as-is.
There are many ways to encounter this, but I recently received a report that looked something like this: https://3v4l.org/NSDmO
The element shouldn't have gotten the xsd
prefix, because now the XML schema definition is no longer valid as the type is still “string”.
The element should've just been put into the document as-is.
Here's another example of a similar bug: https://bugs.php.net/bug.php?id=47847
While it's possible in theory to invent ad-hoc solutions for this, this is dangerous. A general solution is impossible without breaking existing code, hence this proposal to fix these bugs in an opt-in way.
From https://bugs.php.net/bug.php?id=47530
The namespace prefixes should be kept as-is when a node gets imported. Instead, in some cases a default:
prefix is created. This is a side-effect of a libxml2-API misuse by PHP. It is not fixable because its fix has side-effects that break other applications.
Let's discuss some alternatives to this RFC.
People have implemented userland DOM libraries on top of the existing DOM extension. However, even userland solutions can't fully work around issues caused by PHP's DOM extension. This is because those libraries still have to work with broken methods. I often receive bug reports from developers of such libraries regarding functionality they're using that doesn't interact well because they're (in)directly relying on bugs and hacks, or the underlying DOM method has an unfixable bug. Again, those underlying bugs cannot be fixed because they would break BC. The real solution is to provide a BC-preserving fix at PHP's side.
I basically copy-pasted this from my HTML 5 RFC.
One might wonder why we don't just create an entirely new DOM extension, based on another library, with HTML5 support. There are a couple of reasons:
There are no BC breaks for the reasons given in the introduction. The spec-compliance is opt-in.
PHP 8.4.
First and third-party extensions are unaffected because the internal data structures and APIs remain the same. Of course, the DOM extension itself is heavily affected. When using opt-in spec-compliance, the DOM extension (and other extensions using the same document tree) will get additional performance improvements due to the reworked namespace management.
To clarify, even the API for XSLTProcessor
and simplexml_import_dom
does not need changes. That's because the argument types use object deliberately. Classes can register themselves as “XML nodes” with the libxml extension, so the use case of extending the supported XML classes even with third party extensions is already supported without causing BC breaks.
None right now.
When this RFC lands, it will become much easier to add new features to the DOM extension. Preferably, I will only add new features to the new classes and keep the old classes as-is. An example of a new feature I have worked on based on the development branch of this RFC is native CSS selector support: https://github.com/nielsdos/php-src/pull/82
One primary vote with 2/3 majority to accept this proposal as a whole.
Voting started on 2024-02-13 and will end on 2024-02-27.
Merged into PHP 8.4: https://github.com/php/php-src/commit/14b6c981c374fc183d7b2eae20b0712bb356d160