rfc:opt_in_dom_spec_compliance

PHP RFC: Opt-in DOM spec-compliance

Introduction

The DOM extension in PHP is used to parse, query, and manipulate XML/HTML documents. The DOM extension is supposed to follow the DOM specification. Originally this was the DOM Core Level 3 specification, but nowadays, that specification has evolved into the current "Living Specification" maintained by WHATWG. Unfortunately, there are many bugs in PHP's DOM extension. Most of those bugs are related to namespace and attribute handling. This leads to people trying to work around those bugs by relying on more bugs, or on undocumented side-effects of incorrect behaviour, leading to even more issues in the end. Furthermore, some of these bugs may have security implications. Note that the bugs are not HTML-exclusive, but also apply to XML documents.

Some of these bugs are caused because the method or property was implemented incorrectly back in the day, or because the original DOM 3 specification used to be unclear. A smaller part of this is because the specification has made breaking changes when HTML 5 first came along and the specification creators had to unify what browsers implemented into a single specification that everyone agreed on.

It's not possible to “just fix” these bugs because people actually rely on these bugs. They are also often unaware that what they're doing is actually incorrect or causes the internal document state to be inconsistent. We therefore have to fix this in a backwards-compatible way: i.e. a hard requirement is that all code written for the current DOM extension keeps working without requiring changes. In summary, the core challenge lies in the fact that two decades of buggy behavior have become deeply ingrained in the system.

Proposal

It is clear that any behavioural fix must come in an opt-in manner. Fortunately, the HTML 5 RFC that landed in PHP 8.4-dev gives us a unique opportunity to do so!

To recap, that RFC has introduced 3 new classes in a new DOM namespace: DOM\Document, DOM\XMLDocument, and DOM\HTMLDocument. When utilising these new classes, HTML 5 parsing and serialization will be used. The old DOM classes are unaffected. As this RFC landed in the development cycle of PHP 8.4, there are no users yet. I propose that when the new classes are used, then the DOM extension will opt-into spec-compliance behaviour and the bugs are resolved. When you are using the (old) DOMDocument class, the old implementations will be used. This means that backwards compatibility is kept.

This document also includes a bug list highlighting issues within the scope of this RFC. Bugs are categorized into two categories: type issues and behavioral issues.

  1. Type issues involve incorrect property, return value, or method argument types, such as a non-nullable string property currently returning an empty string instead of NULL as specified.
  2. Behavioral issues are about incorrectly implemented operation semantics, and the majority of bugs addressed by this RFC fall into this category.

Solving type issues

Solving behavioural issues is possible in an opt-in way, but type issues are more difficult to solve. In particular, changing types is backwards incompatible, especially considering that the DOM classes can be extended by userland classes.

Therefore, I propose the following BC solution. The HTML 5 RFC added class aliases such as DOMNode -> DOM\Node etc. I propose to make them real classes instead of aliases. This way we can leave the old classes untouched while fixing the types in the new classes, even though most of the structure remains the same. This also means DOMDocument will no longer inherit from DOM\Document. This inheritance was introduced in the HTML 5 RFC, but would no longer be possible due to type divergences. Note that a lot of code is still shared internally between these classes, so this is very doable from PHP's internal point-of-view.

The disadvantage of doing this is that it becomes more difficult for userland code to support the “old DOM” and “new DOM” classes simultaneously. They'd have to use type unions now. Then again, that's probably fine because their semantic behaviour can differ quite a bit.

Let's not damage XML support

The DOM spec has no explicit support for things like DOM entities and DOM notation nodes. I'd prefer to keep the support in because otherwise it makes working with XML more difficult, especially in combination with other extensions. The spec authors have removed some support for these things to simplify the specification but there's no reason why we should drop support for these things that don't have an alternative.

Migration

Migration will unfortunately not be trivial for everyone because many users rely on behaviour that was not intended by spec, or rely on bugs. Furthermore, there are userland libraries that currently work with the old classes and won't be able to support the old and new classes at the same time easily because of type and behaviour differences. So I expect the migration of the ecosystem to take a long time. There's a high chance that the migration of all old code won't happen until PHP versions older than 8.4 are EOL.

Therefore, a conscious decision is to not deprecate the old DOM classes anytime soon. New code can use the new DOM classes, while old code can keep using the old classes and migrate at their own pace.

I do propose however to add a note to the DOM documentation that the usage of the new classes is encouraged.

Adapters

The old DOM classes and new DOM classes are internal-data-structure-compatible with each other, so it will be possible to import “old DOM nodes” into the “new DOM”. This should also help library developers migrate: they could write an adapter layer or adapter helper functions. This would, for quite a few libraries, reduce the requirement to maintain two different versions of the same library. Of course there are also libraries that expose the DOM classes that are used, they would have a harder time pulling off an adapter interface and there might not be an easy solution for this except to place the burden on the user of such a library.

To accomplish this I propose to add a method to DOM\Document:

class DOM\Document ... {
    public function importLegacyNode(DOMNode $node, bool $deep = false): DOM\Node;
}

The reason to keep this as a separate method is to not pollute the existing importNode method $node argument.

This method can throw if an unsupported node is imported (e.g. a document node itself), just like importNode already does.

A previous iteration of this proposal also proposed the adopt{Legacy,Modern}Node methods. This could create two different representations (DOM\Node and DOMNode) of the same node at the same time. This can cause weird issues because new DOM and old DOM make different assumptions. To prevent issues, I dropped this from the proposal.

A previous iteration included the importModernNode method that was added to DOMDocument. Upon trying to implement this, I found that it was too difficult to make it work correctly due to limitations in the import code implementation. In particular, in old DOM, namespaces must always be attached to an element. But when importing a node with namespaced attributes, this could sometimes lose the namespace of those attributes because at that point the cloned subtree is not attached to the document yet. While it's probably possible to fix this for most cases, there will always be cases where this causes issues. As such, I rather not provide this functionality than provide it in a half-working/half-broken state. The reverse direction, importLegacyNode, does not suffer from this problem because we have our own namespace handling code for new DOM.

Testing

To proactively prevent as many implementation issues as possible, I tried to test every edge case I found in the DOM spec.

WHATWG (the working group maintaining the DOM spec) also has a repository full of tests. It's called WPT (Web Platform Test). I ported a subset of these tests from Javascript to PHP and those ported tests all pass. This increases the confidence that the implementation is correct. Note that I only ported a subset because porting is very time consuming and mentally draining, even with automation.

To ensure that the old DOM classes still work, I rely on the PHP test suite, and I have also run the PHPUnit tests of real-world DOM-utilising libraries. I have tested veewee's XML library, Mensbeam library, some SimpleSAML libraries

Bug list

I will be using the currently aliased names for the DOM classes in this document.

DOM\Node class (and its subclasses)

Properties

Methods

This can be fixed unconditionally in the master branch.

DOM\Attr class

Properties

DOM\Text class

Methods

DOM\ChildNode and DOM\ParentNode interface

For all the methods in this interface, the pre-insertion validity checking is incomplete. Source: https://dom.spec.whatwg.org/#concept-node-ensure-pre-insertion-validity

  • Step 4 is missing: Should throw a hierarchy request DOMException when the node to insert isn't a DocumentFragment, DocumentType, Element, or CharacterData.
  • Step 5 (first part) is missing: Text nodes may not be inserted if the parent is a document, should result in a hierarchy request DOMException.
  • Step 5 (second part) is missing: Doctype nodes may only be inserted if the parent is a document, otherwise should result in a hierarchy request DOMException.
  • Missing all the validation of step 6.
  • Merges adjacent text nodes while it shouldn't.
  • Handles the special case of passing a single node incorrectly with regards to error handling.

DOM\Document class

Properties

  • $documentURI is supposed to be a URI, but for local files it doesn't prefix the path with the file scheme.
  • $strictErrorChecking property should only exist on the legacy DOMDocument class, exceptions instead of warnings are the default in the modern-day DOM spec.

Methods

DOM\Element class

Properties

Methods

  • getAttributeNode(string $qualifiedName) (https://dom.spec.whatwg.org/#dom-element-getattributenode)
    • Should lowercase the qualified name when working with HTML namespace in an HTML document.
    • Should return NULL instead of false when the attribute doesn't exist.
    • Doesn't correctly match the qualified name of an attribute.
    • Can return a DOMNameSpaceNode instead of an actual attribute (see general issues).
  • getAttributeNodeNS(?string $namespace, string $localName) (https://dom.spec.whatwg.org/#dom-element-getattributenodens)
    • As a side note: this method _does_ return NULL instead of false, unlike getAttributeNode!
    • Can return a DOMNameSpaceNode instead of an actual attribute (see general issues).
    • Treats the empty string and NULL $namespace different, while they are actually the same thing.
  • getAttribute(string $qualifiedName) (https://dom.spec.whatwg.org/#dom-element-getattribute)
    • Should lowercase the qualified name when working with HTML namespace in an HTML document.
    • Should return NULL instead of the empty string when the attribute doesn't exist because it would be impossible to differentiate between a non-existent attribute and an empty attribute otherwise.
    • Internally works with DOMNameSpaceNode (see general issues).
  • getAttributeNS(?string $namespace, string $localName) (https://dom.spec.whatwg.org/#dom-element-getattributens)
    • Should return NULL instead of the empty string instead when the attribute doesn't exist.
    • Internally works with DOMNameSpaceNode (see general issues).
    • Treats the empty string and NULL $namespace different, while they are actually the same thing.
  • hasAttribute(string $qualifiedName) (https://dom.spec.whatwg.org/#dom-element-hasattribute)
    • Should lowercase the qualified name when working with HTML namespace in an HTML document.
    • Doesn't correctly match the qualified name of an attribute.
    • Internally works with DOMNameSpaceNode (see general issues).
  • hasAttributeNS(?string $namespace, string $qualifiedName) (https://dom.spec.whatwg.org/#dom-element-hasattributens)
    • Internally works with DOMNameSpaceNode (see general issues).
    • Treats the empty string and NULL $namespace different, while they are actually the same thing.
  • removeAttribute(string $qualifiedName) (https://dom.spec.whatwg.org/#dom-element-removeattribute)
    • Should lowercase the qualified name when working with HTML namespace in an HTML document.
    • Internally works with DOMNameSpaceNode (see general issues).
    • Shouldn't return anything according to spec, but it returns a bool indicating failure or success.
  • removeAttributeNS(?string $namespace, string $localName) (https://dom.spec.whatwg.org/#dom-element-removeattributens)
    • Internally works with DOMNameSpaceNode (see general issues).
    • Shouldn't return anything according to spec, but it returns a bool indicating failure or success.
    • Treats the empty string and NULL $namespace different, while they are actually the same thing.
  • setAttribute(string $qualifiedName, string $value) (https://dom.spec.whatwg.org/#dom-element-setattribute)
    • Internally works with DOMNameSpaceNode (see general issues).
    • Should lowercase the qualified name when working with HTML namespace in an HTML document.
  • setAttributeNS(?string $namespace, string $qualifiedName, string $value) (https://dom.spec.whatwg.org/#dom-element-setattributens)
    • Internally works with DOMNameSpaceNode (see general issues).
    • Violates the namespace well-formedness constraints (see spec step “validate and extract”).
    • Treats the empty string and NULL $namespace different, while they are actually the same thing.
    • Behaves incorrectly when namespaced attributes are provided. To solve this, this is nowadays an alias for setAttributeNodeNS.
    • Should't throw WRONG_DOCUMENT_ERR.
    • Should throw INUSE_ATTRIBUTE_ERR when the attribute is already attached to another element.
    • Due to other implementation issues of internal methods, these can violate the hierarchy constraints.

DOM\NamedNodeMap class

Has the same bugs as DOM\Element::getAttribute.

According to spec, the methods that operate on strings expect unsigned integer arguments instead of signed integer arguments. This means for example that -1 must be treated as 2**32-1. This allows you to do things like: $text->substringData(1, -1) to get the string inside $text excluding the first character. This currently isn't the case and will become possible by this proposal.

General issues

  • The rules surrounding the HTML namespace are not respected.
  • Runtime performance issues with namespaces.
  • Namespace serialization is incorrect when xmlns attributes exist, or when the namespace of an element is the empty namespace (in some cases).
  • There is a DOMNameSpaceNode class where namespace declarations are sometimes treated as attributes and sometimes are not. This causes all sorts of inconsistencies. In modern-day DOM spec the internal namespace information is not exposed, but when you see an xmlns declaration they are attributes and can be manipulated properly like attributes. This also causes issues when there is internal namespace information and an xmlns attribute. The DOMNameSpaceNode class is also lacking in features because they try to be like attributes but are not due to implementation problems. Explicit xmlns attributes exist that are just attributes and have no influence on the namespace declaration, they are only there to help serialization.
  • Namespace reconciliation can shift nodes between namespaces, which is incorrect.
  • The ID attribute is not always respected because the current DOM implementation still has the setAttributeId legacy behaviour. (e.g. $element->setAttribute("id") does not work properly in combination with getElementById).
  • The XML serialization is incorrect in some cases (related to namespace prefix conflicts and the empty namespace).

Class hierarchy

The class hierarchy w.r.t. textual nodes is supposed to be:

  • CharacterData extends Node (Actually an interface)
    • Text extends CharacterData
      • CDATASection extends Text
    • ProcessingInstruction extends CharacterData
    • Comment extends CharacterData

However in the current implementation, the ProcessingInstruction class extends Node instead of CharacterData. Also CharacterData is a class instead of an interface in the current implementation, but that's because interfaces cannot contain properties in PHP.

General typing issues

  • As listed above, there are a lot of places where the implementation uses “string” but should actually use “?string”.
  • DOMNameSpaceNode will never be possibly returned in the spec compliant implementation, so that return type becomes useless.
  • There are a lot of return types of the form “T|false” because the current implementation can return false on error instead of throwing an exception if “strictErrorChecking” is false. This is a legacy DOM feature that is no longer supported in the modern-day DOM spec. For new classes, the return type would become “T” instead of “T|false”.

Other non-spec bugs

There is one other minor bugs that can't easily be fixed without breaking BC, so I include it here too:

  • Constructors are not called for custom DOM classes registered by registerNodeClass, but destructors are: https://3v4l.org/S4jOY. As the DOM spec dictates that Node is not directly instantiable, this will be fixed simply by disallowing the __construct function declaration.

Bug reports

Both bugsnet and GitHub contain bug reports that are consequences of spec compliance issues. By implementing this proposal, the following reports will be closed as fixed:

Although this list looks small, the impact of this proposal is huge. It will fix a lot of issues that are not in this list.

Namespace bug examples

Here are 3 examples of namespace bugs that are not solvable without this proposal. This should make it even clearer why the fixes have to be opt-in.

xmlns=""

Try it out: https://3v4l.org/8aqgO

The expected serialization is

<outer xmlns="urn:a"><inner xmlns=""/></outer>

because the inner element was created using createElement, which puts the element in no namespace. Therefore, the xmlns=“” attribute is necessary. Unfortunately, the 3v4l snippet lacks the xmlns=“” attribute in the output. Therefore, if you were to reparse the output from the 3v4l snippet, then the inner element will suddenly become part of the urn:a namespace, which is incorrect. Fixing this would drastically change the behaviour of namespaces, and experience tells me that a lot of people don't know that this is wrong.

This is related to https://bugs.php.net/bug.php?id=81468, but note however that the expectation in that bug report is wrong because of the misunderstanding I explained above about how createElement works.

Shifting

For the lack of a better term, shifting namespaces means that the prefix of the namespace changes on certain operations on the DOM tree. This is wrong because the prefix and namespace URI must always be kept as-is.

There are many ways to encounter this, but I recently received a report that looked something like this: https://3v4l.org/NSDmO

The element shouldn't have gotten the xsd prefix, because now the XML schema definition is no longer valid as the type is still “string”. The element should've just been put into the document as-is.

Here's another example of a similar bug: https://bugs.php.net/bug.php?id=47847

While it's possible in theory to invent ad-hoc solutions for this, this is dangerous. A general solution is impossible without breaking existing code, hence this proposal to fix these bugs in an opt-in way.

Importing

From https://bugs.php.net/bug.php?id=47530

The namespace prefixes should be kept as-is when a node gets imported. Instead, in some cases a default: prefix is created. This is a side-effect of a libxml2-API misuse by PHP. It is not fixable because its fix has side-effects that break other applications.

Alternatives

Let's discuss some alternatives to this RFC.

Userland solutions

People have implemented userland DOM libraries on top of the existing DOM extension. However, even userland solutions can't fully work around issues caused by PHP's DOM extension. This is because those libraries still have to work with broken methods. I often receive bug reports from developers of such libraries regarding functionality they're using that doesn't interact well because they're (in)directly relying on bugs and hacks, or the underlying DOM method has an unfixable bug. Again, those underlying bugs cannot be fixed because they would break BC. The real solution is to provide a BC-preserving fix at PHP's side.

An entirely new DOM extension

I basically copy-pasted this from my HTML 5 RFC.

One might wonder why we don't just create an entirely new DOM extension, based on another library, with HTML5 support. There are a couple of reasons:

  • Interoperability problems with other extensions (both within php-src and third-party).
  • Interoperability issues with userland code. Right now you can still import nodes from the “old DOM” to the “new DOM”.
  • Additional maintenance work and complexity. A spec-compliance “mode” can share almost all code while a new extension cannot.
  • I don't have time to build this.

Backward Incompatible Changes

There are no BC breaks for the reasons given in the introduction. The spec-compliance is opt-in.

Proposed PHP Version(s)

PHP 8.4.

RFC Impact

To Existing Extensions

First and third-party extensions are unaffected because the internal data structures and APIs remain the same. Of course, the DOM extension itself is heavily affected. When using opt-in spec-compliance, the DOM extension (and other extensions using the same document tree) will get additional performance improvements due to the reworked namespace management.

To clarify, even the API for XSLTProcessor and simplexml_import_dom does not need changes. That's because the argument types use object deliberately. Classes can register themselves as “XML nodes” with the libxml extension, so the use case of extending the supported XML classes even with third party extensions is already supported without causing BC breaks.

Open Issues

None right now.

Future Scope

When this RFC lands, it will become much easier to add new features to the DOM extension. Preferably, I will only add new features to the new classes and keep the old classes as-is. An example of a new feature I have worked on based on the development branch of this RFC is native CSS selector support: https://github.com/nielsdos/php-src/pull/82

Proposed Voting Choices

One primary vote with 2/3 majority to accept this proposal as a whole.

Voting started on 2024-02-13 and will end on 2024-02-27.

Accept Opt-in DOM spec-compliance RFC?
Real name Yes No
ashnazg (ashnazg)  
beberlei (beberlei)  
crell (crell)  
devnexen (devnexen)  
galvao (galvao)  
girgias (girgias)  
kocsismate (kocsismate)  
nielsdos (nielsdos)  
saki (saki)  
sebastian (sebastian)  
sergey (sergey)  
theodorejb (theodorejb)  
timwolla (timwolla)  
weierophinney (weierophinney)  
Final result: 14 0
This poll has been closed.

Patches and Tests

Implementation

References

Changelog

  • 0.1.5: Drop importModernNode
  • 0.1.4: Clarify other spec bugs
  • 0.1.3: Update migration
  • 0.1.2: Mention namespace performance improvement.
  • 0.1.1: Fix typos and clarify some details. No semantic changes.
  • 0.1: Initial version under discussion
rfc/opt_in_dom_spec_compliance.txt · Last modified: 2024/03/09 21:23 by nielsdos