rfc:domdocument_html5_parser

This is an old revision of the document!


PHP RFC: DOM HTML5 parsing and serialization

Introduction

PHP's DOM extension supports loading HTML documents using the methods \DOMDocument::loadHTML and \DOMDocument::loadHTMLFile. This uses libxml2's HTML parser under the hood to parse these documents into a libxml2 document tree that ext/dom uses. Unfortunately, this parser only supports HTML up to version 4.01. This is a problem because HTML5 has become the de facto standard for websites over the past decade. Introducing HTML5 parsing to PHP's DOM implementation is crucial for modernizing and enhancing PHP's capabilities in handling modern web content.

Using loadHTML(File) to load HTML5 content results in multiple parsing errors and incorrect document trees. These issues arise from changes in parsing rules between HTML4 and HTML5. Notably, the current parser does not recognize semantic HTML5 tags (e.g., main, article, section, ...) as valid tags. Then there's also problems with certain element nestings not allowed in HTML4, but allowed in HTML5, causing incorrect document trees. Another concern highlighted in PHP's bug tracker is the handling of closing tags within script contexts. With the common practice of embedding HTML within JavaScript, HTML4 parsers encounter problems with closing tags within JavaScript literals. Consequently, parsing through loadHTML(File) leads to incorrect document trees. The list of issues goes on and on. Not being able to parse HTML5 properly is one of the major pain points of our DOM extension.

There's an open issue at the libxml2 bugtracker to add HTML5 parsing support: https://gitlab.gnome.org/GNOME/libxml2/-/issues/211. However, it seems like this won't happen anytime soon. Furthermore, there are also problems with saving (also known as serializing) HTML5 documents due to subtle rule differences between HTML4 and HTML5. This RFC proposes a practically backwards compatible solution to deal with these problems. To solve the parsing issue, we will leverage an alternative HTML5 parser to create the libxml2 document tree. This parser seamlessly integrates with the DOM extension, ensuring compatibility for all existing code and third-party extensions. To solve the serialization issue, an implementation for the HTML5 serialization algorithm will also be added. The new functionality will be available via a new class.

Proposal

The most important requirement is that the new class must integrate seamlessly with the DOM extension. This means that using it must be a simple drop-in replacement. You will still be able to use all the existing APIs to manipulate and traverse DOM documents and nodes.

This proposal introduces the DOM\HTMLDocument class. The reason we introduce a new class instead of replacing the methods of the existing class is to ensure full backwards compatibility. There are applications that work with legacy HTML4 documents, and want the HTML4 behaviour. By keeping the \DOMDocument class, nothing changes for existing code. Code that wants HTML5 functionality can use the DOM\HTMLDocument class.

How does the class hierarchy look and how does it interact with \DOMDocument? We'll add a common abstract base class DOM\Document (name taken from the DOM spec & Javascript world). DOM\Document contains the properties and abstract methods common to both HTML and XML documents. Examples of what it includes/excludes:

  • includes: firstElementChild, lastElementChild, ...
  • excludes: xmlStandalone, xmlVersion, validate(), ...

Then we'll have two subclasses: DOM\HTMLDocument (a previous version of this RFC named this DOM\HTML5Document) and DOM\XMLDocument. \DOMDocument will also use DOM\Document as a base class to make it interchangeable with the new classes. We're only adding XMLDocument for completeness and API parity. It's a drop-in replacement for \DOMDocument, and behaves the exact same. The difference is that the API is on par with HTMLDocument, and the construction is designed to be more misuse-resistant. \DOMDocument will NOT change, and remains for the foreseeable future.

Introducing a new class also opens the door to tackle some oddities in how DOM documents are constructed. In particular, the properties set by \DOMDocument's constructor are overridden by its load methods, which is surprising. That's even mentioned as the second top comment on https://www.php.net/manual/en/domdocument.loadxml.php. Furthermore, the XML version argument of the constructor is even useless for HTML5 documents. While we cannot change the behaviour of \DOMDocument, we can choose a sane behaviour for DOM\HTMLDocument and DOM\XMLDocument. So instead of mirroring the broken API, we'll use factory methods. Factory methods are essentially a way to implement multiple named constructors. As it's unclear what a default constructor should be for DOM\Document derivatives, we chose to only have named constructors and disable the public constructor by making it private. This should make the code also more readable and less surprising as the factory method's name tells us exactly what the behaviour is.

To put it in PHP code:

namespace DOM {
	// The base abstract document class
	abstract class Document extends DOM\Node implements DOM\ParentNode {
		/* all properties and methods that are common and sensible for both XML & HTML documents */
	}
 
	final class XMLDocument extends Document {
		/* insert specific XML methods and properties (e.g. xmlVersion, validate(), ...) here */
 
		private function __construct() {}
 
		public static function createEmpty(string $version = "1.0", string $encoding = "UTF-8"): XMLDocument;
		public static function createFromFile(string $path, int $options = 0, ?string $override_encoding = null): XMLDocument;
		public static function createFromString(string $source, int $options = 0, ?string $override_encoding = null): XMLDocument;
	}
 
	final class HTMLDocument extends Document {
		/* insert specific Html methods and properties here */
 
		private function __construct() {}
 
		public static function createEmpty(string $encoding = "UTF-8"): HTMLDocument;
		public static function createFromFile(string $path, int $options = 0, ?string $override_encoding = null): HTMLDocument;
		public static function createFromString(string $source, int $options = 0, ?string $override_encoding = null): HTMLDocument;
	}
}
 
class DOMDocument extends DOM\Document {
	/* Keep methods, properties, and constructor the same as they are now */
}

The override_encoding parameter is optional. It is used to override the implicit encoding detection routines as determined by the HTML parser spec. This can be useful when the document is downloaded manually (e.g. using Guzzle).

We'll have the existing DOM classes in the global namespace and our three new classes in the (new) DOM namespace. This is awkward. I propose to solve this by creating namespace aliases for the existing DOM classes and constants, and (single) function. This would improve consistency and in the far far future may allow a complete transition to the namespaced variants. This means for example that there will be an alias DOM\Element for DOMElement, an alias DOM\Entity for DOMEntity etc. The exception will be DOMException which is aliased to DOM\DOMException because that's the official name and otherwise importing it and using it would be confusing with the global namespace Exception class (see also https://github.com/php/php-src/pull/9071#issuecomment-1193162754). There is a single function dom_import_simplexml, which can get an alias as DOM\import_simplexml. Similarly, the constants would lose their DOM_ prefix in the namespace version, e.g. DOM\INDEX_SIZE_ERR will be an alias for DOM_INDEX_SIZE_ERR. For constants that begin with XML_ I propose to keep the prefix.

The options argument

Just like the load methods of \DOMDocument, their HTML5 counterparts also take an optional options argument. The options for the load methods change the way the parser behaves. The only three libxml options that will have an effect for the new methods are LIBXML_HTML_NOIMPLIED, LIBXML_COMPACT, and LIBXML_NOERROR. Here's an overview of the other options that are unimplemented and the reason why:

Option Reasoning
LIBXML_BIGLINES
LIBXML_PARSEHUGE
Not needed, this always works for the new methods.
LIBXML_DTDATTR
LIBXML_DTDLOAD
LIBXML_DTDVALID
There is only one valid DTD for HTML5, these options don't make sense.
LIBXML_HTML_NODEFDTD Not needed, this is the default HTML5 behaviour.
LIBXML_NOBLANKS This doesn't remove blank nodes in all cases. There's rules that libxml2 follows based on whether the element accepts #PCDATA, and based on the position of the element. As HTML5 is not based on XML, there is no concept of #PCDATA. Hence, it is unclear what the right behaviour should be.
LIBXML_NOCDATA
LIBXML_NOEMPTYTAG
LIBXML_NOENT
LIBXML_NSCLEAN
LIBXML_XINCLUDE
LIBXML_SCHEMA_CREATE
This is only valid in XML, the concept doesn't exist in HTML5.
LIBXML_NONET Not needed, the new methods never access the network.
LIBXML_NOWARNING Not needed, only errors are reported, there's no concept of a warning because this is not a conformance checker.
LIBXML_PEDANTIC Error reporting follows the spec, no custom error levels are available.

Furthermore, we also implement a custom option DOM\NO_DEFAULT_NS that avoids putting a default namespace on the HTML/SVG/MATHML elements. This is done to ease migration and to make everything compatible with non-namespace aware DOM tools. Something very similar exists in masterminds/html5-php and this option is also used in Symfony's CSS Selector package.

Passing invalid options will result in an argument ValueError exception.

Additional background info

The DOM extensions supports both XML and HTML documents. It's built heavily upon libxml2's APIs and data structures, just like all XML-related PHP extensions within php-src. This is great for interoperability (e.g. with simplexml and xsl). Third-party extensions also use libxml2 APIs. For example, the xmldiff PECL extension peeks into the internals of DOMNode to grab the libxml2 data structures and compare them. It is not possible to switch away from the libxml2 library as the underlying basis for the DOM extension because that will cause a major BC break.

Approach

Parsing an HTML document via an HTML parser results in a document tree. The tree consists of HTML nodes. These nodes are structs on the heap created by the parser. In order to integrate an alternative parser into our DOM extension, these nodes need to be converted into libxml2 nodes. The resulting tree, after conversion, is then used in the DOM extension, just as if it had come from libxml2's parser.

The conversion is fairly straight-forward. We perform a depth-first traversal on the tree, checking the node type and creating the corresponding libxml2 node. The traversal is performed using iteration instead of recursion to prevent stack overflows with deep trees. After this process is done, we throw away the old tree and are left with only the libxml2 tree.

For serializing, I wrote code implementing the HTML5 serialization algorithm using libxml2 nodes. I could've also developed a method of converting a libxml2 tree back to the original type of tree that the parser produced, but that's more complicated to implement and likely has slower performance.

Choosing an HTML5 parser

We have to choose a suitable HTML5 parser. It should be spec-compliant, heavily tested, and fast. I propose to use Lexbor. According to its README, it satisfies our requirements. Furthermore, people already made bindings for Elixir, Crystal, Python, D, and Ruby. This shows that it has been used in practice in other serious projects.

It is fully written in C99. That's ideal, because PHP is also using the C99 standard. One small complication is that this library is not available in package managers for almost all distros. Therefore, I propose to bundle it with PHP. This also gives us the freedom to incorporate a patch to expose the line and column numbers of HTML nodes such that the error messages are richer and the DOMNode::getLineNo() function will work properly. Bundling a library with PHP is not unprecedented, PHP already bundles e.g. pcre2lib, libgd, libmagic, ...

Lexbor also supports overriding the allocation routines. Therefore, we can make it work with PHP's memory limit. Something that is currently not done with libxml2.

Alternative considered HTML5 parsers

Lexbor is one of several HTML5 parsers available. During my investigation, I considered two alternatives:

  • Gumbo: https://github.com/google/gumbo-parser.
    A relatively well-known HTML5 parser developed by Google in C.
    Unfortunately, it has been unmaintained since 2016, as indicated in its README, making it unsuitable for use.
  • html5ever: https://github.com/servo/html5ever.
    This is Servo's HTML5 parser, written in Rust.
    I have implemented a proof-of-concept conversion from html5ever to libxml2, and a proof-of-concept integration with PHP on my fork.

    I decided to not go with this option for a few reasons.
    * Firstly, while writing it in Rust would enhance memory safety (especially for untrusted documents), introducing Rust as an additional dependency for PHP adds extra complexity. PHP's default-enabled extensions can currently be built using only C, but if we go this route this would change.
    * Secondly, the implementation is incomplete, primarily the lack of character encoding support is problematic: it currently only supports UTF-8 documents. Moreover, logic for character encoding meta tags is absent.
    * Lastly, observing the commit activity raises doubts about the ongoing activity of this project.

Considering these factors, I opted against using the above two. Lexbor emerged as the better choice after this investigation.

A note on conformance checkers

I want to emphasize that the HTML5 parser is not a conformance checkers. Conformance checkers check for additional rules in addition to the parsing rules. Browsers, and the proposed class, only perform the parsing rules checks. An example of something that's fine for a HTML5 parser, but not fine for a conformance checker is the following document:

<!doctype html><html><head></head><body></body></html>

This is perfectly valid for a parser. Our implementation won't report any errors. Conformance checkers, however, will report the lack of a title element (amongst some other minor things).

Error handling

When parsing a document, potential parse errors may occur. With the load methods of \DOMDocument, a parser error results in an E_WARNING by default. However, you can use libxml_use_internal_errors(true) to store the errors inside an array. In this case, no warning will be generated and the parse errors may be inspected using libxml_get_errors() and libxml_get_last_error().

The naming of these methods is a bit unfortunate because it leaks implementation details. Users shouldn't have to care that it's actually libxml2 under the hood producing these errors. The reality is that these error methods have become synonymous with “handling errors in \DOMDocument / SimpleXML / ...”. To offer a seamless HTML5 drop-in, my current implementation follows the same error handling as described above. That means, by default we will emit an E_WARNING. If libxml_use_internal_errors(true) is used then the errors will be stored, and can be retrieved in the same way as described above. This may seem unconventional since the errors originate from Lexbor rather than libxml2. However, we have good reasons to do so.

The alternative would be to introduce methods specific to getting the errors from the HTML5 parser. However, I do not believe that's a good idea because:

  1. The developers utilising these new parsing methods don't necessarily know that it uses Lexbor. So they expect the error handling behaviour to be the same as the existing methods.
  2. The proposed approach makes it easier to use as a drop-in replacement.
  3. If libxml2 ever introduces its own HTML5 parser, we can drop Lexbor and nothing changes for the end user w.r.t. error handling.

External entity loader

XML supports something called “external entities”. This will load data from an external source into the current document (if enabled). Because you might want to customise the external entity handling, there's a libxml_set_external_entity_loader(?callable $resolver_function) function to setup a custom “resolver”. This “resolver” returns either a path, a stream resource, or null. In the former two cases, the entity will be loaded from the path or stream. In the latter case, the loading will be blocked.

This interacts a bit surprisingly with the existing loadHTMLFile method. You can observe this here: https://3v4l.org/rJTTc. The loadHTMLFile method considers loading the file also as loading an external entity, hence the “resolver” is invoked.

There's a (deprecated) similar function libxml_disable_entity_loader(bool $disable) that completely disables loading external entities. This function has been perceived as broken by the community due to it blocking loading anything that's not coming from a string. See https://github.com/php/php-src/pull/5867 for more details. I don't know how the community perceives the interaction between loadHTMLFile and libxml_set_external_entity_loader.

Unlike XML, HTML5 does not have a concept of external entities. The question I have is whether libxml_set_external_entity_loader should affect the new class's parser in the same way as it does for the existing class. The advantage would be consistency, but I don't know if this is what the community wants. I'm leaving this for a secondary vote for the community to decide on.

Interoperability between \DOMDocument and DOM\HTMLDocument

DOM\HTMLDocument and \DOMDocument are both subclasses of DOM\Document. Therefore, if you want to use both interchangeably you can use the parent class as a type declaration. Since most of the API, except construction, is similar, this shouldn't give interoperability problems.

However, what if you're using a library that returns a (non-HTML5) \DOMDocument but you'd like a DOM\HTMLDocument (or vice versa)? You can solve this issue by using the DOM\Document::importNode or DOM\Document::adoptNode methods.

Parsing benchmarks

You might wonder about the performance impact of the tree conversion. In particular, how does the performance of DOM\HTMLDocument::loadHTML compare with the performance of \DOMDocument::loadHTML? Note that the latter method doesn't follow the HTML5 rules, but it does give an indication about the performance.

Relevant scripts can be found at https://gist.github.com/nielsdos/5b59de15b4f1572b2147980eb0687df3.

Experimental setup

I downloaded the homepages of the top 50 websites (excluding blank pages and NSFW pages) as listed according to similarweb. This means 43 websites remain: 6 NSFW sites, and one blank page (microsoftonline.com) were removed. I created a PHP script that invokes each parser 300 times. I ran the experiment on an i7-4790 with 16GiB RAM.

Results

The following graph shows the results. The blue bar shows the parse time in seconds for \DOMDocument, and the orange bar does so for DOM\HTMLDocument. Lower is better. The black vertical line indicates the minimum & maximum measured times for each bar. First of all, some measurements on the far left are very low. That's because those sites primarily generate their content using JavaScript. Hence, there are not many HTML nodes in the document. Some sites also show a geo-blocked page, so these pages are rather simple and will be parsed quickly. Second, we can see that DOM\HTMLDocument is usually on par or faster than \DOMDocument's parser, despite having to do a conversion. When it is slower, it's not by much.

Based on this limited experiment, I conclude that the performance is acceptable.

Impact on binary size

Incorporating any library will increase the binary size of the DOM extension. The Lexbor library is fairly big. Some of the library is not actually used. I've manually ripped out the big parts of the CSS parser with a patch. However, diving into each source file and ripping out functions that are not used is time-consuming and difficult. Furthermore, this would make syncing upstream changes also more difficult.

Inspecting the dom.so shared library using the size command yields the following results:

before/after text data
before this patch 174.78 KiB 15.18 KiB
after this patch 2966.81 KiB 553.44 KiB

The large data section is due to the large lookup tables for text encoding handling: Lexbor supports a lot of text encodings. The HTML5 parser spec requires quite a few character encodings to be supported by a compliant parser. This also has some influence on the text section, but another big part of it is simply all the parsing logic.

Naming

The names are in accordance to the DOM specification.

The class is inside a new namespace called DOM. This follows the policy of the accepted Namespaces in bundled PHP extensions RFC. The capitalization of the namespace and class names follows the guidelines written in the Class Naming RFC.

There's currently a discussion on the mailing list about changing the above-linked policy: https://externals.io/message/120959. The casing rules are flexible with respect to the outcome of that potential future RFC. As this RFC is introduced in the 8.4 development cycle, there's still freedom to change the naming after this RFC is hypothetically accepted.

Completely alternative solution

This section will list alternative solutions that I considered, but rejected.

Alternative DOM extension

One might wonder why we don't just create an entirely new DOM extension, based on another library, with HTML5 support. There are a couple of reasons:

  1. Interoperability problems with other extensions (both within php-src and third-party).
  2. Fragmentation of userland.
  3. Additional maintenance work and complexity.
  4. I don't have time to build this.

Rolling our own HTML5 parser

Instead of using an external library/dependency, why don't we make our own parser? There are a couple of reasons:

  1. It's complex
  2. It requires a lot of testing. Using a library that's been used by many others (like listed before), reduces the chance of bugs.
  3. It takes more maintenance effort to build our own, fix our bugs, and keep up with potential spec changes than relying on a library.
  4. Time constraints

Backward Incompatible Changes

This RFC adds three new classes, and new aliases. The existing \DOMDocument class remains as-is. DOMNode::ownerDocument gets its type changed from ?DOMDocument to ?DOM\Document. Similarly, DOMXPath::document gets its type changed from \DOMDocument to DOM\Document, and the constructor now receives DOM\Document instead of \DOMDocument. The constructor change is not a BC break, because constructors do not participate in LSP checks. As PHP's type checks happen at runtime instead of statically, this shouldn't affect assignments. Overriding the changed property in a child class of \DOMNode or \DOMXPath would cause a compile error. However, overriding properties is useless in PHP anyway, so this is only a minor break. Therefore, this feature is almost purely opt-in.

Proposed PHP Version(s)

Next PHP 8.x. At the time of writing this is PHP 8.4.

RFC Impact

To SAPIs

None.

To Existing Extensions

Only ext/dom is affected.

To Opcache

No impact.

New Constants

None.

php.ini Defaults

None.

Open Issues

None yet.

Unaffected PHP Functionality

Everything outside of ext/dom is unaffected.

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

The Lexbor library also includes functionality outside of HTML parsing that we do not use right now.

  1. It contains a CSS selector parser, that transforms the expression into a list of actions we must follow to find the elements. This could make implementing querySelector(All) easier.
  2. It contains a WHATWG-compliant URL parser, which might be useful for extending PHP's URL pasing capabilities.
  3. There are more performance optimization and possibly size reduction opportunities. I've already upstreamed work for reducing size.
  4. The new class could be a way to opt-in into spec-compliant behaviour. This is out of scope for this RFC though.

Proposed Voting Choices

There is 1 primary vote, and there is 1 secondary vote:

  1. Whether the proposed classes and namespace aliases should be introduced. This requires 2/3 majority.
  2. Whether DOM\HTMLDocument::fromFile should respect the resolver set by libxml_set_external_entity_loader. This requires 50% majority.

Patches and Tests

This does not yet include the external entity loader support. I want to wait until we have the results of the secondary vote before I spend time coding this part.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

Rejected Features

None yet.

Changelog

  • 0.6.4: Add optional arguments $override_encoding to the factory methods.
  • 0.6.3: Fixed typo: fromEmpty -> createEmpty. There was a single place with this typo.
  • 0.6.2: Fixed some missing leading backslashes...
  • 0.6.1: Use FQN names, fixed a reference to an old name, and fixed typos
  • 0.6.0: mark classes as final, update method names, clarification about named constructor, list \DOMXPath modification..
  • 0.5.3: The options argument was discussed in the text but missing in the signature, this is now fixed.
  • 0.5.2: Clarification about \DOMDocument being kept as-is.
  • 0.5.1: Clarification about purpose of XMLDocument.
  • 0.5.0: Add a common base class DOM\Document, make DOM\HTMLDocument into DOM\HTMLDocument extending DOM\Document, add DOM\XMLDocument, add factory methods. See revision history and internals mail for full changelog.
  • 0.4.0: Initial version placed under discussion
rfc/domdocument_html5_parser.1696005691.txt.gz · Last modified: 2023/09/29 16:41 by nielsdos