rfc:domdocument_html5_parser

This is an old revision of the document!


PHP RFC: DOM HTML5 parsing and serialization

Introduction

PHP's DOM extension supports loading HTML documents using the methods DOMDocument::loadHTML and DOMDocument::loadHTMLFile. This uses libxml2's HTML parser under the hood to parse these documents into a libxml2 document tree that ext/dom uses. Unfortunately, this parser only supports HTML up to version 4.01. This is a problem because HTML5 has become the de facto standard for websites over the past decade. Introducing HTML5 parsing to PHP's DOM implementation is crucial for modernizing and enhancing PHP's capabilities in handling modern web content.

Using loadHTML(File) to load HTML5 content results in multiple parsing errors and incorrect document trees. These issues arise from changes in parsing rules between HTML4 and HTML5. Notably, the current parser does not recognize semantic HTML5 tags (e.g., main, article, section, ...) as valid tags. Then there's also problems with certain element nestings not allowed in HTML4, but allowed in HTML5, causing incorrect document trees. Another concern highlighted in PHP's bug tracker is the handling of closing tags within script contexts. With the common practice of embedding HTML within JavaScript, HTML4 parsers encounter problems with closing tags within JavaScript literals. Consequently, parsing through loadHTML(File) leads to incorrect document trees. The list of issues goes on and on. Not being able to parse HTML5 properly is one of the major pain points of our DOM extension.

There's an open issue at the libxml2 bugtracker to add HTML5 parsing support: https://gitlab.gnome.org/GNOME/libxml2/-/issues/211. However, it seems like this won't happen anytime soon. Furthermore, there are also problems with saving (also known as serializing) HTML5 documents due to subtle rule differences between HTML4 and HTML5. This RFC proposes a fully backwards compatible solution to deal with these problems. To solve the parsing issue, we will leverage an alternative HTML5 parser to create the libxml2 document tree. This parser seamlessly integrates with the DOM extension, ensuring compatibility for all existing code and third-party extensions. To solve the serialization issue, an implementation for the HTML5 serialization algorithm will also be added. The new functionality will be available via a DOM\HTML5Document class that extends DOMDocument and overrides the loadHTML(File) and saveHTML(File) methods.

Proposal

The most important requirement is that the new class must integrate seamlessly with the DOM extension. This means that using it must be a simple drop-in replacement. You will still be able to use all the existing APIs to manipulate and traverse DOM documents and nodes.

This proposal introduces the DOM\HTML5Document class that extends the DOMDocument class. The reason we introduce a new class instead of replacing the methods of the existing class is to ensure full backwards compatibility. There are applications that work with legacy HTML4 documents, and want the HTML4 behaviour. By keeping the DOMDocument class, nothing changes for existing code. Code that wants HTML5 functionality can use the DOM\HTML5Document class.

The class overrides the following methods with an HTML5 implementation:

  1. loadHTML
  2. loadHTMLFile
  3. saveHTML
  4. saveHTMLFile

Due to implementation technicalities, it also overrides the load and loadXML methods, but those behaviours don't change.

The options argument

Just like the load methods of DOMDocument, their HTML5 counterparts also take an optional options argument. The options for the load methods change the way the parser behaves. The only three libxml options that will have an effect for the new methods are LIBXML_HTML_NOIMPLIED, LIBXML_COMPACT, and LIBXML_NOERROR. Here's an overview of the other options that are unimplemented and the reason why:

Option Reasoning
LIBXML_BIGLINES
LIBXML_PARSEHUGE
Not needed, this always works for the new methods.
LIBXML_DTDATTR
LIBXML_DTDLOAD
LIBXML_DTDVALID
There is only one valid DTD for HTML5, these options don't make sense.
LIBXML_HTML_NODEFDTD Not needed, this is the default HTML5 behaviour.
LIBXML_NOBLANKS This doesn't remove blank nodes in all cases. There's rules that libxml2 follows based on whether the element accepts #PCDATA, and based on the position of the element. As HTML5 is not based on XML, there is no concept of #PCDATA. Hence, it is unclear what the right behaviour should be.
LIBXML_NOCDATA
LIBXML_NOEMPTYTAG
LIBXML_NOENT
LIBXML_NSCLEAN
LIBXML_XINCLUDE
LIBXML_SCHEMA_CREATE
This is only valid in XML, the concept doesn't exist in HTML5.
LIBXML_NONET Not needed, the new methods never access the network.
LIBXML_NOWARNING Not needed, only errors are reported, there's no concept of a warning because this is not a conformance checker.
LIBXML_PEDANTIC Error reporting follows the spec, no custom error levels are available.

Furthermore, we also implement a custom option DOM\NO_DEFAULT_NS that avoids putting a default namespace on the HTML/SVG/MATHML elements. This is done to ease migration and to make everything compatible with non-namespace aware DOM tools. Something very similar exists in masterminds/html5-php and this option is also used in Symfony's CSS Selector package.

Passing invalid options will result in an argument ValueError exception.

Additional background info

The DOM extensions supports both XML and HTML documents. It's built heavily upon libxml2's APIs and data structures, just like all XML-related PHP extensions within php-src. This is great for interoperability (e.g. with simplexml and xsl). Third-party extensions also use libxml2 APIs. For example, the xmldiff PECL extension peeks into the internals of DOMDocument to grab the libxml2 data structures and compare them. It is not possible to switch away from the libxml2 library as the underlying basis for the DOM extension because that will cause a major BC break.

Approach

Parsing an HTML document via an HTML parser results in a document tree. The tree consists of HTML nodes. These nodes are structs on the heap created by the parser. In order to integrate an alternative parser into our DOM extension, these nodes need to be converted into libxml2 nodes. The resulting tree, after conversion, is then used in the DOM extension, just as if it had come from libxml2's parser.

The conversion is fairly straight-forward. We perform a depth-first traversal on the tree, checking the node type and creating the corresponding libxml2 node. The traversal is performed using iteration instead of recursion to prevent stack overflows with deep trees. After this process is done, we throw away the old tree and are left with only the libxml2 tree.

For serializing, I wrote code implementing the HTML5 serialization algorithm using libxml2 nodes. I could've also developed a method of converting a libxml2 tree back to the original type of tree that the parser produced, but that's more complicated to implement and likely has slower performance.

Choosing an HTML5 parser

We have to choose a suitable HTML5 parser. It should be spec-compliant, heavily tested, and fast. I propose to use Lexbor. According to its README, it satisfies our requirements. Furthermore, people already made bindings for Elixir, Crystal, Python, D, and Ruby. This shows that it has been used in practice in other serious projects.

It is fully written in C99. That's ideal, because PHP is also using the C99 standard. One small complication is that this library is not available in package managers for almost all distros. Therefore, I propose to bundle it with PHP. This also gives us the freedom to incorporate a patch to expose the line and column numbers of HTML nodes such that the error messages are richer and the DOMNode::getLineNo() function will work properly. Bundling a library with PHP is not unprecedented, PHP already bundles e.g. pcre2lib, libgd, libmagic, ...

Lexbor also supports overriding the allocation routines. Therefore, we can make it work with PHP's memory limit. Something that is currently not done with libxml2.

Alternative considered HTML5 parsers

Lexbor is one of several HTML5 parsers available. During my investigation, I considered two alternatives:

  • Gumbo: https://github.com/google/gumbo-parser.
    A relatively well-known HTML5 parser developed by Google in C.
    Unfortunately, it has been unmaintained since 2016, as indicated in its README, making it unsuitable for use.
  • html5ever: https://github.com/servo/html5ever.
    This is Servo's HTML5 parser, written in Rust.
    I have implemented a proof-of-concept conversion from html5ever to libxml2, and a proof-of-concept integration with PHP on my fork.

    I decided to not go with this option for a few reasons.
    * Firstly, while writing it in Rust would enhance memory safety (especially for untrusted documents), introducing Rust as an additional dependency for PHP adds extra complexity. PHP's default-enabled extensions can currently be built using only C, but if we go this route this would change.
    * Secondly, the implementation is incomplete, primarily the lack of character encoding support is problematic: it currently only supports UTF-8 documents. Moreover, logic for character encoding meta tags is absent.
    * Lastly, observing the commit activity raises doubts about the ongoing activity of this project.

Considering these factors, I opted against using the above two. Lexbor emerged as the better choice after this investigation.

A note on conformance checkers

I want to emphasize that the HTML5 parser is not a conformance checkers. Conformance checkers check for additional rules in addition to the parsing rules. Browsers, and the proposed class, only perform the parsing rules checks. An example of something that's fine for a HTML5 parser, but not fine for a conformance checker is the following document:

<!doctype html><html><head></head><body></body></html>

This is perfectly valid for a parser. Our implementation won't report any errors. Conformance checkers, however, will report the lack of a title element (amongst some other minor things).

Error handling

When parsing a document, potential parse errors may occur. With the load methods of DOMDocument, a parser error results in an E_WARNING by default. However, you can use libxml_use_internal_errors(true) to store the errors inside an array. In this case, no warning will be generated and the parse errors may be inspected using libxml_get_errors() and libxml_get_last_error().

The naming of these methods is a bit unfortunate because it leaks implementation details. Users shouldn't have to care that it's actually libxml2 under the hood producing these errors. The reality is that these error methods have become synonymous with “handling errors in DOMDocument / SimpleXML / ...”. To offer a seamless HTML5 drop-in, my current implementation follows the same error handling as described above. That means, by default we will emit an E_WARNING. If libxml_use_internal_errors(true) is used then the errors will be stored, and can be retrieved in the same way as described above. This may seem unconventional since the errors originate from Lexbor rather than libxml2. However, we have good reasons to do so.

The alternative would be to introduce methods specific to getting the errors from the HTML5 parser. However, I do not believe that's a good idea because:

  1. The developers utilising these new parsing methods don't necessarily know that it uses Lexbor. So they expect the error handling behaviour to be the same as the existing methods.
  2. The proposed approach makes it easier to use as a drop-in replacement.
  3. If libxml2 ever introduces its own HTML5 parser, we can drop Lexbor and nothing changes for the end user w.r.t. error handling.

External entity loader

XML supports something called “external entities”. This will load data from an external source into the current document (if enabled). Because you might want to customise the external entity handling, there's a libxml_set_external_entity_loader(?callable $resolver_function) function to setup a custom “resolver”. This “resolver” returns either a path, a stream resource, or null. In the former two cases, the entity will be loaded from the path or stream. In the latter case, the loading will be blocked.

This interacts a bit surprisingly with the existing loadHTMLFile method. You can observe this here: https://3v4l.org/rJTTc. The loadHTMLFile method considers loading the file also as loading an external entity, hence the “resolver” is invoked.

There's a (deprecated) similar function libxml_disable_entity_loader(bool $disable) that completely disables loading external entities. This function has been perceived as broken by the community due to it blocking loading anything that's not coming from a string. See https://github.com/php/php-src/pull/5867 for more details. I don't know how the community perceives the interaction between loadHTMLFile and libxml_set_external_entity_loader.

Unlike XML, HTML5 does not have a concept of external entities. The question I have is whether libxml_set_external_entity_loader should affect the new class's loadHTMLFile in the same way as it does for the existing class. The advantage would be consistency, but I don't know if this is what the community wants. I'm leaving this for a secondary vote for the community to decide on.

Interoperability between DOMDocument and DOM\HTML5Document

Because DOM\HTML5Document is a subclass of DOMDocument, all methods accepting a DOMDocument also accept a DOM\HTML5Document. These functions can transparently work on HTML5 documents. If you want to restrict your code to only accept HTML5 documents, you can use the stricter DOM\HTML5Document type hint.

However, what if you're using a library that returns a (non-HTML5) DOMDocument but you'd like a DOM\HTML5Document (or vice versa)? You can solve this issue by using the DOMDocument::importNode or DOMDocument::adoptNode methods.

Parsing benchmarks

You might wonder about the performance impact of the tree conversion. In particular, how does the performance of DOM\HTML5Document::loadHTML compare with the performance of DOMDocument::loadHTML? Note that the latter method doesn't follow the HTML5 rules, but it does give an indication about the performance.

Relevant scripts can be found at https://gist.github.com/nielsdos/5b59de15b4f1572b2147980eb0687df3.

Experimental setup

I downloaded the homepages of the top 50 websites (excluding blank pages and NSFW pages) as listed according to similarweb. This means 43 websites remain: 6 NSFW sites, and one blank page (microsoftonline.com) were removed. I created a PHP script that invokes each parser 300 times. I ran the experiment on an i7-4790 with 16GiB RAM.

Results

The following graph shows the results. The blue bar shows the parse time in seconds for DOMDocument, and the orange bar does so for DOM\HTML5Document. Lower is better. The black vertical line indicates the minimum & maximum measured times for each bar. First of all, some measurements on the far left are very low. That's because those sites primarily generate their content using JavaScript. Hence, there are not many HTML nodes in the document. Some sites also show a geo-blocked page, so these pages are rather simple and will be parsed quickly. Second, we can see that DOM\HTML5Document is usually on par or faster than DOMDocument's parser, despite having to do a conversion. When it is slower, it's not by much.

Based on this limited experiment, I conclude that the performance is acceptable.

Impact on binary size

Incorporating any library will increase the binary size of the DOM extension. The Lexbor library is fairly big. Some of the library is not actually used. I've manually ripped out the big parts of the CSS parser with a patch. However, diving into each source file and ripping out functions that are not used is time-consuming and difficult. Furthermore, this would make syncing upstream changes also more difficult.

Inspecting the dom.so shared library using the size command yields the following results:

before/after text data
before this patch 174.78 KiB 15.18 KiB
after this patch 2966.81 KiB 553.44 KiB

The large data section is due to the large lookup tables for text encoding handling: Lexbor supports a lot of text encodings. The HTML5 parser spec requires quite a few character encodings to be supported by a compliant parser. This also has some influence on the text section, but another big part of it is simply all the parsing logic.

Naming

I'm open to discussion about the name. I chose to use the HTML5 name because this is widely recognized as meaning “modern HTML technology”. See also https://html.spec.whatwg.org/multipage/introduction.html#is-this-html5. The name may still not be that great because you can still load XML documents with it.

The class is inside a new namespace called DOM. This follows the policy of the accepted Namespaces in bundled PHP extensions RFC. The capitalization of the namespace and class names follows the guidelines written in the Class Naming RFC.

There's currently a discussion on the mailing list about changing the above-linked policy: https://externals.io/message/120959. The casing rules are flexible with respect to the outcome of that potential future RFC. As this RFC is introduced in the 8.4 development cycle, there's still freedom to change the naming after this RFC is hypothetically accepted.

This paragraph introduces the second primary vote of this RFC. We'll have DOM classes in the global namespace and a single class (i.e. HTML5Document) in the (new) DOM namespace. This is awkward. I propose to solve this by creating namespace aliases for the existing DOM classes and constants, and (single) function. This would improve consistency and in the far far future may allow a complete transition to the namespaced variants. This means for example that there will be an alias DOM\Document for DOMDocument, an alias DOM\Entity for DOMEntity etc. There is a single function dom_import_simplexml, which can get an alias as DOM\import_simplexml. Similarly, the constants would lose their DOM_ prefix in the namespace version, e.g. DOM\INDEX_SIZE_ERR will be an alias for DOM_INDEX_SIZE_ERR. For constants that begin with XML_ I propose to keep the prefix.

Completely alternative solution

This section will list alternative solutions that I considered, but rejected.

Alternative DOM extension

One might wonder why we don't just create an entirely new DOM extension, based on another library, with HTML5 support. There are a couple of reasons:

  1. Interoperability problems with other extensions (both within php-src and third-party).
  2. Fragmentation of userland.
  3. Additional maintenance work and complexity.
  4. I don't have time to build this.

Rolling our own HTML5 parser

Instead of using an external library/dependency, why don't we make our own parser? There are a couple of reasons:

  1. It's complex
  2. It requires a lot of testing. Using a library that's been used by many others (like listed before), reduces the chance of bugs.
  3. It takes more maintenance effort to build our own, fix our bugs, and keep up with potential spec changes than relying on a library.
  4. Time constraints

Backward Incompatible Changes

This RFC adds a new class, but the existing DOMDocument class as-is. Therefore, this feature is purely opt-in, and there is no BC break.

Proposed PHP Version(s)

Next PHP 8.x. At the time of writing this is PHP 8.4.

RFC Impact

To SAPIs

None.

To Existing Extensions

Only ext/dom is affected.

To Opcache

No impact.

New Constants

None.

php.ini Defaults

None.

Open Issues

None yet.

Unaffected PHP Functionality

Everything outside of ext/dom is unaffected.

Future Scope

This section details areas where the feature might be improved in future, but that are not currently proposed in this RFC.

The Lexbor library also includes functionality outside of HTML parsing that we do not use right now.

  1. It contains a CSS selector parser, that transforms the expression into a list of actions we must follow to find the elements. This could make implementing querySelector(All) easier.
  2. It contains a WHATWG-compliant URL parser, which might be useful for extending PHP's URL pasing capabilities.
  3. There are more performance optimization and possibly size reduction opportunities. I've already upstreamed work for reducing size.
  4. The new class could be a way to opt-in into spec-compliant behaviour. This is out of scope for this RFC though.

Proposed Voting Choices

There are 2 primary votes, and there is 1 secondary vote:

  1. Whether DOM\HTML5Document should be introduced. This requires 2/3 majority.
  2. Whether to create namespace aliases for existing DOM classes into a DOM namespace. This requires 2/3 majority.
  3. Whether DOM\HTML5Document::loadHTMLFile should respect the resolver set by libxml_set_external_entity_loader. This requires 50% majority.

Patches and Tests

  1. Pull request: https://github.com/nielsdos/php-src/pull/32 (TODO: move this to php-src)

This does not yet include the external entity loader support. I want to wait until we have the results of the secondary vote before I spend time coding this part.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

Rejected Features

None yet.

Changelog

  • 0.x.y: Initial version placed under discussion
rfc/domdocument_html5_parser.1693683048.txt.gz · Last modified: 2023/09/02 19:30 by nielsdos