rfc:ast_based_parsing_compilation_process

Request for Comments: Moving to an AST-based parsing/compilation process

Introduction

Currently PHP uses a single-pass compilation process, i.e. the parser directly invokes opcode compilation routines. Most other languages on the other hand use an intermediary structure to separate those two phases: The parser only emits an abstract syntax tree (AST), which is then used by a separate compiler to emit instructions. The use of an AST decouples the two phases and as such allows for greater flexibility and deeper analysis.

This RFC outlines why PHP should consider moving to an AST-based compilation process.

Advantages

Potential for more advanced optimizations

An abstract syntax tree gives us the ability to apply more advanced optimizations. It is usually not possible to apply compile-time optimizations with a single-pass compiler because not enough information is available at the time the instructions are emitted. With an abstract syntax tree on the other hand a lot more information is easily available, which can be used to compile faster opcodes.

The main issue here is that PHP does not cache the compiled opcodes, so they are regenerated on every request. As an optimizing compiler increases compile times it may be that the additional compilation overhead is actually larger than the resources saved in execution. Adding advanced optimizations in the compilation process would probably require the inclusion of some kind of opcode-cache directly into PHP. Or alternatively the optimizations would be done by a separate extension (which could be employed together with APC etc).

Furthermore one should point out that the concept of “advanced optimizations” is rather fuzzy. If you look at projects like HipHop, you'll find that the performance numbers don't actually show a vast improvement over the native Zend implementation. There just isn't *that* much potential for optimization in a dynamic language (unless you go down the JIT road).

Generally I'd say that the potential for optimizations is the least important reason to move to an AST (even though it is probably the first thing that comes to mind). It's just not clear enough how much it will actually bring (or cost).

Elimination of various quirks

Currently there are various quirks in the emitted opcodes which can be attributed to the use of a single-pass compiler. Some examples:

All these can be eliminated when an AST is used.

Removal of ugly hacks / reducing complexity of the compiler

In order to support single-pass compilation ugly hacks have to be used both in the parser and in the compiler. One example is that already emitted opcodes sometimes have to be overridden or adjusted at a later time (and I'm not talking about pass_two here). Generally all code related to variable access (where “variable” means, variables, properties, functions, methods etc) is very complex and hard to understand. The immense complexity in this area is also part of the reason why the dereferencing syntaxes in PHP are so inconsistent.

Using an AST the compiler can be written in a much cleaner way. This would make the code in this area easier to understand, easier to maintain and also more friendly to new contributors.

Decoupling syntax decisions from technical issues

With the current single-pass compiler some things are very hard / near impossible to implement. This actively influences syntax decisions.

A few examples of syntax that is currently not possible, but would be possible with a syntax tree:

  • Array destructuring using something like [$a, $b, $c] = $array instead of a dedicated list() syntax. This is common in other languages, but not possible in PHP.
  • List comprehensions / generator expressions where the result expression comes first, e.g. [x * x for x in list] in Python. In PHP only the reverse syntax is possible: [foreach ($list as $x) yield $x * $x]
  • C#-style expression trees (which form the basis for LINQ)

Apart from larger syntax limitations the current system commonly also affects smaller syntax decisions. One example here are the strange parentheses requirements for the yield expression. Those requirements exist solely for technical reasons and would not be required with an AST-generating parser.

Better error messages

Currently many things are directly enforced in the grammar which should really be checked during compilation (or a completely separate pass). E.g. if you try to initialize a class property with a non-static value, you'll get a rather unintelligible parse error message, instead of something like Cannot initialize property with non-static value. (And then you obviously go to StackOverflow, ask the question for the five hundredth time and annoy the heck out of me!)

Disadvantages

The main disadvantage of generating an AST is (quite obviously) that it slows down compilation and requires more memory. At this point it is hard to estimate how much impact it will have in this respect.

Summary

Splitting up parsing and compilation in two phases comes with numerous advantages, the most important ones being reducing the complexity of the compiler (and grammar) as well as reducing the influence of technical issues on language design (syntax in particular).

But: Moving to an AST-based compilation process would basically require a (nearly) complete rewrite of the parser and the compiler. It would be a hell lot of work. It would be nice to get some help there ;)

rfc/ast_based_parsing_compilation_process.txt · Last modified: 2013/08/24 21:13 by nikic