rfc:abstract_syntax_tree

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:abstract_syntax_tree [2014/07/31 15:20] nikicrfc:abstract_syntax_tree [2017/09/22 13:28] (current) – external edit 127.0.0.1
Line 2: Line 2:
   * Date: 2014-07-28   * Date: 2014-07-28
   * Author: Nikita Popov <nikic@php.net>   * Author: Nikita Popov <nikic@php.net>
-  * Status: Draft +  * Status: Implemented (in PHP 7) 
-  * TargetingPHP.next+  * Discussionhttp://markmail.org/message/br4ixewsnqitrx3n
  
 ===== Introduction ===== ===== Introduction =====
Line 17: Line 17:
 ==== More maintainable parser and compiler ==== ==== More maintainable parser and compiler ====
  
-In the new AST-based implementation the compiler is fully decoupled from the parser, which leads to a code quality and maintainability improvement. In the following some examples of how this improves our code are discussed:+In the new AST-based implementation the compiler is fully decoupled from the parser, which leads to a code quality and maintainability improvement. In the following some examples of such improvements are discussed:
  
   * The parser no longer needs to define separate productions in cases where the same syntax requires different compilation. For example static scalar expressions no longer need to redefine all basic operations and can reuse the normal ''expr'' production.   * The parser no longer needs to define separate productions in cases where the same syntax requires different compilation. For example static scalar expressions no longer need to redefine all basic operations and can reuse the normal ''expr'' production.
-  * The parser needs far fewer mid-rule semantic actions. Now mid-rule reduction is only used to back up doc comments, whereas previously the use was ubiquitous. Apart from code quality concerns this is beneficial because mid-rule actions force the parser to reduce earlier, i.e. the parser is allowed to inspect a smaller number of tokens in order to decide which rule should be reduced. This limits the syntax we can implement. +  * The parser needs far fewer mid-rule semantic actions. Now mid-rule reduction is only used to back up doc comments, whereas previously the use was ubiquitous. Apart from code quality concernsthis is beneficial because mid-rule actions force the parser to reduce earlier, i.e. the parser is allowed to inspect a smaller number of tokens in order to decide which rule should be reduced. This limits the syntax we can implement. 
-  * Implementations of control flow structures were usually spread across multiple functions called as mid-rule actions. Jump instruction opnums were backed up into arbitrary znodes, which usually results in very hard to follow code (like ''%%CG(active_op_array)->opcodes[close_bracket_token->u.op.opline_num].op2.opline_num = get_next_op_number(CG(active_op_array))%%''). The opnums were stored both directly from the parser and from the compiler, sometimes updated across multiple rules and as such hard to follow (a particular nice example being try/catch handling)+  * Implementations of control flow structures were usually spread across multiple functions called as mid-rule actions. Jump instruction opnums were backed up into arbitrary znodes (either from the parser or from the compiler), which usually results in very hard to follow code. When reading compiler code you always have to wonder about questions like the following: To what does ''%%close_bracket_token->u.op.opline_num%%'' refer? Where was this opline emitted and what opcode does it have? What does it mean if the ''op2.opline_num'' of that opline is changed? 
-  * Variables were implemented through a backpatch list (and stack), into which oplines necessary for ''BP_VAR_W'' fetches were inserted. Afterwards the oplines were modified or removed depending on the fetch mode that was eventually chosen. The AST-based implementation can directly compile using the correct fetch mode. +  * Variables were previously implemented through a backpatch list (and stack), into which oplines necessary for ''BP_VAR_W'' fetches were inserted. Afterwards the oplines were modified or removed depending on the fetch mode that was eventually chosen. The AST-based implementation can directly compile using the correct fetch mode. 
-  * We also no longer need to backpatch in a number of other places, e.g. during list compilation.+  * We also no longer need to backpatch in a number of other places, e.g. during ''list()'' compilation.
  
 ==== Decoupling syntax decisions from technical issues ==== ==== Decoupling syntax decisions from technical issues ====
Line 36: Line 36:
 This choice was made purely due to technical restrictions and is removed in the AST implementation. This choice was made purely due to technical restrictions and is removed in the AST implementation.
  
-Additionally the current compiler architecture prevents use from implementing some types of syntax altogether. Examples include:+Additionally the current compiler architecture prevents us from implementing some types of syntax altogether. Examples include:
  
   * Array destructuring using ''[$a, $b, $c] = $array'' instead of a dedicated ''list()'' syntax. This is common in other languages, but not possible in PHP.   * Array destructuring using ''[$a, $b, $c] = $array'' instead of a dedicated ''list()'' syntax. This is common in other languages, but not possible in PHP.
Line 48: Line 48:
 The abstract syntax tree has little impact on runtime performance or memory usage. While the AST does allow us to generate slightly better and smaller instruction sequences in some cases (e.g. for ''for'' loops), the practical impact is not worth mentioning. The abstract syntax tree has little impact on runtime performance or memory usage. While the AST does allow us to generate slightly better and smaller instruction sequences in some cases (e.g. for ''for'' loops), the practical impact is not worth mentioning.
  
-The introduction of an AST does however impact the performance and memory usage of the compilation process itself. It should be emphasized that this difference is only relevant when no opcode cache is in use. If an opcode cache is used then each file is only compiled once, as such any difference does not have a practical impact.+The introduction of an AST does however impact the performance and memory usage of the compilation process itself. It should be emphasized that this difference is //only relevant when no opcode cache is in use//. If an opcode cache is used then each file is only compiled once, as such any difference does not have a practical impact.
  
 The script used to measure the following numbers is available as a [[https://gist.github.com/nikic/289b0c7538b46c2220bc|Gist]]. Tests were performed on three files, with different sizes. The "small" one has about 100 lines of code, the "medium" one about 700 and the "large" one about 2800. The script used to measure the following numbers is available as a [[https://gist.github.com/nikic/289b0c7538b46c2220bc|Gist]]. Tests were performed on three files, with different sizes. The "small" one has about 100 lines of code, the "medium" one about 700 and the "large" one about 2800.
Line 78: Line 78:
 The introduction of the AST comes with minor changes to syntax and behavior, which are listed in the following: The introduction of the AST comes with minor changes to syntax and behavior, which are listed in the following:
  
-==== ''yield'' does not require parentheses ====+==== yield does not require parentheses ====
  
 This is the only syntax related change and was already mentioned previously. ''yield'' in expression use no longer needs parentheses, so all of the following are valid: This is the only syntax related change and was already mentioned previously. ''yield'' in expression use no longer needs parentheses, so all of the following are valid:
Line 94: Line 94:
 Similarly ''byRef(func())'' and ''%%byRef((func()))%%'' will now both throw a strict-standards notice if ''byRef'' takes its parameter by-reference, but ''func'' does not return by reference. Similarly ''byRef(func())'' and ''%%byRef((func()))%%'' will now both throw a strict-standards notice if ''byRef'' takes its parameter by-reference, but ''func'' does not return by reference.
  
-==== Order of ''list()'' assignments ====+==== Changes to list() ==== 
 + 
 +> **Note**: The behavior of ''list($a, $b) = $a'' described below no longer applies. After this RFC was accepted ''list()'' assignments that contain the same variable on the left- and right-hand side have been special cased to ensure the right-hand side always evaluates first. This means that ''list($a, $b) $a'' continues working as expected.
  
 ''list()'' currently assigns variables right-to-left, the AST implementation will assign them left-to-right instead: ''list()'' currently assigns variables right-to-left, the AST implementation will assign them left-to-right instead:
Line 106: Line 108:
 </code> </code>
  
-Furthermore ''list()'' now only accesses every offset once.+Another example where the assignment order is relevant is if both the left and right side of the list assignment use the same variable:
  
-==== Order of evaluation of assignments ====+<code php> 
 +$a [1, 2]; 
 +list($a, $b) $a;
  
-For assignments the right-hand side is now evaluated before the left-hand side:+// OLD: $a = 1, $b = 2 
 +// NEW: $a = 1, $b = null + "Undefined index 1" 
 + 
 +$b = [1, 2]; 
 +list($a, $b) = $b; 
 +// OLD: $a = null + "Undefined index 0", $b = 2 
 +// NEW: $a = 1, $b = 2 
 +</code> 
 + 
 +''list()'' will now access every offset only once:
  
 <code php> <code php>
-$i = 1; +list(list($a, $b)) = $array; 
-$array[$i++] = $i+++ 
-// OLD: [1 => 2+// OLD: 
-// NEW: [=1]+$b = $array[0][1]; 
 +$a $array[0][0]; 
 + 
 +// NEW: 
 +$_tmp = $array[0]; 
 +$a $_tmp[0]; 
 +$b = $_tmp[1];
 </code> </code>
  
-Note that order of evaluation is *explicitly documented to be undefined* and also not consistentFor example, in the following example both the old implementation and the AST implementation evaluate the right-hand side first:+The only visible change this has for most purposes is that an "Undefined index" notice is only thrown once per index, not many times. 
 + 
 +Empty ''list()''s are now always disallowed. Previously they were only forbidden in some places:
  
 <code php> <code php>
-$i 1+list() $a          // INVALID 
-$array[$i] = $i+++list($b, list()) = $a// INVALID 
-// OLD + NEW: [2 => 1]+foreach ($a as list()) // INVALID (was also invalid previously)
 </code> </code>
  
-===== Patch =====+==== Auto-vivification order for by-reference assignments ====
  
-The AST implementation can be found on GitHub: https://github.com/nikic/php-src/tree/ast+> **Note**: The auto-vivification order for reference assignments has been restored to the old behavior in PHP 7.1. The reason for this is that we found hard to avoid memory safety issues with the new order.
  
-The branch already includes the Uniform Variable Syntax RFCas it was a necessary prerequisite for the implementation.+While by-reference assignments are (CVs notwithstanding) evaluated left-to-rightauto-vivification currently occurs right-to-left. In the AST implementation this will happen left-to-right instead:
  
-The implementation has everything ported, but probably still has some bugs and needs some cleanup :)+<code php> 
 +$obj = new stdClass; 
 +$obj->a = &$obj->b; 
 +$obj->b = 1; 
 +var_dump($obj); 
 + 
 +// OLD: 
 +object(stdClass)#1 (2) { 
 +  ["b"]=> 
 +  &int(1) 
 +  ["a"]=> 
 +  &int(1) 
 +
 + 
 +// NEW: 
 +object(stdClass)#1 (2) { 
 +  ["a"]=> 
 +  &int(1) 
 +  ["b"]=> 
 +  &int(1) 
 +
 +</code> 
 + 
 +Note: The order can easily be changed, but the old behavior looks like a bug to me, so I decided to keep the new behavior. 
 + 
 +==== Directly calling __clone is allowed ==== 
 + 
 +Doing calls like ''%%$obj->__clone()%%'' is now allowed. This was the only magic method that had a compile-time check preventing some calls to it, which doesn't make sense. If we allow all other magic methods to be called, there's no reason to forbid this one.
  
 ===== Implementation ===== ===== Implementation =====
Line 441: Line 489:
  
 The ''zend_begin_loop'' and ''zend_end_loop'' functions store information for break/continue and try/catch. The ''zend_begin_loop'' and ''zend_end_loop'' functions store information for break/continue and try/catch.
 +
 +===== Additional possibilities (not implemented) =====
 +
 +The generated AST can be exposed to userland via an extension, for use by static analysers. This should be relatively easy to implement and we might even want to provide this as a bundled extension (like ext/tokenizer).
 +
 +More interestingly, we could allow extensions to hook into the compilation process (the current AST implementation does not provide hooks, but they can be added if we want them). This would allow extensions to implement some types of "language features".
 +
 +As an example, this is roughly how an implementation of the [[rfc:ifsetor|ifsetor RFC]] //could// look like using an AST hook:
 +
 +<code c>
 +/* Works by rewriting ifsetor($foo, 'bar') to isset($foo) ? $foo : 'bar' */
 +void ext_ifsetor_hook(zend_ast **ast_ptr TSRMLS_DC) {
 +    zend_ast *ast = *ast_ptr;
 +    
 +    if (ast->kind == ZEND_AST_CALL && ast->child[0]->kind == ZEND_AST_ZVAL) {
 +        zend_string *name = zval_get_string(zend_ast_get_zval(ast->child[0]));
 +        zend_ast_list *args = zend_ast_get_list(ast->child[1]);
 +        
 +        if (zend_str_equals_literal_ci(name, "ifsetor")
 +            && args->children == 2 && !zend_args_contain_unpack(args)
 +        ) {
 +            if (!zend_is_variable(args->child[0])) {
 +                zend_error_noreturn(E_COMPILE_ERROR, "First argument of ifsetor "
 +                    "must be a variable");
 +            }
 +            
 +            /* Note: One would need a function for adding refs to args->child[0] here,
 +             * as it is used two times - as written here it won't work correctly. */
 +            *ast_ptr = zend_ast_create(ZEND_AST_CONDITIONAL,
 +                zend_ast_create(ZEND_AST_ISSET, args->child[0]),
 +                args->child[0],
 +                args->child[1]
 +            );
 +        }
 +        
 +        STR_RELEASE(name);
 +    }
 +}
 +</code>
 +
 +I don't know how useful this is and how many things can be implemented in this way, but I think it's worth considering.
 +
 +An additional possibility is to drop the keywords for ''isset'' and ''empty'' and just compile them as special function calls (using similar checks as the code above). Maybe other keywords can be dropped as well.
 +
 +===== Patch =====
 +
 +The AST implementation can be found at https://github.com/nikic/php-src/tree/ast. Some quick links to the most important files:
 +
 +  * ''[[https://github.com/nikic/php-src/blob/ast/Zend/zend_language_parser.y|zend_language_parser.y]]''
 +  * ''[[https://github.com/nikic/php-src/blob/ast/Zend/zend_ast.h|zend_ast.h]]''
 +  * ''[[https://github.com/nikic/php-src/blob/ast/Zend/zend_ast.c|zend_ast.c]]''
 +  * ''[[https://github.com/nikic/php-src/blob/ast/Zend/zend_compile.c|zend_compile.c]]'' (The code relevant to the AST begins somewhere around line 3200, everything before that is largely untouched.)
 +
 +//The branch already includes the Uniform Variable Syntax RFC//, as it was a necessary prerequisite for the implementation.
 +
 +The implementation has everything ported, but probably still has some bugs and needs some cleanup :)
 +
 +===== Vote =====
 +The vote started on 2014-08-18 and ended on 2014-08-25. The necessary 2/3 majority was reached, as such the RFC is accepted.
 +
 +<doodle title="Use AST implementation in PHP 7?" auth="nikic" voteType="single" closed="true">
 +   * Yes
 +   * No
 +</doodle>
rfc/abstract_syntax_tree.1406820048.txt.gz · Last modified: 2017/09/22 13:28 (external edit)