rfc:parser-extension-api

This is an old revision of the document!


PHP RFC: Parser Extension API

Introduction

This RFC proposes an introduction of userland parser extensions API for providing an access to the low-level Abstract Syntax Tree (AST) parser.

RFC consists of two parts:

  • parsing API that provides an AST for a given string of source code
  • extension API that allows to register custom PHP hooks from userland to modify an Abstract Syntax Tree before transforming it into concrete opcodes.

Parsing API proposal

As you know, all previous versions of PHP do not provide an API for accessing an information about Abstract Syntax Tree for specified code. This limitation was due to the absence of AST on the engine level. There is only tokenizer PHP extension with token_get_all() function that provides an information about lexical tokens. However, this stream of tokens can not be easily used because of complex grammar of PHP which requires a development of grammar on PHP side.

Therefore, latest version of PHP now includes powerful AST-based implementation of the compiler which is fully decoupled from the parser. This parser gives an opportunity for better code quality and maintainability improvement. Information about Abstract Syntax Tree can be useful on the userland side too, so I want to propose to provide a parsing API for building an AST tree in addition to the existing tokenizer extension.

Why AST is needed on userland side?

Currently, there are some libraries that provide a top-level API for accessing an information about the source code. This includes PHP-Parser (https://github.com/nikic/PHP-Parser), PHP-Token-Reflection (https://github.com/Andrewsville/PHP-Token-Reflection), Doctrine Annotations and other tools. Information about structure of the source code is also used by all existing QA tools that performs a static analysis of source code, heavily relying on tokenizer extension and custom parsers. Introduction of system API for parsing can simplify this tools and make them more reliable and faster.

Parser API

Structural unit of Abstract Syntax Tree is a single node that holds an information about concrete element:

<?php
namespace Php\Parser;
 
class Node
{
    public $kind;
    public $flags;
    public $lineNumber;
    public $value; 
 
    /**
     * @var Node[]|array List of children nodes
     */
    public $children;
 
    /**
     * Returns the text representation of current node
     *
     * @return string
     */
    public function dump()
 
    /**
     * Returns a user-friendly name of node kind, e.g. "AST_ASSIGN" 
     * @return string
     */
    public function getKindName()
 
    /**
     * Is current node uses flags or not
     * @return bool
     */
    public function isUsingFlags()
}

The `kind` property specified the type of the node. It is an integral value, which corresponds to one of the AST_* constants, for example AST_STMT_LIST. To retrieve the string name of an integral kind getKindName() method of node can be used.

The `flags` property contains node specific flags. It is always defined, but for most nodes it is always zero. isUsingFlags() method for node can be used to determine whether a node has a meaningful flags value.

The `lineNumber` property specified the starting line number of the node. The `children` property contains an array of child-nodes.

To access an information about AST for the code, `Php\Parser\Engine` class will be used:

<?php
namespace Php\Parser;
 
final class Engine
{
     /**
      * Parses given code and returns an AST for it
      *
      * @return Node
      */
     public static function parse($phpCode): Node
}

The static Engine::parse() method accepts a source code string (which is parsed in INITIAL mode, i.e. it should generally include an opening PHP tag) and returns an abstract syntax tree consisting of Node objects.

Here is an example of getting an AST for simple code:

<?php
use Php\Parser\Engine as ParserEngine;
 
$code = <<<'EOC'
<?php
$var = 42;
EOC;
 
$astTree = ParserEngine::parse($code);
echo $astTree->dump(); 
 
// Output:
AST_STMT_LIST @ 1 {
    0: AST_ASSIGN @ 1 {
        0: AST_VAR @ 1 {
            0: "var"
        }
        1: 42
    }
}

This information about AST can be used later for custom Parser Extensions, QA static analysis tools, source code rewriting tools and much more.

I want to notice, that this part was originally implemented and described by Nikita Popov as an experimental php-ast extension https://github.com/nikic/php-ast, so it can be used as a starting point for this RFC.

Parser Extension API

Second part of this RFC proposes to add an API for building userland parser extensions. We could allow userland extensions to hook into the compilation process. This would allow extensions to implement some types of language features, for example, Design-by-Contract verifying, Aspect-Oriented programming, analysis of annotation metadata and much more.

Userland parser extension is described by the extension interface with single `process` method that accepts only one argument - top-level AST node and can modify it.

<?php
namespace Php\Parser;
 
interface ExtensionInterface {
 
    /**
     * Receives a top-level node of AST and can transform it
     */
    public static function process(Node $node);
}

Each extension can be registered or unregistered in the parser engine class by calling appropriate methods:

<?php
namespace Php\Parser;
 
class Engine {
 
    /** 
     * @var array|string[] List of parser extension classes
     */
    private static $extensions;
 
    /**
     * Register an extension class in the parser
     * @param string $extensionClassName Name of the extension class
     */
    public static function registerExtension($extensionClassName)
 
    /**
     * Unregister an extension class from the parser
     * @param string $extensionClassName Name of the extension class
     */
    public static function unregisterExtension($extensionClassName)
 
    /**
     * Returns a list of currently registered extensions
     * @return string[]|array List of registered extensions
     */
    public static function getRegisteredExtensions()
} 

Here is an example with pseudo-code of possible usage for implementing Design-By-Contract paradigm:

<?php
 
use Php\Parser\Node;
use Php\Parser\Engine as ParserEngine;
use Php\Parser\ExtensionInterface;
 
class DbcParserExtension implements ExtensionInterface
{
    public static function process(Node $node)
    {
        // prepare an AST to insert, this can be annotation or anything else
        $astToInsert = ParserEngine::parse('<?php assert("$this->value > 0")');
 
        // node visitor, that will traverse the AST for specific nodes
        $methodNodeVisitor = new NodeVisitor($node, Node::AST_METHOD);
        $methodNodeVisitor->visit(function (Node $node) use ($astToInsert) {
            // Insert our AST code before original method statements
            $node->children = array_merge($astToInsert->children, $node->children);
        });
    }
} 
 
// Registration of extension
ParserEngine::registerExtension(DbcParserExtension::class);
 
// Now every include/eval/create_function/etc will trigger our hook
include 'SomeClass.php';
 
// We can alos parse a code directly with parser, hook will be called too:
ParserEngine::parse(file_get_contents('SomeClass.php'));

General flow of compiling the source code and limitations

Current flow (as of PHP7) of running PHP code can be represented as following:

Source Code > Tokenizer > AST > Opcodes > Execution

Fist step is lexing (or tokenization) of source code into separate tokens. After that Abstract Syntax Tree is generated by the parser, based on token stream and PHP Grammar. This AST is used for producing concrete opcodes for each node. More details available at Abstract Syntax Tree RFC

After implementation of this RFC, general flow will be changed in the following way:

Source Code > Tokenizer > AST > Parser Extension > Opcodes > Execution

Note, that Parser Extension hooks are executing before generating of opcodes, so hooks will be typically called only once per each file, because of opcode cachers. This is considered as limitation of parser extensions, they can receive an AST for file only once, so no dynamic AST transformations are allowed, because all opcodes are fetched directly by file name:

Source Code > Opcodes > Execution

Impact on performance

Without registered parser extensions there is no impact on runtime performance, because no extra steps are required to compile a source code into the opcodes. Each registered parser extension will have a little impact on runtime performance, because userland hooks should be invoked after parsing of each file or executing `eval` constructions. However, with enabled opcode cacher, this operation is performed only once for each file and then cached version of opcodes is used without future calls to the userland extensions, as such any difference does not have a practical impact.

Backward Incompatible Changes

No changes.

Proposed PHP Version(s)

Target: PHP7.x

RFC Impact

To SAPIs

No impact to SAPI.

To Existing Extensions

Existing extensions are not affected.

To Opcache

This RFC doesn't affect an opcache logic, because it provides an API for AST-manipulation which is earlier step of execution of source code. However, presence of opcache is highly required for usage of parser extensions to avoid unnecessary call of extensions for not modified files.

New Constants

`Php\Parser\Node` class will contain several constans for describing different kind of nodes, their names and flags.

php.ini Defaults

Currently no

Open Issues

  1. Need to clarify/choose the right way of registration of parser extensions (http://news.php.net/php.internals/82951, http://news.php.net/php.internals/82958)
  2. Inclusion of `NodeVisitor` class into the parser RFC?
  3. Should voting include 2/3 majority or simple 50%+1 vote?

Future Scope

This implementation of RFC can be used later for building an API for annotations (metadata) to return values as AST nodes, as well, as Design-By-Contract handler, that operates on AST.

Proposed Voting Choices

Not decided yet.

Patches and Tests

No patch is available at the moment.

Implementation

No information yet.

References

Rejected Features

None

rfc/parser-extension-api.1424673377.txt.gz · Last modified: 2017/09/22 13:28 (external edit)