This is an old revision of the document!

PHP RFC: Object-based token_get_all() alternative

Date: 2020-02-13
Author: Nikita Popov nikic@php.net
Status: Under Discussion
Target Version: PHP 8.0
Implementation: https://github.com/php/php-src/pull/5176

Introduction

The token_get_all() function currently returns tokens either as a single-character string, or an array with a token ID, token text and line number. This RFC proposes to add a token_get_all() alternative which returns an array of objects instead. This reduces memory usage and makes code operating on tokens more readable.

Proposal

A new PhpToken class is introduced with the following properties and methods:

class PhpToken {
    /** One of the T_* constants, or an integer < 256 representing a single-char token. */
    public int $id;
    /** The textual content of the token. */
    public string $text;
    /** The starting line number (1-based) of the token. */
    public int $line;
    /** The starting position (0-based) in the tokenized string. */
    public int $pos;
 
    /**
     * Same as token_get_all(), but returning array of PhpToken.
     * @return PhpToken[]
     */
    public static function getAll(string $code, int $flags = 0): array;
 
    public function __construct(int $id, string $text, int $line = -1, int $pos = -1);
}

The PhpToken::getAll() method is the replacement for token_get_all(), which returns an array of PhpToken objects instead of a mix of strings and arrays.

It should be emphasized that all tokens are returned as objects, including single-char tokens. While this uses more memory than returning them as strings, experience has shown that the current string/array mix is very inconvenient to work with.

Returning an array of objects has the following advantages over the current approach:

The representation of tokens is uniform, it is not necessary to continuously check whether an array or string token is being used.
The using code is cleaner, because $token->text is easier to understand than $token[1] and friends.
The token stores the position in the file, so that consumers don't have to compute and store it separately.

Finally (and this is the real motivation here), the tokens take up significantly less memory, and are faster to construct as well. On a large file:

Default:
    Memory Usage: 14.0MiB
    Time: 0.43s (for 100 tokenizations)
TOKEN_AS_OBJECT:
    Memory Usage: 8.0MiB
    Time: 0.32s (for 100 tokenizations)

Open Questions

Additional methods

There are a few useful helper methods that could be added to the PhpToken class. Three suggestions are given as PHP code below. The is() method is a useful helper, variations of which will be found in many libraries processing token streams. isIgnorable() helps the particularly common case of skipping whitespace-like tokens. getTokenName() avoids going through token_name() for debug output.

class PhpToken {
    /** Whether the token has the given ID, the given text,
     *  or has an ID/text part of the given array. */
    public function is($kind): bool {
        if (is_array($kind)) {
            foreach ($kind as $singleKind) {
                if (is_string($singleKind)) {
                    if ($this->text === $singleKind) {
                        return true;
                    }
                } else if (is_int($singleKind)) {
                    if ($this->id === $singleKind) {
                        return true;
                    }
                } else {
                    throw new TypeError("Kind array must have elements of type int or string");
                }
            }
            return false;
        } else if (is_string($kind)) {
            return $this->text === $kind;
        } else if (is_int($kind)) {
            return $this->id === $kind,
        } else {
            throw new TypeError("Kind must be of type int, string or array");
        }
    }
 
    /** Whether this token would be ignored by the PHP parser. */
    public function isIgnorable(): bool {
        return $this->is([
            T_WHITESPACE,
            T_COMMENT,
            T_DOC_COMMENT,
            T_OPEN_TAG,
        ]);
    }
 
    /** Get the name of the token. */
    public function getTokenName(): string {
        if ($this->id < 256) {
            return chr($this->id);
        } else {
            return token_name($this->id);
        }
    }
}

Allowing extension of the class

If the class is extended, should MyPhpToken::getAll() return an array of MyPhpToken? How does this interact with constructors?

Rejected Features

Lazy token stream

PhpToken::getAll() returns an array of tokens. It has been suggested that it could return an iterator instead. This would reduce memory usage if it is sufficient to inspect tokens one-by-one for a given use-case.

This is not supported by the current proposal, because the current PHP lexer doesn't support this in an efficient manner. A full lexer state backup and restore would have to be performed for each token. Even if support for an iterator is added in the future, the ability to directly create an array should still be retained, as this will always be more efficient than going through an iterator (for the use-cases that do need a full token array).

Backward Incompatible Changes

There are no backwards compatibility breaks, apart from the new class name.

Vote

Yes / No.