Differences

This shows you the differences between two versions of the page.

--- rfc:token_as_object [2020/02/13 10:28] – nikic
+++ rfc:token_as_object [2020/11/12 13:33] (current) – nikic
@@ Line 1: / Line 1: @@
-====== PHP RFC: token_get_all() TOKEN_AS_OBJECT mode ======
+====== PHP RFC: Object-based token_get_all() alternative ======
   * Date: 2020-02-13
   * Author: Nikita Popov <nikic@php.net>
-  * Status: Under Discussion
+  * Status: Implemented
   * Target Version: PHP 8.0
   * Implementation: https://github.com/php/php-src/pull/5176
@@ Line 8: / Line 8: @@
 ===== Introduction =====
-The ''token_get_all()'' function currently returns tokens either as a single-character string, or an array with a token ID, token text and line number. This RFC proposes to add a token_get_all() mode which returns an object instead. This reduces memory usage and makes code operating on tokens more readable.
+The ''token_get_all()'' function currently returns tokens either as a single-character string, or an array with a token ID, token text and line number. This RFC proposes to add a token_get_all() alternative which returns an array of objects instead. This reduces memory usage and makes code operating on tokens more readable.
+> **Note:** PhpToken::getAll() has been renamed to PhpToken::tokenize() prior to the PHP 8.0 release. The RFC text still refers to PhpToken::getAll().
 ===== Proposal =====
-''token_get_all()'' accepts a new ''TOKEN_AS_OBJECT'' flag (which can be combined with the existing ''TOKEN_PARSE'' flag as well). If this flag is set, the return value is of type ''PhpToken[]'' declared as follows:
+A new ''PhpToken'' class is introduced with the following properties and methods:
 <PHP>
@@ Line 24: / Line 26: @@
     /** The starting position (0-based) in the tokenized string. */
     public int $pos;
+    /**
+     * Same as token_get_all(), but returning array of PhpToken.
+     * @return static[]
+     */
+    public static function getAll(string $code, int $flags = 0): array;
+    final public function __construct(int $id, string $text, int $line = -1, int $pos = -1);
+    /** Get the name of the token. */
+    public function getTokenName(): ?string;
+    /**
+     * Whether the token has the given ID, the given text,
+     * or has an ID/text part of the given array.
+     *
+     * @param int|string|array $kind
+     */
+    public function is($kind): bool;
+    /** Whether this token would be ignored by the PHP parser. */
+    public function isIgnorable(): bool;
 }
 </PHP>
+The ''PhpToken::getAll()'' method is the replacement for ''token_get_all()'', which returns an array of ''PhpToken'' objects instead of a mix of strings and arrays.
 It should be emphasized that **all** tokens are returned as objects, including single-char tokens. While this uses more memory than returning them as strings, experience has shown that the current string/array mix is very inconvenient to work with.
@@ Line 35: / Line 61: @@
   * The token stores the position in the file, so that consumers don't have to compute and store it separately.
-Finally (and this is the real motivation here), the tokens take up signficiantly less memory, and are faster to construct as well. On a large file:
+Finally, the tokens take up significantly less memory, and are faster to construct as well. On a large file:
 <code>
@@ Line 46: / Line 72: @@
 </code>
-===== Open Questions =====
+==== Extensibility ====
-==== Construction method ====
+The ''PhpToken::getAll()'' method returns ''static[]'', as such it is possible to seamlessly extend the class:
-The RFC currently proposes a new ''TOKEN_AS_OBJECT'' flag to the existing ''token_get_all()'' function. Nicolas Grekas suggested to use a new function or method instead, to make this functionality more easily polyfillable.
+<PHP>
+class MyPhpToken extends PhpToken {
+    public function getLowerText() {
+        return strtolower($this->text);
+    }
+}
-To make the functionality self-contained, a new static method on ''PhpToken'' could be added:
+$tokens = MyPhpToken::getAll($code);
+var_dump($tokens[0] instanceof MyPhpToken); // true
+$tokens[0]->getLowerText(); // works
+</PHP>
+To guarantee a well-defined construction behavior, the ''PhpToken'' constructor is final and cannot be overridden by child classes. This matches the extension approach of the ''SimpleXMLElement'' class.
+==== Additional methods ====
+The ''PhpToken'' class defines a few additional methods, which are defined in terms of the reference-implementations given below.
 <PHP>
-class PhpToken {
+public function getTokenName(): ?string {
-    /** @return PhpToken[] */
+    if ($this->id < 256) {
-    public function getAll(string $code, int $flags = 0);
+        return chr($this->id);
+    } elseif ('UNKNOWN' !== $name = token_name($this->id)) {
+        return $name;
+    } else {
+        return null;
+    }
 }
 </PHP>
-==== Additional methods ====
+''getTokenName()'' is mainly useful for debugging purposes. For single-char tokens with IDs below 256, it returns the extended ASCII character corresponding to the ID. For known tokens, it returns the same result as ''token_name()''. For unknown tokens, it returns null.
-There are a few useful helper methods that could be added to the ''PhpToken'' class. Three suggestions are given as PHP code below. The ''is()'' method is a useful helper, variations of which will be found in many libraries processing token streams. ''isIgnorable()'' helps the particularly common case of skipping whitespace-like tokens. ''getTokenName()'' avoids going through ''token_name()'' for debug output.
+It should be noted that tokens that are not known to PHP are commonly used, for example when emulating lexer behavior from future PHP versions. In this case custom token IDs are used, so they should be handled gracefully.
 <PHP>
-class PhpToken {
+public function is($kind): bool {
-    /** Whether the token has the given ID, the given text,
+    if (is_array($kind)) {
-     *  or has an ID/text part of the given array. */
+        foreach ($kind as $singleKind) {
-    public function is($kind): bool {
+            if (is_string($singleKind)) {
-        if (is_array($kind)) {
+                if ($this->text === $singleKind) {
-            foreach ($kind as $singleKind) {
+                    return true;
-                if (is_string($singleKind)) {
+                }
-                    if ($this->text === $singleKind) {
+            } else if (is_int($singleKind)) {
-                        return true;
+                if ($this->id === $singleKind) {
-                    }
+                    return true;
-                } else if (is_int($singleKind)) {
-                    if ($this->id === $singleKind) {
-                        return true;
-                    }
-                } else {
-                    throw new TypeError("Kind array must have elements of type int or string");
                 }
+            } else {
+                throw new TypeError("Kind array must have elements of type int or string");
             }
-            return false;
-        } else if (is_string($kind)) {
-            return $this->text === $kind;
-        } else if (is_int($kind)) {
-            return $this->id === $kind,
-        } else {
-            throw new TypeError("Kind must be of type int, string or array");
         }
+        return false;
+    } else if (is_string($kind)) {
+        return $this->text === $kind;
+    } else if (is_int($kind)) {
+        return $this->id === $kind,
+    } else {
+        throw new TypeError("Kind must be of type int, string or array");
     }
+}
+</PHP>
-    /** Whether this token would be ignored by the PHP parser. */
+The ''is()'' method allows checking for certain tokens, while abstracting over whether it is a single-char token ''%%$token->is(';')%%'', a multi-char token ''%%$token->is(T_FUNCTION)%%'', or whether multiple tokens are allowed ''%%$token->is([T_CLASS, T_TRAIT, T_INTERFACE])%%''.
-    public function isIgnorable(): bool {
-        return $this->is([
-            T_WHITESPACE,
-            T_COMMENT,
-            T_DOC_COMMENT,
-            T_OPEN_TAG,
-        ]);
-    }
-    /** Get the name of the token. */
+While non-generic code can easily check the appropriate property, such as ''%%$token->text == ';'%%'' or ''%%$token->id == T_FUNCTION%%'', token stream implementations commonly need to be generic over different token kinds and need to support specification of multiple token kinds. For example:
-    public function getTokenName(): string {
-        if ($this->id < 256) {
+<PHP>
-            return chr($this->id);
+// An example, NOT part of the PhpToken interface.
-        } else {
+public function findRight($pos, $findTokenKind) {
-            return token_name($this->id);
+    $tokens = $this->tokens;
+    for ($count = \count($tokens); $pos < $count; $pos++) {
+        if ($tokens[$pos]->is($findTokenKind)) {
+            return $pos;
         }
     }
+    return -1;
 }
 </PHP>
+These kinds of search/skip/check APIs benefit from having an efficient native implementation of ''is()''.
+<PHP>
+public function isIgnorable(): bool {
+    return $this->is([
+        T_WHITESPACE,
+        T_COMMENT,
+        T_DOC_COMMENT,
+        T_OPEN_TAG,
+    ]);
+}
+</PHP>
+As a special case, it is very common that whitespace and comments need to be skipped during token processing. The ''isIgnorable()'' method determines whether a token is ignored by the PHP parser.
+===== Rejected Features =====
+==== Lazy token stream ====
+''PhpToken::getAll()'' returns an array of tokens. It has been suggested that it could return an iterator instead. This would reduce memory usage if it is sufficient to inspect tokens one-by-one for a given use-case.
+This is not supported by the current proposal, because the current PHP lexer doesn't support this in an efficient manner. A full lexer state backup and restore would have to be performed for each token. Even if support for an iterator is added in the future, the ability to directly create an array should still be retained, as this will always be more efficient than going through an iterator (for the use-cases that do need a full token array).
 ===== Backward Incompatible Changes =====
-There are no backwards compatibility breaks, apart from the new constant name and the new class name.
+There are no backwards compatibility breaks, apart from the new class name.
 ===== Vote =====
-Yes / No.
+Voting opened 2020-03-06 and closes 2020-03-20.
+<doodle title="Add object-based token_get_all() alternative?" auth="nikic" voteType="single" closed="true">
+   * Yes
+   * No
+</doodle>

Differences

Page Tools