PHP RFC: token_get_all() flag to return consistent elements


token_get_all() returns an array of tokens where each token element is either a single-character (for single-character tokens), or an array describing the token's ID, line number, and text content. For example, token_get_all(“<?php ;”) returns:

Array (
  [0] => Array (
    [0] => int(374)
    [1] => string(6)"<?php "
    [2] => int(1)
  [1] => string(1)";"

This makes writing tools which use the scanner awkward, and it actually hides scanner info (The line number, stored in sub-element [2]).


This proposal aims to normalize the output of token_get_all (when requested) by always using associative arrays as the sub-elements in the output. For example, token_get_all(“<?php ;”, TOKEN_ASSOC) would output:

Array (
  [0] => Array (
    [id] => int(374)
    [text] => string(6)"<?php "
    [line] => int(1)
  [1] => Array (
    [id] => int(59)  // 59 == ord(';')
    [text] => string(1) ";"
    [line] => int(1)

Note the use of a new constant TOKEN_ASSOC to be used with the flags parameter introduced in PHP 7.0

Additional changes

In order to reduce boilerplate in code which uses token_get_all(), the token_name() function will be updated to so that token_name($element['token']) is always a valid call. That is, single-character token values will return the character value for that ordinal.

In terms of psuedo-code:

function token_name($id) {
  if ($id < 256) {
    return chr($id);
  return current_token_name($id);

New Constants

TOKEN_ASSOC - When present, token_get_all() will use the new format

Future Scope

Possibly add additional fields such as character position, tokenizer state, etc...

Proposed Voting Choices

Introduce TOKEN_ASSOC and new scanner output format? 50% majority required

rfc/token-get-always-tokens.txt · Last modified: 2017/09/22 13:28 by