rfc:on_demand_name_mangling

PHP RFC: On-demand Name Mangling

Introduction

PHP marshals external key-value pairs into super-globals by mangling some disallowed characters to underscores:

  • .” and “ ” become an “_
  • If there is no “]”, then the first “[” also becomes an “_” -- all other left brackets remain intact
  • Other characters are left as is

As seen here, both for both $_ENV and $_GET:

# the shell environment variable "a.b" becomes "a_b" inside $_ENV
$ /usr/bin/env "a.b=foo" php -r 'echo $_ENV["a_b"];'
foo

# a "[" also mangles to an underscore
$ /usr/bin/env "a[b=foo" php -r 'echo $_ENV["a_b"];'
foo

# same mangling rules for $_REQUEST
# Note how $ is ignored
$ cat mangle.phpt
--TEST--
How does $_REQUEST handle HTML form variables with unusual names?
--GET--
a.b=dot&a$b=dollar&a%20b=space&a[b=bracket
--FILE--
<?php
print_r($_GET);
?>
--EXPECTF--
Array
(
    [a_b] => bracket
    [a$b] => dollar
)
$ pear run-tests --cgi=/usr/bin/php-cgi mangle.phpt
Running 1 tests
PASS How does $_REQUEST handle HTML form variables with unusual names?[mangle.phpt]
TOTAL TIME: 00:00
1 PASSED TESTS
0 SKIPPED TESTS

Mangling has the undesirable consequence that many external variables may map to one PHP variable. For example, three separate HTML form elements named a.b, a_b and a[b will all resolve to a_b in the corresponding super-global, with the value from a[b winning (because it was last). This leads to user confusion and userland work arounds, not to mention bug reports: #34882 and #42055 for example.

Automatic name mangling supported register_globals and its kin like import_request_variables(), but those features ended in August 2014. Name mangling isn't required for super-global marshaling, because the associative array nature of super-globals can accommodate any string variable name. So do we need automatic name mangling? Consider this hypothetical new test:

--TEST--
Name mangling logic removed from engine, placed in polyfill
--GET--
a.b=dot&a_b=underscore&a$b=dollar&a%20b=space&a[b=bracket
--FILE--
<?php
print_r(get_defined_vars());
php_mangle_superglobals();
print_r(get_defined_vars());
?>
--EXPECTF--
Array
(
    [_GET] => Array
        (
            [a.b] => dot
            [a_b] => underscore
            [a$b] => dollar
            [a b] => space
            [a[b] => bracket
        )
)
Array
(
    [_GET] => Array
        (
            [a_b] => bracket
            [a$b] => dollar
        )
)

In this new implementation, the engine no longer mangles marshaled superglobals at startup. Instead, the ability to mangle names has moved to an optional, userland-provided polyfill function php_mangle_superglobals().

In the example above, an a_b key was externally supplied. The call to php_mangle_superglobals clobbered the original value of a_b with the value of the last seen mangle-equivalent key (a[b).

Importantly, the user made this mangling happen: the engine did not do it automatically.

The polyfill algorithm is simple:

  • find all superglobal keys that violate the PHP unquoted variable name regex 1)
  • for each, create a new mangled key linked to the corresponding value

Applications requiring name mangling may call the polyfill during their bootstrap phase to emulate prior engine behavior.

Proposal

This RFC proposes to remove automatic name mangling, with backward compatibility maintained through a userspace polyfill function that mangles super-globals on-demand:

  • Upon acceptance:
    • Update documentation that name mangling is deprecated and will be removed in 8.0
    • Release a userland polyfill that implements the historic mangling behavior
    • Polyfill shall be available via composer (but not PEAR)
  • Next major release (currently 8.0):
    • Remove all name mangling code in super-global marshaling functions

Discussion

These questions were raised in the mailing list discussion.

Should a notice be raised if the engine mangles a superglobal?

Before version 1.3, this RFC proposed raising an E_DEPRECATED message (once per startup) when the engine mangled a name, so that developers were made aware of future changes. However, Rouven Weßling asked:

If I have a well behaved application that doesn’t rely on name mangling or have included the polyfill, how can I prevent a log message from being emitted when a user appends (unused) parameters to the query string that require mangling?

and Nikita Popov commented:

Even if it's only a single deprecation warning instead of multiple, it's still a deprecation warning that I, as the application author, have absolutely no control over. For me, a deprecation warning indicates that there is some code I must change to make that warning *go away*.
Sure, it's informative. But it's enough to be informative about this *once*, rather than every time a user makes an odd-ish request.

Given that (a) an application could get spammed by malicious users2), and (b) that documentation suffices to notify users of this change, then the RFC changed as of 1.3 to only document the removal of name mangling as of the next major version.

Should an INI configuration control mangling?

Nikita Popov suggested (and Stanislav Malyshev seconded) a counter-proposal to use an INI setting:

I would favor the introduction of a new ini setting. E.g. mangle_names=0 disables name mangling, while mangle_names=1 throws a deprecation warning on startup and enables name mangling. mangle_names=0 should be the default. That is essentially disable name mangling, but leave an escape hatch for those people who rely on it (for whatever reason).

An INI setting to disable mangling must be engine-wide (e.g., PHP_INI_SYSTEM or PHP_INI_PERDIR) as its historical effect occurs before userland code runs. Engine-wide settings are tricky because they force conditions across all instances of PHP running in a given SAPI process. In a hosted environment where many unrelated sites share the same engine configuration, it's possible that one site might require mangling while another site requires no-mangling. These two sites could not co-exist unless the site operator allows per directory configuration, which they may not. Thus, an INI setting would introduce operational problems for some definable sub-set of users.

It's still possible to provide an “escape hatch” for applications requiring name mangling: the polyfill described earlier. Applications need only include the polyfill code and add it to their bootstrapping. The polyfill would be available via Composer, and the polyfill would populate all the mangled variables as before.

The polyfill approach is considered superior to the INI approach for three reasons:

  • Userland can maintain BC independent of system INI settings (which they may not control)
  • The engine is completely cleaned of all mangling behavior (which means less code to fuss over)
  • No additional weight of configuration values (which is a complaint point for many)

Should ''extract()'' automatically mangle names?

Early versions of this proposal (< v1.2) proposed using extract to mangle names. Rowan Collins and others pointed out this was an unnecessary complication: preg_match could also accomplish the goal. Thus, all references to extract in this RFC have been removed.

However, extract() should have the option to emit mangled names with a new constant (EXTR_MANGLE). extract() should also be fixed to export variables with any variable name, because they are all technically valid with the quoted variable syntax (${'foo.bar'}). These will be handled as function fixes and not with this RFC.

Backward Incompatible Changes

This proposal introduces backward incompatible changes: any userland code relying on mangled names would have to either (a) change to using original external variable names or (b) re-mangle the super-globals with a polyfill.

The polyfill could be accomplished with code like:

function php_mangle_name($name) {
    $name = preg_replace('/[^a-zA-Z0-9_\x7f-\xff]/', '_', $name);
    return preg_replace('/^[0-9]/', '_', $name);
}
function php_mangle_superglobals() {
    if (version_compare(PHP_VERSION, '8.0.0', '<')) {
        return;
    }
    foreach ($_ENV as $var => &$val) {
        $mangled = php_mangle_name($var);
        if ($mangled !== $var) {
            $_ENV[$mangled] =& $val;
        }
    }
    // similar loops for $_GET, $_POST
    // similar logic for $_COOKIE and $_FILES
}

To reduce the burden on userland, this polyfill library could be made available via Composer:

$ composer require php/mangle-superglobals ^1.0
$ cat app/bootstrap.php
<?php
require __DIR__ . '/vendor/autoload.php';

php_mangle_superglobals();

// ...

Proposed PHP Version(s)

PHP 8.0.

RFC Impact

To SAPIs

No impact.

To Existing Extensions

No impact.

To Opcache

No impact.

New Constants

None.

php.ini Defaults

None.

Open Issues

None.

Proposed Voting Choices

A simple yes/no voting option with a 2/3 majority required: “Remove name mangling in PHP 8.0?”

Patches and Tests

None yet. Implementations will follow vote.

Implementation

TODO: After the project is implemented, this section should contain

  1. the version(s) it was merged to
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature

Rejected Features

None so far.

1)
Unquoted variable names must match the regex [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*
2)
The max_input_vars configuration option behaves similarly with the once-per-startup deprecation message proposed prior to version 1.3. The difference is the max_input_vars message could be squelched by increasing the limit, whereas the proposed mangling message could never be squelched by user code
rfc/on_demand_name_mangling.txt · Last modified: 2019/07/16 12:25 by bishop