rfc:implicit_move_optimisation

This is an old revision of the document!


PHP RFC: Implicit Move Optimisation

Introduction

PHP currently contains two data types which are copy-on-write (CoW): string and arrays. When you pass a variable containing a CoW type as an argument to a function, you don't copy the data. Instead, you increment the reference count of the data by one. The lowest possible reference count for a variable passed to a function is therefore 2: one reference is the variable, and the other is the function argument. Let's look at a toy code example:

function my_function(array $my_array) {
    // Remember: $my_array has a reference count of 2:
    //   one reference in $array (main), and one in $my_array (my_function).
    $my_array[] = 3;
    return $my_array;
}
 
function main() {
    $array = [0, 1, 2];
    $array = my_function($array);
    var_dump($array);
}
 
main();

The code example shows an array being passed to “my_function”, which modifies the array and returns the modified result. However, due to the copy-on-write (CoW) mechanism, the variable “$my_array” is copied within “my_function” because the array's reference count is greater than 1 and it is modified. In this example, the copy is unnecessary since the result always overwrites the original array. The purpose of this RFC is to propose an optimisation for avoiding such unnecessary copies.

While this example may not highlight the significance of the copy cost, there are scenarios where it does matter, such as processing large data batches (e.g., ETL-like processes) or complex functions. The reason for creating a separate “main()” function will become clear later in the proposal.

A similar case where this optimisation is beneficial is for variables which are passed to functions and then never used again the caller.

Proposal

This optimisation will be implemented as an addition to the existing optimisation pipeline as an SSA-based optimisation. I call this optimisation an “implicit move”. The “implicit” refers to the fact it happens without the programming having to do anything. The “move” refers to the fact that the lifetime and data is moved into the called function. This is inspired by Rust's and C++'s move semantics.

TODO: how does it work

Constraints

There are situations where this optimisation cannot be applied. Specifically, we cannot allow the user code to observe the unsetting of the variable.

The optimisation is not applied if the caller uses one of the following:

  • Variadic arguments
  • “func_get_args” etc.
  • Variable variables
  • “compact”, “extract” etc.
  • try-catch

Furthermore, the optimisation is also not applied on the pseudo-main function of PHP, because of globals (in fact, no SSA-based optimisations take place in that case because of those complications). It is also not applied on arguments in the caller. This is because it could influence backtraces from exceptions or influence the result of “debug_get_backtrace”.

Examples

Some functions already avoid a copy if the argument array has a reference count of 1. One example of a builtin function that already does this is “array_merge”. So the following code will become really fast with this optimisation:

function array_merge_loop($n) {
    $a = range(1, 3);
    for ($i = 0; $i < $n; ++$i) {
        $a = array_merge($a, [1, 2, 3]);
    }
}

For 30000 iterations without the optimisation, this takes 3.405s on my machine, with the optimisation this takes only 0.002s. This is only an artificial micro-benchmark for a single function of course, but this shows that the idea works.

Other examples of builtin functions that is optimised for a reference count of 1 are “array_replace”, “array_unique”, and “array_intersect”. More functions can follow in the future.

Alternatives

You might wonder why we cannot just use references instead to prevent copies. It is my opinion that references are less ergonomic for developers. Furthermore, builtin functions such as “array_merge”, “array_replace”, etc., don't take a reference so those functions wouldn't be able to avoid a copy.

For ergonomics, take a look at the following code example:

function my_function(array &$my_array) {
    $my_array[] = 3;
    var_dump($my_array); // Do something interesting
}
 
function example_ref() {
    $array = [0, 1, 2];
    my_function($array); // No copy, I want to change the array
}
 
function example_copy() {
    $array = [0, 1, 2];
    $array_copy = array_merge($array, []); // copy the array
    my_function($array_copy); // Operates on the copy
    var_dump($array); // Should be the original
}

In the function “example_ref”, I want to modify “$array” in-place, while in “example_copy” I want the original array to stay the same. With references I have to explicitly copy the array if I want to keep the original. If we instead have an optimisation that can detect when it can modify in-place, then we can avoid references and the code becomes simpler (while still avoiding a copy):

function my_function(array $my_array) {
    $my_array[] = 3;
    var_dump($my_array); // Do something interesting
}
 
function example_ref() {
    $array = [0, 1, 2];
    my_function($array); // No copy, I want to change the array
}
 
function example_copy() {
    $array = [0, 1, 2];
    my_function($array); // Operates on the copy
    var_dump($array); // Should be the original
}

Risks

Every optimisation that gets added to PHP has the risk of introducing breakages. In particular, the type inference needs to be changed to reflect the move of a variable. Type inference is heavily used in the JIT engine, so there's a concern that this optimisation might cause bugs in the JIT. I haven't seen such bugs yet, and I would be surprised if there are because the type inference only differs if the SSA variable is never used again.

The SSA construction change does introduce a compile-time performance decrease for WordPress because the optimiser performs extra work. However, the optimisation improves the run-time performance of WordPress slightly, which is what actually matters.

Backward Incompatible Changes

There's a subtle BC break. Consider the following code:

class Foo {
    public function __destruct() {
        var_dump(1);
    }
}
 
function test(array $x) {
    unset($x[0]);
}
 
function main() {
    $x = [new Foo];
    test($x);
    var_dump(2);
}
 
main();

Without this optimisation this will first output 2, and then 1. But with the optimisation this will output 1 first, and then 2. This is because with the optimisation the array “$x” is destroyed after the call to “test” finishes as its lifetime was transferred to “test”. It is hard to predict what the impact will be on real-world applications. In my opinion executing code with side-effects which needs to happen in a specific order is dangerous as-is, because even in current PHP no such guarantees are made if objects have cycles for example.

Proposed PHP Version(s)

Next PHP 8.x (which is PHP 8.3 at the time of writing).

RFC Impact

To SAPIs

None.

To Existing Extensions

None.

To Opcache

This proposal implements an extra optimisation in the DFA pass in the optimiser. There need to be changes to SCCP, DFG, type inference, and SSA construction to account for the behaviour change in SEND_VAR. The SEND_VAR opcode now redefines op1 because it can set its type to UNDEF. The type inference needs to reflect that as well.

New Constants

None.

php.ini Defaults

None.

Open Issues

Make sure there are no open issues when the vote starts!

Unaffected PHP Functionality

Everything that's not the optimiser.

Future Scope

This proposal depends on functions being able to operate on data in-place. Support for this has already been added in 8.3 for the “array_merge”, “array_replace”, “array_unique”, and “array_intersect”. More functions (and especially string functions) may follow.

Proposed Voting Choices

One primary vote with a simple yes/no option.

Patches and Tests

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

References

Links to external references, discussions or RFCs

Rejected Features

Keep this updated with features that were discussed on the mail lists.

rfc/implicit_move_optimisation.1684001813.txt.gz · Last modified: 2023/05/13 18:16 by nielsdos