PHP RFC: Structs
- Date: 2024-04-16
- Author: Ilija Tovilo, tovilo.ilija@gmail.com
- Status: Draft
- Target Version: PHP 8.x
- Implementation: https://github.com/php/php-src/pull/13800
Proposal
This RFC proposes to add structs to PHP, which are classes with value semantics.
struct Position { public function __construct( public $x, public $y, ) {} } $p1 = new Position(1, 2); $p2 = $p1; $p2->x++; var_dump($p1 === $p2); // false $p2->x--; var_dump($p1 === $p2); // true
Data transfer objects
The problem
Classes are commonly used to model data in PHP. Such classes have many names (data transfer objects, plain old php objects, records, etc.). This allows the developer to describe the shape of the data, thus documenting it and improving developer experience in IDEs over arrays.
Using classes for data comes with one significant downside: Objects are passed by reference, rather than by value. When dealing with mutable data, this makes it very easy to shoot yourself in the foot by exposing mutations to places that don't expect to see them.
Consider the following example:
class Position { public function __construct( public $x, public $y, ) {} } function createShapes() { // Use same position for both shapes $pos = new Position(10, 20); $circle = new Circle(position: $pos, radius: 10); $square = new Square(position: $pos, side: 20); return [$circle, $square]; } $shapes = createShapes(); // Apply gravity foreach ($shapes as $shape) { /* We're not physicists. :P */ $shape->position->y--; var_dump($shape->position); } // Position(10, 18), Position(10, 18)??
Since both shapes are created with the same position, createShapes() tries to be resourceful and uses the same Position instance for both shapes. Unfortunately, applyGravity() is not aware of this optimization and applies its change to the same object twice.
What's the solution? position needs to be copied, but where? We can either copy it in createShapes() so that each shape has its own distinct position, or we can copy it in applyGravity(), assuming that position may be referenced from somewhere else. For the latter case, we may mark Position as readonly to get some guarantees that we get it right. Which of these two approaches is better depends on how many positions can be shared, and how often they change. Unfortunately, either can lead to useless copies.
The solution
Like arrays, strings and other value types, structs are conceptually copied when assigned to a variable, or when passed to a function.
With this description, let's reconsider the createShapes() from above.
struct Position { ... } function createShapes() { // Use same position for both shapes $pos = new Position(10, 20); $circle = new Circle(position: $pos, radius: 10); $square = new Square(position: $pos, side: 20); return [$circle, $square]; }
Conceptually, $circle->position and $square->position are distinct objects at the end of this function. applyGravity() can no longer influence multiple references to position. This completely avoids the “spooky action at a distance” problem.
At first glance, it doesn't seem like that would avoid useless copies. In reality, it works somewhat differently, but the details are not too important for now. It will be explained in more detailed in the CoW chapter.
Growable data structures
The problem
The same problem exists, and is in fact greatly exacerbated, for internal, growable data structures such as vectors, stacks, queues, etc. that desire to provide APIs immune to action at a distance.
// Pseudo-code for an internal class class Vector { public $storage = <malloced>; public function append($element) { $clone = clone $this; // including storage $clone->storage->append($element); return $clone; } } // Userland $vector = new Vector(); for ($i = 0; $i < 1000; $i++) { $vector = $vector->append($i); }
Not only will this loop create a copy for each list object on each iteration, but it will also copy its entire storage. With this approach, time complexity of a single insert becomes O(n). For m inserts, it becomes O(m*n), which is catastrophic. Looking at the code above, it becomes evident that $vector is not referenced from anywhere else. It is thus completely unnecessary to copy it.
And when it is shared, we only need a single copy, rather than a copy for each insertion.
function appendAndPrint($vector) { $vector = $vector->append(2); // This copy may be necessary, because $vector may still be referenced in the caller. $vector = $vector->append(3); // This copy is always unnecessary. var_dump($vector); // [1, 2, 3] } $vector = new Vector(); $vector = $vector->append(1); // This copy is also unnecessary. appendAndPrint($vector); var_dump($vector); // [1]
The solution
As a reminder, structs are conceptually copied when assigned to a variable, or when passed to a function. When appendAndPrint() is called, $vector is effectively already copied. Just like with arrays, the user doesn't need to think about creating explicit copies. The engine does it for you.
function appendAndPrint($vector) { $vector->append!(2); $vector->append!(3); var_dump($vector); // [1, 2, 3] } $vector = new Vector(); $vector->append!(1); appendAndPrint($vector); var_dump($vector); // [1]
Mind the ! in append!(). It denotes that the method call will mutate the struct, which makes every modification very explicit. It also has some technical benefits, which will be explained later.
One of the primary motivators of this RFC is to enable the possibility of introducing internal data structures, such as lists (e.g. Vector from php-ds) as a faster and stricter alternative to arrays, without introducing many of the pitfalls some other languages suffer from by making them reference types.
CoW 🐄
But wait:
What's the solution?
positionneeds to be copied, but where? We can either copy it increateShapes()so that each shape has its own distinct position ... Unfortunately, either can lead to useless copies....
Like arrays, strings and other value types, structs are conceptually copied when assigned to a variable, or when passed to a function.
This RFC, minutes ago
This solution doesn't sound like it would solve the presented problem. You may assume that structs come with the same slowdown as creating a copy for each assignment of an object. However, structs have a cool trick up their sleeves: Copy-on-write, or CoW for short. CoW is already used for both arrays and strings, so this is not a new concept to the PHP engine. PHP tracks the reference count for each allocation such as objects, arrays and strings. When value types are modified, PHP checks if the reference count is >1, and if so, it copies the element before performing a modification.
function print($value) { var_dump($value); } function appendAndPrint($value) { $value[] = 'baz'; var_dump($value); } print(['foo', 'bar']); appendAndPrint(['foo', 'bar']); $array = ['foo', 'bar']; print($array); appendAndPrint($array);
Note: This code ignores the fact that array literals are constant, for simplicity.
With the rules described above, the only line performing potential copies is $value[] = 'baz';, since it performs a modification of the array. The copy is also avoided unless $value is referenced from somewhere else, which is only the case when passing the local variable $array to appendAndPrint().
This is already how arrays work today. Structs follow the exact same principle.
function print($value) { var_dump($value); } function modifyAndPrint($value) { $value->x++; var_dump($value); } print(new Position(1, 2)); appendAndPrint(new Position(1, 2)); $pos = new Position(1, 2); print($pos); appendAndPrint($pos);
Only one implicit copy happens, namely in modifyAndPrint() when $value is still referenced as $pos from the caller.
Equality/Identity
The structs identity is dictated not by their “pointer” or object ID, as with normal objects. Instead, value types are considered identical if they contain the same data. As such, the identity operator === is adjusted for structs so that two structs are identical if:
- The objects are instances of the same struct.
- All of their properties are identical (===).
- If the objects contain dynamic properties, the dynamic properties must have the same order.
This adjustment pertains not only to the === operator itself, but also to language features that use it. For example, match will compare the expression and match arms using === semantics.
The semantics for equality (==) for structs remain the same as for normal classes. That is, the objects must be of the same struct, and all properties must be equal (==), including dynamic properties. However, the order of dynamic properties is irrelevant.
Method calls
Self-mutating methods of structs pose an interesting problem. The promise of value types is that a value doesn't change, unless an explicit modification is made to this same variable. For example, consider a BigNum implementation:
struct BigNum { public function __construct( // Int is not very useful, it's just for demonstration purposes. :) public int $value, ) {} public function double() { $this->value *= 2; } } $bigNum1 = new BigNum(1); $bigNum2 = $bigNum1; $bigNum2->double(); var_dump($bigNum1); // 1 var_dump($bigNum2); // 2
To properly support this, we need both an indication on the caller and the callee, that the method will mutate the variable.
struct BigNum { // ... public mutating function double() { $this->value *= 2; } } // ... $bigNum2->double!();
The call-site notation is technically necessary, for reasons we'll not get into here. But it also has the nice side-effect of making it immediately clear that the method mutates the variable.
// $vector is modified, indicated by !. $vector->sort!(); // $vector is not modified, indicated by the lack of !. $sortedVector = $vector->sorted();
Only mutating methods can and must be called using the !() syntax. Calling mutating methods with (), or non-mutating methods with !() results in a runtime error.
Similarly, classes trying to implement mutating methods will compile error.
TOOD: Check if we can enforce mutating at compile-time, anytime $this is fetched with RW (assignments, calling of mutating methods, fetching references).
References
Value types are great, because they avoid surprising mutations from elsewhere. However, sometimes you really do want to modify a value from elsewhere. This can easily be done by passing the struct by-reference.
function double(&$bigNum) { $bigNum->value *= 2; } $bigNum = new BigNum(1); double($bigNum); var_dump($bigNum); // 2 // or just by using reference variables $bigNumRef = &$bigNum; $bigNumRef->value *= 2; var_dump($bigNum); // 4
This behavior is exactly equivalent to the one for arrays.
Readonly / interior mutability
readonly prevents mutability for arrays. For example:
class Vector { public function __construct( public readonly array $values, ) {} } $vector = new Vector([[1], [2], [3]]); $vector->values[0][0] *= 2;
While $vector->values[0][0] *= 2; does not write to the values property itself, the nested write is considered a mutation of values. The same is not true for objects.
class Vector { public function __construct( public readonly array $values, ) {} } class BigNum { public function __construct( public int $value, ) {} } $vector = new Vector([new BigNum(1)]); $vector->values[0]->value *= 2;
TODO: This is actually broken currently.
This modification is not considered mutating, because the object may change from some other place anyway. Structs behave closer to arrays, so interior mutation is not allowed.
// This throws if BigNum is a struct. $vector->values[0]->value *= 2;
Reflection
As described in the CoW section, two struct instances may be used across two separate variables. However, modifying one of them should not affect the other. ReflectionProperty::setValue() would break this promise.
$bigNum1 = new BigNum(1); $bigNum2 = $bigNum1; $reflection = new ReflectionProperty(BigNum::class, 'value'); $reflection->setValue($bigNum2, 2); // Desired behavior var_dump($bigNum1, $bigNum2); // 1, 2
for this to work properly, ReflectionProperty::setValue() would need to accept a reference for the $objectOrValue property. That is because internal functions are assumed not to mutate struct objects when they are accepted by value, because the copy could not be written back to the original variable. Making $objectOrValue by-reference would break existing code where $objectOrValue is a temporary value (e.g. the result of a function call). There's also the special @prefer-ref annotation that is only available for internal functions. If the value can be passed by reference, it is. Otherwise, it is passed by value. This solution works well, but breaks userland overrides of ReflectionProperty::setValue() with no possibility of mitigation, because @prefer-ref is not available in userland.
For this reason, I have opted to throw when passing a struct object to ReflectionProperty::setValue() for the time being.
Inheritance
Inheritance is currently not allowed for structs. Structs are mainly targeted at data modelling, which should prefer composition over inheritance. There are currently no known technical issues with inheritance for structs, but we may want to be cautious when introducing them, and carefully consider the plethora of subtle semantic nuances.
Implementing interfaces is allowed, however. Interface methods may be mutating, which will be enforced when implementing the interface method. However, they may obviously only be implemented by structs, but not classes.
Hashing
SplObjectStorage allows using objects as keys. For structs, these semantics are not too useful, because the object id changes unpredictably. Instead, the lookup should be based on the objects property. However, as hashing is a complicated topic, this will be postponed to a separate RFC. For now, using struct objects is not allowed for  SplObjectStorage or WeakMap.
Move semantics
There are still some cases where useless copies occur.
function doubled($bigNum) { $bigNum->value *= 2; return $bigNum; } $bigNum = 1; $bigNum = doubled($bigNum);
In this case, copying $bigNum before passing it to doubled is actually unnecessary, as it is immediately overwritten anyway. The ownership of $bigNum could thus be “moved” to doubled(). Knowing when exactly this is safe is tough, because it depends on whether doubled() can throw exceptions, and whether $bigNum is the sole reference to the struct object before the function call.
One could implement such move semantics by hand.
function move(&$value) { $moved = $value; $value = null; return $moved; } $bigNum = 1; $bigNum = doubled(move($bigNum));
Essentially, this code sets $bigNum to null before passing the value to doubled(), making doubled() the sole owner of the value. However, if doubled() fails for one reason or another, the value of $bigNum is lost.
There were some attempts to implement implicit move semantics, namely https://github.com/php/php-src/pull/11166. We may try to pursue this further.
Performance
Assignment to a property now needs to check whether the object is a struct object, and then clone it. This change was necessary in various code paths. In my benchmarks, this lead to a small slowdown of +0.07%, whether you use structs or not. The benchmark was performed on Symfony Demo, with Opcache.
Backwards incompatible changes
struct needs to become a keyword in this RFC. However, struct will only be considered a keyword when it is followed by another identifier, excluding extends and implements. This is the same approach used for the enum RFC, and thus completely avoided backwards incompatible changes.
There are no other backwards incompatible changes.
Vote
Voting starts xxxx-xx-xx and ends xxxx-xx-xx.
As this is a language change, a 2/3 majority is required.