rfc:vector

PHP RFC: final class Vector

Introduction

PHP's native array type is rare among programming language in that it is used as an associative map of values, but also needs to support lists of values. In order to support both use cases, additional memory is needed to track keys (around twice as much as is needed to just store the values, for non-reference counted values) .

It would be useful to have an efficient variable-length container in the standard library for the following reasons:

  1. To save memory in applications or libraries that may need to store many lists of values or run for long periods of time
    (in modules identified as potentially exceeding memory limits)
    (both in userland and in native code written in php-src/PECLs)
  2. To provide a better alternative to ArrayObject and SplFixedArray for use cases that require variable sized collections (For lists of values) that can be passed by value to be modified.
  3. To give users the option of stronger runtime guarantees that property, parameter, or return values really contain a list of values without gaps, that array modifications don't introduce gaps or invalid keys, that values in the collection aren't references, etc.

Proposal

This proposes to add the class final class Vector to PHP.

Similarly to vectors in other languages, this is backed by a memory-efficient representation (raw C array of values with a size and capacity) and provides constant amortized-time push/pop operations.

Similarly to ArrayObject, the $x[] = $value shorthand for appending to the Vector is supported in addition to ArrayAccess functionality.

final class Vector implements IteratorAggregate, Countable, JsonSerializable, ArrayAccess
{
    /**
     * Construct a Vector from an iterable.
     *
     * When $preserveKeys is false, the values will be reindexed without gaps starting from 0
     * When $preserveKeys is true, any gaps in the keys of the iterable will be filled in with null,
     * and negative indices or non-integer indices will be rejected and cause an Exception.
     */
    public function __construct(iterable $iterator = [], bool $preserveKeys = true) {}
    public function getIterator(): InternalIterator {}
    public function count(): int {}
    public function capacity(): int {}
    public function shrinkToFit(): void {}
    public function clear(): void {}
    public function setSize(int $size): void {}
 
    public function __serialize(): array {}
    public function __unserialize(array $data): void {}
    public static function __set_state(array $array): Vector {}
 
    public function push(mixed $value): void {}
    public function pop(): mixed {}
 
    public function toArray(): array {}
    // Strictly typed, unlike offsetGet/offsetSet
    public function valueAt(int $offset): mixed {}
    public function setValueAt(int $offset, mixed $value): void {}
 
    public function offsetGet(mixed $offset): mixed {}
    public function offsetExists(mixed $offset): bool {}
    public function offsetSet(mixed $offset, mixed $value): void {}
    // Throws because unset and null are different things, unlike SplFixedArray
    public function offsetUnset(mixed $offset): void {}
 
    public function indexOf(mixed $value): int|false {}
    public function contains(mixed $value): bool {}
 
    public function map(callable $callback): Vector {}
    /**
     * Returns the subset of elements of the Vector satisfying the predicate.
     *
     * If the value returned by the callback is truthy
     * (e.g. true, non-zero number, non-empty array, truthy object, etc.),
     * this is treated as satisfying the predicate.
     *
     * @param null|callable(mixed):mixed $callback
     */
    public function filter(?callable $callback = null): Vector {}
 
    public function jsonSerialize(): array {}
}

Implementation Choices

Global Namespace

This maintains consistency with the namespace used for general-purpose collections already in the SPL (as well as relatively recent additions such as WeakReference and WeakMap). Other recent additions to PHP such as ReflectionIntersectionType in PHP 8.1 have also continued to use the global namespace when adding classes with functionality related to other classes.

Lack of Name Prefix

  1. Short names are more convenient to remember/use.
  2. Possible future additions such as a Deque/Queue based on a efficient C array representation rather than a linked list would conflict with existing Spl names such as SplQueue, SplStack, etc.
  3. There is already an addition to the spl without a prefix - ArrayObject. Because array was already a type its name could not reasonably be any shorter.

Accepting an iterable

This is similar to the way the existing classes ArrayObject::__construct and SplFixedArray::fromArray.

End users may be surprised if integer keys are not the same as the ones passed in by default (e.g. if keys were unset or inserted out of order), which is why $preserve_keys = true is the default.

Unlike SplFixedArray, this is not a fixed size, which is why this accepts an iterable instead of a size. Accepting a mix of different types (iterable|int) is not done because it would make code harder to reason about when types are missing or inaccurate.

setSize can be used to create a vector of a certain size after instantiating an empty vector. Array library helpers such as array_fill or range may also be useful.

Final Class

If this were extensible, this would have the following drawbacks

  1. Not have as strong guarantees to readers of code (or even opcache, if optimizations were added targeting opcache) that elements were actually a vector or that certain methods would/wouldn't throw certain exceptions.
  2. Require more memory and runtime checks to check if this was the original class or a subclass.
  3. Be more likely to have discovered or undiscovered bugs due to userland extensions of Vector

push/pop

This is consistent with the name used for array_push()/array_pop()

Other naming choices were chosen to be consistent with existing functionality in SplFixedArray/ArrayObject where reasonable.

Backward Incompatible Changes

The class name \Vector is now used by PHP, and it will be a compilation error to declare classlikes of the same name in the global namespace since the class already exists.

Proposed PHP Version(s)

8.2

RFC Impact

To Opcache

None

Unaffected PHP Functionality

PHP's type system remains unchanged (e.g. array) - final class Vector is a class and instances are ordinary objects.

Benchmarks

This is a contrived benchmark for estimating the performance of building/reading variable-sized arrays of different sizes, when the final size would be unknown on 64-bit PHP (it is known here).

Read time is counted separately from create+destroy time. This is a total over all iterations, and the instrumentation adds to the time needed.

SplFixedArray doesn't have a push method (conceptually, it hasn't made sense for a data structure described as a fixed size array), and this benchmark would be faster if it did. SplStack uses foreach for read benchmarking because the random access of SplStack is O(n) (linear time) in a linked list.

Vector is faster than the other object data structures currently available in the SPL.

Results for php 8.2.0-dev debug=false with opcache enabled=true
 
Appending to array:         n=       1 iterations= 8000000 memory=     376 bytes, create+destroy time=0.645 read time = 0.308 result=0
Appending to Vector:        n=       1 iterations= 8000000 memory=     128 bytes, create+destroy time=1.003 read time = 0.355 result=0
Appending to SplStack:      n=       1 iterations= 8000000 memory=     184 bytes, create+destroy time=1.737 read time = 0.742 result=0
Appending to SplFixedArray: n=       1 iterations= 8000000 memory=      80 bytes, create+destroy time=1.810 read time = 0.428 result=0
 
 
Appending to array:         n=       4 iterations= 2000000 memory=     376 bytes, create+destroy time=0.222 read time = 0.114 result=12000000
Appending to Vector:        n=       4 iterations= 2000000 memory=     128 bytes, create+destroy time=0.323 read time = 0.164 result=12000000
Appending to SplStack:      n=       4 iterations= 2000000 memory=     280 bytes, create+destroy time=0.739 read time = 0.301 result=12000000
Appending to SplFixedArray: n=       4 iterations= 2000000 memory=     128 bytes, create+destroy time=1.164 read time = 0.233 result=12000000
 
 
Appending to array:         n=       8 iterations= 1000000 memory=     376 bytes, create+destroy time=0.154 read time = 0.084 result=28000000
Appending to Vector:        n=       8 iterations= 1000000 memory=     192 bytes, create+destroy time=0.227 read time = 0.148 result=28000000
Appending to SplStack:      n=       8 iterations= 1000000 memory=     408 bytes, create+destroy time=0.530 read time = 0.240 result=28000000
Appending to SplFixedArray: n=       8 iterations= 1000000 memory=     192 bytes, create+destroy time=1.026 read time = 0.205 result=28000000
 
 
Appending to array:         n= 1048576 iterations=      20 memory=33558608 bytes, create+destroy time=0.699 read time = 0.151 result=10995105792000
Appending to Vector:        n= 1048576 iterations=      20 memory=16777304 bytes, create+destroy time=0.483 read time = 0.271 result=10995105792000
Appending to SplStack:      n= 1048576 iterations=      20 memory=33554584 bytes, create+destroy time=0.865 read time = 0.410 result=10995105792000
Appending to SplFixedArray: n= 1048576 iterations=      20 memory=16777304 bytes, create+destroy time=2.431 read time = 0.404 result=10995105792000
<?php
 
function bench_array(int $n, int $iterations) {
    $totalReadTime = 0.0;
    $startTime = hrtime(true);
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = [];
        for ($i = 0; $i < $n; $i++) {
            $values[] = $i;
        }
        $startReadTime = hrtime(true);
        for ($i = 0; $i < $n; $i++) {
            $total += $values[$i];
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;
 
        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
 
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to array:         n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
function bench_vector(int $n, int $iterations) {
    $startTime = hrtime(true);
    $totalReadTime = 0.0;
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = new Vector();
        for ($i = 0; $i < $n; $i++) {
            $values[] = $i;
        }
 
        $startReadTime = hrtime(true);
        for ($i = 0; $i < $n; $i++) {
            $total += $values[$i];
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;
 
        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to Vector:        n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
// SplStack is a subclass of SplDoublyLinkedList, so it is a linked list that takes more memory than needed.
// Access to values in the middle of the SplStack is also less efficient.
function bench_spl_stack(int $n, int $iterations) {
    $startTime = hrtime(true);
    $totalReadTime = 0.0;
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = new SplStack();
        for ($i = 0; $i < $n; $i++) {
            $values->push($i);
        }
        $startReadTime = hrtime(true);
        // Random access is linear time in a linked list, so use foreach instead
        foreach ($values as $value) {
            $total += $value;
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;
        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to SplStack:      n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
function bench_spl_fixed_array(int $n, int $iterations) {
    $startTime = hrtime(true);
    $totalReadTime = 0.0;
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = new SplFixedArray();
        for ($i = 0; $i < $n; $i++) {
            // Imitate how push() would be implemented in a situation
            // where the number of elements wasn't actually known ahead of time.
            // erealloc() tends to extend the existing array when possible.
            $size = $values->getSize();
            $values->setSize($size + 1);
            $values->offsetSet($size, $i);
        }
        $startReadTime = hrtime(true);
        for ($i = 0; $i < $n; $i++) {
            $total += $values[$i];
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;
        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to SplFixedArray: n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
$n = 2**20;
$iterations = 10;
$sizes = [
    [1, 8000000],
    [4, 2000000],
    [8, 1000000],
    [2**20, 20],
];
printf(
    "Results for php %s debug=%s with opcache enabled=%s\n\n",
    PHP_VERSION,
    PHP_DEBUG ? 'true' : 'false',
    json_encode(function_exists('opcache_get_status') && (opcache_get_status(false)['opcache_enabled'] ?? false))
);
 
foreach ($sizes as [$n, $iterations]) {
    bench_array($n, $iterations);
    bench_vector($n, $iterations);
    bench_spl_stack($n, $iterations);
    bench_spl_fixed_array($n, $iterations);
    echo "\n";
}

Future Scope

If \Vector is added, there would be plenty of time for myself or others to add additional methods before PHP 8.2's feature freeze (probably in July 2022)

Additional data structures from https://github.com/TysonAndre/pecl-teds that are general purpose (such as \Deque or future additions) may be possible as well.

Proposed Voting Choices

Yes/No vote, requiring a 2/3 majority

References

- https://www.npopov.com/2014/12/22/PHPs-new-hashtable-implementation.html - https://github.com/TysonAndre/pecl-teds (implementations of multiple data structures, including Teds\Vector, based originally on the SplFixedArray documentation) - https://externals.io/message/112639#112641

Rejected Features

Why not use php-ds/ext-ds instead?

  1. No matter how useful or popular a PECL is, datastructures available in PHP's core will have much, much wider adoption in applications and libraries that are available in PECLs, allowing those applications and libraries to write faster and/or more memory efficient code.
  2. End users can make much stronger assumptions about the backwards compatibility and long-term availability of data structures that are included in core.
  3. The php-ds maintainers do not plan to merge the extension into php-src, and believe php-ds should coexist with new functionality being added in a separate namespace instead (see quote and later clarifications for full context)
  4. Opcache may be able to make stronger optimizations of internal classes found in php-src than any third party PECL. (e.g. because Deque::push() or Vector::push() would never throw or emit notices, it may be possible to optimize it to be even faster than appending to an array in the Opcache JIT compiler)

Perceived issues and uncertainties about php-ds distribution plans

This has been asked about multiple times in threads on unrelated proposals (https://externals.io/message/112639#112641 and https://externals.io/message/93301#93301 years ago) throughout the years, but the maintainer of php-ds had a long term goal of developing the separately from php's release cycle (and was still focusing on the PECL when I'd asked on the GitHub issue in the link in September 2020).

To quote the maintainer on the GitHub issue on php-ds/ext-ds I'd opened the last time someone suggested using php-ds (emphasis on the below quote mine)

My long-term intention has been to not merge this extension into php-src. I would like to see it become available as a default extension at the distribution level. Unfortunately I have no influence or understanding of that process. Having an independent release and development cycle is a good thing, in my opinion.

If those plans change, I would like to hold off until a 2.0 release - I've learnt a lot over the last 4 years and would like to revisit some of the design decisions I made then, such as a significant reduction of the interfaces or perhaps more interfaces with greater specificity. Functions like map, filter, reduce can be delegated to other libraries that operate on iterable instead of having these as first-class members of the interface. There is a 2.0 branch with some ideas but I haven't looked at that in a while.

I have been working on a research project to design persistent data structures for immutability, so there is a lot of work that I have set for myself for this project over the next 6 months or so. I have no intention to push for distribution changes in the short-term but I am open to the suggestion.

> Do you mean OS distribution level (Windows, Ubuntu, CentOS, HomeBrew for mac, etc.?)
He meant distribution with PHP core (on all platforms where PHP is available)

Whichever is more viable - simply not merged into core, but distributed and enabled by default alongside it.0

There have been no proposals from the maintainer themselves so far to add php-ds to core or distribute it alongside core in any form. That was just what the maintainer mentioned as a long term plan.

The model of distributing an extension separately from core has never been done before, and even if approved would raise multiple concerns:

  • I personally doubt having it developed separately from php's release cycle would be accepted by voters (e.g. if unpopular decisions couldn't be voted against or vetoed, or if RFCs passed by the community for additions of datastructures (or additions of methods to datastructures) could be overturned by the php-ds maintainers)
  • This may limit what features could be added by the community: For example, introducing the map() or filter() functionality to a Vector if the php-ds maintainers removed that function in a simplified 2.0.
  • I'm not certain how backwards compatibility would be handled in that model, e.g. if the maintainers of ext-ds wanted to drop support for a method after it was released.
  • This may cause delays in publishing php releases, e.g. if the maintainers were unable to quickly review patches for crashes, incompatibilities or compile errors introduced in new php versions, etc.
  • and other concerns (e.g. API debates such as https://externals.io/message/93301#93301)

With php-ds itself getting merged anytime soon (if the maintainers continue to plan to distribute php-ds that way) seeming unlikely to me, I decided to start independently working on efficient data structure implementations. I don't see dragging it in (against the maintainer's wishes) as a viable option for many, many, many reasons. But having efficient datastructures in PHP's core is still useful.

The timeline for php-ds 2.0 is also something I am uncertain about.

Additionally, while there may be some uses for immutable datastructures, I would believe there are more uses for mutable datastructures, especially for programmers with imperative programming backgrounds such as C/C++, and would propose these mutable datastructures regardless of those plans. Having these mutable datastructures in core is still useful to immutable programmers and functional programmers, because it provides another tool to write the internal, private implementation details in a memory-efficient way.

  • EDIT: I misread the maintainer's response as being about the project php-ds 2.0 - I'm now pretty sure the “research project to design persistent data structures for immutability” is a different project from ext-ds and possibly in a different programming language.
    \\(Leaving in this comment in because immutable datastructures were brought up by others in the RFC discussion)

While PECL development outside of php has its benefits for development and ability to make new features available in older php releases, it's less likely that application and library authors will start making use of those data structures because many users won't have any given PECL already installed. (though php-ds also publishes a polyfill, it would not have the cpu and memory savings, and add its own overhead)

Additionally, users (and organizations using PHP) can often make stronger assumptions on backwards compatibility and long-term availability of functionality that is merged into PHP's core.

So the choice of feature set, some names, signatures, and internal implementation details are different, because this is reimplementing a common datastructure found in different forms in many languages. It's definitely a mature project, but I personally feel like reimplementing this (without referring to the php-ds source code and without copying the entire api as-is) is the best choice to add efficient data structures to core while respecting the maintainer's work on the php-ds project and their wish to maintain control over the php-ds project.

As a result, I've been working on implementing data structures such as Deque based on php-src's data structure implementations (mostly SplFixedArray and ArrayObject) instead (and based on my past PECL/RFC experience, e.g. with runkit7/igbinary)

Minor differences in API design goals

Traditionally, PHP has been a very batteries included language. Existing functionality such as strings and arrays have very large standard libraries. This makes it easier to write code without depending on too many third party composer libraries, and knowledge of the standard library can transfer to any codebase a developer works on.

My hopes for ease of use, readability, speed, and static analysis in future data structures such as Vector are similar to those mentioned by Benjamin Morel in the GitHub issue:

Functions like map, filter, reduce can be delegated to other libraries that operate on iterable instead of having these as first-class members of the interface.

Again, I understand the rationale behind this decision, like reducing duplication and keeping only the core functionality in DS. However, sometimes you have to take into consideration ease of use vs purity of the code.

Ease of use / DX / readability: it seems more logical to me to do:

$map->filter(fn(...) => ...);

Rather than:

Some\filter($map, fn(...) => ...);

Speed: as you said, internal iteration is faster. And speed is one of the selling points of DS vs arrays.

Static analysis: I love the fact that Map::filter() can be strictly typed as returning Map<TKey, TValue> in Psalm, for example. If you rely on a generic filter() function, I'm not sure such strict typing will be easy or even possible.

Thank you for your work on DS anyway, I already use the extension in my closed-source project, in particular Map. I would love to use data structures in my open-source projects, one day! 🤞

Additionally, it may be inconvenient for end users (e.g. new contributors to projects) to remember specifics of multiple libraries or utility classes when working on different codebases, to deal with dependency conflicts after major version upgrades, or to deal with libraries dropping support for older php versions, getting abandoned, etc.

Update: php-ds maintainer response clarifications

On September 24, 2021, the maintainer responded after being asked about current plans for php-ds

Hi everyone, I am happy to see this discussion and I thank you all for taking part. My reservation to merge ds into core has always been because I wanted to make sure we get it right before we do that and the intention behind the mythical v2 was to achieve that, based on learnings from v1 and feedback from the community. I have no personal attachment to this project, I only want what is best for PHP and the community.

I would love to see a dedicated, super-lean vec data structure in core that has native iteration and all the other same internal benefits as arrays. In my opinion, the API should be very minimal and potentially compatible with all the non-assoc array functions. An OO interface can easily be designed around that. I'm imagining something similar to Golang's slices.

As for the future of ds itself, I think these can co-exist and ds can remain external. I've been researching and designing immutable data structures over the last 4 years and I still hope to develop a v2 that simplifies the interfaces and introduces immutable structures. Attempting to implement a suite of structures in core or an OO vector would take a lot of work and might be difficult to reach consensus on with the API. I don't think we should attempt to merge ds into core at any time.

I am currently traveling and have not followed this discussion in detail on the mailing list. I'd be happy to assist in any way I can and will catch up as soon as I am home again this week. Feel free to quote this response on the mailing list as well.

I'm still awaiting some clarifications on how they they were willing to assist before updating the remainder of this RFC.

Additionally, there may be differences in design goals, as noted in the above section.

Adding a native type instead (is_vec)

https://externals.io/message/109760#109812

Yes, sorry – Hack introduced the vec type (with value semantics) in 2016 after they'd experimented first with Vector (object semantics). Use of Vector is now discouraged.

Details here: https://github.com/facebook/hhvm/issues/6451

FB/Hack appears to be in the multi-year process of moving all PHP arrays to one of [vec/dict/keyset]. *That's likely not an option for PHP itself,* but having the option of a vec equivalent (in this proposal “list”) would make sense, I think.

Adding a new type to php as a non-class is a massive undertaking for php-src itself and extension authors. It would not work with a lot of existing code that handled arrays and objects - I expect that is_vec would be a separate check from is_object and is_array, etc. This is part of why PHP 8.1 enum classes are an object type rather than a distinct type

See https://www.npopov.com/2015/05/05/Internal-value-representation-in-PHP-7-part-1.html for how php represents values internally.

That would also require a lot more familiarity than I have with opcache and the JIT assembly compiler, and I expect it would be more controversial due to not working with existing code. For a language such as Hack where feature development is done by one company(Facebook), major language redesigns and breaking changes would much more difficult in PHP, with users/developers from many different backgrounds (and a much stronger focus on backward compatibility). Additionally, adding a class doesn't prevent adding a vec/list in the future - for example, HHVM has both vec and https://docs.hhvm.com/hack/reference/class/HH.Vector/ and HH\Vector remains usable, PHP has both array and ArrayObject, etc.

Also, even if a type vec or array were added, vec and array would be distinct types - a vec couldn't be passed to a parameter that expected an array reference (or returned in a return value), because later adding a string array key (in the parameter or return value) would be a runtime error.

Changelog

0.2: Add php-ds maintainer response, improve documentation, note this is on hold while working on Deque (Double-Ended Queue) RFC

rfc/vector.txt · Last modified: 2021/09/26 16:46 by tandre