rfc:generators

This is an old revision of the document!


Request for Comments: Generators

  • Date: 2012-06-05
  • Author: Niktia Popov nikic@php.net
  • Status: In Draft

Introduction

Generators provide an easy, boilerplate-free way of implementing iterators.

As an example, consider how you would implement the file() function in userland code:

function getLinesFromFile($fileName) {
    if (!$fileHandle = fopen($fileName, 'r')) {
        return;
    }
 
    $lines = [];
    while (false !== $line = fgets($fileHandle)) {
        $lines[] = $line;
    }
 
    fclose($fileHandle);
 
    return $lines;
}
 
$lines = getLinesFromFile($fileName);
foreach ($lines as $line) {
    // do something with $line
}

The main disadvantage of this kind of code is evident: It will read the whole file into a large array. Depending on how big the file is, this can easily hit the memory limit. This is not what you usually want. Instead you want to get the lines one by one. This is what iterators are perfect for.

Sadly implementing iterators requires an insane amount of boilerplate code. E.g. consider this iterator variant of the above function:

class LineIterator implements Iterator {
    protected $fileHandle;
 
    protected $line;
    protected $i;
 
    public function __construct($fileName) {
        if (!$this->fileHandle = fopen($fileName, 'r')) {
            throw new RuntimeException('Couldn\'t open file "' . $fileName . '"');
        }
    }
 
    public function rewind() {
        fseek($this->fileHandle, 0);
        $this->line = fgets($this->fileHandle);
        $this->i = 0;
    }
 
    public function valid() {
        return false !== $this->line;
    }
 
    public function current() {
        return $this->line;
    }
 
    public function key() {
        return $this->i;
    }
 
    public function next() {
        if (false !== $this->line) {
            $this->line = fgets($this->fileHandle);
            $this->i++;
        }
    }
 
    public function __destruct() {
        fclose($this->fileHandle);
    }
}
 
$lines = new LineIterator($fileName);
foreach ($lines as $line) {
    // do something with $line
}

As you can see a very simple piece of code can easily become very complicated when turned into an iterator. Generators solve this problem and allow you to implement iterators in a very straightforward manner:

function getLinesFromFile($fileName) {
    if (!$fileHandle = fopen($fileName, 'r')) {
        return;
    }
 
    while (false !== $line = fgets($fileHandle)) {
        yield $line;
    }
 
    fclose($fileHandle);
}
 
$lines = getLinesFromFile($fileName);
foreach ($lines as $line) {
    // do something with $line
}

The code looks very similar to the array-based implementation. The main difference is that instead of pushing values into an array the values are yielded.

Generators work by passing control back and forth between the generator and the calling code:

When you first call the generator function ($lines = getLinesFromFile($fileName)) the passed argument is bound, but nothing of the code is actually executed. Instead the function directly returns a Generator object. That Generator object implements the Iterator interface and is what is eventually traversed by the foreach loop:

Whenever the Iterator::next() method is called PHP resumes the execution of the generator function until it hits a yield expression. The value of that yield expression is what Iterator::current() then returns.

Generator methods, together with the IteratorAggregate interface, can be used to easily implement traversable classes too:

class Test implements IteratorAggregate {
    protected $data;
 
    public function __construct(array $data) {
        $this->data = $data;
    }
 
    public function getIterator() {
        foreach ($this->data as $key => $value) {
            yield $key => $value;
        }
        // or whatever other traversation logic the class has
    }
}
 
$test = new Test(['foo' => 'bar', 'bar' => 'foo']);
foreach ($test as $k => $v) {
    echo $k, ' => ', $v, "\n";
}

Generators can also be used the other way around, i.e. instead of producing values they can also consume them. When used in this way they are often referred to as enhanced generators, reverse generators or coroutines.

Coroutines are a rather advanced concept, so it very hard to come up with not too contrived an short examples. For an introduction see an example on how to parse streaming XML using coroutines. If you want to know more, I highly recommend checking out a presentation on this subject.

Specification

Recognition of generator functions

Any function which contains a yield statement is automatically a generator function.

The initial implementation required that generator functions are marked with an asterix modifier (function*). This method has the advantage that generators are more explicit and also allows for yield-less coroutines.

The automatic detection was chosen over the asterix modifier for the following reasons:

  • There is an existing generator implementation in HipHop PHP, which uses automatic-detection. Using the asterix modifier would break compatibility.
  • All existing generator implementations in other language (that I know of) also use automatic detection. This includes Python, JavaScript 1.7 and C#. The only exception to this is the generator support as defined by ECMAScript Harmony, but I know no browser that actually implements it in the defined way.
  • The syntax for by-reference yielding looks very ugly: function *&gen()
  • yield-less coroutines are a very narrow use case and are also possible with automatic-detection using a code like if (false) yield;.

Basic behavior

When a generator function is called the execution is suspended immediately after parameter binding and a Generator object is returned.

The Generator object implements the following interface:

final class Generator implements Iterator {
    void  rewind();
    bool  valid();
    mixed current();
    mixed key();
    void  next();
 
    mixed send(mixed $value);
    void close();
}

If the generator is not yet at a yield statement (i.e. was just created and not yet used as an iterator), then any call to rewind, valid, current, key, next or send will resume the generator until the next yield statement is hit.

Consider this example:

function gen() {
    echo 'start';
    yield 'middle';
    echo 'end';
}
 
// Initial call does not output anything
$gen = gen();
 
// Call to current() resumes the generator, thus "start" is echo'd.
// Then the yield expression is hit and the string "middle" is returned
// as the result of current() and then echo'd.
echo $gen->current();
 
// Execution of the generator is resumed again, thus echoing "end"
$gen->next();

A nice side-effect of this behavior is that coroutines do not have to be primed with a next() call before they can be used. (This is required in Python and also the reason why coroutines in Python usually use some kind of decorator that automatically primes the coroutine.)

Apart from the above the Generator methods behave as follows:

  • rewind: Generators are not rewindable, so this is just a no-op. (More in the “Rewinding a generator” section.)
  • valid: Returns false if the generator has been closed, true otherwise. (More in the “Closing a generator” section.)
  • current: Returns whatever was passed to yield or null if nothing was passed or the generator is already closed.
  • key: Returns the yielded key or, if none was specified, an auto-incrementing key or null if the generator is already closed. (More in the “Yielding keys” section.)
  • next: Resumes the generator (unless the generator is already closed).
  • send: Sets the return value of the yield expression and resumes the generator (unless the generator is already closed). (More in the “Sending values” section.)
  • close: Closes the generator. (More in the “Closing a generator” section.)

Yielding keys

The languages that currently implement generators don't have support for yielding keys (only values). This though is just a sideeffect as these languages don't support keys in iterators in general.

In PHP on the other hand keys are explicitly part of the iteration process and it thus does not make sense to not add key-yielding support. The syntax could be analogous to that of foreach loops and array declarations:

yield $key => $value;

The problem with this syntax that it would be ambiguous in array declarations and nested yield expressions:

array(
    yield $key => $value
)
// could be
array(
    (yield $key) => $value
)
// or
array(
    (yield $key => $value)
)
 
yield yield $a => $b;
// could be
yield (yield $a) => $b
// or
yield (yield $a => $b)

Even though this is an absolute edge-case, the grammar still shouldn't be ambiguous in this case. It's not clear how this can be solved. In the current implementation the yield $key => $value syntax is simply implemented as a statement instead of an expression. This obviously is a rather bad solution to the problem and it would be preferable to find some other way to deal with it.

Furthermore generators need to generate keys even if no key was explicitly yielded. In this case it seems reasonable to behave the same as arrays do: Start with the key 0 and always increment by one. If in between an integer key which is larger than the current auto-key is explicitly yielded, then that will be used as the starting point for new auto-keys. All other yielded keys do not affect the auto-key mechanism.

function gen() {
    yield 'a';
    yield 'b';
    yield 'key' => 'c';
    yield 'd';
    yield 10 => 'e';
    yield 'f';
}
 
foreach (gen() as $key => $value) {
    echo $key, ' => ', $value, "\n";
}
 
// outputs:
0 => a
1 => b
key => c
2 => d
10 => e
11 => f

This is the same behavior that arrays have (i.e. if gen() instead simply returned an array with the yielded values the keys would be same). The only difference occurs when the generator yield non-integer, but numeric keys. For arrays they are cast, for generators the are not.

Yield by reference

Generators can also yield by values by reference. To do so the & modifier is added before the function name, just like it is done for return by reference.

This for example allows you to create classes with by-ref iteration behavior (which is something that is completely impossible with normal iterators):

class DataContainer implements IteratorAggregate {
    protected $data;
 
    public function __construct(array $data) {
        $this->data = $data;
    }
 
    public function &getIterator() {
        foreach ($this->data as $key => &$value) {
            yield $key => $value;
        }
    }
}

The class can then be iterated using by-ref foreach:

$dataContainer = new DataContainer([1, 2, 3]);
foreach ($dataContainer as &$value) {
    $value *= -1;
}
 
// $this->data is now [-1, -2, -3]

Only generators specifying the & modifier can be iterated by ref. If you try to iterate a non-ref generator by-ref an E_ERROR is thrown.

Rewinding a generator

Rewinding to some degree goes against the concept of generators, as they are mainly intended as one-time data sources that are not supposed to be iterated another time. On the other hand, most generators probably *are* rewindable and it might make sense to allow it. One could argue though that rewinding a generator is really bad practice (especially if the generator is doing some expensive calculation). Allowing it to rewind would look like it is a cheap operation, just like with arrays. Also rewinding (as in jumping back to the execution context state at the initial call to the generator) can lead to unexpected behavior, e.g. in the following case:

function getSomeStuff(PDOStatement $stmt) {
    foreach ($stmt as $row) {
        yield doSomethingWith($row);
    }
}

Here rewinding would simply result in an empty iterator as the result set is already depleted.

One solution thus could be to allow explicitly marking generators to be rewindable. E.g. one could add a rewindable function, which makes a generator rewindable:

$gen = rewindable(gen());

This function is actually already implementable in userland code (see “Cloning a generator” section.)

Cloning a generator

Generators can be cloned, thus leaving two independent Generator objects with the same state. This behavior can for example be used to create the aforementioned rewindable function:

class RewindableGenerator implements Iterator {
    protected $original;
    protected $current;
 
    public function __construct(Generator $generator) {
        $this->original = $generator;
        $this->current = null;
    }
 
    public function rewind() {
        if ($this->current) { $this->current->close(); }
        $this->current = clone $this->original;
        $this->current->rewind();
    }
 
    public function valid() {
        if (!$this->current) { $this->current = clone $this->original; }
        return $this->current->valid();
    }
 
    public function current() {
        if (!$this->current) { $this->current = clone $this->original; }
        return $this->current->current();
    }
 
    public function key() {
        if (!$this->current) { $this->current = clone $this->original; }
        return $this->current->key();
    }
 
    public function next() {
        if (!$this->current) { $this->current = clone $this->original; }
        $this->current->next();
    }
 
    public function send($value) {
        if (!$this->current) { $this->current = clone $this->original; }
        return $this->current->send($value);
    }
 
    public function close() {
        $this->original->close();
        if ($this->current) {
            $this->current->close();
        }
    }
}
 
function rewindable(Generator $generator) {
    return new RewindableGenerator($generator);
}

It can be then used as follows:

function xrange($start, $end, $step = 1) {
    for ($i = $start; $i <= $end; $i += $step) {
        yield $i;
    }
}
 
$range = rewindable(xrange(0, 5));
foreach ($range as $i) {
    echo $i, "\n";
}
foreach ($range as $i) {
    echo $i, "\n";
}

This will correctly output the 0..5 range twice.

Closing a generator

When a generator is closed it frees the suspended execution context (as well as all other held variables). After it has been closed valid will return false and both current and key will return null.

A generator can be closed in three ways:

  • Explicitly calling the Generator::close method. This can be useful if you want to free used memory after you don't need the generator anymore.
  • Removing all references to the generator object. In this case the generator will be closed as part of the garbage collection process.
  • Reaching a return statement (or the end of the function) in a generator or throwing an exception from it (without catching it inside the generator).

The following resources are destructed while closing a generator:

  • The current execution context (execute_data)
  • Stack arguments for the generator call, and the additional execution context which is used to manage them.
  • The currently active symbol table (or the compiled variables if no symbol table is in use).
  • The current $this object.
  • If the generator is closed during a method call, the object which the method is invoked on (EX(object)).
  • If the generator is closed during a call, the arguments pushed to the stack.
  • Any foreach loop variables which are still alive (taken from brk_cont_array).
  • The current generator key and value

Currently it can happen that temporary variables are not cleaned up properly in edge-case situations. Exceptions are also subject to this problem: https://bugs.php.net/bug.php?id=62210. If that bug could be fixed for exceptions, then it would also be fixed for generators.

A generator cannot be closed while it is running (e.g. if $generator->closed() is called within the generator function). In this case and E_WARNING is thrown and the call is ignored.

In Python generators are closed by throwing a GeneratorExit exception into them. This is *not done* in the PHP implementation.

The exception gives the generator function a chance to clean up any resources it uses:

function gen() {
    $lock = getLock();
    try {
        // do some stuff
    } catch (GeneratorExitException $e) {
        releaseLock($lock);
        throw $e;
    }
    releaseLock($lock);
}

As you can see the exception allows the generator to clean up. Also note that the exception has to be rethrown. This way it is ensured that the generator does not simply resume execution, but really does terminate.

As PHP does not support finally blocks which have strict must-run semantics and there are other simple ways to enforce resource cleanup (e.g. destructors) this behavior is not implemented in PHP.

Patch

A working, but not yet complete implementation can be found at https://github.com/nikic/php-src/tree/addGeneratorsSupport.

Performance

You can find a small micro benchmark at https://gist.github.com/2975796. It compares several ways of iterating ranges:

  • Using generators (xrange)
  • Using iterators (RangeIterator)
  • Using arrays implemented in userland (urange)
  • Using arrays implemented internally (range)

For large ranges generators are consistently faster; about four times faster than an iterator implementation and even 40% faster than the native range implementation.

For small ranges (around one hundred elements) the variance of the results is rather high, but from multiple runs it seems that in this case generators are slightly slower than the native implementation, but still faster than the iterator variant.

The tests were run on a Ubuntu VM, so I'm not exactly sure how representative they are.

Why not just use callback functions?

A question that has come up a few times during discussion: Why not use callback functions, instead of generators? For example the above getLinesFromFile function could be rewritten using a callback:

function processLinesFromFile($fileName, callable $callback) {
    if (!$fileHandle = fopen($fileName, 'r')) {
        return;
    }
 
    while (false !== $line = fgets($fileHandle)) {
        $callback($line);
    }
 
    fclose($fileHandle);
}
 
processLinesFromFile($fileName, function($line) {
    // do something
});

This approach has two main disadvantages:

Firstly, callbacks integrate badly into the existing PHP coding paradigms. Having quadruply-nested closures is something very normal in languages like JavaScript, but rather rare in PHP. Many things in PHP are based on iteration and generators can nicely integrate with this.

A concrete example, which was actually my initial motivation to write the generators patch:

protected function getTests($directory, $fileExtension) {
    $it = new RecursiveDirectoryIterator($directory);
    $it = new RecursiveIteratorIterator($it, RecursiveIteratorIterator::LEAVES_ONLY);
    $it = new RegexIterator($it, '(\.' . preg_quote($fileExtension) . '$)');
 
    $tests = array();
    foreach ($it as $file) {
        // read file
        $fileContents = file_get_contents($file);
 
        // parse sections
        $parts = array_map('trim', explode('-----', $fileContents));
 
        // first part is the name
        $name = array_shift($parts);
 
        // multiple sections possible with always two forming a pair
        foreach (array_chunk($parts, 2) as $chunk) {
            $tests[] = array($name, $chunk[0], $chunk[1]);
        }
    }
 
    return $tests;
}

This is a function which I use to provide test vectors to PHPUnit. I point it to a directory containing test files and then split up those test files into individual tests + expected output. I can then use the result of the function to feed some test function via @dataProvider.

The problem with the above implementation obviously is that I have to read all tests into memory at once (instead of one-by-one).

How can I solve this problem? By turning it into an iterator obviously! But if you look closer, this isn't actually that easy, because I'm adding new tests in a nested loop. So I would have to implement some kind of complex push-back mechanism to solve the problem. And - getting back on topic - I can't use callbacks here either, because I need a traversable for use with @dataProvider. Generators on the other hand solve this problem very elegantly. Actually, all you have to do to turn it into a lazy generator is replace $tests[] = with yield.

The second, more general problem with callbacks is that it's very hard to manage state across calls. The classic example is a lexer + parser system. If you implement the lexer using a callback (i.e. lex(string $sourceCode, callable $tokenConsumer)) you would have to figure out some way to keep state between subsequent calls. You'd have to build some kind of state machine, which can quickly get really ugly, even for simple problems (just look at the hundreds of states that a typical LALR parser has). Again, generators solve this problem elegantly, because they maintain state implicitly, in the execution state.

TODO

  • Decide on whether to implement the Generator::throw method
  • Implement yield* expression and generator return values

Further resources

rfc/generators.1343411655.txt.gz · Last modified: 2017/09/22 13:28 (external edit)