rfc:native_regular_expressions

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
rfc:native_regular_expressions [2014/08/15 01:13] – Add wiki formatting bishoprfc:native_regular_expressions [2017/09/22 13:28] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== PHP RFC: Native Regular Expression ====== ====== PHP RFC: Native Regular Expression ======
-  * Version: 0.1 
   * Date: 2014-08-13   * Date: 2014-08-13
   * Author: Bishop Bettini, bishop@php.net   * Author: Bishop Bettini, bishop@php.net
   * Status: Draft   * Status: Draft
-  * First Published at: http://wiki.php.net/rfc/native_regular_expressions 
  
 FIXME FIXME FIXME FIXME FIXME FIXME
 Jotting my ideas down here.  Move along. Maybe called "first_class_citizens_regex" or something. Jotting my ideas down here.  Move along. Maybe called "first_class_citizens_regex" or something.
 +Consider emulating structure of https://wiki.php.net/rfc/abstract_syntax_tree
 +https://wiki.php.net/rfc/generators
 +
 +
 +===== Introduction =====
 +
 +Regular expressions provide powerful string matching capabilities and play a critical role in most software written in PHP.  For example, Github reports [[https://github.com/search?q=language%3Aphp+preg_filter+OR+preg_grep+OR+preg_match+OR+preg_match_all+OR+preg_replace+OR+preg_split&type=Code|10.5 million instances of the ''preg_*()'' family of functions]]((Compared with [[https://github.com/search?q=language%3Aphp+str_replace+OR+explode+OR+strpos&type=Code|16 million instances]] of the ''explode()'', ''strpos()'', and ''str_replace()'' related functions.))((This RFC does not consider the deprecated POSIX regular expressions to be an active part of PHP and any implementation of this RFC will focus solely upon PCRE regular expressions.)).
 +
 +In the current engine, regular expressions are plain old strings:
 +
 +<code php>
 +while (preg_match('/^\s*[^#]/', $line[$i++])) {}
 +</code>
 +
 +The primary disadvantage with string representation comes when the regular expression itself needs to contain a single quote, double quote, or the delimiters bracketing the regular expression.  When that happens, the programmer has to make mental shifts to workaround the string representation, sometimes making the regular expression harder to read and maintain.  Example:
 +
 +<code php>
 +// match foo in examples: ="foo"   and  ='foo'  and   = "foo"
 +preg_match_all('(=\s*['."'".'"]([^'."'".'"/]*)['."'".'"])x', $string, $matches);
 +</code>((PCRE wizards might rightly scold me for using that example, as it doesn't actually work as described for unbalanced quotation marks or escaped quotation marks, and that a realistic working example would be ''(["'])(?:\\?+.)*?\1'', thus needing only one single quote escaped.  I agree, but I generated this example to illustrate a point.))
 +
 +In some other languages, regular expressions are part of the language itself.
 +
 +
 +
 +
 +Another problem with regular expressions buried in plain old strings is that syntax highlighting becomes much more difficult.
 +
 +
 +<code php>
 +function getLinesFromFile($fileName) {
 +    if (!$fileHandle = fopen($fileName, 'r')) {
 +        return;
 +    }
 +    
 +    while (false !== $line = fgets($fileHandle)) {
 +        yield $line;
 +    }
 +    
 +    fclose($fileHandle);
 +}
 +
 +$lines = getLinesFromFile($fileName);
 +foreach ($lines as $line) {
 +    // do something with $line
 +}
 +</code>
 +
 +The code looks very similar to the array-based implementation. The main difference is that instead of pushing
 +values into an array the values are ''yield''ed.
 +
 +Generators work by passing control back and forth between the generator and the calling code:
 +
 +When you first call the generator function (''$lines = getLinesFromFile($fileName)'') the passed argument is bound,
 +but nothing of the code is actually executed. Instead the function directly returns a ''Generator'' object. That
 +''Generator'' object implements the ''Iterator'' interface and is what is eventually traversed by the ''foreach''
 +loop:
 +
 +Whenever the ''Iterator::next()'' method is called PHP resumes the execution of the generator function until it
 +hits a ''yield'' expression. The value of that ''yield'' expression is what ''Iterator::current()'' then returns.
 +
 +Generator methods, together with the ''IteratorAggregate'' interface, can be used to easily implement traversable
 +classes too:
 +
 +<code php>
 +class Test implements IteratorAggregate {
 +    protected $data;
 +    
 +    public function __construct(array $data) {
 +        $this->data = $data;
 +    }
 +    
 +    public function getIterator() {
 +        foreach ($this->data as $key => $value) {
 +            yield $key => $value;
 +        }
 +        // or whatever other traversation logic the class has
 +    }
 +}
 +
 +$test = new Test(['foo' => 'bar', 'bar' => 'foo']);
 +foreach ($test as $k => $v) {
 +    echo $k, ' => ', $v, "\n";
 +}
 +</code>
 +
 +Generators can also be used the other way around, i.e. instead of producing values they can also consume them. When
 +used in this way they are often referred to as enhanced generators, reverse generators or coroutines.
 +
 +Coroutines are a rather advanced concept, so it very hard to come up with not too contrived an short examples.
 +For an introduction see an example [[https://gist.github.com/3111288|on how to parse streaming XML using coroutines]].
 +If you want to know more, I highly recommend checking out [[http://www.dabeaz.com/coroutines/Coroutines.pdf|a presentation
 +on this subject]].
 +
 +
 +
 +
 +
  
 New built-in "re" BNF is roughly: New built-in "re" BNF is roughly:
Line 45: Line 141:
 ====== Discussions ====== ====== Discussions ======
   * https://news.ycombinator.com/item?id=7889923   * https://news.ycombinator.com/item?id=7889923
 +  * http://stackoverflow.com/questions/25310999/what-is-the-maximum-length-of-a-regular-expression
  
  
rfc/native_regular_expressions.1408065182.txt.gz · Last modified: 2017/09/22 13:28 (external edit)