rfc:native_regular_expressions

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
rfc:native_regular_expressions [2014/08/14 01:35] – Jotting my thoughts down. bishoprfc:native_regular_expressions [2017/09/22 13:28] (current) – external edit 127.0.0.1
Line 1: Line 1:
 ====== PHP RFC: Native Regular Expression ====== ====== PHP RFC: Native Regular Expression ======
-  * Version: 0.1 
   * Date: 2014-08-13   * Date: 2014-08-13
   * Author: Bishop Bettini, bishop@php.net   * Author: Bishop Bettini, bishop@php.net
   * Status: Draft   * Status: Draft
-  * First Published at: http://wiki.php.net/rfc/native_regular_expressions 
  
 +FIXME FIXME FIXME
 Jotting my ideas down here.  Move along. Maybe called "first_class_citizens_regex" or something. Jotting my ideas down here.  Move along. Maybe called "first_class_citizens_regex" or something.
 +Consider emulating structure of https://wiki.php.net/rfc/abstract_syntax_tree
 +https://wiki.php.net/rfc/generators
 +
 +
 +===== Introduction =====
 +
 +Regular expressions provide powerful string matching capabilities and play a critical role in most software written in PHP.  For example, Github reports [[https://github.com/search?q=language%3Aphp+preg_filter+OR+preg_grep+OR+preg_match+OR+preg_match_all+OR+preg_replace+OR+preg_split&type=Code|10.5 million instances of the ''preg_*()'' family of functions]]((Compared with [[https://github.com/search?q=language%3Aphp+str_replace+OR+explode+OR+strpos&type=Code|16 million instances]] of the ''explode()'', ''strpos()'', and ''str_replace()'' related functions.))((This RFC does not consider the deprecated POSIX regular expressions to be an active part of PHP and any implementation of this RFC will focus solely upon PCRE regular expressions.)).
 +
 +In the current engine, regular expressions are plain old strings:
 +
 +<code php>
 +while (preg_match('/^\s*[^#]/', $line[$i++])) {}
 +</code>
 +
 +The primary disadvantage with string representation comes when the regular expression itself needs to contain a single quote, double quote, or the delimiters bracketing the regular expression.  When that happens, the programmer has to make mental shifts to workaround the string representation, sometimes making the regular expression harder to read and maintain.  Example:
 +
 +<code php>
 +// match foo in examples: ="foo"   and  ='foo'  and   = "foo"
 +preg_match_all('(=\s*['."'".'"]([^'."'".'"/]*)['."'".'"])x', $string, $matches);
 +</code>((PCRE wizards might rightly scold me for using that example, as it doesn't actually work as described for unbalanced quotation marks or escaped quotation marks, and that a realistic working example would be ''(["'])(?:\\?+.)*?\1'', thus needing only one single quote escaped.  I agree, but I generated this example to illustrate a point.))
 +
 +In some other languages, regular expressions are part of the language itself.
 +
 +
 +
 +
 +Another problem with regular expressions buried in plain old strings is that syntax highlighting becomes much more difficult.
 +
 +
 +<code php>
 +function getLinesFromFile($fileName) {
 +    if (!$fileHandle = fopen($fileName, 'r')) {
 +        return;
 +    }
 +    
 +    while (false !== $line = fgets($fileHandle)) {
 +        yield $line;
 +    }
 +    
 +    fclose($fileHandle);
 +}
 +
 +$lines = getLinesFromFile($fileName);
 +foreach ($lines as $line) {
 +    // do something with $line
 +}
 +</code>
 +
 +The code looks very similar to the array-based implementation. The main difference is that instead of pushing
 +values into an array the values are ''yield''ed.
 +
 +Generators work by passing control back and forth between the generator and the calling code:
 +
 +When you first call the generator function (''$lines = getLinesFromFile($fileName)'') the passed argument is bound,
 +but nothing of the code is actually executed. Instead the function directly returns a ''Generator'' object. That
 +''Generator'' object implements the ''Iterator'' interface and is what is eventually traversed by the ''foreach''
 +loop:
 +
 +Whenever the ''Iterator::next()'' method is called PHP resumes the execution of the generator function until it
 +hits a ''yield'' expression. The value of that ''yield'' expression is what ''Iterator::current()'' then returns.
 +
 +Generator methods, together with the ''IteratorAggregate'' interface, can be used to easily implement traversable
 +classes too:
 +
 +<code php>
 +class Test implements IteratorAggregate {
 +    protected $data;
 +    
 +    public function __construct(array $data) {
 +        $this->data = $data;
 +    }
 +    
 +    public function getIterator() {
 +        foreach ($this->data as $key => $value) {
 +            yield $key => $value;
 +        }
 +        // or whatever other traversation logic the class has
 +    }
 +}
 +
 +$test = new Test(['foo' => 'bar', 'bar' => 'foo']);
 +foreach ($test as $k => $v) {
 +    echo $k, ' => ', $v, "\n";
 +}
 +</code>
 +
 +Generators can also be used the other way around, i.e. instead of producing values they can also consume them. When
 +used in this way they are often referred to as enhanced generators, reverse generators or coroutines.
 +
 +Coroutines are a rather advanced concept, so it very hard to come up with not too contrived an short examples.
 +For an introduction see an example [[https://gist.github.com/3111288|on how to parse streaming XML using coroutines]].
 +If you want to know more, I highly recommend checking out [[http://www.dabeaz.com/coroutines/Coroutines.pdf|a presentation
 +on this subject]].
 +
 +
 +
 +
 +
  
 New built-in "re" BNF is roughly: New built-in "re" BNF is roughly:
  
 +<code>
 syntax := re <fence-post> <regex-chars> <fence-post> <regex-modifiers> <semic> syntax := re <fence-post> <regex-chars> <fence-post> <regex-modifiers> <semic>
 fence-post := <any character> fence-post := <any character>
Line 15: Line 113:
 regex-modifiers := whatever is valid for modifiers regex-modifiers := whatever is valid for modifiers
 semic := ';' semic := ';'
 +</code>
  
 Example: Example:
 +<code>
 $regex = re /^\w+$/i $regex = re /^\w+$/i
 preg_match($regex, 'whatever'); preg_match($regex, 'whatever');
 ereg_match($regex, 'whatever'); // wouldn't work... maybe need $regex->test() ereg_match($regex, 'whatever'); // wouldn't work... maybe need $regex->test()
 +</code>
  
 +====== Motivation ======
 +  * Regex are integral to modern info processing
 +  * Quoting them inside strings is hard: you have the quote character to deal with, plus the fence post
 +  * Other languages have re built in
  
-Motivation +====== Goals ====== 
-Regex are integral to modern info processing +  Reduce effort of code authors to quote regex properly 
-* Quoting them inside strings is hard: you have the quote character to deal with, plus the fence post +  Compile time verification of regex (benefit?)
-Other languages have re built in+
  
-Goals: +====== Non-goals ====== 
-Reduce effort of code authors to quote regex properly +  Adding a new regex class, with methods like $re->test('whatever')
-* Compile time verification of regex (benefit?)+
  
-Non-goals+====== Similar implementations ====== 
-Adding a new regex class, with methods like $re->test('whatever')+  * Javascripthttp://mrrena.blogspot.com/2012/07/regular-expressions-in-javascript.html 
 +  Python: https://docs.python.org/3/howto/regex.html 
 +  * Comparison: http://hyperpolyglot.org/scripting
  
-Similar implementations: +====== Discussions ====== 
-Javascript: http://mrrena.blogspot.com/2012/07/regular-expressions-in-javascript.html +  https://news.ycombinator.com/item?id=7889923 
-Python: https://docs.python.org/3/howto/regex.html +  http://stackoverflow.com/questions/25310999/what-is-the-maximum-length-of-a-regular-expression
-* Comparison: http://hyperpolyglot.org/scripting+
  
-Discussions: 
-https://news.ycombinator.com/item?id=7889923 
  
 +----
  
 This is a suggested template for PHP Request for Comments (RFCs). Change this template to suit your RFC.  Not all RFCs need to be tightly specified.  Not all RFCs need all the sections below. This is a suggested template for PHP Request for Comments (RFCs). Change this template to suit your RFC.  Not all RFCs need to be tightly specified.  Not all RFCs need all the sections below.
rfc/native_regular_expressions.1407980114.txt.gz · Last modified: 2017/09/22 13:28 (external edit)