rfc:is_literal

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Next revisionBoth sides next revision
rfc:is_literal [2020/12/23 19:53] – New examples, with a focus on escaping craigfrancisrfc:is_literal [2021/05/03 19:38] – Don't write off the concat version (inc more perf stats) craigfrancis
Line 1: Line 1:
 ====== PHP RFC: Is Literal Check ====== ====== PHP RFC: Is Literal Check ======
  
-  * Version: 0.2+  * Version: 0.6
   * Date: 2020-03-21   * Date: 2020-03-21
-  * Updated: 2020-12-22+  * Updated: 2021-04-30
   * Author: Craig Francis, craig#at#craigfrancis.co.uk   * Author: Craig Francis, craig#at#craigfrancis.co.uk
   * Status: Draft   * Status: Draft
Line 11: Line 11:
 ===== Introduction ===== ===== Introduction =====
  
-Add an //is_literal()// functionso developers/frameworks can check if given variable is **safe**.+A new function, //is_literal(string $string)//, to identify variables that have been created from programmer defined string.
  
-As in, at runtime, being able to check if a variable has been created by literals, defined within a PHP script, by a trusted developer.+This takes the concept of "taint checking" and makes it simpler and stricter.
  
-This simple check can be used to warn or completely block SQL Injection, Command Line Injection, and many cases of HTML Injection (aka XSS).+It does not allow a variable to be marked as untainted, and it does not allow escaping (important).
  
-===== The Problem =====+For example, take a database library that supports parametrised queries at the driver level, today a programmer could use either of these:
  
-Escaping strings for SQLHTML, Commands, etc is **very** error prone.+<code php> 
 +$db->query('SELECT * FROM users WHERE id = ?'[$_GET['id']]);
  
-The vast majority of programmers should never do this (mistakes will be made).+$db->query('SELECT * FROM users WHERE id = ' $_GET['id']); // INSECURE 
 +</code>
  
-Unsafe values (often user supplied) *must* be kept separate (e.g. parameterised SQL), or be processed by something that understands the context (e.g. HTML Templating Engine).+If the library only accepted a literal SQL string (written by the programmer), and simply rejected the second example (not written as literal), the library can provide an "inherently safe API".
  
-This is primarily for security reasonsbut it also causes data to be damaged (e.gASCII/UTF-8 issues).+This definition of an "inherently safe API" comes from Christoph Kernwho did a talk in 2016 about [[https://www.youtube.com/watch?v=ccfEu-Jj0as|Preventing Security Bugs through Software Design]] (also at [[https://www.usenix.org/conference/usenixsecurity15/symposium-program/presentation/kern|USENIX Security 2015]]), which covers how this is used at Google. The idea is that we "Don't Blame the Developer, Blame the API"; where we need to put the burden on libraries (written once, used by many) to ensure that it's impossible for the developer to make these mistakes.
  
-Take these mistakes:+By adding a way for libraries to check if the strings they receive came from the developer (from trusted PHP source code), it allows the library to check they are being used in a safe way.
  
-  echo "<img src=" . htmlentities($url) . " alt='' />";+===== Why =====
  
-Flawed because the attribute value is not quoted, e.g. //$url = '/ onerror=alert(1)'//+The [[https://owasp.org/www-project-top-ten/|OWASP Top 10]] lists common vulnerabilities sorted by prevalence, exploitability, detectability, and impact. Each ranked out of 3.
  
-  echo "<img src='" . htmlentities($url. "' alt='' />";+**A1: Injection** - common prevalence (2), easy for attackers to detect/exploit (3), severe impact (3).
  
-Flawed because //htmlentities()// does not encode single quotes by defaulte.g. //$url = "/' onerror='alert(1)"//+**A7: XSS** - widespread prevalence (3), easy for attackers to detect/exploit (3), moderate impact (2).
  
-  echo '<a href="' . htmlentities($url. '">Link</a>';+And these two have always been listed: 2003 (A6/A4), 2004 (A6/A4), 2007 (A2/A1), 2010 (A1/A2), 2013 (A1/A3), 2017 (A1/A7).
  
-Flawed because a link can include JavaScripte.g. //$url = 'javascript:alert(1)'//+It'because these mistakes are very easy to makeand hard to identify - is_literal() directly addresses this problem.
  
-  <script> +===== Examples =====
-    var url "<?addslashes($url) ?>"; +
-  </script>+
  
-Flawed because //addslashes()// is not HTML context aware, e.g. //$url = '</script><script>alert(1)</script>'//+The [[https://www.doctrine-project.org/projects/doctrine-orm/en/current/reference/query-builder.html#high-level-api-methods|Doctrine Query Builder]] allows a custom WHERE clause to be provided as a string. This is intended for use with literals and placeholders, but does not protect against this simple mistake:
  
-  echo '<a href="/path/?name=. htmlentities($name) . '">Link</a>';+<code php> 
 +// INSECURE 
 +$qb->select('u') 
 +   ->from('User', 'u') 
 +   ->where('u.id = . $_GET['id']) 
 +</code>
  
-Flawed because //urlencode()// has not been used, e.g. //$name = 'A&B'//+The definition of the //where()// method could check with //is_literal()// and throw an exception, advising the programmer to replace it with a safer use of placeholders:
  
-  <p><?htmlentities($url?></p>+<code php> 
 +$qb->select('u'
 +   ->from('User', 'u'
 +   ->where('u.id :identifier'
 +   ->setParameter('identifier', $_GET['id'])
 +</code>
  
-Flawed because the encoding is not guaranteed to be UTF-8 (or ISO-8859-1 before PHP 5.4)so the value could be corrupted.+Similarly, Twig allows [[https://twig.symfony.com/doc/2.x/recipes.html#loading-a-template-from-a-string|loading a template from a string]]which could allow accidentally skipping the default escaping functionality:
  
-Also flawed because some browsers (e.gIE 11), if the charset isn't defined (header or meta tag), could guess the output as UTF-7, e.g. //$url = '+ADw-script+AD4-alert(1)+ADw-+AC8-script+AD4-'//+<code php> 
 +// INSECURE 
 +echo $twig->createTemplate('<p>Hi ' $_GET['name'. '</p>')->render()
 +</code>
  
-  example.html: +If //createTemplate()// checked with //is_literal()//the programmer could be advised to write this instead:
-      <img src={{ url }} alt='' /+
-   +
-  $loader = new \Twig\Loader\FilesystemLoader('./templates/'); +
-  $twig = new \Twig\Environment($loader, ['autoescape' => 'name'])+
-   +
-  echo $twig->render('example.html'['url' => $url]);+
  
-Flawed because Twig is not context aware (in this case, an unquoted HTML attribute), e.g. //$url = '/ onerror=alert(1)'//+<code php> 
 +echo $twig->createTemplate('<p>Hi {{ name }}</p>')->render(['name' => $_GET['name']]); 
 +</code>
  
-  $sql 'SELECT 1 FROM user WHERE id=' . $mysqli->escape_string($id);+===== Failed Solutions =====
  
-Flawed because the value has not been quoted, e.g. //$id 'id', or '1 OR 1=1'//+==== Education ====
  
-  $sql = 'SELECT 1 FROM user WHERE id="' . $mysqli->escape_string($id) . '"';+Developer training has not worked, it simply does not scale (people start programming every day), and learning about every single issue is difficult.
  
-Flawed if 'sql_mode' includes //NO_BACKSLASH_ESCAPES//, e.g. //$id = '2" or "1"="1'//+Keeping in mind that programmers will frequently do just enough to complete their task (busy), where they often copy/paste a solution to their problem they find online (risky)modify it for their needs (risky), then move on.
  
-  $sql = 'INSERT INTO user (name) VALUES ("' . $mysqli->escape_string($name) . '")';+We cannot keep saying they 'need to be careful', and relying on them to never make a mistake.
  
-Flawed if 'SET NAMES latin1' has been used, and escape_string() uses 'utf8'.+==== Escaping ====
  
-  $parameters = "-f$email"; +Escaping is hard, and error prone.
-   +
-  // $parameters = '-f' escapeshellarg($email); +
-   +
-  mail('a@example.com', 'Subject', 'Message', NULL, $parameters);+
  
-Flawed because it's not possible to safely escape values in //$additional_parameters// for //mail()//, e.g. //$email = 'b@example.com -X/www/example.php'//+We have a list of common [[https://github.com/craigfrancis/php-is-literal-rfc/blob/main/justification.md#common-mistakes|escaping mistakes]].
  
-===== Previous Solutions =====+Developers should use parameterised queries (e.g. SQL), or a well tested library that knows how to escape values based on their context (e.g. HTML).
  
-[[https://github.com/laruence/taint|Taint extension]] by Xinchen Hui, but this approach explicitly allows escaping, which doesn't address the issues listed above.+==== Taint Checking ====
  
-[[https://wiki.php.net/rfc/sql_injection_protection|Automatic SQL Injection Protection]] by Matt Tait, where it was noted:+Some languages implement a "taint flag" which tracks whether values are considered "safe".
  
-  * "unfiltered input can affect way more than only SQL" ([[https://news-web.php.net/php.internals/87355|Pierre Joye]]); +There is a [[https://github.com/laruence/taint|Taint extension for PHP]] by Xinchen Hui, and [[https://wiki.php.net/rfc/taint|a previous RFC proposing it be added to the language]] by Wietse Venema.
-  * this amount of work isn't ideal for "just for one use case" ([[https://news-web.php.net/php.internals/87647|Julien Pauli]]); +
-  * It would have effected every SQL functionsuch as //mysqli_query()//, //$pdo->query()//, //odbc_exec()//, etc (concerns raised by [[https://news-web.php.net/php.internals/87436|Lester Caine]] and [[https://news-web.php.net/php.internals/87650|Anthony Ferrara]]); +
-  * Each of those functions would need bypass for cases where unsafe SQL was intentionally being used (e.g. phpMyAdmin taking SQL from POST data) because some applications intentionally "pass raw, user submitted, SQL" (Ronald Chmara [[https://news-web.php.net/php.internals/87406|1]]/[[https://news-web.php.net/php.internals/87446|2]]).+
  
-I also agree that "SQL injection is almost solved problem [by using] prepared statements" ([[https://news-web.php.net/php.internals/87400|Scott Arciszewski]]), but we still need something to identify mistakes.+These solutions rely on the assumption that the output of an escaping function is safe for particular contextThis sounds reasonable in theory, but the operation of escaping functions, and the context for which their output is safe, are very hard to define. This leads to a feature that is both complex and unreliable.
  
-===== Related JavaScript Implementation =====+This proposal avoids the complexity by addressing a different part of the problem: separating inputs supplied by the programmer, from inputs supplied by the user.
  
-This RFC is taking some ideas from TC39, where a similar idea is being discussed for JavaScript, to support the introduction of Trusted Types.+==== Static Analysis ====
  
-https://github.com/tc39/proposal-array-is-template-object\\ +While I agree with [[https://news-web.php.net/php.internals/109192|Tyson Andre]], it is highly recommended to use Static Analysis.
-https://github.com/mikewest/tc39-proposal-literals+
  
-They are looking at "Distinguishing strings from a trusted developerfrom strings that may be attacker controlled".+But they nearly always focus on other issues (type checkingbasic logic flaws, code formatting, etc).
  
-===== Solution =====+Those that attempt to address injection vulnerabilities, do so via Taint Checking (see above), and are [[https://github.com/vimeo/psalm/commit/2122e4a1756dac68a83ec3f5abfbc60331630781|often incomplete]].
  
-Literals are safe valuesdefined within the PHP scriptsfor example:+For a quick examplepsalm, even in its strictest errorLevel (1), and/or running //--taint-analysis//, will not notice the missing quote marks in this SQLand will incorrectly assume this is perfectly safe:
  
-  $a = 'Example'; +<code php> 
-  is_literal($a); // true +$db new mysqli('...');
-   +
-  $a 'Example ' . $a . ', ' . 5; +
-  is_literal($a); // true +
-   +
-  $a = 'Example ' . $_GET['id']; +
-  is_literal($a); // false +
-   +
-  $a = 'Example ' time(); +
-  is_literal($a); // false +
-   +
-  $a = sprintf('LIMIT %d', 3); +
-  is_literal($a); // false +
-   +
-  $c = count($ids); +
-  $a = 'WHERE id IN (' implode(',', array_fill(0, $c, '?')) . ')'; +
-  is_literal($a); // true, the odd one that involves functions. +
-   +
-  $limit = 10; +
-  $a = 'LIMIT ' . ($limit + 1); +
-  is_literal($a); // false, but might need some discussion.+
  
-This uses a similar definition of [[https://wiki.php.net/rfc/sql_injection_protection#safeconst|SafeConst]from Matt Tait's RFC, but it does not need to accept Integer or FloatingPoint variables as safe (unless it makes the implementation easier), nor should this proposal effect any existing functions.+$id = (string) ($_GET['id'?? 'id'); // Keep the type checker happy.
  
-Thanks to [[https://news-web.php.net/php.internals/87396|Xinchen Hui]], we know the PHP5 Taint extension was complex, but "with PHP7's new zend_string, and string flags, the implementation will become easier".+$db->prepare('SELECT * FROM users WHERE id = ' $db->real_escape_string($id)); 
 +</code>
  
-And thanks to [[https://chat.stackoverflow.com/transcript/message/48927813#48927813|Mark R]], it might be possible to use the fact that "interned strings in PHP have flag"which is there because these "can't be freed".+When psalm comes to taint checking the usage of library (like Doctrine), it assumes all methods are safe, because none of them note the sinks (and even if they did, you're back to escaping being an issue).
  
-Commands can be checked to ensure they are a "programmer supplied constant/static/validated string"and all other unsafe variables are provided separately (as noted by [[https://news-web.php.net/php.internals/87725|Yasuo Ohgaki]]).+But the biggest problem is that Static Analysis is simply not used by most developersespecially those who are new to programming (usage tends to be higher by those writing well tested libraries).
  
-This approach allows all systems/frameworks to decide if they want to **block**, **educate** (via a notice), or **ignore** these issues (to avoid the "don't nanny" concern raised by [[https://news-web.php.net/php.internals/87383|Lester Caine]]).+===== Proposal =====
  
-Unlike the Taint extension, there must **not** be an equivalent //untaint()// function, or support any kind of escaping.+This RFC proposes adding four functions:
  
-==== SolutionSQL Injection ====+  * //is_literal(string $string)bool// to check if a variable represents a value written into the source code or not. 
 +  * //literal_implode(string $glue, array $pieces): string// - implode an array of literals, with a literal. 
 +  * //literal_combine(string $piece, string ...$pieces): string// - allow concatenating literal strings. 
 +  * //literal_sprintf(string $format, string ...$values): string// - a version of sprintf that uses literals.
  
-Database abstractions (e.g. ORMs) will be able to ensure they are provided with strings that are safe.+A literal is defined as a value (string) which has been written by the programmerThe value may be passed between functions, as long as it is not modified in any way.
  
-[[https://www.doctrine-project.org/projects/doctrine-orm/en/2.7/reference/query-builder.html#high-level-api-methods|Doctrine]] could use this to ensure //->where($predicates)// is a literal:+<code php> 
 +is_literal('Example')// true
  
-  $users $queryBuilder +$= 'Hello'; 
-    ->select('u') +$b = 'World';
-    ->from('User', 'u'+
-    ->where('u.id = ' . $_GET['id']) +
-    ->getQuery() +
-    ->getResult(); +
-   +
-  // example.php?id=u.id+
  
-This mistake can be easily identified by:+is_literal($a); // true 
 +is_literal($a . $b); // TBC, details below.
  
-  public function where($predicates+$c = literal_combine($a, $b); 
-  { +is_literal($c); // true
-      if (function_exists('is_literal') && !is_literal($predicates)) { +
-          throw new Exception('->where() can only accept a literal'); +
-      } +
-      ... +
-  }+
  
-[[https://redbeanphp.com/index.php?p=/finding|RedBean]] could check //$sql// is a literal:+is_literal($_GET['id']); // false 
 +is_literal('WHERE id = ' intval($_GET['id'])); // false 
 +is_literal(rand(0, 10)); // false 
 +is_literal(sprintf('LIMIT %d', 3)); /false 
 +</code>
  
-  $users = R::find('user', 'id = ' $_GET['id']);+There is no way to manually mark a string as a literal (i.e. no equivalent to //untaint()//); as soon as the value has been manipulated in any way, it is no longer marked as a literal.
  
-[[http://propelorm.org/Propel/reference/model-criteria.html#relational-api|PropelORM]] could check //$clause// is a literal:+===== Previous Work =====
  
-  $users = UserQuery::create()->where('id = ' . $_GET['id'])->find();+Google uses "compile time constants" in Go, which isn't as good as a run time solution (e.g. the //WHERE IN// issue), but it works, and is used by [[https://blogtitle.github.io/go-safe-html/|go-safe-html]] and [[https://github.com/google/go-safeweb/tree/master/safesql|go-safesql]].
  
-The //is_literal()// function could also be used internally by ORM developers, so they can be sure they have created their SQL strings out of literals. This would avoid mistakes such as the ORDER BY issues in the Zend framework [[https://framework.zend.com/security/advisory/ZF2014-04|1]]/[[https://framework.zend.com/security/advisory/ZF2016-03|2]].+Google also uses [[https://errorprone.info/|Error Prone]] in Java to augment the compiler's type analysis, where [[https://errorprone.info/bugpattern/CompileTimeConstant|@CompileTimeConstant]] ensures method parameters can only use "compile-time constant expressions" (this isn't a complete solution either).
  
-==== SolutionSQL InjectionBasic ====+Perl has a [[https://perldoc.perl.org/perlsec#Taint-mode|Taint Mode]]via the -T flag, where all input is marked as "tainted", and cannot be used by some methods (like commands that modify files), unless you use a regular expression to match and return known-good values (where regular expressions are easy to get wrong).
  
-A simple example:+[[https://github.com/tc39/proposal-array-is-template-object|JavaScript might get isTemplateObject]] to "Distinguishing strings from a trusted developer from strings that may be attacker controlled" (intended to be [[https://github.com/mikewest/tc39-proposal-literals|used with Trusted Types]]).
  
-  $sql = 'SELECT * FROM table WHERE id = ?'; +As noted abovethere is the [[https://github.com/laruence/taint|Taint extension for PHP]by Xinchen Hui.
-   +
-  $result = $db->exec($sql, [$id]);+
  
-Checked in the framework by:+And there is the [[https://wiki.php.net/rfc/sql_injection_protection|Automatic SQL Injection Protection]] RFC by Matt Tait, where this RFC uses a similar concept of the [[https://wiki.php.net/rfc/sql_injection_protection#safeconst|SafeConst]]. When Matt's RFC was being discussed, it was noted:
  
-  class db { +  * "unfiltered input can affect way more than only SQL" ([[https://news-web.php.net/php.internals/87355|Pierre Joye]]); 
-   +  * this amount of work isn't ideal for "just for one use case" ([[https://news-web.php.net/php.internals/87647|Julien Pauli]]); 
-    public function exec($sql, $parameters = []) { +  * It would have effected every SQL function, such as //mysqli_query()////$pdo->query()//, //odbc_exec()//, etc (concerns raised by [[https://news-web.php.net/php.internals/87436|Lester Caine]] and [[https://news-web.php.net/php.internals/87650|Anthony Ferrara]])
-   +  * Each of those functions would need a bypass for cases where unsafe SQL was intentionally being used (e.g. phpMyAdmin taking SQL from POST data) because some applications intentionally "pass raw, user submitted, SQL" (Ronald Chmara [[https://news-web.php.net/php.internals/87406|1]]/[[https://news-web.php.net/php.internals/87446|2]]). 
-      if (!is_literal($sql)) { + 
-        throw new Exception('SQL must be a literal.');+I also agree that "SQL injection is almost a solved problem [by using] prepared statements" ([[https://news-web.php.net/php.internals/87400|Scott Arciszewski]]), and this is where //is_literal()// can be used to check that no mistakes are made when using prepared statements. 
 + 
 +===== Usage ===== 
 + 
 +By libraries: 
 + 
 +<code php> 
 +class db 
 +  protected $level = 2; // Probably should default to 1 at first. 
 +  function literal_check($var) { 
 +    if (function_exists('is_literal') && !is_literal($var)) { 
 +      if ($this->level === 0
 +        // Programmer aware, and is choosing to bypass this check. 
 +      } else if ($this->level === 1) { 
 +        trigger_error('Non-literal detected!', E_USER_WARNING); 
 +      } else 
 +        throw new Exception('Non-literal detected!');
       }       }
-   
-      $statement = $this->pdo->prepare($sql); 
-      $statement->execute($parameters); 
-      return $statement->fetchAll(); 
-   
     }     }
-   
   }   }
 +  function unsafe_disable_injection_protection() {
 +    $this->level = 0;
 +  }
 +  function where($sql, $parameters = []) {
 +    $this->literal_check($sql);
 +    // ...
 +  }
 +}
  
-This also works with string concatenation:+$db->where('id = ?'); // OK 
 +$db->where('id = ' . $_GET['id']); // Exception thrown 
 +</code>
  
-  define('TABLE''example'); +Table and Fields in SQLwhich cannot use parameters; for example //ORDER BY//:
-   +
-  $sql = 'SELECT * FROM ' . TABLE . ' WHERE id = ?'; +
-   +
-    is_literal($sql); // Returns true +
-   +
-  $sql .= ' AND id = ' . $mysqli->escape_string($_GET['id']); +
-   +
-    is_literal($sql); // Returns false+
  
-==== Solution: SQL InjectionORDER BY ====+<code php> 
 +$order_fields 
 +    'name', 
 +    'created', 
 +    'admin', 
 +  ];
  
-To ensure //ORDER BY// can be set via the userbut only use acceptable values:+$order_id = array_search(($_GET['sort'] ?? NULL)$order_fields);
  
-  $order_fields = [ +$sql literal_combine(' ORDER BY '$order_fields[$order_id]); 
-      'name', +</code>
-      'created', +
-      'admin', +
-    ]; +
-   +
-  $order_id array_search(($_GET['sort'] ?? NULL), $order_fields); +
-   +
-  $sql = ' ORDER BY ' $order_fields[$order_id];+
  
-==== Solution: SQL Injection, WHERE IN ====+Undefined number of parameters; for example //WHERE IN//:
  
-Most SQL strings can be a simple concatenations of literal values, but //WHERE IN (?,?,?)// needs to use a variable number of literal placeholders.+<code php> 
 +function where_in_sql($count) { // Should check for 0 
 +  $sql = []; 
 +  for ($k = 0; $k < $count; $k++) { 
 +    $sql[] = '?'; 
 +  } 
 +  return literal_implode(',', $sql); 
 +
 +$sql = literal_combine('WHERE id IN ('where_in_sql(count($ids))')'); 
 +</code>
  
-There needs to be a special case for //array_fill()//+//implode()//, so the //is_literal// state can be preserved, allowing us to create the safe literal string '?,?,?':+===== Considerations =====
  
-  $in_sql implode(',', array_fill(0, count($ids), '?')); +==== Naming ====
-   +
-  $sql 'SELECT * FROM table WHERE id IN (' . $in_sql . ')';+
  
-==== SolutionCLI Injection ====+Literal string is the standard name for strings in source code. See [[https://www.google.com/search?q=what+is+literal+string+in+php|Google]].
  
-Rather than using functions such as:+> A string literal is the notation for representing a string value within the text of a computer program. In PHP, strings can be created with single quotes, double quotes or using the heredoc or the nowdoc syntax...
  
-  * //exec()// +Alternatives suggestions have included //is_from_literal()// from [[https://news-web.php.net/php.internals/109197|Jakob Givoni]]. I think //is_safe_string()// might be asking for trouble. Other terms have included "compile time constants" and "code string".
-  * //shell_exec()// +
-  * //system()// +
-  * //passthru()//+
  
-Frameworks (or PHP) could introduce something similar to //pcntl_exec()//, where arguments are provided separately.+==== Supporting Int/Float/Boolean values====
  
-Or, take a safe literal for the command, and use parameters for the arguments (like SQL does):+When converting to stringthey aren't guaranteed (and often don't) have the exact same value they have in source code.
  
-  $output = parameterised_exec('grep ? /path/to/file | wc -l', [ +For example, //TRUE// and //true// when cast to string give "1".
-      'example', +
-    ]);+
  
-Rough implementation:+It's also a very low value feature, where there might not be space for a flag to be added.
  
-  function parameterised_exec($cmd, $args []) { +==== Supporting Concatenation ====
-   +
-    if (!is_literal($cmd)) { +
-      throw new Exception('The first argument must be a literal'); +
-    } +
-   +
-    $offset 0; +
-    $k 0; +
-    while (($pos strpos($cmd, '?', $offset)) !== false) { +
-      if (!isset($args[$k])) { +
-        throw new Exception('Missing parameter "' . ($k + 1) . '"'); +
-        exit(); +
-      } +
-      $arg escapeshellarg($args[$k]); +
-      $cmd substr($cmd, 0, $pos) . $arg . substr($cmd, ($pos + 1)); +
-      $offset = ($pos + strlen($arg)); +
-      $k++; +
-    } +
-    if (isset($args[$k])) { +
-      throw new Exception('Unused parameter "' . ($k + 1) . '"'); +
-      exit(); +
-    } +
-   +
-    return exec($cmd); +
-   +
-  }+
  
-==== Solution: HTML Injection ====+This is the big question.
  
-Template engines should receive variables separately from the raw HTML.+Máté Kocsis has done some [[https://github.com/craigfrancis/php-is-literal-rfc/blob/main/tests/results/with-concat/kocsismate.pdf|primary testing on supporting string concat]], and found a 0.124% performance hit for the Laravel Demo app, 0.161% for Symfony, and a more severe -3.719% when running this [[https://github.com/kocsismate/php-version-benchmarks/blob/main/app/zend/concat.php#L25|concat test]].
  
-Often the engine will get the HTML from static files (safe):+In my own [[https://github.com/craigfrancis/php-is-literal-rfc/tree/main/tests|simplistic testing]], where I included a basic version that did not support string concat. The [[https://github.com/craigfrancis/php-is-literal-rfc/blob/main/tests/results/with-concat/local.pdf|results]] found:
  
-  $html = file_get_contents('/path/to/template.html');+    Laravel Demo App: +0.30% with, vs +0.18% without concat. 
 +    Symfony Demo App: +0.06% with, vs +0.06% without concat. 
 +    My Concat Test:   +4.36% with, vs +2.23% without concat.
  
-But small snippets of HTML are often easier to define as literal within the PHP script:+In my basic test, I used RAM Disk, and disabled the processors Turbo Boost. With the Demo Apps, I used ///sapi/cgi/php-cgi "-T10"// to get the timings (so would include the compilation), and ///sapi/cli/php// for [[https://github.com/craigfrancis/php-is-literal-rfc/blob/main/tests/001.phpt|My Concat Test]].
  
-  $template_html = ' +There is still a small impact without concat because the //concat_function()// in "zend_operators.cuses //zend_string_extend()// (where the literal flag needs to be removed). And in "zend_vm_def.h", it has a similar versionand supports a quick concat with an empty string, which doesn't create a new variable (x2) and would need it's flag removed as well.
-    <p>Hello <span id="username"></span></p> +
-    <p><a>Website</a></p>';+
  
-Where the variables are supplied separately, in this example I'using XPath:+Technically string concat isn't needed for most libraries, like an ORM or Query Builder, where their methods nearly always take a small literal string. But it would make adoption of is_literal() easier for existing projects that are currently using string concat for their SQL, HTML Templates, etc.
  
-  $values = [ +And supporting runtime concat would make the literal check easier to understandas it would be consistent (e.g. compiler vs runtime concatwhere the compiler can concat two strings to create single literal that has the literal flag set).
-      '//span[@id="username"]' => [ +
-          NULL      => 'Name'// The textContent +
-          'class'   => 'admin', +
-          'data-id' => '123', +
-        ], +
-      '//a' => [ +
-          'href' => 'https://example.com', +
-        ], +
-    ]; +
-   +
-  echo template_parse($template_html, $values);+
  
-The templating engine can then accept and apply the supplied variables for the relevant context.+The non-concat version would use //literal_combine()// or //literal_implode()// as special functions to avoid most of the work during runtime contact. Where Dan Ackroyd notes that these functions would make it easier to identify exactly where mistakes are made, rather than it being picked up at the end of a potentially long script, after multiple string concatenations, e.g.
  
-As a simple example, this can be done with:+<code php> 
 +$sortOrder = 'ASC';
  
-  function template_parse($html, $values) { +// 300 lines of code, or multiple function calls 
-   + 
-    if (!is_literal($html)) { +$sql .= ' ORDER BY name ' . $sortOrder
-      throw new Exception('Invalid Template HTML.')+ 
-    } +// 300 lines of code, or multiple function calls 
-   + 
-    $dom = new DomDocument(); +$db->query($sql); 
-    $dom->loadHTML('<?xml encoding="UTF-8">'$html); +</code
-   + 
-    $xpath = new DOMXPath($dom); +If a developer changed the literal //'ASC'// to //$_GET['order']//, the error raised by //$db->query()// would not be clear where the mistake was made. Whereas using //literal_combine()// highlights exactly where the issue happened: 
-   + 
-    foreach ($values as $query =$attributes) { +<code php> 
-   +$sql literal_combine($sql, ORDER BY name ', $sortOrder); 
-      if (!is_literal($query)) { +</code> 
-        throw new Exception('Invalid Template XPath.'); + 
-      } +==== Performance ==== 
-   + 
-      foreach ($xpath->query($queryas $element{ +TBC 
-        foreach ($attributes as $attribute => $value) { + 
-   +See the section above
-          if (!is_literal($attribute)) { + 
-            throw new Exception('Invalid Template Attribute.'); +==== Values from INI/JSON/YAML ==== 
-          } + 
-   +As noted by [[https://news-web.php.net/php.internals/87667|Dennis Birkholz]], Systems/Frameworks that define certain variables (e.g. table name prefixeswithout the use of a literal (e.g. ini/json/yaml files)might need to make some changes to use this feature (depending on where they use the is_literal check). 
-          if ($attribute) { + 
-            $safe false; +==== Existing String Functions ===
-            if ($attribute == 'href') { + 
-              if (preg_match('/^https?:\/\//', $value)) { +Trying to determine if the //is_literal// flag should be passed through functions like //str_repeat()//, or //substr()// etc is difficult. Having a security feature be difficult to reason aboutgives a much higher chance of mistakes. 
-                $safe = true; // Not "javascript:..." + 
-              } +For any use-case where dynamic strings are required, it would be better to build those strings with an appropriate query builder, or by using //literal_combine()/////literal_implode()//.
-            } else if ($attribute == 'class') { +
-              if (in_array($value, ['admin', 'important'])) { +
-                $safe true; // Only allow specific classes? +
-              } +
-            } else if (preg_match('/^data-[a-z]+$/', $attribute)) { +
-              if (preg_match('/^[a-z0-9 ]+$/i'$value)) { +
-                $safe true; +
-              } +
-            } +
-            if ($safe+
-              $element->setAttribute($attribute$value); +
-            } +
-          } else { +
-            $element->textContent = $value; +
-          } +
-   +
-        } +
-      } +
-   +
-    } +
-   +
-    $html = ''; +
-    $body = $dom->documentElement->firstChild; +
-    if ($body->hasChildNodes()) { +
-      foreach ($body->childNodes as $node+
-        $html .= $dom->saveXML($node); +
-      } +
-    } +
-   +
-    return $html; +
-   +
-  }+
  
 ===== Backward Incompatible Changes ===== ===== Backward Incompatible Changes =====
  
-None+No known BC breaks, except for code-bases that already contain userland functions //is_literal()//, //literal_implode()// or //literal_combine()//.
  
 ===== Proposed PHP Version(s) ===== ===== Proposed PHP Version(s) =====
  
-PHP 8.1?+PHP 8.1
  
 ===== RFC Impact ===== ===== RFC Impact =====
Line 399: Line 322:
 ==== To SAPIs ==== ==== To SAPIs ====
  
-Not sure+None known
  
 ==== To Existing Extensions ==== ==== To Existing Extensions ====
Line 411: Line 334:
 ===== Open Issues ===== ===== Open Issues =====
  
-On [[https://github.com/craigfrancis/php-is-literal-rfc/issues|GitHub]]:+None
  
-  - Would this cause performance issues? Presumably not as bad a type checking. +===== Unaffected PHP Functionality =====
-  - Can //array_fill()//+//implode()// pass though the "is_literal" flag for the "WHERE IN" case? +
-  - Should the function be named //is_from_literal()//? (suggestion from [[https://news-web.php.net/php.internals/109197|Jakob Givoni]]) +
-  - Systems/Frameworks that define certain variables (e.g. table name prefixes) without the use of a literal (e.g. ini/json/yaml files), so they might need to make some changes to use this check, as originally noted by [[https://news-web.php.net/php.internals/87667|Dennis Birkholz]].+
  
-===== Alternatives =====+None known
  
-  - The current Taint Extension (notes above) +===== Future Scope =====
- - Using static analysis (not at runtime), for example [[https://psalm.dev/|psalm]] (thanks [[https://news-web.php.net/php.internals/109192|Tyson Andre]]). But I can't find any which do these checks by default (if they even try), and we can't expect all programmers to use static analysis (especially those who have just stated).+
  
-===== Unaffected PHP Functionality =====+As noted by MarkR, the biggest benefit will come when it can be used by PDO and similar functions (//mysqli_query//, //preg_match//, //exec//, etc). But the basic idea can be used immediately by frameworks and general abstraction libraries, and they can give feedback for future work.
  
-Not sure+**Phase 2** could introduce a way for programmers to specify certain PHP function/method arguments can only accept literals, and/or specific value-objects their project trusts (this idea comes from [[https://web.dev/trusted-types/|Trusted Types]] in JavaScript).
  
-===== Future Scope =====+For example, a project could require the second argument for //pg_query()// only accept literals or their //query_builder// object (which provides a //__toString// method); and that any output (print, echo, readfile, etc) must use the //html_output// object that's returned by their trusted HTML Templating system (using //ob_start()// might be useful here).
  
-Certain functions (//mysqli_query////preg_match//etccould use this information to generate a error/warning/notice.+**Phase 3** could set a default of 'only literals' for all of the relevant PHP function argumentsso developers are given a warningand later prevented (via an exception), when they provide an unsafe value to those functions (they could still specify that unsafe values are allowed, e.g. phpMyAdmin).
  
-PHP could also have mode where output (e.g. //echo '<html>'//is blockedand this can be bypassed (maybe via //ini_set//) when the HTML Templating Engine has created the correctly encoded output.+And, for bit of silliness (Spaß ist verboten), MarkR would like a //is_figurative()// function (functionality to be confirmed).
  
 ===== Proposed Voting Choices ===== ===== Proposed Voting Choices =====
  
-N/A+Accept the RFC. Yes/No
  
 ===== Patches and Tests ===== ===== Patches and Tests =====
  
-volunteer is needed to help with implementation.+N/A
  
 ===== Implementation ===== ===== Implementation =====
 +
 +Dan Ackroyd has [[https://github.com/php/php-src/compare/master...Danack:is_literal_attempt_two|started an implementation]], which uses functions like [[https://github.com/php/php-src/compare/master...Danack:is_literal_attempt_two#diff-2b0486443df74cd919c949f33f895eacf97c34b8490e7554e032e770ab11e4d8R2761|literal_combine()]] to avoid performance concerns.
 +
 +Joe Watkins has [[https://github.com/php/php-src/compare/master...krakjoe:literals|created an implementation]] which supports string concat at runtime.
 +
 +===== References =====
  
 N/A N/A
Line 448: Line 373:
  
 N/A N/A
 +
 +===== Thanks =====
 +
 +  - **Dan Ackroyd**, DanAck, for starting the first implementation (which made this a reality), and followup on the version that uses functions instead of string concat.
 +  - **Joe Watkins**, krakjoe, for finding how to set the literal flag (tricky), and creating the implementation that supports string concat.
 +  - **Máté Kocsis**, mate-kocsis, for setting up and doing the performance testing.
 +  - **Rowan Tommins**, IMSoP, for re-writing this RFC to focus on the key features, and putting it in context of how it can be used by libraries.
 +  - **Nikita Popov**, NikiC, for suggesting where the literal flag could be stored. Initially this was going to be the "GC_PROTECTED flag for strings", which allowed Dan to start the first implementation.
 +  - **Mark Randall**, MarkR, for alternative ideas, and noting that "interned strings in PHP have a flag", which started the conversation on how this could be implemented.
 +  - **Xinchen Hui**, who created the Taint Extension, allowing me to test the idea; and noting how Taint in PHP5 was complex, but "with PHP7's new zend_string, and string flags, the implementation will become easier" [[https://news-web.php.net/php.internals/87396|source]].
  
rfc/is_literal.txt · Last modified: 2022/02/14 00:36 by craigfrancis