This is an old revision of the document!
PHP RFC: Is Literal Check
- Version: 0.5
- Date: 2020-03-21
- Updated: 2021-04-19
- Author: Craig Francis, craig#at#craigfrancis.co.uk
- Status: Draft
- First Published at: https://wiki.php.net/rfc/is_literal
- GitHub Repo: https://github.com/craigfrancis/php-is-literal-rfc
Introduction
This RFC proposes a new function, is_literal(string $string), to help enforce a separation of hard-coded values, from user-supplied data.
This addresses some of the same use cases as “taint flags”, but is both simpler and stricter. It does not address how user data is transmitted or escaped, only whether it has been passed to a particular library function separately from the programmer defined values.
The clearest example is a database library which supports parametrised queries at the driver level, where the programmer could use either of these:
$db->query('SELECT * FROM users WHERE id = ' . $_GET['id']); // INSECURE $db->query('SELECT * FROM users WHERE id = ?', [$_GET['id']]);
By rejecting the SQL that was not written as a literal (first example), the library can provide protection against this incorrect use.
Examples
The Doctrine Query Builder allows custom WHERE clauses to be provided as strings. This is intended for use with literals and placeholders, but does not protect against this simple mistake:
// INSECURE $qb->select('u') ->from('User', 'u') ->where('u.id = ' . $_GET['id'])
The definition of the where() method could check with is_literal() and throw an exception advising the programmer to replace it with a safer use of placeholders:
$qb->select('u') ->from('User', 'u') ->where('u.id = :identifier') ->setParameter('identifier', $_GET['id']);
Similarly, Twig allows loading a template from a string, which could allow accidentally skipping the default escaping functionality:
// INSECURE echo $twig->createTemplate('<p>Hi ' . $_GET['name'] . '</p>')->render();
If createTemplate() checked with is_literal(), the programmer could be advised to write this instead:
echo $twig->createTemplate('<p>Hi {{ name }}</p>')->render(['name' => $_GET['name']]);
Proposal
A literal is defined as a value (string) which has been written by the programmer. The value may be passed between functions, as long as it is not modified in any way other than string concatenation.
is_literal('Example'); // true $a = 'Example'; is_literal($a); // true is_literal(4); // true is_literal(0.3); // true is_literal('a' . 'b'); // true, compiler can concatenate $a = 'A'; $b = $a . ' B ' . 3; is_literal($b); // true, ideally (more details below) is_literal($_GET['id']); // false is_literal(rand(0, 10)); // false is_literal(sprintf('LIMIT %d', 3)); // false, should use parameters
Note that there is no way to manually mark a string as a literal (i.e. no equivalent to untaint()); as soon as the value has been manipulated in any way, it is no longer marked as a literal.
See the justification page as to why it's done this way.
Comparison to Taint Tracking
Some languages implement a “taint flag” which tracks whether values are considered “safe”. There is a Taint extension for PHP by Xinchen Hui, and a previous RFC proposing it be added to the language.
These solutions rely on the assumption that the output of an escaping function is safe for a particular context. This sounds reasonable in theory, but the operation of escaping functions, and the context for which their output is safe, are very hard to define. This leads to a feature that is both complex and unreliable.
This proposal avoids the complexity by addressing a different part of the problem: separating inputs supplied by the programmer, from inputs supplied by the user.
Previous Work
Google uses a similar approach in Go to identify “compile time constants”, Perl has a Taint Mode (but uses regular expressions to un-taint data), and there are discussions about adding it to JavaScript to support Trusted Types.
As noted by Tyson Andre, it might be possible to use static analysis, for example psalm. But I can't find any which do these checks by default, can be incomplete, they are likely to miss things (especially at runtime), and we can't expect all programmers to use static analysis (especially those who are new to programming, who need this more than developers who know the concepts and just make the odd mistake).
And there is the Automatic SQL Injection Protection RFC by Matt Tait, where this RFC uses a similar concept of the SafeConst. When Matt's RFC was being discussed, it was noted:
- “unfiltered input can affect way more than only SQL” (Pierre Joye);
- this amount of work isn't ideal for “just for one use case” (Julien Pauli);
- It would have effected every SQL function, such as mysqli_query(), $pdo->query(), odbc_exec(), etc (concerns raised by Lester Caine and Anthony Ferrara);
I also agree that “SQL injection is almost a solved problem [by using] prepared statements” (Scott Arciszewski), and this is where is_literal() can be used to check that no mistakes are made.
Usage
By libraries:
function literal_check($var) { if (function_exists('is_literal') && !is_literal($var)) { $level = 2; // Get from config, defaults to 1. if ($level === 0) { // Programmer aware, and is choosing to bypass this check. } else if ($level === 1) { trigger_error('Non-literal detected!', E_USER_NOTICE); } else { throw new Exception('Non-literal detected!'); } } } function example($input) { literal_check($input); // ... } example('hello'); // OK example(strtoupper('hello')); // Exception thrown: the result of strtoupper is a new, non-literal string
Table and Fields in SQL, which cannot use parameters; for example ORDER BY:
$order_fields = [ 'name', 'created', 'admin', ]; $order_id = array_search(($_GET['sort'] ?? NULL), $order_fields); $sql = ' ORDER BY ' . $order_fields[$order_id];
Undefined number of parameters; for example WHERE IN:
function where_in_sql($count) { // Should check for 0 $sql = '?'; for ($k = 1; $k < $count; $k++) { $sql .= ',?'; } return $sql; } $sql = 'WHERE id IN (' . where_in_sql(count($ids)) . ')';
Backward Incompatible Changes
None
Proposed PHP Version(s)
PHP 8.1?
RFC Impact
To SAPIs
Not sure
To Existing Extensions
Not sure
To Opcache
Not sure
Open Issues
On GitHub:
- Name it something else? Jakob Givoni suggested is_from_literal().
- Would this cause performance issues? A basic string concat test, just focusing on string concat (worst case scenario), shows a 1.3% increase in processing time (1.341s to 1.358s = +0.017s).
- Systems/Frameworks that define certain variables (e.g. table name prefixes) without the use of a literal (e.g. ini/json/yaml files), they might need to make some changes to use this check, as originally noted by Dennis Birkholz.
Unaffected PHP Functionality
Not sure
Future Scope
As noted by MarkR, the biggest benefit will come when it can be used by PDO and similar functions (mysqli_query, preg_match, exec, etc). But the basic idea can be used immediately by frameworks and general abstraction libraries, and they can give feedback for future work.
Phase 2 could introduce a way for programmers to specify that certain function arguments only accept safe literals, and/or specific value-objects their project trusts (this idea comes from Trusted Types in JavaScript).
For example, a project could require the second argument for pg_query() only accept literals or their query_builder object (which provides a __toString method); and that any output (print, echo, readfile, etc) must use the html_output object that's returned by their trusted HTML Templating system (using ob_start() might be useful here).
Phase 3 could set a default of 'only literals' for all of the relevant PHP function arguments, so developers are given a warning, and later prevented (via an exception), when they provide an unsafe value to those functions (they could still specify that unsafe values are allowed, e.g. phpMyAdmin).
And, for a bit of silliness (Spaß ist verboten), there could be a is_figurative() function, which MarkR seems to really, want
Proposed Voting Choices
N/A
Patches and Tests
N/A
Implementation
Joe Watkins has created an implementation which includes string concat. While the performance impact needs to be considered, this would provide the easiest solution for projects already using string concat for their parameterised SQL.
Dan Ackroyd also started an implementation, which uses functions like literal_combine() to avoid performance concerns.
References
N/A
Rejected Features
N/A
Thanks
- Dan Ackroyd, DanAck, for surprising me with the first implementation, and getting the whole thing started.
- Joe Watkins, krakjoe, for finding how to set the literal flag, and creating the implementation that supports string concat.
- Rowan Tommins, IMSoP, for re-writing this RFC to focus on the key features, and putting it in context of how it can be used by libraries.
- Nikita Popov, NikiC, for suggesting where the literal flag could be stored. Initially this was going to be the GC_PROTECTED flag for strings, which allowed Dan to start the first implementation.
- MarkR, for alternative ideas, and noting that “interned strings in PHP have a flag” source, which started the conversation on how this could be implemented.
- Xinchen Hui, who created the Taint Extension, allowing me to test the idea; and noting how Taint in PHP5 was complex, but “with PHP7's new zend_string, and string flags, the implementation will become easier” source.