rfc:literal_string

PHP RFC: LiteralString

Introduction

Add LiteralString type, and is_literal_string(), to “distinguish strings from a trusted developer, from strings that may be attacker controlled”.

The vast majority of Injection Vulnerabilities involving libraries (e.g. database abstractions) are due to programmers using the library incorrectly. A simple LiteralString check would allow libraries to easily identify these mistakes, without needing to make massive changes.

It also allows developers to easily check their Parameterised Queries.

The LiteralString type has been added to Python 3.11 via PEP 675.

This technique is used at Google (as described in “Building Secure and Reliable Systems”, see Common Security Vulnerabilities, pages 251-255, which shows how “developer-controlled input” prevents these issues in Go); it's used by FaceBook developers (ref pyre type-checker); and Christoph Kern discussed it in 2016 with Preventing Security Bugs through Software Design. Also explained at USENIX Security 2015, OWASP AppSec US 2021, and summarised at eiv.dev.

The Problem

Developers often believe Database Abstractions or Parameterised Queries have completely solved Injection and Cross-Site Scripting (XSS) vulnerabilities; and while the situation has improved, mistakes still happen:

// Doctrine
$qb->select('u')
   ->from('User', 'u')
   ->where('u.id = ?1')
   ->setParameter(1, $_GET['id']); // Correct
 
$qb->select('u')
   ->from('User', 'u')
   ->where('u.id = ' . $_GET['id']); // INSECURE, but easier to write/read :-)
 
$qb->select('u')
    ->from('User', 'u')
    ->where($qb->expr()->andX(
        $qb->expr()->eq('u.type_id', $_GET['type']), // INSECURE, 'u.type_id) OR (1 = 1'
        $qb->expr()->isNull('u.deleted'), // Is ignored due to 'OR'
    ));
 
// Laravel
DB::table('user')->whereRaw('CONCAT(name_first, " ", name_last) LIKE ?', $search . '%');
DB::table('user')->whereRaw('CONCAT(name_first, " ", name_last) LIKE "' . $search . '%"'); // INSECURE

Additional Examples; where tools (e.g. SQL Map, Havij, jSQL) make it easy to exploit these mistakes.

In the latest OWASP Top 10, Injection Vulnerabilities rank third highest security risk to web applications (database abstractions have at least helped move them from the top spot, but do not solve the problem).

Year Injection Position XSS Position
2021 - Latest 3 3
2017 1 7
2013 1 3
2010 1 2
2007 2 1
2004 6 4
2003 6 4

Proposal

A string will be of the LiteralString type if it was defined by the programmer (in source code), or is the result of LiteralString values being concatenated. The LiteralString type is lost when the value is modified.

The following string concatenation functions can return LiteralString values:

  1. str_repeat()
  2. str_pad()
  3. implode()
  4. join()

Namespaces constructed for the programmer by the compiler will be marked as a LiteralString.

Examples

$a = 'Hello';
$b = 'World';
 
is_literal_string('Example'); // true
is_literal_string($a); // true
is_literal_string($_GET['id']); // false
function example1(LiteralString $input) {
  return $input;
}
 
function example2(String $input) {
  if (!is_literal_string($input)) {
    error_log('Log issue, but still continue.');
  }
  return $input;
}
 
example1($a); // OK
example1($a . $b); // OK
example1("Hi $b"); // OK
example1(example1($a)); // OK
 
example1($_GET['id']); // TypeError
example1('/bin/rm -rf ' . $_GET['path']); // TypeError
example1('<img src=' . $_GET['src'] . ' />'); // TypeError
example1('WHERE id = ' . $_GET['id']); // TypeError

Most libraries will probably use something like example2() to test the values they receive, partially for backwards compatibility reasons (can use function_exists), but also because it allows them to easily choose how mistakes are handled. For example, I would suggest libraries used logged warnings by default, with an option to throw exceptions for those programmers who are confident their code is ready or when it's in development mode, or they could provide a way to disable checks on a per query basis, or entirely for legacy projects (example).

Libraries could also check their output (e.g. SQL to a database) is still a LiteralString, but this isn't a priority (libraries are rarely the source of Injection Vulnerabilities, it's usually the programmer using them incorrectly).

You can test it at 3v4l.org using the previous “is_literal()” function name.

Considerations

Performance

Máté Kocsis created a PHP benchmark to replicate the old Intel Tests. The results for the implementation found a 0.47% impact with the Symfony demo app, where it did not connect to a database (because the natural variability introduced by a database makes it impossible to measure an impact that small).

String Concatenation

When two LiteralString values are concatenated, the result is also a LiteralString.

It's been suggested that not supporting concatenation might help debugging, with the thought being, in a long complex script, which only checks if a variable is a LiteralString at the end, it's harder to identify the source of the problem. However, over the last year the feedback has been that the usual debug techniques work fine (if anything, programmers want sprintf support); whereas it would be nigh-on-impossible to update every library and all existing code to not use concatenation (e.g. to use a query builder). That said, someone who really wants this strict way of working could use:

function literal_implode($separator, $array) {
  $return = implode($separator, $array);
  if (!is_literal_string($return)) {
    throw new Exception('Non-literal-string detected!');
  }
  return $return;
}
 
function literal_concat(...$a) {
  return literal_implode('', $a);
}

Python and Go support string concatenation as well.

(On a technical note, we did test an implementation that didn't support concatenation, primarily to see if this would help reduce the performance impact even further. However, the PHP engine can sometimes still concatenate values automatically at compile-time (so concatenation appears to work in some contexts), and it didn't make much (if any) difference in regards to performance, because concat_function() in “zend_operators.c” uses zend_string_extend() (which needs to remove the LiteralString flag) and “zend_vm_def.h” does the same; by supporting a quick concat with an empty string (x2), which would need its flag removed as well).

String Splitting

In regards to string splitting, we didn't find any realistic use cases, and security features should try to keep the implementation as simple as possible.

Also, the security considerations are different. Concatenation joins known/fixed units together, whereas if you're starting with a LiteralString, and the program allows the Evil-User to split the string (e.g. setting the length in substr), then they get considerable control over the result (it creates an untrusted modification).

While unlikely to be written by a programmer, we can consider these:

$length = ($_GET['length'] ?? -5);
$url    = substr('https://example.com/js/a.js?v=55', 0, $length);
$html   = substr('<a href="#">#</a>', 0, $length);

If $url was used in a Content-Security-Policy, the query string needs to be removed, but as more of the string is removed, the more resources are allowed (“https:” basically allows resources from anywhere). With the HTML example, moving from the tag content to the attribute can be a problem (while HTML Templating Engines should be fine, unfortunately libraries like Twig are not currently context aware, so you need to change from the default 'html' encoding to 'html_attr' encoding).

Krzysztof Kotowicz has confirmed that, at Google, with “go-safe-html”, string concatenation is allowed, but splitting is explicitly not supported because it “can cause issues”; for example, “arbitrary split position of a HTML string can change the context”.

Frequently Asked Questions

FAQ: WHERE IN

With SQL, you can use WHERE id IN (?,?,?)

User values should be sent to the database separately (with prepared queries), so you should follow the advice from Levi Morrison, PDO Execute, and Drupal Multiple Arguments, and use something like this:

$sql = 'WHERE id IN (' . join(',', array_fill(0, count($ids), '?')) . ')';

Or, you could use concatenation:

$sql = '?';
for ($k = 1; $k < $count; $k++) {
  $sql .= ',?';
}

Libraries can also abstract this, e.g. WordPress should support the following in the future (#54042):

$wpdb->prepare('SELECT * FROM table WHERE id IN (%...d)', $ids);

FAQ: Non-Parameterised Values

With Table and Field names in SQL, you cannot use parameters, these must be in the SQL string.

Ideally they would be LiteralStrings anyway (so no change needed); and if they are dependent on user input, in most cases you can (and should) use an array of permitted LiteralString values:

$sort = ($_GET['sort'] ?? NULL);
 
$fields = [
    'name',
    'email',
    'created',
  ];
 
$order_id = array_search($sort, $fields);
 
$sql .= ' ORDER BY ' . $fields[$order_id]; // A LiteralString

Or, you could use:

$fields = [
    'name'    => 'u.full_name',
    'email'   => 'u.email_address',
    'created' => 'DATE(u.created)',
  ];
 
$sql .= ' ORDER BY ' . ($fields[$sort] ?? 'u.full_name'); // A LiteralString

This approach stops the attacker specifying a private field (e.g. telephone_number, where they can determine every users telephone number by updating their own account, and seeing how that affects the order).

There may be some exceptions, see the next section.

FAQ: Non-LiteralString Values

So what do we do when a non-LiteralString needs to be used?

For example Dennis Birkholz noted that some Systems/Frameworks define some variables (e.g. table name prefixes) without the use of a LiteralString (e.g. ini/json/yaml). And Larry Garfield noted that in Drupal's ORM “the table name itself is user-defined” (not in the PHP script).

These special non-LiteralString values should still be handled separately (and carefully); where the library checks the sensitive inputs (SQL/HTML/CLI/etc) are still LiteralStrings, and accepts any special values separately, where it can safely/consistently use them (e.g. using backtick escaping for identifiers being sent to a MySQL database).

For example, using a separate array of $identifiers:

$sql = '
  SELECT
    u.name
  FROM
    user AS u
  WHERE
    u.type = ?
  ORDER BY
    {field}'; // A LiteralString
 
$parameters = [
    $_GET['type'],
  ];
 
$identifiers = [
    'field' => $_GET['field'],
  ];
 
$results = $db->query($sql, $parameters, $identifiers);

And WordPress 6.2 is scheduled to support (#52506):

$wpdb->prepare('ORDER BY %i', $field);

Or the library could use a Query Builder.

FAQ: Bypassing It

This implementation does not provide an easy way for programmers to mark anything they want as a LiteralString, this is on purpose - we do not want to re-create one of the problems with Taint Checking, by pretending the LiteralString is a flag to say the value is “safe”.

Some libraries may want to support their own way to bypass these checks, e.g. a ValueObject:

class UnsafeSQL {
  private $value = NULL;
  public function __construct($value) {
    $this->value = $value;
  }
  public function __toString() {
    return $this->value;
  }
}
 
function example1(LiteralString|UnsafeSQL $input) {
  return $input;
}
 
function example2($input) {
  if (!is_literal_string($input) && !($input instanceof UnsafeSQL)) {
    error_log('Log issue, but still continue.');
  }
  return $input;
}

But we do not pretend there aren't ways around this (e.g. using eval), but in doing so the programmer is clearly choosing to do something wrong. We want to provide safety rails, but there is nothing stopping the programmer from intentionally jumping over them.

FAQ: Integer Values

We wanted to flag integers defined in the source code, in the same way we are doing with strings. Unfortunately it would require a big change to add a literal flag on integers. Changing how integers work internally would have made a big performance impact, and potentially affected every part of PHP (including extensions).

Due to this limitation, we did consider an approach to trust all integers, where Scott Arciszewski suggested the name is_noble(). While this is not as philosophically pure, we continued to explore this possibility because we could not find any way an Injection Vulnerability could be introduced with integers in SQL, HTML, CLI; and other contexts as well (e.g. preg, mail additional_params, XPath query, and even eval). We could not find any character encoding issues either (The closest we could find was EBCDIC, an old IBM character encoding, which encodes the 0-9 characters differently; which anyone using it would need to re-encode either way, and EBCDIC is not supported by PHP). And we could not find any issue with a 64bit PHP server sending a large number to a 32bit database, because the number is being encoded as characters in a string (so that's also fine). However, the feedback received was that while safe from Injection Vulnerabilities, it becomes a more complex concept, one that might cause programmers to assume it is also safe from programmer/logic errors. Ultimately the preference was the simpler approach, that did not allow any integers (which is reinforced with the name LiteralString).

Python and Go do not support integers either.

FAQ: Other Values

Like Integers, it would be hard to support Boolean/Float values; they are also a very low-value feature, and we cannot be sure of the security implications.

For example, the value you put in is not always the same as what you get out:

var_dump((string) true);  // "1"
var_dump((string) false); // ""
var_dump(2.3 * 100);      // 229.99999999999997
 
setlocale(LC_ALL, 'de_DE.UTF-8');
var_dump(sprintf('%.3f', 1.23)); // "1,230"
 // Note the comma, which can be bad for SQL.
 // Pre 8.0 this also happened with string casting.

FAQ: Other Functions

We made the decision to only support 4 functions that concatenated strings.

There are a lot of other candidates; e.g. adding strtoupper() might be reasonable, however we would need to consider the effect of every function and context, making the concept of a LiteralString more complex (e.g. output varying based on the current locale, str_shuffle() creating unpredictable results, etc).

The main request that's come up over the last year is to support sprintf(). While this is reasonable for basic concatenation (e.g. only using “%s”), it gets more complicated when coercing values to a different type, or when using formatting. That said, a future RFC might consider changing this (with the main focus being on the implications/risks).

Python has a longer list of methods that preserve LiteralString, where they found it tricky to decide what should be allowed, and this created a bit of negative feedback (some people want more functions on the list, while others wish these hadn't been included because it moved away from a simple “developer defined string”).

FAQ: The Name

A “Literal String” is the standard name for strings in source code. See Google.

A string literal is the notation for representing a string value within the text of a computer program. In PHP, strings can be created with single quotes, double quotes or using the heredoc or the nowdoc syntax.

LiteralString shows it only accepts strings (not integers, as noted above).

And follows the naming convention of not using underscores for the type/object (e.g. DateTime, DOMDocument, ImageMagick), while using underscores for the is_literal_string() function.

It's also the name chosen for the Python implementation.

FAQ: Extensions

If an extension is found to be already using the flag we're using for LiteralString (unlikely), that's the same as any new flag being introduced into PHP, and will need to be updated in the same way. And by default, flags are off, which is a fail safe situation.

FAQ: Adoption

Existing libraries will probably focus on using is_literal_string(), as it allows them to easily choose how mistakes are handled, and function_exists() makes supporting PHP 8.2 and below very easy.

Psalm (Matthew Brown): 13th June 2021 “I was skeptical about the first draft of this RFC when I saw it last month, but now I see the light (especially with the concat changes)”. Then on the 14th June, “I've just added support for a literal-string type to Psalm: https://psalm.dev/r/9440908f39” (4.8.0)

PHPStan (Ondřej Mirtes): 1st September 2021, has been implemented in 0.12.97.

PhpStorm: 2022.3 recognises the literal-string type (WI-64109).

WordPress: After adding support for escaping field/table names (identifiers) with %i (#52506), and to make IN (?,?,?) easier with %...d (#54042), a LiteralString check will be added to the $query parameter in wpdb::prepare().

Nettle (David Grudl): “the literal-string type [is used] with nette/database” (patch).

Doctrine: While not part of the official Doctrine project, the phpstan-doctrine extension adds experimental support via bleedingEdge (will probably use a separate flag in the future).

Propel (Mark Scherer): “given that this would help to more safely work with user input, I think this syntax would really help in Propel.” (example).

RedBean (Gabor de Mooij): “You can list RedBeanPHP as a supporter, we will implement this into the core.” (example).

Alternatives

Static Analysis

Both Psalm and PHPStan have supported the literal-string type since September 2021.

While I want more programmers to use Static Analysis, it's not realistic to expect all PHP programmers to always use these tools, and for all PHP code to be updated so Static Analysis can run the strictest checks, and use no baseline (using the JetBrains surveys; in 2021 only 33% used Static Analysis; and 2022 shows a similar story with 33% (at best) using PHPStan/Psalm/Phan; where the selected programmers for both surveys are 3 times more likely to use Laravel than WordPress).

Also, it can be tricky to get current Static Analysis tools to cover every case. For example, they don't currently support recursive type checking, or get a value-object to conditionally return a type. In contrast, both are easy with the LiteralString type.

Taint Checking

Taint Checking incorrectly assumes the output of an escaping function is “safe” for a particular context. While it sounds reasonable in theory, the operation of escaping functions, and the context for which their output is safe, is very hard to define, and leads to a feature that is both complex and unreliable.

$sql = 'SELECT * FROM users WHERE id = ' . $db->real_escape_string($id); // INSECURE
$html = "<img src=" . htmlentities($url) . " alt='' />"; // INSECURE
$html = "<a href='" . htmlentities($url) . "'>..."; // INSECURE

All three examples would be incorrectly considered “safe” (untainted). The first two need the values to be quoted. The third example, htmlentities() does not escape single quotes by default before PHP 8.1 (fixed), and it does not consider the issue of 'javascript:' URLs.

This is why Psalm notes these Taint Checking Limitations, and suggests using the literal-string type.

Abstractions

Libraries currently accept LiteralStrings like the following:

->field_add('LEFT(ref, (LENGTH(ref) - 3))')

But the library has no idea when a programmer does something like:

->field_add('LEFT(ref, (LENGTH(ref) - ' . $_GET['cut'] . '))') // INSECURE

A LiteralString check would easily identify these mistakes; but an alternative approach would be to replace these simple strings with a full abstraction, where every part is either represented by an object, or checked/quoted as appropriate; for example:

->field_add(new Func('LEFT', 'ref', new Calc(new Func('LENGTH', 'ref'), '-', new Value(3))))

The Laravel Query Expressions package does this.

While this does allow for additional checks (e.g. static analysis), it's unlikely many programmers will adopt, as it's difficult to write (and later read); in the same way developers are more likely to use DOMDocument::loadHTML() rather than add every element via DOMDocument::createElement(), DOMDocument::createAttribute(), etc.

Tagged Templates

In JavaScript, there is a form of Template Literal known as Tagged Templates.

Available since ~2015 (Firefox 34, Chrome 41, NodeJS 4); where libraries should use isTemplateObject (NodeJS can use is-template-object) to ensure the function is called correctly (example).

function example(strings, ...values) {
    if (isTemplateObject(strings)) {
       throw new Error('Not a Tagged Template');
    }
    return strings[0] + values[0] + strings[1] + values[1] + strings[2];
}
 
var id = 123,
    field = 'name',
    sql = example`WHERE id = ${id} ORDER BY ${field}`; // The Template
 
console.log(sql);

PHP cannot use ` (execute shell command), but could use ``` (which can be tricky for MarkDown).

Instead of calling a function directly, PHP could create a TemplateLiteral object, providing methods like getStringParts() and getValues(), so the object can be passed to a library to check and use.

By using a TemplateLiteral object, it would be possible to concatenate with $a = ```{$a} b``` (e.g. to conditionally add SQL/HTML, or help readability); but other forms of concatenation would be up for debate, e.g.

$sql = ```{$sql} AND category = {$category}```;
 
$sql = ```deleted ``` . ($archive ? ```IS NOT NULL``` :  ```IS NULL```); // Maybe?
 
if ($name) {
  $sql .= ``` AND name = {$name}```; // Maybe?
}

Tagged Templates might be a nice feature to have (sometimes they can be easier to read), but assuming a __toString() method is provided, we must also consider mis-use; e.g. in JavaScript, basic Template Literals have made it much easier for developers to create XSS vulnerabilities, where developers often don't think about HTML encoding in this context:

p.innerHTML = `Hi ${name}`; // INSECURE

Consideration would be needed on if/how Tagged Templates could protect functions like mysqli_query(); e.g. only accept if the Tagged Template uses no variables? or could PDO, MySQLi, ODBC, etc provide Value-Objects for Identifiers? In comparison, a LiteralString can simply be accepted - so code that already uses LiteralString's would not need any modification (see Future Scope for special cases).

Also, considering developers often (incorrectly) believe their Database Abstractions or Parameterised Queries have completely solved Injection Vulnerabilities, it would be very unlikely to get all developers to replace all of their existing LiteralStrings with Tagged Templates (note how few libraries use this in NodeJS).

While changing the quote character is fairly easy, it's tricky to automate, time-consuming, and risky for those without tests (a typical project can easily require thousands of lines of code to be changed). Any escaping functions would still need to be removed (so no advantage there). Variables for Identifiers (e.g. field-name) in SQL Tagged Templates would need to be considered, and developers will need to wait until PHP 8.X is their minimum supported version.

Example / Diff

XHP in Hack / HHVM is similar, where it introduces an XML-like syntax that can be used for HTML templating.

Macros

In Rust it's possible to use procedural macros, e.g.

html_add!("<p>Hello <span>?</span></p>");

Macros are run during compilation (when user values are not present), and can replace the code within the brackets. In this case the macro could check the contents, and if it's considered safe, change the code to call a method provided by the library with “unsafe” in its name. While developers could call the unsafe method directly, they are at least aware they are doing something unsafe, and can be easily found during an audit.

Macros might be a nice feature to have; but it can get complicated for libraries to check the AST; getting developers to replace their existing LiteralStrings to use Macros is unlikely (as noted with Tagged Templates); and without operator overloads (1/2), concatenation would need to be handled within the macro:

- $where_sql .= ' AND deleted IS NULL';
+ $where_sql = sql!($where_sql . ' AND deleted IS NULL');
or
+ sql!($where_sql .= ' AND deleted IS NULL');

Example / Diff

Education

Training simply does not scale, and mistakes still happen.

We cannot expect everyone to have formal training, know everything from Day 1, and consider programming a full time job. We want new programmers, with a variety of experiences, ages, and backgrounds. Everyone should be guided to do the right thing, and notified as soon as they make a mistake (we all make mistakes). We also need to acknowledge that many programmers are busy, do copy/paste code, don't necessarily understand what it does, edit it for their needs, then simply move on to their next task.

Other Programming Languages

Similar concepts implemented in other programming languages:

Python can use the LiteralString type in 3.11 (pyre example, via PEP 675).

Go can use an “un-exported string type”, a technique which is used by go-safe-html.

C++ can use a “consteval annotation”.

Scala can use “String with Singleton”.

Java can use a “@CompileTimeConstant annotation” from Error Prone to ensure method parameters can only use “compile-time constant expressions”.

Rust can use a “procedural macro”, to check the provided value is a literal at compile-time.

Node has the is-template-object polyfill, which checks a tag function was provided a “tagged template literal” (this technique is used in safesql, via template-tag-common). Alternatively Node programmers can use goog.string.Const from Google's Closure Library.

JavaScript is getting isTemplateObject, for “Distinguishing strings from a trusted developer from strings that may be attacker controlled” (intended to be used with Trusted Types).

Perl has a Taint Mode, via the -T flag, where all input is marked as “tainted”, and cannot be used by some methods (like commands that modify files), unless you use a regular expression to match and return known-good values (regular expressions are easy to get wrong).

History

There is a Taint extension for PHP by Xinchen Hui, and a previous RFC proposing it be added to the language by Wietse Venema, but Taint Checking is flawed (see notes above).

And there is the Automatic SQL Injection Protection RFC by Matt Tait (this RFC uses a similar concept of the SafeConst). When Matt's RFC was being discussed, it was noted:

  • “unfiltered input can affect way more than only SQL” (Pierre Joye);
  • this amount of work isn't ideal for “just for one use case” (Julien Pauli);
  • It would have effected every SQL function, such as mysqli_query(), $pdo->query(), odbc_exec(), etc (concerns raised by Lester Caine and Anthony Ferrara);
  • Each of those functions would need a bypass for cases where unsafe SQL was intentionally being used (e.g. phpMyAdmin taking SQL from POST data) because some applications intentionally “pass raw, user submitted, SQL” (Ronald Chmara 1/2).

In 2021 I wrote the is_literal() RFC, where the feedback was:

  • “Ideally we would want to assign a variable to be of 'literal' type.” George P. Banyard (covered by this RFC).
  • “There is good progress in taint analysis” Marco Pivetta (see the flaws noted with Taint Analysis above).
  • “I would like the ecosystem to pick up static analysis more” Marco Pivetta (I do too, but I doubt we can get everyone using it all the time, at the strictest levels, and no baseline).
  • “the concatenation operation is basically kicking the can down the road [...] using a function like concat_literal() [...] provides immediate feedback” George P. Banyard and “[literal_concat() makes] it easy to track down issues where they occur” Dan Ackroyd (I've not found this to be the case, but re-writing all code to not use concatenation is a big change).
  • “I'd prefer proper type or static analysis over adding more functions if that could effectively solve the problem.” (this RFC adds the type, and Static Analysis can do this now).
  • “There just too much debate for me to be comfortable and vote yes.” (ref the discussions about integer support and concatenation just before the 8.1 deadline).
  • “Bad code should better be fixed through better documentation.” (we've tried that, and mistakes still happen).
  • “I think libraries are very unlikely to adopt” (they have, see above).
  • “you can't even trust is_literal() [due to] file_put_contents(“data.php”, ”<?php return $_GET[id];“); $id = require “data.php”;” (I doubt any programmer will do this by accident).
  • “I don't believe we should expect security or maintainability without (all together): proper education + peer reviewing + static analysis.” (these should still happen).
  • “in the real world, you're going to start seeing cases where something is “literal enough”, but doesn't pass the is_literal test” (examples were asked for, but no response).

I also agree with Scott Arciszewski, “SQL injection is almost a solved problem [by using] prepared statements”, where LiteralString identifies when user input is accidentally included in the SQL string.

On a technical note, the implementation avoids situations that could have confused the programmer, by using the Lexer to mark strings as a LiteralString (i.e. earlier in the process):

$one = 1;
$a = 'A' . $one; // false, flag removed because it's being concatenated with an integer.
$b = 'A' . 1; // Was true, as the compiler optimised this to the literal 'A1'.
 
$a = "Hello ";
$b = $a . 2; // Was true, as the 2 was coerced to the string '2' (to optimise the concatenation).
 
$a = implode("-", [1, 2, 3]); // Was true with OPcache, as it could optimise this to the literal '1-2-3'
 
$a = chr(97); // Was true, due to the use of Interned Strings.

Backward Incompatible Changes

No known BC breaks, except for existing code that contains the userland function is_literal_string(), or object LiteralString.

Proposed PHP Version(s)

PHP 8.3

RFC Impact

To SAPIs

None known

To Existing Extensions

None known

To Opcache

None known

Open Issues

Additional testing of the final implementation; including extensions like Swoole or OpenSwoole.

Should eval() be unable to create a LiteralString, or is too similar to:

$id = ($_GET['id'] ?? NULL);
$file = tempnam(sys_get_temp_dir(), 'literal-string');
file_put_contents($file, '<'.'?php return '.var_export(strval($id),true).';');
$id = require($file);
unlink($file);

Future Scope

1) We might re-look at sprintf() being able to return a LiteralString.

2) We might re-look at LiteralInteger. While this is unlikely, as it would change the zval structure, it might be possible if there is enough demand. It would also need a discussion on what happens with other operations, e.g. integer addition.

3) As noted by MarkR, the biggest benefit will come when this flag can be used by PDO and similar functions (mysqli_query, preg_match, exec, etc).

However, first we need libraries to start checking the relevant inputs are a LiteralString. The library can then do their thing, and apply the appropriate escaping, which can result in a value that no longer has the LiteralString flag set, but is perfectly safe for the native functions.

With a future RFC, we could introduce checks for the native functions. For example, if we use the Trusted Types concept from JavaScript, the libraries could create a stringable ValueObject as their output. These objects can be added to a list of safe objects for the relevant native functions. The native functions could then warn programmers when they do not receive a value with the LiteralString flag, or one of the safe objects. These warnings would not break anything, they just make programmers aware of any mistakes they have made, and we will always need a way of switching them off entirely (e.g. phpMyAdmin).

Voting

Accept the RFC

LiteralString
Real name Yes No
Final result: 0 0
This poll has been closed.

Implementation

Joe Watkin's implementation provides is_literal(), but will need to be updated to support the LiteralString native type, and re-name the function to is_literal_string().

Rejected Features

Thanks

  1. Joe Watkins, krakjoe, for writing the full implementation, including support for concatenation and integers, and helping me though the RFC process.
  2. Máté Kocsis, mate-kocsis, for setting up and doing the performance testing.
  3. Scott Arciszewski, CiPHPerCoder, for checking over the original RFC, and provided text on how we could implement integer support under a is_noble() name.
  4. Dan Ackroyd, DanAck, for starting the first implementation, which made this a reality, providing literal_concat() and literal_implode(), and followup on how it should work.
  5. Xinchen Hui, who created the Taint Extension, allowing me to test the idea; and noting how Taint in PHP5 was complex, but “with PHP7's new zend_string, and string flags, the implementation will become easier” source.
  6. Rowan Francis, for proof-reading, and helping me make an RFC that contains readable English.
  7. Rowan Tommins, IMSoP, for helping with the original RFC, focusing on the key features, and put it in context of how it can be used by libraries.
  8. Nikita Popov, NikiC, for suggesting where the flag could be stored. Initially this was going to be the “GC_PROTECTED flag for strings”, which allowed Dan to start the first implementation.
  9. Mark Randall, MarkR, for suggestions, and noting that “interned strings in PHP have a flag”, which started the conversation on how this could be implemented.
  10. Sara Golemon, SaraMG, for noting that I'd need to explain how is_literal() is different to the flawed Taint Checking approach, so we don't get “a false sense of security or require far too much escape hatching”.
rfc/literal_string.txt · Last modified: 2023/04/20 12:18 by craigfrancis