rfc:pdo_driver_specific_parsers

This is an old revision of the document!


PHP RFC: PDO Driver specific SQL parsers

Introduction

The PDO extension contains a SQL parser, whose main purpose is to recognise parameter placeholders inside queries (i.e. “?” and “:paramName”), so that it knows how many and what parameters to expect for a query, and pass the information to the PDO driver in use.

This parser had historically been modelled to work with what was the de-facto SQL standard in the PHP ecosystem at that time: MySQL. However, the SQL dialect used by MySQL is different when handling string literals from standard SQL, followed by other database vendors, such as PostgreSQL and SQLite.

Specifically, MySQL treats the backslash character as an escape character:

'This \'word\' is a single quoted'

Whereas the SQL standard uses double single quotes:

'This ''word'' is a single quoted'

When using databases other than MySQL with PDO, valid queries having string literals ending with a backslash character are throwing off the parser and causing apparently inexplicable bugs.

For example:

SELECT 'foo\' AS a, '?' AS b

will make PDO consider “'foo\' AS a, '” to be a string literal and will parse the following “?” as a positional parameter placeholder. In fact, if you pay close attention, even the DocuWiki SQL code formatter is thrown off by this very example.

We have several reports of similar bugs [1], [2], [3], and possibly other duplicates[7].

In a nutshell, the PDO SQL scanner function doesn't need to completely parse the SQL dialect in use. It is sufficient to properly recognise quoted string literals, quoted identifier literals, and comments to skip placeholder detection inside them.

Bonus improvement

The limitation of a global SQL parser in PDO meant that my previous RFC to support the PostgreSQL-only "?" operator[4] had to apply the parser change to all database types. The change had no known side-effects, but I feel this RFC could improve by keeping this peculiarity inside the pdo_pgsql scanner only.

Proposal

The proposal is to:

  1. change the default PDO scanner to expect standard SQL only;
  2. allow drivers to optionally provide a custom scanner function to handle their specific SQL dialect;
  3. re-use the current version of the scanner in pdo_mysql, cleaning it up of the PostgreSQL-specific handling of the “?” operator;
  4. add a pdo_pgsql specific scanner, derived from the new default scanner, but capable of dealing with the “?” operator.

One important thing to mention is that the proposed changes are only targeting the part of PDO that scans the SQL query for parameter placeholders. Quoting literals or using parameters in queries is not going to be affected.

Detailed Proposal

In order to execute the plan, a new member will be appended to:

struct pdo_dbh_methods {
	pdo_dbh_close_func		closer;
	pdo_dbh_prepare_func		preparer;
	pdo_dbh_do_func			doer;
	pdo_dbh_quote_func		quoter;
	// ...
	pdo_dbh_sql_scanner		scanner;
}

Each PDO driver defines already their own struct. Leaving the new member to NULL will make the driver use the default PDO scanner function. Otherwise a pointer to a custom scanner function will override the default when parsing queries. It's really as simple as that.

Research on String Literals, Identifiers, and Comments

MySQL

MySQL by default accepts both backslash escaped quotes and SQL standard. String literals can use single or double quotes. See the documentation (8.0 current is linked here, but 5.7 and 8.3 bahave the same).

The NO_BACKSLASH_ESCAPE SQL mode will disable recognition of the backslash as escape character. If set it will break SQL scanning regardless of this RFC.

The ANSI_QUOTES SQL mode switches from backtick to SQL standard double quotes for identifier literals.

Several comment types supported: -- , #, and /* */ (not nested).

PostgreSQL

Escaping has evolved during the years. Historically accepted “\'”, but started gradually transitioning to the SQL standard around 2005, going from memory. Since 9.1 (2011+) it accepts only single quoted string literals by default according to the SQL standard. See the documentation.

It also accepts escape string literals, e.g.

E'This \'word\' is a single quoted'

I had seen them being used when PostgreSQL started emitting warnings about backslashes in string literals, during the transition phase when magic_quotes_qpc was a thing, but I really see no place for this syntax in modern PHP. In fact this is the only syntax that is compatible with the current PDO SQL scanner, although I wouldn't recommend anybody to change their SQL string literals to this variant.

The behaviour of strings can also be manipulated in multiple ways through configuration variables, such as: standard_conforming_strings, backslash_quote, and escape_string_warning.

The RFC will ensure regular SQL standard string literals are supported on a Postgres instance with default configuration, while the rarer escape string literals will not be supported.

I don't believe trying to add support for escape string literal is worth the worth the effort. To me it would be best to note the new incompatibility in the UPGRADE file, especially considering that configuration variables could cause incompatibilities anyway. (DISCUSSION REQUIRED)

About comments, it follows the standard with -- and /* */ (w/ nested comments allowed)

SQLite

Follows the SQL standard, and requires double single quotes to represent the single quote in a string literal. See documentation.

It will however accept double quoted strings as string literals under some circumstances.

Double quoted identifiers, but also other funny syntaxes.

Almost SQL standard comments: -- and /* */ (not nested).

Oracle

SQL standard string literals, according to the documentation. It also supports alternative quoting, e.g. q'<literal>' and many other variants, which is out of scope for this RFC.

Double quoted identifiers.

Almost SQL standard comments: -- and /* */ (not nested).

SQL Server

SQL standard string literals, according to the documentation.

Depending on the QUOTED_IDENTIFIER setting, double quotes are either used for strings or identifiers.

Almost SQL standard comments: -- and /* */ (non nested).

Historical Background

A few years back I attempted to fix a bug and came up with with a pull request that could be considered a proof of concept for this RFC. The same topic was also brought up by others on internals, but no one had time to follow up with a proper RFC.

Backward Incompatible Changes

No expected BC breaks.

Users having applications that can work with multiple database engines should still be very careful and write portable queries, possibly using the PDO::quote() method when necessary instead of hardcoding strings containing escape characters.

Proposed PHP Version(s)

Next PHP 8.x, hopefully 8.4.

RFC Impact

To SAPIs

No impact

To Existing Extensions

Drivers outside of php-src might have to be modified if they make assumptions about the structure of enum pdo_param_type. They would have to be rebuilt since the PDO_DRIVER_API macro would be updated.

That has historically been allowed/expected in minor versions. The last time it happened was for PHP 7.2 with PHP RFC: Extended String Types For PDO.

To Opcache

No impact to opcache.

New Constants

No new constant.

php.ini Defaults

No php.ini changes

Open Issues

No open issues ATM.

Unaffected PHP Functionality

Anything not related to PDO scanning the SQL query for parameter placeholders.

Future Scope

If necessary, support “exotic” syntaxes in custom scanners for other database drivers.

Proposed Voting Choices

As per the voting RFC a yes/no vote with a 2/3 majority is needed for this proposal to be accepted.

Patches and Tests

PoC pull request. Once discussed, I will start with the actual implementation of the proposal.

References

rfc/pdo_driver_specific_parsers.1713516334.txt.gz · Last modified: 2024/04/19 08:45 by mbeccati