rfc:kill-csv-escaping

PHP RFC: Kill proprietary CSV escaping mechanism

Introduction

For many years, we receive bug reports regarding the strange behavior of the $escape parameter of our CSV writing and reading functions (fputcsv, fgetcsv etc.); the latest has been reported today. Apparently, this escaping mechanism causes more harm than good.

Albeit CSV is still a widespread data exchange format, it has never been officially standardized. There exists, however, the “informational” RFC 4180 which has no notion of escape characters, but rather defines escaped as strings enclosed in double-quotes where contained double-quotes have to be doubled. While this concept is supported by PHP's implementation ($enclosure), the $escape sometimes interferes, so that fgetcsv() may be unable to correctly parse externally generated CSV, and fputcsv() is sometimes generating non-compliant CSV. Even a rountrip (fgetcsv(fputcsv(…)) may fail.

While in many cases passing “\0” as $escape parameter will yield the desired results, this won't work if someone is writing/reading binary CSV files, may have issues with some non ASCII compatible encodings, and is generally to be regarded as a hack.

Proposal

Since some may rely on the current behavior (and maybe explicitly work around it), we cannot simply drop support for the $escape parameter. Instead, the author proposes a stepwise process to keep BC as well as in any way possible:

  1. PHP 7.4: allow to pass an empty string as $escape argument, which serves to deactivate the escaping
  2. ?: deprecate passing an non-empty string as $escape argument
  3. PHP 8: change the default value of $escape to an empty string
  4. ?: deprecate passing an explicit $escape argument at all
  5. PHP 9: remove the $escape parameter altogether

The affected functions are fputcsv(), fgetcsv() and str_getcsv(), and also the ::setCsvControl(), ::getCsvControl(), ::fputcsv(), and ::fgetcsv() methods of SplFileObject, as well as any related functionality that might be introduced during the stepwise process.

To facilitate this, the internal APIs php_fgetcsv() and php_fputcsv() will be adapted accordingly, i.e. their escape_char parameter type will be changed from char to int where -1 will disable the escaping mechanism, and finally this parameter will be removed.

Besides bringing our CSV support more inline with other CSV processors, we also reduce the rather lengthy parameter lists of the respective functions.

Backward Incompatible Changes

See above.

Proposed PHP Version(s)

See above.

New Constants

Temporarily the *internal* macro PHP_CSV_NO_ESCAPE (which expands to -1) will be introduced in file.h.

Open Issues

None, yet.

Future Scope

The CSV reading and writing functionality might be extended to support arbitrary character encodings, or respective alternatives might be introduced in the MBString extension. This is not subject of this RFC, though.

Proposed Voting Choices

Whether we follow the proposed stepwise process as outlined above, or not. To be accepted the vote requires a 2/3 majority.

Patches and Tests

A preliminary pull request implementing support for the empty $escape parameter is available.

Implementation

After the project is implemented, this section should contain

  1. the version(s) it was merged into
  2. a link to the git commit(s)
  3. a link to the PHP manual entry for the feature
  4. a link to the language specification section (if any)

References

Rejected Features

None, yet.

rfc/kill-csv-escaping.txt · Last modified: 2018/12/02 15:19 by cmb