====== PHP RFC: Add RFC 4648 compliant data encoding API ====== * Version: 1.0 * Date: 2025-06-15 * Author: Ignace Nyamagana Butera, nyamsprod@gmail.com * Status: Draft * First Published at: https://wiki.php.net/rfc/data_encoding_api ===== Introduction ===== To improve interoperability between PHP and other programming languages and to simplify data encoding usage in PHP we propose to add the ability for the core language to encode and decode data using the family of RFC4648 encoding/decoding algorithms (base16, base32 and base64). Currently PHP supports only a limited subset of RFC4648 and with this RFC we aim at providing full support for the RFC but also to provide missing encoding algorithms to developers. ==== Downsides of the current approach ==== PHP provides partial support for Base64 via the ''base64_encode'' and ''base64_decode'' functions but they do not provide: * support for base64 Url * support for padding character removal during encoding * support for generating time constant encoding string. PHP provides partial support for Base16 via the ''bin2hex'' and ''hex2bin'' functions but they do not provide: * support for strict decoding mechanism * support for generating time constant encoding string PHP currently does not provide any Base32 feature. Adding to the missing algorithm, is the diversity of PHP user-land packages which all claim support for Base32 algorithms without explicitly referring which variant is used. The situation becomes critical if your application relies on that say encoding for handling data generated from other systems or from other programming languages. The context renders using data encoding in PHP more complex than it should be. The goal of the RFC is to proposed the encoding/decoding functionalities as described in RFC4648 to the PHP standard library. the RFC also introduces a native [[https://github.com/paragonie/constant_time_encoding/|constant time encoding implementation]] of the feature to tackle security challenges in the data encoding fields. Once implemented the feature would improve and simplify data encoding usage in PHP while improving interoperability with other programming languages and security within the PHP ecosystem. ===== Proposal ===== A new, always available ''Encoding'' namespace is to be added to the standard library. The namespace would contain classes and function for encoding and decoding string or byte sequences. For this purpose, the following internal classes and functions are added:


namespace Encoding {
    class EncodingException extends \Exception
    {
    }

    class UnableToDecodeException extends EncodingException
    {
    }

    enum Base16Alphabet
    {
        case Upper;
        case Lower;
    }

    enum Base32Alphabet
    {
        case Ascii;
        case Hex;
        case Crockford;
        case Z;
    }

    enum Base64Alphabet
    {
        case Standard;
        case SafeUrl;
        case Imap;
    }

    enum PaddingMode
    {
        case AlphabetControlled;
        case StripPadding;
        case PreservePadding;
    }

    enum DecodingMode
    {
        case Lenient;
        case Strict;
    }

    enum TimingMode
    {
        case Unprotected;
        case ConstantTime;
    }
}

The following Base16 functions are added:


namespace Encoding {
    function base16_encode(
        string $decoded,
        Base16Alphabet $alphabet = Base16Alphabet::Upper,
        TimingMode $timingMode = TimingMode::Unprotected,
    ): string;

    /**
     * @throws UnableToDecodeException
     */
    function base16_decode(
        string $encoded,
        DecodingMode $decodingMode = DecodingMode::Strict,
        TimingMode $timingMode = TimingMode::Unprotected,
    ): string;
}

The following Base32 functions are added:


namespace Encoding {
    function base32_encode(
        string $decoded,
        Base32Alphabet $alphabet = Base32Alphabet::Ascii,
        PaddingMode $paddingMode = PaddingMode::AlphabetControlled,
        TimingMode $timingMode = TimingMode::Unprotected,
    ): string;

    /**
     * @throws UnableToDecodeException
     */
    function base32_decode(
        string $encoded,
        Base32Alphabet $alphabet = Base32Alphabet::Ascii,
        DecodingMode $decodingMode = DecodingMode::Strict,
        TimingMode $timingMode = TimingMode::Unprotected,
    ): string;
}

The following Base64 functions are added:


namespace Encoding {
    function base64_encode(
        string $decoded,
        Base64Alphabet $alphabet = Base64Alphabet::Standard,
        PaddingMode $paddingMode = PaddingMode::AlphabetControlled,
        TimingMode $timingMode = TimingMode::Unprotected,
    ): string;

    /**
     * @throws UnableToDecodeException
     */
    function base64_decode(
        string $encoded,
        Base64Alphabet $alphabet = Base64Alphabet::Standard,
        DecodingMode $decodingMode = DecodingMode::Strict,
        TimingMode $timingMode = TimingMode::Unprotected,
    ): string;
}

==== Function-based API ==== The RFC chooses to use a functions-based API instead of a class-based API for the following reasons: * most PHP scripts use encoding in a one off fashion using a class-based API would feel overly complicated for a quick encode or decode operation * using functions emphasise that encoding/decoding has no internal state or side effects. * creating a class-based API on top of a function-based API, in user-land, is trivial. The RFC chooses to use a enum-based options to avoid the use of boolean or arbitrary string values to improve readability and developer experience when using the new API. The general signature semantic chosen for each algorithm is the following: For encoding:


function algo_encode(string $decoded, Enum ...$options): string;

For decoding:


/**
 * @throws UnableToDecodeException
 */
function algo_decode(string $encoded, Enum ...$options): string;

where __algo__ is the name of the underlying encoding algorithm. When decoding is performed a ''UnableToDecodeException'' exception is thrown on any error. When not strict, a tolerance toward the encoded string is allowed but decoding can still trigger a ''UnableToDecodeException'' exception if the string is still invalid after applying tolerant related operation on the encoded string. ==== Parameters ==== === String Parameters === * **$decoded** : the string to encode; * **$encoded** : the string to decode; ==== Options ==== === Alphabets support === == Base16 Alphabets ==





Base16 does not have multiple alphabets but can be encoded using uppercase or lowercase letters.
By default, to be compliant with RFC4648, the default value will be ''Base16Alphabet::Upper''.


== Base32 Alphabets ==





The Base32 can be used with different alphabets. We will support the most used alphabets out of the box

  * ASCII : the RFC4648 Standard alphabet (case sensitive)
  * HEX : the RFC4648 Hexadecimal alphabet (case sensitive)
  * Crockford: [[https://www.crockford.com/base32.html|The douglas Crockford alphabet]] (case insensitive)
  * Z: the [[https://philzimmermann.com/docs/human-oriented-base-32-encoding.txt|Z-base-32 alphabet]]  (case sensitive)

The default value will be ''Base32Alphabet::Ascii''.

== Base64 Alphabets ==




The Base64 can be used with different alphabets. We will support the most used alphabets out of the box. **All Base64 alphabet are case sensitive.**

  * Standard : the RFC4648 Standard alphabet
  * SafeUrl : the RFC4648 SafeURL alphabet
  * Imap: the [[https://datatracker.ietf.org/doc/html/rfc3501#section-5.1.3|RFC3501]] Imap version

The default value will be ''Base64Alphabet::Standard''.

=== Padding presence during encoding ===




Base32 and Base64 use a padding character. The padding character has a technical role. It ensures that the encoded output represents complete blocks of data and allows the decoder to reconstruct the original binary input unambiguously. But to
improve readability some alphabets have chosen to not include them in the result of their encoding process. This option MUST tell the encoding mechanism if padding needs to be present or not at the end of the encoding process.

By default the padding mode is ''PaddingMode::AlphabetControlled'' meaning the padding character will be present only if it is
mandatory for the chosen alphabet.

=== Decoding Mode ===




For all functions, during decoding, you MUST be able to specify how decoding will be performed. By default, the **$decodingMode** is ''DecodingMode::Strict'' and the algorithm strictly follow the rules set by the RFC. 
You can also set the ''$decodingMode'' to ''DecodingMode::Linient''. When using this decoding mode several manipulation are performed on the `$encoded` string before the actual decoding process:

  * When applicable, the **$encoded** string is converted into the correct character casing.
  * When applicable, the padding length is corrected to allow correct decoding.

Regardless of the mode:

  * The alphabet is treated as a sequence of byte values without any special treatment for multi-byte UTF-8.
  * The following characters: ''\r'', ''\t'', ''\n'' and the space character are all ignored during the decoding processus.
  * There should be a protection against ''NULL'' bytes presence in the **$encoded** string.

The linient process while available is made restrictive to take into account [[https://datatracker.ietf.org/doc/html/rfc4648.html#section-12|the security considerations covered in section 12 of RFC 4648]]

By default the decoding mode is ''DecodingMode::Strict''.

=== Timing generation mode ===




Sometimes for security reason you MAY want to use a more secure algorithm to avoid leaking information during
encoding/decoding process. Because using a different algorithm MAY result in a different processing time an optional Enum is proposed to opt-in into the changed process, for now a constant time generation algorithm is added in addition to the standard generation process which does not protect against [[https://blog.ircmaxell.com/2014/11/its-all-about-time.html|timing attacks]]. Depending on the implementation this option MAY not be made available for every algorithm.

By default the timing mode is ''TimingMode::Unprotected''.

==== Usage examples ====

Using the ''Encoding\base64_encode'' and ''Encoding\base64_decode'' functions




Using the ''Encoding\base16_encode'' and ''Encoding\base16_decode'' functions




==== In other Languages ====

=== Go ===

In its standard package Go supports [all RFC4648 algorithm as well as acii85 format](https://pkg.go.dev/encoding@go1.24.4)

=== Python ===

Python has updated its encoding supports and now supports [all RFC4648 algorithm as well as acii85 format](https://docs.python.org/3/library/base64.html). Python also has an extensive support for many base85 alphabet.

=== JavaScript/NodeJs ===

Does not support base32 natively nor base85.

=== C# ===

Only support natively base64 (not base64 URL)

=== Java ===

Only support natively base64

===== Open questions =====

  * Should we allow users to specify their own alphabet for base32 ?
  * Should we allow users to specify their own padding character where applicable ?


===== Backward Incompatible Changes =====

The namespace **''Encoding''** is now reserved

===== Proposed PHP Version(s) =====

The next minor PHP version (PHP 8.5).

===== RFC Impact =====

==== To SAPIs ====

None.

==== To Existing Extensions ====

None.

==== To Opcache ====

None.


===== Implementation =====

Tim Düsterhus, has volunteered to do the implementation, but will check whether or not a constant time implementation is possible for all combinations of options.

===== Future Scope =====

  * Add support for [[https://bitcoinwiki.org/wiki/base58|base58]] used with Bitcoin
  * Add support for [[https://en.wikipedia.org/wiki/Ascii85|ascii85]] used in PDF format and by Git

===== References =====

  * RFC 4648: https://datatracker.ietf.org/doc/html/rfc4648
  * Douglas CrockFord base32: https://www.crockford.com/base32.html