rfc:pack-unpack-endianness-signed-integers-support

PHP RFC: Add pack()/unpack() support for signed integers with specific endianness

Introduction

This RFC proposes adding support for signed integers with specific endianness to PHP's pack() and unpack() functions. This addresses GitHub issue #17068 and fixes the format letter choices in the current implementation (PR #19368).

Currently, PHP's pack/unpack functions support:

  • Machine-endian signed integers: s, l, q (2, 4, 8 bytes)
  • Machine-endian unsigned integers: S, L, Q (2, 4, 8 bytes)
  • Endian-specific unsigned integers: v/n, V/N, P/J (2, 4, 8 bytes)

However, there is no support for signed integers with specific endianness, forcing developers to use manual workarounds:

<?php
// Current manual approach for signed little-endian 4-byte integer
$unpackToSignedInt = static function (string $v) {
    $unpacked = unpack('va/Cb/cc', $v);
    return ($unpacked['c'] << 24) | ($unpacked['b'] << 16) | $unpacked['a'];
};
 
// Proposed approach
$value = unpack('w', $binaryData)[1]; // signed little-endian 2-byte
?>

Perl Specification Reference

According to the Perl documentation (https://perldoc.perl.org/functions/pack), Perl handles signed integers with endianness using modifier syntax:

s<   signed 16-bit, little-endian byte order
s>   signed 16-bit, big-endian byte order
l<   signed 32-bit, little-endian byte order  
l>   signed 32-bit, big-endian byte order
q<   signed 64-bit, little-endian byte order
q>   signed 64-bit, big-endian byte order

The Perl documentation states: “Starting with Perl 5.10.0, integer and floating-point formats... may all be followed by the '>' or '<' endianness modifiers to respectively enforce big- or little-endian byte-order.”

Why Perl's Approach Cannot Be Used in PHP

While Perl's specification provides the ideal reference, PHP cannot adopt Perl's exact syntax for several technical reasons:

1. Base Letters Already Taken

PHP already uses the base letters for machine-endian signed integers:

  • s = signed 16-bit (machine endian)
  • l = signed 32-bit (machine endian)
  • q = signed 64-bit (machine endian)

2. Parser Architecture Limitations

Perl uses modifier syntax where endianness indicators (<, >) follow the base format letter. PHP's pack format parser is designed around single-character format codes in switch/case statements, not compound expressions like s< or s>.

3. Different Design Philosophy

PHP established a pattern of using completely different letters for endian-specific variants:

  • Unsigned endian-specific: v/n (2-byte), V/N (4-byte), P/J (8-byte)
  • Rather than modifiers like Perl's approach

Current Implementation Problems

The current PR #19368 introduces arbitrary format letters that don't follow any logical pattern:

m/y  for signed 2-byte (little/big endian)
M/Y  for signed 4-byte (little/big endian)  
p/j  for signed 8-byte (little/big endian)

Issues with current choices:

  • No relationship to Perl's base letters (s, l, q)
  • No logical pairing with existing unsigned endian formats
  • Arbitrary selection that doesn't follow PHP's established patterns

Format Letter Analysis

Currently Used Letters:

Lowercase: a, c, d, e, f, g, h, i, j, l, m, n, p, q, s, v, x, y

Uppercase: A, C, E, G, H, I, J, L, M, N, P, Q, S, V, X, Y, Z

Available Letters:

Lowercase: b, k, o, r, t, u, w, z

Uppercase: B, D, F, K, O, R, T, U, W

Proposed Solution

Replace the current arbitrary letter choices with letters that follow PHP's established conventions and create logical relationships with existing formats:

Proposed Format Letters:

  • w/W for signed 2-byte (little/big endian)
  • t/T for signed 4-byte (little/big endian)
  • r/R for signed 8-byte (little/big endian)

Rationale:

  • Follows PHP convention: lowercase = little-endian, uppercase = big-endian
  • Systematic approach: Creates consistent pairs rather than arbitrary letter choices
  • Available letters: All proposed letters are currently unused
  • Closest to Perl's intent: While we can't use Perl's exact `s`/`l`/`q` base letters (already taken), these letters provide a systematic alternative

Comparison Tables

Perl vs PHP Approaches:

Perl Specification Current PR (Wrong) Proposed Solution
s< (signed 2-byte LE) m w
s> (signed 2-byte BE) y W
l< (signed 4-byte LE) M t
l> (signed 4-byte BE) Y T
q< (signed 8-byte LE) p r
q> (signed 8-byte BE) j R

PHP Format Letter Organization:

Type 2-byte 4-byte 8-byte
Unsigned LE v V P
Unsigned BE n N J
Signed LE w (proposed) t (proposed) r (proposed)
Signed BE W (proposed) T (proposed) R (proposed)

Platform Considerations

32-bit Platform Behavior:

On 32-bit platforms, 8-byte format codes (r/R) will throw a ValueError with the message “64-bit format codes are not available for 32-bit versions of PHP”, consistent with existing behavior for q/Q/P/J.

<?php
// On 32-bit platforms
try {
    pack('r', 1);
} catch (ValueError $e) {
    echo $e->getMessage(); // "64-bit format codes are not available..."
}
?>

Backward Incompatible Changes

This change modifies the format letters introduced in PR #19368. Since that PR hasn't been released yet, there are no backward compatibility concerns for existing code.

The proposed letters (w, W, t, T, r, R) are currently unused in PHP's pack/unpack implementation.

Proposed PHP Version(s)

PHP 8.6 (next minor version)

Voting Choices

Add signed integer endianness support to pack()/unpack() with proposed format letters?
Real name Yes No
Final result: 0 0
This poll has been closed.

Implementation

The implementation is available in PR #19368, which requires updating the format letters from the current arbitrary choices to the proposed systematic approach outlined in this RFC.

Changes required in the pull request if this get accepted:

  • Replace m with w, y with W
  • Replace M with t, Y with T
  • Replace p with r, j with R

References

rfc/pack-unpack-endianness-signed-integers-support.txt · Last modified: by alexandredaubois