rfc:pack-unpack-endianness-signed-integers-support

PHP RFC: Add pack()/unpack() support for endianness modifiers on integers

Introduction

This RFC proposes adding support for signed integers with specific endianness to PHP's pack() and unpack() functions using Perl's endianness modifier syntax. This addresses GitHub issue #17068.

Currently, PHP's pack/unpack functions support:

  • Machine-endian signed integers: s, l, q (2, 4, 8 bytes)
  • Machine-endian unsigned integers: S, L, Q (2, 4, 8 bytes)
  • Endian-specific unsigned integers: v/n, V/N, P/J (2, 4, 8 bytes)

However, there is no support for signed integers with specific endianness, forcing developers to use manual workarounds:

<?php
// Current manual approach for signed little-endian 4-byte integer
$unpackToSignedInt = static function (string $v) {
    $unpacked = unpack('va/Cb/cc', $v);
    return ($unpacked['c'] << 24) | ($unpacked['b'] << 16) | $unpacked['a'];
};
 
// Proposed approach with modifiers
$value = unpack('l<', $binaryData)[1]; // signed little-endian 4-byte
?>

Perl Specification Reference

According to the Perl documentation (https://perldoc.perl.org/functions/pack), Perl handles signed integers with endianness using modifier syntax:

s<   signed 16-bit, little-endian byte order
s>   signed 16-bit, big-endian byte order
l<   signed 32-bit, little-endian byte order
l>   signed 32-bit, big-endian byte order
q<   signed 64-bit, little-endian byte order
q>   signed 64-bit, big-endian byte order

Proposed Solution

This RFC proposes adding endianness modifiers (< and >) to PHP's pack/unpack functions, following Perl's established syntax.

Proposed Syntax:

  • s</s> for signed 2-byte (little/big endian)
  • l</l> for signed 4-byte (little/big endian)
  • q</q> for signed 8-byte (little/big endian)
  • S</S> for unsigned 2-byte (little/big endian)
  • L</L> for unsigned 4-byte (little/big endian)
  • Q</Q> for unsigned 8-byte (little/big endian)

Here are the pros of this approach:

  • Consistency with Perl: Maintains compatibility with Perl's well-established syntax, reducing cognitive load for developers working across languages
  • Intuitive semantics: The < and > symbols visually suggest byte order direction
  • Backward compatibility: Modifiers are opt-in; existing code continues to work unchanged
  • No arbitrary choices: Unlike inventing new format letters, this leverages proven syntax with 15+ years of usage in Perl
  • Minimal implementation: Proof-of-concept shows straightforward implementation without parser rewrite

Example Usage:

<?php
// Little-endian signed integers
$data = pack('s<l<q<', -258, -16909060, -72340172838076673);
 
// Big-endian signed integers
$data = pack('s>l>q>', -258, -16909060, -72340172838076673);
 
// Unsigned integers with explicit endianness
$data = pack('S<L>Q<', 258, 16909060, 72340172838076673);
 
// Mixed endianness (little-endian 16-bit, big-endian 32-bit)
$data = pack('s<2l>2', 258, -2, 16909060, -16909060);
 
// Unpacking with modifiers
[$int16_le, $int32_le] = array_values(unpack('s<a/l<b', $data));
[$uint16_be, $uint32_le] = array_values(unpack('S>a/L<b', $data));
?>

Error Handling:

The modifiers should emit a ValueError when used with unsupported format letters, preventing silent failures:

<?php
// Using modifiers with unsupported format letters
pack('a<', 'test'); // ValueError: Endianness modifier '<' is not supported for format code 'a'
pack('Z>', 'test'); // ValueError: Endianness modifier '>' is not supported for format code 'Z'
 
// Using modifiers on formats with inherent endianness
pack('v<', 42); // ValueError: Endianness modifier '<' cannot be applied to format code 'v' which already has inherent endianness
pack('N>', 42); // ValueError: Endianness modifier '>' cannot be applied to format code 'N' which already has inherent endianness
?>

Modifier Restrictions

Following Perl's design, endianness modifiers are prohibited on format codes that already have inherent endianness. This prevents ambiguity about which endianness takes precedence.

Formats that CANNOT use modifiers:

  • v/n - 2-byte unsigned with inherent endianness (little/big)
  • V/N - 4-byte unsigned with inherent endianness (little/big)
  • P/J - 8-byte unsigned with inherent endianness (little/big)

Perl explicitly prohibits modifiers on inherent-endian formats to avoid conflicts. For example, attempting v< in Perl raises: “'<' allowed only after types sSiIlLqQjJfFdDpP( in pack”.

When to use modifiers vs inherent formats:

<?php
// For SIGNED integers, only modifiers work
pack('s<', -42);  // Signed little-endian 16-bit - no equivalent format exists
pack('l>', -42);  // Signed big-endian 32-bit - no equivalent format exists
 
// For UNSIGNED integers, both work
pack('S<', 42) === pack('v', 42);   // Both: unsigned 2-byte little-endian
pack('S>', 42) === pack('n', 42);   // Both: unsigned 2-byte big-endian
pack('L<', 42) === pack('V', 42);   // Both: unsigned 4-byte little-endian
pack('L>', 42) === pack('N', 42);   // Both: unsigned 4-byte big-endian
?>

Considered Alternatives

Alternative 1: New Format Letters

Initially, new format letters were proposed: w/W (2-byte), t/T (4-byte), r/R (8-byte).

This was rejected because:

  • Needless divergence from Perl with arbitrary selection: no logical relationship to the underlying integer types or Perl's base letters
  • Unlike directional modifiers (</>), letter pairs don't visually convey endianness

Alternative 2: Creating a New Function

A completely new function for binary packing could be designed with modern syntax.

This was rejected as well because:

  • This RFC aims to complete pack/unpack functionality, not replace it
  • pack/unpack are well-established; adding modifiers is the minimal change needed

Comparison Tables

Perl vs PHP (Proposed):

Perl Specification Proposed PHP Implementation
s< (signed 2-byte LE) s<
s> (signed 2-byte BE) s>
S< (unsigned 2-byte LE) S<
S> (unsigned 2-byte BE) S>
l< (signed 4-byte LE) l<
l> (signed 4-byte BE) l>
L< (unsigned 4-byte LE) L<
L> (unsigned 4-byte BE) L>
q< (signed 8-byte LE) q<
q> (signed 8-byte BE) q>
Q< (unsigned 8-byte LE) Q<
Q> (unsigned 8-byte BE) Q>

Complete PHP Format Letter Organization:

Type 2-byte 4-byte 8-byte
Unsigned LE (inherent) v V P
Unsigned BE (inherent) n N J
Unsigned machine-endian S L Q
Unsigned LE (modifier) S< (proposed) L< (proposed) Q< (proposed)
Unsigned BE (modifier) S> (proposed) L> (proposed) Q> (proposed)
Signed machine-endian s l q
Signed LE (modifier) s< (proposed) l< (proposed) q< (proposed)
Signed BE (modifier) s> (proposed) l> (proposed) q> (proposed)

Platform Considerations

32-bit Platform Behavior:

On 32-bit platforms, 8-byte format codes (q</q>/Q</Q>) will throw a ValueError with the message “64-bit format codes are not available for 32-bit versions of PHP”, consistent with existing behavior for q/Q/P/J.

<?php
// On 32-bit platforms
try {
    pack('q<', 1);  // signed 64-bit
} catch (ValueError $e) {
    echo $e->getMessage(); // "64-bit format codes are not available..."
}
 
try {
    pack('Q>', 1);  // unsigned 64-bit
} catch (ValueError $e) {
    echo $e->getMessage(); // "64-bit format codes are not available..."
}
?>

Modifier Applicability:

Endianness modifiers are supported for both signed and unsigned machine-endian integer format codes (s, l, q, S, L, Q). While unsigned integers already have dedicated endian-specific letters (v/n, V/N, P/J), supporting modifiers on uppercase letters provides better memorability and consistency. Using modifiers with other format codes will emit a ValueError.

Future Scope

While this RFC covers both signed and unsigned integer modifiers, Perl supports endianness modifiers on additional format types that could be considered in future RFCs. For example, the support could be added to floating-point formats (f, d).

Group Modifiers

Perl's () group syntax allows applying endianness to multiple formats at once:

<?php
// Potential future support:
pack('(sl)<', -42, 4711);  // Both short and long are little-endian
pack('(s<l>)<', ...)       // ERROR: Can't override byte-order within a group
?>

Backward Incompatible Changes

There are no backward compatibility concerns. The modifier syntax is entirely opt-in:

  • Existing format strings without modifiers continue to work unchanged
  • No existing format codes are removed or altered
  • The < and > characters are not currently used in pack format strings

Proposed PHP Version(s)

PHP 8.6 (next minor version)

Voting Choices

Default title
Real name Yes No Abstain
Final result: 0 0 0
This poll has been closed.

References

rfc/pack-unpack-endianness-signed-integers-support.txt · Last modified: by alexandredaubois