PHP RFC: Working With Substrings

Version: 1
Date: 2023-01-18
Author: Thomas Hruska, cubic@php.net
Status: Under Discussion
First Published at: https://wiki.php.net/RFC/working_with_substrings

Introduction

Zend engine has the policy of performing copy-on-write for every string operation. This means that memory is regularly allocated and deallocated and strings are copied as needed based on reference counting. In general, this is fine because freed memory goes back into the allocation pool and therefore reduces the number of overall actual allocation requests to the system. However, very serious performance issues arise when attempting to process and manipulate strings one byte at a time or work with larger memory buffers (1MB or more). This RFC aims to address the problems by adding a few new functions and adding substring support to a broader selection of existing PHP functions to reduce the number of memory allocations and copy operations that take place in performance critical scenarios.

How serious are the performance issues? When appending one byte at a time under certain circumstances (very large buffers), I can get up to 9MB/sec on an Intel Core i7 with system memory capable of performing transfers up to 23GB/sec. Modifying portions of existing strings inline only peaks at 35MB/sec on the same hardware. Those are not typos. (These numbers are from a real world library - artificial benchmarks peak at 55MB/sec.) In short, PHP's 0.2% of maximum system memory performance while churning 100% of one CPU thread to modify strings inline is very unimpressive. Of course, this won't come as any surprise to the PHP development team nor is it a design flaw of the language itself.

Proposal

This RFC is split into two major subsections: A few brand new userland functions (str_splice(), str_realloc(), fread_mem()) to work with strings inline (i.e. not copy-on-write) and enhancements to a variety of existing PHP functions to add substring support.

Some enhancements being proposed might not technically need substring support but are being proposed for function prototype consistency so that PHP doesn't wind up in a weird place down the road with mismatched prototypes.

Working implementations of all the items being proposed can be found at:

https://github.com/cubiclesoft/php-ext-qolfuncs

str_splice()

int str_splice(string &$dst, int $dst_offset, [ ?int $dst_length = null, string $src = '', int $src_offset = 0, ?int $src_length = null, int $src_repeat = 1, bool $shrink = true, ?int $dst_lastsize = null ])

Removes the string designated by offset and length in the destination and replaces it with the optional source substring.

$dst - A string passed by reference to splice.
$dst_offset - The offset into the destination string to start the splice. Supports negative offsets from the end of the string.
$dst_length - The length of the destination string to replace from the offset. Supports negative lengths from the end of the string and is nullable (Default is null).
$src - A string to insert starting at $dst_offset (Default is '').
$src_offset - The offset into the source string to begin copying. Supports negative offsets from the end of the string (Default is 0).
$src_length - The length of the source string to copy. Supports negative lengths from the end of the string and is nullable (Default is null).
$src_repeat - The number of times to repeat the source string (Default is 1).
$shrink - Whether or not to shrink the destination string (Default is true). When false and the destination is smaller than the input buffer, then this affects the return value.
$dst_lastsize - The last size returned from str_splice(). Nullable (Default is null).

Returns: The size of the destination string.

Unlike substr_replace(), str_splice() can result in zero memory allocations under carefully controlled conditions. The function, when possible, performs inline memory manipulation for up to 10x potential performance gains. Testing str_splice() against IFDS shows a 2x improvement to random write performance and notably simpler code (sequential write and both read types remained roughly the same). Artificial benchmarks (see the Benchmarks subsection below) show an increase of 160 to 9,200 times faster over built-in options for manipulating large, fixed size 20MB buffers. Real world application performance will vary but probably reside more in the “2x to 3x faster” range. Also, the bigger the buffer being worked with, the greater the gain due to fewer copy operations.

The $src_repeat option allows for filling the destination with a repeated substring. For example, writing 4KB of 0x00 to some part of the string.

This function also supports virtual buffers. When $shrink is false, the buffer size is not reduced and the return value reflects the size of the virtual buffer. See str_realloc() for example usage.

The “carefully controlled conditions” mentioned in the first paragraph is stringently defined as follows:

The destination string is NOT interned/immutable.
Has only one (src and dst zend_string are different) or two (src and dst zend_string are the same) references (refcount).
When source and destination are the same string (rare) they must either be the same initial sizes or their initial substring must not overlap.
The destination string is not being resized.

If any condition is not met, the string buffer is dynamically (re)allocated.

Target audience: Users working with large data (1MB or larger). Roughly similar in name/design to array_splice() only it doesn't return the removed string.

Why it should be added to PHP: Better performance and more features than substr_replace(). Notably better performance for equal sized strings where $dst_length == $src_length * $src_repeat. Less userland code required for optimized memory usage. Fundamentally different design from other string functions in PHP.

Benchmarks

C:\php-sdk\phpdev\vs16\x64\php-8.2.1
$ x64\Release_TS\php.exe test2.php
Artificial benchmark (100MB buffer, sequential byte-by-byte writing):

3.0783689022064 sec (32,484,735 bytes/sec)  (Append one byte at a time)
1.7970371246338 sec (55,647,153 bytes/sec)  (Overwrite all bytes sequentially one byte at a time)

Modifying an existing buffer is 1.72 times faster than appending one byte at a time. (The case for str_realloc() + str_splice() = fewer reallocations)

0.0033409595489502 sec (29,931,520,730 bytes/sec)  (Overwrite all bytes with str_splice())

str_splice() is up to 413 times faster than direct byte-by-byte manipulation of a 100MB buffer.

NOTE: Max memory speed on the test hardware is 23GB/sec, so the 29GB/sec above indicates a slightly off calculation due to being so fast that the speed of the underlying memset() call can't be accurately measured. The gain above was manually adjusted from 537 to 413 to more closely reflect reality. Regardless, a single memset() call is obviously much faster.

Artificial benchmark (20MB buffer, 4KB copy, random writing):

Randomly wrote 16,777,216 bytes (767,784 bytes/sec) (substr_replace())
Randomly wrote 134,217,728 bytes (43,894,435 bytes/sec) (direct byte-level manipulation)
Randomly wrote 21,239,955,456 bytes (7,074,957,359 bytes/sec) (str_splice())

str_splice() is up to 9,214 times faster than substr_replace() of a 20MB buffer. str_splice() is up to 161 times faster than direct byte-level manipulation of a 20MB buffer.

These benchmarks are artificial. That is, the associated use-cases are rarely encountered in real world code. As stated earlier, real world application performance will vary but probably reside more in the “2x to 3x faster” neighborhood than seeing these 160x or greater improvements. The point is to show that str_splice() generally and notably improves application performance. How much performance improves is going to depend heavily on the individual use-case.

Code Reduction

str_splice() reduces the amount of code compared to current userland implementations that MAY have similar performance gains. Currently, this is the best possible userland code when working with chunked buffers:

$x = 0;
$y = strlen($data);
 
while ($x < $y)
{
...
 
	// Copy data up to page size.
	$x2 = $this->currpos % $this->pagesize;
	$y2 = strlen($this->pagemap[$pagepos][0]);
	$diff = $y2 - $x2;
	if ($diff <= $y - $x)
	{
		$this->pagemap[$pagepos][0] = ($x2 === 0 ? "" : substr($this->pagemap[$pagepos][0], 0, $x2));
		$this->pagemap[$pagepos][0] .= substr($data, $x, $diff);
 
		$x += $diff;
		$x2 = $y2;
		$this->currpos += $diff;
	}
	else if ($x2 === 0)
	{
		$diff = $y - $x;
		$tempdata = substr($this->pagemap[$pagepos][0], $y - $x);
		$this->pagemap[$pagepos][0] = ($x == 0 ? $data : substr($data, $x));
		$this->pagemap[$pagepos][0] .= $tempdata;
 
		$x = $y;
		$x2 += $diff;
		$this->currpos += $diff;
	}
	else
	{
		// PHP is very slow when appending one byte at a time to a string.
		while ($x2 < $y2 && $x < $y)
		{
			$this->pagemap[$pagepos][0][$x2] = $data[$x];
 
			$x++;
			$x2++;
			$this->currpos++;
		}
	}
 
	if ($y2 < $this->pagesize && $x < $y)
	{
		$size = ($this->pagesize - $y2 < $y - $x ? $this->pagesize - $y2 : $y - $x);
		$this->pagemap[$pagepos][0] .= ($x == 0 && $size == $y ? $data : substr($data, $x, $size));
 
		$x += $size;
		$x2 += $size;
		$this->currpos += $size;
	}
 
...
}

With str_splice(), this can be reduced to:

$x = 0;
$y = strlen($data);
 
while ($x < $y)
{
...
 
	// Copy data up to page size.
	$x2 = $this->currpos % $this->pagesize;
	$y2 = strlen($this->pagemap[$pagepos][0]);
	$diff = $this->pagesize - $x2;
	$size = ($diff <= $y - $x ? $diff : $y - $x);
	$size2 = ($x2 + $size > $y2 ? $y2 - $x2 : $size);
 
	str_splice($this->pagemap[$pagepos][0], $x2, $size2, $data, $x, $size);
 
	$x += $size;
	$x2 += $size;
	$this->currpos += $size;
 
...
}

Performance testing shows both sets of code to be roughly equivalent under most scenarios but the str_splice() variant is up to 2-3 times faster under certain scenarios - specifically, random bulk write operations. The amount of code overall in the str_splice() example is greatly reduced, thus improving code readability.

For comparison, performance testing the functionally equivalent naive implementation (not shown here) that modifies strings one byte at a time is many times slower than the examples above. That is, the naive implementation, while approximately the same amount of code as the str_splice() variant above, runs at around 35MB/sec to 55MB/sec while the two optimized variants above run at around 515MB/sec on the test hardware (i.e. the optimized variants are up to 14 times faster).

str_realloc()

int str_realloc(string &$str, int $size, [ bool $fast = false ])

Reallocates the buffer associated with a string and returns the previous size.

$str - A string to resize.
$size - The new size of the string.
$fast - Whether or not to reallocate the string for the new size when shrinking (Default is false).

Returns: The previous size of the string.

This function can preallocate a buffer or truncate an existing string. Inline modifying one byte at a time is approximately 1.7 to 3.8 times faster than appending one byte at a time to the end of a string (See Benchmarks for str_splice()).

Goes hand-in-hand with str_splice() virtual buffers. For example, preallocating an estimated 1MB buffer, filling the buffer using str_splice(), and then calling str_realloc() to finalize the string.

Target audience: Users working with large data (1MB or more).

Why it should be added to PHP: Better performance. Works in concert with str_splice(). Gives indirect access to zend_string_realloc().

Example Usage

Let's say an application regularly constructs a buffer that averages 750KB in size when completed.

// Allocate buffer.
$str = $template;
$vsize = str_realloc($str, strlen($template) + 1024768);
 
// Replace substrings.
$pos = 0;
while (($pos = strpos($str, "{embed ", $pos)) !== false && $pos < $vsize)
{
	$pos2 = strpos($str, "}", $pos + 7);
	if ($pos2 === false || $pos2 >= $vsize)  $pos2 = $pos + 6;
	else
	{
		$embed = GenerateEmbed(substr($str, $pos + 6, $pos2 - $pos - 6));
 
		$vsize = str_splice($str, $pos, $pos2 - $pos + 1, $embed, 0, null, 1, false, $vsize);
	}
 
	$pos = $pos2 + 1;
}
 
// Finalize the buffer.
str_realloc($str, $vsize, true);

This approach reduces the number of intermediate memory allocations that take place by preallocating extra buffer space in advance. str_splice() tracks the actual buffer size and then the last str_realloc() call truncates the string to the final buffer size without copying the string (when $fast is true, str_realloc() simply sets the string size).

Fast truncation is also useful to avoid copying memory:

// Interned string.
$str = "Truncate ";
 
// Copied.  Non-interned string.
$str .= "meeeee";
 
// Sets the string size ($fast = true).
str_realloc($str, 8, true);
 
// May append to the same string.  Custom Zend engine allocators, etc. make this indeterminate though.
$str .= "d";
 
echo $str;  // Outputs:  Truncated

Another example is to preallocate a buffer before calling fread_mem():

// Preallocate the buffer.
$size = 1024768;
$str = "";
str_realloc($str, $size);
 
// Read until the buffer is filled.
$pos = 0;
while ($pos < $size && !feof($fp))
{
	// Maximum size to attempt to read is automatically determined by the size of the buffer - offset.
	$size2 = fread_mem($fp, $str, $pos);
	if ($size2 === false)  die("Stream read failed.");
 
	$pos += $size2;
}
 
// Truncate the buffer.
str_realloc($str, $pos, true);

For small buffers, the performance gains by avoiding copies is going to be largely negligible. For large buffers, avoiding copies can have notable gains in performance and, of course, reduce RAM usage.

php_adjust_substr_offset_length()

PHPAPI void php_adjust_substr_offset_length(zend_string *str, zend_long *str_offset, zend_long *str_length)

Intakes a zend_string, an offset, and a length and adjusts the offset and length values so that the range is in bounds of the string buffer.

This is not a userland function. It is intended to be a shared function for the functions that follow for calculating the correct bounds-checked offset and length of a substring. str_splice() was written before this function existed.

Perhaps could benefit from zend_always_inline. The name also might need to be changed. How offset and length are adjusted might be inconsistent with other areas of PHP?

explode() with substring

array explode(string separator, string str [, ?int limit = null, ?int str_offset = 0, ?int str_length = null ])

Splits a string on string separator and return array of components. If limit is positive only limit number of components is returned. If limit is negative all components except the last abs(limit) are returned.

$separator - A string containing a separator to split on.
$str - The string to split.
$limit - The number of components to return.
$str_offset - The offset into the string to begin splitting. Supports negative offsets from the end of the string (Default is 0).
$str_length - The length of the string being split. Supports negative lengths from the end of the string and is nullable (Default is null).

Returns: An array containing the split string components.

Extends explode() with string offset and length parameters for the $str parameter. Useful for extracting substrings that need to be split.

Target audience: Users that call explode(“...”, substr($str)).

Why it should be added to PHP: Saves a call to substr(), which would create a temporary copy.

str_split() with substring

array str_split(string str [, ?int split_length = 1, ?int str_offset = 0, ?int str_length = null ])

Convert a string to an array. If split_length is specified, break the string down into chunks each split_length characters long.

$str - A string to split.
$split_length - The chunk length of each entry in the array (Default is 1).
$str_offset - The offset into the string to begin splitting. Supports negative offsets from the end of the string (Default is 0).
$str_length - The length of the string being split. Supports negative lengths from the end of the string and is nullable (Default is null).

Returns: An array containing the split string components.

Extends str_split() with string offset and length parameters for the $str parameter. Useful for extracting substrings that need to be split.

Target audience: Users that call `str_split(substr($str))`.

Why it should be added to PHP: Saves a call to substr() and keeps the str_split() prototype in line with explode().

fread_mem()

int|false fread_mem(resource fp, string &$str, [ int str_offset = 0, ?int length = null ])

Binary-safe inline file read.

$fp - A resource to an open file.
$str - A string to store the read data in.
$str_offset - The offset into the string to begin reading into. Supports negative offsets from the end of the string (Default is 0).
$length - The maximum number of bytes to read. Nullable (Default is null).

Returns: An integer containing the number of bytes read on success, false otherwise.

This function reads data from a stream into the destination string starting at the specified offset. When length is null, the length is automatically determined based on the size of $str minus the offset.

Target audience: All users.

Why it should be added to PHP: It is extremely common to call fread(), check for failure, and then append the returned data to another string in a loop. This eliminates two memory allocations per loop.

Instead of adding this as a new global function, it might be possible to extend the existing fread() function to support multiple function signatures to add equivalent functionality.

fwrite() with substring

int|false fwrite(resource $stream, string $data, ?int $length = null [, ?int offset = 0])

Writes the contents of data to the file stream pointed to by stream.

$stream - A file system pointer resource that is typically created using fopen().
$data - The string that is to be written.
$length - If length is an int, writing will stop after length bytes have been written or the end of data is reached, whichever comes first. Supports negative lengths from the end of the string and is nullable (Default is null).
$offset - The offset into the string to begin reading into. Supports negative offsets from the end of the string (Default is 0).

Returns: The number of bytes written, or false on failure.

Extends fwrite() to support substrings. Useful for efficiently writing partial buffers to non-blocking network streams.

Unfortunately, fwrite() already has a nullable length parameter. This means the new offset parameter will have to be put after the length parameter.

Target audience: All users.

Why it should be added to PHP: It is extremely common to call fwrite(), get back a “write succeeded” response BUT only part of the data was written, and then have to chop up the buffer to be able to call fwrite() again to send the rest of the data. This is extremely inefficient for large buffers.

hash() with substring

string hash(string algo, string data[, bool raw_output = false, array options = [], ?int data_offset = 0, ?int data_length = null])

Generate a hash of a given input string. Returns lowercase hexits by default.

$algo - A string containing a hash algorithm.
$data - The data to hash.
$raw_output - Output raw data when true, lowercase hexits when false (Default is false).
$options - An array of options for the various hashing algorithms. Currently, only the “seed” parameter is supported by the MurmurHash variants. (Default is []).
$data_offset - The offset into the string to begin hashing from. Supports negative offsets from the end of the string (Default is 0).
$data_length - The length of the string to hash. Supports negative lengths from the end of the string and is nullable (Default is null).

Returns: A string containing the result of the hash.

Extends hash() to support substrings. Useful for verifying binary data within a substring such as:

4 bytes size
Data
4 byte CRC-32

Reduces the need to extract data just to check a CRC/hash.

Target audience: Users who work with binary data containing CRCs or hashes.

Why it should be added to PHP: Saves copying data out of a string just to hash it when a few minor pointer adjustments have the same effect. The binary data blob being checked is probably structured in some way and so copying it out into its own string just to verify a CRC wastes time and uses more memory.

hash_hmac() with substring

string hash_hmac(string algo, string data, string key[, bool raw_output = false, ?int data_offset = 0, ?int data_length = null])

Generate a hash of a given input string with a key using HMAC. Returns lowercase hexits by default.

$algo - A string containing a hash algorithm.
$data - The data to hash.
$key - A string containing the HMAC key.
$raw_output - Output raw data when true, lowercase hexits when false (Default is false).
$data_offset - The offset into the string to begin hashing from. Supports negative offsets from the end of the string (Default is 0).
$data_length - The length of the string to hash. Supports negative lengths from the end of the string and is nullable (Default is null).

Returns: A string containing the result of the hash.

Extends hash_hmac() to support substrings. Useful for verifying binary data within a substring.

The new parameters here are less useful than the substring support for hash() but should be added for completeness and consistency.

Target audience: Some users who work with binary data.

Why it should be added to PHP: Saves copying data out of a string just to hash it when a few minor pointer adjustments have the same effect. Should also be added to be consistent with substring support for hash().

Backward Incompatible Changes

Significant care was taken to not introduce any BC breaks. As such, there shouldn't be any BC breaks as a result of these additions and enhancements.

str_splice(), str_realloc(), and fread_mem() will no longer be available as global function names. May break existing userland software that defines global functions with these names. Searching GitHub for those three function names turns up the following results:

str_splice() - 52 results. Appears to mostly be a function name inside a class for DocWiki. No apparent naming conflicts.
str_realloc() - 2 results. Just the test extension (qolfuncs). No apparent naming conflicts.
fread_mem() - 73 results. Name conflict with an identically named global function defined in the FPDF userland library.

Proposed PHP Version(s)

Next PHP 8.x.

RFC Impact

To SAPIs: Will be applied to all PHP environments.
To Existing Extensions: Additions and changes made to ext/standard and ext/hash in the existing .c and .h files.
To Opcache: New global functions (str_splice(), str_realloc(), etc) to be added to the registered opcache function list like all the other registered global functions.
New Constants: No new constants introduced.
php.ini Defaults: No changes to php.ini introduced.

Open Issues

Issue 1 - Suggested alternative name(s) for str_splice():

str_modify() because it doesn't perfectly mirror array_splice()'s prototype. One of the goals of this function is to optimally splice strings inline - ideally without making any unnecessary memory allocations/copies. Also, strings are not arrays so we probably shouldn't expect the prototype or the behavior to precisely match.

Issue 2 - Are there other alternate names for these functions that should be considered?

Issue 3 - Should fread_mem() be merged into fread()?

Issue 4 - hash_hmac() does not currently have an $options array but hash() does. Should hash_hmac() reserve an $options array parameter for future use and/or consistency with hash()? Or is the lack of an $options array an oversight and hash_hmac() should actually be mirroring hash()'s prototype?

Issue 5 - hash_hmac() still uses zend_parse_parameters() while hash() uses macro expansion. Should hash_hmac() be switched over to the macro expansion method for consistency with hash()?

Issue 6 - Should I go back and integrate php_adjust_substr_offset_length() into str_splice()?

Issue 7 - Should php_adjust_substr_offset_length() have zend_always_inline? Should it be a macro instead of a function?

Issue 8 - Is php_adjust_substr_offset_length() internally consistent in how it calculates bounded offset and length with other areas of PHP?

Issue 9 - Should the operations conducted here be their own type in PHP instead of adding new functions and parameters to existing functions? See the discussion thread for details.

Issue 10 - Should we consider the larger body of string functions for inline buffer modification by supporting the syntax $mystr->substr(), $mystr->toUpper(), etc.? This increases the scope quite a bit. See the discussion thread for details.

Future Scope

Maybe there are other functions that could benefit from substring offset/length support. A large number of functions were looked at but only a few seemed like obvious choices.

Proposed Voting Choices

The vote will require 2/3 majority with a vote on a per-function basis.

Patches and Tests

Working implementations of all the items being proposed can currently be found at: