====== PHP ZTS Improvement ======
This is a description of ZTS improvement idea, that should reduce cost of module globals access.
For example EG(current_execute_data).
The idea came during analysing of JIT for ZTS expediency, but it should improve ZTS interpreter and whole PHP ZTS build as well.
===== Non ZTS Way =====
Without ZTS this takes just 1 CPU instruction and 1 load.
movl executor_globals+field_offset(%rip), %eax
executor_globals
+----------------+
| field 0 |
field_offset ------>| ... |
| field N |
+----------------+
===== Current ZTS Way (PHP-7.3) =====
However in ZTS build the same access requires 6 CPU instruction and 6 loads.
movq _tsrm_ls_cache@gottpoff(%rip), %rax
movslq executor_globals_id(%rip), %rdx
movq %fs:(%rax), %rax
movq (%rax), %rax
movq -8(%rax,%rdx,8), %rdx
movl field_offset(%rdx), %eax
%fs
|
v tsrm_tls_entry
+---------------------+ +---+ +----------------+
| _tsrm_ls_cache |-->| + |-->| storage |--+
+---------------------+ +---+ | count | |
| thread_id | |
| next | |
+----------------+ |
|
+---------------------+ +---+ |
| executor_globals_id |-->| + |<-----------+----------+
+---------------------+ +---+ |
| |
| void** v
| +----------------+
| | slot 0 |
+---->| ... |--+
| slot N | |
+----------------+ |
|
+---+ |
field_offset ------>| + |<-----------+----------+
+---+ |
| |
| EG v
| +----------------+
| | field 0 |
+---->| ... |
| field N |
+----------------+
===== New ZTS Way (PHP-7.4+ or PHP-8) =====
In case we fatten all the data structures we may reduce access pattern to 4 instructiond and 4 loads.
I think, it's possible to make this changes on TSRM level only, without (or with minimal) TSRM source API modification.
This would allow target the improvement into PHP-7.4.
movq _tsrm_ls_cache@gottpoff(%rip), %rax
movslq executor_globals_offset(%rip), %rdx
movq %fs:(%rax), %rax
movq field_offset(%rax,%rdx), %rax
%fs
|
v tsrm_tls_entry
+---------------------+ +---+ +----------------+
| _tsrm_ls_cache |-->| + |-->| thread_id |
+---------------------+ +---+ | next |
| | count |
v +----------------+
+---------------------+ +---+ | slot 0 size |
| executor_globals_id |-->| + |-->+----------------+
+---------------------+ +---+ | slot 0 field 0 |
| | ... |
V | ... |
+---+ | ... |
field_offset ------>| + |-->| ... |
+---+ | slot 0 field N |
+----------------+
| ... |
+----------------+
| slot K size |
+----------------+
| slot K field 0 |
| ... |
| slot K field M |
+----------------+
==== Reserved global id-s (EG, CG, etc) ====
In addition we may reserve slots few slots for frequently used execute_data, compiler_data, etc.
And make "executor_global_id" to be precomputed at compile time constants.
This will reduce access pattern to 3 instructions and 3 loads.
This also eliminates requirement for temporary CPU register.
movq _tsrm_ls_cache@gottpoff(%rip), %rax
movq %fs:(%rax), %rax
movq executor_globals_offset + field_offset(%rax), %rax
==== JIT ====
With JIT we may aviod "position independent" read of "_tsrm_ls_cache" and reduce access pattern to 2 instructions and 2 loads.
movq %fs:_tsrm_ls_cache, %rax
movq executor_globals_offset + field_offset(%rax), %rax
Finally, we may cache address of "tsrm_tls_entry" in CPU register and come to 1 instruction and 1 load.