PHP ZTS Improvement
This is a description of ZTS improvement idea, that should reduce cost of module globals access. For example EG(current_execute_data). The idea came during analysing of JIT for ZTS expediency, but it should improve ZTS interpreter and whole PHP ZTS build as well.
Non ZTS Way
Without ZTS this takes just 1 CPU instruction and 1 load.
movl executor_globals+field_offset(%rip), %eax
executor_globals +----------------+ | field 0 | field_offset ------>| ... | | field N | +----------------+
Current ZTS Way (PHP-7.3)
However in ZTS build the same access requires 6 CPU instruction and 6 loads.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movslq executor_globals_id(%rip), %rdx movq %fs:(%rax), %rax movq (%rax), %rax movq -8(%rax,%rdx,8), %rdx movl field_offset(%rdx), %eax
%fs | v tsrm_tls_entry +---------------------+ +---+ +----------------+ | _tsrm_ls_cache |-->| + |-->| storage |--+ +---------------------+ +---+ | count | | | thread_id | | | next | | +----------------+ | | +---------------------+ +---+ | | executor_globals_id |-->| + |<-----------+----------+ +---------------------+ +---+ | | | | void** v | +----------------+ | | slot 0 | +---->| ... |--+ | slot N | | +----------------+ | | +---+ | field_offset ------>| + |<-----------+----------+ +---+ | | | | EG v | +----------------+ | | field 0 | +---->| ... | | field N | +----------------+
New ZTS Way (PHP-7.4+ or PHP-8)
In case we fatten all the data structures we may reduce access pattern to 4 instructiond and 4 loads. I think, it's possible to make this changes on TSRM level only, without (or with minimal) TSRM source API modification. This would allow target the improvement into PHP-7.4.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movslq executor_globals_offset(%rip), %rdx movq %fs:(%rax), %rax movq field_offset(%rax,%rdx), %rax
%fs | v tsrm_tls_entry +---------------------+ +---+ +----------------+ | _tsrm_ls_cache |-->| + |-->| thread_id | +---------------------+ +---+ | next | | | count | v +----------------+ +---------------------+ +---+ | slot 0 size | | executor_globals_id |-->| + |-->+----------------+ +---------------------+ +---+ | slot 0 field 0 | | | ... | V | ... | +---+ | ... | field_offset ------>| + |-->| ... | +---+ | slot 0 field N | +----------------+ | ... | +----------------+ | slot K size | +----------------+ | slot K field 0 | | ... | | slot K field M | +----------------+
Reserved global id-s (EG, CG, etc)
In addition we may reserve slots few slots for frequently used execute_data, compiler_data, etc. And make “executor_global_id” to be precomputed at compile time constants. This will reduce access pattern to 3 instructions and 3 loads. This also eliminates requirement for temporary CPU register.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movq %fs:(%rax), %rax movq executor_globals_offset + field_offset(%rax), %rax
JIT
With JIT we may aviod “position independent” read of “_tsrm_ls_cache” and reduce access pattern to 2 instructions and 2 loads.
movq %fs:_tsrm_ls_cache, %rax movq executor_globals_offset + field_offset(%rax), %rax
Finally, we may cache address of “tsrm_tls_entry” in CPU register and come to 1 instruction and 1 load.