This is a description of ZTS improvement idea, that should reduce cost of module globals access. For example EG(current_execute_data). The idea came during analysing of JIT for ZTS expediency, but it should improve ZTS interpreter and whole PHP ZTS build as well.
Without ZTS this takes just 1 CPU instruction and 1 load.
movl executor_globals+field_offset(%rip), %eax
executor_globals +----------------+ | field 0 | field_offset ------>| ... | | field N | +----------------+
However in ZTS build the same access requires 6 CPU instruction and 6 loads.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movslq executor_globals_id(%rip), %rdx movq %fs:(%rax), %rax movq (%rax), %rax movq -8(%rax,%rdx,8), %rdx movl field_offset(%rdx), %eax
%fs | v tsrm_tls_entry +---------------------+ +---+ +----------------+ | _tsrm_ls_cache |-->| + |-->| storage |--+ +---------------------+ +---+ | count | | | thread_id | | | next | | +----------------+ | | +---------------------+ +---+ | | executor_globals_id |-->| + |<-----------+----------+ +---------------------+ +---+ | | | | void** v | +----------------+ | | slot 0 | +---->| ... |--+ | slot N | | +----------------+ | | +---+ | field_offset ------>| + |<-----------+----------+ +---+ | | | | EG v | +----------------+ | | field 0 | +---->| ... | | field N | +----------------+
In case we fatten all the data structures we may reduce access pattern to 4 instructiond and 4 loads. I think, it's possible to make this changes on TSRM level only, without (or with minimal) TSRM source API modification. This would allow target the improvement into PHP-7.4.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movslq executor_globals_offset(%rip), %rdx movq %fs:(%rax), %rax movq field_offset(%rax,%rdx), %rax
%fs | v tsrm_tls_entry +---------------------+ +---+ +----------------+ | _tsrm_ls_cache |-->| + |-->| thread_id | +---------------------+ +---+ | next | | | count | v +----------------+ +---------------------+ +---+ | slot 0 size | | executor_globals_id |-->| + |-->+----------------+ +---------------------+ +---+ | slot 0 field 0 | | | ... | V | ... | +---+ | ... | field_offset ------>| + |-->| ... | +---+ | slot 0 field N | +----------------+ | ... | +----------------+ | slot K size | +----------------+ | slot K field 0 | | ... | | slot K field M | +----------------+
In addition we may reserve slots few slots for frequently used execute_data, compiler_data, etc. And make “executor_global_id” to be precomputed at compile time constants. This will reduce access pattern to 3 instructions and 3 loads. This also eliminates requirement for temporary CPU register.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movq %fs:(%rax), %rax movq executor_globals_offset + field_offset(%rax), %rax
With JIT we may aviod “position independent” read of “_tsrm_ls_cache” and reduce access pattern to 2 instructions and 2 loads.
movq %fs:_tsrm_ls_cache, %rax movq executor_globals_offset + field_offset(%rax), %rax
Finally, we may cache address of “tsrm_tls_entry” in CPU register and come to 1 instruction and 1 load.