This is a description of ZTS improvement idea, that should reduce cost of module globals access. For example EG(current_execute_data). The idea came during analysing of JIT for ZTS expediency, but it should improve ZTS interpreter and whole PHP ZTS build as well.
Without ZTS this takes just 1 CPU instruction and 1 load.
movl executor_globals+field_offset(%rip), %eax
                      executor_globals
                      +----------------+
                      | field 0        |
  field_offset ------>| ...            |
                      | field N        |
                      +----------------+
However in ZTS build the same access requires 6 CPU instruction and 6 loads.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movslq executor_globals_id(%rip), %rdx movq %fs:(%rax), %rax movq (%rax), %rax movq -8(%rax,%rdx,8), %rdx movl field_offset(%rdx), %eax
                           %fs
                            |
                            v       tsrm_tls_entry
+---------------------+   +---+   +----------------+
| _tsrm_ls_cache      |-->| + |-->| storage        |--+
+---------------------+   +---+   | count          |  |
                                  | thread_id      |  |
                                  | next           |  |
                                  +----------------+  |
                                                      |
+---------------------+   +---+                       |
| executor_globals_id |-->| + |<-----------+----------+                            
+---------------------+   +---+            |
                            |              |
                            |       void** v
                            |     +----------------+
                            |     | slot 0         |
                            +---->| ...            |--+
                                  | slot N         |  |
                                  +----------------+  |
                                                      |
                          +---+                       |
      field_offset ------>| + |<-----------+----------+
                          +---+            |
                            |              |
                            |       EG     v
                            |     +----------------+
                            |     | field 0        |
                            +---->| ...            |
                                  | field N        |
                                  +----------------+
In case we fatten all the data structures we may reduce access pattern to 4 instructiond and 4 loads. I think, it's possible to make this changes on TSRM level only, without (or with minimal) TSRM source API modification. This would allow target the improvement into PHP-7.4.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movslq executor_globals_offset(%rip), %rdx movq %fs:(%rax), %rax movq field_offset(%rax,%rdx), %rax
                           %fs
                            |
                            v       tsrm_tls_entry
+---------------------+   +---+   +----------------+
| _tsrm_ls_cache      |-->| + |-->| thread_id      |
+---------------------+   +---+   | next           |
                            |     | count          |
                            v     +----------------+
+---------------------+   +---+   | slot 0 size    |
| executor_globals_id |-->| + |-->+----------------+
+---------------------+   +---+   | slot 0 field 0 |
                            |     | ...            |
                            V     | ...            |
                          +---+   | ...            |
      field_offset ------>| + |-->| ...            |
                          +---+   | slot 0 field N |
                                  +----------------+
                                  | ...            |
                                  +----------------+
                                  | slot K size    |
                                  +----------------+
                                  | slot K field 0 |
                                  | ...            |
                                  | slot K field M |
                                  +----------------+
In addition we may reserve slots few slots for frequently used execute_data, compiler_data, etc. And make “executor_global_id” to be precomputed at compile time constants. This will reduce access pattern to 3 instructions and 3 loads. This also eliminates requirement for temporary CPU register.
movq _tsrm_ls_cache@gottpoff(%rip), %rax movq %fs:(%rax), %rax movq executor_globals_offset + field_offset(%rax), %rax
With JIT we may aviod “position independent” read of “_tsrm_ls_cache” and reduce access pattern to 2 instructions and 2 loads.
movq %fs:_tsrm_ls_cache, %rax movq executor_globals_offset + field_offset(%rax), %rax
Finally, we may cache address of “tsrm_tls_entry” in CPU register and come to 1 instruction and 1 load.