====== PHP ZTS Improvement ====== This is a description of ZTS improvement idea, that should reduce cost of module globals access. For example EG(current_execute_data). The idea came during analysing of JIT for ZTS expediency, but it should improve ZTS interpreter and whole PHP ZTS build as well. ===== Non ZTS Way ===== Without ZTS this takes just 1 CPU instruction and 1 load.


	movl	executor_globals+field_offset(%rip), %eax


                      executor_globals
                      +----------------+
                      | field 0        |
  field_offset ------>| ...            |
                      | field N        |
                      +----------------+

===== Current ZTS Way (PHP-7.3) ===== However in ZTS build the same access requires 6 CPU instruction and 6 loads.


	movq	_tsrm_ls_cache@gottpoff(%rip), %rax
	movslq	executor_globals_id(%rip), %rdx
	movq	%fs:(%rax), %rax
	movq	(%rax), %rax
	movq	-8(%rax,%rdx,8), %rdx
	movl	field_offset(%rdx), %eax


                           %fs
                            |
                            v       tsrm_tls_entry
+---------------------+   +---+   +----------------+
| _tsrm_ls_cache      |-->| + |-->| storage        |--+
+---------------------+   +---+   | count          |  |
                                  | thread_id      |  |
                                  | next           |  |
                                  +----------------+  |
                                                      |
+---------------------+   +---+                       |
| executor_globals_id |-->| + |<-----------+----------+                            
+---------------------+   +---+            |
                            |              |
                            |       void** v
                            |     +----------------+
                            |     | slot 0         |
                            +---->| ...            |--+
                                  | slot N         |  |
                                  +----------------+  |
                                                      |
                          +---+                       |
      field_offset ------>| + |<-----------+----------+
                          +---+            |
                            |              |
                            |       EG     v
                            |     +----------------+
                            |     | field 0        |
                            +---->| ...            |
                                  | field N        |
                                  +----------------+

===== New ZTS Way (PHP-7.4+ or PHP-8) ===== In case we fatten all the data structures we may reduce access pattern to 4 instructiond and 4 loads. I think, it's possible to make this changes on TSRM level only, without (or with minimal) TSRM source API modification. This would allow target the improvement into PHP-7.4.


	movq	_tsrm_ls_cache@gottpoff(%rip), %rax
	movslq	executor_globals_offset(%rip), %rdx
	movq	%fs:(%rax), %rax
	movq	field_offset(%rax,%rdx), %rax


                           %fs
                            |
                            v       tsrm_tls_entry
+---------------------+   +---+   +----------------+
| _tsrm_ls_cache      |-->| + |-->| thread_id      |
+---------------------+   +---+   | next           |
                            |     | count          |
                            v     +----------------+
+---------------------+   +---+   | slot 0 size    |
| executor_globals_id |-->| + |-->+----------------+
+---------------------+   +---+   | slot 0 field 0 |
                            |     | ...            |
                            V     | ...            |
                          +---+   | ...            |
      field_offset ------>| + |-->| ...            |
                          +---+   | slot 0 field N |
                                  +----------------+
                                  | ...            |
                                  +----------------+
                                  | slot K size    |
                                  +----------------+
                                  | slot K field 0 |
                                  | ...            |
                                  | slot K field M |
                                  +----------------+

==== Reserved global id-s (EG, CG, etc) ==== In addition we may reserve slots few slots for frequently used execute_data, compiler_data, etc. And make "executor_global_id" to be precomputed at compile time constants. This will reduce access pattern to 3 instructions and 3 loads. This also eliminates requirement for temporary CPU register.

	
	movq	_tsrm_ls_cache@gottpoff(%rip), %rax
	movq	%fs:(%rax), %rax
	movq	executor_globals_offset + field_offset(%rax), %rax

==== JIT ==== With JIT we may aviod "position independent" read of "_tsrm_ls_cache" and reduce access pattern to 2 instructions and 2 loads.

	
	movq	%fs:_tsrm_ls_cache, %rax
	movq	executor_globals_offset + field_offset(%rax), %rax

Finally, we may cache address of "tsrm_tls_entry" in CPU register and come to 1 instruction and 1 load.