zts-improvement

PHP ZTS Improvement

This is a description of ZTS improvement idea, that should reduce cost of module globals access. For example EG(current_execute_data). The idea came during analysing of JIT for ZTS expediency, but it should improve ZTS interpreter and whole PHP ZTS build as well.

Non ZTS Way

Without ZTS this takes just 1 CPU instruction and 1 load.

	movl	executor_globals+field_offset(%rip), %eax
                      executor_globals
                      +----------------+
                      | field 0        |
  field_offset ------>| ...            |
                      | field N        |
                      +----------------+

Current ZTS Way (PHP-7.3)

However in ZTS build the same access requires 6 CPU instruction and 6 loads.

	movq	_tsrm_ls_cache@gottpoff(%rip), %rax
	movslq	executor_globals_id(%rip), %rdx
	movq	%fs:(%rax), %rax
	movq	(%rax), %rax
	movq	-8(%rax,%rdx,8), %rdx
	movl	field_offset(%rdx), %eax
                           %fs
                            |
                            v       tsrm_tls_entry
+---------------------+   +---+   +----------------+
| _tsrm_ls_cache      |-->| + |-->| storage        |--+
+---------------------+   +---+   | count          |  |
                                  | thread_id      |  |
                                  | next           |  |
                                  +----------------+  |
                                                      |
+---------------------+   +---+                       |
| executor_globals_id |-->| + |<-----------+----------+                            
+---------------------+   +---+            |
                            |              |
                            |       void** v
                            |     +----------------+
                            |     | slot 0         |
                            +---->| ...            |--+
                                  | slot N         |  |
                                  +----------------+  |
                                                      |
                          +---+                       |
      field_offset ------>| + |<-----------+----------+
                          +---+            |
                            |              |
                            |       EG     v
                            |     +----------------+
                            |     | field 0        |
                            +---->| ...            |
                                  | field N        |
                                  +----------------+

New ZTS Way (PHP-7.4+ or PHP-8)

In case we fatten all the data structures we may reduce access pattern to 4 instructiond and 4 loads. I think, it's possible to make this changes on TSRM level only, without (or with minimal) TSRM source API modification. This would allow target the improvement into PHP-7.4.

	movq	_tsrm_ls_cache@gottpoff(%rip), %rax
	movslq	executor_globals_offset(%rip), %rdx
	movq	%fs:(%rax), %rax
	movq	field_offset(%rax,%rdx), %rax
                           %fs
                            |
                            v       tsrm_tls_entry
+---------------------+   +---+   +----------------+
| _tsrm_ls_cache      |-->| + |-->| thread_id      |
+---------------------+   +---+   | next           |
                            |     | count          |
                            v     +----------------+
+---------------------+   +---+   | slot 0 size    |
| executor_globals_id |-->| + |-->+----------------+
+---------------------+   +---+   | slot 0 field 0 |
                            |     | ...            |
                            V     | ...            |
                          +---+   | ...            |
      field_offset ------>| + |-->| ...            |
                          +---+   | slot 0 field N |
                                  +----------------+
                                  | ...            |
                                  +----------------+
                                  | slot K size    |
                                  +----------------+
                                  | slot K field 0 |
                                  | ...            |
                                  | slot K field M |
                                  +----------------+

Reserved global id-s (EG, CG, etc)

In addition we may reserve slots few slots for frequently used execute_data, compiler_data, etc. And make “executor_global_id” to be precomputed at compile time constants. This will reduce access pattern to 3 instructions and 3 loads. This also eliminates requirement for temporary CPU register.

	movq	_tsrm_ls_cache@gottpoff(%rip), %rax
	movq	%fs:(%rax), %rax
	movq	executor_globals_offset + field_offset(%rax), %rax

JIT

With JIT we may aviod “position independent” read of “_tsrm_ls_cache” and reduce access pattern to 2 instructions and 2 loads.

	movq	%fs:_tsrm_ls_cache, %rax
	movq	executor_globals_offset + field_offset(%rax), %rax

Finally, we may cache address of “tsrm_tls_entry” in CPU register and come to 1 instruction and 1 load.

zts-improvement.txt · Last modified: 2019/02/13 08:17 by dmitry