This is an old revision of the document!
Request for Comments: Native TLS for globals in ZTS
- Version: 2.0
- Date: 2008-08-24
- Author: Arnaud Le Blanc firstname.lastname@example.org
- Status: Under Discussion
- First Published at: http://marc.info/?l=php-internals&m=121893972814818&w=2
- Initial patch: http://arnaud.lb.s3.amazonaws.com/__thread-tls.patch
- Current patch: http://arnaud.lb.s3.amazonaws.com/__thread-tls-2.patch
Currently ZTS builds are slower than non-ZTS builds. This RFC is about avoiding some of the major overhead of ZTS builds by using native thread local storage.
Currently the way globals work forces to pass a thread-local-storage pointer across function calls, which involves some overhead. Also, not all functions get the pointer as argument and need to use TSRMLS_FETCH(), which is slow. For instance emalloc() involves a TSRMLS_FETCH(). An other overhead is accessing globals, using multiple pointers in different locations.
The first proposed patch makes each global a native TLS variable so that accessing them is as simple as global_name->member. This removes the requirement of passing the tls pointer across function calls, so that the two major overheads of ZTS builds are avoided.
Results for bench.php:
|ZTS patched and static globals||3.8s|
|ZTS patched, static globals, dynamic TLS||4.8s|
So the patch made ZTS builds mostly as fast as non-ZTS builds.
Unfortunately the TLS model used in these tests was the static model, which is restrictive, and in particular does not allow to use it in shared libraries which will be loaded dynamically. (c.f TLS internals bellow).
Even if this patch can still be used with the dynamic model (it actually greatly improves the performance of ZTS-PIC builds), it can not be used as-is when building non-PIC code (the dynamic model cannot be used with non-PIC code).
The second patch is based on some research I made on various TLS implementations.
In fact mostly all implementations I tested (Linux, FreeBSD, Solaris) allocate a surplus of TLS memory especially to allow to dlopen() libraries using static TLS. This memory being allocated in addition to any TLS memory needed by libraries loaded before program startup, it is guaranteed that this memory is always available and is reserved for this case.
Based on that, and as long as it can be tested at configure time, it seems reasonable to expect that we will always have space for at least a single TLS pointer. Unfortunately Windows was the only implementation I tested that do not allowed that at all and the following will not work on it.
So the second patch uses only one TLS variable, tsrm_ls_cache, which is used to cache tsrm_ls, so that it is not required to pass it across function calls.
Here are the results after having applied only this change:
|ZTS-patched, native TLS enabled||4.6s|
After that I made some more changes so that accessing a global requires less instructions. Actually this mimics the way static TLS works internally, each global is accessed using the following code:
__thread void *tsrm_ls_cache; (tsrm_ls_cache + global_offset)->member
This needs a few instructions compared to the original way of accessing globals:
void ***tsrm_ls; (*tsrm_ls)[global_id - 1]->element
This change is also enabled when not using native TLS too, but tsrm_ls needs to be a void** instead of a void*:
void **tsrm_ls; (*tsrm_ls + global_offset)->member
|ZTS-patched, native TLS disabled||5.0s|
|ZTS-patched, native TLS enabled||4.2s|
Native TLS can be enabled with --with-tsrm-__thread-tls or --with-tsrm-full__thread-tls in the first patch.
For the second patch, the switch is --with-tsrm-native-tls.
On most systems there are two major models of TLS: A static model, the faster, and a dynamic model (and some sub-models). The following briefly explains how it works and what I found in various implementations.
Each block is allocated at a fixed (linker-defined) offset from an address specific to each thread. As this address can be accessed very quickly, this allows very quick access to each TLS block. On most implementations, on IA-32, this thread-specific-address is the Thread Control Block, whose address is stored in offset 0 of the %gs segment register.
The way the static model works requires that the memory needed by each TLS variable to be allocated before program startup. This means that the static model can not be used in shared libraries loaded at runtime.
Linux, Solaris, FreeBSD, Windows.
Linux, Solaris and FreeBSD implementations allocate a fixed amount of surplus memory especially to allow dynamically loaded libraries to use the static model. Linux allocates 1664 bytes, FreeBSD 64 and Solaris 512. This amount of memory is always allocated in addition of the memory allocated for TLS before program startup, and is always available (this memory can be used only by dlopen()ed modules using static TLS).
On GCC this model can be selected by using -ftls-model=initial-exec. On SunStudio: -xthreadvar=no%dynamic. For both, this model is the default one when building non-PIC code.
Each TLS block is allocated dynamically when a shared library is loaded. Some data is then stored in the global offset table so that the program knows where to find each TLS block. This model allows to load libraries at runtime but is slower: It involves a function call (internally) and requires to build position independent code. However the implementation used on Linux seems to be very efficient and that only the fact that the code has been built as position independent makes a real difference when comparing to the static model.
Linux, Solaris, FreeBSD, Windows Vista. The Windows documentation does not gives many implementation details, but it seems that Windows Vista allows DLLs to use thread local storage (not tested). Other Windows versions have no equivalent of the dynamic model.
On GCC this model can be selected by using -ftls-model=general-dynamic. On SunStudio: -xthreadvar=dynamic. For both, this is the default when building PIC code.
- ELF implementation: http://people.redhat.com/drepper/tls.pdf
The current way of declaring a global is a follows:
extern ts_rsrc_id my_global_id; /* declare global in headers */ ts_rsrc_id my_global_id; /* declare global */ ts_allocate_id(&my_global_id, sizeof(type), ctor, dtor); /* allocate global at thread startup */
The new way is:
TSRMG_DH(type, my_global_id); /* declare global in headers */ TSRMG_D(type, my_global_id); /* declare gloabal */ TSRMG_ALLOCATE(my_global_id, sizeof(type), ctor, dtor); /* allocate global at thread startup */
All this is already done by the patch for code in the Zend Engine and in /php-src.
There is no changes needed for extensions as long as they use the extension-specific macros for declaring globals (as this is done by default for extensions created with ext_skel).
Declaring tsrm_ls explicitly must be avoided.
TSRM does some sort of JIT initialization of thread data, relying on the fact that TSRMLS_FETCH() calls ts_resource_ex, which will do the initialization if needed. However with the patch TSRMLS_FETCH() does nothing at all, and ts_resource_ex must be called explicitly at least one time in each thread. The TSRMLS_INIT() macro has been created for this purpose, and must be called at least one time in each thread.
As the patch avoids passing tsrm_ls across function calls, #ifdef ZTS is not anymore relevant to check that. The new PASS_TSRMLS macro is now defined when tsrm_ls needs to be passed across function calls. For instance this is needed by ZEND_ATTRIBUTE_FORMAT and some other places.