PHP RFC: PHP.net Analytics Collection
- Version: 0.9
- Date: 2024-10-28
- Author: Larry Garfield (larry@garfieldtech.com), Roman Pronskiy (roman@pronskiy.com)
- Status: Draft
- First Published at: http://wiki.php.net/rfc/phpnet-analytics
Introduction
The PHP.net website is a critical resource for PHP developers worldwide, providing documentation, news, and updates about the PHP language. It has millions of visits every year. At least, we're pretty sure it does, since currently PHP.net lacks any useful analytics beyond rudimentary server logs. That makes it difficult to determine where or how best to invest resources (whether volunteer or paid) in improving the PHP.net experience, particularly the documentation.
Of particular interest, the PHP Foundation is looking into expanding its scope to fund improvements to the documentation. However, there are over 17,000 documentation pages on php.net, and right now no one knows which ones are the most high-traffic and worth investing resources in. To know where to get the best “bang for the buck,” both literally and figuratively, we need better data.
This RFC proposes implementing a self-hosted analytics solution to gather valuable insights into how users interact with the site.
Proposal
The Infrastructure Team, in cooperation with the Foundation, will install the Matomo analytics server on PHP-maintained hardware, and install a tracking code on PHP.net. The process is very similar to Google Analytics, but self-hosted.
Matomo will be configured to avoid saving any Personally Identifying Information (PII).
The specific code to be placed on php.net can be previewed in this commit.
Data access
The raw data collected by Matomo will be available only to the PHP Infrastructure Team, which may include staff from the PHP Foundation.
Fully anonymized aggregate data (such as most-popular-pages, overall traffic rates, coarse-grained geographic information, etc.) will be made publicly available as feasible.
Privacy
To acknowledge the presence of an analytics service, the “Logfiles” section of the PHP Privacy Policy page will be replaced with the following:
Analytics
PHP.net collects anonymous user statistics to help improve the site. We do not collect any personally identifiable information, and you may opt-out of analytics at any time. Collected analytics are used exclusively by the PHP.net team and PHP Foundation to improve PHP.net. The raw data is never shared with any third party, ever, unless compelled by a valid court order.
Why self-hosted?
One of the chief concerns with any analytics system is tracking by third parties. While Google Analytics and similar services are most popular, many members of the PHP community are justifiably concerned about what such companies do with the data they collect. For that reason, we believe this is a case where self-hosting is the better option, even if it isn't as feature-rich as some third party services. Protecting the privacy of our users is of paramount importance.
Why Matomo?
There are many self-hosted analytics packages available on the market. Matomo was selected for a number of reasons.
- It is already in use on the PHP Foundation website, so the team already has familiarity with it
- It's Free Software (GPLv3)
- It's written primarily in PHP
- It has a long history on the market (the oldest commit is over 16 years ago) and is still in active development, so it should be reliable for a long time to come
- Matomo supports GDPR compliance, including allowing users to opt-out of tracking entirely
While there are no doubt other viable options on the market, the above points (particularly the team's familiarity with the tool already) make it the most straightforward option.
Why a JS tracker?
Matomo has the ability to ingest server log files as an alternative to using a JS code. However, that would result in inadequate data for a number of reasons.
- Server logs are more locked down, and thus have a higher bus factor.
- Automatically ingesting logs, either from the server or our CDN provider, adds more moving parts that can (and likely will) break from time to time.
- Server logs ignore cached requests served by a CDN or other proxy. Even using CDN logs would miss data cached by another proxy.
A client-side tracker also provides far richer data, such as:
- Time-on-page
- Whether they read the whole page or just a part
- Whether they even saw comments
- What percentage of users get to the docs through direct links vs the home page
- If users are hitting a single page per browser window or navigating through the site, and if the latter, how?
- How much are users using the search function? Is it finding what they want, or is it just a crutch?
- Do people use the translations alone, or do they use both the English site and other languages in tandem?
- Does anyone use multiple translations?
None of that information can be derived from server logs. All of it could be derived from a client-side tracker, without collecting any PII.
Aren't analytics trackers evil?
No. Third-party trackers that uniquely identify individuals across multiple domains and make that data available to other third parties are evil. First-party analytics can provide valuable insights into how users use a website. The safety advantages of this approach are:
- No outside parties see the data, ever.
- No PII is collected, ever.
- We still get useful information about how people use php.net that allow us to make it better
Proposed Voting Choices
This is a simple yes-or-no vote to approve this service. 2/3 majority required to pass.