rfc:dvcs

Request for Comments: Choosing a distributed version control system for PHP

  • Version: 1.0
  • Date: 2011-07-30
  • Author: David Soria Parra <dsp at php dot net>
  • Status: Accepted with Option Git Voting Results (Voting ended Sep 7, 2011 12:00 UTC)
  • First Published at: http://wiki.php.net/rfc/dvcs

Introduction

PHP uses Subversion (SVN) as its version control system of choice at the moment. This has a few drawbacks. A decentralized version control system can solve some of these drawbacks. This RFC aims to provide information helpful in choosing one of the proposed decentralized version control systems (DVCS).

Current Situation

Subversion is used to host the main PHP repository, as well as several sub-projects such as PEAR and PECL. Access to the repository is controlled through the Karma system, which uses a custom implementation.

Using Subversion has several drawbacks:

  • requires network access (and SVN's network access over HTTP is very slow)
  • slow log/annotate commands
  • large checkout sizes
  • single point of failure
  • painful merging
  • no implicit consistency checks

Decentralized version control system can solve several issues:

  • better merging support
  • consistency checks using SHA1 checksums
  • local repository, no network access required to commit
  • network access only for push/pull (and faster network operations)
  • no single point of failure, easy to setup multiple hosting
  • advanced features such as rebase to linearise history, bisect to find regression bugs
  • “social coding platforms” enabling developers to easily submit patches.

Decentralized version control systems have some drawbacks:

  • no partial checkout of subdirectories
  • no empty directories, .keep file needed
  • no global unique incrementing rev numbers, SHA1s are global unique revnums
  • no svn:externals, no svn:eol-style (solved with extensions)

Overview of Competitors

Git

Git was written by Linus Torvalds as a replacement for BitKeeper, which was used for Linux Kernel development until 2005. Git is used by various large Open Source Projects, including Perl, VLC and Gnome. It is considered the most popular and best known Open Source DVCS.

Git is written in C, Shell, and Perl. It runs under Linux, BSD and Mac OS X. A Windows version based on msys is available through the msysgit project.

URL:          http://git-scm.org
Mailing list: http://vger.kernel.org/vger-lists.html#git
Wiki:         https://git.wiki.kernel.org/
Version:      1.7.6

Mercurial

Mercurial was written by Matt Mackall in 2005. It is used by large Open Source Projects like OpenJDK, Python and Mozilla. It is written in Python, with some modules written in C for performance reasons. Mercurial is available for Linux, BSD, Mac OS X and Windows. Mercurial's command line name is 'hg' - a reference to the symbol of the chemical element Mercury.

URL:          http://mercurial.selenic.com
Mailing list: http://mercurial.selenic.com/wiki/MailingLists
Wiki:         http://mercurial.selenic.com/wiki/
Version:      1.9.1

Bazaar

Bazaar was written by Martin Pool in 2005, when he was commissioned by Canonical to build a DVCS that “open-source hackers would love to use.” It is written in Python, and is supported on Linux, Mac OS X, and Windows. Its commandline name is 'bzr'.

Information on Bazaar in this RFC is incomplete.

URL:          http://bazaar.canonical.com
Mailing list: http://wiki.bazaar.canonical.com/BzrSupport#Mailing%20Lists
Wiki:         http://wiki.bazaar.canonical.com/
Version:      2.3.4

Concepts

While every version control system has its own terminology, some terms are used common to all of them. Here is a list of common definitions:

repository

  A collection of revisions, organized into branches

clone

  A complete copy of a branch or repository; also called a checkout

commit

  A recorded revision in a repository

merge

  An application of all the changes and history in one branch to another

pull

  To update a clone from the original branch/repository, whether remote or local

push

  To copy revisions from one repository to another

Revision Model

Git and Mercurial use string representations of SHA-1 checksums to identify a changeset. Both version control systems offer reserved names to access often used changesets such as the topmost commit. In both systems a user can specify only the a part of the full SHA1 as long as this part identifies a single changeset.

In addition to global revision numbers, Mercurial offers local revision numbers. They are incrementing integers that can be used to indentify a changeset. Multiple repositories of the same project do not necessarily have the same local revisions.

Bazaar uses a more CVS-like model of monotonically increasing revision numbers, with dotted notation for branches.

Branching Model

Git and Mercurial have fundamental differences in their branching model.

Git uses pointers to a changeset to define a branch. Every ancestor of a changeset that is marked that way is part of the branch. If you delete the pointer, the name of the branch is gone, and can only be recovered using the so-called “reflog”, if it's not yet expired. This means that you cannot bring back the name of a branch after a few years.

Mercurial, on the other hand, records the name of the branch in the changeset itself. Once you've comitted to a branch, the branch name will stay. You can close a branch, but you cannot remove the branch name without altering history. The drawback of this approach is that branches are not suited very well for small living test branches, as naming conflicts can occur. Mercurial offers solutions in the form of so-called Bookmarks and Anonymous Branches, which work similarly to Git's branching model.

Workflows

The following section describes typical work flows. Note that not all Subverison work flows translate one-to-one to a DVCS.

Setup

Git
  git config --global ui.user "David Soria Parra"
  git config --global ui.email "dsp@php.net"
Mercurial

Edit ~/.hgrc

  [ui]
  username = David Soria Parra <dsp@php.net>
Bazaar
  bzr whoami "David Soria Parra <dsp@php.net>"

Checkout and Patch

Git
  $ git clone git://git.php.net/php-src.git
  $ git checkout PHP_5_4
  ... hack Zend/zend.c ...
  $ git commit Zend/zend.c
  $ git push
Mercurial
  $ hg clone http://hg.php.net/php-src
  $ hg update PHP_5_4
  ... hack Zend/zend.c ...
  $ hg commit
  $ hg push
Bazaar
  $ bzr branch sftp://bzr.php.net/php-src
  .. hack Zend/zend.c ...
  $ bzr commit
  $ bzr push

Port patches across branches

Git
  $ git checkout master
  $ git merge PHP_5_4
  or
  $ git checkout trunk
  $ git cherry-pick a32ba2 # assuming a32ba2 is the commit to port
  
Mercurial
  $ hg update default
  $ hg merge PHP_5_4
  or with the transplant extension installed
  $ hg update default
  $ hg transplant a32ba2

Releasing a version

Git
  $ git checkout PHP_5_4
  $ git tag --sign v5.4.1
  $ git push origin v5.4.1
Mercurial
  $ hg update PHP_5_4
  $ hg tag v5.4.1
  $ hg push

Backport Patch

Git

To backport a changeset you can use the git cherry-pick feature.

In some circumstances this can lead to duplicated commits that can cause troubles during merges, so backporting a feature is discouraged. Try to apply a patch to the oldest currently maintained branch and merge that branch to maintained release branches.

  $ git checkout master
  .. hack hack ..
  .. commit rev 3ab3f
  $ git checkout PHP_5_4
  $ git cherry-pick 3ab3f
Mercurial

To backport a changeset you can use the hg transplant feature from from the transplant extension that is shipped with Mercurial.

In some circumstances this can lead to duplicated commits that can cause troubles during merges, so backporting a feature is discouraged. Try to apply a patch to the oldest currently maintained branch and merge that branch to maintained release branches.

  $ hg update master
  .. hack hack ..
  .. commit rev 3ab3f
  $ hg update PHP_5_4
  $ hg transplant 3ab3f

Moving extension from/to core to/from pecl

We will use separate repositories for PECL and PEAR modules. php-src will be a separate module. We need a mechanism to move extensions from PECL to core and vice versa.

Both Mercurial and Git support subrepositories (called submodules in git). These are external references to repositories. The advantage of this approach is that it's very easy to add and remove modules by just modifying the external references. The drawback of this approach is that you will not have a combined history log of all subrepositories. Commits across multiple subrepositories will lead to separate commits. Mercurial and git will not know that these commits are related.

An alternative to this approach is the use of subtree merges. You can merge a repository into a subdirectory of a repository. This way you will end up with the merged history being a full part of the repositories history. The drawback of this approach is that you need in-depth knowledge to perform such merges, or to split the repositories again. A similar approach can be used with Mercurial by using merge and the convert extension.

In our use case it makes more sense to use subtree merges. Moving extensions from or to core doesn't happen frequently, and the overhead in performing the merges and splitting is worth the benefit of having one php-src repository that contains all extensions and their full history.

Git

Git supports Subtree merges, this can be used to merge stuff into an ext directory. To get an extension out of core and back into pecl, you will need to use git filter-branch to extract the subdirectory and regularly remove the extension with a git rm from the repo.

Mercurial

To move an extension into core you will first need hg convert to create a repository that contains the pecl extension in a ext/ folder. Than you can merge it. If you want to move an extension from core to pecl, you will need to use hg convert to extract the history of the repository from php-src and then regularly remove the ext/<extension> directoy with hg remove.

Tools and Platform Support

Operating Systems

Mercurial is available for all major platforms: Linux, BSD, Mac OS X, Windows. All core features are available on supported platforms. The same is true of Bazaar.

Git is available for Linux, BSD and Mac OS X. Windows binaries based on msys are provided by the msysgit project. As early Git versions primarly targeted Linux, some commands can still be slower or even non-existant on Windows. This situation is slowly improving.

CRLF -> LF

Git supports CRLF to LF conversion. This can be configured using the variables core.autocrlf, core.safecrlf and gitattributes.

Mercurial supports CRLF to LF conversion using the EOL extension.

GUI

Mercuial: Various GUI tools are available. TortoiseHG, HGK, MacHG, Eclipse, Emacs, etc

Git: Various GUI tools are available: TortoiseGit, gitk, git-cola, qgit, Eclipse, Emacs, etc

Web

The PHP Karma System

PHP uses a self-implemented access control system, informally called Karma. The system consists of two parts. The first part is the link to PHP user's management, hosted on master.php.net. Authentication determines whether a user is registered, and checks their password. Then, authorization determines whether the user is allowed to commit to a particular directory.

Under Subversion, authentication is managed with HTTP Digest authentication, using data from master. Authorization is handled by a custom pre-commit hook which parses the “avail” files.

Authentication

Both Git and Mercurial support SSH and HTTP. We will focus on HTTP, as we cannot manage SSH keys for all users that have access to the PHP repository. HTTP will also enable push and pull through most corporate firewalls.

Git

Git can use HTTP Basic auth to verify whether a user is allowed to login. We can use HTTP Digest for authentication and an update hook to check whether the user is authorized to push to that directory.

Mercurial

Mercurial can use HTTP basic auth. We can use HTTP Digest for authentication and a hook to check if the user has the necessary access rights. An implementation of the PHP karma system as a Mercurial plugin exists. It can be found here: https://bitbucket.org/segv/php-karma-hook

General Server Layout

      
      user <-- KARMA CHECK --> hg/git.php.net <-- MIRRORING --> github/bitbucket
      

Unqiue features

Git

Index

Git implements a index (also called staging area) between repository and working directory that keeps track of changed files. Only changes that are tracked in the index are part of the next changeset upon commit. This makes it possible to just “stage” parts of the changes made to the working directory (see: git add -p). The drawback of this approach is that you have to manually stage a change or use the –all switch in git commit. While this is a powerful feature, it can confuse people coming from other version control systems.

Separate Author and Commiter

Git separates author and commiter, and records both in each changeset. A commit can have a different author and committer. This is useful for PHP, as a patch from the mailing list will be committed with the original author and the information of committed it, making it easier to identify the original patch author.

Mercurial

Local Revision Numbers

Both Git and Mercurial use SHA1 to identify a changeset and ensure that it is globally unique. The string representation of a SHA1 hash is the revision number. Incrementing integers cannot identify a changeset globally, but they are useful shortcuts in local repositories.

Mercurial uses incrementing integers, similar to SVN revision numbers, on changesets. These can be used on a local repository to identify a revision. Git supports only SHA1.

Bazaar uses CVS-style incrementing integers as revision numbers.

Changeset

  changeset:   75485:61e266b471e4
  branch:      PHP_5_4
  tag:         tip
  parent:      75482:b5e860dc2f05
  user:        sixd@c90b9560-bf6c-de11-be94-00142212c4b1
  date:        Mon Jul 25 17:30:09 2011 +0000
  summary:     Patch r313663 and r313665 to allow PECL builds to work with earlier releases

has the local revision number 75485 and the global 61e266b471e4.

Revsets and Filesets

Mercurial has a powerful query language to select changests or files to include in log, diff and similar commands. For example, to search for the last tagged revision that includes a given changeset, you can run the following query:

  hg log -r 'limit(descendants($1) and tagged(), 1)'

Extensions

Mercurial is written in Python, and supports loading and executing extensions written in Python. These extensions can access the internal Mercurial API.

Mercurial extensions are a common way to implement additional features such as rebase, commit signatures, and access control. Mercurial ships with a set of core extensions. A full list of extensions can be found in the Mercurial wiki.

Benchmarks

The tests were done with Git v1.7.6 and Mercurial 1.9 on a Thinkpad X201, Intel i5 M540 @2.53 Ghz, Samsung 128 GB SDD, 4 GB RAM.

For the php-src repository (not the complete repository):

Benchmark Git Mercurial
Repository Size 120 MB 197 MB (with hg 1.9 generaldelta)
Switching branch trunk → PHP_5_4 0.182s 1.328s
Annontate file (Zend/zend.c) 2.936s 0.745s
Log over the last 1000 commits 0.752s 1.111s

Hosting Infrastructure

The main repository will be hosted at php.net. This will make implementing infrastructure or scripts easier, and gives us full power over our development environment.

Other hosting sites can be used to attract more developers. DVCS make it easy to push a repository to different locations while keeping them all in sync.

Git / Github

The most popular hosting site for Git based projects is github.com. Github encourages people to interact with hosted projects by making it easy to clone a repository and send a “pull request” to the upstream project.

Github reserved a PHP user for the PHP project wiht unlimited public repositories. We can use this account to create a repository and use pull requests from github to integrate into PHP.

A typical workflow for example is:

1. Pull pull request from github.com
2. Merge locally
3. Push merge to git.php.net
4. Automatic sync between git.php.net <-> github will push the changes
   to github. The pull request will be closed automatically.

Subversion Integration

Github supports a SVN bridge. You can checkout and commit to a repository using SVN. Advanced subversion features such as properties do not work.

Mercurial / Bitbucket

Bitbucket is a popular hosting site for Mercurial projects. It is build similar to github, although the amount of hosted projects is smaller then on github.

A typical workflow for example is:

1. Pull pull request from bitbucket.org
2. Merge locally
3. Push merge to hg.php.net
4. Automatic sync between hg.php.net <-> bitbucket will push the changes
   to github. The pull request will be closed automatically.

Subversion Integration

Bitbucket supports a SVN bridge. You can checkout and commit to a repository using SVN. The subversion integration is in beta phase. Commits can fail under certain circumstances.

Implementation

The implementation details are handled in subsequent RFCs. Important issues that can influence the decision are outlined in this RFC.

We will not convert the whole SVN repository at once. As Git and Mercurial do not allow checkouts of subdirectories, we will have to split the repository into modules, a move which is already desired. To guarantee minimal downtimes and a smooth transition, we will move one module at a time. The first module will be php-src and the karma system. Other modules like systems will follow.

As DVCS does not support cloning of subtrees of a repository, we will split the repository similar to the old CVS structure. References to other repositories can be maintained using the submodules/subrepositories feature as described above. If necessary subtree merges will be done to maintain better history and easier cloning.

For every module like phpdoc, php-src, web and so on an own implementation RFC will be written.

In general, modules will be accessible via HTTP on git/hg.php.net. Every module will be it's own repository, e.g http://hg.php.net/repositories/php-src http://hg.php.net/repositories/pecl-http. A web interface to browse the code will be available. Note that a combined checkout of PECL modules or similar SVN directories will not be available. If necessary, we can create meta repositories that have references to those repositories manually.

php-src

The first source to be moved to the chosen DVCS will be php-src. Other modules, such as web, pecl and phpdoc will follow later.

PECL, PHPdoc and co

Modules like PECL can use submodules/subrepositories to maintain references to each other. If a PECL modules moves to php-src it will be either done as a submodules/subrepository or using subtree merges to maintain a combined history and ease the cloning process. To easy the process, if PECL modules are maintained outside of php.net they should try to use the same VCS as PHP.

Migration RFC

The migration of modules is handled through the following RFCs:

  1. Migration of php-src
  2. Migration of phpdoc
  3. Migration of PECL

more to follow.

Discussion and Recommendation

In this section I'll try to outline some qualitative evaluation based on discussions that I had about this topic with people from the PHP community and with other users of Git and Mercurial.

It seems that Mercurial is considered easier to learn when coming from Subversion. Its extension system keeps the core commands simple, while advanced users can add more features. This extensibility can be useful when it comes to implementing the Karma system, or documentation-related translation synchronization. Mercurial comes with excellent Windows support. It also has very capable HTTP support, making it easy for people to pull and push through proxies. The UI is kept simple, and Mercurial does not expose low-level commands. For people without much knowledge about version control systems, and people coming from Subversion, Mercurial is more suitable than Git.

Git is considered the most used DVCS in the Open Source community. Git does not have a plugin system, but the Karma system can be implemented through push and pull hooks. Git offers HTTP support that can be used to push and pull through a proxy. Newer Git versions offer a smart HTTP protocol which can be considered equal to Mercurial's HTTP support. Due to its history, concepts, and the large set of commands, Git has a higher learning curve than similar systems. Git is widely used by Open Source projects written in PHP, such as Zend Framework, Symphony 2, phpBB, and xdebug. The most important argument in favor of Git, however, is not Git itself, but Github. It has a large user base, and makes it very easy for people to participate on a project. It is far more popular than Bitbucket.

Migrating history can be easily done in both DVCS. The Karma system can be implemented in both DVCS. There is no show stopper on both DVCS.

My personal opinion is that Mercurial is easier to learn. SVN concepts translate better to Mercurials concepts, e.g. local incrementing revision numbers. Developers. that already know Git won't have a big problem to use Mercurial. I think Mercurial would be the better fit. It also gives us more flexibility in case we want to write additional extensions for the phpdoc team or whatever.

Further Readings and References

Changelog

  • 2011-07-30: initial revision
  • 2011-08-07: general cleanup and a small amount of Bazaar information
  • 2011-08-26: add reference to hginit.com
rfc/dvcs.txt · Last modified: 2011/09/07 15:30 by dsp