The SVN migration was completed in July 2009. This document has been retained for historical purposes.

CVS to SVN Migration Path

This is a document describing in detail the steps I (Gwynne) took to convert the PHP repository from CVS to SVN, updated continuously as I go through the process.

I am making use of the CVS2SVN command-line Python tool at The basic documentation is at

Basic cvs2svn Use

I started from the very beginning with a copy of the entire PHP CVS repository as tarballed by Derick. CVS2SVN runs in a large number (16) of effective “passes” over a CVS repository, enumerated here:

$ ./cvs2svn --help-passes
    1 : CollectRevsPass
    2 : CleanMetadataPass
    3 : CollateSymbolsPass
    4 : FilterSymbolsPass
    5 : SortRevisionSummaryPass
    6 : SortSymbolSummaryPass
    7 : InitializeChangesetsPass
    8 : BreakRevisionChangesetCyclesPass
    9 : RevisionTopologicalSortPass
   10 : BreakSymbolChangesetCyclesPass
   11 : BreakAllChangesetCyclesPass
   12 : TopologicalSortPass
   13 : CreateRevsPass
   14 : SortSymbolsPass
   15 : IndexSymbolsPass
   16 : OutputPass

I have the CVS repository stored in a directory called “realroot” (why not?) and stored the temporary data files on a separate mount point because of drive space issues. My initial commandline, based purely on the documentation, before any kind of testing was this:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs \
--dry-run --no-cross-branch-commits \
--username=svnconvert \
--cvs-revnums --use-cvs \
--tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

The options are:

  • “–svnrepos=./svnroot” - the place to create a new SVN repository
  • “–fs-type=fsfs” - The type of SVN repository to create. FSFS is the default, but I specified it anyway
  • “–dry-run” - Test only. Don't actually convert anything. I always do this first.
  • “–no-cross-branch-commits” - Don't make single commits across multiple CVS branches. Makes history a little more consistent at the cost of more revision commits
  • “–username=svnconvert” - The SVN username to use in log messages when creating the new repository
  • “–cvs-revnums” - Store the last CVS revision of each file in a property on that file, I thought it'd be useful for history tracking
  • “–use-cvs” - Use the cvs command instead of internal code or the rcs command to retrieve information from the CVS repository. Slower but more reliable.
  • “–tempdir=/Volumes/External/private/tmp/cvs2svn-tmp” - Put temporary files in my external hard drive's tmp directory. Saved me from having my system hard disk fill up; one of the warnings for “–use-internal-co” (see below) is that it requires considerable disk space.
  • “./realroot” - The path to the CVS repository to convert.

Pass 1

Almost instantly I thought of something - why was I using the much slower –use-cvs, since the only thing it affects is the $Log$ keyword? Derick and Jani confirmed that $Log$ is not, in fact, used in the PHP CVS, so I switched to “–use-internal-co” to use internal code. The resulting speedup of cvs2svn was considerable.

My next issue was with the sheer amount of output cvs2svn spits out, and there was a lot. I added “–quiet” to the commandline to slow down the flooding of my terminal window. By no means did it stop it, but it slowed it.

Pass 1 ran all the way through after that, and died with a number of errors saying things like:

A CVS repository cannot contain both realroot/phpdoc-ja/reference/oci8/functions/OCI-Lob-writeToFile.xml,v and realroot/phpdoc-ja/reference/oci8/functions/Attic/OCI-Lob-writeToFile.xml,v;

I looked this up in the cvs2svn FAQ and found that it's a common minor corruption in CVS repositories due to various forms of repository maitenance. There are several ways to handle this issue, some of which preserve history and some of which destroy it. I chose the most conservative: adding “–retain-conflicting-attic-files” to my commandline. The result is a few extra files in the new SVN repository, but it preserves the maximum amount of history data. I consider that one of my most primary concerns during this conversion process.

And also, I got a very upsetting error:

ERROR: 'realroot/phpweb/distributions/Attic/php-5.0.0-installer.exe,v' is not a valid ,v file

This seemed odd to me, so I popped the file open in TextMate, my preferred text editor and found out that about half the file had been sliced neatly off the top. That's definitely not a valid ,v file. I checked out whether it was worth trying to repair the file, but Jani didn't think so, since it's in Attic, and I agreed. I moved the file out of the CVS root, and the error disappeared. I didn't delete the file, though, just in case.

When pass 1 succeeded, my full commandline was:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs --dry-run --no-cross-branch-commits --username=svnconvert --cvs-revnums --use-internal-co --quiet --retain-conflicting-attic-files --tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

Pass 2

Pass 2 immediately spit out dozens of errors regarding the inability to decode CVS log messages. I wasn't entirely surprised, as some of the committers to CVS don't do it in English, but it did seem a little odd that so many files were failing. I investigated by looking at the documentation for encodings, and found the “–encoding” and “–fallback-encoding” options. Passing multiple “–encoding” options would try each encoding in sequence until one succeeded. If none succeeded, the “–fallback-encoding” would be used in lossy mode. I thought, “Cool! Now, how do I tell it to try all encodings?” There turned out to be no way to do that, and the list of single encodings was very, very daunting. There are hundreds of encodings out there. Then I noticed something: the default encoding list is “ascii”. Nothing else. No UTF-8, no Latin 1, nothing! That wouldn't do, so I added several “–encoding” options for ASCII, UTF-8, UTF-16, Shift-JIS, MacRoman, ISO Latin 1, and Euc JP. Those struck me as being all the common encodings, and lo and behold, pass 2 spit out no more complaints. My guess was that Latin 1 and UTF-8 covered almost all of the issues. I had also added “–fallback-encoding=latin_1” as a “let's fall back if we have to” measure, but removed it, worried that it was suppressing errors I'd rather have seen. I needn't have worried; even without that option, it worked great.

When pass 2 succeeded, my full commandline was:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs --dry-run --no-cross-branch-commits --username=svnconvert --cvs-revnums --use-internal-co --quiet --retain-conflicting-attic-files --encoding=ascii --encoding=utf_8 --encoding=utf_16 --encoding=shift_jis --encoding=mac_roman --encoding=latin_1 --encoding=euc_jp --tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

Pass 3

Pass 3 died pretty much instantly with the very cryptic message:

----- pass 3 (CollateSymbolsPass) -----
Checking for forced tags with commits...
The following paths are not disjoint:
    Path tags/php4 contains the following other paths: tags/php4/CREDITS
Please fix the above errors and restart CollateSymbolsPass

The first thing I did was a documentation search. Nothing in the docs or the FAQs. Next came a mailing list archive search. Nothing but an error that wasn't really related. I tried Googling the entire Web, but that just gave me a bunch of irrelevant results and a link back to the same mailing list article I'd already found. I found the code that outputs the message in cvs2svn, but that wasn't any use because the text “not disjoint” is nowhere near the code that actually tells me what's going on, especially since I don't read Python! Finally I decided to try a little simple logic. There was no literal path “php4” in the repository, but after all, the error message said “tags/”, didn't it? So I hopped over to and checked the php4 modules. Lo and behold, there was a php4 tag, and a php4/CREDITS tag!

Next question was, why in the world did cvs2svn have an issue with that? Well, because tags are directories in SVN, you can't have a file by the name of a tag that way. The tag itself shouldn't exist, but it does, and I had to handle it. I can't really expect to do a cvs tag -D on the repository, so how to tell cvs2svn not to bother with that one useless tag? The answer: the “–exclude” option. I added “–exclude=php4/CREDITS” to the commandline and tried pass 3 again. It worked perfectly.

When pass 3 succeeded, my commandline was:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs --dry-run --no-cross-branch-commits --username=svnconvert --cvs-revnums --use-internal-co --quiet --retain-conflicting-attic-files --encoding=ascii --encoding=utf_8 --encoding=utf_16 --encoding=shift_jis --encoding=mac_roman --encoding=latin_1 --encoding=euc_jp --exclude=php4/CREDITS --tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

Passes 4-8

Pass 4 worked on the first try:

Pass 5 was also a clean sweep:

Pass 6 was just a beautiful thing, went by without a hitch. Not that I expected any of these “sort” passes to be a big deal, but you never know…

Pass 7 worked without problems too, though it was a much slower pass than the last three. I finally started to feel like I was making progress. Pass 7 done, means pass 8 is up! Halfway there!

Pass 8 took something along the lines of an hour to run, but it finally finished without errors… I hope the other phases aren't similarly insane with their timing. I'm considering taking the –quiet flag back out, and I think I will for pass 9. It's nice to know something's happening; I checked my top -u output twice during pass 8 to make sure it hadn't frozen up.

----- pass 4 (FilterSymbolsPass) -----
Filtering out excluded symbols and summarizing items...
---- pass 5 (SortRevisionSummaryPass) -----
Sorting CVS revision summaries...

When passes 4-8 succeeded, my commandline was:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs --dry-run --no-cross-branch-commits --username=svnconvert --cvs-revnums --use-internal-co --quiet --retain-conflicting-attic-files --encoding=ascii --encoding=utf_8 --encoding=utf_16 --encoding=shift_jis --encoding=mac_roman --encoding=latin_1 --encoding=euc_jp --exclude=php4/CREDITS --tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

Pass 9

Woohoo! Taking –quiet out allowed me to get timing data from the script for each pass, and wow was 9 ever a long pass! Have a look:

----- pass 9 (RevisionTopologicalSortPass) -----
Generating CVSRevisions in commit order...
Time for pass9 (RevisionTopologicalSortPass): 4088 seconds.

cvs2svn Statistics:
Total CVS Files:            159442
Total CVS Revisions:        912073
Total CVS Branches:         160368
Total CVS Tags:            1829773
Total Unique Tags:            1705
Total Unique Branches:         258
CVS Repos Size in KB:      4034353
First Revision Date:    Wed Mar 13 10:16:01 1996
Last Revision Date:     Thu Jun 26 08:39:31 2008
Timings (seconds):
1997   pass1    CollectRevsPass
  18   pass2    CleanMetadataPass
   0   pass3    CollateSymbolsPass
 370   pass4    FilterSymbolsPass
   6   pass5    SortRevisionSummaryPass
   5   pass6    SortSymbolSummaryPass
 320   pass7    InitializeChangesetsPass
3435   pass8    BreakRevisionChangesetCyclesPass
4088   pass9    RevisionTopologicalSortPass
4089   total

4088 seconds / 3600 seconds/hour… you don't need a calculator to realize that's around one and an eighth hours. And check out pass 8's timing, almost as bad. Oh well, pass 10 is up next… I was going to think about adding –verbose, but on reflection I don't think I want all that.

When pass 9 succeeded, my commandline was:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs --dry-run --no-cross-branch-commits --username=svnconvert --cvs-revnums --use-internal-co --retain-conflicting-attic-files --encoding=ascii --encoding=utf_8 --encoding=utf_16 --encoding=shift_jis --encoding=mac_roman --encoding=latin_1 --encoding=euc_jp --exclude=php4/CREDITS --tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

Passes 10-16

10: Pass 10 was what my Warcraft friends would call “easysauce”:

11: Well, pass 11 was faster than 8 and 9, if not as fast as 10…

12: Another one bites the dust!

13: Pass 13 certainly was interesting. Generating all the SVN commits…

14: Nice short easy one.

15: 15 down, 1 to go!

16: Pop the champagne cork!

----- pass 10 (BreakSymbolChangesetCyclesPass) -----
Breaking symbol changeset dependency cycles...
Time for pass10 (BreakSymbolChangesetCyclesPass): 181.9 seconds.
----- pass 11 (BreakAllChangesetCyclesPass) -----
Breaking CVSSymbol dependency loops...
Time for pass11 (BreakAllChangesetCyclesPass): 1039 seconds.
----- pass 12 (TopologicalSortPass) -----
Generating CVSRevisions in commit order...
Time for pass12 (TopologicalSortPass): 255.5 seconds.
Creating Subversion r192256 (commit)
Time for pass13 (CreateRevsPass): 512.2 seconds.
----- pass 14 (SortSymbolsPass) -----
Sorting symbolic name source revisions...
Time for pass14 (SortSymbolsPass): 9.787 seconds.
----- pass 15 (IndexSymbolsPass) -----
Determining offsets for all symbolic names...
Time for pass15 (IndexSymbolsPass): 6.344 seconds.
----- pass 16 (OutputPass) -----
Starting Subversion r192256 / 192256
Time for pass16 (OutputPass): 706.2 seconds.

cvs2svn Statistics:
Total CVS Files:            159442
Total CVS Revisions:        912073
Total CVS Branches:         160368
Total CVS Tags:            1829773
Total Unique Tags:            1705
Total Unique Branches:         258
CVS Repos Size in KB:      4034353
Total SVN Commits:          192256
First Revision Date:    Wed Mar 13 10:16:01 1996
Last Revision Date:     Thu Jun 26 08:39:31 2008
Timings (seconds):
1996.9   pass1    CollectRevsPass
 18.1   pass2    CleanMetadataPass
  0.3   pass3    CollateSymbolsPass
370.0   pass4    FilterSymbolsPass
  6.3   pass5    SortRevisionSummaryPass
  5.3   pass6    SortSymbolSummaryPass
320.0   pass7    InitializeChangesetsPass
3435.3   pass8    BreakRevisionChangesetCyclesPass
4088.5   pass9    RevisionTopologicalSortPass
181.9   pass10   BreakSymbolChangesetCyclesPass
1038.6   pass11   BreakAllChangesetCyclesPass
255.5   pass12   TopologicalSortPass
512.2   pass13   CreateRevsPass
  9.8   pass14   SortSymbolsPass
  6.3   pass15   IndexSymbolsPass
706.2   pass16   OutputPass
706.5   total

Now, to do it all at once without –dry-run. It'll take awhile. Wish me luck!

When pass 16 succeeded, my commandline was:

./cvs2svn --svnrepos=./svnroot --fs-type=fsfs --dry-run --no-cross-branch-commits --username=svnconvert --cvs-revnums --use-internal-co --retain-conflicting-attic-files --encoding=ascii --encoding=utf_8 --encoding=utf_16 --encoding=shift_jis --encoding=mac_roman --encoding=latin_1 --encoding=euc_jp --exclude=php4/CREDITS --tempdir=/Volumes/External/private/tmp/cvs2svn-tmp ./realroot

The real run

Well, on the first try of this, I had my temporary files directory on the external drive and the output SVN root on the internal drive. I didn't realize the SVN root was going to be larger than the CVS root, with the result that I woke up with my computer screaming “Your primary system disk is out of space.” The result was a computer that took an hour to get back up to speed, thanks to Darwin's VM manager and its extremely poor handling of low-memory situations. No big deal, really. Not the first time I've done something like that.

So I tried the run again with the output root on the external drive, which has considerably more free space. A few hours later I came back to my machine to find the same thing had happened! I'd neglected to realize that the VM manager in Darwin doesn't know how to make actual use of the space it has, and cvs2svn had used up every byte of drive space by using up every byte of RAM. So I had to free up some space on the system disk. How often do I exhaust 2GB of swap space with 2GB of physical RAM? Not often!

All of this was a lack of recognition of the sheer size of PHP's CVS repository. The resulting SVN repository will have 196,000 commits or more, and take up over 4GB of space, maybe 5. So, I started the run a third time, with significantly more space available for Darwin's poor brain-damaged memory manager and the cvs2svn temporary and output files.

Passes 1 through 15 completed in about three hours time overall, finally, but pass 16 was something else again. Keep in mind, pass 16 is the “commit to SVN” phase, and we have 192,256 commits to do. Not only that, but cvs2svn invokes first the RCS “co” command and then the “svnadmin” command for every single commit! That means forking some 192256*2+1=384513 processes, on a Darwin system with a maximum of 100,000 PIDs, about 100 of which are already in use by various system processes and applications! That's gonna make my poor OS cycle its PID usage a few times. A good six hours later I was at revision 17088/192256 and realized… this was gonna take a long, long, long time.

Alright, I thought, let's see what we can't do to make this a little faster. I remembered from the docs the –dumpfile option, which drops all the SVN data into a dumpfile that svnadmin can later load more or less all at once. That certainly would save all the svnadmin invocations, right? Right? Well, I wasn't going to waste six hours worth of commits without some kind of proof that this really was gonna cut the time in half like that, so I dove into the cvs2svn source code a second time. Python is a very whitespace-dependant language… it makes me shudder, frankly. I surfed my way through what little I managed to understand of the code and finally figured out… yes, the –dumpfile option writes directly to an on-disk file instead of invoking svnadmin for every commit! Wow, what the hell is the idea of not making this the default? Sure, the dumpfile is gonna be ridiculously huge, like 10GB, but I can handle that! I canceled the hell out of the current run.

But, why have to do passes 1 through 15 over again when it's three hours work that doesn't change in the slightest? That's why I used Control-C to break cvs2svn's run. I added the appropriate “–dumpfile” option, ditched “–svnrepos”, and then tossed in “–pass=16” to make it start from the end, as it were. Since I'd C-C'd, it didn't delete the temp files, and I ended up with pass 16 restarting with the dumpfile output. Very nice, right?

Sure enough, while the thing was still invoking “co” for every revision, at least the useless svnadmin slop was gone. That instantly doubled the speed of pass 16. Well worth the wasted 6 hours; I only wished I'd thought of it sooner. Six more hours later I'd found another disk space error somewhere around revision 50000. Oops… only 8GB on my external drive wasn't enough for both the temporary files and the output file. Did some rummaging about in my filesystem and found about 40GB of old data to delete that I'd been too lazy to clean up before. Then I started pass 16… again.

Whoops. I made the mistake of thinking again. I wondered why in the world the darned thing was invoking “co” so much when I'd told it to use internal code for checkouts. Lo and behold, I checked my commandline and somewhere along the line I'd changed it to –use-rcs because of the disk space problems! Ooooops… no wonder six hours didn't get me all the way through… So I killed the pass… again… and tried to restart it with –use-internal-co. Err… whoops some more. I had to start all 16 passes over to do that! Then again, considering the sheer cost of all that forking, it was probably a save either way, so I started it over. In much less than the three hours, poof, I was at pass 16, and I'd hit revision 60000 within three hours. That was better!

I went to sleep at that point, and when I woke up, it was done. About time! The statistics were all laid out for me, too:

Starting Subversion r192256 / 192256
Time for pass16 (OutputPass): 8455 seconds.

cvs2svn Statistics:
Total CVS Files:            159442
Total CVS Revisions:        912073
Total CVS Branches:         160368
Total CVS Tags:            1829773
Total Unique Tags:            1705
Total Unique Branches:         258
CVS Repos Size in KB:      4034353
Total SVN Commits:          192256
First Revision Date:    Wed Mar 13 10:16:01 1996
Last Revision Date:     Thu Jun 26 08:39:31 2008
Timings (seconds):
 5360   pass1    CollectRevsPass
   21   pass2    CleanMetadataPass
    0   pass3    CollateSymbolsPass
  444   pass4    FilterSymbolsPass
   15   pass5    SortRevisionSummaryPass
    9   pass6    SortSymbolSummaryPass
  391   pass7    InitializeChangesetsPass
 4190   pass8    BreakRevisionChangesetCyclesPass
 5254   pass9    RevisionTopologicalSortPass
  186   pass10   BreakSymbolChangesetCyclesPass
  658   pass11   BreakAllChangesetCyclesPass
  254   pass12   TopologicalSortPass
  524   pass13   CreateRevsPass
    9   pass14   SortSymbolsPass
    7   pass15   IndexSymbolsPass
 8455   pass16   OutputPass
25778   total

Whew! For the lazy, 25778 seconds / 3600 seconds/hour = 7.1606 hours. Not too bad for such a giant CVS repository, all things considered. The resulting dumpfile came out to a whopping 19GB!

$ ls -lah /Volumes/External/svnimport.dump
-rw-r--r--    1 gwynne  admin     19G Jun 29 07:13 svnimport.dump

Importing the dumpfile

The next step, import that giant thing into a new svn repository. The commands? “svnadmin create” and “svnadmin load”. I knew the latter would take awhile, so I tossed a “time” at the start of it for curiosity's sake. I also threw a chown in there so Darwin wouldn't whine if I later tried to make the repository publically accessible:

$ cd /Volumes/External
$ svnadmin create --fs-type=fsfs ./phpsvn
$ sudo chown -R _svn:_svn ./phpsvn
$ sudo time svnadmin load ./phpsvn < ./svnimport.dump

While this was running I got to thinking as I watched the output scroll by. I saw a line like this:

     * adding path : tags/RELEASE_0_5_5/pear/Net_SmartIRC/package.xml ...COPIED... done.

See the problem? That really should have been “pear/tags/RELEASE_0_5_5”. Something in me knew it wasn't going to be as simple as it'd been so far, but I hadn't thought of this little caveat yet. I checked the cvs2svn FAQ and found the question regarding mutiple projects in a single repository. Oops… I should've been using an options file all along. Well, live and learn… I let the dumpfile keep running, though; no sense completely wasting all that work, and at least it served as an example of a successful conversion, if not necessarily a correct one.

The final piece of that was:

<<< Started new transaction, based on original revision 192256
     * editing path : trunk/pecl/apc/apc_cache.c ... done.

------- Committed revision 192256 >>>

    30674.07 real      4941.17 user      4724.23 sys
$ du -h -d0 ./phpsvn
7.3G	./phpsvn

30674.07 seconds is 8.521 hours. So a bit longer than it took to do the original conversion. That figures. Was it wasted effort? I wasn't sure, but it was lessons learned, and that's never a complete waste.

A cvs2svn options file

Option files in cvs2svn are written in Python. Fortunately, they're also so heavily commented that you don't need to understand the language itself to use them. Being a programmer of a lot of C-ish languages, I was able to at least get a grip on what the code was doing anyway. It didn't matter. The example file was huge, and I mean huge! I'll spare you all the neccessity of figuring out the equivelants to the options I'd already specified on the commandline; it was all pretty clear anyway from the file's comments. I added quite a few more little tweaks as I went, though.

  • I used “ctx.revision_recorder = InternalRevisionRecorder(compress=True)” to get compression for what was previously –use-internal-co. Uses a lot less disk space, and solves I/O binding problems.
  • I used “ctx.prune = True” to mimic the effects of “cvs update -P” across the entire SVN repository. I saw no reason not to.
  • I reduced the encodings used for log messages to UTF-8, Latin 1, and ASCII. It seemed simpler.
  • I used “ctx.symbol_info_filename = '/Volumes/External/phpsvn.syminfo.txt'” to get an output of the decisions made about symbols by cvs2svn.
  • I used “ctx.username = 'cvs2svn'”, since that seemed to make more sense than my previous choice of svnconvert.
  • I told cvs2svn to use “EOLStyleFromMimeTypeSetter(),” for auto-props setting, telling it to use MIME type to figure out EOL styles.
  • On advice from Jani and Derick, I used “DefaultEOLStyleSetter('native'),” to tell it to use native line endings instead of binary style for unknown files.
  • I set “ctx.cross_project_commits = True” despite my previous commandline choice, based on comments in the example options file which suggested it made more sense to allow them.
  • I used “changeset_database.use_mmap_for_cvs_item_to_changeset_table = False” for paranoia's sake. A 5% speedup wasn't worth the risk of having to do it multiple times because of my computer exploding with Out of Memory errors.

The next step was to take the list of directories in the CVS root and turn it into a list of stanzas similar to:

        ] + global_symbol_strategy_rules,

Whew. That was going to make for a very long options file where it was very easy to make copypasta errors. I needed to add a little Python code. How does one do foreach (array(blah blah blah) as $item) { /* etc */ } in Python? So I went to a close friend who does know Python.

We came up with this rather handy little bit of code:

cvsrootdir = r'/Users/gwynne/src/cvs2svn-2.1.1/realroot'
fnames = os.listdir(cvsrootdir)
for fname in fnames:
    pathname = os.path.join(cvsrootdir, fname)
    if os.path.isdir(pathname) and fname != r'CVSROOT':
            trunk_path=fname + 'trunk',
            branches_path=fname + 'branches',
            tags_path=fname + 'tags',
                ] + global_symbol_strategy_rules,

If we'd gotten this right, the result would be a whole set of projects with every CVS module that wasn't defined in modules. That was another can of worms I intended to open once I got this part right.

Well, we'd gotten it mostly right. First attempt came up with this:

Traceback (most recent call last):
  File "./cvs2svn", line 31, in <module>
    main(sys.argv[0], sys.argv[1:])
  File "/Users/gwynne/src/cvs2svn-2.1.1/cvs2svn_lib/", line 47, in main
    run_options = RunOptions(progname, cmd_args, pass_manager)
  File "/Users/gwynne/src/cvs2svn-2.1.1/cvs2svn_lib/", line 259, in __init__
  File "/Users/gwynne/src/cvs2svn-2.1.1/cvs2svn_lib/", line 739, in process_options_file
    execfile(options_filename, g, l)
  File "./phpsvn.options", line 110, in <module>
    fnames = os.listdir(cvsrootdir)
NameError: name 'os' is not defined

The fix was easy: adding “import os” to the top of the options file. Bingo, cvs2svn ran. This was my phpsvn.options file:

# (Be in -*- python -*- mode.)
import re
import os
from cvs2svn_lib.boolean import *
from cvs2svn_lib import config
from cvs2svn_lib import changeset_database
from cvs2svn_lib.common import CVSTextDecoder
from cvs2svn_lib.log import Log
from cvs2svn_lib.project import Project
from cvs2svn_lib.svn_output_option import DumpfileOutputOption
from cvs2svn_lib.svn_output_option import ExistingRepositoryOutputOption
from cvs2svn_lib.svn_output_option import NewRepositoryOutputOption
from cvs2svn_lib.revision_manager import NullRevisionRecorder
from cvs2svn_lib.revision_manager import NullRevisionExcluder
from cvs2svn_lib.rcs_revision_manager import RCSRevisionReader
from cvs2svn_lib.cvs_revision_manager import CVSRevisionReader
from cvs2svn_lib.checkout_internal import InternalRevisionRecorder
from cvs2svn_lib.checkout_internal import InternalRevisionExcluder
from cvs2svn_lib.checkout_internal import InternalRevisionReader
from cvs2svn_lib.symbol_strategy import AllBranchRule
from cvs2svn_lib.symbol_strategy import AllTagRule
from cvs2svn_lib.symbol_strategy import BranchIfCommitsRule
from cvs2svn_lib.symbol_strategy import ExcludeRegexpStrategyRule
from cvs2svn_lib.symbol_strategy import ForceBranchRegexpStrategyRule
from cvs2svn_lib.symbol_strategy import ForceTagRegexpStrategyRule
from cvs2svn_lib.symbol_strategy import HeuristicStrategyRule
from cvs2svn_lib.symbol_strategy import UnambiguousUsageRule
from cvs2svn_lib.symbol_strategy import DefaultBasePathRule
from cvs2svn_lib.symbol_strategy import HeuristicPreferredParentRule
from cvs2svn_lib.symbol_strategy import SymbolHintsFileRule
from cvs2svn_lib.symbol_transform import ReplaceSubstringsSymbolTransform
from cvs2svn_lib.symbol_transform import RegexpSymbolTransform
from cvs2svn_lib.symbol_transform import NormalizePathsSymbolTransform
from cvs2svn_lib.property_setters import AutoPropsPropertySetter
from cvs2svn_lib.property_setters import CVSBinaryFileDefaultMimeTypeSetter
from cvs2svn_lib.property_setters import CVSBinaryFileEOLStyleSetter
from cvs2svn_lib.property_setters import CVSRevisionNumberSetter
from cvs2svn_lib.property_setters import DefaultEOLStyleSetter
from cvs2svn_lib.property_setters import EOLStyleFromMimeTypeSetter
from cvs2svn_lib.property_setters import ExecutablePropertySetter
from cvs2svn_lib.property_setters import KeywordsPropertySetter
from cvs2svn_lib.property_setters import MimeMapper
from cvs2svn_lib.property_setters import SVNBinaryFileKeywordsPropertySetter
Log().log_level = Log.VERBOSE
ctx.output_option = DumpfileOutputOption(
ctx.dry_run = False
ctx.revision_recorder = InternalRevisionRecorder(compress=True)
ctx.revision_excluder = InternalRevisionExcluder()
ctx.revision_reader = InternalRevisionReader(compress=True)
ctx.svnadmin_executable = r'svnadmin'
ctx.sort_executable = r'sort'
ctx.trunk_only = False
ctx.prune = True
ctx.cvs_author_decoder = CVSTextDecoder(
ctx.cvs_log_decoder = CVSTextDecoder(
ctx.cvs_filename_decoder = CVSTextDecoder(
ctx.decode_apple_single = False
ctx.symbol_info_filename = '/Volumes/External/phpsvn.syminfo.txt'
global_symbol_strategy_rules = [
ctx.username = 'cvs2svn'
ctx.tmpdir = r'/Volumes/External/private/tmp/cvs2svn-tmp'
ctx.cross_project_commits = True
ctx.cross_branch_commits = True
ctx.retain_conflicting_attic_files = True
run_options.profiling = False
changeset_database.use_mmap_for_cvs_item_to_changeset_table = False
cvsrootdir = r'/Users/gwynne/src/cvs2svn-2.1.1/realroot'
fnames = os.listdir(cvsrootdir)
for fname in fnames:
    pathname = os.path.join(cvsrootdir, fname)
    if os.path.isdir(pathname) and fname != r'CVSROOT':
            trunk_path=fname + 'trunk',
            branches_path=fname + 'branches',
            tags_path=fname + 'tags',
                ] + global_symbol_strategy_rules,

I was a little bit more careful with my commandline this time; I wanted some log of what was going on while still seeing the progress. So I fell back on the trusty “tee” command:

$ ./cvs2svn --option=./phpsvn.options | tee ./phpsvn.convert.out

Then it was time to watch what happened. Oh boy…

Yep. I knew it wouldn't be that simple:

Pass 1 complete.
Error summary:
ERROR: No RCS files found under '/Users/gwynne/src/cvs2svn-2.1.1/realroot/livingtags'!
Are you absolutely certain you are pointing cvs2svn
at a CVS repository?

ERROR: No RCS files found under '/Users/gwynne/src/cvs2svn-2.1.1/realroot/pear-manual'!
Are you absolutely certain you are pointing cvs2svn
at a CVS repository?

ERROR: No RCS files found under '/Users/gwynne/src/cvs2svn-2.1.1/realroot/phpdoc-ar-only'!
Are you absolutely certain you are pointing cvs2svn
at a CVS repository?

ERROR: No RCS files found under '/Users/gwynne/src/cvs2svn-2.1.1/realroot/phpdoc-he-only'!
Are you absolutely certain you are pointing cvs2svn
at a CVS repository?

ERROR: No RCS files found under '/Users/gwynne/src/cvs2svn-2.1.1/realroot/phpdoc-ro-dir'!
Are you absolutely certain you are pointing cvs2svn
at a CVS repository?

ERROR: No RCS files found under '/Users/gwynne/src/cvs2svn-2.1.1/realroot/phpdoc-ro-only'!
Are you absolutely certain you are pointing cvs2svn
at a CVS repository?

Exited due to fatal error(s).

Well… every single one of those folders was utterly empty. Oh well. I thought it was kinda suspicious that there were directories named after what I knew were pseudo-modules. Bah. I killed the directories and tried again.

Clean sweep this time around:

cvs2svn Statistics:
Total CVS Files:            159415
Total CVS Revisions:        909522
Total CVS Branches:         154874
Total CVS Tags:            1835211
Total Unique Tags:            3495
Total Unique Branches:         489
CVS Repos Size in KB:      4032117
Total SVN Commits:          189058
First Revision Date:    Wed Mar 13 10:16:01 1996
Last Revision Date:     Thu Jun 26 08:39:31 2008
Timings (seconds):
 1815   pass1    CollectRevsPass
   15   pass2    CleanMetadataPass
    1   pass3    CollateSymbolsPass
  741   pass4    FilterSymbolsPass
   38   pass5    SortRevisionSummaryPass
   14   pass6    SortSymbolSummaryPass
  364   pass7    InitializeChangesetsPass
 4020   pass8    BreakRevisionChangesetCyclesPass
 4226   pass9    RevisionTopologicalSortPass
  175   pass10   BreakSymbolChangesetCyclesPass
  373   pass11   BreakAllChangesetCyclesPass
  256   pass12   TopologicalSortPass
  592   pass13   CreateRevsPass
   10   pass14   SortSymbolsPass
    9   pass15   IndexSymbolsPass
13529   pass16   OutputPass
26179   total

26179 seconds = 7.272 hours. Not bad, just slightly over the time for the one-project run, most of it in OutputPass. The astute will notice most of the saved time is in BreakRevisionChangesetCyclesPass and RevisionTopologicalSortPass. That makes a lot of sense, since cvs2svn no longer had to break nonsensical dependencies between projects that weren't actually related.

$ ls -lah /Volumes/External/phpsvn.*
-rw-r--r--    1 gwynne  admin     19G Jun 30 19:05 phpsvn.dumpfile
-rw-r--r--    1 gwynne  admin    1.9M Jun 30 12:19 phpsvn.syminfo.txt
$ ls -lah /Users/gwynne/src/cvs2svn-2.1.1/phpsvn.convert.out
-rw-r--r--    1 gwynne  staff   189M Jun 30 19:05 phpsvn.convert.out

189MB just for the logfile. Crazy. Oh well. Time to import the dumpfile.

$ svnadmin create --fs-type=fsfs /Volumes/External/phpsvn
$ sudo chown -R _svn:_svn /Volumes/External/phpsvn
$ sudo time svnadmin load /Volumes/External/phpsvn < /Volumes/External/phpsvn.dumpfile | tee ./phpsvn.load.out

And the result:

------- Committed revision 189058 >>>
    44243.87 real      5250.85 user      5715.10 sys

44243.87 = 12.29 hours. That could be because my CPU was under heavy load from other things, though, and I expect that's the exact reason. Oh well. Now we have a half-working repository.

Modules and externals

Well, cvs2svn doesn't handle translating CVSROOT/modules into externals definitions, so now I have to set those up manually. This becomes a bit more complex, because Subversion's externals support is not a drop-in replacement for modules. SVN 1.5 adds sparse checkouts, which ease the burden a little, but not enough. The topic requires discussion on methods of implementation. Please see the RFC I created at svnexternals.

Checking out the repository

After a few days I got tired of waiting for people to comment. I suppose it was reasonable for them to think a proof of concept conversion didn't need mass discussion. For the moment I decided I'd go with merging the ZendEngine2 and Zend modules. Even in CVS it's questionable why the modules were split; in SVN it's completely ridiculous. Would I be able to just drop ZendEngine2's trunk on top of Zend and go *pfft* on ZendEngine2? Oops, no; the histories of the two modules are woven together in very strange ways. They have a lot of version tags in common and a lot not in common. Why, when ZendEngine2 wasn't even in use for PHP 4? Oh well. Out of sight, out of mind. I sure as hell wasn't expecting anyone to try to convert an existing working copy from CVS to SVN after a real repository change, so it was okay to do things that would utterly trash modifications made in local copies, including changing around directory structure. The question becomes one of updating scripts that depend (foolishly) on these directory names.

The first step was to get a checkout to play with:

$ svn checkout svn:// phpsvn-co

Yep. Every single module, with every tag and every branch. I went for a drink. This was gonna take awhile…

Of course it wasn't going to be that simple. SVN belched this one out at me:

svn: Failed to add directory 'phpsvn-co/pear/branches/start/Selenium/tests': an unversioned directory of the same name already exists

At least it was fairly obvious what that meant, at least to me. There was a .svn directory checked into the repository. I confirmed with an svn ls:

$ svn ls svn://

Someone checked a SVN working copy into CVS. That was a pretty strange thing to do, but it's a showstopper when switching to SVN! Fortunately, this wasn't CVS, and I was able to kill the offending directories with one command (I had to use file URLs because svnserve's configured not to allow any write access):

$ sudo svn rm -m "[SVN CONVERSION] Removing .svn directories that break SVN checkout." \
  file:///Volumes/External/phpsvn/pear/branches/start/Selenium/tests/.svn \
  file:///Volumes/External/phpsvn/pear/branches/start/Selenium/tests/events/.svn \
  file:///Volumes/External/phpsvn/pear/branches/start/Selenium/tests/html/.svn \
  file:///Volumes/External/phpsvn/pear/branches/start/Selenium/docs/.svn \
  file:///Volumes/External/phpsvn/pear/branches/start/Selenium/examples/.svn \

Committed revision 189059.

I then re-ran the checkout command, knowing SVN would resume from where it left off now that it was able. Or not. It spit the same error back at me again. Rather than assume I'd failed to fix the problem with a logical solution, I thought maybe SVN was trying to check out from the r189058 it had started with still. That made a lot of sense. Easy solution, rm -Rf phpsvn-co and start over. Slower, but a good guarantee that people who try to do the same don't get messed up anyway.

It couldn't be that simple. There were more .svn directories floating about, in pear/branches/start/Testing_Selenium. I got rid of those the same way.

And the next error was nasty…

svn: Failed to add directory 'phpsvn-co/pear/branches/Townnews': a versioned directory of the same name already exists

What? Huh? Why? I ran a ls to find out…

$ svn ls svn://

Case-insensitive filesystem made that impossible to use in a checkout. Ohhhhh boy… It was about this time I looked at the rest of the branches directory in pear and realized something very annoying… Two more lines from that list:


That means every PEAR subdirectory needs its own branches/tags/trunk subdirectory. All the way back to cvs2svn and our options file, then. No point continuing to work with an incorrect repository. Were there any other top-level modules like that? Yep, pecl. I dove into cvs2svn's options file and tweaked it. It took awhile to figure out what I was doing in Python, but I finally came up with this code fragment:

def recurse_dir(rootdir, modprefix, exceptions, deepens, run_options, xforms, rules):
    global os, recurse_dir
    fnames = os.listdir(rootdir)
    for fname in fnames:
        pathname = os.path.join(rootdir, fname)
        if os.path.isdir(pathname) and fname not in exceptions and fname not in deepens:
                trunk_path=modprefix + fname + '/trunk',
                branches_path=modprefix + fname + '/branches',
                tags_path=modprefix + fname + '/tags',
                symbol_strategy_rules=[] + rules,
        elif os.path.isdir(pathname) and fname in deepens:
            recurse_dir(os.path.join(rootdir, fname), fname + '/', [], [], run_options, xforms, rules)
recurse_dir(r'/Users/gwynne/src/cvs2svn-2.1.1/realroot', '', ['CVSROOT'], ['pear', 'pecl'], run_options,
    [ReplaceSubstringsSymbolTransform('\\','/'), NormalizePathsSymbolTransform()], global_symbol_strategy_rules)

And started the cvs2svn run over. As one can see from the code above, it was a real mess dealing with the way cvs2svn calls option files, but I finally managed to fudge it to kill all the NameErrors.

I knew it wouldn't be that simple, of course. There were some dead directories in both pear and pecl to rm out of the CVS root, but that was easy enough. For the sake of record, they were:


I'll spare you all the pain I went through figuring out some more options for efficiency's sake, since none of it is useful info. Suffice it to say I ended up also changing these options:

ctx.output_option = NewRepositoryOutputOption(
ctx.cross_project_commits = False
ctx.cross_branch_commits = False

Finally, after this set of time statistics, I was ready to move on:

cvs2svn Statistics:
Total CVS Files:            159414
Total CVS Revisions:        909490
Total CVS Branches:         152400
Total CVS Tags:            1837685
Total Unique Tags:           11525
Total Unique Branches:        1012
CVS Repos Size in KB:      4032089
Total SVN Commits:          271574
First Revision Date:    Wed Mar 13 10:16:01 1996
Last Revision Date:     Thu Jun 26 08:39:31 2008
Timings (seconds):
 2478   pass1    CollectRevsPass
   26   pass2    CleanMetadataPass
    4   pass3    CollateSymbolsPass
  553   pass4    FilterSymbolsPass
   10   pass5    SortRevisionSummaryPass
    8   pass6    SortSymbolSummaryPass
  335   pass7    InitializeChangesetsPass
 8325   pass8    BreakRevisionChangesetCyclesPass
 8753   pass9    RevisionTopologicalSortPass
  307   pass10   BreakSymbolChangesetCyclesPass
  524   pass11   BreakAllChangesetCyclesPass
  280   pass12   TopologicalSortPass
  593   pass13   CreateRevsPass
   10   pass14   SortSymbolsPass
   11   pass15   IndexSymbolsPass
34894   pass16   OutputPass
57114   total

That's almost 16 hours, for those keeping score.

Anyway, the next step was to find a case-sensitive filesystem to check out to. Easy! Create a 20GB blank sparse disk image, format it HFS+ journaled case-sensitive and checkout to there.

Well, that didn't work very well. A full checkout of just php-src with all its tags and branches is well past 20G in HFS+. Forget the entire repository. Some of the tags in there are completely ridiculous, and the branching, the naming of the tags is just awful… but I digress. I decided the most obvious thing to do was to work with smaller pieces of the repository. So I picked up a checkout of ZendEngine2 and Zend.

Moving to

It was time to work on a system with slightly more capabilities than mine; I logged into (also and started work there. I modified the cvs2svn options file accordingly, set up a blank SVN repository next to the CVS repository, took a snapshot of the CVS repository, and ran cvs2svn over the snapshot. I didn't run into any unexpected issues, which was a pleasant surprise. Next step was to check out each module in the SVN repository to find any problems such as that mentioned above with Selenium. A long and annoying process, but at least it's easy.

Doing some checkouts

Well, what's a cheap way to check out all the SVN modules and see whether there are problems, without overruning the limited hard drive space of my system? Answer: Shell script! I came up with this little gem:

dirs=`ls $1`
for dir in $dirs; do
        echo "Processing ${dir}..."
        if [ -d "$1/${dir}" ]; then
                svn co $2/"${dir}" >> ./checkout.log 2>&1
                if [ "$?" -eq 0 ]; then
                        echo "Successful on ${dir}." >> ./checkout.results
                        echo "FAILED ON ${dir}!" >> ./checkout.results
                rm -Rf ./"${dir}"

Worked like a charm. Pointed it at the CVS and SVN repositories, and kicked it into gear. A couple hours of tail -f checkout.log scrolling later, I had the following list of failures:

FAILED ON livingtags!
FAILED ON pear-manual!
FAILED ON phpdoc-ar-only!
FAILED ON phpdoc-he-only!
FAILED ON phpdoc-ro-dir!
FAILED ON phpdoc-ro-only!
FAILED ON phpdoc-tr-dir!

Every single one of those except pear was a nonexistent module, empty in CVS and ignored entirely in the SVN conversion. That left the pear module. Sure enough, the expected failure in Selenium and Testing_Selenium from someone who checked in .svn directories to CVS for some unknown reason. They were easily removed with a direct svn rm command:

$ sudo -u svn \
svn rm -m "[SVN CONVERSION] Removing .svn directories that break SVN checkout." \
  $SVNROOT/pear/Selenium/branches/shin/.svn \
  $SVNROOT/pear/Selenium/branches/shin/tests/.svn \
  $SVNROOT/pear/Selenium/branches/shin/tests/events/.svn \
  $SVNROOT/pear/Selenium/branches/shin/tests/html/.svn \
  $SVNROOT/pear/Selenium/branches/shin/docs/.svn \
  $SVNROOT/pear/Selenium/branches/shin/examples/.svn \
  $SVNROOT/pear/Selenium/tags/start/tests/.svn \
  $SVNROOT/pear/Selenium/tags/start/tests/events/.svn \
  $SVNROOT/pear/Selenium/tags/start/tests/html/.svn \
  $SVNROOT/pear/Selenium/tags/start/docs/.svn \
  $SVNROOT/pear/Selenium/tags/start/examples/.svn \
  $SVNROOT/pear/Selenium/tags/start/.svn \
  $SVNROOT/pear/Testing_Selenium/branches/shin/.svn \
  $SVNROOT/pear/Testing_Selenium/branches/shin/tests/.svn \
  $SVNROOT/pear/Testing_Selenium/branches/shin/tests/events/.svn \
  $SVNROOT/pear/Testing_Selenium/branches/shin/tests/html/.svn \
  $SVNROOT/pear/Testing_Selenium/branches/shin/docs/.svn \
  $SVNROOT/pear/Testing_Selenium/branches/shin/examples/.svn \
  $SVNROOT/pear/Testing_Selenium/tags/start/.svn \
  $SVNROOT/pear/Testing_Selenium/tags/start/tests/.svn \
  $SVNROOT/pear/Testing_Selenium/tags/start/tests/events/.svn \
  $SVNROOT/pear/Testing_Selenium/tags/start/tests/html/.svn \
  $SVNROOT/pear/Testing_Selenium/tags/start/docs/.svn \
Committed revision 279477.


About this time I realized that a lot of things related to SVN would require version control before the repository was ready for use! Things like all the various scripts involved in the conversion itself, all the authorization data, the commit hooks, all the fun stuff. Putting these things into CVS would result in a bit of recursive failure. Putting them into the SVN repository I'd set up would interfere with the conversion, and besides, this was metadata, stuff that belongs in an equivelant to CVSROOT. Solution: A second SVN repository under much more restricted authorization control. I put in a request for a domain name and set up's Apache to serve it from a separate repository. Then Wez and a couple others convinced me that was a stupid idea. There wasn't really any reason this stuff couldn't go into CVS, other than my ornery resistance to the older and less useful system.

It was about this time that I had to study Git for another project and began to wonder if maybe it wasn't better than SVN, but I'm just not into the idea of learning an entirely new system and forcing everyone else to do the same. SVN maps 90% onto CVS commands… Git maps more like 40%. SVN is a good midway step to true distributed VCS, and there are plenty of Git/SVN interface tools.

So I set up a CVS module called SVNROOT/, got karma to it, and checked in my options file along with the checkout script above. Almost immediately I got an interesting question:

“Didn't we decide to use PHP instead of Python?”

Yes, we did. And yes, the options file is written in Python. Unfortunately, the way cvs2svn is set up makes this necessary; it includes the options file similarly to a PHP include directive.


Next step: Decide on a repository structure. Ooops… lots of differing opinions on that.

Well, this was getting complicated. It was time to step back and automate some of the process. So I popped open a new PHP file and came up with automation for the svn create, cvs2svn, and svn rm commands already discussed. Then I went back and added some nice command-line-y-ness to it using PEAR's Console_CommandLine (a VERY nice package, kudos to its author(s)!). The script can be viewed at

That done, I looked back at the reorganization mess. It looked like there would in fact be a few separate repositories for things like PEAR and GTK. I needed advice on this one, so I went to the mailing list. They wanted to know, “why separate repositories?” Well, it's a matter of maitenance, really. GTK, PEAR, Zend, they all have their own little quirks in the hook scripts and really it's just simpler and more elegant for them to have their own workspaces to play in rather than all this endless special-casing in the hooks and ACLs.

So I rewrote the conversion script completely to support this premise, and contacted various people to find out what to do with the “miscellaneous” modules scattered all over the place. Turned out most of them either belonged alongside php-src or were just plain defunct! The choice was made not to convert defunct modules, since there is a plan to leave the CVS repository available in some form.

Hook scripts

At a glance it might seem that would be the end of it. But unfortunately, no. There are a lot of administrative tasks done by scripts in CVSROOT, all of which need to be ported to SVN equivelants. I decided it would be astute to make a list of what needed to be ported before actually getting into it! To do that, I grabbed a copy of CVSROOT itself and had a looksee. It turned out the following things needed conversion:

  • Access Control Lists - replaced by the SVN authz database
  • - I couldn't quite figure out what this was for. It seemed to write the name of the committed directory to a file. A little more investigation showed it to be part of the automation
  • cvswrappers - Replaced by SVN's autoprops
  • - Sends the e-mails to various mailing lists when commits happen
  • modules - Replaced by svn:externals and restructuring
  • readers - Replaced by SVN's authz database

Available for the curious

Meanwhile, the converted PHP repository is now available via:

$ svn co

This will check out all the projects in the repository; it's suggested to specify a particular module like Don't forget about svn ls!

vcs/cvs2svnconversion.txt · Last modified: 2019/08/26 16:54 by gwynne