Monday, November 17, 2008

Perforce, FishEye, CC.Net, Labels, Oh My!

My employer is a big Perforce shop. Although we've got some CVS repositories still lying around for legacy reasons, over the past few years almost all of our source code has managed to make its way into four Perforce instances (separated for geographical and organizational reasons). We use it for pretty much everything, and we've got our entire software development methodology based around its use.

That includes a number of tools that we use that are integrated with it:
  • Code review tools (all built in-house)
  • Software release and distribution tools (also all built in-house)
  • Bamboo, as our current continuous integration system
  • FishEye, as our SCM web-based visualization system
  • Jira, as our issue tracking system
In addition, in the past, before we moved to Bamboo, we had three other CI systems hitting it, CruiseControl, CruiseControl.NET, and Hudson (all but one retired, and that one is being retired shortly).

All of them are hitting it on a regular basis pretty hard to pull metadata (what happened when) and data (to pull the actual SCM data). This is a tale of where it started to go wrong and how we fixed it.

Problem Diagnosis
We noticed that FishEye was behaving pretty badly when it was rescanning a repository. We do this on a regular basis, because FishEye won't automatically rescan changelist descriptions (at least for Perforce), and sometimes we go back and edit changelist descriptions to hook them up to Jira issues or end-user requests, or just to provide more clarity on what someone did in the past. Since FishEye won't pick up those changes (if it's already processed changelist 123456, why would it process it again?), we have to periodically completely rescan our master Perforce server to pull out all changes.

Our master Perforce installation is relatively big for a non-gaming house (I can't give numbers for this in a public forum), and rescanning it was taking several days. In fact, it was taking so long, that we couldn't even plan on doing it over a weekend: if we kicked off the process on Friday night after New York went home, it wouldn't be done by the time Hong Kong and Tokyo came in on Monday. This was a problem.

Also, when Fisheye was doing this, the whole repository was noticably slower. Since the rescans started to take place over normal business hours when people were trying to work, this made the problem doubly bad: not only was Fisheye not available, it was making Perforce slow for users during the rescan.

So I started diagnosing what was going on, and the process that was taking the longest was processing labels. This alone was taking over a day, and because of the way Fisheye does this, and because forking to query Perforce on Solaris 10 is a painful experience, we needed to get this number down. We had way too many labels covering way too many revisions.

Perforce Is Not CVS
The metadata table that holds this data in Perforce (db.label) was absolutely massive: roughly 7GB, or about 60% of our entire Perforce metadata storage. This wasn't good, and it's far from ordinary. When I started investigating, we had over 12000 labels. That's a lot for the number of projects we're hosting and the number of releases we've done, but it turns out that 10000 of them were created by CruiseControl.NET builds.

This was largely done from a misconception of what labels are good for, and is basically a remnant of CVS-style thinking. In CVS, because revisions of files can be interleved together, if you want to reference the state of a particular subsection of the repository as of a particular point in time, you have to add a Tag to every revision of every file involved. This actually goes in and adds metadata for that revision to say it's part of BUILD_5 or some such thing.

A Perforce Label is different. Perforce has atomic, monotonically increasing changelist numbers, where each number uniquely identifies the state of every single revision in every single file in the entire server. And I can use them in all types of contexts. In particular, I can pull down the state of a particular project as of a particular changelist number: "Give me the Fibble project as of changelist 12345." This is how Perforce-integrating CI systems work: they query Perforce to say "tell me all the submitted changelists that I haven't seen", and then sync up particular areas as of those changelist numbers. Therefore, a changelist number is the equivalent of a tag applied to every revision of every file in the whole server.

A Label, on the other hand, is there for cases where you need to refer to revisions of files across multiple changelists. The key use case here is patching. Let's say that you've released version 1.2.0 of your software, and then you start adding changes to support 1.2.1. But in the meantime, you discover a complete showstopper bug that requires you to put out a special release with only that bug fix in, and not any of the other changes you've got prepared for 1.2.1. Since the 1.2.1 features have already started going in, if you try to pull all the source code as of the point where the critical bug fix went in, you'll get the 1.2.1 changes as well. In this case, you create a Label, and you put in the label all the 1.2.0 revisions, as well as the revisions just for the showstopper fix, but none of the rest of the 1.2.1 changes. This gives you a way to refer to a collection of revisions of files across different changelists.

What our CC.Net server was doing (and as I didn't install it, I don't know if this was default or intentional on our part), was for every single build of every single project, creating a new label which contained the state of all the files for that build. But you don't need to do that: since it was pulling everything down as of a particular changelist number, all that accomplished was saying "The revisions for build 30 of the Fibble project are the same as the files as of changelist 12345," which Perforce changelist numbers already give you. So it was unnecessary metadata that was clogging up the server intolerably.

We deleted all of those 10000 labels (the fact that we had already moved all those projects to Bamboo made this a no-brainer, as we were no longer using the CC.Net provided builds at all). But the size of the db.label table didn't actually shrink. In fact, it actually got slightly bigger during that time.

This is because Perforce as an optimization assumes that you're constantly increasing the amount of metadata that you're putting in as time goes forward, and so doesn't prune the size of the tables. So they were still too big, and sparse, so although we didn't have to do as many queries across them, it was hurting the OS caching of the files.

The solution there is to restore from a checkpoint (a Perforce checkpoint is a gzipped text file containing every binary record in your metadata in plain text; it acts as your primary backup mechanism for the metadata records that it keeps along with your RCS files). Before we did this we went through a pruning process for old workspaces that people hadn't deleted (removing several hundred) to get the db.have file down in size as well.

After this was done, the size of our db.* tables went from 11GB to 3.0GB (our gzipped checkpoints went from 492MB to 221MB). And the FishEye scans went from 3 days to 5 hours. Job done.

So after all that, what we can draw from this is:
  • Don't label unnecessarily. They're relatively expensive in terms of metadata, and you probably don't need to do it.
  • Shrink your metadata. Remove anything that you no longer need, including old workspaces.
  • When you prune significantly, restore from checkpoint. This is the only way to get Perforce to really shrink down your metadata tables on disk.
blog comments powered by Disqus