The theory here is that forking involves playing around with TLBs entries quite a bit. Since a monstrous full heap will have a lot of 8KB page TLB entries to contend with, if we shrink the number of TLB entries by a factor of 256 (by working with 2048KB pages rather than 8KB ones), you'll limit the amount of time that the kernel is spending mucking with them.
First of all, make sure you read this: The large memory support page from Sun. Now that we've gotten the formals out of the way, here's some fun we had.
First of all, a stock RHEL 5.2 installation has a
HugePages_Total
set to 0 (cat /proc/meminfo
). No huge pages whatsoever. So you need to bump that up. For my test (maximum 2GB fully populated heap on a 4GB physical RAM system), we decided to set that to 3 GB, which is 1536 2MB pages.echo 1536 > /proc/sys/vm/nr_hugepages
isn't guaranteed to actually do that, and the first time we ran it, we ended up with a whopping 2 HugePages_Total
. Second time bumped us up to 4. So we went on a process hunt to eliminate any processes that were stopping us working, and got things down pretty small. Now we were able to get up to 870, which was good enough for my 1GB tests (which indicated the major performance degradation anyway), though not for the 2GB test. (Yes, I know that you're supposed to do this on startup, but I didn't have that option so we did what we could).And so I kicked things off with the
-XX:+UseLargePages
flag. Fail.Every single time I got a
Java HotSpot(TM) 64-Bit Server VM warning: Failed to reserve shared memory (ero = 12).
And nothing would run. Well, damn!Turns out those little tiny bits that they say in the support page about not working for non-privileged users are completely accurate. These all went away when I had someone with sudo rights run the process as root, and all my numbers are from running as root. So just assume that even 1.6.0_10 ain't going to allow you to allocate any large pages if you're not root.
So he ran things as root (and I re-ran things as non-root without the UseLargePages flag since I changed the test slightly). Here's some fun comparison:
Heap Size | Large Pages Time | Normal Pages Time | Speedup |
---|---|---|---|
128MB | 3.589sec | 13.217sec | 3.68 times faster |
256MB | 3.62sec | 15.314sec | 4.23 times faster |
512MB | 4.638sec | 36.692sec | 7.91 times faster |
1024MB | 3.885sec | 67.062sec | 17.26 times faster |
Oh, and that speedup between 512MB and 1024MB? Completely reproducible. Not sure what precisely was going on there, I'm going to assume my test case is flawed somehow.
It's happening so quickly at this point that I'm quite suspicious that all I'm measuring is the
/bin/false
process startup and teardown performance, as well as the concurrency inside Java. I don't actually think I'm testing anything of any meaningful precision anymore. Maybe at a few million forks or with higher concurrency, but I've achieved essentially a constant amount of time spent forking, so I've gotten out of the heap issue really.So it turns out that you really can make Java fork like crazy on Linux, as long as you're willing to run as root. And I don't know why and my naive googling didn't help. If someone can let me know, I'd really greatly appreciate it.
Did any of this help Solaris x86? Not one whit. Adding the
-XX:UseLargePages
flag (even though Solaris 10 doesn't require any type of configuration to make it work) didn't improve performance at all, and Solaris was still twice as slow as Linux without the flag.