Friday, October 31, 2008

A London TechHub? Don't Be Daft

After Rob Knight gave a shout-out to me in a blog entry, I didn't realize that other people had been talking about trying to set up some type of "TechHub", and now apparently they're talking about doing it in London.

People who have seen me in full-on rant mode down t' pub may recognize this as one of my favorite topics, and one on which I completely agree with Rob Knight in that London is perhaps the stupidest place in the UK to do it, bar none. Don't get me wrong: there are definitely technology firms doing quite interesting work in London, and there are a lot of people that I admire working for technology pure-plays here in London. And that's great. But you're never going to have anything near Silicon Valley style concentration of technology startups or firms here.

Micro-Cluster or Macro-Cluster
Your first question is whether you want a micro-cluster or a macro-cluster. I think the current plans for a Tech Hub are on the micro-cluster form: it's a building or two nearby each other where you can put lots of tech startups in one area. And what does Mike think you need for this?
  • Great transport links
  • Nearby cheap hotel
  • Super-Fast broadband
  • Hot-desk facilities
Uhm, Mike, do you just realize that you've essentially invented the business model for Regus? Simple fact, none of those things is unique to technology startups whatsoever. And in fact, none of the most successful startups that I know of has ever had such ideal digs right off the bat. Wanna know why? Everybody wants them.

And when everybody wants them, the cost of them goes up. And that means that startups get priced out of being able to move straight into them. My startup started in the cheapest office space we could get, which was a serviced office in an industrial park in San Mateo, with shared T1 access and no static IP and no public transport and right at the intersection of two freeways which were always massively crowded so you had to commute at odd hours. Because it's what we could afford. And we even had a 7-figure seed round under our belts!

Providing prestige office accomodation for startups doesn't make any sense because they'll be priced right out of moving into them.

So where do startups end up clustering by their very nature? Places which have some of the above, but are not Grade A office space for a variety of reasons. Places like Old Street and Hoxton and Shoreditch and Bermondsey. Oh, and look, those places already have quite a few startups naturally gravitating towards them.

Places like the TechHub already exist, in every way you can imagine, it's just that you can't afford them if you're a tech startup.

The Valley Is Bigger Than London
Okay, so let's reconsider as a macro-hub, where we're trying to figure out how to replicate Silicon Valley rather than Palo Alto. That requires scale. Lots of it.

When you talk about The Valley (and not the San Fernando Valley from whence comes pr0n) [my best recollection was that The Valley essentially spread from San Mateo down to southern San Jose and across to include parts of Fremont; if you include San Francisco and Berkeley you're talking about a huge area] you're talking about an area that that's about 35 miles across. That's pretty much the entire size of Greater London from Heathrow to Upminster.

One of the advantages for many people there is that you can largely pick where you live relatively independently of where you're going to work. When I was there I worked in San Francisco, San Mateo, Burlingame, Menlo Park, and Cupertino. I lived that entire time within a 1 square mile area. You can easily settle down in Cupertino or Sunnyvale or Mountain View and have the entire Silicon Valley employment system available to you. Admittedly, this is part of the whole car/freeway culture, which sucks royally, but can you say the same for London? No.

London is completely impossible to navigate in over any moderate distance whatsoever in a car. So let's stop that route right now.

Let's look at the public transport map then. In fact, the public transportation system of London is irredemably designed to do exactly one thing: get people to and from their jobs in the City, and to and from their playgrounds in the West End. And it can't even do that. You want to go from Non-City-Location-A to Non-City-Location-B via public transportation? Oh, it is to laugh. If you can do it at all, it's going to take you an eternity each way. I had a 45-mile door-to-door commute from San Francisco to Cupertino once. Took me 45 minutes each way. My current commute (from Fulham to Bank) takes me longer than that. For what, 5 miles? And I'm following the direction of the transportation system. For me to go play Table Tennis on Sunday, to go about 2.5 miles, takes me 45 minutes each way in a bus. For 2.5 miles.

In this respect, the outer/inner London system acts as a fully centralized Silicon Valley: you can settle down anywhere with a train into central London and just stay there, as you move from City-job to City-job. More on that next.

Staff Up, Yo.
So let's say that you did want to create a macro-cluster of tech firms. I would say that the #1 hurdle you have to face is this: How do you make it relatively easy to get about 20 A-list software engineers in a startup quickly. Let's look at each of those in turn:

Relatively Easy. By this I mean that your hit rate on qualified candidates isn't less than 1% (which is roughly what my hit rate on CVs to interviews was just after the Dot Com Boom ended as everybody who was pretty marginal in the valley all went looking for jobs at the same time).

20. I pick this because I think if you're looking for a team smaller than this, you could probably do that almost anywhere. And I know that modern programming languages and techniques are making it easier to do more with less, but you've still got a requirement for about 20 technical people overall, whether it's hosting, configuration, testing, the works. To get something that's really good in a commercial context still requires a fair number of people.

A-list. Startups aren't amateur hour. Well, they are in Web 2.0 (oh, snap!), but not in the real world. Bring your A game, yo.

In a startup. This point is critical because you have to get people who are willing and able to work for a startup and in a startup environment.

Quickly. Is it going to take you 12-18 months to ramp a team that quickly? Fail.

What's going to block you in every single one of those criteria? The financial industry.

Follow The Money
Where are all those people that you want working in London? The financial services industry. Why? The money. Why are we such money-grubbing whores? Because London's expensive.

When I moved to London (and it had to be London; I wasn't willing to consider anywhere else in the UK) for personal reasons about 5 years ago, I initially thought I'd go work in a startup or in the tech industry, and I met with quite a few VC-backed firms, and had several meetings with a very prominent London-based VC (who works for a very prominent Sand Hill Road firm's outpost in St. James). His advice stuck with me, and it's never more true. On average, you're always going to earn more in the financial industry with lower variance.

If you play the startup game, over a 10 year period, figure you're going to have one minor pop (acquisition for small money) and one medium pop (acquisition for major money). Assume that you will never be part of a successful IPO (because the chances are so slim that it's laughable). During that period you average £X in annual pay, with those minor and medium pops paying you £Y in total one-off payments.

During that same period in financial services, you'll be earning £A in annual payments, and £B in bonuses every year. In order to make the maths work out, you would have to assume that 10X + Y is more than 10(A + B). It's not. If you're the type of person who's going to be in at the first round of a startup (and thus makes Y meaningful), Y is going to work out at roughly the same as (or even a little more than) 10B. But A is astronomically higher than X.

And the financial industry knows that their competition for top software talent in the UK is the rest of the financial industry, and the technology sector. And they know that by their very nature, most super-strong software engineers would prefer to work in a pure-tech environment where they're not a support structure, but the central pillar of the firm. Therefore, they know that they have to pay them more, and with much lower variance, than they would get in their preferred industry. And they do.

And let's be honest: if someone told you "I'm going to pay you more every year, in fact, I'm even going to pay you enough to own property in Zone 2 in London" versus "I'm going to pay you so little you're going to be renting somewhere 30 minutes from a tube line in Zone 6, but maybe just maybe in 6 years you might do pretty well," it's a pretty easy decision for a lot of people to make.

Which is precisely why Paul Graham likes his serfs to be so young: so that they haven't yet built up any tradition of eating anything other than ramen. When you've been able to afford Nobu, you don't want to go back to nothing better than McDonalds (unless it's on the way home after the pub and you need all the fat and protein you can shove in your face if you're to be functional the next day). You'll do it if you have to, but most people aren't going to choose it.

I guess that's why the vast majority of the people I know here who are of the calibre I worked with in the valley work in finance here in London. And they wouldn't do an early stage startup, because they physically couldn't slash their monthly expenditure to a point that would make the startup pay effective. You might be able to get a few, but 20?

And even if you could get 20, do you think that you could get 400 (20 startups each with 20)? I've been at parties in San Francisco with 400 tech-company software engineers for particular social niches. Do you think you could get 400 A-list software engineers to give up their cushy financial industry salaries and stay living in London on a pittance to work for your startups? Because you'd need at least that to kickstart it.

And this isn't just a London problem. New York isn't a tech industry epicentre for the same reasons. Chicago's interesting in that it has enough financial firms to have spillover effects, but not so big that its gravitational well can't support pure-play technology firms.

Could you just pay startup employees more? Well, how many European VCs do you know who like having their founding team earning into 6 figures sterling before they have a dime of revenue? I didn't think so.

Oh, Noes! The Credit Tsunami!
Oh, wait, the Credit Crisis/Crunch/Quake/Tsunami is going to fix all this by eliminating all the financial industry jobs! My ass.

Bonuses for many are going to temporarily suck. People will be laid off. But I don't know of any strong technologists who have yet been laid off by a financial firm in London (I'm sure there are, but in the transitive closure of people that I know, I don't directly know of any). They're so difficult to hire at the best of times, you want to hold onto them. Don't count on there being such a flood of programming superstars who are given massive severances so they can all start companies at the same time that you can turn back time.

You'll probably see what happened when the Dot-Com bubble burst. You saw all the people who came into the industry (Tech then, Finance now) who really weren't good enough, but who got hired because people needed warm bodies who could do the grunt work (then web coding, now WinForms programming), they'll all go. And a few people who really shouldn't go, but get caught up in a round for some reason or another. And those people will find jobs almost instantly (great Silicon Valley anecdote: I had to delay the CEO of my firm laying me off once because I was on the phone getting a job offer from another firm).

So even if the Credit Implosion results in a different technology employee market in London, you'll just face the problem of identifying the little wheat from the piles of chaff. And that sucks very valuable resources.

If Not Here, Where?
Here's where I (somewhat) agree with Mike and disagree with Rob: Manchester isn't close enough to London. VCs aren't going to want to live in Manchester, and neither are your other financiers and lawyers and all that. They want you to be within a 1 hour drive of their offices on Sand Hill Road in Silicon Valley, which essentially defines where you can have your offices there. And they aren't going to want to get on a plane.

I think you've got an hour-and-a-half to two-hour train journey from central london (which puts you unfortunately in outer-commuter-belt). Places that are interesting:
  • Brighton. Has a similar vibe to it as San Francisco, and similar numbers of "creative types." But it's so hemmed in by geography it might not be easy to grow it. And it's horrid in the summer as all the day-trippers come down and invade it (disclaimer: I used to have a weekend place there).
  • Bristol. Already has a massive HP lab outside of it, good university town, decent weather, fast trains.
  • Oxford/Cambridge. Universities (remember, Palo Alto has Stanford and Berkeley has Berkeley, largely defining why Silicon Valley is where it is) bring young talent who haven't been seduced by London yet.
  • Reading/Bracknell. The area was starting to get that way (with all the outposts of US tech firms out there), except that it's really soul destroying and foul so nobody young and talented would want to live there (remember: you can work in Silicon Valley [similarly soul destroying and horrible], but live in San Francisco quite easily. I had a friend working in Reading and living in London and it almost killed him). Maybe if Crossrail had gone there, but otherwise too grim.
If Manchester had a super-high-speed rail link to central London, then it would be quite interesting, and it would probably actually really benefit the UK as a whole. But given that the UK can't do new infrastructure, I'm not holding my breath.

Where does this leave us? Well, I think essentially the following:
  • London has a geography and transportation system that limits you to locating in central London to draw on enough talent;
  • The desirability of London for so many reasons makes any office space suitable to drawing on a lot of talent ridiculously expensive; and
  • The financial industry will collectively act to ensure that the type of people you need to do a pure-tech startup simply aren't available at the prices that a startup can afford.
Yes, I'd love a technology cluster inside the M-25. But I'm realistic.

JAXB 2.x and @XmlElement(required=true)

We've been quite happily using the JAXB reference implementations for a while now, until someone actually evaluated whether the XML that it's generating is valid. Turns out that it's only sometimes valid with respect to your schema.

(FTR, this is going to be a bit tricky since I'm using Blogger and thus don't have the best inline XML-and-code support). (Version notation: This is all valid for both 2.0 and 2.1 versions [up through 2.1.8], but is not valid for 1.x JAXB).

The Problem: Required Strings
Assume you have an XSD complexType which contains two elements:
<xs:element name="foo" type="xs:double"/>
<xs:element name="bar" type="xs:string"/>

In your XJC-generated Java code, both the Foo and Bar properties will be annotated with an @XmlElement(required=true) annotation. In addition, Foo will be declared as a double rather than a Double, so you'll always have a value of some form in the generated class. The problem is with bar.

bar will be declared without any default value whatsoever (unless you're using the JAXB Default Value plugin). Even worse, if you run through a marshall/unmarshall pass on objects or XML that lack a bar element at all, it works entirely fine (even with the default value plugin) and you'll be just fine consuming data without a bar element whatsoever. So you can quite merrily generate XML which doesn't adhere to your XSD, and consume XML which also doesn't. (for the bar element, you'll always have the default value of 0.0 if the XML you're unmarshalling lacks the element).

In general, I'd consider code which is doing this to be bug-ridden, and in need of fixing: if your application knows that you need a bar, why aren't you adding one to your objects before marshalling them? Conversely, you should always try to make do with whatever crap someone barfs at you over the wire if you can (Postel's Law and all that). But it's still confusing behavior.

Workaround One: Make It An Attribute
If you have control over your schema, making it an XML attribute will solve this problem immediately. They're handled differently by JAXB, so it always works as you'd expect.

Workaround Two: Explicitly Validate
At runtime, for performance reasons, JAXB by default doesn't actually have access to your XSD. All it works with are the (generated|hand-written) Java files with annotations (this is a difference to the way the old XMLBeans worked, where it would pre-process a binary form of your XSD for runtime validation). You can make JAXB perform strict validation, but for performance reasons it won't by default.

The problem is that it means that you have to ship your XSD files with your application and use SchemaFactory out of the javax.xml.validation package to load up the XSD into an in-memory Schema instance, and pass that to your Marshaller and Unmarshaller's setSchema methods. It'll slow you down, but you'll be guaranteed to be valid.

A reasonable option at least to me would be to validate in debug cycles (using the assertions system) and turn it off when you know your applications actually all work together happily.

Reader Note: This was written largely because googling this topic never came up with any actual explaination. Therefore, I wanted this to be keyword google friendly. And since I was going to have to document this for other people at my company, I figured I'd document it for the world.

Thursday, October 30, 2008

Linux: A (Less) Terrible Choice For Java Continuous Integration

Where we last left our intrepid developer, he was floating in a sea of Bamboo+Perforce misery, blithely assuming that moving from Solaris 10 to Linux would solve all of his problems. Oh, what a blissful world he would live in! What joy he would have no longer having to deal with Solaris swap space reservation woes!

Since then, though, he's gotten access to two (almost) the same machines, one running Solaris 10 x86, and one running RHEL 5.2. And while he's vindicated, he's nowhere near vindicated enough for his liking.

Executive Summary: Runtime.exec() performance under Linux is superior to that of Solaris 10 x86, but nowhere near as superior as it should be.

Inspect the following micro-benchmark (and apologies that I don't have the nifty code viewing tools that other bloggers do):

import java.util.concurrent.*;

public class ForkTest
public static void main(String[] args) throws InterruptedException {
int nThreads = Integer.parseInt(args[0]);
int nSlabs = Integer.parseInt(args[1]);
byte[][] bytes = new byte[nSlabs][];
for(int i = 0; i < nSlabs; i++) {
bytes[i] = new byte[80 * 1024 * 1024];
ExecutorService executor = Executors.newFixedThreadPool(Integer.parseInt(args[0]));
long start = System.currentTimeMillis();
for(int i = 0; i < 1000; i++) {
executor.execute(new Runnable() {
public void run() {
try { Runtime.getRuntime().exec("/bin/false").waitFor();}
catch (Throwable t) { t.printStackTrace(System.err); }
executor.awaitTermination(10L, TimeUnit.MINUTES);
long end = System.currentTimeMillis();
double secs = ((double)(end - start)) / 1000.0;
System.out.println("" + nThreads + " - Forking 1000 times took " + secs + " secs");

Essentially what the test is doing is:
  • Creating a fixed size (-Xms and -Xmx set to the same value) heap
  • Allocating some slabs of memory (to fill up the heap) (where I refer to "empty heap" tests, this was set to 0)
  • Creating an ExecutorService with a certain number of threads
  • Running through 1000 tasks, where each task involved running /bin/false in a sub-process and waiting for it to terminate

I felt that this was probably the best way that I could possibly test whether the behavior that I felt was causing Bamboo to perform badly with Perforce repositories would also affect Linux. Turns out I'm half right; Linux will still suck, but suck 50% less.

General Parameters
Both machines were Sun X4100 (non-M2) servers with two dual-core Opterons (one a pair of 275s and one a pair of 285s), and 4GB physical RAM. All tests were done on 1.6.0_10.

Empty Heap Comparison
Here's the first test: Run through everything with no slabs allocated (empty heap) and see how fast we can go. Results in this graph, but the highlight here is that Solaris was very little affected by the size of the heap, but was always slower than Linux, by roughly a factor of 2.

Full Heap Comparison
Next test was to fill up the heaps and then try. Here you can see that the heap size completely determines performance, but Linux is always better (factor of 2 again).

Full/Empty Comparison
Here are just the Linux values plotted out, and it's pretty clear what's going on.

Interesting Observations
Note that the sweet spot here is two threads. No more, no fewer. On Linux, Solaris, empty, full, doesn't matter. You want to fork as fast as you can? Have two threads doing it. Admittedly, these tests were on 2-socket (4 core total) machines, but when I repeated this on Solaris on one of our 8-socket x4600 machines (16 core total), I ended up with the exact same thing: 2 threads was always ideal.

Uninteresting Observations
"Hey, Kirk, you just proved that forking an empty virtual space is faster than a full virtual space! Big whoop! You're such a Java-specific Moran that you forgot all that from your 31337 days!"

Well, no, not really. What I specifically established is that:
  • On neither Linux nor on Solaris are the Sun-provided JVMs using any of the fast-subprocess-spawn operations available to them.
  • This is a really clear win for anyone working on CI systems to nag Sun or the OpenJDK crowd to get changed and fully tested.
  • Linux is still a factor of 2 ahead of Solaris here. I would have hoped a fast-spawn implementation would be a factor of 10, but I'll take a factor of 2 gladly.
  • There is definitely something happening in a fork+exec pattern on Linux which is VM-specific, which means that our suppositions earlier that Linux is going to do optimistic copying aren't panning through to eliminate the costs of a fork with a large amount of allocated memory.

So let's say you have a large, long-running Java server which benefits from having a relatively large heap (like, oh, I dunno, a Continuous Integration server), and you have to shell out constantly because a vendor refuses to support you well (speaking of which, I've actually formally asked Perforce to document the protocol).

If you really want to avoid the whole C++ thing, essentially you should be:
  • Forking to a second JVM instance to run a small Java application.
  • That small Java application should itself shell out to your command-line application (remember: on Linux with a 128m empty heap you can get up to 175 Runtime.exec()/sec, which is not too shabby)
  • Have that small java application just pass stdin/stdout to the parent application.

Yes, this seems completely retarded. I can't believe I'm recommending it. But it would actually work as a consistent approach to the Perforce+Continuous Integration problem.

Thursday, October 23, 2008

Culture of Testing and Continuous Integration

Ran across a pretty good post on how defects can lead to never being able to meet any of your three basic criteria (features, quality, time). You should read it. But the Kirk summary here is that while I'm not surprised, there are still a lot of organizations who haven't gotten Test/Continuous-Integration Infected. I've known for years that it's a key component to long-running development project success (hint: if you're using source-code control, it's long-running), but this is a pretty good summary about why if you always skimp on "Quality", you're actually always going to lose on Features (too busy fighting bugs to add them) and Time (end of the cycle is extended dealing with defects in key functionality).

A few quick and recent anecdotes:
  • That Java5 vs Java6 ThreadPoolExecutor bug? Caught by our continuous integration server testing against a plethora of JVM and OS combinations.
  • In some C++ code, we caught a race condition by running tests over and over under continuous integration (and because our build agents run on ordinary development machines, this exposes them to a variety of configuration and machine load scenarios), and refining the tests until we found the lowest-level part that was failing.
  • We have logic that works with some (intentionally) unreliable middleware that people too often assume is reliable. By having tests that spuriously fail under load (the tests assume more reliability than you can actually achieve on machines that are otherwise loaded), it became a good stick to beat the "best-effort is good enough" crowd internally. Tests periodically failing don't lie: I can say "15% of the time, this test fails, and it's using the API as intended."
I've had some discussions with people about why they don't move to a CI and Test-Infected model, and I've heard a few arguments. I think most of these come down to two factors.

First, starting is hard. If you're on an existing product which you don't actually like the internal code path on, it can be really difficult to start testing. And if your code isn't written to make it easy to test in isolation, all your tests are system-integration tests, which are pretty poor at providing low-level test cases to track the code itself. The problem seems so daunting that you don't even really start at all.

The easiest solution here is the obvious one: start with some system-integration tests, and as you start working on existing parts of the system (either fixing them or replacing them) start writing lower and lower level tests. Use a code coverage tool (like the shill I am, I'm extraordinarily partial to Clover for Java development), see how you're doing, and make sure you're constantly expanding the code paths that you have under testing. And use it interactively: turn tests on and off and compare your coverage to find out how good your low-level tests and system-wide tests really are at testing your system.

Over time, as your test base grows, you end up exactly where you should be: all non-trivial changes to the code result in changes to the test suite to establish that the code is behaving the way you expect, and you're running them constantly to give you positive feedback that your code is working properly.

But this isn't the rant-filled part of the blog. The following is.

The Test-Infected crowd turns good software engineers against testing. Seriously. If you do anything to an extreme, you turn off software engineers who would otherwise listen to you. So just shut the hell up.

[In particular, if you work for a consulting company and are trying to sell development services, please don't bother trying to tell me how to develop software for long-running maintenance. How in the world would you know? By your very nature, you don't do long-running system development or maintenance. Why would I trust you in any way except as a clearing house for ideas you've seen other people doing? Other people who are probably failing and thus hiring you in the first place? Great. You've seen lots of fail.]

First of all, Test-Driven Development is retarded. There's a reason why every example I've ever seen of it is of extremely trivial code: because it's impossible to do well otherwise. When I start working on something that's non-trivial, do I know enough about the internal implementation details or the external contract that it will present to the rest of my system to be able to write a test that I feel has proper coverage of the internal code paths before I start writing it? No. That's part of the crafting of complicated software: you have to start with a concept and then start iteratively refining it until you end up reaching code that works well. If you actually have so much insight into how you're going to implement something that your Test-First code is going to be a reasonable exercise of your code, then your code is so trivial that there's no point writing it in the first place.

Second of all, I have absolutely no patience whatsoever for people who spend their time agonizing over whether something is a Unit, Functional, Integration, System, Performance, Scalability, blah blah blah test. Seriously. The whole thing is complete engineering navel-gazing. Here's a hint: if you have essentially two completely implementations of some logic, one just to make it a "Mock" implementation (but which has to adhere so strongly to an existing contract that you've written it twice), then you've failed and wasted everybody's time and effort doing it. Performance and scalability tests I can see, because you're probably going to run them manually on a periodic basis under a more controlled environment, but spending days agonizing over your mock framework? That's just crazy.

And unfortunately, I get comments like these from other programmers, who see all the Test-Driven Development and "Mock The World" arguments and it turns them right off. They say "if that's what it means to do all this great testing stuff, I want nothing to do with it," and then they produce crappy software. Fail all over.

You want to write some effective tests? Here are some hints:
  • Layer-cake your internal design. Then test up the layer stack. Layer A requires Layer B which requires Layer C? Test all three. Bug only shows up in the tests for Layer A? Then there's a bug in Layer A. It's just that simple.
  • Isolate your major components. Write an interface that you know will be easy to mock (something going to a database? write a quick in-memory representation for it) and then put a fa├žade on the low-level code. (Note that if your layers are done as interfaces, it becomes extraordinarily easy to combine this with layer testing to have "proper" unit testing).
  • Check your coverage. Unless you're running with something like Clover, you're not going to know how well your tests exercise your code paths. So use a code coverage tool to draw a spotlight on the areas that you're not covering particularly well.
  • If you have a big, existing system, write system-wide tests. They're far better than nothing, and they give you a basis for your later refactoring to get to a designed-for-testing scenario.
  • Continuous Integration. Make sure you have a single-step "compile the world" target somewhere (even if it's a shell or batch script), grab Bamboo (just don't run it on Solaris, natch), and start. You'll never go back.
  • Design for Testing. Make sure as you write code that you think "How am I going to test this?" and then write the tests right away (same SVN/Git/P4/CVS submission), changing your logic as required to make it easy to test.
  • Write tests as documentation. Writing some code that someone else is going to use? Create your documentation as a test case. The only documentation. Then if someone says "Hey, you wrote Fibble, how do I use it?" you can point them at the test case, and they can replicate it. If they say "Hey, that's great, but you didn't show me how to use the Fubar functionality," you add a test case that demonstrates that. If your tests start getting so complex that you can't point to any of them as a reasonable real-world demonstration of your code, you need to fix your design. Or add simpler test cases just for demonstration. But once you're doing demos/docs as tests, you can be pretty confident your design "smells" good and if people copy your demo/doc it's always going to work. Because you're testing it constantly.
This all becomes second nature over time, but if you ever hear somebody say that you should start by writing your unit test first and pull out your mock-object generator and auto-complete all the methods you haven't even written yet, just poke them in the eye until they go away. They're probably too busy trying to charge your company for a consulting gig to get anything real done anyway.

Tuesday, October 21, 2008

Java5 vs. Java6: ThreadPoolExecutor Difference

If you do thread-pooling Java without calling into* (e.g. you're constructing your own instances of ThreadPoolExecutor) and you're using LinkedBlockingQueue so that you can build up a task list, be aware that ThreadPoolExecutor got completely rewritten in Java 6.

Under Java 5, the behavior is that if you set the core size of the thread pool to 0 threads, then the pool won't actually kick off a single thread until the offer method on your BlockingQueue refuses a new element. In the case of LinkedBlockingQueue, that means that it won't kick off a thread until you've reached the maximum number of elements in your queue (the number in the constructor that you are obviously specifying to a reasonable number for your workload), which isn't quite what you would expect.

The workaround is to set your core size to >= 1, or to just use Java 6, which doesn't have this behavior (the execute(Runnable) method was rewritten and has special logic if the pool size is 0 to force it to create a new Thread in that case). In general, though, setting the core size to at least 1 is always a safe thing to do and will work across both Java versions.

Yes, this did bite me. And yes, thanks to continuous integration checking JVMs that my coworkers insist on using but which I abandoned years ago, I tracked this down to precisely this problem.

Updated 2009-06-25: Got corrected in a comment that I said SynchronousQueue when I should have said BlockingQueue.

Monday, October 20, 2008

AMQP, Exchanges, and Routing Thread on rabbitmq-discuss

For those of you who have been following my various ramblings on AMQP, specifically on the whole Exchange-Routing semantics (which overflowed into Ben Hood's blog), on Ben and Alexis' request I've started a thread over on rabbitmq-discuss to try to hash things out in a more direct format rather than duelling bloggers. If you think you have something to add, or just want to watch things unfold, I recommend you take a gander over there.

And if nobody takes me up on it, I'll just assume that I've won and the new AMQP spec will have everything I want in it, including a gift certificate to Millie's Cookies. And beer.

Disclaimer: I don't actually use RabbitMQ, my kind hosts just thought it would be as good a place as any for a public MOM-protocol-routing-semantics smackdown.

POSIX Message Queues: Useful, but Limited

Since my previous post on looking for VMS-style Mailbox semantics for more modern operating systems, I've done some research into POSIX Message Queues based on a comment in the entry. I think they're pretty darn cool, but it seems like they might be targetting a different audience than what I think I was looking for.

It also appears that the reason why I've not really seen much (any?) actual code using POSIX or SysV message queues is that support at least for POSIX Message Queues is relatively recent in the Linux kernel timeframe (2.6.6), and thus hasn't really been around for long enough as a system-primitive to find itself worked into much other software.

Disclaimer: It's entirely possible that at this point I'm largely inventing what I'd love from an internal queuing system, and it's quite probable that VMS Mailboxes don't do any of this. From here on, assume this is Kirk's Dream Land rather than "VMS Rulez".

Message Size
The default Linux implementation has the maximum message size set to 8192 bytes. That's not a lot of data to be honest, although it would be more than enough for an bridge between Erlang processes for most data. Unfortunately, this is a kernel option, and so to increase it you need to muck with your runtime kernel parameters (/proc/sys/fs/mqueue/msgsize_max).

Maximum Queue Size
The maximum number of messages you can put into the queue before it starts to block the sender is 10 messages. Again, kernel configurable (/proc/sys/fs/mqueue/msg_max), but 10 messages is pretty darn small.

I think these two combine to give a pretty clear indication of the type of message semantics they're going for: very fast, small messages. I could see a pretty clear analogue to most tick data in a financial services environment, or audio packets, or Erlang tuples. But this is really clearly designed for super-latency-critical scenarios, and the 10 message maximum is surprisingly small.

In doing some research into this, I came across this comment on a KernelTrap post, which I think gives some pretty clear insight into what's going on here, which is that this is designed for ultra-fast communications between processes (trying to make sure that everything is staying inside the same core and L1 cache if at all possible I would presume).

This all makes sense once you realize that POSIX message queues aren't persistent at all: they're like a Priority-queue channel between currently running processes on the same machine, they don't appear to survive machine restarts in any way. This is great as a low-level IPC mechanism, but doesn't really seem to apply to more failsafe long-running messaging scenarios.

When the kernel bounces, all messages and queues disappear.

Initial Reaction
After all this, it appears to me that POSIX Message Queues are really intended to be used for IPC at a very low level of interaction within a particular system, rather than for longer processes with more data. I'd love to see them used for an interaction channel through which a disk-based queue system could be built though.

For that reason, I probably wouldn't attempt to bridge them using AMQP: it would break the semantic model of super-low-latency messaging far too much to try to bridge them over anything slower/higher-latency than Infiniband (though that would be an excellent non-MPI way to use IB, particularly if you have RDMA for the message contents). A persistent queue implementation using disk leveraging POSIX Message Queues for control? That would probably be a much better candidate for AMQP bridging to me.

Friday, October 17, 2008

RabbitMQ talk at LShift Last Night

I attended the AMQP+RabbitMQ talk last night at LShift, and it was a pretty good time. Apparently the talk was largely a rehash of the talk that Alexis & company did at Google UK a few weeks ago, and was pretty interesting.

The basic pattern was:
  • Talk about MOM (messaging is good) and AMQP (AMQP is also good)
  • Talk about RabbitMQ (RabbitMQ is good) and Erlang/OTP (they're also good and ideal for implementation of an AMQP broker)
  • Demonstration of some cool stuff you can do tying it all together
To me the biggest parts that were new and quite interesting and well worth the time was the latter half, where they showed how a lot of the features in Erlang are ideally suited to developing an AMQP broker in under 5000 lines of human-written code (a little more is auto-generated from the AMQP spec itself), and some of the neat things that Erlang gives you as a platform for maintenance and monitoring.

Specifically, in case you haven't been looking at it yet, Erlang gives you a natural model to represent a lot of concepts in building large asynchronous systems. Since there's a lot of processing that goes on in message-oriented systems which is by its very nature asynchronous, it's pretty easy to map that onto Erlang in a pretty trivial way, which drastically reduces the amount of code that you need to write yourself.

Since the major concepts you're working with (Exchanges, Queues, Sockets) map directly to low-level Erlang actors^H^H^H^H^H^Hprocesses, Erlang gives you a really easy way to automatically introspect the state of those things, since they're being mapped to processes. This for a Java guy was pretty cool, because essentially you can introspect into the state of a running Erlang system and see in one fell swoop a snapshot of a process which shows you the equivalent of (for Java):
  • The stack trace
  • Values of all local variables across the stack
  • Values of all elements in the heap directly addressible from the bound variables
Doing that even with the new tools available in Java 6 and Terracotta is nowhere near as easy to do, and it was pretty darn powerful to witness.

And then, down t' pub.

One of the most interesting things to me was that quite a few of the people there really weren't part of the converted trying to hear things to make them feel all happy about themselves. Rather, there were a lot of people who aren't currently using MOM systems (or have basically done stuff that's core to MOM themselves, because the basic concepts are so intuitive that if you don't know what's out there, you'll end up writing your own half-baked MOM implementation yourself to do almost any non-trivial asynchronous system) who are eager to look into them now.

That's pretty nice to hear, and I think is one of the major medium-term benefits of AMQP: expansion of the overall MOM space from one dominated by MQSeries which people use because they have to, not because they want to, to one where people actively choose to use MOM because it's seen as a benefit rather than a hurdle. So AMQP really will involve growing the space, rather than cannibalizing it. And that to me is a great thing.

Monday, October 06, 2008

SonicMQ: Constructing New MessagePublishers or not?

Just as a quick follow-on to my previous mini-benchmark, I've now tried the same basic benchmark but testing publishing each message to its own Topic, and tested creating a new MessagePublisher per message, versus creating a unbound (Session.createProducer(null)) MessageProducer and using that to send all messages (producer.send(destination, message)).

For the lazy and link-click-shy, this is a tiny benchmark of how to publish small messages as fast as possible against SonicMQ 7.5.1 using the 7.5 JMS driver.

There is a difference, but a pretty small one.
  • Creating a single MessageProducer and using producer.send(destination, message) gave me 32352mps sent.
  • Creating a new MessageProducer for each new message/topic gave me 31847mps sent.
  • This was definitely reproducible, and I ran each test many times to make sure I got a pretty consistent result.

So there's a minor performance enhancement (1.5%), but it's pretty small, and judging from network traffic, it's not actually hitting the wire for the new MessageProducer, so it's entirely a local driver optimization.

SonicMQ New Sessions, Publication, and Performance

I was trying to debug some code today and ended up running a performance benchmark of fast message publishing, and thought it might be useful to the SonicMQ community. Here goes.

You have a system that is publishing small messages to the same topic as fast as it possibly can.

What is the impact of creating new Sessions on the publishing performance?

Please don't consider this a generic SonicMQ benchmark in any form. The code and infrastructure were definitely not setup for a proper benchmark, and I'm sure my employer signed some disclaimer that we wouldn't perform one anyway. This is just an example of how coding practices can affect your performance. Blah Blah Blah don't sue me Blah Blah Blah.

Benchmark Description
  • Create a JMS Connection to your SonicMQ broker.
  • In a tight, single-threaded loop, construct a new BytesMessage, bung 10 bytes of data into it, and send the message.
  • Every N messages (where N is varied), shutdown the Session and construct a new Session, Destination, and MessageProducer.

  • SonicMQ broker was running on a 2-socket dual-core (total of 4 cores) Opteron machine running Solaris 10 x86
  • Broker running 7.5.1, 7.5.0 JMS client
  • Client was running Windows XP
  • JDK 1.5.0_13 used on both broker and client
  • 100Mbps available between client and server (broker has 1Gbps available; I'm not allowed to thrash this broker to death during business hours), going through at least 3 switch hops (but no router)
  • All messages sent non-persistent, with no consumers whatsoever


New Session Every # MessagesMsg/sec

More interestingly, if you look at the network utilization graph on the Windows machine acting as the test driver, you can see exactly how the performance ramped up:

There appears to be some definite network latency in setting up the Session, MessageProducer, and Destination, but this disappears beyond the every-1000-messages mark.

Therefore, you can quite clearly conclude that you should avoid creating unnecessary Sessions and just hold them open for a publishing scenario.

Wednesday, October 01, 2008

IPC: Where My Queues At, Yo?

Updated 2008-10-20: Adrian Found Where My Queues At. Unbeknownst to me, they were there (somewhat all along) in POSIX Message Queues, more in the comments. I've changed this to be part of LazyWeb, because I really didn't know that they existed and haven't actually seen anybody using them. M4D props to Adrian.

Updated 2008-10-20 #2: POSIX Message Queues are close, but not quite it. I'll be doing another post on this, but turns out POSIX Message Queues aren't quite what I think mailbox semantics are, and some differences turn out to be key. Also, they've only been around since Linux kernel 2.6.6, which would explain why I've not seen them in more widespread use.

Not that long ago, my firm was interviewing someone from the Czech Republic who had been working in Frankfurt for the Deutsche Boerse, and some interesting stuff came out of that. Not the least interesting thing was that their entire infrastructure was still based on OpenVMS, and it was working extremely well. The secret? Queues. Or, rather, Mailboxes.

What a lot of people know about VMS, aside from the fact that it's Old and Not UNIX, is that it spawned Windows NT (WNT == VMS+1). What they don't realize is that it's actually quite a reasonable operating system that's kinda gotten a bad name because of the WNT thing, and because it's Not UNIX, but I want to address the thing that it's really gotten right, and that's queuing as an intrinsic form of IPC.

As I understood from the interviews, Deutsche Boerse uses this pretty extensively to distribute work: work is pulled from mailboxes, posted to other mailboxes, in a pretty standard modern MOM pattern. The recipients of messages may be on the same machine, or on another machine. They don't know; they don't care. Sounds like what you'd use an MOM system for, right? Only they don't use an MOM system, they use their operating system.

A Mailbox in VMS essentially forms the basis of a fully asynchronous queuing system. Applications can define mailboxes, to which you can publish messages, and the recipient will be notified at some point in the future that a message is available. And because VMS has a lot of clustering facilities, mailboxes can either be on the local machine or on another machine, and because it's all fully guaranteed delivery (within reasons of course), you have full fire-and-forget semantics.

Contrast all this with your UNIX forms of IPC: Pipes, Files, Signals. What do the former two have in common? They're either synchronous or near-enough in that they involve polling. Signals may be asynchronous, but they can't really carry a payload: they're just a number. These are enough to form the basis for almost all forms of common IPC, except for fully asynchronous ones. What do you do in a UNIX IPC model if you want to send a message and at some point in the future make sure that something else picks it up? Put it in a spool file? That seems pretty clunky, even today.

Sure, you can expand the UNIX model to involve asynchronous IO, but that's really just an asynch vaneer on top of traditional synchronous IPC. It doesn't change the fundamental basis that your basic units of system-system IPC are fundamentally synchronous. This all smells funny.

Given that there are a lot of AMQP people who read my blog apparently, where's the topicality? How about this: why is it necessary that I even have to use something like broker-based MOM when the operating system could realistically do this for me? Why is this all so resolutely in user space when operating systems were doing this decades ago? Why do I need to communicate using sockets (another system-system synchronous IPC system) to another process to do basic message-based IPC?

Why can't I take all the various implementations of this pretty basic concept (fire-and-forget, guaranteed messaging, asynchronous notification, single consumer delivery) and pull them out of their various programming-language (Scala, Erlang) and application-semantic specific models (JMS, WCF), and pull them up a level? Why can't I get this out of my operating system? Why can't we just break out of the UNIX model of decades ago?

Maybe we've accepted that the micro-kernel model of operating systems has really won. And by that, I don't mean the Mach-model of operating systems, but, rather, the model that system-level APIs are pretty much fixed in stone by history, and won't be expanded (yes, they might be expanded in the particulars, but they're not going to be significantly expanded in the general concepts that people are willing to defer to the kernel to do). People do everything they have to in user-space.

But something to me is wanting. I think this really simple form of IPC, which is so insanely beautiful that people are willing to program in actor-based concurrency systems like Erlang just to get it, really should be given a second thought. Maybe not in kernel space, but definitely as a standard that you can assume will be around on any system on which you program.

And the interviewee? A (hopefully) happy employee of my firm. We hire quality when we can find it.