Tuesday, March 24, 2009

Best Effort Delivery Fails At The Worst Time

In an early post in the blog, I discussed the two fundamental types of MOM infrastructure: broker-based and distributed. A particular thing I wanted to address is the best-effort delivery model inherent in distributed MOM infrastructure.

Background on Best Effort Systems

As a reminder, distributed systems operate a series of broker instances, where each one is located in the same process, VM, or OS as the code participating in the message network. For example, it might be a library that's bound into your process space (like 29West), or it might be a daemon that is hosted on the same machine (like Rendezvous). These distributed components then communicate over a distributed, unreliable communications channel (such as Ethernet Broadcast or IP Multicast). [1]

Conventionally, publishers in these systems have no knowledge of the actual consumers, and no access control is thus possible (how can you enforce AuthN or AuthZ if you're just bit-blasting onto a wire?). They simply fire-and-forget in the most literal form, as they're sending over a network protocol that has no concept of acknowledgment. So they issue bits onto the wire, and hope they get to any recipients. Management of these systems involves setting up your networking infrastructure to handle IP multicast properly, or making sure you have sufficient ethernet subnet-to-subnet bridges (rvrds if you're working with Rendezvous) to make sure that publishers and consumers can communicate.

The only guarantee you have is one of "best-effort." What this means is that publishers and consumers will do everything reasonable to ensure that messages get from publisher to consumer, without sacrificing speed or adding unnecessary latency into the network. The moment you have a slow consumer, or you have a network glitch, or a publisher goes down, that's it; messages get lost or thrown away, and you don't know about it (most of the time; some systems have an optional "you just lost messages" signal you can receive, but in the worst case your client can't tell that it's lost anything at all). That's because in these scenarios, anything that you can do to increase the application semantics will cost you in performance. If you don't need it, why pay for it?

This is very different from guaranteed-delivery semantics, where the middleware is responsible for throttling and caching such that messages aren't lost unless something really drastic happens; simply throwing away a message because an internal queue was starting to fill up isn't something you expect from those systems.

And what you get out of this is a lot of raw speed. No TCP stacks getting in the way, no sliding windows to deal with ACK messages, no central routing brokers. Fast, low-latency messages. Sure, they're unreliable, but do some interesting stuff with the protocol, and consumers who miss messages can even request the ones that they missed from the publisher (so best-effort can be pretty-darn-good-indeed effort).

This is seductive, and it's particularly seductive to developers. Setting up a simple Rendezvous (or other low-latency, best-effort delivery system) network is something you can do without your IT staff's involvement. There's no discussion of ports and broker hardware and networking topologies and involvement with your companies MOM Ops Team or anything else. Just start banging messages around, and look! It all works! And that's when things get problematic.

Best Effort = Development Win, Production Fail

Because the problem is that Best Effort systems seem much better than that in development. In development, you can run days without dropping a single message, no matter what size it is. In testing, you can run many test iterations without being able to force a message drop [2]. More importantly, you really can't force a message drop, at least not without implementing a lossy Decorator on top of the MOM interfaces. And that's a problem.

Developers easily self-justify both of these situations:

  • Yeah, so 25% of that message handling code isn't covered by my functional tests. Who cares? That's just because like RuntimeException handling, it's going to be so rare that there's no point forcing the test. If it was going to be more common, the functional and integration tests would already expose it.
  • Why would I spend 40% of my time working on 25% of the code that I can't get to naturally execute? That's completely retarded. I need to get back to making my Traders happy by adding more functionality to the system and sticking more data into Excel. Get back to me when the Big Bad Error Case actually happens.
And why wouldn't you accept those scenarios? If you don't, you're stealing from the company through bad prioritization.

The problem is that development infrastructure in no way represents the production user's infrastructure, particularly in a Financial Services context. And the causes of message drop tend to only happen under serious load, which usually means significant market activity, which usually means it's the absolute worst time to drop a message and have to engage in some form of retry behavior that's thoroughly untested in anger.

Consider these differences:

  • In development, you're the only app running that is using the specified MOM system (because like most developers, you work on one system at a time). In production, your users probably have three different apps all stressing the underlying networking infrastructure for the same MOM system.
  • In development, your networking system is probably under minimal load; aside from accessing your SCM system and streaming MP3s to your colleagues and reading tech blogs and tweeting [3], you're not really taxing it; your production users are pushing their networking hardware as hard as they can (mostly loading and saving 100MB Excel files to network stores, but that still eats bandwidth).
  • In development, you typically have single use cases: either big messages or small ones, but not mixed; your production users have them mixed all around.
  • Production users and systems have a lot more running on them (if traders, they'll have Bloomberg and Reuters and 8 Excel processes and 2 trading systems and your apps all on the same machine). Under peaks, everything suffers simultaneously as resources that otherwise are plentiful become quite scarce.
  • In development, if your machine crashes, you have a lot more to worry about than whether that test case ran successfully; in production, machines crash and die all the time and the only thing your users care about is getting themselves up and running as fast as possible. [4]
  • For intra-data-center communications, your development machines are used for coding and compiling and testing; your production machines are churning through messages 24x7 and are in a much noisier set of switches usually. That has effects that you can't evaluate, because you're not allowed to play on the production hardware.
As a result, expect at least 500% more packets to be dropped in production than in development. [5] But if you've been properly prioritizing your development, you've never tested this situation. So you're hosed.

Effective Use of Best Effort Systems

So how do you effectively use distributed/low-latency/best-effort MOM systems?
  • Always make sure your application has a non-best-effort delivery mechanism. Most of the problems with best-effort systems only affect you if you think your application should only have one MOM system, and you know that parts of your application are best suited to a distributed MOM infrastructure. If you are ever using a best-effort MOM system, start with a guaranteed-delivery MOM system and downgrade certain traffic accordingly. Never start in the other direction.
  • Don't ever use a best-effort system for a human-interacting environment (though the back-end processors for a human front-end system is fine). Seriously. You don't need it (humans can't detect latency below 100ms anyway), and you're just going to thrash their PC. Don't do it.
  • Make sure messages are as idempotent as possible. While tick delivery systems based on deltas are extremely useful, try to make them as prone to interpretation assuming individual message loss as possible. For example:
    • Base all delta ticks on deltas from start of day, not from last tick (because you might lose the previous tick)
    • When shipping only those fields which have changed, also ship the most important other fields (so if shipping just a change to the Ask Size, also ship the basic Bid/Ask/Trade sextuplet of data)[6]
  • Every time you publish or subscribe using a best-effort system, ask yourself "What happens if this message disappears completely?" If the answer is not "whatever," don't use that system and/or upgrade your semantics.
  • Any time you are using these systems, have a dedicated network completely shut off from other networking uses. If you have multiple major use cases (think: equities and fixed income, or tick delivery and analytics bundles) on the same machine, use completely different hardware for those as well. Combining traffic on the same network interface is the biggest cause of message loss that I've seen. If your networking team can't/won't do this, get a new networking team.
  • Have those network interfaces (one per message context) hooked up to the same physical switching fabric wherever possible to minimize the amount of potential loss in inter-switch communications.

These guidelines essentially amount to the following:

If you think you want to use a distributed MOM system, you probably don't. If you know you do, only use it for server-to-server communications, and make your physical layout match your logical layout as much as possible.

Conclusion

This essay will probably seem pretty down on best-effort systems; that's intentional. I think these systems have their place still, but for most systems their place is back in history, when switches were unnecessary latency additions and broker-based MOM systems added yet more latency and hardware-based TCPoEs didn't exist. If you're giving up 500ms of latency by putting a broker in between publisher and consumer, you'd be stupid to do so if you can avoid it. But that's not the case anymore. Modern brokers can measure their additional latency in microseconds with far better than broadcast/multicast-based best-effort delivery; if you can do that well with a broker in the middle, why sacrifice application semantics to remove it?

That's where I think the future in this space is with systems like 29 West and not with systems like Rendezvous: 29 West is adding interesting functionality that you want, and better (think: non-publisher-host-based) resilience, while still having a non-broker-based infrastructure. While I still think it's fringe, Rendezvous is just stalled as a product, and it's not going anywhere. [7]

So the next time you're sitting there coding against Rendezvous and thinking "you know, let's face it, this best-effort thing is a sham; all my messages show up!" just remember: the moment you assume that, it's going to bite you in the ass. And it'll probably happen on the last day of the quarter when volatility is up 25% and your head trader is having a down day. And then you'll wish you had sacrificed those 10us of latency for better-than-best-effort delivery.

Footnotes:

[1]: In JMS terms, imagine an embedded JMS driver which communicates without a broker, just through IP broadcast and multicast with every other JMS driver on the network.
[2]: Even worse, because many distributed tests will launch all components on one machine, you never exercise the network effects that distributed MOM systems entail. This makes sense from a development efficiency perspective, but is particularly problematic in a distributed MOM situation, because the chances of dropping a message fall to pretty much 0 in such a scenario.
[3]: Assuming you don't work for Big Bank B, where all such forms of non-authorized-communications are Strictly Prohibited.
[4]: This is a particular worry if your Best Effort to Better Than Best Effort upgrade is based on disk or other publisher-side mechanisms: what happens when the publisher's machine goes down completely? Or, what happens if it's mostly down but the fast way to resolve the situation to your ops team is to swap it out before it fully fails?
[5]: No, I'm not going to specify this more fully, because it's so prone to variation based on the runtime environment. Just trying to say that if you drop one message a day in dev, assume at least 5 per endpoint per day. Or more.
[6]: Having periodic snapshot ticks (so adding the sextuplet to the tick that happens every 5 seconds, or manually pumping a snapshot every 5 seconds under low activity) is one alternative here, but what happens if you lose the snapshot tick?
[7]: Except, bizarrely, into the realm of broker-based MOM by virtue of the Solace-powered Tibco Messaging Appliance.
blog comments powered by Disqus