In fact, when I started getting more immersed in the REST culture, I realized that a lot of the nuances that are ideal to master in order to provide a great RESTful interface (caching, timeouts, advanced status codes) are there specifically to limit the number of repolls that are necessary against the terminal service by improving the ability of caching proxies upstream to handle requests.
But no matter what, there are clearly latency issues involved in any polling system, particularly one which is using fixed content caching timeouts to optimize performance (because this puts the latency entirely under the control of the publishing system which generated the result of the initial GET command). How do we minimize those?
Pull RSS Is No Substitute
RSS/Atom (note that I'm going to use RSS as a shorthand for the whole RSS/RDF/Atom spectrum of pull-based syndication lists) seems like a reasonable way to minimize the actual content that needs to be polled by batching updates: instead of having clients constantly polling every URL of interest, you batch them up into groups (perhaps one for each resource domain) and provide a single feed on each group. The RSS feed then indicates the state of all the resources in its domain that might have changed in the last period of interest. So now we've reduced hundreds of GETs down to one, but we're still polling.
My old colleague Julian Hyde posted some stuff recently which was precisely along the lines of what we're talking about here (post #1, post#2), in that he's building a live querying system against streams, and finding that the polling nature of RSS is particularly noxious for the type of system they're building. I'm surprised it took him so long. RSS is actually a pretty terrible solution to this problem: it combines polling with extra data (because an RSS feed is generic, it publishes more data than any particular consumer needs) in an inefficient form and still allows for the RSS publisher to put a caching timeout. Wow. Talk about a latency effect!
It gets worse. Let's assume that you're trying to do correlations across feeds. Depending on the cache timeout settings of each individual feed, you might have a latency of 1sec on one feed and 15min on another feed. How in the world do you try to do anywhere near real-time correlations on those two feeds? Answer: you can't. At best you can say "hey, dear user, 15 minutes ago something interesting happened." 15 minutes is a pretty short cache timeout for an RSS feed to be honest, but an eternity to many applications.
The only real advantage that RSS/Atom provides is that because it's based on HTTP, and we have all these skills built up as a community to scale HTTP-based queries (temporary redirects, CDNs, caching reverse proxies), we can scale out millions of RSS queries pretty well. Problem? All those techniques increase latency.
Anyone with any experience with asynchronous message oriented middleware can see that fundamentally you have a one-to-many pub/sub problem here, and that smells like MOM to me. More specifically, you have a problem where ideally you want a single publisher to be able to target a massive number of subscribers, where each subscriber gets only the updates that they want without any polling. Wow. That looks like exactly what MOM was designed for: you have a broker which collects new messages, maintains subscription lists to indicate which consumers should get which messages, and then routes messages to their consumers.
It's so clear that some type of messaging interface is the solution here that people are starting to graft XMPP onto the problem, and using that to publish updates to HTTP-provided data. Which would be nice, except that XMPP is a terrible protocol for asynchronous, low-latency MOM infrastructure. (If nothing else, the XML side really will hurt you here: GZipping will shrink XML quite a bit, at a computational and latency cost; a simple "URL changed at timestamp X" is going to be smaller in binary form than GZipped XML and pack/unpack much faster).
But we've already got another solution brewing here: AMQP, which simultaneously provides:
- Binary protocol (for performance)
- Open protocol (for compatibility)
- Open Source implementations
- Built by MOM experts
Combining REST with AMQP
At my previous employer, I actually built this exact system. In short, resources also had domain-specific MOM topics to which the REST HTTP system would publish an asynchronous message when any update to a resource happened. Clients would hit the HTTP endpoint to initialize their state, and immediately setup a subscription to updates to the separate MOM system.
Without the MOM component, it's just pure REST. Once you add it, you eliminate any resource or RSS polling at all. The MOM brokers are responsible for making sure that clients get updates that they care about, and there's only the latency that the broker introduces to updates. Worked brilliantly: all the benefits of a RESTful architecture, with none of the update latency effects.
We had it easy though: this was all purely internal. We didn't have to operate at Internet Scale, and we didn't have to interoperate with anybody else (no AMQP for us, this was all SonicMQ as the MOM broker).
In order to get this to work as a general principle, we as a community need to do the following:
- Finish AMQP 1.0 (I've been told the specs are nearly ready and then just comes the implementation and compatibility jams);
- Decide on a standard pattern for REST updates (topic naming, content encoding, AMQP broker identification in HTTP response headers);
- Develop easy to use publisher and subscriber libraries (so that services and client packages can easily include this functionality across all the major modern languages);
- Develop AMQP brokers and services that can operate at Internet Scale.