Tuesday, March 02, 2010

Whence Fudge? Why Not Just Use/Extend Avro/GPB/Thrift?

Or, In Which I Make The Case For Fudge Existing At All

tl;dr: Fudge is binary for speed and self describing to do interesting things when you don't have schema support at the time of processing.

After I announced the Fudge Messaging 0.2 release I realized that we had a back-link from @evan on Twitter with a quite pithy characterization of the situation. I've followed up with Evan quite a bit over twitter, and I think this requires a bit more detailed description of why Fudge is different from other encoding systems.

What Do You Transmit On The Wire?

Let's start with a very simple question: if you're trying to transmit data in between machines, whether temporally concurrent or not, how do you encode said data? Broadly speaking, you have a number of options:
Hand-Rolled Binary Protocol
Yep, you could directly emit the individual bytes in the appropriate byte order and hand-roll the encoders and decoders, and it'll probably be pretty fast. It'd be a complete and utter waste of development resources, but fast it would undoubtedly be.
Compact Schema-Based Representation
The next option is that you use a system based on a schema definition, and use that schema definition and the underlying encoding system to automatically do your encoding/decoding. Examples of this type of system are Google Protocol Buffers, Thrift, and Avro. This is going to be fast, and an efficient use of development resources, but the messages are useless without access to the schema.
Object Serialization
Every major Object-Oriented language has some built-in serialization system for object graphs, and it's usually dead easy to use. However, they all suck in a number of ways, and don't work at all if you don't have matching code objects on both sides of the communications channel.
Text-Based Representation
You could just say "I don't care about performance at all" and just use a text-based encoding like XML, JSON, or YAML. The messages are at least marginally human readable, and you can do stuff with them without any schema (which may or may not be used at all), but it's going to be slow, slow, slow.
Compact Schema-Free Representation
The final option is to say "I like having my data in a compact binary representation, but I don't want to have to have access to any type of schema in order to work with messages." The most well known example of this is TibrvMsg from Tibco Rendezvous, but it's not a very good implementation, and it's proprietary. This is what we created Fudge to solve.

Hey, Isn't Avro Schema-Free?

No, no it's not. The Avro encoding system requires access to a schema in order to be able to decode messages at all, because all metadata is lost during encoding. However, it has two characteristics which are similar to schema-free encoding systems:
  • It decodes into a schema-free representation. What I mean by this is that if you did have the schema for a message, you can decode it into an arbitrary data structure which isn't tightly bound to the schema (think generic structure vs. schema-generated code).
  • The communications protocols involve out-of-band schema transmission. When you start communicating using RPC Avro semantics, you communicate the schema of the messages that you're going to be transmitting using a JSON-encoded schema representation. But once you've done that handshake, the actual messages are schema-bound.

Thus you can view "Avro the communications protocol" as schema-free, as you don't have to have a compile-time-bound schema to be able to communicate effectively, but "Avro the encoding system" as schema-bound.

Why Is Schema-Free Encoding Useful?

If you're looking at temporally concurrent point-to-point communications (think: RPC over a socket), a schema-bound system is pretty useful. You can squeeze every single unnecessary byte out of the communications traffic, and that's pretty darn useful from an efficiency perspective. But not all communications are temporally concurrent and point-to-point.

Let's consider the other extreme: temporally separate publish/subscribe communications (think: market data ticks, log messages, entity update messages). In that case, you have to make sure that every single point where you want to process the message has access to the schema used at the point of message serialization; if you don't have it, all you have is an opaque byte array. That's a logistical nightmare to keep everything in sync over time, particularly when you start doing logging and replay of messages for debugging and auditing.

Moreover, there's a whole host of things that you can do in between producer and consumer if you have a schema-free encoding:

  • You can visually inspect the contents of messages. Not in a text editor unless you're using a text-based encoding system, but in some type of general purpose tool. This is an unbelievably useful development, debugging, and system support tool.
  • You can do content-based routing on them. It's quite easy using XML, for example, to create XPath rules that route messages through the system ("send messages where /Company/Ticker = AAPL to Node 5; send other messages to Node 7"). You don't have to compile each little rule into native code, you can do it based purely on configuration. Heck, you could even build a custom routing system for ActiveMQ or RabbitMQ and put the functionality right in your message oriented middleware brokers.
  • You can automatically convert them to other representations. Have data in XML but need it in JSON? No problem; you can auto-convert it. Have data in Fudge but need it in XML? No problem; you can auto-convert it.
  • You can store and query them efficiently. You can take any arbitrary document and put it into a semi-structured data store (think: XML Database, MongoDB), and then query it in an efficient way ("find me all documents where /System/Hostname is node-5 and /Log/Type is AUDIT or SECURITY").
  • You can transform the content using configuration. You can take an arbitrary message, and just using configuration (so no more custom code) change the contents to take into consideration renames, schema changes, add stuff, remove stuff, do whatever you want.

The SOA guys have known about these advantages for years, which is why they use XML for everything. The only primary flaw there is XML is, quite frankly, rubbish in every other way.

That's Why We Created Fudge

Most of the network traffic in the OpenGamma code isn't actually point-to-point RPC; it's distributed one-to-(one or more), and we use message oriented middleware extensively for communications. We needed a compact representation where given an arbitrary block of data, we could "do stuff" with it, no matter when it was produced.

None of the compact schema-free systems support this type of operation. If someone can point me to an Open Source system which covers these types of use cases, and it's better than ours, we'll gladly merge projects. But Avro ain't it. Avro's good for what it does, but what it does isn't what we need.

But Your Messages Must Be Huge

The Fudge Encoding Specification was very carefully designed to scale both down and up in terms of functionality. At the most verbose, for each field in a message/stream (of course Fudge supports streaming operations; if it didn't, do you think we could efficiently do all those auto-transformations?), the Fudge overhead consists of:
1-byte Field Prefix
This has the processing directives to say what else is in the stream for that field.
1-byte Type ID
We have to know at the very minimum the type of data being transmitted for that field.
The Field Name
The UTF-8 encoded name ("Bid", "Ask", "System", "UserName", etc.)
A 2-byte Field Ordinal
A numeric code for the field (nope, not the index into the message, just a numeric ID like the name is a textual ID)

Here's the thing though: only the field prefix and type ID are required. Although right now Fudge-Proto doesn't do so, you could easily use Fudge in a form where you rely on the ordering of fields in the message to determine the contents (which is exactly what Avro does). In that case, you pay a 2-byte penalty per field. Yes, if you're transmitting a lot of very small data fields that's a relatively high penalty, but if you're transmitting 8-byte numbers and text, it's really not a whole heck of a lot.

You want names so that you can have something like Map<String, Object> in your code, but you don't want to transmit those on the wire? That's what we invented Fudge Taxonomies for. 2-byte ordinal gives you full name data at runtime, without each message having to include the name.

So in one encoding specification, we can support very terse metadata as well as very verbose metadata.

Conclusion

So ultimately, we created Fudge to be binary (for speed and efficiency), self-describing (for schema-free operation even when a schema exists), and flexible (so you can only use the bits you need).

Yes, we could have started with an Avro or a GPB or a Thrift, but there's no way that you could have done what we've done without breaking existing implementations. And I think if you started with an Avro and worked down all the use cases we've run into in the past, you'd end up with Fudge in the end anyway; we just skipped a few steps.

The binary schema-bound systems are great and they define things like RPC semantics which Fudge doesn't. But hopefully it's clear that us creating Fudge wasn't just a case of NIH, it's that we had very specific use cases that they can't support naturally.

blog comments powered by Disqus