What Do You Transmit On The Wire?Let's start with a very simple question: if you're trying to transmit data in between machines, whether temporally concurrent or not, how do you encode said data? Broadly speaking, you have a number of options:
- Hand-Rolled Binary Protocol
- Yep, you could directly emit the individual bytes in the appropriate byte order and hand-roll the encoders and decoders, and it'll probably be pretty fast. It'd be a complete and utter waste of development resources, but fast it would undoubtedly be.
- Compact Schema-Based Representation
- The next option is that you use a system based on a schema definition, and use that schema definition and the underlying encoding system to automatically do your encoding/decoding. Examples of this type of system are Google Protocol Buffers, Thrift, and Avro. This is going to be fast, and an efficient use of development resources, but the messages are useless without access to the schema.
- Object Serialization
- Every major Object-Oriented language has some built-in serialization system for object graphs, and it's usually dead easy to use. However, they all suck in a number of ways, and don't work at all if you don't have matching code objects on both sides of the communications channel.
- Text-Based Representation
- You could just say "I don't care about performance at all" and just use a text-based encoding like XML, JSON, or YAML. The messages are at least marginally human readable, and you can do stuff with them without any schema (which may or may not be used at all), but it's going to be slow, slow, slow.
- Compact Schema-Free Representation
- The final option is to say "I like having my data in a compact binary representation, but I don't want to have to have access to any type of schema in order to work with messages." The most well known example of this is
TibrvMsgfrom Tibco Rendezvous, but it's not a very good implementation, and it's proprietary. This is what we created Fudge to solve.
Hey, Isn't Avro Schema-Free?No, no it's not. The Avro encoding system requires access to a schema in order to be able to decode messages at all, because all metadata is lost during encoding. However, it has two characteristics which are similar to schema-free encoding systems:
- It decodes into a schema-free representation. What I mean by this is that if you did have the schema for a message, you can decode it into an arbitrary data structure which isn't tightly bound to the schema (think generic structure vs. schema-generated code).
- The communications protocols involve out-of-band schema transmission. When you start communicating using RPC Avro semantics, you communicate the schema of the messages that you're going to be transmitting using a JSON-encoded schema representation. But once you've done that handshake, the actual messages are schema-bound.
Why Is Schema-Free Encoding Useful?If you're looking at temporally concurrent point-to-point communications (think: RPC over a socket), a schema-bound system is pretty useful. You can squeeze every single unnecessary byte out of the communications traffic, and that's pretty darn useful from an efficiency perspective. But not all communications are temporally concurrent and point-to-point. Let's consider the other extreme: temporally separate publish/subscribe communications (think: market data ticks, log messages, entity update messages). In that case, you have to make sure that every single point where you want to process the message has access to the schema used at the point of message serialization; if you don't have it, all you have is an opaque byte array. That's a logistical nightmare to keep everything in sync over time, particularly when you start doing logging and replay of messages for debugging and auditing. Moreover, there's a whole host of things that you can do in between producer and consumer if you have a schema-free encoding:
- You can visually inspect the contents of messages. Not in a text editor unless you're using a text-based encoding system, but in some type of general purpose tool. This is an unbelievably useful development, debugging, and system support tool.
- You can do content-based routing on them. It's quite easy using XML, for example, to create XPath rules that route messages through the system ("send messages where
/Company/Ticker= AAPL to Node 5; send other messages to Node 7"). You don't have to compile each little rule into native code, you can do it based purely on configuration. Heck, you could even build a custom routing system for ActiveMQ or RabbitMQ and put the functionality right in your message oriented middleware brokers.
- You can automatically convert them to other representations. Have data in XML but need it in JSON? No problem; you can auto-convert it. Have data in Fudge but need it in XML? No problem; you can auto-convert it.
- You can store and query them efficiently. You can take any arbitrary document and put it into a semi-structured data store (think: XML Database, MongoDB), and then query it in an efficient way ("find me all documents where
- You can transform the content using configuration. You can take an arbitrary message, and just using configuration (so no more custom code) change the contents to take into consideration renames, schema changes, add stuff, remove stuff, do whatever you want.
That's Why We Created FudgeMost of the network traffic in the OpenGamma code isn't actually point-to-point RPC; it's distributed one-to-(one or more), and we use message oriented middleware extensively for communications. We needed a compact representation where given an arbitrary block of data, we could "do stuff" with it, no matter when it was produced. None of the compact schema-free systems support this type of operation. If someone can point me to an Open Source system which covers these types of use cases, and it's better than ours, we'll gladly merge projects. But Avro ain't it. Avro's good for what it does, but what it does isn't what we need.
But Your Messages Must Be HugeThe Fudge Encoding Specification was very carefully designed to scale both down and up in terms of functionality. At the most verbose, for each field in a message/stream (of course Fudge supports streaming operations; if it didn't, do you think we could efficiently do all those auto-transformations?), the Fudge overhead consists of:
- 1-byte Field Prefix
- This has the processing directives to say what else is in the stream for that field.
- 1-byte Type ID
- We have to know at the very minimum the type of data being transmitted for that field.
- The Field Name
- The UTF-8 encoded name ("Bid", "Ask", "System", "UserName", etc.)
- A 2-byte Field Ordinal
- A numeric code for the field (nope, not the index into the message, just a numeric ID like the name is a textual ID)
Map<String, Object>in your code, but you don't want to transmit those on the wire? That's what we invented Fudge Taxonomies for. 2-byte ordinal gives you full name data at runtime, without each message having to include the name. So in one encoding specification, we can support very terse metadata as well as very verbose metadata.