Sunday, April 18, 2010

Twitter For Messaging: Encoding Binary In Unicode

This week at Chirp, Twitter announced Annotations, which is the Twitter-specific way of saying "you can assign arbitrary metadata to individual twitter updates." After a very quick Twitter (how self-referential) conversation with Alexis from Rabbit Technologies, we agreed that with this, Twitter is essentially trying to build a much more general-purpose pub-sub messaging technology. I wanted to talk about that.

History Repeats: Twitter Is MOM

The use of annotations is extremely familiar to anybody with a background in traditional messaging technologies. In general, publishing a message in a traditional environment requires:
  • A destination. In traditional pub-sub, this is a topic name; in Twitter parlance it's the publisher's twitter handle.
  • Message metadata. In traditional pub-sub, this is a set of properties (typesafe name/value pairs attached to a message); in Twitter parlance it's your annotations (full definition still forthcoming).
  • Message content. In traditional pub-sub, this is in general a byte array (though specs like JMS allow for code-level specifications that ultimately resolve to a byte array); in Twitter parlance this is your 140-character tweet.

Now it looks like we've got some pretty good equivalencies; every major headline element is covered by Twitter. So how do you adapt to an All Twitter world?

I would say the starting point is the message content. I live in a world of machine-to-machine communication (people are messy). Byte arrays don't match up with character data.

Or do they?

An Historical Diversion: BinHex64

Let's consider a problem that impacted technology professionals back before most Ruby programmers were alive: how do you transmit binary data over the internet?

You had two options:

  • Use a binary protocol, written from scratch or using something like RPC. This worked, but required endpoints that understood the protocol.
  • Transmit data over a text encoding, like email or usenet. This allowed for the greatest amount of interim-stage compatibility, but had serious interoperability issues.

The primary interoperability problem with transmitting binary data over text protocols was a pretty simple one: most Internet protocols from the Dawn of Time were written by ignorant Americans and thus only supported 7-bit ASCII. Binary data inherently is 8-bit: you're trying to transmit a byte array, and each byte has 8 bits. How do you fit a square peg (8-bit binary) into a round hole (7-bit ASCII)?

The solution is BinHex encoding. The basic idea is that you attempt to represent a high-fidelity source dataset (8-bit binary) in a low-fidelity target encoding (7-bit ASCII) by increasing the size of the encoded message to fit the target encoding.

This seems stupid and archaic, but it's still the way binary attachments go into email.

Enter The Reverse BinHex

At first glance it might not seem that way, but Twitter represents the world in a reverse form to BinHex encoding. With BinHex encoding you're trying to fit 8-bit bytes into a 7-bit world; with Twitter tweets you're trying to fit 8-bit bytes into a "character" world. The only thing that's germane is "what is a character?"

Twitter is quite clear: A Twitter character is a Unicode code point. If it weren't so, Twitter wouldn't be able to handle localized tweets as well as it does.

Right, now we're cooking with gas.

A Unicode code point is, in the most broad brushstrokes possible, drawn from one of two distinct sets of planes:

  • The Basic Multilingual Plane, consisting of code points in the numerical region from 0x0000-0xFFFF.
  • The Astral Planes, currently allowing code points in the numerical region from 0x100000-0x10FFFF.

According to their character count page, Twitter uses UTF-8 in its internal representation. Let's consider two distinct possibilities:

  • Twitter only handles the Basic Multilingual Plane. If that's the case, in general, one Twitter character can handle 2 8-bit bytes.
  • Twitter handles the full theoretic range of Unicode including the Astral Planes. If that's the case, one Twitter character can handle 2 and a half 8-bit bytes (ignoring the Supplementary Private Use B plane, to make things simpler).

If we ignore all complicating factors, what can we thus store in a Twitter 140-character tweet if we're trying to encode machine-readable byte arrays?

  • 280 bytes if Twitter only supports the Basic Multilingual Plane; or
  • 350 bytes if Twitter supports the vast majority of the Astral Planes.

These don't seem like a lot, but if you allocate a few bytes (Twitter Encoded of course) for message sequence number and chaining, and you use a compact binary representation like Avro or FudgeMsg, you can get a lot of data into 280/350 bytes.

So if you use this theoretical reverse-BinHex encoding system to expand byte arrays into Twitter Messages (after full Annotation support is released), you can get arbitrary metadata for routing decisions, and a 280/350-byte binary payload. Enough clearly for a lot of uses.

Twitter As The New Machine-to-Machine Cloud Service

Don't be daft. This is entirely a thought experiment about how you could encode Real Data into a Tweet. If you attempt to hook multiple machine processes up through Twitter as a data distribution mechanism you are a moron.

If you're interested in that type of functionality, you should talk to Rabbit or another Cloud Messaging Provider (I'm sure there will be competition forthcoming). Cloud Messaging makes sense; using Twitter as your Cloud Messaging Provider is completely stupid.

Seriously. There are certainly use cases where you can see machines, people, and other machines communicating over Twitter. But if you're going to the point of converting binary data into arbitrary Unicode codepoints for transmission over Twitter, you completely fail at asynchronous communication, and should be required to spend at least 6 months doing nothing but implementing sections from EIP as penance.