History Repeats: Twitter Is MOMThe use of annotations is extremely familiar to anybody with a background in traditional messaging technologies. In general, publishing a message in a traditional environment requires:
- A destination. In traditional pub-sub, this is a topic name; in Twitter parlance it's the publisher's twitter handle.
- Message metadata. In traditional pub-sub, this is a set of properties (typesafe name/value pairs attached to a message); in Twitter parlance it's your annotations (full definition still forthcoming).
- Message content. In traditional pub-sub, this is in general a byte array (though specs like JMS allow for code-level specifications that ultimately resolve to a byte array); in Twitter parlance this is your 140-character tweet.
An Historical Diversion: BinHex64Let's consider a problem that impacted technology professionals back before most Ruby programmers were alive: how do you transmit binary data over the internet? You had two options:
- Use a binary protocol, written from scratch or using something like RPC. This worked, but required endpoints that understood the protocol.
- Transmit data over a text encoding, like email or usenet. This allowed for the greatest amount of interim-stage compatibility, but had serious interoperability issues.
Enter The Reverse BinHexAt first glance it might not seem that way, but Twitter represents the world in a reverse form to BinHex encoding. With BinHex encoding you're trying to fit 8-bit bytes into a 7-bit world; with Twitter tweets you're trying to fit 8-bit bytes into a "character" world. The only thing that's germane is "what is a character?" Twitter is quite clear: A Twitter character is a Unicode code point. If it weren't so, Twitter wouldn't be able to handle localized tweets as well as it does. Right, now we're cooking with gas. A Unicode code point is, in the most broad brushstrokes possible, drawn from one of two distinct sets of planes:
- The Basic Multilingual Plane, consisting of code points in the numerical region from
- The Astral Planes, currently allowing code points in the numerical region from
- Twitter only handles the Basic Multilingual Plane. If that's the case, in general, one Twitter character can handle 2 8-bit bytes.
- Twitter handles the full theoretic range of Unicode including the Astral Planes. If that's the case, one Twitter character can handle 2 and a half 8-bit bytes (ignoring the Supplementary Private Use B plane, to make things simpler).
- 280 bytes if Twitter only supports the Basic Multilingual Plane; or
- 350 bytes if Twitter supports the vast majority of the Astral Planes.