Here at OpenGamma we make considerable use of MongoDB, in particular as a persistent store for data which we either don't want to spend the time on normalizing to an RDBMS Schema, or where we positively value the schema-free design approach taken by MongoDB.
We also make extensive use of the Fudge Messaging project for virtually all of our distributed object management needs (whether using Fudge Proto files, or manually doing the Object/Document encoding ourselves). Luckily, the two work extremely well together.
Because of the way that we defined the Fudge encoding specification, and designed the major Fudge Reference Implementation classes and interfaces, it's extremely easy to map the worlds of Fudge, JSON, and XML (in fact, Fudge already supports streaming translations to/from XML and JSON). We've actually had support for converting in between Fudge objects and the BasicDBObject
that the MongoDB Java driver uses since Fudge 0.1, and we use it extensively in OpenGamma: anywhere you have a Fudge object, you can seemlessly persist it into a MongoDB database as a document, and load it back directly into Fudge format later on.
So with that in mind, I decided to try some performance tests on some different approaches that you can take to go from a Fudge object to a MongoDB persisted document.
Benchmark Setup
The first dimension of testing is the type of document being persisted. I had two target documents:
- Small Document
- This document, intended to represent something like a log file entry, consists of 3 primitive field entries, as well as a single list of 5 integers.
- Large Document
- This document, intended to represent a larger concept more appropriate to the OpenGamma system, consists of several hundred fields in a number of sub-documents (sub-
DBObject
in MongoDB, sub-FudgeFieldContainer
in Fudge), across a number of different types, as well as some large byte array fields.
I considered had three different approaches to doing the conversion between the two types of objects:
- MongoDB Native
- In this case I just created
BasicDBObject
instances directly and avoided Fudge entirely as a baseline. - Fudge Converted
- Created a Fudge message, and then converted to
BasicDBObject
using the built-in Fudge translation system - Fudge Wrapped
- This one wasn't built in to Fudge yet (and won't be until I can clean it up and test it properly). I kept a Fudge data structure, and just wrapped it in an implementation of the
DBObject
interface, which delegated all calls to the appropriate call onFudgeFieldContainer
.
Additional parameters of interest:
- Used a remote MongoDB server running on Fedora 11 (installed from Yum,
mongo-stable-server-20100512-mongodb_1.fc11.x86_64
RPM) running on a VM with reasonably fast underlying disk. - Local MongoDB server was 1.4.3 x86_64 running on Fedora 13 on a Core i7 with 8GB of RAM and all storage on an Intel SSD
- MongoDB Java Driver 1.4 (pulled from Github)
- JVM was Sun JDK 1.6.0_20 on Fedora 13 x86_64
Benchmark Results
Test Description | MongoDB Native | Fudge Converted | Fudge Wrapped |
---|---|---|---|
Creation of 1,000,000 Small MongoDB DBObject s |
539ms | 1,603ms | 839ms |
Persistence of 1,000,000 Small MongoDB DBObject s |
41,188ms | 46,201ms | 92,866ms |
Creation of 100,000 Large MongoDB DBObject s |
15,351ms | 23,956ms | 15,785ms |
Persistence of 100,000 Large MongoDB DBObject s (remote DB) |
57,207ms | 60,511ms | 56,236ms |
Persistence of 100,000 Large MongoDB DBObject s (local DB) |
66,557ms | 74,763ms | 58,816ms |
Results Explanation
The first thing to point out is that for the small DBObject
case, the particular way in which MongoDB encodes data for transmission on the wire matters a lot. In particular, there's one decision that the driver has made that changes everything: it does a whole lot of random lookups.
A BasicDBObject
extends from a LinkedHashMap
, and so doing object.get(fieldName)
is a very fast operation. However, because Fudge is a multi-map, we don't actually do that in Fudge, and by default we store fields as a list of fields (JSON stores lists as a, well, list; Fudge stores them as repeated values with the same field name). Because this makes point lookups slow, we intentionally do whole-message operations as often as we can, and just iterate over all the fields in the message.
The MongoDB driver code does the same thing, but instead of doing a for(Entry entry : entrySet())
style of operation, it iterates over the keys and does a separate get operation for each key. In Fudge, this is potentially a linear search through the whole message.
To work around this, in my wrapper object I built up a map where there was only a single value per field. This works well, but the small document case has 1/6 of the fields be a list, making this test thrash in CPU on doing the document conversion (which explains why the small document persistence test is more than twice as fast with the wrapper as just rebuilding the objects). Yes, I could do this optimization further, but it would be difficult to improve on the combined setup (document construction) and runtime (persistence) performance of just building up a BasicDBObject
, which is what the Fudge conversion does anyway.
The wrapped Fudge object wins in every case for the large document test, no matter how many times I run them (and I've done it quite a few times for both local and remote, with all outliers eliminated). Moreover, I actually get faster performance running on a remote DB than on a local DB (which surprised me quite a bit).
The only things that I can conclude from this are:
FudgeMsg
limits the data size on insertion into the message (when you do amsg.add()
operation, not on serialization) for small integral values (if you put in along
but it's actually the number 6, Fudge will convert that to abyte
). However, theByteEncoder
which converts values in MongoDB to the wire representation will never do this optimization, and will actually upscale small values to at least a 32-bit boundary. This means that if you put data into aFudgeMsg
first and then put it into the MongoDB wire encoding, you shrink the size of the message. Given the number of pseudo-randomshort
,int
andlong
values in this message, it's a clean win.- The object churn for the non-wrapped form (where we construct instances of
BasicDBObject
from aFudgeFieldContainer
) causes CPU effects that the wrapped form doesn't suffer from.
Conclusion
One of the things that was really pleasant for me in running this test is just how nice it is to take a document model that's designed for efficient binary encoding (Fudge), and persist it extremely quickly into a database that's designed for web-style data (MongoDB). The sum total of the actual persistence code is all of about 10 lines; I spend far more lines of code building the messages/documents themselves.
The wrapped object form definitely wins in a number of cases. My current code isn't production-quality by any means, but I think it's a useful thing to add to the Fudge arsenal. That being said, I think the real win is to rethink the way in which we get data into MongoDB in the first place.
Given the way the MongoDB Java driver iterates over fields, it seems to me that a far better solution is to cut out the DBObject
system entirely, and write a Fudge persister that speaks the native MongoDB wire protocol directly, and take advantage of the direct streaming capabilities of the Fudge protocol. When we've done that, we should be going just about as fast and efficiently as we can and Fudge will have a seamless combination of rich code-level tools, efficient wire-level protocol for binary serialization, great codecs for working with text encodings like JSON and XML, and a fantastic document/message database facility using MongoDB.