Perl Benchmark Serializer: JSON vs Sereal vs Data::MessagePack vs CBOR

I’m working on optimization of computing and I need a fast real time message packer for storing data in Redis at light speed.

So I have try several serializer, like JSON, Sereal and Data::MessagePack. And I would like to share the result.

First of all, a tiny word about the different serializer.

JSON is a JavaScript Object Notation system. It is available in many languages and support by different server natively, like MongoDB, Redis.

Data::MessagePack has also the same advantage, it is support by a lot of system and server.

Having a support on server side could be excellent to speed up process. You send hash of data in the storage of the server, and ask the server to manipulate the data it self. Like incrementing an element in the hash.

Sereal is really great also but support only few language and no NoSQL server support yet.

CBOR is a new one I discover thanks to your comments guy. It’s a Concise Binary Object Representation based on this draft : http://tools.ietf.org/search/rfc7049. The encoding / decoding and size is really nice, but they is no yet a lot of implementation for this Serializer. It is also good to know that this serializer will only work with 64bits system.

So let’s start with the benchmark.

Here the code :

And here the result (AVG of the 3 results) for the small :

Serializer Serialized Size Encode Speed (op/s) Decode Speed (op/s)
CBOR 30 2669101 1146879
JSON 42 724345 1058658
Data::MessagePack 30 1559156 688127
Sereal 38 491519 826427
Sereal OO 38 1181633 1251141

And for the large structure (AVG of the 3 results) :

Serializer Serialized Size Encode Speed (op/s) Decode Speed (op/s)
CBOR 2757 61439 35839
JSON 3941 43761 21503
Data::MessagePack 2653 60074 26879
Sereal 2902 58708 33083
Sereal OO 2902 66166 33083

Conclusion :

First, if you want to use Sereal, use the OO interface. This one is really really fast !

Sereal is really good at decoding and with the proper option is as good as Data::MessagePack for encoding. So if you need high speed and pretty good transport size, Sereal could be the perfect choice.

The bad point with Sereal is the lack of support on other languages and on server side. For example on Riak with where using Riak and we are thinking of moving to MessagePack or JSON to support the MapReduce method. Same think for Redis, MessagePack is supported in LUA script.

If you need good speed, an really tiny data to keep it in a real time server, MessagePack seems perfect. Tiny, fast, compact, and can be expand on server side script.

Also, the new comer CBOR is really a nice surprise. The speed is astonish, and the size is almost as good as Data::MessagePack. The only issue is we are on a protocol really new, with almost no implementation. But this one is really promising. Also we have to take care to only encode/decode on a 64bits system (or corrupted data could come).

Thanks for the comments, so I could have add more test in my bench. I will optimize our usage of Sereal and may give a try to MessagePack in some situation.

Celogeek

Short URL: http://sck.pm/N4

One Pingback/Trackback

  • Steffen Mueller

    For benchmarking, you need to use Sereal’s object-oriented interface. It avoids very large amounts of overhead for constructing the encoder/decoder objects and gives a number of memory allocation amortizations. This is particularly important for small data structures like yours. Furthermore, you can gain significant space savings by enabling Snappy compression if you care to.
    I’m surprised MessagePack ends up that much smaller than Sereal in your “big” benchmark. If I get a chance, I’ll dig in to find out why.

    • Steffen Mueller

      So by modifying your first benchmark script to use Sereal’s OO interface, Sereal encoding becomes as fast as MessagePack on my laptop. Decoding via OO is 50% faster than MessagePack. If I also add the “no_shared_hashkeys” option to the Sereal encoder (which doesn’t help with space savings for this benchmark), then Sereal encoding is about 35-40% FASTER than MessagePack.

      For the large data structure that you propose, Sereal is significantly faster at decoding than MessagePack, and almost THREE TIMES FASTER at encoding. Again, on my laptop, and again, using the OO interface and the no_shared_hashkeys option. See the code and results in the gist below.

      https://gist.github.com/tsee/8634922

      • http://blog.celogeek.com/ Celogeek

        Oh my god, man, thanks a million for that !

        That’s why I love doing benchmark and publish it, they is always a perfect solution that come out !

        the option {no_shared_hashkeys => 1} is not really recommanded, except for bitting other serializer in benchmark.

        And yeah, the OO interface is really faster ! that seems odd, because keeping an OO for the direct sharing method and use it directly, well, less ops, more speed.

        So I add the OO interface of sereal to the bench right now

        • Steffen Mueller

          Other than Perl, there’s complete Sereal implementations for Go, Objective-C, and Ruby at least. There implementations in varying degrees of incompleteness for Java, JavaScript, Lua, PHP, and Python. I am not aware of an effort to implement it in Erlang (Riak).

          A more general point: When you do benchmarks, make sure to at least run them several times to see how much of a variation you get. As I said, on my laptop, Sereal beat MP easily. For you, it seems marginally slower. But your benchmark is quite flawed: At a runtime of less than a second, it’ll be all over the place for repeats. For somewhat more reliable benchmarks, check out dumbbench (on CPAN). If you’re on Linux, especially the –pin-frequency option can help make benchmarks vastly more reproducible.

          A colleague pointed out the reason why MP produces smaller output than Sereal on your second benchmark. Both Sereal and MP due a trick to encode small integer numbers into one byte of output. Sereal can do that for -15 to 16. MP can do that for 0 to 127. If you look at your particular benchmark, it’s chosen to encode the numbers 1..1000. That’s going to be virtually all the difference. Now, we could have chosen to encode more small integers in one byte, but we instead decided to pull a similar trick for short strings, small arrays, and small hashes, based on the idea that integers larger than 16, but smaller than 127 aren’t actually THAT common compared to short strings, and the major data structures. It’s a trade-off and your benchmark is particularly unlucky for Sereal. Finally, MP doesn’t have as elaborate a header as Sereal, so on something trivially small, like just encoding a single number, you’ll find that MP stores a few bytes less. This is another deliberate trade-off.

          Finally, if you actually used some of Sereal’s more advanced features, you’d find it beating virtually anything else. For example, common sub structures are only encoded once and come out the same way on the decoding side. MP and JSON both will just create copies needlessly (and sometimes that can even lead to nasty bugs).

          I would suggest that you test serializers with your ACTUAL data instead of some made-up stuff. You’ll probably find that the picture changes substantially.

          • http://blog.celogeek.com/ Celogeek

            With re

          • http://damien.krotkine.com/blog/ dams

            I think that Steffen meant: “use your real data in the benchmark you posted”, not “switch library in your real life system” :) Otherwise you are comparing tomatoes with potatoes.

          • Steffen Mueller

            What dams said is precisely what I meant! :)

          • http://damien.krotkine.com/blog/ dams

            I used the lua implementation of Sereal, when working at the same company as Celogeek, tackling the same issue :) the lua implementation is robust enough so that it works fine with any data structure, but doesn’t work very well with Perl specific things. Which is good enough for most usage imho. I used it in a specific case where I needed to add elements to an existing serealized ArrayRef in Redis. It worked, but it was faster to use Redis to concatenate a new sereal object containing the additional elements, thus not using lua. Then on retrieving, using deserealization from offset, and merging the results in one ArrayRef. That solution was basically beating any others I could come of with.

          • http://blog.celogeek.com/ Celogeek

            we try, when we can, avoid Redis scripting. it could be a real bottleneck (X nodes vs 1 redis that compute stuff).
            also we fast to a real issue on Riak, we want to do map reduce and JSON is supported pretty well, but Sereal is not. So we cannot do everything we want.

            No body work on an ERLang version ? it could be awesome at the end.

          • http://damien.krotkine.com/blog/ dams

            At some point it was on my todo list, but meh…

          • http://blog.celogeek.com/ Celogeek

            :) well, may be you will use Riak soon (or already the case), and then :) may be you would like to implement the ERLang version of Sereal. which could be excellent for a lot of people :)

  • briandfoy

    I do a similar benchmark in the latest edition of Mastering Perl, but I did it a bit different. Each of the encoders have a particular strength, so that should be part of the benchmark. For instance, Sereal has a way to avoid keeping duplicate keys and values, which means a benchmark for message size is interesting. Some people might not care how fast the encoder is as long as the speed of data transfer is fast.

  • Felix Ostmann

    You should also try CBOR: https://metacpan.org/pod/CBOR::XS

    There are rumours, that this is the fasted from all available serialization modules.

    • http://blog.celogeek.com/ Celogeek

      Oh nice, I try right now and upgrade the current post. I will update the title also :)
      A quick result : CBOR on the small data is as tiny as Data::MessagePack and twice faster that any other,
      On large structure, the size is between Data::MessagePack and Sereal, it encode a little less faster than sereal and decode faster.

      I publish the result right away

      • Steffen Mueller

        Interesting. Remember me saying that CBOR wasn’t faster than Sereal in my benchmarks? I thought that my benchmarks (because I chose the data structure) might be biased. So I ran yours again. Check out the results:

        https://gist.github.com/tsee/8648558

        Sereal beats ANYTHING else by a factor of SEVERAL on encoding the “large” structure in your test. I don’t know what’s going on, to be honest. The only change I made to your script was to change “use 5.18.0″ to “use 5.14.0″ because that’s what I ran it with.

        • http://blog.celogeek.com/ Celogeek

          if I read your result correctly,

          on tiny struct, cbor is faster than sereal for encoding, but slower for decoding.
          and on a large struct, it is the opposite.

          And sereal encode really faster on large struct (hudge difference)

          I have upgrade our usage of Sereal, using v2 and correct setting, I double the speed :) happy with that.

          • Steffen Mueller

            Sereal v2 is not known to be (significantly) faster than version 0.370.

    • Steffen Mueller

      It’s not in any of the benchmarks I did:

      https://github.com/Sereal/Sereal/blob/master/Perl/shared/author_tools/bench.pl

      Specifically again, Sereal beats CBOR handily on encode performance pretty much across the board.

      In case anybody wondered: Decoder performance of various formats is much closer to one another usually since most of the time is spent in the Perl guts, allocating data structures.

    • http://blog.celogeek.com/ Celogeek

      bench done, the result is astonish ! thanks for the tips

  • Pingback: Benchmarking serializers with Perl (by Vincent) | Weborama Dev Blog()