Simple JSON and Mongo speed test between Perl, Ruby and Python


I’m about to try a Web Framework in another programming language than Perl. I have a speed test for Perl with JSON and MongoDB. I give a try on Python and Ruby.

My speed test do this :

  • Read from pipeline (pbzip2)
  • JSON decoding
  • Batch insert into MongoDB

My original file contain 45 million of rows, each of them is a JSON string.

The languages :

  • Perl 5.18 with JSON::XS
  • Ruby 2.1.1 with OJ (Optimized JSON)
  • Python 3.4 with UJSON (Ultra Fast JSON)

The MongoDB engines :

  • TokuMX (MongoDB 2.4)
  • MongoDB 2.6

The speed test resume :

Language Read from pipeline Decode JSON Insert in TokuMX Batch Insert in MongoDB 2.6
Perl 0m6.861s 0m8.104s 0m5.602s 0m4.916s
Ruby 0m6.669s 0m12.588s 0m9.552s 0m8.291s
Python 0m6.858s 0m8.160s 0m8.382s 0m8.097s

Speed test 1 : Read 1M of rows from pipeline reading

Perl Code :

Ruby Code :

Python Code :

Speed test 2 : Decoding JSON 500 000 rows

Perl Code :

Ruby Code :

Python Code :

Speed test 3 : Insert 100 000 rows in TokuMX

I run the code twice (for the drop), and report the time.

Perl Code :

Ruby Code :

Python Code :

Perl win on TokuMX here, now I try the batch insert with MongoDB 2.6 (real batch insert mode).

Speed test 4 : Batch Insert 100 000 rows in MongoDB 2.6

Perl Code :

Ruby Code :

Python Code :

Again, Perl is even faster with real bulk insert ! The driver automatically choose the best mode. Ruby has a unordered bulk insert mode in 2.6 (not the same code), and the result is a speed as good as Python.


Perl Memory::Stats – Get RSS memory reporting on multiple platforms

It is often nice to be able to dump the current memory usage of a part of your process.

I’m working on Mac OSX, and I’m frustrating when a perl module only works on Linux, because it read file like ‘/proc’ or stuff like that.

Thanks to IRC users, I have discover Proc::ProcessTable, and then I have decide to create a module that use it to record the memory usage. The good point is that module work on Linux and Mac OSX, also on FreeBSD. I’m not sure on Windows.

My module is Memory::Stats, so how it works ?

You can also add checkpoints:

Enjoy !


Interesting read : Slide Asia YAPC for Memory Use

Perl REST::Client – POST method

I’m working on rewriting Redmine::API to remove Net::HTTP::Spore.

The simple replacement is to handle the request with REST::Client.

My first question was, how can I post params with that ? How can I do an equivalent to the “payload” of Net::HTTP::Spore ?

Well, after googling and trying stuff, here the result :

Here the client :

Here the server :

in ‘lib/rc.pm’ :

in rc.yml :

And then to launch it :

The result is :

So you can encode params like you do with GET and pass it to the POST method without the interrogation mark “?”.

So for example :

REST::Client has an method to do that for you, but it is for the GET method, so you have to remove the interrogation mark ‘?’ :

You can also use URI

A very important header is to properly setup the content type :

{‘Content-type’ => ‘application/x-www-form-urlencoded’}

It indicate that you have encode your params in a certain way.

I hope this could help.


Perl IO::Socket server socket strange behaviour

I’m trying to create a server that listen on a port and that never answer nor accept anything.

Here my small code :

To test it, I run the server :

And then I try the socket timeout with for instance a redis connection with a timeout :

The first time I do this, the Redis stay lock forever ! I have try a IO::Socket client simple, it seems that the first time the IO::Socket listen on a port, it prepare an accept command, and lead to IO::Socket client think that the server has accepted the connexion.
But this is not true !

The second time I run the redis script I got :

And that is the correct message.

I have also try with the official client :

First time, I got the console :

And then it is lock forever at the second try :

I will try other client and may be create a server in another languages, but since IO::Socket has not been fixed since several years and several bugs has been reported, I think we have to live with that.


Perl Benchmark Serializer: JSON vs Sereal vs Data::MessagePack vs CBOR

I’m working on optimization of computing and I need a fast real time message packer for storing data in Redis at light speed.

So I have try several serializer, like JSON, Sereal and Data::MessagePack. And I would like to share the result.

First of all, a tiny word about the different serializer.

JSON is a JavaScript Object Notation system. It is available in many languages and support by different server natively, like MongoDB, Redis.

Data::MessagePack has also the same advantage, it is support by a lot of system and server.

Having a support on server side could be excellent to speed up process. You send hash of data in the storage of the server, and ask the server to manipulate the data it self. Like incrementing an element in the hash.

Sereal is really great also but support only few language and no NoSQL server support yet.

CBOR is a new one I discover thanks to your comments guy. It’s a Concise Binary Object Representation based on this draft : http://tools.ietf.org/search/rfc7049. The encoding / decoding and size is really nice, but they is no yet a lot of implementation for this Serializer. It is also good to know that this serializer will only work with 64bits system.

So let’s start with the benchmark.

Here the code :

And here the result (AVG of the 3 results) for the small :

Serializer Serialized Size Encode Speed (op/s) Decode Speed (op/s)
CBOR 30 2669101 1146879
JSON 42 724345 1058658
Data::MessagePack 30 1559156 688127
Sereal 38 491519 826427
Sereal OO 38 1181633 1251141

And for the large structure (AVG of the 3 results) :

Serializer Serialized Size Encode Speed (op/s) Decode Speed (op/s)
CBOR 2757 61439 35839
JSON 3941 43761 21503
Data::MessagePack 2653 60074 26879
Sereal 2902 58708 33083
Sereal OO 2902 66166 33083

Conclusion :

First, if you want to use Sereal, use the OO interface. This one is really really fast !

Sereal is really good at decoding and with the proper option is as good as Data::MessagePack for encoding. So if you need high speed and pretty good transport size, Sereal could be the perfect choice.

The bad point with Sereal is the lack of support on other languages and on server side. For example on Riak with where using Riak and we are thinking of moving to MessagePack or JSON to support the MapReduce method. Same think for Redis, MessagePack is supported in LUA script.

If you need good speed, an really tiny data to keep it in a real time server, MessagePack seems perfect. Tiny, fast, compact, and can be expand on server side script.

Also, the new comer CBOR is really a nice surprise. The speed is astonish, and the size is almost as good as Data::MessagePack. The only issue is we are on a protocol really new, with almost no implementation. But this one is really promising. Also we have to take care to only encode/decode on a 64bits system (or corrupted data could come).

Thanks for the comments, so I could have add more test in my bench. I will optimize our usage of Sereal and may give a try to MessagePack in some situation.


Perl Jedi Plugin Auth

Jedi::Plugin::Auth is an authentication plugin for Jedi.

It handle the authentication for you, saving using info in his database, and returning the full profile in a session when the user identify himself properly.

The plugin provide :

  • jedi_auth_signin
  • jedi_auth_signout
  • jedi_auth_login
  • jedi_auth_logout
  • jedi_auth_update
  • jedi_auth_users_with_role
  • jedi_auth_users_count
  • jedi_auth_users

So you can add an user, remote it, log the user into the session, log it out, update the user information, list the users with a specific role, know the number of user and get all or part of users of the auth database.

The default behavior is to create for you an auth database in SQLite into the distribution directory of Jedi::Plugin::Auth, using your app to complete the final name.

Each of you apps will have his own authentication.

Each user created will have a generated UUID that can be used in your apps to associated the user with his own data.

This is only a plugin, you need to implement the interface and the backend to make it functional.

The documentation can be found here : Jedi::Plugin::Auth

And this documentation provide hopefully enough information to use it properly.

More will come, I’m looking for implementing the MySQL storage in your own databases with your own prefix.



Perl Benchmark Cache with Expires and Max Size

I’m dealing with Cache at work, and I need a high performance in memory cache system. I’m actually using CHI with Memory Backend, and I have notice that the result setting is really slow comparing to the over.

So I have decide to benchmark all the module I can found on metacpan about cache and try to find the best one.

The best for me is the one that support expires and max size, if possible not both at the same time, with high speed on write and read.

The table below will report the Read/Write speed per second, and the memory consumption. It will also report the number of element found in cache. It is supposed to be near the max_size setting.

I will set the cache to 500k and write 600k of data.

First of all, here a test program with CHI Memory. The other test will be done the same way, adapting a little the syntax of set/get.

And now the result :

Engine Write Speed Read Speed Cache hit Memory usage in KB
Pure Hash 564 181 1 523 941 600 000 203 804
CHI Memory 12 977 171 939 7 100 86 436
CHI RawMemory 21 757 118 395 499 999 490 356
CHI FastMmap (25m file) 23 420 48 960 291 303 46 580
Tie Cache LRU Expires 141 878 169 350 500 000 420 280
Tie Cache FastMmap (25m file) 47 068 92 067 298 020 46 340
Cache LRU 219 534 410 147 500 000 424 108
Cache BDB 39 014 58 737 600 000 588
Cache FastMmap (50m) 54 593 91 893 600 000 65 968
Cache Ref CART 59 186 225 209 500 000 347 360
Cache Ref CAR 55 622 221 103 500 000 308 356
Cache Ref FIFO 147 548 398 192 500 000 211 016
Cache Ref LRU 85 039 108 123 500 000 330 576
Redis TCP / String val 7 275 7 418 600 000 224
Redis TCP / Sereal Encode Decode 6 235 6 771 600 000 224
Redis Socket / String val 20 528 20 205 600 000 196
Redis Socket / Sereal Encode Decode 18 318 19 913 600 000 260
Cache Memcached Fast / Sereal 29 174 39 490 441 472 216
Riak Light (pbc) 734 923 600 000 312
Net Riak (pbc) 396 519 600 000 252
MongoDB 2 701 2 069 600 000 5 036

Here the graph of the different results :

Write / Read

Memory Usage

Read : With Share DB File (1, 2, 4 process)

Write : With Share DB File (1, 2, 4 process)

Conclusion :

About CHI :

The CHI Memory seems bad on keeping cache. I have set 500k max_size and it hold only 13k !

The CHI RawMemory is bad about memory consumption. It take twice the memory need to hold data.

The CHI FastMmap is great for memory usage: low memory consumption share between multiple process. With 4 process and heavy load I have seen errors appear.

CHI is usually poor on writing and have no excellent performance on reading.

About TIE Cache :

Tie Cache LRU has a nice performance but consume memory almost like CHI.

The FastMmap implementation is faster than the CHI one.

The usage, it is like using a simple HASH.

About Cache :

Cache LRU has the highest performance of all Cache packages I have tested. The writing and reading is very very high.

The memory consumption is still very high and cannot be used for any kind of usage. It may be nice for keeping only a maximum amount of data for a compute job.

Cache BDB use a Berkeley Database to save your data. It cost absolutely no memory ! It could be great for long running program with a high amount of data to keep in a local cache. One very big issue is that the database can be corrupted very easily. I have run only 3 processes at the same time and the database is corrupted and impossible to make it work again except by deleting it. Each process need a different database, the only thing is that it will consume disk space. But usually we have a lot more disk space than memory. I will really considerate that one to keep data on long running web application. It is very nice to save memory and keep a lot of data in a local cache.

And Cache FastMmap is the fastest local file cache I have tested and work perfectly with many process. This is my best choice !

About Cache Ref :

Cache Ref has a very nice implementation of memory cache. It cost a little more memory than a simple hash and has a very nice performance.

The Cache Ref FIFO is very simple, use just a little more than the simple hash, and great to keep the last X values. Very great performance, and very nice for memory.

About Redis :

Redis in order to read / write locally, is better in pure socket. The read speed is almost the same as the right speed. The good point is that it doesn’t cost any memory to your process, and all of them can share the memory used by Redis.

It is still more than twice better to use a FastMmap file instead of having a Redis instance to do the same job.

About Memcached:

Memcached is a really good server for caching purpose, and they is XS module to read/write on it, which lead you to a better speed than Redis. But Redis has a lot more feature. The FastMmap is still twice better, and easier to setup.

About Riak:

Riak, by default (I haven’t test with a lot of nodes), is really slow comparing to the other one, and I’m pretty certain that is not the best choice for local caching.

Note that Riak::Light is almost twice faster than Net::Riak.

About MongoDB:

MongoDB give better result on local instance, I’m not pretty sure I have use the unixsocket, the driver MongoDB is not very explicit on this. But the speed is really low comparing to the other caches.

Well, I hope you like the benchmark. Feel free to share. If you have other great module and experience, please share, I would love to add it to the bench.

You can find the tests scripts on github : https://github.com/celogeek/perl-test-caching.

Enjoy !


Perl Jedi Plugin Session

Jedi::Plugin::Session is an extension for Perl Jedi to give your the possibility to store session for your user.

The plugin will automatically generate and save an UUID for the visiting user, and extend the Jedi::Request to add the methods : session_get and session_set.

You have not limit (except your memory) on how many data you store for each user. You can only store serializable object.

The plugin session has for now 3 engines :

I may add a file engine soon. If needed.

On the Redis and SQLite engines allow you to share for the same app launching in multiple workers the session.

Each app has his own session data.

The UUID store in a cookie in the browser of the user is only a part of the final UUID. If someone steal that cookie, it will have a different session. If you change the browser also you will have a different session.

Let show an example of how to use it :

The session is not necessary a ‘HASH’. But it is often simpler to save a session this way.

If the user is a new one, or his session has expire, then the session data will be undef.

If you want to reset the expiration time you can do :

To use another engine, it is simple :

Each engine can be configured with the Jedi configuration.

You can take a look at the possibility with the BACKENDS doc.

Each engine should work without any configuration. SQLite will store the data in the dist_dir of the Jedi::Plugin::Session, and create a session file per app. The Redis will use the default port to store the session.

Enjoy !


Perl – Understand List, Unary operation and Array in scalar context

The common misunderstanding in Perl is the difference between :

  • A list : (1, 2, 3)
  • An array : @my_array containing (1, 2, 3)
  • The unary operation coma ‘,’: $operation1, $operation2, $operation3

Perl use the context to act practically over logically. That lead to unexpected behavior if you mislead each of this concept.

The 3 contexts are : void context, list context, scalar context. We will focus on the scalar context here.

First of all, what is the difference between a list and an array, and when the one and the other is used.

When you declare an array :

You can use several function like ‘push’, ‘scalar’ … on it.

You cannot assigned a list to a variable, a list is ephemeral and not operation can be done on it.

You cannot do :

It has no meaning. the push (1,2,3), 4 is equivalent to push 1,2,3,4, so push try to push 2, 3, 4 into the constant 1. And that doesn’t work.

Now let show different behavior between all of this concept :

The unary operator :

Here we have the unary operation coma ‘,’.

To write it more clearly we can do :

So $x = 4 in that case.

The list :

We use a list in a scalar context. In that case the list will return the last element.

So $x = 6;

The array :

We use an array in a scalar context. In that case the array will return the size of the array.

So $x = 3

Now a bit more tricky, the mix of it :

The operation assign the array to $x, in that case that mean the size of the array, then execute the constant 7.

So $x = 3

The array in a list context will expand into a list with ’4, 5, 6′ and then we add 7. That create a list with 4, 5, 6, 7.
A list will return the last element in a scalar context.

So $x = 7;

And take care of the unary operator against the list :

Here we don’t have a list assign to an array, but the unary operator.

So @x = (4).

If you use a list :

Each element of the list is added to the array.

So @x = (4, 5, 6).

If we are in a function, the behavior is different, always for the practical other logical.

The unary operation is transformed into a list before the function return it’s result.
When we mix an array and a list, it result into a list, then the latest result is returned in scalar context.
When we return an array in a scalar context, the result is the size of the array.

We can replace the array return with :

We cannot generalize the behavior of a function by assuming that a method always return a list except when we return an array.

Each method can have access to the context when we call it and act differently. We often want a function return the most practical result.

The only way to know how a function behave, it to read the documentation on it.

I will show you an example with the “sort_in_place” function :

This method check the context. The “wantarray” (badly named), can be used to get the context.
Here the value based on context :

  • equal “1″ in a list context.
  • equal empty string “” in a scalar context. So wantarray() is false but defined wantarray() is true.
  • equal to undef in a void context. So wantarray() and defined wantarray() is false.

So :

In list context :

In scalar context :

In void context :

To resume, you have to take care of how you use the context in Perl, and never assume based on your experience that a function behave in a certain way because the array or list behave the same way. You should always check the documentation of the function you use to know how this one behave on different context.

If you have question, or critic, fill free to comment. I will gladly improve this document based on your comments.

I hope it is more clear for you to understand the strange behavior of some functions now.



My Perl and Javascript blog !