#TechAtMoEngage: CPython vs Pypy Performance Benchmarking
Reading Time: 4 minutes
As we start our porting journey from python2 to python3, it was important for us to explore what options we have. Python3 upgrade will introduce a completely new ecosystem for us – beginning from builds to deployments to environment setups to code migration and testing. We wanted to take full advantage of it while ensuring that we get the best performance boost when we jump to python3.
The Problem
While planning for the migration, the very first question we came across was – Python3, but which version? Our current environments run on python 2.7.16. While making the jump, we wanted to ensure that we were sorted for a few years, and we don’t encounter an EOL at least for the next 4 years while reaping the maximum performance benefits.
With that in mind, we started exploring what versions are available to us as options. At the time of writing this doc, python has the latest version as 3.9 [source]. We wanted to adopt the most recent stable release and decided to experiment with 3.9.1. The only impediment, however, is that it was released very recently (Oct 2020).
Could it be too soon to adopt a python version as recent as 3.9? Our major concerns are with respect to library support, tooling, etc.
The Options
We started exploring the alternatives for us and we came across multiple python implementations, each claiming to be better than the others. Source – [https://www.python.org/download/alternatives/]
Upon going through each of the implementations, we decided to run a performance benchmark for a few of these, to decide for ourselves, how each of these fared.
For the sake of performance testing, we picked a comparison between CPython 2.7.16, CPython 3.7.8, CPython 3.8.8, CPython 3.9.1, and PyPy 3.7.8.
The Setup
For benchmarking purposes, we had to come up with the right experiment to give us a good comparative analysis among these.
We decided to benchmark each of these via Serialization-Deserialization of a complex nested object. This is a CPU-bound operation and would give us a good insight into the performance of every version.
At MoEngage, we have a huge object which stores metadata about our customers. This object contains more than 80 fields, with some fields nested up to 3 layers. It contains nested lists, nested objects, and all sorts of data-types and came up as the perfect candidate for this benchmark.
We took a dump of the objects (>1k in number) into a file and created a benchmark to load (deserialize) and unload (serialize) this object. We used the `timeit` module to measure the time taken. We ran the benchmark 5 times and took an average of all the runs. Here is a code snipped indicating the setup:
The Result
The results were staggering for the total time taken to finish the benchmarks.
The Analysis
For every CPython version, we noticed each successor performing better than the previous.
The major comparison we were looking forward to was between CPython 3.7.9 vs PyPy 3.7.8 vs CPython 3.9.1. PyPy comes with a JIT (just-in-time) compiler for Python and boasts of much better performance over standard CPython (at the cost of compromising on availability of some libraries which require C extensions). As our use cases didn’t need much of such libraries, it would have been a decent choice for us considering the performance gains that we were looking at.
Their claims prove to be true if we compare CPython 3.7.9 vs PyPy 3.7.8 (the latest Python version supported in PyPy is 3.7.8) in the above graphs, but at the same time, the next python versions (3.8.8 and 3.9.1) seem to have caught up and even beaten PyPy by a decent margin.
The Conclusion
For us, it clearly makes sense to move to CPython 3.9.1 right now as we look towards supporting this version for a long term at MoEngage and we don’t even have to compromise on any library availability (with C Extensions). With respect to the general library support for version 3.9.1, the python community believes that every new python version gets good library support within 6 months of the launch. Going by that, we are expecting most of our needs to be fulfilled by March/April.
For us, the difference between what we are using today (2.7.16) and what we plan to use is looking to be in the range of 2x performance gain. This could simply mean up to 50% infrastructure savings for us.
These are exciting times for us to be migrating thousands of lines of code and we are looking forward to getting these performance gains.
If you have embarked upon such a journey at your organization, we would be really happy to speak to you and get some more insights into your experiences. Please feel free to comment below to share your thoughts!
We are also looking to expanding our tech team and in case this excites you, do check our open opportunities and let us know what you think!