2021 Year in Review-Engineering@MoEngage
Reading Time: 9 minutes
2021 has been a year of excitement and adrenaline rush at MoEngage. We started this year with better planning than previous years, especially the engineering team. In 2020, we had set our culture focused more on customer obsession and delivering more customer value than ever before. Customer obsession has helped us develop the right direction for 2021, and I couldn’t have asked for a better outcome from the team in 2021. I am very proud of the team and what we have achieved together in the last year.
Scale
At MoEngage, we have always been proud of solving challenging problems on a day-to-day basis. Our challenges revolved around building scalable solutions that would stand the test of time and just scale horizontally. The challenge grew 2.5x fold last year; we began serving 400 million monthly active users. Currently, we are serving 1 Billion monthly active users. Just to compare the things, we are about 1/3rd the scale of Facebook. It’s exciting to build solutions and products at this scale. At MoEngage, the engineering team is always working on challenging problems. We ensured that the teams had absolute ownership and freedom to execute their ideas to solve these problems.
Overhauls
Early in the year, we communicated to our teams that 2021 would be a radical transformation in both processes and teams. I would like to talk first about team transformations. We have focused on setting up the leadership team to execute the vision and navigate our team members in the right direction. Apart from a couple of areas, we are satisfied with the current team configurations. We have a couple of DoEs, 6 EMs, 4 Technical Architects. Most of them came into the team in 2021. The new unit has brought a lot of new ideas into the team. There is a lot of focus on perfecting day-to-day activities, which ultimately leads to more customer value by delivering quality products on time.
We have also made a complete overhaul in 4 of our products(Flows, Segmentation, Sherpa, Analytics), and each one of them has taken about a year to complete. We don’t shy away from admitting our mistakes, but at the same time, we take concrete steps to fix them, even if it takes a considerable time to do it. Long-term projects are tough to pull off for any company, and it takes a lot of discipline, focus, commitment, and, more importantly, hunger from the people executing them.
Flows(one of our offerings) stack, we have moved away from Angular, overhauled our design primarily in the UI, and rewrote the entire product in React. UX, Performance, and maintainability were the main pain points we wanted to solve, and the team has executed the mission flawlessly. We have rolled out our new Flows to roughly 60% of customers, and we haven’t seen any significant issues. Our customers can efficiently orchestrate their user lifecycles on a single page with a lot of ease.
The last year has been a golden year for data science at MoEngage. We have built some essential products by putting up the right team in place, and more importantly, one of our key members has taken over as a leader, and he has been the heart and soul of the mission. We have built proper bridges connecting data science outcomes to the applications directly facing the customer. A lot of work has gone into making them so that we can easily integrate any future solutions, and needless to say, also in a way that works at our scale. Our customers can get more insights out of the box and send even better intelligent communications. You will hear more about the product details in the 2021 review of MoEngage.
Our segmentation service has wholly redone the stack again for the 3rd time, probably the last time for some years to come, and this time, we have decoupled our storage and compute layers. We have combined serverless with an in-house solution this time, having the flexibility to change between the stacks when needed, ultimately having two different ways to solve the same problem. We can beautifully balance costs vs. performance with this, and the team deserves all the credit for such an ingenious solution. The new stack leverages S3 as a data warehouse, AWS lake formation to hep us with Security, Athena, or Presto as query layers. There are other technologies used, which we will explain in detail in a separate blog. We showcased 5x improvement in our p99 response times for our segmentation service.
Our Analytics service and segmentation service are working on the same data warehouse now, reducing a lot of inconsistencies. The team has put a lot of effort into migrating the analytics stack from GCP to AWS.
Another team that saw an overhaul is the Security team, showing its presence with many new practices, software, and audits. We invested a lot more time and money last year into Security to give our customers a more secure environment than ever before. The team has made its presence known by achieving the ISO certification and changing many in-house practices. We are confident that we will see much more from the Security team this year.
Process Improvements
Setting up effective processes and helping people adopt them helps alleviate any operational gaps for any company, and processes can scale quickly. We have improved our processes in multiple areas, and it has yielded better results for us.
More rigor into SLA tracking is one of the crucial things we have introduced this year. We can immediately fix any customer experience issues faster and prevent some problems. Similarly, we are tracking support escalations to the engineering team with equal vigor to ensure that our customers get faster and more accurate responses.
Oncall processes have improved dramatically in the last year, emphasizing fixing the issues properly, taking time away from actual development, and fixing the documentation if needed. Oncall handovers would happen at the end of the on-call, and any long-term fixes are moved to next quarter OKRs. We have also set up the targets to reduce the number of pagers for each team and put a constant focus on tracking the load. NOC engineers were onboarded last year to help our dev teams with some trivial actions and improve our runbooks and playbooks. Our campaigns teams have demonstrated an excellent outcome in both areas of SLAs and on-call, all the campaigns should go on time and reach out to all the users, and the team has executed this goal flawlessly.
With so many changes and overhauls happening daily, our QA team has stepped up a notch as well in their practices. Bug bashes have worked well for us where all the team members from the respective teams come together and unearth as many bugs as possible. We also have an internal bounty program to appreciate the participants’ efforts. We have gamified the whole process to foster healthy competition and engagement. Our QA team has consistently reduced defect leakage for the last four quarters. We also introduced one-page test plans, which are lean and help develop a clear thought process before starting testing.
Platform improvements
Last year, the platform team had a more formalized setup, bringing Database, Data engineering, Platform Backend, and Platform Frontend under one umbrella. They focused on improving developer efficiency and bringing standardization across the application teams.
Kafka has become an integral part of our tech stack. All the streaming services, both internal and customer-facing, are using it. We ported our services to a new version of Kafka last year. It’s not easy when some of the services are stateful, and the states are stored in Kafka. Much rigor and validation are needed, especially when porting production services. The Data Engineering team made Kafka more reliable by setting up a foolproof alerting system to proactively detect anomalies and plan for the capacity. Additionally, we broke our monolith Kafka into mini Kafka clusters, and we will be speaking more about this in a blog later on.
The database team also had a fun year, and I would say most fruitful compared to yesteryears. We were able to solve some perennial issues permanently, going from a place of daily firefighting to working only on long-term projects these days. SLAs have improved from the database team regarding availability and response times. We upgraded our database versions to bring more reliability, security, and performance into our applications. We have beefed up our team to bring automation into everyday activities. I am optimistic that 2022 will bring some exciting outcomes from the database team.
Platform Backend team had a fantastic year, working on various projects throughout the year. As a SAAS platform, we have a pattern where not every customer utilizes all our services, and also, not everyone uses our services in the same way. So, one customer would be using the infrastructure more than another customer. So, to understand our infrastructure costs customer-wise, we have built a platform to make it easy for our application teams to track the costs of their services split by the customer—one of the game-changing projects for us to improve our margins. Python and Java are the major programming languages at MoEngage. We have been running two separate tracks to support the platform services for these languages. The ROI that the team has enjoyed from these two tracks is immeasurable.
Platform Frontend team is just getting started, and we will see a significant improvement in our processes and performance for the FE team in 2022. We have already set up the team required for this.
SRE
SRE team has been an unsung hero at MoEngage. They have been the backbone for almost everything that happens in MoEngage, and I can’t imagine any team that wouldn’t need their expertise.
Kubernetes(k8’s) and SRE are synonymous these days, and our team had hit significant milestones last year. About 40% of our services are running on k8s now. Many developers have shown interest in moving to k8s, and such has been the impact from the k8s team.
The infrastructure costs team has done a commendable job keeping the infrastructure costs in check. Dedicated focus on margins, auditing daily, putting a process to infra launch approvals, and most importantly, tagging the infrastructure at business and service levels are some of the process improvements that we have made. We have also moved our infrastructure to new generations for better performance at lower costs. We have moved to new EC2 generations, upgraded our disks from gp2 to gp3, looked deeper at storage costs, and implemented intelligent policies to move our storage to cold layers for unused data.
Automation is another area where the SRE team has improved significantly. Infrastructure as a code has been the guideline for us for about two years now, and we have always been trying to avoid repetitive work, at least to launch new infra. We have 3 DCs in production, and starting last year, we had problems with parity related to monitoring and alerting. We have implemented a couple of solutions to tackle this problem, codify the alarms, and have the ability to figure out any alert parity problems on production. We are in better shape than last year to launch a new DC via code than manually. We are planning to add two more DCs in 2022, and it will be an exhilarating ride for all the teams in MoEngage.
The SRE team has put a dedicated on-call and focuses on improving our staging environment, thereby reducing the time it takes for the products to go live.
Notable Changes
There have been other significant changes from other teams and areas as well. We have built stats as a service that will bring faster integrations of stats for other teams. Our streams service has reached a scale of 1 billion events per day, and we have been just horizontally scaling this service since its inception. We rewrote our campaign’s reports service through which our customers can get more value. We have made some of our communications code live in Go, and we will see more of it in 2022.
We saw massive participation and energy from the teams for the last year’s hackathon. A lot of new ideas came to light from this initiative. Top ideas will get implemented and rolled out to production.
Learnings
We learn a lot every year as a team and individuals, and last year hasn’t been an exception. We overcame some incidents, and the tradition of filing RCAs, and ensuring that the groups learn from this rather than just individuals has been helping us. We now know more than last year how to build better, reliable, and scalable systems. We have identified plenty of areas to work on our tech stack in 2022, and we have started planning a couple of months early for the 2022 plan. These are some of the questions we have asked ourselves throughout the year – How do we execute inter-team projects better? What changes are needed to provide more accurate forecasts to the business? What are the team bottlenecks, if any, and what are the process gaps, if any?. We have found answers to these questions, but I expect many new questions in 2022.
There has never been a dull day at MoEngage.I am very proud of the team’s achievement in 2021 and couldn’t be more excited for 2022. It’s still day 1(I know it’s a cliche, but hey, it works), and we look forward to adding a lot more customer value in 2022. Also, we are hiring, check out our open opportunities.