Building a Scalable Notification Service

October 07, 2022

Hey Everyone!

Today we’ll be talking about

How Razorpay Scaled Their Notification Service
- A brief intro to webhooks and using them for sending notifications
- How MySQL became a bottleneck for Razorpay's Notification Service
- Solving the issue by using Kinesis to decouple the system
Tech Snippets
- A great blog explaining how the Stable Diffusion model (AI image generation) works
- A GitHub repo with resources for CTOs (or aspiring CTOs) on architecture, product management, hiring and more
- Lessons from Software Engineering at Google
- Pair Programming Antipatterns

How Razorpay Scaled Their Notification Service

Razorpay is a payments service and neobank that is one of India’s most valuable fintech startups (valued at $7.5 billion dollars). The company powers payments for over 8 million businesses in India and has been growing extremely quickly (they went from $60 billion in payments processed in 2021 to $90 billion in 2022). With this growth in payment volume and users, transactions on the system have been growing exponentially.

With this immense increase, engineers have to invest a lot of resources in redesigning the system architecture to make it more scalable. One service that had to be redesigned was Razorpay’s Notification service, a platform that handled all the customer notification requirements for SMS, E-Mail and webhooks.

A few challenges popped up with how the company handled webhooks, the most popular way that users get notifications from Razorpay.

We’ll give a brief overview of webhooks, talk about the challenges Razorpay faced, and how they addressed them. You can read the full blog post here.

Brief Explanation of Webhooks

A webhook can be thought of as a “reverse API”. While a traditional REST API is pull, a webhook is push.

For example, let’s say you use Razorpay to process payments for your app and you want to know whenever someone has purchased a monthly subscription.

With a REST API, one way of doing this would be to send a GET request to Razorpay’s servers every few minutes so you can check to see if there have been any transactions (you are pulling information). However, this results in unnecessary load for your computer and Razorpay’s servers.

On the other hand, if you use a webhook, you can set up a route on your backend with logic that you want to execute whenever a transaction is made. Then, you can give Razorpay the URL for this route and the company will send you a HTTP POST request whenever someone purchases a monthly subscription (information is being pushed to you).

You then respond with a 2xx HTTP status code to acknowledge that you’ve received the webhook. If you don’t respond, then Razorpay will retry the delivery. They’ll continue retrying for 24 hours with exponential backoff.

Here’s a sample implementation of a webhook in ExpressJS and you can watch this video to see how webhooks are set up with Razorpay. They have many webhooks to alert users on things like successful payments, refunds, disputes, canceled subscriptions and more. You can view their docs here if you’re curious.

Existing Notification Flow

Here’s the existing flow for how Notifications work with Razorpay.

API nodes for the Notification service will receive the request and validate it. After validation, they’ll send the notification message to an AWS SQS queue.
Worker nodes will consume the notification message from the SQS queue and send out the notification (SMS, webhook and e-mail). They will write the result of the execution to a MySQL database and also push the result to Razorpay’s data lake.
Scheduler nodes will check the MySQL databases for any notifications that were not sent out successfully and push them back to the SQS queue to be processed again.

This system could handle a load of up to 2,000 transactions per second and regularly served a peak load of 1,000 transactions per second. However, at these levels, the system performance started degrading and Razorpay wasn’t able to meet their SLAs with P99 latency increasing from 2 seconds to 4 seconds (99% of requests were handled within 4 seconds and the team wanted to get this down to 2 seconds).

Challenges when Scaling Up

With the increase in transactions, the Razorpay team encountered a few scalability issues

Database Bottleneck - Read query performance was getting worse and it couldn’t scale to meet the required input/output operations per second (IOPS). The team vertically scaled the database from a 2x.large to a 8x.large but this wasn’t a long-term solution considering the pace of growth.
Customer Responses - In order for the webhook to be considered delivered, customers need to respond with a 2xx HTTP status code. If they don’t, Razorpay will retry sending the webhook. Some customers have slow response times for webhooks and this was causing worker nodes to be blocked while waiting for a user response.
Unexpected Increases in Load - Load would increase unexpectedly for certain events/days and this would impact the notification platform.

In order to address these issues, the Razorpay team decided on several goals

Add the ability to prioritize notifications
Eliminate the database bottleneck
Manage SLAs for customers who don’t respond promptly to webhooks

Prioritize Incoming Load

Not all of the incoming notification requests were equally critical so engineers decided to create different queues for events of different priority.

P0 queue - all critical events with highest impact on business metrics are pushed here.
P1 queue - the default queue for all notifications other than P0.
P2 queue - All burst events (very high TPS in a short span) go here.

This separated priorities and allowed the team to set up a rate limiter with different configurations for each queue.

All the P0 events had a higher limit than the P1 events and notification requests that breached the rate limit would be sent to the P2 queue.

Reducing the Database Bottleneck

In the earlier implementation, as the traffic on the system scaled, the worker pods would also autoscale.

The increase in worker nodes would ramp up the input/output operations per second on the database, which would cause the database performance to severely degrade.

To address this, engineers decoupled the system by writing the notification requests to the database asynchronously with AWS Kinesis. Kinesis is a fully managed data streaming service offered by Amazon and it’s very commonly used with real-time big data processing applications.

They added a Kinesis stream between the workers and the database so the worker nodes will write the status for the notification messages to Kinesis instead of MySQL. The worker nodes could autoscale when necessary and Kinesis could handle the increase in write load. However, engineers had control over the rate at which data was written from Kinesis to MySQL, so they could keep the database write load relatively constant.

Managing Webhooks with Delayed Responses

A key variable with maintaining the latency SLA is the customer’s response time. Razorpay is sending a POST request to the user’s URL and expects a response with a 2xx status code for the webhook delivery to be considered successful.

When the webhook call is made, the worker node is blocked as it waits for the customer to respond. Some customer servers don’t respond quickly and this can affect overall system performance.

To solve this, engineers came up with the concept of Quality of Service for customers, where a customer with a delayed response time will have their webhook notifications get decreased priority for the next few minutes. Afterwards, Razorpay will re-check to see if the user’s servers are responding quickly.

Additionally, Razorpay configured alerts that can be sent to the customers so they can check and fix the issues on their end.

Observability

It’s extremely critical that Razorpay’s service meets SLAs and stays highly available. If you’re processing payments for your website with their service, then you’ll probably find outages or severely degraded service to be unacceptable.

To ensure the systems scale well with increased load, Razorpay built a robust system around observability.

They have dashboards and alerts in Grafana to detect any anomalies, monitor the system’s health and analyze their logs. They also use distributed tracing tools to understand the behavior of the various components in their system.

For more details, you can read the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Tech Dive - Load Balancers (Quastor Pro)

In our last tech dive, we delve into load balancers and talk about the purpose they serve.

We go through layer 4 and layer 7 load balancers and talk about the pros/cons and differences between each.

We also talk about load balancing strategies like Round Robin, Least Connections, Hashing, Consistent Hashing and more. We talk about the strengths and weaknesses of each.

This is part of Quastor Pro, a section with explainers on the backend concepts that we talk about in our Quastor summaries.

You can get a discount of 25% off below.

I'd highly recommend using your Learning & Development budget for Quastor Pro.

Tech Snippets

Illustrated Stable Diffusion - Jay Alammar runs a fantastic blog where he gives very easy to understand explanations on the latest developments in machine learning. He wrote a great blog post on how the Stable Diffusion model (for AI image generation) works.

An awesome Github Repo on resources for CTOs (or aspiring CTOs).The repo contains resources on
- Software Development Processes (Scrum/Agile, CI/CD, etc.)
- Software Architecture
- Product Management
- Hiring for technical roles

What I learned from Software Engineering at Google - Software Engineering at Google is an awesome book on the engineering culture, processes and tools at Google. Swizec wrote a great blog post where he goes through his takeaways from the book and how he's applying the lessons for software engineering that isn't at Google-scale.

Pair Programming Antipatterns - Pair Programming can be an excellent tool for educating junior developers on the codebase however there are quite a few anti-patterns you’ll want to avoid. This article gives a great list of some of them for the person leading the pair programming session (the driver) and the person following (the navigator).

Interview Question

You are given two sorted arrays that can have different sizes.

Return the median of the two sorted arrays.

Here’s the question in Leetcode.