How to Build a Scalable Notification Service
Plus, Stable Diffusion explained visually, why you should open source your company, pair programming antipatterns and more.
Hey Everyone!
Today we’ll be talking about
How Razorpay Scaled Their Notification Service
A brief intro to webhooks and using them for sending notifications
How MySQL became a bottleneck for Razorpay's Notification Service
Solving the issue by using Kinesis to decouple the system
Tech Snippets
Stable Diffusion Illustrated (Machine Learning)
Should You Open Source Your Company? (Leadership)
Pair Programming Antipatterns (General)
9 Lessons Learned When Becoming a VP of Engineering (Career Growth)
Algorithms Implemented in Rust (Rust Numba One!!)
How Razorpay Scaled Their Notification Service
Razorpay is one of the biggest fintech companies in India, specializing in payment processing (they’re often referred to as the Stripe of India). They handle both online and local payment processing, offer banking services, provide loans/financing and much more.
They’ve been growing incredibly fast and went from a $1 billion dollar valuation to a $7.5 billion dollar valuation in a single year. Having your stock options 7x in a single year sounds pretty good (although, the $7.5 billion dollar fundraising was in 2021 and they haven’t raised since then... Unfortunately, money printers aren’t going brrrr anymore).
The company handles payments for close to 10 million businesses and they’ve processed over $100 billion in total payment volume last year.
The massive growth in payment volume and users is obviously great if you’re an employee there, but it also means huge headaches for the engineering team. Transactions on the system have also been growing exponentially. To handle this, engineers have to invest a lot of resources in redesigning the system architecture to make it more scalable.
One service that had to be redesigned was Razorpay’s Notification service, a platform that handled all the customer notification requirements for SMS, E-Mail and webhooks.
A few challenges popped up with how the company handled webhooks, the most popular way that users get notifications from Razorpay.
We’ll give a brief overview of webhooks, talk about the challenges Razorpay faced, and how they addressed them. You can read the full blog post here.
Super Brief Explanation of Webhooks
A webhook can be thought of as a “reverse API”. While a traditional REST API is pull, a webhook is push.
For example, let’s say you use Razorpay to process payments for your app and you want to know whenever someone has purchased a monthly subscription.
With a REST API, one way of doing this is with a polling approach. You send a GET request to Razorpay’s servers every few minutes so you can check to see if there’s any transactions. You’ll get a ton of “No transactions“ replies and a few “Transaction processed“ replies, so you’re putting a lot of unnecessary load for your computer and Razorpay’s servers.
With a webhook-based architecture, Razorpay can stop getting all those HTTP GET requests from you and all their other users. With this architecture, when they process a transaction for you, they’ll send you an HTTP request notifying you about the event.
You first set up a route on your backend (yourSite.com/api/payment/success or whatever) and give them the route. Then, Razorpay will send an HTTP POST request to that route whenever someone purchases something from your site.
You’ll obviously have to set up backend logic on your server for that route and process the HTTP POST request (you might update a payments database, send your a text message with a money emoji, whatever)
You should also add logic to the route where you respond with a 2xx HTTP status code to Razorpay to acknowledge that you’ve received the webook. If you don’t respond, then Razorpay will retry the delivery. They’ll continue retrying for 24 hours with exponential backoff.
If you’re interested in learning more, check out our deep dive on API paradigms where we talk about Request-Response, Webhooks, Websockets, GrapphQL and more.
Existing Notification Flow
Here’s the existing flow for how Notifications work with Razorpay.
API nodes for the Notification service will receive the request and validate it. After validation, they’ll send the notification message to an AWS SQS queue.
Worker nodes will consume the notification message from the SQS queue and send out the notification (SMS, webhook and e-mail). They will write the result of the execution to a MySQL database and also push the result to Razorpay’s data lake (a data store that holds unstructured data).
Scheduler nodes will check the MySQL databases for any notifications that were not sent out successfully and push them back to the SQS queue to be processed again.
This system could handle a load of up to 2,000 transactions per second and regularly served a peak load of 1,000 transactions per second.
However, at these levels, the system performance started degrading and Razorpay wasn’t able to meet their SLAs with P99 latency increasing from 2 seconds to 4 seconds (99% of requests were handled within 4 seconds and the team wanted to get this down to 2 seconds).
Challenges when Scaling Up
These issues with performance when scaling up where down to a few issues…
Database Bottleneck - Read query performance was getting worse and it couldn’t scale to meet the required input/output operations per second (IOPS). The team vertically scaled the database from a 2x.large to a 8x.large but this wouldn’t work in the long-term considering the pace of growth. (it’s best to start thinking of a distributed solution when you still have room to vertically scale rather than frantically trying to shard when you’re already maxed out at AWS’s largest instance sizes)
Customer Responses - In order for the webhook to be considered delivered, customers need to respond with a 2xx HTTP status code. If they don’t, Razorpay will retry sending the webhook. Some customers have slow response times for webhooks and this was causing worker nodes to be blocked while waiting for a user response.
Unexpected Increases in Load - Load would increase unexpectedly for certain events/days and this would impact the notification platform.
In order to address these issues, the Razorpay team decided on several goals
Add the ability to prioritize notifications
Eliminate the database bottleneck
Manage SLAs for customers who don’t respond promptly to webhooks
Prioritize Incoming Load
Not all of the incoming notification requests were equally critical so engineers decided to create different queues for events of different priority.
P0 queue - all critical events with highest impact on business metrics are pushed here.
P1 queue - the default queue for all notifications other than P0.
P2 queue - All burst events (very high TPS in a short span) go here.
This separated priorities and allowed the team to set up a rate limiter with different configurations for each queue.
All the P0 events had a higher limit than the P1 events and notification requests that breached the rate limit would be sent to the P2 queue.
Reducing the Database Bottleneck
In the earlier implementation, as the traffic on the system scaled, the worker pods would also autoscale.
The increase in worker nodes would ramp up the input/output operations per second on the database, which would cause the database performance to severely degrade.
To address this, engineers decoupled the system by writing the notification requests to the database asynchronously with AWS Kinesis.
Kinesis is a fully managed data streaming service offered by Amazon and it’s very useful to understand when building real-time big data processing applications (or to look smart in a system design interview).
They added a Kinesis stream between the workers and the database so the worker nodes will write the status for the notification messages to Kinesis instead of MySQL. The worker nodes could autoscale when necessary and Kinesis could handle the increase in write load. However, engineers had control over the rate at which data was written from Kinesis to MySQL, so they could keep the database write load relatively constant.
Managing Webhooks with Delayed Responses
A key variable with maintaining the latency SLA is the customer’s response time. Razorpay is sending a POST request to the user’s URL and expects a response with a 2xx status code for the webhook delivery to be considered successful.
When the webhook call is made, the worker node is blocked as it waits for the customer to respond. Some customer servers don’t respond quickly and this can affect overall system performance.
To solve this, engineers came up with the concept of Quality of Service for customers, where a customer with a delayed response time will have their webhook notifications get decreased priority for the next few minutes. Afterwards, Razorpay will re-check to see if the user’s servers are responding quickly.
Additionally, Razorpay configured alerts that can be sent to the customers so they can check and fix the issues on their end.
Observability
It’s extremely critical that Razorpay’s service meets SLAs and stays highly available. If you’re processing payments for your website with their service, then you’ll probably find outages or severely degraded service to be unacceptable.
To ensure the systems scale well with increased load, Razorpay built a robust system around observability.
They have dashboards and alerts in Grafana to detect any anomalies, monitor the system’s health and analyze their logs. They also use distributed tracing tools to understand the behavior of the various components in their system.
For more details, you can read the full blog post here.