Lessons from Big Tech on Building Resilient Payment Systems
Lessons from Shopify and Doordash on building Large Scale Payment Systems. Plus, how Discord develops GenAI applications, Redis getting overtaken and more.
Hey Everyone!
Today we’ll be talking about
Lessons from Big Tech on Building Resilient Payment Systems
Implementing Idempotency with Idempotency Keys
Setting low timeouts
Implementing Circuit Breakers
Monitoring and Alerting with the Four Golden Signals
Load Testing with some examples from Big Tech
Tech Snippets
How Discord develops GenAI Applications
Valkey is Rapidly Overtaking Redis
How HubSpot saved Millions of Dollars on Logging
A Self-Hosted alternative to Cloud Services
Tips from Big Tech on Building Large Scale Payments Systems
One of the most fundamental things to get right in an app is the payment system. Issues with the payment system means lots of lost revenue while you’re getting the problems sorted out. It’s even worse if you’re a company dealing with products in the real-world (e-commerce, food delivery, healthcare, etc.).
In fact, in 2019, UberEats had an issue with their payments system that gave everyone in India free food orders for an entire weekend. The issue ended up costing the company millions of dollars (on the bright side, it was probably a great growth hack for getting people to download and sign up for the app).
Gergely Orosz was an engineering manager on the Payments team at Uber during the outage and he has a great video delving into what happened.
If you’d like to learn more about how big tech companies are building resilient payment systems, Shopify Engineering and DoorDash both published fantastic blog posts on lessons they learned regarding payment systems.
In today’s article, we’ll go through both blog posts and discuss some of the tips from Shopify and DoorDash on how to avoid burning millions of dollars of company money from a payment outage.
We’ll talk about concepts like idempotency, circuit breakers, timeouts, monitoring, alerting and more.
If you want to remember all the concepts we discuss in Quastor, you can download 100+ Anki Flash cards (open source, spaced-repetition cards) on everything we’ve discussed. Thanks for supporting Quastor!
Idempotency
An idempotent operation is one that can be performed multiple times without changing the result after the first operation. In other words, if you make the same request multiple times, all the subsequent requests (after the first one) should not do anything significant (change data, create a transaction, etc.).
This is crucial when it comes to payment systems. If you accidentally double-charge a customer, you won’t just make them angry. You’ll also (probably) get a chargeback from the customer’s credit card company. This hurts your reputation to your payment processor (too many chargebacks will lead to your account being suspended or terminated).
Making requests idempotent will minimize the chances of accidentally charging a customer multiple times for the same transaction.
Idempotency Keys
Shopify minimizes the chances of any double-charges by using Idempotency Keys. An idempotency key is a unique identifier that you include with each request to your payment processor. If you make the same payment request multiple times then your payment processor will ignore any subsequent requests with the same idempotency key.
To generate idempotency keys, Shopify uses ULIDs (Universally Unique Lexicographically Sortable Identifiers). These are 128-bit identifiers that consist of a 48-bit timestamp followed by 80 bits of random data. The timestamps allow the ULIDs to be sorted, so they work great with data structures like b-trees for indexing.
Timeouts
Another important factor to consider in your payment system is timeouts. A timeout is just a predetermined period of time that your app will wait before an error is triggered. Timeouts prevent the system from waiting indefinitely for a response from another system or device.
The Shopify team recommends setting low timeouts whenever possible with payment systems. When a user tries to submit a payment, you want to tell them as quickly as possible whether there was an error so they can retry.
There’s nothing worse than waiting tens of seconds for your payment to go through. It’s even worse if the user just assumes the payment has already failed and they retry it. If you don’t have proper idempotency then this might result in double payments.
Specific timeout numbers are obviously specific to your application, but Shopify gave some useful guidelines that they use
Database reads/writes - 1 second
Backend Requests - 5 seconds
Circuit Breakers
Another vital concept is circuit breakers. A circuit breaker is a mechanism that trips when it detects that a service is degraded (if it’s returning a lot of errors for example). Once a circuit breaker is tripped, it will prevent any additional requests from going to that degraded service.
This helps prevent cascading failures (where the failure of one service leads to the failure of other services) and also stops the system from wasting resources by continuously trying to access a service that is down.
The hardest thing about setting up circuit breakers is making sure they’re properly configured. If the circuit breaker is too sensitive, then it’ll trip unnecessarily and cause traffic to stop flowing to a healthy backend-service. On the other hand, if it’s not sensitive enough then it won’t trip when it should and that’ll lead to system instability.
You’ll need to monitor your system closely and make adjustments to the circuit breaker to ensure that they’re providing the desired level of protection without negatively impacting the user experience.
Monitoring and Alerting
In an ideal scenario, you’ll know that there are issues with your payment system before an outage happens. You should have monitoring and alerting set up so that you’re notified of any potential problems.
In the not-so-ideal scenario, you’ll find out that there are issues with your system when you see your app trending on Twitter with a bunch of tweets talking about how people can get free stuff.
To avoid this, you should have robust monitoring and alerting in place for your payment system.
Four Golden Signals
Google’s Site Reliability Engineering handbook lists four signals that you should monitor
Latency - the time it takes for your system to process a request. You can measure this with median latency, percentiles (P90, P99), average latency etc.
Traffic - The amount of requests your system is receiving. This is typically measured in requests per minute or requests per second.
Errors - the number of requests that are failing. In any system at scale, you’ll always be seeing some amount of errors so you’ll have to figure out a proper threshold that indicates something is truly wrong with the system without constantly triggering false alarms.
Saturation - How much load the system is under relative to it’s total capacity. This could be measured by CPU/memory usage, bandwidth, etc.
Load Testing
A terrific way to be proactive in ensuring the reliability and stability of your payment system is by doing load testing. This is where you put your system under stress in a controlled environment to see what breaks. By identifying and fixing potential issues before they occur to real users, you can save yourself a lot of headaches (and money!).
Like all e-commerce businesses, Shopify has certain days of the year when they experience very high traffic (Black Friday, Cyber Monday, etc.) To ensure that their system can handle the increased volume, Shopify conducts extensive load tests before these events.
Load testing can be incredibly challenging to implement since you need to accurately simulate real-world conditions. Predicting how customers behave in the real world can be quite difficult. One way that companies do this is through request-replay (they capture a certain percentage of production requests in their API gateway and then replay these requests when load testing).
Load Testing Examples
Many companies have blogged about their load testing efforts so I’d recommend checking out
These are a couple of lessons from the articles. For more tips, you can check out the Shopify article here and the DoorDash article here.
Tech Snippets
Subscribe to Quastor Pro for long-form articles on concepts in system design and backend engineering.
Past article content includes
System Design Concepts
| Tech Dives
|
When you subscribe, you’ll also get hundreds of Spaced Repetition (Anki) Flashcards for reviewing all the main concepts discussed in prior Quastor articles.