Lessons from Big Tech on Building Resilient Payment Systems

Lessons from Shopify and Doordash on building Large Scale Payment Systems. Plus, how Discord develops GenAI applications and more.

Hey Everyone!

Today we’ll be talking about

  • Lessons from Big Tech on Building Resilient Payment Systems

    • Implementing Idempotency with Idempotency Keys

    • Setting low timeouts

    • Implementing Circuit Breakers

    • Monitoring and Alerting with the Four Golden Signals

    • Load Testing with some examples from Big Tech

  • Tech Snippets

    • How Discord develops GenAI Applications

    • How HubSpot saved Millions of Dollars on Logging

    • A Self-Hosted alternative to Cloud Services

When you’re looking to build enterprise features for B2B customers, the most crucial decision is whether you should seek an external vendor or spend engineering time building them in-house. Making the wrong decision can cost hundreds of thousands of dollars in spend and lost revenue while adding unnecessary roadblocks for your engineering team.

Single-Sign On and Directory Sync (SCIM) are essential components that almost every startup will eventually add to their application.

WorkOS wrote a terrific ROI guide delving into the build vs. buy tradeoff for SSO and SCIM. It’s a great read if you want to learn about the factors you should consider when making this decision.

WorkOS does a detailed analysis of the investment around

  • Infrastructure - gathering customer requirements, customizing open source libraries, integration testing, reliability work, etc.

  • Feature expansion - developing additional features like domain verification, custom attributes and more.

  • Maintenance - ongoing overhead for managing security incidents, bugs and support tickets.

For a comprehensive dive into build vs. buy, read the full article.

sponsored

Tips from Big Tech on Building Large Scale Payments Systems

One of the most fundamental things to get right in an app is the payment system. Issues with the payment system means lots of lost revenue while you’re getting the problems sorted out. It’s even worse if you’re a company dealing with products in the real-world (e-commerce, food delivery, healthcare, etc.).

In fact, in 2019, UberEats had an issue with their payments system that gave everyone in India free food orders for an entire weekend. The issue ended up costing the company millions of dollars (on the bright side, it was probably a great growth hack for getting people to download and sign up for the app).

Gergely Orosz was an engineering manager on the Payments team at Uber during the outage and he has a great video delving into what happened.

If you’d like to learn more about how big tech companies are building resilient payment systems, Shopify Engineering and DoorDash both published fantastic blog posts on lessons they learned regarding payment systems.

In today’s article, we’ll go through both blog posts and discuss some of the tips from Shopify and DoorDash on how to avoid burning millions of dollars of company money from a payment outage.

We’ll talk about concepts like idempotency, circuit breakers, timeouts, monitoring, alerting and more.

Idempotency

An idempotent operation is one that can be performed multiple times without changing the result after the first operation. In other words, if you make the same request multiple times, all the subsequent requests (after the first one) should not do anything significant (change data, create a transaction, etc.).

This is crucial when it comes to payment systems. If you accidentally double-charge a customer, you won’t just make them angry. You’ll also (probably) get a chargeback from the customer’s credit card company. This hurts your reputation to your payment processor (too many chargebacks will lead to your account being suspended or terminated).

Making requests idempotent will minimize the chances of accidentally charging a customer multiple times for the same transaction.

Idempotency Keys

Shopify minimizes the chances of any double-charges by using Idempotency Keys. An idempotency key is a unique identifier that you include with each request to your payment processor. If you make the same payment request multiple times then your payment processor will ignore any subsequent requests with the same idempotency key.

To generate idempotency keys, Shopify uses ULIDs (Universally Unique Lexicographically Sortable Identifiers). These are 128-bit identifiers that consist of a 48-bit timestamp followed by 80 bits of random data. The timestamps allow the ULIDs to be sorted, so they work great with data structures like b-trees for indexing.

Timeouts

Another important factor to consider in your payment system is timeouts. A timeout is just a predetermined period of time that your app will wait before an error is triggered. Timeouts prevent the system from waiting indefinitely for a response from another system or device.

The Shopify team recommends setting low timeouts whenever possible with payment systems. When a user tries to submit a payment, you want to tell them as quickly as possible whether there was an error so they can retry.

There’s nothing worse than waiting tens of seconds for your payment to go through. It’s even worse if the user just assumes the payment has already failed and they retry it. If you don’t have proper idempotency then this might result in double payments.

Specific timeout numbers are obviously based on your application, but Shopify gave some useful guidelines that they use

  • Database reads/writes - 1 second

  • Backend Requests - 5 seconds

Circuit Breakers

Another vital concept is circuit breakers. A circuit breaker is a mechanism that trips when it detects that a service is degraded (if it’s returning a lot of errors for example). Once a circuit breaker is tripped, it will prevent any additional requests from going to that degraded service.

This helps prevent cascading failures (where the failure of one service leads to the failure of other services) and also stops the system from wasting resources by continuously trying to access a service that is down.

The hardest thing about setting up circuit breakers is making sure they’re properly configured. If the circuit breaker is too sensitive, then it’ll trip unnecessarily and cause traffic to stop flowing to a healthy backend-service. On the other hand, if it’s not sensitive enough then it won’t trip when it should and that’ll lead to system instability.

You’ll need to monitor your system closely and make adjustments to the circuit breaker to ensure that they’re providing the desired level of protection without negatively impacting the user experience.

Monitoring and Alerting

In an ideal scenario, you’ll know that there are issues with your payment system before an outage happens. You should have monitoring and alerting set up so that you’re notified of any potential problems.

In the not-so-ideal scenario, you’ll find out that there are issues with your system when you see your app trending on Twitter with a bunch of tweets talking about how people can get free stuff.

To avoid this, you should have robust monitoring and alerting in place for your payment system.

Four Golden Signals

Google’s Site Reliability Engineering handbook lists four signals that you should monitor

  • Latency - the time it takes for your system to process a request. You can measure this with median latency, percentiles (P90, P99), average latency etc.

  • Traffic - The amount of requests your system is receiving. This is typically measured in requests per minute or requests per second.

  • Errors - the number of requests that are failing. In any system at scale, you’ll always be seeing some amount of errors so you’ll have to figure out a proper threshold that indicates something is truly wrong with the system without constantly triggering false alarms.

  • Saturation - How much load the system is under relative to it’s total capacity. This could be measured by CPU/memory usage, bandwidth, etc.

Load Testing

A terrific way to be proactive in ensuring the reliability and stability of your payment system is by doing load testing. This is where you put your system under stress in a controlled environment to see what breaks. By identifying and fixing potential issues before they occur to real users, you can save yourself a lot of headaches (and money!).

Like all e-commerce businesses, Shopify has certain days of the year when they experience very high traffic (Black Friday, Cyber Monday, etc.) To ensure that their system can handle the increased volume, Shopify conducts extensive load tests before these events.

Load testing can be incredibly challenging to implement since you need to accurately simulate real-world conditions. Predicting how customers behave in the real world can be quite difficult. One way that companies do this is through request-replay (they capture a certain percentage of production requests in their API gateway and then replay these requests when load testing).

Load Testing Examples

Many companies have blogged about their load testing efforts so I’d recommend checking out

These are a couple of lessons from the articles. For more tips, you can check out the Shopify article here and the DoorDash article here.

When you’re building Single Sign-On, one crucial component you’ll need to manage is SAML X.509 certificates. These are essential for verifying the authenticity of SAML requests and responses between your app and your customer’s Identity Providers (Google Workspace, LastPass, OneLogin, etc.)

However, managing these certificates can be tricky:

  • Certificate Expiration - X.509 certificates expire after 1-10 years and can cause outages when they do .

  • Dynamic Metadata - Not all identity providers support dynamic metadata URLs for fetching new certificates. This means you might have to handle it manually.

  • Manual Certificate Renewals - Coordinating certificate renewals with customers can be time-consuming. You should reach out to customers about the renewal 90 days before certificate expiration.

If you’d like to learn the best practices, WorkOS published a fantastic blog post delving into SAML X.509.

They talk about why it’s important, potential issues you’ll come across and how you should mitigate them

sponsored

Tech Snippets