How Faire Scaled Their Engineering Team
Plus, how to build ChatGPT Plugins, how Amazon does CI/CD, scaling a Notification Service and more.
Today we’ll be talking about
How Faire Scaled Their Engineering Team to Hundreds of Developers
Hiring developers who are customer focused
Implementing code review and testing early on
Tracking KPIs around developer velocity
Factors that make a good KPI
How Razorpay Scaled Their Notification Service
A brief intro to webhooks and using them for sending notifications
How MySQL became a bottleneck for Razorpay's Notification Service
Solving the issue by using Kinesis to decouple the system
How Amazon does CI/CD
Building ChatGPT Plugins
Prioritizing with the Code Quality Pyramid
Career Advice for Engineering Managers
Build the Modular Monolith First
A Dive Into Handling Time Series Data
Paul Dix is the CTO of InfluxDB and he wrote a great paper delving into what is time series data, what are the use cases for time series, and how you can handle it.
As developers and businesses move to instrument more of their servers, applications, network infrastructure and the physical world, time series is becoming the de facto standard for how to think about storing, retrieving and mining this data for real-time and historical insight.
Time series databases are unique because they have to handle an extremely high write throughput. You might have 100,000 IoT devices collecting 10 metrics every second, which will quickly scale to millions of writes per second.
Learn more why time series databases are the superior choice for the monitoring, metrics, real-time analytics and Internet of Things (IoT)/sensor data use cases by reading the full paper.
How Faire Maintained Engineering Velocity as They Scaled
Faire is an online marketplace that connects small businesses with makers and brands. A wide variety of products are sold on Faire, ranging from home decor to fashion accessories and much more. Businesses can use Faire to order products they see demand for and then sell them to customers.
Faire was started in 2017 and became a unicorn (valued at over $1 billion) in under 2 years. Now, over 600,000 businesses are purchasing wholesale products from Faire’s online marketplace and they’re valued at over $12 billion.
Marcelo Cortes is the co-founder and CTO of Faire and he wrote a great blog post for YCombinator on how they scaled their engineering team to meet this growth.
Here’s a summary
Faire’s engineering team grew from five engineers to hundreds in just a few years. They were able to sustain their pace of engineering execution by adhering to four important elements.
Hiring the right engineers
Building solid long-term foundations
Tracking metrics for decision-making
Keeping teams small and independent
Hiring the Right Engineers
When you have a huge backlog of features/bug fixes/etc that need to be pushed, it can be tempting to hire as quickly as possible.
Instead, the engineering team at Faire resisted this urge and made sure new hires met a high bar.
Specific things they looked for were…
Expertise in Faire’s core technology
They wanted to move extremely fast so they needed engineers who had significant previous experience with what Faire was building. The team had to build a complex payments infrastructure in a couple of weeks which involved integrating with multiple payment processors, asynchronous retries, processing partial refunds and a number of other features. People on the team had previous experience building the same infrastructure for Cash App at Square, so that helped tremendously.
Focused on Delivering Value to Customers
When hiring engineers, Faire looked for people who were amazing technically but were also curious about Faire’s business and were passionate about entrepreneurship.
The CTO would ask interviewees questions like “Give me examples of how you or your team impacted the business”. Their answers showed how well they understood their past company’s business and how their work impacted customers.
A positive signal is when engineering candidates proactively ask about Faire’s business/market.
Having customer-focused engineers made it much easier to shut down projects and move people around. The team was more focused on delivering value for the customer and not wedded to the specific products they were building.
Build Solid Long-Term Foundations
From day one, Faire documented their culture in their engineering handbook. They decided to embrace practices like writing tests and code reviews (contrary to other startups that might solely focus on pushing features as quickly as possible).
Faire found that they operated better with these processes and it made onboarding new developers significantly easier.
Here’s four foundational elements Faire focused on.
Being Data Driven
Faire started investing in their data engineering/analytics when they were at just 10 customers. They wanted to ensure that data was a part of product decision-making.
From the start, they set up data pipelines to collect and transfer data into Redshift (their data warehouse).
They trained their team on how to use A/B testing and how to transform product ideas into statistically testable experiments. They also set principles around when to run experiments (and when not to) and when to stop experiments early.
For picking technologies, they had two criteria
The team should be familiar with the tech
There should be evidence from other companies that the tech is scalable long term
They went with Java for their backend language and MySQL as their database.
Many startups think they can move faster by not writing tests, but Faire found that the effort spent writing tests had a positive ROI.
They used testing to enforce, validate and document specifications. Within months of code being written, engineers will forget the exact requirements, edge cases, constraints, etc.
Having tests to check these areas helps developers to not fear changing code and unintentionally breaking things.
Faire didn’t focus on 100% code coverage, but they made sure that critical paths, edge cases, important logic, etc. were all well tested.
Faire started doing code reviews after hiring their first engineer. These helped ensure quality, prevented mistakes and spread knowledge.
Some best practices they implemented for code reviews are
Be Kind. Use positive phrasing where possible. It can be easy to unintentionally come across as critical.
Don’t block changes from being merged if the issues are minor. Instead, just ask for the change verbally.
Ensure code adheres to your internal style guide. Faire switched from Java to Kotlin in 2019 and they use JetBrains’ coding conventions.
Track Engineering Metrics
In order to maintain engineering velocity, Faire started measuring this with metrics at just 20 engineers.
Some metrics they started monitoring are
CI wait time
Defects Resolution Time
And more. They built dashboards to monitor these metrics.
As Faire grew to 100+ engineers, it no longer made sense to track specific engineering velocity metrics across the company.
They moved to a model where each team maintains a set of key performance indicators (KPIs) that are published as a scorecard. These show how successful the team is at maintaining its product areas and parts of the tech stack it owns.
As they rolled this process out, they also identified what makes a good KPI.
Here are some factors they found for identifying good KPIs to track.
Clearly Ladders Up to a Top-Level Business Metric
In order to convince other stakeholders to care about a KPI, you need a clear connection to a top-level business metric (revenue, reduction in expenses, increase in customer LTV, etc.). For tracking pager volume, Faire saw the connection as high pager volume leads to tired and distracted engineers which leads to lower code output and fewer features delivered.
Is Independent of other KPIs
You want to express the maximum amount of information with the fewest number of KPIs. Using KPIs that are correlated with each other (or measuring the same underlying thing) means that you’re taking attention away from another KPI that’s measuring some other area.
Is Normalized In a Meaningful Way
If you’re in a high growth environment, looking at the KPI can be misleading depending on the denominator. You want to adjust the values so that they’re easier to compare over time.
Solely looking at the infrastructure cost can be misleading if your product is growing rapidly. It might be alarming to see that infrastructure costs doubled over the past year, but if the number of users tripled then that would be less concerning.
Instead, you might want to normalize infrastructure costs by dividing the total amount spent by the number of users.
This is a short summary of some of the advice Marcelo offered. For more lessons learnt (and additional details) you should check out the full blog post here.
How did you like this summary?
Your feedback really helps me improve curation for future emails.
It's About Time. Build on InfluxDB.
Working with large sets of time-stamped data has its challenges.
Fortunately, InfluxDB is a time series platform purpose-built to handle the unique workloads of time series data.
Using InfluxDB, developers can ingest billions of data points in real-time with unbounded cardinality, and store, analyze, and act on that data – all in a single database.
No matter what kind of time series data you’re working with – metrics, events, traces, or logs – InfluxDB Cloud provides a performant, elastic, serverless time series platform with the tools and features developers need. Native SQL compatibility makes it easy to get started with InfluxDB and to scale your solutions.
Companies like IBM, Cisco, and Robinhood all rely heavily on InfluxDB to build and manage responsive backend applications, to power predictive intelligence, and to monitor their systems for insights that they would otherwise miss.
See for yourself by quickly spinning up the platform and testing it out InfluxDB Cloud for free.
In addition to our emails, you can also get weekly articles on system design and technical dives on cool tech!
Past articles include
System Design Articles
It’s $12 per month and I’d highly recommend using your job’s Learning & Development stipend to pay for it!
How Razorpay Scaled Their Notification Service
Razorpay is a payments service and neobank that is one of India’s most valuable fintech startups (valued at $7.5 billion dollars). The company powers payments for over 8 million businesses in India and has been growing extremely quickly (they went from $60 billion in payments processed in 2021 to $90 billion in 2022). With this growth in payment volume and users, transactions on the system have been growing exponentially.
With this immense increase, engineers have to invest a lot of resources in redesigning the system architecture to make it more scalable. One service that had to be redesigned was Razorpay’s Notification service, a platform that handled all the customer notification requirements for SMS, E-Mail and webhooks.
A few challenges popped up with how the company handled webhooks, the most popular way that users get notifications from Razorpay.
We’ll give a brief overview of webhooks, talk about the challenges Razorpay faced, and how they addressed them. You can read the full blog post here.
Brief Explanation of Webhooks
A webhook can be thought of as a “reverse API”. While a traditional REST API is pull, a webhook is push.
For example, let’s say you use Razorpay to process payments for your app and you want to know whenever someone has purchased a monthly subscription.
With a REST API, one way of doing this would be to send a GET request to Razorpay’s servers every few minutes so you can check to see if there have been any transactions (you are pulling information). However, this results in unnecessary load for your computer and Razorpay’s servers.
On the other hand, if you use a webhook, you can set up a route on your backend with logic that you want to execute whenever a transaction is made. Then, you can give Razorpay the URL for this route and the company will send you a HTTP POST request whenever someone purchases a monthly subscription (information is being pushed to you).
You then respond with a 2xx HTTP status code to acknowledge that you’ve received the webhook. If you don’t respond, then Razorpay will retry the delivery. They’ll continue retrying for 24 hours with exponential backoff.
Here’s a sample implementation of a webhook in ExpressJS and you can watch this video to see how webhooks are set up with Razorpay. They have many webhooks to alert users on things like successful payments, refunds, disputes, canceled subscriptions and more. You can view their docs here if you’re curious.
Existing Notification Flow
Here’s the existing flow for how Notifications work with Razorpay.
API nodes for the Notification service will receive the request and validate it. After validation, they’ll send the notification message to an AWS SQS queue.
Worker nodes will consume the notification message from the SQS queue and send out the notification (SMS, webhook and e-mail). They will write the result of the execution to a MySQL database and also push the result to Razorpay’s data lake.
Scheduler nodes will check the MySQL databases for any notifications that were not sent out successfully and push them back to the SQS queue to be processed again.
This system could handle a load of up to 2,000 transactions per second and regularly served a peak load of 1,000 transactions per second. However, at these levels, the system performance started degrading and Razorpay wasn’t able to meet their SLAs with P99 latency increasing from 2 seconds to 4 seconds (99% of requests were handled within 4 seconds and the team wanted to get this down to 2 seconds).
Challenges when Scaling Up
With the increase in transactions, the Razorpay team encountered a few scalability issues
Database Bottleneck - Read query performance was getting worse and it couldn’t scale to meet the required input/output operations per second (IOPS). The team vertically scaled the database from a 2x.large to a 8x.large but this wasn’t a long-term solution considering the pace of growth.
Customer Responses - In order for the webhook to be considered delivered, customers need to respond with a 2xx HTTP status code. If they don’t, Razorpay will retry sending the webhook. Some customers have slow response times for webhooks and this was causing worker nodes to be blocked while waiting for a user response.
Unexpected Increases in Load - Load would increase unexpectedly for certain events/days and this would impact the notification platform.
In order to address these issues, the Razorpay team decided on several goals
Add the ability to prioritize notifications
Eliminate the database bottleneck
Manage SLAs for customers who don’t respond promptly to webhooks
Prioritize Incoming Load
Not all of the incoming notification requests were equally critical so engineers decided to create different queues for events of different priority.
P0 queue - all critical events with highest impact on business metrics are pushed here.
P1 queue - the default queue for all notifications other than P0.
P2 queue - All burst events (very high TPS in a short span) go here.
This separated priorities and allowed the team to set up a rate limiter with different configurations for each queue.
All the P0 events had a higher limit than the P1 events and notification requests that breached the rate limit would be sent to the P2 queue.
Reducing the Database Bottleneck
In the earlier implementation, as the traffic on the system scaled, the worker pods would also autoscale.
The increase in worker nodes would ramp up the input/output operations per second on the database, which would cause the database performance to severely degrade.
To address this, engineers decoupled the system by writing the notification requests to the database asynchronously with AWS Kinesis. Kinesis is a fully managed data streaming service offered by Amazon and it’s very commonly used with real-time big data processing applications.
They added a Kinesis stream between the workers and the database so the worker nodes will write the status for the notification messages to Kinesis instead of MySQL. The worker nodes could autoscale when necessary and Kinesis could handle the increase in write load. However, engineers had control over the rate at which data was written from Kinesis to MySQL, so they could keep the database write load relatively constant.
Managing Webhooks with Delayed Responses
A key variable with maintaining the latency SLA is the customer’s response time. Razorpay is sending a POST request to the user’s URL and expects a response with a 2xx status code for the webhook delivery to be considered successful.
When the webhook call is made, the worker node is blocked as it waits for the customer to respond. Some customer servers don’t respond quickly and this can affect overall system performance.
To solve this, engineers came up with the concept of Quality of Service for customers, where a customer with a delayed response time will have their webhook notifications get decreased priority for the next few minutes. Afterwards, Razorpay will re-check to see if the user’s servers are responding quickly.
Additionally, Razorpay configured alerts that can be sent to the customers so they can check and fix the issues on their end.
It’s extremely critical that Razorpay’s service meets SLAs and stays highly available. If you’re processing payments for your website with their service, then you’ll probably find outages or severely degraded service to be unacceptable.
To ensure the systems scale well with increased load, Razorpay built a robust system around observability.
They have dashboards and alerts in Grafana to detect any anomalies, monitor the system’s health and analyze their logs. They also use distributed tracing tools to understand the behavior of the various components in their system.
For more details, you can read the full blog post here.