How Faire Scaled Their Engineering Team

Plus, how to build ChatGPT Plugins, how Amazon does CI/CD, scaling a Notification Service and more.

March 27, 2023

Hey Everyone!

Today we’ll be talking about

How Faire Scaled Their Engineering Team to Hundreds of Developers
- Hiring developers who are customer focused
- Implementing code review and testing early on
- Tracking KPIs around developer velocity
- Factors that make a good KPI
How Razorpay Scaled Their Notification Service
- A brief intro to webhooks and using them for sending notifications
- How MySQL became a bottleneck for Razorpay's Notification Service
- Solving the issue by using Kinesis to decouple the system
Tech Snippets
- How Amazon does CI/CD
- Building ChatGPT Plugins
- Prioritizing with the Code Quality Pyramid
- Career Advice for Engineering Managers
- Build the Modular Monolith First

How Faire Maintained Engineering Velocity as They Scaled

Faire is an online marketplace that connects small businesses with makers and brands. A wide variety of products are sold on Faire, ranging from home decor to fashion accessories and much more. Businesses can use Faire to order products they see demand for and then sell them to customers.

Faire was started in 2017 and became a unicorn (valued at over $1 billion) in under 2 years. Now, over 600,000 businesses are purchasing wholesale products from Faire’s online marketplace and they’re valued at over $12 billion.

Marcelo Cortes is the co-founder and CTO of Faire and he wrote a great blog post for YCombinator on how they scaled their engineering team to meet this growth.

Here’s a summary

Faire’s engineering team grew from five engineers to hundreds in just a few years. They were able to sustain their pace of engineering execution by adhering to four important elements.

Hiring the right engineers
Building solid long-term foundations
Tracking metrics for decision-making
Keeping teams small and independent

Hiring the Right Engineers

When you have a huge backlog of features/bug fixes/etc that need to be pushed, it can be tempting to hire as quickly as possible.

Instead, the engineering team at Faire resisted this urge and made sure new hires met a high bar.

Specific things they looked for were…

Expertise in Faire’s core technology

They wanted to move extremely fast so they needed engineers who had significant previous experience with what Faire was building. The team had to build a complex payments infrastructure in a couple of weeks which involved integrating with multiple payment processors, asynchronous retries, processing partial refunds and a number of other features. People on the team had previous experience building the same infrastructure for Cash App at Square, so that helped tremendously.

Focused on Delivering Value to Customers

When hiring engineers, Faire looked for people who were amazing technically but were also curious about Faire’s business and were passionate about entrepreneurship.

The CTO would ask interviewees questions like “Give me examples of how you or your team impacted the business”. Their answers showed how well they understood their past company’s business and how their work impacted customers.

A positive signal is when engineering candidates proactively ask about Faire’s business/market.

Having customer-focused engineers made it much easier to shut down projects and move people around. The team was more focused on delivering value for the customer and not wedded to the specific products they were building.

Build Solid Long-Term Foundations

From day one, Faire documented their culture in their engineering handbook. They decided to embrace practices like writing tests and code reviews (contrary to other startups that might solely focus on pushing features as quickly as possible).

Faire found that they operated better with these processes and it made onboarding new developers significantly easier.

Here’s four foundational elements Faire focused on.

Being Data Driven

Faire started investing in their data engineering/analytics when they were at just 10 customers. They wanted to ensure that data was a part of product decision-making.

From the start, they set up data pipelines to collect and transfer data into Redshift (their data warehouse).

They trained their team on how to use A/B testing and how to transform product ideas into statistically testable experiments. They also set principles around when to run experiments (and when not to) and when to stop experiments early.

Choosing Technology

For picking technologies, they had two criteria

The team should be familiar with the tech
There should be evidence from other companies that the tech is scalable long term

They went with Java for their backend language and MySQL as their database.

Writing Tests

Many startups think they can move faster by not writing tests, but Faire found that the effort spent writing tests had a positive ROI.

They used testing to enforce, validate and document specifications. Within months of code being written, engineers will forget the exact requirements, edge cases, constraints, etc.

Having tests to check these areas helps developers to not fear changing code and unintentionally breaking things.

Faire didn’t focus on 100% code coverage, but they made sure that critical paths, edge cases, important logic, etc. were all well tested.

Code Reviews

Faire started doing code reviews after hiring their first engineer. These helped ensure quality, prevented mistakes and spread knowledge.

Some best practices they implemented for code reviews are

Be Kind. Use positive phrasing where possible. It can be easy to unintentionally come across as critical.
Don’t block changes from being merged if the issues are minor. Instead, just ask for the change verbally.
Ensure code adheres to your internal style guide. Faire switched from Java to Kotlin in 2019 and they use JetBrains’ coding conventions.

Track Engineering Metrics

In order to maintain engineering velocity, Faire started measuring this with metrics at just 20 engineers.

Some metrics they started monitoring are

CI wait time
Open Defects
Defects Resolution Time
Flaky tests
New Tickets

And more. They built dashboards to monitor these metrics.

As Faire grew to 100+ engineers, it no longer made sense to track specific engineering velocity metrics across the company.

They moved to a model where each team maintains a set of key performance indicators (KPIs) that are published as a scorecard. These show how successful the team is at maintaining its product areas and parts of the tech stack it owns.

As they rolled this process out, they also identified what makes a good KPI.

Here are some factors they found for identifying good KPIs to track.

Clearly Ladders Up to a Top-Level Business Metric

In order to convince other stakeholders to care about a KPI, you need a clear connection to a top-level business metric (revenue, reduction in expenses, increase in customer LTV, etc.). For tracking pager volume, Faire saw the connection as high pager volume leads to tired and distracted engineers which leads to lower code output and fewer features delivered.

Is Independent of other KPIs

You want to express the maximum amount of information with the fewest number of KPIs. Using KPIs that are correlated with each other (or measuring the same underlying thing) means that you’re taking attention away from another KPI that’s measuring some other area.

Is Normalized In a Meaningful Way

If you’re in a high growth environment, looking at the KPI can be misleading depending on the denominator. You want to adjust the values so that they’re easier to compare over time.

Solely looking at the infrastructure cost can be misleading if your product is growing rapidly. It might be alarming to see that infrastructure costs doubled over the past year, but if the number of users tripled then that would be less concerning.

Instead, you might want to normalize infrastructure costs by dividing the total amount spent by the number of users.

This is a short summary of some of the advice Marcelo offered. For more lessons learnt (and additional details) you should check out the full blog post here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

It's About Time. Build on InfluxDB.

Working with large sets of time-stamped data has its challenges.

Fortunately, InfluxDB is a time series platform purpose-built to handle the unique workloads of time series data.

Using InfluxDB, developers can ingest billions of data points in real-time with unbounded cardinality, and store, analyze, and act on that data – all in a single database.

No matter what kind of time series data you’re working with – metrics, events, traces, or logs – InfluxDB Cloud provides a performant, elastic, serverless time series platform with the tools and features developers need. Native SQL compatibility makes it easy to get started with InfluxDB and to scale your solutions.

Companies like IBM, Cisco, and Robinhood all rely heavily on InfluxDB to build and manage responsive backend applications, to power predictive intelligence, and to monitor their systems for insights that they would otherwise miss.

See for yourself by quickly spinning up the platform and testing it out InfluxDB Cloud for free.

sponsored

Tech Snippets

How Amazon does CI/CD

Clare Liguori is a Principal Engineer at Amazon and she wrote a fantastic article delving into Amazon’s system for continuous deployment. She talks about every stage of deployment including code review, testing (unit and integration), validation, one-box deployment, staggered deployments, roll-backs and more.

It’s extremely in-depth and provides a great overview of the different tactics Amazon uses to get code from a developer’s machine to production.

aws.amazon.com/builders-library/automating-safe-hands-off-deployments/?did=ba_card&trk=ba_card

Building ChatGPT Plugins

OpenAI has recently announced the addition of plugins to chatGPT and GPT-4. This means a huge increase in the capabilities of what these LLMs can do.

Here’s a great video from someone with plugin access. He goes through how it works, what plugins are available, how they’re built and more.

https://youtu.be/hpePPqKxNq8

Prioritizing with the Code Quality Pyramid

Fabian Zeindl is a senior software architect and he wrote a great blog post introducing The Code Quality Pyramid framework.

This divides the qualities of your system into hierarchical layers where the lower layers affect the parts above. Having a performant build system will enhance your ability to improve all the layers above.

Fabian goes through the layers in the pyramid and talks about how to improve them.

www.fabianzeindl.com/posts/the-codequality-pyramid

Advice for Engineering Managers Who Want to Climb the Ladder

Charity Majors is the co-founder and CTO of Honeycomb and she wrote a great post with advice for Engineering Managers on how they can advance in their careers.

The post is centered around becoming an Engineering Director (go from managing ICs to managing managers).

The most opportunities will be at fast-growing startups with at least 100 engineers. You should demonstrate impact beyond your team and be proactive in communicating problems with your manager/management (along with solutions).

Check out the full post for her advice.

charity.wtf/2022/06/13/advice-for-engineering-managers-who-want-to-climb-the-ladder

Build the Modular Monolith First

Despite the hype, microservices come with a lot of downsides. The extra network calls hurt performance and reliability. They’re harder to debug.

If you’re on a small team, using a modular monolith design pattern can give a great separation of concerns and allows an easier migration to microservices when the time comes.

www.fearofoblivion.com/build-a-modular-monolith-first

Quastor Pro

In addition to our emails, you can also get weekly articles on system design and technical dives on cool tech!

Past articles include

System Design Articles

Tech Dives

Database Concepts

It’s $12 per month and I’d highly recommend using your job’s Learning & Development stipend to pay for it!

How Razorpay Scaled Their Notification Service

Razorpay is a payments service and neobank that is one of India’s most valuable fintech startups (valued at $7.5 billion dollars). The company powers payments for over 8 million businesses in India and has been growing extremely quickly (they went from $60 billion in payments processed in 2021 to $90 billion in 2022). With this growth in payment volume and users, transactions on the system have been growing exponentially.

With this immense increase, engineers have to invest a lot of resources in redesigning the system architecture to make it more scalable. One service that had to be redesigned was Razorpay’s Notification service, a platform that handled all the customer notification requirements for SMS, E-Mail and webhooks.

A few challenges popped up with how the company handled webhooks, the most popular way that users get notifications from Razorpay.

We’ll give a brief overview of webhooks, talk about the challenges Razorpay faced, and how they addressed them. You can read the full blog post here.

Brief Explanation of Webhooks

A webhook can be thought of as a “reverse API”. While a traditional REST API is pull, a webhook is push.

For example, let’s say you use Razorpay to process payments for your app and you want to know whenever someone has purchased a monthly subscription.

With a REST API, one way of doing this would be to send a GET request to Razorpay’s servers every few minutes so you can check to see if there have been any transactions (you are pulling information). However, this results in unnecessary load for your computer and Razorpay’s servers.

On the other hand, if you use a webhook, you can set up a route on your backend with logic that you want to execute whenever a transaction is made. Then, you can give Razorpay the URL for this route and the company will send you a HTTP POST request whenever someone purchases a monthly subscription (information is being pushed to you).

You then respond with a 2xx HTTP status code to acknowledge that you’ve received the webhook. If you don’t respond, then Razorpay will retry the delivery. They’ll continue retrying for 24 hours with exponential backoff.

Here’s a sample implementation of a webhook in ExpressJS and you can watch this video to see how webhooks are set up with Razorpay. They have many webhooks to alert users on things like successful payments, refunds, disputes, canceled subscriptions and more. You can view their docs here if you’re curious.

Existing Notification Flow

Here’s the existing flow for how Notifications work with Razorpay.

API nodes for the Notification service will receive the request and validate it. After validation, they’ll send the notification message to an AWS SQS queue.
Worker nodes will consume the notification message from the SQS queue and send out the notification (SMS, webhook and e-mail). They will write the result of the execution to a MySQL database and also push the result to Razorpay’s data lake.
Scheduler nodes will check the MySQL databases for any notifications that were not sent out successfully and push them back to the SQS queue to be processed again.

This system could handle a load of up to 2,000 transactions per second and regularly served a peak load of 1,000 transactions per second. However, at these levels, the system performance started degrading and Razorpay wasn’t able to meet their SLAs with P99 latency increasing from 2 seconds to 4 seconds (99% of requests were handled within 4 seconds and the team wanted to get this down to 2 seconds).

Challenges when Scaling Up

With the increase in transactions, the Razorpay team encountered a few scalability issues

Database Bottleneck - Read query performance was getting worse and it couldn’t scale to meet the required input/output operations per second (IOPS). The team vertically scaled the database from a 2x.large to a 8x.large but this wasn’t a long-term solution considering the pace of growth.
Customer Responses - In order for the webhook to be considered delivered, customers need to respond with a 2xx HTTP status code. If they don’t, Razorpay will retry sending the webhook. Some customers have slow response times for webhooks and this was causing worker nodes to be blocked while waiting for a user response.
Unexpected Increases in Load - Load would increase unexpectedly for certain events/days and this would impact the notification platform.

In order to address these issues, the Razorpay team decided on several goals

Add the ability to prioritize notifications
Eliminate the database bottleneck
Manage SLAs for customers who don’t respond promptly to webhooks

Prioritize Incoming Load

Not all of the incoming notification requests were equally critical so engineers decided to create different queues for events of different priority.

P0 queue - all critical events with highest impact on business metrics are pushed here.
P1 queue - the default queue for all notifications other than P0.
P2 queue - All burst events (very high TPS in a short span) go here.

This separated priorities and allowed the team to set up a rate limiter with different configurations for each queue.

All the P0 events had a higher limit than the P1 events and notification requests that breached the rate limit would be sent to the P2 queue.

Reducing the Database Bottleneck

In the earlier implementation, as the traffic on the system scaled, the worker pods would also autoscale.

The increase in worker nodes would ramp up the input/output operations per second on the database, which would cause the database performance to severely degrade.

To address this, engineers decoupled the system by writing the notification requests to the database asynchronously with AWS Kinesis. Kinesis is a fully managed data streaming service offered by Amazon and it’s very commonly used with real-time big data processing applications.

They added a Kinesis stream between the workers and the database so the worker nodes will write the status for the notification messages to Kinesis instead of MySQL. The worker nodes could autoscale when necessary and Kinesis could handle the increase in write load. However, engineers had control over the rate at which data was written from Kinesis to MySQL, so they could keep the database write load relatively constant.

Managing Webhooks with Delayed Responses

A key variable with maintaining the latency SLA is the customer’s response time. Razorpay is sending a POST request to the user’s URL and expects a response with a 2xx status code for the webhook delivery to be considered successful.

When the webhook call is made, the worker node is blocked as it waits for the customer to respond. Some customer servers don’t respond quickly and this can affect overall system performance.

To solve this, engineers came up with the concept of Quality of Service for customers, where a customer with a delayed response time will have their webhook notifications get decreased priority for the next few minutes. Afterwards, Razorpay will re-check to see if the user’s servers are responding quickly.

Additionally, Razorpay configured alerts that can be sent to the customers so they can check and fix the issues on their end.

Observability

It’s extremely critical that Razorpay’s service meets SLAs and stays highly available. If you’re processing payments for your website with their service, then you’ll probably find outages or severely degraded service to be unacceptable.

To ensure the systems scale well with increased load, Razorpay built a robust system around observability.

They have dashboards and alerts in Grafana to detect any anomalies, monitor the system’s health and analyze their logs. They also use distributed tracing tools to understand the behavior of the various components in their system.

For more details, you can read the full blog post here.