How Shopify Scaled To Billions of Rows of Data Ingestion for Black Friday

Plus, how Google does Code Review and details on Stripe's migration to TypeScript

Hey Everyone!

Today we’ll be talking about

  • How Shopify used Apache Flink and Server-Sent Events to Build their Black Friday Shopping Dashboard

    • Shopify publishes a dashboard for Black Friday that shows live data on total sales, unique shoppers, trending products, etc. across all of Shopify's stores

    • To do this, Shopify has to ingest a massive amount of data from all their stores, transform this data and then stream it to clients.

    • We'll talk about how they incorporated Apache Flink for processing the data and switched over from WebSockets to Server-Sent Events for pushing the data to clients.

  • How Google does Code Review

    • If you check out Google's Github repo, they have a bunch of publicly available docs on how they run engineering teams and build products.

    • They published a great page that goes through their guidelines, practices and culture around code review.

    • They talk about areas of the PR that code reviewers are encouraged to focus on, when engineers should do code review, how quickly they should respond and more.

  • Tech Snippets

    • Why Functional Programming Should be the Future of Software Development

    • How Stripe Migrated Millions of Lines of Code to Type Script

    • How Netflix uses Machine Learning for Fraud Detection

    • Rapidly Solving Sudoku, N-Queens, Pentomino Placement and more with Knuth's Algorithm X

How Shopify Built their Black Friday Dashboard

Shopify is an e-commerce platform that allows retailers to easily create online stores. They have over 1.7 million businesses on the platform and processed over $160 billion dollars of sales in 2021.

The Black Friday/Cyber Monday sale is the biggest event for e-commerce and Shopify runs a real-time dashboard every year showing information on how all the Shopify stores are doing.

The Black Friday Cyber Monday (BCFM) Live Map shows data on things like

  • Total sales per minute

  • Total number of unique shoppers per minute

  • Trending products

  • Shipping Distance and carbon offsets

And more. It gets a lot of traffic and serves as great marketing for the Shopify brand.

Shopify BFCM Live Map 2022 Frontend

To build this dashboard, Shopify has to ingest a massive amount of data from all their stores, transform the data and then update clients with the most up-to-date figures.

Bao Nguyen is a Senior Staff Engineer at Shopify and he wrote a great blog post on how Shopify accomplished this.

Here’s a Summary

Shopify needed to build a dashboard displaying global statistics like total sales per minute, unique shoppers per minute, trending products and more. They wanted this data aggregated from all the Shopify merchants and to be served in real time.

Two key technology choices were on

  • Data Pipelines - Using Apache Flink to replace Shopify’s homegrown solution

  • Real-time Communication - Using Server Sent Events instead of Websockets.

We’ll talk about both.

Data Pipelines

The data for the BFCM map is generated by analyzing the transaction data and metadata from millions of merchants on the Shopify platform.

This data is taken from various Kafka topics and then aggregated and cleaned. The data is transformed into the relevant metrics that the BFCM dashboard needs (unique users, total sales, etc.) and then stored in Redis, where it’s broadcasted to the frontend via Redis Pub/Sub (allows Redis to be used as a message broker with the publish/subscribe messaging pattern).

At peak volume, these Kafka topics would process nearly 50,000 messages per second. Shopify was using Cricket, an inhouse developed data streaming service (built with Go) to identify events that were relevant to the BFCM dashboard and clean/process them for Redis.

However, they had issues scaling Cricket to process all the event volume. Latency of the system became too high when processing a large volume of messages per second and it took minutes for changes to become available to the client.

Shopify has been investing in Apache Flink over the past few years, so they decided to bring it in to do the heavy lifting. Flink is an open-source, highly scalable data processing framework written in Java and Scala. It supports both bulk/batch and stream processing.

With Flink, it’s easy to run on a cluster of machines and distribute/parallelize tasks across servers. You can handle millions of events per second and scale Flink to be run across thousands of machines if necessary.

Shopify changed the system to use Flink for ingesting the data from the different Kafka topics, filtering relevant messages, cleaning and aggregating events and more. Cricket acted instead as a relay layer (intermediary between Flink and Redis) and handled less-intensive tasks like deduplicating any events that were repeated.

2021 BFCM live map system diagram

This scaled extremely well and the Flink jobs ran with 100% uptime throughout Black Friday / Cyber Monday without requiring any manual intervention.

Shopify wrote a longform blog post specifically on the Cricket to Flink redesign, which you can read here.

Streaming Data to Clients with Server Sent Events

The other issue Shopify faced was around streaming changes to clients in real time.

With real-time data streaming, there are three main types of communication models:

  • Push - The client opens a connection to the server and that connection remains open. The server pushes messages and the client waits for those messages. The server will maintain a list of connected clients to push data to.

  • Polling - The client makes a request to the server to ask if there’s any updates. The server responds with whether or not there’s a message. The client will repeat these request messages at some configured interval.

  • Long Polling - The client makes a request to the server and this connection is kept open until a response with data is returned. Once the response is returned, the connection is closed and reopened immediately or after a delay.

To implement real-time data streaming, there’s various protocols you can use.

One way is to just send frequent HTTP requests from the client to the server, asking whether there are any updates (polling).

Another is to use WebSockets to establish a connection between the client and server and send updates through that (either push or long polling).

Or you could use server-sent events to stream events to the client (push).

WebSockets to Server Sent Events

Previously, Shopify used WebSockets to stream data to the client. WebSockets provide a bidirectional communication channel over a single TCP connection. Both the client and server can send messages through the channel.

However, having a bidirectional channel was overkill for Shopify. Instead they found that Server Sent Events would be a better option.

Server-Sent Events (SSE) provides a way for the server to push messages to a client over HTTP.

credits to Gokhan Ayrancioglu for the awesome image

The client will send a GET request to the server specifying that it’s waiting for an event stream with the text/event-stream content type. This will start an open connection that the server can use to send messages to the client.

With SSE, some of the benefits the Shopify team saw were

  • Secure uni-directional push: The connection stream is coming from the server and is read-only which meant a simpler architecture

  • Uses HTTP requests: they were already familiar with HTTP, so it made it easier to implement

  • Automatic Reconnection: If there’s a loss of connection, reconnection is automatically retried

In order to handle the load, they built their SSE server to be horizontally scalable with a cluster of VMs sitting behind Shopify’s NGINX load-balancers. This cluster will autoscale based on load.

Connections to the clients are maintained by the servers, so they need to know which clients are active (and should receive data). Shopify ran load tests to ensure they could handle a high volume of connections by building a Java application that would initiate a configurable number of SSE connections to the server and they ran it on a bunch of VMs in different regions to simulate the expected number of connections.

Results

Shopify was able to handle all the traffic with 100% uptime for the BFCM Live Map.

By using server sent events, they were also able to minimize data latency and deliver data to clients within milliseconds of availability.

Overall, data was visualized on the BFCM’s Live Map UI within 21 seconds of it’s creation time.

For more details, you can read the full blog post here.

Tech Snippets

  • Rapidly Solving Sudoku, N-Queens, Pentomino Placement and more with Knuth's Algorithm X - An exact cover problem is where you have a collection of items and some rules/constraints. You need to select a subset of the items that satisfies the constraints without any overlaps. The N Queens problem is an example of this, where you want to place N Queens on a chess board where none of the queens can attack each other. Algorithm X is a method created by Donald Knuth for solving these types of problems. This is a great blog post that delves into Algorithm X and how it works.

  • How Stripe Migrated Millions of Lines of Code to TypeScript - In 2022, Stripe migrated their largest JavaScript codebase (powering the Stripe Dashboard) from Flow to TypeScript. Flow is an optional type system for JavaScript developed at Facebook, and Stripe was an early adopter. Since then, they’ve had some problems with it so they made the decision to migrate to TypeScript. They published a great blog post on their migration strategy, preparation, process and more.

  • Spellbook of Modern JavaScript Web Dev - This is an awesome GitHub repo that aggregates a ton of resources on web dev. It has resources on JavaScript and Node, Web Assembly, Web APIs, CSS and PostCSS, Web Design and some resources on Server Side programming.The resources are structured quite well, so it’s a good reference if you’re a web developer.

  • Machine Learning for Fraud Detection in Streaming Services - Netflix engineers have to deal with tons of fraud, whether it’s people using stolen accounts, trying to pirate content, pay with stolen credit cards, etc. To deal with this, they’ve developed robust anomaly-detection systems that can be split into rule-based and model-based. Rule-based involves the use of domain experts to hard code rules and model-based are trained based on machine learning algorithms. Netflix published a great blog post diving into both types of models and how they employ them to fight fraud.

  • Why Functional Programming Should Be the Future of Software Development - Charles Scalfani is the CTO of Panoramic Software and he wrote an interesting article for IEEE Spectrum on why functional programming helps reduce fragility in codebases and reduces the amount of time spent on maintaining code. A big cost is that these languages can have a steep learning curve, but Scalfani argues that the benefits make up for this.

Code Reviews at Google

If you check out the Google Github Repo, they have a bunch of publicly available documentation on how they run engineering teams and build products.

They published a great page that goes through their guidelines, practices and culture around code review.

What to Focus On

Areas that code reviews are encouraged to focus on are

  • Design - Is the code well designed and appropriate for the system? Does it belong or should it go in some other library/package? Do the interactions of various pieces of the code make sense?

  • Functionality - Does the code behave as the author intended? In other words, are there any bugs that haven’t been caught? Look for edge cases, concurrency bugs, resource leaks, etc.

  • Complexity - Is the code more complicated than it should be? Would another developer be able to easily understand and use this code when they come across it in the future? Is there any over-engineering, where there’s unnecessary functionality?

  • Tests - Ask for unit, integration or end-to-end tests as appropriate for the change. Make sure that the tests are correct, sensible and useful. Remember that tests are also code that has to be maintained, so don’t accept unneeded complexity in tests.

  • Naming - Did the developer pick good names? A good name is long enough to fully communicate what the item is. It should also follow standard conventions around naming.

  • Comments - Are the comments clear and easily understandable? Are they all necessary? Are the comments just band aids for unnecessary code complexity (perhaps the code should be refactored instead of having the comment)?

  • Style - Google has style guides for all their major languages. If a reviewer is suggesting a style point that isn’t in the style guide, then they should prefix the comment with Nit to let the developer know that it’s a nitpick. Don’t block code based only on a personal style preference.

Speed of Code Reviews

When code reviews are slow, this has a big negative impact on developer velocity and morale. The developer who submitted the pull request will have his/her productivity hurt and having the same delay for every code review means the entire team will be slower to release new features, bug fixes, etc.

One business day is the maximum time it should take to respond to a code review request and pull requests should be small enough to make this easy to accomplish.

Developers should do code reviews during break points in their work. After coming back from lunch, returning from a meeting, etc.

Interrupting a developer while he/she is working on a focused task (i.e. coding) is counterproductive due to how long it takes to get back into a flow state.

For more advice on how to write code review comments, handling pushback, navigating pull requests and more, read the full post here.

Interview Question

You are given an array of k linked lists.

Each list is sorted in ascending order.

Merge all the linked lists into one sorted linked list and return it.

Previous Question

As a reminder, here’s our last question

You are given a list of strings called words and a string pattern.

Return a list of the strings inside words that match the pattern.

A word matches the pattern if the letters in the word can be mapped one-to-one to characters in the pattern.

Solution

This question is a bit of an extension on another common interview question, which is to find if two strings are isomorphic.

An isomorphism is a one-to-one mapping (bijection) between two graphs.

In this question, our “graphs” are represented by character strings.

So, we need to find out if a one-to-one mapping can be formed between two strings.

For an example, let’s look at the strings “abb” and “mee”.

We can form a one-to-one mapping with a maps to m and b maps to e.

For “apple” and “zaapy”, we can form a one-to-one mapping with a maps to z, p maps to a, l maps to p and e maps to y.

Each character in the first string maps to only one character in the second string and vice-versa.

How can we calculate if two strings are isomorphic in O(n) time?

One way is to just iterate through both strings and try to create a mapping from the first string to the second and from the second string to the first.

If creating the mapping fails at any point, then you return False.

Here’s the Python 3 code implementing this.

Now, to answer the original question, we can iterate through the string of words and check each word to see if it’s isomorphic with our pattern.

If it is, we add it to an isomorphicWords array and then return that array after the loop terminates.

Here’s the Python 3 code.