Redesigning Etsy's Machine Learning Platform

Clean Code's recommendations for Naming. How Tail Call Optimization Works. A primer on Assembly Language. How Etsy redesigned their ML platform and more!

December 31, 2021

Hey Everyone,

Happy New Year!

Today we’ll be talking about

Redesigning Etsy’s Machine Learning Platform
- The ML Platform team builds and maintains the infrastructure that Etsy data scientists use to prototype, train, deploy and maintain ML models at scale.
- Etsy rebuilt their ML Platform to deal with increased scale and to reduce maintenance costs.
- We’ll go over the principles Etsy used when redesigning and the tech stack that Etsy is using for training and prototyping, model serving, workflow orchestration and more.
Clean Code’s recommendations for Naming - Let’s kick the new year off by reviewing some principles from Bob Martin’s Clean Code!
- Use intention-revealing names that tell you why the variable exists
- Use pronounceable names so that you can discuss them easily in a conversation
- Avoid single-letter names and make names that are easily searchable
- And more heuristics on how you should name variables, functions, classes, etc.
Plus, a couple awesome tech snippets on
- Levels.fyi’s 2021 End of Year Salary Report
- A quick primer on Assembly Language with Dave Plummer
- How Tail Call Optimization Works
- Hyrum’s Law on API Design

We also have a solution to our last coding interview question, plus a new question from Facebook.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Clean Code - Naming

When you’re writing code, you’re constantly coming up with names. Naming classes, functions, variables, objects, etc.

An (often underlooked) aspect of writing clean code is coming up with good names. Having a great naming scheme makes your code much easier to read.

There are only two hard things in Computer Science: cache invalidation and naming things. ~ Phil Karlton

Here are some rules from Clean Code’s chapter on Naming

Use Intention-Revealing Names

The name of a variable, function or class should tell you why it exists, what it does and how it’s used.

If a name requires a comment, then the name does not properly reveal it’s intent!

Don’t be afraid of long names! Rely on your editor’s autocomplete.

This is bad

int diff; // time difference between two days in hours

This is better

int timeDiffInHours;

Avoid Disinformation

Don’t let your names leave false clues!

For example, the i and j variables are frequently used as counters inside a loop.

This convention comes from Sigma notation in mathematics.

Therefore, don’t use i or j inside a loop for something that’s not a loop counter!

Make Meaningful Distinctions

Every programmer faces this situation. You’re trying to name a variable but you already have another variable with the same name that’s in the same scope.

What do you do?

Many programmers do the classic number-series naming (a1, a2, a3, a4, etc.)

Don’t do this! Avoid changing the new name in some arbitrary way just to satisfy the compiler.

Instead, put effort into giving the new name a meaningful distinction.

Use Pronounceable Names

If you can’t pronounce the name, it becomes harder to discuss the variable, function, object, etc. in a conversation.

Therefore, try and make the names pronounceable.

Use Searchable Names

Every now and then, you’ll want to search for a variable in your code.

You can easily grep for a name like “get_customer_activation_date” but if you have a name like “se” then it could get difficult to search for it in a large codebase (too many matches).

Clean code recommends that you only use single-letter names as local variables inside short methods.

The length of a name should correspond to the size of it’s scope

Rules for Class & Method Names

Classes and Objects should have noun or noun phrase names.

Examples - Customer, WikiPage, AddressParser

Methods should have verbs or verb phrase names

Examples - postPayment, deletePage

Pick One Word Per Concept

Pick one word for an abstract concept and stick with it!

It’s confusing to have fetch_, retrieve_, get_, etc. all as prefixes for equivalent methods of different classes in a single codebase.

Pick one!

Otherwise, it becomes difficult to keep track of which method prefix goes with which class.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Tech Snippets

A quick primer on Assembly Language - Dave Plummer is a retired ex-Microsoft engineer. He’s most famous for being the creator of Windows Task Manager and the Space Cadet Pinball port to Windows.In this video, Dave gives a primer on assembly language by going through a clock program for the Commodore PET 6502.

Levels.fyi’s 2021 End of Year Salary Report - Levels.fyi is a great site to get compensation information for big tech companies.They just released their End of Year Pay Report for 2021, listing the companies that pay the highest compensation for software engineers.
How Tail Call Optimization Works - Evan Kitzke is a software engineer at Uber who primarily works with C++, Python and Go. He was the creator of the Pyflame Python profiler.He wrote a great blog post explaining how Tail Call Optimization works and some misconceptions that people have around it (for example, TCO is not just used for recursion).

Hyrum’s Law on API Design - When your API reaches scale and has a large amount of users, then it does not matter what you promise in the contract.All observable behaviors of your system will be depended on by some user somewhere. You should be thinking of this when modifying your API.

Redesigning Etsy’s Machine Learning Platform

Kyle Gallatin and Rob Miles are two software engineers at Etsy working on the Machine Learning Platform team.

They wrote a great article summarizing the design of the ML infrastructure that Etsy uses and the design choices that went into building the system.

Here’s a summary

Etsy is an e-commerce platform that allows users to sell handmade or vintage items. Popular products sold on the site include things like jewelry, clothing, bags, etc.

The website makes extensive use of machine learning models for things like search, recommendations, the ad platform, trust & safety, and more.

The ML Platform team at Etsy develops and maintains the technical infrastructure that Etsy data scientists use to prototype, train and deploy ML models at scale.

Etsy’s first ML platform was built in 2017, when the data science team was much smaller and largely relied on much simpler models. As the platform had to start supporting more complex machine learning projects and new ML frameworks, the maintenance costs started to become too high.

They decided they would build a new version of the ML platform, and would rely on the following principles.

Avoid in-house tooling - Use open-source technologies like TensorFlow and managed ML solutions from platforms like Google Cloud. This way, the data science team can build models quickly without having to rely on the platform team for support.
Embrace self-service - Instead of burdening the data science team with platform-specific abstractions, let the open source and managed tools and technologies speak for themselves. Users of the ML platform can rely on those tool’s well-written documentation and free up the ML Platform team to focus away from support and more on core work.
Toolset Flexibility - TensorFlow is the primary modeling framework, however users of the platform shouldn’t be limited to a single toolset. They should be able to experiment and deploy models using any ML library.

The design of ML Platform V2

Training and Prototyping

Etsy’s training and prototyping platform largely relies on Google Cloud services like Vertex AI and Dataflow, where the data science team can experiment freely with the ML framework of their choice.

They can use Juptyer Notebooks to quickly iterate while using complex infrastructure and managing large amounts of data.

Massive extract transform load (ETL) jobs can be run through Dataflow while complex training jobs can be submitted to Vertex AI for optimization.

Model Serving

Etsy relies on Google Kubernetes Engine (GKE) for the core of their Model Serving system (making inferences in production).

To deploy models, data scientists will create stateless ML microservices that are deployed in Etsy’s Kubernetes cluster.

These microservices will then serve requests from Etsy’s website or mobile app.

The deployments are managed through the Model Management Service, an in-house developed control plane that gives the data science team a simple UI to manage their model deployments.

The Model Management Service violates Etsy’s avoid in-house rule, but that’s because it was already built-out and the ML platform team found it was still the best tool available.

However, they extended the Model Management Service to support two additional open-source serving frameworks: TensorFlow serving and Seldon Core.

Workflow Orchestration

In order to keep ML models up-to-date, the ML platform also needs robust pipelines for retraining and deployment.

Etsy relies on Kubeflow and TFX pipelines (TensorFlow Extended) for this. With Google Cloud Platform’s Vertex AI Pipelines, the data science team can develop and test pipelines using either the Kubeflow or TFX SDE, based on their own preference.

This makes it much faster to write, test and validate pipelines.

Outcomes

The ML Platform team estimates that with V2, ML practitioners at Etsy can now go from idea to live ML experiment in half the time it previously took. Launching new model architectures takes days instead of weeks, and data scientists can launch dozens of hyperparameter tuning experiments with a single command.

The biggest challenge around V2 has been encouraging adoption of the new ML platform. Migrating to a new platform requires upfront effort that may not align with the current priorities of the staff at Etsy.

In order to encourage adoption, the ML Platform team provided additional support to early adopters to ease the transition, put a big emphasis on transparency around components and new features, and showcased best practices.

For more details, read the full blog post!

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.

Interview Question

You are given a string s that contains only digits.

Return all possible valid IP addresses that can be obtained from s (without changing the order of the digits in s or removing any of the digits).

Example

Input: “25525511135”

Output: ["255.255.11.135","255.255.111.35"]

Input: "0000"

Output: ["0.0.0.0"]

Input: "010010"

Output: ["0.10.0.10","0.100.1.0"]

Input: "101023"

Output:["1.0.10.23","1.0.102.3","10.1.0.23","10.10.2.3", "101.0.2.3"]

Here’s the question in LeetCode

Previous Solution

As a reminder, here’s our last question

Given two strings s and t, return the number of distinct subsequences of s which equal t.

A subsequence is a new string formed from the original string by deleting some (or none) of the characters without changing the remaining character’s relative positions.

Example

Input - s = “babgbag”, t = “bag”

Output - 5

Here’s the question in LeetCode.

Solution

We can solve this question with Dynamic Programming.

We’ll first start with the top-down solution.

We’ll have a recursive function called _numDistinct.

The function will take two parameters, sCounter and tCounter.

sCounter represents the current index in s and tCounter represents the current index in t.

Now, we’ll iterate through the characters in s from sCounter to the end of s.

For each character, we’ll check if it’s equal to the character at index tCounter in t.

If it is, then we have a possible matching subsequence and we’ll explore this further by recursively calling _numDistinct on the next character in s and t.

Our base case is if tCounter == len(t). That means we’ve reached the end of t and found a valid subsequence.

If tCounter != len(t) but sCounter == len(s), then that’s our second base case. That means we’ve reached the end of s but we haven’t found a valid subsequence.

In order to avoid repeat computations, we’ll add a memo table as a cache.

Here’s the Python 3 code for the top down solution.

However, this code will still give you a “time limit exceeded” error on LeetCode.

In order to pass all test conditions, you’ll have to flip the code around to come up with the bottom-up DP solution.

You can do that with a DP table.

Here’s the Python 3 code for the bottom up solution.

Quastor Daily is a free Software Engineering newsletter sends out FAANG Interview questions (with detailed solutions), Technical Deep Dives and summaries of Engineering Blog Posts.