The Engineering Behind Instagram's Recommendations

How Instagram built an Information Retrieval System to power their Suggested Posts feature.

Hey Everyone!

Today we’ll be talking about

  • An overview of Snapchat's AWS Architecture

    • Saral Jain is the Senior Director of Engineering at Snap Inc and he gave an interview for AWS on what happens in their backend when you send a snap.

    • Snap runs their backend on EKS (Elastic Kubernetes Service) where they use more than 900 EKS clusters and many of the clusters have 1000+ instances.

    • They're heavy users of DynamoDB and have their own database built on top of it for storing metadata related to snaps.

  • How Instagram's Suggested Posts feature works

    • Instagram's Suggested Posts feature is an example of an Information Retrieval system. These systems usually consist of a Candidate Generation step and a Candidate Ranking/Scoring step.

    • Instagram uses account embeddings (based on word embeddings) and co-occurrence based similarity for Candidate Generation

  • Tech Snippets

    • Seinfeld and Software Engineering - Worlds are colliding!!

    • How Monzo deploys to Production over 100 times a day

    • Learn Postgres in your web browser

    • Migrating to an Asynchronous System at Netflix


How Instagram Suggests New Content

Instagram is a social media app with more than 2 billion active users. In order to keep users engaged, the company dedicates a ton of resources to making sure the posts in a user’s feed are relevant, fresh and interesting.

Instagram launched a Suggested Posts feature where they recommend posts a user may enjoy from accounts the user isn’t following and they place those posts in the user’s feed. The goal is to make it easier for users to find new accounts to follow.

Amogh Mahapatra is a machine learning engineer at Meta and he wrote a great blog post on how Instagram implemented this feature.

Here’s a summary

Instagram’s suggested posts feature will find photos/video posts that you may like from accounts that you don’t follow. This results in you finding more content you like, following more accounts and spending more time on Instagram.

This feature is an example of the Information Retrieval problem, where you have a large set of documents (Instagram posts) and you want to find certain documents based on a set of criteria.

Information Retrieval systems typically have a two-step design

  1. Candidate Generation - based on the user’s interests, fetch all the candidates that a user could be interested in. In this case, Instagram is looking for all the possible posts from accounts the user doesn’t follow that he/she may be interested in.

  2. Candidate Selection/Scoring - rank the candidates and select the best subset to show to the user. In this scenario that means looking at the potential posts from the candidate generation stage and selecting the best few posts that will be shown to the user as Suggested Posts in their feed.

How Instagram does Candidate Generation

The first step is to search for posts that a user may like from accounts the user isn’t following. To do this, Instagram relies on user embeddings and co-occurrence based similarity.

User embeddings are a popular technique in building recommendation systems, where you use a machine learning model to generate a vector that represents a user. The vector is a series of numbers (magnitudes) in various dimensions. These numbers are chosen by the ML model based on the user’s engagement data so users who have similar engagement data will get similar numbers. You can find the most similar accounts to the user by looking at other accounts that are nearby in vector space (using something like Cosine similarity).

This is based on word embeddings, where you generate a vector representation of a word based on the meaning and usage of that word. We talked about word embeddings in a previous article on how Dropbox implemented their image search feature.

Instagram also uses a technique called Co-occurrence Similarity, which comes from frequent pattern mining. They look at user data to see what media users are engaging with and look for any co-occurring accounts (accounts that also get engagement from those users). Then, they calculate co-occurrence frequencies of media pairs and use them for Candidate Generation. For example, there may be a lot of users who like posts from the Golden State Warriors and also from the Los Angeles Lakers (two NBA teams). Users who follow one team and not the other might benefit from getting the other team’s posts as Suggested Posts.

Cold Start Problem

An issue that you’ll frequently see with recommender systems is the Cold Start problem, where the system performs poorly for new users due to a lack of data.

Instagram deals with this in two ways:

  • Popular Media - For extremely new users who don’t follow anyone / haven’t engaged with any content, Instagram will recommend posts that are popular with the general instagram user base. The recommendation algorithm can then adjust based off the user’s response to those initial posts.

  • Fallback Graph Exploration - If a user hasn’t engaged with any content but follows other accounts, Instagram will generate candidates for them by evaluating their one-hop and two-hop connections. They’ll look at accounts followed by the user and see what posts those accounts liked and use that to generate candidates.

How Instagram does Candidate Selection

The candidate generation step generates a group of potential Suggested Posts. In the candidate selection stage, machine learning models are used to pick the best few posts that’ll be shown to the user.

To do this, Instagram uses a ton of different data points and various machine learning models. Many of the data points are also generated using ML models.

Some of the data points considered are

  • Probability of user engagement

  • Content quality of the image/video

  • Past Author-Viewer interactions/engagement

  • User embeddings

And much more.

Some of the models used are

In order to select the best models, hyperparameters, etc. Facebook relies on online A/B testing and offline simulations. The offline simulations work by replaying a user’s actions (their likes, comments, shares, etc.) to different models and training them to predict the user’s actions. Then, these engagement prediction models can be used to evaluate candidate ranking models.

Offline simulation can’t replace A/B testing since there are many behavioral dynamics that are too complicated to model, but it provides a higher throughput alternative to quickly evaluate model performance. You can read more about offline simulation at Meta here.

For more details on Instagram’s Suggested Post feature, read the full article here.

How did you like this summary?

Your feedback really helps me improve curation for future emails.

Login or Subscribe to participate in polls.

Tech Snippets

  • Engineering Festivus - This is a hilarious collection of Seinfeld scenes adapted for software engineering. My favorite one so far is Jerry and George talking about all the new frontend frameworks out there and comparing them. If you're into Seinfeld then you should definitely check them out.

  • How Monzo deploys to Production over 100 times a day - Monzo is an online bank based in the UK with nearly 6 million customers. As a startup, it's extremely important for them to iterate quickly and ship fast. To do so, they've optimized their developer workflow for rapid delivery and built a culture of shipping small, reversible changes. This is a great article that delves into how they've done this.

  • Learn Postgres in your web browser! - This is an awesome tutorial that lets you learn/practice your Postgres skills in the browser. The playground loads sample datasets and then guides you through topics like indexing, partitioning, general maintenance and more.

  • Migrating to an Asynchronous System at Netflix - This is a great talk on migrating a synchronous request-response based system to an asynchronous system with Kafka. It goes into why they migrated the system and some of the challenges the team had to solve around data loss, duplicate processing, latency, etc.

How Snapchat Works

Snapchat is an instant messaging app with over 300 million daily active users. The company uses a multi-cloud strategy relying heavily on AWS and Google Cloud Platform.

Saral Jain is the Senior Director of Engineering at Snap Inc where he leads the Cloud Infrastructure, Data and IT organizations. He gave a great interview on the AWS series This is my Architecture.

He discussed the process of what happens on the backend when a user sends/receives a snap on the app (sends or receives an image/video). This is for a video series by AWS, so unfortunately he only talks about the AWS architecture.

Here’s a summary

For their AWS stack, Snap runs their backend on Elastic Kubernetes Service (EKS) and they use more than 900 EKS clusters where many of the clusters have 1000+ instances.

The core services involved in sending and receiving snaps are the

  • Media Delivery Service

  • Core Orchestration Service

  • Friend Graph

  • Snap DB

When a user sends a snap from their mobile device, their phone will talk to Snapchat’s API Gateway.

The Gateway will communicate with the Media Delivery Service to send the picture/video to AWS CloudFront (AWS’ Content Delivery Network) and also persist it in S3. The media will be given a media ID that it can be referenced through.

Once the media has been persisted, Snap’s Orchestration service will query Snapchat’s friend graph to make sure that the sender has the permissions to send the picture/video to the recipient (they should be friends on Snapchat).

If the permissions check passes, the Orchestration service will persist the conversation metadata (including the media ID) into Snap DB. Snap DB is Snapchat’s custom database that is built on top of DynamoDB (a proprietary NoSQL database by AWS). They store nearly 400 terabytes of data in DynamoDB.

The team created their own database as a frontend to DynamoDB to add higher level features to meet Snap’s specific use cases. Snap has to deal with a lot of ephemeral data so they added optimizations for that and also TTL and custom transactions to reduce costs.

For receiving a snap, the orchestration service will look up a connection ID from ElasticCache, to get access to the persistent connection that Snap servers have with the clients who have the app open.

The service looks at the conversation metadata to get the media ID of the picture/video. The content is retrieved from CloudFront and then sent to the recipient’s device.

If the recipient doesn’t have the app open, then Snapchat relies on Apple Push Notification Service or Firebase Cloud Messaging.

For more details, you can watch the full video here.

How did you like this summary?

Your feedback really helps me improve curation for future emails. Thanks!

Login or Subscribe to participate in polls.

Interview Question

You are given an array nums of distinct positive integers.

Return the number of tuples (a, b, c, d) such that

a * b = c * d

where a, b, c and d are elements of nums and

a != b != c != d

Previous Question

As a reminder, here’s our last question

Given a string s, return the longest palindromic substring in s. A palindromic substring is a substring that is a palindrome.

Solution

The brute force solution is to go through every substring in our string and check each one to see if it's a palindrome.

The number of substrings scales quadratically with the size of the input string (O(n^2)). Checking a string to see if it's a palindrome scales linearly. Therefore, the brute force solution takes O(n^3) time.

Instead, we can go through every character in our string and assume it's the middle of our palindrome. Then, we set two pointers on the left and right of that character and see what's the longest possible palindrome centered at that character. We'll have to check for palindromes of both even and odd length.

Once we've iterated through the entire string, then we can return the longest palindrome.

The time complexity is O(n^2).