Apple's Perceptual Hashing

A look into Facebook's key-value store called ZippyDB. Also, a look into the technical implementation of Apple's new CSAM policy. Plus, a great YouTube series on building an Operating System.

August 10, 2021

Hey Everyone,

Today we’ll be talking about

ZippyDB, the largest strongly consistent, geographically distributed key-value store at Facebook
- We’ll talk about it’s Architecture, Data Model, Consistency guarantees and Transactions.
Apple’s CSAM (Child Sexual Abuse Material) implementation
- Apple recently announced a new policy for CSAM.
- We’ll talk about the technological implementation and the NeuralHash function.

A great YouTube series on how to build an Operating System

We also have a solution to our last Coding Interview question on Binary Search Trees and a new interview question from Facebook.

Apple’s CSAM changes

Apple has recently announced new changes to iOS designed around CSAM (Child Sexual Abuse Material).

Apple will be scanning user images for CSAM material if the user syncs his photos with iCloud.

If CSAM exists in the images (and the number of occurrences passes a certain threshold), Apple will disable the iCloud account and pass the user’s information to federal/state authorities in that jurisdiction.

Privacy experts have almost uniformly criticized this policy, saying it sets a dangerous precedence. Here’s a detailed argument on why Apple’s policy may be dangerous by the Electronic Frontier Foundation.

We’ll be focusing on the technical implementation of how Apple is implementing these CSAM protections.

System Overview

Rather than scanning images in the cloud, Apple’s system performs on-device matching using a database of known CSAM image hashes provided by child-safety organizations.

The hashing is done using NeuralHash, a perceptual hashing function.

When a user uploads an image to iCloud Photos, an on-device matching process compares that image’s NeuralHash to a database of known CSAM NeuralHashes (both NeuralHashes are first transformed using elliptic-curve cryptography). The matching process is done through an algorithm called Private Set Intersection.

Apple then uses another technology called Threshold Secret Sharing, where Apple cannot see how many CSAM matches there are until the iCloud Photos account crosses a certain threshold of known CSAM matches.

At that point, there is a manual review process by Apple. If the flags for CSAM matches are correct, then Apple will send a report to federal or state authorities.

This technology will be included in iOS 15 and iPadOS 15.

Here’s a breakdown of NeuralHash…

NeuralHash

NeuralHash is a perceptual hashing function that can map an image to a numerical hash.

The hash is based on the features of the image rather than the pixel values of the image.

NeuralHash is part of the class of Locality-Sensitive Hashing functions.

Because of this, identical and visually similar images will result in the same hash, and images that are different from one another will result in different hashes.

Images that are slightly cropped or resized will have the same hash as the original.

When you’re using NeuralHash to generate a hash for an image, there are two main steps.

The image is passed into a convolutional neural network to generate an N-dimensional, floating-point descriptor of the image.This neural network is trained through a self-supervised training scheme. Images are changed with transformations that keep them perceptually similar to the original, and then the CNN is taught to generate descriptors that are close to each other for the original/transformed pair.The CNN is also taught to generate descriptors that are far apart from each other for images that are not perceptually similar.
The floating-point descriptor is passed through a locality-sensitive hashing scheme to convert the N floating-point numbers to M bits.Descriptors that are close to each other will be get the same hash while descriptors that are far apart will get different hashes.

For a much more detailed breakdown, check out Apple’s full technical summary on CSAM detection.

Tech Snippets

An interesting series on how to build an Operating System
For developers, Safari is crap and outdated - Safari has consistently lagged behind competing browsers in supporting modern web APIs and features.This presents a considerable challenge for developers wanting to create products that work consistently across all the major browsers.
Go: the Good, the Bad, and the Ugly - An interesting article that goes into the good, bad and ugly of Go.The GoodThe BadThe UglyThis is just a small selection of points made in the article. For all the points (and a detailed explanation for each one) read the full article.
- Easy to learn
- Easy concurrent programming
- Great standard library
- Performant
- Standardized test framework
- Defer statement
- Ignores advances in modern language design
- Interfaces are structural types
- No enumerations
- Zero values that panic
- Dependency management is a nightmare
- Slice gotchas
- Noisy error management
- Race conditions
10 Mistakes to avoid as a Software Development Manager
- The Project Is Not the Top Priority — People Are
- Processes Should Not Be for Stats, but for Ease of Development
- Don’t Make Developers Compete — Build Camaraderie
- “Developers’ Happiness” Is Not Just Nice To Have
- Stay Technical. Be Relevant to the Developers

ZippyDB

Read the full post here.

ZippyDB is the largest strongly consistent, geographically distributed key-value store at Facebook.

It was first deployed in 2013 and is now used for tons of different use cases like counting events, product data, metadata on distributed applications, and more.

It’s popular for both ephemeral and non-ephemeral small key-value data.

Due to the large number of different use cases, ZippyDB offers lots of flexibility for durability, consistency, availability and latency guarantees.

History of ZippyDB

ZippyDB uses RocksDB as the underlying storage engine.

Before ZippyDB, various teams at Facebook used RocksDB directly to manage their data.

However, this resulted in a duplication of effort as each team was solving similar challenges around consistency, fault tolerance, recovery, replication and capacity management.

ZippyDB was built to solve the challenges for all of these teams.

One key decision was to reuse as much existing infrastructure as possible.

Therefore, ZippyDB uses RocksDB as a storage engine and is built on top of Shard Manager (Facebook’s shard management tool) and on top of ZooKeeper (a distributed configuration service).

Architecture

ZippyDB is deployed in units known as Tiers.

Each tier consists of compute and storage resources spread across several geographic areas.

The default tier that is used is the “wildcard” tier, which is the generic multi-tenant tier. This is the preferred tier because of it’s better utilization of hardware and lower operational overhead.

However, additional dedicated tiers can be brought up if needed (usually due to stricter isolation requirements).

The data belonging to a tier is split into units known as shards, which is the most basic unit of data management on the server side.

Each shard is replicated across multiple regions and they either use Paxos or asynchronous replication to replicate data.

The replication strategy can be configured based on application durability needs and write/read performance.

Data Model

ZippyDB supports a simple key-value data model with APIs to get, put, and delete keys.

You can iterate over key prefixes and delete a range of keys.

The API is very similar to the API exposed by the underlying RocksDB storage engine.

Applications can also access data on ZippyDB through an ORM layer.

Consistency

ZippyDB provides configurable consistency and durability levels to applications.

Applications can make durability, consistency, and performance trade-offs dynamically on a per request level.

For writes, ZippyDB’s default behavior is to persist the data on a majority of replicas’ Paxos logs and write the data to RocksDB on the primary before acknowledging the write to the client.

Some applications cannot tolerate this much of latency (and don’t need strong durability and consistency guarantees), so ZippyDB supports a fast-acknowledge mode, where writes are acknowledged as soon as they are enqueued on the primary.

For reads, the three most popular consistency levels are eventual, read-your-write, and strong.

Eventual Consistency - ZippyDB ensures that reads aren’t served by replicas that are lagging behind the primary beyond a certain configurable threshold (heartbeats are used to determine lag).This provides a much stronger consistency guarantee than your typical eventual consistency.
Read-Your-Writes Consistency - Clients cache the latest sequence number returned by the server for writes and use that version to run at-or-later queries when reading.
Strong Consistency - Clients can see the effects of the most recent writes regardless of where the writes or reads come from.Strong reads are implementing by routing the read request to the primary.

Transactions and Conditional Writes

ZippyDB supports transactions and conditional writes for use cases that need atomic read-modify-write operations on a set of keys.

Conditional writes are useful for cases where a client wants to atomically modify a set of keys based on some common precondition (such as whether the keys are present, or a certain value matches, etc.)

The conditional write API can be more efficient than the transaction API in cases where clients can compute the precondition without requiring a read.

Read the full article here.

Interview Question

You are given two non-empty linked lists representing two non-negative integers.

The digits are stored in reverse order, and each of their nodes contains a single digit.

Add the two numbers together and return the sum as a linked list.

You may assume the two numbers do not contain any leading zeros.

Example

Input: list1 = 2 -> 4 -> 3, list2 = 5 -> 6 -> 4

Output: 7 -> 0 -> 8

Explanation: 342 + 465 = 807

Previous Solution

As a reminder, here’s our last question

Implement a BST Iterator class that represents an iterator over the in-order traversal of a Binary Search Tree.

Implement the following methods

BSTIterator - constructor. The root of the BST will be passed in as a parameter. The pointer should be initialized to a non-existent number small than any number in the BST.
hasNext - Returns true if there exists a number in the traversal to the right of the pointer, otherwise returns false.
next - Moves the pointer to the right, then returns the number at the pointer.

Here’s the question in LeetCode

Solution

We can solve this question by doing an iterative in-order traversal.

We’ll use an array to keep track of our stack and use the variable cur to keep track of our current node.

Constructor

When we instantiate our class, we’ll first create cur and point that to None.

Then, we’ll create an empty array that represents our stack.

The smallest object in a BST is the left-most object in the tree.

Therefore, we’ll keep moving down to the left child until we reach None. While we do this, we add each parent node to our stack.

We do this part in _getLeftMost.

When the user calls the next method, we’ll do some basic case analysis.

if cur is not pointing to anything and our stack has a length of greater than 1 - Then, we can pop off an item from our stack and return it. We also set cur to that item.
if cur is not None and has a right child -Then we’ll run our in-order traversal on the tree under the right child.We first find the smallest node in that tree by moving down the left children and add each parent node to our stack. We do this with our _getLeftMost function.We return the smallest node.
if cur is not None and we don’t have a right child - Then we will first check if our stack is empty.If the stack is not empty, then we can pop a node off the stack and return that.Otherwise, that means we’ve traversed through the entire tree and we’ll just return None.

hasNext

For hasNext, we can just check if the length of our stack is greater than 0 and if the node at cur has a right child.

If either of those are true, then we can return true.

Otherwise, we return False.

Here’s the Python 3 code.