How Facebook Keeps Millions of Servers Synced

Plus, how Google uses Design Docs, strategies for managing infrastructure costs and more.

Hey Everyone!

Today we’ll be talking about

  • How Facebook Keeps Millions of Servers Synchronized

    • Why keeping the system clocks of your servers synchronized is crucial

    • Brief Overview of Network Time Protocol (NTP)

    • Network Time Protocol at Meta

    • Brief Overview of Precision Time Protocol (PTP)

    • Switching to PTP and Deploying it at Meta

  • Tech Snippets

    • How Google uses Design Docs

    • Strategies for Managing Infrastructure Costs

    • A List of Resources for Fuzz Testing

    • Static Analysis Tools

How Facebook Keeps Millions of Severs Synced

If you’re running a distributed system, it’s incredibly important to keep the system clocks of the machines synchronized. If the machines are off by a few seconds, this will cause a huge variety of different issues.

You can probably imagine why unsynchronized clocks would be a big issue, but just to beat a dead horse…

  • Data Consistency - You might have data stored across multiple storage nodes for redundancy and performance. When data is updated, this needs to be propagated to all the storage nodes that hold a copy. If the system clocks aren’t synchronized, then one node’s timestamp might conflict with another. This can cause confusion about which update is the most recent, leading to data inconsistency, frustration and anger.

  • Observability - In past articles, we’ve talked a bunch about keeping logs, storing metrics, traces, etc. This is crucial for understanding what’s happening in your distributed system. However, all of these things are useless if your machines don’t have synchronized clocks and the timestamps are all messed up.

  • Network Security - Many cryptographic protocols rely on synchronized clocks for correctness. The Kerberos authentication protocol, for example, uses timestamps to prevent replay attacks (where an attacker intercepts a network message and replays it later). Solely relying on a machine’s internal clock for the time is not a good idea from a security perspective.

  • Event Ordering - Understanding the order in which events occur is obviously very important. You might have one event that debits an account and another that credits it. Processing these transactions in the wrong order at scale will lead to incorrect account balances and unhappy customers (or happy customers who will become very, very unhappy).

And many more reasons.

Why do computers get unsynchronized

For time keeping, the gold standard is an atomic clock. They have an error rate of ~1 second in a span of 100 million years. However, they’re too expensive to put in every machine.

Instead, computers typically contain quartz clocks. These are far less accurate and can drift by a couple of seconds per day.

To keep computers synced, we rely on networking protocols like Network Time Protocol (NTP) and Precision Time Protocol (PTP).

Facebook published a fantastic series of blog posts delving into their use of NTP, why they switched to PTP and how they currently keep machines synced.

You can read the full blog posts below

We’ll summarize the articles and give some extra context.

Intro to Network Time Protocol

NTP is one of the oldest protocols that’s still in current use. It’s intended to synchronize all participating computers to within a few milliseconds of UTC.

With NTP, you have clients (devices that need to be synchronized) and NTP servers (which keep track of the time).

Here’s a high level overview of the steps for communication between the two.

  1. The client will send an NTP request packet to the time server. The packet will be stamped with the time from the client (the origin timestamp).

  2. The server stamps the time when the request packet is received (the receive timestamp)

  3. The server stamps the time again when it sends a response packet back to the client (the transmit timestamp)

  4. The client stamps the time when the response packet is received (the destination timestamp)

These timestamps allow the client to account for the roundtrip delay and work out the difference between its internal time and that provided by the server. It adjusts accordingly and synchronizes itself based on multiple requests to the NTP server.

NTP Strata

Of course, you can’t have millions of computers all trying to stay synced with a single atomic clock. It’s far too many requests for a single NTP server to handle.

Instead, NTP works on a peer-to-peer basis, where the machines in the NTP network are divided into strata.

  • Stratum 0 - atomic clock or GPS receiver

  • Stratum 1 - synced directly with a stratum 0 device

  • Stratum 2 - servers that sync with stratum 1 devices

  • Stratum 3 - servers that sync with stratum 2 devices

And so on until stratum 15. Stratum 16 is used to indicate that a device is unsynchronized.

A computer may query multiple NTP servers, discard any outliers (in case of faults with the servers) and then average the rest.

Computers may also query the same NTP server multiple times over the course of a few minutes and then use statistics to reduce random error due to variations in network latency.

Here’s a fantastic article that delves into NTP

NTP at Facebook

Facebook’s NTP service was designed in four main layers

  • Stratum 0 - layer of satellites with extremely precise atomic clocks from a GPS system

  • Stratum 1 - Facebook’s atomic clock

  • Stratum 2 - Pool of NTP servers

  • Stratum 3 - Servers configured for larger scale

Credits - Meta’s Engineering Blog

In terms of the process that runs on servers to keep them synchronized, Facebook tested out two time daemons

  • Ntpd - this is used on most Unix-like operating systems and has been stable for many years

  • Chrony - this is newer than Ntpd and had additional features to provide more precise time synchronization. It could theoretically bring precision down to nanoseconds.

Facebook ended up migrating their infrastructure to Chrony and you can read the reasoning here.

However, in late 2022, Facebook switched entirely away from NTP to Precision Time Protocol (PTP).

Precision Time Protocol

PTP was introduced in 2002 as a way to sync clocks more precisely than NTP.

While NTP provides millisecond-level synchronization, PTP networks aim to achieve nanosecond or even picosecond-level precision.

There’s many things which can throw off your clock synchronization

  • The response time from the servers can depend on the software/driver/firmware stack

  • The quality of the network router switches and network interfaces

  • Small delays when sending the signal

PTP uses hardware timestamping and transparent clocks to better measure this network delay and adjust for it. One downside is that PTP places more load on network hardware.

Benefits of PTP at Facebook

Switching to PTP gave Facebook quite a few benefits

  1. Higher Precision and Accuracy - PTP allows for precision within nanoseconds whereas NTP has precision within milliseconds.

  2. Better Scalability - NTP systems require frequent check-ins to ensure synchronization, which can slow down the network as the system grows. On the other hand, PTP allows systems to rely on a single source of truth for timing, improving scalability.

  3. Mitigation of Network Delays and Errors - PTP significantly reduces the chance of network delays and errors.

For more details on this, you can read the full post here.

Deploying PTP at Facebook

With PTP, Facebook is striving for nanosecond accuracy. The design consists of three main components.

  • PTP Rack

  • The Network

  • The Client

PTP Rack

This houses the hardware and software that serves time to clients.

It consists of

  • GNSS Antenna - Antenna in Facebook’s data centers that communicates with GNSS (Global Navigation Satellite System)

  • Time Appliance - Dedicated piece of hardware that consists of a GNSS receiver and a miniaturized atomic clock. Users of Time Appliance can keep accurate time, even in the event of GNSS connectivity loss.

PTP Network

Responsible for transmitting the PTP messages from the PTP rack to clients. It uses unicast transmission, which simplifies network design and improves scalability.

PTP Client

You need a PTP client running on your machines to communicate with the PTP network. Meta uses ptp4l, an open source client.

However, they faced some issues with edge cases and certain types of network cards.

For all the details on how Facebook deployed PTP, you can read the full article here.

Tech Snippets