How Facebook Keeps Millions of Servers Synced
Plus, how Google uses Design Docs, strategies for managing infrastructure costs and more.
Hey Everyone!
Today we’ll be talking about
How Facebook Keeps Millions of Servers Synchronized
Why keeping the system clocks of your servers synchronized is crucial
Brief Overview of Network Time Protocol (NTP)
Network Time Protocol at Meta
Brief Overview of Precision Time Protocol (PTP)
Switching to PTP and Deploying it at Meta
Tech Snippets
How Google uses Design Docs
Strategies for Managing Infrastructure Costs
A List of Resources for Fuzz Testing
Static Analysis Tools
How Facebook Keeps Millions of Severs Synced
If you’re running a distributed system, it’s incredibly important to keep the system clocks of the machines synchronized. If the machines are off by a few seconds, this will cause a huge variety of different issues.
You can probably imagine why unsynchronized clocks would be a big issue, but just to beat a dead horse…
Data Consistency - You might have data stored across multiple storage nodes for redundancy and performance. When data is updated, this needs to be propagated to all the storage nodes that hold a copy. If the system clocks aren’t synchronized, then one node’s timestamp might conflict with another. This can cause confusion about which update is the most recent, leading to data inconsistency, frustration and anger.
Observability - In past articles, we’ve talked a bunch about keeping logs, storing metrics, traces, etc. This is crucial for understanding what’s happening in your distributed system. However, all of these things are useless if your machines don’t have synchronized clocks and the timestamps are all messed up.
Network Security - Many cryptographic protocols rely on synchronized clocks for correctness. The Kerberos authentication protocol, for example, uses timestamps to prevent replay attacks (where an attacker intercepts a network message and replays it later). Solely relying on a machine’s internal clock for the time is not a good idea from a security perspective.
Event Ordering - Understanding the order in which events occur is obviously very important. You might have one event that debits an account and another that credits it. Processing these transactions in the wrong order at scale will lead to incorrect account balances and unhappy customers (or happy customers who will become very, very unhappy).
And many more reasons.
Why do computers get unsynchronized
For time keeping, the gold standard is an atomic clock. They have an error rate of ~1 second in a span of 100 million years. However, they’re too expensive to put in every machine.
Instead, computers typically contain quartz clocks. These are far less accurate and can drift by a couple of seconds per day.
To keep computers synced, we rely on networking protocols like Network Time Protocol (NTP) and Precision Time Protocol (PTP).
Facebook published a fantastic series of blog posts delving into their use of NTP, why they switched to PTP and how they currently keep machines synced.
You can read the full blog posts below
We’ll summarize the articles and give some extra context.
Intro to Network Time Protocol
NTP is one of the oldest protocols that’s still in current use. It’s intended to synchronize all participating computers to within a few milliseconds of UTC.
With NTP, you have clients (devices that need to be synchronized) and NTP servers (which keep track of the time).
Here’s a high level overview of the steps for communication between the two.
The client will send an NTP request packet to the time server. The packet will be stamped with the time from the client (the origin timestamp).
The server stamps the time when the request packet is received (the receive timestamp)
The server stamps the time again when it sends a response packet back to the client (the transmit timestamp)
The client stamps the time when the response packet is received (the destination timestamp)
These timestamps allow the client to account for the roundtrip delay and work out the difference between its internal time and that provided by the server. It adjusts accordingly and synchronizes itself based on multiple requests to the NTP server.
NTP Strata
Of course, you can’t have millions of computers all trying to stay synced with a single atomic clock. It’s far too many requests for a single NTP server to handle.
Instead, NTP works on a peer-to-peer basis, where the machines in the NTP network are divided into strata.
Stratum 0 - atomic clock or GPS receiver
Stratum 1 - synced directly with a stratum 0 device
Stratum 2 - servers that sync with stratum 1 devices
Stratum 3 - servers that sync with stratum 2 devices
And so on until stratum 15. Stratum 16 is used to indicate that a device is unsynchronized.
A computer may query multiple NTP servers, discard any outliers (in case of faults with the servers) and then average the rest.
Computers may also query the same NTP server multiple times over the course of a few minutes and then use statistics to reduce random error due to variations in network latency.
Here’s a fantastic article that delves into NTP
NTP at Facebook
Facebook’s NTP service was designed in four main layers
Stratum 0 - layer of satellites with extremely precise atomic clocks from a GPS system
Stratum 1 - Facebook’s atomic clock
Stratum 2 - Pool of NTP servers
Stratum 3 - Servers configured for larger scale
Credits - Meta’s Engineering Blog
In terms of the process that runs on servers to keep them synchronized, Facebook tested out two time daemons
Facebook ended up migrating their infrastructure to Chrony and you can read the reasoning here.
However, in late 2022, Facebook switched entirely away from NTP to Precision Time Protocol (PTP).
Precision Time Protocol
PTP was introduced in 2002 as a way to sync clocks more precisely than NTP.
While NTP provides millisecond-level synchronization, PTP networks aim to achieve nanosecond or even picosecond-level precision.
There’s many things which can throw off your clock synchronization
The response time from the servers can depend on the software/driver/firmware stack
The quality of the network router switches and network interfaces
Small delays when sending the signal
PTP uses hardware timestamping and transparent clocks to better measure this network delay and adjust for it. One downside is that PTP places more load on network hardware.
Benefits of PTP at Facebook
Switching to PTP gave Facebook quite a few benefits
Higher Precision and Accuracy - PTP allows for precision within nanoseconds whereas NTP has precision within milliseconds.
Better Scalability - NTP systems require frequent check-ins to ensure synchronization, which can slow down the network as the system grows. On the other hand, PTP allows systems to rely on a single source of truth for timing, improving scalability.
Mitigation of Network Delays and Errors - PTP significantly reduces the chance of network delays and errors.
For more details on this, you can read the full post here.
Deploying PTP at Facebook
With PTP, Facebook is striving for nanosecond accuracy. The design consists of three main components.
PTP Rack
The Network
The Client
PTP Rack
This houses the hardware and software that serves time to clients.
It consists of
GNSS Antenna - Antenna in Facebook’s data centers that communicates with GNSS (Global Navigation Satellite System)
Time Appliance - Dedicated piece of hardware that consists of a GNSS receiver and a miniaturized atomic clock. Users of Time Appliance can keep accurate time, even in the event of GNSS connectivity loss.
PTP Network
Responsible for transmitting the PTP messages from the PTP rack to clients. It uses unicast transmission, which simplifies network design and improves scalability.
PTP Client
You need a PTP client running on your machines to communicate with the PTP network. Meta uses ptp4l, an open source client.
However, they faced some issues with edge cases and certain types of network cards.
For all the details on how Facebook deployed PTP, you can read the full article here.