How Facebook Keeps Millions of Servers Synced
Plus, how Google uses Design Docs, strategies for managing infrastructure costs and more.
Hey Everyone!
Today we’ll be talking about
How Facebook Keeps Millions of Servers Synchronized
Why keeping the system clocks of your servers synchronized is crucial
Brief Overview of Network Time Protocol (NTP)
Network Time Protocol at Meta
Brief Overview of Precision Time Protocol (PTP)
Switching to PTP and Deploying it at Meta
Tech Snippets
How Google uses Design Docs
Strategies for Managing Infrastructure Costs
A List of Resources for Fuzz Testing
Static Analysis Tools
Warrant is an open source startup that helps developers implement extremely precise and fine-grained access controls within their applications. If you’ve ever tried to do this at scale, it can be incredibly challenging because you need it to be low-latency but also highly consistent.
Google was able to solve this problem with Zanzibar, a globally distributed authorization system that they use for YouTube, Google Drive and Google Cloud.
The founders of Warrant used Google Zanzibar as inspiration for their authorization system and have since scaled to serve millions of requests every day for developers around the world.
This week, WorkOS has announced their acquisition of Warrant. They’ll be integrating Warrant’s fine grained authorization service into the WorkOS suite of products to make it easier for customers to build authorization and access controls.
If you’re curious about how Warrant works, you can view the entire codebase for their authorization system in the blog post below.
sponsored
How Facebook Keeps Millions of Severs Synced
If you’re running a distributed system, it’s incredibly important to keep the system clocks of the machines synchronized. If the machines are off by a few seconds, this will cause a huge variety of different issues.
You can probably imagine why unsynchronized clocks would be a big issue, but just to beat a dead horse…
Data Consistency - You might have data stored across multiple storage nodes for redundancy and performance. When data is updated, this needs to be propagated to all the storage nodes that hold a copy. If the system clocks aren’t synchronized, then one node’s timestamp might conflict with another. This can cause confusion about which update is the most recent, leading to data inconsistency, frustration and anger.
Observability - In past articles, we’ve talked a bunch about keeping logs, storing metrics, traces, etc. This is crucial for understanding what’s happening in your distributed system. However, all of these things are useless if your machines don’t have synchronized clocks and the timestamps are all messed up.
Network Security - Many cryptographic protocols rely on synchronized clocks for correctness. The Kerberos authentication protocol, for example, uses timestamps to prevent replay attacks (where an attacker intercepts a network message and replays it later). Solely relying on a machine’s internal clock for the time is not a good idea from a security perspective.
Event Ordering - Understanding the order in which events occur is obviously very important. You might have one event that debits an account and another that credits it. Processing these transactions in the wrong order at scale will lead to incorrect account balances and unhappy customers (or happy customers who will become very, very unhappy).
And many more reasons.
Why do computers get unsynchronized
For time keeping, the gold standard is an atomic clock. They have an error rate of ~1 second in a span of 100 million years. However, they’re too expensive to put in every machine.
Instead, computers typically contain quartz clocks. These are far less accurate and can drift by a couple of seconds per day.
To keep computers synced, we rely on networking protocols like Network Time Protocol (NTP) and Precision Time Protocol (PTP).
Facebook published a fantastic series of blog posts delving into their use of NTP, why they switched to PTP and how they currently keep machines synced.
You can read the full blog posts below
We’ll summarize the articles and give some extra context.
Intro to Network Time Protocol
NTP is one of the oldest protocols that’s still in current use. It’s intended to synchronize all participating computers to within a few milliseconds of UTC.
With NTP, you have clients (devices that need to be synchronized) and NTP servers (which keep track of the time).
Here’s a high level overview of the steps for communication between the two.
The client will send an NTP request packet to the time server. The packet will be stamped with the time from the client (the origin timestamp).
The server stamps the time when the request packet is received (the receive timestamp)
The server stamps the time again when it sends a response packet back to the client (the transmit timestamp)
The client stamps the time when the response packet is received (the destination timestamp)
These timestamps allow the client to account for the roundtrip delay and work out the difference between its internal time and that provided by the server. It adjusts accordingly and synchronizes itself based on multiple requests to the NTP server.
NTP Strata
Of course, you can’t have millions of computers all trying to stay synced with a single atomic clock. It’s far too many requests for a single NTP server to handle.
Instead, NTP works on a peer-to-peer basis, where the machines in the NTP network are divided into strata.
Stratum 0 - atomic clock or GPS receiver
Stratum 1 - synced directly with a stratum 0 device
Stratum 2 - servers that sync with stratum 1 devices
Stratum 3 - servers that sync with stratum 2 devices
And so on until stratum 15. Stratum 16 is used to indicate that a device is unsynchronized.
A computer may query multiple NTP servers, discard any outliers (in case of faults with the servers) and then average the rest.
Computers may also query the same NTP server multiple times over the course of a few minutes and then use statistics to reduce random error due to variations in network latency.
Here’s a fantastic article that delves into NTP
NTP at Facebook
Facebook’s NTP service was designed in four main layers
Stratum 0 - layer of satellites with extremely precise atomic clocks from a GPS system
Stratum 1 - Facebook’s atomic clock
Stratum 2 - Pool of NTP servers
Stratum 3 - Servers configured for larger scale
Credits - Meta’s Engineering Blog
In terms of the process that runs on servers to keep them synchronized, Facebook tested out two time daemons
Facebook ended up migrating their infrastructure to Chrony and you can read the reasoning here.
However, in late 2022, Facebook switched entirely away from NTP to Precision Time Protocol (PTP).
Precision Time Protocol
PTP was introduced in 2002 as a way to sync clocks more precisely than NTP.
While NTP provides millisecond-level synchronization, PTP networks aim to achieve nanosecond or even picosecond-level precision.
There’s many things which can throw off your clock synchronization
The response time from the servers can depend on the software/driver/firmware stack
The quality of the network router switches and network interfaces
Small delays when sending the signal
PTP uses hardware timestamping and transparent clocks to better measure this network delay and adjust for it. One downside is that PTP places more load on network hardware.
Benefits of PTP at Facebook
Switching to PTP gave Facebook quite a few benefits
Higher Precision and Accuracy - PTP allows for precision within nanoseconds whereas NTP has precision within milliseconds.
Better Scalability - NTP systems require frequent check-ins to ensure synchronization, which can slow down the network as the system grows. On the other hand, PTP allows systems to rely on a single source of truth for timing, improving scalability.
Mitigation of Network Delays and Errors - PTP significantly reduces the chance of network delays and errors.
For more details on this, you can read the full post here.
Deploying PTP at Facebook
With PTP, Facebook is striving for nanosecond accuracy. The design consists of three main components.
PTP Rack
The Network
The Client
PTP Rack
This houses the hardware and software that serves time to clients.
It consists of
GNSS Antenna - Antenna in Facebook’s data centers that communicates with GNSS (Global Navigation Satellite System)
Time Appliance - Dedicated piece of hardware that consists of a GNSS receiver and a miniaturized atomic clock. Users of Time Appliance can keep accurate time, even in the event of GNSS connectivity loss.
PTP Network
Responsible for transmitting the PTP messages from the PTP rack to clients. It uses unicast transmission, which simplifies network design and improves scalability.
PTP Client
You need a PTP client running on your machines to communicate with the PTP network. Meta uses ptp4l, an open source client.
However, they faced some issues with edge cases and certain types of network cards.
For all the details on how Facebook deployed PTP, you can read the full article here.
One of the most common causes of data breaches and account hijacking is customers using weak or reused passwords.
It doesn't matter how much time you spend on security audits, penetration testing, encryption or whatever. If your customers are setting weak passwords, then your business is at risk.
A crucial way to mitigate this risk is to inform new users about the strength of their passwords as they create them.
To solve this problem, Dropbox created zxcvbn, an open-source library that calculates password strength based on factors like:
Entropy - the library measures how random and unpredictable the password is
Dictionary checks - zxcvbn checks a range of dictionaries (common passwords, names, english words) to see if the password is too similar to a current word
Pattern recognition - the library uses pattern matching to identify common password elements like dates, repeated characters, sequences, etc.
And more
If you want an easy way to implement user password security in your app, check out AuthKit, an open-source login box that incorporates zxcvbn and other modern authentication best practices to provide a much more secure onboarding experience for new users.
sponsored