How Google stores Exabytes of Data
A look into Google Colossus, a massive distributed file system that stores exabytes of data. Plus, when is short tenure a red flag? How to manage technical quality in a codebase and more.
Today we’ll be talking about
How Google stores Exabytes of Data
Colossus is Google’s Distributed File System and it can scale to tens of thousands of machines and exabytes of storage.
It’s the next generation of Google File System, a distributed file system built in the early 2000s that inspired Hadoop Distributed File System
We’ll talk about some history of Google File System and MapReduce and then go into how Colossus works
Real World Recommendation Systems
Managing Technical Quality in a Codebase
How Dropbox Rewrote Their Web Serving Stack
AWS Builder’s Library - How to Instrument Your System for Visibility
When Is Short Tenure a Red Flag?
How to build a Centralized Monitoring Platform
In a past article, we talked about how crucial Observability is for meeting your availability targets. However, having multiple teams build their own monitoring solutions doesn’t scale and is a waste of effort.
Instead, engineering teams are adopting a centralized approach to monitoring with a Metrics as a Service model.
With this model, different stakeholders like IT Ops, developers, managers, etc. will all have a single location they can easily access for observability data.
InfluxDB wrote a great blog post detailing how you can implement a centralized Observability Platform at your own organization.
Data Collectors - InfluxData’s Telegraf is an open source plugin that easily integrates with virtually any data source
Storage - InfluxDB is a time series database that can scale to over 1 million writes per second.
Visualization - InfluxDB has native dashboarding and visualization tools but also integrates with third-party tools like Grafana.
How Google Stores Exabytes of Data
Google Colossus is a massive distributed storage system that Google uses to store and manage exabytes of data (1 exabyte is 1 million terabytes). Colossus is the next generation of Google File System (GFS), which was first introduced in 2003.
In the early 2000s, Sanjay Ghemawat, Jeff Dean and other Google engineers published two landmark papers in distributed systems: the Google File System paper and the MapReduce paper.
GFS was a distributed file system that could scale to over a thousand nodes and hundreds of terabytes of storage. The system consisted of 3 main components: a master node, Chunk Servers and clients.
The master is responsible for managing the metadata of all files stored in GFS. It would maintain information about the location and status of the files. The Chunk Servers would store the actual data. Each managed a portion of data stored in GFS, and they would replicate the data to other chunk servers to ensure fault tolerance. Then, the GFS client library could be run on other machines that wanted to create/read/delete files from a GFS cluster.
To make the system scalable, Google engineers minimized the number of operations that the master would have to do. If a client wanted to access a file, they would send a query to the master. The master would send the chunk servers that hold the chunks of the file, and the client would download the data from the chunk servers. The master shouldn’t be involved in the minutia of transferring data from chunk servers to clients.
In order to efficiently run computations on the data stored on GFS, Google created MapReduce, a programming model and framework for processing large data sets. It consists of two main functions: the map function and the reduce function.
Map is responsible for processing the input data locally and generating a set of intermediate key-value pairs. The reduce function will then process these intermediate key-value pairs in parallel and combine them to generate the final output. MapReduce is designed to maximize parallel processing while minimizing the amount of network bandwidth needed (run computations locally when possible).
For more details, we wrote past deep dives on MapReduce and Google File System.
These technologies were instrumental in scaling Google and they were also reimplemented by engineers at Yahoo. In 2006, the Yahoo projects were open sourced and became the Apache Hadoop project. Since then, Hadoop has exploded into a massive ecosystem of data engineering projects.
One issue with Google File System was the decision to have a single master node for storing and managing the GFS cluster’s metadata. You can read Section 2.4 Single Master in the GFS paper for an explanation of why Google made the decision of having a single master.
This choice worked well for batch-oriented applications like web crawling and indexing websites. However, it could not meet the latency requirements for applications like YouTube, where you need to serve a video extremely quickly.
Having the master node go down meant the cluster would be unavailable for a couple of seconds (during the automatic failover). This was no bueno for low latency applications.
To deal with this issue (and add other improvements), Google created Colossus, the successor to Google File System.
Dean Hildebrand is a Technical Director at Google and Denis Serenyi is a Tech Lead on the Google Cloud Storage team. They posted a great talk on YouTube delving into Google’s infrastructure and how Colossus works.
Infrastructure at Google
The Google Cloud Platform and all of Google’s products (Search, YouTube, Gmail, etc.) are powered by the same underlying infrastructure.
It consists of three core building blocks
Borg - A cluster management system that serves as the basis for Kubernetes. It launches and manages compute services at Google. Borg runs hundreds of thousands of jobs across many clusters (with each having thousands of machines). For more details, Google published a paper talking about how Borg works.
Spanner - A highly scalable, consistent relational database with support for distributed transactions. Under the hood, Spanner stores its data on Google Colossus. It uses TrueTime, Google’s clock synchronization service to provide ordering and consistency guarantees. For more details, check out the Spanner paper.
Colossus - Google’s successor to Google File System. This is a distributed file system that stores exabytes of data with low latency and high reliability. For more details on how it works, keep reading.
How Colossus Works
Similar to Google File System, Colossus consists of three components
Applications that need to store/retrieve data on Colossus will do so with the client library, which abstracts away all the communication the app needs to do with the Control Plane and D servers. Users of Colossus can select different service tiers based on their latency/availability/reliability requirements. They can also choose their own data encoding based on the performance cost trade-offs they need to make.
The Colossus Control Plane is the biggest improvement compared to Google File System, and it consists of Curators and Custodians.
Curators replace the functionality of the master node, removing the single point of failure. When a client needs a certain file, it will query a curator node. The curator will respond with the locations of the various data servers that are holding that file. The client can then query the data servers directly to download the file.
When creating files, the client can send a request to the Curator node to obtain a lease. The curator node will create the new file on the D servers and send the locations of the servers back to the client. The client can then write directly to the D servers and release the lease when it’s done.
Curators store file system metadata in Google BigTable, a NoSQL database. Because of the distributed nature of the Control Plane, Colossus can now scale up by over 100x the largest Google File System clusters, while delivering lower latency.
Custodians in Colossus are background storage managers, and they handle things like disk space rebalancing, switching data between hot and cold storage, RAID reconstruction, failover and more.
D servers refers to the Chunk Servers in Google File System, and they’re just network attached disks. They store all the data that’s being held in Colossus. As mentioned previously, data flows directly from the D servers to the clients, to minimize the involvement of the control plane. This makes it much easier to scale the amount of data stored in Colossus without having the control plane as a bottleneck.
Colossus abstracts away a lot of the work you have to do in managing data.
Engineers want Colossus to provide the best performance at the cheapest cost. Data in the distributed file system is stored on a mixture of flash and disk. Figuring out the optimal amount of flash memory vs disk space and how to distribute data can mean the difference of tens of millions of dollars. To handle intelligent disk management, engineers looked at how data was accessed in the past.
Newly written data tends to be hotter, so it’s stored in flash memory. Old analytics data tends to be cold, so that’s stored in cheaper disk. Certain data will always be read at specific time intervals, so it’s automatically transferred over to memory so that latency will be low.
Clients don’t have to think about any of this. Colossus manages it for them.
As mentioned, apps have different requirements around consistency, latency, availability, etc.
Colossus provides different service tiers so that applications can choose what they want.
At Google’s scale, machines are failing all the time (it’s inevitable when you have millions of machines).
Colossus steps in to handle things like replicating data across D servers, background recovery and steering reads/writes around failed nodes.
For more details, you can read the full talk here.
How did you like this summary?
Your feedback really helps me improve curation for future emails.
The Best Way to Handle Large Sets of Time-Stamped Data
Working with large sets of time-stamped data has its challenges.
Fortunately, InfluxDB is a time series database purpose-built to handle the unique workloads of time series data.
Using InfluxDB, developers can ingest billions of data points in real-time with unbounded cardinality, and store, analyze, and act on that data – all in a single database.
No matter what kind of time series data you’re working with – metrics, events, traces, or logs – InfluxDB Cloud provides a performant, elastic, serverless time series platform with the tools and features developers need. Native SQL compatibility makes it easy to get started with InfluxDB and to scale your solutions.
Companies like IBM, Cisco, and Robinhood all rely heavily on InfluxDB to build and manage responsive backend applications, to power predictive intelligence, and to monitor their systems for insights that they would otherwise miss.
See for yourself by quickly spinning up the platform and testing it out InfluxDB Cloud for free.