Tech Dive on AWS S3
A tech dive on Object Storage (AWS S3, Azure Blob Storage, Google Cloud Storage). Plus how to avoid SMS fraud attacks, the history behind inheritance and more.
Hey Everyone!
Today we’ll be talking about
Tech Dive on Cloud Object Storage (AWS S3, Azure Blob Store, etc.)
Different Types of Cloud storage (File store, Block store, Object store)
General API of Object stores
Immutability Properties
How Pricing Works
Storage Tiers
Life Cycle Policies
Security Best Practices
Client Side vs. Server Side Encryption
Performance Optimization
New Players with CloudFlare R2 and Tigris
Tech Snippets
If Inheritance is so bad, why does everyone use it?
Mental health in software engineering
Managing technical quality in a codebase
How SMS fraud works and how to guard against it
Tech Dive - Object Storage
One of the most common types of storage you’ll work with is Object Storage. You’ve probably heard of AWS S3, Azure Blob Storage, Google Cloud Storage, etc.
In this dive, we’ll delve into Object Storage and talk about things you should know when working with this storage paradigm.
Types of Storage Systems
One way to differentiate between storage systems is based on how they represent your data.
Some common ones include
File System - your data is stored in folders and files in a tree-like way. Every operating system represents data in this way, so hopefully you know what I’m talking about.
Ex. Google Cloud Filestore, Amazon Elastic File System, Azure Files
Blocks - You are given access to individual blocks in your storage system where a block is the smallest unit of storage (in AWS Elastic Block Store, it’s a couple of kilobytes). You can manipulate these individual blocks and write your file across multiple blocks. This offers the highest granularity.
Ex. AWS Elastic Block Store, Google Persistent Disk, Azure Disk Storage
Objects - Each file you want to upload (
txt
,mp4
,png
,csv
,py
, whatever) is treated as an object. Rather than messing around with individual blocks, you can read/delete by objects. There is also no concept of files or hierarchy. You have a flat namespace where all the files are on the same level.Ex. Azure Blob Storage, AWS S3, Google Cloud Storage
So, with Object Storage, an object is just any particular file coupled with some metadata (details about the content, access privileges, etc.)
Using Object Stores
Regardless of which provider you’re using, Object storage systems follow a common architectural pattern
Account - you have your account with the cloud provider.
Containers/Buckets - You create containers/buckets within S3/Blob Store/Cloud Storage. Each bucket will have its own namespace and access controls, so files must have different names within the same bucket. You could create separate buckets for backup data, multimedia files, logs, etc.
Data Objects - Within each bucket, you save your actual files, referred to as objects. Each object will contain the file, metadata and a unique identifier. You can also optionally set up versions for the files, so have a
v1
,v2
,v3
, etc. As we’ll later discuss, files are immutable in Object stores, so versioning can be handy if you want to modify a file and upload the new version.
Immutability
Once you upload a file to your object store, that file cannot be changed. If you’d like to modify it, then your only option is to make your changes locally, upload the new file and delete the old version from your object store.
For this reason, Object stores have a versioning system where you can re-upload a file with the same name but give it a different version number.
Re-uploading new versions again and again is obviously inconvenient for large files but it comes with quite a few benefits.
Some of the pros of immutability are
No Locks or Synchronization - you don’t have to worry about someone else modifying the file as you’re downloading it
Efficient Data Distribution - You can split a file up and distribute it across multiple objects/regions without worrying about one of the objects being modified and breaking the entire file.
Simplified Data Management - Having immutable files with different versions simplifies the data lifecycle process. You can set rules to automatically purge any old files with a newer version or to move all new versions to a different bucket.
The core functions across cloud providers are
PUT
- Upload an object to the file/container.GET
- Retrieve an object from the storage bucket.DELETE
- delete a data object from storage. This is permanent unless you have versioning enabled and want to restore a previous version.COPY
- Copy a data object to another bucket or another object in the same bucket with a different nameLIST
- list files in cloud storageMultipart Upload - If you need to upload a big file, you can break it up into chunks and upload one chunk at a time. After, the cloud provider will combine the chunks into one data object.
Access Control List Operations - manage permissions on individual operations, so defining who can read a certain data object.
Pricing
As you might imagine, there are quite a few factors that influence how much you’re charged for storage.
The factors include
Data Storage - you pay for the amount of data you’re storing in gigabytes per month. This changes based on your storage tier.
Data Retrieval and Access - You’re charged for uploading/accessing data from the storage system. Again, these changes depend on your storage tier.
Egress Fees - You’re charged for any data leaving the cloud provider’s network. This could be data transferred to another cloud, an on-prem location, or being delivered to end-users.
Note - Thanks to the European Data Act, you are not charged egress fees when you’re leaving a cloud provider for another cloud/on-prem service. This means that if you’re on AWS and you’re shifting to Azure, you can submit a form telling AWS that you’re exiting their cloud and transferring all your data to Azure. They’ll waive the egress fees.API Requests - You’re charged a small fee for any API request. These are typically billed per thousand requests.
Data Management & Additional Services - You’re charged for things like data replication, lifecycle policies, analytics tools (like S3 Storage Class Analysis) and more.
Storage Tiers
As we discussed above, all the various cloud providers offer different storage tiers where you can make trade offs between
Storage Price
Access Price
Latency when accessing data
Each provider names their tiers differently, but at a high level, they can be split up into
Hot Tier - data that is frequently accessed. You pay the highest storage costs but the lowest access costs. Here, you might store transaction data, current-versions of ML models, user analytics data from the past 24 hours, etc.
Cool Tier - data that is infrequently accessed. You get lower storage costs than the hot tier but also high access costs. Here, you might store user data for inactive users (users who haven’t logged in in 30 days), past analytics data, etc.
Cold Tier - Data that is accessed less frequently than the cool tier. Again, your storage costs will be lower than the cool tier but access costs will be higher. Here, you could store things like old training datasets (aren’t used anymore but could be used in the future), deprecated application versions, historical user activity that is used for occasional analysis.
Archive Tier - This is usually an offline storage tier. You should have flexible latency requirements when accessing this data as it can take hours. Here, you might keep decade-old user data that you need for regulatory/compliance, legacy documentation, etc. Basically just things that could be useful in the future.
Life Cycle Policies
You can also set up storage lifecycle policies that will transition objects from one storage class to another. For example, you might have data that you will only access for the first 30 days after creation. After a month, then the data will be rarely accessed. You might set up a lifecycle policy to first store the data in AWS Standard, and then transition it to AWS Glacier after 1 month.
This is the first part of our tech dive on Object Storage.
In the rest of the article (2000+ words), we’ll talk about
Security best practices around resource/user-based policies and ACLs
Client-side vs. Server-side encryption
Performance optimization with CDNs and compression
Newer players like CloudFlare R2 and Tigris
Tech Snippets