Tech Dive on AWS S3

A tech dive on Object Storage (AWS S3, Azure Blob Storage, Google Cloud Storage). Plus how to avoid SMS fraud attacks, the history behind inheritance and more.

Arpan KG
April 12, 2024

Hey Everyone!

Today we’ll be talking about

Tech Dive on Cloud Object Storage (AWS S3, Azure Blob Store, etc.)
- Different Types of Cloud storage (File store, Block store, Object store)
- General API of Object stores
- Immutability Properties
- How Pricing Works
- Storage Tiers
- Life Cycle Policies
- Security Best Practices
- Client Side vs. Server Side Encryption
- Performance Optimization
- New Players with CloudFlare R2 and Tigris
Tech Snippets
- If Inheritance is so bad, why does everyone use it?
- Mental health in software engineering
- Managing technical quality in a codebase
- How SMS fraud works and how to guard against it

Tech Dive - Object Storage

One of the most common types of storage you’ll work with is Object Storage. You’ve probably heard of AWS S3, Azure Blob Storage, Google Cloud Storage, etc.

In this dive, we’ll delve into Object Storage and talk about things you should know when working with this storage paradigm.

Types of Storage Systems

One way to differentiate between storage systems is based on how they represent your data.

Some common ones include

File System - your data is stored in folders and files in a tree-like way. Every operating system represents data in this way, so hopefully you know what I’m talking about.
Ex. Google Cloud Filestore, Amazon Elastic File System, Azure Files
Blocks - You are given access to individual blocks in your storage system where a block is the smallest unit of storage (in AWS Elastic Block Store, it’s a couple of kilobytes). You can manipulate these individual blocks and write your file across multiple blocks. This offers the highest granularity.
Ex. AWS Elastic Block Store, Google Persistent Disk, Azure Disk Storage
Objects - Each file you want to upload (txt, mp4, png, csv, py, whatever) is treated as an object. Rather than messing around with individual blocks, you can read/delete by objects. There is also no concept of files or hierarchy. You have a flat namespace where all the files are on the same level.
Ex. Azure Blob Storage, AWS S3, Google Cloud Storage

So, with Object Storage, an object is just any particular file coupled with some metadata (details about the content, access privileges, etc.)

Using Object Stores

Regardless of which provider you’re using, Object storage systems follow a common architectural pattern

Account - you have your account with the cloud provider.
Containers/Buckets - You create containers/buckets within S3/Blob Store/Cloud Storage. Each bucket will have its own namespace and access controls, so files must have different names within the same bucket. You could create separate buckets for backup data, multimedia files, logs, etc.
Data Objects - Within each bucket, you save your actual files, referred to as objects. Each object will contain the file, metadata and a unique identifier. You can also optionally set up versions for the files, so have a v1, v2, v3, etc. As we’ll later discuss, files are immutable in Object stores, so versioning can be handy if you want to modify a file and upload the new version.

Immutability

Once you upload a file to your object store, that file cannot be changed. If you’d like to modify it, then your only option is to make your changes locally, upload the new file and delete the old version from your object store.

For this reason, Object stores have a versioning system where you can re-upload a file with the same name but give it a different version number.

Re-uploading new versions again and again is obviously inconvenient for large files but it comes with quite a few benefits.

Some of the pros of immutability are

No Locks or Synchronization - you don’t have to worry about someone else modifying the file as you’re downloading it
Efficient Data Distribution - You can split a file up and distribute it across multiple objects/regions without worrying about one of the objects being modified and breaking the entire file.
Simplified Data Management - Having immutable files with different versions simplifies the data lifecycle process. You can set rules to automatically purge any old files with a newer version or to move all new versions to a different bucket.

The core functions across cloud providers are

PUT - Upload an object to the file/container.
GET - Retrieve an object from the storage bucket.
DELETE - delete a data object from storage. This is permanent unless you have versioning enabled and want to restore a previous version.
COPY - Copy a data object to another bucket or another object in the same bucket with a different name
LIST - list files in cloud storage
Multipart Upload - If you need to upload a big file, you can break it up into chunks and upload one chunk at a time. After, the cloud provider will combine the chunks into one data object.
Access Control List Operations - manage permissions on individual operations, so defining who can read a certain data object.

Pricing

As you might imagine, there are quite a few factors that influence how much you’re charged for storage.

The factors include

Data Storage - you pay for the amount of data you’re storing in gigabytes per month. This changes based on your storage tier.
Data Retrieval and Access - You’re charged for uploading/accessing data from the storage system. Again, these changes depend on your storage tier.
Egress Fees - You’re charged for any data leaving the cloud provider’s network. This could be data transferred to another cloud, an on-prem location, or being delivered to end-users.

Note - Thanks to the European Data Act, you are not charged egress fees when you’re leaving a cloud provider for another cloud/on-prem service. This means that if you’re on AWS and you’re shifting to Azure, you can submit a form telling AWS that you’re exiting their cloud and transferring all your data to Azure. They’ll waive the egress fees.
API Requests - You’re charged a small fee for any API request. These are typically billed per thousand requests.
Data Management & Additional Services - You’re charged for things like data replication, lifecycle policies, analytics tools (like S3 Storage Class Analysis) and more.

Storage Tiers

As we discussed above, all the various cloud providers offer different storage tiers where you can make trade offs between

Storage Price
Access Price
Latency when accessing data

Each provider names their tiers differently, but at a high level, they can be split up into

Hot Tier - data that is frequently accessed. You pay the highest storage costs but the lowest access costs. Here, you might store transaction data, current-versions of ML models, user analytics data from the past 24 hours, etc.
Cool Tier - data that is infrequently accessed. You get lower storage costs than the hot tier but also high access costs. Here, you might store user data for inactive users (users who haven’t logged in in 30 days), past analytics data, etc.
Cold Tier - Data that is accessed less frequently than the cool tier. Again, your storage costs will be lower than the cool tier but access costs will be higher. Here, you could store things like old training datasets (aren’t used anymore but could be used in the future), deprecated application versions, historical user activity that is used for occasional analysis.
Archive Tier - This is usually an offline storage tier. You should have flexible latency requirements when accessing this data as it can take hours. Here, you might keep decade-old user data that you need for regulatory/compliance, legacy documentation, etc. Basically just things that could be useful in the future.

Life Cycle Policies

You can also set up storage lifecycle policies that will transition objects from one storage class to another. For example, you might have data that you will only access for the first 30 days after creation. After a month, then the data will be rarely accessed. You might set up a lifecycle policy to first store the data in AWS Standard, and then transition it to AWS Glacier after 1 month.

This is the first part of our tech dive on Object Storage.

In the rest of the article (2000+ words), we’ll talk about

Security best practices around resource/user-based policies and ACLs
Client-side vs. Server-side encryption
Performance optimization with CDNs and compression
Newer players like CloudFlare R2 and Tigris

Read the full article by subscribing to Quastor Pro.

Tech Snippets

If Inheritance is so bad, why does everyone use it?

You’ll frequently hear the principal of “Prefer Composition to Inheritance“

However, why exactly do you hear this? What problems does inheritance lead to in practice?

This is a fantastic article that first talks about why inheritance became popular in the first place. It also explores what’s challenging about inheritance and why we still use it despite it’s shortcomings.

buttondown.email/hillelwayne/archive/if-inheritance-is-so-bad-why-does-everyone-use-it

How SMS Fraud Works and How to Guard Against It

This is a terrific article on SMS Fraud and why it happens. Some numbers are “premium” where calling/texting that will cost you some money (tens of cents). The owner of that premium number gets a portion of those tens of cents for himself.

Attackers exploit this to get your service to send that premium number two-factor auth codes, OTPs, etc.

Some suggestions to avoid getting scammed by this include
- Add IP-based rate limiting to the endpoint that sends out SMS messages
- Only send an SMS to a specific phone number a small number of times before blocking that number for a cool-off period
- Identifying and blocking premium phone numbers using libphonenumber

Read the full article for the rest of the tips.

technicallythinking.substack.com/p/how-sms-fraud-works-and-how-to-guard-against-it

Mental Health in Software Engineering

Vadim Kravcenko has spent over 5 years as a CTO working at several startups. He wrote a helpful blog post delving into mental health and what tips worked for him.

Some of his tips for engineers and leaders is to

- Remember that not all deadlines are equal
- Say No to anything non-critical in your off-time
- Cutting off alcohol and adding in exercise.

Remember that your greatest asset to the company isn’t the code you write. Instead, it’s you.

vadimkravcenko.com/shorts/mental-health-in-software-engineering

Managing technical quality in a codebase.

Will Larson is the CTO of Carta and he wrote a detailed post delving into the steps you should take when improving the technical quality of your codebase.

He goes through finding hot spots that cause immediate problems to adopting best practices and creating technical quality teams at your company.

lethain.com/managing-technical-quality