Tech Dive on AWS S3

A tech dive on Object Storage (AWS S3, Azure Blob Storage, Google Cloud Storage). Plus how to avoid SMS fraud attacks, the history behind inheritance and more.

Hey Everyone!

Today we’ll be talking about

  • Tech Dive on Cloud Object Storage (AWS S3, Azure Blob Store, etc.)

    • Different Types of Cloud storage (File store, Block store, Object store)

    • General API of Object stores

    • Immutability Properties

    • How Pricing Works

    • Storage Tiers

    • Life Cycle Policies

    • Security Best Practices

    • Client Side vs. Server Side Encryption

    • Performance Optimization

    • New Players with CloudFlare R2 and Tigris

  • Tech Snippets

    • If Inheritance is so bad, why does everyone use it?

    • Mental health in software engineering

    • Managing technical quality in a codebase

    • How SMS fraud works and how to guard against it

If you’re curious about how LLMs like GPT-4, LLaMA and Claude work then you should check out this fantastic course by Brilliant.

It’s fully interactive with animations, hands-on graphics and detailed explanations on things like

  • N-Gram Models

  • Transformers

  • Fine Tuning LLMs

Brilliant is an education platform that has a huge amount of math, data science, computer science and ML content. Their content is structured as bite-sized lessons with tons of interactive animations, graphics and more.

This makes it really easy to build a daily learning habit with the Brilliant app, making you a better problem solver and a faster learner.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

sponsored

Tech Dive - Object Storage

One of the most common types of storage you’ll work with is Object Storage. You’ve probably heard of AWS S3, Azure Blob Storage, Google Cloud Storage, etc.

In this dive, we’ll delve into Object Storage and talk about things you should know when working with this storage paradigm.

Types of Storage Systems

One way to differentiate between storage systems is based on how they represent your data.

Some common ones include

  • File System - your data is stored in folders and files in a tree-like way. Every operating system represents data in this way, so hopefully you know what I’m talking about.

    Ex. Google Cloud Filestore, Amazon Elastic File System, Azure Files

  • Blocks - You are given access to individual blocks in your storage system where a block is the smallest unit of storage (in AWS Elastic Block Store, it’s a couple of kilobytes). You can manipulate these individual blocks and write your file across multiple blocks. This offers the highest granularity.

    Ex. AWS Elastic Block Store, Google Persistent Disk, Azure Disk Storage

  • Objects - Each file you want to upload (txt, mp4, png, csv, py, whatever) is treated as an object. Rather than messing around with individual blocks, you can read/delete by objects. There is also no concept of files or hierarchy. You have a flat namespace where all the files are on the same level.

    Ex. Azure Blob Storage, AWS S3, Google Cloud Storage

So, with Object Storage, an object is just any particular file coupled with some metadata (details about the content, access privileges, etc.)

Using Object Stores

Regardless of which provider you’re using, Object storage systems follow a common architectural pattern

  • Account - you have your account with the cloud provider.

  • Containers/Buckets - You create containers/buckets within S3/Blob Store/Cloud Storage. Each bucket will have its own namespace and access controls, so files must have different names within the same bucket. You could create separate buckets for backup data, multimedia files, logs, etc.

  • Data Objects - Within each bucket, you save your actual files, referred to as objects. Each object will contain the file, metadata and a unique identifier. You can also optionally set up versions for the files, so have a v1, v2, v3, etc. As we’ll later discuss, files are immutable in Object stores, so versioning can be handy if you want to modify a file and upload the new version.

Immutability

Once you upload a file to your object store, that file cannot be changed. If you’d like to modify it, then your only option is to make your changes locally, upload the new file and delete the old version from your object store.

For this reason, Object stores have a versioning system where you can re-upload a file with the same name but give it a different version number.

Re-uploading new versions again and again is obviously inconvenient for large files but it comes with quite a few benefits.

Some of the pros of immutability are

  • No Locks or Synchronization - you don’t have to worry about someone else modifying the file as you’re downloading it

  • Efficient Data Distribution - You can split a file up and distribute it across multiple objects/regions without worrying about one of the objects being modified and breaking the entire file.

  • Simplified Data Management - Having immutable files with different versions simplifies the data lifecycle process. You can set rules to automatically purge any old files with a newer version or to move all new versions to a different bucket.

The core functions across cloud providers are

  • PUT - Upload an object to the file/container.

  • GET - Retrieve an object from the storage bucket.

  • DELETE - delete a data object from storage. This is permanent unless you have versioning enabled and want to restore a previous version.

  • COPY - Copy a data object to another bucket or another object in the same bucket with a different name

  • LIST - list files in cloud storage

  • Multipart Upload - If you need to upload a big file, you can break it up into chunks and upload one chunk at a time. After, the cloud provider will combine the chunks into one data object.

  • Access Control List Operations - manage permissions on individual operations, so defining who can read a certain data object.

Pricing

As you might imagine, there are quite a few factors that influence how much you’re charged for storage.

The factors include

  • Data Storage - you pay for the amount of data you’re storing in gigabytes per month. This changes based on your storage tier.

  • Data Retrieval and Access - You’re charged for uploading/accessing data from the storage system. Again, these changes depend on your storage tier.

  • Egress Fees - You’re charged for any data leaving the cloud provider’s network. This could be data transferred to another cloud, an on-prem location, or being delivered to end-users.

    Note - Thanks to the European Data Act, you are not charged egress fees when you’re leaving a cloud provider for another cloud/on-prem service. This means that if you’re on AWS and you’re shifting to Azure, you can submit a form telling AWS that you’re exiting their cloud and transferring all your data to Azure. They’ll waive the egress fees.

  • API Requests - You’re charged a small fee for any API request. These are typically billed per thousand requests.

  • Data Management & Additional Services - You’re charged for things like data replication, lifecycle policies, analytics tools (like S3 Storage Class Analysis) and more.

Storage Tiers

As we discussed above, all the various cloud providers offer different storage tiers where you can make trade offs between

  • Storage Price

  • Access Price

  • Latency when accessing data

Each provider names their tiers differently, but at a high level, they can be split up into

  • Hot Tier - data that is frequently accessed. You pay the highest storage costs but the lowest access costs. Here, you might store transaction data, current-versions of ML models, user analytics data from the past 24 hours, etc.

  • Cool Tier - data that is infrequently accessed. You get lower storage costs than the hot tier but also high access costs. Here, you might store user data for inactive users (users who haven’t logged in in 30 days), past analytics data, etc.

  • Cold Tier - Data that is accessed less frequently than the cool tier. Again, your storage costs will be lower than the cool tier but access costs will be higher. Here, you could store things like old training datasets (aren’t used anymore but could be used in the future), deprecated application versions, historical user activity that is used for occasional analysis.

  • Archive Tier - This is usually an offline storage tier. You should have flexible latency requirements when accessing this data as it can take hours. Here, you might keep decade-old user data that you need for regulatory/compliance, legacy documentation, etc. Basically just things that could be useful in the future.

Life Cycle Policies

You can also set up storage lifecycle policies that will transition objects from one storage class to another. For example, you might have data that you will only access for the first 30 days after creation. After a month, then the data will be rarely accessed. You might set up a lifecycle policy to first store the data in AWS Standard, and then transition it to AWS Glacier after 1 month.

This is the first part of our tech dive on Object Storage.

In the rest of the article (2000+ words), we’ll talk about

  • Security best practices around resource/user-based policies and ACLs

  • Client-side vs. Server-side encryption

  • Performance optimization with CDNs and compression

  • Newer players like CloudFlare R2 and Tigris

Have you ever been curious about the inner workings of popular codecs like H.264 or AV1?

They rely on a huge number of clever techniques and algorithms. One example is interframe coding, where you identify consecutive frames that are similar to each other and then only store the changes between frames rather than the entire frame itself.

If you’d like to learn more techniques like this from areas like video compression, computer memory, GPS, wireless communication and more, then you should check out Brilliant.

They released a course called How Technology Works that delves into all of these topics in an engaging, easy to understand way.

This is just one of hundreds of courses that Brilliant has that cover all topics across software engineering, machine learning, data science, quantitative finance and more.

With the link below, you can get a 30-day free trial to check it out. You’ll also get a 20% discount when you subscribe.

sponsored

Tech Snippets