Tech Dive on Redis

In this article we'll delve into Redis and talk about key characteristics, the history, data structures and more.

November 26, 2023

Hey Everyone!

Today we'll be talking about

Redis Tech Dive - We’ve mentioned Redis a ton in past Quastor articles, so now we’ll be delving in on
- Key Characteristics
- History of the Database
- Data Structures Offered
- Persisting Redis to Disk with RDB and AOF
- Common Use Cases (Database, Cache, Pub/Sub)
Why LinkedIn switched from JSON to Protobuf
- A brief intro to Rest.li, LinkedIn’s framework for creating web servers and clients
- Issues they were facing with JSON and alternatives they considered
- An overview of Protobuf
- Results from switching to Protobuf
Tech Snippets
- Ask HN: Inherited a PHP codebase that looks like it was written in 2003. What do I do?
- GitHub Repo with resources for Mid-level developers who want to become senior
- Summarizing Post Incident Reviews with GPT-4 at Canva
- Advantages of Traces over Traditional Logging
- A 12-part course on Generative AI

Redis Tech Dive

Redis (stands for Remote Dictionary Server) is an open source key-value store that’s commonly used as a database, cache, message broker, and much more. It’s been optimized to be fast and highly scalable but also be versatile and support a variety of different use cases.

It was first released in 2009 and has since gained adoption at every major tech company and at many startups.

Some of the characteristics of Redis are

In-Memory with Optional Durability - Redis uses RAM as the primary storage for the database’s items. Reading/writing data to RAM can be thousands of times faster than HDD access.

The obvious downside is that you’ll lose all your data if someone trips over the server’s power cable. For this, Redis offers optional durability where you can write the state of your Redis database to disk.
Key-Value with Rich Data Structures - Redis is a key value store, where your keys are strings. For the value, Redis offers a variety of data structures like strings, sets, sorted sets, bitmaps, geospatial indexes and more. Each of these data structures comes with specific operations like union/intersection (for sets), geo-search (for geospatial indexes) and more.
Easy Horizontal Scalability - There’s obviously a limit to the amount of memory you can stuff into a single machine. Redis Cluster gives you a way to easily run a setup where data is automatically sharded across multiple Redis nodes.

We’ll delve into each of these in much more depth.

History of Redis

Redis was created by Salvatore Sanfilippo (also known as antirez) in the late 2000s. At the time, he was building a startup to help website owners get visibility into what their users were doing.

A site owner could use their tool to see what users were reading, what pages they were clicking to, etc. All this data was available in real time.

The issue was with scaling the data store. Salvatore was originally trying to use MySQL but ran into performance issues. MySQL relies on the hard disk for primary storage, so HDD read/write speeds were becoming a bottleneck.

Instead, he created LMDB, an in-memory, key value database that relied on RAM instead of HDD. The MVP was just 300 lines long, but supported commands like SET/GET with support for strings and lists as data types.

In 2009, it was launched on Hacker News and received the typical HN reaction of “hey, this has already been done”.

By 2010, both Twitter and Instagram had adopted Redis. VMWare became the first major sponsor of the project.

As Redis gained adoption, many companies formed to offer consulting services around the database. One of these was Redis Labs. Antirez (Salvatore) ended up joining Redis Labs and the company became the official sponsor of the project.

Antirez maintained a role as the BDFL (Benevolent Dictator For Life) of the project until 2020, when he stepped down. Redis now has a governance model of people from Redis Labs and also other core community members.

You can read a much more detailed look at the history of Redis from this fantastic article here.

Database Overview

Redis is a key value store (sorry for repeating this for the 10th time) so the most basic operations are GET/SET for getting the value of a specific key and also setting the value for a key.

Keys are strings but the value can be represented by a variety of different data structures. The core data structures are strings, lists, sets, hashes, sorted sets, streams, geospatial indexes, bitmaps, bitfields and hyperloglog.

If you’re using Redis, it’s crucial to have a solid understanding of the different data structures and their tradeoffs.

Strings

The most basic data type for your value, it can store any sequence of bytes. The maximum size of a value string is 512 MB.

Methods include SET/GET. You also have methods to increment the string if the string value can be converted to a number.

Lists

Redis also supports lists of strings with their List data structure. Lists are implemented internally with Linked Lists, so you can add to the left and right of a Redis List in constant time. However, retrieving elements from a List takes linear time.

Methods include LPUSH to add an element to the left (the front) of the list. RPUSH to add an element to the right (the end) of the list.

You can also do things like get an element at a certain index, get a range of elements, and more.

Bitmaps

In a previous Quastor article, we talked about how Grab scaled their user segmentation platform. One of the key challenges they had was scaling the write throughput.

They need to store a large set of user IDs (represented as integers) and check if a certain user ID was in a segment. They accomplished this by representing the user IDs in a bitmap where you do a bitwise operation to check if the bit at the user ID’s index is 0 or 1. Bitmaps are also commonly used as a database index.

Bitmaps can be extremely efficient to work with. Both time and space complexity scale at a constant factor.

In Redis, Bitmaps are a special application of the string data structure, but each bit in the string is considered a separate value (either a 0 or a 1).

Redis strings can be up to 512 MB in size, so a Bitmap value type can contain up to 4 billion bits (512 megabytes 1024 kb in 1 megabyte 1024 bytes in 1 kb * 8 bits in 1 byte).

You have operations like SETBIT and GETBIT to set/get the value of a bit at a specific offset. You can also use BITCOUNT to count the number of set bits (bits with a value of 1) in the bitmap over a given range.

HyperLogLog

HyperLogLog is a probabilistic data structure that lets you keep track of the number of unique elements in a dataset in an extremely efficient, fast way. The downside is HyperLogLog provides an estimated count rather than an exact one (the accuracy is very high, with less than a 1% standard error).

It’s very useful for real-time analytics. Maybe you want to count the number of users who’ve downloaded each image on your website. You might have millions of images, so using an HLL could be an efficient way to keep track of this.

Whenever a user downloads an image for the first time, you use the PFADD function to add them to the HLL. If you want to get the number of unique users who’ve been added to the HLL, you can use the PFCOUNT command.

This is an excerpt from our Tech Dive on Redis for Quastor Pro readers. We also talk about the other Redis data structures (Sets, Hashes, Bitfields, etc.), Redis’s options for persistence (RDB and AOF) and their trade-offs, real world use cases and more.

For detailed tech dives on a huge host of topics in backend and data engineering, check out Quastor Pro.

Readers can typically expense Quastor Pro with their job’s Learning & Development budget. Here’s an email you can send to your manager.

Tech Snippets

Ask HN: Inherited the worst code and tech team I have ever seen. How to fix it?

This is an interesting HN post by a developer who got tasked with overhauling a massive, extremely messy PHP codebase. The prior team didn’t use version control, no frameworks and seemed to follow best practices from 2003.

The comments go into why a full rewrite is a bad idea and why the best place to start is by getting a suite of end-to-end tests in place. This provides a baseline to safely make changes.

Once that’s ready, pick off pieces one by one and modernize them.

news.ycombinator.com/item?id=32883596

A GitHub repo with Resources to get from Mid to Senior and Beyond

This is a fantastic github repo with a ton of resources on how you can further your career as a mid/senior-level developer. It’s a collection of books, courses, youtube channels, podcasts and more that you can follow to accelerate your career.

github.com/jordan-cutler/path-to-senior-engineer-handbook

Summarizing Post Incident Reviews with GPT-4

Canva records Post Incident Reviews in Confluence with summaries for easier readability. Initially, these summaries were written by reliability engineers but they were seeing inconsistencies in the writing.

Now, they just do it with ChatGPT. In this blog post, they talk about their workflow and how they remove sensitive data, prompt effectively and monitor results. They also discuss the results they’ve seen after a few months of use.

www.canva.dev/blog/engineering/summarise-post-incident-reviews-with-gpt4

Advantages of Traces over Traditional Logging

This is a great introduction to tracing and implementing it in your backend for increased observability. Andy Dote discusses the differences between logs and traces and issues you may face with traditional logging practices.

Using tracing can help with establishing clear causal relationships, manage timing and much more. The article ends with details on how you can adapt your current system to support tracing.

andydote.co.uk/2023/09/19/tracing-is-better

Generative AI for Beginners

This is a 12-part course on the fundamentals of building generative AI applications. It covers a wide range of topics like different types of LLMs, techniques in prompt engineering, using vector databases and much more.

Each lesson comes with an accompanying Jupyter notebook so you can easily run the code yourself.

microsoft.github.io/generative-ai-for-beginners/#

Why LinkedIn switched from JSON to Protobuf

LinkedIn has over 900 million members in 200 countries. To serve this traffic, they use a microservices architecture with thousands of backend services. These microservices combine to tens of thousands of individual API endpoints across their system.

As you might imagine, this can lead to quite a few headaches if not managed properly (although, properly managing the system will also lead to headaches, so I guess there’s just no winning).

To simplify the process of creating and interacting with these services, LinkedIn built (and open-sourced) Rest.li, a Java framework for writing RESTful clients and servers.

To create a web server with Rest.li, all you have to do is define your data schema and write the business logic for how the data should be manipulated/sent with the different HTTP requests (GET, POST, etc.).

Rest.li will create Java classes that represent your data model with the appropriate getters, setters, etc. It will also use the code you wrote for handling the different HTTP endpoints and spin up a highly scalable web server.

For creating a client, Rest.li handles things like

Service Discovery - Translates a URI like d2:// to the proper address - http://myD2service.something.com:9520/.
Type Safety - Uses the schema created when building the server to check types for requests/responses.
Load Balancing - Balancing request load between the different servers that are running a certain backend service.
Common Request Patterns - You can do things like make parallel Scatter-Gather requests, where you get data from all the nodes in a cluster.

and more.

To learn more about Rest.li, you can check out the docs here.

JSON

Since it was created, Rest.li has used JSON as the default serialization format for sending data between clients and servers.

{

  "id": 43234,

  "type": "post",

  "authors": [

    "jerry",

    "tom"
  ]
}

JSON has tons of benefits

Human Readable - Makes it much easier to work with than looking at 08 96 01 (binary-encoded message). If something’s not working, you can just log the JSON message.
Broad Support - Every programming language has libraries for working with JSON. (I actually tried looking for a language that didn’t and couldn’t find one. Here’s a JSON library for Fortran.)
Flexible Schema - The format of your data doesn’t have to be defined in advance and you can dynamically add/remove fields as necessary. However, this flexibility can also be a downside since you don’t have type safety.
Huge amount of Tooling - There’s a huge amount of tooling developed for JSON like linters, formatters/beautifiers, logging infrastructure and more.

However, the downside that LinkedIn kept running into was with performance.

With JSON, they faced

Increased Network Bandwidth Usage - plaintext is pretty verbose and this resulted in large payload sizes. The increased network bandwidth usage was hurting latency and placing excess load on LinkedIn’s backend.
Serialization and Deserialization Latency - Serializing and deserializing an object to JSON can be suboptimal due to how verbose the messages are. This is not an issue for the majority of applications, but at Linkedin’s volume it was becoming a problem.

To reduce network usage, engineers tried integrating compression algorithms like gzip to reduce payload size. However, this just made the serialization/deserialization latency worse.

Instead, LinkedIn looked at several formats as an alternative to JSON.

They considered

Protocol Buffers (Protobuf) - Protobuf is a widely used message-serialization format that encodes your message in binary. It’s very efficient, supported by a wide range of languages and also strongly typed (requires a predefined schema). We’ll talk more about this below.
Flatbuffers - A serialization format that was also open-sourced by Google. It’s similar to Protobuf but also offers “zero-copy deserialization”. This means that you don’t need to parse/unpack the message before you access data.
MessagePack - Another binary serialization format with wide language support. However, MessagePack doesn’t require a predefined schema so this can cause it to be less safe and less performant than Protobuf.
CBOR - A binary serialization format that was inspired by MessagePack. CBOR extends MessagePack and adds some features like distinguishing text strings from byte strings. Like MessagePack, it does not require a predefined schema.

And a couple other formats.

They ran some benchmarks and also looked at factors like language support, community and tooling. Based on their examination, they went with Protobuf.

Overview of Protobuf

Protocol Buffers (Protobuf) are a language-agnostic, binary serialization format created at Google in 2001. Google needed an efficient way for storing structured data to send across the network, store on disk, etc.

Protobuf is strongly typed. You start by defining how you want your data to be structured in a .proto file.

The proto file for serializing a user object might look like…

syntax = "proto3";

message Person {

  string name = 1;

  int32 id = 2;

  repeated string email = 3;

}

They support a huge variety of types including: bools, strings, arrays, maps, etc. You can also update your schema later without breaking deployed programs that were compiled against the older formats.

Once you define your schema in a .proto file, you use the protobuf compiler (protoc) to compile this to data access classes in your chosen language. You can use these classes to read/write protobuf messages.

Some of the benefits of Protobuf are

Smaller Payload - Encoding is much more efficient. If you have {“id”:59} in JSON, then this takes around 10 bytes assuming no whitespace and UTF-8 encoding. In protobuf, that message would be 0x08 0x3b (hexadecimal), and it would only take 2 bytes.
Fast Serialization/Deserialization - Because the payload is much more compact, serializing and deserializing it is also faster. Additionally, knowing what format to expect for each message allows for more optimizations when deserializing.
Type Safety - As we discussed, having a schema means that any deviations from this schema are caught at compile time. This leads to a better experience for users and (hopefully) fewer 3 am calls.
Language Support - There’s wide language support with tooling for Python, Java, Objective-C, C++, Kotlin, Dart, and more.

Results

Using Protobuf resulted in an increase in throughput for response and request payloads. For large payloads, LinkedIn saw improvements in latency of up to 60%.

They didn’t see any statistically significant degradations when compared to JSON for any of their services.

Here’s the P99 latency comparison chart from benchmarking Protobuf against JSON for servers under heavy load.

For more details, read the full blog post here.

How To Remember Concepts from Quastor Articles

We go through a ton of engineering concepts in the Quastor summaries and tech dives.

In order to help you remember these concepts for your day to day job (or job interviews), I’m creating flash cards that summarize these core concepts from all the Quastor summaries and tech dives.

The flash cards are completely editable and are meant to be integrated in a Spaced-Repetition Program (Anki) so that you can easily remember the concepts forever.

This is an additional perk for Quastor Pro members in addition to the tech dive articles. Thank you for being a subscriber!