How Go Mitigates Supply Chain Attacks
Hey Everyone,
Today we’ll be talking about
A look at Bloom Filters
How Bloom Filters work and a comparison to hash tables
An example use case to reduce database load
Real world use cases with CDNs
How Go mitigates supply chain attacks
A new version of a dependency cannot affect a Go build without deliberate installation by the programmer.
The CheckSum Database maintains a list of cryptographic hashes of all the dependencies to prevent compromised dependencies from spreading malicious code
Go’s principle of “A little copying is better than a little dependency”
Plus, some tech snippets on
Google’s new Language Model with 540 billion parameters
How Slack is revamping their mobile codebase
A sorting algorithm with a time complexity of O(max(n))
How to become an Engineering Manager
On Deck is an awesome startup that takes the MBA experience and virtualizes it.
They’re launching the On Deck Engineering Fellowship, which is a curated community to help aspiring engineering leaders learn managerial skills, meet industry leaders and build a personal network of like-minded peers.
They have two specialized tracks that you can join, depending on where you are in your career:
For Early-career Engineers - Early-career engineers who want to break into management roles
For Engineering Managers - Managers & Directors who want to accelerate their path to senior leadership.
You’ll be placed in a group of highly dedicated and capable developers who share the same goal as you, so this is a great way to build connections that will last a lifetime.
sponsored
Bloom Filters
A bloom filter is a very space-efficient data structure that can quickly (constant time) tell you if an item doesn’t exist in a group of items.
When you’re creating a new username on Twitter and they have to check whether that username is already taken, a bloom filter is a great data structure to use for speeding that up.
How Bloom Filters Work
A bloom filter is like a simpler version of a hash table.
The core data structure in a bloom filter is a bit vector (an array of bits). This bit vector will be used to keep track of items that exist in the set.
When you want to insert an item, you’ll first use a hash function (or multiple hash functions) to hash that item into an integer.
Then, you can mod that integer by the number of slots in your bit vector
integer % num_slots_bit_vector
This will give you a slot in your bit vector for that item. You set that slot’s bit to 1. Then, you can add the item to your database (or whatever storage layer you’re using).
If you want to check if some item exists in your bloom filter, you can repeat the process of hashing and modulus to find the corresponding slot in the bit vector. If the slot in the bit vector is not set to 1, then you immediately know that the item does not exist in the set. You don’t have to query your database (which is a lot more expensive than using the bloom filter).
However, if the slot in the bit vector for that item is set to 1, then that doesn’t tell you for sure that the item exists in the set. You’ll have to do a further check within your database to know for certain.
The bloom filter will only tell you if an item doesn’t exist in the set.
This is because of possible collisions with your hash function (pigeonhole principle). Your bit vector only has a limited number of slots, and the number of possible items is much, much larger than the number of slots.
You’ll eventually run into a scenario where two different items get mapped to the same slot in your bloom filter’s bit vector.
Therefore, bloom filters will tell you if an item is either
possibly in the set
definitely not in the set
False positive matches are possible, but false negatives are not.
Comparing Bloom Filters and Hash Tables
Bloom filters are very similar to hash tables, but a hash table eliminates the possibility of false positive matches through collision resolution.
The hash table solves the collision problem with solutions like open addressing.
The downside of this is that it makes the hash table take up far more space than the bloom filter.
If you want to keep track of every single twitter username in your data structure, a hash table may become too large to store in-memory. In that situation, you’ll want to use a bloom filter.
Example Use Case
Going back to the twitter example, let’s say you’re an engineer at twitter and you’re working on the sign up form for new users.
When a new user tries to create their username, they need to quickly find out if their username has already been taken.
Sending a query to your database every single time a user attempts to create a username can result in unnecessary load on the database.
Therefore, you can use a bloom filter to quickly check if a username is unique.
If the username doesn’t exist in the bloom filter, then you can avoid querying the database.
If the username does exist in the bloom filter, then you can query the database to check if the username is already taken or if the bloom filter match was a false positive.
Real World Use Cases
Databases use bloom filters extensively to reduce disk lookups for non-existent rows/columns. Apache Cassandra, Postgres and Google Big Table are just a few examples of databases that use bloom filters to reduce disk lookups.
The Akamai Content Delivery Network (one of the largest CDNs in the world) uses bloom filters to avoid “one-hit wonders” from being stored in their caches. One-hit-wonders are objects that are requested by users just once. Akamai uses a bloom filter to detect the second request for a web object and then cache it only after the second request.
Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.
Tech Snippets
A sorting algorithm with time complexity O(max(n) from r/programminghumor
Pathways Language Model - Google has recently released a paper showcasing their new language model with 540 billion parameters. They evaluated this model on hundreds of language understanding and generation tasks (translation, summarization, question answering, etc.) and found that it achieves state-of-the-art few-shot performance across most tasks.Here’s an example of the language model explaining a joke.
How Slack is revamping their mobile codebase - In the beginning of 2020, Slack engineers realized that the tech debt in their mobile codebase was starting to significantly affect developer productivity.This is a multi-part series of blog posts on how they revamped their codebasePart 1 - Getting rid of the worst tech debt and finishing stalled migrations.Part 2 - Moving code from the main app target to modules and adopting Bazel for the build system for iOS.
How Go Mitigates Supply Chain Attacks
A common attack vector that engineers have to be wary of is supply chain attacks, where an open source dependency that your code relies on is compromised in some way.
This was displayed full blast a few months ago with the Log4Shell, a vulnerability in the popular Java logging framework Log4J.
Filippo Valsorda is in charge of cryptography and security on the Go team at Google. He wrote a great blog post describing the steps Go takes to mitigate the risk of supply chain attacks.
Here’s a summary
All Builds are Locked
There is no way for any new version of a dependency to automatically affect a Go build.
The version of every dependency contributing to a Go build is fully determined by the go.mod file (a text-file listing all the dependencies) in the main module.
The only commands that can change the go.mod file are not expected to be run automatically or in CI, so changes to dependency trees must be made deliberately.
Module Contents are checked against a global database
If an attacker compromises a dependency, they could re-upload an existing version with malware.
To combat this, the go.sum file contains a list of cryptographic hashes of each dependency that contributes to the build.
Go takes this a step further with the Checksum Database. This is a global append-only list of go.sum entries. The database is served and maintained by Google and this ensures that you can check your dependencies to make sure they haven’t been compromised.
Prefer Copying over a Dependency
Go has a culture of rejecting large dependency trees and preferring a bit of copying to adding a new dependency.
High-quality reusable Go modules will sometimes have a label of “zero dependencies”, so Go developers can use that module without taking on other dependencies.
The strongest mitigation to supply chain attacks is always a small dependency tree.
For other security measures Go takes (and for further elaboration), read the full blog post here.
Interview Question
Given a positive integer n, build a square matrix that has elements from 1 to n^2 in spiral order.
Previous Question
As a refresher, here’s the previous question
You are given the head of a linked list.
You are also given an integer n.
Remove the nth node from the end of the linked list and return it’s head.
Example
Input: head = [1, 2, 3, 4, 5], n = 2
Output: [1, 2, 3, 5]
Solution
We can solve this question with a single pass of the linked list.
We do it by having two pointers, a fast pointer and a slow pointer.
The fast pointer will first traverse N nodes in our linked list.
If the fast pointer has already reached the end of the linked list, then that means the Nth node from the end of the linked list is the head node. So we’ll delete the head node and return the linked list.
Otherwise, we’ll start traversing both the slow and fast pointer through our linked list. The fast pointer will be N nodes ahead of the slow pointer.
When the fast pointer reaches the last node of our linked list, the slow pointer will be directly behind the Nth node from the end of the linked list.
We can delete the node in front of the slow pointer and then return our linked list.
Here’s the Python 3 code. What’s the time and space complexity of our solution? Reply back with your answer and we’ll tell you if you’re right/wrong.