How Compilers Work

Hey,

Today we’ll be talking about

  • How Compilers Work - Bob Nystrom wrote an amazing book called Crafting Interpreters, where he goes through how language implementations work under the hood. We’ll be giving a summary of one of the chapters with some commentary from yours truly.

    • The difference between programming languages and programming language implementations

    • Front end, middle end and back end of compilers and a breakdown of what each of these 3 parts do.

  • Plus some tech snippets on

    • Rust for JavaScript developers

    • A twitter thread on how CDNs work by Alex Xu

    • Lambda Calculus in 397 Bytes

    • How JPEG Compression Works

Questions? Please contact me at [email protected].

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Language Implementations Explained

Bob Nystrom is a software engineer at Google, where he works on the Dart Programming language.

He wrote an amazing book called Crafting Interpreters, where he walks you through how programming language implementations work. In the book, you’ll build two interpreters, one in Java and another in C.

He published the entire book for free here, but I’d highly suggest you support the author if you have the means to do so.

His section on A Map of the Territory gives a fantastic introduction to programming language implementations, so here’s a summary of that section (with my own commentary added in).

Languages vs. Language Implementations

First of all, it’s important to note that a programming language implementation is different from a programming language.

The programming language refers to the syntax, keywords, etc. The programming language itself is usually designed by a committee and there are some standard documents that describe the language. These documents are usually called the Programming Language Specification.

(Not all languages have a specification. Python, for example, has the Python Language Reference, which is the closest thing to it’s specification.)

The language implementation is the actual software that allows you to run code from that programming language. Typically, an implementation consists of a compiler/interpreter.

A programming language can have many different language implementations, and these implementations can all be quite different from each other. The key factor is that all these implementations must be able to run code that is defined according to the programming language specification.

The most popular implementation for Python is CPython but there’s also PyPy (JIT compiler), Jython (Python running on the JVM), IronPython (Python running on .NET) and many more implementations.

The Architecture of a Language Implementation

The branching paths a language may take over the mountain.

Language implementations are obviously built differently, but there are some general patterns.

A language implementation can be subdivided into the following parts

  • Front end - takes in your source code and turns it into an intermediate representation.

  • Middle end - takes the intermediate representation and applies optimizations to it.

  • Back end - takes the optimized intermediate representation and turns it into machine code or bytecode.

The intermediate representation is a data structure or code that makes the compiler more modular. It allows you to reuse your front end for multiple target platforms and reuse your backend for multiple source languages.

For example, you can write multiple back ends that turn the intermediate representation into machine code for x86, ARM, and other platforms and then reuse the same front end that turns your C code into intermediate representation.

LLVM is designed around a high-level assembly language that is named “intermediate representation” (here’s the language reference for LLVM IR) while CPython uses a data structure called the Control Flow Graph.

We’ll break down all of these concepts further…

Front end

As we said before, the front end is responsible for taking in your source code and turning it into an intermediate representation.

The first part is scanning (also known as lexical analysis). This is where the front end reads your source code and converts it into a series of tokens. A token is a single element of a programming language. It can be a single character, like a {, or it can be a word, like System.out.println.

After scanning and converting your source code into tokens, the next step is parsing.

A parser will take in the flat sequence of tokens and build a tree structure based on the programming language’s grammar.

This tree is usually called an abstract syntax tree (AST). While the parser is creating the abstract syntax tree, it will let you know if there are any syntax errors in your code.

The front end will also handle tasks like binding. This is where every identifier (variable, etc.) gets linked to where it’s defined.

If the language is statically typed, this is where type checking happens and type errors are reported.

In terms of storing this information, language implementations will do this differently.

Some will store it on the abstract syntax tree as attributes. Others will store it in a lookup table called a symbol table.

All this information will be stored in some type of intermediate representation.

There are a couple of well established styles of intermediate representation out there. Examples include

Having a shared intermediate representation helps make your compiler design much more modular.

Middle End

The middle end is responsible for performing various optimizations on the intermediate representation.

These optimizations are independent of the platform that’s being targeted, therefore they’ll speed up your code regardless of what the backend does.

An example of an optimization is constant folding: if some expression always evaluates to the exact same value, then do that evaluation at compile time and replace the code for the expression with the result.

So, replace 

With the result of that expression.

Other examples are removal of unreachable code (reachability analysis) and code that does not affect the program results (dead code elimination).

Back End

The back end is responsible for taking the optimized intermediate representation from the middle end and generating the machine code for the specific CPU architecture of the computer (or generating bytecode).

The back end may perform more analysis, transformations and optimizations that are specific for that CPU architecture.

If the back end produces bytecode, then you’ll also need another compiler for each target architecture that turns that bytecode into machine code.

Or, many runtimes rely on a virtual machine, where a program emulates a hypothetical chip. The bytecode is run on that virtual machine.

An example is the Java Virtual Machine, which runs Java bytecode. You can reuse that backend and write frontends to handle different languages. Python, Kotlin, Clojure and Scala are a few examples of languages that have front ends that can convert that language into Java bytecode.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.

Tech Snippets

  • From JavaScript to Rust - If you’re a JavaScript developer and are interested in learning Rust, this is a really awesome free book that teaches Rust from the perspective of JavaScript. It maps common JavaScript and Nodejs workflows to the Rust ecosystem.

  • How CDNs Work - This is a great twitter thread by Alex Xu with a ton of illustrations on how CDNs work. Alex is the author of “The System Design Interview - An Insider’s Guide”, a must read if you’re preparing for system design interviews.

  • How JPEG Compression Works - Bilal Himite wrote an awesome series of articles discussing how image processing works. This is one part in his series that covers how JPEGs work.

Here’s some snippets from our last email, in case you missed it!

  • What Makes a Great Engineering Manager - Dave Rensin is a former Engineering Director at Google. In this interview, he talks about what makes a great EM (engineering manager), how to interview for an EM position and tips for people who are transitioning into EM.

  • An awesome Github Repo on resources for CTOs (or aspiring CTOs).The repo contains resources on

    • Software Development Processes (Scrum/Agile, CI/CD, etc.)

    • Software Architecture

    • Product Management

    • Hiring for technical roles

  • Advanced Linux Programming - If you’re a developer for the Linux OS, this is a great textbook to teach you advanced topics around how to write programs that work with features like IPC, multiprocessing, multi-threading, and interaction with hardware devices.The book goes into how you can build applications that are faster and more secure and it also shows peculiarities in the Linux OS including limitations and conventions.

  • How to Review Code as a Junior Developer - This is an awesome blog post on the Pinterest Engineering blog about how junior developers should do code reviews. Many junior devs may think their feedback (especially when given to developers who are more senior) is a waste of time, but that’s not the case!Several benefits from reviewing code as a junior developer are

    • You learn the code base more quickly

    • You build a feedback circle

    • You take more ownership over the codebase

Interview Question

You are given two sorted arrays that can have different sizes.

Return the median of the two sorted arrays.

Previous Solution

As a reminder, here’s our last question

Build a class MyQueue that implements a queue using two stacks.

Your class should implement the functions push, pop, peek, and empty.

Solution

A stack works based on the LIFO principle - last in, first out.

If you push 10 numbers into a stack and then pop out all 10 numbers, you’ll get all 10 numbers back in reverse order.

On the other hand, a queue works based on FIFO - first in, first out.

If you push 10 numbers into a queue and then pop out all 10 numbers, you’ll get all 10 numbers back in the same order.

So, given that we have two stacks, one way of implementing this is to push every element through both stacks.

This will reverse the order twice, which means we’ll get our items back in the same order they were put in (first in, first out).

One stack will be named first, and the other will be named second.

For the push method, we’ll push our items on to the first stack.

Then, when we pop, we’ll first pop everything off the first stack and push it onto the second stack (this is done with our _load function).

After doing that, we can pop an item off our second stack and return it to the user.

Here’s the Python 3 code.

Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.