Hey,
Today we’ll be talking about
Questions? Please contact me at [email protected].
Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.
Bob Nystrom is a software engineer at Google, where he works on the Dart Programming language.
He wrote an amazing book called Crafting Interpreters, where he walks you through how programming language implementations work. In the book, you’ll build two interpreters, one in Java and another in C.
He published the entire book for free here, but I’d highly suggest you support the author if you have the means to do so.
His section on A Map of the Territory gives a fantastic introduction to programming language implementations, so here’s a summary of that section (with my own commentary added in).
First of all, it’s important to note that a programming language implementation is different from a programming language.
The programming language refers to the syntax, keywords, etc. The programming language itself is usually designed by a committee and there are some standard documents that describe the language. These documents are usually called the Programming Language Specification.
(Not all languages have a specification. Python, for example, has the Python Language Reference, which is the closest thing to it’s specification.)
The language implementation is the actual software that allows you to run code from that programming language. Typically, an implementation consists of a compiler/interpreter.
A programming language can have many different language implementations, and these implementations can all be quite different from each other. The key factor is that all these implementations must be able to run code that is defined according to the programming language specification.
The most popular implementation for Python is CPython but there’s also PyPy (JIT compiler), Jython (Python running on the JVM), IronPython (Python running on .NET) and many more implementations.
Language implementations are obviously built differently, but there are some general patterns.
A language implementation can be subdivided into the following parts
The intermediate representation is a data structure or code that makes the compiler more modular. It allows you to reuse your front end for multiple target platforms and reuse your backend for multiple source languages.
For example, you can write multiple back ends that turn the intermediate representation into machine code for x86, ARM, and other platforms and then reuse the same front end that turns your C code into intermediate representation.
LLVM is designed around a high-level assembly language that is named “intermediate representation” (here’s the language reference for LLVM IR) while CPython uses a data structure called the Control Flow Graph.
We’ll break down all of these concepts further…
As we said before, the front end is responsible for taking in your source code and turning it into an intermediate representation.
The first part is scanning (also known as lexical analysis). This is where the front end reads your source code and converts it into a series of tokens. A token is a single element of a programming language. It can be a single character, like a {, or it can be a word, like System.out.println.
After scanning and converting your source code into tokens, the next step is parsing.
A parser will take in the flat sequence of tokens and build a tree structure based on the programming language’s grammar.
This tree is usually called an abstract syntax tree (AST). While the parser is creating the abstract syntax tree, it will let you know if there are any syntax errors in your code.
The front end will also handle tasks like binding. This is where every identifier (variable, etc.) gets linked to where it’s defined.
If the language is statically typed, this is where type checking happens and type errors are reported.
In terms of storing this information, language implementations will do this differently.
Some will store it on the abstract syntax tree as attributes. Others will store it in a lookup table called a symbol table.
All this information will be stored in some type of intermediate representation.
There are a couple of well established styles of intermediate representation out there. Examples include
Having a shared intermediate representation helps make your compiler design much more modular.
The middle end is responsible for performing various optimizations on the intermediate representation.
These optimizations are independent of the platform that’s being targeted, therefore they’ll speed up your code regardless of what the backend does.
An example of an optimization is constant folding: if some expression always evaluates to the exact same value, then do that evaluation at compile time and replace the code for the expression with the result.
So, replace
With the result of that expression.
Other examples are removal of unreachable code (reachability analysis) and code that does not affect the program results (dead code elimination).
The back end is responsible for taking the optimized intermediate representation from the middle end and generating the machine code for the specific CPU architecture of the computer (or generating bytecode).
The back end may perform more analysis, transformations and optimizations that are specific for that CPU architecture.
If the back end produces bytecode, then you’ll also need another compiler for each target architecture that turns that bytecode into machine code.
Or, many runtimes rely on a virtual machine, where a program emulates a hypothetical chip. The bytecode is run on that virtual machine.
An example is the Java Virtual Machine, which runs Java bytecode. You can reuse that backend and write frontends to handle different languages. Python, Kotlin, Clojure and Scala are a few examples of languages that have front ends that can convert that language into Java bytecode.
Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.
Here’s some snippets from our last email, in case you missed it!
You are given two sorted arrays that can have different sizes.
Return the median of the two sorted arrays.
As a reminder, here’s our last question
Build a class MyQueue that implements a queue using two stacks.
Your class should implement the functions push, pop, peek, and empty.
Solution
A stack works based on the LIFO principle - last in, first out.
If you push 10 numbers into a stack and then pop out all 10 numbers, you’ll get all 10 numbers back in reverse order.
On the other hand, a queue works based on FIFO - first in, first out.
If you push 10 numbers into a queue and then pop out all 10 numbers, you’ll get all 10 numbers back in the same order.
So, given that we have two stacks, one way of implementing this is to push every element through both stacks.
This will reverse the order twice, which means we’ll get our items back in the same order they were put in (first in, first out).
One stack will be named first, and the other will be named second.
For the push method, we’ll push our items on to the first stack.
Then, when we pop, we’ll first pop everything off the first stack and push it onto the second stack (this is done with our _load function).
After doing that, we can pop an item off our second stack and return it to the user.
Here’s the Python 3 code.
Quastor is a free Software Engineering newsletter that sends out summaries of technical blog posts, deep dives on interesting tech and FAANG interview questions and solutions.