Hey Everyone,
Today we’ll be talking about
Questions? Please contact me at [email protected].
Quastor is a free Software Engineering newsletter that sends out deep dives on interesting tech, summaries of technical blog posts, and FAANG interview questions and solutions.
Spark is an open source project that makes it much easier to run computations on large, distributed data. It’s widely used to run datacenter computations and its popularity has been exploding since 2012.
It’s now become one of the most popular open source projects in the big data space and is used by companies like Amazon, Tencent, Shopify, eBay and more.
Before Spark, engineers relied on Hadoop MapReduce to run computations on their data, but there were quite a few issues with that approach.
Spark was introduced as a way to solve those pain points, and it’s quickly evolved into much more.
We’ll talk about why Spark was created, what makes Spark so fast and how it works under the hood.
We’ll start with a brief overview of MapReduce.
In a previous tech dive, we talked about Google MapReduce and how Google was using it to run massive computations to help power Google Search.
MapReduce introduced a new parallel programming paradigm that made it much easier to run computations on massive amounts of distributed data.
Although Google’s implementation of MapReduce was proprietary, it was re-implemented as part of Apache Hadoop.
Hadoop gained widespread popularity as a set of open source tools for companies dealing with massive amounts of data.
Let’s say you have 100 terabytes of data split across 100 different machines. You want to run some computations on this data.
You take the code from your map function and run it on each of the 100 machines in a parallel manner.
On each machine, the map function will take in that machine’s chunk of the data and output the results of the map function.
The output will get written to local disk on that machine (or a nearby machine if there isn’t enough space on local).
Then, the reduce function will take in the output of all the map functions and combine that to give the answer to your computation.
The MapReduce framework on Hadoop had some shortcomings that were becoming big issues for engineers.
Apache Spark was created as a successor to MapReduce to ease these problems.
The main goal was to create a fast and versatile tool to handle distributed processing of large amounts of data. The tool should be able to handle a variety of different workloads, with a specific emphasis on workloads that reuse a working set of data across multiple operations.
Many common machine learning algorithms will repeatedly apply a function to the same dataset to optimize a parameter (ex. Gradient descent).
Running a bunch of random SQL queries on a dataset to get a feel for it is another example of reusing a working set of data across multiple operations (SQL queries in this scenario).
Spark is designed to handle these operations with ease.
Spark is a program for distributed data processing, so it runs on top of your data storage layer. You can use Spark on top of Hadoop Distributed File System, MongoDB, HBase, Cassandra, Amazon S3, RDBMSs and a bunch of other storage layers.
In a Spark program, you can transform your data in different ways (filter, map, intersection, union, etc.) and Spark can distribute these operations across multiple computers for parallel processing.
Spark offers nearly 100 high-level, commonly needed data processing operators and you can use Spark with Scala, Java, Python and R.
Spark also offers libraries on top to handle a diverse range of workloads.
Spark’s speed comes from two main architectural choices
As we said before, Spark is a distributed data processing engine that can process huge volumes of data distributed across thousands of machines.
The collection of machines is called a Spark cluster and the largest Spark cluster is around 8000 machines. (Note. You can also run Spark on a single machine. If you want, you can download it from the Apache website )
Spark is based on a leader-worker architecture. In Spark lingo, the leader is called the Spark driver while the worker is called the Spark executor.
A Spark application has a single driver, where the driver functions as the central coordinator. You’ll be interacting with the driver with your Scala/Python/R/Java code and you can run the driver on your own machine or on one of the machines in the Spark cluster.
The executors are the worker processes that execute the instructions given to them by the driver. Each Spark executor is a JVM process that is run on each of the nodes in the Spark cluster (you’ll mostly have one executor per node).
The Spark executor will get assigned tasks that require working on a partition of the data that is closest to them in the cluster. This helps reduce network congestion.
When you’re working with a distributed system, you’ll typically use a cluster manager (like Apache Mesos, Kubernetes, Docker Swarm, etc.) to help manage all the nodes in your cluster.
Spark is no different. The Spark driver will work with a cluster manager to orchestrate the Spark Executors. You can configure Spark to use Apache Mesos, Kubernetes, Hadoop YARN or Spark’s built-in cluster manager.
When Spark runs your computations on the given datasets, it uses a data structure called a Resilient Distributed Dataset (RDD).
RDDs are the fundamental abstraction for representing data in Spark and they were first introduced in the original Spark paper.
Spark will look at your dataset across all the partitions and create an RDD that represents it. This RDD will then be stored in memory where it will be manipulated through transformations and actions.
The key features of RDDs are
As you’re running your transformations, Spark will not be executing any computations.
Instead, the Spark driver will be adding these transformations to a Directed Acyclic Graph. You can think of this as just a flowchart of all the transformations you’re applying on the data.
Once you call an action, then the Spark driver will start computing all the transformations. Within the driver are the DAG Scheduler and the Task Scheduler. These two will manage executing the DAG.
When you call an action, the DAG will go to the DAG scheduler.
The DAG scheduler will divide the DAG into different stages where each stage contains various tasks related to your transformations.
The DAG scheduler will run various optimizations to make sure that the stages are being done in the most optimal way to eliminate any redundant computations. Then, it will create a set of stages and then pass this to the Task Scheduler.
The Task Scheduler will then coordinate with the Cluster Manager (Apache Mesos, Kubernetes, Hadoop YARN, etc.) to execute all the stages using the machines in your Spark cluster and get the results from the computations.
Quastor is a free Software Engineering newsletter that sends out deep dives on interesting tech, summaries of technical blog posts, and FAANG interview questions and solutions.
Write a function that checks whether an integer is a palindrome.
For example, 191 is a palindrome, as well as 111. 123 is not a palindrome.
Do not convert the integer into a string.
Quastor is a free Software Engineering newsletter that sends out deep dives on interesting tech, summaries of technical blog posts, and FAANG interview questions and solutions.