Internals of RDD

Before we dive into internals of RDD we need to know three major components

– Execution Model

– Shuffle

– Cache

Execution Model,

Lets take a small program to get distinct name per first letter

 screen-shot-2016-11-13-at-10-38-46-am

The Execution of the program depends on the data and the cluster that we are running in.

The execution can be compared to Pig in hadoop which also uses lazy evaluation, which in turn means Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually get executed.

One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.

For example — if you executed every transformation eagerly, what does that mean? Well, it means you will have to materialize that many intermediate datasets in memory. This is evidently not efficient — for one, it will increase your GC costs. (Because you’re really not interested in those intermediate results as such. Those are just convenient abstractions for you while writing the program.) So, what you do instead is — you tell Spark what is the eventual answer you’re interested and it figures out best way to get there.

1. Create DAG of RDDs to represent computation

2. Create logical execution plan for DAG

3. Schedule and execute individual tasks

With our example lets see how RDD’s are created.

screen-shot-2016-11-13-at-10-56-16-am

screen-shot-2016-11-13-at-10-57-59-am

 

Here it shows the each step involved within creating the final RDD. We have map, groupby, mapvalues and collect. The next step would be to create a logical plan.

Things that should be considered while creating a logical plan is that each items that are pipelined and doesn’t have any relation with the other partitions should be grouped together.

And the steps that has relation with other partitions because of the shuffle(which is triggered by groupby in our example) have to be separated into different stages.

screen-shot-2016-11-13-at-11-04-07-am

 

Now that stages have been laid out they have to be executed as task.

In this step stages are executed one by one. Each stage is split into multiple task depending on the data size and the resource available. Task is a combination of the data and computation. Lets say our dataset has 4 different files and so they are stored in 4 different blocks in HDFS. So when the data is loaded into memory for the first map operation they are loaded into 4 partitions(Default Partition in RDD is the number of blocks the data is residing in HDFS)screen-shot-2016-11-13-at-11-16-55-am

Task is described as a bundle of work – in the stage 1 of our code – task is to load the hdfs block into memory and do the map operation.

screen-shot-2016-11-13-at-11-17-41-am

Each task here are totally independent of each other.

Now lets see how each task are executed. For this example lets say we have one core and data partitions reside in 3 different machines.

Things the driver program has to consider is the data locality so that it can trigger the task in the same node where the data resides. And since the driver program is executed in a single thread there is a slight delay in starting each task consecutively although they are executed in different nodes.

screen-shot-2016-11-13-at-11-24-44-am

The next step would be the shuffle stage which shuffles the data between partitions and prepares it for the next stage.

The bottom of the first stage is the map() step that had four partitions and the top of the second stage is the groupby that also has four partitions(by default it is the same numbers thats why it is 4 to 4). Partitioning is done by simply hashing keys into buckets. Hash each key into one bucket so that when we get to groupby we will know where to look for each values.