One of the challenge when we have a database that has multiple tables that is not related in the normal sense but different key names within them join together. Exposing this knowledge to LLM require either have a 360 degree view of the data by joining them all together and creating one fact table of programming them through prompt engineering by providing LLM the knowledge of how the tables are connected. Below is the sample code that I’ve used to tackle this challenge.
Author: dineshmuruganc@gmail.com
Generative Models
Instead of trying to Map 𝑃(𝑌|𝑋=𝑥) like in discriminative models(https://en.wikipedia.org/wiki/Discriminative_model), generative model tries to find the joint distribution between 𝑋 and 𝑌 – 𝑃(𝑋,𝑌) then use that distribution to generate 𝑃(𝑌|𝑋=𝑥) or 𝑃(𝑋|𝑌=𝑦). I know what you are thinking, yes it is possible to generate data from the model. A famous method used in Neural Networks to generate data to test models used in self driving cars is GAN(https://en.wikipedia.org/wiki/Generative_adversarial_network) which was derived from the idea of generative models.
I’m not going to add more explanations to generative models than wikipedia(https://en.wikipedia.org/wiki/Generative_model) or 1000’s of articles on internet does. Instead what I’m going to do is try to create a generative model and explain the process of understanding the relationship between the data and target variables in a classification model.
The model I choose here is the Gaussian Naive Bayes model. One reason I choose this model is there are good articles on the internet showed how GNB works but I could find one that showed how the formulas for them are applied and why are they called generative models. The video that I liked is https://www.youtube.com/watch?v=r1in0YNetG8. Some articles that I liked are https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/, https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/. The Second articles give the explanation with formulas but with categorical data. Here I have used numerical data and formula’s from GNB.
import pandas as pd import numpy as np
I took the data from https://github.com/liquidcarrot/data.pima-indians-diabetes
ds = pd.read_csv("/Users/dineshkumarmurugan/Downloads/pima-indians-diabetes.data.csv",names=["pregnancies","plasmaglucoseconcentration","diastolicbloodpressure","tricepsskinfoldthickness","insulin","bodymassindex","diabetespedigreefunction","age","diabetic"]) ds.head()
pregnancies | plasmaglucoseconcentration | diastolicbloodpressure | tricepsskinfoldthickness | insulin | bodymassindex | diabetespedigreefunction | age | diabetic | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
ds.describe()
pregnancies | plasmaglucoseconcentration | diastolicbloodpressure | tricepsskinfoldthickness | insulin | bodymassindex | diabetespedigreefunction | age | diabetic | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Of course, when we hear the word Gaussian, mean and variance come with it. Here we consider each feature as independent (The whole idea being Naïve Bayes model) and calculate mean and standard deviation for them for each class.
def calcualte_mean_std(x_var,var): return ({"1":np.mean(x_var[x_var["diabetic"] == 1][var]),"0":np.mean(x_var[x_var["diabetic"] == 0][var])}, {"1":np.std(x_var[x_var["diabetic"] == 1][var]),"0":np.std(x_var[x_var["diabetic"] == 0][var])})
pregnancies_mean,pregnancies_std = calcualte_mean_std(ds,"pregnancies") plasmaglucoseconcentration_mean,plasmaglucoseconcentration_std = calcualte_mean_std(ds,"plasmaglucoseconcentration") diastolicbloodpressure_mean,diastolicbloodpressure_std = calcualte_mean_std(ds,"diastolicbloodpressure") tricepsskinfoldthickness_mean,tricepsskinfoldthickness_std = calcualte_mean_std(ds,"tricepsskinfoldthickness") insulin_mean,insulin_std = calcualte_mean_std(ds,"insulin") bodymassindex_mean,bodymassindex_std = calcualte_mean_std(ds,"bodymassindex") diabetespedigreefunction_mean,diabetespedigreefunction_std = calcualte_mean_std(ds,"diabetespedigreefunction") age_mean,age_std = calcualte_mean_std(ds,"age")
ds["diabetic"].value_counts()
0 500 1 268 Name: diabetic, dtype: int64
Calculating the probability of the classes [ ]:
prob_zero = 500/768 prob_one = 268/768
Calculating probaility for each feature.
%%latex \begin{equation*} P(V|1) = \frac{1} {\sqrt{2 * \pi * \sigma^2}} e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}} \end{equation*}
𝑃(𝑉|1)=12∗𝜋∗𝜎2‾‾‾‾‾‾‾‾‾‾√𝑒−12(𝑥−𝜇)2𝜎2P(V|1)=12∗π∗σ2e−12(x−μ)2σ2
WordPress messed up my latex 🙁 (Below is the screengrab)

import math def prob_one_var_one_class(var_mean,var_stan,x): return 1/math.sqrt(2 * math.pi + var_stan) * math.exp((-0.5)* pow((x - var_mean),2)/var_stan)
prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],6)
0.2659499185599781
prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],6)
0.09769091102242611
prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],1)
0.042722843214137975
prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],1)
0.13657759789740978
Testing the individual probability with the feature shows they seem to resonate with the data.
%%latex \begin{equation*} P(D|1) = P(V_1|1) * P(V_2|1) * P(V_3|1) * P(V_4|1) * P(V_5|1) * P(V_6|1) * P(V_7|1) * P(V_8|1) \end{equation*}
𝑃(𝐷|1)=𝑃(𝑉1|1)∗𝑃(𝑉2|1)∗𝑃(𝑉3|1)∗𝑃(𝑉4|1)∗𝑃(𝑉5|1)∗𝑃(𝑉6|1)∗𝑃(𝑉7|1)∗𝑃(𝑉8|1)P(D|1)=P(V1|1)∗P(V2|1)∗P(V3|1)∗P(V4|1)∗P(V5|1)∗P(V6|1)∗P(V7|1)∗P(V8|1)

%%latex \begin{equation*} P(D|0) = P(V_1|0) * P(V_2|0) * P(V_3|0) * P(V_4|0) * P(V_5|0) * P(V_6|0) * P(V_7|0) * P(V_8|0) \end{equation*}
𝑃(𝐷|0)=𝑃(𝑉1|0)∗𝑃(𝑉2|0)∗𝑃(𝑉3|0)∗𝑃(𝑉4|0)∗𝑃(𝑉5|0)∗𝑃(𝑉6|0)∗𝑃(𝑉7|0)∗𝑃(𝑉8|0)P(D|0)=P(V1|0)∗P(V2|0)∗P(V3|0)∗P(V4|0)∗P(V5|0)∗P(V6|0)∗P(V7|0)∗P(V8|0)

%%latex \begin{equation*} P(1|D) = \frac{P(D|1) * P(1)} {((P(D|1) * P(1)) + (P(D|0) * P(0)) )} \end{equation*}
𝑃(1|𝐷)=𝑃(𝐷|1)∗𝑃(1)((𝑃(𝐷|1)∗𝑃(1))+(𝑃(𝐷|0)∗𝑃(0)))P(1|D)=P(D|1)∗P(1)((P(D|1)∗P(1))+(P(D|0)∗P(0)))

%%latex \begin{equation*} P(0|D) = \frac{P(D|0) * P(0)} {((P(D|1) * P(1)) + (P(D|0) * P(0)) )} \end{equation*}
𝑃(0|𝐷)=𝑃(𝐷|0)∗𝑃(0)((𝑃(𝐷|1)∗𝑃(1))+(𝑃(𝐷|0)∗𝑃(0)))P(0|D)=P(D|0)∗P(0)((P(D|1)∗P(1))+(P(D|0)∗P(0)))

def calculate_overall_prob(x): prob_x_given_zero = prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],x["pregnancies"]) *\ prob_one_var_one_class(plasmaglucoseconcentration_mean["0"],plasmaglucoseconcentration_std["0"],x["plasmaglucoseconcentration"]) * \ prob_one_var_one_class(diastolicbloodpressure_mean["0"],diastolicbloodpressure_std["0"],x["diastolicbloodpressure"]) * \ prob_one_var_one_class(tricepsskinfoldthickness_mean["0"],tricepsskinfoldthickness_std["0"],x["tricepsskinfoldthickness"]) * \ prob_one_var_one_class(insulin_mean["0"],insulin_std["0"],x["insulin"]) * \ prob_one_var_one_class(bodymassindex_mean["0"],bodymassindex_std["0"],x["bodymassindex"]) * \ prob_one_var_one_class(diabetespedigreefunction_mean["0"],diabetespedigreefunction_std["0"],x["diabetespedigreefunction"]) * \ prob_one_var_one_class(age_mean["0"],age_std["0"],x["age"]) prob_x_given_one = prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],x["pregnancies"]) *\ prob_one_var_one_class(plasmaglucoseconcentration_mean["1"],plasmaglucoseconcentration_std["1"],x["plasmaglucoseconcentration"]) * \ prob_one_var_one_class(diastolicbloodpressure_mean["1"],diastolicbloodpressure_std["1"],x["diastolicbloodpressure"]) * \ prob_one_var_one_class(tricepsskinfoldthickness_mean["1"],tricepsskinfoldthickness_std["1"],x["tricepsskinfoldthickness"]) * \ prob_one_var_one_class(insulin_mean["1"],insulin_std["1"],x["insulin"]) * \ prob_one_var_one_class(bodymassindex_mean["1"],bodymassindex_std["1"],x["bodymassindex"]) * \ prob_one_var_one_class(diabetespedigreefunction_mean["1"],diabetespedigreefunction_std["1"],x["diabetespedigreefunction"]) * \ prob_one_var_one_class(age_mean["1"],age_std["1"],x["age"]) overall_prob_zero = (prob_x_given_zero * prob_zero) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one)) overall_prob_one = (prob_x_given_one * prob_one) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one)) return overall_prob_zero,overall_prob_one
calculate_overall_prob(ds.loc[2])
(1.869821016215942e-24, 1.0)
calculate_overall_prob(ds.loc[3])
(1.0, 7.518076945772211e-20)
We can see the probability function applied to values from the dataset and they seem to work as expected. Value in index 2 is correctly having probability of 1 for class one (in the range of 0 to 1) and value in index is correctly having probability of 1 for class zero
Probability of one getting into car accident
One day when I was driving and I was thinking to myself what is the probability of any one getting in an accident. Hmm, I’m a data scientist I should be able to get the numbers fairly easily. As soon as I reached my house I started analyzing the problem.
I thought to myself let’s take the same approach that we use in the job. The strategy that I use is taken from book “How to Solve it” by G.Polya. In this book G.Polya outlines the steps required to solve any problem. It is based on a mathematical method.
First step is to figure out the unknown, get the data, find out the condition and see if it is redundant, or insufficient, or contradictory. In my problem unknown is the probability of me getting in to an accident, for the data I used the data from NHTSA (https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812783) It provides a quarterly level aggregation of number of accidents, accidents per 100 million miles driven, total miles driven and more on quarterly basis. The last piece in the first step is the condition, this made me think. Hmm, what is the condition here? Condition could be I keep driving every other weekend through all four quarters of year (which I do). Every step here is a iterative step. Let’s go back and think what else we are missing. We are missing the data about me. Which needs to be translated to numbers. I drive every other weekend, let’s say that’s about 6 weekends in a quarter, which comes around 4000 miles in a quarter on average.
We’ve come to the second step now. Second step is to devise a plan. We need to find the connection between data and the unknown, we need to consider auxiliary problems. One of the main action items in the step 2 is to find an analogous problem and see if it is solved before or if some of the solution from that problem can be used here. When we think about analogous problem what comes to our mind is insurance industry. They must have solved this problem and there could be solution in the internet already for this problem. But here due to the personal interest in solving the problem by myself I’ve decided not to “google” the solution. Apart from trying to find the analogous problem the main action item in step two is to devise a plan. Let’s see what type of distribution best describes this data. Obviously when there is a number and interval the first distribution that comes to mind id Poisson distribution, which can be used to find 0 events in a unit of time, this also leads us to exponential distribution which can tell us amount of time left until I meet an accident at given point of time. Adding to that exponential distribution has memoryless property that makes it ideal for this problem.
Now we go to the third step. Here we just have to find the event rate (which is an average for every quarter) and apply the formula for Poission distribution for the zero event happening in one quarter is exp (-lamda), lamda is the rate. The number comes to 0.99 which the probability of me not getting in to accident. I used the average number of accidents from the data for a quarter.
And finally, we are at the fourth step. This step is to examine the result, check the argument and see if we can obtain the result differently. Hmm, my intuition says why not try to find a distribution of people similar to me and then find the probability on that. Here we don’t have the data, so we need to assume a lot. Ok let’s take one quarter data of 100,000,000 miles (1.1 accidents have happened) and say in this set all of them are driving to see his girlfriend and all of them drive 4000 miles in a quarter, so that’s about 25,000 people. We will round the accident to 2 since that we can’t have 1.1 accidents and we need to round at the upper level considering the importance of the problem. So the odds of me getting into accident will be 1 in 12,500.
Conclusion here is I know the probability is low, but still possible.
Meta-analysis
- Introduction:
Meta-analysis is a systematic method for synthesizing quantitative results of different empirical studies regarding the effect of an independent variable. In our analysis we have data from 7 different studies of corticosteroid therapy in premature labour and its effect on neonatal death. We want to apply meta-analysis to find if the treatment applied had significant impact. We will apply Bayesian meta-analysis to find out.
- Assumptions:
Here the effect that we are supposed to measure is the overall mean difference between the treatment group and control group. We have the data already well defined in terms of PICO (Participants, Interventions, Comparisons, Outcome) measure (which is the first step in a meta-analysis).
The next step in a meta-analysis would be to go through the bibliographical databases and read about each study to select the study. Our assumption here is this step is already completed and we will use all the 7 studies.
We assume that all measures are at same scale, so effect size measures across these studies are comparable. There are three possible assumptions about the data – I) No heterogeneity, II) Complete heterogeneity and III) Exchangeable. We will assume exchangeable model as this is a compromise between complete homogeneity and complete heterogeneity, and we assume there are no important covariates that might form the basis of a more complex mode.
There are two approaches – fixed effect or random effect model which we could be using. Since the study was done by different person, we can assume that not all are from the same population and not all will have the same method of intervention. Also, we don’t have answers to the questions like what procedure was followed in each of this study. We assume that differences in samples, design, and conduct introduce more statistical heterogeneity between studies, which can reasonably be attributed to random error within studies. In that case it might be more reasonable to use a random effects model. A random effects model assumes the studies are a sample from all possible studies and includes an additional variance component for variation or heterogeneity between studies.
The next step would be to try to understand different bias in the study like selection bias, publication bias.
Fig 1. Funnel plot

We can use this funnel plot from the data to assume that there is a possible publication bias (Publication bias is one kind of reporting bias).
- Methods:
- The first level of the hierarchical model assumes Yj∼N(θj,σ2j).
- For the exchangeability model we specify a second level of variance as θj∼N(μ,τ2) , with the θj’s drawn from a normal distribution with hyperparameters μ and τ2, which is the variation in the population-level effects. θj is a compromise between the data (Yj) and the prior.
- Prior’s are μ~N(0, 105) Which translates to a 95% certainty that true value of μ lies between -1.96×316 and +1.96×316 (316 comes from sqrt(105)) and
- And τ2~ U(0,10) since it can’t be allowed to be negative,
Yj and σ2j can be shown as


,
To understand the τ2 (tau – Measures of between-study heterogeneity) and σ2 (sigma – variance within the studies) let’s look at the forest plot

Fig 2. Forest plot.
We can see the point estimate in each analysis and weightage given to each analysis on trying to understand the overall variance. Here the diamond is the overall effect, we see that it is heavily influenced by Auckland study.
The x-axis forms the effect size scale, plotted on the top of the plot. Each row, except the bottom one, represents a study’s effect size estimate in the form of a point and a (95%) confidence interval. The confidence interval of the combined effect size in Fig 2 does not include zero, i.e., in case of a
confidence level of 95% the p-value is smaller than .05. In traditional terminology, this means that the
meta-analytic effect is statistically significant.
- Computation:
The model was fitted in R using the rjags package. The jags model code is given in the appendix. The models were run with following specifications: burn = 10000, 2 chains, 10000 iterations per chain, and no thinning. The model successfully converged for all two priors as judged from trace plots (see Figure 3 for representation). Our model converged pretty well.

The second part of the model is simulation study to evaluate the frequentist Type I error of the method (code is given in the appendix). For each iteration same configuration are used in the JAGS model (burn = 10000, 2 chains, 10000). A parameter “point_est” keeps track of the model summary if experiment group is different from the control group.
Fig 3. Diagnostic plot for variable OR.
We also look at DIC to see how well the model has fit, lower values of DIC indicate a better model fit. The DIC value shows a mean deviance of 14.77 with a penalty of 2.932. Which implies that it is a good fit. Gelman and Rubin’s Convergence Diagnostic were used to test the convergence and they were all less than 1.2.
- Result:
Below is the result of the JAGS model. Although there are different outcomes of meta-analysis we could be looking at, but for our analysis we will look at the point estimate. Here parameter d and delta[j] is in logit scale. OR is the odds ratio, and it is in the positive scale and does not include 0 in its credible interval proving that there is a treatment effect.
Variable Name | Mean | SD | Naive SE | Time-series SE |
OR | 2.4912 | 1.2324 | 0.0087147 | 0.024413 |
d | 0.8289 | 0.3944 | 0.002789 | 0.008488 |
delta[1] | 0.6862 | 0.3697 | 0.0026141 | 0.007632 |
delta[2] | 1.0146 | 0.6033 | 0.0042661 | 0.012662 |
delta[3] | 0.9713 | 0.5153 | 0.003644 | 0.010262 |
delta[4] | 0.6393 | 0.4356 | 0.0030805 | 0.00825 |
delta[5] | 0.8779 | 0.5153 | 0.0036439 | 0.008662 |
delta[6] | 1.0523 | 0.6109 | 0.0043197 | 0.013568 |
delta[7] | 0.5495 | 0.4963 | 0.0035094 | 0.01056 |
prob.OR1 | 0.986 | 0.1175 | 0.0008308 | 0.001212 |
tau | 0.5297 | 0.4567 | 0.0032293 | 0.017661 |
Table 1. Result of the jags model.
We can also look at the parameter prod.OR1 with which we can get the probability of the synthesized estimate being below (or above) some value (probability of OR > 1). Here it is close to 1.
The second part of the result is for the simulation study. Total number of iterations that concludes that there is a difference between treatment groups are taken and proportion is calculated. The value that we obtained is 0.059. Which is the Type I error of this method. This is acceptable for this model. This lies around the acceptable range and it is the decision of the researcher to accept this score.
- References:
- Andrew Gelman, John B. Carlin, Hal S Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin: Book on Bayesian Data Analysis (third edition) – Chapter 5 section 6 Hierarchical modeling applied to a meta-analysis
- Harrer, M., Cuijpers, P. & Ebert, D. D. (2019). Doing Meta-Analysis in R: A Hand-on Guide. https://bookdown.org/MathiasHarrer/Doing_Meta_Analysis_in_R/.
Appendix
JAGS Code:
data(cochrane)
Y <- log(cochrane[,4]/(cochrane[,5]- cochrane[,4]))-log(cochrane[,2]/(cochrane[,3]- cochrane[,2]))
V <- sqrt((1/cochrane[,4]) + (1/(cochrane[,5]-cochrane[,4])) + (1/cochrane[,2]) + (1/(cochrane[,3]-cochrane[,2])))
model_string <- textConnection(“model {
# Likelihood
for (j in 1:n) {
P[j] <- 1/V[j] # Calculate precision
Y[j] ~ dnorm(delta[j],P[j])
delta[j] ~ dnorm(d,prec)
}
# Priors
d ~ dnorm(0,1.0E-6)
prec <- 1/tau.sq
tau.sq <- tau*tau # tau.sq = between-study variance
tau ~ dunif(0,10) # Uniform on SD
OR<-exp(d) #exponentiate to get back to OR scale
prob.OR1<-step(d) #calc prob of OR > 1
}”)
model <- jags.model(model_string,data = list(Y=Y,V=V,n=Nstud), n.chains=2,quiet=TRUE)
update(model, 10000, progress.bar=”none”)
params <- c(“d”,”tau”,”OR”, “prob.OR1″,”delta”)
samples <- coda.samples(model,
variable.names=params,
n.iter=10000, progress.bar=”none”)
Simulation Study:
nsims <- 1000 # Number of simulated datasets
m <- c(rep(0,nsims))
s_d <- c(rep(0,nsims))
t <- c(rep(0,nsims))
point_est <- c(rep(0,nsims))
library(rjags)
Y <- matrix(0, nrow = nsims, ncol = 7)
V <- matrix(0, nrow = nsims, ncol = 7)
for(sim in 1:nsims){
samp1 <- c(rep(0,7))
samp0 <- c(rep(0,7))
for(i in 1:7){
samp1[i] <- rbinom(1,cochrane[i,5],0.1) + 0.5
samp0[i] <- rbinom(1,cochrane[i,3],0.1) + 0.5
}
Y[sim,] <- log(samp1/(cochrane[,5]-samp1))-log(samp0/(cochrane[,3]-samp0))
V[sim,] <- sqrt((1/samp1) + (1/(cochrane[,5]-samp1)) + (1/samp0) + (1/(cochrane[,3]-samp0)))
model_string <- textConnection(“model {
# Likelihood
for (j in 1:n) {
P[j] <- 1/V[j] # Calculate precision
Y[j] ~ dnorm(delta[j],P[j])
delta[j] ~ dnorm(d,prec)
}
# Priors
d ~ dnorm(0,1.0E-6)
prec <- 1/tau.sq
tau.sq <- tau*tau # tau.sq = between-study variance
tau ~ dunif(0,10) # Uniform on SD
OR<-exp(d) #exponentiate to get back to OR scale
prob.OR1<-step(d) #calc prob of OR > 1
}”)
model <- jags.model(model_string,data = list(Y=Y[sim,],V=V[sim,],n=7), n.chains=2,quiet=TRUE)
update(model, 10000, progress.bar=”none”)
params <- c(“d”,”tau”,”OR”, “prob.OR1”)
samples <- coda.samples(model,
variable.names=params,
n.iter=10000, progress.bar=”none”)
m[sim] <- summary(samples)[[1]][2,][1]
s_d[sim] <- summary(samples)[[1]][2,][2]
t[sim] <- summary(samples)[[1]][4,][1]
if(m[sim] – s_d[sim] < 0 && m[sim] + s_d[sim] > 0)
{
# When summed up we get Type I error
point_est[sim] = 1
}
}
Study of Middle School Student’s Perceptions of Size and Scale
Introduction:
In this study we are trying to relate students’ understanding of size and scale, and their achievements in science and mathematics by providing extra teaching them about size and scale.
Our objective is to find out whether extra teaching on size and scale will influence their understanding on them and get them interested in STEM education. The most important thing that will help support our finding is to study the results of the tests designed by the researchers mainly with two metrics: comparing academic year over year performance of both the groups involved in the experiment and comparing year over year differences in progress between control and experiment groups.
Background:
Students’ dream of making a successful career generally starts in school. In order for them to build a science career they need to choose science subjects, but it depends on how much interest students have developed toward science to take up those subjects. So, we became interested to start with students’ perception of size and scale and how the teaching of those influences their understandings of size and scale, and their achievement in science and mathematics.
Design:
Objective here is to find out whether extra teaching on size and scale influences students’ understanding of those topics. The experiment is designed using experimental design technique called Randomized Block Design. This design was chosen to remove the source variability among selecting the students.
Students in experimental group get extra 150 hours of teaching about size and scale, but they do not receive exact question and answers that are part of their exams, while the Students in the control group do not receive any extra teaching.
Data Collection:
Data were collected from 150 students who entered Grade 6 from 2013 in middle school. Students were randomly selected for both the groups. 90 Students were selected into the experiment group and 64 students were selected into the Control group.
Data is consisted of scores obtained by students from both the groups. Exams were conducted for 3 years on the beginning and end of each academic year.
Two different types of exams were held. First one consisted of 26 abstract questions about the actual size of an object, and the second one consisted of 31 pictured cards that needed to be sorted in relative order.
Two different Grading systems were used to grade the answers. In the first one full credit was assigned to each correct answer or close to the correct answer, and in the second one full credit was assigned to the right answer and half credit to the answer that was close to the right answer. Data were labeled with student id, year, grading system and academic year of the exam (pre or post).
One of the problems in the data collection that the researchers missed to address is the missing values, there is a lot of missing values for students from Control group in year 2 and 3, Researchers have missed specifying if there is any reason to missing values. This makes imputing the missing values a harder problem. Hence, we cannot use any method to impute the data. So, data with missing values will be ignored during the data cleaning process.
Hypothesis:
Here the null hypothesis is that any extra teaching on size and scale will neither improve students’ interest towards stem education nor will it increase their score in those exams. Alternate hypothesis is there is a clear improvement of the score in students belonging to the experiment group, both year by year and across three years of the experiment.
Researchers are also interested in knowing how students’ interest have increased year by year, the two grading systems will help us understand how close students were to the right answer and how many were confident about the right answer.
Data Analysis:
We will use three different tools to perform the analysis.
Linear Regression:
Linear Regression is a way to determine the joint effect of independent on a dependent variable in the form:
Y =β0 +β1X1 +β2X2 +…+βnXn +ε
With
the following assumptions:
a) All n observations are independent
b) E(εj) = 0 mean of the error is 0
c) Var(εj) = σ2 the
variance of the error is constant
d) Cov(εj, εk) = 0 the error terms are uncorrelated
Here our dependent variable is the post score the one they have achieved after their semester and independent score is their pre score and which group do they belong to.
Anova:
When compared to linear regression ANOVA is the better tool to get insights into a randomized block design. It will help us understand how much variance each group has; in our case how much variance does the score of students in different group has. Also, it will help us compare the mean of both the groups.
The ANOVA model specifies that the mean for a given population, μl, is a function of an overall mean μ, and population specific effects, τl .
μl = μ + τl
ANOVA is used to test the null hypothesis that the means of all populations are equal against the alternative hypothesis that at least one mean vector is not equal to the others

Crossed factor using ANOVA to determine if either gender or ethnicity plays a role in the score.
yijk=m+ai+bj(i)+ϵijk
Paired T Test:
A research design that can be used to
investigate the effects of a single treatment is a paired- comparison study. We
are using paired t test to if there is any difference exist between pre and
post semester scores.
Data used here is score of pre and post semester. The null hypothesis is that
there are not differences between the groups.
Paired comparison tests are needed to assess the efficacy of a treatment or
multiple treatments.
1. Calculate the difference (di = yi − xi) between the
two observations on each pair, making sure you distinguish between positive and
negative differences.
2. Calculate the mean difference, ̄d.
3. Calculate the standard
deviation of the differences, sd, and use this to calculate the standard error
of the mean difference, SE( ̄d) = √ sd n
4.
Calculate the t-statistic, which is given by T = ̄d SE( ̄d) . Under the null
hypothesis, this statistic follows a t-distribution with n − 1 degrees of
freedom.
5. Use tables of the t-distribution to compare your value for T to the tn−1
distribution. This will give the p-value for the paired t-test.
The assumptions of the paired t-test are:
1. The data are continuous (not discrete).
2. The data, i.e., the differences for the matched-pairs, follow a normal probability distribution.
3. The sample of pairs is a simple random sample from its population. Each individual in the population has an equal probability of being selected in the sample.
Summary and discussion:
P-value in each of these techniques can be used to check the significant of the independent variable on the final score.
In linear regression
===================================================================================================================
Dependent variable:
———————————————————————————————–
yr1_post_std1
(1) (2) (3) (4)
——————————————————————————————————————-
yr1_pre_std1 0.532*** 0.532*** 0.627*** 0.623***
(0.075) (0.075) (0.069) (0.068)
Group 1.397 0.958 1.069 0.658
(1.368) (1.335) (1.406) (1.364)
EthniticyAA 2.899 2.137
(2.123) (2.045)
EthniticyAF -6.090 -7.328
(5.814) (5.747)
EthniticyAS -1.326 -1.913
(5.753) (5.719)
EthniticyH -1.284 -1.870
(2.356) (2.294)
EthniticyO 6.505 5.917
(5.755) (5.722)
EthniticyW 5.157** 4.480**
(1.984) (1.883)
GenderF -3.444 -0.585
(4.847) (4.792)
GenderFF -1.089 3.770
(9.251) (9.349)
GenderM -5.183 -2.127
(4.880) (4.779)
Constant 11.160** 7.657*** 9.192* 8.112***
(4.846) (2.105) (4.944) (1.791)
——————————————————————————————————————-
Observations 149 149 149 149
R2 0.437 0.426 0.372 0.364
Adjusted R2 0.391 0.393 0.350 0.356
Residual Std. Error 7.825 (df = 137) 7.812 (df = 140) 8.086 (df = 143) 8.050 (df = 146)
F Statistic 9.648*** (df = 11; 137) 12.989*** (df = 8; 140) 16.934*** (df = 5; 143) 41.854*** (df = 2; 146)
===================================================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
This is a done with SCS dataset for year 1. We have comparison of 4 different models. We can see that only the yr1_pre_std1 which is a pre semester score and EthniticyW has significance over the post semester score.
Looking at the adjusted R2 we can come to a conclusion that linear model does not fit properly to the dataset.
With Anova
===================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
——————————————————————-
Sum Sq 5 2,524.234 3,491.530 63.855 156.016 3,051.787 8,387.801
Df 5 29.600 60.073 1 1 6 137
F value 4 13.589 24.184 0.849 0.995 14.425 49.846
Pr(> F) 4 0.199 0.229 0.000 0.015 0.349 0.469
——————————————————————-
===================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
——————————————————————-
Sum Sq 4 3,144.558 3,821.342 31.406 695.811 4,450.252 8,543.818
Df 4 37.000 68.707 1 1 39.5 140
F value 3 17.861 28.338 0.515 1.510 26.534 50.563
Pr(> F) 3 0.166 0.267 0.000 0.012 0.250 0.474
——————————————————————-
===================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
——————————————————————-
Sum Sq 4 3,736.445 4,518.546 37.793 93.137 6,422.547 9,349.512
Df 4 37.000 70.673 1 1 38 143
F value 3 28.152 47.768 0.569 0.573 41.944 83.310
Pr(> F) 3 0.362 0.327 0.000 0.224 0.542 0.636
——————————————————————-
====================================================================
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
——————————————————————–
Sum Sq 3 4,962.785 4,739.013 15.085 2,713.629 7,436.635 9,461.097
Df 3 49.333 83.716 1 1 73.5 146
F value 2 41.876 58.892 0.233 21.054 62.697 83.519
Pr(> F) 2 0.315 0.446 0.000 0.158 0.473 0.630
——————————————————————–
We can see that none of the model is significant enough to prove the relationship, which supports the adjusted R2 from the linear model.
Crossed factor Anova
Df Sum Sq Mean Sq F value Pr(>F)
yr1_pre_std1 1 5409 5409 84.175 3.76e-15 ***
Group 1 15 15 0.235 0.6290
Gender 3 112 37 0.579 0.6302
Ethniticy 6 962 160 2.494 0.0268 *
yr1_pre_std1:std 1 5 5 0.071 0.7903
yr1_pre_std1:Gender 2 88 44 0.687 0.5054
std:Gender 2 74 37 0.575 0.5644
yr1_pre_std1:Ethniticy 6 164 27 0.427 0.8598
std:Ethniticy 3 108 36 0.558 0.6442
Gender:Ethniticy 3 87 29 0.450 0.7181
yr1_pre_std1:std:Gender 1 42 42 0.650 0.4218
yr1_pre_std1:std:Ethniticy 3 385 128 1.998 0.1187
yr1_pre_std1:Gender:Ethniticy 3 80 27 0.413 0.7442
std:Gender:Ethniticy 3 477 159 2.473 0.0656 .
yr1_pre_std1:std:Gender:Ethniticy3 3 1 0.015 0.9974
Residuals 107 6876 64
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Again, crossed factor anova supports the result from the linear regression. Control group or experiment group does not affect the post semester score.
We can also perform paired t test to check if there is any significant deference in the pre and post semester score.
Paired t-test
data: stem_ans_na_removed$yr1_pre_std1 and stem_ans_na_removed$yr1_post_std1
t = -1.4021, df = 148, p-value = 0.163
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.4417102 0.4148646
sample estimates:
mean of the differences
-1.013423
Based on the above output we cannot reject the null hypothesis. The paired population means are equal.
Let’s now check if there is a difference in mean in pre and post semester score in control group.
Paired t-test
data: stem_ans_na_removed_C_group$yr1_pre_std1 and stem_ans_na_removed_C_group$yr1_post_std1
t = -0.1132, df = 56, p-value = 0.9103
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.624080 2.343379
sample estimates:
mean of the differences
-0.1403509
We cannot reject the null hypothesis here as well. Which in our case it means that the extra classes did not make a significant difference in their post semester score.
To add to all these results, lets plot the interaction plot

Which shows that there is no trend that shows control group to be different from experiment group.
Concluding remarks:
Our findings suggest that there is no difference between students who took extra classes and students who did not take extra classes.
Further work is recommended to understand why there is no significant difference in pre and post semester scores. Students should have gained some knowledge on the size and scale irrespective of the group they belong too.
Cheat sheet for Data Science Pipeline
- Exploratory Data Analysis
- Building new Features
- Dimensionality reduction
- Removing Outliers
- Selecting Validation Metrics & Testing and Validation
- Cross Validation & Hyperparameter tuning
- Feature Selection.
Exploratory Data Analysis:
This is the very first step in the data science process. It is done to understand the dataset better, check its features and shape, validate an initial hypothesis, and get a preliminary idea about the next step.
E.g.:
df = pd.read_csv(“<file-location>”)
df.describe()
# describe provides count and 5-point summary about the dataset
df.boxplot(return_type=’axes’)
# boxplot plots the data to visualize as boxplot.
df.quantile([0.1,0.9]
# to get the 10 th and 90 th quantile
df.<categorical variable name>.unique()
# to get the unique list in a categorical variable
pd.crosstab(df[‘<varaible name>’],df[‘<variable name>’])
# crosstab creates co-occurrence matrix/ similarity matrix
Building new features:
In cases where features and target variables are not really related, we need to modify the input dataset. We can apply either linear or nonlinear transformations to improve the accuracy of the system.
a). Normalize the input feature using Z-scores.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
b). Instead of using Z-normalization which is affected by outlier since it operates on mean and standard deviation, we can use RobustScaler which uses mean and IQR(Inter-Quartile Range) to scale each feature independently.
from sklearn.preprocessing import RobustScaler
scaler2 = RobustScaler()
df_r_scaled = scaler2.fit_transform(df)
Dimensionality Reduction:
Dataset may contain large number of features, many of which may be unnecessary. Keeping only the interesting feature is a way to not only make your dataset more manageable but have algorithms work better.
a). Covariance Matrix
– Covariance matrix provides us idea about the correlation. We can remove the variables that are highly correlated with one another.(Caution: It finds only linear relation).
import numpy as np
df_cov = np.corrcoef(df.data.T)
b). Principal Component Analysis
– PCA is technique that helps define samller and more relevant set of features. It’s just like restructuring the information in the dataset by aggregating as much as possible of the information onto the initial vectors produced.
from sklearn.decomposition import PCA
v_pca = PCA(n_components=2)
df_pca2 = v_pca.fit_transform(df.data)
c). Other forms of PCA
- RandomizedPCA(In case of Big data)
- Kernel PCA
d). Latent Factor Analysis
Overall idea is similar to PCA, used when latent factor or a construct is expected to be in the dataset
from sklearn.decomposition import FactorAnalysis
v_fact = FactorAnalysis(n_components=2)
df_fact2 = v_fact.fit_transform(df.data)
e). Linear Discriminant Analysis
Strictly speaking LDA is a classifier but it is often used in dimensionality reduction. Since it is a supervised approach it requires a label set to optimize.
from sklearn.lda import LDA
v_lda = LDA(n_components=2)
df_lda2 = v_lda.fit_transform(df.data,df.target)
# here df.target is the label
f). Latent Semantical Analysis
LSA is applied to text after it has been processed by TfidVectorizer or CountVectorizer. It applied SVD to the input dataset, producing semantic set of words usually associated with the same concept. LSA is used when features are homogeneous.
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vect = TfidfVectorizer()
word_freq = tf_vect.fit_tranform(<text dataset>)
from sklearn.decomposition import TruncatedSVD
v_tsvd = TruncatedSVD(n_components=2)
v_tsvd.fit(word_freq)
Removing Outliers.
When data point deviates markedly from the others in the sample, it is called an outlier.
Cause may be
- The point represents a rare occurrence
- The point is clearly some kind of a mistake.
- The point represents usual occurrence of another distribution.
a). Univariate outlier detection.
Univariate methods are based on EDA and visulization such as boxplots.
Rule of thumb to keep in mind when chasing outliers by examining single variables
- If the Z-score of the point is above or below 3 standard deviation then the value has to be considered as suspect outlier.
- Values less than 25th percentile minus IQR(difference between 75th and 25th percentile)
from sklearn import preprocessing
df_norm = preprocessing.StandardScaler().fit_transform(df)
outlier_row, outlier_column = np.where(np.abs(df_norm)>3)
b). EllipticEnvelope
It is function that tries to figure out the key parameters of your data’s general distribution by assuming that your entire data is an expression of an underlying multivariate gaussian distribution.
from sklearn.covariance import EllipticEnvlope
robust_covariance_set = EllipticEnvelope(containment=.1).fit(df[0])
detection = robust_covariance_set.predict(df[0])
outlier = np.where(detection==-1)[0]
Som more methods include one class SVM
Selecting Validation Metrics:
In order to measure the performance of the model that we have built we need to have a measure or a function that scores the outcome.
a). Multilabel classification
- Confusion matrix
A table that gives us an idea about what misclassification are for each class.
from sklearn.metrics import confusion_matrix
v_cm = confusion_matrix(Y_test,Y_pred)
print(cm)
- Accuracy:
It is a portion of the predicted label that are exactly equal to the real ones.
print(“Accuracy: ”, metrics.accuracy_score(Y_test,Y_pred))
- Precision
It counts the number of relevant results in the result set. Number of correct labels in each set of classified labels.
print(“Precision: ”, metrics.precision_score(Y_test,Y_pred))
- Recall
Amount of correctly classified labels in the set divided by total counts of labels for that set.
print(“Recall: ”, metrics.recall_score(Y_test,Y_pred))
- F1 Score:
It is a harmonic average of percision and recall. It is mostly used when dealting with unbalanced datasets.
print(“F1 Score: ”, metrics.f1_score(Y_test,Y_pred))
b). Binary Classification.
Mainly used measures are ROC(Receiver Operating Characterisitics curve) or AUC(Area Under Curve)
sklearn.metrics.roc_auc_score()
c). Regression.
Many error measures are derived from Euclidean algebra.
- MAE (Mean absolute error)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(Y_test,Y_pred)
- MSE (Mean squared error)
from sklearn.metrics import mean_squared_error
mean_squared_error(Y_test,Y_pred)
- R^2 Score
R^2 is also known as coefficient of determination.
sklearn.metrics.r2_score
Testing and Validation:
When we provide our entire dataset for model training, in extreme cases the model is over trained or too much complex with respect to the available data. A memorized pattern prevails and the algorithm becomes unfit to predict correctly new observations.
- We can increase the sample size so that it becomes infeasible to store all information.
- We can use simpler machine learning algorithm which is less prone to memorization.
- We can use regularization to penalize extremely complex models.
In Many cases fresh data is not available, in such cases a good approach would be tp divide the initial data into training set(usually 70-80 percent) and rest can be used as testing set.
X_train,X_test,Y_train,Y_test = cross_validation.train_test_split(X,Y,test_size=0.30)
Advanced methods includes changing the random state of the split. Random_state is a parameter to the train_test_split function.
Cross Validation:
From the previous step we know that both train and test set vary. Unfortunately this brings uncertainty along with the reduction of the learning examples dedicated to training. A solution would be to use cross validation. Which will help us in using training data for both model optimization and model training.
cross_validation.cross_val_score(model_h,X_train,Y_train,cv=10,scoring=’accuracy’)
Hyperparameter optimization:
A machine learning algorithm is not simply determined by the learning algorithm but also by its hyperparameters.
search_func = grid_search.GridSearchCV(estimator=h,param_grid=search_grid,scoring=’accuracy’,cv=10)
# search_grid is the dict that has the parameters for the algorithm to run on.
Feature Selection:
Irrelevant and redundant features may play a role in lack of interpretability of the resulting model, along the training times and, most importantly, overfitting and poor generalization.
a). Selection based on feature varaince.
Remove all the feature which have small Variance, typically lower than one set.
from sklearn.feature_selection import VarianceThreshold
df_selected = VarianceThreshold(threshold=1.0).fit_transform(df)
b). Univariate Selection
Select single variables that are associated the most with the target variable. Three available tests to base our selection on:
- f_regression – uses F-Test and p-value
- f_classif – uses ANOVA F-test
- Chi2 – uses chi-squared test
from sklearn.feature_selection import chi2, SelectPercentile
v_selec = SelectPercentile(chi2,percentile=0.25).fit(df,Y)
# here df is transformed to Binary value
c). Recursive elimination
Problem with univariate selection is the likelihood of selecting a subset containing redundant information. Recursive elimination in this case could help provide the answer.
from sklearn.linear_model import LogisticRegression
v_class = LogisticRegression(random_state=101)
v_class.fit(X_train,Y_train)
print(‘In-sample accuracy: %0.3f’ % v_class.score(X_train,Y_train))
d). Stability and L1-based selection:
Another way to solve the same problem will be by using regularization to limit the weight of the coefficients, thus preventing overfitting and selection of the most relevant variables without losing predictive power.
from sklearn.svm import LinearSVC
v_class = LogisticRegression(c=0.1,penalty=’l1’,random_state=101)
v_class.fit(X_train,Y_train)
print(‘Out-of-sample accuracy: %0.3f’ % v_class.score(X_test,Y_test))
Anomaly Detection using Spark
One of the analytic use case for Data collection in Big Data is finding anomaly. It serves multiple purpose. Not only in terms of Security check in Network data but also as a data quality check.
In an idea scenario, you will know how many types of data are processed from a single source. But in our case, let’s say we don’t know how many different types of events are being processed in the data, or in Analytical terms we don’t know how many clusters are within the dataset.
First Step in any anomaly detection system would be to identify the no of clusters. Spark has clustering algorithm called K means. We can either implement K means algorithm or use the streaming K means if our system is going to be stream based system.
To find the right number of cluster we can use a method called elbow method(https://bl.ocks.org/rpgove/0060ff3b656618e9136b).
Below is the code to get the SSE value of each cluster size.
val rddVec = trainingDatafram.rdd.map(x => Vectors.dense(x.toSeq.map(_.toString.toDouble).toArray))
val clusters = KMeans.train(rddVec,7, 10,5,”random”,1234L)
clusters.save(spark.sparkContext,args(0))
val WSSSE = clusters.computeCost(rddVec)
println(“Within Set Sum of Squared Errors = ” + WSSSE)
Plot the SSE value in the graph to figure out where are we getting the elbow. In the above example 7 is the cluster size, 10 is the maximum iteration, 5 is the no of run, “random” is the mode and 1234L is the seed value.
There is one another method to get the right number of cluster (This method should be used only in case when we are developing an anomaly system since the cluster obtained by this will not give the actual no of types of events within the data). In this method, we calculate the distance of each element from the center of the cluster and take the mean of all the distances. The no of cluster that has the minimum mean is taken as the cluster size. To reduce the no of outliers or anomaly detected, but it increases the cluster size.
Below is the code to get the cluster size with our second method.
def distance(a:Vector,b:Vector):Double = {
math.sqrt(a.toArray.zip(b.toArray).map(p=>p._1-p._2).map(d => d * d).sum)
}
def distToCentroid(datum:Vector,model:KMeansModel):Double = {
val cluster = model.predict(datum)
val centrodi = model.clusterCenters(cluster)
val d = distance(centrodi,datum)
d
}
def clusteringScore(data:RDD[Vector],k:Int):Double= {
val kmeans = new KMeans
kmeans.setK(k)
val model = kmeans.run(data)
val dd:RDD[(Vector,Double)] = data.map(datum => (datum,distToCentroid(datum,model)))
import spark.sqlContext.implicits._
//dd.toDF().show()
val tlist = dd.top(5)(Ordering.by(_._2))
//println(dd.mean())
dd.values.mean()
}
val temp = (1 to 100 by 1).map(k => (k,clusteringScore(rddVec,k)))
temp.foreach(println)
The Second step is to get the threshold – which is the distance from the centroid at which we identify it as anomaly. We can calculate distance of all record and make a histogram of our own choice (depending on the data size) which will help us spread the data as much as we can. From which we can identify the groups and pick a threshold which will tell us from where the anomaly starts.
Code for Second step is given below.
val data_distance1:RDD[Double] = rddVec.map(distToCentroid(_,clusters))
val his = data_distance.keys.histogram(20)
println(his._1.toList+””+his._2.toList)
The parameter clusters is the model created with the no of k identified in the above step.
We then arrive at that threshold and use this same model if we are running a batch system or give this threshold and configuration in streaming k means if we are running a streaming system.
Internals of RDD
Before we dive into internals of RDD we need to know three major components
– Execution Model
– Shuffle
– Cache
Execution Model,
Lets take a small program to get distinct name per first letter
The Execution of the program depends on the data and the cluster that we are running in.
The execution can be compared to Pig in hadoop which also uses lazy evaluation, which in turn means Spark adds them to a DAG of computation and only when driver requests some data, does this DAG actually get executed.
One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.
For example — if you executed every transformation eagerly, what does that mean? Well, it means you will have to materialize that many intermediate datasets in memory. This is evidently not efficient — for one, it will increase your GC costs. (Because you’re really not interested in those intermediate results as such. Those are just convenient abstractions for you while writing the program.) So, what you do instead is — you tell Spark what is the eventual answer you’re interested and it figures out best way to get there.
1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
With our example lets see how RDD’s are created.
Here it shows the each step involved within creating the final RDD. We have map, groupby, mapvalues and collect. The next step would be to create a logical plan.
Things that should be considered while creating a logical plan is that each items that are pipelined and doesn’t have any relation with the other partitions should be grouped together.
And the steps that has relation with other partitions because of the shuffle(which is triggered by groupby in our example) have to be separated into different stages.
Now that stages have been laid out they have to be executed as task.
In this step stages are executed one by one. Each stage is split into multiple task depending on the data size and the resource available. Task is a combination of the data and computation. Lets say our dataset has 4 different files and so they are stored in 4 different blocks in HDFS. So when the data is loaded into memory for the first map operation they are loaded into 4 partitions(Default Partition in RDD is the number of blocks the data is residing in HDFS)
Task is described as a bundle of work – in the stage 1 of our code – task is to load the hdfs block into memory and do the map operation.
Each task here are totally independent of each other.
Now lets see how each task are executed. For this example lets say we have one core and data partitions reside in 3 different machines.
Things the driver program has to consider is the data locality so that it can trigger the task in the same node where the data resides. And since the driver program is executed in a single thread there is a slight delay in starting each task consecutively although they are executed in different nodes.
The next step would be the shuffle stage which shuffles the data between partitions and prepares it for the next stage.
The bottom of the first stage is the map() step that had four partitions and the top of the second stage is the groupby that also has four partitions(by default it is the same numbers thats why it is 4 to 4). Partitioning is done by simply hashing keys into buckets. Hash each key into one bucket so that when we get to groupby we will know where to look for each values.