Generative Models

Instead of trying to Map 𝑃(π‘Œ|𝑋=π‘₯) like in discriminative models(https://en.wikipedia.org/wiki/Discriminative_model), generative model tries to find the joint distribution between 𝑋 andΒ π‘Œ – 𝑃(𝑋,π‘Œ) then use that distribution to generate 𝑃(π‘Œ|𝑋=π‘₯) or 𝑃(𝑋|π‘Œ=𝑦). I know what you are thinking, yes it is possible to generate data from the model. A famous method used in Neural Networks to generate data to test models used in self driving cars is GAN(https://en.wikipedia.org/wiki/Generative_adversarial_network) which was derived from the idea of generative models.

I’m not going to add more explanations to generative models than wikipedia(https://en.wikipedia.org/wiki/Generative_model) or 1000’s of articles on internet does. Instead what I’m going to do is try to create a generative model and explain the process of understanding the relationship between the data and target variables in a classification model.

The model I choose here is the Gaussian Naive Bayes model. One reason I choose this model is there are good articles on the internet showed how GNB works but I could find one that showed how the formulas for them are applied and why are they called generative models. The video that I liked isΒ https://www.youtube.com/watch?v=r1in0YNetG8. Some articles that I liked areΒ https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/,Β https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/. The Second articles give the explanation with formulas but with categorical data. Here I have used numerical data and formula’s from GNB.

import pandas as pd
import numpy as np

I took the data from https://github.com/liquidcarrot/data.pima-indians-diabetes

ds = pd.read_csv("/Users/dineshkumarmurugan/Downloads/pima-indians-diabetes.data.csv",names=["pregnancies","plasmaglucoseconcentration","diastolicbloodpressure","tricepsskinfoldthickness","insulin","bodymassindex","diabetespedigreefunction","age","diabetic"])
ds.head()
pregnanciesplasmaglucoseconcentrationdiastolicbloodpressuretricepsskinfoldthicknessinsulinbodymassindexdiabetespedigreefunctionagediabetic
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
ds.describe()
pregnanciesplasmaglucoseconcentrationdiastolicbloodpressuretricepsskinfoldthicknessinsulinbodymassindexdiabetespedigreefunctionagediabetic
count768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000
mean3.845052120.89453169.10546920.53645879.79947931.9925780.47187633.2408850.348958
std3.36957831.97261819.35580715.952218115.2440027.8841600.33132911.7602320.476951
min0.0000000.0000000.0000000.0000000.0000000.0000000.07800021.0000000.000000
25%1.00000099.00000062.0000000.0000000.00000027.3000000.24375024.0000000.000000
50%3.000000117.00000072.00000023.00000030.50000032.0000000.37250029.0000000.000000
75%6.000000140.25000080.00000032.000000127.25000036.6000000.62625041.0000001.000000
max17.000000199.000000122.00000099.000000846.00000067.1000002.42000081.0000001.000000

Of course, when we hear the word Gaussian, mean and variance come with it. Here we consider each feature as independent (The whole idea being NaΓ―ve Bayes model) and calculate mean and standard deviation for them for each class.

def calcualte_mean_std(x_var,var):
    return ({"1":np.mean(x_var[x_var["diabetic"] == 1][var]),"0":np.mean(x_var[x_var["diabetic"] == 0][var])},
            {"1":np.std(x_var[x_var["diabetic"] == 1][var]),"0":np.std(x_var[x_var["diabetic"] == 0][var])})
pregnancies_mean,pregnancies_std = calcualte_mean_std(ds,"pregnancies")
plasmaglucoseconcentration_mean,plasmaglucoseconcentration_std = calcualte_mean_std(ds,"plasmaglucoseconcentration")
diastolicbloodpressure_mean,diastolicbloodpressure_std = calcualte_mean_std(ds,"diastolicbloodpressure")
tricepsskinfoldthickness_mean,tricepsskinfoldthickness_std = calcualte_mean_std(ds,"tricepsskinfoldthickness")
insulin_mean,insulin_std = calcualte_mean_std(ds,"insulin")
bodymassindex_mean,bodymassindex_std = calcualte_mean_std(ds,"bodymassindex")
diabetespedigreefunction_mean,diabetespedigreefunction_std = calcualte_mean_std(ds,"diabetespedigreefunction")
age_mean,age_std = calcualte_mean_std(ds,"age")
ds["diabetic"].value_counts()
0    500
1    268
Name: diabetic, dtype: int64

Calculating the probability of the classes [Β ]:

prob_zero = 500/768
prob_one = 268/768

Calculating probaility for each feature.

%%latex 
\begin{equation*}

P(V|1) = \frac{1} {\sqrt{2 * \pi * \sigma^2}} e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}}

\end{equation*}

𝑃(𝑉|1)=12βˆ—πœ‹βˆ—πœŽ2β€Ύβ€Ύβ€Ύβ€Ύβ€Ύβ€Ύβ€Ύβ€Ύβ€Ύβ€Ύβˆšπ‘’βˆ’12(π‘₯βˆ’πœ‡)2𝜎2P(V|1)=12βˆ—Ο€βˆ—Οƒ2eβˆ’12(xβˆ’ΞΌ)2Οƒ2

WordPress messed up my latex πŸ™ (Below is the screengrab)

import math 
def prob_one_var_one_class(var_mean,var_stan,x):
    return 1/math.sqrt(2 * math.pi + var_stan) * math.exp((-0.5)* pow((x - var_mean),2)/var_stan)
prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],6)
0.2659499185599781
prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],6)
0.09769091102242611
prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],1)
0.042722843214137975
prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],1)
0.13657759789740978

Testing the individual probability with the feature shows they seem to resonate with the data.

%%latex 
\begin{equation*}
P(D|1)   = P(V_1|1) * P(V_2|1) * P(V_3|1) * P(V_4|1) * P(V_5|1) * P(V_6|1) * P(V_7|1) * P(V_8|1)
\end{equation*}

𝑃(𝐷|1)=𝑃(𝑉1|1)βˆ—π‘ƒ(𝑉2|1)βˆ—π‘ƒ(𝑉3|1)βˆ—π‘ƒ(𝑉4|1)βˆ—π‘ƒ(𝑉5|1)βˆ—π‘ƒ(𝑉6|1)βˆ—π‘ƒ(𝑉7|1)βˆ—π‘ƒ(𝑉8|1)P(D|1)=P(V1|1)βˆ—P(V2|1)βˆ—P(V3|1)βˆ—P(V4|1)βˆ—P(V5|1)βˆ—P(V6|1)βˆ—P(V7|1)βˆ—P(V8|1)

%%latex 
\begin{equation*}
P(D|0)   = P(V_1|0) * P(V_2|0) * P(V_3|0) * P(V_4|0) * P(V_5|0) * P(V_6|0) * P(V_7|0) * P(V_8|0)
\end{equation*}

𝑃(𝐷|0)=𝑃(𝑉1|0)βˆ—π‘ƒ(𝑉2|0)βˆ—π‘ƒ(𝑉3|0)βˆ—π‘ƒ(𝑉4|0)βˆ—π‘ƒ(𝑉5|0)βˆ—π‘ƒ(𝑉6|0)βˆ—π‘ƒ(𝑉7|0)βˆ—π‘ƒ(𝑉8|0)P(D|0)=P(V1|0)βˆ—P(V2|0)βˆ—P(V3|0)βˆ—P(V4|0)βˆ—P(V5|0)βˆ—P(V6|0)βˆ—P(V7|0)βˆ—P(V8|0)

%%latex 
\begin{equation*}
P(1|D)   = \frac{P(D|1) * P(1)}  {((P(D|1) * P(1)) + (P(D|0) * P(0)) )}
\end{equation*}

𝑃(1|𝐷)=𝑃(𝐷|1)βˆ—π‘ƒ(1)((𝑃(𝐷|1)βˆ—π‘ƒ(1))+(𝑃(𝐷|0)βˆ—π‘ƒ(0)))P(1|D)=P(D|1)βˆ—P(1)((P(D|1)βˆ—P(1))+(P(D|0)βˆ—P(0)))

%%latex 
\begin{equation*}
P(0|D)   = \frac{P(D|0) * P(0)}  {((P(D|1) * P(1)) + (P(D|0) * P(0)) )}
\end{equation*}

𝑃(0|𝐷)=𝑃(𝐷|0)βˆ—π‘ƒ(0)((𝑃(𝐷|1)βˆ—π‘ƒ(1))+(𝑃(𝐷|0)βˆ—π‘ƒ(0)))P(0|D)=P(D|0)βˆ—P(0)((P(D|1)βˆ—P(1))+(P(D|0)βˆ—P(0)))

def calculate_overall_prob(x):
    prob_x_given_zero = prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],x["pregnancies"]) *\
    prob_one_var_one_class(plasmaglucoseconcentration_mean["0"],plasmaglucoseconcentration_std["0"],x["plasmaglucoseconcentration"]) * \
    prob_one_var_one_class(diastolicbloodpressure_mean["0"],diastolicbloodpressure_std["0"],x["diastolicbloodpressure"]) * \
    prob_one_var_one_class(tricepsskinfoldthickness_mean["0"],tricepsskinfoldthickness_std["0"],x["tricepsskinfoldthickness"]) * \
    prob_one_var_one_class(insulin_mean["0"],insulin_std["0"],x["insulin"]) * \
    prob_one_var_one_class(bodymassindex_mean["0"],bodymassindex_std["0"],x["bodymassindex"]) * \
    prob_one_var_one_class(diabetespedigreefunction_mean["0"],diabetespedigreefunction_std["0"],x["diabetespedigreefunction"]) * \
    prob_one_var_one_class(age_mean["0"],age_std["0"],x["age"])
    
    
    prob_x_given_one = prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],x["pregnancies"]) *\
    prob_one_var_one_class(plasmaglucoseconcentration_mean["1"],plasmaglucoseconcentration_std["1"],x["plasmaglucoseconcentration"]) * \
    prob_one_var_one_class(diastolicbloodpressure_mean["1"],diastolicbloodpressure_std["1"],x["diastolicbloodpressure"]) * \
    prob_one_var_one_class(tricepsskinfoldthickness_mean["1"],tricepsskinfoldthickness_std["1"],x["tricepsskinfoldthickness"]) * \
    prob_one_var_one_class(insulin_mean["1"],insulin_std["1"],x["insulin"]) * \
    prob_one_var_one_class(bodymassindex_mean["1"],bodymassindex_std["1"],x["bodymassindex"]) * \
    prob_one_var_one_class(diabetespedigreefunction_mean["1"],diabetespedigreefunction_std["1"],x["diabetespedigreefunction"]) * \
    prob_one_var_one_class(age_mean["1"],age_std["1"],x["age"])
    
    overall_prob_zero = (prob_x_given_zero * prob_zero) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one))
    overall_prob_one = (prob_x_given_one * prob_one) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one))

    return overall_prob_zero,overall_prob_one
calculate_overall_prob(ds.loc[2])
(1.869821016215942e-24, 1.0)
calculate_overall_prob(ds.loc[3])
(1.0, 7.518076945772211e-20)

We can see the probability function applied to values from the dataset and they seem to work as expected. Value in index 2 is correctly having probability of 1 for class one (in the range of 0 to 1) and value in index is correctly having probability of 1 for class zero

Leave a Reply

Your email address will not be published. Required fields are marked *