Generative Models

Instead of trying to Map 𝑃(𝑌|𝑋=𝑥) like in discriminative models(https://en.wikipedia.org/wiki/Discriminative_model), generative model tries to find the joint distribution between 𝑋 and 𝑌 – 𝑃(𝑋,𝑌) then use that distribution to generate 𝑃(𝑌|𝑋=𝑥) or 𝑃(𝑋|𝑌=𝑦). I know what you are thinking, yes it is possible to generate data from the model. A famous method used in Neural Networks to generate data to test models used in self driving cars is GAN(https://en.wikipedia.org/wiki/Generative_adversarial_network) which was derived from the idea of generative models.

I’m not going to add more explanations to generative models than wikipedia(https://en.wikipedia.org/wiki/Generative_model) or 1000’s of articles on internet does. Instead what I’m going to do is try to create a generative model and explain the process of understanding the relationship between the data and target variables in a classification model.

The model I choose here is the Gaussian Naive Bayes model. One reason I choose this model is there are good articles on the internet showed how GNB works but I could find one that showed how the formulas for them are applied and why are they called generative models. The video that I liked is https://www.youtube.com/watch?v=r1in0YNetG8. Some articles that I liked are https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/, https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/. The Second articles give the explanation with formulas but with categorical data. Here I have used numerical data and formula’s from GNB.

import pandas as pd
import numpy as np

I took the data from https://github.com/liquidcarrot/data.pima-indians-diabetes

ds = pd.read_csv("/Users/dineshkumarmurugan/Downloads/pima-indians-diabetes.data.csv",names=["pregnancies","plasmaglucoseconcentration","diastolicbloodpressure","tricepsskinfoldthickness","insulin","bodymassindex","diabetespedigreefunction","age","diabetic"])
ds.head()

	pregnancies	plasmaglucoseconcentration	diastolicbloodpressure	tricepsskinfoldthickness	insulin	bodymassindex	diabetespedigreefunction	age	diabetic
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

ds.describe()

	pregnancies	plasmaglucoseconcentration	diastolicbloodpressure	tricepsskinfoldthickness	insulin	bodymassindex	diabetespedigreefunction	age	diabetic
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

Of course, when we hear the word Gaussian, mean and variance come with it. Here we consider each feature as independent (The whole idea being Naïve Bayes model) and calculate mean and standard deviation for them for each class.

def calcualte_mean_std(x_var,var):
    return ({"1":np.mean(x_var[x_var["diabetic"] == 1][var]),"0":np.mean(x_var[x_var["diabetic"] == 0][var])},
            {"1":np.std(x_var[x_var["diabetic"] == 1][var]),"0":np.std(x_var[x_var["diabetic"] == 0][var])})

pregnancies_mean,pregnancies_std = calcualte_mean_std(ds,"pregnancies")
plasmaglucoseconcentration_mean,plasmaglucoseconcentration_std = calcualte_mean_std(ds,"plasmaglucoseconcentration")
diastolicbloodpressure_mean,diastolicbloodpressure_std = calcualte_mean_std(ds,"diastolicbloodpressure")
tricepsskinfoldthickness_mean,tricepsskinfoldthickness_std = calcualte_mean_std(ds,"tricepsskinfoldthickness")
insulin_mean,insulin_std = calcualte_mean_std(ds,"insulin")
bodymassindex_mean,bodymassindex_std = calcualte_mean_std(ds,"bodymassindex")
diabetespedigreefunction_mean,diabetespedigreefunction_std = calcualte_mean_std(ds,"diabetespedigreefunction")
age_mean,age_std = calcualte_mean_std(ds,"age")

ds["diabetic"].value_counts()

0    500
1    268
Name: diabetic, dtype: int64

Calculating the probability of the classes [ ]:

prob_zero = 500/768
prob_one = 268/768

Calculating probaility for each feature.

%%latex 
\begin{equation*}

P(V|1) = \frac{1} {\sqrt{2 * \pi * \sigma^2}} e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}}

\end{equation*}

𝑃(𝑉|1)=12∗𝜋∗𝜎2‾‾‾‾‾‾‾‾‾‾√𝑒−12(𝑥−𝜇)2𝜎2P(V|1)=12∗π∗σ2e−12(x−μ)2σ2

WordPress messed up my latex 🙁 (Below is the screengrab)

import math 
def prob_one_var_one_class(var_mean,var_stan,x):
    return 1/math.sqrt(2 * math.pi + var_stan) * math.exp((-0.5)* pow((x - var_mean),2)/var_stan)

prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],6)

0.2659499185599781

prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],6)

0.09769091102242611

prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],1)

0.042722843214137975

prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],1)

0.13657759789740978

Testing the individual probability with the feature shows they seem to resonate with the data.

%%latex 
\begin{equation*}
P(D|1)   = P(V_1|1) * P(V_2|1) * P(V_3|1) * P(V_4|1) * P(V_5|1) * P(V_6|1) * P(V_7|1) * P(V_8|1)
\end{equation*}

𝑃(𝐷|1)=𝑃(𝑉1|1)∗𝑃(𝑉2|1)∗𝑃(𝑉3|1)∗𝑃(𝑉4|1)∗𝑃(𝑉5|1)∗𝑃(𝑉6|1)∗𝑃(𝑉7|1)∗𝑃(𝑉8|1)P(D|1)=P(V1|1)∗P(V2|1)∗P(V3|1)∗P(V4|1)∗P(V5|1)∗P(V6|1)∗P(V7|1)∗P(V8|1)

%%latex 
\begin{equation*}
P(D|0)   = P(V_1|0) * P(V_2|0) * P(V_3|0) * P(V_4|0) * P(V_5|0) * P(V_6|0) * P(V_7|0) * P(V_8|0)
\end{equation*}

𝑃(𝐷|0)=𝑃(𝑉1|0)∗𝑃(𝑉2|0)∗𝑃(𝑉3|0)∗𝑃(𝑉4|0)∗𝑃(𝑉5|0)∗𝑃(𝑉6|0)∗𝑃(𝑉7|0)∗𝑃(𝑉8|0)P(D|0)=P(V1|0)∗P(V2|0)∗P(V3|0)∗P(V4|0)∗P(V5|0)∗P(V6|0)∗P(V7|0)∗P(V8|0)

%%latex 
\begin{equation*}
P(1|D)   = \frac{P(D|1) * P(1)}  {((P(D|1) * P(1)) + (P(D|0) * P(0)) )}
\end{equation*}

𝑃(1|𝐷)=𝑃(𝐷|1)∗𝑃(1)((𝑃(𝐷|1)∗𝑃(1))+(𝑃(𝐷|0)∗𝑃(0)))P(1|D)=P(D|1)∗P(1)((P(D|1)∗P(1))+(P(D|0)∗P(0)))

%%latex 
\begin{equation*}
P(0|D)   = \frac{P(D|0) * P(0)}  {((P(D|1) * P(1)) + (P(D|0) * P(0)) )}
\end{equation*}

𝑃(0|𝐷)=𝑃(𝐷|0)∗𝑃(0)((𝑃(𝐷|1)∗𝑃(1))+(𝑃(𝐷|0)∗𝑃(0)))P(0|D)=P(D|0)∗P(0)((P(D|1)∗P(1))+(P(D|0)∗P(0)))

def calculate_overall_prob(x):
    prob_x_given_zero = prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],x["pregnancies"]) *\
    prob_one_var_one_class(plasmaglucoseconcentration_mean["0"],plasmaglucoseconcentration_std["0"],x["plasmaglucoseconcentration"]) * \
    prob_one_var_one_class(diastolicbloodpressure_mean["0"],diastolicbloodpressure_std["0"],x["diastolicbloodpressure"]) * \
    prob_one_var_one_class(tricepsskinfoldthickness_mean["0"],tricepsskinfoldthickness_std["0"],x["tricepsskinfoldthickness"]) * \
    prob_one_var_one_class(insulin_mean["0"],insulin_std["0"],x["insulin"]) * \
    prob_one_var_one_class(bodymassindex_mean["0"],bodymassindex_std["0"],x["bodymassindex"]) * \
    prob_one_var_one_class(diabetespedigreefunction_mean["0"],diabetespedigreefunction_std["0"],x["diabetespedigreefunction"]) * \
    prob_one_var_one_class(age_mean["0"],age_std["0"],x["age"])
    
    
    prob_x_given_one = prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],x["pregnancies"]) *\
    prob_one_var_one_class(plasmaglucoseconcentration_mean["1"],plasmaglucoseconcentration_std["1"],x["plasmaglucoseconcentration"]) * \
    prob_one_var_one_class(diastolicbloodpressure_mean["1"],diastolicbloodpressure_std["1"],x["diastolicbloodpressure"]) * \
    prob_one_var_one_class(tricepsskinfoldthickness_mean["1"],tricepsskinfoldthickness_std["1"],x["tricepsskinfoldthickness"]) * \
    prob_one_var_one_class(insulin_mean["1"],insulin_std["1"],x["insulin"]) * \
    prob_one_var_one_class(bodymassindex_mean["1"],bodymassindex_std["1"],x["bodymassindex"]) * \
    prob_one_var_one_class(diabetespedigreefunction_mean["1"],diabetespedigreefunction_std["1"],x["diabetespedigreefunction"]) * \
    prob_one_var_one_class(age_mean["1"],age_std["1"],x["age"])
    
    overall_prob_zero = (prob_x_given_zero * prob_zero) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one))
    overall_prob_one = (prob_x_given_one * prob_one) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one))

    return overall_prob_zero,overall_prob_one

calculate_overall_prob(ds.loc[2])

(1.869821016215942e-24, 1.0)

calculate_overall_prob(ds.loc[3])

(1.0, 7.518076945772211e-20)

We can see the probability function applied to values from the dataset and they seem to work as expected. Value in index 2 is correctly having probability of 1 for class one (in the range of 0 to 1) and value in index is correctly having probability of 1 for class zero

Leave a Reply Cancel reply