Instead of trying to MapΒ π(π|π=π₯) like in discriminative models(https://en.wikipedia.org/wiki/Discriminative_model), generative model tries to find the joint distribution betweenΒ π andΒ π –Β π(π,π) then use that distribution to generateΒ π(π|π=π₯) orΒ π(π|π=π¦). I know what you are thinking, yes it is possible to generate data from the model. A famous method used in Neural Networks to generate data to test models used in self driving cars is GAN(https://en.wikipedia.org/wiki/Generative_adversarial_network) which was derived from the idea of generative models.
I’m not going to add more explanations to generative models than wikipedia(https://en.wikipedia.org/wiki/Generative_model) or 1000’s of articles on internet does. Instead what I’m going to do is try to create a generative model and explain the process of understanding the relationship between the data and target variables in a classification model.
The model I choose here is the Gaussian Naive Bayes model. One reason I choose this model is there are good articles on the internet showed how GNB works but I could find one that showed how the formulas for them are applied and why are they called generative models. The video that I liked isΒ https://www.youtube.com/watch?v=r1in0YNetG8. Some articles that I liked areΒ https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/,Β https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/. The Second articles give the explanation with formulas but with categorical data. Here I have used numerical data and formula’s from GNB.
import pandas as pd import numpy as np
I took the data from https://github.com/liquidcarrot/data.pima-indians-diabetes
ds = pd.read_csv("/Users/dineshkumarmurugan/Downloads/pima-indians-diabetes.data.csv",names=["pregnancies","plasmaglucoseconcentration","diastolicbloodpressure","tricepsskinfoldthickness","insulin","bodymassindex","diabetespedigreefunction","age","diabetic"]) ds.head()
pregnancies | plasmaglucoseconcentration | diastolicbloodpressure | tricepsskinfoldthickness | insulin | bodymassindex | diabetespedigreefunction | age | diabetic | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
ds.describe()
pregnancies | plasmaglucoseconcentration | diastolicbloodpressure | tricepsskinfoldthickness | insulin | bodymassindex | diabetespedigreefunction | age | diabetic | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Of course, when we hear the word Gaussian, mean and variance come with it. Here we consider each feature as independent (The whole idea being NaΓ―ve Bayes model) and calculate mean and standard deviation for them for each class.
def calcualte_mean_std(x_var,var): return ({"1":np.mean(x_var[x_var["diabetic"] == 1][var]),"0":np.mean(x_var[x_var["diabetic"] == 0][var])}, {"1":np.std(x_var[x_var["diabetic"] == 1][var]),"0":np.std(x_var[x_var["diabetic"] == 0][var])})
pregnancies_mean,pregnancies_std = calcualte_mean_std(ds,"pregnancies") plasmaglucoseconcentration_mean,plasmaglucoseconcentration_std = calcualte_mean_std(ds,"plasmaglucoseconcentration") diastolicbloodpressure_mean,diastolicbloodpressure_std = calcualte_mean_std(ds,"diastolicbloodpressure") tricepsskinfoldthickness_mean,tricepsskinfoldthickness_std = calcualte_mean_std(ds,"tricepsskinfoldthickness") insulin_mean,insulin_std = calcualte_mean_std(ds,"insulin") bodymassindex_mean,bodymassindex_std = calcualte_mean_std(ds,"bodymassindex") diabetespedigreefunction_mean,diabetespedigreefunction_std = calcualte_mean_std(ds,"diabetespedigreefunction") age_mean,age_std = calcualte_mean_std(ds,"age")
ds["diabetic"].value_counts()
0 500 1 268 Name: diabetic, dtype: int64
Calculating the probability of the classes [Β ]:
prob_zero = 500/768 prob_one = 268/768
Calculating probaility for each feature.
%%latex \begin{equation*} P(V|1) = \frac{1} {\sqrt{2 * \pi * \sigma^2}} e^{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}} \end{equation*}
π(π|1)=12βπβπ2βΎβΎβΎβΎβΎβΎβΎβΎβΎβΎβπβ12(π₯βπ)2π2P(V|1)=12βΟβΟ2eβ12(xβΞΌ)2Ο2
WordPress messed up my latex π (Below is the screengrab)

import math def prob_one_var_one_class(var_mean,var_stan,x): return 1/math.sqrt(2 * math.pi + var_stan) * math.exp((-0.5)* pow((x - var_mean),2)/var_stan)
prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],6)
0.2659499185599781
prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],6)
0.09769091102242611
prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],1)
0.042722843214137975
prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],1)
0.13657759789740978
Testing the individual probability with the feature shows they seem to resonate with the data.
%%latex \begin{equation*} P(D|1) = P(V_1|1) * P(V_2|1) * P(V_3|1) * P(V_4|1) * P(V_5|1) * P(V_6|1) * P(V_7|1) * P(V_8|1) \end{equation*}
π(π·|1)=π(π1|1)βπ(π2|1)βπ(π3|1)βπ(π4|1)βπ(π5|1)βπ(π6|1)βπ(π7|1)βπ(π8|1)P(D|1)=P(V1|1)βP(V2|1)βP(V3|1)βP(V4|1)βP(V5|1)βP(V6|1)βP(V7|1)βP(V8|1)

%%latex \begin{equation*} P(D|0) = P(V_1|0) * P(V_2|0) * P(V_3|0) * P(V_4|0) * P(V_5|0) * P(V_6|0) * P(V_7|0) * P(V_8|0) \end{equation*}
π(π·|0)=π(π1|0)βπ(π2|0)βπ(π3|0)βπ(π4|0)βπ(π5|0)βπ(π6|0)βπ(π7|0)βπ(π8|0)P(D|0)=P(V1|0)βP(V2|0)βP(V3|0)βP(V4|0)βP(V5|0)βP(V6|0)βP(V7|0)βP(V8|0)

%%latex \begin{equation*} P(1|D) = \frac{P(D|1) * P(1)} {((P(D|1) * P(1)) + (P(D|0) * P(0)) )} \end{equation*}
π(1|π·)=π(π·|1)βπ(1)((π(π·|1)βπ(1))+(π(π·|0)βπ(0)))P(1|D)=P(D|1)βP(1)((P(D|1)βP(1))+(P(D|0)βP(0)))

%%latex \begin{equation*} P(0|D) = \frac{P(D|0) * P(0)} {((P(D|1) * P(1)) + (P(D|0) * P(0)) )} \end{equation*}
π(0|π·)=π(π·|0)βπ(0)((π(π·|1)βπ(1))+(π(π·|0)βπ(0)))P(0|D)=P(D|0)βP(0)((P(D|1)βP(1))+(P(D|0)βP(0)))

def calculate_overall_prob(x): prob_x_given_zero = prob_one_var_one_class(pregnancies_mean["0"],pregnancies_std["0"],x["pregnancies"]) *\ prob_one_var_one_class(plasmaglucoseconcentration_mean["0"],plasmaglucoseconcentration_std["0"],x["plasmaglucoseconcentration"]) * \ prob_one_var_one_class(diastolicbloodpressure_mean["0"],diastolicbloodpressure_std["0"],x["diastolicbloodpressure"]) * \ prob_one_var_one_class(tricepsskinfoldthickness_mean["0"],tricepsskinfoldthickness_std["0"],x["tricepsskinfoldthickness"]) * \ prob_one_var_one_class(insulin_mean["0"],insulin_std["0"],x["insulin"]) * \ prob_one_var_one_class(bodymassindex_mean["0"],bodymassindex_std["0"],x["bodymassindex"]) * \ prob_one_var_one_class(diabetespedigreefunction_mean["0"],diabetespedigreefunction_std["0"],x["diabetespedigreefunction"]) * \ prob_one_var_one_class(age_mean["0"],age_std["0"],x["age"]) prob_x_given_one = prob_one_var_one_class(pregnancies_mean["1"],pregnancies_std["1"],x["pregnancies"]) *\ prob_one_var_one_class(plasmaglucoseconcentration_mean["1"],plasmaglucoseconcentration_std["1"],x["plasmaglucoseconcentration"]) * \ prob_one_var_one_class(diastolicbloodpressure_mean["1"],diastolicbloodpressure_std["1"],x["diastolicbloodpressure"]) * \ prob_one_var_one_class(tricepsskinfoldthickness_mean["1"],tricepsskinfoldthickness_std["1"],x["tricepsskinfoldthickness"]) * \ prob_one_var_one_class(insulin_mean["1"],insulin_std["1"],x["insulin"]) * \ prob_one_var_one_class(bodymassindex_mean["1"],bodymassindex_std["1"],x["bodymassindex"]) * \ prob_one_var_one_class(diabetespedigreefunction_mean["1"],diabetespedigreefunction_std["1"],x["diabetespedigreefunction"]) * \ prob_one_var_one_class(age_mean["1"],age_std["1"],x["age"]) overall_prob_zero = (prob_x_given_zero * prob_zero) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one)) overall_prob_one = (prob_x_given_one * prob_one) / ((prob_x_given_zero * prob_zero) + (prob_x_given_one * prob_one)) return overall_prob_zero,overall_prob_one
calculate_overall_prob(ds.loc[2])
(1.869821016215942e-24, 1.0)
calculate_overall_prob(ds.loc[3])
(1.0, 7.518076945772211e-20)
We can see the probability function applied to values from the dataset and they seem to work as expected. Value in index 2 is correctly having probability of 1 for class one (in the range of 0 to 1) and value in index is correctly having probability of 1 for class zero