- Exploratory Data Analysis
- Building new Features
- Dimensionality reduction
- Removing Outliers
- Selecting Validation Metrics & Testing and Validation
- Cross Validation & Hyperparameter tuning
- Feature Selection.
Exploratory Data Analysis:
This is the very first step in the data science process. It is done to understand the dataset better, check its features and shape, validate an initial hypothesis, and get a preliminary idea about the next step.
E.g.:
df = pd.read_csv(“<file-location>”)
df.describe()
# describe provides count and 5-point summary about the dataset
df.boxplot(return_type=’axes’)
# boxplot plots the data to visualize as boxplot.
df.quantile([0.1,0.9]
# to get the 10 th and 90 th quantile
df.<categorical variable name>.unique()
# to get the unique list in a categorical variable
pd.crosstab(df[‘<varaible name>’],df[‘<variable name>’])
# crosstab creates co-occurrence matrix/ similarity matrix
Building new features:
In cases where features and target variables are not really related, we need to modify the input dataset. We can apply either linear or nonlinear transformations to improve the accuracy of the system.
a). Normalize the input feature using Z-scores.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
b). Instead of using Z-normalization which is affected by outlier since it operates on mean and standard deviation, we can use RobustScaler which uses mean and IQR(Inter-Quartile Range) to scale each feature independently.
from sklearn.preprocessing import RobustScaler
scaler2 = RobustScaler()
df_r_scaled = scaler2.fit_transform(df)
Dimensionality Reduction:
Dataset may contain large number of features, many of which may be unnecessary. Keeping only the interesting feature is a way to not only make your dataset more manageable but have algorithms work better.
a). Covariance Matrix
– Covariance matrix provides us idea about the correlation. We can remove the variables that are highly correlated with one another.(Caution: It finds only linear relation).
import numpy as np
df_cov = np.corrcoef(df.data.T)
b). Principal Component Analysis
– PCA is technique that helps define samller and more relevant set of features. It’s just like restructuring the information in the dataset by aggregating as much as possible of the information onto the initial vectors produced.
from sklearn.decomposition import PCA
v_pca = PCA(n_components=2)
df_pca2 = v_pca.fit_transform(df.data)
c). Other forms of PCA
- RandomizedPCA(In case of Big data)
- Kernel PCA
d). Latent Factor Analysis
Overall idea is similar to PCA, used when latent factor or a construct is expected to be in the dataset
from sklearn.decomposition import FactorAnalysis
v_fact = FactorAnalysis(n_components=2)
df_fact2 = v_fact.fit_transform(df.data)
e). Linear Discriminant Analysis
Strictly speaking LDA is a classifier but it is often used in dimensionality reduction. Since it is a supervised approach it requires a label set to optimize.
from sklearn.lda import LDA
v_lda = LDA(n_components=2)
df_lda2 = v_lda.fit_transform(df.data,df.target)
# here df.target is the label
f). Latent Semantical Analysis
LSA is applied to text after it has been processed by TfidVectorizer or CountVectorizer. It applied SVD to the input dataset, producing semantic set of words usually associated with the same concept. LSA is used when features are homogeneous.
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vect = TfidfVectorizer()
word_freq = tf_vect.fit_tranform(<text dataset>)
from sklearn.decomposition import TruncatedSVD
v_tsvd = TruncatedSVD(n_components=2)
v_tsvd.fit(word_freq)
Removing Outliers.
When data point deviates markedly from the others in the sample, it is called an outlier.
Cause may be
- The point represents a rare occurrence
- The point is clearly some kind of a mistake.
- The point represents usual occurrence of another distribution.
a). Univariate outlier detection.
Univariate methods are based on EDA and visulization such as boxplots.
Rule of thumb to keep in mind when chasing outliers by examining single variables
- If the Z-score of the point is above or below 3 standard deviation then the value has to be considered as suspect outlier.
- Values less than 25th percentile minus IQR(difference between 75th and 25th percentile)
from sklearn import preprocessing
df_norm = preprocessing.StandardScaler().fit_transform(df)
outlier_row, outlier_column = np.where(np.abs(df_norm)>3)
b). EllipticEnvelope
It is function that tries to figure out the key parameters of your data’s general distribution by assuming that your entire data is an expression of an underlying multivariate gaussian distribution.
from sklearn.covariance import EllipticEnvlope
robust_covariance_set = EllipticEnvelope(containment=.1).fit(df[0])
detection = robust_covariance_set.predict(df[0])
outlier = np.where(detection==-1)[0]
Som more methods include one class SVM
Selecting Validation Metrics:
In order to measure the performance of the model that we have built we need to have a measure or a function that scores the outcome.
a). Multilabel classification
- Confusion matrix
A table that gives us an idea about what misclassification are for each class.
from sklearn.metrics import confusion_matrix
v_cm = confusion_matrix(Y_test,Y_pred)
print(cm)
- Accuracy:
It is a portion of the predicted label that are exactly equal to the real ones.
print(“Accuracy: ”, metrics.accuracy_score(Y_test,Y_pred))
- Precision
It counts the number of relevant results in the result set. Number of correct labels in each set of classified labels.
print(“Precision: ”, metrics.precision_score(Y_test,Y_pred))
- Recall
Amount of correctly classified labels in the set divided by total counts of labels for that set.
print(“Recall: ”, metrics.recall_score(Y_test,Y_pred))
- F1 Score:
It is a harmonic average of percision and recall. It is mostly used when dealting with unbalanced datasets.
print(“F1 Score: ”, metrics.f1_score(Y_test,Y_pred))
b). Binary Classification.
Mainly used measures are ROC(Receiver Operating Characterisitics curve) or AUC(Area Under Curve)
sklearn.metrics.roc_auc_score()
c). Regression.
Many error measures are derived from Euclidean algebra.
- MAE (Mean absolute error)
from sklearn.metrics import mean_absolute_error
mean_absolute_error(Y_test,Y_pred)
- MSE (Mean squared error)
from sklearn.metrics import mean_squared_error
mean_squared_error(Y_test,Y_pred)
- R^2 Score
R^2 is also known as coefficient of determination.
sklearn.metrics.r2_score
Testing and Validation:
When we provide our entire dataset for model training, in extreme cases the model is over trained or too much complex with respect to the available data. A memorized pattern prevails and the algorithm becomes unfit to predict correctly new observations.
- We can increase the sample size so that it becomes infeasible to store all information.
- We can use simpler machine learning algorithm which is less prone to memorization.
- We can use regularization to penalize extremely complex models.
In Many cases fresh data is not available, in such cases a good approach would be tp divide the initial data into training set(usually 70-80 percent) and rest can be used as testing set.
X_train,X_test,Y_train,Y_test = cross_validation.train_test_split(X,Y,test_size=0.30)
Advanced methods includes changing the random state of the split. Random_state is a parameter to the train_test_split function.
Cross Validation:
From the previous step we know that both train and test set vary. Unfortunately this brings uncertainty along with the reduction of the learning examples dedicated to training. A solution would be to use cross validation. Which will help us in using training data for both model optimization and model training.
cross_validation.cross_val_score(model_h,X_train,Y_train,cv=10,scoring=’accuracy’)
Hyperparameter optimization:
A machine learning algorithm is not simply determined by the learning algorithm but also by its hyperparameters.
search_func = grid_search.GridSearchCV(estimator=h,param_grid=search_grid,scoring=’accuracy’,cv=10)
# search_grid is the dict that has the parameters for the algorithm to run on.
Feature Selection:
Irrelevant and redundant features may play a role in lack of interpretability of the resulting model, along the training times and, most importantly, overfitting and poor generalization.
a). Selection based on feature varaince.
Remove all the feature which have small Variance, typically lower than one set.
from sklearn.feature_selection import VarianceThreshold
df_selected = VarianceThreshold(threshold=1.0).fit_transform(df)
b). Univariate Selection
Select single variables that are associated the most with the target variable. Three available tests to base our selection on:
- f_regression – uses F-Test and p-value
- f_classif – uses ANOVA F-test
- Chi2 – uses chi-squared test
from sklearn.feature_selection import chi2, SelectPercentile
v_selec = SelectPercentile(chi2,percentile=0.25).fit(df,Y)
# here df is transformed to Binary value
c). Recursive elimination
Problem with univariate selection is the likelihood of selecting a subset containing redundant information. Recursive elimination in this case could help provide the answer.
from sklearn.linear_model import LogisticRegression
v_class = LogisticRegression(random_state=101)
v_class.fit(X_train,Y_train)
print(‘In-sample accuracy: %0.3f’ % v_class.score(X_train,Y_train))
d). Stability and L1-based selection:
Another way to solve the same problem will be by using regularization to limit the weight of the coefficients, thus preventing overfitting and selection of the most relevant variables without losing predictive power.
from sklearn.svm import LinearSVC
v_class = LogisticRegression(c=0.1,penalty=’l1’,random_state=101)
v_class.fit(X_train,Y_train)
print(‘Out-of-sample accuracy: %0.3f’ % v_class.score(X_test,Y_test))
Like!! Really appreciate you sharing this blog post.Really thank you! Keep writing.