Cheat sheet for Data Science Pipeline

  • Exploratory Data Analysis
  • Building new Features
  • Dimensionality reduction
  • Removing Outliers
  • Selecting Validation Metrics & Testing and Validation
  • Cross Validation & Hyperparameter tuning
  • Feature Selection.

Exploratory Data Analysis:

This is the very first step in the data science process. It is done to understand the dataset better, check its features and shape, validate an initial hypothesis, and get a preliminary idea about the next step.

E.g.:

df = pd.read_csv(“<file-location>”)

df.describe()

# describe provides count and 5-point summary about the dataset

df.boxplot(return_type=’axes’)

# boxplot plots the data to visualize as boxplot.

df.quantile([0.1,0.9]

# to get the 10 th and 90 th quantile

df.<categorical variable name>.unique()

# to get the unique list in a categorical variable

pd.crosstab(df[‘<varaible name>’],df[‘<variable name>’])

# crosstab creates co-occurrence matrix/ similarity matrix

Building new features:

In cases where features and target variables are not really related, we need to modify the input dataset. We can apply either linear or nonlinear transformations to improve the accuracy of the system.

a). Normalize the input feature using Z-scores.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

b). Instead of using Z-normalization which is affected by outlier since it operates on mean and standard deviation, we can use RobustScaler which uses mean and IQR(Inter-Quartile Range) to scale each feature independently.

from sklearn.preprocessing import RobustScaler

scaler2 = RobustScaler()

df_r_scaled = scaler2.fit_transform(df)

Dimensionality Reduction:

Dataset may contain large number of features, many of which may be unnecessary. Keeping only the interesting feature is a way to not only make your dataset more manageable but have algorithms work better.

a). Covariance Matrix

– Covariance matrix provides us idea about the correlation. We can remove the variables that are highly correlated with one another.(Caution: It finds only linear relation).

     import numpy as np

            df_cov = np.corrcoef(df.data.T)

b). Principal Component Analysis

– PCA is technique that helps define samller and more relevant set of features. It’s just like restructuring the information in the dataset by aggregating as much as possible of the information onto the initial vectors produced.

            from sklearn.decomposition import PCA

            v_pca = PCA(n_components=2)

            df_pca2 = v_pca.fit_transform(df.data)

c). Other forms of PCA

  1. RandomizedPCA(In case of Big data)
  2. Kernel PCA

d). Latent Factor Analysis

Overall idea is similar to PCA, used when latent factor or a construct is expected to be in the dataset

   from sklearn.decomposition import FactorAnalysis

            v_fact = FactorAnalysis(n_components=2)

            df_fact2 = v_fact.fit_transform(df.data)

e). Linear Discriminant Analysis

Strictly speaking LDA is a classifier but it is often used in dimensionality reduction. Since it is a supervised approach it requires a label set to optimize.

            from sklearn.lda import LDA

            v_lda = LDA(n_components=2)

            df_lda2 = v_lda.fit_transform(df.data,df.target)

            # here df.target is the label

f). Latent Semantical Analysis

LSA is applied to text after it has been processed by TfidVectorizer or CountVectorizer. It applied SVD to the input dataset, producing semantic set of words usually associated with the same concept. LSA is used when features are homogeneous.

        from sklearn.feature_extraction.text import TfidfVectorizer

            tf_vect = TfidfVectorizer()

            word_freq = tf_vect.fit_tranform(<text dataset>)

            from sklearn.decomposition import TruncatedSVD

            v_tsvd = TruncatedSVD(n_components=2)

            v_tsvd.fit(word_freq)

Removing Outliers.

When data point deviates markedly from the others in the sample, it is called an outlier.

Cause may be

  1. The point represents a rare occurrence
  2. The point is clearly some kind of a mistake.
  3. The point represents usual occurrence of another distribution.

a). Univariate outlier detection.

Univariate methods are based on EDA and visulization such as boxplots.

Rule of thumb to keep in mind when chasing outliers by examining single variables

  • If the Z-score of the point is above or below 3 standard deviation then the value has to be considered as suspect outlier.
  • Values less than 25th percentile minus IQR(difference between 75th and 25th percentile)

from sklearn import preprocessing

df_norm = preprocessing.StandardScaler().fit_transform(df)

outlier_row, outlier_column = np.where(np.abs(df_norm)>3)

b). EllipticEnvelope

It is function that tries to figure out the key parameters of your data’s general distribution by assuming that your entire data is an expression of an underlying multivariate gaussian distribution.

            from sklearn.covariance import EllipticEnvlope

            robust_covariance_set = EllipticEnvelope(containment=.1).fit(df[0])

            detection = robust_covariance_set.predict(df[0])

            outlier = np.where(detection==-1)[0]

Som more methods include one class SVM

Selecting Validation Metrics:

In order to measure the performance of the model that we have built we need to have a measure or a function that scores the outcome.

a). Multilabel classification

  1. Confusion matrix

A table that gives us an idea about what misclassification are for each class.

                 from sklearn.metrics import confusion_matrix

                 v_cm = confusion_matrix(Y_test,Y_pred)

                 print(cm)

  1. Accuracy:

It is a portion of the predicted label that are exactly equal to the real ones.

                 print(“Accuracy: ”, metrics.accuracy_score(Y_test,Y_pred))

  1. Precision

It counts the number of relevant results in the result set. Number of correct labels in each set of classified labels.

print(“Precision: ”, metrics.precision_score(Y_test,Y_pred))

  1. Recall

Amount of correctly classified labels in the set divided by total counts of labels for that set.

print(“Recall: ”, metrics.recall_score(Y_test,Y_pred))

  1. F1 Score:

It is a harmonic average of percision and recall. It is mostly used when dealting with unbalanced datasets.

print(“F1 Score: ”, metrics.f1_score(Y_test,Y_pred))

b). Binary Classification.

Mainly used measures are ROC(Receiver Operating Characterisitics curve) or AUC(Area Under Curve)

sklearn.metrics.roc_auc_score()

c). Regression.

Many error measures are derived from Euclidean algebra.

  1. MAE (Mean absolute error)

from sklearn.metrics import mean_absolute_error

mean_absolute_error(Y_test,Y_pred)

  1. MSE (Mean squared error)

from sklearn.metrics import mean_squared_error

mean_squared_error(Y_test,Y_pred)

  1. R^2 Score

R^2 is also known as coefficient of determination.

        sklearn.metrics.r2_score

Testing and Validation:

When we provide our entire dataset for model training, in extreme cases the model is over trained or too much complex with respect to the available data. A memorized pattern prevails and the algorithm becomes unfit to predict correctly new observations.

  1. We can increase the sample size so that it becomes infeasible to store all information.
  2. We can use simpler machine learning algorithm which is less prone to memorization.
  3. We can use regularization to penalize extremely complex models.

In Many cases fresh data is not available, in such cases a good approach would be tp divide the initial data into training set(usually 70-80 percent) and rest can be used as testing set.

X_train,X_test,Y_train,Y_test = cross_validation.train_test_split(X,Y,test_size=0.30)

Advanced methods includes changing the random state of the split. Random_state is a parameter to the train_test_split function.

Cross Validation:

From the previous step we know that both train and test set vary. Unfortunately this brings uncertainty along with the reduction of the learning examples dedicated to training. A solution would be to use cross validation. Which will help us in using training data for both model optimization and model training.

           cross_validation.cross_val_score(model_h,X_train,Y_train,cv=10,scoring=’accuracy’)

Hyperparameter optimization:

A machine learning algorithm is not simply determined by the learning algorithm but also by its hyperparameters.

search_func = grid_search.GridSearchCV(estimator=h,param_grid=search_grid,scoring=’accuracy’,cv=10)

# search_grid is the dict that has the parameters for the algorithm to run on.

Feature Selection:

Irrelevant and redundant features may play a role in lack of interpretability of the resulting model, along the training times and, most importantly, overfitting and poor generalization.

a). Selection based on feature varaince.

Remove all the feature which have small Variance, typically lower than one set.

            from sklearn.feature_selection import VarianceThreshold

            df_selected = VarianceThreshold(threshold=1.0).fit_transform(df)

b). Univariate Selection

Select single variables that are associated the most with the target variable. Three available tests to base our selection on:

  1. f_regression – uses F-Test and p-value
  2. f_classif – uses ANOVA F-test
  3. Chi2 – uses chi-squared test

from sklearn.feature_selection import chi2, SelectPercentile

v_selec = SelectPercentile(chi2,percentile=0.25).fit(df,Y)

# here df is transformed to Binary value

c). Recursive elimination

Problem with univariate selection is the likelihood of selecting a subset containing redundant information. Recursive elimination in this case could help provide the answer.

           from sklearn.linear_model import LogisticRegression

            v_class = LogisticRegression(random_state=101)

            v_class.fit(X_train,Y_train)

            print(‘In-sample accuracy: %0.3f’ % v_class.score(X_train,Y_train))

d). Stability and L1-based selection:

Another way to solve the same problem will be by using regularization to limit the weight of the coefficients, thus preventing overfitting and selection of the most relevant variables without losing predictive power.

            from sklearn.svm import LinearSVC

            v_class = LogisticRegression(c=0.1,penalty=’l1’,random_state=101)

            v_class.fit(X_train,Y_train)

            print(‘Out-of-sample accuracy: %0.3f’ % v_class.score(X_test,Y_test))