Credit Card Fraud Prediction model, SMOTE & model Explainability using Shap values.

Omkar Gawade
4 min readSep 27, 2022
Image by Nathan Dumlao on Unsplash

Credit card fraud detection is a common problem in machine learning — the only downside being the amount of people who are defaulters are significantly lower to honest working people — this creates a class imbalance which produces poor models. In this analysis we will go through step by step about how to solve class imbalance using SMOTE, hyperparameter tuning and model explainability with Shaply values.

Dataset:

The dataset for the analysis was exported from Kaggle. The data contains credit card data with variables which are encoded after conducting PCA analysis along with Amount & time. The dataset is already cleaned with no nulls and no major correlation with each other.

Image by author

Class imbalance:

Check for class imbalance of the target variable ‘class’:

Image by author

Here we can see that class 1 (Defaulters) is ~ 0.17 % of the total data — which would seriously mess up model metrics like Accuracy — it is not a good measure for imbalanced datasets.

Imagine there are 100 datapoints — 99 with “class 0” and 1 with “class 1” — even though in this case the accuracy might be 99% it is misleading because the model will always predict “class 0” because it has been trained to do so.

Create train and test datasets:

Image by author

Synthetic minority over-sampling (SMOTE):

There are various methods that can be used to solve class imbalance like Oversampling, Under-sampling or Random oversampling / Random under sampling — SMOTE is by far one of the most effective methods to solve this problem.

SMOTE solves the class imbalance problem by creating synthetic samples — these synthetic samples are not duplicates like oversampling, but in reality closest neighbors to the original minority class datapoints.

We will be using imblearn library to perform SMOTE on the data — below is the code you’ll need to refit x_train and y_train to overcome class imbalance.

Image by author

Fitting XG boost Classifier on the refit smote datasets using Randomized CV hyperparameter training

We will pip install xgboost library and import XGBClassifieR.

Image by author

Hyperparameter Training: This is a step where we fine tune the model to get better performance; There several methods for hyperparameter tuning such as:

  1. GridSearchCV : Tests all possible hyperparameter’s specified and gives the best performing model out of it.
  2. RandomizedSearchCV: Randomly selects hyperparameter’s specified and gives the best performing model out of it.
  3. Bayesian Optimization: In this method we learn from our previous hyperparameter combination runs to move the model towards more performance ( ex. minimizing ‘rmse’ in Regression problem or increasing ‘auc’ in binary classification problem)

In our case we will be utilizing RandomizedSearchCV for our hyperparameter tuning. Below are the lists of hyperparameter’s I specified with the code :

Image by author

After all permutations are finished we can grab the hyperparameter’s for the best estimator

Image by author

Fitting the XGboost Classifier with best estimator

Image by author

Getting classification metrics for model

Image by author

Shap Analyis for Model Explainability:

Shaply values employ the methodology of game theory where it tries to keep all variables constant and vary one variable randomly to see the impact of the feature on the prediction. In a nutshell shaply values quantifies the contribution each feature has on the prediction.

Below is the code you would need to get shaply values:

Image by author

Summary plot for shap values:

Summary plot gives an overall outlook on which feature has the most impact on the prediction in a descending order.

Image by author

We can also get a visualization on every single row which shows what impact has every feature has on the prediction to be 1 or 0

For example lets pick a random row and get the shap waterfall graph for it:

Image by author

Furthermore, we can get a Dataframe with shap values for all rows for further analysis/visualizations.

If you enjoyed this story — please give me follow here and connect with me on linkedIn: https://www.linkedin.com/in/omkar-gawade/

--

--

Omkar Gawade

Sr. Data Scientist @ Sourceday | Data professional with 5 plus years experience in analytics, visualizations and modeling.