Credit Card Fraud

Presentation of the dataset is available on Kaggle

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Source: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import recall_score, confusion_matrix, precision_score, accuracy_score
from sklearn.linear_model import SGDClassifier
from sklearn.base import clone
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier


%matplotlib inline

Dataset Exploration

First let's explore quickly the dataset. I won't explain all what we have as it is described on above

In [2]:
dataset = pd.read_csv("creditcard.csv")
In [3]:
print(dataset.head())
print(dataset.describe())
print(dataset.info())
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...         V21       V22       V23       V24  \
0  0.098698  0.363787  ...   -0.018307  0.277838 -0.110474  0.066928   
1  0.085102 -0.255425  ...   -0.225775 -0.638672  0.101288 -0.339846   
2  0.247676 -1.514654  ...    0.247998  0.771679  0.909412 -0.689281   
3  0.377436 -1.387024  ...   -0.108300  0.005274 -0.190321 -1.175575   
4 -0.270533  0.817739  ...   -0.009431  0.798278 -0.137458  0.141267   

        V25       V26       V27       V28  Amount  Class  
0  0.128539 -0.189115  0.133558 -0.021053  149.62      0  
1  0.167170  0.125895 -0.008983  0.014724    2.69      0  
2 -0.327642 -0.139097 -0.055353 -0.059752  378.66      0  
3  0.647376 -0.221929  0.062723  0.061458  123.50      0  
4 -0.206010  0.502292  0.219422  0.215153   69.99      0  

[5 rows x 31 columns]
                Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  3.919560e-15  5.688174e-16 -8.769071e-15  2.782312e-15   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean  -1.552563e-15  2.010663e-15 -1.694249e-15 -1.927028e-16 -3.137024e-15   
std    1.380247e+00  1.332271e+00  1.237094e+00  1.194353e+00  1.098632e+00   
min   -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01   
25%   -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01   
50%   -5.433583e-02 -2.741871e-01  4.010308e-02  2.235804e-02 -5.142873e-02   
75%    6.119264e-01  3.985649e-01  5.704361e-01  3.273459e-01  5.971390e-01   
max    3.480167e+01  7.330163e+01  1.205895e+02  2.000721e+01  1.559499e+01   

           ...                 V21           V22           V23           V24  \
count      ...        2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean       ...        1.537294e-16  7.959909e-16  5.367590e-16  4.458112e-15   
std        ...        7.345240e-01  7.257016e-01  6.244603e-01  6.056471e-01   
min        ...       -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00   
25%        ...       -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01   
50%        ...       -2.945017e-02  6.781943e-03 -1.119293e-02  4.097606e-02   
75%        ...        1.863772e-01  5.285536e-01  1.476421e-01  4.395266e-01   
max        ...        2.720284e+01  1.050309e+01  2.252841e+01  4.584549e+00   

                V25           V26           V27           V28         Amount  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  284807.000000   
mean   1.453003e-15  1.699104e-15 -3.660161e-16 -1.206049e-16      88.349619   
std    5.212781e-01  4.822270e-01  4.036325e-01  3.300833e-01     250.120109   
min   -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01       0.000000   
25%   -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02       5.600000   
50%    1.659350e-02 -5.213911e-02  1.342146e-03  1.124383e-02      22.000000   
75%    3.507156e-01  2.409522e-01  9.104512e-02  7.827995e-02      77.165000   
max    7.519589e+00  3.517346e+00  3.161220e+01  3.384781e+01   25691.160000   

               Class  
count  284807.000000  
mean        0.001727  
std         0.041527  
min         0.000000  
25%         0.000000  
50%         0.000000  
75%         0.000000  
max         1.000000  

[8 rows x 31 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 284807 entries, 0 to 284806
Data columns (total 31 columns):
Time      284807 non-null float64
V1        284807 non-null float64
V2        284807 non-null float64
V3        284807 non-null float64
V4        284807 non-null float64
V5        284807 non-null float64
V6        284807 non-null float64
V7        284807 non-null float64
V8        284807 non-null float64
V9        284807 non-null float64
V10       284807 non-null float64
V11       284807 non-null float64
V12       284807 non-null float64
V13       284807 non-null float64
V14       284807 non-null float64
V15       284807 non-null float64
V16       284807 non-null float64
V17       284807 non-null float64
V18       284807 non-null float64
V19       284807 non-null float64
V20       284807 non-null float64
V21       284807 non-null float64
V22       284807 non-null float64
V23       284807 non-null float64
V24       284807 non-null float64
V25       284807 non-null float64
V26       284807 non-null float64
V27       284807 non-null float64
V28       284807 non-null float64
Amount    284807 non-null float64
Class     284807 non-null int64
dtypes: float64(30), int64(1)
memory usage: 67.4 MB
None
In [4]:
dataset['Class'].value_counts()
Out[4]:
0    284315
1       492
Name: Class, dtype: int64

Let's now explore the repartition of the amount of each transactions

In [5]:
plt.hist(dataset['Amount'], bins=50)
plt.show()

As we can imagine mainly all transaction are below 1500 \$. In order to reduce the range of amount we can check how many frauds we have above 3000 \$

In [6]:
dataset[(dataset['Amount'] > 3000) & (dataset['Class']==1)]
Out[6]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V21 V22 V23 V24 V25 V26 V27 V28 Amount Class

0 rows × 31 columns

Good point, there is none so we can remove transaction with an amoutn above 3000 \$. After we can check the repartition of frauds based on the amount and the cost for the bank as Warranty

In [7]:
dataset = dataset[dataset['Amount'] < 3000]
In [8]:
fraud = dataset[dataset['Class']==1]
plt.hist(fraud['Amount'], bins=50)
plt.show()
In [9]:
bins = 50
Amount_max = 3000

Y = []
C = []
X = list(range(0, Amount_max, bins))
for i in X:
    s = fraud[(fraud['Amount'] > i) & (fraud['Amount'] <= i + bins)]['Amount'].sum()
    Y.append(s)
    if len(C) > 0:
        c = C[-1] + s
    else:
        c = s
    C.append(c)
    print("{} => {} $ - {}".format(i, s, c))

plt.bar(X, Y, width=50)
plt.ylabel('Cost')
plt.title('Cost of Frauds per amount')
plt.show()

plt.plot(X, C)
plt.show()
0 => 2073.9300000000007 $ - 2073.9300000000007
50 => 4945.329999999998 $ - 7019.259999999998
100 => 3646.77 $ - 10666.029999999999
150 => 2496.0499999999997 $ - 13162.079999999998
200 => 2519.32 $ - 15681.399999999998
250 => 2983.95 $ - 18665.35
300 => 4872.54 $ - 23537.89
350 => 2189.02 $ - 25726.91
400 => 870.5699999999999 $ - 26597.48
450 => 2335.59 $ - 28933.07
500 => 2622.46 $ - 31555.53
550 => 1164.38 $ - 32719.91
600 => 2518.13 $ - 35238.04
650 => 667.55 $ - 35905.590000000004
700 => 5063.5199999999995 $ - 40969.11
750 => 1543.19 $ - 42512.3
800 => 2456.7599999999998 $ - 44969.060000000005
850 => 0 $ - 44969.060000000005
900 => 925.31 $ - 45894.37
950 => 996.27 $ - 46890.64
1000 => 0 $ - 46890.64
1050 => 1096.99 $ - 47987.63
1100 => 0 $ - 47987.63
1150 => 0 $ - 47987.63
1200 => 1218.89 $ - 49206.52
1250 => 0 $ - 49206.52
1300 => 1335.0 $ - 50541.52
1350 => 2743.81 $ - 53285.329999999994
1400 => 1402.16 $ - 54687.49
1450 => 0 $ - 54687.49
1500 => 1504.93 $ - 56192.42
1550 => 0 $ - 56192.42
1600 => 0 $ - 56192.42
1650 => 0 $ - 56192.42
1700 => 0 $ - 56192.42
1750 => 0 $ - 56192.42
1800 => 1809.68 $ - 58002.1
1850 => 0 $ - 58002.1
1900 => 0 $ - 58002.1
1950 => 0 $ - 58002.1
2000 => 0 $ - 58002.1
2050 => 0 $ - 58002.1
2100 => 2125.87 $ - 60127.97
2150 => 0 $ - 60127.97
2200 => 0 $ - 60127.97
2250 => 0 $ - 60127.97
2300 => 0 $ - 60127.97
2350 => 0 $ - 60127.97
2400 => 0 $ - 60127.97
2450 => 0 $ - 60127.97
2500 => 0 $ - 60127.97
2550 => 0 $ - 60127.97
2600 => 0 $ - 60127.97
2650 => 0 $ - 60127.97
2700 => 0 $ - 60127.97
2750 => 0 $ - 60127.97
2800 => 0 $ - 60127.97
2850 => 0 $ - 60127.97
2900 => 0 $ - 60127.97
2950 => 0 $ - 60127.97

So we can see that most of frauds are below 500 \$. Nevertheless in term of cost, all fraud below 500 \$ cost 31500 \$ to the bank (50% of the total cost of frauds). If we want to avoid around 90% of fraud costs, we should consider frauds up to 1500 \$

Model simplification

We clearly have a unbalanced dataset as we have only 0.17% of frauds. One good thing to do in such case is to try some Random Undersampling

For now, we only explore the dataset, the dataset will be splitted for evaluation later

In [10]:
n_non_fraud = [100, 1000, 10000, 100000, dataset[dataset["Class"] == 0]["Class"].count()]         # min : 1 - max : 284807-492
n_components = 3
print(n_non_fraud)
[100, 1000, 10000, 100000, 284026]
In [11]:
for sample_size in n_non_fraud:
    a = dataset[dataset["Class"] == 1]                           # we keep all frauds
    b = dataset[dataset["Class"] == 0].sample(sample_size)       # we take "sample_size" non fraud to balance the ratio fraud/non_fraud

    dataset_us = pd.concat([a, b]).sample(frac=1)                   # merge and shuffle both dataset
    
    y = dataset_us["Class"]
    X = dataset_us.drop(["Time", "Class"], axis=1)
    
    X_scale = StandardScaler().fit_transform(X)
    X_proj = PCA(n_components=n_components).fit_transform(X_scale)
    
    plt.scatter(X_proj[:, 0], X_proj[:, 1], s=X_proj[:, 2], c=y)

    plt.xlabel("PCA1")
    plt.ylabel("PCA2")
    plt.title("{}-points".format(sample_size))
    #plt.savefig("{}-points".format(sample_size), dpi=600)
    plt.show()
C:\Anaconda\lib\site-packages\matplotlib\collections.py:877: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

With 100 and 1000 non-frauds, we can see that non fraud are packed but some fraud are also grouped. With 10k and there is still some yellow points included in violet ones. With the full dataset, the reduction is useless as we packed all points.

Nevertheless, with 100000 points, we have a nice split in 2 dimensions. We can fix this value to fit the PCA and use it in the full datraset afterward.

In [12]:
# fit the PCA with 100k non-frauds
a = dataset[dataset["Class"] == 1]
b = dataset[dataset["Class"] == 0].sample(100000)

dataset = pd.concat([a, b]).sample(frac=1)

y = dataset["Class"]
X = dataset.drop(["Time", "Class"], axis=1)

X_scale = StandardScaler().fit_transform(dataset)
pca = PCA(n_components=0.95, svd_solver="full")
X_proj = pca.fit(X_scale)

# transform the full dataset with the pca create previously
dataset = pd.read_csv("creditcard.csv")
y = dataset["Class"]
X = dataset.drop(["Time", "Class"], axis=1)

X_scale = StandardScaler().fit_transform(dataset)
X_proj = pca.transform(X_scale)

Setting up a model

Above instead of keeping only the 3 main dimensions, we reduce dimensions until having 5% loss. We can check how many features we have :

In [13]:
print(X_proj.shape)
(284807, 27)

Unfortunately, we drop only 2 additionnal dimensions but it's better than nothing. We can check also that our reduction still allow a nice split.

In [14]:
plt.scatter(X_proj[:, 0], X_proj[:, 1], s=X_proj[:, 2], c=y)

plt.xlabel("PCA1")
plt.ylabel("PCA2")
plt.title("{}-points".format(X_proj.shape[0]))
#plt.savefig("{}-points".format(sample_size), dpi=600)
plt.show()
C:\Anaconda\lib\site-packages\matplotlib\collections.py:877: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

For this model, it would be bad to use a standard split as we have an unbalanced dataset (492 frauds for 280k non-frauds). In such case we should definitely go for a StratifiedKFold with let say 5 folds to have around 100 frauds in each fold.

For now we gonna try some classification model and our target won't be the count of good guess. In this exercice it makes no sense as we can easily reach 99.8% as we only have 0.17% fraud in total. A classifier saying non fraud everytime would get 99.8%.

Instead our score will be the number of non detected frauds (False Negative). So we must maximise the Precision

Just as reminder, confusion matrix is :

\begin{vmatrix} Non\_Fraud\_detected\_as\_non\_fraud & Fraud\_detected\_as\_non\_fraud \\ Non\_Fraud\_detected\_as\_Fraud & Fraud\_detected\_as\_Fraud \end{vmatrix}
In [15]:
skf = StratifiedKFold(n_splits=5, shuffle=True)  #shuffle is required to avoid having unbalance folds
sgd_clf = SGDClassifier()
for train_index, test_index in skf.split(X_proj, y):
    clone_clf = clone(sgd_clf)
    X_train, X_test = X_proj[train_index], X_proj[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clone_clf.fit(X_train, y_train)
    y_pred = clone_clf.predict(X_test)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    print(recall, precision)
    print(confusion_matrix(y_test, y_pred))
0.878787878788 1.0
[[56863     0]
 [   12    87]]
0.929292929293 0.938775510204
[[56857     6]
 [    7    92]]
0.867346938776 0.894736842105
[[56853    10]
 [   13    85]]
0.795918367347 0.962962962963
[[56860     3]
 [   20    78]]
0.877551020408 0.945054945055
[[56858     5]
 [   12    86]]

So with Stochastic Gradient Descent Classifier, we reach 95% of precision in average which is not so bad. We can try other models

In [16]:
skf = StratifiedKFold(n_splits=5, shuffle=True)  #shuffle is required to avoid having unbalance folds
tree_clf = DecisionTreeClassifier(max_depth=7)
for train_index, test_index in skf.split(X_proj, y):
    clone_clf = clone(tree_clf)
    X_train, X_test = X_proj[train_index], X_proj[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clone_clf.fit(X_train, y_train)
    y_pred = clone_clf.predict(X_test)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    print(recall, precision)
    print(confusion_matrix(y_test, y_pred))
1.0 0.942857142857
[[56857     6]
 [    0    99]]
0.989898989899 1.0
[[56863     0]
 [    1    98]]
0.989795918367 0.989795918367
[[56862     1]
 [    1    97]]
0.989795918367 0.989795918367
[[56862     1]
 [    1    97]]
0.989795918367 1.0
[[56863     0]
 [    1    97]]

With as Tree Classifier of depth 5 we have around 96% of precision but with a depth of 7, we nearly miss no fraud AND we nearly have no Non_fraud detected as Fraud so we won't have to check a lot of transactions every day. This model is definitely well suited for this work

In [17]:
# skf = StratifiedKFold(n_splits=5, shuffle=True)  #shuffle is required to avoid having unbalance folds
# svc_clf = SVC(gamma=2, C=1)
# for train_index, test_index in skf.split(X_proj, y):
#     clone_clf = clone(svc_clf)
#     X_train, X_test = X_proj[train_index], X_proj[test_index]
#     y_train, y_test = y[train_index], y[test_index]
#     clone_clf.fit(X_train, y_train)
#     y_pred = clone_clf.predict(X_test)
#     recall = recall_score(y_test, y_pred)
#     precision = precision_score(y_test, y_pred)
#     print(recall, precision)
#     print(confusion_matrix(y_test, y_pred))
In [18]:
skf = StratifiedKFold(n_splits=5, shuffle=True)  #shuffle is required to avoid having unbalance folds
mlp_clf = MLPClassifier(hidden_layer_sizes=(50, 20))
for train_index, test_index in skf.split(X_proj, y):
    clone_clf = clone(mlp_clf)
    X_train, X_test = X_proj[train_index], X_proj[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clone_clf.fit(X_train, y_train)
    y_pred = clone_clf.predict(X_test)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    print(recall, precision)
    print(confusion_matrix(y_test, y_pred))
0.989898989899 1.0
[[56863     0]
 [    1    98]]
1.0 0.980198019802
[[56861     2]
 [    0    99]]
0.989795918367 1.0
[[56863     0]
 [    1    97]]
1.0 1.0
[[56863     0]
 [    0    98]]
1.0 1.0
[[56863     0]
 [    0    98]]

The MLPClassifier give slightly better than Tree Classifier. It maybe improve by changing topology but we currently have only 1 fraud non detected and we this that 2 non-fraud are detected as fraud. This is really great because that means we won't have refunds to do to victims and we won't also need employee to check some possible frauds.

Conclusion

By using the MLPClassifier, we can detect nearly all frauds and having nearly no False Positive. There is just a need to prepare all data first in the StandardScaler and in the PCA. Just to finish let's compute the score on the whole dataset

In [19]:
best_model = clone_clf
y_pred = best_model.predict(X_proj)
In [20]:
print("Accuracy score : {}".format(accuracy_score(y, y_pred)))
print("Precision score : {}".format(precision_score(y, y_pred)))
print("Recall score : {}".format(recall_score(y, y_pred)))
print("Confusion Matrix : {}".format(confusion_matrix(y, y_pred)))
Accuracy score : 1.0
Precision score : 1.0
Recall score : 1.0
Confusion Matrix : [[284315      0]
 [     0    492]]