Presentation of the dataset

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import graphviz

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz

%matplotlib inline

Let's first import the dataset and take a quick look to the content

In [2]:
dataset_raw = pd.read_csv("HR.csv")
In [3]:
print(dataset_raw.info(), "\n")
print(dataset_raw.head(), "\n")
print(dataset_raw.describe(), "\n")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
satisfaction_level       14999 non-null float64
last_evaluation          14999 non-null float64
number_project           14999 non-null int64
average_montly_hours     14999 non-null int64
time_spend_company       14999 non-null int64
Work_accident            14999 non-null int64
left                     14999 non-null int64
promotion_last_5years    14999 non-null int64
sales                    14999 non-null object
salary                   14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None 

   satisfaction_level  last_evaluation  number_project  average_montly_hours  \
0                0.38             0.53               2                   157   
1                0.80             0.86               5                   262   
2                0.11             0.88               7                   272   
3                0.72             0.87               5                   223   
4                0.37             0.52               2                   159   

   time_spend_company  Work_accident  left  promotion_last_5years  sales  \
0                   3              0     1                      0  sales   
1                   6              0     1                      0  sales   
2                   4              0     1                      0  sales   
3                   5              0     1                      0  sales   
4                   3              0     1                      0  sales   

   salary  
0     low  
1  medium  
2  medium  
3     low  
4     low   

       satisfaction_level  last_evaluation  number_project  \
count        14999.000000     14999.000000    14999.000000   
mean             0.612834         0.716102        3.803054   
std              0.248631         0.171169        1.232592   
min              0.090000         0.360000        2.000000   
25%              0.440000         0.560000        3.000000   
50%              0.640000         0.720000        4.000000   
75%              0.820000         0.870000        5.000000   
max              1.000000         1.000000        7.000000   

       average_montly_hours  time_spend_company  Work_accident          left  \
count          14999.000000        14999.000000   14999.000000  14999.000000   
mean             201.050337            3.498233       0.144610      0.238083   
std               49.943099            1.460136       0.351719      0.425924   
min               96.000000            2.000000       0.000000      0.000000   
25%              156.000000            3.000000       0.000000      0.000000   
50%              200.000000            3.000000       0.000000      0.000000   
75%              245.000000            4.000000       0.000000      0.000000   
max              310.000000           10.000000       1.000000      1.000000   

       promotion_last_5years  
count           14999.000000  
mean                0.021268  
std                 0.144281  
min                 0.000000  
25%                 0.000000  
50%                 0.000000  
75%                 0.000000  
max                 1.000000   

In [4]:
print(dataset_raw["left"].sum())
3571

So we know that our dataset contains 14999 employees (3571 already left and 11428 are still in the company). It's a slightly unbiaised dataset as it is composed of 74% of employees still in the company and 26% who left.

In term of features, all features are already cleaned. Only 2 are with text so the preparation will be easy

Exploration and Preparation

In [5]:
dataset = dataset_raw.copy()
In [6]:
dataset_raw["salary"].unique()
Out[6]:
array(['low', 'medium', 'high'], dtype=object)

So we have 3 values for the dataset corresponding to the range of salaries. We can directly convert them using a dict. I don't use a LabelEncoder to have smallest values for smallest salaries.

In [7]:
convert_dict = {"high" : 3, "medium": 2, "low": 1}
dataset = dataset.replace({"salary": convert_dict})
In [8]:
dataset_raw["sales"].unique()
Out[8]:
array(['sales', 'accounting', 'hr', 'technical', 'support', 'management',
       'IT', 'product_mng', 'marketing', 'RandD'], dtype=object)

For the team, it makes less sense to use a LabelEncoder as those values are not linked. For example, previously with salary, the higher the value is, the more salary the employee have. In this case we cannot say that marketing is better than accounting for example. In such case, we can use a OneHotEncoder. This can be done using Sklearn or get_dummies from pandas

In [9]:
dataset = pd.get_dummies(dataset)

Now, our dataset is only with values and clean. We can start to explore it. We can start with the Correlation Matrix

In [10]:
f, ax = plt.subplots(figsize=(15, 15))
corr = dataset.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax, annot=True)
plt.show()

We can see that the main correlation between the fact that an employee leave and other parameter is mostly :

  • the satisfaction : if the employee is bored at work,, He will probably look for another job
  • the evaluation : This can be a bit less obvious. If you have bad evaluation, you may think that you are not made for this job and want to leave but if it's too high, you can expect changing position
  • the number of project : If you have too much project, you can be tired of this position and if you don't have enought you may be bored

Let's quickly explore if there is a pattern

In [11]:
g = sns.PairGrid(dataset_raw[["satisfaction_level", "last_evaluation", "number_project"]])
g.map(plt.scatter);
In [12]:
satisfaction_left = dataset[dataset["left"] == 1]["satisfaction_level"]
print(satisfaction_left.mean())
satisfaction_remain =  dataset[dataset["left"] == 0]["satisfaction_level"]
print(satisfaction_remain.mean())
0.44009801176140917
0.666809590479516

We can see that the satisfaction of people who left the company is below the average. This is logical. We can also plot an histogram of the distribution

In [13]:
x = 100*satisfaction_left
y = 100*satisfaction_remain
nb_bins = 30

fig, ax = plt.subplots()
for a in [x, y]:
    sns.distplot(a, bins=range(0, 100, 100//nb_bins), ax=ax, kde=False)
ax.set_xlim([0, 100])
plt.show()

As expected, people with no satifaction left, people with low satifaction mostly left. We also can see that a part of employee highly satisfied left. This is probably because they found a better position somewhere else in the meantime. We can do the same with the last evaluation

In [14]:
eval_left = dataset[dataset["left"] == 1]["last_evaluation"]
print(eval_left.mean())
eval_remain =  dataset[dataset["left"] == 0]["last_evaluation"]
print(eval_remain.mean())

x = 100*eval_left
y = 100*eval_remain
nb_bins = 30

fig, ax = plt.subplots()
for a in [x, y]:
    sns.distplot(a, bins=range(0, 100, 100//nb_bins), ax=ax, kde=False)
ax.set_xlim([0, 100])
plt.show()
0.7181125735088183
0.7154733986699274

This is also splitted in 2 buckets. One for people with "bad" evaluation. In that case, around 33% of people left. There is also talent who left (evaluation > 8/10). This bucket is also probably a part of satified people who found a better place somewhere else. You have the count below

In [15]:
talent_left = dataset[(dataset["left"] == 1) & (dataset["last_evaluation"] > 0.75)]["left"].count()
print(talent_left)
satisfied_left = dataset[(dataset["left"] == 1) & (dataset["satisfaction_level"] > 0.75)]["left"].count()
print(satisfied_left)
1869
768

So 1869 ppl left with a note above 7.5/10 at evaluation and 768 had a satisfaction above 0.75. We can also check the impact of number of project or salary

In [16]:
eval_left = dataset[dataset["left"] == 1]["number_project"]
print(eval_left.mean())
eval_remain =  dataset[dataset["left"] == 0]["number_project"]
print(eval_remain.mean())

x = eval_left
y = eval_remain
nb_bins = 30

fig, ax = plt.subplots()
for a in [x, y]:
    sns.distplot(a, bins=range(0, 10, 1), ax=ax, kde=False)
ax.set_xlim([0, 10])
plt.show()
3.8555026603192384
3.786664333216661

We can see that people with few project usually left. This is probably because thay are bored or bad at their position and a slower. There is also a bucket with high number of project. This is more likely due to discouragement

In [17]:
eval_left = dataset[dataset["left"] == 1]["salary"]
print(eval_left.mean())
eval_remain =  dataset[dataset["left"] == 0]["salary"]
print(eval_remain.mean())

x = eval_left
y = eval_remain
nb_bins = 30

fig, ax = plt.subplots()
for a in [x, y]:
    sns.distplot(a, ax=ax, kde=False)
plt.show()
1.4147297675721087
1.6509450472523626
In [18]:
eval_left = dataset[dataset["left"] == 1]["promotion_last_5years"]
eval_remain =  dataset[dataset["left"] == 0]["promotion_last_5years"]

a = dataset[(dataset["left"] == 1) & (dataset["promotion_last_5years"] == 1)]["promotion_last_5years"].count()
b = dataset[(dataset["left"] == 0) & (dataset["promotion_last_5years"] == 1)]["promotion_last_5years"].count()

print(a/(a+b))

a = dataset[(dataset["left"] == 1) & (dataset["promotion_last_5years"] == 0)]["promotion_last_5years"].count()
b = dataset[(dataset["left"] == 0) & (dataset["promotion_last_5years"] == 0)]["promotion_last_5years"].count()

print(a/(a+b))

x = eval_left
y = eval_remain
nb_bins = 30

fig, ax = plt.subplots()
for a in [x, y]:
    sns.distplot(a, ax=ax, kde=False)
plt.show()
0.0595611285266
0.241961852861

We can see that there is only few high salary and on this part, there is nearly noone leaving. For Low Salary around 33% leave and for Medium Salary is more around 25%. In people who got a promotion, only 5 percent left. In those who didn't get it, 24% left.

So now we know better the dataset and we can setup some models

Setup models

In this section, we gonna apply some classifier, evaluation them on accuracy (knowing that a dummy classifier can easily have 75 % success due to the unbalanced dataset). But first of all, we just goona extract some line of people remaining in the company for the final evaluation.

In [19]:
temp = dataset.index[dataset["left"] == 0].tolist()
idx = random.sample(temp, 50)
X_eval = dataset.iloc[idx].drop("left", axis=1)

dataset = dataset.drop(dataset.index[idx])

We can now, scale the dataset and prepare our train and test dataset

In [20]:
X = dataset.drop("left", axis=1)
y = dataset["left"]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

And finally we can benchmark some classical classifier

In [22]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(acc)
print(confusion_matrix(y_test, model.predict(X_test)))
0.976588628763
[[2241   51]
 [  19  679]]

With DecisionTree, we can also display the logical usign graphviz. This tree is huge but can be used by anybody to predict it manually

In [23]:
header = list(dataset.columns)
export_graphviz(model, out_file="mytree.dot", feature_names=header)
In [24]:
from IPython.display import Image
Image("image.png")
Out[24]: