Introduction

Being focused on a futur project which takes more time than expected, I'll just dig into 2 new interesting libraries.

TPOT :

This library automates the data preparation and training by using genetic algorithms to preprocess, create features and test models with differents parameters. This libraries has plenty of good benefits as it does the reserach in an automated way but the training takes times and doesn't consider Neural Networks. As an introduction, we will try it on a very small and known dataset

Black :

This library re-write python file following PEP8 rule. on this one, we will just apply it on a simple python file to see the outcome.

TPOT (Github)

As mentionned previously, TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. This library applies on the following steps:

To try it, let's test it on a very simple library which is Iris Dataset

In [1]:
import pandas as pd
import numpy as np
from tpot import TPOTClassifier
from sklearn.datasets import load_wine
In [2]:
dataset = load_wine()
dataset.keys()
Out[2]:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

The dataset provided by sklearn is not a usual dataset so let's first create dataframes as we usually have

In [3]:
X = pd.DataFrame(data = dataset.data, columns =dataset.feature_names)

y = pd.Series(data = dataset.target)
y.name = "Label"
y = y.map({i : dataset.target_names[i] for i in range(3)}).to_frame()

y_ohe = pd.get_dummies(dataset.target)
y_ohe.columns = dataset.target_names
In [4]:
X.head()
Out[4]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
In [5]:
y_ohe.head()
Out[5]:
class_0 class_1 class_2
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
In [6]:
y.head()
Out[6]:
Label
0 class_0
1 class_0
2 class_0
3 class_0
4 class_0

Now we can split both datas with a trian and validation set

In [7]:
from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test, y_train_ohe, y_test_ohe = train_test_split(X, y, y_ohe, test_size=0.25, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(dataset.data, dataset.target, test_size=0.25, random_state=42)

and not the use of this algorithm is very simple. We jsute have few parameters. As the dataset is very simple, the population chosen is small to have a chance to see an improvement.

In [8]:
tpot = TPOTClassifier(generations=10, population_size=3, verbosity=2, n_jobs=1)

tpot.fit(X_train, y_train)
Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Generation 1 - Current best internal CV score: 0.9629426129426129
Generation 2 - Current best internal CV score: 0.9700854700854702
Generation 3 - Current best internal CV score: 0.9700854700854702
Generation 4 - Current best internal CV score: 0.9700854700854702
Generation 5 - Current best internal CV score: 0.9703703703703704
Generation 6 - Current best internal CV score: 0.9703703703703704
Generation 7 - Current best internal CV score: 0.9703703703703704
Generation 8 - Current best internal CV score: 0.9703703703703704
Generation 9 - Current best internal CV score: 0.9703703703703704
Generation 10 - Current best internal CV score: 0.9777777777777779

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.9000000000000001, min_samples_leaf=2, min_samples_split=19, n_estimators=100)
Out[8]:
TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=10,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=3,
        random_state=None, scoring=None, subsample=1.0, use_dask=False,
        verbosity=2, warm_start=False)

We can see improvement during training and the best model is provided at the end, we can now test it on a validation set

In [9]:
tpot.score(X_test, y_test)
Out[9]:
0.9777777777777777

The genetic algorithm succeed to improve the pre-processing and modeling to get 1 more percent of accuracy. I wanted to try it with dataframe but it doesn't work. I also wanted to pre-process the label to one hot encoded vector but it's also not considered. To finish with this library, we can export the pipeline for futur training :

In [10]:
tpot.export('tpot_pipeline.py')
Out[10]:
True

and this is the return pipeline

In [ ]:
"""

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the class is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1).values
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'].values, random_state=None)

# Average CV score on the training set was:0.9777777777777779
exported_pipeline = RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.9000000000000001, min_samples_leaf=2, min_samples_split=19, n_estimators=100)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)


"""

That's it for this library. It's a public library on github with a fast growing community so we can expect new features of it in the futur months.

Black (Github)

Now let's talk about Black. This is not a library as usual. It can be installed as any other library with pip but after the use is only by command line with :

black path/to/file.py --args

Let's try it on the following code use to train a Deep Q Network with Prioritzed Experience Replay. I modified it to have it non-PEP8.

In [ ]:
import numpy as np

class SumTree(object):
    """
    This SumTree code is modified version and the original code is from:
    https://github.com/jaara/AI-blog/blob/master/SumTree.py

    Story the data with it priority in tree and data frameworks.
    """
    data_pointer = 0
    def __init__(self, capacity):
        self.capacity = capacity  # for all priority values
        self.tree = np.zeros(2 * capacity - 1)
        # [--------------Parent nodes-------------][-------leaves to recode priority-------]
        #             size: capacity - 1                       size: capacity
        self.data = np.zeros(capacity, dtype=object)  # for all transitions
        # [--------------data frame-------------]
        #             size: capacity

    def add_new_priority(self, p, data):
        
        leaf_idx = self.data_pointer + self.capacity - 1
        self.data[self.data_pointer] = data  # update data_frame
        self.update(leaf_idx, p)  # update tree_frame
        self.data_pointer += 1
        
        if self.data_pointer >= self.capacity:  # replace when exceed the capacity
            self.data_pointer = 0

    def update(self, tree_idx, p):
        
        change = p - self.tree[tree_idx]

        self.tree[tree_idx] = p
        self._propagate_change(tree_idx, change)

    def _propagate_change(self, tree_idx, change):
        """change the sum of priority value in all parent nodes"""
        parent_idx = (tree_idx - 1) // 2
        self.tree[parent_idx] += change
        if parent_idx != 0:
            self._propagate_change(parent_idx, change)

    def get_leaf(self, lower_bound):
        leaf_idx = self._retrieve(lower_bound)  # search the max leaf priority based on the lower_bound
        data_idx = leaf_idx - self.capacity + 1
        return [leaf_idx, self.tree[leaf_idx], self.data[data_idx]]

    def _retrieve(self, lower_bound, parent_idx=0):
        """
        Tree structure and array storage:

        Tree index:
             0         -> storing priority sum
            / \
          1     2
         / \   / \
        3   4 5   6    -> storing priority for transitions

        Array type for storing:
        [0,1,2,3,4,5,6]
        """
        left_child_idx = 2 * parent_idx + 1
        right_child_idx = left_child_idx + 1

        if left_child_idx >= len(self.tree): return parent_idx # end search when no more child
        if self.tree[left_child_idx] == self.tree[right_child_idx]:
            return self._retrieve(lower_bound, np.random.choice([left_child_idx, right_child_idx]))
        if lower_bound <= self.tree[left_child_idx]:  # downward search, always search for a higher priority node
            return self._retrieve(lower_bound, left_child_idx)
        else:
            return self._retrieve(lower_bound - self.tree[left_child_idx], right_child_idx)

    @property
    def root_priority(self):
        return self.tree[0]  # the root

In [ ]:
import numpy as np


class SumTree(object):
    """
    This SumTree code is modified version and the original code is from:
    https://github.com/jaara/AI-blog/blob/master/SumTree.py

    Story the data with it priority in tree and data frameworks.
    """

    data_pointer = 0

    def __init__(self, capacity):
        self.capacity = capacity  # for all priority values
        self.tree = np.zeros(2 * capacity - 1)
        # [--------------Parent nodes-------------][-------leaves to recode priority-------]
        #             size: capacity - 1                       size: capacity
        self.data = np.zeros(capacity, dtype=object)  # for all transitions
        # [--------------data frame-------------]
        #             size: capacity

    def add_new_priority(self, p, data):

        leaf_idx = self.data_pointer + self.capacity - 1
        self.data[self.data_pointer] = data  # update data_frame
        self.update(leaf_idx, p)  # update tree_frame
        self.data_pointer += 1

        if self.data_pointer >= self.capacity:  # replace when exceed the capacity
            self.data_pointer = 0

    def update(self, tree_idx, p):

        change = p - self.tree[tree_idx]

        self.tree[tree_idx] = p
        self._propagate_change(tree_idx, change)

    def _propagate_change(self, tree_idx, change):
        """change the sum of priority value in all parent nodes"""
        parent_idx = (tree_idx - 1) // 2
        self.tree[parent_idx] += change
        if parent_idx != 0:
            self._propagate_change(parent_idx, change)

    def get_leaf(self, lower_bound):
        leaf_idx = self._retrieve(
            lower_bound
        )  # search the max leaf priority based on the lower_bound
        data_idx = leaf_idx - self.capacity + 1
        return [leaf_idx, self.tree[leaf_idx], self.data[data_idx]]

    def _retrieve(self, lower_bound, parent_idx=0):
        """
        Tree structure and array storage:

        Tree index:
             0         -> storing priority sum
            / \
          1     2
         / \   / \
        3   4 5   6    -> storing priority for transitions

        Array type for storing:
        [0,1,2,3,4,5,6]
        """
        left_child_idx = 2 * parent_idx + 1
        right_child_idx = left_child_idx + 1

        if left_child_idx >= len(self.tree):
            return parent_idx  # end search when no more child
        if self.tree[left_child_idx] == self.tree[right_child_idx]:
            return self._retrieve(
                lower_bound, np.random.choice([left_child_idx, right_child_idx])
            )
        if (
            lower_bound <= self.tree[left_child_idx]
        ):  # downward search, always search for a higher priority node
            return self._retrieve(lower_bound, left_child_idx)
        else:
            return self._retrieve(
                lower_bound - self.tree[left_child_idx], right_child_idx
            )

    @property
    def root_priority(self):
        return self.tree[0]  # the root

We can see that it considered already several PEP8 points like :

  • 2 lines before to declare a Class
  • 1 line before to declare a Method
  • less than 88 characters per lines (but sometimes it creates strange choices like : if ( lower_bound <= self.tree[left_child_idx] ): # downward search, always search for a higher priority node

instead of if (lower_bound <= self.tree[left_child_idx]):

# downward search, always search for a higher priority node

</code>

but in average the result is fine. Several points are logical and must be applied by all developper but it can support codes to have all the same layout.

Conclusion

This quick notebook presents 2 simple but usefull libraries. One is mainly done for ML purpose and simplify some preparation steps. It's still a bit restricted as it's not working with dataframe (we should convert it first to Numpy matrices). It doesn't accept labeled target (or at least in bugged when doint the scoring on the test set) or One-Hot-Encoded matrices. This will most probably evolve in the near futur so let's keep an eye on it.

Regarding Black, it's a more global library which may be usefull in company to align codes between devs. Even if it applies simple rules, it has the benefit to still align it the same way. It has several option to adjust line lenght, varaible/class naming and so on.