In [ ]:

```
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
from sklearn.decomposition import PCA
```

First we have to create the Model. The model will contain :

By default, sizes will be 300 for HL1 and 100 for HL2

In [ ]:

```
input_features = 28*28
size_hidden_layer_1 = 300
size_hidden_layer_2 = 100
size_output_layer = 10
```

Now, we will create placeholder of the input and the known target

In [ ]:

```
# Creation Graph
X = tf.placeholder(tf.float32, shape=(None, input_features), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
```

In [ ]:

```
hidden_layer_1 = tf.layers.dense(X, size_hidden_layer_1, name="hidden1", activation=tf.nn.relu)
hidden_layer_2 = tf.layers.dense(hidden_layer_1, size_hidden_layer_2, name="hidden2", activation=tf.nn.relu)
output_layer = tf.layers.dense(bn2_act, size_output_layer, name="output")
```

Unfortunately, this model is not the best one we can have in general and this is also true on this example. Usually for Deep Neural Network (which is not really the case with only 2 hidden layers), there is risks of Vanishing/Exploding Gradient Problem. This makes the training longer to do. But researched offers several solutions to avoid it and train the network faster.

Normalization of inputs (Called Batch Normalization). The process is to center and scale the input data inbetween every layers. This takes computing times but in overall, helps to train the NN faster

Tweaking the Activation Function : There is a lot of activation functions. In this jungle, of function, one variant of the relu provides in overall the best result. It's the ELU (Exponential Linear Unit) It's a bit slower than Leaky Relu or Relu simple but avoid dying cells.

Those 2 improvements are added on the code below and differences will be reviewed on the Analysis Section

In [ ]:

```
hidden_layer_1 = tf.layers.dense(X, size_hidden_layer_1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden_layer_1, training=True, momentum=0.9)
bn1_act = tf.nn.elu(bn1)
hidden_layer_2 = tf.layers.dense(bn1_act, size_hidden_layer_2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden_layer_2, training=True, momentum=0.9)
bn2_act = tf.nn.elu(bn2)
output_layer = tf.layers.dense(bn2_act, size_output_layer, name="output") # pas d'activation
```

In [ ]:

```
# Cost function
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output_layer)
loss= tf.reduce_mean(cross_entropy)
```

Now we have the error so we can backpropagate the error in the network to tune weight matrices

In [ ]:

```
# Backpropagation
LR = 0.01
optimizer = tf.train.GradientDescentOptimizer(LR).minimize(loss)
```

In [ ]:

```
# evaluation
correct = tf.nn.in_top_k(output_layer, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
```

In [ ]:

```
# Initialization
init = tf.global_variables_initializer()
saver = tf.train.Saver()
```

In [ ]:

```
# Tensorborad info
acc_summary = tf.summary.scalar("Accuracy", accuracy)
file_writter = tf.summary.FileWriter("/saves/summary/BN_elu-{}-{}/".format(size_hidden_layer_1, size_hidden_layer_2), tf.get_default_graph())
```

In [ ]:

```
training_step = True
mnist = input_data.read_data_sets("/data/")
n_epoch = 60
batch_size = 70
training_instances = mnist.train.num_examples
nb_batch = training_instances // batch_size
if input_features < 28*28:
pca = PCA(n_components=input_features)
X_train = pca.fit_transform(mnist.train.images)
X_test = pca.transform(mnist.test.images)
else:
X_train = mnist.train.images
X_test = mnist.test.images
```

In [ ]:

```
with tf.Session() as sess:
if not training_step:
saver.restore(sess, "/saves/my_model_chapter10_2.ckpt")
print(accuracy.eval(feed_dict={X: X_test, y: mnist.test.labels}))
else:
init.run()
for epoch in range(n_epoch):
for X_batch, y_batch in next_batch(X_train, mnist.train.labels, batch_size, nb_batch):
sess.run(optimizer, feed_dict={X: X_batch, y: y_batch})
accuracy_str = acc_summary.eval(feed_dict={X: X_test, y: mnist.test.labels})
file_writter.add_summary(accuracy_str, epoch)
print(epoch, accuracy.eval(feed_dict={X: X_test, y: mnist.test.labels}))
save_path = saver.save(sess, "/saves/my_model_chapter10_2.ckpt")
file_writter.close()
```

After several training sessions done with either :

On the first picture below, we have the evolution of the accuracy with the uncompressed inputs and differents sizes of Hidden layers (bigger than the input, a bit smaller or really smaller)

We can see that on average the training end up at the same value (close to 98 %). Having more layers provide slightly more precision but the training time increase a lot. If we now compare them on the timeframe compare to epoch, we have :

With this point of view, the topology of 300 hidden layers and 100 next looks better (41% faster to lose only 0.15% in precision).

Now we can also compare the same topology with compressed dataset and Original one :

We can see that the training is faster with compressed one as the first matrix is only 154xBatch_Size instead of 784xBatch_Size. The precision is not really lost also. Unfortunately, in that case, the gain on the training doesn't compensate the time lost to pass the dataset thru the PCA function (not timed here). So for such a small model, the PCA doesn't really makes sense.

We can also compare a different topology for both inputs. a topology of (300, 100) for the dataset compressed and with the same ratio, we can compare the raw input on a model of (1568, 500)

Now the precision is not really important (0.3%) and the calculation time start to be important. So we can say that if we use PCA to reduce features AND the topology, the initial calculation makes sense !

Talking about topology, we can compare also accuracy for compressed and raw inputs for severa topologies :

We can see that the model adapt pretty well as the precision doesn't change a lot based on models. In such cases, we can think of reducing sizes to have a model "light"

Now we fix the topology to (300, 100) and we fix the input as raw's ones. We can now compare activation function and also the Batch Normalisation. Below you have the accuracy based on training TIME:

We can see that Batch Normalization with ELU activation function learn really faster than other ones. Unfortunately, it requires more computation times. Nevertheless, if we now look at the accuracy based on epochs :

We can see that it really outperformed the other models and is completely trained after 20 epochs (correspond to 39s). That means it's quicker than other ones which are trained after around 50 seconds.

We can also see that in ELU without BN is really worse than other models. That means the BN helps a lot even if it double the computing time (59s to 1min57s for the same amount of epoch and the difference in accuracy remains 1%).

We saw with this example the impact of differents points on the model.

- The Batch Normalization costs a lot in computing ressources but helps a lot to train model quicker (in epoch not in time)
- The bigger the NN is, the slower it is. Also there is risks of overfitting and after a certain size the gain in precision is not really important.
- The activation function has an impact on the training time (requires more or less calculation and reach quicker or not the "end" of the training)

If you want to go deeper you can also try to play with different optimizers (Momentum Optimizer, Nesterov Accelerated Gradient, Adagrad, RMSProp, AdamOptimization). As explained on the book "Hands On ML with Scikit Learn and Tensorflow", there is a study from 2017 (link) which explain that we should avoid adaptative optimization like Adagrad, RMSProp or AdamOptimization because they can generalize poorly. They advice us to use Nesterov Accelerated Gradient instead.