In this exercice, we are going to do the first Neural Network with Tensorflow. We will explore some models, tweak their topology, activation function and check the impact in Tensorboard. As input we gonna use the mnist dataset included in Tensorflow


In [ ]:
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
from sklearn.decomposition import PCA

First we have to create the Model. The model will contain :

  • 1 input layer of size "input_features"
  • 2 hidden layers of size "size_hidden_layer_1 & 2"
  • 1 output layer of size 10 (as we want to predict number from 0 to 9)
  • By default, sizes will be 300 for HL1 and 100 for HL2

    In [ ]:
    input_features = 28*28
    size_hidden_layer_1 = 300
    size_hidden_layer_2 = 100
    size_output_layer = 10

    Now, we will create placeholder of the input and the known target

    In [ ]:
    # Creation Graph
    X = tf.placeholder(tf.float32, shape=(None, input_features), name='X')
    y = tf.placeholder(tf.int32, shape=(None), name='y')

    Now we can create the topology. During the analysis we gonna tweak it to check the impact of some parameters on the Learning. The one proposed below is the simplest one with only a relu activation function for both Hidden Layers and None for the output layer

    In [ ]:
    hidden_layer_1 = tf.layers.dense(X, size_hidden_layer_1, name="hidden1", activation=tf.nn.relu)
    hidden_layer_2 = tf.layers.dense(hidden_layer_1, size_hidden_layer_2, name="hidden2", activation=tf.nn.relu)
    output_layer = tf.layers.dense(bn2_act, size_output_layer, name="output")

    Unfortunately, this model is not the best one we can have in general and this is also true on this example. Usually for Deep Neural Network (which is not really the case with only 2 hidden layers), there is risks of Vanishing/Exploding Gradient Problem. This makes the training longer to do. But researched offers several solutions to avoid it and train the network faster.

    • Normalization of inputs (Called Batch Normalization). The process is to center and scale the input data inbetween every layers. This takes computing times but in overall, helps to train the NN faster

    • Tweaking the Activation Function : There is a lot of activation functions. In this jungle, of function, one variant of the relu provides in overall the best result. It's the ELU (Exponential Linear Unit) It's a bit slower than Leaky Relu or Relu simple but avoid dying cells.

    Those 2 improvements are added on the code below and differences will be reviewed on the Analysis Section

    In [ ]:
    hidden_layer_1 = tf.layers.dense(X, size_hidden_layer_1, name="hidden1")
    bn1 = tf.layers.batch_normalization(hidden_layer_1, training=True, momentum=0.9)
    bn1_act = tf.nn.elu(bn1)
    hidden_layer_2 = tf.layers.dense(bn1_act, size_hidden_layer_2, name="hidden2")
    bn2 = tf.layers.batch_normalization(hidden_layer_2, training=True, momentum=0.9)
    bn2_act = tf.nn.elu(bn2)
    output_layer = tf.layers.dense(bn2_act, size_output_layer, name="output") # pas d'activation

    Now during the training, we will have to compute the cross entropy between the evaluated output and the real output. The objective during the training is to reduce this entropy

    In [ ]:
    # Cost function
    cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=output_layer)
    loss= tf.reduce_mean(cross_entropy)

    Now we have the error so we can backpropagate the error in the network to tune weight matrices

    In [ ]:
    # Backpropagation
    LR = 0.01
    optimizer = tf.train.GradientDescentOptimizer(LR).minimize(loss)

    We can also add in the graph a node to compute the accuracy based on output from the network and the expected output. This will be used in the test set to evaluate the graph on unknown data

    In [ ]:
    # evaluation
    correct = tf.nn.in_top_k(output_layer, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

    Now the model is set, we can initialise it and create a Saver to be able to explore result in Tensorboard

    In [ ]:
    # Initialization
    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

    For the result presented below, I changed the name of the save based on network topology and activation function.

    In [ ]:
    # Tensorborad info
    acc_summary = tf.summary.scalar("Accuracy", accuracy)
    file_writter = tf.summary.FileWriter("/saves/summary/BN_elu-{}-{}/".format(size_hidden_layer_1, size_hidden_layer_2), tf.get_default_graph())

    Now we can prepare the training steps. If you want to try to reduce dimensions, a PCA is set up to reduce it to the expected "input_features" size. This will be tested to check the gain of time and the loss of accuracy

    In [ ]:
    training_step = True
    mnist = input_data.read_data_sets("/data/")
    n_epoch = 60
    batch_size = 70
    training_instances = mnist.train.num_examples
    nb_batch = training_instances // batch_size
    if input_features < 28*28:
        pca = PCA(n_components=input_features)
        X_train = pca.fit_transform(mnist.train.images)
        X_test = pca.transform(mnist.test.images)
        X_train = mnist.train.images
        X_test = mnist.test.images

    And launch the training (By changing training step to False, the model will load the latest model saved and you can only test it with datas)

    In [ ]:
    with tf.Session() as sess:
        if not training_step:
            saver.restore(sess, "/saves/my_model_chapter10_2.ckpt")
            print(accuracy.eval(feed_dict={X: X_test, y: mnist.test.labels}))
            for epoch in range(n_epoch):
                for X_batch, y_batch in next_batch(X_train, mnist.train.labels, batch_size, nb_batch):
          , feed_dict={X: X_batch, y: y_batch})
                accuracy_str = acc_summary.eval(feed_dict={X: X_test, y: mnist.test.labels})
                file_writter.add_summary(accuracy_str, epoch)
                print(epoch, accuracy.eval(feed_dict={X: X_test, y: mnist.test.labels}))
            save_path =, "/saves/my_model_chapter10_2.ckpt")


    After several training sessions done with either :

  • Relu or ELU Activation function
  • Differents topologies
  • 2 Sizes of the input (28x28 features as provided by default or a compressed version with only the 154 best features, corresponding to 5% of data lost after reconstruction)
  • we got the accuracy on test sets for every epochs. We can see :

    On the first picture below, we have the evolution of the accuracy with the uncompressed inputs and differents sizes of Hidden layers (bigger than the input, a bit smaller or really smaller) 2017-08-21%2018_02_12-TensorBoard.png

    We can see that on average the training end up at the same value (close to 98 %). Having more layers provide slightly more precision but the training time increase a lot. If we now compare them on the timeframe compare to epoch, we have :


    With this point of view, the topology of 300 hidden layers and 100 next looks better (41% faster to lose only 0.15% in precision).

    Now we can also compare the same topology with compressed dataset and Original one :


    We can see that the training is faster with compressed one as the first matrix is only 154xBatch_Size instead of 784xBatch_Size. The precision is not really lost also. Unfortunately, in that case, the gain on the training doesn't compensate the time lost to pass the dataset thru the PCA function (not timed here). So for such a small model, the PCA doesn't really makes sense.

    We can also compare a different topology for both inputs. a topology of (300, 100) for the dataset compressed and with the same ratio, we can compare the raw input on a model of (1568, 500)


    Now the precision is not really important (0.3%) and the calculation time start to be important. So we can say that if we use PCA to reduce features AND the topology, the initial calculation makes sense !

    Talking about topology, we can compare also accuracy for compressed and raw inputs for severa topologies :



    We can see that the model adapt pretty well as the precision doesn't change a lot based on models. In such cases, we can think of reducing sizes to have a model "light"

    Now we fix the topology to (300, 100) and we fix the input as raw's ones. We can now compare activation function and also the Batch Normalisation. Below you have the accuracy based on training TIME:


    We can see that Batch Normalization with ELU activation function learn really faster than other ones. Unfortunately, it requires more computation times. Nevertheless, if we now look at the accuracy based on epochs :


    We can see that it really outperformed the other models and is completely trained after 20 epochs (correspond to 39s). That means it's quicker than other ones which are trained after around 50 seconds.

    We can also see that in ELU without BN is really worse than other models. That means the BN helps a lot even if it double the computing time (59s to 1min57s for the same amount of epoch and the difference in accuracy remains 1%).


    We saw with this example the impact of differents points on the model.

    • The Batch Normalization costs a lot in computing ressources but helps a lot to train model quicker (in epoch not in time)
    • The bigger the NN is, the slower it is. Also there is risks of overfitting and after a certain size the gain in precision is not really important.
    • The activation function has an impact on the training time (requires more or less calculation and reach quicker or not the "end" of the training)

    If you want to go deeper you can also try to play with different optimizers (Momentum Optimizer, Nesterov Accelerated Gradient, Adagrad, RMSProp, AdamOptimization). As explained on the book "Hands On ML with Scikit Learn and Tensorflow", there is a study from 2017 (link) which explain that we should avoid adaptative optimization like Adagrad, RMSProp or AdamOptimization because they can generalize poorly. They advice us to use Nesterov Accelerated Gradient instead.