TensorFlow 04 : Implement a LeNet-5-like NN to classify notMNIST Images

The blog is a solution of Udacity DL Assignment 4, using a CNN to classify notMNIST images. Visit here to get a full version of my codes.

Data Set
NotMNIST data:

Looks more complex than MNIST. However, in this case, the LeNet-like ConvNN could achieve 91.6% accuracy (my best trial).
In the dataset, 10 classes from letter A to J are included.


You will find notMNIST_large and notMNIST_small two compressed files.

“We’ll convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road.”
The normalized images are with the size of 28 * 28.

In the .packle file:
In the experiment, we use the sizes as:

train_size = 200000
valid_size = 10000
test_size = 10000

Click here  for getting dataset and .packle file. (You need to run it first) The data are downloaded and packed up automatically.

As inputs of ConvNet, we need a cube for each image: width*height*channel. Usually for the RGB images, channel number is 3. But here we have only gray scale pngs, so the number is 1. We still need to reshape them, and also the labels.

Training set (200000, 28, 28, 1) (200000, 10)
Validation set (10000, 28, 28, 1) (10000, 10)
Test set (10000, 28, 28, 1) (10000, 10)

Data are 4-D tensors ( num_images, width,height, channel), the labels are 1-hot tensors: (num_images, label)


In the code, we have two convolutional layers, two pooling layers after each, then following by another Conv-layer, and a fully connected hidden layer.
For optimization, dropout is used in the middle of Conv1 and Pool1 layer. Also the learning rate decay is added.

Placeholder, Constant and Variable
Before we start the ConvNet, a smart way is to define the spaces for the trainable parameters, mainly the weights and biases.

In TF, Placeholder is used to declare a space for a tensor. You could give it a size and a datatype.

# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_size, image_size, num_channels))

The shape is a 4-D Tensor, with the data type being float. That is to initialize the tensor with empty space.

tf.constant creates a constant tensor in a given shape and tensor (helpful for loading data):

tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)

We use tf.Variables() to initialize weights and biases. For weights, random values are normal, we can get them from a truncated normal distribution. In terms of biases, assign them to 0.0 or 1.0.

layer1_weights = tf.Variable(tf.truncated_normal( [patch_size, patch_size, num_channels, depth], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth]))


batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64

Layer1: Conv1 layer
Patch or Kernal size 5 * 5 , input_channel 1, output channel 16
A cube of [16,28,28,1]

conv = tf.nn.conv2d(data, layer1_weights, [1, 1, 1, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)

The padding method, we use “Same”. the tensor [1,1,1,1] is the stride. strides[0] is for each batch (sample), strides[3] is for each channel or depth, stride [1] and stride [2] are vertical or horizontal move step. With the combination, the output “image” size is exact the same size with the input, 28 * 28. But the depth is different, from 1 to 16.
Then we add a bias tensor. We have 16 channels on the first Conv layer, each layer we have a bias. So we have a 1-D tensor of 16 scalars.

*About padding :
“The difference is on the way of dealing with borders. Same padding allows the sliding window moves to the border, with the exceeded pixels being all 0. But valid padding stops the sliding window when it reaches to the border, which will make the Conv size smaller than the input, normally.

Same padding:
out_height = ceil(float(in_height) / float(strides[1]))
out_width = ceil(float(in_width) / float(strides[2]))

Valid padding:
out_height = ceil(float(in_height – filter_height + 1) / float(strides[1]))
out_width = ceil(float(in_width – filter_width + 1) / float(strides[2]))”


To prevent overfitting and optimize the performance, we simply add a dropout after Conv1 layer.

hidden = tf.nn.dropout(hidden, keep_prob)

Keep_prob is a scalar-tensor, the probability that each element is kept. In dropout, there is a probability of keeping the connection or not. In our experiments, we use 1.0 as the keep_prob, to make sure them to be unchanged.

Pooling could be treated as a feature extraction. We have a 1*1 Convention before, then we have more deeper channels. Now let’s extract features from them.

pool1 = tf.nn.max_pool(hidden, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1],
padding='SAME', name='pool1')
norm1 = tf.nn.lrn(pool1, 4, bias=1.0,   alpha=0.001 / 9.0, beta=0.75,

We use max pooling here, the sliding window is 2 * 2, which means we choose the max value from 4 values each time as the feature of the small patch. Same padding but stride is 2 * 2, so finally we will have a half sized image, 14 * 14, same depth (16).
Tf.nn.lrn is short of tf.nn.local_response_normalization(). LRN normalizes the output before it sent into the non-linearity, which helps to bring inputs to ReLU to a common scale.

Note that the same depth number will be kept if you do pooling only.

Hidden Layer
(Let’s skip Conv2 layer and Pooling2 layer, keep the same strides and paddings, we will achieve a smaller size: 7 * 7 * 16 -> width * height * channel)
Then there is another Conv layer, nothing magic here, keep the same shape.
In the hidden layer we will add ReLU, the results is a huge matrix, logits. Hidden layer has 64 neurons.

hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
result = tf.matmul(hidden, layer4_weights) + layer4_biases

Learning Rate Decay
Afther the logits are got, compute the loss. We get a mean of the softmax (sigmoid) entropy.

loss = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels))

The cost function we chose is called cross entropy. Logits is the resulst value the model gets, tf_train_lables is the true result.

In practice, learning rate drops through training. We prefer to have a smaller learning rate to achieve a better model. If something went wrong in your code, try to make the learning rate to be smaller.

How to select the learning rate?

TF has very nice functions defined.

# Learning rate decay
global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 0.1
learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
100000, 0.96, staircase=True)

# Optimizer: get the min of loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)


When training and validating, the keep_prob ( of the dropout) was 0.5, when testing, we use 1.0. Investigating the influences currently.
Minibatch accuracy: 87.5%
Validation accuracy: 70.2%
Minibatch loss at step 950: 0.063094
Minibatch accuracy: 100.0%
Validation accuracy: 69.5%
Minibatch loss at step 1000: 1.012981
Minibatch accuracy: 75.0%
Validation accuracy: 69.8%
Test accuracy: 91.6%


How do you improve the CNN model? You could use dropout on each layer, add more Conv layers, achieve learning rate decay, etc. Once thing that you should think about is how to choose good hyper-parameters.
If we only say about SGD:
– Initial learning rate
– learning rate decay
– Momentum
– Batch Size
– Weight initialization

When it comes to ConvNN, you might think about patch size. There is another way called Inception (I think it is proposed by GoogLeNet).


Facing with a Conv layer, you could try pooling, or you could use different patch sizes (1*1, 3*3, 5*5). Instead of trying one by one, let’s use them together, probably assigning different weights. Make them all together as a very deep (because of more channels) new layer.

How do we choose hyper-params to make models better?? All the data scientists are facing the same problem.


Udacity DL Assignment 4.

Published by Irene

Keep calm and update blog.

5 thoughts on “TensorFlow 04 : Implement a LeNet-5-like NN to classify notMNIST Images

  1. hey thanks for sharing your work!
    you said “When training and validating, the keep_prob ( of the dropout) was 1.0, when testing, we use 0.5.”
    could you explain me why do you use dropout prob 1 for training et 0.5 for testing?
    I would have done the contrary…


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: