# TensorFlow 02: Play with MNIST and Google DL Udacity Lectures

Something to say:

Google Deep Learning Lectures please find here, but I suggest that you’d better have some experiences before you go through the lectures. They are short but with very cute style :D!

I am trying to combine them together: lectures for theoretical knowledge and tensorflow for implementation. A good point is that the terms are the same.

With the MNIST dataset, a well-labelled hand written digits for ML, let’s follow the tutorial and use only few lines, to start the journey of TensorFlow.

Import dataset

import tensorflow.examples.tutorials.mnist.input_data


I was totally confused about the import of dataset. Seems that need to download first, but in Yann’s website (http://yann.lecun.com/exdb/mnist/), four files avaliable.
If you go to the link of input_data.py  and locate the read_data_sets function, you will get So simply download all of them, and put it into a folder named “MNIST_data”. Btw, check the current working path by using:

import os
print(os.getcwd())


Run those two lines: “The downloaded data is split into three parts, 55,000 data points of training data (mnist.train), 10,000 points of test data (mnist.test), and 5,000 points of validation data (mnist.validation). This split is very important: it’s essential in machine learning that we have separate data which we don’t learn from so that we can make sure that what we’ve learned actually generalizes!”

Two parts: images (“xs”) and labels (“ys”). mnist.train.images ; mnist.train.labels
Each image (transferred to a vector) contains 28 $\times$ 28 = 784 numbers (pixels, scales between 0 and 1).
55000 images, 784 numbers in each, which makes the minst.train.images a tensor with the shape of [55000,784].

“one-hot vectors” : labels scale from 0 to 9, we will have a vector with 9 zeros and 1 one. Say 2 is presented as [0,0,1,0,0,0,0,0,0,0].
Similarly, mnist.train.labels is a [55000, 10] tensorflow.

Softmax Regression- The activation function

It is a generalised reg of Logistic Regression. Supervised Learning.
Logistic Regression is targeting at binary classification, Softmax Regression is focusing on multiple classification.
Probability-based method to recognise a digit. Softmax reg is a simple model for that.
Two steps: add up evidence, convert evidence to probs.
Like the active function, we define our evidence like this: ${ evidence }_{ i }=\quad \sum _{ j }^{ }{ { W }_{ i,j }{ x }_{ j }+{ b }_{ i } }$
where Wi is the weights and bi is the bias for class i, and j is an index for summing over the pixels in our input image x.

So next convert evidence to probs: $y=softmax(evidence)$
Basically, we can define softmax by ourself (depends on practical use). Here is a normalised exp(x).

Finally, we get $y=softmax(Wx+b)$

Cross Entropy – The cost function —Provided by DL Course from Udacity by Google.
D for distance, to measure the distance between the probabilities and the one-hot vector. To show that “how well we are doing”, or to show the errors. Distance is not symmantric.
Used to evaluate errors. ${ H }_{ y' }(y)\quad =\quad -\sum _{ i }^{ }{ { y' }_{ i }log({ y }_{ i }) }$
where, y is the probability distribution, y’ is the true distribution.

Loss function is the average cross entropy : $\frac { 1 }{ n } \sum { D(S(wx+b),L) }$. Let’s try to minimize the lsot funtion (Now it is a numeric problem). So use GD: Usually we use Stochastic GD, which has a faster convergence.
For optimisation of SGD, one way is called Momentum. When the curve is like a plain, you can keep being aggressive and having bigger steps. Or sometimes walks like the shape of “S”, when the curve is sharp and you have a bigger learning rate: For more, please check here – Optimization:SGD . The main idea is, if the steps keep going in the same direction then we use a bigger step; if the steps are changing directions frequently, we use a trade-off size of step (a smaller step maybe).

The Whole Training Process Multinomial Logistic Classification $D(S(wx+b),L)$

Firstly, input x (tensors of pixels as a big matrix) is involved with a linear model, multiplies with weights and added up with biases. Then we will have a output y after computation of the linear model, keep in mind that only numerical results we could get. Next, a Softmax function S(y) is used to map the numerical numbers to probability. Finally, we will measure the errors of the computed probabilities and the real results by using the cross-entropy function (Euclidean Radial Basis Function is a way).

Implementation

We will have an easy neural net, 785 vertices in input layer and 10 vertices in output layer. No hidden layers. Same, the y_ matrix, which is the predicted probs, has the shape of any times 10.
The weight matrix will have the shape of [785,10], while the bias matrix is a vector of 10-d, it’s for the bias in output layer.

Or, it is in ipython notebook, but simply can be converted to python file via:

ipython nbconvert test.ipynb –to python test.python

Here is a full version:


# coding: utf-8

# In:

import tensorflow.examples.tutorials.mnist.input_data as input_data
import os

# In:

# In:

print(os.getcwd())

# In:

# Implementation starts!

# None means any number, so x is not a specific number here.
x = tf.placeholder(tf.float32, [None, 784])

# In:

# Init weights, bioas, (all zeros first) and define softmax function
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros())

# first multiply x and w, then add b vector. apply softmax to get probabilities
y = tf.nn.softmax(tf.matmul(x, W) + b)

# In:

# Trainning
y_ = tf.placeholder(tf.float32, [None, 10])

# tf.log computes logarithm of each element.
cross_entropy = -tf.reduce_sum(y_*tf.log(y))

# minimize cross_entropy using the gradient descent algorithm with a learning rate of 0.01.

# In:

init = tf.initialize_all_variables()

# launch the model in a Session, run the initialized operation
sess = tf.Session()
sess.run(init)

# Train for 1000 times!
# batch of 100 at each time
# train_step feeding in the batches data to replace the placeholders
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

# In:

# evaluation
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, &quot;float&quot;))
print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})



The argmax provides the likelihood? tf.argmax(y,1) gives the label which the model think it’s the most likely for each input (predicted y). tf.argmax(y_,1) gives the correct label.
tf.equal will give a result, could be retrieved from a T-F vector ([T,F,F,T….]) to a binary vector ([1,0,0,1….]), then a return value is a mean (0.5,0.75, etc). (91.4% of accuracy.)

In the official tutorial, an advanced version is also provided, applying a CNN by TensorFlow. I went through the CNN lecture as well, and I hope I will update a new blog talk about CNN and the advanced codes ASAP. Thanks!