Recently working on a shared task job of image annotation. An interesting paper saw on NIPS’15 was proposed by Baidu Research. Find paper here . Official website. This post is the study notes.
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering
by Baidu Research
(I like the lexical style of the title)
Generally, it is not a new idea of using deep NNs on image annotation, some groups already done research from workshops on Re-Work https://www.re-work.co/, for example. As I mentioned on my 4th DL notes, the paper is an impressive one, maybe from an application level. But the contributions also include a multilingual tagged question dataset.
The Freestyle Multilingual Image Question Answering (FM-IQA) dataset is released. And the Multimodal QA model is proposed, which is able to answer questions given an image. An example is shown below:
Mainly four components:
1. LSTM: sentence -> a dense vector representation
2. CNN: extract image presentation. (Pre-trained then fixed during training)
3. LSTM: current and previous words in the answer -> dense representations
4. Fuses info, predict next word in the answer.
Weights Sharing between:
1 and 3 LSTMs;
word embedding layer and the fully connected SoftMax layer.
FM-IQA based on MS COCO dataset, along with question pairs. Including AI related questions.
It contains 158,392 images with 316,193 Chinese question-answer pairs and their corresponding English translations(really huge).
Visual Tuning Test using human judges.
The mQA Model
4 Components described above. Jointly training together but fixed CNN.
Training part: questions and answers are represented in one-hot vectors. Two tags in the beginning and end of training answers, and .
Testing part: start by tags, calculate probability distribution of the next word in the answer. Beam search is used to find out first K candidates from the SoftMax layer.
1. LSTM for questions
There is an embedding layer and the LSTM layer. The one hot vector comes into the word embedding layer as a 512-D vector, then it is mapped to the dense semantic space, as the input of LSTM.
A GoogleNet is implemented, but removed the last SoftMax layer.
3. LSTM for answers
The activation of the memory cells for the words in the answer, as well as the word embeddings, will be fed into the fusing component to generate the next words in the answer.
LSTMs are different for questions and answers, which means the weights are different, because of different “properties” (i.e grammar). But the word embedding are the same, because of the semantic meaning is the same in both questions and answers.
4. Fusing layer
Representation from CNN, memory cells from two LSTMs ,Q & A, and a word embedding of the t-th word from the answer. Multiply by a weight matrix respectively then sum up, as an input of an element-wise non-linear function g(.), that’s the function f(t) for the t-th word in the answer of the fusing layer.
An immediate layer is connected after the fusing layer: from dense multimodal representation to dense word representation to get the answer, following by a fully connected SoftMax, sharing features with the word embedding layer from the answer (talk later).
** But not clear about the representation from CNN, maybe labels extracted only. Seems that only from Qs and As, lexically training the answer, although the image is to be considered, how the machine could know the cat is sitting from the image? Mainly from the given answer. So in the training set, if we have 100000 images with same question-answer pair, “what the cat is doing?” – “sitting” , and 10 images with the pair “what the cat is doing?” – “walking”, no matter what images look like. Then from the model, when given the question “what the cat is doing?”, from the lexical level and probabilities, it is more likely to answer the cat is sitting. But actually, they choose question data according to some rules, find on the section 4 of the paper.**
Weight Sharing Strategy
Weight matrix between word-embedding layers above the two LSTMs are shared from the consideration of meaning of single words.
But for LSTMs, they are different, because of the difference of grammar properties.