Posted in Natural Language Processing, Python

NLP 03: Finding Mr. Alignment, IBM Translation Model 1

It is somehow a little bit fast to start MT.
Anyway, this blog is very superficial, giving you a view on basics, along with an implementation but a bad result…which gives you more chances to optimize. Btw, you might learn some Chinese here 😛

MT Challenges

Lexical ambiguity: same word, different meanings according to context. “Can you do this?” VS “A can of Coke.”
Word Orders: usually the order of subject, verb and object. Eng: SVO, Jap:SOV.

Transfer-Based Approaches

We can translate a sentence in a word-by-word way.
For example:
English: I love you.
Mandarin: 我(I, read as “wǒ”)爱(love, “aì”)你(you, “nǐ”)。
We could see a very nice alignment word by word, which keeps a good order. Say we want to translate a sentence but the order is different, we can simply translate all the words first, and then change the order by following some rules. So here we have a transfer-based approach.
3 phases:
1. Analysis: learn features of source language sentences.
2. Transfer: convert source sentence tree to a target-language parse tree.
3. Generation: convert the target-language parse tree to output sentence.
Flowing is an example, showing the transfer of an English sentence to Japanese.

13219858_797290463739857_493291411_n
So first we have the rules (e.p, S->NP-A,VP) from the source language, then given a sentence, we can have a parse tree. Then we need to change some orders to adjust the target language rules.
13231059_797290510406519_1703027822_n

Statistical MT : The Noisy Channel Model

Language pairs. For example, “I” in English is paired with “我” in Mandarin with a probability. “I” in English is paired with “Je” in French with a probability
So the basic idea is to train the model with those parallel pairs. Let’s follow a translation system from French to English.
We could have a model showing p(e|f), estimating conditional probability of an English sentence given the French one f.
We need to think about two components:
p(e): the language model. Could be a trigram model.
p(f|e): the translation model, trained from a parallel corpus of French/English pairs.

Techniqually, p(e|f)=\frac { p(e|f) }{ p(f) } .
Also, { argmax }_{ e }p(e|f)={ argmax }_{ e }\frac { p(e|f) }{ p(f) } .

More about p(f|e). Note f here is a generalized representation, indicating any targeting languages.
If we translate a sentence from Spanish to English, few candidates can be chosen.
13227903_797295287072708_1049744350_n
We can add the language model.
13231008_797295393739364_324592024_n
which gives more reasonable results.

Alignments

One challenge for us is that the pairs are not always easy to be found. We call them alignments.Given an English sentence as the source, a French sentence as the target.

13219713_797299067072330_1865047695_n

The graph shows above is a possible alignment. In the following annotation, we use l for the length of source sentence, m for the target sentence length. The alignment a is the series of the indices which the current word in target sentence is supposed to be paired with the word in source sentence (the red lines).

So the job becomes how can we find the most possible alignments first when training.
We have models for p(a|e,m) and p(f|a,e,m), where p(f,a|e,m)=\sum _{a\in A }^{ }{p(a|e,m)p(f|a,e,m) } . A is the set of all possible alignments. That is, given the target sentence length and source sentence, the possibility of the target sentence. We sum over on the scale of all the alignments. So we can get p(f,a|e,m).
13140590_797316533737250_781766029_n
For any alignment a, there is a probability when given the two sentences with the lengths. We can choose a maximum one.
13187604_797316640403906_779432095_n
So, {a}^{*} is the most possible alignment.

Translation Probabilities

An estimate for p(f|a,e,m), which means given alignment a, target sentence length m and input source sentence e, the probability of output target sentence f. In IBM Model 1, it is defined this way:

p(f|a,e,m)\quad =\quad \prod _{ j=1 }^{ m }{ t({ f }_{ j }|{ e }_{ { a }_{ j } } } )
We are going to use the same example in alignments.
13219923_797316827070554_1229623823_n
so the probability is a product of t.
Simply, t is a conditional probability. For example, t(Le|the) means the probability of the target word “Le” given the condition of English word “the”, which is how confident we are in this target sentence, we can translate “the” to “Le”, or generate “Le” from the word “the”.
Here is the generative process of IBM Model 1:
13183078_797317003737203_782838611_n

Finding Mr.Alignment

One main job for us is to find the alignments by using sentences in both source and target. We might need someone to provide the alignments, however, there is too much work to do! They must know both languages!
The core problem is to compute t values for possible pairs of the target language and English words. For example, we can find English “I” always occur with French word “Je”.
The following implementation is a simplified version of this algorithm [*]:
13219795_797307443738159_339816990_n
Note we did not calculate c(j|i,l,m), c(i,l,m) and q values. T usually assigns to 10 to 20. We go over each pair, k is the total number of pairs. You can find everything in [&].

Here is the code. Or go here to download via github.


# coding: utf-8

# In[17]:

import sys
from collections import defaultdict
import math,random,re

"""
IBM Model 1: estimate t parameters
EM Algorithm: finding the alignments
"""

class IBM_Model_1(object):
    """
    Stores counts for n-grams and emissions.
    """

    def __init__(self, n=3):
        #sentences <--bad solution
        self.source = list()
        self.target = list()

        self.source_corpos = defaultdict(int)
        self.target_corpos = defaultdict(int)
        #counts
        self.source_target_occurance = defaultdict(int) # c(e,f)
        self.source_occurance = defaultdict(int) # c (e)
        self.conditioned_occurance = defaultdict(int) # c(j|i,l,m)
        self.joint_occurance = defaultdict(int) # c (i,l,m)
        #term-frenquency?
        self.t = defaultdict(float) # t (f|e)
        #self.q = defaultdict(float)
        #delta
        self.delta = defaultdict(float)

    def counter(self,source_file,target_file):
        for line in source_file:
            line = line.strip().split(' ')

            for word in line:
                self.source_corpos[word] +=1
            # save only words and digis
            #line = [x for x in line if re.match("^[A-Za-z0-9_-]*$", x)]
            line.insert(0,'NULL')
            self.source.append(line)
        for line in target_file:
            line = line.strip().split(' ')
            #line = [x for x in line if re.match("^[A-Za-z0-9_-]*$", x)]
            self.target.append(line)
            for word in line:
                self.target_corpos[word] +=1
        print ('Load done!')
        print ('Found %i words in source'%len(self.source_corpos))
        print ('Found %i words in target'%len(self.target_corpos))
        print (len(self.source))
        print (len(self.target))

    def em(self):

        # all counts are zeros (down by defaultdict), t (f|e ) are random values range in [.0,1.]
        for k in range(1, len(self.source)):
            for i in range (1, len(self.target[k])):
                for j in range (1,len(self.source[k])):
                    self.t[tuple((self.target[k][i],self.source[k][j]))] = 1/self.source_corpos[self.source[k][j]]

        # normally should be about 10-20
        for s in range(1,11):
            # iterate every sentence pairs
            for k in range(0, len(self.source)):
                m = len(self.target[k])
                l = len(self.source[k])
                for i in range (0, m):
                    down = 0.0
                    for j in range (0,l):
                        down += self.t[tuple((self.target[k][i],self.source[k][j]))]
                    for j in range (0,len(self.source[k])):
                        if down == 0.0:
                            self.delta [tuple((k,i,j))] = 0.0
                        else:
                            self.delta [tuple((k,i,j))] = self.t[tuple((self.target[k][i],self.source[k][j]))] /down
                        self.source_target_occurance [tuple((self.source[k][j],self.target[k][i]))] += self.delta [tuple((k,i,j))]
                        self.source_occurance [self.source[k][j]] += self.delta [tuple((k,i,j))]

            # Update t
            sums = 0
            for k,v in self.t.items():
                if self.source_target_occurance [k] * self.source_occurance[k[1]] >0 :
                    result = self.source_target_occurance [k]/self.source_occurance[k[1]]
                    sums +=1
                else:
                    result = 0.0
                self.t[k] = result

            print ('Now..',s)

    def write_t (self, output):

        for k,v in self.t.items():
            if v >0:
                output.write('%s %s %f\n'%(k[0],k[1],v))

        output.close()
        print ('Done!')

    def get_align(self, source, target, output):
        #sentences <--bad solution
        self.source = list()
        self.target = list()

        for line in source:
            line = line.strip().split(' ')

            #line.insert(0,'NULL')
            self.source.append(line)
        for line in target:
            line = line.strip().split(' ')
            self.target.append(line)

        for k in range(1, len(self.source)):
            m = len(self.target[k])
            l = len(self.source[k])
            # for each sentence
            for i in range (1,len(self.source)):
                source = self.source[i]
                target = self.target[i]

                # for each word in target sentence
                for (index,f) in enumerate(target):
                    align = list ()
                    # indicate which sentence
                    align.append(i)
                    align.append(index+1)
                    alignments = defaultdict(float)
                    for (indexe, e) in enumerate(source):
                        if self.t[tuple((f,e))] > 0:
                            alignments[indexe] = self.t[tuple((f,e))]
                    if len(alignments) == 0:
                        a = 0
                    else :
                        # find the max
                        a = max(alignments.items(), key=lambda a: a[1])[0]
                        align.append(a+1)
                        output.write('%i %i %i\n' %(align[0],align[1],align[2]))
                        print (align[0],align[1],align[2])

                    align.clear()
        output.close()
        print ('Finish Writing!')

def usage():
    print ("""

        Read in both English and Spanish files
    """)

if __name__ == "__main__":

    source_file = open (r'small_corpus.en','r')
    target_file = open (r'small_corpus.es','r')
    output_file = open ('small_output','w')
    output_key = open ('small_keys','w')
    source = open (r'dev.en','r')
    target = open (r'dev.es','r')
    # Initialize a trigram counter
    my_model = IBM_Model_1()
    my_model.counter(source_file,target_file)
    my_model.em()
    my_model.write_t(output_file)
    my_model.get_align(source,target,output_key)

My inputs are two corpus files: 100 English sentences and 100 Spanish sentences in pairs. They are quite small so the results were not that good. Following shows some lines from English and Spanish respectively.

13214539_797318743737029_1128976348_o

13223678_797318927070344_148856067_o

The expecting outputs are alignments. You can find a part of them here:
13162312_797319343736969_234282189_n
There are three columns, the first column is the index fo the sentence in target corpus, the second column is the word index in the target sentence, while the third one is indicating the paired word index in the source sentence.
I also tested with a larger dataset, however, the result is still not satisfactory. Hope I could find out the reasons in the near future.

 


References

Find the Coursera slides here. https://d396qusza40orc.cloudfront.net/nlangp/mt1.pdf
[&]Find the assignments here. https://class.coursera.org/nlangp-001/assignment/view?assignment_id=7
[*]IBM Model。 http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf
Screenshots of equators from here. https://d396qusza40orc.cloudfront.net/nlangp/mt2.pdf

Author:

Keep calm and update blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s