It is somehow a little bit fast to start MT.
Anyway, this blog is very superficial, giving you a view on basics, along with an implementation but a bad result…which gives you more chances to optimize. Btw, you might learn some Chinese here 😛
Lexical ambiguity: same word, different meanings according to context. “Can you do this?” VS “A can of Coke.”
Word Orders: usually the order of subject, verb and object. Eng: SVO, Jap:SOV.
We can translate a sentence in a word-by-word way.
English: I love you.
Mandarin: 我(I, read as “wǒ”)爱(love, “aì”)你(you, “nǐ”)。
We could see a very nice alignment word by word, which keeps a good order. Say we want to translate a sentence but the order is different, we can simply translate all the words first, and then change the order by following some rules. So here we have a transfer-based approach.
1. Analysis: learn features of source language sentences.
2. Transfer: convert source sentence tree to a target-language parse tree.
3. Generation: convert the target-language parse tree to output sentence.
Flowing is an example, showing the transfer of an English sentence to Japanese.
So first we have the rules (e.p, S->NP-A,VP) from the source language, then given a sentence, we can have a parse tree. Then we need to change some orders to adjust the target language rules.
Statistical MT : The Noisy Channel Model
Language pairs. For example, “I” in English is paired with “我” in Mandarin with a probability. “I” in English is paired with “Je” in French with a probability
So the basic idea is to train the model with those parallel pairs. Let’s follow a translation system from French to English.
We could have a model showing p(e|f), estimating conditional probability of an English sentence given the French one f.
We need to think about two components:
: the language model. Could be a trigram model.
: the translation model, trained from a parallel corpus of French/English pairs.
More about . Note f here is a generalized representation, indicating any targeting languages.
If we translate a sentence from Spanish to English, few candidates can be chosen.
We can add the language model.
which gives more reasonable results.
One challenge for us is that the pairs are not always easy to be found. We call them alignments.Given an English sentence as the source, a French sentence as the target.
The graph shows above is a possible alignment. In the following annotation, we use for the length of source sentence, for the target sentence length. The alignment is the series of the indices which the current word in target sentence is supposed to be paired with the word in source sentence (the red lines).
So the job becomes how can we find the most possible alignments first when training.
We have models for and , where . A is the set of all possible alignments. That is, given the target sentence length and source sentence, the possibility of the target sentence. We sum over on the scale of all the alignments. So we can get .
For any alignment , there is a probability when given the two sentences with the lengths. We can choose a maximum one.
So, is the most possible alignment.
An estimate for , which means given alignment a, target sentence length m and input source sentence e, the probability of output target sentence f. In IBM Model 1, it is defined this way:
We are going to use the same example in alignments.
so the probability is a product of t.
Simply, t is a conditional probability. For example, means the probability of the target word “Le” given the condition of English word “the”, which is how confident we are in this target sentence, we can translate “the” to “Le”, or generate “Le” from the word “the”.
Here is the generative process of IBM Model 1:
One main job for us is to find the alignments by using sentences in both source and target. We might need someone to provide the alignments, however, there is too much work to do! They must know both languages!
The core problem is to compute t values for possible pairs of the target language and English words. For example, we can find English “I” always occur with French word “Je”.
The following implementation is a simplified version of this algorithm [*]:
Note we did not calculate , and values. usually assigns to 10 to 20. We go over each pair, is the total number of pairs. You can find everything in [&].
Here is the code. Or go here to download via github.
# coding: utf-8 # In: import sys from collections import defaultdict import math,random,re &quot;&quot;&quot; IBM Model 1: estimate t parameters EM Algorithm: finding the alignments &quot;&quot;&quot; class IBM_Model_1(object): &quot;&quot;&quot; Stores counts for n-grams and emissions. &quot;&quot;&quot; def __init__(self, n=3): #sentences &lt;--bad solution self.source = list() self.target = list() self.source_corpos = defaultdict(int) self.target_corpos = defaultdict(int) #counts self.source_target_occurance = defaultdict(int) # c(e,f) self.source_occurance = defaultdict(int) # c (e) self.conditioned_occurance = defaultdict(int) # c(j|i,l,m) self.joint_occurance = defaultdict(int) # c (i,l,m) #term-frenquency? self.t = defaultdict(float) # t (f|e) #self.q = defaultdict(float) #delta self.delta = defaultdict(float) def counter(self,source_file,target_file): for line in source_file: line = line.strip().split(' ') for word in line: self.source_corpos[word] +=1 # save only words and digis #line = [x for x in line if re.match(&quot;^[A-Za-z0-9_-]*$&quot;, x)] line.insert(0,'NULL') self.source.append(line) for line in target_file: line = line.strip().split(' ') #line = [x for x in line if re.match(&quot;^[A-Za-z0-9_-]*$&quot;, x)] self.target.append(line) for word in line: self.target_corpos[word] +=1 print ('Load done!') print ('Found %i words in source'%len(self.source_corpos)) print ('Found %i words in target'%len(self.target_corpos)) print (len(self.source)) print (len(self.target)) def em(self): # all counts are zeros (down by defaultdict), t (f|e ) are random values range in [.0,1.] for k in range(1, len(self.source)): for i in range (1, len(self.target[k])): for j in range (1,len(self.source[k])): self.t[tuple((self.target[k][i],self.source[k][j]))] = 1/self.source_corpos[self.source[k][j]] # normally should be about 10-20 for s in range(1,11): # iterate every sentence pairs for k in range(0, len(self.source)): m = len(self.target[k]) l = len(self.source[k]) for i in range (0, m): down = 0.0 for j in range (0,l): down += self.t[tuple((self.target[k][i],self.source[k][j]))] for j in range (0,len(self.source[k])): if down == 0.0: self.delta [tuple((k,i,j))] = 0.0 else: self.delta [tuple((k,i,j))] = self.t[tuple((self.target[k][i],self.source[k][j]))] /down self.source_target_occurance [tuple((self.source[k][j],self.target[k][i]))] += self.delta [tuple((k,i,j))] self.source_occurance [self.source[k][j]] += self.delta [tuple((k,i,j))] # Update t sums = 0 for k,v in self.t.items(): if self.source_target_occurance [k] * self.source_occurance[k] &gt;0 : result = self.source_target_occurance [k]/self.source_occurance[k] sums +=1 else: result = 0.0 self.t[k] = result print ('Now..',s) def write_t (self, output): for k,v in self.t.items(): if v &gt;0: output.write('%s %s %f\n'%(k,k,v)) output.close() print ('Done!') def get_align(self, source, target, output): #sentences &lt;--bad solution self.source = list() self.target = list() for line in source: line = line.strip().split(' ') #line.insert(0,'NULL') self.source.append(line) for line in target: line = line.strip().split(' ') self.target.append(line) for k in range(1, len(self.source)): m = len(self.target[k]) l = len(self.source[k]) # for each sentence for i in range (1,len(self.source)): source = self.source[i] target = self.target[i] # for each word in target sentence for (index,f) in enumerate(target): align = list () # indicate which sentence align.append(i) align.append(index+1) alignments = defaultdict(float) for (indexe, e) in enumerate(source): if self.t[tuple((f,e))] &gt; 0: alignments[indexe] = self.t[tuple((f,e))] if len(alignments) == 0: a = 0 else : # find the max a = max(alignments.items(), key=lambda a: a) align.append(a+1) output.write('%i %i %i\n' %(align,align,align)) print (align,align,align) align.clear() output.close() print ('Finish Writing!') def usage(): print (&quot;&quot;&quot; Read in both English and Spanish files &quot;&quot;&quot;) if __name__ == &quot;__main__&quot;: source_file = open (r'small_corpus.en','r') target_file = open (r'small_corpus.es','r') output_file = open ('small_output','w') output_key = open ('small_keys','w') source = open (r'dev.en','r') target = open (r'dev.es','r') # Initialize a trigram counter my_model = IBM_Model_1() my_model.counter(source_file,target_file) my_model.em() my_model.write_t(output_file) my_model.get_align(source,target,output_key)
My inputs are two corpus files: 100 English sentences and 100 Spanish sentences in pairs. They are quite small so the results were not that good. Following shows some lines from English and Spanish respectively.
The expecting outputs are alignments. You can find a part of them here:
There are three columns, the first column is the index fo the sentence in target corpus, the second column is the word index in the target sentence, while the third one is indicating the paired word index in the source sentence.
I also tested with a larger dataset, however, the result is still not satisfactory. Hope I could find out the reasons in the near future.
Find the Coursera slides here. https://d396qusza40orc.cloudfront.net/nlangp/mt1.pdf
[&]Find the assignments here. https://class.coursera.org/nlangp-001/assignment/view?assignment_id=7
[*]IBM Model。 http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/ibm12.pdf
Screenshots of equators from here. https://d396qusza40orc.cloudfront.net/nlangp/mt2.pdf