NLP 02: A Trigram Hidden Markov Model (Python)

After HMMs, let’s work on a Trigram HMM directly on texts.First will introduce the model, then pieces of code for practicing.
But not going to give a full solution as the course is still going every year, find out more in references.

Model

13101151_791718694297034_1694635888_n

An example is given below:
“Deep learning is part of a broader family of machine learning methods.”

So here we assign {x}_{i} to ith word: {x}_{1}=Deep,{x}_{2}=learning,etc. n is the length of sentence, index starts from 1. Also, we assign {x}_{0} and {x}_{-1} to a special mark “*”, to mark the starting of a sentence. We use {y}_{i} to represent the tags to each of them, and S is a collection of all possible tags. If S is {N,V,D,P},then y will be one value from S. Keep in mind that there is a “STOP” tag as a remarkable signal at the end of the sentence. So our goal is to give a possible sequence of labels (y values), e.p, N,V,V,…,STOP. So we need to find out p({x}_{1},..,{x}_{n}, {y}_{1},,{y}_{n+1}), and the joint possibility means that sentence and the tag sequence “appear together”.

We take a shorter sentence as an example: “Irene likes iPhone”. We assume the tag sequence is N,V,P,STOP.
So the probability is :
p({ x }_{ 1 },..,{ x }_{ 3 },{ y }_{ 1 },,{ y }_{ 4 })=q(N|*,*)\times q(V|*,N)\times q(P|N,V)\times q(STOP|V,P)\times q(Irene|N)\times q(like|V)\times q(iPhone|P)

Why called Trigram HMM?
13128899_791761360959434_1674003937_oThe model looks at three words as a bag at each step (Trigram). In the first part on the right part of the equation, there is a Markov Chain. From the definition, we’ve made an assumption that the tag for the current word, is depending on the previous two words. The second part is an emission component, that is what is the probability of this word x recognized as a tag y. Some words can be treated as Noun or Verb, like the word “record”. So after learning from a training set,you are able to get both p("record"|Verb), p("record"|Noun).
We can think of all x to be A, all y to be B. So basically we want to learn a distribution:p(A,B)=p(B)\times p(A|B). The Markov Chain part takes all collection of B, the emission part takes all of A conditioned on B.
We are going to use likelyhood estimation to calculate emission values: e(x|y)=\frac { Count(x,y) }{ Count (y) } . Count(x,y) refers to how many times in the dataset that when word x appears with the tag being y. Count(y) referes to how many times tag y appears. For example, to get p("record"|Verb), we need to find out the number of occurrence that “record” as the role of Verb, then divided by the total number of Verb occurrence number.
Eventually the sequence of tags are what we want to get, so an argmax will be used, \underset { {y}_{1},..,{y}_{n} }{ argmax } \quad p({x}_{1},..,{x}_{n}, {y}_{1},..,{y}_{n+1}). First we need to find out a maximum probability of the sequence, then find out the mapping sequence.

Implementation

The original data:
13090103_791853134283590_1371047466_n

In each line, there is one word or a character. There will be empty lines. The training set is shown below:
13115374_791853237616913_516278306_n
Where the second colomn is the tags. Each row is a pair of (x,y). The goal for tagging is, given a testing data who provides only x, we need to get an output of (x,y) pairs.

Counts

As mentioned before, we need to do a basic statistics to get counts of appearance. Here we need to kinds of counts: a word x appears as a certain role (tag) y, a tag {y}_{k+2} appears after tag(s). Especially for the second one, there types, unigram(looking for {y}_{k} only), bigram(looking for pairs of {y}_{k},{y}_{k+1}) and trigram(looking for sequence of {y}_{k},{y}_{k+1},{y}_{k+2}).
In the experiments, we need to get a trigram:
q({y}_{k}|{y}_{k-2},{y}_{k-1})=\frac {Count ({y}_{k-2},{y}_{k-1}{y}_{k})}{Count({y}_{k-2},{y}_{k-1})}.

Rare Words

You might have noticed that in the joint probability, if one of the probability is zero, it will lead to a zero result. What if we have a new word which has never appeared in the training set. Or any low-frequency words, who has a low probability (e.p, Count number is lower than 5). One solution is we can split low-frequency words or new words from the others. So we can replace them with a special word “_RARE_”, then we can do a basic statistics all together. After training, when we meet any new words, we then treat them as “_RARE_”.

Codes on Counting

Found codes in [2].
We can use an HMM class.

from collections import defaultdict
class Hmm(object):
    '''
    Stores counts for n-grams and emissions.
    '''

    def __init__(self, n=3):

        self.n = n
        self.emission_counts = defaultdict(int)
        self.ngram_counts = [defaultdict(int) for i in xrange(self.n)]
        self.all_states = set()
...

The defaultdict(int) allows you to store a <key,value> pair, and the type of the value must be int, while the type of key could be different. Here we use the emission_counts to store <(word,tag), times>, that is how many times this word appears as this tag, note the key is a tuple object.
ngram_counts is a list of defaultdict(int), it stores 1 to 3-gram counts. For example, 1-gram (unigram) counting as a defaultdict, stores something like this <(tag), times>, 2-gram (bigram) is a defaultdict of <(tag1,tag2), times> (how many times does tag1 appears followed by tag2). all_states actually gives you the set of S, the collection of all possible tags, in our experiments, we only have {‘O’,’I-GENE’} two types of tags.
The output of counting (part) is as following:
13115411_791868620948708_1586068162_n
We can read from the file next time directly to get those counts.
The most simple gene tagger could be illustrate as { y }^{ * }=arg\underset { y }{ max } \quad e(x|y). The method in HMM class can be used to get emission values. Then get a whole list of tag for training set as the result.


    def get_emission_paras(self):
    	# stores &amp;lt;(word,tag), emission value&amp;gt;
        self.emision_parameters = defaultdict(float)
        for wordNtag,counts in self.emission_counts.items():
            tag = wordNtag[1] #get tag of the tuple
            tagCount = self.ngram_counts[0][(tag,)]
            emission = counts / tagCount
            self.emision_parameters[wordNtag]=emission

 def get_argmax(self):
        #buid argmax list
        for wordNtag,values in self.emision_parameters.items():
            self.argmax_list[wordNtag[0]].append((wordNtag[1],values))
        #build argmax real tag
        for word,values in self.argmax_list.items():
            tag = max(values,key=itemgetter(1))[0]
            self.argmax_value[word]=tag

        print(self.argmax_value)

You can also easily get q({y}_{k}|{y}_{k-2},{y}_{k-1})=\frac {Count ({y}_{k-2},{y}_{k-1}{y}_{k})}{Count({y}_{k-2},{y}_{k-1})}:

def get_trigram_param(self):

        self.trigram_param = defaultdict(float)
        for tag_tuple,counts in self.trigram.items():
            subtag = tuple(tag_tuple[1:]) #get tag of the tuple
            tagCount = self.bigram[subtag]
            if tagCount:
                q = counts / tagCount
            else:
                q = 0.
            self.trigram_param[tag_tuple]=q

The Viterbi Algorithm

    #Viterbi Algo
    def viterbi(self,sentence):

        def findSet(index):
            if index in range(1,len(sentence)+1):
                return self.all_states
            elif index == 0 or index == -1:
                return {'*'}
        #stores (word:tag) in this whole sentence
        sentence_with_tag = defaultdict(str)

        #inner function to commpute pi values--start
        def pi_viterbi(k,u,v,sentence):
            prob = defaultdict(float)
            #initialization
            if k==0 and u == '*' and v == '*':
                return (1.,'*')
            else:
                for w in findSet(k-2):
                    prev = pi_viterbi(k-1,w,u,sentence)[0]
                    #tuple((w,u,v))
                    q = self.trigram_param[tuple((w,u,v))]
                    e = self.emision_parameters[tuple((sentence[k-1].lower(),v))]
                    probability = prev * q * e
                    prob [tuple((w,u))] = probability
                max_tuple = max(prob.items(), key=lambda x: x[1])

                #print (max_tuple[1],max_tuple[0][0])
                return max_tuple[1],max_tuple[0][0]

        #inner function to compute pi values--end

        sentence_with_tag= list()
        backpointer=defaultdict(str)
        tags = defaultdict(str)
        k = len(sentence)
        u_glob = ''
        v_glob = ''
        glob=0.
        for i in range(1,k+1):
            prob = defaultdict(float)
            for u in findSet(i-1):
                for v in findSet(i):
                    value,w=pi_viterbi(i,u,v,sentence)
                    prob [tuple((i,u,v))] = value
                    backpointer[tuple((i,u,v))]=w
            max_tuple = max(prob.items(), key=lambda x: x[1])

            backpointer[tuple((i,max_tuple[0][1],max_tuple[0][-1]))] = max_tuple[0][1] # bp (k,u,v)= tag w

            #sentence_with_tag.append(max_tuple[0][-1])
            u_glob = max_tuple[0][-2]
            v_glob = max_tuple[0][-1]
            glob = max_tuple[1]
            print ('Max',max_tuple)
        tags[k-1]=u_glob
        tags[k]=v_glob

        for i in range((k-2),0,-1):
            tag = backpointer[tuple(((i+2),tags[i+1],tags[i+2]))]
            tags[i]=tag

        tag_list=list()
        for i in range(1,len(tags)+1):
            tag_list.append(tags[i])

        #tag list as results
        return tag_list

Instead of replacing all rare words with ‘_RARE_’, we can also try to improve our model by classifying them into different categories (like alldigits,allcapital,etc). A simple classfier on rare words using regular expressions:

     def classify(word):
            # classify rare words into precise groups
            # only numbers
            if re.findall(r'\d',word):
                return 'NUM'
            # only capitalized letters
            elif re.findall(r'([A-Z]+)$',word):
                return 'ALL_CAP'
            # ends with a capitalized letter,not all capital
            elif re.findall(r'[a-z]+[A-Z]+$',word):
                return 'LAST_CAP'
            else:
                return 'RARE'

The input is a list of words and the outputs is a sequence of tags. Here is a comparison:
13115452_791864300949140_469655256_n


References:
[1]A detailed description about data and task.
[2]The course video. If you finished on Week1 and Week2, then feel free to start the assignment.

Published by Irene

Keep calm and update blog.

2 thoughts on “NLP 02: A Trigram Hidden Markov Model (Python)

    1. Thanks for the blog link! It’s good to see n-gram can work in many interesting applications lol Let’s keep in touch in the near future!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: