# NLP 02: A Trigram Hidden Markov Model (Python)

After HMMs, let’s work on a Trigram HMM directly on texts.First will introduce the model, then pieces of code for practicing.
But not going to give a full solution as the course is still going every year, find out more in references.

#### Model

An example is given below:
“Deep learning is part of a broader family of machine learning methods.”

So here we assign ${x}_{i}$ to ith word: ${x}_{1}=Deep$,${x}_{2}=learning$,etc. $n$ is the length of sentence, index starts from 1. Also, we assign ${x}_{0}$ and ${x}_{-1}$ to a special mark “*”, to mark the starting of a sentence. We use ${y}_{i}$ to represent the tags to each of them, and $S$ is a collection of all possible tags. If $S$ is {$N,V,D,P$},then $y$ will be one value from S. Keep in mind that there is a “STOP” tag as a remarkable signal at the end of the sentence. So our goal is to give a possible sequence of labels (y values), e.p, N,V,V,…,STOP. So we need to find out $p({x}_{1},..,{x}_{n}, {y}_{1},,{y}_{n+1})$, and the joint possibility means that sentence and the tag sequence “appear together”.

We take a shorter sentence as an example: “Irene likes iPhone”. We assume the tag sequence is N,V,P,STOP.
So the probability is :
$p({ x }_{ 1 },..,{ x }_{ 3 },{ y }_{ 1 },,{ y }_{ 4 })=q(N|*,*)\times q(V|*,N)\times q(P|N,V)\times q(STOP|V,P)\times q(Irene|N)\times q(like|V)\times q(iPhone|P)$

Why called Trigram HMM?
The model looks at three words as a bag at each step (Trigram). In the first part on the right part of the equation, there is a Markov Chain. From the definition, we’ve made an assumption that the tag for the current word, is depending on the previous two words. The second part is an emission component, that is what is the probability of this word x recognized as a tag y. Some words can be treated as Noun or Verb, like the word “record”. So after learning from a training set,you are able to get both $p("record"|Verb), p("record"|Noun)$.
We can think of all x to be A, all y to be B. So basically we want to learn a distribution:$p(A,B)=p(B)\times p(A|B)$. The Markov Chain part takes all collection of B, the emission part takes all of A conditioned on B.
We are going to use likelyhood estimation to calculate emission values: $e(x|y)=\frac { Count(x,y) }{ Count (y) }$. Count(x,y) refers to how many times in the dataset that when word x appears with the tag being y. Count(y) referes to how many times tag y appears. For example, to get $p("record"|Verb)$, we need to find out the number of occurrence that “record” as the role of Verb, then divided by the total number of Verb occurrence number.
Eventually the sequence of tags are what we want to get, so an argmax will be used, $\underset { {y}_{1},..,{y}_{n} }{ argmax } \quad p({x}_{1},..,{x}_{n}, {y}_{1},..,{y}_{n+1})$. First we need to find out a maximum probability of the sequence, then find out the mapping sequence.

#### Implementation

The original data:

In each line, there is one word or a character. There will be empty lines. The training set is shown below:

Where the second colomn is the tags. Each row is a pair of (x,y). The goal for tagging is, given a testing data who provides only x, we need to get an output of (x,y) pairs.

##### Counts

As mentioned before, we need to do a basic statistics to get counts of appearance. Here we need to kinds of counts: a word x appears as a certain role (tag) y, a tag ${y}_{k+2}$ appears after tag(s). Especially for the second one, there types, unigram(looking for ${y}_{k}$ only), bigram(looking for pairs of ${y}_{k},{y}_{k+1}$) and trigram(looking for sequence of ${y}_{k},{y}_{k+1},{y}_{k+2}$).
In the experiments, we need to get a trigram:
$q({y}_{k}|{y}_{k-2},{y}_{k-1})=\frac {Count ({y}_{k-2},{y}_{k-1}{y}_{k})}{Count({y}_{k-2},{y}_{k-1})}$.

##### Rare Words

You might have noticed that in the joint probability, if one of the probability is zero, it will lead to a zero result. What if we have a new word which has never appeared in the training set. Or any low-frequency words, who has a low probability (e.p, Count number is lower than 5). One solution is we can split low-frequency words or new words from the others. So we can replace them with a special word “_RARE_”, then we can do a basic statistics all together. After training, when we meet any new words, we then treat them as “_RARE_”.

##### Codes on Counting

Found codes in [2].
We can use an HMM class.

from collections import defaultdict
class Hmm(object):
'''
Stores counts for n-grams and emissions.
'''

def __init__(self, n=3):

self.n = n
self.emission_counts = defaultdict(int)
self.ngram_counts = [defaultdict(int) for i in xrange(self.n)]
self.all_states = set()
...


The defaultdict(int) allows you to store a <key,value> pair, and the type of the value must be int, while the type of key could be different. Here we use the emission_counts to store <(word,tag), times>, that is how many times this word appears as this tag, note the key is a tuple object.
ngram_counts is a list of defaultdict(int), it stores 1 to 3-gram counts. For example, 1-gram (unigram) counting as a defaultdict, stores something like this <(tag), times>, 2-gram (bigram) is a defaultdict of <(tag1,tag2), times> (how many times does tag1 appears followed by tag2). all_states actually gives you the set of S, the collection of all possible tags, in our experiments, we only have {‘O’,’I-GENE’} two types of tags.
The output of counting (part) is as following:

We can read from the file next time directly to get those counts.
The most simple gene tagger could be illustrate as ${ y }^{ * }=arg\underset { y }{ max } \quad e(x|y)$. The method in HMM class can be used to get emission values. Then get a whole list of tag for training set as the result.


def get_emission_paras(self):
# stores &amp;lt;(word,tag), emission value&amp;gt;
self.emision_parameters = defaultdict(float)
for wordNtag,counts in self.emission_counts.items():
tag = wordNtag[1] #get tag of the tuple
tagCount = self.ngram_counts[0][(tag,)]
emission = counts / tagCount
self.emision_parameters[wordNtag]=emission


 def get_argmax(self):
#buid argmax list
for wordNtag,values in self.emision_parameters.items():
self.argmax_list[wordNtag[0]].append((wordNtag[1],values))
#build argmax real tag
for word,values in self.argmax_list.items():
tag = max(values,key=itemgetter(1))[0]
self.argmax_value[word]=tag

print(self.argmax_value)


You can also easily get $q({y}_{k}|{y}_{k-2},{y}_{k-1})=\frac {Count ({y}_{k-2},{y}_{k-1}{y}_{k})}{Count({y}_{k-2},{y}_{k-1})}$:

def get_trigram_param(self):

self.trigram_param = defaultdict(float)
for tag_tuple,counts in self.trigram.items():
subtag = tuple(tag_tuple[1:]) #get tag of the tuple
tagCount = self.bigram[subtag]
if tagCount:
q = counts / tagCount
else:
q = 0.
self.trigram_param[tag_tuple]=q



The Viterbi Algorithm

    #Viterbi Algo
def viterbi(self,sentence):

def findSet(index):
if index in range(1,len(sentence)+1):
return self.all_states
elif index == 0 or index == -1:
return {'*'}
#stores (word:tag) in this whole sentence
sentence_with_tag = defaultdict(str)

#inner function to commpute pi values--start
def pi_viterbi(k,u,v,sentence):
prob = defaultdict(float)
#initialization
if k==0 and u == '*' and v == '*':
return (1.,'*')
else:
for w in findSet(k-2):
prev = pi_viterbi(k-1,w,u,sentence)[0]
#tuple((w,u,v))
q = self.trigram_param[tuple((w,u,v))]
e = self.emision_parameters[tuple((sentence[k-1].lower(),v))]
probability = prev * q * e
prob [tuple((w,u))] = probability
max_tuple = max(prob.items(), key=lambda x: x[1])

#print (max_tuple[1],max_tuple[0][0])
return max_tuple[1],max_tuple[0][0]

#inner function to compute pi values--end

sentence_with_tag= list()
backpointer=defaultdict(str)
tags = defaultdict(str)
k = len(sentence)
u_glob = ''
v_glob = ''
glob=0.
for i in range(1,k+1):
prob = defaultdict(float)
for u in findSet(i-1):
for v in findSet(i):
value,w=pi_viterbi(i,u,v,sentence)
prob [tuple((i,u,v))] = value
backpointer[tuple((i,u,v))]=w
max_tuple = max(prob.items(), key=lambda x: x[1])

backpointer[tuple((i,max_tuple[0][1],max_tuple[0][-1]))] = max_tuple[0][1] # bp (k,u,v)= tag w

#sentence_with_tag.append(max_tuple[0][-1])
u_glob = max_tuple[0][-2]
v_glob = max_tuple[0][-1]
glob = max_tuple[1]
print ('Max',max_tuple)
tags[k-1]=u_glob
tags[k]=v_glob

for i in range((k-2),0,-1):
tag = backpointer[tuple(((i+2),tags[i+1],tags[i+2]))]
tags[i]=tag

tag_list=list()
for i in range(1,len(tags)+1):
tag_list.append(tags[i])

#tag list as results
return tag_list



Instead of replacing all rare words with ‘_RARE_’, we can also try to improve our model by classifying them into different categories (like alldigits,allcapital,etc). A simple classfier on rare words using regular expressions:

     def classify(word):
# classify rare words into precise groups
# only numbers
if re.findall(r'\d',word):
return 'NUM'
# only capitalized letters
elif re.findall(r'([A-Z]+)$',word): return 'ALL_CAP' # ends with a capitalized letter,not all capital elif re.findall(r'[a-z]+[A-Z]+$',word):
return 'LAST_CAP'
else:
return 'RARE'


The input is a list of words and the outputs is a sequence of tags. Here is a comparison:

References: