After HMMs, let’s work on a Trigram HMM directly on texts.First will introduce the model, then pieces of code for practicing.
But not going to give a full solution as the course is still going every year, find out more in references.
Model
An example is given below:
“Deep learning is part of a broader family of machine learning methods.”
So here we assign to ith word:
,
,etc.
is the length of sentence, index starts from 1. Also, we assign
and
to a special mark “*”, to mark the starting of a sentence. We use
to represent the tags to each of them, and
is a collection of all possible tags. If
is {
},then
will be one value from S. Keep in mind that there is a “STOP” tag as a remarkable signal at the end of the sentence. So our goal is to give a possible sequence of labels (y values), e.p, N,V,V,…,STOP. So we need to find out
, and the joint possibility means that sentence and the tag sequence “appear together”.
We take a shorter sentence as an example: “Irene likes iPhone”. We assume the tag sequence is N,V,P,STOP.
So the probability is :
Why called Trigram HMM?
The model looks at three words as a bag at each step (Trigram). In the first part on the right part of the equation, there is a Markov Chain. From the definition, we’ve made an assumption that the tag for the current word, is depending on the previous two words. The second part is an emission component, that is what is the probability of this word x recognized as a tag y. Some words can be treated as Noun or Verb, like the word “record”. So after learning from a training set,you are able to get both
.
We can think of all x to be A, all y to be B. So basically we want to learn a distribution:. The Markov Chain part takes all collection of B, the emission part takes all of A conditioned on B.
We are going to use likelyhood estimation to calculate emission values: . Count(x,y) refers to how many times in the dataset that when word x appears with the tag being y. Count(y) referes to how many times tag y appears. For example, to get
, we need to find out the number of occurrence that “record” as the role of Verb, then divided by the total number of Verb occurrence number.
Eventually the sequence of tags are what we want to get, so an argmax will be used, . First we need to find out a maximum probability of the sequence, then find out the mapping sequence.
Implementation
The original data:
In each line, there is one word or a character. There will be empty lines. The training set is shown below:
Where the second colomn is the tags. Each row is a pair of (x,y). The goal for tagging is, given a testing data who provides only x, we need to get an output of (x,y) pairs.
Counts
As mentioned before, we need to do a basic statistics to get counts of appearance. Here we need to kinds of counts: a word x appears as a certain role (tag) y, a tag appears after tag(s). Especially for the second one, there types, unigram(looking for
only), bigram(looking for pairs of
) and trigram(looking for sequence of
).
In the experiments, we need to get a trigram:
.
Rare Words
You might have noticed that in the joint probability, if one of the probability is zero, it will lead to a zero result. What if we have a new word which has never appeared in the training set. Or any low-frequency words, who has a low probability (e.p, Count number is lower than 5). One solution is we can split low-frequency words or new words from the others. So we can replace them with a special word “_RARE_”, then we can do a basic statistics all together. After training, when we meet any new words, we then treat them as “_RARE_”.
Codes on Counting
Found codes in [2].
We can use an HMM class.
from collections import defaultdict class Hmm(object): ''' Stores counts for n-grams and emissions. ''' def __init__(self, n=3): self.n = n self.emission_counts = defaultdict(int) self.ngram_counts = [defaultdict(int) for i in xrange(self.n)] self.all_states = set() ...
The defaultdict(int) allows you to store a <key,value> pair, and the type of the value must be int, while the type of key could be different. Here we use the emission_counts to store <(word,tag), times>, that is how many times this word appears as this tag, note the key is a tuple object.
ngram_counts is a list of defaultdict(int), it stores 1 to 3-gram counts. For example, 1-gram (unigram) counting as a defaultdict, stores something like this <(tag), times>, 2-gram (bigram) is a defaultdict of <(tag1,tag2), times> (how many times does tag1 appears followed by tag2). all_states actually gives you the set of S, the collection of all possible tags, in our experiments, we only have {‘O’,’I-GENE’} two types of tags.
The output of counting (part) is as following:
We can read from the file next time directly to get those counts.
The most simple gene tagger could be illustrate as . The method in HMM class can be used to get emission values. Then get a whole list of tag for training set as the result.
def get_emission_paras(self): # stores &lt;(word,tag), emission value&gt; self.emision_parameters = defaultdict(float) for wordNtag,counts in self.emission_counts.items(): tag = wordNtag[1] #get tag of the tuple tagCount = self.ngram_counts[0][(tag,)] emission = counts / tagCount self.emision_parameters[wordNtag]=emission
def get_argmax(self): #buid argmax list for wordNtag,values in self.emision_parameters.items(): self.argmax_list[wordNtag[0]].append((wordNtag[1],values)) #build argmax real tag for word,values in self.argmax_list.items(): tag = max(values,key=itemgetter(1))[0] self.argmax_value[word]=tag print(self.argmax_value)
You can also easily get :
def get_trigram_param(self): self.trigram_param = defaultdict(float) for tag_tuple,counts in self.trigram.items(): subtag = tuple(tag_tuple[1:]) #get tag of the tuple tagCount = self.bigram[subtag] if tagCount: q = counts / tagCount else: q = 0. self.trigram_param[tag_tuple]=q
The Viterbi Algorithm
#Viterbi Algo def viterbi(self,sentence): def findSet(index): if index in range(1,len(sentence)+1): return self.all_states elif index == 0 or index == -1: return {'*'} #stores (word:tag) in this whole sentence sentence_with_tag = defaultdict(str) #inner function to commpute pi values--start def pi_viterbi(k,u,v,sentence): prob = defaultdict(float) #initialization if k==0 and u == '*' and v == '*': return (1.,'*') else: for w in findSet(k-2): prev = pi_viterbi(k-1,w,u,sentence)[0] #tuple((w,u,v)) q = self.trigram_param[tuple((w,u,v))] e = self.emision_parameters[tuple((sentence[k-1].lower(),v))] probability = prev * q * e prob [tuple((w,u))] = probability max_tuple = max(prob.items(), key=lambda x: x[1]) #print (max_tuple[1],max_tuple[0][0]) return max_tuple[1],max_tuple[0][0] #inner function to compute pi values--end sentence_with_tag= list() backpointer=defaultdict(str) tags = defaultdict(str) k = len(sentence) u_glob = '' v_glob = '' glob=0. for i in range(1,k+1): prob = defaultdict(float) for u in findSet(i-1): for v in findSet(i): value,w=pi_viterbi(i,u,v,sentence) prob [tuple((i,u,v))] = value backpointer[tuple((i,u,v))]=w max_tuple = max(prob.items(), key=lambda x: x[1]) backpointer[tuple((i,max_tuple[0][1],max_tuple[0][-1]))] = max_tuple[0][1] # bp (k,u,v)= tag w #sentence_with_tag.append(max_tuple[0][-1]) u_glob = max_tuple[0][-2] v_glob = max_tuple[0][-1] glob = max_tuple[1] print ('Max',max_tuple) tags[k-1]=u_glob tags[k]=v_glob for i in range((k-2),0,-1): tag = backpointer[tuple(((i+2),tags[i+1],tags[i+2]))] tags[i]=tag tag_list=list() for i in range(1,len(tags)+1): tag_list.append(tags[i]) #tag list as results return tag_list
Instead of replacing all rare words with ‘_RARE_’, we can also try to improve our model by classifying them into different categories (like alldigits,allcapital,etc). A simple classfier on rare words using regular expressions:
def classify(word): # classify rare words into precise groups # only numbers if re.findall(r'\d',word): return 'NUM' # only capitalized letters elif re.findall(r'([A-Z]+)$',word): return 'ALL_CAP' # ends with a capitalized letter,not all capital elif re.findall(r'[a-z]+[A-Z]+$',word): return 'LAST_CAP' else: return 'RARE'
The input is a list of words and the outputs is a sequence of tags. Here is a comparison:
References:
[1]A detailed description about data and task.
[2]The course video. If you finished on Week1 and Week2, then feel free to start the assignment.
Thanks for your article, very interesting thoughts. We used trigrams only in PHP for few our projects. You can read about it in my article http://multi-programming.com/blog/trigram-method-in-automatic-spelling-correction. Now it’s time to try python and your information will be good start.
LikeLiked by 1 person
Thanks for the blog link! It’s good to see n-gram can work in many interesting applications lol Let’s keep in touch in the near future!
LikeLike