First introduced by Mikolov 1 in 2013, the word2vec is to learn distributed representations (word embeddings) when applying neural network. It is based on the distributed hypothesis that words occur in similar contexts (neighboring words) tend to have similar meanings. Two models here: cbow ( continuous bag of words) where we use a bag of words to predict a target word and skip-gram where we use one word to predict its neighbors. For more, although not highly recommended, have a look at TensorFlow tutorial here.
After this idea is proved to be effective and helpful, say, you can easily cluster and find similar words in a huge corpus, people then began thinking further: is it possible to have a higher level of representation on sentences, paragraphs or even documents.
One idea is we can first use the word embeddings to represent each word in a sentence, then apply a simple average pooling approach where the generated document vector is actually a centroid of all words in the space 2. The popular idea is we following the similar idea on traning the word2vec to learn distributed representations for pieces of texts as an unsupervised method [3,4].
Similarly, there are two models in doc2vec: dbow and dm.
dbow (distributed bag of words)
It is a simpler model that ignores word order and training stage is quicker. The model uses no-local context/neighboring words in predictions. You see it is not considering the order of the words. From the paper 4, the figure below shows dbow.
In Gensim, you will code like this:
model = gensim.models.Doc2Vec(documents,dm = 0, alpha=0.1, size= 20, min_alpha=0.025)
Set dm to be 0. If you print out word embeddings at each epoch, you will notice they are not updating. From the graph above, we may guess that we have only paragraph embeddings updated during backpropagation.
dm (distributed memory)
We treat the paragraph as an extra word. Then it is concatenated/averaged with local context word vectors when making predictions. During training, both paragraph and word embeddings are updated. It calls for more computation and complexity.
In Gensim, set the dm to be 1(by default):
model = gensim.models.Doc2Vec(documents,dm = 1, alpha=0.1, size= 20, min_alpha=0.025)
Print out word embeddings at each epoch, you will notice they are updating.
More detailed: we treat each document as an extra word; doc ID/ paragraph ID is represented as one-hot vector; documents are also embedded into continuous vector space.
Example with Gensim
Gensim provides lots of models like LDA, word2vec and doc2vec. While I found some of the example codes on a tutorial is based on long and huge projects (like they trained on English Wiki corpus lol), here I give few lines of codes to show how to start playing with doc2vec.
First, you need is a list of txt files that you want to try the simple code on. I have a list of txt files under the folder named docs. Two .py files in total: load.py for reading and cleaning data and doc2vectest.py for running doc2vec model.
import gensim import os import re from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.porter import PorterStemmer from gensim.models.doc2vec import TaggedDocument def get_doc_list(folder_name): doc_list =  file_list = [folder_name+'/'+name for name in os.listdir(folder_name) if name.endswith('txt')] for file in file_list: st = open(file,'r').read() doc_list.append(st) print ('Found %s documents under the dir %s .....'%(len(file_list),folder_name)) return doc_list def get_doc(folder_name): doc_list = get_doc_list(folder_name) tokenizer = RegexpTokenizer(r'\w+') en_stop = get_stop_words('en') p_stemmer = PorterStemmer() taggeddoc =  texts =  for index,i in enumerate(doc_list): # for tagged doc wordslist =  tagslist =  # clean and tokenize document string raw = i.lower() tokens = tokenizer.tokenize(raw) # remove stop words from tokens stopped_tokens = [i for i in tokens if not i in en_stop] # remove numbers number_tokens = [re.sub(r'[\d]', ' ', i) for i in stopped_tokens] number_tokens = ' '.join(number_tokens).split() # stem tokens stemmed_tokens = [p_stemmer.stem(i) for i in number_tokens] # remove empty length_tokens = [i for i in stemmed_tokens if len(i) > 1] # add tokens to list texts.append(length_tokens) td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),str(index)) # for later versions, you may want to use: td = TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(stemmed_tokens))).split(),[str(index)]) taggeddoc.append(td) return taggeddoc
The method get_doc_list is used for loading all txt files under a directory, it returns a list of strings. If you have 4 txts, then the length of the list will be 4.
The method get_doc is mainly for cleaning strings then port the data into some format that doc2vec can use. After we get the long string for each txt, we preprocess it by tokenizing, removing stopwords and numbers, stemming(that is if you have “supply” and “supplies”, then they will convert to “suppli”). You can change and add more filters in this step.
From the tutorial given by Radim 5, The TaggedDocument (used to be LabeledSentence) is like this:
sentence = TaggedDocument(words=[u'some', u'words', u'here'], tags=[u'SENT_1'])
You need to pass in a Unicode format list of words and the tags of the document (we agree here a document is a collection of words). Normally we give one tag for each document, but you can still assign more than one. In my experiment, I just give a unique id to each doc as their tag.I tried with int but got an error! The str.encode(somestringhere) method helps you convert a string to the unicode-format. Method get_doc returns a list of TaggedDocument objects.
import gensim import load documents = load.get_doc('docs') print ('Data Loading finished') print (len(documents),type(documents)) # build the model model = gensim.models.Doc2Vec(documents, dm = 0, alpha=0.025, size= 20, min_alpha=0.025, min_count=0) # start training for epoch in range(200): if epoch % 20 == 0: print ('Now training epoch %s'%epoch) model.train(documents) model.alpha -= 0.002 # decrease the learning rate model.min_alpha = model.alpha # fix the learning rate, no decay # shows the similar words print (model.most_similar('suppli')) # shows the learnt embedding print (model['suppli']) # shows the similar docs with id = 2 print (model.docvecs.most_similar(str(2)))
After loading the documents, we are able to build a doc2vec model. Yes, just one line.
We are able to pass in documents and assign hyper-parameters. You can find a full version about the methods here 6. If the dm = 0, then we are training a dbow model. The size = 20 defines the dimension of doc vectors. If we initialize by passing in documents here, then we do not need to build vocabulary, it is done by itself.
You can train it for a number of epochs by changing the learning rate (alpha).
After some time, let’s print some results. Do remember when we train doc2vec, we can get word embeddings and also document similarities, and even label representations!
Here I printed most similar words of “suppli”:
>>model.most_similar('suppli') [('gorski', 0.7319533824920654), ('ensur', 0.7222224473953247), ('beyond', 0.718737006187439), ('d', 0.6974059343338013), ('sociedad', 0.6583201885223389), ('particularli', 0.6544623374938965), ('lvg', 0.644609808921814), ('sal', 0.6434764862060547), ('measur', 0.6433277130126953), ('livneh', 0.6426206827163696)].
And let’s see how the word “suppli” is represented (a 20-d vector):
>>model['suppli'] [ 0.00780776 -0.02093589 -0.00954595 0.01870585 -0.0185861 0.0023135 0.00341994 0.00175795 0.01479601 0.01020735 0.02441289 0.01075038 0.00807728 0.0213691 0.01130075 0.01297983 0.01369582 -0.01174711 -0.00518298 -0.00057144]
Then the similarity/ distance between the document with ID 2 and the rest:
>>model.docvecs.most_similar(str(2)) [('1', 0.44029974937438965), ('0', -0.044562749564647675), ('3', -0.048865705728530884), ('6', -0.08216284960508347), ('9', -0.15016411244869232), ('5', -0.16429446637630463), ('4', -0.1840556114912033), ('8', -0.21571332216262817), ('7', -0.23153537511825562)]
You can save both word embeddings and document/paragraph embeddings:
When you want to use the model:
# load the word2vec word2vec = gensim.models.Doc2Vec.load_word2vec_format('save/trained.word2vec') print (word2vec['good']) # load the doc2vec model = gensim.models.Doc2Vec.load('save/trained.model') docvecs = model.docvecs # print (docvecs[str(3)])
They will print out the vectors for you.
We can simply get those word embeddings, and plot them (as done in the word2vec). Here a simple PCA() method was used first, then we take some of the words to plot. After PCA(), we reduced dimension of a word to 2.
def plotWords(): #get model, we use w2v only w2v,d2v=useModel() words_np =  #a list of labels (words) words_label =  for word in w2v.vocab.keys(): words_np.append(w2v[word]) words_label.append(word) print('Added %s words. Shape %s'%(len(words_np),np.shape(words_np))) pca = decomposition.PCA(n_components=2) pca.fit(words_np) reduced= pca.transform(words_np) # plt.plot(pca.explained_variance_ratio_) for index,vec in enumerate(reduced): # print ('%s %s'%(words_label[index],vec)) if index <100: x,y=vec,vec plt.scatter(x,y) plt.annotate(words_label[index],xy=(x,y)) plt.show()
We will plot something like this: