Posted in Natural Language Processing, Problem Shooting, Python, Theory

Working with ROUGE 1.5.5 Evaluation Metric in Python

If you use ROUGE Evaluation metric for text summarization systems or machine translation systems, you must have noticed that there are many versions of them. So how to get it work with your own systems with Python? What packages are helpful? In this post, I will give some ideas based on engineering’s view (which means I am not going to introduce what is ROUGE). I also suffered from few issues and finally got them solved. My methods might not be the best ways but they worked.

Download ROUGE script

Many papers refer to this paper when they report results : ROUGE: A Package for Automatic Evaluation of Summaries by Chin-Yew Lin. Although other versions are acceptable, people recently use ROUGE 1.5.5 commonly. You need to find the original Perl script: Make sure if this is a good one. Normally people will not change it if they use Python, and that’s how it becomes a standard. Check how many lines (should be ~3298 lines) before use.

You may need to download the whole ROUGE-1.5.5 folder from the link. Run the test before the next steps:


If everything goes well, you will see outputs like:

./ -e ../data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a ROUGE-test.xml > ../sample-output/ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a

It might take few seconds to run the test cases.

I met some issues here saying that I need to install other things about Perl. Follow the error message if you meet some problems.

Install Python wrapper

It is natural to choose a Python wrapper, which will help you to calculate ROUGE score form the Perl script. I recommend pyrouge, and I have seen some papers have applied it to ROUGE scores. In the official document, you can easily find installation and usage.

Remember to set your ROUGE path (absolute path to ROUGE-1.5.5 directory, which contains the Perl script), and run a test.

Run with Python codes

In my case, I have my system outputs organized as follows:

System output files

I have my reference folder for original summarizations. Each txt file contains a single line for each article, and it is clear that for the file name there is an ID. Same format for the decoded folder, where I keep the system outputs. In my case, they are summarizations created by the machine, and the txt file names are also attached with an ID, paired with their true results in the reference folder.

ID is important when using pyrouge to get ROUGE scores. I changed the name format to match my case, based on the codes from official document:

from pyrouge import Rouge155
r = Rouge155()
# set directories
r.system_dir = 'decoded/'
r.model_dir = 'reference/'

# define the patterns
r.system_filename_pattern = '(\d+)_decoded.txt'
r.model_filename_pattern = '#ID#_reference.txt'

# use default parameters to run the evaluation
output = r.convert_and_evaluate()
output_dict = r.output_to_dict(output)

You can see many log info come out:

2017-12-18 11:21:36,865 [MainThread ] [INFO ] Writing summaries.
2017-12-18 11:21:36,868 [MainThread ] [INFO ] Processing summaries. Saving

Then after some works (transform the txt files into other format files), you can see the default parameters and finally a table as results.

2017-12-18 11:21:36,871 [MainThread ] [INFO ] Running ROUGE with command /Users/ireneli/Tools/pyrouge/tools/ROUGE-1.5.5/ -e /Users/ireneli/Tools/pyrouge/tools/ROUGE-1.5.5/data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m /var/folders/3h/x33pjd7d3k136fh564j3stqr0000gn/T/tmpkr3m965d/rouge_conf.xml
1 ROUGE-1 Average_R: 0.78378 ( 0.78378 - 0.78378)
1 ROUGE-1 Average_P: 0.80556 ( 0.80556 - 0.80556)
1 ROUGE-1 Average_F: 0.79452 ( 0.79452 - 0.79452)
1 ROUGE-2 Average_R: 0.69444 ( 0.69444 - 0.69444)
1 ROUGE-2 Average_P: 0.71429 ( 0.71429 - 0.71429)
1 ROUGE-2 Average_F: 0.70423 ( 0.70423 - 0.70423)
1 ROUGE-3 Average_R: 0.62857 ( 0.62857 - 0.62857)
1 ROUGE-3 Average_P: 0.64706 ( 0.64706 - 0.64706)
1 ROUGE-3 Average_F: 0.63768 ( 0.63768 - 0.63768)
1 ROUGE-4 Average_R: 0.55882 ( 0.55882 - 0.55882)
1 ROUGE-4 Average_P: 0.57576 ( 0.57576 - 0.57576)
1 ROUGE-4 Average_F: 0.56716 ( 0.56716 - 0.56716)
1 ROUGE-L Average_R: 0.78378 ( 0.78378 - 0.78378)
1 ROUGE-L Average_P: 0.80556 ( 0.80556 - 0.80556)
1 ROUGE-L Average_F: 0.79452 ( 0.79452 - 0.79452)
1 ROUGE-W-1.2 Average_R: 0.32228 ( 0.32228 - 0.32228)
1 ROUGE-W-1.2 Average_P: 0.68198 ( 0.68198 - 0.68198)
1 ROUGE-W-1.2 Average_F: 0.43771 ( 0.43771 - 0.43771)
1 ROUGE-S* Average_R: 0.60961 ( 0.60961 - 0.60961)
1 ROUGE-S* Average_P: 0.64444 ( 0.64444 - 0.64444)
1 ROUGE-S* Average_F: 0.62654 ( 0.62654 - 0.62654)
1 ROUGE-SU* Average_R: 0.61966 ( 0.61966 - 0.61966)
1 ROUGE-SU* Average_P: 0.65414 ( 0.65414 - 0.65414)
1 ROUGE-SU* Average_F: 0.63643 ( 0.63643 - 0.63643)

Normally we report ROUGE-2 Average_F and ROUGE-L Average_F scores.
Besides, you might want to remove the temp file to release some space on your machine. In this case, I will need to delete tmpkr3m965d folder (can be found in the log info).

###Troubleshooting: illegal division by zero
I was annoyed by this error:

Now starting ROUGE eval...
Illegal division by zero at /home/lily/zl379/RELEASE-1.5.5/ line 2455.
subprocess.CalledProcessError: Command '['/home/lily/zl379/RELEASE-1.5.5/', '-e', '/home/lily/zl379/RELEASE-1.5.5/data', '-c', '95', '-2', '-1', '-U', '-r', '1000', '-n', '4', '-w', '1.2', '-a', '-m', '/tmp/tmpuu0bqmes/rouge_conf.xml']' returned non-zero exit status 255

So I checked line 2455 in


And the error said “illegal division by zero”, which means here the $$base could be zero. We may infer that, the txt files might be some empty ones. Or, if the txt files contain something like <b> or other strange characters. By filtering them, I got the problem solved easily.

Posted in Python

TensorFlow 07: Word Embeddings (2) – Loading Pre-trained Vectors

A brief introduction on Word2vec please check this post. In this post, we try to load pre-trained Word2vec model, which is a huge file contains all the word vectors trained on huge corpora.


Download here .I downloaded the GloVe one, the vocabulary size is 4 million, dimension is 50. It is a smaller one trained on a “global” corpus (from wikipedia). There are models trained on Twitter as well in the page.


The model is formatted as (word vector) in each line, separated by a space. Below shows a screenshot: not only the words, but also some marks like comma are included in the model.image


There is an easy way for you to load the model by reading the vector file. Here I separate the words and vectors, because the words will be fed into vocabulary.

import numpy as np
filename = 'glove.6B.50d.txt'
def loadGloVe(filename):
    vocab = []
    embd = []
    file = open(filename,'r')
    for line in file.readlines():
        row = line.strip().split(' ')
    print('Loaded GloVe!')
    return vocab,embd
vocab,embd = loadGloVe(filename)
vocab_size = len(vocab)
embedding_dim = len(embd[0])
embedding = np.asarray(embd)

The vocab is a list of words or marks. The embedding is the huge 2-d array with all the word vectors. We initialize the embedding size to be the number of column of the embedding array.

Embedding Layer

After loading in the vectors, we need to use them to initialize W of the embedding layer in your network.

W = tf.Variable(tf.constant(0.0, shape=[vocab_size, embedding_dim]),
                trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocab_size, embedding_dim])
embedding_init = W.assign(embedding_placeholder)

Here W is first built as Variables, but initialized by constant zeros. Be careful with the shape: [vocab_size, embedding_dim], where we can know after loading the model. If trainable is set to be False, it would not be updated during training. Change to True for a trainable setup. Then an embedding_placeholder is set up to receive the real values (fed from the feed_dict in, and at last W is assigned.

After creating a session and initialize global variables, run the embedding_init operation by feeding in the 2-D array embedding., feed_dict={embedding_placeholder: embedding})


Suppose you have raw documents, the first thing you need to do is to build a vocabulary, which will map each word into an id. TensorFlow process the following code to lookup embeddings:

tf.nn.embedding_lookup(W, input_x)

where W is the huge embedding matrix, input_x is a tensor with ids. In another word, it will lookup embeddings by given Ids.

So we would choose the pre-trained model when we build the vocabulary: word-id maps.

from tensorflow.contrib import learn
#init vocab processor
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
#fit the vocab from glove
pretrain =
#transform inputs
x = np.array(list(vocab_processor.transform(your_raw_input)))

First init the vocab processor by passing in a max_document_length, in default, shorter sentences would be padded by zeros. Then we fit the processor by the vocab list to build the word-id maps. Finally, use the processor to transform from real raw documents.

Now you are ready to train your own network with pre-trained word vectors!

Posted in Algorithm, Natural Language Processing, Python, Theory

NLP 05: From Word2vec to Doc2vec: a simple example with Gensim



First introduced by Mikolov 1 in 2013, the word2vec is to learn distributed representations (word embeddings) when applying neural network. It is based on the distributed hypothesis that words occur in similar contexts (neighboring words) tend to have similar meanings. Two models here: cbow ( continuous bag of words) where we use a bag of words to predict a target word and skip-gram where we use one word to predict its neighbors. For more, although not highly recommended, have a look at TensorFlow tutorial here. Continue reading “NLP 05: From Word2vec to Doc2vec: a simple example with Gensim”

Posted in Deep Learning, Python, Theory

TensorFlow 05: Understanding Basic Usage

Until recently, I realized I missed some basics about TF. I went directly to the MNIST when I learned. Also, I asked few people if they have some nice tutorials for TF or for DL. Well, it is not like other modules, where you can easily find good ones like Andrew’s ML. But I did find something (in the reference section), I did not go through every one. For those who are interested, have a check by yourself. Or you might happy with sharing your recommends.
Continue reading “TensorFlow 05: Understanding Basic Usage”