## Working with ROUGE 1.5.5 Evaluation Metric in Python

If you use ROUGE Evaluation metric for text summarization systems or machine translation systems, you must have noticed that there are many versions of them. So how to get it work with your own systems with Python? What packages are helpful? In this post, I will give some ideas based on engineering’s view (which means I am not going to introduce what is ROUGE). I also suffered from few issues and finally got them solved. My methods might not be the best ways but they worked.

Many papers refer to this paper when they report results : ROUGE: A Package for Automatic Evaluation of Summaries by Chin-Yew Lin. Although other versions are acceptable, people recently use ROUGE 1.5.5 commonly. You need to find the original Perl script: ROUGE-1.5.5.pl. Make sure if this is a good one. Normally people will not change it if they use Python, and that’s how it becomes a standard. Check how many lines (should be ~3298 lines) before use.

You may need to download the whole ROUGE-1.5.5 folder from the link. Run the test before the next steps:

```\$./runROUGE-test.pl
```

If everything goes well, you will see outputs like:

```./ROUGE-1.5.5.pl -e ../data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a ROUGE-test.xml > ../sample-output/ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a
....
```

It might take few seconds to run the test cases.

I met some issues here saying that I need to install other things about Perl. Follow the error message if you meet some problems.

### Install Python wrapper

It is natural to choose a Python wrapper, which will help you to calculate ROUGE score form the Perl script. I recommend pyrouge, and I have seen some papers have applied it to ROUGE scores. In the official document, you can easily find installation and usage.

Remember to set your ROUGE path (absolute path to ROUGE-1.5.5 directory, which contains the Perl script), and run a test.

### Run with Python codes

In my case, I have my system outputs organized as follows:

I have my reference folder for original summarizations. Each txt file contains a single line for each article, and it is clear that for the file name there is an ID. Same format for the decoded folder, where I keep the system outputs. In my case, they are summarizations created by the machine, and the txt file names are also attached with an ID, paired with their true results in the reference folder.

ID is important when using pyrouge to get ROUGE scores. I changed the name format to match my case, based on the codes from official document:

```from pyrouge import Rouge155
r = Rouge155()
# set directories
r.system_dir = 'decoded/'
r.model_dir = 'reference/'

# define the patterns
r.system_filename_pattern = '(\d+)_decoded.txt'
r.model_filename_pattern = '#ID#_reference.txt'

# use default parameters to run the evaluation
output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)
```

```2017-12-18 11:21:36,865 [MainThread ] [INFO ] Writing summaries.
2017-12-18 11:21:36,868 [MainThread ] [INFO ] Processing summaries. Saving
...
```

Then after some works (transform the txt files into other format files), you can see the default parameters and finally a table as results.

```2017-12-18 11:21:36,871 [MainThread ] [INFO ] Running ROUGE with command /Users/ireneli/Tools/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl -e /Users/ireneli/Tools/pyrouge/tools/ROUGE-1.5.5/data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m /var/folders/3h/x33pjd7d3k136fh564j3stqr0000gn/T/tmpkr3m965d/rouge_conf.xml
---------------------------------------------
1 ROUGE-1 Average_R: 0.78378 (95%-conf.int. 0.78378 - 0.78378)
1 ROUGE-1 Average_P: 0.80556 (95%-conf.int. 0.80556 - 0.80556)
1 ROUGE-1 Average_F: 0.79452 (95%-conf.int. 0.79452 - 0.79452)
---------------------------------------------
1 ROUGE-2 Average_R: 0.69444 (95%-conf.int. 0.69444 - 0.69444)
1 ROUGE-2 Average_P: 0.71429 (95%-conf.int. 0.71429 - 0.71429)
1 ROUGE-2 Average_F: 0.70423 (95%-conf.int. 0.70423 - 0.70423)
---------------------------------------------
1 ROUGE-3 Average_R: 0.62857 (95%-conf.int. 0.62857 - 0.62857)
1 ROUGE-3 Average_P: 0.64706 (95%-conf.int. 0.64706 - 0.64706)
1 ROUGE-3 Average_F: 0.63768 (95%-conf.int. 0.63768 - 0.63768)
---------------------------------------------
1 ROUGE-4 Average_R: 0.55882 (95%-conf.int. 0.55882 - 0.55882)
1 ROUGE-4 Average_P: 0.57576 (95%-conf.int. 0.57576 - 0.57576)
1 ROUGE-4 Average_F: 0.56716 (95%-conf.int. 0.56716 - 0.56716)
---------------------------------------------
1 ROUGE-L Average_R: 0.78378 (95%-conf.int. 0.78378 - 0.78378)
1 ROUGE-L Average_P: 0.80556 (95%-conf.int. 0.80556 - 0.80556)
1 ROUGE-L Average_F: 0.79452 (95%-conf.int. 0.79452 - 0.79452)
---------------------------------------------
1 ROUGE-W-1.2 Average_R: 0.32228 (95%-conf.int. 0.32228 - 0.32228)
1 ROUGE-W-1.2 Average_P: 0.68198 (95%-conf.int. 0.68198 - 0.68198)
1 ROUGE-W-1.2 Average_F: 0.43771 (95%-conf.int. 0.43771 - 0.43771)
---------------------------------------------
1 ROUGE-S* Average_R: 0.60961 (95%-conf.int. 0.60961 - 0.60961)
1 ROUGE-S* Average_P: 0.64444 (95%-conf.int. 0.64444 - 0.64444)
1 ROUGE-S* Average_F: 0.62654 (95%-conf.int. 0.62654 - 0.62654)
---------------------------------------------
1 ROUGE-SU* Average_R: 0.61966 (95%-conf.int. 0.61966 - 0.61966)
1 ROUGE-SU* Average_P: 0.65414 (95%-conf.int. 0.65414 - 0.65414)
1 ROUGE-SU* Average_F: 0.63643 (95%-conf.int. 0.63643 - 0.63643)
```

Normally we report ROUGE-2 Average_F and ROUGE-L Average_F scores.
Besides, you might want to remove the temp file to release some space on your machine. In this case, I will need to delete tmpkr3m965d folder (can be found in the log info).

###Troubleshooting: illegal division by zero
I was annoyed by this error:

```Now starting ROUGE eval...
Illegal division by zero at /home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl line 2455.
subprocess.CalledProcessError: Command '['/home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl', '-e', '/home/lily/zl379/RELEASE-1.5.5/data', '-c', '95', '-2', '-1', '-U', '-r', '1000', '-n', '4', '-w', '1.2', '-a', '-m', '/tmp/tmpuu0bqmes/rouge_conf.xml']' returned non-zero exit status 255
```

So I checked line 2455 in ROUGE-1.5.5.pl:

```\$\$score=wlcsWeightInverse(\$\$hit/\$\$base,\$weightFactor);
```

And the error said “illegal division by zero”, which means here the \$\$base could be zero. We may infer that, the txt files might be some empty ones. Or, if the txt files contain something like `<b>` or other strange characters. By filtering them, I got the problem solved easily.

Posted in Algorithm, Natural Language Processing, Python, Theory

## NLP 05: From Word2vec to Doc2vec: a simple example with Gensim

#### Introduction

First introduced by Mikolov 1 in 2013, the word2vec is to learn distributed representations (word embeddings) when applying neural network. It is based on the distributed hypothesis that words occur in similar contexts (neighboring words) tend to have similar meanings. Two models here: cbow ( continuous bag of words) where we use a bag of words to predict a target word and skip-gram where we use one word to predict its neighbors. For more, although not highly recommended, have a look at TensorFlow tutorial here. Continue reading “NLP 05: From Word2vec to Doc2vec: a simple example with Gensim”

Posted in Algorithm, Natural Language Processing, Python, Theory

## NLP 04: Log-Linear Models for Tagging Task (Python)

We will focus on POS tagging in this blog.

##### Notations

While HMM gives us a joint probability on tags and words: $p({t}_{[1:n]},{w}_{[1:n]})$. Tags t and words w are one-to-one mapping, so in the series, they share the same length.