Working with ROUGE 1.5.5 Evaluation Metric in Python

If you use ROUGE Evaluation metric for text summarization systems or machine translation systems, you must have noticed that there are many versions of them. So how to get it work with your own systems with Python? What packages are helpful? In this post, I will give some ideas based on engineering’s view (which means I am not going to introduce what is ROUGE). I also suffered from few issues and finally got them solved. My methods might not be the best ways but they worked.

Download ROUGE script

Many papers refer to this paper when they report results : ROUGE: A Package for Automatic Evaluation of Summaries by Chin-Yew Lin. Although other versions are acceptable, people recently use ROUGE 1.5.5 commonly. You need to find the original Perl script: ROUGE-1.5.5.pl. Make sure if this is a good one. Normally people will not change it if they use Python, and that’s how it becomes a standard. Check how many lines (should be ~3298 lines) before use.

You may need to download the whole ROUGE-1.5.5 folder from the link. Run the test before the next steps:

$./runROUGE-test.pl

If everything goes well, you will see outputs like:

./ROUGE-1.5.5.pl -e ../data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a ROUGE-test.xml > ../sample-output/ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a
....

It might take few seconds to run the test cases.

I met some issues here saying that I need to install other things about Perl. Follow the error message if you meet some problems.

Install Python wrapper

It is natural to choose a Python wrapper, which will help you to calculate ROUGE score form the Perl script. I recommend pyrouge, and I have seen some papers have applied it to ROUGE scores. In the official document, you can easily find installation and usage.

Remember to set your ROUGE path (absolute path to ROUGE-1.5.5 directory, which contains the Perl script), and run a test.

Run with Python codes

In my case, I have my system outputs organized as follows:

26755585_1169685933166973_908085534_n1

I have my reference folder for original summarizations. Each txt file contains a single line for each article, and it is clear that for the file name there is an ID. Same format for the decoded folder, where I keep the system outputs. In my case, they are summarizations created by the machine, and the txt file names are also attached with an ID, paired with their true results in the reference folder.

ID is important when using pyrouge to get ROUGE scores. I changed the name format to match my case, based on the codes from official document:

from pyrouge import Rouge155
r = Rouge155()
# set directories
r.system_dir = 'decoded/'
r.model_dir = 'reference/'

# define the patterns
r.system_filename_pattern = '(\d+)_decoded.txt'
r.model_filename_pattern = '#ID#_reference.txt'

# use default parameters to run the evaluation
output = r.convert_and_evaluate()
print(output)
output_dict = r.output_to_dict(output)

You can see many log info come out:

2017-12-18 11:21:36,865 [MainThread ] [INFO ] Writing summaries.
2017-12-18 11:21:36,868 [MainThread ] [INFO ] Processing summaries. Saving
...

Then after some works (transform the txt files into other format files), you can see the default parameters and finally a table as results.

2017-12-18 11:21:36,871 [MainThread ] [INFO ] Running ROUGE with command /Users/ireneli/Tools/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl -e /Users/ireneli/Tools/pyrouge/tools/ROUGE-1.5.5/data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a -m /var/folders/3h/x33pjd7d3k136fh564j3stqr0000gn/T/tmpkr3m965d/rouge_conf.xml
---------------------------------------------
1 ROUGE-1 Average_R: 0.78378 (95%-conf.int. 0.78378 - 0.78378)
1 ROUGE-1 Average_P: 0.80556 (95%-conf.int. 0.80556 - 0.80556)
1 ROUGE-1 Average_F: 0.79452 (95%-conf.int. 0.79452 - 0.79452)
---------------------------------------------
1 ROUGE-2 Average_R: 0.69444 (95%-conf.int. 0.69444 - 0.69444)
1 ROUGE-2 Average_P: 0.71429 (95%-conf.int. 0.71429 - 0.71429)
1 ROUGE-2 Average_F: 0.70423 (95%-conf.int. 0.70423 - 0.70423)
---------------------------------------------
1 ROUGE-3 Average_R: 0.62857 (95%-conf.int. 0.62857 - 0.62857)
1 ROUGE-3 Average_P: 0.64706 (95%-conf.int. 0.64706 - 0.64706)
1 ROUGE-3 Average_F: 0.63768 (95%-conf.int. 0.63768 - 0.63768)
---------------------------------------------
1 ROUGE-4 Average_R: 0.55882 (95%-conf.int. 0.55882 - 0.55882)
1 ROUGE-4 Average_P: 0.57576 (95%-conf.int. 0.57576 - 0.57576)
1 ROUGE-4 Average_F: 0.56716 (95%-conf.int. 0.56716 - 0.56716)
---------------------------------------------
1 ROUGE-L Average_R: 0.78378 (95%-conf.int. 0.78378 - 0.78378)
1 ROUGE-L Average_P: 0.80556 (95%-conf.int. 0.80556 - 0.80556)
1 ROUGE-L Average_F: 0.79452 (95%-conf.int. 0.79452 - 0.79452)
---------------------------------------------
1 ROUGE-W-1.2 Average_R: 0.32228 (95%-conf.int. 0.32228 - 0.32228)
1 ROUGE-W-1.2 Average_P: 0.68198 (95%-conf.int. 0.68198 - 0.68198)
1 ROUGE-W-1.2 Average_F: 0.43771 (95%-conf.int. 0.43771 - 0.43771)
---------------------------------------------
1 ROUGE-S* Average_R: 0.60961 (95%-conf.int. 0.60961 - 0.60961)
1 ROUGE-S* Average_P: 0.64444 (95%-conf.int. 0.64444 - 0.64444)
1 ROUGE-S* Average_F: 0.62654 (95%-conf.int. 0.62654 - 0.62654)
---------------------------------------------
1 ROUGE-SU* Average_R: 0.61966 (95%-conf.int. 0.61966 - 0.61966)
1 ROUGE-SU* Average_P: 0.65414 (95%-conf.int. 0.65414 - 0.65414)
1 ROUGE-SU* Average_F: 0.63643 (95%-conf.int. 0.63643 - 0.63643)

Normally we report ROUGE-2 Average_F and ROUGE-L Average_F scores.
Besides, you might want to remove the temp file to release some space on your machine. In this case, I will need to delete tmpkr3m965d folder (can be found in the log info).

Troubleshooting: illegal division by zero

I was annoyed by this error:

Now starting ROUGE eval...
Illegal division by zero at /home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl line 2455.
subprocess.CalledProcessError: Command '['/home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl', '-e', '/home/lily/zl379/RELEASE-1.5.5/data', '-c', '95', '-2', '-1', '-U', '-r', '1000', '-n', '4', '-w', '1.2', '-a', '-m', '/tmp/tmpuu0bqmes/rouge_conf.xml']' returned non-zero exit status 255

So I checked line 2455 in ROUGE-1.5.5.pl:

$$score=wlcsWeightInverse($$hit/$$base,$weightFactor);

And the error said “illegal division by zero”, which means here the $$base could be zero. We may infer that, the txt files might be some empty ones. Or, if the txt files contain something like <b> or other strange characters. Other cases including html tags, for some reasons, if you have an URL as the whole document, it may raise this error. Alwasy make sure that you have done preprocessing before you run the ROUGE evaluation. By filtering them, I got the problem solved easily. 

“Cannot open exception db …”

If you meet this error while running the runROUGE-test.perl:
“Cannot open exception db file for reading: data/WordNet-2.0.exc.db”

I refer to this method.

Published by Irene

Keep calm and update blog.

2 thoughts on “Working with ROUGE 1.5.5 Evaluation Metric in Python

  1. “CalledProcessError: Command ‘[‘/home/lily/zl379/RELEASE-1.5.5/ROUGE-1.5.5.pl’, ‘-e’, ‘/home/lily/zl379/RELEASE-1.5.5/data’, ‘-c’, ’95’, ‘-2’, ‘-1’, ‘-U’, ‘-r’, ‘1000’, ‘-n’, ‘4’, ‘-w’, ‘1.2’, ‘-a’, ‘-m’, ‘/tmp/tmpuu0bqmes/rouge_conf.xml’]’ returned non-zero exit status 255”
    What did you do for this error?

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: