# Revision Note: 05/26/2005, Chin-Yew LIN
# 1.5.5
# (1) Correct stemming on multi-token BE heads and modifiers.
# Previously, only single token heads and modifiers were assumed.
# (2) Correct the resampling routine which ignores the last evaluation
# item in the evaluation list. Therefore, the average scores reported
# by ROUGE is only based on the first N-1 evaluation items.
# Thanks Barry Schiffman at Columbia University to report this bug.
# This bug only affects ROUGE-1.5.X. For pre-1.5 ROUGE, it only affects
# the computation of confidence interval (CI) estimation, i.e. CI is only
# estimated by the first N-1 evaluation items, but it *does not* affect
# average scores.
# (3) Change read_text and read_text_LCS functions to read exact words or
# bytes required by users. Previous versions carry out whitespace
# compression and other string clear up actions before enforce the length
# limit.
# 1.5.4.1
# (1) Minor description change about "-t 0" option.
# 1.5.4
# (1) Add easy evalution mode for single reference evaluations with -z
# option.
# 1.5.3
# (1) Add option to compute ROUGE score based on SIMPLE BE format. Given
# a set of peer and model summary file in BE format with appropriate
# options, ROUGE will compute matching scores based on BE lexical
# matches.
# There are 6 options:
# 1. H : Head only match. This is similar to unigram match but
# only BE Head is used in matching. BEs generated by
# Minipar-based breaker do not include head-only BEs,
# therefore, the score will always be zero. Use HM or HMR
# optiions instead.
# 2. HM : Head and modifier match. This is similar to bigram or
# skip bigram but it's head-modifier bigram match based on
# parse result. Only BE triples with non-NIL modifier are
# included in the matching.
# 3. HMR : Head, modifier, and relation match. This is similar to
# trigram match but it's head-modifier-relation trigram
# match based on parse result. Only BE triples with non-NIL
# relation are included in the matching.
# 4. HM1 : This is combination of H and HM. It is similar to unigram +
# bigram or skip bigram with unigram match but it's
# head-modifier bigram match based on parse result.
# In this case, the modifier field in a BE can be "NIL"
# 5. HMR1 : This is combination of HM and HMR. It is similar to
# trigram match but it's head-modifier-relation trigram
# match based on parse result. In this case, the relation
# field of the BE can be "NIL".
# 6. HMR2 : This is combination of H, HM and HMR. It is similar to
# trigram match but it's head-modifier-relation trigram
# match based on parse result. In this case, the modifier and
# relation fields of the BE can both be "NIL".
# 1.5.2
# (1) Add option to compute ROUGE score by token using the whole corpus
# as average unit instead of individual sentences. Previous versions of
# ROUGE uses sentence (or unit) boundary to break counting unit and takes
# the average score from the counting unit as the final score.
# Using the whole corpus as one single counting unit can potentially
# improve the reliablity of the final score that treats each token as
# equally important; while the previous approach considers each sentence as
# equally important that ignores the length effect of each individual
# sentences (i.e. long sentences contribute equal weight to the final
# score as short sentences.)
# +v1.2 provide a choice of these two counting modes that users can
# choose the one that fits their scenarios.
# 1.5.1
# (1) Add precision oriented measure and f-measure to deal with different lengths
# in candidates and references. Importance between recall and precision can
# be controled by 'alpha' parameter:
# alpha -> 0: recall is more important
# alpha -> 1: precision is more important
# Following Chapter 7 in C.J. van Rijsbergen's "Information Retrieval".
# http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html
# F = 1/(alpha * (1/P) + (1 - alpha) * (1/R)) ;;; weighted harmonic mean
# 1.4.2
# (1) Enforce length limit at the time when summary text is read. Previously (before
# and including v1.4.1), length limit was enforced at tokenization time.
# 1.4.1
# (1) Fix potential over counting in ROUGE-L and ROUGE-W
# In previous version (i.e. 1.4 and order), LCS hit is computed
# by summing union hit over all model sentences. Each model sentence
# is compared with all peer sentences and mark the union LCS. The
# length of the union LCS is the hit of that model sentence. The
# final hit is then sum over all model union LCS hits. This potentially
# would over count a peer sentence which already been marked as contributed
# to some other model sentence. Therefore, double counting is resulted.
# This is seen in evalution where ROUGE-L score is higher than ROUGE-1 and
# this is not correct.
# ROUGEeval-1.4.1.pl fixes this by add a clip function to prevent
# double counting.
# 1.4
# (1) Remove internal Jackknifing procedure:
# Now the ROUGE script will use all the references listed in the
# section in each section and no
# automatic Jackknifing is performed.
# If Jackknifing procedure is required when comparing human and system
# performance, then users have to setup the procedure in the ROUGE
# evaluation configuration script as follows:
# For example, to evaluate system X with 4 references R1, R2, R3, and R4.
# We do the following computation:
#
# for system: and for comparable human:
# s1 = X vs. R1, R2, R3 h1 = R4 vs. R1, R2, R3
# s2 = X vs. R1, R3, R4 h2 = R2 vs. R1, R3, R4
# s3 = X vs. R1, R2, R4 h3 = R3 vs. R1, R2, R4
# s4 = X vs. R2, R3, R4 h4 = R1 vs. R2, R3, R4
#
# Average system score for X = (s1+s2+s3+s4)/4 and for human = (h1+h2+h3+h4)/4
# Implementation of this in a ROUGE evaluation configuration script is as follows:
# Instead of writing all references in a evaluation section as below:
#
# ...
#
#
systemX
#
#
# R1
# R2
# R3
# R4
#
#
# we write the following:
#
#
#
systemX
#
#
# R2
# R3
# R4
#
#
#
#
#
systemX
#
#
# R1
# R3
# R4
#
#
#
#
#
systemX
#
#
# R1
# R2
# R4
#
#
#
#
#
systemX
#
#
# R1
# R2
# R3
#
#
#
# In this case, the system and human numbers are comparable.
# ROUGE as it is implemented for summarization evaluation is a recall-based metric.
# As we increase the number of references, we are increasing the number of
# count units (n-gram or skip-bigram or LCSes) in the target pool (i.e.
# the number ends up in the denominator of any ROUGE formula is larger).
# Therefore, a candidate summary has more chance to hit but it also has to
# hit more. In the end, this means lower absolute ROUGE scores when more
# references are used and using different sets of rerferences should not
# be compared to each other. There is no nomalization mechanism in ROUGE
# to properly adjust difference due to different number of references used.
#
# In the ROUGE implementations before v1.4 when there are N models provided for
# evaluating system X in the ROUGE evaluation script, ROUGE does the
# following:
# (1) s1 = X vs. R2, R3, R4, ..., RN
# (2) s2 = X vs. R1, R3, R4, ..., RN
# (3) s3 = X vs. R1, R2, R4, ..., RN
# (4) s4 = X vs. R1, R2, R3, ..., RN
# (5) ...
# (6) sN= X vs. R1, R2, R3, ..., RN-1
# And the final ROUGE score is computed by taking average of (s1, s2, s3,
# s4, ..., sN). When we provide only three references for evaluation of a
# human summarizer, ROUGE does the same thing but using 2 out 3
# references, get three numbers, and then take the average as the final
# score. Now ROUGE (after v1.4) will use all references without this
# internal Jackknifing procedure. The speed of the evaluation should improve
# a lot, since only one set instead of four sets of computation will be
# conducted.
# 1.3
# (1) Add skip bigram
# (2) Add an option to specify the number of sampling point (default is 1000)
# 1.2.3
# (1) Correct the enviroment variable option: -e. Now users can specify evironment
# variable ROUGE_EVAL_HOME using the "-e" option; previously this option is
# not active. Thanks Zhouyan Li of Concordia University, Canada pointing this
# out.
# 1.2.2
# (1) Correct confidence interval calculation for median, maximum, and minimum.
# Line 390.
# 1.2.1
# (1) Add sentence per line format input format. See files in Verify-SPL for examples.
# (2) Streamline command line arguments.
# (3) Use bootstrap resampling to estimate confidence intervals instead of using t-test
# or z-test which assume a normal distribution.
# (4) Add LCS (longest common subsequence) evaluation method.
# (5) Add WLCS (weighted longest common subsequence) evaluation method.
# (6) Add length cutoff in bytes.
# (7) Add an option to specify the longest ngram to compute. The default is 4.
# 1.2
# (1) Change zero condition check in subroutine &computeNGramScores when
# computing $gram1Score from
# if($totalGram2Count!=0) to
# if($totalGram1Count!=0)
# Thanks Ken Litkowski for this bug report.
# This original script will set gram1Score to zero if there is no
# bigram matches. This should rarely has significant affect the final score
# since (a) there are bigram matches most of time; (b) the computation
# of gram1Score is using Jackknifing procedure. However, this definitely
# did not compute the correct $gram1Score when there is no bigram matches.
# Therefore, users of version 1.1 should definitely upgrade to newer
# version of the script that does not contain this bug.
# Note: To use this script, two additional data files are needed:
# (1) smart_common_words.txt - contains stopword list from SMART IR engine
# (2) WordNet-1.6.exc.db - WordNet 1.6 exception inflexion database
# These two files have to be put in a directory pointed by the environment
# variable: "ROUGE_EVAL_HOME".
# If environment variable ROUGE_EVAL_HOME does not exist, this script will
# will assume it can find these two database files in the current directory.