RELEASE-NOTE.txt 14.1 KB
Newer Older
Z
zhanghan17 已提交
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
# Revision Note: 05/26/2005, Chin-Yew LIN
#              1.5.5
#              (1) Correct stemming on multi-token BE heads and modifiers.
#                  Previously, only single token heads and modifiers were assumed.
#              (2) Correct the resampling routine which ignores the last evaluation
#                  item in the evaluation list. Therefore, the average scores reported
#                  by ROUGE is only based on the first N-1 evaluation items.
#                  Thanks Barry Schiffman at Columbia University to report this bug.
#                  This bug only affects ROUGE-1.5.X. For pre-1.5 ROUGE, it only affects
#                  the computation of confidence interval (CI) estimation, i.e. CI is only
#                  estimated by the first N-1 evaluation items, but it *does not* affect
#                  average scores.
#              (3) Change read_text and read_text_LCS functions to read exact words or
#                  bytes required by users. Previous versions carry out whitespace 
#                  compression and other string clear up actions before enforce the length
#                  limit. 
#              1.5.4.1
#              (1) Minor description change about "-t 0" option.
#              1.5.4
#              (1) Add easy evalution mode for single reference evaluations with -z
#                  option.
#              1.5.3
#              (1) Add option to compute ROUGE score based on SIMPLE BE format. Given
#                  a set of peer and model summary file in BE format with appropriate
#                  options, ROUGE will compute matching scores based on BE lexical
#                  matches.
#                  There are 6 options:
#                  1. H    : Head only match. This is similar to unigram match but
#                            only BE Head is used in matching. BEs generated by
#                            Minipar-based breaker do not include head-only BEs,
#                            therefore, the score will always be zero. Use HM or HMR
#                            optiions instead.
#                  2. HM   : Head and modifier match. This is similar to bigram or
#                            skip bigram but it's head-modifier bigram match based on
#                            parse result. Only BE triples with non-NIL modifier are
#                            included in the matching.
#                  3. HMR  : Head, modifier, and relation match. This is similar to
#                            trigram match but it's head-modifier-relation trigram
#                            match based on parse result. Only BE triples with non-NIL
#                            relation are included in the matching.
#                  4. HM1  : This is combination of H and HM. It is similar to unigram +
#                            bigram or skip bigram with unigram match but it's 
#                            head-modifier bigram match based on parse result.
#                            In this case, the modifier field in a BE can be "NIL"
#                  5. HMR1 : This is combination of HM and HMR. It is similar to
#                            trigram match but it's head-modifier-relation trigram
#                            match based on parse result. In this case, the relation
#                            field of the BE can be "NIL".
#                  6. HMR2 : This is combination of H, HM and HMR. It is similar to
#                            trigram match but it's head-modifier-relation trigram
#                            match based on parse result. In this case, the modifier and
#                            relation fields of the BE can both be "NIL".
#              1.5.2
#              (1) Add option to compute ROUGE score by token using the whole corpus
#                  as average unit instead of individual sentences. Previous versions of
#                  ROUGE uses sentence (or unit) boundary to break counting unit and takes
#                  the average score from the counting unit as the final score.
#                  Using the whole corpus as one single counting unit can potentially
#                  improve the reliablity of the final score that treats each token as
#                  equally important; while the previous approach considers each sentence as
#                  equally important that ignores the length effect of each individual
#                  sentences (i.e. long sentences contribute equal weight to the final
#                  score as short sentences.)
#                  +v1.2 provide a choice of these two counting modes that users can
#                  choose the one that fits their scenarios.
#              1.5.1
#              (1) Add precision oriented measure and f-measure to deal with different lengths
#                  in candidates and references. Importance between recall and precision can
#                  be controled by 'alpha' parameter:
#                  alpha -> 0: recall is more important
#                  alpha -> 1: precision is more important
#                  Following Chapter 7 in C.J. van Rijsbergen's "Information Retrieval".
#                  http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html
#                  F = 1/(alpha * (1/P) + (1 - alpha) * (1/R)) ;;; weighted harmonic mean
#              1.4.2
#              (1) Enforce length limit at the time when summary text is read. Previously (before
#                  and including v1.4.1), length limit was enforced at tokenization time.
#              1.4.1
#              (1) Fix potential over counting in ROUGE-L and ROUGE-W
#                  In previous version (i.e. 1.4 and order), LCS hit is computed
#                  by summing union hit over all model sentences. Each model sentence
#                  is compared with all peer sentences and mark the union LCS. The
#                  length of the union LCS is the hit of that model sentence. The
#                  final hit is then sum over all model union LCS hits. This potentially
#                  would over count a peer sentence which already been marked as contributed
#                  to some other model sentence. Therefore, double counting is resulted.
#                  This is seen in evalution where ROUGE-L score is higher than ROUGE-1 and
#                  this is not correct.
#                  ROUGEeval-1.4.1.pl fixes this by add a clip function to prevent
#                  double counting.
#              1.4
#              (1) Remove internal Jackknifing procedure:
#                  Now the ROUGE script will use all the references listed in the
#                  <MODEL></MODEL> section in each <EVAL></EVAL> section and no
#                  automatic Jackknifing is performed.
#                  If Jackknifing procedure is required when comparing human and system
#                  performance, then users have to setup the procedure in the ROUGE
#                  evaluation configuration script as follows:
#                  For example, to evaluate system X with 4 references R1, R2, R3, and R4. 
#                  We do the following computation:
#
#                  for system:            and for comparable human:
#                  s1 = X vs. R1, R2, R3    h1 = R4 vs. R1, R2, R3 
#                  s2 = X vs. R1, R3, R4    h2 = R2 vs. R1, R3, R4
#                  s3 = X vs. R1, R2, R4    h3 = R3 vs. R1, R2, R4
#                  s4 = X vs. R2, R3, R4    h4 = R1 vs. R2, R3, R4
#
#                  Average system score for X = (s1+s2+s3+s4)/4 and for human = (h1+h2+h3+h4)/4
#                  Implementation of this in a ROUGE evaluation configuration script is as follows:
#                  Instead of writing all references in a evaluation section as below:
#                    <EVAL ID="1">
#                    ...
#                    <PEERS>
#                    <P ID="X">systemX</X>
#                    <PEERS>
#                    <MODELS>
#                    <M ID="1">R1</M>
#                    <M ID="2">R2</M>
#                    <M ID="3">R3</M>
#                    <M ID="4">R4</M>
#                    <MODELS>
#                    </EVAL>
#                  we write the following:
#                    <EVAL ID="1-1">
#                    <PEERS>
#                    <P ID="X">systemX</X>
#                    <PEERS>
#                    <MODELS>
#                    <M ID="2">R2</M>
#                    <M ID="3">R3</M>
#                    <M ID="4">R4</M>
#                    <MODELS>
#                    </EVAL>
#                    <EVAL ID="1-2">
#                    <PEERS>
#                    <P ID="X">systemX</X>
#                    <PEERS>
#                    <MODELS>
#                    <M ID="1">R1</M>
#                    <M ID="3">R3</M>
#                    <M ID="4">R4</M>
#                    <MODELS>
#                    </EVAL>
#                    <EVAL ID="1-3">
#                    <PEERS>
#                    <P ID="X">systemX</X>
#                    <PEERS>
#                    <MODELS>
#                    <M ID="1">R1</M>
#                    <M ID="2">R2</M>
#                    <M ID="4">R4</M>
#                    <MODELS>
#                    </EVAL>
#                    <EVAL ID="1-4">
#                    <PEERS>
#                    <P ID="X">systemX</X>
#                    <PEERS>
#                    <MODELS>
#                    <M ID="1">R1</M>
#                    <M ID="2">R2</M>
#                    <M ID="3">R3</M>
#                    <MODELS>
#                    </EVAL>
#                    
#                  In this case, the system and human numbers are comparable.
#                  ROUGE as it is implemented for summarization evaluation is a recall-based metric. 
#                  As we increase the number of references, we are increasing the number of 
#                  count units (n-gram or skip-bigram or LCSes) in the target pool (i.e. 
#                  the number ends up in the denominator of any ROUGE formula is larger). 
#                  Therefore, a candidate summary has more chance to hit but it also has to 
#                  hit more. In the end, this means lower absolute ROUGE scores when more 
#                  references are used and using different sets of rerferences should  not 
#                  be compared to each other. There is no nomalization mechanism in ROUGE 
#                  to properly adjust difference due to different number of references used.
#                    
#                  In the ROUGE implementations before v1.4 when there are N models provided for 
#                  evaluating system X in the ROUGE evaluation script, ROUGE does the 
#                  following:
#                    (1) s1 = X vs. R2, R3, R4, ..., RN
#                    (2) s2 = X vs. R1, R3, R4, ..., RN
#                    (3) s3 = X vs. R1, R2, R4, ..., RN
#                    (4) s4 = X vs. R1, R2, R3, ..., RN
#                    (5) ...
#                    (6) sN= X vs. R1, R2, R3, ..., RN-1
#                  And the final ROUGE score is computed by taking average of (s1, s2, s3, 
#                  s4, ..., sN). When we provide only three references for evaluation of a 
#                  human summarizer, ROUGE does the same thing but using 2 out 3 
#                  references, get three numbers, and then take the average as the final 
#                  score. Now ROUGE (after v1.4) will use all references without this
#                  internal Jackknifing procedure. The speed of the evaluation should improve
#                  a lot, since only one set instead of four sets of computation will be
#                  conducted.
#              1.3
#              (1) Add skip bigram
#              (2) Add an option to specify the number of sampling point (default is 1000)
#              1.2.3
#              (1) Correct the enviroment variable option: -e. Now users can specify evironment
#                  variable ROUGE_EVAL_HOME using the "-e" option; previously this option is
#                  not active. Thanks Zhouyan Li of Concordia University, Canada pointing this
#                  out.
#              1.2.2
#              (1) Correct confidence interval calculation for median, maximum, and minimum.
#                  Line 390.
#              1.2.1
#              (1) Add sentence per line format input format. See files in Verify-SPL for examples.
#              (2) Streamline command line arguments.
#              (3) Use bootstrap resampling to estimate confidence intervals instead of using t-test
#                  or z-test which assume a normal distribution.
#              (4) Add LCS (longest common subsequence) evaluation method.
#              (5) Add WLCS (weighted longest common subsequence) evaluation method.
#              (6) Add length cutoff in bytes.
#              (7) Add an option to specify the longest ngram to compute. The default is 4.
#              1.2
#              (1) Change zero condition check in subroutine &computeNGramScores when
#                  computing $gram1Score from
#                  if($totalGram2Count!=0)  to
#                  if($totalGram1Count!=0)
#                  Thanks Ken Litkowski for this bug report.
#                  This original script will set gram1Score to zero if there is no
#                  bigram matches. This should rarely has significant affect the final score
#                  since (a) there are bigram matches most of time; (b) the computation
#                  of gram1Score is using Jackknifing procedure. However, this definitely
#                  did not compute the correct $gram1Score when there is no bigram matches.
#                  Therefore, users of version 1.1 should definitely upgrade to newer
#                  version of the script that does not contain this bug.
# Note:        To use this script, two additional data files are needed:
#              (1) smart_common_words.txt - contains stopword list from SMART IR engine
#              (2) WordNet-1.6.exc.db - WordNet 1.6 exception inflexion database
#              These two files have to be put in a directory pointed by the environment
#              variable: "ROUGE_EVAL_HOME".
#              If environment variable ROUGE_EVAL_HOME does not exist, this script will
#              will assume it can find these two database files in the current directory.