提交 b2a76567 编写于 作者: Z zhanghan17

release ernie-gen

上级 830e2b7e
English | [简体中文](./README.zh.md)
## _ERNIE-GEN_: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
- [Proposed Generation Framework](#proposed-generation-framework)
- [Pre-trained Models](#pre-trained-models)
- [Fine-tuning on Downstream Tasks](#fine-tuning-on-downstream-tasks)
* [Abstractive Summarization](#abstractive-summarization)
* [Question Generation](#question-generation)
* [Generative Dialogue Response](#generative-dialogue-response)
* [Generative Question Answering](#generative-question-answering)
- [Usage](#usage)
* [Install PaddlePaddle](#install-paddlepaddle)
* [Fine-tuning](#fine-tuning)
* [Employ Dynamic Computation Graph](#employ-dynamic-computation-graph)
* [The ERNIE 1.0 is avaliable](#the-ernie-10-is-avaliable-for-chinese-generation-tasks)
- [Citation](#citation)
For technical description of the algorithm, please see our paper:
>[_**ERNIE-GEN:An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation**_](https://arxiv.org/abs/2001.11314.pdf)
>Dongling Xiao\*, Han Zhang\*, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>Preprint January 2020
>Accepted by **IJCAI-2020**
![ERNIE-GEN](https://img.shields.io/badge/Pretraining-Generation-green) ![Gigaword](https://img.shields.io/badge/Abstractive Summarization-Gigaword-yellow) ![Gigaword](https://img.shields.io/badge/Abstractive Summarization-CNN/Daily Mail-blue) ![SQuAD](https://img.shields.io/badge/Question Generation-SQuAD-green) ![Personal-Chat](https://img.shields.io/badge/Dialogue Response-Personal Chat-yellowgreen) ![CoQA](https://img.shields.io/badge/Generative Question Answering-CoQA-orange)
---
**[ERNIE-GEN](https://arxiv.org/abs/2001.11314.pdf) is a multi-flow language generation framework for both pre-training and fine-tuning.** We propose a novel **span-by-span generation** pre-training task to enable the model to **generate a semantically-complete span** at each step rather than a word, in light of the fact that entities, phrases in human writing are organized in a coherent manner. An **infilling generation mechanism** and a **noise-aware generation method** are incorporated into both pre-training and fine-tuning to alleviate **the problem of exposure bias**. In the pre-training phase, ERNIE-GEN adopts a **multi-granularity target fragments sampling** strategy to force decoder to rely more on the encoder representations other than the previous generated words to enhancing the correlation between encoder and decoder.
## Proposed Generation Framework
We construct three novel methods to enhance the language generation ability:
- **Span-by-span Generation Pre-training Task**: to enable model to generate a semantically-complete span at each step rather than a word.
- **Infilling Genration and Noise-aware Generation**: to alleviate the problem of exposure bias.
- **Multi-Granularity Target Fragments**: to enhance the correlation between encoder and decoder during pre-training.
Specifically, the span-by-span generation task and word-by-word generation task based on infilling generation mechanism are impemented by a carefully designed **Multi-Flow Attention** architecture as shown below.
![multi-flow-attention](.meta/multi-flow-attention.png)
## Pre-trained Models
We release the checkpoints for **ERNIE-GEN _base_** model and **ERNIE-GEN _large_** model which are both pre-trained on English Wikipedia and [BookCorpus](https://arxiv.org/abs/1506.06724) (totally 16GB). Besides, **ERNIE-GEN _large_** pre-trained on the 160GB corpus (used by [RoBERTa](https://arxiv.org/abs/1907.11692) and [BART](https://arxiv.org/abs/1910.13461)) is available as well.
- [**ERNIE-GEN _base_**](https://ernie.bj.bcebos.com/ernie_gen_base.tgz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_)
- [**ERNIE-GEN _large_**](https://ernie.bj.bcebos.com/ernie_gen_large.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
- [**ERNIE-GEN _large with 160G_**](https://ernie.bj.bcebos.com/ernie_gen_large_160g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
## Fine-tuning on Downstream Tasks
We compare the performance of [ERNIE-GEN](https://arxiv.org/pdf/2001.11314.pdf) with the existing SOTA pre-training models for natural language generation ([UniLM](https://arxiv.org/abs/1905.03197), [MASS](https://arxiv.org/abs/1905.02450), [PEGASUS](https://arxiv.org/abs/1912.08777), [BART](https://arxiv.org/abs/1910.13461) and [T5](https://arxiv.org/abs/1910.10683)) on 5 genration tasks, including abstractive summarization (**_Gigaword_** and **_CNN/DailyMail_**), question generation (**_SQuAD_**), dialogue generation (**_Persona-Chat_**) and generative question answering (**_CoQA_**).
### Abstractive Summarization
- _**Gigaword**_
The results on Gigaword-10k (10K examples of Gigaword) are presented as follows:
| Model | <strong>Data / Params</strong> | <strong>Rouge-1</strong> | <strong>Rouge-2</strong> | <strong>Rouge-L</strong> |
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: | :----------------------: |
| UniLM | 16G / 340M | 34.21 | 15.28 | 31.54 |
| **ENRIE-GEN** _base_ | 16G / 110M | 33.75 | 15.23 | 31.35 |
| **ERNIE-GEN** _large_ | 16G / 340M | 35.05 | 16.10 | 32.50 |
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | **35.51** | **16.79** | **33.23** |
The results on Gigaword are presented as follows:
| Model | <strong>Data / Params</strong> | <strong>Rouge-1</strong> | <strong>Rouge-2</strong> | <strong>Rouge-L</strong> |
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: | :----------------------: |
| MASS | 18G / 160M | 38.73 | 19.71 | 35.96 |
| BERTSHARE | 16G / 110M | 38.13 | 19.81 | 35.62 |
| UniLM | 16G / 340M | 38.45 | 19.45 | 35.75 |
| PEGASUS (_C4_) | 750G / 568M | 38.75 | 19.96 | 36.14 |
| PEGASUS (_HugeNews_) | 3.8T / 568M | 39.12 | 19.86 | 36.24 |
| **ENRIE-GEN** _base_ | 16G / 110M | 38.83 | 20.04 | 36.20 |
| **ERNIE-GEN** _large_ | 16G / 340M | 39.25 | 20.25 | 36.53 |
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | **39.46** | **20.34** | **36.74** |
We preprocess the raw Gigaword dataset following UniLM, the preprocessed data is avalilable at this [Gigaword](https://ernie.bj.bcebos.com/gigaword.tgz).
- _**CNN/Daily Mail**_
The results on CNN/Daily Mail are presented as follows:
| <strong>Model</strong> | Data / Params | <strong>Rouge-1</strong> | <strong>Rouge-2</strong> | <strong>Rouge-L</strong> |
| :-------------------------------------------------------- | :-----------: | :----------------------: | :----------------------: | :----------------------: |
| MASS | 18G / 160M | 42.12 | 19.50 | 39.01 |
| UniLM | 16G / 340M | 43.33 | 20.21 | 40.51 |
| T5 _large_ | 750G / 340M | 42.50 | 20.68 | 39.75 |
| T5 _xlarge_ | 750G / 11B | 43.52 | **21.55** | 40.69 |
| BART | 160G / 400M | 44.16 | 21.28 | 40.90 |
| PEGASUS (_C4_) | 750G / 568M | 43.90 | 21.20 | 40.76 |
| PEGASUS (_HugeNews_) | 3.8T / 568M | 44.17 | 21.47 | 41.11 |
| **ENRIE-GEN** _base_ | 16G / 110M | 42.30 | 19.92 | 39.68 |
| **ENRIE-GEN** _large_ | 16G / 340M | 44.02 | 21.17 | 41.26 |
| **ENRIE-GEN** _large_ (160G) | 160G / 340M | **44.31** | 21.35 | **41.60** |
We preprocess the raw CNN/Daily Mail dataset following UniLM, the preprocessed data is avalilable at this [CNN/Daily Mail](https://ernie.bj.bcebos.com/cnndm.tgz).
### Question Generation
- _**SQuAD**_
The results on the [SQuAD 1.1](https://arxiv.org/abs/1806.03822) dataset following the data split in [[Du et al., 2017]](https://arxiv.org/pdf/1705.00106.pdf) are presented as follows:
| Model | <strong>BLEU-4</strong> | <strong>METEOR</strong> | <strong>Rouge-L</strong> |
| :----------------------------------------------------------- | :----------------------: | :----------------------: | :----------------------: |
| [SemQG](https://arxiv.org/abs/1909.06356) | 18.37 | 22.65 | 46.68 |
| UniLM _large_ (beam size=1) | 22.12 | 25.06 | 51.07 |
| **ENRIE-GEN** _base_ (beam size=1) | 22.28 | 25.13 | 50.38 |
| **ERNIE-GEN** _large_ (beam size=1) | 24.03 | 26.31 | 52.36 |
| **ERNIE-GEN** _large_ (beam size=5) | 25.40 | **26.92** | 52.84 |
| **ERNIE-GEN** _large_ (beam size=5) + (160G) | **25.41** | 26.77 | **52.91** |
The results following the reversed dev-test data split in [[Zhao et al., 2018]](https://www.aclweb.org/anthology/D18-1424/) are presented as follows:
| Model | <strong>BLEU-4</strong> | <strong>METEOR</strong> | <strong>Rouge-L</strong> |
| :----------------------------------------------------------- | :----------------------: | :----------------------: | :----------------------: |
| SemQG | 20.76 | 24.20 | 48.91 |
| UniLM _large_ (beam size=1) | 23.75 | 25.61 | 52.04 |
| **ENRIE-GEN** _base_ (beam size=1) | 23.52 | 25.61 | 51.45 |
| **ERNIE-GEN** _large_ (beam size=1) | 25.57 | 26.89 | 53.31 |
| **ERNIE-GEN** _large_ (beam size=5) | 26.95 | **27.57** | 53.77 |
| **ERNIE-GEN** _large_ (beam size=5) + (160G) | **27.05** | 27.43 | **53.83** |
*_Note that we also report the results with higher beam size to 5._
The preprocessed data for question generation task can be downloaded from [SQuAD](https://ernie.bj.bcebos.com/squad_qg.tgz).
### Generative Dialogue Response
- _**Personal-Chat**_
Comparison with current state-of-the-art results on the multi-turn conversations task ([Persona-Chat](https://arxiv.org/abs/1801.07243)) is presented as follows:
| Model | <strong>BLEU-1</strong> | <strong>BLEU-2</strong> | <strong>Distinct-1</strong> | <strong>Distinct-2</strong> |
| :-------------------------------------------------------- | :---------------------: | :---------------------: | :-------------------------: | :---------------------------: |
| [LIC](https://arxiv.org/abs/1910.07931) | 40.5 | 32.0 | 0.019 | 0.113 |
| [PLATO](https://arxiv.org/abs/1910.07931) | 45.8 | 35.7 | 0.012 | 0.064 |
| PLATO _w/o latent_ | 40.6 | 31.5 | 0.021 | 0.121 |
| **ERNIE-GEN** _large_ | **46.8** | **36.4** | **0.023** | **0.168** |
The training data can be downloaded from [Personal-Chat](https://ernie.bj.bcebos.com/persona_chat.tgz).
### Generative Question Answering
- _**CoQA**_
Results of development set on CoQA task is presented as follows:
| Model | F1-score |
| :-------------------------------------------------------- | :------: |
| [Seq2Seq](https://arxiv.org/abs/1910.07931) | 27.5 |
| [PGNet](https://arxiv.org/abs/1910.07931) | 45.4 |
| UniLM _large_ | 82.5 |
| **ERNIE-GEN** _large_ | **84.5** |
We preprocess the raw [CoQA](https://arxiv.org/abs/1808.07042) dataset, the preprocessed data is avalilable at this [CoQA-preprocessed](https://ernie.bj.bcebos.com/coqa.tgz).
Finally, we also compared with a concurrent work [ProphetNet](https://arxiv.org/abs/2001.04063), the fine-tuning results on Gigaword, CNN/Daily Mail and SQuAD are reported as follows:
- _**Abstractive Summarization**_
| Model / Task | <strong>Data / Params</strong> | <strong>Gigaword</strong> |<strong>CNN/Daily Mail</strong>|
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: |
| Metric | - | <strong>Rouge-1 / Rouge-2 / Rouge-L</strong> |<strong>Rouge-1 / Rouge-2 / Rouge-L</strong>|
| **ProphetNet** _large_ (160G) | 160G / 340M | **39.51** / **20.42** / 36.69 |44.20 / 21.17 / 41.30|
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | 39.46 / 20.34 / **36.74** |**44.31** / **21.35** / **41.60**|
- _**Question Generation**_
| Model | <strong>Data / Params</strong> | <strong>BLEU-4 / METEOR / Rouge-L</strong> |<strong>BLEU-4 / METEOR / Rouge-L</strong>|
| :-------------------------------------------------------- | :----------------------------: | :----------------------: |:----------------------: |
| Data split | - | <strong>Original</strong> |<strong>Reversed dev-test</strong>|
| **ProphetNet** _large_ (16G) | 16G / 340M | 25.01 / 26.83 / 52.57 |26.72 / **27.64** / **53.79** |
| **ERNIE-GEN** _large_ (16G) | 16G / 340M | **25.40** / **26.92** / **52.84** |**26.95** / 27.57 / **53.77**|
## Usage
### Install PaddlePaddle
This code base has been tested with Paddle Fluid 1.7 with Python 2.7. Other dependency of ERNIE-GEN is listed in `requirements.txt`, you can install it by
```script
pip install -r requirements.txt
```
### Fine-tuning
Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before running ERNIE-GEN. We have put the parameter configurations of the above downstream tasks in `config/`. You can easily run finetuning through these configuration files. For example, you can finetune ERNIE-GEN base model on Gigaword by
```script
MODEL="base" # base or large or large_160g
TASK="gigaword" # cnndm, coqa, gigaword, squad_qg or persona-chat
sh run_seq2seq.sh ./configs/${MODEL}/${TASK}_conf
```
The log of training and the evaluation results are in `log/job.log.0`. To finetune on your own task data, you can refer to the data format we provide for processing your data.
Our fine-tuning experiments are carried on 8 NVIDIA V100 (32GB) GPUs. If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file.
**NOTICE: ** The actual total batch size is equal to `configured batch size * number of used gpus`.
### Employ Dynamic Computation Graph
The ERNIE-GEN code using dynamic graph is more concise and flexible, please refer to [ERNIE-GEN Dygraph](https://github.com/PaddlePaddle/ERNIE/tree/develop/experimental/seq2seq) for specific use.
### The ERNIE 1.0 is avaliable for Chinese Generation Tasks
The ERNIE-GEN code is compatible with [ERNIE 1.0](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz) model. After specifying the parameters related to the model and data in the configuration file, you can use ERNIE 1.0 to fine-tune chinese generation tasks.
## Citation
You can cite the paper as below:
```
@article{xiao2020ernie-gen,
title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2001.11314},
year={2020}
}
```
[English](./README.md) | 简体中文
## _ERNIE-GEN_: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation
- [模型框架](#模型框架)
- [预训练模型](#预训练模型)
- [微调任务](#微调任务)
* [生成式摘要](#生成式摘要)
* [问题生成](#问题生成)
* [多轮对话](#多轮对话)
* [生成式多轮问答](#生成式多轮问答)
- [使用说明](#使用说明)
* [安装飞桨](#安装飞桨)
* [运行微调](#运行微调)
* [使用动态图](#使用动态图)
* [中文生成任务使用 ERNIE 1.0](#中文生成任务使用-ernie-10)
- [引用](#引用)
关于算法的详细描述,请参见我们的论文:
>[_**ERNIE-GEN:An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation**_](https://arxiv.org/abs/2001.11314)
>Dongling Xiao\*, Han Zhang\*, Yukun Li, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution)
>Preprint January 2020
>Accepted by **IJCAI-2020**
![ERNIE-GEN](https://img.shields.io/badge/预训练-生成-green) ![Gigaword](https://img.shields.io/badge/生成式摘要-Gigaword-yellow) ![Gigaword](https://img.shields.io/badge/生成式摘要-CNN/Daily Mail-blue) ![SQuAD](https://img.shields.io/badge/问题生成-SQuAD-green) ![Personal-Chat](https://img.shields.io/badge/多轮对话-Personal Chat-yellowgreen) ![CoQA](https://img.shields.io/badge/多轮问答-CoQA-orange)
---
**ERNIE-GEN 是面向生成任务的预训练-微调框架**,首次在预训练阶段加入**span-by-span 生成**任务,让模型每次能够生成一个语义完整的片段。在预训练和微调中通过**填充式生成机制****噪声感知机制**来缓解曝光偏差问题。此外, ERNIE-GEN 采样**多片段-多粒度目标文本采样**策略, 增强源文本和目标文本的关联性,加强了编码器和解码器的交互。
## 模型框架
我们提出了三种方法来提高语言生成能力:
- **Span-by-span 生成任务**: 让模型能够每次生成一个语义完整的片段。
- **填充式生成****噪声感知生成**: 缓解曝光偏差问题。
- **多片段-多粒度目标文本采样**: 预训练阶段增强编码器和解码器的交互。
我们基于 Transformer 模型设计了 **Mulit-Flow Attention** 框架,用于实现 span-by-span 的填充式生成。
![multi-flow-attention](.meta/multi-flow-attention.png)
## 预训练模型
我们发布了 **ERNIE-GEN _base_** 模型和 **ERNIE-GEN _large_** 模型。 预训练数据使用英文维基百科和 BookCorpus,总共16GB。此外,我们还发布了基于 160GB 语料预训练的**ERNIE-GEN _large_** 模型,此份语料也被用于 [RoBERTa](https://arxiv.org/abs/1907.11692)[BART](https://arxiv.org/abs/1910.13461) 的预训练。
- [**ERNIE-GEN _base_**](https://ernie.bj.bcebos.com/ernie_gen_base.tgz) (_lowercased | 12-layer, 768-hidden, 12-heads, 110M parameters_)
- [**ERNIE-GEN _large_**](https://ernie.bj.bcebos.com/ernie_gen_large.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
- [**ERNIE-GEN _large with 160G_**](https://ernie.bj.bcebos.com/ernie_gen_large_160g.tgz) (_lowercased | 24-layer, 1024-hidden, 16-heads, 340M parameters_)
## 微调任务
我们在五个典型生成任务上与当前效果最优的生成预训练模型([UniLM](https://arxiv.org/abs/1905.03197)[MASS](https://arxiv.org/abs/1905.02450)[PEGASUS](https://arxiv.org/abs/1912.08777)[BART](https://arxiv.org/abs/1910.13461)[T5](https://arxiv.org/abs/1910.10683)等)进行对比, 包括生成式摘要 (Gigaword 和 CNN/DailyMail), 问题生成(SQuAD), 多轮对话(Persona-Chat) 和生成式多轮问答(CoQA)。
### 生成式摘要
- _**Gigaword**_
在 Gigaword-10k (Gigaword 的子集) 上的效果:
| 模型 | <strong>数据量 / 参数量</strong> | <strong>Rouge-1</strong> | <strong>Rouge-2</strong> | <strong>Rouge-L</strong> |
| :-------------------------------------------------------- | :------------------------------: | :----------------------: | :----------------------: | :----------------------: |
| UniLM | 16G / 340M | 34.21 | 15.28 | 31.54 |
| **ENRIE-GEN** _base_ | 16G / 110M | 33.75 | 15.23 | 31.35 |
| **ERNIE-GEN** _large_ | 16G / 340M | 35.05 | 16.10 | 32.50 |
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | **35.51** | **16.79** | **33.23** |
在 Gigaword 上的效果:
| 模型 | <strong>数量 / 参数量</strong> | <strong>Rouge-1</strong> | <strong>Rouge-2</strong> | <strong>Rouge-L</strong> |
| :-------------------------------------------------------- | :----------------------------: | :----------------------: | :----------------------: | :----------------------: |
| MASS | 18G / 160M | 38.73 | 19.71 | 35.96 |
| [BERTSHARE](https://arxiv.org/abs/1907.12461) | 16G / 110M | 38.13 | 19.81 | 35.62 |
| UniLM | 16G / 340M | 38.45 | 19.45 | 35.75 |
| PEGASUS (_C4_) | 750G / 568M | 38.75 | 19.96 | 36.14 |
| PEGASUS (_HugeNews_) | 3.8T / 568M | 39.12 | 19.86 | 36.24 |
| **ENRIE-GEN** _base_ | 16G / 110M | 38.83 | 20.04 | 36.20 |
| **ERNIE-GEN** _large_ | 16G / 340M | 39.25 | 20.25 | 36.53 |
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | **39.46** | **20.34** | **36.74** |
我们按照 UniLM 的方式处理了数据,下载链接 [Gigaword](https://ernie.bj.bcebos.com/gigaword.tgz)
- _**CNN/Daily Mail**_
在 CNN/Daily Mail 上的效果:
| <strong>模型</strong> | 数据量 /参数量| <strong>Rouge-1</strong> | <strong>Rouge-2</strong> | <strong>Rouge-L</strong> |
| :-------------------------------------------------------- | :-----------: | :----------------------: | :----------------------: | :----------------------: |
| MASS | 18G / 160M | 42.12 | 19.50 | 39.01 |
| UniLM | 16G / 340M | 43.33 | 20.21 | 40.51 |
| T5 _large_ | 750G / 340M | 42.50 | 20.68 | 39.75 |
| T5 _xlarge_ | 750G / 11B | 43.52 | **21.55** | 40.69 |
| BART | 160G / 400M | 44.16 | 21.28 | 40.90 |
| PEGASUS (_C4_) | 750G / 568M | 43.90 | 21.20 | 40.76 |
| PEGASUS (_HugeNews_) | 3.8T / 568M | 44.17 | 21.47 | 41.11 |
| **ENRIE-GEN** _base_ | 16G / 110M | 42.30 | 19.92 | 39.68 |
| **ENRIE-GEN** _large_ | 16G / 340M | 44.02 | 21.17 | 41.26 |
| **ENRIE-GEN** _large_ (160G) | 160G / 340M | **44.31** | 21.35 | **41.60** |
我们按照 UniLM 的方式处理了数据,下载链接 [CNN/Daily Mail](https://ernie.bj.bcebos.com/cnndm.tgz)
### 问题生成
- _**SQuAD**_
在 SQuAD 1.1 数据集上的效果(测试集划分按照 [[Du et al., 2017]](https://arxiv.org/abs/1705.00106)) :
| 模型 | <strong>BLEU-4</strong> | <strong>METEOR</strong> | <strong>Rouge-L</strong> |
| :----------------------------------------------------------- | :----------------------: | :----------------------: | :----------------------: |
| [SemQG](https://arxiv.org/abs/1909.06356) | 18.37 | 22.65 | 46.68 |
| UniLM _large_ (beam size=1) | 22.12 | 25.06 | 51.07 |
| **ENRIE-GEN** _base_ (beam size=1) | 22.28 | 25.13 | 50.38 |
| **ERNIE-GEN** _large_ (beam size=1) | 24.03 | 26.31 | 52.36 |
| **ERNIE-GEN** _large_ (beam size=5) | 25.40 | **26.92** | 52.84 |
| **ERNIE-GEN** _large_ (beam size=5) + (160G) | **25.41** | 26.77 | **52.91** |
按照 [[Zhao et al., 2018]](https://www.aclweb.org/anthology/D18-1424/) 反向使用验证集和测试集,效果如下:
| Model | <strong>BLEU-4</strong> | <strong>METEOR</strong> | <strong>Rouge-L</strong> |
| :----------------------------------------------------------- | :----------------------: | :----------------------: | :----------------------: |
| [SemQG](https://arxiv.org/abs/1909.06356) | 20.76 | 24.20 | 48.91 |
| UniLM _large_ (beam size=1) | 23.75 | 25.61 | 52.04 |
| **ENRIE-GEN** _base_ (beam size=1) | 23.52 | 25.61 | 51.45 |
| **ERNIE-GEN** _large_ (beam size=1) | 25.57 | 26.89 | 53.31 |
| **ERNIE-GEN** _large_ (beam size=5) | 26.95 | **27.57** | 53.77 |
| **ERNIE-GEN** _large_ (beam size=5) + (160G) | **27.05** | 27.43 | **53.83** |
*_我们增加了将 beam size 扩大到 5 的结果。_
我们按照 UniLM 的方式处理了数据,下载链接 [SQuAD](https://ernie.bj.bcebos.com/squad_qg.tgz)
### 多轮对话
- _**Personal-Chat**_
| Model | <strong>BLEU-1</strong> | <strong>BLEU-2</strong> | <strong>Distinct-1</strong> | <strong>Distinct-2</strong> |
| :-------------------------------------------------------- | :---------------------: | :---------------------: | :-------------------------: | :---------------------------: |
| [LIC](https://arxiv.org/abs/1910.07931) | 40.5 | 32.0 | 0.019 | 0.113 |
| [PLATO](https://arxiv.org/abs/1910.07931) | 45.8 | 35.7 | 0.012 | 0.064 |
| [PLATO](https://arxiv.org/abs/1910.07931) _w/o latent_ | 40.6 | 31.5 | 0.021 | 0.121 |
| **ERNIE-GEN** _large_ | **46.8** | **36.4** | **0.023** | **0.168** |
我们处理的数据下载链接 [Personal-Chat](https://ernie.bj.bcebos.com/persona_chat.tgz)
### 生成式多轮问答
- _**CoQA**_
在 CoQA 验证集上的效果:
| 模型 | F1-score |
| :-------------------------------------------------------- | :------: |
| [Seq2Seq](https://arxiv.org/abs/1910.07931) | 27.5 |
| [PGNet](https://arxiv.org/abs/1910.07931) | 45.4 |
| UniLM _large_ | 82.5 |
| **ERNIE-GEN** _large_ | **84.5** |
我们对原始的 CoQA 数据集进行了处理,下载链接 [CoQA](https://ernie.bj.bcebos.com/coqa.tgz)
此外,我们与同期的工作 [ProphetNet](https://arxiv.org/abs/2001.04063) 在 Gigaword,CNN/Daily Mail 和 SQuAD 三个数据集上进行了对比:
- _**生成式摘要**_
| 模型 / 任务 | <strong>数据量 / 参数量</strong> | <strong>Gigaword</strong> |<strong>CNN/Daily Mail</strong>|
| :-------------------------------------------------------- | :------------------------------: | :----------------------: | :----------------------: |
| Metric | - | <strong>Rouge-1 / Rouge-2 / Rouge-L</strong> |<strong>Rouge-1 / Rouge-2 / Rouge-L</strong>|
| ProphetNet _large_ (160G) | 160G / 340M | **39.51** / **20.42** / 36.69 |44.20 / 21.17 / 41.30|
| **ERNIE-GEN** _large_ (160G) | 160G / 340M | 39.46 / 20.34 / **36.74** |**44.31** / **21.35** / **41.60**|
- _**问题生成**_
| 模型 | <strong>数据量 / 参数量</strong> | <strong>BLEU-4 / METEOR / Rouge-L</strong> |<strong>BLEU-4 / METEOR / Rouge-L</strong>|
| :-------------------------------------------------------- | :------------------------------: | :----------------------: |:----------------------: |
| Data split | - | <strong>Original</strong> |<strong>Reversed dev-test</strong>|
| ProphetNet** _large_ (16G) | 16G / 340M | 25.01 / 26.83 / 52.57 |26.72 / **27.64** / **53.79** |
| **ERNIE-GEN** _large_ (16G) | 16G / 340M | **25.40** / **26.92** / **52.84** |**26.95** / 27.57 / **53.77**|
## 使用说明
### 安装飞桨
我们的代码基于 Paddle Fluid 1.7 和 Python 2.7。 ERNIE-GEN 依赖的其他模块也列举在 `requirements.txt`,可以通过下面的指令安装:
```script
pip install -r requirements.txt
```
### 运行微调
在运行 ERNIE-GEN 前,需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 `config/` ,可以简单地通过配置文件运行。 例如,您可以通过下面的指令在 Gigaword 数据集上微调 ERNIE-GEN base 模型:
```script
MODEL="base" # base or large or large_160g
TASK="gigaword" # cnndm, coqa, gigaword, squad_qg or persona-chat
sh run_seq2seq.sh ./configs/${MODEL}/${TASK}_conf
```
训练和评估的日志在 `log/job.log.0`。 如果要在您自己的数据集上微调,可以参考我们提供的数据格式处理自己的数据。
我们的微调实验在 8 张 32GB 显存的英伟达 V100 GPU 上运行,如果您的 GPU 显存不够,可以减小配置文件中的 batch_size 。
**注意**: 训练时实际的 batch size 等于 `配置的 batch size * GPU 卡数`
### 使用动态图
动态图版本的 ERNIE-GEN 代码更加简洁灵活,使用请参考 [ERNIE-GEN Dygraph](https://github.com/PaddlePaddle/ERNIE/tree/develop/experimental/seq2seq)
### 中文生成任务使用 ERNIE 1.0
ERNIE-GEN 的代码兼容 [ERNIE 1.0 模型](https://ernie.bj.bcebos.com/ERNIE_1.0_max-len-512.tar.gz),修改配置文件中模型和数据相关的设置,就可以用 ERNIE 1.0 在中文生成任务上微调。
## 引用
可以按下面的格式引用我们的论文:
```
@article{xiao2020ernie-gen,
title={ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for Natural Language Generation},
author={Xiao, Dongling and Zhang, Han and Li, Yukun and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng},
journal={arXiv preprint arXiv:2001.11314},
year={2020}
}
```
#load model
vocab_path="ernie_gen_base/vocab.txt"
config_path="ernie_gen_base/ernie_config.json"
init_model="ernie_gen_base/params"
#input
max_src_len=640
max_tgt_len=192
tokenized_input="true"
continuous_position="true"
batch_size=8
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=128
beam_size=5
length_penalty=1.0
use_multi_gpu_test="true"
#train
epoch=30
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=5e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/cnndm/"
train_set="train.tsv"
dev_set="dev.2k.tsv"
pred_set="test.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="true"
#evaluate
eval_script="sh ./eval/tasks/cnndm/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_base/vocab.txt"
config_path="ernie_gen_base/ernie_config.json"
init_model="ernie_gen_base/params"
#input
max_src_len=192
max_tgt_len=64
tokenized_input="true"
continuous_position="true"
batch_size=16
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=32
beam_size=5
length_penalty=0.6
use_multi_gpu_test="true"
#train
epoch=30
weight_decay=0.01
label_smooth=0.1
save_and_valid_by_epoch="true"
hidden_dropout_prob=0.1
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1.25e-5
#noise
random_noise="true"
noise_prob=0.5
#dataset
data_path="./datasets/gigaword/"
train_set="train.10k.tsv"
dev_set="dev.20k.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/gigaword/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_base/vocab.txt"
config_path="ernie_gen_base/ernie_config.json"
init_model="ernie_gen_base/params"
#input
max_src_len=192
max_tgt_len=64
tokenized_input="true"
continuous_position="true"
batch_size=16
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=32
beam_size=5
length_penalty=0.6
use_multi_gpu_test="true"
#train
epoch=10
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=3e-5
#noise
random_noise="true"
noise_prob=0.5
#dataset
data_path="./datasets/gigaword/"
train_set="train.tsv"
dev_set="dev.20k.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/gigaword/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_base/vocab.txt"
config_path="ernie_gen_base/ernie_config.json"
init_model="ernie_gen_base/params"
#input
max_src_len=512
max_tgt_len=96
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=48
beam_size=5
length_penalty=1.0
use_multi_gpu_test="true"
#train
epoch=10
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=2.5e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/squad_qg/"
train_set="train.tsv"
dev_set="dev.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
#evaluate
eval_script="sh ./eval/tasks/squad_qg/eval.sh"
eval_mertrics="Bleu_4,METEOR,ROUGE_L"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#input
max_src_len=640
max_tgt_len=192
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=128
beam_size=5
length_penalty=1.0
use_multi_gpu_test="true"
#train
epoch=20
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=4e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/cnndm/"
train_set="train.tsv"
dev_set="dev.2k.tsv"
pred_set="test.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="true"
#evaluate
eval_script="sh ./eval/tasks/cnndm/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#for multi-turn dialog/qa
task_type="dialog"
role_type_size=3
turn_type_size=16
#input
max_src_len=480
max_tgt_len=32
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
#tgt_type_id=1
#decode
do_decode="true"
max_dec_len=30
beam_size=3
length_penalty=0.0
use_multi_gpu_test="true"
#train
epoch=10
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1e-5
#noise
random_noise="false"
noise_prob=0.5
#dataset
data_path="./datasets/coqa/"
train_set="train.tsv"
dev_set="dev.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/coqa/eval.sh"
eval_mertrics="f1"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#input
max_src_len=192
max_tgt_len=64
tokenized_input="true"
continuous_position="true"
batch_size=16
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=32
beam_size=5
length_penalty=0.6
use_multi_gpu_test="true"
#train
epoch=30
weight_decay=0.01
label_smooth=0.1
save_and_valid_by_epoch="true"
hidden_dropout_prob=0.1
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/gigaword/"
train_set="train.10k.tsv"
dev_set="dev.20k.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/gigaword/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#input
max_src_len=192
max_tgt_len=64
tokenized_input="true"
continuous_position="true"
batch_size=16
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=32
beam_size=5
length_penalty=0.6
use_multi_gpu_test="true"
#train
epoch=5
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.2
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=3e-5
#noise
random_noise="true"
noise_prob=0.6
#dataset
data_path="./datasets/gigaword/"
train_set="train.tsv"
dev_set="dev.20k.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/gigaword/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#for multi-turn dialog/qa
task_type="dialog"
role_type_size=3
turn_type_size=16
#input
max_src_len=472
max_tgt_len=40
tokenized_input="true"
continuous_position="true"
batch_size=8
in_tokens="false"
#decode
do_decode="true"
max_dec_len=32
beam_size=10
length_penalty=1.3
use_multi_gpu_test="true"
#train
epoch=30
weight_decay=0.01
label_smooth=0.0
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1e-4
#noise
random_noise="false"
noise_prob=0.0
#dataset
data_path="./datasets/persona_chat/"
train_set="train.tsv"
dev_set="dev.2k.tsv"
pred_set="test.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="true"
do_decode="true"
#evaluate
eval_script="sh ./eval/tasks/persona_chat/eval.sh"
eval_mertrics="bleu_1,bleu_2,distinct_1,distinct_2"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#input
max_src_len=512
max_tgt_len=96
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=48
beam_size=5
length_penalty=1.0
use_multi_gpu_test="true"
#train
epoch=10
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.2
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/squad_qg/"
train_set="train.tsv"
dev_set="dev.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
#evaluate
eval_script="sh ./eval/tasks/squad_qg/eval.sh"
eval_mertrics="Bleu_4,METEOR,ROUGE_L"
#load model
vocab_path="ernie_gen_large_160g/vocab.txt"
config_path="ernie_gen_large_160g/ernie_config.json"
init_model="ernie_gen_large_160g/params"
#input
max_src_len=640
max_tgt_len=192
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=128
beam_size=5
length_penalty=1.2
use_multi_gpu_test="true"
#train
epoch=17
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.02
lr_scheduler="linear_warmup_decay"
learning_rate=4e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/cnndm/"
train_set="train.tsv"
dev_set="dev.2k.tsv"
pred_set="test.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="true"
#evaluate
eval_script="sh ./eval/tasks/cnndm/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#for multi-turn dialog/qa
task_type="dialog"
role_type_size=3
turn_type_size=16
#input
max_src_len=480
max_tgt_len=32
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
#tgt_type_id=1
#decode
do_decode="true"
max_dec_len=30
beam_size=3
length_penalty=0.0
use_multi_gpu_test="true"
#train
epoch=10
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1e-5
#noise
random_noise="false"
noise_prob=0.5
#dataset
data_path="./datasets/coqa/"
train_set="train.tsv"
dev_set="dev.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/coqa/eval.sh"
eval_mertrics="f1"
#load model
vocab_path="ernie_gen_large_160g/vocab.txt"
config_path="ernie_gen_large_160g/ernie_config.json"
init_model="ernie_gen_large_160g/params"
#input
max_src_len=192
max_tgt_len=64
tokenized_input="true"
continuous_position="true"
batch_size=16
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=32
beam_size=5
length_penalty=0.6
use_multi_gpu_test="true"
#train
epoch=30
weight_decay=0.01
label_smooth=0.1
save_and_valid_by_epoch="true"
hidden_dropout_prob=0.1
#lr
warmup_proportion=0.15
lr_scheduler="linear_warmup_decay"
learning_rate=7.5e-6
#noise
random_noise="true"
noise_prob=0.65
#dataset
data_path="./datasets/gigaword/"
train_set="train.10k.tsv"
dev_set="dev.20k.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/gigaword/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_large_160g/vocab.txt"
config_path="ernie_gen_large_160g/ernie_config.json"
init_model="ernie_gen_large_160g/params"
#input
max_src_len=192
max_tgt_len=64
tokenized_input="true"
continuous_position="true"
batch_size=16
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=32
beam_size=6
length_penalty=0.7
use_multi_gpu_test="true"
#train
epoch=5
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.2
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=3e-5
#noise
random_noise="true"
noise_prob=0.6
#dataset
data_path="./datasets/gigaword/"
train_set="train.tsv"
dev_set="dev.20k.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
do_pred="false"
#evaluate
eval_script="sh ./eval/tasks/gigaword/eval.sh"
eval_mertrics="rouge-1,rouge-2,rouge-l"
#load model
vocab_path="ernie_gen_large/vocab.txt"
config_path="ernie_gen_large/ernie_config.json"
init_model="ernie_gen_large/params"
#for multi-turn dialog/qa
task_type="dialog"
role_type_size=3
turn_type_size=16
#input
max_src_len=472
max_tgt_len=40
tokenized_input="true"
continuous_position="true"
batch_size=8
in_tokens="false"
#decode
do_decode="true"
max_dec_len=32
beam_size=10
length_penalty=1.3
use_multi_gpu_test="true"
#train
epoch=30
weight_decay=0.01
label_smooth=0.0
hidden_dropout_prob=0.1
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.1
lr_scheduler="linear_warmup_decay"
learning_rate=1e-4
#noise
random_noise="false"
noise_prob=0.0
#dataset
data_path="./datasets/persona_chat/"
train_set="train.tsv"
dev_set="dev.2k.tsv"
pred_set="test.tsv"
do_train="true"
do_val="true"
do_test="false"
do_pred="true"
do_decode="true"
#evaluate
eval_script="sh ./eval/tasks/persona_chat/eval.sh"
eval_mertrics="bleu_1,bleu_2,distinct_1,distinct_2"
#load model
vocab_path="ernie_gen_large_160g/vocab.txt"
config_path="ernie_gen_large_160g/ernie_config.json"
init_model="ernie_gen_large_160g/params"
#input
max_src_len=512
max_tgt_len=96
tokenized_input="true"
continuous_position="true"
batch_size=4
in_tokens="false"
tgt_type_id=3
#decode
do_decode="true"
max_dec_len=48
beam_size=5
length_penalty=1.0
use_multi_gpu_test="true"
#train
epoch=10
weight_decay=0.01
label_smooth=0.1
hidden_dropout_prob=0.2
save_and_valid_by_epoch="true"
#lr
warmup_proportion=0.25
lr_scheduler="linear_warmup_decay"
learning_rate=1.25e-5
#noise
random_noise="true"
noise_prob=0.7
#dataset
data_path="./datasets/squad_qg/"
train_set="train.tsv"
dev_set="dev.tsv"
test_set="test.tsv"
do_train="true"
do_val="true"
do_test="true"
#evaluate
eval_script="sh ./eval/tasks/squad_qg/eval.sh"
eval_mertrics="Bleu_4,METEOR,ROUGE_L"
#!/usr/bin/env bash
set -xu
function check_iplist() {
if [[ ${iplist:-""} == "" ]] ;then
iplist=`hostname -i`
fi
export PADDLE_PSERVER_PORT=9184
export PADDLE_TRAINER_IPS=${iplist}
export PADDLE_CURRENT_IP=`hostname -i`
iparray=(${iplist//,/ })
for i in "${!iparray[@]}"; do
if [ ${iparray[$i]} == ${PADDLE_CURRENT_IP} ]; then
export PADDLE_TRAINER_ID=$i
fi
done
export TRAINING_ROLE=TRAINER
export PADDLE_INIT_TRAINER_COUNT=${#iparray[@]}
export PADDLE_PORT=${PADDLE_PSERVER_PORT}
export PADDLE_TRAINERS=${PADDLE_TRAINER_IPS}
export POD_IP=${PADDLE_CURRENT_IP}
export PADDLE_TRAINERS_NUM=${PADDLE_INIT_TRAINER_COUNT}
export PADDLE_IS_LOCAL=0
#paddle debug envs
export GLOG_v=0
export GLOG_logtostderr=1
#nccl debug envs
export NCCL_DEBUG=INFO
export NCCL_IB_GID_INDEX=3
}
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""ultis help and eval functions for gen ."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import collections
import json
import math
import subprocess
from reader.tokenization import BasicTokenizer
class GenerationEval(object):
"""GenerationEval"""
def __init__(self, args, merge_subword=None):
self.basic_tokenizer = BasicTokenizer(do_lower_case=True)
self.merge_subword = merge_subword
self.eval_script = args.eval_script.split(" ")
self.eval_mertrics = args.eval_mertrics.split(",") if args.eval_mertrics else []
self.tokenized_input = args.tokenized_input
def eval(self, output_file, phase="", features=None):
"""run eval"""
eval_res = {}
if self.eval_script:
eval_res = subprocess.check_output(self.eval_script + [output_file, phase])
eval_res = json.loads(eval_res)
else:
preds = []
for line in open(output_file):
preds.append(self.basic_tokenizer.tokenize(line.strip()))
refs = []
for id in sorted(features.keys()):
if self.tokenized_input:
ref = features[id].tgt.decode("utf8").split(" ")
refs.append([self.merge_subword(ref)])
else:
refs.append([self.basic_tokenizer.tokenize(features[id].tgt)])
for mertric in self.eval_mertrics:
eval_func = getattr(self, mertric, None)
if eval_func:
eval_res[mertric] = eval_func(refs, preds)
ret = []
for mertric in self.eval_mertrics:
mertric_res = eval_res.get(mertric, None)
if mertric_res is None:
raise Exception("Eval mertric: %s is not supported" % mertric)
ret.append("%s: %f" % (mertric, mertric_res))
return ", ".join(ret)
def bleu(self, refs, preds):
"""bleu mertric"""
return _compute_bleu(refs, preds, max_order=4)[0]
def _get_ngrams(segment, max_order):
ngram_counts = collections.Counter()
for order in range(1, max_order + 1):
for i in range(0, len(segment) - order + 1):
ngram = tuple(segment[i: i + order])
ngram_counts[ngram] += 1
return ngram_counts
def _compute_bleu(reference_corpus, translation_corpus, max_order=4, smooth=False):
matches_by_order = [0] * max_order
possible_matches_by_order = [0] * max_order
reference_length = 0
translation_length = 0
for (references, translation) in zip(reference_corpus, translation_corpus):
reference_length += min(len(r) for r in references)
translation_length += len(translation)
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
translation_ngram_counts = _get_ngrams(translation, max_order)
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
matches_by_order[len(ngram) - 1] += overlap[ngram]
for order in range(1, max_order + 1):
possible_matches = len(translation) - order + 1
if possible_matches > 0:
possible_matches_by_order[order - 1] += possible_matches
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
precisions[i] = ((matches_by_order[i] + 1.) /
(possible_matches_by_order[i] + 1.))
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (float(matches_by_order[i]) /
possible_matches_by_order[i])
else:
precisions[i] = 0.0
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
ratio = float(translation_length) / reference_length
if ratio > 1.0:
bp = 1.
else:
bp = math.exp(1 - 1. / (ratio + 1e-4))
bleu = geo_mean * bp
ret = [bleu, precisions, bp, ratio, translation_length, reference_length]
return ret
此差异已折叠。
"""BERT finetuning runner."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import sys
BASE_DIR = os.path.dirname(os.path.abspath(__file__)) + "/../"
print(BASE_DIR)
sys.path = [BASE_DIR] + sys.path
import logging
import glob
import json
import argparse
import math
import string
from multiprocessing import Pool, cpu_count
import numpy as np
# pip install py-rouge
import time
import tempfile
import shutil
# pip install pyrouge
from cnndm.bs_pyrouge import Rouge155
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
datefmt='%m/%d/%Y %H:%M:%S',
level=logging.INFO)
logger = logging.getLogger(__name__)
parser = argparse.ArgumentParser()
# Required parameters
parser.add_argument("--gold", type=str, help="Gold output file.")
parser.add_argument("--pred", type=str, help="Input prediction file.")
parser.add_argument("--split", type=str, default="",
help="Data split (train/dev/test).")
parser.add_argument("--save_best", action='store_true',
help="Save best epoch.")
parser.add_argument("--only_eval_best", action='store_true',
help="Only evaluate best epoch.")
parser.add_argument("--trunc_len", type=int, default=60,
help="Truncate line by the maximum length.")
parser.add_argument("--duplicate_rate", type=float, default=0.7,
help="If the duplicat rate (compared with history) is large, we can discard the current sentence.")
default_process_count = max(1, cpu_count() - 1)
parser.add_argument("--processes", type=int, default=default_process_count,
help="Number of processes to use (default %(default)s)")
parser.add_argument("--perl", action='store_true',
help="Using the perl script.")
parser.add_argument('--lazy_eval', action='store_true',
help="Skip evaluation if the .rouge file exists.")
args = parser.parse_args()
SPECIAL_TOKEN = ["[UNK]", "[PAD]", "[CLS]", "[MASK]"]
def test_rouge(cand, ref):
temp_dir = tempfile.mkdtemp()
candidates = cand
references = ref
assert len(candidates) == len(references)
cnt = len(candidates)
current_time = time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime())
tmp_dir = os.path.join(temp_dir, "rouge-tmp-{}".format(current_time))
if not os.path.isdir(tmp_dir):
os.mkdir(tmp_dir)
os.mkdir(tmp_dir + "/candidate")
os.mkdir(tmp_dir + "/reference")
try:
for i in range(cnt):
if len(references[i]) < 1:
continue
with open(tmp_dir + "/candidate/cand.{}.txt".format(i), "w") as f:
f.write(candidates[i])
with open(tmp_dir + "/reference/ref.{}.txt".format(i), "w") as f:
f.write(references[i])
r = Rouge155(temp_dir=temp_dir)
r.model_dir = tmp_dir + "/reference/"
r.system_dir = tmp_dir + "/candidate/"
r.model_filename_pattern = 'ref.#ID#.txt'
r.system_filename_pattern = r'cand.(\d+).txt'
rouge_results = r.convert_and_evaluate()
print(rouge_results)
results_dict = r.output_to_dict(rouge_results)
finally:
if os.path.isdir(tmp_dir):
shutil.rmtree(tmp_dir)
return results_dict
def rouge_results_to_str(results_dict):
return ">> ROUGE-F(1/2/l): {:.2f}/{:.2f}/{:.2f}\nROUGE-R(1/2/3/l): {:.2f}/{:.2f}/{:.2f}\n".format(
results_dict["rouge_1_f_score"] * 100,
results_dict["rouge_2_f_score"] * 100,
results_dict["rouge_l_f_score"] * 100,
results_dict["rouge_1_recall"] * 100,
results_dict["rouge_2_recall"] * 100,
results_dict["rouge_l_recall"] * 100
)
def count_tokens(tokens):
counter = {}
for t in tokens:
if t in counter.keys():
counter[t] += 1
else:
counter[t] = 1
return counter
def get_f1(text_a, text_b):
tokens_a = text_a.lower().split()
tokens_b = text_b.lower().split()
if len(tokens_a) == 0 or len(tokens_b) == 0:
return 1 if len(tokens_a) == len(tokens_b) else 0
set_a = count_tokens(tokens_a)
set_b = count_tokens(tokens_b)
match = 0
for token in set_a.keys():
if token in set_b.keys():
match += min(set_a[token], set_b[token])
p = match / len(tokens_a)
r = match / len(tokens_b)
return 2.0 * p * r / (p + r + 1e-5)
_tok_dict = {"(": "-LRB-", ")": "-RRB-",
"[": "-LSB-", "]": "-RSB-",
"{": "-LCB-", "}": "-RCB-"}
def _is_digit(w):
for ch in w:
if not(ch.isdigit() or ch == ','):
return False
return True
def fix_tokenization(text):
input_tokens = text.split()
output_tokens = []
has_left_quote = False
has_left_single_quote = False
i = 0
prev_dash = False
while i < len(input_tokens):
tok = input_tokens[i]
flag_prev_dash = False
if tok in _tok_dict.keys():
output_tokens.append(_tok_dict[tok])
i += 1
elif tok == "\"":
if has_left_quote:
output_tokens.append("''")
else:
output_tokens.append("``")
has_left_quote = not has_left_quote
i += 1
elif tok == "'" and len(output_tokens) > 0 and output_tokens[-1].endswith("n") and i < len(input_tokens) - 1 and input_tokens[i + 1] == "t":
output_tokens[-1] = output_tokens[-1][:-1]
output_tokens.append("n't")
i += 2
elif tok == "'" and i < len(input_tokens) - 1 and input_tokens[i + 1] in ("s", "d", "ll"):
output_tokens.append("'"+input_tokens[i + 1])
i += 2
elif tok == "'":
if has_left_single_quote:
output_tokens.append("'")
else:
output_tokens.append("`")
has_left_single_quote = not has_left_single_quote
i += 1
elif tok == "." and i < len(input_tokens) - 2 and input_tokens[i + 1] == "." and input_tokens[i + 2] == ".":
output_tokens.append("...")
i += 3
elif tok == "," and len(output_tokens) > 0 and _is_digit(output_tokens[-1]) and i < len(input_tokens) - 1 and _is_digit(input_tokens[i + 1]):
# $ 3 , 000 -> $ 3,000
output_tokens[-1] += ','+input_tokens[i + 1]
i += 2
elif tok == "." and len(output_tokens) > 0 and output_tokens[-1].isdigit() and i < len(input_tokens) - 1 and input_tokens[i + 1].isdigit():
# 3 . 03 -> $ 3.03
output_tokens[-1] += '.'+input_tokens[i + 1]
i += 2
elif tok == "." and len(output_tokens) > 0 and len(output_tokens[-1]) == 1 and output_tokens[-1].isupper() and i < len(input_tokens) - 2 and len(input_tokens[i + 1]) == 1 and input_tokens[i + 1].isupper() and input_tokens[i + 2] == '.':
# U . N . -> U.N.
k = i+3
while k+2 < len(input_tokens):
if len(input_tokens[k + 1]) == 1 and input_tokens[k + 1].isupper() and input_tokens[k + 2] == '.':
k += 2
else:
break
output_tokens[-1] += ''.join(input_tokens[i:k])
i += 2
elif tok == "-":
if i < len(input_tokens) - 1 and input_tokens[i + 1] == "-":
output_tokens.append("--")
i += 2
elif i == len(input_tokens) - 1 or i == 0:
output_tokens.append("-")
i += 1
elif output_tokens[-1] not in string.punctuation and input_tokens[i + 1][0] not in string.punctuation:
output_tokens[-1] += "-"
i += 1
flag_prev_dash = True
else:
output_tokens.append("-")
i += 1
elif prev_dash and len(output_tokens) > 0 and tok[0] not in string.punctuation:
output_tokens[-1] += tok
i += 1
else:
output_tokens.append(tok)
i += 1
prev_dash = flag_prev_dash
return " ".join(output_tokens)
def remove_duplicate(l_list, duplicate_rate):
tk_list = [l.lower().split() for l in l_list]
r_list = []
history_set = set()
for i, w_list in enumerate(tk_list):
w_set = set(w_list)
if len(w_set & history_set)/len(w_set) <= duplicate_rate:
r_list.append(l_list[i])
history_set |= w_set
return r_list
def process_eval(eval_fn):
gold_list = []
with open(args.gold, "r") as f_in:
for l in f_in:
line = l.strip().replace(" <S_SEP> ", '\n')
gold_list.append(line)
pred_list = []
with open(eval_fn, "r") as f_in:
for l in f_in:
buf = []
for sentence in l.strip().split("[X_SEP]"):
sentence = fix_tokenization(sentence)
if any(get_f1(sentence, s) > 1.0 for s in buf):
continue
s_len = len(sentence.split())
if s_len <= 4:
continue
buf.append(sentence)
if args.duplicate_rate and args.duplicate_rate < 1:
buf = remove_duplicate(buf, args.duplicate_rate)
if args.trunc_len:
num_left = args.trunc_len
trunc_list = []
for bit in buf:
tk_list = bit.split()
n = min(len(tk_list), num_left)
trunc_list.append(' '.join(tk_list[:n]))
num_left -= n
if num_left <= 0:
break
else:
trunc_list = buf
line = "\n".join(trunc_list)
pred_list.append(line)
with open(eval_fn+'.post', 'w') as f_out:
for l in pred_list:
f_out.write(l.replace('\n', ' [X_SEP] ').strip())
f_out.write('\n')
# rouge scores
if len(pred_list) < len(gold_list):
# evaluate subset
gold_list = gold_list[:len(pred_list)]
assert len(pred_list) == len(gold_list)
if args.perl:
scores = test_rouge(pred_list, gold_list)
else:
scores = evaluator.get_scores(pred_list, [[it] for it in gold_list])
return eval_fn, scores
def main():
if args.perl:
eval_fn_list = list(glob.glob(args.pred))
else:
eval_fn_list = [eval_fn for eval_fn in glob.glob(args.pred) if not(
args.lazy_eval and Path(eval_fn+".rouge").exists())]
eval_fn_list = list(filter(lambda fn: not(fn.endswith(
'.post') or fn.endswith('.rouge')), eval_fn_list))
if args.only_eval_best:
best_epoch_dict = {}
for dir_path in set(Path(fn).parent for fn in eval_fn_list):
fn_save = os.path.join(dir_path, 'save_best.dev')
if Path(fn_save).exists():
with open(fn_save, 'r') as f_in:
__, o_name, __ = f_in.read().strip().split('\n')
epoch = o_name.split('.')[1]
best_epoch_dict[dir_path] = epoch
new_eval_fn_list = []
for fn in eval_fn_list:
dir_path = Path(fn).parent
if dir_path in best_epoch_dict:
if Path(fn).name.split('.')[1] == best_epoch_dict[dir_path]:
new_eval_fn_list.append(fn)
eval_fn_list = new_eval_fn_list
logger.info("***** Evaluation: %s *****", ','.join(eval_fn_list))
#num_pool = min(args.processes, len(eval_fn_list))
#p = Pool(num_pool)
#r_list = p.imap_unordered(process_eval, eval_fn_list)
r_list = []
for ins in eval_fn_list:
r1, r2 = process_eval(ins)
r_list.append((r1, r2))
r_list = sorted([(fn, scores)
for fn, scores in r_list], key=lambda x: x[0])
rg2_dict = {}
for fn, scores in r_list:
print(fn)
if args.perl:
print(rouge_results_to_str(scores))
else:
rg2_dict[fn] = scores['rouge-2']['f']
print(
"ROUGE-1: {}\tROUGE-2: {}\n".format(scores['rouge-1']['f'], scores['rouge-2']['f']))
with open(fn+".rouge", 'w') as f_out:
f_out.write(json.dumps(
{'rg1': scores['rouge-1']['f'], 'rg2': scores['rouge-2']['f']}))
#p.close()
#p.join()
if args.save_best:
# find best results
group_dict = {}
for k, v in rg2_dict.items():
d_name, o_name = Path(k).parent, Path(k).name
if (d_name not in group_dict) or (v > group_dict[d_name][1]):
group_dict[d_name] = (o_name, v)
# compare and save the best result
for k, v in group_dict.items():
fn = os.path.join(k, 'save_best.'+args.split)
o_name_s, rst_s = v
should_save = True
if Path(fn).exists():
with open(fn, 'r') as f_in:
rst_f = float(f_in.read().strip().split('\n')[-1])
if rst_s <= rst_f:
should_save = False
if should_save:
with open(fn, 'w') as f_out:
f_out.write('{0}\n{1}\n{2}\n'.format(k, o_name_s, rst_s))
if __name__ == "__main__":
main()
set -x
PRED=$1
PREFIX=$2
python pyrouge_set_rouge_path.py `pwd`/file2rouge/
python cnndm/eval.py --pred ${PRED} \
--gold ${PREFIX}.summary --trunc_len 100 --perl
PRED=`pwd`"/"$1
if [[ $2 == "dev" ]];then
EVAL_PREFIX=$DEV_PREFIX
elif [[ $2 == "test" ]];then
EVAL_PREFIX=$TEST_PREFIX
elif [[ $2 == "pred" ]];then
EVAL_PREFIX=$PRED_PREFIX
fi
PREFIX=`pwd`"/"${TASK_DATA_PATH}"/"${EVAL_PREFIX}
cd `dirname $0`
sh cnndm_eval.sh $PRED $PREFIX 2>${EVAL_SCRIPT_LOG} | grep ROUGE-F | awk -F ": " '{print $2}' | awk -F "/" '{print "{\"rouge-1\": "$1", \"rouge-2\": "$2", \"rouge-l\": "$3"}"}'
A Brief Introduction of the ROUGE Summary Evaluation Package
by Chin-Yew LIN
Univeristy of Southern California/Information Sciences Institute
05/26/2005
<<WHAT'S NEW>>
(1) Correct the resampling routine which ignores the last evaluation
item in the evaluation list. Therefore, the average scores reported
by ROUGE is only based on the first N-1 evaluation items.
Thanks Barry Schiffman at Columbia University to report this bug.
This bug only affects ROUGE-1.5.X. For pre-1.5 ROUGE, it only affects
the computation of confidence interval (CI) estimation, i.e. CI is only
estimated by the first N-1 evaluation items, but it *does not* affect
average scores.
(2) Correct stemming on multi-token BE heads and modifiers.
Previously, only single token heads and modifiers were assumed.
(3) Change read_text and read_text_LCS functions to read exact words or
bytes required by users. Previous versions carry out whitespace
compression and other string clear up actions before enforce the length
limit.
(4) Add the capability to score summaries in Basic Element (BE)
format by using option "-3", standing for BE triple. There are 6
different modes in BE scoring. We suggest using *"-3 HMR"* on BEs
extracted from Minipar parse trees based on our correlation analysis
of BE-based scoring vs. human judgements on DUC 2002 & 2003 automatic
summaries.
(5) ROUGE now generates three scores (recall, precision and F-measure)
for each evaluation. Previously, only one score is generated
(recall). Precision and F-measure scores are useful when the target
summary length is not enforced. Only recall scores were necessary since
DUC guideline dictated the limit on summary length. For comparison to
previous DUC results, please use the recall scores. The default alpha
weighting for computing F-measure is 0.5. Users can specify a
particular alpha weighting that fits their application scenario using
option "-p alpha-weight". Where *alpha-weight* is a number between 0
and 1 inclusively.
(6) Pre-1.5 version of ROUGE used model average to compute the overall
ROUGE scores when there are multiple references. Starting from v1.5+,
ROUGE provides an option to use the best matching score among the
references as the final score. The model average option is specified
using "-f A" (for Average) and the best model option is specified
using "-f B" (for the Best). The "-f A" option is better when use
ROUGE in summarization evaluations; while "-f B" option is better when
use ROUGE in machine translation (MT) and definition
question-answering (DQA) evaluations since in a typical MT or DQA
evaluation scenario matching a single reference translation or
definition answer is sufficient. However, it is very likely that
multiple different but equally good summaries exist in summarization
evaluation.
(7) ROUGE v1.5+ also provides the option to specify whether model unit
level average will be used (macro-average, i.e. treating every model
unit equally) or token level average will be used (micro-average,
i.e. treating every token equally). In summarization evaluation, we
suggest using model unit level average and this is the default setting
in ROUGE. To specify other average mode, use "-t 0" (default) for
model unit level average, "-t 1" for token level average and "-t 2"
for output raw token counts in models, peers, and matches.
(8) ROUGE now offers the option to use file list as the configuration
file. The input format of the summary files are specified using the
"-z INPUT-FORMAT" option. The INPUT-FORMAT can be SEE, SPL, ISI or
SIMPLE. When "-z" is specified, ROUGE assumed that the ROUGE
evaluation configuration file is a file list with each evaluation
instance per line in the following format:
peer_path1 model_path1 model_path2 ... model_pathN
peer_path2 model_path1 model_path2 ... model_pathN
...
peer_pathM model_path1 model_path2 ... model_pathN
The first file path is the peer summary (system summary) and it
follows with a list of model summaries (reference summaries) separated
by white spaces (spaces or tabs).
(9) When stemming is applied, a new WordNet exception database based
on WordNet 2.0 is used. The new database is included in the data
directory.
<<USAGE>>
(1) Use "-h" option to see a list of options.
Summary:
Usage: ROUGE-1.5.4.pl
[-a (evaluate all systems)]
[-c cf]
[-d (print per evaluation scores)]
[-e ROUGE_EVAL_HOME]
[-h (usage)]
[-b n-bytes|-l n-words]
[-m (use Porter stemmer)]
[-n max-ngram]
[-s (remove stopwords)]
[-r number-of-samples (for resampling)]
[-2 max-gap-length (if < 0 then no gap length limit)]
[-3 <H|HM|HMR|HM1|HMR1|HMR2>]
[-u (include unigram in skip-bigram) default no)]
[-U (same as -u but also compute regular skip-bigram)]
[-w weight (weighting factor for WLCS)]
[-v (verbose)]
[-x (do not calculate ROUGE-L)]
[-f A|B (scoring formula)]
[-p alpha (0 <= alpha <=1)]
[-t 0|1|2 (count by token instead of sentence)]
[-z <SEE|SPL|ISI|SIMPLE>]
<ROUGE-eval-config-file> [<systemID>]
ROUGE-eval-config-file: Specify the evaluation setup. Three files come with the ROUGE
evaluation package, i.e. ROUGE-test.xml, verify.xml, and verify-spl.xml are
good examples.
systemID: Specify which system in the ROUGE-eval-config-file to perform the evaluation.
If '-a' option is used, then all systems are evaluated and users do not need to
provide this argument.
Default:
When running ROUGE without supplying any options (except -a), the following defaults are used:
(1) ROUGE-L is computed;
(2) 95% confidence interval;
(3) No stemming;
(4) Stopwords are inlcuded in the calculations;
(5) ROUGE looks for its data directory first through the ROUGE_EVAL_HOME environment variable. If
it is not set, the current directory is used.
(6) Use model average scoring formula.
(7) Assign equal importance of ROUGE recall and precision in computing ROUGE f-measure, i.e. alpha=0.5.
(8) Compute average ROUGE by averaging sentence (unit) ROUGE scores.
Options:
-2: Compute skip bigram (ROGUE-S) co-occurrence, also specify the maximum gap length between two words (skip-bigram)
-u: Compute skip bigram as -2 but include unigram, i.e. treat unigram as "start-sentence-symbol unigram"; -2 has to be specified.
-3: Compute BE score.
H -> head only scoring (does not applied to Minipar-based BEs).
HM -> head and modifier pair scoring.
HMR -> head, modifier and relation triple scoring.
HM1 -> H and HM scoring (same as HM for Minipar-based BEs).
HMR1 -> HM and HMR scoring (same as HMR for Minipar-based BEs).
HMR2 -> H, HM and HMR scoring (same as HMR for Minipar-based BEs).
-a: Evaluate all systems specified in the ROUGE-eval-config-file.
-c: Specify CF\% (0 <= CF <= 100) confidence interval to compute. The default is 95\% (i.e. CF=95).
-d: Print per evaluation average score for each system.
-e: Specify ROUGE_EVAL_HOME directory where the ROUGE data files can be found.
This will overwrite the ROUGE_EVAL_HOME specified in the environment variable.
-f: Select scoring formula: 'A' => model average; 'B' => best model
-h: Print usage information.
-b: Only use the first n bytes in the system/peer summary for the evaluation.
-l: Only use the first n words in the system/peer summary for the evaluation.
-m: Stem both model and system summaries using Porter stemmer before computing various statistics.
-n: Compute ROUGE-N up to max-ngram length will be computed.
-p: Relative importance of recall and precision ROUGE scores. Alpha -> 1 favors precision, Alpha -> 0 favors recall.
-s: Remove stopwords in model and system summaries before computing various statistics.
-t: Compute average ROUGE by averaging over the whole test corpus instead of sentences (units).
0: use sentence as counting unit, 1: use token as couting unit, 2: same as 1 but output raw counts
instead of precision, recall, and f-measure scores. 2 is useful when computation of the final,
precision, recall, and f-measure scores will be conducted later.
-r: Specify the number of sampling point in bootstrap resampling (default is 1000).
Smaller number will speed up the evaluation but less reliable confidence interval.
-w: Compute ROUGE-W that gives consecutive matches of length L in an LCS a weight of 'L^weight' instead of just 'L' as in LCS.
Typically this is set to 1.2 or other number greater than 1.
-v: Print debugging information for diagnositic purpose.
-x: Do not calculate ROUGE-L.
-z: ROUGE-eval-config-file is a list of peer-model pair per line in the specified format (SEE|SPL|ISI|SIMPLE).
(2) Please read RELEASE-NOTE.txt for information about updates from previous versions.
(3) The following files coming with this package in the "sample-output"
directory are the expected output of the evaluation files in the
"sample-test" directory.
(a) use "data" as ROUGE_EVAL_HOME, compute 95% confidence interval,
compute ROUGE-L (longest common subsequence, default),
compute ROUGE-S* (skip bigram) without gap length limit,
compute also ROUGE-SU* (skip bigram with unigram),
run resampling 1000 times,
compute ROUGE-N (N=1 to 4),
compute ROUGE-W (weight = 1.2), and
compute these ROUGE scores for all systems:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -a ROUGE-test.xml
(b) Same as (a) but apply Porter's stemmer on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a-m.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -m -a ROUGE-test.xml
(c) Same as (b) but apply also a stopword list on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-a-m-s.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -m -s -a ROUGE-test.xml
(d) Same as (a) but apply a summary length limit of 10 words:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-l10-a.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -a ROUGE-test.xml
(e) Same as (d) but apply Porter's stemmer on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-l10-a-m.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -m -a ROUGE-test.xml
(f) Same as (e) but apply also a stopword list on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-l10-a-m-s.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -l 10 -m -s -a ROUGE-test.xml
(g) Same as (a) but apply a summary lenght limit of 75 bytes:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-b75-a.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -a ROUGE-test.xml
(h) Same as (g) but apply Porter's stemmer on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-b75-a-m.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -m -a ROUGE-test.xml
(i) Same as (h) but apply also a stopword list on the input:
ROUGE-test-c95-2-1-U-r1000-n4-w1.2-b75-a-m-s.out
> ROUGE-1.5.4.pl -e data -c 95 -2 -1 -U -r 1000 -n 4 -w 1.2 -b 75 -m -s -a ROUGE-test.xml
Sample DUC2002 data (1 system and 1 model only per DUC 2002 topic), their BE and
ROUGE evaluation configuration file in XML and file list format,
and their expected output are also included for your reference.
(a) Use DUC2002-BE-F.in.26.lst, a BE files list, as ROUGE the
configuration file:
command> ROUGE-1.5.4.pl -3 HM -z SIMPLE DUC2002-BE-F.in.26.lst 26
output: DUC2002-BE-F.in.26.lst.out
(b) Use DUC2002-BE-F.in.26.simple.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -3 HM DUC2002-BE-F.in.26.simple.xml 26
output: DUC2002-BE-F.in.26.simple.out
(c) Use DUC2002-BE-L.in.26.lst, a BE files list, as ROUGE the
configuration file:
command> ROUGE-1.5.4.pl -3 HM -z SIMPLE DUC2002-BE-L.in.26.lst 26
output: DUC2002-BE-L.in.26.lst.out
(d) Use DUC2002-BE-L.in.26.simple.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -3 HM DUC2002-BE-L.in.26.simple.xml 26
output: DUC2002-BE-L.in.26.simple.out
(e) Use DUC2002-ROUGE.in.26.spl.lst, a BE files list, as ROUGE the
configuration file:
command> ROUGE-1.5.4.pl -n 4 -z SPL DUC2002-ROUGE.in.26.spl.lst 26
output: DUC2002-ROUGE.in.26.spl.lst.out
(f) Use DUC2002-ROUGE.in.26.spl.xml as ROUGE XML evaluation configuration file:
command> ROUGE-1.5.4.pl -n 4 DUC2002-ROUGE.in.26.spl.xml 26
output: DUC2002-ROUGE.in.26.spl.out
<<INSTALLATION>>
(1) You need to have DB_File installed. If the Perl script complains
about database version incompatibility, you can create a new
WordNet-2.0.exc.db by running the buildExceptionDB.pl script in
the "data/WordNet-2.0-Exceptions" subdirectory.
(2) You also need to install XML::DOM from http://www.cpan.org.
Direct link: http://www.cpan.org/modules/by-module/XML/XML-DOM-1.43.tar.gz.
You might need install extra Perl modules that are required by
XML::DOM.
(3) Setup an environment variable ROUGE_EVAL_HOME that points to the
"data" subdirectory. For example, if your "data" subdirectory
located at "/usr/local/ROUGE-1.5.4/data" then you can setup
the ROUGE_EVAL_HOME as follows:
(a) Using csh or tcsh:
$command_prompt>setenv ROUGE_EVAL_HOME /usr/local/ROUGE-1.5.4/data
(b) Using bash
$command_prompt>ROUGE_EVAL_HOME=/usr/local/ROUGE-1.5.4/data
$command_prompt>export ROUGE_EVAL_HOME
(4) Run ROUGE-1.5.4.pl without supplying any arguments will give
you a description of how to use the ROUGE script.
(5) Please look into the included ROUGE-test.xml, verify.xml. and
verify-spl.xml evaluation configuration files for preparing your
own evaluation setup. More detailed description will be provided
later. ROUGE-test.xml and verify.xml specify the input from
systems and references are in SEE (Summary Evaluation Environment)
format (http://www.isi.edu/~cyl/SEE); while verify-spl.xml specify
inputs are in sentence per line format.
<<DOCUMENTATION>>
(1) Please look into the "docs" directory for more information about
ROUGE.
(2) ROUGE-Note-v1.4.2.pdf explains how ROUGE works. It was published in
Proceedings of the Workshop on Text Summarization Branches Out
(WAS 2004), Bacelona, Spain, 2004.
(3) NAACL2003.pdf presents the initial idea of applying n-gram
co-occurrence statistics in automatic evaluation of
summarization. It was publised in Proceedsings of 2003 Language
Technology Conference (HLT-NAACL 2003), Edmonton, Canada, 2003.
(4) NTCIR2004.pdf discusses the effect of sample size on the
reliability of automatic evaluation results using data in the past
Document Understanding Conference (DUC) as examples. It was
published in Proceedings of the 4th NTCIR Meeting, Tokyo, Japan, 2004.
(5) ACL2004.pdf shows how ROUGE can be applied on automatic evaluation
of machine translation. It was published in Proceedings of the 42nd
Annual Meeting of the Association for Computational Linguistics
(ACL 2004), Barcelona, Spain, 2004.
(6) COLING2004.pdf proposes a new meta-evaluation framework, ORANGE, for
automatic evaluation of automatic evaluation methods. We showed
that ROUGE-S and ROUGE-L were significantly better than BLEU,
NIST, WER, and PER automatic MT evalaution methods under the
ORANGE framework. It was published in Proceedings of the 20th
International Conference on Computational Linguistics (COLING 2004),
Geneva, Switzerland, 2004.
(7) For information about BE, please go to http://www.isi.edu/~cyl/BE.
<<NOTE>>
Thanks for using the ROUGE evaluation package. If you have any
questions or comments, please send them to cyl@isi.edu. I will do my
best to answer your questions.
# Revision Note: 05/26/2005, Chin-Yew LIN
# 1.5.5
# (1) Correct stemming on multi-token BE heads and modifiers.
# Previously, only single token heads and modifiers were assumed.
# (2) Correct the resampling routine which ignores the last evaluation
# item in the evaluation list. Therefore, the average scores reported
# by ROUGE is only based on the first N-1 evaluation items.
# Thanks Barry Schiffman at Columbia University to report this bug.
# This bug only affects ROUGE-1.5.X. For pre-1.5 ROUGE, it only affects
# the computation of confidence interval (CI) estimation, i.e. CI is only
# estimated by the first N-1 evaluation items, but it *does not* affect
# average scores.
# (3) Change read_text and read_text_LCS functions to read exact words or
# bytes required by users. Previous versions carry out whitespace
# compression and other string clear up actions before enforce the length
# limit.
# 1.5.4.1
# (1) Minor description change about "-t 0" option.
# 1.5.4
# (1) Add easy evalution mode for single reference evaluations with -z
# option.
# 1.5.3
# (1) Add option to compute ROUGE score based on SIMPLE BE format. Given
# a set of peer and model summary file in BE format with appropriate
# options, ROUGE will compute matching scores based on BE lexical
# matches.
# There are 6 options:
# 1. H : Head only match. This is similar to unigram match but
# only BE Head is used in matching. BEs generated by
# Minipar-based breaker do not include head-only BEs,
# therefore, the score will always be zero. Use HM or HMR
# optiions instead.
# 2. HM : Head and modifier match. This is similar to bigram or
# skip bigram but it's head-modifier bigram match based on
# parse result. Only BE triples with non-NIL modifier are
# included in the matching.
# 3. HMR : Head, modifier, and relation match. This is similar to
# trigram match but it's head-modifier-relation trigram
# match based on parse result. Only BE triples with non-NIL
# relation are included in the matching.
# 4. HM1 : This is combination of H and HM. It is similar to unigram +
# bigram or skip bigram with unigram match but it's
# head-modifier bigram match based on parse result.
# In this case, the modifier field in a BE can be "NIL"
# 5. HMR1 : This is combination of HM and HMR. It is similar to
# trigram match but it's head-modifier-relation trigram
# match based on parse result. In this case, the relation
# field of the BE can be "NIL".
# 6. HMR2 : This is combination of H, HM and HMR. It is similar to
# trigram match but it's head-modifier-relation trigram
# match based on parse result. In this case, the modifier and
# relation fields of the BE can both be "NIL".
# 1.5.2
# (1) Add option to compute ROUGE score by token using the whole corpus
# as average unit instead of individual sentences. Previous versions of
# ROUGE uses sentence (or unit) boundary to break counting unit and takes
# the average score from the counting unit as the final score.
# Using the whole corpus as one single counting unit can potentially
# improve the reliablity of the final score that treats each token as
# equally important; while the previous approach considers each sentence as
# equally important that ignores the length effect of each individual
# sentences (i.e. long sentences contribute equal weight to the final
# score as short sentences.)
# +v1.2 provide a choice of these two counting modes that users can
# choose the one that fits their scenarios.
# 1.5.1
# (1) Add precision oriented measure and f-measure to deal with different lengths
# in candidates and references. Importance between recall and precision can
# be controled by 'alpha' parameter:
# alpha -> 0: recall is more important
# alpha -> 1: precision is more important
# Following Chapter 7 in C.J. van Rijsbergen's "Information Retrieval".
# http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html
# F = 1/(alpha * (1/P) + (1 - alpha) * (1/R)) ;;; weighted harmonic mean
# 1.4.2
# (1) Enforce length limit at the time when summary text is read. Previously (before
# and including v1.4.1), length limit was enforced at tokenization time.
# 1.4.1
# (1) Fix potential over counting in ROUGE-L and ROUGE-W
# In previous version (i.e. 1.4 and order), LCS hit is computed
# by summing union hit over all model sentences. Each model sentence
# is compared with all peer sentences and mark the union LCS. The
# length of the union LCS is the hit of that model sentence. The
# final hit is then sum over all model union LCS hits. This potentially
# would over count a peer sentence which already been marked as contributed
# to some other model sentence. Therefore, double counting is resulted.
# This is seen in evalution where ROUGE-L score is higher than ROUGE-1 and
# this is not correct.
# ROUGEeval-1.4.1.pl fixes this by add a clip function to prevent
# double counting.
# 1.4
# (1) Remove internal Jackknifing procedure:
# Now the ROUGE script will use all the references listed in the
# <MODEL></MODEL> section in each <EVAL></EVAL> section and no
# automatic Jackknifing is performed.
# If Jackknifing procedure is required when comparing human and system
# performance, then users have to setup the procedure in the ROUGE
# evaluation configuration script as follows:
# For example, to evaluate system X with 4 references R1, R2, R3, and R4.
# We do the following computation:
#
# for system: and for comparable human:
# s1 = X vs. R1, R2, R3 h1 = R4 vs. R1, R2, R3
# s2 = X vs. R1, R3, R4 h2 = R2 vs. R1, R3, R4
# s3 = X vs. R1, R2, R4 h3 = R3 vs. R1, R2, R4
# s4 = X vs. R2, R3, R4 h4 = R1 vs. R2, R3, R4
#
# Average system score for X = (s1+s2+s3+s4)/4 and for human = (h1+h2+h3+h4)/4
# Implementation of this in a ROUGE evaluation configuration script is as follows:
# Instead of writing all references in a evaluation section as below:
# <EVAL ID="1">
# ...
# <PEERS>
# <P ID="X">systemX</X>
# <PEERS>
# <MODELS>
# <M ID="1">R1</M>
# <M ID="2">R2</M>
# <M ID="3">R3</M>
# <M ID="4">R4</M>
# <MODELS>
# </EVAL>
# we write the following:
# <EVAL ID="1-1">
# <PEERS>
# <P ID="X">systemX</X>
# <PEERS>
# <MODELS>
# <M ID="2">R2</M>
# <M ID="3">R3</M>
# <M ID="4">R4</M>
# <MODELS>
# </EVAL>
# <EVAL ID="1-2">
# <PEERS>
# <P ID="X">systemX</X>
# <PEERS>
# <MODELS>
# <M ID="1">R1</M>
# <M ID="3">R3</M>
# <M ID="4">R4</M>
# <MODELS>
# </EVAL>
# <EVAL ID="1-3">
# <PEERS>
# <P ID="X">systemX</X>
# <PEERS>
# <MODELS>
# <M ID="1">R1</M>
# <M ID="2">R2</M>
# <M ID="4">R4</M>
# <MODELS>
# </EVAL>
# <EVAL ID="1-4">
# <PEERS>
# <P ID="X">systemX</X>
# <PEERS>
# <MODELS>
# <M ID="1">R1</M>
# <M ID="2">R2</M>
# <M ID="3">R3</M>
# <MODELS>
# </EVAL>
#
# In this case, the system and human numbers are comparable.
# ROUGE as it is implemented for summarization evaluation is a recall-based metric.
# As we increase the number of references, we are increasing the number of
# count units (n-gram or skip-bigram or LCSes) in the target pool (i.e.
# the number ends up in the denominator of any ROUGE formula is larger).
# Therefore, a candidate summary has more chance to hit but it also has to
# hit more. In the end, this means lower absolute ROUGE scores when more
# references are used and using different sets of rerferences should not
# be compared to each other. There is no nomalization mechanism in ROUGE
# to properly adjust difference due to different number of references used.
#
# In the ROUGE implementations before v1.4 when there are N models provided for
# evaluating system X in the ROUGE evaluation script, ROUGE does the
# following:
# (1) s1 = X vs. R2, R3, R4, ..., RN
# (2) s2 = X vs. R1, R3, R4, ..., RN
# (3) s3 = X vs. R1, R2, R4, ..., RN
# (4) s4 = X vs. R1, R2, R3, ..., RN
# (5) ...
# (6) sN= X vs. R1, R2, R3, ..., RN-1
# And the final ROUGE score is computed by taking average of (s1, s2, s3,
# s4, ..., sN). When we provide only three references for evaluation of a
# human summarizer, ROUGE does the same thing but using 2 out 3
# references, get three numbers, and then take the average as the final
# score. Now ROUGE (after v1.4) will use all references without this
# internal Jackknifing procedure. The speed of the evaluation should improve
# a lot, since only one set instead of four sets of computation will be
# conducted.
# 1.3
# (1) Add skip bigram
# (2) Add an option to specify the number of sampling point (default is 1000)
# 1.2.3
# (1) Correct the enviroment variable option: -e. Now users can specify evironment
# variable ROUGE_EVAL_HOME using the "-e" option; previously this option is
# not active. Thanks Zhouyan Li of Concordia University, Canada pointing this
# out.
# 1.2.2
# (1) Correct confidence interval calculation for median, maximum, and minimum.
# Line 390.
# 1.2.1
# (1) Add sentence per line format input format. See files in Verify-SPL for examples.
# (2) Streamline command line arguments.
# (3) Use bootstrap resampling to estimate confidence intervals instead of using t-test
# or z-test which assume a normal distribution.
# (4) Add LCS (longest common subsequence) evaluation method.
# (5) Add WLCS (weighted longest common subsequence) evaluation method.
# (6) Add length cutoff in bytes.
# (7) Add an option to specify the longest ngram to compute. The default is 4.
# 1.2
# (1) Change zero condition check in subroutine &computeNGramScores when
# computing $gram1Score from
# if($totalGram2Count!=0) to
# if($totalGram1Count!=0)
# Thanks Ken Litkowski for this bug report.
# This original script will set gram1Score to zero if there is no
# bigram matches. This should rarely has significant affect the final score
# since (a) there are bigram matches most of time; (b) the computation
# of gram1Score is using Jackknifing procedure. However, this definitely
# did not compute the correct $gram1Score when there is no bigram matches.
# Therefore, users of version 1.1 should definitely upgrade to newer
# version of the script that does not contain this bug.
# Note: To use this script, two additional data files are needed:
# (1) smart_common_words.txt - contains stopword list from SMART IR engine
# (2) WordNet-1.6.exc.db - WordNet 1.6 exception inflexion database
# These two files have to be put in a directory pointed by the environment
# variable: "ROUGE_EVAL_HOME".
# If environment variable ROUGE_EVAL_HOME does not exist, this script will
# will assume it can find these two database files in the current directory.
此差异已折叠。
此差异已折叠。
=head1 NAME
XML::DOM::AttDef - A single XML attribute definition in an ATTLIST in XML::DOM
=head1 DESCRIPTION
XML::DOM::AttDef extends L<XML::DOM::Node>, but is not part of the DOM Level 1
specification.
Each object of this class represents one attribute definition in an AttlistDecl.
=head2 METHODS
=over 4
=item getName
Returns the attribute name.
=item getDefault
Returns the default value, or undef.
=item isFixed
Whether the attribute value is fixed (see #FIXED keyword.)
=item isRequired
Whether the attribute value is required (see #REQUIRED keyword.)
=item isImplied
Whether the attribute value is implied (see #IMPLIED keyword.)
=back
=head1 NAME
XML::DOM::AttlistDecl - An XML ATTLIST declaration in XML::DOM
=head1 DESCRIPTION
XML::DOM::AttlistDecl extends L<XML::DOM::Node> but is not part of the
DOM Level 1 specification.
This node represents an ATTLIST declaration, e.g.
<!ATTLIST person
sex (male|female) #REQUIRED
hair CDATA "bold"
eyes (none|one|two) "two"
species (human) #FIXED "human">
Each attribute definition is stored a separate AttDef node. The AttDef nodes can
be retrieved with getAttDef and added with addAttDef.
(The AttDef nodes are stored in a NamedNodeMap internally.)
=head2 METHODS
=over 4
=item getName
Returns the Element tagName.
=item getAttDef (attrName)
Returns the AttDef node for the attribute with the specified name.
=item addAttDef (attrName, type, default, [ fixed ])
Adds a AttDef node for the attribute with the specified name.
Parameters:
I<attrName> the attribute name.
I<type> the attribute type (e.g. "CDATA" or "(male|female)".)
I<default> the default value enclosed in quotes (!), the string #IMPLIED or
the string #REQUIRED.
I<fixed> whether the attribute is '#FIXED' (default is 0.)
=back
=head1 NAME
XML::DOM::Attr - An XML attribute in XML::DOM
=head1 DESCRIPTION
XML::DOM::Attr extends L<XML::DOM::Node>.
The Attr nodes built by the XML::DOM::Parser always have one child node
which is a Text node containing the expanded string value (i.e. EntityReferences
are always expanded.) EntityReferences may be added when modifying or creating
a new Document.
The Attr interface represents an attribute in an Element object.
Typically the allowable values for the attribute are defined in a
document type definition.
Attr objects inherit the Node interface, but since they are not
actually child nodes of the element they describe, the DOM does not
consider them part of the document tree. Thus, the Node attributes
parentNode, previousSibling, and nextSibling have a undef value for Attr
objects. The DOM takes the view that attributes are properties of
elements rather than having a separate identity from the elements they
are associated with; this should make it more efficient to implement
such features as default attributes associated with all elements of a
given type. Furthermore, Attr nodes may not be immediate children of a
DocumentFragment. However, they can be associated with Element nodes
contained within a DocumentFragment. In short, users and implementors
of the DOM need to be aware that Attr nodes have some things in common
with other objects inheriting the Node interface, but they also are
quite distinct.
The attribute's effective value is determined as follows: if this
attribute has been explicitly assigned any value, that value is the
attribute's effective value; otherwise, if there is a declaration for
this attribute, and that declaration includes a default value, then
that default value is the attribute's effective value; otherwise, the
attribute does not exist on this element in the structure model until
it has been explicitly added. Note that the nodeValue attribute on the
Attr instance can also be used to retrieve the string version of the
attribute's value(s).
In XML, where the value of an attribute can contain entity references,
the child nodes of the Attr node provide a representation in which
entity references are not expanded. These child nodes may be either
Text or EntityReference nodes. Because the attribute type may be
unknown, there are no tokenized attribute values.
=head2 METHODS
=over 4
=item getValue
On retrieval, the value of the attribute is returned as a string.
Character and general entity references are replaced with their values.
=item setValue (str)
DOM Spec: On setting, this creates a Text node with the unparsed contents of the
string.
=item getName
Returns the name of this attribute.
=back
=head1 NAME
XML::DOM::CDATASection - Escaping XML text blocks in XML::DOM
=head1 DESCRIPTION
XML::DOM::CDATASection extends L<XML::DOM::CharacterData> which extends
L<XML::DOM::Node>.
CDATA sections are used to escape blocks of text containing characters
that would otherwise be regarded as markup. The only delimiter that is
recognized in a CDATA section is the "]]>" string that ends the CDATA
section. CDATA sections can not be nested. The primary purpose is for
including material such as XML fragments, without needing to escape all
the delimiters.
The DOMString attribute of the Text node holds the text that is
contained by the CDATA section. Note that this may contain characters
that need to be escaped outside of CDATA sections and that, depending
on the character encoding ("charset") chosen for serialization, it may
be impossible to write out some characters as part of a CDATA section.
The CDATASection interface inherits the CharacterData interface through
the Text interface. Adjacent CDATASections nodes are not merged by use
of the Element.normalize() method.
B<NOTE:> XML::DOM::Parser and XML::DOM::ValParser convert all CDATASections
to regular text by default.
To preserve CDATASections, set the parser option KeepCDATA to 1.
=head1 NAME
XML::DOM::CharacterData - Common interface for Text, CDATASections and Comments
=head1 DESCRIPTION
XML::DOM::CharacterData extends L<XML::DOM::Node>
The CharacterData interface extends Node with a set of attributes and
methods for accessing character data in the DOM. For clarity this set
is defined here rather than on each object that uses these attributes
and methods. No DOM objects correspond directly to CharacterData,
though Text, Comment and CDATASection do inherit the interface from it.
All offsets in this interface start from 0.
=head2 METHODS
=over 4
=item getData and setData (data)
The character data of the node that implements this
interface. The DOM implementation may not put arbitrary
limits on the amount of data that may be stored in a
CharacterData node. However, implementation limits may mean
that the entirety of a node's data may not fit into a single
DOMString. In such cases, the user may call substringData to
retrieve the data in appropriately sized pieces.
=item getLength
The number of characters that are available through data and
the substringData method below. This may have the value zero,
i.e., CharacterData nodes may be empty.
=item substringData (offset, count)
Extracts a range of data from the node.
Parameters:
I<offset> Start offset of substring to extract.
I<count> The number of characters to extract.
Return Value: The specified substring. If the sum of offset and count
exceeds the length, then all characters to the end of
the data are returned.
=item appendData (str)
Appends the string to the end of the character data of the
node. Upon success, data provides access to the concatenation
of data and the DOMString specified.
=item insertData (offset, arg)
Inserts a string at the specified character offset.
Parameters:
I<offset> The character offset at which to insert.
I<arg> The DOMString to insert.
=item deleteData (offset, count)
Removes a range of characters from the node.
Upon success, data and length reflect the change.
If the sum of offset and count exceeds length then all characters
from offset to the end of the data are deleted.
Parameters:
I<offset> The offset from which to remove characters.
I<count> The number of characters to delete.
=item replaceData (offset, count, arg)
Replaces the characters starting at the specified character
offset with the specified string.
Parameters:
I<offset> The offset from which to start replacing.
I<count> The number of characters to replace.
I<arg> The DOMString with which the range must be replaced.
If the sum of offset and count exceeds length, then all characters to the end of
the data are replaced (i.e., the effect is the same as a remove method call with
the same range, followed by an append method invocation).
=back
=head1 NAME
XML::DOM::Comment - An XML comment in XML::DOM
=head1 DESCRIPTION
XML::DOM::Comment extends L<XML::DOM::CharacterData> which extends
L<XML::DOM::Node>.
This node represents the content of a comment, i.e., all the characters
between the starting '<!--' and ending '-->'. Note that this is the
definition of a comment in XML, and, in practice, HTML, although some
HTML tools may implement the full SGML comment structure.
######################################################################
package XML::DOM::DOMException;
######################################################################
use Exporter;
use overload '""' => \&stringify;
use vars qw ( @ISA @EXPORT @ErrorNames );
BEGIN
{
@ISA = qw( Exporter );
@EXPORT = qw( INDEX_SIZE_ERR
DOMSTRING_SIZE_ERR
HIERARCHY_REQUEST_ERR
WRONG_DOCUMENT_ERR
INVALID_CHARACTER_ERR
NO_DATA_ALLOWED_ERR
NO_MODIFICATION_ALLOWED_ERR
NOT_FOUND_ERR
NOT_SUPPORTED_ERR
INUSE_ATTRIBUTE_ERR
);
}
sub UNKNOWN_ERR () {0;} # not in the DOM Spec!
sub INDEX_SIZE_ERR () {1;}
sub DOMSTRING_SIZE_ERR () {2;}
sub HIERARCHY_REQUEST_ERR () {3;}
sub WRONG_DOCUMENT_ERR () {4;}
sub INVALID_CHARACTER_ERR () {5;}
sub NO_DATA_ALLOWED_ERR () {6;}
sub NO_MODIFICATION_ALLOWED_ERR () {7;}
sub NOT_FOUND_ERR () {8;}
sub NOT_SUPPORTED_ERR () {9;}
sub INUSE_ATTRIBUTE_ERR () {10;}
@ErrorNames = (
"UNKNOWN_ERR",
"INDEX_SIZE_ERR",
"DOMSTRING_SIZE_ERR",
"HIERARCHY_REQUEST_ERR",
"WRONG_DOCUMENT_ERR",
"INVALID_CHARACTER_ERR",
"NO_DATA_ALLOWED_ERR",
"NO_MODIFICATION_ALLOWED_ERR",
"NOT_FOUND_ERR",
"NOT_SUPPORTED_ERR",
"INUSE_ATTRIBUTE_ERR"
);
sub new
{
my ($type, $code, $msg) = @_;
my $self = bless {Code => $code}, $type;
$self->{Message} = $msg if defined $msg;
# print "=> Exception: " . $self->stringify . "\n";
$self;
}
sub getCode
{
$_[0]->{Code};
}
#------------------------------------------------------------
# Extra method implementations
sub getName
{
$ErrorNames[$_[0]->{Code}];
}
sub getMessage
{
$_[0]->{Message};
}
sub stringify
{
my $self = shift;
"XML::DOM::DOMException(Code=" . $self->getCode . ", Name=" .
$self->getName . ", Message=" . $self->getMessage . ")";
}
1; # package return code
=head1 NAME
XML::DOM::DOMImplementation - Information about XML::DOM implementation
=head1 DESCRIPTION
The DOMImplementation interface provides a number of methods for
performing operations that are independent of any particular instance
of the document object model.
The DOM Level 1 does not specify a way of creating a document instance,
and hence document creation is an operation specific to an
implementation. Future Levels of the DOM specification are expected to
provide methods for creating documents directly.
=head2 METHODS
=over 4
=item hasFeature (feature, version)
Returns 1 if and only if feature equals "XML" and version equals "1.0".
=back
=head1 NAME
XML::DOM::Document - An XML document node in XML::DOM
=head1 DESCRIPTION
XML::DOM::Document extends L<XML::DOM::Node>.
It is the main root of the XML document structure as returned by
XML::DOM::Parser::parse and XML::DOM::Parser::parsefile.
Since elements, text nodes, comments, processing instructions, etc.
cannot exist outside the context of a Document, the Document interface
also contains the factory methods needed to create these objects. The
Node objects created have a getOwnerDocument method which associates
them with the Document within whose context they were created.
=head2 METHODS
=over 4
=item getDocumentElement
This is a convenience method that allows direct access to
the child node that is the root Element of the document.
=item getDoctype
The Document Type Declaration (see DocumentType) associated
with this document. For HTML documents as well as XML
documents without a document type declaration this returns
undef. The DOM Level 1 does not support editing the Document
Type Declaration.
B<Not In DOM Spec>: This implementation allows editing the doctype.
See I<XML::DOM::ignoreReadOnly> for details.
=item getImplementation
The DOMImplementation object that handles this document. A
DOM application may use objects from multiple implementations.
=item createElement (tagName)
Creates an element of the type specified. Note that the
instance returned implements the Element interface, so
attributes can be specified directly on the returned object.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the tagName does not conform to the XML spec.
=back
=item createTextNode (data)
Creates a Text node given the specified string.
=item createComment (data)
Creates a Comment node given the specified string.
=item createCDATASection (data)
Creates a CDATASection node given the specified string.
=item createAttribute (name [, value [, specified ]])
Creates an Attr of the given name. Note that the Attr
instance can then be set on an Element using the setAttribute method.
B<Not In DOM Spec>: The DOM Spec does not allow passing the value or the
specified property in this method. In this implementation they are optional.
Parameters:
I<value> The attribute's value. See Attr::setValue for details.
If the value is not supplied, the specified property is set to 0.
I<specified> Whether the attribute value was specified or whether the default
value was used. If not supplied, it's assumed to be 1.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the name does not conform to the XML spec.
=back
=item createProcessingInstruction (target, data)
Creates a ProcessingInstruction node given the specified name and data strings.
Parameters:
I<target> The target part of the processing instruction.
I<data> The data for the node.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the target does not conform to the XML spec.
=back
=item createDocumentFragment
Creates an empty DocumentFragment object.
=item createEntityReference (name)
Creates an EntityReference object.
=back
=head2 Additional methods not in the DOM Spec
=over 4
=item getXMLDecl and setXMLDecl (xmlDecl)
Returns the XMLDecl for this Document or undef if none was specified.
Note that XMLDecl is not part of the list of child nodes.
=item setDoctype (doctype)
Sets or replaces the DocumentType.
B<NOTE>: Don't use appendChild or insertBefore to set the DocumentType.
Even though doctype will be part of the list of child nodes, it is handled
specially.
=item getDefaultAttrValue (elem, attr)
Returns the default attribute value as a string or undef, if none is available.
Parameters:
I<elem> The element tagName.
I<attr> The attribute name.
=item getEntity (name)
Returns the Entity with the specified name.
=item createXMLDecl (version, encoding, standalone)
Creates an XMLDecl object. All parameters may be undefined.
=item createDocumentType (name, sysId, pubId)
Creates a DocumentType object. SysId and pubId may be undefined.
=item createNotation (name, base, sysId, pubId)
Creates a new Notation object. Consider using
XML::DOM::DocumentType::addNotation!
=item createEntity (parameter, notationName, value, sysId, pubId, ndata)
Creates an Entity object. Consider using XML::DOM::DocumentType::addEntity!
=item createElementDecl (name, model)
Creates an ElementDecl object.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the element name (tagName) does not conform to the XML spec.
=back
=item createAttlistDecl (name)
Creates an AttlistDecl object.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the element name (tagName) does not conform to the XML spec.
=back
=item expandEntity (entity [, parameter])
Expands the specified entity or parameter entity (if parameter=1) and returns
its value as a string, or undef if the entity does not exist.
(The entity name should not contain the '%', '&' or ';' delimiters.)
=item check ( [$checker] )
Uses the specified L<XML::Checker> to validate the document.
If no XML::Checker is supplied, a new XML::Checker is created.
See L<XML::Checker> for details.
=item check_sax ( [$checker] )
Similar to check() except it uses the SAX interface to XML::Checker instead of
the expat interface. This method may disappear or replace check() at some time.
=item createChecker ()
Creates an XML::Checker based on the document's DTD.
The $checker can be reused to check any elements within the document.
Create a new L<XML::Checker> whenever the DOCTYPE section of the document
is altered!
=back
=head1 NAME
XML::DOM::DocumentFragment - Facilitates cut & paste in XML::DOM documents
=head1 DESCRIPTION
XML::DOM::DocumentFragment extends L<XML::DOM::Node>
DocumentFragment is a "lightweight" or "minimal" Document object. It is
very common to want to be able to extract a portion of a document's
tree or to create a new fragment of a document. Imagine implementing a
user command like cut or rearranging a document by moving fragments
around. It is desirable to have an object which can hold such fragments
and it is quite natural to use a Node for this purpose. While it is
true that a Document object could fulfil this role, a Document object
can potentially be a heavyweight object, depending on the underlying
implementation. What is really needed for this is a very lightweight
object. DocumentFragment is such an object.
Furthermore, various operations -- such as inserting nodes as children
of another Node -- may take DocumentFragment objects as arguments; this
results in all the child nodes of the DocumentFragment being moved to
the child list of this node.
The children of a DocumentFragment node are zero or more nodes
representing the tops of any sub-trees defining the structure of the
document. DocumentFragment nodes do not need to be well-formed XML
documents (although they do need to follow the rules imposed upon
well-formed XML parsed entities, which can have multiple top nodes).
For example, a DocumentFragment might have only one child and that
child node could be a Text node. Such a structure model represents
neither an HTML document nor a well-formed XML document.
When a DocumentFragment is inserted into a Document (or indeed any
other Node that may take children) the children of the DocumentFragment
and not the DocumentFragment itself are inserted into the Node. This
makes the DocumentFragment very useful when the user wishes to create
nodes that are siblings; the DocumentFragment acts as the parent of
these nodes so that the user can use the standard methods from the Node
interface, such as insertBefore() and appendChild().
=head1 NAME
XML::DOM::DocumentType - An XML document type (DTD) in XML::DOM
=head1 DESCRIPTION
XML::DOM::DocumentType extends L<XML::DOM::Node>.
Each Document has a doctype attribute whose value is either null or a
DocumentType object. The DocumentType interface in the DOM Level 1 Core
provides an interface to the list of entities that are defined for the
document, and little else because the effect of namespaces and the
various XML scheme efforts on DTD representation are not clearly
understood as of this writing.
The DOM Level 1 doesn't support editing DocumentType nodes.
B<Not In DOM Spec>: This implementation has added a lot of extra
functionality to the DOM Level 1 interface.
To allow editing of the DocumentType nodes, see XML::DOM::ignoreReadOnly.
=head2 METHODS
=over 4
=item getName
Returns the name of the DTD, i.e. the name immediately following the
DOCTYPE keyword.
=item getEntities
A NamedNodeMap containing the general entities, both external
and internal, declared in the DTD. Duplicates are discarded.
For example in:
<!DOCTYPE ex SYSTEM "ex.dtd" [
<!ENTITY foo "foo">
<!ENTITY bar "bar">
<!ENTITY % baz "baz">
]>
<ex/>
the interface provides access to foo and bar but not baz.
Every node in this map also implements the Entity interface.
The DOM Level 1 does not support editing entities, therefore
entities cannot be altered in any way.
B<Not In DOM Spec>: See XML::DOM::ignoreReadOnly to edit the DocumentType etc.
=item getNotations
A NamedNodeMap containing the notations declared in the DTD.
Duplicates are discarded. Every node in this map also
implements the Notation interface.
The DOM Level 1 does not support editing notations, therefore
notations cannot be altered in any way.
B<Not In DOM Spec>: See XML::DOM::ignoreReadOnly to edit the DocumentType etc.
=head2 Additional methods not in the DOM Spec
=item Creating and setting the DocumentType
A new DocumentType can be created with:
$doctype = $doc->createDocumentType ($name, $sysId, $pubId, $internal);
To set (or replace) the DocumentType for a particular document, use:
$doc->setDocType ($doctype);
=item getSysId and setSysId (sysId)
Returns or sets the system id.
=item getPubId and setPubId (pudId)
Returns or sets the public id.
=item setName (name)
Sets the name of the DTD, i.e. the name immediately following the
DOCTYPE keyword. Note that this should always be the same as the element
tag name of the root element.
=item getAttlistDecl (elemName)
Returns the AttlistDecl for the Element with the specified name, or undef.
=item getElementDecl (elemName)
Returns the ElementDecl for the Element with the specified name, or undef.
=item getEntity (entityName)
Returns the Entity with the specified name, or undef.
=item addAttlistDecl (elemName)
Adds a new AttDecl node with the specified elemName if one doesn't exist yet.
Returns the AttlistDecl (new or existing) node.
=item addElementDecl (elemName, model)
Adds a new ElementDecl node with the specified elemName and model if one doesn't
exist yet.
Returns the AttlistDecl (new or existing) node. The model is ignored if one
already existed.
=item addEntity (notationName, value, sysId, pubId, ndata, parameter)
Adds a new Entity node. Don't use createEntity and appendChild, because it should
be added to the internal NamedNodeMap containing the entities.
Parameters:
I<notationName> the entity name.
I<value> the entity value.
I<sysId> the system id (if any.)
I<pubId> the public id (if any.)
I<ndata> the NDATA declaration (if any, for general unparsed entities.)
I<parameter> whether it is a parameter entity (%ent;) or not (&ent;).
SysId, pubId and ndata may be undefined.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the notationName does not conform to the XML spec.
=back
=item addNotation (name, base, sysId, pubId)
Adds a new Notation object.
Parameters:
I<name> the notation name.
I<base> the base to be used for resolving a relative URI.
I<sysId> the system id.
I<pubId> the public id.
Base, sysId, and pubId may all be undefined.
(These parameters are passed by the XML::Parser Notation handler.)
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the notationName does not conform to the XML spec.
=back
=item addAttDef (elemName, attrName, type, default, fixed)
Adds a new attribute definition. It will add the AttDef node to the AttlistDecl
if it exists. If an AttDef with the specified attrName already exists for the
given elemName, this function only generates a warning.
See XML::DOM::AttDef::new for the other parameters.
=item getDefaultAttrValue (elem, attr)
Returns the default attribute value as a string or undef, if none is available.
Parameters:
I<elem> The element tagName.
I<attr> The attribute name.
=item expandEntity (entity [, parameter])
Expands the specified entity or parameter entity (if parameter=1) and returns
its value as a string, or undef if the entity does not exist.
(The entity name should not contain the '%', '&' or ';' delimiters.)
=back
=head1 NAME
XML::DOM::Element - An XML element node in XML::DOM
=head1 DESCRIPTION
XML::DOM::Element extends L<XML::DOM::Node>.
By far the vast majority of objects (apart from text) that authors
encounter when traversing a document are Element nodes. Assume the
following XML document:
<elementExample id="demo">
<subelement1/>
<subelement2><subsubelement/></subelement2>
</elementExample>
When represented using DOM, the top node is an Element node for
"elementExample", which contains two child Element nodes, one for
"subelement1" and one for "subelement2". "subelement1" contains no
child nodes.
Elements may have attributes associated with them; since the Element
interface inherits from Node, the generic Node interface method
getAttributes may be used to retrieve the set of all attributes for an
element. There are methods on the Element interface to retrieve either
an Attr object by name or an attribute value by name. In XML, where an
attribute value may contain entity references, an Attr object should be
retrieved to examine the possibly fairly complex sub-tree representing
the attribute value. On the other hand, in HTML, where all attributes
have simple string values, methods to directly access an attribute
value can safely be used as a convenience.
=head2 METHODS
=over 4
=item getTagName
The name of the element. For example, in:
<elementExample id="demo">
...
</elementExample>
tagName has the value "elementExample". Note that this is
case-preserving in XML, as are all of the operations of the
DOM.
=item getAttribute (name)
Retrieves an attribute value by name.
Return Value: The Attr value as a string, or the empty string if that
attribute does not have a specified or default value.
=item setAttribute (name, value)
Adds a new attribute. If an attribute with that name is
already present in the element, its value is changed to be
that of the value parameter. This value is a simple string,
it is not parsed as it is being set. So any markup (such as
syntax to be recognized as an entity reference) is treated as
literal text, and needs to be appropriately escaped by the
implementation when it is written out. In order to assign an
attribute value that contains entity references, the user
must create an Attr node plus any Text and EntityReference
nodes, build the appropriate subtree, and use
setAttributeNode to assign it as the value of an attribute.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the specified name contains an invalid character.
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=back
=item removeAttribute (name)
Removes an attribute by name. If the removed attribute has a
default value it is immediately replaced.
DOMExceptions:
=over 4
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=back
=item getAttributeNode
Retrieves an Attr node by name.
Return Value: The Attr node with the specified attribute name or undef
if there is no such attribute.
=item setAttributeNode (attr)
Adds a new attribute. If an attribute with that name is
already present in the element, it is replaced by the new one.
Return Value: If the newAttr attribute replaces an existing attribute
with the same name, the previously existing Attr node is
returned, otherwise undef is returned.
DOMExceptions:
=over 4
=item * WRONG_DOCUMENT_ERR
Raised if newAttr was created from a different document than the one that created
the element.
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=item * INUSE_ATTRIBUTE_ERR
Raised if newAttr is already an attribute of another Element object. The DOM
user must explicitly clone Attr nodes to re-use them in other elements.
=back
=item removeAttributeNode (oldAttr)
Removes the specified attribute. If the removed Attr has a default value it is
immediately replaced. If the Attr already is the default value, nothing happens
and nothing is returned.
Parameters:
I<oldAttr> The Attr node to remove from the attribute list.
Return Value: The Attr node that was removed.
DOMExceptions:
=over 4
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=item * NOT_FOUND_ERR
Raised if oldAttr is not an attribute of the element.
=back
=head2 Additional methods not in the DOM Spec
=over 4
=item setTagName (newTagName)
Sets the tag name of the Element. Note that this method is not portable
between DOM implementations.
DOMExceptions:
=over 4
=item * INVALID_CHARACTER_ERR
Raised if the specified name contains an invalid character.
=back
=item check ($checker)
Uses the specified L<XML::Checker> to validate the document.
NOTE: an XML::Checker must be supplied. The checker can be created in
different ways, e.g. when parsing a document with XML::DOM::ValParser,
or with XML::DOM::Document::createChecker().
See L<XML::Checker> for more info.
=back
=head1 NAME
XML::DOM::ElementDecl - An XML ELEMENT declaration in XML::DOM
=head1 DESCRIPTION
XML::DOM::ElementDecl extends L<XML::DOM::Node> but is not part of the
DOM Level 1 specification.
This node represents an Element declaration, e.g.
<!ELEMENT address (street+, city, state, zip, country?)>
=head2 METHODS
=over 4
=item getName
Returns the Element tagName.
=item getModel and setModel (model)
Returns and sets the model as a string, e.g.
"(street+, city, state, zip, country?)" in the above example.
=back
=head1 NAME
XML::DOM::Entity - An XML ENTITY in XML::DOM
=head1 DESCRIPTION
XML::DOM::Entity extends L<XML::DOM::Node>.
This node represents an Entity declaration, e.g.
<!ENTITY % draft 'INCLUDE'>
<!ENTITY hatch-pic SYSTEM "../grafix/OpenHatch.gif" NDATA gif>
The first one is called a parameter entity and is referenced like this: %draft;
The 2nd is a (regular) entity and is referenced like this: &hatch-pic;
=head2 METHODS
=over 4
=item getNotationName
Returns the name of the notation for the entity.
I<Not Implemented> The DOM Spec says: For unparsed entities, the name of the
notation for the entity. For parsed entities, this is null.
(This implementation does not support unparsed entities.)
=item getSysId
Returns the system id, or undef.
=item getPubId
Returns the public id, or undef.
=back
=head2 Additional methods not in the DOM Spec
=over 4
=item isParameterEntity
Whether it is a parameter entity (%ent;) or not (&ent;)
=item getValue
Returns the entity value.
=item getNdata
Returns the NDATA declaration (for general unparsed entities), or undef.
=back
=head1 NAME
XML::DOM::EntityReference - An XML ENTITY reference in XML::DOM
=head1 DESCRIPTION
XML::DOM::EntityReference extends L<XML::DOM::Node>.
EntityReference objects may be inserted into the structure model when
an entity reference is in the source document, or when the user wishes
to insert an entity reference. Note that character references and
references to predefined entities are considered to be expanded by the
HTML or XML processor so that characters are represented by their
Unicode equivalent rather than by an entity reference. Moreover, the
XML processor may completely expand references to entities while
building the structure model, instead of providing EntityReference
objects. If it does provide such objects, then for a given
EntityReference node, it may be that there is no Entity node
representing the referenced entity; but if such an Entity exists, then
the child list of the EntityReference node is the same as that of the
Entity node. As with the Entity node, all descendants of the
EntityReference are readonly.
The resolution of the children of the EntityReference (the replacement
value of the referenced Entity) may be lazily evaluated; actions by the
user (such as calling the childNodes method on the EntityReference
node) are assumed to trigger the evaluation.
######################################################################
package XML::DOM::NamedNodeMap;
######################################################################
use strict;
use Carp;
use XML::DOM::DOMException;
use XML::DOM::NodeList;
use vars qw( $Special );
# Constant definition:
# Note: a real Name should have at least 1 char, so nobody else should use this
$Special = "";
sub new
{
my ($class, %args) = @_;
$args{Values} = new XML::DOM::NodeList;
# Store all NamedNodeMap properties in element $Special
bless { $Special => \%args}, $class;
}
sub getNamedItem
{
# Don't return the $Special item!
($_[1] eq $Special) ? undef : $_[0]->{$_[1]};
}
sub setNamedItem
{
my ($self, $node) = @_;
my $prop = $self->{$Special};
my $name = $node->getNodeName;
if ($XML::DOM::SafeMode)
{
croak new XML::DOM::DOMException (NO_MODIFICATION_ALLOWED_ERR)
if $self->isReadOnly;
croak new XML::DOM::DOMException (WRONG_DOCUMENT_ERR)
if $node->[XML::DOM::Node::_Doc] != $prop->{Doc};
croak new XML::DOM::DOMException (INUSE_ATTRIBUTE_ERR)
if defined ($node->[XML::DOM::Node::_UsedIn]);
croak new XML::DOM::DOMException (INVALID_CHARACTER_ERR,
"can't add name with NodeName [$name] to NamedNodeMap")
if $name eq $Special;
}
my $values = $prop->{Values};
my $index = -1;
my $prev = $self->{$name};
if (defined $prev)
{
# decouple previous node
$prev->decoupleUsedIn;
# find index of $prev
$index = 0;
for my $val (@{$values})
{
last if ($val == $prev);
$index++;
}
}
$self->{$name} = $node;
$node->[XML::DOM::Node::_UsedIn] = $self;
if ($index == -1)
{
push (@{$values}, $node);
}
else # replace previous node with new node
{
splice (@{$values}, $index, 1, $node);
}
$prev;
}
sub removeNamedItem
{
my ($self, $name) = @_;
# Be careful that user doesn't delete $Special node!
croak new XML::DOM::DOMException (NOT_FOUND_ERR)
if $name eq $Special;
my $node = $self->{$name};
croak new XML::DOM::DOMException (NOT_FOUND_ERR)
unless defined $node;
# The DOM Spec doesn't mention this Exception - I think it's an oversight
croak new XML::DOM::DOMException (NO_MODIFICATION_ALLOWED_ERR)
if $self->isReadOnly;
$node->decoupleUsedIn;
delete $self->{$name};
# remove node from Values list
my $values = $self->getValues;
my $index = 0;
for my $val (@{$values})
{
if ($val == $node)
{
splice (@{$values}, $index, 1, ());
last;
}
$index++;
}
$node;
}
# The following 2 are really bogus. DOM should use an iterator instead (Clark)
sub item
{
my ($self, $item) = @_;
$self->{$Special}->{Values}->[$item];
}
sub getLength
{
my ($self) = @_;
my $vals = $self->{$Special}->{Values};
int (@$vals);
}
#------------------------------------------------------------
# Extra method implementations
sub isReadOnly
{
return 0 if $XML::DOM::IgnoreReadOnly;
my $used = $_[0]->{$Special}->{UsedIn};
defined $used ? $used->isReadOnly : 0;
}
sub cloneNode
{
my ($self, $deep) = @_;
my $prop = $self->{$Special};
my $map = new XML::DOM::NamedNodeMap (Doc => $prop->{Doc});
# Not copying Parent property on purpose!
local $XML::DOM::IgnoreReadOnly = 1; # temporarily...
for my $val (@{$prop->{Values}})
{
my $key = $val->getNodeName;
my $newNode = $val->cloneNode ($deep);
$newNode->[XML::DOM::Node::_UsedIn] = $map;
$map->{$key} = $newNode;
push (@{$map->{$Special}->{Values}}, $newNode);
}
$map;
}
sub setOwnerDocument
{
my ($self, $doc) = @_;
my $special = $self->{$Special};
$special->{Doc} = $doc;
for my $kid (@{$special->{Values}})
{
$kid->setOwnerDocument ($doc);
}
}
sub getChildIndex
{
my ($self, $attr) = @_;
my $i = 0;
for my $kid (@{$self->{$Special}->{Values}})
{
return $i if $kid == $attr;
$i++;
}
-1; # not found
}
sub getValues
{
wantarray ? @{ $_[0]->{$Special}->{Values} } : $_[0]->{$Special}->{Values};
}
# Remove circular dependencies. The NamedNodeMap and its values should
# not be used afterwards.
sub dispose
{
my $self = shift;
for my $kid (@{$self->getValues})
{
undef $kid->[XML::DOM::Node::_UsedIn]; # was delete
$kid->dispose;
}
delete $self->{$Special}->{Doc};
delete $self->{$Special}->{Parent};
delete $self->{$Special}->{Values};
for my $key (keys %$self)
{
delete $self->{$key};
}
}
sub setParentNode
{
$_[0]->{$Special}->{Parent} = $_[1];
}
sub getProperty
{
$_[0]->{$Special}->{$_[1]};
}
#?? remove after debugging
sub toString
{
my ($self) = @_;
my $str = "NamedNodeMap[";
while (my ($key, $val) = each %$self)
{
if ($key eq $Special)
{
$str .= "##Special (";
while (my ($k, $v) = each %$val)
{
if ($k eq "Values")
{
$str .= $k . " => [";
for my $a (@$v)
{
# $str .= $a->getNodeName . "=" . $a . ",";
$str .= $a->toString . ",";
}
$str .= "], ";
}
else
{
$str .= $k . " => " . $v . ", ";
}
}
$str .= "), ";
}
else
{
$str .= $key . " => " . $val . ", ";
}
}
$str . "]";
}
1; # package return code
=head1 NAME
XML::DOM::NamedNodeMap - A hash table interface for XML::DOM
=head1 DESCRIPTION
Objects implementing the NamedNodeMap interface are used to represent
collections of nodes that can be accessed by name. Note that
NamedNodeMap does not inherit from NodeList; NamedNodeMaps are not
maintained in any particular order. Objects contained in an object
implementing NamedNodeMap may also be accessed by an ordinal index, but
this is simply to allow convenient enumeration of the contents of a
NamedNodeMap, and does not imply that the DOM specifies an order to
these Nodes.
Note that in this implementation, the objects added to a NamedNodeMap
are kept in order.
=head2 METHODS
=over 4
=item getNamedItem (name)
Retrieves a node specified by name.
Return Value: A Node (of any type) with the specified name, or undef if
the specified name did not identify any node in the map.
=item setNamedItem (arg)
Adds a node using its nodeName attribute.
As the nodeName attribute is used to derive the name which
the node must be stored under, multiple nodes of certain
types (those that have a "special" string value) cannot be
stored as the names would clash. This is seen as preferable
to allowing nodes to be aliased.
Parameters:
I<arg> A node to store in a named node map.
The node will later be accessible using the value of the nodeName
attribute of the node. If a node with that name is
already present in the map, it is replaced by the new one.
Return Value: If the new Node replaces an existing node with the same
name the previously existing Node is returned, otherwise undef is returned.
DOMExceptions:
=over 4
=item * WRONG_DOCUMENT_ERR
Raised if arg was created from a different document than the one that
created the NamedNodeMap.
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this NamedNodeMap is readonly.
=item * INUSE_ATTRIBUTE_ERR
Raised if arg is an Attr that is already an attribute of another Element object.
The DOM user must explicitly clone Attr nodes to re-use them in other elements.
=back
=item removeNamedItem (name)
Removes a node specified by name. If the removed node is an
Attr with a default value it is immediately replaced.
Return Value: The node removed from the map or undef if no node with
such a name exists.
DOMException:
=over 4
=item * NOT_FOUND_ERR
Raised if there is no node named name in the map.
=back
=item item (index)
Returns the indexth item in the map. If index is greater than
or equal to the number of nodes in the map, this returns undef.
Return Value: The node at the indexth position in the NamedNodeMap, or
undef if that is not a valid index.
=item getLength
Returns the number of nodes in the map. The range of valid child node
indices is 0 to length-1 inclusive.
=back
=head2 Additional methods not in the DOM Spec
=over 4
=item getValues
Returns a NodeList with the nodes contained in the NamedNodeMap.
The NodeList is "live", in that it reflects changes made to the NamedNodeMap.
When this method is called in a list context, it returns a regular perl list
containing the values. Note that this list is not "live". E.g.
@list = $map->getValues; # returns a perl list
$nodelist = $map->getValues; # returns a NodeList (object ref.)
for my $val ($map->getValues) # iterate over the values
=item getChildIndex (node)
Returns the index of the node in the NodeList as returned by getValues, or -1
if the node is not in the NamedNodeMap.
=item dispose
Removes all circular references in this NamedNodeMap and its descendants so the
objects can be claimed for garbage collection. The objects should not be used
afterwards.
=back
=head1 NAME
XML::DOM::Node - Super class of all nodes in XML::DOM
=head1 DESCRIPTION
XML::DOM::Node is the super class of all nodes in an XML::DOM document.
This means that all nodes that subclass XML::DOM::Node also inherit all
the methods that XML::DOM::Node implements.
=head2 GLOBAL VARIABLES
=over 4
=item @NodeNames
The variable @XML::DOM::Node::NodeNames maps the node type constants to strings.
It is used by XML::DOM::Node::getNodeTypeName.
=back
=head2 METHODS
=over 4
=item getNodeType
Return an integer indicating the node type. See XML::DOM constants.
=item getNodeName
Return a property or a hardcoded string, depending on the node type.
Here are the corresponding functions or values:
Attr getName
AttDef getName
AttlistDecl getName
CDATASection "#cdata-section"
Comment "#comment"
Document "#document"
DocumentType getNodeName
DocumentFragment "#document-fragment"
Element getTagName
ElementDecl getName
EntityReference getEntityName
Entity getNotationName
Notation getName
ProcessingInstruction getTarget
Text "#text"
XMLDecl "#xml-declaration"
B<Not In DOM Spec>: AttDef, AttlistDecl, ElementDecl and XMLDecl were added for
completeness.
=item getNodeValue and setNodeValue (value)
Returns a string or undef, depending on the node type. This method is provided
for completeness. In other languages it saves the programmer an upcast.
The value is either available thru some other method defined in the subclass, or
else undef is returned. Here are the corresponding methods:
Attr::getValue, Text::getData, CDATASection::getData, Comment::getData,
ProcessingInstruction::getData.
=item getParentNode and setParentNode (parentNode)
The parent of this node. All nodes, except Document,
DocumentFragment, and Attr may have a parent. However, if a
node has just been created and not yet added to the tree, or
if it has been removed from the tree, this is undef.
=item getChildNodes
A NodeList that contains all children of this node. If there
are no children, this is a NodeList containing no nodes. The
content of the returned NodeList is "live" in the sense that,
for instance, changes to the children of the node object that
it was created from are immediately reflected in the nodes
returned by the NodeList accessors; it is not a static
snapshot of the content of the node. This is true for every
NodeList, including the ones returned by the
getElementsByTagName method.
NOTE: this implementation does not return a "live" NodeList for
getElementsByTagName. See L<CAVEATS>.
When this method is called in a list context, it returns a regular perl list
containing the child nodes. Note that this list is not "live". E.g.
@list = $node->getChildNodes; # returns a perl list
$nodelist = $node->getChildNodes; # returns a NodeList (object reference)
for my $kid ($node->getChildNodes) # iterate over the children of $node
=item getFirstChild
The first child of this node. If there is no such node, this returns undef.
=item getLastChild
The last child of this node. If there is no such node, this returns undef.
=item getPreviousSibling
The node immediately preceding this node. If there is no such
node, this returns undef.
=item getNextSibling
The node immediately following this node. If there is no such node, this returns
undef.
=item getAttributes
A NamedNodeMap containing the attributes (Attr nodes) of this node
(if it is an Element) or undef otherwise.
Note that adding/removing attributes from the returned object, also adds/removes
attributes from the Element node that the NamedNodeMap came from.
=item getOwnerDocument
The Document object associated with this node. This is also
the Document object used to create new nodes. When this node
is a Document this is undef.
=item insertBefore (newChild, refChild)
Inserts the node newChild before the existing child node
refChild. If refChild is undef, insert newChild at the end of
the list of children.
If newChild is a DocumentFragment object, all of its children
are inserted, in the same order, before refChild. If the
newChild is already in the tree, it is first removed.
Return Value: The node being inserted.
DOMExceptions:
=over 4
=item * HIERARCHY_REQUEST_ERR
Raised if this node is of a type that does not allow children of the type of
the newChild node, or if the node to insert is one of this node's ancestors.
=item * WRONG_DOCUMENT_ERR
Raised if newChild was created from a different document than the one that
created this node.
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=item * NOT_FOUND_ERR
Raised if refChild is not a child of this node.
=back
=item replaceChild (newChild, oldChild)
Replaces the child node oldChild with newChild in the list of
children, and returns the oldChild node. If the newChild is
already in the tree, it is first removed.
Return Value: The node replaced.
DOMExceptions:
=over 4
=item * HIERARCHY_REQUEST_ERR
Raised if this node is of a type that does not allow children of the type of
the newChild node, or it the node to put in is one of this node's ancestors.
=item * WRONG_DOCUMENT_ERR
Raised if newChild was created from a different document than the one that
created this node.
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=item * NOT_FOUND_ERR
Raised if oldChild is not a child of this node.
=back
=item removeChild (oldChild)
Removes the child node indicated by oldChild from the list of
children, and returns it.
Return Value: The node removed.
DOMExceptions:
=over 4
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=item * NOT_FOUND_ERR
Raised if oldChild is not a child of this node.
=back
=item appendChild (newChild)
Adds the node newChild to the end of the list of children of
this node. If the newChild is already in the tree, it is
first removed. If it is a DocumentFragment object, the entire contents of
the document fragment are moved into the child list of this node
Return Value: The node added.
DOMExceptions:
=over 4
=item * HIERARCHY_REQUEST_ERR
Raised if this node is of a type that does not allow children of the type of
the newChild node, or if the node to append is one of this node's ancestors.
=item * WRONG_DOCUMENT_ERR
Raised if newChild was created from a different document than the one that
created this node.
=item * NO_MODIFICATION_ALLOWED_ERR
Raised if this node is readonly.
=back
=item hasChildNodes
This is a convenience method to allow easy determination of
whether a node has any children.
Return Value: 1 if the node has any children, 0 otherwise.
=item cloneNode (deep)
Returns a duplicate of this node, i.e., serves as a generic
copy constructor for nodes. The duplicate node has no parent
(parentNode returns undef.).
Cloning an Element copies all attributes and their values,
including those generated by the XML processor to represent
defaulted attributes, but this method does not copy any text
it contains unless it is a deep clone, since the text is
contained in a child Text node. Cloning any other type of
node simply returns a copy of this node.
Parameters:
I<deep> If true, recursively clone the subtree under the specified node.
If false, clone only the node itself (and its attributes, if it is an Element).
Return Value: The duplicate node.
=item normalize
Puts all Text nodes in the full depth of the sub-tree
underneath this Element into a "normal" form where only
markup (e.g., tags, comments, processing instructions, CDATA
sections, and entity references) separates Text nodes, i.e.,
there are no adjacent Text nodes. This can be used to ensure
that the DOM view of a document is the same as if it were
saved and re-loaded, and is useful when operations (such as
XPointer lookups) that depend on a particular document tree
structure are to be used.
B<Not In DOM Spec>: In the DOM Spec this method is defined in the Element and
Document class interfaces only, but it doesn't hurt to have it here...
=item getElementsByTagName (name [, recurse])
Returns a NodeList of all descendant elements with a given
tag name, in the order in which they would be encountered in
a preorder traversal of the Element tree.
Parameters:
I<name> The name of the tag to match on. The special value "*" matches all tags.
I<recurse> Whether it should return only direct child nodes (0) or any descendant that matches the tag name (1). This argument is optional and defaults to 1. It is not part of the DOM spec.
Return Value: A list of matching Element nodes.
NOTE: this implementation does not return a "live" NodeList for
getElementsByTagName. See L<CAVEATS>.
When this method is called in a list context, it returns a regular perl list
containing the result nodes. E.g.
@list = $node->getElementsByTagName("tag"); # returns a perl list
$nodelist = $node->getElementsByTagName("tag"); # returns a NodeList (object ref.)
for my $elem ($node->getElementsByTagName("tag")) # iterate over the result nodes
=back
=head2 Additional methods not in the DOM Spec
=over 4
=item getNodeTypeName
Return the string describing the node type.
E.g. returns "ELEMENT_NODE" if getNodeType returns ELEMENT_NODE.
It uses @XML::DOM::Node::NodeNames.
=item toString
Returns the entire subtree as a string.
=item printToFile (filename)
Prints the entire subtree to the file with the specified filename.
Croaks: if the file could not be opened for writing.
=item printToFileHandle (handle)
Prints the entire subtree to the file handle.
E.g. to print to STDOUT:
$node->printToFileHandle (\*STDOUT);
=item print (obj)
Prints the entire subtree using the object's print method. E.g to print to a
FileHandle object:
$f = new FileHandle ("file.out", "w");
$node->print ($f);
=item getChildIndex (child)
Returns the index of the child node in the list returned by getChildNodes.
Return Value: the index or -1 if the node is not found.
=item getChildAtIndex (index)
Returns the child node at the specifed index or undef.
=item addText (text)
Appends the specified string to the last child if it is a Text node, or else
appends a new Text node (with the specified text.)
Return Value: the last child if it was a Text node or else the new Text node.
=item dispose
Removes all circular references in this node and its descendants so the
objects can be claimed for garbage collection. The objects should not be used
afterwards.
=item setOwnerDocument (doc)
Sets the ownerDocument property of this node and all its children (and
attributes etc.) to the specified document.
This allows the user to cut and paste document subtrees between different
XML::DOM::Documents. The node should be removed from the original document
first, before calling setOwnerDocument.
This method does nothing when called on a Document node.
=item isAncestor (parent)
Returns 1 if parent is an ancestor of this node or if it is this node itself.
=item expandEntityRefs (str)
Expands all the entity references in the string and returns the result.
The entity references can be character references (e.g. "&#123;" or "&#x1fc2"),
default entity references ("&quot;", "&gt;", "&lt;", "&apos;" and "&amp;") or
entity references defined in Entity objects as part of the DocumentType of
the owning Document. Character references are expanded into UTF-8.
Parameter entity references (e.g. %ent;) are not expanded.
=item to_sax ( %HANDLERS )
E.g.
$node->to_sax (DocumentHandler => $my_handler,
Handler => $handler2 );
%HANDLERS may contain the following handlers:
=over 4
=item * DocumentHandler
=item * DTDHandler
=item * EntityResolver
=item * Handler
Default handler when one of the above is not specified
=back
Each XML::DOM::Node generates the appropriate SAX callbacks (for the
appropriate SAX handler.) Different SAX handlers can be plugged in to
accomplish different things, e.g. L<XML::Checker> would check the node
(currently only Document and Element nodes are supported), L<XML::Handler::BuildDOM>
would create a new DOM subtree (thereby, in essence, copying the Node)
and in the near future, XML::Writer could print the node.
All Perl SAX related work is still in flux, so this interface may change a
little.
See PerlSAX for the description of the SAX interface.
=item check ( [$checker] )
See descriptions for check() in L<XML::DOM::Document> and L<XML::DOM::Element>.
=item xql ( @XQL_OPTIONS )
To use the xql method, you must first I<use> L<XML::XQL> and L<XML::XQL::DOM>.
This method is basically a shortcut for:
$query = new XML::XQL::Query ( @XQL_OPTIONS );
return $query->solve ($node);
If the first parameter in @XQL_OPTIONS is the XQL expression, you can leave off
the 'Expr' keyword, so:
$node->xql ("doc//elem1[@attr]", @other_options);
is identical to:
$node->xql (Expr => "doc//elem1[@attr]", @other_options);
See L<XML::XQL::Query> for other available XQL_OPTIONS.
See L<XML::XQL> and L<XML::XQL::Tutorial> for more info.
=item isHidden ()
Whether the node is hidden.
See L<Hidden Nodes|XML::DOM/_Hidden_Nodes_> for details.
=back
######################################################################
package XML::DOM::NodeList;
######################################################################
use vars qw ( $EMPTY );
# Empty NodeList
$EMPTY = new XML::DOM::NodeList;
sub new
{
bless [], $_[0];
}
sub item
{
$_[0]->[$_[1]];
}
sub getLength
{
int (@{$_[0]});
}
#------------------------------------------------------------
# Extra method implementations
sub dispose
{
my $self = shift;
for my $kid (@{$self})
{
$kid->dispose;
}
}
sub setOwnerDocument
{
my ($self, $doc) = @_;
for my $kid (@{$self})
{
$kid->setOwnerDocument ($doc);
}
}
1; # package return code
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册