Running sample code in this directory requires PaddelPaddle v0.10.0 and later. If the PaddlePaddle on your device is lower than this version, please follow the instructions in [installation document](http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html) and make an update.
---

# Chinese Ancient Poetry Generation

## Introduction
On an encoder-decoder neural network model, we perform sequence-to-sequence training with The Complete Tang Poems. The model should generate the verse after the given input verse.

The encoders and decoders in the model all use a stacked bi-directional LSTM which, by default, has three layers and attention.

The following is a brief directory structure and description of this example：

```text
.
├── data                 # store training data and dictionary
│   ├── download.sh      # download raw data
├── README.md            # documentation
├── index.html           # document (html format)
├── preprocess.py        # raw data preprocessing
├── generate.py          # generate verse script
├── network_conf.py      # model definition
├── reader.py            # data reading interface
├── train.py             # training script
└── utils.py             # define utility functions
```

## Data Processing
### Raw Data Source
The training data of this example is The Complete Tang Poems in the [Chinese ancient poetry database](https://github.com/chinese-poetry/chinese-poetry). There are about 54,000 Tang poems.

### Downloading Raw Data
```bash
cd data && ./download.sh && cd ..
```
### Data Preprocessing
```bash
python preprocess.py --datadir data/raw --outfile data/poems.txt --dictfile data/dict.txt
```

After the above script is executed, the processed training data "poems.txt" and dictionary "dict.txt" will be generated. The dictionary's unit is word, and it is constructed by words with a frequency of at least 10.

Divided into three columns, each line in poems.txt contains the title, author, and content of a poem. Verses of a poem are separated by`.`.

Training data example:
```text
登鸛雀樓  王之渙  白日依山盡.黃河入海流.欲窮千里目.更上一層樓
觀獵      李白   太守耀清威.乘閑弄晚暉.江沙橫獵騎.山火遶行圍.箭逐雲鴻落.鷹隨月兔飛.不知白日暮.歡賞夜方歸
晦日重宴  陳嘉言  高門引冠蓋.下客抱支離.綺席珍羞滿.文場翰藻摛.蓂華彫上月.柳色藹春池.日斜歸戚里.連騎勒金羈
```

When the model is trained, each verse is used as a model input, and the next verse is used as a prediction target.


## Model Training
The command line arguments in the training script, ["train.py"](./train.py), can be viewed with `python train.py --help`. The main parameters are as follows:
- `num_passes`: number of passes
- `batch_size`: batch size
- `use_gpu`: whether to use GPU
- `trainer_count`: number of trainers, the default is 1
- `save_dir_path`: model storage path, the default is the current directory under the models directory
- `encoder_depth`: model encoder LSTM depth, default 3
- `decoder_depth`: model decoder LSTM depth, default 3
- `train_data_path`: training data path
- `word_dict_path`: data dictionary path
- `init_model_path`: initial model path, no need to specify at the start of training

### Training Execution
```bash
python train.py \
    --num_passes 50 \
    --batch_size 256 \
    --use_gpu True \
    --trainer_count 1 \
    --save_dir_path models \
    --train_data_path data/poems.txt \
    --word_dict_path data/dict.txt \
    2>&1 | tee train.log
```
After each pass training, the model parameters are saved under directory "models". Training logs are stored in "train.log".

### Optimal Model Parameters
Find the pass with the lowest cost and use the model parameters corresponding to the pass for subsequent prediction.
```bash
python -c 'import utils; utils.find_optiaml_pass("./train.log")'
```

## Generating Verses
Use the ["generate.py"](./generate.py) script to generate the next verse for the input verses. Command line arguments can be viewed with `python generate.py --help`.
The main parameters are described as follows:
- `model_path`: trained model parameter file
- `word_dict_path`: data dictionary path
- `test_data_path`: input data path
- `batch_size`: batch size, default is 1
- `beam_size`: search size in beam search, the default is 5
- `save_file`: output save path
- `use_gpu`: whether to use GPU

### Perform Generation
For example, save the verse `孤帆遠影碧空盡` in the file `input.txt` as input. To predict the next sentence, execute the command:
```bash
python generate.py \
    --model_path models/pass_00049.tar.gz \
    --word_dict_path data/dict.txt \
    --test_data_path input.txt \
    --save_file output.txt
```
The result will be saved in the file "output.txt". For the above example input, the generated verses are as follows:
```text
-9.6987     萬 壑 清 風 黃 葉 多
-10.0737    萬 里 遠 山 紅 葉 深
-10.4233    萬 壑 清 波 紅 一 流
-10.4802    萬 壑 清 風 黃 葉 深
-10.9060    萬 壑 清 風 紅 葉 多
```