diff --git a/ernie-vil/.meta/ernie_vil_struct.png b/ernie-vil/.meta/ernie_vil_struct.png new file mode 100644 index 0000000000000000000000000000000000000000..cfa72e6116a2d2393f2bf12f25c98a66545c0698 Binary files /dev/null and b/ernie-vil/.meta/ernie_vil_struct.png differ diff --git a/ernie-vil/README.md b/ernie-vil/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c14475630ca7596dd638a839a6b7f42d27ff5bfa --- /dev/null +++ b/ernie-vil/README.md @@ -0,0 +1,136 @@ +English| [简体中文](./README_zh.md) + +## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph +- [Framework](#framework) +- [Pre-trained models](#pre-trained-models) +- [Downstream tasks](#downstream-tasks) + * [VCR](#VCR) +- [Usage](#usage) + * [Install PaddlePaddle](#install-paddlepaddle) + * [Fine-tuning on ERNIE-ViL](#fine-tuning-on-ernie-vil) + * [Inference](#inference) +- [Citation](#citation) + +For technical description of the algorithm, please see our paper: + +>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934) +> +>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution) +> +>Preprint June 2020 +> + +![ERNIE-ViL](https://img.shields.io/badge/Pretraining-vision_and_language_joint_representions-green) +![VQA](https://img.shields.io/badge/VQA-Visual_Question_Answering-yellow) +![VCR](https://img.shields.io/badge/VCR-Visual_Commensense_Reasoning-blue) ![RefCOCO+](https://img.shields.io/badge/RefCOCO+-Region_to_Phrase_Grounding-green) +![IRTR](https://img.shields.io/badge/IR_&TR-Image_Retrieval&_Text_Retrieval-yellowgreen) + +**[ERNIE-ViL](https://arxiv.org/abs/2006.16934) is a knowledge-enhanced joint representations for vision-language tasks**, which is the first work that has **introduced structured knowledge to enhance vision-language pre-training**. Utilizing structured knowledge obtained +from scene graphs, ERNIE-ViL constructs three **Scene Graph Prediction tasks**, i.e., **Object Prediction**, **Attribute Prediction** and **Relationship Prediction** tasks. +Thus, ERNIE-ViL can learn the better joint vision-language representations characterizing the alignments of the detailed semantics across vision and language. + + + +## Framework + +Based on the scene graph parsed from the text using Scene Graph Parser, we construct Object Prediction, Attribute Prediction and Relationship Prediction tasks: +- **Object Prediction:** We randomly select a set of the objects in the scene graph, then mask and predict the corresponding words in the sentence. +- **Attribute Prediction:** For the object-attribute pairs in the scene graph, we randomly select a part of them to mask and predict the words related to the attribute nodes in the sentence. +- **Realtionship Prediction:** For the object-relationship-object triplets in the scene graph, we randomly select a part of realtionship nodes to mask and predict them. + +![ernie_vil_struct](.meta/ernie_vil_struct.png) +Model Architecture of ERNIE-ViL + + +## Pre-trained Models +ERNIE-ViL adopts large-scale image-text aligned datasets as the pre-training data. We provide ERNIE-ViL models of two scale settings which are pretrained on [**Conceptual Captions**](https://www.aclweb.org/anthology/P18-1238.pdf) and [**SBU Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio). + +- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_) +- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) + +## Downstream tasks +We finetune ERNIE-ViL on five vision-langage downstream tasks, i.e., Visual Commensense Reasoning([**VCR**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf)), +Visual Question Answering([**VQA**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf)), +Cross-modal Image Retrieval([**IR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)), +Cross-modal Text Retrieval([**TR**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166)) and +Region_to_Phrase_Grounding([**RefCOCO+**](https://www.aclweb.org/anthology/D14-1086.pdf)). + +_Code and pre-trained models related to VCR task are made public now, and those of more downstream tasks are planed to be public._ + +### VCR + * datasets + * The training, validation and testing data of VCR task are provided by [**VCR Website**](https://visualcommonsense.com/download/). + * Organization of visual features is modified from [**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), we directly use the data from it. Data can be downloaded [here](https://github.com/jiasenlu/vilbert_beta/tree/master/data). + * Put all downloaded files under diretory "data/vcr". + + + * Task pre-training: We perform task-pretraining on VCR task, which is also known as task-specific-pretraining. The trained models are as follows: + * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz) + * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) + * Performance: Results of VCR task for ERNIE-ViL model, compared with previous state-of-the-art pre-trained models([**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)). + + | Models | Q->A | QA->R | Q->AR | + | :--------------------------------------| :---------------------------: | :----------------------------: | :-----------------------------: | + | VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) | + | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) | + | VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) | + | ERNIE-ViL (task-pretrain) _large_ | 78.52(79.2) | 83.37(83.5) | 65.81(66.3) | + + _Numerical results outside and inside parentheses represent the dev and test performance of VCR task respectively. + Test results are obtained from the [**VCR leadborad**](https://visualcommonsense.com/leaderboard/)._ + + + +## Usage + +### Install PaddlePaddle + +This code has been tested with Paddle Fluid 1.8 with Python 2.7. Other dependencies of ERNIE-ViL are listed in `requirements.txt`, you can install them by + ```script + pip install -r requirements.txt + ``` + +### Fine-tuning on ERNIE-ViL +Please update LD_LIBRARY_PATH about CUDA, cuDNN, NCCL2 before fine-tuning. You can easily run fine-tuning through +configuration files. For example, you can finetune ERNIE-ViL model on VCR task by +```script + sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models +``` +Files which are needed by fine-tuning can be found in our given download links, incluing vocabulary dictionary, configuration +file and pre-trained parameters. Note that our fine-tuning experiments on VCR are carried on 4 NVIDIA V100 (32GB) GPUs. +If your GPU memory is not enough, you can reduce the batch size in the corresponding configuration file, e.g., "conf/vcr/model_conf_vcr". + + + +### Inference + + You can use the following command to infer fine-tuned models. For example, you can infer VCR models by the following commands for different sub-tasks: + + **Task Q->A** + + ```script + sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file + ``` + **Task QA->R** + + ```script + sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file + ``` + + + + +## Citation + +You can cite the paper as below: + +``` +@article{yu2020ernie, + title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph}, + author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng}, + journal={arXiv preprint arXiv:2006.16934}, + year={2020} +} + +``` + diff --git a/ernie-vil/README_zh.md b/ernie-vil/README_zh.md new file mode 100644 index 0000000000000000000000000000000000000000..149fc6a93d71d078c62eef141091ce986420c97e --- /dev/null +++ b/ernie-vil/README_zh.md @@ -0,0 +1,132 @@ + +[English](./README.md) | 简体中文 + +## _ERNIE-ViL_: Knowledge Enhanced Vision-Language Representations Through Scene Graph +- [模型框架](#模型框架) +- [预训练模型](#预训练模型) +- [下游任务](#下游任务) + * [视觉推理](#视觉推理) +- [使用说明](#使用说明) + * [安装飞桨](#安装飞桨) + * [运行微调](#运行微调) + * [预测](#预测) +- [引用](#引用) + +关于算法的详细描述,请参见我们的论文 + +>[_**ERNIE-ViL:Knowledge Enhanced Vision-Language Representations Through Scene Graph**_](https://arxiv.org/abs/2006.16934) +> +>Fei Yu\*, Jiji Tang\*, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang (\* : equal contribution) +> +>Preprint June 2020 +> +![ERNIE-ViL](https://img.shields.io/badge/预训练-视觉语言联合表示-green)![VQA](https://img.shields.io/badge/视觉问答-VQA-yellow) ![VCR](https://img.shields.io/badge/视觉常识推理-VCR-blue) ![RefCOCO](https://img.shields.io/badge/引用表达式理解-RefCOCO+-green) ![IRTR](https://img.shields.io/badge/跨模态检索-IR&TR-yellowgreen) + + +--- +**ERNIE-ViL +是面向视觉-语言任务的知识增强预训练框架**,首次在视觉-语言预训练中引入了结构化的知识。ERNIE-ViL利用场景图中的结构化知识,构建了**物体预测,属性预测,关系预测**三种预训练任务,精细地刻画了视觉-语言模态之间细粒度语义的对齐,从而获得了更好的视觉-语言联合表示。 + +## 模型框架 + +基于文本中解析出的场景图,ERNIE-ViL提出了三个多模态场景图预测任务: +- **物体预测**:随机选取图中的一部分物体,然后对其在句子中对应的词进行掩码和预测; +- **属性预测**:对于场景图中的属性-物体组合,随机选取一部分词对其中属性词进行掩码和预测; +- **关系预测**:对于场景图中的物体-关系-物体三元组,对其中的关系词进行掩码和预测。 + +![ernie_vil_struct](.meta/ernie_vil_struct.png) + +ERNIE-ViL 场景图预训练任务结构 + +## 预训练模型 + + +ERNIE-ViL使用大规模图文对齐数据集作为预训练数据,基于[**Conceptual +Captions**](https://www.aclweb.org/anthology/P18-1238.pdf)和[**SBU +Captions**](http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captio)数据集,训练和发布了两种参数规模的模型: + +- [**ERNIE-ViL _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-en.1.tar.gz) (_lowercased | 12-text-stream-layer, 6-visual-stream-layer_) +- [**ERNIE-ViL _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-en.1.tar.gz) (_lowercased | 24-text-stream-layer, 6-visual-stream-layer_) + +## 下游任务 + +ERNIE-ViL在五个视觉语言下游任务进行了实验,包括[**视觉常识推理**](https://openaccess.thecvf.com/content_CVPR_2019/papers/Zellers_From_Recognition_to_Cognition_Visual_Commonsense_Reasoning_CVPR_2019_paper.pdf), +[**视觉问答**](https://openaccess.thecvf.com/content_iccv_2015/papers/Antol_VQA_Visual_Question_ICCV_2015_paper.pdf), +[**跨模态图片检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166), +[**跨模态文本检索**](https://www.mitpressjournals.org/doi/abs/10.1162/tacl_a_00166), +[**引用式理解**](https://www.aclweb.org/anthology/D14-1086.pdf)。 + +_当前仅开源视觉常识推理任务相关模型和代码,后续计划开源更多下游任务的模型和代码。_ + + +### **视觉常识推理** + * 数据集合 + * 训练、验证和测试集合相关数据由[**视觉常识推理官网**](http://visualcommonsense.com/download/)提供; + * 视觉端特征的组织方式借鉴[**ViLBERT**](https://github.com/jiasenlu/vilbert_beta), 因此项目直接使用**ViLBERT**中的数据,数据[下载地址](https://github.com/jiasenlu/vilbert_beta/tree/master/data); + * 将所有获取的文件放在 data/vcr 目录下; + + + * 任务预训练: 在视觉推理任务中进行了任务预训练,预训练获得模型如下 + * [**ERNIE-ViL-VCR-task-pretrain _base_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-base-VCR-task-pre-en.1.tar.gz) + * [**ERNIE-ViL-VCR-task-pretrain _large_**](https://ernie-github.cdn.bcebos.com/model-ernie-vil-large-VCR-task-pre-en.1.tar.gz) + * 效果: ERNIE-ViL与之前最优预训练模型[**VILLA**](https://arxiv.org/pdf/2006.06195.pdf)在视觉常识推理任务上的效果对比如下: + + | 模型 | Q->A | QA->R | Q->AR | + | :---------------------------------- | :---------------------------: | :----------------------------: | :---------------------------: | + | VILLA (task-pretrain) _base_ | 75.54(76.4) | 78.78(79.1) | 59.75(60.6) | + | ERNIE-ViL (task-pretrain) _base_ | 76.37(77.0) | 79.65(80.3) | 61.24(62.1) | + | VILLA (task-pretrain) _large_ | 78.45(78.9) | 82.57(82.8) | 65.18(65.7) | + | ERNIE-ViL (task-pretrain) _large_ | 78.52(79.2) | 83.37(83.5) | 65.81(66.3) | + + _注:括号外表示验证集效果,括号内表示测试集效果,测试集效果由[VCR榜单](https://visualcommonsense.com/leaderboard/)提供。_ + + +## 使用说明 + +### 安装飞桨 + +ERNIE-ViL代码基于Paddle Fluid 1.8 和 Python 2.7, 依赖的其他模块也列举在 requirements.txt,可以通过下面的指令安装: + ```script + pip install -r requirements.txt + ``` +### 运行微调 +在运行 ERNIE-ViL 前,需要将 CUDA 、cuDNN 、NCCL2 的动态库路径添加到 LD_LIBRARY_PATH 。 我们把下游任务的参数配置文件放到了 conf/ ,可以简单地通过配置文件运行。 例如,您可以通过下面的指令在VCR上任务上进行微调: +```script + sh run_finetuning.sh vcr conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $pretrain_models_params +``` +前面提供的模型链接中包含了所有需要的文件, 包含词表文件,配置文件和预训练参数。VCR任务的微调实验是在 4 张32 GB 的英伟达V100 GPU上运行,如果您的GPU显存不够,可以考虑八张卡运行或者减小配置中的batch_size。 +_我们目前开放了预训练模型和VCR的任务代码,其他的下游任务可以参考任务自主尝试。_ + +### 预测 +基于已经训练的模型,您可以通过下面的命令测试VCR的效果: + + **Task Q->A** + + ```script + sh run_inference.sh vcr qa $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file + ``` + **Task QA->R** + + ```script + sh run_inference.sh vcr qar $split(val/test) conf/vcr/model_conf_vcr $vocab_file $ernie_vil_config $model_params $res_file + ``` + + + VCR的测试可以在一张32GB的英伟达V100 GPU上运行,测试的结果包含Q->A 任务、QA->R任务和Q->AR任务,其中Q->AR任务由前两个任务结果合并所得。 + + + +## 引用 + +可以按下面的格式引用我们的论文: + +``` +@article{yu2020ernie, + title={ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph}, + author={Yu, Fei and Tang, Jiji and Yin, Weichong and Sun, Yu and Tian, Hao and Wu, Hua and Wang, Haifeng}, + journal={arXiv preprint arXiv:2006.16934}, + year={2020} +} + +``` + diff --git a/ernie-vil/args/__init__.py b/ernie-vil/args/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/args/finetune_args.py b/ernie-vil/args/finetune_args.py new file mode 100644 index 0000000000000000000000000000000000000000..dd034c673bfe0bceee053f293eff9fc8fba36c15 --- /dev/null +++ b/ernie-vil/args/finetune_args.py @@ -0,0 +1,79 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" args defination and default value """ + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os +import time +import argparse + +from utils.args import ArgumentGroup, print_arguments + +# yapf: disable +parser = argparse.ArgumentParser(__doc__) +model_g = ArgumentGroup(parser, "model", "model configuration and paths.") +model_g.add_arg("ernie_config_path", str, "./config/ernie_config.json", "json file path for ernie model config.") +model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.") +model_g.add_arg("checkpoints", str, "checkpoints", "Path to save checkpoints.") +model_g.add_arg("task_name", str, "vcr", "Task to finetune on ERNIE-ViL") + +train_g = ArgumentGroup(parser, "training", "training options.") +train_g.add_arg("epoch", int, 100, "Number of epoches for training.") +train_g.add_arg("learning_rate", float, 0.0001, "Learning rate used to train with warmup.") +train_g.add_arg("lr_scheduler", str, "linear_warmup_decay", + "scheduler of learning rate.", choices=['linear_warmup_decay', 'noam_decay', 'manual_warmup_decay']) +train_g.add_arg("decay_steps", str, "", "learning rate decay steps, list with ;") +train_g.add_arg("lr_decay_ratio", float, 0.1, "learning rate decay ratio, used with manual_warmup_decay") +train_g.add_arg("weight_decay", float, 0.01, "Weight decay rate for L2 regularizer.") +train_g.add_arg("num_train_steps", int, 1000000, "Total steps to perform pretraining.") +train_g.add_arg("warmup_steps", int, 0, "Total steps to perform warmup when pretraining.") +train_g.add_arg("save_steps", int, 100, "The steps interval to save checkpoints.") +train_g.add_arg("validation_steps", int, 6000, "The steps interval to evaluate model performance.") +train_g.add_arg("use_fuse", bool, False, "Whether to use fuse_allreduce_ops.") +train_g.add_arg("nccl_comm_num", int, 1, "NCCL comm num.") +train_g.add_arg("hierarchical_allreduce_inter_nranks", int, 8, "Hierarchical allreduce inter ranks.") +train_g.add_arg("use_hierarchical_allreduce", bool, False, "Use hierarchical allreduce or not.") +train_g.add_arg("use_gpu", bool, True, "Whether to gpu.") + +log_g = ArgumentGroup(parser, "logging", "logging related.") +log_g.add_arg("skip_steps", int, 10, "The steps interval to print loss.") +log_g.add_arg("verbose", bool, False, "Whether to output verbose log.") + +data_g = ArgumentGroup(parser, "data", "Data paths, vocab paths and data processing options") +data_g.add_arg("result_file", str, "./res_tmp", "file to storage results") +data_g.add_arg("lr_decay_dict_file", str, "", "learning rate decay files.") +data_g.add_arg("train_filelist", str, "", "Path to training filelist.") +data_g.add_arg("valid_filelist", str, "", "Path to valid filelist.") +data_g.add_arg("test_filelist", str, "", "Path to test filelist.") +data_g.add_arg("vocab_path", str, "./config/vocab.txt", "Vocabulary path.") +data_g.add_arg("test_split", str, "val", "test of sub tasks, val or test") +data_g.add_arg("max_seq_len", int, 128, "Number of words of the longest seqence.") +data_g.add_arg("max_img_len", int, 100, "Number of image rois of the longest seqence.") +data_g.add_arg("feature_size", int, 2048, "Number of roi feature size of image.") +data_g.add_arg("fusion_method", str, "sum", "Number of roi feature size of image.") +data_g.add_arg("batch_size", int, 16, "Total examples' number in batch for training. see also --in_tokens.") +data_g.add_arg("task_group_json", str, "", "Path to task json") + +run_type_g = ArgumentGroup(parser, "run_type", "running type options.") +run_type_g.add_arg("is_distributed", bool, False, "If set, then start distributed training.") +run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.") +run_type_g.add_arg("use_fast_executor", bool, False, "If set, use fast parallel executor (in experiment).") +run_type_g.add_arg("do_train", bool, False, "Whether to perform evaluation on test data set.") +run_type_g.add_arg("do_test", bool, False, "Whether to perform evaluation on test data set.") +run_type_g.add_arg("output_file", str, "", "The output file to save model output.") +# yapf: enable diff --git a/ernie-vil/batching/__init__.py b/ernie-vil/batching/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/batching/finetune_batching.py b/ernie-vil/batching/finetune_batching.py new file mode 100644 index 0000000000000000000000000000000000000000..c9527bfd2d5570e567bd171b250db44202ba2b11 --- /dev/null +++ b/ernie-vil/batching/finetune_batching.py @@ -0,0 +1,97 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" prepare data format for finetuning tasks """ + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np + +from six.moves import xrange + + +def prepare_batch_data(batch_records, num_choice, pad_id, task_index, task_num): + """ + prepare batch data for finetuning tasks + """ + batch_input_ids = [] + batch_input_pos = [] + batch_seg_ids = [] + batch_input_masks = [] + num_sample = len(batch_records) + batch_lens = [record["input_lens"] for record in batch_records] + batch_labels = [record["target"] for record in batch_records] + binary_labels = np.zeros([num_choice * num_sample, 1], dtype='float32') + for i, l in enumerate(batch_labels): + binary_labels[i * num_choice + l] = 1.0 + labels = np.array(batch_labels).astype("int64").reshape([-1, 1]) + image_features = [record["features"] for record in batch_records] + image_boxes = [record["boxes"] for record in batch_records] + batch_anno_ids = np.array([record["anno_id"] for record in batch_records]).astype("int64").reshape([-1, 1]) + max_len = max([max(lens) for lens in batch_lens]) + for i in range(len(batch_records)): + batch_input_ids.append([inst + list([pad_id] * (max_len - len(inst))) \ + for inst in batch_records[i]["input_ids"]]) + batch_input_pos.append([inst + list([pad_id] * (max_len - len(inst))) \ + for inst in batch_records[i]["input_pos"]]) + batch_seg_ids.append([inst + list([pad_id] * (max_len - len(inst))) \ + for inst in batch_records[i]["segment_ids"]]) + batch_input_masks.append([[1] * len(inst) + [0] * (max_len - len(inst)) \ + for inst in batch_records[i]["input_ids"]]) + + image_embedding, image_mask = pad_feature_data(image_features, return_mask=True) + image_loc = pad_feature_data(image_boxes) + src_ids = np.array(batch_input_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1]) + src_pos = np.array(batch_input_pos).astype("int64").reshape([num_choice * num_sample, max_len, 1]) + src_seg = np.array(batch_seg_ids).astype("int64").reshape([num_choice * num_sample, max_len, 1]) + src_masks = np.array(batch_input_masks).astype("float32").reshape([num_choice * num_sample, max_len, 1]) + src_task = np.zeros(src_ids.shape, dtype="int64") + batch, seq_len, fea_len = image_embedding.shape + image_embedding = np.tile(np.expand_dims(image_embedding, axis=1), \ + (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, fea_len]) + image_mask = np.tile(np.expand_dims(image_mask, axis=1), \ + (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 1]) + image_loc = np.tile(np.expand_dims(image_loc, axis=1), \ + (1, num_choice, 1, 1)).reshape([num_choice * batch, seq_len, 5]) + return_list = [src_ids, src_pos, src_seg, src_task, src_masks, \ + image_embedding, image_loc, image_mask, labels, batch_anno_ids] + return_list.append(np.array([task_index]).astype('int64')) + return_list.append(binary_labels) + for i in xrange(task_num): + if i == task_index: + return_list.append(np.array([1.0]).astype("float32")) + else: + return_list.append(np.array([0.0]).astype("float32")) + return return_list + + +def pad_feature_data(data, pad_value=0.0, dtype="float32", return_mask=False): + """ + pad visual features with given pad value + """ + max_lenth=max([len(item) for item in data]) + data_width = len(data[0][0]) + out_data = np.ones((len(data), max_lenth, data_width), dtype=dtype) * pad_value + out_mask = np.zeros((len(data), max_lenth, 1), dtype=dtype) + for i in range(len(data)): + out_data[i, 0: len(data[i]), :] = data[i] + if return_mask: + out_mask[i, 0:len(data[i]):] = 1.0 + if return_mask: + return out_data, out_mask + else: + return out_data + +if __name__ == "__main__": + pass diff --git a/ernie-vil/conf/vcr/model_conf_vcr b/ernie-vil/conf/vcr/model_conf_vcr new file mode 100644 index 0000000000000000000000000000000000000000..d683cbff17d285ed369ed39a432cd1e8eb920885 --- /dev/null +++ b/ernie-vil/conf/vcr/model_conf_vcr @@ -0,0 +1,12 @@ +output_model_path="output_vcr" +lr_scheduler="manual_warmup_decay" +decay_steps="13308;19962" +lr_decay_ratio=0.1 +num_train_steps=26640 +SAVE_STEPS=6660 +WARMUP_STEPS=6654 +BATCH_SIZE=64 +VALID_STEPS=20000 +LR_RATE=2e-5 +WEIGHT_DECAY=0.01 +MAX_LEN=80 diff --git a/ernie-vil/conf/vcr/task_vcr.json b/ernie-vil/conf/vcr/task_vcr.json new file mode 100644 index 0000000000000000000000000000000000000000..9ac9d56d24f05591f29456b8ac50cf603faabcd4 --- /dev/null +++ b/ernie-vil/conf/vcr/task_vcr.json @@ -0,0 +1,42 @@ +[ +{ +"task": "VCR_Q-A", +"num_choice": 4, +"annotations_jsonpath_train": "./data/vcr/train.jsonl", +"annotations_jsonpath_val": "./data/vcr/val.jsonl", +"annotations_jsonpath_test": "./data/vcr/test.jsonl", +"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb", +"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb", +"unisex_names_table" : "./data/vcr/unisex_names_table.csv", +"Proprocessor": "PreprocessorBasic", +"tokenizer_name" : "FullTokenizer", +"fusion_method" : "mul", +"dropout_rate" : 0.1, +"max_seq_len" : 60, +"use_gt_fea" : true, +"shufflekeep_across_task": true, +"shuffle_every_epoch": true, +"task_weight": 1.0, +"task_prefix": "vcr_qa" +}, +{ +"task": "VCR_QA-R", +"num_choice": 4, +"annotations_jsonpath_train": "./data/vcr/train.jsonl", +"annotations_jsonpath_val": "./data/vcr/val.jsonl", +"annotations_jsonpath_test": "./data/vcr/test.jsonl", +"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb", +"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb", +"unisex_names_table" : "./data/vcr/unisex_names_table.csv", +"Proprocessor": "PreprocessorBasic", +"tokenizer_name" : "FullTokenizer", +"fusion_method" : "mul", +"dropout_rate" : 0.1, +"max_seq_len" : 80, +"use_gt_fea" : true, +"shufflekeep_across_task": true, +"shuffle_every_epoch" : true, +"task_weight": 1.0, +"task_prefix": "vcr_qar" +} +] diff --git a/ernie-vil/conf/vcr/task_vcr_qa.json b/ernie-vil/conf/vcr/task_vcr_qa.json new file mode 100644 index 0000000000000000000000000000000000000000..c2b4afa714046a94e8f7720506721cc6edde5894 --- /dev/null +++ b/ernie-vil/conf/vcr/task_vcr_qa.json @@ -0,0 +1,21 @@ +[ +{ +"task": "VCR_Q-A", +"num_choice": 4, +"annotations_jsonpath_train": "./data/vcr/train.jsonl", +"annotations_jsonpath_val": "./data/vcr/val.jsonl", +"annotations_jsonpath_test": "./data/vcr/test.jsonl", +"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb", +"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb", +"unisex_names_table" : "./data/vcr/unisex_names_table.csv", +"Proprocessor": "PreprocessorBasic", +"tokenizer_name" : "FullTokenizer", +"tagger_path" : "./script/ntc.pickle", +"nltk_data_path" : "./nltk_data", +"fusion_method" : "mul", +"dropout_rate" : 0.1, +"max_seq_len" : 60, +"use_gt_fea" : true, +"task_prefix" : "vcr_qa" +} +] diff --git a/ernie-vil/conf/vcr/task_vcr_qar.json b/ernie-vil/conf/vcr/task_vcr_qar.json new file mode 100644 index 0000000000000000000000000000000000000000..8f4c88021f2666ce1779ea47ffd6014e67fb91ad --- /dev/null +++ b/ernie-vil/conf/vcr/task_vcr_qar.json @@ -0,0 +1,22 @@ +[ +{ +"task": "VCR_QA-R", +"num_choice": 4, +"annotations_jsonpath_train": "./data/vcr/train.jsonl", +"annotations_jsonpath_val": "./data/vcr/val.jsonl", +"annotations_jsonpath_test": "./data/vcr/test.jsonl", +"feature_lmdb_path": "./data/vcr/VCR_resnet101_faster_rcnn_genome_pickle2.lmdb", +"gt_feature_lmdb_path": "./data/vcr/VCR_gt_resnet101_faster_rcnn_genome_pickle2.lmdb", +"unisex_names_table" : "./data/vcr/unisex_names_table.csv", +"Proprocessor": "PreprocessorBasic", +"tokenizer_name" : "FullTokenizer", +"vocab_path" : "./package/vocab.txt", +"tagger_path" : "./script/ntc.pickle", +"nltk_data_path" : "./nltk_data", +"fusion_method" : "mul", +"dropout_rate" : 0.1, +"max_seq_len" : 80, +"use_gt_fea" : true, +"task_prefix" : "vcr_qa" +} +] diff --git a/ernie-vil/finetune.py b/ernie-vil/finetune.py new file mode 100755 index 0000000000000000000000000000000000000000..dbee99a0a5d6096a3ae9e902f3137f227470e2af --- /dev/null +++ b/ernie-vil/finetune.py @@ -0,0 +1,465 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" finetuning vison-language task """ + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import os +import sys +import time +import datetime +import argparse +import numpy as np +import multiprocessing +import json + +from reader.vcr_finetuning import VCRDataJointReader +from model.ernie_vil import ErnieVilModel, ErnieVilConfig +from optim.optimization import optimization +from utils.args import print_arguments +from utils.init import init_checkpoint, init_pretraining_params +from args.finetune_args import parser + +import paddle.fluid as fluid + +args = parser.parse_args() + +# yapf: enable. + +#READERS = {"vcr": VCRDataJointReader, "vqa": VQADataReader, "refcoco+": RefcocoReader, "flickr": FlickrReader} +READERS = {"vcr": VCRDataJointReader} + +def format_result(res_arr, qids, pred, labels, scores): + """ + trans batch results into json format + """ + for i in range(len(qids)): + res="\t".join([str(qids[i]), str(pred[i]), str(labels[i]), " ".join(["%.5f" % s for s in scores[i]])]) + res_arr.append(res) + return res_arr + + +def create_vcr_model(pyreader_name, ernie_config, task_group, is_prediction=False): + """ + create model arc for vcr tasks + """ + shapes = [[-1, args.max_seq_len, 1], #src_id + [-1, args.max_seq_len, 1], #pos_id + [-1, args.max_seq_len, 1], #sent_id + [-1, args.max_seq_len, 1], #task_id + [-1, args.max_seq_len, 1], #input_mask + [-1, args.max_img_len, args.feature_size], #image_embedding + [-1, args.max_img_len, 5], #image_loc + [-1, args.max_img_len, 1], #image_mask + [-1, 1], #labels + [-1, 1], #qids + [], #task_index + [-1, 1], #binary_labels + ] + dtypes = ['int64', 'int64', 'int64', 'int64', 'float32', 'float32', 'float32', 'float32', + 'int64', 'int64', 'int64', 'float32'] + lod_levels = [0] * len(dtypes) + + for _ in task_group: + shapes.append([]) + dtypes.append('float') + lod_levels.append(0) + + pyreader = fluid.layers.py_reader( + capacity=30, + shapes=shapes, + dtypes=dtypes, + lod_levels=lod_levels, + name=pyreader_name, + use_double_buffer=False) + + inputs = fluid.layers.read_file(pyreader) + src_ids, pos_ids, sent_ids, task_ids, input_mask, image_embeddings, \ + image_loc, image_mask, labels, q_ids, task_index, binary_labels = inputs[: 12] + + ernie_vil = ErnieVilModel( + src_ids=src_ids, + position_ids=pos_ids, + sentence_ids=sent_ids, + task_ids=task_ids, + input_mask=input_mask, + image_embeddings=image_embeddings, + image_loc=image_loc, + input_image_mask=image_mask, + config=ernie_config + ) + + h_cls, h_img = ernie_vil.get_pooled_output() + task_conf = task_group[0] + fusion_method = task_conf["fusion_method"] + fusion_fea = ernie_vil.get_match_score(text=h_cls, image=h_img, \ + dropout_rate=task_conf["dropout_rate"], + mode=fusion_method) + if is_prediction: + num_choice = int(task_conf['num_choice']) + task_name = task_conf.get('task_prefix', 'vcr') + score = fluid.layers.fc(fusion_fea, 1, + param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0", + initializer = fluid.initializer.TruncatedNormal(scale = 0.02)), + bias_attr = task_name + "_fc.b_0") + score = fluid.layers.reshape(score, shape = [-1, num_choice]) + _loss, _softmax = fluid.layers.softmax_with_cross_entropy(logits = score, + label = labels, return_softmax = True) + _acc = fluid.layers.accuracy(input = _softmax, label = labels) + pred = fluid.layers.argmax(score, axis = 1) + mean_loss = fluid.layers.mean(_loss) + task_vars = [mean_loss, _acc, pred, q_ids, labels, _softmax] + for var in task_vars: + var.persistable = True + return pyreader, task_vars + else: + start_ind = 12 + mean_loss = fluid.layers.zeros(shape = [1], dtype = 'float32') + mean_acc = fluid.layers.zeros(shape = [1], dtype = 'float32') + for task_conf in task_group: + task_weight = inputs[start_ind] + start_ind += 1 + num_choice = int(task_conf['num_choice']) + task_name = task_conf.get('task_prefix', 'vcr') + score = fluid.layers.fc(fusion_fea, 1, + param_attr = fluid.ParamAttr(name = task_name + "_fc.w_0", + initializer = fluid.initializer.TruncatedNormal(scale = 0.02)), + bias_attr = task_name + "_fc.b_0") + + _loss = fluid.layers.sigmoid_cross_entropy_with_logits(score, + binary_labels, name = "cross_entropy_loss") + tmp_score = fluid.layers.reshape(score, shape = [-1, num_choice]) + _softmax = fluid.layers.softmax(tmp_score) + _acc = fluid.layers.accuracy(input = _softmax, label = labels) + _mean_loss = fluid.layers.mean(_loss) + mean_loss += _mean_loss * task_weight + mean_acc += _acc * task_weight + task_vars = [fluid.layers.reduce_mean(mean_loss), mean_acc] + for var in task_vars: + var.persistable = True + + return pyreader, task_vars + + +#MODELS = {"vcr": create_vcr_model, "vqa": create_vqa_model, "refcoco+": create_refcoco_model} +MODELS = {"vcr": create_vcr_model} + +def predict_wrapper(args, + exe, + ernie_config, + task_group, + test_prog=None, + pyreader=None, + graph_vars=None): + """Context to do validation. + """ + reader_name = READERS[args.task_name] + data_reader = reader_name( + task_group, + split=args.test_split, + vocab_path=args.vocab_path, + is_test=True, + shuffle=False, + batch_size=args.batch_size, + epoch=args.epoch) + if args.do_test: + assert args.init_checkpoint is not None, "[FATAL] Please use --init_checkpoint '/path/to/checkpoints' \ + to specify you pretrained model checkpoints" + + init_pretraining_params(exe, args.init_checkpoint, test_prog) + print(("testing on %s %s split") % (args.task_name, args.test_split)) + + def predict(exe=exe, pyreader=pyreader): + """ + inference for downstream tasks + """ + pyreader.decorate_tensor_provider(data_reader.data_generator()) + pyreader.start() + + cost = 0 + appear_step = 0 + task_acc = {} + task_steps = {} + steps = 0 + case_f1 = 0 + appear_f1 = 0 + time_begin = time.time() + task_name_list = [v.name for v in graph_vars] + fetch_list = task_name_list + + print('task name list : ', task_name_list) + sum_acc = 0 + res_arr = [] + while True: + try: + outputs = exe.run(fetch_list=fetch_list, program=test_prog) + each_acc = outputs[1][0] + preds = np.reshape(outputs[2], [-1]) + qids = np.reshape(outputs[3], [-1]) + labels = np.reshape(outputs[4], [-1]) + scores = np.reshape(outputs[5], [-1, 4]) + sum_acc += each_acc + steps += 1 + if steps % 10 == 0: + print('cur_step:', steps, 'cur_acc:', sum_acc / steps) + format_result(res_arr, qids.tolist(), preds.tolist(), labels.tolist(), scores.tolist()) + except fluid.core.EOFException: + pyreader.reset() + break + + used_time = time.time() - time_begin + + with open(args.result_file, "w") as f: + for r in res_arr: + f.write(r + "\n") + + print("average_acc:", sum_acc / steps) + ret = {} + ret["acc"] = "acc: %f" % (sum_acc / steps) + for item in ret: + try: + ret[item] = ret[item].split(':')[-1] + except: + pass + return ret + return predict + + +def get_optimizer(total_loss, train_program, startup_prog, args): + """ + optimization func + """ + decay_steps_str=args.decay_steps + if decay_steps_str == "": + decay_steps = [] + else: + decay_steps = [int(s) for s in decay_steps_str.split(";")] + scheduled_lr = optimization( + loss=total_loss, + warmup_steps=args.warmup_steps, + num_train_steps=args.num_train_steps, + learning_rate=args.learning_rate, + train_program=train_program, + startup_prog=startup_prog, + weight_decay=args.weight_decay, + scheduler=args.lr_scheduler, + decay_steps=decay_steps, + lr_decay_ratio=args.lr_decay_ratio) + return scheduled_lr + + +def main(args): + """ + Main func for downstream tasks + """ + print("finetuning tasks start") + ernie_config = ErnieVilConfig(args.ernie_config_path) + ernie_config.print_config() + + with open(args.task_group_json) as f: + task_group = json.load(f) + print('task: ', task_group) + + startup_prog = fluid.Program() + if args.do_train and args.do_test: + print("can not set both do_train and do_test as True") + return + + model_name = MODELS[args.task_name] + if args.do_train: + train_program = fluid.Program() + with fluid.program_guard(train_program, startup_prog): + with fluid.unique_name.guard(): + train_pyreader, model_outputs = model_name( + pyreader_name='train_reader', ernie_config=ernie_config, task_group=task_group) + + total_loss = model_outputs[0] + scheduled_lr = get_optimizer(total_loss, train_program, startup_prog, args) + if args.do_test: + test_prog = fluid.Program() + with fluid.program_guard(test_prog, startup_prog): + with fluid.unique_name.guard(): + test_pyreader, model_outputs = model_name( + pyreader_name='test_reader', ernie_config=ernie_config, task_group=task_group, is_prediction=True) + total_loss = model_outputs[0] + + test_prog = test_prog.clone(for_test=True) + + if args.use_gpu: + gpu_id = 0 + if os.getenv("FLAGS_selected_gpus"): + gpu_id = int(os.getenv("FLAGS_selected_gpus")) + place = fluid.CUDAPlace(gpu_id) if args.use_gpu else fluid.CPUPlace() + + print("theoretical memory usage: ") + if args.do_train: + print(fluid.contrib.memory_usage( + program=train_program, batch_size=args.batch_size)) + if args.do_test: + print(fluid.contrib.memory_usage( + program=test_prog, batch_size=args.batch_size)) + + nccl2_num_trainers = 1 + nccl2_trainer_id = 0 + print("args.is_distributed:", args.is_distributed) + trainer_id = 0 + if args.is_distributed: + trainer_id = int(os.getenv("PADDLE_TRAINER_ID")) + worker_endpoints_env = os.getenv("PADDLE_TRAINER_ENDPOINTS") + current_endpoint = os.getenv("PADDLE_CURRENT_ENDPOINT") + worker_endpoints = worker_endpoints_env.split(",") + trainers_num = len(worker_endpoints) + + print("worker_endpoints:{} trainers_num:{} current_endpoint:{} \ + trainer_id:{}".format(worker_endpoints, trainers_num, + current_endpoint, trainer_id)) + + # prepare nccl2 env. + config = fluid.DistributeTranspilerConfig() + config.mode = "nccl2" + if args.nccl_comm_num > 1: + config.nccl_comm_num = args.nccl_comm_num + if args.use_hierarchical_allreduce and trainers_num > args.hierarchical_allreduce_inter_nranks: + config.use_hierarchical_allreduce=args.use_hierarchical_allreduce + config.hierarchical_allreduce_inter_nranks=args.hierarchical_allreduce_inter_nranks + + assert config.hierarchical_allreduce_inter_nranks > 1 + assert trainers_num % config.hierarchical_allreduce_inter_nranks == 0 + + config.hierarchical_allreduce_exter_nranks = \ + trainers_num / config.hierarchical_allreduce_inter_nranks + + t = fluid.DistributeTranspiler(config=config) + t.transpile( + trainer_id, + trainers=worker_endpoints_env, + current_endpoint=current_endpoint, + program=train_program, + startup_program=startup_prog) + + nccl2_num_trainers = trainers_num + nccl2_trainer_id = trainer_id + + exe = fluid.Executor(place) + exe.run(startup_prog) + + if args.do_train: + if args.init_checkpoint and args.init_checkpoint != "": + sys.stderr.write('############################WARNING############################') + sys.stderr.write('####### using init_pretraining_params, not init_checkpoint ####') + sys.stderr.write('## meaning hyper param e.g. lr won\'t inherit from checkpoint##') + sys.stderr.write('###############################################################') + init_pretraining_params(exe, args.init_checkpoint, train_program) + + reader_name=READERS[args.task_name] + data_reader = reader_name( + task_group, + split="train", + vocab_path=args.vocab_path, + batch_size=args.batch_size, + epoch=args.epoch,) + + exec_strategy = fluid.ExecutionStrategy() + if args.use_fast_executor: + exec_strategy.use_experimental_executor = True + exec_strategy.num_threads = 2 + + exec_strategy.num_iteration_per_drop_scope = min(10, args.skip_steps) + + build_strategy = fluid.compiler.BuildStrategy() + build_strategy.fuse_all_reduce_ops = False + + if args.use_fuse: + build_strategy.fuse_all_reduce_ops = True + + if args.do_train: + train_exe = fluid.ParallelExecutor( + use_cuda=args.use_cuda, + loss_name=total_loss.name, + build_strategy=build_strategy, + exec_strategy=exec_strategy, + main_program=train_program, + num_trainers=nccl2_num_trainers, + trainer_id=nccl2_trainer_id) + + if args.do_test: + predict = predict_wrapper( + args, + exe, + ernie_config, + task_group, + test_prog=test_prog, + pyreader=test_pyreader, + graph_vars=model_outputs) + result = predict() + + if args.do_train: + train_pyreader.decorate_tensor_provider(data_reader.data_generator()) + train_pyreader.start() + steps = 0 + time_begin = time.time() + node_nums = 1 #int(os.getenv("PADDLE_NODES_NUM")) + used_time_all = 0 + while steps < args.num_train_steps: + try: + steps += node_nums + skip_steps = args.skip_steps * node_nums + fetch_list = [] + if nccl2_trainer_id == 0 and steps % skip_steps == 0: + task_name_list = [v.name for v in model_outputs] + fetch_list = task_name_list + fetch_list.append(scheduled_lr.name) + + time_begin = time.time() + outputs = train_exe.run(fetch_list=fetch_list) + if outputs: + print("feed_queue size", train_pyreader.queue.size()) + progress_file = data_reader.get_progress() + epoch = progress_file["current_epoch"] + current_file_index = progress_file["current_file_index"] + total_file = progress_file["total_file"] + current_file = progress_file["current_file"] + print( + "epoch: %d, progress: %d/%d, step: %d, loss: %f, " + "acc: %f" + % (epoch, current_file_index, total_file, steps, + outputs[0][0], + outputs[1][0])) + print("steps:", steps) + print("save_steps:", args.save_steps) + + np_lr = outputs[-1:] + + date_str = datetime.datetime.now().strftime("%Y%m%d %H:%M:%S") + + np_lr = float(np.mean(np_lr[0])) + print("%s current learning_rate:%.8f" % (date_str, np_lr)) + + if steps % args.save_steps == 0: + save_path = os.path.join(args.checkpoints, "step_" + str(steps)) + print("save_path:", save_path) + fluid.io.save_persistables(exe, save_path, train_program) + time_end = time.time() + used_time = time_end - time_begin + time_end = time_begin + print("used_time:", used_time) + except fluid.core.EOFException: + train_pyreader.reset() + break + + +if __name__ == '__main__': + print_arguments(args) + main(args) + diff --git a/ernie-vil/model/__init__.py b/ernie-vil/model/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/model/ernie_vil.py b/ernie-vil/model/ernie_vil.py new file mode 100644 index 0000000000000000000000000000000000000000..13b53097898e4c01416f12105dd0421ed72bd5e1 --- /dev/null +++ b/ernie-vil/model/ernie_vil.py @@ -0,0 +1,288 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""ERNIE-ViL model""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import json + +import six +import paddle.fluid as fluid + +from model.vl_transformer_encoder import encoder, pre_process_layer + + +class ErnieVilConfig(object): + """ + configuration for ernie-vil + """ + def __init__(self, config_path): + self._config_dict = self._parse(config_path) + + def _parse(self, config_path): + try: + with open(config_path) as json_file: + config_dict = json.load(json_file) + except Exception: + raise IOError("Error in parsing Ernie model config file '%s'" % + config_path) + else: + return config_dict + + def __getitem__(self, key): + return self._config_dict[key] + + def print_config(self): + """ + print configuration value + """ + for arg, value in sorted(six.iteritems(self._config_dict)): + print('%s: %s' % (arg, value)) + print('------------------------------------------------') + + +class ErnieVilModel(object): + """ + main class for ERNIE-ViL model + """ + def __init__(self, + src_ids, + position_ids, + sentence_ids, + task_ids, + input_mask, + image_embeddings, + image_loc, + input_image_mask, + config, + predict_feature=False, + predict_class=True, + use_attr=False, + use_soft_label=True): + + self._emb_size = config['hidden_size'] + self._n_layer = config['num_hidden_layers'] + self._n_head = config['num_attention_heads'] + + self._v_head = config['v_num_attention_heads'] + self._v_emb_size = config['v_hidden_size'] + self._v_inter_hid = config['v_intermediate_size'] + + self._co_head = config['co_num_attention_heads'] + self._co_emb_size = config['co_hidden_size'] + self._co_inter_hid = config['co_intermediate_size'] + + self._voc_size = config['vocab_size'] + self._class_size = config['class_size'] + self._class_attr_size = config['class_attr_size'] + self._max_position_seq_len = config['max_position_embeddings'] + self._sent_types = config['sent_type_vocab_size'] + self._task_types = config['task_type_vocab_size'] + self._hidden_act = config['hidden_act'] + self._prepostprocess_dropout = config['hidden_dropout_prob'] + self._attention_dropout = config['attention_probs_dropout_prob'] + self._v_biattention_id = config['v_biattention_id'] + self._t_biattention_id = config['t_biattention_id'] + + self._predict_feature = predict_feature + self._predict_class = predict_class + self._use_attr = use_attr + self._use_soft_label = use_soft_label + self._word_emb_name = "word_embedding" + self._pos_emb_name = "pos_embedding" + self._sent_emb_name = "sent_embedding" + self._image_emb_name = "image_embedding" + self._loc_emb_name = "loc_embedding" + self._dtype = "float32" + self._emb_dtype = "float32" + + # Initialize all weigths by truncated normal initializer, and all biases + # will be initialized by constant zero by default. + self._param_initializer = fluid.initializer.TruncatedNormal( + scale=config['initializer_range']) + + self._build_model(src_ids, position_ids, sentence_ids, task_ids, input_mask, \ + image_embeddings, image_loc, input_image_mask) + + def _build_model(self, src_ids, position_ids, sentence_ids, task_ids, input_mask, \ + image_embeddings, image_loc, input_image_mask): + # padding id in vocabulary must be set to 0 + emb_out = fluid.layers.embedding( + input=src_ids, + size=[self._voc_size, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._word_emb_name, initializer=self._param_initializer), + is_sparse=False) + + position_emb_out = fluid.layers.embedding( + input=position_ids, + size=[self._max_position_seq_len, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._pos_emb_name, initializer=self._param_initializer)) + + sent_emb_out = fluid.layers.embedding( + sentence_ids, + size=[self._sent_types, self._emb_size], + dtype=self._emb_dtype, + param_attr=fluid.ParamAttr( + name=self._sent_emb_name, initializer=self._param_initializer)) + + emb_out = emb_out + position_emb_out + emb_out = emb_out + sent_emb_out + + emb_out = pre_process_layer( + emb_out, 'nd', self._prepostprocess_dropout, name='pre_encoder') + + self_attn_mask = fluid.layers.matmul( + x=input_mask, y=input_mask, transpose_y=True) + + self_attn_mask = fluid.layers.scale( + x=self_attn_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) + n_head_self_attn_mask = fluid.layers.stack( + x=[self_attn_mask] * self._n_head, axis=1) + n_head_self_attn_mask.stop_gradient = True + + image_embeddings = fluid.layers.fc(image_embeddings, + self._v_emb_size, + param_attr=fluid.ParamAttr( + name="image_emb.w_0", + initializer=self._param_initializer), + bias_attr = "image_emb.b_0", + num_flatten_dims = 2) + loc_emb_out = fluid.layers.fc(image_loc, + self._v_emb_size, + param_attr=fluid.ParamAttr( + name="image_loc.w_0", + initializer=self._param_initializer), + bias_attr = "image_loc.b_0", + num_flatten_dims = 2) + + emb_vl_out = image_embeddings + loc_emb_out + emb_vl_out = pre_process_layer( + emb_vl_out, 'nd', self._prepostprocess_dropout, name='vl_pre_encoder') + + self_attn_image_mask = fluid.layers.matmul( + x=input_image_mask, y=input_image_mask, transpose_y=True) + + self_attn_image_mask = fluid.layers.scale( + x=self_attn_image_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) + n_head_self_attn_image_mask = fluid.layers.stack( + x=[self_attn_image_mask] * self._v_head, axis=1) + n_head_self_attn_image_mask.stop_gradient = True + + self_attn_vl_mask = fluid.layers.matmul( + x=input_image_mask, y=input_mask, transpose_y=True) + self_attn_vl_mask = fluid.layers.scale( + x=self_attn_vl_mask, scale=10000.0, bias=-1.0, bias_after_scale=False) + n_head_self_attn_vl_mask = fluid.layers.stack( + x=[self_attn_vl_mask] * self._co_head, axis=1) + n_head_self_attn_vl_mask.stop_gradient = True + + self._enc_out, self._enc_vl_out = encoder( + enc_input=emb_out, + enc_vl_input=emb_vl_out, + attn_bias=n_head_self_attn_mask, + attn_image_bias=n_head_self_attn_image_mask, + attn_vl_bias=n_head_self_attn_vl_mask, + n_layer=self._n_layer, + n_head=self._n_head, + d_key=self._emb_size // self._n_head, + d_value=self._emb_size // self._n_head, + d_model=self._emb_size, + d_inner_hid=self._emb_size * 4, + v_head=self._v_head, + v_key=self._v_emb_size // self._v_head, + v_value=self._v_emb_size // self._v_head, + v_model=self._v_emb_size, + v_inner_hid=self._v_inter_hid, + co_head=self._co_head, + co_key=self._co_emb_size // self._co_head, + co_value=self._co_emb_size // self._co_head, + co_model=self._co_emb_size, + co_inner_hid=self._co_inter_hid, + prepostprocess_dropout=self._prepostprocess_dropout, + attention_dropout=self._attention_dropout, + relu_dropout=0, + hidden_act=self._hidden_act, + preprocess_cmd="", + postprocess_cmd="dan", + param_initializer=self._param_initializer, + v_biattention_id = self._v_biattention_id, + t_biattention_id = self._t_biattention_id, + name='encoder') + + def get_sequence_output(self): + """ + Return sequence output of all text and img tokens + """ + return self._enc_out, self._enc_vl_out + + def get_pooled_output(self): + """ + Get the first feature of each sequence for classification + """ + text_cls_feat = fluid.layers.slice( + input=self._enc_out, axes=[1], starts=[0], ends=[1]) + + text_cls_feat = fluid.layers.cast( + x=text_cls_feat, dtype=self._emb_dtype) + + text_cls_feat = fluid.layers.fc( + input=text_cls_feat, + size=self._co_emb_size, + act="relu", + param_attr=fluid.ParamAttr( + name="pooled_fc_text.w_0", initializer=self._param_initializer), + bias_attr="pooled_fc_text.b_0") + + image_cls_feat = fluid.layers.slice( + input=self._enc_vl_out, axes=[1], starts=[0], ends=[1]) + + image_cls_feat = fluid.layers.cast( + x=image_cls_feat, dtype=self._emb_dtype) + + image_cls_feat = fluid.layers.fc( + input=image_cls_feat, + size=self._co_emb_size, + act="relu", + param_attr=fluid.ParamAttr( + name="pooled_fc_image.w_0", initializer=self._param_initializer), + bias_attr="pooled_fc_image.b_0") + return text_cls_feat, image_cls_feat + + def get_match_score(self, text, image, dropout_rate=0.0, mode="mul"): + """ + match score for text [cls] and image [img] tokens + """ + if mode == "sum": + emb_fuse = text + image + elif mode == "mul": + emb_fuse = text * image + else: + "current mode %s is not supported" % mode + return + if dropout_rate > 0.0: + + emb_fuse = fluid.layers.dropout(emb_fuse, + self._attention_dropout, + dropout_implementation="upscale_in_train") + return emb_fuse + + + diff --git a/ernie-vil/model/vl_transformer_encoder.py b/ernie-vil/model/vl_transformer_encoder.py new file mode 100644 index 0000000000000000000000000000000000000000..0a477541d5fb7edb9e5322e76c023fd6cd66197b --- /dev/null +++ b/ernie-vil/model/vl_transformer_encoder.py @@ -0,0 +1,561 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""two-stream Transformer encoder.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from functools import partial + +import paddle.fluid as fluid +import paddle.fluid.layers as layers + + +def multi_head_attention(queries, + keys, + values, + attn_bias, + d_key, + d_value, + d_model, + n_head=1, + dropout_rate=0., + cache=None, + param_initializer=None, + name='multi_head_att'): + """ + Multi-Head Attention. Note that attn_bias is added to the logit before + computing softmax activiation to mask certain selected positions so that + they will not considered in attention weights. + """ + keys = queries if keys is None else keys + values = keys if values is None else values + + if not (len(queries.shape) == len(keys.shape) == len(values.shape) == 3): + raise ValueError( + "Inputs: quries, keys and values should all be 3-D tensors.") + + def __compute_qkv(queries, keys, values, n_head, d_key, d_value): + """ + Add linear projection to queries, keys, and values. + """ + q = layers.fc(input=queries, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_query_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_query_fc.b_0') + k = layers.fc(input=keys, + size=d_key * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_key_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_key_fc.b_0') + v = layers.fc(input=values, + size=d_value * n_head, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_value_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_value_fc.b_0') + return q, k, v + + def __split_heads(x, n_head): + """ + Reshape the last dimension of inpunt tensor x so that it becomes two + dimensions and then transpose. Specifically, input a tensor with shape + [bs, max_sequence_length, n_head * hidden_dim] then output a tensor + with shape [bs, n_head, max_sequence_length, hidden_dim]. + """ + hidden_size = x.shape[-1] + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + reshaped = layers.reshape( + x=x, shape=[0, 0, n_head, hidden_size // n_head], inplace=True) + + # permuate the dimensions into: + # [batch_size, n_head, max_sequence_len, hidden_size_per_head] + return layers.transpose(x=reshaped, perm=[0, 2, 1, 3]) + + def __combine_heads(x): + """ + Transpose and then reshape the last two dimensions of inpunt tensor x + so that it becomes one dimension, which is reverse to __split_heads. + """ + if len(x.shape) == 3: return x + if len(x.shape) != 4: + raise ValueError("Input(x) should be a 4-D Tensor.") + + trans_x = layers.transpose(x, perm=[0, 2, 1, 3]) + # The value 0 in shape attr means copying the corresponding dimension + # size of the input as the output dimension size. + return layers.reshape( + x=trans_x, + shape=[0, 0, trans_x.shape[2] * trans_x.shape[3]], + inplace=True) + + def scaled_dot_product_attention(q, k, v, attn_bias, d_key, dropout_rate): + """ + Scaled Dot-Product Attention + """ + scaled_q = layers.scale(x=q, scale=d_key ** -0.5) + product = layers.matmul(x=scaled_q, y=k, transpose_y=True) + if attn_bias: + product += attn_bias + weights = layers.softmax(product) + if dropout_rate: + weights = layers.dropout( + weights, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + out = layers.matmul(weights, v) + return out + + q, k, v = __compute_qkv(queries, keys, values, n_head, d_key, d_value) + + if cache is not None: # use cache and concat time steps + # Since the inplace reshape in __split_heads changes the shape of k and + # v, which is the cache input for next time step, reshape the cache + # input from the previous time step first. + k = cache["k"] = layers.concat( + [layers.reshape( + cache["k"], shape=[0, 0, d_model]), k], axis=1) + v = cache["v"] = layers.concat( + [layers.reshape( + cache["v"], shape=[0, 0, d_model]), v], axis=1) + + q = __split_heads(q, n_head) + k = __split_heads(k, n_head) + v = __split_heads(v, n_head) + + ctx_multiheads = scaled_dot_product_attention(q, k, v, attn_bias, d_key, + dropout_rate) + + out = __combine_heads(ctx_multiheads) + + # Project back to the model size. + proj_out = layers.fc(input=out, + size=d_model, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_output_fc.w_0', + initializer=param_initializer), + bias_attr=name + '_output_fc.b_0') + return proj_out + + +def positionwise_feed_forward(x, + d_inner_hid, + d_hid, + dropout_rate, + hidden_act, + param_initializer=None, + name='ffn'): + """ + Position-wise Feed-Forward Networks. + This module consists of two linear transformations with a ReLU activation + in between, which is applied to each position separately and identically. + """ + hidden = layers.fc(input=x, + size=d_inner_hid, + num_flatten_dims=2, + act=hidden_act, + param_attr=fluid.ParamAttr( + name=name + '_fc_0.w_0', + initializer=param_initializer), + bias_attr=name + '_fc_0.b_0') + if dropout_rate: + hidden = layers.dropout( + hidden, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + out = layers.fc(input=hidden, + size=d_hid, + num_flatten_dims=2, + param_attr=fluid.ParamAttr( + name=name + '_fc_1.w_0', initializer=param_initializer), + bias_attr=name + '_fc_1.b_0') + return out + + +def pre_post_process_layer(prev_out, out, process_cmd, dropout_rate=0., + name=''): + """ + Add residual connection, layer normalization and droput to the out tensor + optionally according to the value of process_cmd. + This will be used before or after multi-head attention and position-wise + feed-forward networks. + """ + for cmd in process_cmd: + if cmd == "a": # add residual connection + out = out + prev_out if prev_out else out + elif cmd == "n": # add layer normalization + out = layers.layer_norm( + out, + begin_norm_axis=len(out.shape) - 1, + param_attr=fluid.ParamAttr( + name=name + '_layer_norm_scale', + initializer=fluid.initializer.Constant(1.)), + bias_attr=fluid.ParamAttr( + name=name + '_layer_norm_bias', + initializer=fluid.initializer.Constant(0.))) + elif cmd == "d": # add dropout + if dropout_rate: + out = layers.dropout( + out, + dropout_prob=dropout_rate, + dropout_implementation="upscale_in_train", + is_test=False) + return out + + +pre_process_layer = partial(pre_post_process_layer, None) +post_process_layer = pre_post_process_layer + + +def encoder_co_layer(enc_input, + enc_vl_input, + attn_vl_bias, + co_head, + co_key, + co_value, + co_model, + d_model, + d_inner_hid, + v_model, + v_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name=''): + """ + Co_layer to perform co-attention from visual to language or from language to visual + """ + enc_input_pre = pre_process_layer( + enc_input, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_att') + + enc_input_vl_pre = pre_process_layer( + enc_vl_input, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_vl_pre_att') + + attn_output = multi_head_attention( + enc_input_pre, + enc_input_vl_pre, + enc_input_vl_pre, + layers.transpose(attn_vl_bias, perm=[0, 1, 3, 2]), + co_key, + co_value, + d_model, + co_head, + attention_dropout, + param_initializer=param_initializer, + name=name + '_multi_head_att') + + attn_vl_output = multi_head_attention( + enc_input_vl_pre, + enc_input_pre, + enc_input_pre, + attn_vl_bias, + co_key, + co_value, + v_model, + co_head, + attention_dropout, + param_initializer=param_initializer, + name=name + '_vl_multi_head_att') + + attn_output = post_process_layer( + enc_input, + attn_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_att') + + attn_vl_output = post_process_layer( + enc_vl_input, + attn_vl_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_vl_post_att') + + ffd_output = positionwise_feed_forward( + pre_process_layer( + attn_output, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_ffn'), + d_inner_hid, + d_model, + relu_dropout, + hidden_act, + param_initializer=param_initializer, + name=name + '_ffn') + + ffd_vl_output = positionwise_feed_forward( + pre_process_layer( + attn_vl_output, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_vl_ffn'), + v_inner_hid, + v_model, + relu_dropout, + hidden_act, + param_initializer=param_initializer, + name=name + '_vl_ffn') + + enc_output = post_process_layer( + attn_output, + ffd_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_ffn') + + enc_vl_output = post_process_layer( + attn_vl_output, + ffd_vl_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_vl_post_ffn') + + return enc_output, enc_vl_output + + +def encoder_layer(enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + name=''): + """The encoder layers that can be stacked to form a deep encoder. + This module consits of a multi-head (self) attention followed by + position-wise feed-forward networks and both the two components companied + with the post_process_layer to add residual connection, layer normalization + and droput. + """ + attn_output = multi_head_attention( + pre_process_layer( + enc_input, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_att'), + None, + None, + attn_bias, + d_key, + d_value, + d_model, + n_head, + attention_dropout, + param_initializer=param_initializer, + name=name + '_multi_head_att') + attn_output = post_process_layer( + enc_input, + attn_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_att') + ffd_output = positionwise_feed_forward( + pre_process_layer( + attn_output, + preprocess_cmd, + prepostprocess_dropout, + name=name + '_pre_ffn'), + d_inner_hid, + d_model, + relu_dropout, + hidden_act, + param_initializer=param_initializer, + name=name + '_ffn') + return post_process_layer( + attn_output, + ffd_output, + postprocess_cmd, + prepostprocess_dropout, + name=name + '_post_ffn') + + +def encoder(enc_input, + enc_vl_input, + attn_bias, + attn_image_bias, + attn_vl_bias, + n_layer, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + v_head, + v_key, + v_value, + v_model, + v_inner_hid, + co_head, + co_key, + co_value, + co_model, + co_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd="n", + postprocess_cmd="da", + param_initializer=None, + v_biattention_id=[0, 1, 2, 3, 4, 5], + t_biattention_id=[18, 19, 20, 21, 22, 23], + name=''): + """ + The encoder is composed of a stack of identical layers returned by calling + encoder_layer and encoder_co_layer + """ + + v_start = 0 + t_start = 0 + block = 0 + + for v_layer_id, t_layer_id in zip(v_biattention_id, t_biattention_id): + v_end = v_layer_id + t_end = t_layer_id + for idx in range(t_start, t_end): + enc_output = encoder_layer( + enc_input, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_layer_' + str(idx)) + enc_input = enc_output + + for idx in range(v_start, v_end): + enc_vl_output = encoder_layer( + enc_vl_input, + attn_image_bias, + v_head, + v_key, + v_value, + v_model, + v_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_vlayer_' + str(idx)) + enc_vl_input = enc_vl_output + + enc_output, enc_vl_output = encoder_co_layer( + enc_input, + enc_vl_input, + attn_vl_bias, + co_head, + co_key, + co_value, + co_model, + d_model, + d_inner_hid, + v_model, + v_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_colayer_' + str(block)) + + enc_input, enc_vl_input = enc_output, enc_vl_output + + block += 1 + v_start = v_end + t_start = t_end + + enc_output = encoder_layer( + enc_output, + attn_bias, + n_head, + d_key, + d_value, + d_model, + d_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_layer_' + str(t_end)) + + enc_vl_output = encoder_layer( + enc_vl_output, + attn_image_bias, + v_head, + v_key, + v_value, + v_model, + v_inner_hid, + prepostprocess_dropout, + attention_dropout, + relu_dropout, + hidden_act, + preprocess_cmd, + postprocess_cmd, + param_initializer=param_initializer, + name=name + '_vlayer_' + str(v_end)) + + enc_output = pre_process_layer( + enc_output, preprocess_cmd, prepostprocess_dropout, name="post_encoder") + + enc_vl_output = pre_process_layer( + enc_vl_output, preprocess_cmd, prepostprocess_dropout, name="vl_post_encoder") + + return enc_output, enc_vl_output diff --git a/ernie-vil/optim/__init__.py b/ernie-vil/optim/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/optim/optimization.py b/ernie-vil/optim/optimization.py new file mode 100644 index 0000000000000000000000000000000000000000..fb27665f9cc6bff8fa8e2febda8c4058c082b18c --- /dev/null +++ b/ernie-vil/optim/optimization.py @@ -0,0 +1,167 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" text preprocess """ + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import numpy as np +import paddle.fluid as fluid + +def manual_warmup_decay(learning_rate, warmup_steps, num_train_steps, decay_steps=[], lr_decay_ratio=0.1): + """ + Applies linear warmup of learning rate from 0 and keep constant. + """ + with fluid.default_main_program()._lr_schedule_guard(): + lr = fluid.layers.tensor.create_global_var( + shape=[1], + value=0.0, + dtype='float32', + persistable=True, + name="scheduled_learning_rate") + + global_step = fluid.layers.learning_rate_scheduler._decay_step_counter( + ) + with fluid.layers.control_flow.Switch() as switch: + with switch.case(global_step < warmup_steps): + warmup_lr = learning_rate * (global_step / warmup_steps) + fluid.layers.tensor.assign(warmup_lr, lr) + for i, step in enumerate(decay_steps): + with switch.case(global_step < step): + decayed_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, i) + fluid.layers.tensor.assign(decayed_lr, lr) + with switch.default(): + constant_lr = learning_rate * (global_step / global_step) * pow(lr_decay_ratio, len(decay_steps)) + fluid.layers.tensor.assign(constant_lr, lr) + + return lr + + +def linear_warmup_decay(learning_rate, warmup_steps, num_train_steps): + """ + Applies linear warmup of learning rate from 0 and decay to 0. + """ + with fluid.default_main_program()._lr_schedule_guard(): + lr = fluid.layers.tensor.create_global_var( + shape=[1], + value=0.0, + dtype='float32', + persistable=True, + name="scheduled_learning_rate") + + global_step = fluid.layers.learning_rate_scheduler._decay_step_counter( + ) + + with fluid.layers.control_flow.Switch() as switch: + with switch.case(global_step < warmup_steps): + warmup_lr = learning_rate * (global_step / warmup_steps) + fluid.layers.tensor.assign(warmup_lr, lr) + with switch.default(): + decayed_lr = fluid.layers.learning_rate_scheduler.polynomial_decay( + learning_rate=learning_rate, + decay_steps=num_train_steps, + end_learning_rate=0.0, + power=1.0, + cycle=False) + fluid.layers.tensor.assign(decayed_lr, lr) + + return lr + +def optimization(loss, + warmup_steps, + num_train_steps, + learning_rate, + train_program, + startup_prog, + weight_decay, + scheduler='linear_warmup_decay', + decay_steps=[], + lr_decay_dict_file="", + lr_decay_ratio=0.1): + """ + optimization implementation + """ + if warmup_steps > 0: + if scheduler == 'noam_decay': + scheduled_lr = fluid.layers.learning_rate_scheduler \ + .noam_decay(1 / (warmup_steps * (learning_rate ** 2)), + warmup_steps) + elif scheduler == 'linear_warmup_decay': + scheduled_lr = linear_warmup_decay(learning_rate, warmup_steps, + num_train_steps) + elif scheduler == 'manual_warmup_decay': + scheduled_lr = manual_warmup_decay(learning_rate, warmup_steps, + num_train_steps, decay_steps, lr_decay_ratio) + else: + raise ValueError("Unkown learning rate scheduler, should be " + "'noam_decay' or 'linear_warmup_decay' or 'manual_warmup_decay'") + else: + scheduled_lr = fluid.layers.create_global_var( + name=fluid.unique_name.generate("learning_rate"), + shape=[1], + value=learning_rate, + dtype='float32', + persistable=True) + + lr_decay_dict = {} + if lr_decay_dict_file != "": + with open(lr_decay_dict_file) as f: + for line in f: + param, decay_rate = line.strip().split('\t') + lr_decay_dict[param] = float(decay_rate) + + for param in fluid.default_main_program().block(0).all_parameters(): + if param.name in lr_decay_dict: + print (param.name, lr_decay_dict[param.name]) + param.optimize_attr['learning_rate'] = lr_decay_dict[param.name] + + optimizer = fluid.optimizer.Adam(learning_rate=scheduled_lr) + optimizer._learning_rate_map[fluid.default_main_program( + )] = scheduled_lr + + + fluid.clip.set_gradient_clip( + clip=fluid.clip.GradientClipByGlobalNorm(clip_norm=1.0)) + + def exclude_from_weight_decay(name): + """ + Parameters not use weight decay + """ + if name.find("layer_norm") > -1: + return True + bias_suffix = ["_bias", "_b", ".b_0"] + for suffix in bias_suffix: + if name.endswith(suffix): + return True + return False + + param_list = dict() + + for param in train_program.global_block().all_parameters(): + param_list[param.name] = param * 1.0 + param_list[param.name].stop_gradient = True + + _, param_grads = optimizer.minimize(loss) + + if weight_decay > 0: + for param, grad in param_grads: + if exclude_from_weight_decay(param.name): + continue + with param.block.program._optimized_guard( + [param, grad]), fluid.framework.name_scope("weight_decay"): + updated_param = param - param_list[ + param.name] * weight_decay * scheduled_lr * param.optimize_attr['learning_rate'] + fluid.layers.assign(output=param, input=updated_param) + + return scheduled_lr diff --git a/ernie-vil/preprocess/__init__.py b/ernie-vil/preprocess/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/preprocess/preprocessor.py b/ernie-vil/preprocess/preprocessor.py new file mode 100755 index 0000000000000000000000000000000000000000..0cc0a80139d7bbaad98df8c99352cc95c6f5bec4 --- /dev/null +++ b/ernie-vil/preprocess/preprocessor.py @@ -0,0 +1,46 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" text preprocess """ + +import random +import sys +import os +import base64 +import numpy as np + +reload(sys) +sys.setdefaultencoding("utf-8") + +from preprocess import tokenization + +class PreprocessorBasic(object): + """ + Main class for text preprocess + """ + def __init__(self, + tokenizer_name, + vocab_path, + tagger_path="", + nltk_data_path="", + do_lower_case=True): + self.do_lower_case = do_lower_case + self.tokenizer = getattr(tokenization, tokenizer_name)(vocab_file=vocab_path, do_lower_case=do_lower_case) + self.vocab = self.tokenizer.vocab + + def convert_sentence_to_ids_without_cls(self, sentence): + """ + Convert sentence to ids without cls + """ + tokens = self.tokenizer.tokenize(sentence) + ids = self.tokenizer.convert_tokens_to_ids(tokens) + return ids diff --git a/ernie-vil/preprocess/tokenization.py b/ernie-vil/preprocess/tokenization.py new file mode 100644 index 0000000000000000000000000000000000000000..a661203259b61b6db061158ed91580c0a18af2bd --- /dev/null +++ b/ernie-vil/preprocess/tokenization.py @@ -0,0 +1,467 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" tokenization implemnet """ + + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import collections +import unicodedata +import six +from functools import reduce + +def convert_to_unicode(text): + """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" + if six.PY3: + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + elif six.PY2: + if isinstance(text, str): + return text.decode("utf-8", "ignore") + elif isinstance(text, unicode): + return text + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + else: + raise ValueError("Not running on Python2 or Python 3?") + + +def printable_text(text): + """Returns text encoded in a way suitable for print or `tf.logging`.""" + + # These functions want `str` for both Python2 and Python3, but in one case + # it's a Unicode string and in the other it's a byte string. + if six.PY3: + if isinstance(text, str): + return text + elif isinstance(text, bytes): + return text.decode("utf-8", "ignore") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + elif six.PY2: + if isinstance(text, str): + return text + elif isinstance(text, unicode): + return text.encode("utf-8") + else: + raise ValueError("Unsupported string type: %s" % (type(text))) + else: + raise ValueError("Not running on Python2 or Python 3?") + + +def load_vocab(vocab_file): + """Loads a vocabulary file into a dictionary.""" + vocab = collections.OrderedDict() + fin = open(vocab_file) + for num, line in enumerate(fin): + items = convert_to_unicode(line.strip()).split("\t") + if len(items) > 2: + break + token = items[0] + index = items[1] if len(items) == 2 else num + token = token.strip() + vocab[token] = int(index) + return vocab + + +def convert_by_vocab(vocab, items): + """Converts a sequence of [tokens|ids] using the vocab.""" + output = [] + for item in items: + output.append(vocab[item]) + return output + + +def convert_tokens_to_ids(vocab, tokens): + """ + Converts tokens to ids + """ + return convert_by_vocab(vocab, tokens) + + +def convert_ids_to_tokens(inv_vocab, ids): + """ + Converts ids to tokens + """ + return convert_by_vocab(inv_vocab, ids) + + +def whitespace_tokenize(text): + """Runs basic whitespace cleaning and splitting on a peice of text.""" + text = text.strip() + if not text: + return [] + tokens = text.split() + return tokens + + +class FullTokenizer(object): + """Runs end-to-end tokenziation.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.basic_tokenizer = BasicTokenizer(do_lower_case=do_lower_case) + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + """ + turn text into tokens + """ + split_tokens = [] + for token in self.basic_tokenizer.tokenize(text): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def tokenize_case(self, text): + """ + tokenize case + """ + split_tokens = [] + case_indexs = [] + basic_tokens, case_index = self.basic_tokenizer.tokenize_case(text) + case_indexs += case_index + case_indexs = [[i] for i in case_indexs] + + for token_index, token in enumerate(basic_tokens): + wordpiece_tokens = self.wordpiece_tokenizer.tokenize(token) + if len(wordpiece_tokens) > 1: + case_indexs[token_index] = case_indexs[token_index]*(len(wordpiece_tokens)) + for sub_token in wordpiece_tokens: + split_tokens.append(sub_token) + + if case_indexs: + case_indexs = reduce(lambda x, y: x + y, case_indexs) + return split_tokens, case_indexs + + def convert_tokens_to_ids(self, tokens): + """ + Converts tokens to ids + """ + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + """ + Converts ids to tokens + """ + return convert_by_vocab(self.inv_vocab, ids) + + +class CharTokenizer(object): + """Runs end-to-end tokenziation.""" + + def __init__(self, vocab_file, do_lower_case=True): + self.vocab = load_vocab(vocab_file) + self.inv_vocab = {v: k for k, v in self.vocab.items()} + self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab) + + def tokenize(self, text): + """ + Convert text to tokens + """ + split_tokens = [] + for token in text.lower().split(" "): + for sub_token in self.wordpiece_tokenizer.tokenize(token): + split_tokens.append(sub_token) + + return split_tokens + + def convert_tokens_to_ids(self, tokens): + """ + Convert tokens to ids + """ + return convert_by_vocab(self.vocab, tokens) + + def convert_ids_to_tokens(self, ids): + """ + Convert tokens to ids + """ + return convert_by_vocab(self.inv_vocab, ids) + + +class BasicTokenizer(object): + """Runs basic tokenization (punctuation splitting, lower casing, etc.).""" + + def __init__(self, do_lower_case=True): + """Constructs a BasicTokenizer. + + Args: + do_lower_case: Whether to lower case the input. + """ + self.do_lower_case = do_lower_case + + def tokenize(self, text): + """Tokenizes a piece of text.""" + text = convert_to_unicode(text) + text = self._clean_text(text) + + # This was added on November 1st, 2018 for the multilingual and Chinese + # models. This is also applied to the English models now, but it doesn't + # matter since the English models were not trained on any Chinese data + # and generally don't have any Chinese data in them (there are Chinese + # characters in the vocabulary because Wikipedia does have some Chinese + # words in the English Wikipedia.). + text = self._tokenize_chinese_chars(text) + + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + for token in orig_tokens: + if self.do_lower_case: + token = token.lower() + token = self._run_strip_accents(token) + split_tokens.extend(self._run_split_on_punc(token)) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens + + def tokenize_case(self, text): + """ + tokenize case + """ + text = convert_to_unicode(text) + text = self._clean_text(text) + text = self._tokenize_chinese_chars(text) + + orig_tokens = whitespace_tokenize(text) + split_tokens = [] + case_index = [] + + for token in orig_tokens: + if self.do_lower_case: + if token.istitle(): + case_index.append(1) + else: + case_index.append(0) + token = token.lower() + token = self._run_strip_accents(token) + if token == '': + case_index.pop() + + tmpsplit_tokens, case_index = self._run_split_on_punc_case(token, case_index) + split_tokens.extend(tmpsplit_tokens) + + output_tokens = whitespace_tokenize(" ".join(split_tokens)) + return output_tokens, case_index + + def _run_strip_accents(self, text): + """Strips accents from a piece of text.""" + text = unicodedata.normalize("NFD", text) + output = [] + for char in text: + cat = unicodedata.category(char) + if cat == "Mn": + continue + output.append(char) + return "".join(output) + + def _run_split_on_punc(self, text): + """Splits punctuation on a piece of text.""" + chars = list(text) + i = 0 + start_new_word = True + output = [] + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + return ["".join(x) for x in output] + + def _run_split_on_punc_case(self, text, case_index): + """Splits punctuation on a piece of text.""" + chars = list(text) + i = 0 + start_new_word = True + output = [] + + while i < len(chars): + char = chars[i] + if _is_punctuation(char): + output.append([char]) + start_new_word = True + else: + if start_new_word: + output.append([]) + start_new_word = False + output[-1].append(char) + i += 1 + + if len(output) > 1: + case_index.extend([case_index[-1]]*(len(output)-1)) + + return ["".join(x) for x in output], case_index + + def _tokenize_chinese_chars(self, text): + """Adds whitespace around any CJK character.""" + output = [] + for char in text: + cp = ord(char) + if self._is_chinese_char(cp): + output.append(" ") + output.append(char) + output.append(" ") + else: + output.append(char) + return "".join(output) + + def _is_chinese_char(self, cp): + """Checks whether CP is the codepoint of a CJK character.""" + # This defines a "chinese character" as anything in the CJK Unicode block: + # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) + # + # Note that the CJK Unicode block is NOT all Japanese and Korean characters, + # despite its name. The modern Korean Hangul alphabet is a different block, + # as is Japanese Hiragana and Katakana. Those alphabets are used to write + # space-separated words, so they are not treated specially and handled + # like the all of the other languages. + if ((cp >= 0x4E00 and cp <= 0x9FFF) or # + (cp >= 0x3400 and cp <= 0x4DBF) or # + (cp >= 0x20000 and cp <= 0x2A6DF) or # + (cp >= 0x2A700 and cp <= 0x2B73F) or # + (cp >= 0x2B740 and cp <= 0x2B81F) or # + (cp >= 0x2B820 and cp <= 0x2CEAF) or + (cp >= 0xF900 and cp <= 0xFAFF) or # + (cp >= 0x2F800 and cp <= 0x2FA1F)): # + return True + + return False + + def _clean_text(self, text): + """Performs invalid character removal and whitespace cleanup on text.""" + output = [] + for char in text: + cp = ord(char) + if cp == 0 or cp == 0xfffd or _is_control(char): + continue + if _is_whitespace(char): + output.append(" ") + else: + output.append(char) + return "".join(output) + + +class WordpieceTokenizer(object): + """Runs WordPiece tokenziation.""" + + def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100): + self.vocab = vocab + self.unk_token = unk_token + self.max_input_chars_per_word = max_input_chars_per_word + + def tokenize(self, text): + """Tokenizes a piece of text into its word pieces. + + This uses a greedy longest-match-first algorithm to perform tokenization + using the given vocabulary. + + For example: + input = "unaffable" + output = ["un", "##aff", "##able"] + + Args: + text: A single token or whitespace separated tokens. This should have + already been passed through `BasicTokenizer. + + Returns: + A list of wordpiece tokens. + """ + + text = convert_to_unicode(text) + + output_tokens = [] + for token in whitespace_tokenize(text): + chars = list(token) + if len(chars) > self.max_input_chars_per_word: + output_tokens.append(self.unk_token) + continue + + is_bad = False + start = 0 + sub_tokens = [] + while start < len(chars): + end = len(chars) + cur_substr = None + while start < end: + substr = "".join(chars[start:end]) + if start > 0: + substr = "##" + substr + if substr in self.vocab: + cur_substr = substr + break + end -= 1 + if cur_substr is None: + is_bad = True + break + sub_tokens.append(cur_substr) + start = end + + if is_bad: + output_tokens.append(self.unk_token) + else: + output_tokens.extend(sub_tokens) + return output_tokens + + +def _is_whitespace(char): + """Checks whether `chars` is a whitespace character.""" + # \t, \n, and \r are technically contorl characters but we treat them + # as whitespace since they are generally considered as such. + if char == " " or char == "\t" or char == "\n" or char == "\r": + return True + cat = unicodedata.category(char) + if cat == "Zs": + return True + return False + + +def _is_control(char): + """Checks whether `chars` is a control character.""" + # These are technically control characters but we count them as whitespace + # characters. + if char == "\t" or char == "\n" or char == "\r": + return False + cat = unicodedata.category(char) + if cat.startswith("C"): + return True + return False + + +def _is_punctuation(char): + """Checks whether `chars` is a punctuation character.""" + cp = ord(char) + # We treat all non-letter/number ASCII as punctuation. + # Characters such as "^", "$", and "`" are not in the Unicode + # Punctuation class but we treat them as punctuation anyways, for + # consistency. + if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or + (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)): + return True + cat = unicodedata.category(char) + if cat.startswith("P"): + return True + return False diff --git a/ernie-vil/reader/__init__.py b/ernie-vil/reader/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/reader/_image_features_reader.py b/ernie-vil/reader/_image_features_reader.py new file mode 100644 index 0000000000000000000000000000000000000000..2866bef90e806d14066faf9b2a17faa72834df7a --- /dev/null +++ b/ernie-vil/reader/_image_features_reader.py @@ -0,0 +1,79 @@ +""" +Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +""" +import numpy as np +import copy +import pickle +import lmdb # install lmdb by "pip install lmdb" +import base64 + +class ImageFeaturesH5Reader(object): + """ + Reader class + """ + def __init__(self, features_path): + self.features_path = features_path + self.env = lmdb.open(self.features_path, max_readers=1, readonly=True, + lock=False, readahead=False, meminit=False) + + with self.env.begin(write=False) as txn: + self._image_ids = pickle.loads(txn.get('keys'.encode())) + + self.features = [None] * len(self._image_ids) + self.num_boxes = [None] * len(self._image_ids) + self.boxes = [None] * len(self._image_ids) + self.boxes_ori = [None] * len(self._image_ids) + + def __len__(self): + return len(self._image_ids) + + def __getitem__(self, image_id): + image_id = str(image_id).encode() + index = self._image_ids.index(image_id) + # Read chunk from file everytime if not loaded in memory. + with self.env.begin(write=False) as txn: + item = pickle.loads(txn.get(image_id)) + image_id = item['image_id'] + image_h = int(item['image_h']) + image_w = int(item['image_w']) + num_boxes = int(item['num_boxes']) + + features = np.frombuffer(base64.b64decode(item["features"]), dtype=np.float32).reshape(num_boxes, 2048) + boxes = np.frombuffer(base64.b64decode(item['boxes']), dtype=np.float32).reshape(num_boxes, 4) + g_feat = np.sum(features, axis=0) / num_boxes + num_boxes = num_boxes + 1 + features = np.concatenate([np.expand_dims(g_feat, axis=0), features], axis=0) + image_location = np.zeros((boxes.shape[0], 5), dtype=np.float32) + image_location[:, :4] = boxes + image_location[:, 4] = (image_location[:, 3] - image_location[:, 1]) * \ + (image_location[:, 2] - image_location[:, 0]) / (float(image_w) * float(image_h)) + + image_location_ori = copy.deepcopy(image_location) + image_location[:, 0] = image_location[:, 0] / float(image_w) + image_location[:, 1] = image_location[:, 1] / float(image_h) + image_location[:, 2] = image_location[:, 2] / float(image_w) + image_location[:, 3] = image_location[:, 3] / float(image_h) + + g_location = np.array([0, 0, 1, 1, 1]) + image_location = np.concatenate([np.expand_dims(g_location, axis=0), image_location], axis=0) + + g_location_ori = np.array([0, 0, image_w, image_h, image_w * image_h]) + image_location_ori = np.concatenate([np.expand_dims(g_location_ori, axis=0), image_location_ori], axis=0) + + data_json = {"features": features, + "num_boxes": num_boxes, + "image_location": image_location, + "image_location_ori": image_location_ori + } + return data_json + diff --git a/ernie-vil/reader/vcr_finetuning.py b/ernie-vil/reader/vcr_finetuning.py new file mode 100644 index 0000000000000000000000000000000000000000..78345572f6390864590ae9b989c0c25dc50eccb8 --- /dev/null +++ b/ernie-vil/reader/vcr_finetuning.py @@ -0,0 +1,473 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 + +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +""" VCR Data Reader implementation """ + +from __future__ import print_function +from __future__ import division + +import os +import base64 +import numpy as np +import re +import random +import json +import json_lines +import csv +import sys +import itertools + +from reader._image_features_reader import ImageFeaturesH5Reader +from preprocess import preprocessor +from batching.finetune_batching import prepare_batch_data + +import paddle.fluid as fluid + +def _converId(img_id): + """ + conversion for image ID + """ + img_id = img_id.split('-') + if 'train' in img_id[0]: + new_id = int(img_id[1]) + elif 'val' in img_id[0]: + new_id = int(img_id[1]) + 1000000 + elif 'test' in img_id[0]: + new_id = int(img_id[1]) + 2000000 + else: + print("no split known") + return new_id + + +def _load_annotationsQ_A(annotations_jsonpath, split): + """ + Build an index out of FOIL annotations, mapping each image ID with its corresponding captions. + """ + entries = [] + with open(annotations_jsonpath) as f: + for annotation in json_lines.reader(f): + det_names = "" + question = annotation["question"] + if split == 'test': + ans_label = 0 + else: + ans_label = annotation["answer_label"] + img_id = _converId(annotation["img_id"]) + anno_id = int(annotation["annot_id"].split('-')[1]) + entries.append( + {"question": question, + "answers": annotation["answer_choices"], + "metadata_fn": annotation["metadata_fn"], + "target": ans_label, + "img_id": img_id, + "anno_id": anno_id, + "det_names": annotation['objects'] + }) + return entries + + +def _load_annotationsQA_R(annotations_jsonpath, split): + """ + Build an index out of FOIL annotations, mapping each image ID with its corresponding captions. + """ + entries = [] + with open(annotations_jsonpath, 'rb') as f: + for annotation in json_lines.reader(f): + if split == 'test': + for answer in annotation["answer_choices"]: + question = annotation["question"] + ["[MARK]"] + answer + img_id = _converId(annotation["img_id"]) + ans_label = 0 + anno_id = int(annotation["annot_id"].split('-')[1]) + entries.append( + {"question": question, + "answers": annotation["rationale_choices"], + "metadata_fn": annotation["metadata_fn"], + "target": ans_label, + "img_id": img_id, + "anno_id": anno_id, + "det_names": annotation['objects'] + }) + else: + det_names = "" + question = annotation["question"] + ["[MARK]"] + \ + annotation["answer_choices"][annotation['answer_label']] + ans_label = annotation["rationale_label"] + img_id = _converId(annotation["img_id"]) + anno_id = int(annotation["annot_id"].split('-')[1]) + entries.append( + {"question": question, + "answers": annotation["rationale_choices"], + "metadata_fn": annotation["metadata_fn"], + "target": ans_label, + "img_id": img_id, + "anno_id": anno_id, + "det_names": annotation['objects']}) + return entries + + +class VCRDataReader(object): + """ + Data reader for sub VCR task + """ + def __init__(self, + task_conf, + split, + vocab_path=None, + batch_size=4096, + shuffle=True, + epoch=100, + is_test=False, + feature_reader_dict={}, + random_seed=None, + task_index=0, + task_num=1): + + self.task_conf = task_conf + self.processor = getattr(preprocessor, + task_conf["Proprocessor"])(tokenizer_name=self.task_conf["tokenizer_name"], + vocab_path=vocab_path) + self.vocab = self.processor.vocab + self.batch_size = batch_size + self.shuffle = shuffle + self.epoch = epoch + self.current_epoch = 0 + self.current_file_index = 0 + self.total_file = 0 + self.current_file = None + self.random_seed = random_seed + self.max_seq_len = self.task_conf['max_seq_len'] + self.pad_id = self.vocab["[PAD]"] + self.cls_id = self.vocab["[CLS]"] + self.sep_id = self.vocab["[SEP]"] + self.mask_id = self.vocab["[MASK]"] + self.is_test = is_test + self.task_index = task_index + self.task_num = task_num + + if self.is_test: + self.epoch = 1 + self.shuffle_files = False + if self.shuffle: + shufflekeep_across_task = self.task_conf.get('shufflekeep_across_task', True) + if shufflekeep_across_task: + self.global_rng = np.random.RandomState(random_seed) + else: + self.global_rng = np.random.RandomState() + self.shuffle_every_epoch = self.task_conf.get('shuffle_every_epoch', False) + task=self.task_conf['task'] + annotations_jsonpath=self.task_conf['annotations_jsonpath_' + split] + self.num_choice = int(self.task_conf['num_choice']) + if task == 'VCR_Q-A': + self._entries = _load_annotationsQ_A(annotations_jsonpath, split) + elif task == "VCR_QA-R": + self._entries = _load_annotationsQA_R(annotations_jsonpath, split) + else: + assert False + self._split = split + self._names = [] + with open(self.task_conf['unisex_names_table']) as csv_file: + csv_reader = csv.reader(csv_file, delimiter=',') + for row in csv_reader: + if row[1] != 'name': + self._names.append(row[1]) + self._feature_reader = feature_reader_dict[self.task_conf['feature_lmdb_path']] + self.use_gt_fea = task_conf.get('use_gt_fea', False) + if self.use_gt_fea: + self._gt_feature_reader = feature_reader_dict[self.task_conf['gt_feature_lmdb_path']] + self._max_region_num = self.task_conf.get('max_region_num', 100) + print("use gt featurre") + else: + self._max_region_num = self.task_conf.get('max_region_num', 37) + print("only butd feature") + self.tokenize() + + def generate_random_name(self, det_names): + """ + Replace "person" with a random name + """ + random_name = [] + for name in det_names: + if name == 'person': + word = random.choice(self._names) + else: + word = name + random_name.append(word) + + return random_name + + def replace_det_with_name(self, inputs, random_names): + """ + Replace det with name + """ + tokens = [] + mask = [] + for w in inputs: + if isinstance(w, list): + for idx in w: + word = random_names[idx] + tokens.append(word) + else: + word = w.encode('utf-8') + tokens.append(word) + + return tokens, mask + + def _truncate_seq_pair(self, tokens_a, tokens_b, max_length): + """ + Truncates a sequence pair in place to the maximum length. + """ + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_length: + break + if len(tokens_a) > len(tokens_b): + tokens_a.pop() + else: + tokens_b.pop() + + def get_progress(self): + """ + Return current progress of traning data + """ + progress_dict = {"current_epoch": self.current_epoch, + "current_file_index": self.current_file_index, + "total_file": self.total_file, + "current_file": self.current_file + } + return progress_dict + + def tokenize(self): + """ + Tokenizes the captions. + """ + # This will add caption_tokens in each entry of the dataset. + # -1 represents nil, and should be treated as padding_idx in embedding. + count = 0 + for entry in self._entries: + det_names = entry["det_names"] + random_names = self.generate_random_name(det_names) + # replace with name + tokens_a, mask_a = self.replace_det_with_name(entry["question"], random_names) + q_str = " ".join(tokens_a) + ids_a = [] + for i, q in enumerate(q_str.split(" [MARK] ")): + if i == 1: + ids_a.append(self.vocab["[SEP]"]) + ids_a = ids_a + self.processor.convert_sentence_to_ids_without_cls(q) + + input_ids_all = [] + segment_ids_all = [] + input_poss_all = [] + input_len_all = [] + + for answer in entry["answers"]: + tokens_b, mask_b = self.replace_det_with_name(answer, random_names) + ids_b = self.processor.convert_sentence_to_ids_without_cls(" ".join(tokens_b)) + + self._truncate_seq_pair(ids_a, ids_b, self.max_seq_len - 3) + + input_ids = [] + segment_ids = [] + input_ids.append(self.vocab["[CLS]"]) + segment_ids.append(0) + + for id in ids_a: + input_ids.append(id) + segment_ids.append(0) + + input_ids.append(self.vocab["[SEP]"]) + segment_ids.append(0) + + assert len(ids_b) > 0 + for id in ids_b: + input_ids.append(id) + segment_ids.append(1) + input_ids.append(self.vocab["[SEP]"]) + segment_ids.append(1) + + input_ids_all.append(input_ids) + segment_ids_all.append(segment_ids) + input_poss = [str(pos) for pos in range(len(input_ids))] + input_poss_all.append(input_poss) + input_len_all.append(len(input_ids)) + + entry["input_ids"] = input_ids_all + entry["input_poss"] = input_poss_all + entry["segment_ids"] = segment_ids_all + entry["input_lens"] = input_len_all + + sys.stdout.write('%d/%d\r' % (count, len(self._entries))) + sys.stdout.flush() + count += 1 + + def parse_line(self, s_index): + """ + Form slot info with the line information + """ + entry = self._entries[s_index] + image_id = entry["img_id"] + image_fea_json = self._feature_reader[image_id] + features = image_fea_json["features"] + num_boxes = image_fea_json["num_boxes"] + boxes = image_fea_json["image_location"] + if not self.use_gt_fea: + num_boxes = min(num_boxes, self._max_region_num) + boxes = boxes[:num_boxes] + features = features[:num_boxes] + else: + boxes = boxes[:num_boxes] + features = features[:num_boxes] + image_fea_json = self._gt_feature_reader[image_id] + gt_features = image_fea_json["features"] + gt_num_boxes = image_fea_json["num_boxes"] + gt_boxes = image_fea_json["image_location"] + features[0] = (features[0] * num_boxes + gt_features[0] * gt_num_boxes) / (num_boxes + gt_num_boxes) + + gt_boxes = gt_boxes[1: gt_num_boxes] + gt_features = gt_features[1: gt_num_boxes] + gt_num_boxes = gt_num_boxes - 1 + + gt_box_preserve = min(self._max_region_num - 1, gt_num_boxes) + gt_boxes = gt_boxes[:gt_box_preserve] + gt_features = gt_features[:gt_box_preserve] + gt_num_boxes = gt_box_preserve + + num_box_preserve = min(self._max_region_num - int(gt_num_boxes), int(num_boxes)) + boxes = boxes[:num_box_preserve] + features = features[:num_box_preserve] + + # concatenate the boxes + mix_boxes = np.concatenate((boxes, gt_boxes), axis=0) + mix_features = np.concatenate((features, gt_features), axis=0) + mix_num_boxes = num_box_preserve + int(gt_num_boxes) + + num_boxes = min(mix_num_boxes, self._max_region_num) + boxes = mix_boxes[:num_boxes] + features = mix_features[:num_boxes] + record = { + "input_ids": entry["input_ids"], + "input_pos": entry["input_poss"], + "segment_ids": entry["segment_ids"], + "input_lens": entry["input_lens"], + "target": int(entry["target"]), + "features": features, + "boxes": boxes, + "anno_id": entry["anno_id"] + } + return record + + def data_generator(self): + """ + Data_generator + """ + sample_indice = range(len(self._entries)) + def wrapper(): + """ + Wrapper + """ + for epoch_index in range(self.epoch): + if self._split == "train": + self.current_example = 0 + self.current_epoch = epoch_index + if self.shuffle: + if epoch_index == 0: + self.global_rng.shuffle(sample_indice) + print("shuffle epoch %d" % epoch_index) + elif self.shuffle_every_epoch: + self.global_rng.shuffle(sample_indice) + print("shuffle epoch %d" % epoch_index) + batch_records = [] + for index in sample_indice: + batch_records.append(self.parse_line(index)) + if len(batch_records) == self.batch_size: + yield prepare_batch_data( + batch_records, self.num_choice, self.pad_id, \ + self.task_index, self.task_num), self.task_conf['task'] + batch_records = [] + if len(batch_records) > 0: + yield prepare_batch_data( + batch_records, self.num_choice, self.pad_id, \ + self.task_index, self.task_num), self.task_conf['task'] + return wrapper + + +class VCRDataJointReader(object): + """ + Joint data reader for Q2A task and QA2R task + """ + def __init__(self, + task_conf_group, + split, + batch_size=4096, + shuffle=True, + epoch=100, + vocab_path=None, + is_test=False): + + self.task_readers = [] + feature_reader_dict = {} + self.task_dup_cnt = [] + for task_conf in task_conf_group: + if 'feature_lmdb_path' in task_conf: + if task_conf['feature_lmdb_path'] not in feature_reader_dict: + feature_reader_dict[task_conf['feature_lmdb_path']] = \ + ImageFeaturesH5Reader(task_conf['feature_lmdb_path']) + if 'gt_feature_lmdb_path' in task_conf and task_conf.get('use_gt_fea', False): + if task_conf['gt_feature_lmdb_path'] not in feature_reader_dict: + feature_reader_dict[task_conf['gt_feature_lmdb_path']] = \ + ImageFeaturesH5Reader(task_conf['gt_feature_lmdb_path']) + task_batch_size = task_conf.get('batch_size', 64) + self.task_dup_cnt.append(max(int(task_batch_size / batch_size), 1)) + random_seed=np.random.randint(1000) + for task_index, task_conf in enumerate(task_conf_group): + self.task_readers.append(VCRDataReader(task_conf, split, vocab_path, batch_size, shuffle, + epoch, is_test, feature_reader_dict, random_seed, task_index, len(task_conf_group))) + self.task_generators = [reader.data_generator() for reader in self.task_readers] + + def get_progress(self): + """ + Return current progress of traning data + """ + current_epoch = max([reader.current_epoch for reader in self.task_readers]) + current_file_index = max([reader.current_file_index for reader in self.task_readers]) + total_file = max([reader.total_file for reader in self.task_readers]) + current_file = "" + self.progress_dict = {"current_epoch": current_epoch, + "current_file_index": current_file_index, + "total_file": total_file, + "current_file": current_file + } + return self.progress_dict + + def data_generator(self): + """ + Data_generator + """ + def wrapper(): + """ + warpper + """ + task_buffer = [[] for i in range(len(self.task_dup_cnt))] + for data in itertools.izip(*[generator() for generator in self.task_generators]): + for i, d in enumerate(data): + task_buffer[i].append(d) + if len(task_buffer[i]) >= self.task_dup_cnt[i]: + for t in task_buffer[i]: + yield t[0] + task_buffer[i] = [] + + return wrapper + + +if __name__ == "__main__": + pass diff --git a/ernie-vil/requirements.txt b/ernie-vil/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..525c143d4ad74c9a883b6f7767cd73207006de7b --- /dev/null +++ b/ernie-vil/requirements.txt @@ -0,0 +1,8 @@ +nltk==3.2.4 +numpy==1.14.3 +scipy==1.2.1 +six==1.11.0 +json_lines==0.5.0 +lmdb==0.97 +opencv-python==3.2.0.8 +paddlepaddle-gpu==1.8.3.post97 diff --git a/ernie-vil/run_finetuning.sh b/ernie-vil/run_finetuning.sh new file mode 100644 index 0000000000000000000000000000000000000000..7807240fd41f66c9e203c7384ddd7c34eb845b4f --- /dev/null +++ b/ernie-vil/run_finetuning.sh @@ -0,0 +1,59 @@ +set -eu +set -x + +#bash -x ./env.sh + +TASK_NAME=$1 +CONF_FILE=$2 +VOCAB_PATH=$3 +ERNIE_VIL_CONFIG=$4 +PRETRAIN_MODELS=$5 + +source $CONF_FILE + +#configure your cuda and cudnn +#configure nccl + +export FLAGS_fast_eager_deletion_mode=1 +export FLAGS_eager_delete_tensor_gb=0.0 +export FLAGS_fraction_of_gpu_memory_to_use=0.98 + +e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]') + +use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]') +if [[ ${use_fuse} == "true" ]]; then + export FLAGS_fuse_parameter_memory_size=131072 + export FLAGS_fuse_parameter_groups_size=10 +fi + + +TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}.json + +gpu_cnt=`echo $CUDA_VISIBLE_DEVICES | awk -F"\t" '{len=split($0,vec,",");print len}'` +echo "gpu_cnt", $gpu_cnt +python finetune.py --use_cuda "True" \ + --is_distributed "False" \ + --use_fast_executor ${e_executor-"True"} \ + --nccl_comm_num ${nccl_comm_num:-"1"} \ + --batch_size $((BATCH_SIZE/gpu_cnt)) \ + --do_train "True" \ + --do_test "False" \ + --task_name ${TASK_NAME} \ + --vocab_path ${VOCAB_PATH} \ + --task_group_json ${TASK_GROUP_JSON} \ + --lr_scheduler ${lr_scheduler} \ + --decay_steps ${decay_steps-""} \ + --lr_decay_ratio ${lr_decay_ratio-0.1} \ + --num_train_steps ${num_train_steps} \ + --checkpoints $output_model_path \ + --save_steps ${SAVE_STEPS} \ + --init_checkpoint ${PRETRAIN_MODELS} \ + --ernie_config_path ${ERNIE_VIL_CONFIG} \ + --learning_rate ${LR_RATE} \ + --warmup_steps ${WARMUP_STEPS} \ + --weight_decay ${WEIGHT_DECAY:-0} \ + --max_seq_len ${MAX_LEN} \ + --validation_steps ${VALID_STEPS} \ + --skip_steps 10 + + diff --git a/ernie-vil/run_inference.sh b/ernie-vil/run_inference.sh new file mode 100644 index 0000000000000000000000000000000000000000..63893286fec7c88d44f13b00db0787052da20037 --- /dev/null +++ b/ernie-vil/run_inference.sh @@ -0,0 +1,48 @@ +set -eu + +#bash -x ./env.sh + +TASK_NAME=$1 +SUB_TASK_NAME=$2 +TEST_SPLIT=$3 +CONF_FILE=$4 +VOCAB_PATH=$5 +ERNIE_VIL_CONFIG=$6 +MODEL_PATH=$7 +RES_FILE=$8 + +source $CONF_FILE + +#configure your cuda and cudnn +#configure nccl + +export FLAGS_eager_delete_tensor_gb=2.0 +export FLAGS_fraction_of_gpu_memory_to_use=0.01 +export FLAGS_sync_nccl_allreduce=1 + +e_executor=$(echo ${use_experimental_executor-'True'} | tr '[A-Z]' '[a-z]') + +use_fuse=$(echo ${use_fuse-'False'} | tr '[A-Z]' '[a-z]') +if [[ ${use_fuse} == "true" ]]; then + export FLAGS_fuse_parameter_memory_size=131072 + export FLAGS_fuse_parameter_groups_size=10 +fi + +TASK_GROUP_JSON=./conf/$TASK_NAME/task_${TASK_NAME}_${SUB_TASK_NAME}.json + +python finetune.py --use_cuda "True" \ + --use_fast_executor ${e_executor-"True"} \ + --batch_size ${BATCH_SIZE} \ + --do_train "False" \ + --do_test "True" \ + --test_split ${TEST_SPLIT} \ + --task_name $TASK_NAME \ + --vocab_path ${VOCAB_PATH} \ + --task_group_json ${TASK_GROUP_JSON} \ + --result_file "$RES_FILE" \ + --init_checkpoint "$MODEL_PATH" \ + --ernie_config_path ${ERNIE_VIL_CONFIG} \ + --max_seq_len ${MAX_LEN} \ + --skip_steps 10 + + diff --git a/ernie-vil/utils/__init__.py b/ernie-vil/utils/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/ernie-vil/utils/args.py b/ernie-vil/utils/args.py new file mode 100644 index 0000000000000000000000000000000000000000..a88528a8ae3ff42df932f62502e649d62e82e1b2 --- /dev/null +++ b/ernie-vil/utils/args.py @@ -0,0 +1,61 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Arguments for configuration.""" + +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +import six +import argparse + + +def str2bool(v): + """ + because argparse does not support to parse "true, False" as python + boolean directly + """ + return v.lower() in ("true", "t", "1") + + +class ArgumentGroup(object): + """ + group of arguments + """ + def __init__(self, parser, title, des): + self._group = parser.add_argument_group(title=title, description=des) + + def add_arg(self, name, type, default, help, positional_arg=False, **kwargs): + """ + add arg + """ + prefix = "" if positional_arg else "--" + type = str2bool if type == bool else type + self._group.add_argument( + prefix + name, + default=default, + type=type, + help=help + ' Default: %(default)s.', + **kwargs) + + +def print_arguments(args): + """ + Arguments print function + """ + print('----------- Configuration Arguments -----------') + for arg, value in sorted(six.iteritems(vars(args))): + print('%s: %s' % (arg, value)) + print('------------------------------------------------') diff --git a/ernie-vil/utils/init.py b/ernie-vil/utils/init.py new file mode 100644 index 0000000000000000000000000000000000000000..faadca1b15a38b04122754ee41be35fd2430848c --- /dev/null +++ b/ernie-vil/utils/init.py @@ -0,0 +1,71 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""parameters init function implementations""" + + +from __future__ import print_function + +import os +import six + +import numpy as np +import paddle.fluid as fluid + + +def init_checkpoint(exe, init_checkpoint_path, main_program): + """ + init checkpoint params with lr and step info + """ + assert os.path.exists( + init_checkpoint_path), "[%s] cann't be found." % init_checkpoint_path + def existed_persitables(var): + """ + Check if persitables + """ + if not fluid.io.is_persistable(var): + return False + return os.path.exists(os.path.join(init_checkpoint_path, var.name)) + + fluid.io.load_vars( + exe, + init_checkpoint_path, + main_program=main_program, + predicate=existed_persitables) + print("Load model from {}".format(init_checkpoint_path)) + + +def init_pretraining_params(exe, pretraining_params_path, main_program): + """ + init pretraining params without lr and step info + """ + assert os.path.exists(pretraining_params_path + ), "[%s] cann't be found." % pretraining_params_path + + def existed_params(var): + """ + Check existed params + """ + if not isinstance(var, fluid.framework.Parameter): + return False + return os.path.exists(os.path.join(pretraining_params_path, var.name)) + + fluid.io.load_vars( + exe, + pretraining_params_path, + main_program=main_program, + predicate=existed_params) + print("Load pretraining parameters from {}.".format( + pretraining_params_path)) + diff --git a/requirements.txt b/requirements.txt index e267a7738bb8f9067254bd7fe11fd992b8018504..9c9d2bc707935b6b0cad95f511d7130ce31d9a5c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,4 +6,5 @@ scipy==1.2.1 six==1.11.0 sklearn==0.0 sentencepiece==0.1.8 +opencv-python==3.4.2.17 paddlepaddle-gpu==1.6.3.post107