diff --git a/PaddleCV/PaddleVideo/README.md b/PaddleCV/PaddleVideo/README.md index e277fb4a2a1de1a445f3ec58b9ac1debe3f488d5..34a2b78edcff8c1cee2602d22253653e622e1c83 100644 --- a/PaddleCV/PaddleVideo/README.md +++ b/PaddleCV/PaddleVideo/README.md @@ -16,10 +16,12 @@ | [C-TCN](./models/ctcn/README.md) | 视频动作定位| 2018年ActivityNet夺冠方案 | | [BSN](./models/bsn/README.md) | 视频动作定位| 为视频动作定位问题提供高效的proposal生成方法 | | [BMN](./models/bmn/README.md) | 视频动作定位| 2019年ActivityNet夺冠方案 | +| [ETS](./models/ets/README.md) | 视频描述| ICCV'15提出的结合时序注意力机制的建模方法 | +| [TALL](./models/tall/README.md) | 视频查找| ICCV'17多模态时序回归定位方法 | ### 主要特点 -- 包含视频分类和动作定位方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研,2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型,2019年ActivityNet夺冠方案。 +- 包含视频分类和动作定位方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研,2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型,2019年ActivityNet夺冠方案。ETS结合时序注意力机制构建网络,是视频生成文字描述的经典模型。TALL是利用多模态时序回归定位器对视频片段进行查找的模型。 - 提供了适合视频分类和动作定位任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。 @@ -178,6 +180,17 @@ run.sh | BSN | 16 | 1卡K40 | 7.0 | 66.64% (AUC) | [model-tem](https://paddlemodels.bj.bcebos.com/video_detection/BsnTem_final.pdparams), [model-pem](https://paddlemodels.bj.bcebos.com/video_detection/BsnPem_final.pdparams) | | BMN | 16 | 4卡K40 | 7.0 | 67.19% (AUC) | [model](https://paddlemodels.bj.bcebos.com/video_detection/BMN_final.pdparams) | +- 基于ActivityNet Captions的视频描述模型: +| 模型 | Batch Size | 环境配置 | cuDNN版本 | METEOR | 下载链接 | +| :-------: | :---: | :---------: | :----: | :----: | :----------: | +| ETS | 256 | 4卡P40 | 7.0 | 9.8 | [model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams) | + +- 基于TACoS的视频查找模型: +| 模型 | Batch Size | 环境配置 | cuDNN版本 | R1@IOU5 | R5@IOU5 | 下载链接 | +| :-------: | :---: | :---------: | :----: | :----: | :----: | :----------: | +| TALL | 56 | 1卡P40 | 7.2 | 0.13 | 0.24 | [model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams) | + + ## 参考文献 @@ -190,10 +203,12 @@ run.sh - [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He - [Bsn: Boundary sensitive network for temporal action proposal generation](http://arxiv.org/abs/1806.02964), Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, Ming Yang. - [BMN: Boundary-Matching Network for Temporal Action Proposal Generation](https://arxiv.org/abs/1907.09702), Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen. - +- [Describing Videos by Exploiting Temporal Structure](https://arxiv.org/abs/1502.08029). +- [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101). ## 版本更新 - 3/2019: 新增模型库,发布Attention Cluster,Attention LSTM,NeXtVLAD,StNet,TSN五个视频分类模型。 - 4/2019: 发布Non-local, TSM两个视频分类模型。 - 6/2019: 发布C-TCN视频动作定位模型;Non-local模型增加C2D ResNet101和I3D ResNet50骨干网络;NeXtVLAD、TSM模型速度和显存优化。 +- 10/2019: 发布视频动作定位模型BSN, BMN;视频描述模型ETS;视频查找模型TALL。 diff --git a/PaddleCV/PaddleVideo/data/dataset/ctcn/README.md b/PaddleCV/PaddleVideo/data/dataset/ctcn/README.md index f834cea0cbb51960fd593a9f379f822b08515cca..90fff17ed76b6221acea1e56210252a5f1268acf 100644 --- a/PaddleCV/PaddleVideo/data/dataset/ctcn/README.md +++ b/PaddleCV/PaddleVideo/data/dataset/ctcn/README.md @@ -1,6 +1,6 @@ # C-TCN模型数据使用说明 -C-TCN模型使用ActivityNet 1.3数据集,具体下载方法请参考官方[下载说明](http://activity-net.org/index.html)。在训练此模型时,需要先对mp4源文件抽取RGB和Flow特征,然后再用训练好的TSN模型提取出抽象的特征数据,并存储为pickle文件格式。我们将会提供转化后的数据下载链接。转化后的数据文件目录结构为: +C-TCN模型使用ActivityNet 1.3数据集,具体下载方法请参考官方[下载说明](http://activity-net.org/index.html)。在训练此模型时,需要先对mp4源文件抽取RGB和Flow特征,然后再用训练好的TSN模型提取出抽象的特征数据,并存储为pickle文件格式。我们使用百度云提供转化后的数据[下载链接](https://paddlemodels.bj.bcebos.com/video_detection/CTCN_data.tar.gz)。转化后的数据文件目录结构为: ``` data diff --git a/PaddleCV/PaddleVideo/data/dataset/ets/README.md b/PaddleCV/PaddleVideo/data/dataset/ets/README.md new file mode 100644 index 0000000000000000000000000000000000000000..c725f30dfc697657ed697a483c9596a418049206 --- /dev/null +++ b/PaddleCV/PaddleVideo/data/dataset/ets/README.md @@ -0,0 +1,37 @@ +# ETS模型数据使用说明 + +ETS模型使用ActivityNet Captions数据集,数据准备方法如下: + +步骤一. 特征数据准备: + +- 在[ActivityNet下载页面](http://activity-net.org/challenges/2019/tasks/anet_captioning.html)中,下载"Frame-level features"特征数据集(~89GB)。将下载好的resnet152i\_features\_activitynet\_5fps\_320x240.pkl数据文件存放在PaddleVideo/data/dataset/ets目录下; + +- 运行PaddleVideo/data/dataset/ets/generate\_train\_pickle.py文件,将数据转化为pickle文件,便于内存载入。生成的数据存放在PaddleVideo/data/dataset/ets/feat\_data文件夹下。 + +步骤二. 标签及索引数据准备: + +- 在[Dense-Captioning Events in Videos项目页面](http://cs.stanford.edu/people/ranjaykrishna/densevid/),从dataset链接中下载captions文件夹,其中包含标签和索引的json文件。将captions文件夹存放在PaddleVideo/data/dataset/ets目录下; + +- 按[数据评估](../../../metrics/ets\_metrics/README.md)步骤下载好coco-caption文件夹,并将其放置在PaddleVideo目录下; + +- python运行generate\_data.py文件,生成训练用的文本文件train.list和val.list。 + +步骤三. 生成infer数据: + +- 完成前两个步骤后,python运行generate\_infer\_data.py文件可生成infer.list文件。 + +按如上步骤操作,最终PaddleVideo/data/dataset/ets的目录结构为: + +``` +ets + | + |----feat_data/ + |----train.list + |----val.list + |----preprocess.sh + |----generate_train_pickle.py + |----generate_data.py + |----generate_infer_data.py + |----captions/ + |----resnet152_features_activitynet_5fps_320x240.pkl (生成feat_data后可移除以节省磁盘空间) +``` diff --git a/PaddleCV/PaddleVideo/data/dataset/ets/generate_data.py b/PaddleCV/PaddleVideo/data/dataset/ets/generate_data.py new file mode 100644 index 0000000000000000000000000000000000000000..13080d88b70d3488af67ca9024cc9193dae42526 --- /dev/null +++ b/PaddleCV/PaddleVideo/data/dataset/ets/generate_data.py @@ -0,0 +1,111 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve. +# +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. + +import os +import json +import sys +sys.path.insert(0, '../../../coco-caption') +from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer + + +def remove_nonascii(text): + """ remove nonascii + """ + return ''.join([i if ord(i) < 128 else ' ' for i in text]) + + +def generate_dictionary(caption_file_path): + index = 0 + input_dict = {} + + # get all sentences + train_data = json.loads(open(os.path.join( \ + caption_file_path, 'train.json')).read()) + for vid, content in train_data.iteritems(): + sentences = content['sentences'] + for s in sentences: + input_dict[index] = [{'caption': remove_nonascii(s)}] + index += 1 + + # ptbtokenizer + tokenizer = PTBTokenizer() + output_dict = tokenizer.tokenize(input_dict) + + # sort by word frequency + word_count_dict = {} + for _, sentence in output_dict.iteritems(): + words = sentence[0].split() + for w in words: + if w not in word_count_dict: + word_count_dict[w] = 1 + else: + word_count_dict[w] += 1 + + # output dictionary + with open('dict.txt', 'w') as f: + f.write(' -1\n') + f.write(' -1\n') + f.write(' -1\n') + + truncation = 3 + for word, freq in sorted(word_count_dict.iteritems(), \ + key=lambda x:x[1], reverse=True): + if freq >= truncation: + f.write('%s %d\n' % (word, freq)) + + print 'Generate dictionary done ...' + + +def generate_data_list(mode, caption_file_path): + # get file name + if mode == 'train': + file_name = 'train.json' + elif mode == 'val': + file_name = 'val_1.json' + else: + print 'Invalid mode:' % mode + sys.exit() + + # get timestamps and sentences + input_dict = {} + data = json.loads(open(os.path.join( \ + caption_file_path, file_name)).read()) + for vid, content in data.iteritems(): + sentences = content['sentences'] + timestamps = content['timestamps'] + for t, s in zip(timestamps, sentences): + dictkey = ' '.join([vid, str(t[0]), str(t[1])]) + input_dict[dictkey] = [{'caption': remove_nonascii(s)}] + + # ptbtokenizer + tokenizer = PTBTokenizer() + output_dict = tokenizer.tokenize(input_dict) + + with open('%s.list' % mode, 'wb') as f: + for id, sentence in output_dict.iteritems(): + try: + f.write('\t'.join(id.split() + sentence) + '\n') + except: + pass + + print 'Generate %s.list done ...' % mode + + +if __name__ == '__main__': + caption_file_path = './captions/' + + generate_dictionary(caption_file_path) + + generate_data_list('train', caption_file_path) + generate_data_list('val', caption_file_path) diff --git a/PaddleCV/PaddleVideo/data/dataset/ets/generate_infer_data.py b/PaddleCV/PaddleVideo/data/dataset/ets/generate_infer_data.py new file mode 100644 index 0000000000000000000000000000000000000000..fb1b0046eeac953dec9d8cc0286a6b7d1c863a00 --- /dev/null +++ b/PaddleCV/PaddleVideo/data/dataset/ets/generate_infer_data.py @@ -0,0 +1,19 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve. +# +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. + +if_train = open('val.list') +f_lines = f_train.readlines() +with open('infer.list', 'wb') as f: + for i in range(100): + f.write(f_lines[i]) diff --git a/PaddleCV/PaddleVideo/data/dataset/ets/generate_train_pickle.py b/PaddleCV/PaddleVideo/data/dataset/ets/generate_train_pickle.py new file mode 100644 index 0000000000000000000000000000000000000000..82918c962708d810b83ed417f55be169d8d5bebf --- /dev/null +++ b/PaddleCV/PaddleVideo/data/dataset/ets/generate_train_pickle.py @@ -0,0 +1,54 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve. +# +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. + +import pickle +import os + +import multiprocessing + +output_dir = './feat_data' +if not os.path.exists(output_dir): + os.makedirs(output_dir) +fname = 'resnet152_features_activitynet_5fps_320x240.pkl' +d = pickle.load(open(fname)) + + +def save_file(filenames, process_id): + count = 0 + for key in filenames: + pickle.dump(d[key], open(os.path.join(output_dir, key), 'w')) + count += 1 + if count % 100 == 0: + print('# %d processed %d samples' % (process_id, count)) + print('# %d total processed %d samples' % (process_id, count)) + + +total_keys = d.keys() + +num_threads = 8 +filelists = [None] * 8 +seg_nums = len(total_keys) // 8 + +p_list = [None] * 8 + +for i in range(8): + if i == 7: + filelists[i] = total_keys[i * seg_nums:] + else: + filelists[i] = total_keys[i * seg_nums:(i + 1) * seg_nums] + + p_list[i] = multiprocessing.Process( + target=save_file, args=(filelists[i], i)) + + p_list[i].start() diff --git a/PaddleCV/PaddleVideo/data/dataset/tall/README.md b/PaddleCV/PaddleVideo/data/dataset/tall/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b5fcf949ba7c324a0c541e57199d35ccc0f44e8e --- /dev/null +++ b/PaddleCV/PaddleVideo/data/dataset/tall/README.md @@ -0,0 +1,27 @@ +# TALL模型数据使用说明 + +TALL模型使用TACoS数据集,数据准备过程如下: + +步骤一. 训练和测试集: + +- 训练和测试使用提取好的数据特征,请参考TALL模型原作者提供的[数据下载](https://github.com/jiyanggao/TALL)方法进行模型训练与评估; + +步骤二. infer数据 + +- 为便于用户使用模型进行推断,我们提供了生成infer数据的文件./gen\_infer.py,执行完步骤一后python运行该文件便可在当前文件夹下生成infer数据。 + +按如上步骤操作,最终PaddleVideo/data/dataset/tall需要包含的文件有: + +``` +tall + | + |----Interval64_128_256_512_overlap0.8_c3d_fc6/ + |----Interval128_256_overlap0.8_c3d_fc6/ + |----train_clip-sentvec.pkl + |----test_clip-sentvec.pkl + |----video_allframes_info.pkl + |----infer + | + |----infer_feat/ + |----infer_clip-sen.pkl +``` diff --git a/PaddleCV/PaddleVideo/data/dataset/tall/gen_infer.py b/PaddleCV/PaddleVideo/data/dataset/tall/gen_infer.py new file mode 100644 index 0000000000000000000000000000000000000000..2fc93dfe158bf3ea7d53ad2c8d2c83364792c8da --- /dev/null +++ b/PaddleCV/PaddleVideo/data/dataset/tall/gen_infer.py @@ -0,0 +1,56 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve. +# +#Licensed under the Apache License, Version 2.0 (the "License"); +#you may not use this file except in compliance with the License. +#You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. + +# select sentence vector and featmap of one movie name for inference +import os +import sys +import pickle +import numpy as np + +infer_path = 'infer' +infer_feat_path = 'infer/infer_feat' + +if not os.path.exists(infer_path): + os.mkdir(infer_path) +if not os.path.exists(infer_feat_path): + os.mkdir(infer_feat_path) + +python_ver = sys.version_info + +pickle_path = 'test_clip-sentvec.pkl' +if python_ver < (3, 0): + movies_sentence = pickle.load(open(pickle_path, 'rb')) +else: + movies_sentence = pickle.load(open(pickle_path, 'rb'), encoding='bytes') + +select_name = movies_sentence[0][0].split('.')[0] + +res_sentence = [] +for movie_sentence in movies_sentence: + if movie_sentence[0].split('.')[0] == select_name: + res_sen = [] + res_sen.append(movie_sentence[0]) + res_sen.append([movie_sentence[1][0]]) #select the first one sentence + res_sentence.append(res_sen) + +file = open('infer/infer_clip-sen.pkl', 'wb') +pickle.dump(res_sentence, file, protocol=2) + +movies_feat = os.listdir('Interval128_256_overlap0.8_c3d_fc6') +for movie_feat in movies_feat: + if movie_feat.split('.')[0] == select_name: + feat_path = os.path.join('Interval128_256_overlap0.8_c3d_fc6', + movie_feat) + feat = np.load(feat_path) + np.save(os.path.join(infer_feat_path, movie_feat), feat) diff --git a/PaddleCV/PaddleVideo/metrics/ets_metrics/README.md b/PaddleCV/PaddleVideo/metrics/ets_metrics/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b6a3ae0e61e17d3335439d4cf0921316ec0ae086 --- /dev/null +++ b/PaddleCV/PaddleVideo/metrics/ets_metrics/README.md @@ -0,0 +1,9 @@ +## ActivityNet Captions 指标计算 + +- ActivityNet Captions的指标评估代码可以参考[官方网站](https://github.com/ranjaykrishna/densevid_eval) + +- 下载指标评估代码,将coco-caption和evaluate.py拷贝到PaddleVideo下; + +- 计算精度指标,python运行evaluate.py文件,可通过-s参数指定结果文件,-r参数修改标签文件; + +- 由于模型计算波动较大,在评估过程中可以取不同Epoch的所得训练模型计算精度指标,最优的METEOR值约为10.0左右。 diff --git a/PaddleCV/PaddleVideo/models/ets/README.md b/PaddleCV/PaddleVideo/models/ets/README.md new file mode 100644 index 0000000000000000000000000000000000000000..5b09dc91c256bfef640cc5ea7fa897737f544c72 --- /dev/null +++ b/PaddleCV/PaddleVideo/models/ets/README.md @@ -0,0 +1,107 @@ +# ETS 视频描述模型 + +--- +## 内容 + +- [模型简介](#模型简介) +- [数据准备](#数据准备) +- [模型训练](#模型训练) +- [模型评估](#模型评估) +- [模型推断](#模型推断) +- [参考论文](#参考论文) + + +## 模型简介 + +Describing Videos by Exploiting Temporal Structure是由蒙特利尔大学Li Yao等人提出的用于对视频片段生成文字描述的经典模型,这里简称为ETS。此模型基于编码器-解码器的思想,对输入的视频,先使用3D卷积提取视频的局部时空特征,然后在时序维度上引入注意力机制,利用LSTM在全局尺度上对局部特征进行融合,最后输出文字描述。 + +详细内容请参考[Describing Videos by Exploiting Temporal Structure](https://arxiv.org/abs/1502.08029)。 + + +## 数据准备 + +ETS的训练数据采用ActivityNet Captions提供的数据集,数据下载及准备请参考[数据说明](../../data/dataset/ets/README.md) + +## 模型训练 + +数据准备完毕后,可以通过如下两种方式启动训练: + + export CUDA_VISIBLE_DEVICES=0,1,2,3 + export FLAGS_fast_eager_deletion_mode=1 + export FLAGS_eager_delete_tensor_gb=0.0 + export FLAGS_fraction_of_gpu_memory_to_use=0.98 + python train.py --model_name=ETS \ + --config=./configs/ets.yaml \ + --log_interval=10 \ + --valid_interval=1 \ + --use_gpu=True \ + --save_dir=./data/checkpoints \ + --fix_random_seed=False + + bash run.sh train ETS ./configs/ets.yaml + +- 从头开始训练,使用上述启动命令行或者脚本程序即可启动训练,不需要用到预训练模型 + +- 可下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams)通过`--resume`指定权重存放路径进行finetune等开发 + + +**训练策略:** + +* 采用Adam优化算法训练 +* 权重衰减系数为1e-4 +* 学习率调整使用Noam衰减方法 + +## 模型评估 + +可通过如下两种方式进行模型评估: + + python eval.py --model_name=ETS \ + --config=./configs/ets.yaml \ + --log_interval=1 \ + --weights=$PATH_TO_WEIGHTS \ + --use_gpu=True + + bash run.sh eval ETS ./configs/ets.yaml + +- 使用`run.sh`进行评估时,需要修改脚本中的`weights`参数指定需要评估的权重。 + +- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams)进行评估 + +- 运行上述程序会将测试结果保存在json文件中,默认存储在data/evaluate\_results目录下。使用ActivityNet Captions官方提供的测试脚本,即可计算METEOR。具体计算过程请参考[指标计算](../../metrics/ets_metrics/README.md) + +- 使用CPU进行评估时,请将上面的命令行或者run.sh脚本中的`use_gpu`设置为False + + +在ActivityNet Captions数据集下评估精度如下: + +| METEOR | +| :----: | +| 9.8 | + + +## 模型推断 + +可通过如下两种方式启动模型推断: + + python predict.py --model_name=ETS \ + --config=./configs/ets.yaml \ + --log_interval=1 \ + --weights=$PATH_TO_WEIGHTS \ + --filelist=$FILELIST \ + --use_gpu=True + + bash run.sh predict ETS ./configs/ets.yaml + +- 使用python命令行启动程序时,`--filelist`参数指定待推断的文件列表。用户也可参考[数据说明](../../data/dataset/ets/README.md)步骤三生成默认的推断文件列表。`--weights`参数为训练好的权重参数,如果不设置,程序会自动下载已训练好的权重。 + +- 使用`run.sh`进行评估时,需要修改脚本中的`weights`参数指定需要用到的权重。 + +- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams)进行推断 + +- 模型推断结果存储于json文件中,默认存储在`data/dataset/predict_results`目录下 + +- 使用CPU进行推断时,请将命令行或者run.sh脚本中的`use_gpu`设置为False + +## 参考论文 + +- [Describing Videos by Exploiting Temporal Structure](https://arxiv.org/abs/1502.08029)。 diff --git a/PaddleCV/PaddleVideo/models/tall/README.md b/PaddleCV/PaddleVideo/models/tall/README.md new file mode 100644 index 0000000000000000000000000000000000000000..17ff701b0aafa55e8287c2efa399e6ede436dcc8 --- /dev/null +++ b/PaddleCV/PaddleVideo/models/tall/README.md @@ -0,0 +1,108 @@ +# TALL 视频查找模型 + +--- +## 内容 + +- [模型简介](#模型简介) +- [数据准备](#数据准备) +- [模型训练](#模型训练) +- [模型评估](#模型评估) +- [模型推断](#模型推断) +- [参考论文](#参考论文) + + +## 模型简介 + +TALL是由南加州大学的Jiyang Gao等人提出的视频查找方向的经典模型。对输入的文本序列和视频片段,TALL模型利用多模态时序回归定位器(Cross-modal Temporal Regression Localizer, CTRL)联合视频信息和文本描述信息,输出位置偏置和置信度。CTRL包含四个模块:视觉编码器从视频片段中提取特征,文本编码器从语句中提取特征向量,多模态处理网络结合文本和视觉特征生成联合特征,最后时序回归网络生成置信度和偏置。 + +详细内容请参考[TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101)。 + + +## 数据准备 + +TALL的训练数据采用TACoS数据集,数据下载及准备请参考[数据说明](../../data/dataset/tall/README.md) + +## 模型训练 + +数据准备完毕后,可以通过如下两种方式启动训练: + + export CUDA_VISIBLE_DEVICES=0 + export FLAGS_fast_eager_deletion_mode=1 + export FLAGS_eager_delete_tensor_gb=0.0 + export FLAGS_fraction_of_gpu_memory_to_use=0.98 + python train.py --model_name=TALL \ + --config=./configs/tall.yaml \ + --log_interval=10 \ + --valid_interval=10000 \ + --use_gpu=True \ + --save_dir=./data/checkpoints \ + --fix_random_seed=False + + bash run.sh train TALL ./configs/tall.yaml + +- 从头开始训练,使用上述启动命令行或者脚本程序即可启动训练,不需要用到预训练模型 + +- 可下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams)通过`--resume`指定权重存放路径进行finetune等开发 + +- 模型未设置验证集,故将valid\_interval设为10000,在训练过程中不进行验证。 + + +**训练策略:** + +* 采用Adam优化算法训练 +* 学习率为1e-3 + +## 模型评估 + +可通过如下两种方式进行模型评估: + + python eval.py --model_name=TALL \ + --config=./configs/tall.yaml \ + --log_interval=1 \ + --weights=$PATH_TO_WEIGHTS \ + --use_gpu=True + + bash run.sh eval TALL ./configs/tall.yaml + +- 使用`run.sh`进行评估时,需要修改脚本中的`weights`参数指定需要评估的权重。 + +- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams)进行评估 + +- 运行上述程序会将测试结果打印出来,同时保存在json文件中,默认存储在data/evaluate\_results目录下。 + +- 使用CPU进行评估时,请将上面的命令行或者run.sh脚本中的`use_gpu`设置为False + + +在TACoS数据集下评估精度如下: + +| R1@IOU5 | R5@IOU5 | +| :----: | :----: | +| 0.13 | 0.24 | + + +## 模型推断 + +可通过如下两种方式启动模型推断: + + python predict.py --model_name=TALL \ + --config=./configs/tall.yaml \ + --log_interval=1 \ + --weights=$PATH_TO_WEIGHTS \ + --filelist=$FILELIST \ + --use_gpu=True + + bash run.sh predict TALL ./configs/tall.yaml + +- 使用python命令行启动程序时,`--filelist`参数指定待推断的文件列表。用户也可参考[数据说明](../../data/dataset/tall/README.md)步骤二生成默认的推断文件。`--weights`参数为训练好的权重参数,如果不设置,程序会自动下载已训练好的权重。 + +- 使用`run.sh`进行评估时,需要修改脚本中的`weights`参数指定需要用到的权重。 + +- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams)进行推断 + +- 模型推断结果存储于json文件中,默认存储在`data/dataset/predict_results`目录下。 + +- 使用CPU进行推断时,请将命令行或者run.sh脚本中的`use_gpu`设置为False + +## 参考论文 + +- [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101)。 diff --git a/PaddleCV/PaddleVideo/run.sh b/PaddleCV/PaddleVideo/run.sh index 635e3a38439706f21adb877625bbb8a61b99dc58..9b961bc2119426cf890401df7075c853caf3341e 100644 --- a/PaddleCV/PaddleVideo/run.sh +++ b/PaddleCV/PaddleVideo/run.sh @@ -75,7 +75,7 @@ elif [ "$mode"x == "eval"x ]; then elif [ "$mode"x == "predict"x ]; then echo $mode $name $configs $weights if [ "$weights"x != ""x ]; then - python -i predict.py --model_name=$name \ + python predict.py --model_name=$name \ --config=$configs \ --log_interval=$log_interval \ --weights=$weights \