add documents for ets and tall (#3756)

* add documents for ets and tall * fix bug of inference.py * refine documents of ets and tall following the comments of reviewer * delete preprocess.sh for easy-using * fix pickle.load to be compatible in python3 * add data-download link of ctcn data

add documents for ets and tall (#3756)
* add documents for ets and tall * fix bug of inference.py * refine documents of ets and tall following the comments of reviewer * delete preprocess.sh for easy-using * fix pickle.load to be compatible in python3 * add data-download link of ctcn data
a9abd027 · huangjun12 · SunGaofeng · 2fd3f557 · a9abd027 · a9abd027
12 changed file
--- a/PaddleCV/PaddleVideo/README.md
+++ b/PaddleCV/PaddleVideo/README.md
@@ -16,10 +16,12 @@
 | [C-TCN](./models/ctcn/README.md) | 视频动作定位| 2018年ActivityNet夺冠方案 |
 | [BSN](./models/bsn/README.md) | 视频动作定位| 为视频动作定位问题提供高效的proposal生成方法 |
 | [BMN](./models/bmn/README.md) | 视频动作定位| 2019年ActivityNet夺冠方案 |
+| [ETS](./models/ets/README.md) | 视频描述| ICCV'15提出的结合时序注意力机制的建模方法 |
+| [TALL](./models/tall/README.md) | 视频查找| ICCV'17多模态时序回归定位方法 |
 ### 主要特点
- 包含视频分类和动作定位方向的多个主流领先模型，其中Attention LSTM，Attention Cluster和NeXtVLAD是比较流行的特征序列模型，Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高，NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案，TSM是基于时序移位的简单高效视频时空建模方法，Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型，分别发表于CVPR2018和AAAI2019，是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研，2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型，2019年ActivityNet夺冠方案。
+- 包含视频分类和动作定位方向的多个主流领先模型，其中Attention LSTM，Attention Cluster和NeXtVLAD是比较流行的特征序列模型，Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高，NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案，TSM是基于时序移位的简单高效视频时空建模方法，Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型，分别发表于CVPR2018和AAAI2019，是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研，2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal，为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型，2019年ActivityNet夺冠方案。ETS结合时序注意力机制构建网络，是视频生成文字描述的经典模型。TALL是利用多模态时序回归定位器对视频片段进行查找的模型。
 - 提供了适合视频分类和动作定位任务的通用骨架代码，用户可一键式高效配置模型完成训练和评测。
@@ -178,6 +180,17 @@ run.sh
 | BSN | 16 | 1卡K40 | 7.0 | 66.64% (AUC) | [model-tem](https://paddlemodels.bj.bcebos.com/video_detection/BsnTem_final.pdparams), [model-pem](https://paddlemodels.bj.bcebos.com/video_detection/BsnPem_final.pdparams) |
 | BMN | 16 | 4卡K40 | 7.0 | 67.19% (AUC) | [model](https://paddlemodels.bj.bcebos.com/video_detection/BMN_final.pdparams) |
+- 基于ActivityNet Captions的视频描述模型:
+| 模型 | Batch Size | 环境配置 | cuDNN版本 | METEOR | 下载链接 |
+| :-------: | :---: | :---------: | :----: | :----: | :----------: |
+| ETS | 256 | 4卡P40 | 7.0 | 9.8 | [model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams) |
+- 基于TACoS的视频查找模型:
+| 模型 | Batch Size | 环境配置 | cuDNN版本 | R1@IOU5 | R5@IOU5 | 下载链接 |
+| :-------: | :---: | :---------: | :----: | :----: | :----: | :----------: |
+| TALL | 56 | 1卡P40 | 7.2 | 0.13 | 0.24 | [model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams) |
 ## 参考文献
@@ -190,10 +203,12 @@ run.sh
 - [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
 - [Bsn: Boundary sensitive network for temporal action proposal generation](http://arxiv.org/abs/1806.02964), Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, Ming Yang.
 - [BMN: Boundary-Matching Network for Temporal Action Proposal Generation](https://arxiv.org/abs/1907.09702), Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen.
+- [Describing Videos by Exploiting Temporal Structure](https://arxiv.org/abs/1502.08029).
+- [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101).
 ## 版本更新
 - 3/2019: 新增模型库，发布Attention Cluster，Attention LSTM，NeXtVLAD，StNet，TSN五个视频分类模型。
 - 4/2019: 发布Non-local, TSM两个视频分类模型。
 - 6/2019: 发布C-TCN视频动作定位模型；Non-local模型增加C2D ResNet101和I3D ResNet50骨干网络；NeXtVLAD、TSM模型速度和显存优化。
+- 10/2019: 发布视频动作定位模型BSN, BMN；视频描述模型ETS；视频查找模型TALL。
--- a/PaddleCV/PaddleVideo/data/dataset/ctcn/README.md
+++ b/PaddleCV/PaddleVideo/data/dataset/ctcn/README.md
 # C-TCN模型数据使用说明
-C-TCN模型使用ActivityNet 1.3数据集，具体下载方法请参考官方[下载说明](http://activity-net.org/index.html)。在训练此模型时，需要先对mp4源文件抽取RGB和Flow特征，然后再用训练好的TSN模型提取出抽象的特征数据，并存储为pickle文件格式。我们将会提供转化后的数据下载链接。转化后的数据文件目录结构为：
+C-TCN模型使用ActivityNet 1.3数据集，具体下载方法请参考官方[下载说明](http://activity-net.org/index.html)。在训练此模型时，需要先对mp4源文件抽取RGB和Flow特征，然后再用训练好的TSN模型提取出抽象的特征数据，并存储为pickle文件格式。我们使用百度云提供转化后的数据[下载链接](https://paddlemodels.bj.bcebos.com/video_detection/CTCN_data.tar.gz)。转化后的数据文件目录结构为：
 ```
 data

--- a/PaddleCV/PaddleVideo/data/dataset/ets/README.md
+++ b/PaddleCV/PaddleVideo/data/dataset/ets/README.md
+# ETS模型数据使用说明
+ETS模型使用ActivityNet Captions数据集，数据准备方法如下：
+步骤一. 特征数据准备:
+- 在[ActivityNet下载页面](http://activity-net.org/challenges/2019/tasks/anet_captioning.html)中，下载"Frame-level features"特征数据集(~89GB)。将下载好的resnet152i\_features\_activitynet\_5fps\_320x240.pkl数据文件存放在PaddleVideo/data/dataset/ets目录下；
+- 运行PaddleVideo/data/dataset/ets/generate\_train\_pickle.py文件，将数据转化为pickle文件，便于内存载入。生成的数据存放在PaddleVideo/data/dataset/ets/feat\_data文件夹下。
+步骤二. 标签及索引数据准备：
+- 在[Dense-Captioning Events in Videos项目页面](http://cs.stanford.edu/people/ranjaykrishna/densevid/)，从dataset链接中下载captions文件夹，其中包含标签和索引的json文件。将captions文件夹存放在PaddleVideo/data/dataset/ets目录下；
+- 按[数据评估](../../../metrics/ets\_metrics/README.md)步骤下载好coco-caption文件夹，并将其放置在PaddleVideo目录下；
+- python运行generate\_data.py文件，生成训练用的文本文件train.list和val.list。
+步骤三. 生成infer数据：
+- 完成前两个步骤后，python运行generate\_infer\_data.py文件可生成infer.list文件。
+按如上步骤操作，最终PaddleVideo/data/dataset/ets的目录结构为：
+```
+ets
+  |
+  |----feat_data/
+  |----train.list
+  |----val.list
+  |----preprocess.sh
+  |----generate_train_pickle.py
+  |----generate_data.py
+  |----generate_infer_data.py
+  |----captions/
+  |----resnet152_features_activitynet_5fps_320x240.pkl (生成feat_data后可移除以节省磁盘空间)
+```
--- a/PaddleCV/PaddleVideo/data/dataset/ets/generate_data.py
+++ b/PaddleCV/PaddleVideo/data/dataset/ets/generate_data.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import os
+import json
+import sys
+sys.path.insert(0, '../../../coco-caption')
+from pycocoevalcap.tokenizer.ptbtokenizer import PTBTokenizer
+def remove_nonascii(text):
+    """ remove nonascii
+    """
+    return ''.join([i if ord(i) < 128 else ' ' for i in text])
+def generate_dictionary(caption_file_path):
+    index = 0
+    input_dict = {}
+    # get all sentences
+    train_data = json.loads(open(os.path.join( \
+            caption_file_path, 'train.json')).read())
+    for vid, content in train_data.iteritems():
+        sentences = content['sentences']
+        for s in sentences:
+            input_dict[index] = [{'caption': remove_nonascii(s)}]
+            index += 1
+    # ptbtokenizer
+    tokenizer = PTBTokenizer()
+    output_dict = tokenizer.tokenize(input_dict)
+    # sort by word frequency
+    word_count_dict = {}
+    for _, sentence in output_dict.iteritems():
+        words = sentence[0].split()
+        for w in words:
+            if w not in word_count_dict:
+                word_count_dict[w] = 1
+            else:
+                word_count_dict[w] += 1
+    # output dictionary
+    with open('dict.txt', 'w') as f:
+        f.write('<s> -1\n')
+        f.write('<e> -1\n')
+        f.write('<unk> -1\n')
+        truncation = 3
+        for word, freq in sorted(word_count_dict.iteritems(), \
+                key=lambda x:x[1], reverse=True):
+            if freq >= truncation:
+                f.write('%s %d\n' % (word, freq))
+    print 'Generate dictionary done ...'
+def generate_data_list(mode, caption_file_path):
+    # get file name
+    if mode == 'train':
+        file_name = 'train.json'
+    elif mode == 'val':
+        file_name = 'val_1.json'
+    else:
+        print 'Invalid mode:' % mode
+        sys.exit()
+    # get timestamps and sentences
+    input_dict = {}
+    data = json.loads(open(os.path.join( \
+            caption_file_path, file_name)).read())
+    for vid, content in data.iteritems():
+        sentences = content['sentences']
+        timestamps = content['timestamps']
+        for t, s in zip(timestamps, sentences):
+            dictkey = ' '.join([vid, str(t[0]), str(t[1])])
+            input_dict[dictkey] = [{'caption': remove_nonascii(s)}]
+    # ptbtokenizer
+    tokenizer = PTBTokenizer()
+    output_dict = tokenizer.tokenize(input_dict)
+    with open('%s.list' % mode, 'wb') as f:
+        for id, sentence in output_dict.iteritems():
+            try:
+                f.write('\t'.join(id.split() + sentence) + '\n')
+            except:
+                pass
+    print 'Generate %s.list done ...' % mode
+if __name__ == '__main__':
+    caption_file_path = './captions/'
+    generate_dictionary(caption_file_path)
+    generate_data_list('train', caption_file_path)
+    generate_data_list('val', caption_file_path)
--- a/PaddleCV/PaddleVideo/data/dataset/ets/generate_infer_data.py
+++ b/PaddleCV/PaddleVideo/data/dataset/ets/generate_infer_data.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+if_train = open('val.list')
+f_lines = f_train.readlines()
+with open('infer.list', 'wb') as f:
+    for i in range(100):
+        f.write(f_lines[i])
--- a/PaddleCV/PaddleVideo/data/dataset/ets/generate_train_pickle.py
+++ b/PaddleCV/PaddleVideo/data/dataset/ets/generate_train_pickle.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import pickle
+import os
+import multiprocessing
+output_dir = './feat_data'
+if not os.path.exists(output_dir):
+    os.makedirs(output_dir)
+fname = 'resnet152_features_activitynet_5fps_320x240.pkl'
+d = pickle.load(open(fname))
+def save_file(filenames, process_id):
+    count = 0
+    for key in filenames:
+        pickle.dump(d[key], open(os.path.join(output_dir, key), 'w'))
+        count += 1
+        if count % 100 == 0:
+            print('# %d processed %d samples' % (process_id, count))
+    print('# %d total processed %d samples' % (process_id, count))
+total_keys = d.keys()
+num_threads = 8
+filelists = [None] * 8
+seg_nums = len(total_keys) // 8
+p_list = [None] * 8
+for i in range(8):
+    if i == 7:
+        filelists[i] = total_keys[i * seg_nums:]
+    else:
+        filelists[i] = total_keys[i * seg_nums:(i + 1) * seg_nums]
+    p_list[i] = multiprocessing.Process(
+        target=save_file, args=(filelists[i], i))
+    p_list[i].start()
--- a/PaddleCV/PaddleVideo/data/dataset/tall/README.md
+++ b/PaddleCV/PaddleVideo/data/dataset/tall/README.md
+# TALL模型数据使用说明
+TALL模型使用TACoS数据集，数据准备过程如下：
+步骤一. 训练和测试集：
+- 训练和测试使用提取好的数据特征，请参考TALL模型原作者提供的[数据下载](https://github.com/jiyanggao/TALL)方法进行模型训练与评估；
+步骤二. infer数据
+- 为便于用户使用模型进行推断，我们提供了生成infer数据的文件./gen\_infer.py，执行完步骤一后python运行该文件便可在当前文件夹下生成infer数据。
+按如上步骤操作，最终PaddleVideo/data/dataset/tall需要包含的文件有：
+```
+tall
+  |
+  |----Interval64_128_256_512_overlap0.8_c3d_fc6/
+  |----Interval128_256_overlap0.8_c3d_fc6/
+  |----train_clip-sentvec.pkl
+  |----test_clip-sentvec.pkl
+  |----video_allframes_info.pkl
+  |----infer
+         |
+         |----infer_feat/
+         |----infer_clip-sen.pkl
+```
--- a/PaddleCV/PaddleVideo/data/dataset/tall/gen_infer.py
+++ b/PaddleCV/PaddleVideo/data/dataset/tall/gen_infer.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+# select sentence vector and featmap of one movie name for inference
+import os
+import sys
+import pickle
+import numpy as np
+infer_path = 'infer'
+infer_feat_path = 'infer/infer_feat'
+if not os.path.exists(infer_path):
+    os.mkdir(infer_path)
+if not os.path.exists(infer_feat_path):
+    os.mkdir(infer_feat_path)
+python_ver = sys.version_info
+pickle_path = 'test_clip-sentvec.pkl'
+if python_ver < (3, 0):
+    movies_sentence = pickle.load(open(pickle_path, 'rb'))
+else:
+    movies_sentence = pickle.load(open(pickle_path, 'rb'), encoding='bytes')
+select_name = movies_sentence[0][0].split('.')[0]
+res_sentence = []
+for movie_sentence in movies_sentence:
+    if movie_sentence[0].split('.')[0] == select_name:
+        res_sen = []
+        res_sen.append(movie_sentence[0])
+        res_sen.append([movie_sentence[1][0]])  #select the first one sentence
+        res_sentence.append(res_sen)
+file = open('infer/infer_clip-sen.pkl', 'wb')
+pickle.dump(res_sentence, file, protocol=2)
+movies_feat = os.listdir('Interval128_256_overlap0.8_c3d_fc6')
+for movie_feat in movies_feat:
+    if movie_feat.split('.')[0] == select_name:
+        feat_path = os.path.join('Interval128_256_overlap0.8_c3d_fc6',
+                                 movie_feat)
+        feat = np.load(feat_path)
+        np.save(os.path.join(infer_feat_path, movie_feat), feat)
--- a/PaddleCV/PaddleVideo/metrics/ets_metrics/README.md
+++ b/PaddleCV/PaddleVideo/metrics/ets_metrics/README.md
+## ActivityNet Captions 指标计算
+- ActivityNet Captions的指标评估代码可以参考[官方网站](https://github.com/ranjaykrishna/densevid_eval)
+- 下载指标评估代码，将coco-caption和evaluate.py拷贝到PaddleVideo下；
+- 计算精度指标，python运行evaluate.py文件，可通过-s参数指定结果文件，-r参数修改标签文件；
+- 由于模型计算波动较大，在评估过程中可以取不同Epoch的所得训练模型计算精度指标，最优的METEOR值约为10.0左右。
--- a/PaddleCV/PaddleVideo/models/ets/README.md
+++ b/PaddleCV/PaddleVideo/models/ets/README.md
+# ETS 视频描述模型
+---
+## 内容
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+- [模型推断](#模型推断)
+- [参考论文](#参考论文)
+## 模型简介
+Describing Videos by Exploiting Temporal Structure是由蒙特利尔大学Li Yao等人提出的用于对视频片段生成文字描述的经典模型，这里简称为ETS。此模型基于编码器-解码器的思想，对输入的视频，先使用3D卷积提取视频的局部时空特征，然后在时序维度上引入注意力机制，利用LSTM在全局尺度上对局部特征进行融合，最后输出文字描述。
+详细内容请参考[Describing Videos by Exploiting Temporal Structure](https://arxiv.org/abs/1502.08029)。
+## 数据准备
+ETS的训练数据采用ActivityNet Captions提供的数据集，数据下载及准备请参考[数据说明](../../data/dataset/ets/README.md)
+## 模型训练
+数据准备完毕后，可以通过如下两种方式启动训练：
+    export CUDA_VISIBLE_DEVICES=0,1,2,3
+    export FLAGS_fast_eager_deletion_mode=1
+    export FLAGS_eager_delete_tensor_gb=0.0
+    export FLAGS_fraction_of_gpu_memory_to_use=0.98
+    python train.py --model_name=ETS \
+                    --config=./configs/ets.yaml \
+                    --log_interval=10 \
+                    --valid_interval=1 \
+                    --use_gpu=True \
+                    --save_dir=./data/checkpoints \
+                    --fix_random_seed=False
+    bash run.sh train ETS ./configs/ets.yaml
+- 从头开始训练，使用上述启动命令行或者脚本程序即可启动训练，不需要用到预训练模型
+- 可下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams)通过`--resume`指定权重存放路径进行finetune等开发
+**训练策略：**
+*  采用Adam优化算法训练
+*  权重衰减系数为1e-4
+*  学习率调整使用Noam衰减方法
+## 模型评估
+可通过如下两种方式进行模型评估:
+    python eval.py --model_name=ETS \
+                   --config=./configs/ets.yaml \
+                   --log_interval=1 \
+                   --weights=$PATH_TO_WEIGHTS \
+                   --use_gpu=True
+    bash run.sh eval ETS ./configs/ets.yaml
+- 使用`run.sh`进行评估时，需要修改脚本中的`weights`参数指定需要评估的权重。
+- 若未指定`--weights`参数，脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams)进行评估
+- 运行上述程序会将测试结果保存在json文件中，默认存储在data/evaluate\_results目录下。使用ActivityNet Captions官方提供的测试脚本，即可计算METEOR。具体计算过程请参考[指标计算](../../metrics/ets_metrics/README.md)
+- 使用CPU进行评估时，请将上面的命令行或者run.sh脚本中的`use_gpu`设置为False
+在ActivityNet Captions数据集下评估精度如下:
+| METEOR |
+| :----: |
+|  9.8  |
+## 模型推断
+可通过如下两种方式启动模型推断：
+    python predict.py --model_name=ETS \
+                      --config=./configs/ets.yaml \
+                      --log_interval=1 \
+                      --weights=$PATH_TO_WEIGHTS \
+                      --filelist=$FILELIST \
+                      --use_gpu=True
+    bash run.sh predict ETS ./configs/ets.yaml
+- 使用python命令行启动程序时，`--filelist`参数指定待推断的文件列表。用户也可参考[数据说明](../../data/dataset/ets/README.md)步骤三生成默认的推断文件列表。`--weights`参数为训练好的权重参数，如果不设置，程序会自动下载已训练好的权重。
+- 使用`run.sh`进行评估时，需要修改脚本中的`weights`参数指定需要用到的权重。
+- 若未指定`--weights`参数，脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_caption/ETS_final.pdparams)进行推断
+- 模型推断结果存储于json文件中，默认存储在`data/dataset/predict_results`目录下
+- 使用CPU进行推断时，请将命令行或者run.sh脚本中的`use_gpu`设置为False
+## 参考论文
+- [Describing Videos by Exploiting Temporal Structure](https://arxiv.org/abs/1502.08029)。
--- a/PaddleCV/PaddleVideo/models/tall/README.md
+++ b/PaddleCV/PaddleVideo/models/tall/README.md
+# TALL 视频查找模型
+---
+## 内容
+- [模型简介](#模型简介)
+- [数据准备](#数据准备)
+- [模型训练](#模型训练)
+- [模型评估](#模型评估)
+- [模型推断](#模型推断)
+- [参考论文](#参考论文)
+## 模型简介
+TALL是由南加州大学的Jiyang Gao等人提出的视频查找方向的经典模型。对输入的文本序列和视频片段，TALL模型利用多模态时序回归定位器(Cross-modal Temporal Regression Localizer, CTRL)联合视频信息和文本描述信息，输出位置偏置和置信度。CTRL包含四个模块：视觉编码器从视频片段中提取特征，文本编码器从语句中提取特征向量，多模态处理网络结合文本和视觉特征生成联合特征，最后时序回归网络生成置信度和偏置。
+详细内容请参考[TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101)。
+## 数据准备
+TALL的训练数据采用TACoS数据集，数据下载及准备请参考[数据说明](../../data/dataset/tall/README.md)
+## 模型训练
+数据准备完毕后，可以通过如下两种方式启动训练：
+    export CUDA_VISIBLE_DEVICES=0
+    export FLAGS_fast_eager_deletion_mode=1
+    export FLAGS_eager_delete_tensor_gb=0.0
+    export FLAGS_fraction_of_gpu_memory_to_use=0.98
+    python train.py --model_name=TALL \
+                    --config=./configs/tall.yaml \
+                    --log_interval=10 \
+                    --valid_interval=10000 \
+                    --use_gpu=True \
+                    --save_dir=./data/checkpoints \
+                    --fix_random_seed=False
+    bash run.sh train TALL ./configs/tall.yaml
+- 从头开始训练，使用上述启动命令行或者脚本程序即可启动训练，不需要用到预训练模型
+- 可下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams)通过`--resume`指定权重存放路径进行finetune等开发
+- 模型未设置验证集，故将valid\_interval设为10000，在训练过程中不进行验证。
+**训练策略：**
+*  采用Adam优化算法训练
+*  学习率为1e-3
+## 模型评估
+可通过如下两种方式进行模型评估:
+    python eval.py --model_name=TALL \
+                   --config=./configs/tall.yaml \
+                   --log_interval=1 \
+                   --weights=$PATH_TO_WEIGHTS \
+                   --use_gpu=True
+    bash run.sh eval TALL ./configs/tall.yaml
+- 使用`run.sh`进行评估时，需要修改脚本中的`weights`参数指定需要评估的权重。
+- 若未指定`--weights`参数，脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams)进行评估
+- 运行上述程序会将测试结果打印出来，同时保存在json文件中，默认存储在data/evaluate\_results目录下。
+- 使用CPU进行评估时，请将上面的命令行或者run.sh脚本中的`use_gpu`设置为False
+在TACoS数据集下评估精度如下:
+| R1@IOU5 | R5@IOU5 |
+| :----: | :----: |
+|  0.13  |  0.24  |
+## 模型推断
+可通过如下两种方式启动模型推断：
+    python predict.py --model_name=TALL \
+                      --config=./configs/tall.yaml \
+                      --log_interval=1 \
+                      --weights=$PATH_TO_WEIGHTS \
+                      --filelist=$FILELIST \
+                      --use_gpu=True
+    bash run.sh predict TALL ./configs/tall.yaml
+- 使用python命令行启动程序时，`--filelist`参数指定待推断的文件列表。用户也可参考[数据说明](../../data/dataset/tall/README.md)步骤二生成默认的推断文件。`--weights`参数为训练好的权重参数，如果不设置，程序会自动下载已训练好的权重。
+- 使用`run.sh`进行评估时，需要修改脚本中的`weights`参数指定需要用到的权重。
+- 若未指定`--weights`参数，脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_grounding/TALL_final.pdparams)进行推断
+- 模型推断结果存储于json文件中，默认存储在`data/dataset/predict_results`目录下。
+- 使用CPU进行推断时，请将命令行或者run.sh脚本中的`use_gpu`设置为False
+## 参考论文
+- [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101)。
--- a/PaddleCV/PaddleVideo/run.sh
+++ b/PaddleCV/PaddleVideo/run.sh
@@ -75,7 +75,7 @@ elif [ "$mode"x == "eval"x ]; then
 elif [ "$mode"x == "predict"x ]; then
    echo $mode $name $configs $weights
    if [ "$weights"x != ""x ]; then
-        python -i predict.py --model_name=$name \
+        python predict.py --model_name=$name \
                          --config=$configs \
                          --log_interval=$log_interval \
                          --weights=$weights \