add tall models

cd1a1963 · shippingwang · 554d8864 · cd1a1963 · cd1a1963 · cd1a1963
14 changed file
--- a/PaddleCV/PaddleVideo/README.md
+++ b/PaddleCV/PaddleVideo/README.md
@@ -16,10 +16,12 @@
 | [C-TCN](./models/ctcn/README.md) | 视频动作定位| 2018年ActivityNet夺冠方案 |
 | [BSN](./models/bsn/README.md) | 视频动作定位| 为视频动作定位问题提供高效的proposal生成方法 |
 | [BMN](./models/bmn/README.md) | 视频动作定位| 2019年ActivityNet夺冠方案 |
+| [TALL](./models/tall/README.md) | 视频检索 | |
+ 

 ### 主要特点

- 包含视频分类和动作定位方向的多个主流领先模型，其中Attention LSTM，Attention Cluster和NeXtVLAD是比较流行的特征序列模型，Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高，NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案，TSM是基于时序移位的简单高效视频时空建模方法，Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型，分别发表于CVPR2018和AAAI2019，是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研，2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型，2019年ActivityNet夺冠方案。
+- 包含视频分类和动作定位方向的多个主流领先模型，其中Attention LSTM，Attention Cluster和NeXtVLAD是比较流行的特征序列模型，Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高，NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案，TSM是基于时序移位的简单高效视频时空建模方法，Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型，分别发表于CVPR2018和AAAI2019，是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研，2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型，2019年ActivityNet夺冠方案。TALL模型

 - 提供了适合视频分类和动作定位任务的通用骨架代码，用户可一键式高效配置模型完成训练和评测。

@@ -178,6 +180,13 @@ run.sh
 | BSN | 16 | 1卡K40 | 7.0 | 66.64% (AUC) | [model-tem](https://paddlemodels.bj.bcebos.com/video_detection/BsnTem_final.pdparams), [model-pem](https://paddlemodels.bj.bcebos.com/video_detection/BsnPem_final.pdparams) |
 | BMN | 16 | 4卡K40 | 7.0 | 67.19% (AUC) | [model](https://paddlemodels.bj.bcebos.com/video_detection/BMN_final.pdparams) |

+基于TACoS数据集的视频检索模型：
+
+| 模型 | Batch Size | 环境配置 | cuDNN版本 | 精度 | 下载链接 |
+| :-: | :-: | :-: | :-: | :-: | :-: |
+| TALL | 56 | 1卡K40 | 7.2 | 
+
+

 ## 参考文献

@@ -190,6 +199,7 @@ run.sh
 - [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
 - [Bsn: Boundary sensitive network for temporal action proposal generation](http://arxiv.org/abs/1806.02964), Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, Ming Yang.
 - [BMN: Boundary-Matching Network for Temporal Action Proposal Generation](https://arxiv.org/abs/1907.09702), Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen.
+- [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101), Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia.


 ## 版本更新

--- a/PaddleCV/PaddleVideo/configs/tall.yaml
+++ b/PaddleCV/PaddleVideo/configs/tall.yaml
+MODEL:
+    name: "TALL"
+    visual_feature_dim : 12288
+    sentence_embedding_size : 4800
+    semantic_size : 1024
+    hidden_size : 1000
+    output_size : 3
+
+TRAIN:
+    epoch : 21
+    use_gpu : True
+    num_gpus : 1
+    batch_size : 56
+    feats_dimen : 4096
+    off_size : 2
+    context_num : 1
+    context_size : 128
+    visual_feature_dim : 12288
+    sent_vec_dim : 4800
+    sliding_clip_path : "data/dataset/tacos/Interval64_128_256_512_overlap0.8_c3d_fc6/"
+    clip_sentvec : "data/dataset/tacos/train_clip-sentvec.pkl"
+    movie_length_info : "data/dataset/tacos/video_allframes_info.pkl"
+    dataset : TACoS
+    model : TALL
+
+VALID:
+    batch_size : 1
+    context_num : 1
+    context_size : 128
+    feats_dimen : 4096
+    visual_feature_dim : 12288
+    sent_vec_dim : 4800
+    semantic_size : 4800
+    sliding_clip_path : "data/dataset/tacos/Interval128_256_overlap0.8_c3d_fc6/"
+    clip_sentvec : "data/dataset/tacos/test_clip-sentvec.pkl"
--- a/PaddleCV/PaddleVideo/data/dataset/README.md
+++ b/PaddleCV/PaddleVideo/data/dataset/README.md
@@ -162,3 +162,7 @@ Non-local模型也使用kinetics数据集，不过其数据处理方式和其他
 ## C-TCN

 C-TCN模型使用ActivityNet 1.3数据集，具体使用方法见[C-TCN数据说明](./ctcn/README.md)
+
+## TALL
+
+TALL模型使用TACoS数据集，具体使用方法见[TALL数据说明](./tall/README.md)
--- a/PaddleCV/PaddleVideo/metrics/metrics_util.py
+++ b/PaddleCV/PaddleVideo/metrics/metrics_util.py
@@ -28,6 +28,10 @@ from metrics.detections import detection_metrics as detection_metrics
 from metrics.bmn_metrics import bmn_proposal_metrics as bmn_proposal_metrics
 from metrics.bsn_metrics import bsn_tem_metrics as bsn_tem_metrics
 from metrics.bsn_metrics import bsn_pem_metrics as bsn_pem_metrics
+from metrics.tall import accuracy_metrics as tall_metrics
+
+
+

 logger = logging.getLogger(__name__)

@@ -420,6 +424,24 @@ class BsnPemMetrics(Metrics):
    def reset(self):
        self.calculator.reset()

+##shipping
+class TallMetrics(Metrics):
+    def __init__(self, name, model, cfg):
+	self.name =  name
+ 	self.mode =mode
+	self.calculator = tall_metrics.MetricsCalculator(cfg=cfg, name=self.name, mode=self.mode)
+
+    def calculator_and_log_out(self, fetch_list, info=""):
+        loss = np.array(fetch_list[0])
+	logger.info(info +'\tLoss = {}'.format('%.6f' % np.mean(loss)))
+    
+    def accumalate()
+
+    def finalize_and_log_out(self, info="", savedir="/"):
+
+    def reset(self):
+	self.calculator.clear()
+

 class MetricsZoo(object):
    def __init__(self):
@@ -461,3 +483,4 @@ regist_metrics("CTCN", DetectionMetrics)
 regist_metrics("BMN", BmnMetrics)
 regist_metrics("BSNTEM", BsnTemMetrics)
 regist_metrics("BSNPEM", BsnPemMetrics)
+redist_metrics("TALL", TallMetrics)
--- a/PaddleCV/PaddleVideo/metrics/tall/__init__.py
+++ b/PaddleCV/PaddleVideo/metrics/tall/__init__.py
--- a/PaddleCV/PaddleVideo/metrics/tall/accuracy_metrics.py
+++ b/PaddleCV/PaddleVideo/metrics/tall/accuracy_metrics.py
+# Copyright 2016 Google Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS-IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import numpy as np
+from six.moves import xrange
+import time
+import pickle
+import operator
+
+
+class MetricsCalculator():
+    def __init__(self, name, mode):
+	self.name = name
+	self.mode = mode
+	self.reset()
+
+    def reset(self):
+	logger.info("Resetting {} metrics...".format(self.mode))
+	return
+    def finalize_metrics(self):
+	return
+    def calculate_metrics(self,):
+	return
+    def accumalate(self):
+	return
+
+
+
+
+
+def calculate_reward_batch_withstop(Previou_IoU, current_IoU, t):
+    batch_size = len(Previou_IoU)
+    reward = torch.zeros(batch_size)
+
+    for i in range(batch_size):
+        if current_IoU[i] > Previou_IoU[i] and Previou_IoU[i]>=0:
+            reward[i] = 1 -0.001*t
+        elif current_IoU[i] <= Previou_IoU[i] and current_IoU[i]>=0:
+            reward[i] = -0.001*t
+        else:
+            reward[i] = -1 -0.001*t
+    return reward
+
+
+def calculate_reward(Previou_IoU, current_IoU, t):
+    if current_IoU > Previou_IoU and Previou_IoU>=0:
+        reward = 1-0.001*t
+    elif current_IoU <= Previou_IoU and current_IoU>=0:
+        reward = -0.001*t
+    else:
+        reward = -1-0.001*t
+
+    return reward
+
+def calculate_RL_IoU_batch(i0, i1):
+    # calculate temporal intersection over union
+    batch_size = len(i0)
+    iou_batch = torch.zeros(batch_size)
+
+    for i in range(len(i0)):
+        union = (min(i0[i][0], i1[i][0]), max(i0[i][1], i1[i][1]))
+        inter = (max(i0[i][0], i1[i][0]), min(i0[i][1], i1[i][1]))
+        # if inter[1] < inter[0]:
+        #     iou = 0
+        # else:
+        iou = 1.0*(inter[1]-inter[0])/(union[1]-union[0])
+        iou_batch[i] = iou
+    return iou_batch
+
+def calculate_IoU(i0, i1):
+    # calculate temporal intersection over union
+    union = (min(i0[0], i1[0]), max(i0[1], i1[1]))
+    inter = (max(i0[0], i1[0]), min(i0[1], i1[1]))
+    iou = 1.0*(inter[1]-inter[0])/(union[1]-union[0])
+    return iou
+
+
+def nms_temporal(x1, x2, sim, overlap):
+    pick = []
+    assert len(x1)==len(sim)
+    assert len(x2)==len(sim)
+    if len(x1)==0:
+        return pick
+
+    union = map(operator.sub, x2, x1) # union = x2-x1
+
+    I = [i[0] for i in sorted(enumerate(sim), key=lambda x:x[1])] # sort and get index
+
+    while len(I)>0:
+        i = I[-1]
+        pick.append(i)
+
+        xx1 = [max(x1[i],x1[j]) for j in I[:-1]]
+        xx2 = [min(x2[i],x2[j]) for j in I[:-1]]
+        inter = [max(0.0, k2-k1) for k1, k2 in zip(xx1, xx2)]
+        o = [inter[u]/(union[i] + union[I[u]] - inter[u]) for u in range(len(I)-1)]
+        I_new = []
+        for j in range(len(o)):
+            if o[j] <=overlap:
+                I_new.append(I[j])
+        I = I_new
+    return pick
+
+
+def compute_IoU_recall_top_n_forreg_rl(top_n, iou_thresh, sentence_image_reg_mat, sclips):
+    correct_num = 0.0
+    for k in range(sentence_image_reg_mat.shape[0]):
+        gt = sclips[k]
+        # print(gt)
+        gt_start = float(gt.split("_")[1])
+        gt_end = float(gt.split("_")[2])
+
+        pred_start = sentence_image_reg_mat[k, 0]
+        pred_end = sentence_image_reg_mat[k, 1]
+        iou = calculate_IoU((gt_start, gt_end),(pred_start, pred_end))
+        if iou>=iou_thresh:
+            correct_num+=1
+
+    return correct_num
+
+def compute_IoU_recall_top_n_forreg(top_n, iou_thresh, sentence_image_mat, sentence_image_reg_mat, sclips, iclips):
+    correct_num = 0.0
+    for k in range(sentence_image_mat.shape[0]):
+        gt = sclips[k]
+        gt_start = float(gt.split("_")[1])
+        gt_end = float(gt.split("_")[2])
+        #print gt +" "+str(gt_start)+" "+str(gt_end)
+        sim_v = [v for v in sentence_image_mat[k]]
+        starts = [s for s in sentence_image_reg_mat[k,:,0]]
+        ends = [e for e in sentence_image_reg_mat[k,:,1]]
+        picks = nms_temporal(starts,ends, sim_v, iou_thresh-0.05)
+        #sim_argsort=np.argsort(sim_v)[::-1][0:top_n]
+        if top_n<len(picks): picks=picks[0:top_n]
+        for index in picks:
+            pred_start = sentence_image_reg_mat[k, index, 0]
+            pred_end = sentence_image_reg_mat[k, index, 1]
+            iou = calculate_IoU((gt_start, gt_end),(pred_start, pred_end))
+            if iou>=iou_thresh:
+                correct_num+=1
+                break
+    return correct_num
--- a/PaddleCV/PaddleVideo/models/__init__.py
+++ b/PaddleCV/PaddleVideo/models/__init__.py
@@ -10,6 +10,7 @@ from .ctcn import CTCN
 from .bmn import BMN
 from .bsn import BsnTem
 from .bsn import BsnPem
+from .tall import TALL

 # regist models, sort by alphabet
 regist_model("AttentionCluster", AttentionCluster)
@@ -23,3 +24,4 @@ regist_model("CTCN", CTCN)
 regist_model("BMN", BMN)
 regist_model("BsnTem", BsnTem)
 regist_model("BsnPem", BsnPem)
+regist_model("TALL", TALL)
--- a/PaddleCV/PaddleVideo/models/tall/README.md
+++ b/PaddleCV/PaddleVideo/models/tall/README.md
+# TALL 视频模型
+
+---
+## 内容
+
+-[模型介绍]
--- a/PaddleCV/PaddleVideo/models/tall/__init__.py
+++ b/PaddleCV/PaddleVideo/models/tall/__init__.py
+from .tall import *
--- a/PaddleCV/PaddleVideo/models/tall/tall.py
+++ b/PaddleCV/PaddleVideo/models/tall/tall.py
--- a/PaddleCV/PaddleVideo/models/tall/tall_model.py
+++ b/PaddleCV/PaddleVideo/models/tall/tall_model.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import time
+import sys
+import paddle.fluid as fluid
+
+
+class TALL(object):
+    def __init__(self, mode, cfg):
+	self.images = cfg["images"]
+	self.sentences = cfg["sentences"]
+	if self.mode == "train":
+	    self.offsets = cfg[offsets]
+	self.semantic_size = cfg["semantic_size"]
+	self.hidden_size = cfg["hidden_size"]
+	self.output_size = cfg["output_size"]
+
+    def _cross_modal_comb(visual_feat, sentence_embed):
+        #batch_size = visual_feat.size(0)
+        visual_feat = fluid.layers.reshape(visual_feat, [1, -1, semantic_size])
+        vv_feature = fluid.layers.expand(visual_feat, [train_batch_size, 1, 1])
+        sentence_embed = fluid.layers.reshape(sentence_embed, [-1, 1, semantic_size])
+        ss_feature = fluid.layers.expand(sentence_embed, [1, train_batch_size, 1])
+
+        concat_feature = fluid.layers.concat([vv_feature, ss_feature], axis = 2) #1,1,2048
+
+        mul_feature = vv_feature * ss_feature # B,B,1024
+        add_feature = vv_feature + ss_feature # B,B,1024
+
+        comb_feature = fluid.layers.concat([mul_feature, add_feature, concat_feature], axis = 2)
+        return comb_feature
+
+
+    def net(self)	
+       # visual2semantic
+    	transformed_clip_train = fluid.layers.fc(
+        	input=self.images,
+        	size=semantic_size,
+        	act=None,
+        	name='v2s_lt',
+        	param_attr=fluid.ParamAttr(
+            		name='v2s_lt_weights',
+            		initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=1.0, seed=0)),
+        	bias_attr=False)
+    	#l2_normalize
+    	transformed_clip_train = fluid.layers.l2_normalize(x=transformed_clip_train, axis=1)
+    	# sentence2semantic
+   	transformed_sentence_train = fluid.layers.fc(
+            input=self.sentences,
+            size=semantic_size,
+            act=None,
+            name='s2s_lt',
+            param_attr=fluid.ParamAttr(
+            	name='s2s_lt_weights',
+            	initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=1.0, seed=0)),
+            bias_attr=False)
+    	#l2_normalize
+    	transformed_sentence_train = fluid.layers.l2_normalize(x=transformed_sentence_train, axis=1)
+        
+	cross_modal_vec_train=_cross_modal_comb(transformed_clip_train, transformed_sentence_train)
+    	cross_modal_vec_train=fluid.layers.unsqueeze(input=cross_modal_vec_train, axes=[0])
+    	cross_modal_vec_train=fluid.layers.transpose(cross_modal_vec_train, perm=[0, 3, 1, 2])
+    
+    	mid_output = fluid.layers.conv2d(
+            input=cross_modal_vec_train,
+            num_filters=hidden_size,
+            filter_size=1,
+            stride=1,
+            act="relu",
+            param_attr=fluid.param_attr.ParamAttr(name="mid_out_weights"),
+            bias_attr=False)
+
+    	sim_score_mat_train = fluid.layers.conv2d(
+            input=mid_output,
+            num_filters=output_size,
+            filter_size=1,
+            stride=1,
+            act=None,
+            param_attr=fluid.param_attr.ParamAttr(name="sim_mat_weights"),
+            bias_attr=False)
+    	self.sim_score_mat_train = fluid.layers.squeeze(input=sim_score_mat_train, axes=[0])
+    	return self.sim_score_mat_train, self.offsets
+
+	
--- a/PaddleCV/PaddleVideo/reader/__init__.py
+++ b/PaddleCV/PaddleVideo/reader/__init__.py
@@ -6,6 +6,7 @@ from .ctcn_reader import CTCNReader
 from .bmn_reader import BMNReader
 from .bsn_reader import BSNVideoReader
 from .bsn_reader import BSNProposalReader
+from .tall_reader import TALLReader

 # regist reader, sort by alphabet
 regist_reader("ATTENTIONCLUSTER", FeatureReader)
@@ -19,3 +20,4 @@ regist_reader("CTCN", CTCNReader)
 regist_reader("BMN", BMNReader)
 regist_reader("BSNTEM", BSNVideoReader)
 regist_reader("BSNPEM", BSNProposalReader)
+regist_reader("TALL", TALLReader)
--- a/PaddleCV/PaddleVideo/reader/tacos_reader.py
+++ b/PaddleCV/PaddleVideo/reader/tacos_reader.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+
+import os
+import sys
+import cv2
+import math
+import random
+import functools
+try:
+    import cPickle as pickle
+    from cStringIO import StringIO
+except ImportError:
+    import pickle
+    from io import BytesIO
+import numpy as np
+import paddle
+from PIL import Image, ImageEnhance
+import logging
+
+from .reader_utils import DataReader
+
+class TacosReader(DataReader):
+
+    def __init__(self, name, mode, cfg):
+	self.name = name
+	self.mode = mode
+	self.cfg = cfg
+    def create_reader(self):
+        cfg = self.cfg
+        mode = self.mode
+        num_reader_threads = cfg[mode.upper()]['num_reader_threads']
+        assert num_reader_threads >=1, \
+                "number of reader threads({}) should be a positive integer".format(num_reader_threads)
+        if num_reader_threads == 1:
+            reader_func = make_reader
+        else:
+            reader_func = make_multi_reader
+	
+	filelist = cfg[mode.upper()]['']
+	if self.mode == 'train':
+	    return reader_func()
+	elif self.mode == 'valid':
+	    return reader_func()
+	else:
+	    logger.info("Not implemented")
+	    raise NotImplementedError
+
+def make_reader(cfg):
+    def reader():
+	cs = cPickle.load(open(cfg.TRAIN.train_clip_sentvec))
+        movie_length_info = cPickle.load(open(cfg.TRAIN.movie_length_info))
+
+   #put train() in here
+
+
--- a/PaddleCV/PaddleVideo/reader/tall_reader.py
+++ b/PaddleCV/PaddleVideo/reader/tall_reader.py
+#  Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+import os
+import sys
+import cPickle
+import random
+import numpy as np
+import math
+
+import paddle
+import paddle.fluid as fluid
+import functools
+
+import pdb
+
+random.seed(0)
+
+THREAD = 8
+BUF_SIZE = 1024
+
+
+class TallReader(DataReader):
+
+    def __init__(self, name, mode, cfg):
+	self.name = name
+	self.mode = mode
+	self.cfg = cfg
+
+    def create_reader(self):
+        cfg = self.cfg
+        mode = self.mode
+               
+	if self.mode == 'train':
+	    train_batch_size = cfg.TRAIN.batch_size
+	    return paddle.batch(train(cfg),batch_size = train_batch_size, drop_last=True)
+	elif self.mode == 'valid':
+	    return test(cfg)
+	else:
+	    logger.info("Not implemented")
+	    raise NotImplementedError
+
+
+'''
+calculate temporal intersection over union
+'''
+def calculate_IoU(i0, i1):
+    union = (min(i0[0], i1[0]), max(i0[1], i1[1]))
+    inter = (max(i0[0], i1[0]), min(i0[1], i1[1]))
+    iou = 1.0*(inter[1]-inter[0])/(union[1]-union[0])
+    return iou
+
+'''
+calculate the non Intersection part over Length ratia, make sure the input IoU is larger than 0
+'''
+#[(x1_max-x1_min)-overlap]/(x1_max-x1_min)
+def calculate_nIoL(base, sliding_clip):
+    inter = (max(base[0], sliding_clip[0]), min(base[1], sliding_clip[1]))
+    inter_l = inter[1]-inter[0]
+    length = sliding_clip[1]-sliding_clip[0]
+    nIoL = 1.0*(length-inter_l)/length
+    return nIoL
+
+def get_context_window(sliding_clip_path, clip_name, win_length, context_size, feats_dimen):
+    # compute left (pre) and right (post) context features based on read_unit_level_feats().
+    movie_name = clip_name.split("_")[0]
+    start = int(clip_name.split("_")[1])
+    end = int(clip_name.split("_")[2].split(".")[0])
+    clip_length = context_size
+    left_context_feats = np.zeros([win_length, feats_dimen], dtype=np.float32)
+    right_context_feats = np.zeros([win_length, feats_dimen], dtype=np.float32)
+    last_left_feat = np.load(sliding_clip_path+clip_name)
+    last_right_feat = np.load(sliding_clip_path+clip_name)
+    for k in range(win_length):
+        left_context_start = start - clip_length * (k + 1)
+        left_context_end = start - clip_length * k
+        right_context_start = end + clip_length * k
+        right_context_end = end + clip_length * (k + 1)
+        left_context_name = movie_name + "_" + str(left_context_start) + "_" + str(left_context_end) + ".npy"
+        right_context_name = movie_name + "_" + str(right_context_start) + "_" + str(right_context_end) + ".npy"
+        if os.path.exists(sliding_clip_path+left_context_name):
+            left_context_feat = np.load(sliding_clip_path+left_context_name)
+            last_left_feat = left_context_feat
+        else:
+            left_context_feat = last_left_feat
+        if os.path.exists(sliding_clip_path+right_context_name):
+            right_context_feat = np.load(sliding_clip_path+right_context_name)
+            last_right_feat = right_context_feat
+        else:
+            right_context_feat = last_right_feat
+        left_context_feats[k] = left_context_feat
+        right_context_feats[k] = right_context_feat
+    return np.mean(left_context_feats, axis=0), np.mean(right_context_feats, axis=0)
+
+def process_data(sample, is_train):
+    clip_sentence_pair, sliding_clip_path, context_num, context_size, feats_dimen , sent_vec_dim = sample
+
+    if is_train:
+        offset = np.zeros(2, dtype=np.float32)
+
+        clip_name = clip_sentence_pair[0]
+        feat_path = sliding_clip_path+clip_sentence_pair[2]
+        featmap = np.load(feat_path)
+        left_context_feat, right_context_feat = get_context_window(sliding_clip_path, clip_sentence_pair[2], context_num, context_size, feats_dimen)
+        image = np.hstack((left_context_feat, featmap, right_context_feat))
+        sentence = clip_sentence_pair[1][:sent_vec_dim]
+        p_offset = clip_sentence_pair[3]
+        l_offset = clip_sentence_pair[4]
+        offset[0] = p_offset
+        offset[1] = l_offset
+
+        return image, sentence, offset
+    else:
+        pass
+
+def make_train_reader(cfg, clip_sentence_pairs_iou, shuffle=False, is_train=True):
+    sliding_clip_path = cfg.TRAIN.sliding_clip_path
+    context_num = cfg.TRAIN.context_num
+    context_size = cfg.TRAIN.context_size
+    feats_dimen = cfg.TRAIN.feats_dimen
+    sent_vec_dim = cfg.TRAIN.sent_vec_dim
+
+    def reader():
+        if shuffle:
+            random.shuffle(clip_sentence_pairs_iou)
+        for clip_sentence_pair in clip_sentence_pairs_iou:
+            yield [clip_sentence_pair, sliding_clip_path, context_num, context_size, feats_dimen, sent_vec_dim]
+        
+    mapper = functools.partial(
+            process_data,
+            is_train=is_train)
+
+    return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
+
+def train(cfg):
+    ## TALL
+    feats_dimen = cfg.TRAIN.feats_dimen
+    context_num = cfg.TRAIN.context_num
+    context_size = cfg.TRAIN.context_size
+    visual_feature_dim = cfg.TRAIN.visual_feature_dim
+    sent_vec_dim = cfg.TRAIN.sent_vec_dim
+    sliding_clip_path = cfg.TRAIN.sliding_clip_path
+    cs = cPickle.load(open(cfg.TRAIN.train_clip_sentvec))
+    movie_length_info = cPickle.load(open(cfg.TRAIN.movie_length_info))
+    
+    clip_sentence_pairs = []
+    for l in cs:
+        clip_name = l[0]
+        sent_vecs = l[1]
+        for sent_vec in sent_vecs:
+            clip_sentence_pairs.append((clip_name, sent_vec)) #10146
+    print "TRAIN: " + str(len(clip_sentence_pairs))+" clip-sentence pairs are readed"
+
+    movie_names_set = set()
+    movie_clip_names = {}
+    # read groundtruth sentence-clip pairs
+    for k in range(len(clip_sentence_pairs)):
+        clip_name = clip_sentence_pairs[k][0]
+        movie_name = clip_name.split("_")[0]
+        if not movie_name in movie_names_set:
+            movie_names_set.add(movie_name)
+            movie_clip_names[movie_name] = []
+        movie_clip_names[movie_name].append(k)
+    movie_names = list(movie_names_set)
+    num_samples = len(clip_sentence_pairs)
+    print "TRAIN: " + str(len(movie_names))+" movies."
+
+    # read sliding windows, and match them with the groundtruths to make training samples
+    sliding_clips_tmp = os.listdir(sliding_clip_path) #161396
+    clip_sentence_pairs_iou = []
+
+    #count = 0
+    for clip_name in sliding_clips_tmp:
+        if clip_name.split(".")[2]=="npy":
+            movie_name = clip_name.split("_")[0]
+            for clip_sentence in clip_sentence_pairs:
+                original_clip_name = clip_sentence[0]
+                original_movie_name = original_clip_name.split("_")[0]
+                if original_movie_name==movie_name:
+                    start = int(clip_name.split("_")[1])
+                    end = int(clip_name.split("_")[2].split(".")[0])
+                    o_start = int(original_clip_name.split("_")[1])
+                    o_end = int(original_clip_name.split("_")[2].split(".")[0])
+                    iou = calculate_IoU((start, end), (o_start, o_end))
+                    if iou>0.5:
+                        nIoL=calculate_nIoL((o_start, o_end), (start, end))
+                        if nIoL<0.15:
+                            movie_length = movie_length_info[movie_name.split(".")[0]]
+                            start_offset = o_start-start
+                            end_offset = o_end-end
+                            clip_sentence_pairs_iou.append((clip_sentence[0], clip_sentence[1], clip_name, start_offset, end_offset))
+    #                        count += 1
+    #    if count > 200:
+    #        break
+    num_samples_iou = len(clip_sentence_pairs_iou)
+    print "TRAIN: " + str(len(clip_sentence_pairs_iou))+" iou clip-sentence pairs are readed"
+    
+    return make_train_reader(cfg, clip_sentence_pairs_iou, shuffle=True, is_train=True)
+
+class test(cfg):
+    '''
+    '''
+    def __init__(self, cfg):
+        self.context_num = cfg.TEST.context_num 
+        self.visual_feature_dim = cfg.TEST.visual_feature_dim
+        self.feats_dimen = cfg.TEST.feats_dimen
+        self.context_size = cfg.TEST.context_size
+        self.semantic_size = cfg.TEST.semantic_size
+        self.sliding_clip_path = cfg.TEST.sliding_clip_path
+        self.sent_vec_dim = cfg.TEST.sent_vec_dim
+        self.cs = cPickle.load(open(cfg.TEST.test_clip_sentvec))
+        self.clip_sentence_pairs = []
+        for l in self.cs:
+            clip_name = l[0]
+            sent_vecs = l[1]
+            for sent_vec in sent_vecs:
+                self.clip_sentence_pairs.append((clip_name, sent_vec))
+        print "TEST: " + str(len(self.clip_sentence_pairs)) + " pairs are readed"
+    
+        movie_names_set = set()
+        self.movie_clip_names = {}
+        for k in range(len(self.clip_sentence_pairs)):
+            clip_name = self.clip_sentence_pairs[k][0]
+            movie_name = clip_name.split("_")[0]
+            if not movie_name in movie_names_set:
+                movie_names_set.add(movie_name)
+                self.movie_clip_names[movie_name] = []
+            self.movie_clip_names[movie_name].append(k)
+        self.movie_names = list(movie_names_set)
+        print "TEST: " + str(len(self.movie_names)) + " movies."
+
+        self.clip_num_per_movie_max = 0
+        for movie_name in self.movie_clip_names:
+            if len(self.movie_clip_names[movie_name])>self.clip_num_per_movie_max: self.clip_num_per_movie_max = len(self.movie_clip_names[movie_name])
+        print "TEST: " + "Max number of clips in a movie is "+str(self.clip_num_per_movie_max)
+
+        sliding_clips_tmp = os.listdir(self.sliding_clip_path) # 62741
+        self.sliding_clip_names = []
+        for clip_name in sliding_clips_tmp:
+            if clip_name.split(".")[2]=="npy":
+                movie_name = clip_name.split("_")[0]
+                if movie_name in self.movie_clip_names:
+                    self.sliding_clip_names.append(clip_name.split(".")[0]+"."+clip_name.split(".")[1])
+        self.num_samples = len(self.clip_sentence_pairs)
+        print "TEST: " + "sliding clips number: "+str(len(self.sliding_clip_names))
+
+    def get_test_context_window(self, clip_name, win_length):
+        # compute left (pre) and right (post) context features based on read_unit_level_feats().
+        movie_name = clip_name.split("_")[0]
+        start = int(clip_name.split("_")[1])
+        end = int(clip_name.split("_")[2].split(".")[0])
+        clip_length = self.context_size #128
+        left_context_feats = np.zeros([win_length, self.feats_dimen], dtype=np.float32) #(1,4096)
+        right_context_feats = np.zeros([win_length, self.feats_dimen], dtype=np.float32)#(1,4096)
+        last_left_feat = np.load(self.sliding_clip_path+clip_name)
+        last_right_feat = np.load(self.sliding_clip_path+clip_name)
+        for k in range(win_length):
+            left_context_start = start - clip_length * (k + 1)
+            left_context_end = start - clip_length * k
+            right_context_start = end + clip_length * k
+            right_context_end = end + clip_length * (k + 1)
+            left_context_name = movie_name + "_" + str(left_context_start) + "_" + str(left_context_end) + ".npy"
+            right_context_name = movie_name + "_" + str(right_context_start) + "_" + str(right_context_end) + ".npy"
+            if os.path.exists(self.sliding_clip_path+left_context_name):
+                left_context_feat = np.load(self.sliding_clip_path+left_context_name)
+                last_left_feat = left_context_feat
+            else:
+                left_context_feat = last_left_feat
+            if os.path.exists(self.sliding_clip_path+right_context_name):
+                right_context_feat = np.load(self.sliding_clip_path+right_context_name)
+                last_right_feat = right_context_feat
+            else:
+                right_context_feat = last_right_feat
+            left_context_feats[k] = left_context_feat
+            right_context_feats[k] = right_context_feat
+        return np.mean(left_context_feats, axis=0), np.mean(right_context_feats, axis=0)
+
+    def load_movie_slidingclip(self, movie_name, sample_num):
+        # load unit level feats and sentence vector
+        movie_clip_sentences = []
+        movie_clip_featmap = []
+        clip_set = set()
+        for k in range(len(self.clip_sentence_pairs)):
+            if movie_name in self.clip_sentence_pairs[k][0]:
+                movie_clip_sentences.append((self.clip_sentence_pairs[k][0], self.clip_sentence_pairs[k][1][:self.semantic_size]))
+        for k in range(len(self.sliding_clip_names)):
+            if movie_name in self.sliding_clip_names[k]:
+                # print str(k)+"/"+str(len(self.movie_clip_names[movie_name]))
+                visual_feature_path = self.sliding_clip_path+self.sliding_clip_names[k]+".npy"
+                #context_feat=self.get_context(self.sliding_clip_names[k]+".npy")
+                left_context_feat,right_context_feat = self.get_test_context_window(self.sliding_clip_names[k]+".npy",1)
+                feature_data = np.load(visual_feature_path)
+                #comb_feat=np.hstack((context_feat,feature_data))
+                comb_feat = np.hstack((left_context_feat,feature_data,right_context_feat))
+                movie_clip_featmap.append((self.sliding_clip_names[k], comb_feat))
+        return movie_clip_featmap, movie_clip_sentences