提交 cd1a1963 编写于 作者: S shippingwang

add tall models

上级 554d8864
......@@ -16,10 +16,12 @@
| [C-TCN](./models/ctcn/README.md) | 视频动作定位| 2018年ActivityNet夺冠方案 |
| [BSN](./models/bsn/README.md) | 视频动作定位| 为视频动作定位问题提供高效的proposal生成方法 |
| [BMN](./models/bmn/README.md) | 视频动作定位| 2019年ActivityNet夺冠方案 |
| [TALL](./models/tall/README.md) | 视频检索 | |
### 主要特点
- 包含视频分类和动作定位方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研,2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型,2019年ActivityNet夺冠方案。
- 包含视频分类和动作定位方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。C-TCN动作定位模型也是百度自研,2018年ActivityNet比赛的夺冠方案。BSN模型采用自底向上的方法生成proposal,为视频动作定位问题中proposal的生成提供高效的解决方案。BMN模型是百度自研模型,2019年ActivityNet夺冠方案。TALL模型
- 提供了适合视频分类和动作定位任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
......@@ -178,6 +180,13 @@ run.sh
| BSN | 16 | 1卡K40 | 7.0 | 66.64% (AUC) | [model-tem](https://paddlemodels.bj.bcebos.com/video_detection/BsnTem_final.pdparams), [model-pem](https://paddlemodels.bj.bcebos.com/video_detection/BsnPem_final.pdparams) |
| BMN | 16 | 4卡K40 | 7.0 | 67.19% (AUC) | [model](https://paddlemodels.bj.bcebos.com/video_detection/BMN_final.pdparams) |
基于TACoS数据集的视频检索模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | 精度 | 下载链接 |
| :-: | :-: | :-: | :-: | :-: | :-: |
| TALL | 56 | 1卡K40 | 7.2 |
## 参考文献
......@@ -190,6 +199,7 @@ run.sh
- [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
- [Bsn: Boundary sensitive network for temporal action proposal generation](http://arxiv.org/abs/1806.02964), Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, Ming Yang.
- [BMN: Boundary-Matching Network for Temporal Action Proposal Generation](https://arxiv.org/abs/1907.09702), Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen.
- [TALL: Temporal Activity Localization via Language Query](https://arxiv.org/abs/1705.02101), Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia.
## 版本更新
......
MODEL:
name: "TALL"
visual_feature_dim : 12288
sentence_embedding_size : 4800
semantic_size : 1024
hidden_size : 1000
output_size : 3
TRAIN:
epoch : 21
use_gpu : True
num_gpus : 1
batch_size : 56
feats_dimen : 4096
off_size : 2
context_num : 1
context_size : 128
visual_feature_dim : 12288
sent_vec_dim : 4800
sliding_clip_path : "data/dataset/tacos/Interval64_128_256_512_overlap0.8_c3d_fc6/"
clip_sentvec : "data/dataset/tacos/train_clip-sentvec.pkl"
movie_length_info : "data/dataset/tacos/video_allframes_info.pkl"
dataset : TACoS
model : TALL
VALID:
batch_size : 1
context_num : 1
context_size : 128
feats_dimen : 4096
visual_feature_dim : 12288
sent_vec_dim : 4800
semantic_size : 4800
sliding_clip_path : "data/dataset/tacos/Interval128_256_overlap0.8_c3d_fc6/"
clip_sentvec : "data/dataset/tacos/test_clip-sentvec.pkl"
......@@ -162,3 +162,7 @@ Non-local模型也使用kinetics数据集,不过其数据处理方式和其他
## C-TCN
C-TCN模型使用ActivityNet 1.3数据集,具体使用方法见[C-TCN数据说明](./ctcn/README.md)
## TALL
TALL模型使用TACoS数据集,具体使用方法见[TALL数据说明](./tall/README.md)
......@@ -28,6 +28,10 @@ from metrics.detections import detection_metrics as detection_metrics
from metrics.bmn_metrics import bmn_proposal_metrics as bmn_proposal_metrics
from metrics.bsn_metrics import bsn_tem_metrics as bsn_tem_metrics
from metrics.bsn_metrics import bsn_pem_metrics as bsn_pem_metrics
from metrics.tall import accuracy_metrics as tall_metrics
logger = logging.getLogger(__name__)
......@@ -420,6 +424,24 @@ class BsnPemMetrics(Metrics):
def reset(self):
self.calculator.reset()
##shipping
class TallMetrics(Metrics):
def __init__(self, name, model, cfg):
self.name = name
self.mode =mode
self.calculator = tall_metrics.MetricsCalculator(cfg=cfg, name=self.name, mode=self.mode)
def calculator_and_log_out(self, fetch_list, info=""):
loss = np.array(fetch_list[0])
logger.info(info +'\tLoss = {}'.format('%.6f' % np.mean(loss)))
def accumalate()
def finalize_and_log_out(self, info="", savedir="/"):
def reset(self):
self.calculator.clear()
class MetricsZoo(object):
def __init__(self):
......@@ -461,3 +483,4 @@ regist_metrics("CTCN", DetectionMetrics)
regist_metrics("BMN", BmnMetrics)
regist_metrics("BSNTEM", BsnTemMetrics)
regist_metrics("BSNPEM", BsnPemMetrics)
redist_metrics("TALL", TallMetrics)
# Copyright 2016 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS-IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from six.moves import xrange
import time
import pickle
import operator
class MetricsCalculator():
def __init__(self, name, mode):
self.name = name
self.mode = mode
self.reset()
def reset(self):
logger.info("Resetting {} metrics...".format(self.mode))
return
def finalize_metrics(self):
return
def calculate_metrics(self,):
return
def accumalate(self):
return
def calculate_reward_batch_withstop(Previou_IoU, current_IoU, t):
batch_size = len(Previou_IoU)
reward = torch.zeros(batch_size)
for i in range(batch_size):
if current_IoU[i] > Previou_IoU[i] and Previou_IoU[i]>=0:
reward[i] = 1 -0.001*t
elif current_IoU[i] <= Previou_IoU[i] and current_IoU[i]>=0:
reward[i] = -0.001*t
else:
reward[i] = -1 -0.001*t
return reward
def calculate_reward(Previou_IoU, current_IoU, t):
if current_IoU > Previou_IoU and Previou_IoU>=0:
reward = 1-0.001*t
elif current_IoU <= Previou_IoU and current_IoU>=0:
reward = -0.001*t
else:
reward = -1-0.001*t
return reward
def calculate_RL_IoU_batch(i0, i1):
# calculate temporal intersection over union
batch_size = len(i0)
iou_batch = torch.zeros(batch_size)
for i in range(len(i0)):
union = (min(i0[i][0], i1[i][0]), max(i0[i][1], i1[i][1]))
inter = (max(i0[i][0], i1[i][0]), min(i0[i][1], i1[i][1]))
# if inter[1] < inter[0]:
# iou = 0
# else:
iou = 1.0*(inter[1]-inter[0])/(union[1]-union[0])
iou_batch[i] = iou
return iou_batch
def calculate_IoU(i0, i1):
# calculate temporal intersection over union
union = (min(i0[0], i1[0]), max(i0[1], i1[1]))
inter = (max(i0[0], i1[0]), min(i0[1], i1[1]))
iou = 1.0*(inter[1]-inter[0])/(union[1]-union[0])
return iou
def nms_temporal(x1, x2, sim, overlap):
pick = []
assert len(x1)==len(sim)
assert len(x2)==len(sim)
if len(x1)==0:
return pick
union = map(operator.sub, x2, x1) # union = x2-x1
I = [i[0] for i in sorted(enumerate(sim), key=lambda x:x[1])] # sort and get index
while len(I)>0:
i = I[-1]
pick.append(i)
xx1 = [max(x1[i],x1[j]) for j in I[:-1]]
xx2 = [min(x2[i],x2[j]) for j in I[:-1]]
inter = [max(0.0, k2-k1) for k1, k2 in zip(xx1, xx2)]
o = [inter[u]/(union[i] + union[I[u]] - inter[u]) for u in range(len(I)-1)]
I_new = []
for j in range(len(o)):
if o[j] <=overlap:
I_new.append(I[j])
I = I_new
return pick
def compute_IoU_recall_top_n_forreg_rl(top_n, iou_thresh, sentence_image_reg_mat, sclips):
correct_num = 0.0
for k in range(sentence_image_reg_mat.shape[0]):
gt = sclips[k]
# print(gt)
gt_start = float(gt.split("_")[1])
gt_end = float(gt.split("_")[2])
pred_start = sentence_image_reg_mat[k, 0]
pred_end = sentence_image_reg_mat[k, 1]
iou = calculate_IoU((gt_start, gt_end),(pred_start, pred_end))
if iou>=iou_thresh:
correct_num+=1
return correct_num
def compute_IoU_recall_top_n_forreg(top_n, iou_thresh, sentence_image_mat, sentence_image_reg_mat, sclips, iclips):
correct_num = 0.0
for k in range(sentence_image_mat.shape[0]):
gt = sclips[k]
gt_start = float(gt.split("_")[1])
gt_end = float(gt.split("_")[2])
#print gt +" "+str(gt_start)+" "+str(gt_end)
sim_v = [v for v in sentence_image_mat[k]]
starts = [s for s in sentence_image_reg_mat[k,:,0]]
ends = [e for e in sentence_image_reg_mat[k,:,1]]
picks = nms_temporal(starts,ends, sim_v, iou_thresh-0.05)
#sim_argsort=np.argsort(sim_v)[::-1][0:top_n]
if top_n<len(picks): picks=picks[0:top_n]
for index in picks:
pred_start = sentence_image_reg_mat[k, index, 0]
pred_end = sentence_image_reg_mat[k, index, 1]
iou = calculate_IoU((gt_start, gt_end),(pred_start, pred_end))
if iou>=iou_thresh:
correct_num+=1
break
return correct_num
......@@ -10,6 +10,7 @@ from .ctcn import CTCN
from .bmn import BMN
from .bsn import BsnTem
from .bsn import BsnPem
from .tall import TALL
# regist models, sort by alphabet
regist_model("AttentionCluster", AttentionCluster)
......@@ -23,3 +24,4 @@ regist_model("CTCN", CTCN)
regist_model("BMN", BMN)
regist_model("BsnTem", BsnTem)
regist_model("BsnPem", BsnPem)
regist_model("TALL", TALL)
# TALL 视频模型
---
## 内容
-[模型介绍]
此差异已折叠。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import os
import time
import sys
import paddle.fluid as fluid
class TALL(object):
def __init__(self, mode, cfg):
self.images = cfg["images"]
self.sentences = cfg["sentences"]
if self.mode == "train":
self.offsets = cfg[offsets]
self.semantic_size = cfg["semantic_size"]
self.hidden_size = cfg["hidden_size"]
self.output_size = cfg["output_size"]
def _cross_modal_comb(visual_feat, sentence_embed):
#batch_size = visual_feat.size(0)
visual_feat = fluid.layers.reshape(visual_feat, [1, -1, semantic_size])
vv_feature = fluid.layers.expand(visual_feat, [train_batch_size, 1, 1])
sentence_embed = fluid.layers.reshape(sentence_embed, [-1, 1, semantic_size])
ss_feature = fluid.layers.expand(sentence_embed, [1, train_batch_size, 1])
concat_feature = fluid.layers.concat([vv_feature, ss_feature], axis = 2) #1,1,2048
mul_feature = vv_feature * ss_feature # B,B,1024
add_feature = vv_feature + ss_feature # B,B,1024
comb_feature = fluid.layers.concat([mul_feature, add_feature, concat_feature], axis = 2)
return comb_feature
def net(self)
# visual2semantic
transformed_clip_train = fluid.layers.fc(
input=self.images,
size=semantic_size,
act=None,
name='v2s_lt',
param_attr=fluid.ParamAttr(
name='v2s_lt_weights',
initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=1.0, seed=0)),
bias_attr=False)
#l2_normalize
transformed_clip_train = fluid.layers.l2_normalize(x=transformed_clip_train, axis=1)
# sentence2semantic
transformed_sentence_train = fluid.layers.fc(
input=self.sentences,
size=semantic_size,
act=None,
name='s2s_lt',
param_attr=fluid.ParamAttr(
name='s2s_lt_weights',
initializer=fluid.initializer.NormalInitializer(loc=0.0, scale=1.0, seed=0)),
bias_attr=False)
#l2_normalize
transformed_sentence_train = fluid.layers.l2_normalize(x=transformed_sentence_train, axis=1)
cross_modal_vec_train=_cross_modal_comb(transformed_clip_train, transformed_sentence_train)
cross_modal_vec_train=fluid.layers.unsqueeze(input=cross_modal_vec_train, axes=[0])
cross_modal_vec_train=fluid.layers.transpose(cross_modal_vec_train, perm=[0, 3, 1, 2])
mid_output = fluid.layers.conv2d(
input=cross_modal_vec_train,
num_filters=hidden_size,
filter_size=1,
stride=1,
act="relu",
param_attr=fluid.param_attr.ParamAttr(name="mid_out_weights"),
bias_attr=False)
sim_score_mat_train = fluid.layers.conv2d(
input=mid_output,
num_filters=output_size,
filter_size=1,
stride=1,
act=None,
param_attr=fluid.param_attr.ParamAttr(name="sim_mat_weights"),
bias_attr=False)
self.sim_score_mat_train = fluid.layers.squeeze(input=sim_score_mat_train, axes=[0])
return self.sim_score_mat_train, self.offsets
......@@ -6,6 +6,7 @@ from .ctcn_reader import CTCNReader
from .bmn_reader import BMNReader
from .bsn_reader import BSNVideoReader
from .bsn_reader import BSNProposalReader
from .tall_reader import TALLReader
# regist reader, sort by alphabet
regist_reader("ATTENTIONCLUSTER", FeatureReader)
......@@ -19,3 +20,4 @@ regist_reader("CTCN", CTCNReader)
regist_reader("BMN", BMNReader)
regist_reader("BSNTEM", BSNVideoReader)
regist_reader("BSNPEM", BSNProposalReader)
regist_reader("TALL", TALLReader)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import os
import sys
import cv2
import math
import random
import functools
try:
import cPickle as pickle
from cStringIO import StringIO
except ImportError:
import pickle
from io import BytesIO
import numpy as np
import paddle
from PIL import Image, ImageEnhance
import logging
from .reader_utils import DataReader
class TacosReader(DataReader):
def __init__(self, name, mode, cfg):
self.name = name
self.mode = mode
self.cfg = cfg
def create_reader(self):
cfg = self.cfg
mode = self.mode
num_reader_threads = cfg[mode.upper()]['num_reader_threads']
assert num_reader_threads >=1, \
"number of reader threads({}) should be a positive integer".format(num_reader_threads)
if num_reader_threads == 1:
reader_func = make_reader
else:
reader_func = make_multi_reader
filelist = cfg[mode.upper()]['']
if self.mode == 'train':
return reader_func()
elif self.mode == 'valid':
return reader_func()
else:
logger.info("Not implemented")
raise NotImplementedError
def make_reader(cfg):
def reader():
cs = cPickle.load(open(cfg.TRAIN.train_clip_sentvec))
movie_length_info = cPickle.load(open(cfg.TRAIN.movie_length_info))
#put train() in here
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import os
import sys
import cPickle
import random
import numpy as np
import math
import paddle
import paddle.fluid as fluid
import functools
import pdb
random.seed(0)
THREAD = 8
BUF_SIZE = 1024
class TallReader(DataReader):
def __init__(self, name, mode, cfg):
self.name = name
self.mode = mode
self.cfg = cfg
def create_reader(self):
cfg = self.cfg
mode = self.mode
if self.mode == 'train':
train_batch_size = cfg.TRAIN.batch_size
return paddle.batch(train(cfg),batch_size = train_batch_size, drop_last=True)
elif self.mode == 'valid':
return test(cfg)
else:
logger.info("Not implemented")
raise NotImplementedError
'''
calculate temporal intersection over union
'''
def calculate_IoU(i0, i1):
union = (min(i0[0], i1[0]), max(i0[1], i1[1]))
inter = (max(i0[0], i1[0]), min(i0[1], i1[1]))
iou = 1.0*(inter[1]-inter[0])/(union[1]-union[0])
return iou
'''
calculate the non Intersection part over Length ratia, make sure the input IoU is larger than 0
'''
#[(x1_max-x1_min)-overlap]/(x1_max-x1_min)
def calculate_nIoL(base, sliding_clip):
inter = (max(base[0], sliding_clip[0]), min(base[1], sliding_clip[1]))
inter_l = inter[1]-inter[0]
length = sliding_clip[1]-sliding_clip[0]
nIoL = 1.0*(length-inter_l)/length
return nIoL
def get_context_window(sliding_clip_path, clip_name, win_length, context_size, feats_dimen):
# compute left (pre) and right (post) context features based on read_unit_level_feats().
movie_name = clip_name.split("_")[0]
start = int(clip_name.split("_")[1])
end = int(clip_name.split("_")[2].split(".")[0])
clip_length = context_size
left_context_feats = np.zeros([win_length, feats_dimen], dtype=np.float32)
right_context_feats = np.zeros([win_length, feats_dimen], dtype=np.float32)
last_left_feat = np.load(sliding_clip_path+clip_name)
last_right_feat = np.load(sliding_clip_path+clip_name)
for k in range(win_length):
left_context_start = start - clip_length * (k + 1)
left_context_end = start - clip_length * k
right_context_start = end + clip_length * k
right_context_end = end + clip_length * (k + 1)
left_context_name = movie_name + "_" + str(left_context_start) + "_" + str(left_context_end) + ".npy"
right_context_name = movie_name + "_" + str(right_context_start) + "_" + str(right_context_end) + ".npy"
if os.path.exists(sliding_clip_path+left_context_name):
left_context_feat = np.load(sliding_clip_path+left_context_name)
last_left_feat = left_context_feat
else:
left_context_feat = last_left_feat
if os.path.exists(sliding_clip_path+right_context_name):
right_context_feat = np.load(sliding_clip_path+right_context_name)
last_right_feat = right_context_feat
else:
right_context_feat = last_right_feat
left_context_feats[k] = left_context_feat
right_context_feats[k] = right_context_feat
return np.mean(left_context_feats, axis=0), np.mean(right_context_feats, axis=0)
def process_data(sample, is_train):
clip_sentence_pair, sliding_clip_path, context_num, context_size, feats_dimen , sent_vec_dim = sample
if is_train:
offset = np.zeros(2, dtype=np.float32)
clip_name = clip_sentence_pair[0]
feat_path = sliding_clip_path+clip_sentence_pair[2]
featmap = np.load(feat_path)
left_context_feat, right_context_feat = get_context_window(sliding_clip_path, clip_sentence_pair[2], context_num, context_size, feats_dimen)
image = np.hstack((left_context_feat, featmap, right_context_feat))
sentence = clip_sentence_pair[1][:sent_vec_dim]
p_offset = clip_sentence_pair[3]
l_offset = clip_sentence_pair[4]
offset[0] = p_offset
offset[1] = l_offset
return image, sentence, offset
else:
pass
def make_train_reader(cfg, clip_sentence_pairs_iou, shuffle=False, is_train=True):
sliding_clip_path = cfg.TRAIN.sliding_clip_path
context_num = cfg.TRAIN.context_num
context_size = cfg.TRAIN.context_size
feats_dimen = cfg.TRAIN.feats_dimen
sent_vec_dim = cfg.TRAIN.sent_vec_dim
def reader():
if shuffle:
random.shuffle(clip_sentence_pairs_iou)
for clip_sentence_pair in clip_sentence_pairs_iou:
yield [clip_sentence_pair, sliding_clip_path, context_num, context_size, feats_dimen, sent_vec_dim]
mapper = functools.partial(
process_data,
is_train=is_train)
return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
def train(cfg):
## TALL
feats_dimen = cfg.TRAIN.feats_dimen
context_num = cfg.TRAIN.context_num
context_size = cfg.TRAIN.context_size
visual_feature_dim = cfg.TRAIN.visual_feature_dim
sent_vec_dim = cfg.TRAIN.sent_vec_dim
sliding_clip_path = cfg.TRAIN.sliding_clip_path
cs = cPickle.load(open(cfg.TRAIN.train_clip_sentvec))
movie_length_info = cPickle.load(open(cfg.TRAIN.movie_length_info))
clip_sentence_pairs = []
for l in cs:
clip_name = l[0]
sent_vecs = l[1]
for sent_vec in sent_vecs:
clip_sentence_pairs.append((clip_name, sent_vec)) #10146
print "TRAIN: " + str(len(clip_sentence_pairs))+" clip-sentence pairs are readed"
movie_names_set = set()
movie_clip_names = {}
# read groundtruth sentence-clip pairs
for k in range(len(clip_sentence_pairs)):
clip_name = clip_sentence_pairs[k][0]
movie_name = clip_name.split("_")[0]
if not movie_name in movie_names_set:
movie_names_set.add(movie_name)
movie_clip_names[movie_name] = []
movie_clip_names[movie_name].append(k)
movie_names = list(movie_names_set)
num_samples = len(clip_sentence_pairs)
print "TRAIN: " + str(len(movie_names))+" movies."
# read sliding windows, and match them with the groundtruths to make training samples
sliding_clips_tmp = os.listdir(sliding_clip_path) #161396
clip_sentence_pairs_iou = []
#count = 0
for clip_name in sliding_clips_tmp:
if clip_name.split(".")[2]=="npy":
movie_name = clip_name.split("_")[0]
for clip_sentence in clip_sentence_pairs:
original_clip_name = clip_sentence[0]
original_movie_name = original_clip_name.split("_")[0]
if original_movie_name==movie_name:
start = int(clip_name.split("_")[1])
end = int(clip_name.split("_")[2].split(".")[0])
o_start = int(original_clip_name.split("_")[1])
o_end = int(original_clip_name.split("_")[2].split(".")[0])
iou = calculate_IoU((start, end), (o_start, o_end))
if iou>0.5:
nIoL=calculate_nIoL((o_start, o_end), (start, end))
if nIoL<0.15:
movie_length = movie_length_info[movie_name.split(".")[0]]
start_offset = o_start-start
end_offset = o_end-end
clip_sentence_pairs_iou.append((clip_sentence[0], clip_sentence[1], clip_name, start_offset, end_offset))
# count += 1
# if count > 200:
# break
num_samples_iou = len(clip_sentence_pairs_iou)
print "TRAIN: " + str(len(clip_sentence_pairs_iou))+" iou clip-sentence pairs are readed"
return make_train_reader(cfg, clip_sentence_pairs_iou, shuffle=True, is_train=True)
class test(cfg):
'''
'''
def __init__(self, cfg):
self.context_num = cfg.TEST.context_num
self.visual_feature_dim = cfg.TEST.visual_feature_dim
self.feats_dimen = cfg.TEST.feats_dimen
self.context_size = cfg.TEST.context_size
self.semantic_size = cfg.TEST.semantic_size
self.sliding_clip_path = cfg.TEST.sliding_clip_path
self.sent_vec_dim = cfg.TEST.sent_vec_dim
self.cs = cPickle.load(open(cfg.TEST.test_clip_sentvec))
self.clip_sentence_pairs = []
for l in self.cs:
clip_name = l[0]
sent_vecs = l[1]
for sent_vec in sent_vecs:
self.clip_sentence_pairs.append((clip_name, sent_vec))
print "TEST: " + str(len(self.clip_sentence_pairs)) + " pairs are readed"
movie_names_set = set()
self.movie_clip_names = {}
for k in range(len(self.clip_sentence_pairs)):
clip_name = self.clip_sentence_pairs[k][0]
movie_name = clip_name.split("_")[0]
if not movie_name in movie_names_set:
movie_names_set.add(movie_name)
self.movie_clip_names[movie_name] = []
self.movie_clip_names[movie_name].append(k)
self.movie_names = list(movie_names_set)
print "TEST: " + str(len(self.movie_names)) + " movies."
self.clip_num_per_movie_max = 0
for movie_name in self.movie_clip_names:
if len(self.movie_clip_names[movie_name])>self.clip_num_per_movie_max: self.clip_num_per_movie_max = len(self.movie_clip_names[movie_name])
print "TEST: " + "Max number of clips in a movie is "+str(self.clip_num_per_movie_max)
sliding_clips_tmp = os.listdir(self.sliding_clip_path) # 62741
self.sliding_clip_names = []
for clip_name in sliding_clips_tmp:
if clip_name.split(".")[2]=="npy":
movie_name = clip_name.split("_")[0]
if movie_name in self.movie_clip_names:
self.sliding_clip_names.append(clip_name.split(".")[0]+"."+clip_name.split(".")[1])
self.num_samples = len(self.clip_sentence_pairs)
print "TEST: " + "sliding clips number: "+str(len(self.sliding_clip_names))
def get_test_context_window(self, clip_name, win_length):
# compute left (pre) and right (post) context features based on read_unit_level_feats().
movie_name = clip_name.split("_")[0]
start = int(clip_name.split("_")[1])
end = int(clip_name.split("_")[2].split(".")[0])
clip_length = self.context_size #128
left_context_feats = np.zeros([win_length, self.feats_dimen], dtype=np.float32) #(1,4096)
right_context_feats = np.zeros([win_length, self.feats_dimen], dtype=np.float32)#(1,4096)
last_left_feat = np.load(self.sliding_clip_path+clip_name)
last_right_feat = np.load(self.sliding_clip_path+clip_name)
for k in range(win_length):
left_context_start = start - clip_length * (k + 1)
left_context_end = start - clip_length * k
right_context_start = end + clip_length * k
right_context_end = end + clip_length * (k + 1)
left_context_name = movie_name + "_" + str(left_context_start) + "_" + str(left_context_end) + ".npy"
right_context_name = movie_name + "_" + str(right_context_start) + "_" + str(right_context_end) + ".npy"
if os.path.exists(self.sliding_clip_path+left_context_name):
left_context_feat = np.load(self.sliding_clip_path+left_context_name)
last_left_feat = left_context_feat
else:
left_context_feat = last_left_feat
if os.path.exists(self.sliding_clip_path+right_context_name):
right_context_feat = np.load(self.sliding_clip_path+right_context_name)
last_right_feat = right_context_feat
else:
right_context_feat = last_right_feat
left_context_feats[k] = left_context_feat
right_context_feats[k] = right_context_feat
return np.mean(left_context_feats, axis=0), np.mean(right_context_feats, axis=0)
def load_movie_slidingclip(self, movie_name, sample_num):
# load unit level feats and sentence vector
movie_clip_sentences = []
movie_clip_featmap = []
clip_set = set()
for k in range(len(self.clip_sentence_pairs)):
if movie_name in self.clip_sentence_pairs[k][0]:
movie_clip_sentences.append((self.clip_sentence_pairs[k][0], self.clip_sentence_pairs[k][1][:self.semantic_size]))
for k in range(len(self.sliding_clip_names)):
if movie_name in self.sliding_clip_names[k]:
# print str(k)+"/"+str(len(self.movie_clip_names[movie_name]))
visual_feature_path = self.sliding_clip_path+self.sliding_clip_names[k]+".npy"
#context_feat=self.get_context(self.sliding_clip_names[k]+".npy")
left_context_feat,right_context_feat = self.get_test_context_window(self.sliding_clip_names[k]+".npy",1)
feature_data = np.load(visual_feature_path)
#comb_feat=np.hstack((context_feat,feature_data))
comb_feat = np.hstack((left_context_feat,feature_data,right_context_feat))
movie_clip_featmap.append((self.sliding_clip_names[k], comb_feat))
return movie_clip_featmap, movie_clip_sentences
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册