未验证 提交 c11f177a 编写于 作者: K Kaipeng Deng 提交者: GitHub

Merge pull request #2001 from SunGaofeng/video_classification

Paddle Video V2
......@@ -10,17 +10,19 @@
| [Attention LSTM](./models/attention_lstm/README.md) | 视频分类| 常用模型,速度快精度高 |
| [NeXtVLAD](./models/nextvlad/README.md) | 视频分类| 2nd-Youtube-8M最优单模型 |
| [StNet](./models/stnet/README.md) | 视频分类| AAAI'19提出的视频联合时空建模方法 |
| [TSM](./models/tsm/README.md) | 视频分类| 基于时序移位的简单高效视频时空建模方法 |
| [TSN](./models/tsn/README.md) | 视频分类| ECCV'16提出的基于2D-CNN经典解决方案 |
| [Non-local](./models/nonlocal_model/README.md) | 视频分类| 视频非局部关联建模模型 |
### 主要特点
- 包含视频分类方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,TSN和StNet是两个End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。
- 包含视频分类方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。
- 提供了适合视频分类任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
## 安装
在当前模型库运行样例代码需要PadddlePaddle Fluid v.1.2.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据[安装文档](http://www.paddlepaddle.org/documentation/docs/zh/1.3/beginners_guide/install/index_cn.html)中的说明来更新PaddlePaddle。
在当前模型库运行样例代码需要PadddlePaddle Fluid v.1.4.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据[安装文档](http://www.paddlepaddle.org/documentation/docs/zh/1.4/beginners_guide/install/index_cn.html)中的说明来更新PaddlePaddle。
## 数据准备
......@@ -113,9 +115,10 @@ infer.py
| 模型 | Batch Size | 环境配置 | cuDNN版本 | Top-1 | 下载链接 |
| :-------: | :---: | :---------: | :----: | :----: | :----------: |
| StNet | 128 | 8卡P40 | 5.1 | 0.69 | [model](https://paddlemodels.bj.bcebos.com/video_classification/stnet_kinetics.tar.gz) |
| StNet | 128 | 8卡P40 | 7.1 | 0.69 | [model](https://paddlemodels.bj.bcebos.com/video_classification/stnet_kinetics.tar.gz) |
| TSN | 256 | 8卡P40 | 7.1 | 0.67 | [model](https://paddlemodels.bj.bcebos.com/video_classification/tsn_kinetics.tar.gz) |
| TSM | 128 | 8卡P40 | 7.1 | 0.70 | [model](https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz) |
| Non-local | 64 | 8卡P40 | 7.1 | 0.74 | [model](https://paddlemodels.bj.bcebos.com/video_classification/nonlocal_kinetics.tar.gz) |
## 参考文献
- [Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification](https://arxiv.org/abs/1711.09550), Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen
......@@ -123,8 +126,9 @@ infer.py
- [NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification](https://arxiv.org/abs/1811.05014), Rongcheng Lin, Jing Xiao, Jianping Fan
- [StNet:Local and Global Spatial-Temporal Modeling for Human Action Recognition](https://arxiv.org/abs/1811.01549), Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, Shilei Wen
- [Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859), Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool
- [Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383v1), Ji Lin, Chuang Gan, Song Han
- [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
## 版本更新
- 3/2019: 新增模型库,发布Attention Cluster,Attention LSTM,NeXtVLAD,StNet,TSN五个视频分类模型。
- 4/2019: 发布Non-local, TSM两个视频分类模型。
......@@ -19,12 +19,15 @@ except:
from utils import AttrDict
import logging
logger = logging.getLogger(__name__)
CONFIG_SECS = [
'train',
'valid',
'test',
'infer',
]
'train',
'valid',
'test',
'infer',
]
def parse_config(cfg_file):
......@@ -43,6 +46,7 @@ def parse_config(cfg_file):
return cfg
def merge_configs(cfg, sec, args_dict):
assert sec in CONFIG_SECS, "invalid config section {}".format(sec)
sec_dict = getattr(cfg, sec.upper())
......@@ -56,3 +60,11 @@ def merge_configs(cfg, sec, args_dict):
pass
return cfg
def print_configs(cfg, mode):
logger.info("---------------- {:>5} Arguments ----------------".format(mode))
for sec, sec_items in cfg.items():
logger.info("{}:".format(sec))
for k, v in sec_items.items():
logger.info(" {}:{}".format(k, v))
logger.info("-------------------------------------------------")
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
文件模式从 100755 更改为 100644
[MODEL]
name = "NONLOCAL"
num_classes = 400
image_mean = 114.75
image_std = 57.375
depth = 50
dataset = 'kinetics400'
video_arc_choice = 1
use_affine = False
fc_init_std = 0.01
bn_momentum = 0.9
bn_epsilon = 1.0e-5
bn_init_gamma = 0.
[RESNETS]
num_groups = 1
width_per_group = 64
trans_func = bottleneck_transformation_3d
[NONLOCAL]
bn_momentum = 0.9
bn_epsilon = 1.0e-5
bn_init_gamma = 0.0
layer_mod = 2
conv3_nonlocal = True
conv4_nonlocal = True
conv_init_std = 0.01
no_bias = 0
use_maxpool = True
use_softmax = True
use_scale = True
use_zero_init_conv = False
use_bn = True
use_affine = False
[TRAIN]
num_reader_threads = 8
batch_size = 64
num_gpus = 8
filelist = './dataset/nonlocal/trainlist.txt'
crop_size = 224
sample_rate = 8
video_length = 8
jitter_scales = [256, 320]
dropout_rate = 0.5
learning_rate = 0.01
learning_rate_decay = 0.1
step_sizes = [150000, 150000, 100000]
max_iter = 400000
weight_decay = 0.0001
weight_decay_bn = 0.0
momentum = 0.9
nesterov = True
scale_momentum = True
[VALID]
num_reader_threads = 8
batch_size = 64
filelist = './dataset/nonlocal/vallist.txt'
crop_size = 224
sample_rate = 8
video_length = 8
jitter_scales = [256, 320]
[TEST]
num_reader_threads = 8
batch_size = 4
filelist = 'dataset/nonlocal/testlist.txt'
filename_gt = 'dataset/nonlocal/vallist.txt'
checkpoint_dir = './output'
crop_size = 256
sample_rate = 8
video_length = 8
jitter_scales = [256, 256]
num_test_clips = 30
dataset_size = 19761
use_multi_crop = 1
[INFER]
num_reader_threads = 8
batch_size = 1
filelist = 'dataset/nonlocal/inferlist.txt'
crop_size = 256
sample_rate = 8
video_length = 8
jitter_scales = [256, 256]
num_test_clips = 30
use_multi_crop = 1
......@@ -34,14 +34,16 @@ batch_size = 128
filelist = "./dataset/kinetics/val.list"
[TEST]
seg_num = 25
short_size = 256
target_size = 256
num_reader_threads = 12
buf_size = 1024
batch_size = 16
batch_size = 4
filelist = "./dataset/kinetics/test.list"
[INFER]
seg_num = 25
short_size = 256
target_size = 256
num_reader_threads = 12
......
[MODEL]
name = "TSM"
format = "pkl"
num_classes = 400
seg_num = 8
seglen = 1
image_mean = [0.485, 0.456, 0.406]
image_std = [0.229, 0.224, 0.225]
num_layers = 50
[TRAIN]
epoch = 65
short_size = 256
target_size = 224
num_reader_threads = 12
buf_size = 1024
batch_size = 128
use_gpu = True
num_gpus = 8
filelist = "./dataset/kinetics/train.list"
learning_rate = 0.01
learning_rate_decay = 0.1
decay_epochs = [40, 60]
l2_weight_decay = 1e-4
momentum = 0.9
total_videos = 239781
[VALID]
short_size = 256
target_size = 224
num_reader_threads = 12
buf_size = 1024
batch_size = 128
filelist = "./dataset/kinetics/val.list"
[TEST]
short_size = 256
target_size = 224
num_reader_threads = 12
buf_size = 1024
batch_size = 16
filelist = "./dataset/kinetics/test.list"
[INFER]
short_size = 256
target_size = 224
num_reader_threads = 12
buf_size = 1024
batch_size = 1
filelist = "./dataset/kinetics/infer.list"
......@@ -33,11 +33,12 @@ batch_size = 256
filelist = "./dataset/kinetics/val.list"
[TEST]
seg_num = 7
short_size = 256
target_size = 224
num_reader_threads = 12
buf_size = 1024
batch_size = 32
batch_size = 16
filelist = "./dataset/kinetics/test.list"
[INFER]
......
......@@ -3,10 +3,11 @@ from .feature_reader import FeatureReader
from .kinetics_reader import KineticsReader
from .nonlocal_reader import NonlocalReader
# regist reader, sort by alphabet
regist_reader("ATTENTIONCLUSTER", FeatureReader)
regist_reader("NEXTVLAD", FeatureReader)
regist_reader("ATTENTIONLSTM", FeatureReader)
regist_reader("TSN", KineticsReader)
regist_reader("NEXTVLAD", FeatureReader)
regist_reader("NONLOCAL", NonlocalReader)
regist_reader("TSM", KineticsReader)
regist_reader("TSN", KineticsReader)
regist_reader("STNET", KineticsReader)
regist_reader("NONLOCAL", NonlocalReader)
......@@ -54,16 +54,17 @@ class KineticsReader(DataReader):
"""
def __init__(self, name, mode, cfg):
self.name = name
self.mode = mode
super(KineticsReader, self).__init__(name, mode, cfg)
self.format = cfg.MODEL.format
self.num_classes = cfg.MODEL.num_classes
self.seg_num = cfg.MODEL.seg_num
self.seglen = cfg.MODEL.seglen
self.short_size = cfg[mode.upper()]['short_size']
self.target_size = cfg[mode.upper()]['target_size']
self.num_reader_threads = cfg[mode.upper()]['num_reader_threads']
self.buf_size = cfg[mode.upper()]['buf_size']
self.num_classes = self.get_config_from_sec('model', 'num_classes')
self.seg_num = self.get_config_from_sec('model', 'seg_num')
self.seglen = self.get_config_from_sec('model', 'seglen')
self.seg_num = self.get_config_from_sec(mode, 'seg_num', self.seg_num)
self.short_size = self.get_config_from_sec(mode, 'short_size')
self.target_size = self.get_config_from_sec(mode, 'target_size')
self.num_reader_threads = self.get_config_from_sec(mode, 'num_reader_threads')
self.buf_size = self.get_config_from_sec(mode, 'buf_size')
self.img_mean = np.array(cfg.MODEL.image_mean).reshape(
[3, 1, 1]).astype(np.float32)
......@@ -74,7 +75,7 @@ class KineticsReader(DataReader):
self.filelist = cfg[mode.upper()]['filelist']
def create_reader(self):
_reader = _reader_creator(self.filelist, self.mode, seg_num=self.seg_num, seglen = self.seglen, \
_reader = self._reader_creator(self.filelist, self.mode, seg_num=self.seg_num, seglen = self.seglen, \
short_size = self.short_size, target_size = self.target_size, \
img_mean = self.img_mean, img_std = self.img_std, \
shuffle = (self.mode == 'train'), \
......@@ -94,117 +95,183 @@ class KineticsReader(DataReader):
return _batch_reader
def _reader_creator(pickle_list,
mode,
seg_num,
seglen,
short_size,
target_size,
img_mean,
img_std,
shuffle=False,
num_threads=1,
buf_size=1024,
format='pkl'):
def reader():
with open(pickle_list) as flist:
lines = [line.strip() for line in flist]
if shuffle:
random.shuffle(lines)
for line in lines:
pickle_path = line.strip()
yield [pickle_path]
if format == 'pkl':
decode_func = decode_pickle
elif format == 'mp4':
decode_func = decode_mp4
else:
raise "Not implemented format {}".format(format)
mapper = functools.partial(
decode_func,
mode=mode,
seg_num=seg_num,
seglen=seglen,
short_size=short_size,
target_size=target_size,
img_mean=img_mean,
img_std=img_std)
return paddle.reader.xmap_readers(mapper, reader, num_threads, buf_size)
def decode_mp4(sample, mode, seg_num, seglen, short_size, target_size, img_mean,
img_std):
sample = sample[0].split(' ')
mp4_path = sample[0]
# when infer, we store vid as label
label = int(sample[1])
try:
imgs = mp4_loader(mp4_path, seg_num, seglen, mode)
if len(imgs) < 1:
logger.error('{} frame length {} less than 1.'.format(mp4_path,
len(imgs)))
return None, None
except:
logger.error('Error when loading {}'.format(mp4_path))
return None, None
return imgs_transform(imgs, label, mode, seg_num, seglen, \
short_size, target_size, img_mean, img_std)
def decode_pickle(sample, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
pickle_path = sample[0]
try:
if python_ver < (3, 0):
data_loaded = pickle.load(open(pickle_path, 'rb'))
def _reader_creator(self,
pickle_list,
mode,
seg_num,
seglen,
short_size,
target_size,
img_mean,
img_std,
shuffle=False,
num_threads=1,
buf_size=1024,
format='pkl'):
def decode_mp4(sample, mode, seg_num, seglen, short_size, target_size, img_mean,
img_std):
sample = sample[0].split(' ')
mp4_path = sample[0]
# when infer, we store vid as label
label = int(sample[1])
try:
imgs = mp4_loader(mp4_path, seg_num, seglen, mode)
if len(imgs) < 1:
logger.error('{} frame length {} less than 1.'.format(mp4_path,
len(imgs)))
return None, None
except:
logger.error('Error when loading {}'.format(mp4_path))
return None, None
return imgs_transform(imgs, label, mode, seg_num, seglen, \
short_size, target_size, img_mean, img_std)
def decode_pickle(sample, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
pickle_path = sample[0]
try:
if python_ver < (3, 0):
data_loaded = pickle.load(open(pickle_path, 'rb'))
else:
data_loaded = pickle.load(open(pickle_path, 'rb'), encoding='bytes')
vid, label, frames = data_loaded
if len(frames) < 1:
logger.error('{} frame length {} less than 1.'.format(pickle_path,
len(frames)))
return None, None
except:
logger.info('Error when loading {}'.format(pickle_path))
return None, None
if mode == 'train' or mode == 'valid' or mode == 'test':
ret_label = label
elif mode == 'infer':
ret_label = vid
imgs = video_loader(frames, seg_num, seglen, mode)
return imgs_transform(imgs, ret_label, mode, seg_num, seglen, \
short_size, target_size, img_mean, img_std)
def imgs_transform(imgs, label, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
imgs = group_scale(imgs, short_size)
if mode == 'train':
if self.name == "TSM":
imgs = group_multi_scale_crop(imgs, short_size)
imgs = group_random_crop(imgs, target_size)
imgs = group_random_flip(imgs)
else:
imgs = group_center_crop(imgs, target_size)
np_imgs = (np.array(imgs[0]).astype('float32').transpose(
(2, 0, 1))).reshape(1, 3, target_size, target_size) / 255
for i in range(len(imgs) - 1):
img = (np.array(imgs[i + 1]).astype('float32').transpose(
(2, 0, 1))).reshape(1, 3, target_size, target_size) / 255
np_imgs = np.concatenate((np_imgs, img))
imgs = np_imgs
imgs -= img_mean
imgs /= img_std
imgs = np.reshape(imgs, (seg_num, seglen * 3, target_size, target_size))
return imgs, label
def reader():
with open(pickle_list) as flist:
lines = [line.strip() for line in flist]
if shuffle:
random.shuffle(lines)
for line in lines:
pickle_path = line.strip()
yield [pickle_path]
if format == 'pkl':
decode_func = decode_pickle
elif format == 'mp4':
decode_func = decode_mp4
else:
data_loaded = pickle.load(open(pickle_path, 'rb'), encoding='bytes')
vid, label, frames = data_loaded
if len(frames) < 1:
logger.error('{} frame length {} less than 1.'.format(pickle_path,
len(frames)))
return None, None
except:
logger.info('Error when loading {}'.format(pickle_path))
return None, None
if mode == 'train' or mode == 'valid' or mode == 'test':
ret_label = label
elif mode == 'infer':
ret_label = vid
imgs = video_loader(frames, seg_num, seglen, mode)
return imgs_transform(imgs, ret_label, mode, seg_num, seglen, \
short_size, target_size, img_mean, img_std)
def imgs_transform(imgs, label, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
imgs = group_scale(imgs, short_size)
if mode == 'train':
imgs = group_random_crop(imgs, target_size)
imgs = group_random_flip(imgs)
else:
imgs = group_center_crop(imgs, target_size)
np_imgs = (np.array(imgs[0]).astype('float32').transpose(
(2, 0, 1))).reshape(1, 3, target_size, target_size) / 255
for i in range(len(imgs) - 1):
img = (np.array(imgs[i + 1]).astype('float32').transpose(
(2, 0, 1))).reshape(1, 3, target_size, target_size) / 255
np_imgs = np.concatenate((np_imgs, img))
imgs = np_imgs
imgs -= img_mean
imgs /= img_std
imgs = np.reshape(imgs, (seg_num, seglen * 3, target_size, target_size))
return imgs, label
raise "Not implemented format {}".format(format)
mapper = functools.partial(
decode_func,
mode=mode,
seg_num=seg_num,
seglen=seglen,
short_size=short_size,
target_size=target_size,
img_mean=img_mean,
img_std=img_std)
return paddle.reader.xmap_readers(mapper, reader, num_threads, buf_size)
def group_multi_scale_crop(img_group, target_size, scales=None, \
max_distort=1, fix_crop=True, more_fix_crop=True):
scales = scales if scales is not None else [1, .875, .75, .66]
input_size = [target_size, target_size]
im_size = img_group[0].size
# get random crop offset
def _sample_crop_size(im_size):
image_w, image_h = im_size[0], im_size[1]
base_size = min(image_w, image_h)
crop_sizes = [int(base_size * x) for x in scales]
crop_h = [input_size[1] if abs(x - input_size[1]) < 3 else x for x in crop_sizes]
crop_w = [input_size[0] if abs(x - input_size[0]) < 3 else x for x in crop_sizes]
pairs = []
for i, h in enumerate(crop_h):
for j, w in enumerate(crop_w):
if abs(i - j) <= max_distort:
pairs.append((w, h))
crop_pair = random.choice(pairs)
if not fix_crop:
w_offset = random.randint(0, image_w - crop_pair[0])
h_offset = random.randint(0, image_h - crop_pair[1])
else:
w_step = (image_w - crop_pair[0]) / 4
h_step = (image_h - crop_pair[1]) / 4
ret = list()
ret.append((0, 0)) # upper left
if w_step != 0:
ret.append((4 * w_step, 0)) # upper right
if h_step != 0:
ret.append((0, 4 * h_step)) # lower left
if h_step != 0 and w_step != 0:
ret.append((4 * w_step, 4 * h_step)) # lower right
if h_step != 0 or w_step != 0:
ret.append((2 * w_step, 2 * h_step)) # center
if more_fix_crop:
ret.append((0, 2 * h_step)) # center left
ret.append((4 * w_step, 2 * h_step)) # center right
ret.append((2 * w_step, 4 * h_step)) # lower center
ret.append((2 * w_step, 0 * h_step)) # upper center
ret.append((1 * w_step, 1 * h_step)) # upper left quarter
ret.append((3 * w_step, 1 * h_step)) # upper right quarter
ret.append((1 * w_step, 3 * h_step)) # lower left quarter
ret.append((3 * w_step, 3 * h_step)) # lower righ quarter
w_offset, h_offset = random.choice(ret)
return crop_pair[0], crop_pair[1], w_offset, h_offset
crop_w, crop_h, offset_w, offset_h = _sample_crop_size(im_size)
crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
ret_img_group = [img.resize((input_size[0], input_size[1]), Image.BILINEAR) for img in crop_img_group]
return ret_img_group
def group_random_crop(img_group, target_size):
......
......@@ -34,7 +34,7 @@ class NonlocalReader(DataReader):
image_mean
image_std
batch_size
list
filelist
crop_size
sample_rate
video_length
......@@ -68,7 +68,7 @@ class NonlocalReader(DataReader):
dataset_args['min_size'] = cfg[mode.upper()]['jitter_scales'][0]
dataset_args['max_size'] = cfg[mode.upper()]['jitter_scales'][1]
dataset_args['num_reader_threads'] = num_reader_threads
filelist = cfg[mode.upper()]['list']
filelist = cfg[mode.upper()]['filelist']
batch_size = cfg[mode.upper()]['batch_size']
if self.mode == 'train':
......@@ -79,7 +79,7 @@ class NonlocalReader(DataReader):
sample_times = 1
return reader_func(filelist, batch_size, sample_times, False, False,
**dataset_args)
elif self.mode == 'test':
elif self.mode == 'test' or self.mode == 'infer':
sample_times = cfg['TEST']['num_test_clips']
if cfg['TEST']['use_multi_crop'] == 1:
sample_times = int(sample_times / 3)
......@@ -146,8 +146,8 @@ def apply_resize(rgbdata, min_size, max_size):
ratio = float(side_length) / float(width)
else:
ratio = float(side_length) / float(height)
out_height = int(height * ratio)
out_width = int(width * ratio)
out_height = int(round(height * ratio))
out_width = int(round(width * ratio))
outdata = np.zeros(
(length, out_height, out_width, channel), dtype=rgbdata.dtype)
for i in range(length):
......@@ -197,14 +197,13 @@ def crop_mirror_transform(rgbdata,
def make_reader(filelist, batch_size, sample_times, is_training, shuffle,
**dataset_args):
# should add smaple_times param
fl = open(filelist).readlines()
fl = [line.strip() for line in fl if line.strip() != '']
def reader():
fl = open(filelist).readlines()
fl = [line.strip() for line in fl if line.strip() != '']
if shuffle:
random.shuffle(fl)
if shuffle:
random.shuffle(fl)
def reader():
batch_out = []
for line in fl:
# start_time = time.time()
......@@ -253,23 +252,6 @@ def make_reader(filelist, batch_size, sample_times, is_training, shuffle,
def make_multi_reader(filelist, batch_size, sample_times, is_training, shuffle,
**dataset_args):
fl = open(filelist).readlines()
fl = [line.strip() for line in fl if line.strip() != '']
if shuffle:
random.shuffle(fl)
n = dataset_args['num_reader_threads']
queue_size = 20
reader_lists = [None] * n
file_num = int(len(fl) // n)
for i in range(n):
if i < len(reader_lists) - 1:
tmp_list = fl[i * file_num:(i + 1) * file_num]
else:
tmp_list = fl[i * file_num:]
reader_lists[i] = tmp_list
def read_into_queue(flq, queue):
batch_out = []
for line in flq:
......@@ -315,6 +297,24 @@ def make_multi_reader(filelist, batch_size, sample_times, is_training, shuffle,
queue.put(None)
def queue_reader():
# split file list and shuffle
fl = open(filelist).readlines()
fl = [line.strip() for line in fl if line.strip() != '']
if shuffle:
random.shuffle(fl)
n = dataset_args['num_reader_threads']
queue_size = 20
reader_lists = [None] * n
file_num = int(len(fl) // n)
for i in range(n):
if i < len(reader_lists) - 1:
tmp_list = fl[i * file_num:(i + 1) * file_num]
else:
tmp_list = fl[i * file_num:]
reader_lists[i] = tmp_list
queue = multiprocessing.Queue(queue_size)
p_list = [None] * len(reader_lists)
# for reader_list in reader_lists:
......
......@@ -38,13 +38,20 @@ class DataReader(object):
"""data reader for video input"""
def __init__(self, model_name, mode, cfg):
"""Not implemented"""
pass
self.name = model_name
self.mode = mode
self.cfg = cfg
def create_reader(self):
"""Not implemented"""
pass
def get_config_from_sec(self, sec, item, default=None):
if sec.upper() not in self.cfg:
return default
return self.cfg[sec.upper()].get(item, default)
class ReaderZoo(object):
def __init__(self):
......
......@@ -2,6 +2,7 @@
- [Youtube-8M](#Youtube-8M数据集)
- [Kinetics](#Kinetics数据集)
- [Non-local](#Non-local)
## Youtube-8M数据集
这里用到的是YouTube-8M 2018年更新之后的数据集。使用官方数据集,并将TFRecord文件转化为pickle文件以便PaddlePaddle使用。Youtube-8M数据集官方提供了frame-level和video-level的特征,这里只需使用到frame-level的特征。
......@@ -117,3 +118,6 @@ ActivityNet官方提供了Kinetics的下载工具,具体参考其[官方repo ]
即可生成相应的文件列表,train.list和val.list的每一行表示一个pkl文件的绝对路径。
## Non-local
Non-local模型也使用kinetics数据集,不过其数据处理方式和其他模型不一样,详细内容见[Non-local数据说明](./nonlocal/README.md)
# Non-local模型数据说明
在Non-local模型中,输入数据是mp4文件,在datareader部分的代码中,使用opencv读取mp4文件对视频进行解码和采样。train和valid数据随机选取起始帧的位置,对每帧图像做随机增强,短边缩放至[256, 320]之间的某个随机数,长边根据长宽比计算出来,截取出224x224大小的区域。test时每条视频会选取10个不同的位置作为起始帧,同时会选取三个不同的空间位置作为crop区域的起始点,这样每个视频会进行10x3次采样,对这30个样本的预测概率求和,选取概率最大的分类作为最终的预测结果。
## 数据下载
下载kinetics400数据,具体方法见[数据说明](../README.md)中kinetics数据部分,假设下载的mp4文件存放在DATADIR目录下,train和validation数据分别位于$DATADIR/train和$DATADIR/valid目录。在下载数据的时候,将所有视频的高度缩放至256,宽度通过长宽比计算出来。
## 下载官方数据列表
将官方提供的数据集文件表格[kinetics-400\_train.csv](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics/data/kinetics-400_train.csv)[kinetics-400\_val.csv](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics/data/kinetics-400_val.csv)下载到此目录。
## 生成文件列表
打开generate\_list.sh,将其中的TRAIN\_DIR和VALID\_DIR修改成用户所保存的mp4文件路径,运行脚本
bash generate_list.sh
即可生成trainlist.txt、vallist.txt和testlist.txt。
# Copyright (c) 2017-present, Facebook, Inc.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import sys
import numpy as np
import random
# src = 'trainlist_download.txt'
# outlist = 'trainlist.txt'
# original_folder = '/nfs.yoda/xiaolonw/kinetics/data/train'
# replace_folder = '/scratch/xiaolonw/kinetics/data/compress/train_256'
assert (len(sys.argv) == 5)
src = sys.argv[1]
outlist = sys.argv[2]
original_folder = sys.argv[3]
replace_folder = sys.argv[4]
f = open(src, 'r')
flist = []
for line in f:
flist.append(line)
f.close()
f2 = open(outlist, 'w')
listlen = len(flist)
for i in range(listlen):
line = flist[i]
line = line.replace(original_folder, replace_folder)
f2.write(line)
f2.close()
import os
import numpy as np
import sys
num_classes = 400
replace_space_by_underliner = True # whether to replace space by '_' in labels
fn = sys.argv[1] #'trainlist_download400.txt'
train_dir = sys.argv[
2] #'/docker_mount/data/k400/Kinetics_trimmed_processed_train'
val_dir = sys.argv[3] #'/docker_mount/data/k400/Kinetics_trimmed_processed_val'
trainlist = sys.argv[4] #'trainlist.txt'
vallist = sys.argv[5] #'vallist.txt'
fl = open(fn).readlines()
fl = [line.strip() for line in fl if line.strip() != '']
action_list = []
for line in fl[1:]:
act = line.split(',')[0].strip('\"')
action_list.append(act)
action_set = set(action_list)
action_list = list(action_set)
action_list.sort()
if replace_space_by_underliner:
action_list = [item.replace(' ', '_') for item in action_list]
# assign integer label to each category, abseiling is labeled as 0,
# zumba labeled as 399 and so on, sorted by the category name
action_label_dict = {}
for i in range(len(action_list)):
key = action_list[i]
action_label_dict[key] = i
assert len(action_label_dict.keys(
)) == num_classes, "action num should be {}".format(num_classes)
def generate_file(Faction_label_dict, Ftrain_dir, Ftrainlist, Fnum_classes):
trainactions = os.listdir(Ftrain_dir)
trainactions.sort()
assert len(
trainactions) == Fnum_classes, "train action num should be {}".format(
Fnum_classes)
train_items = []
trainlist_outfile = open(Ftrainlist, 'w')
for trainaction in trainactions:
assert trainaction in Faction_label_dict.keys(
), "action {} should be in action_dict".format(trainaction)
trainaction_dir = os.path.join(Ftrain_dir, trainaction)
trainaction_label = Faction_label_dict[trainaction]
trainaction_files = os.listdir(trainaction_dir)
for f in trainaction_files:
fn = os.path.join(trainaction_dir, f)
item = fn + ' ' + str(trainaction_label)
train_items.append(item)
trainlist_outfile.write(item + '\n')
trainlist_outfile.flush()
trainlist_outfile.close()
generate_file(action_label_dict, train_dir, trainlist, num_classes)
generate_file(action_label_dict, val_dir, vallist, num_classes)
# Download txt name
TRAINLIST_DOWNLOAD="kinetics-400_train.csv"
# path of the train and valid data
TRAIN_DIR=YOUR_TRAIN_DATA_DIR # replace this with your train data dir
VALID_DIR=YOUR_VALID_DATA_DIR # replace this with your valid data dir
python generate_filelist.py $TRAINLIST_DOWNLOAD $TRAIN_DIR $VALID_DIR trainlist.txt vallist.txt
# generate test list
python generate_testlist_multicrop.py
import os
vallist = 'vallist.txt'
testlist = 'testlist.txt'
sampling_times = 10
cropping_times = 3
fl = open(vallist).readlines()
fl = [line.strip() for line in fl if line.strip() != '']
f_test = open(testlist, 'w')
for i in range(len(fl)):
line = fl[i].split(' ')
fn = line[0]
label = line[1]
for j in range(sampling_times):
for k in range(cropping_times):
test_item = fn + ' ' + str(i) + ' ' + str(j) + ' ' + str(k) + '\n'
f_test.write(test_item)
f_test.close()
......@@ -37,7 +37,7 @@ logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--model-name',
'--model_name',
type=str,
default='AttentionCluster',
help='name of model to train.')
......@@ -47,14 +47,14 @@ def parse_args():
default='configs/attention_cluster.txt',
help='path to config file of model')
parser.add_argument(
'--use-gpu', type=bool, default=True, help='default use gpu.')
'--use_gpu', type=bool, default=True, help='default use gpu.')
parser.add_argument(
'--weights',
type=str,
default=None,
help='weight path, None to use weights from Paddle.')
parser.add_argument(
'--batch-size',
'--batch_size',
type=int,
default=1,
help='sample number in a batch for inference.')
......@@ -64,17 +64,17 @@ def parse_args():
default=None,
help='path to inferenece data file lists file.')
parser.add_argument(
'--log-interval',
'--log_interval',
type=int,
default=1,
help='mini-batch interval to log.')
parser.add_argument(
'--infer-topk',
'--infer_topk',
type=int,
default=20,
help='topk predictions to restore.')
parser.add_argument(
'--save-dir', type=str, default='./', help='directory to store results')
'--save_dir', type=str, default='./', help='directory to store results')
args = parser.parse_args()
return args
......@@ -83,8 +83,8 @@ def infer(args):
# parse config
config = parse_config(args.config)
infer_config = merge_configs(config, 'infer', vars(args))
print_configs(infer_config, "Infer")
infer_model = models.get_model(args.model_name, infer_config, mode='infer')
infer_model.build_input(use_pyreader=False)
infer_model.build_model()
infer_feeds = infer_model.feeds()
......@@ -105,10 +105,8 @@ def infer(args):
# if no weight files specified, download weights from paddle
weights = args.weights or infer_model.get_weights()
def if_exist(var):
return os.path.exists(os.path.join(weights, var.name))
fluid.io.load_vars(exe, weights, predicate=if_exist)
infer_model.load_test_weights(exe, weights,
fluid.default_main_program(), place)
infer_feeder = fluid.DataFeeder(place=place, feed_list=infer_feeds)
fetch_list = [x.name for x in infer_outputs]
......@@ -126,8 +124,7 @@ def infer(args):
topk_inds = predictions[i].argsort()[0 - args.infer_topk:]
topk_inds = topk_inds[::-1]
preds = predictions[i][topk_inds]
results.append(
(video_id[i], preds.tolist(), topk_inds.tolist()))
results.append((video_id[i], preds.tolist(), topk_inds.tolist()))
prev_time = cur_time
cur_time = time.time()
period = cur_time - prev_time
......@@ -145,6 +142,7 @@ def infer(args):
"{}_infer_result".format(args.model_name))
pickle.dump(results, open(result_file_name, 'wb'))
if __name__ == "__main__":
args = parse_args()
logger.info(args)
......
......@@ -187,10 +187,11 @@ def get_metrics(name, mode, cfg):
return metrics_zoo.get(name, mode, cfg)
regist_metrics("NEXTVLAD", Youtube8mMetrics)
regist_metrics("ATTENTIONLSTM", Youtube8mMetrics)
# sort by alphabet
regist_metrics("ATTENTIONCLUSTER", Youtube8mMetrics)
regist_metrics("TSN", Kinetics400Metrics)
regist_metrics("ATTENTIONLSTM", Youtube8mMetrics)
regist_metrics("NEXTVLAD", Youtube8mMetrics)
regist_metrics("NONLOCAL", MulticropMetrics)
regist_metrics("TSM", Kinetics400Metrics)
regist_metrics("TSN", Kinetics400Metrics)
regist_metrics("STNET", Kinetics400Metrics)
regist_metrics("NONLOCAL", MulticropMetrics)
......@@ -63,6 +63,7 @@ class MetricsCalculator():
def accumulate(self, loss, pred, labels):
labels = labels.astype(int)
labels = labels[:, 0]
for i in range(pred.shape[0]):
probs = pred[i, :].tolist()
vid = labels[i]
......@@ -81,6 +82,8 @@ class MetricsCalculator():
evaluate_results(self.results, self.filename_gt, self.dataset_size, \
self.num_classes, self.num_test_clips)
# save temporary file
if not os.path.isdir(self.checkpoint_dir):
os.makedirs(self.checkpoint_dir)
pkl_path = os.path.join(self.checkpoint_dir, "results_probs.pkl")
with open(pkl_path, 'w') as f:
......@@ -188,26 +191,4 @@ def evaluate_results(results, filename_gt, test_dataset_size, num_classes,
logger.info('top-5 accuracy: {:.2f} percent'.format(accuracy_top5 * 100))
logger.info('-' * 80)
for i in range(sample_num):
prob = probs[i]
# top-1
idx = prob.argmax()
if idx == gt_labels[i] and counts[i] > 0:
accuracy = accuracy + 1
ids = np.argsort(prob)[::-1]
for j in range(5):
if ids[j] == gt_labels[i] and counts[i] > 0:
accuracy_top5 = accuracy_top5 + 1
break
accuracy = float(accuracy) / float(sample_num)
accuracy_top5 = float(accuracy_top5) / float(sample_num)
logger.info('-' * 80)
logger.info('top-1 accuracy: {:.2f} percent'.format(accuracy * 100))
logger.info('top-5 accuracy: {:.2f} percent'.format(accuracy_top5 * 100))
logger.info('-' * 80)
return
from .model import regist_model, get_model
from .attention_cluster import AttentionCluster
from .attention_lstm import AttentionLSTM
from .nextvlad import NEXTVLAD
from .nonlocal_model import NonLocal
from .tsm import TSM
from .tsn import TSN
from .stnet import STNET
from .attention_lstm import AttentionLSTM
# regist models
# regist models, sort by alphabet
regist_model("AttentionCluster", AttentionCluster)
regist_model("AttentionLSTM", AttentionLSTM)
regist_model("NEXTVLAD", NEXTVLAD)
regist_model('NONLOCAL', NonLocal)
regist_model("TSM", TSM)
regist_model("TSN", TSN)
regist_model("STNET", STNET)
regist_model("AttentionLSTM", AttentionLSTM)
......@@ -65,17 +65,9 @@ class ModelBase(object):
self.name = name
self.is_training = (mode == 'train')
self.mode = mode
self.cfg = cfg
self.py_reader = None
# parse config
# assert os.path.exists(cfg), \
# "Config file {} not exists".format(cfg)
# self._config = ModelConfig(cfg)
# self._config.parse()
# if args and isinstance(args, dict):
# self._config.merge_configs(mode, args)
# self.cfg = self._config.get_configs()
self.cfg = cfg
def build_model(self):
"build model struct"
......@@ -137,8 +129,8 @@ class ModelBase(object):
if os.path.exists(path):
return path
logger.info("Download pretrain weights of {} from {}".format(
self.name, url))
logger.info("Download pretrain weights of {} from {}".format(self.name,
url))
download(url, path)
return path
......@@ -146,6 +138,12 @@ class ModelBase(object):
logger.info("Load pretrain weights from {}".format(pretrain))
fluid.io.load_params(exe, pretrain, main_program=prog)
def load_test_weights(self, exe, weights, prog, place):
def if_exist(var):
return os.path.exists(os.path.join(weights, var.name))
fluid.io.load_vars(exe, weights, predicate=if_exist)
def get_config_from_sec(self, sec, item, default=None):
if sec.upper() not in self.cfg:
return default
......@@ -163,7 +161,7 @@ class ModelZoo(object):
def get(self, name, cfg, mode='train'):
for k, v in self.model_zoo.items():
if k == name:
if k.upper() == name.upper():
return v(name, cfg, mode)
raise ModelNotFoundError(name, self.model_zoo.keys())
......@@ -178,4 +176,3 @@ def regist_model(name, model):
def get_model(name, cfg, mode='train'):
return model_zoo.get(name, cfg, mode)
# Non-local Neural Networks视频分类模型
---
## 目录
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [模型训练](#模型训练)
- [模型评估](#模型评估)
- [模型推断](#模型推断)
- [参考论文](#参考论文)
## 模型简介
Non-local Neural Networks是由Xiaolong Wang等研究者在2017年提出的模型,主要特点是通过引入Non-local操作来描述距离较远的像素点之间的关联关系。提取大范围内数据点之间的关联关系,一直是一个比较重要的问题。对于序列化数据,比如语音、视频等,比较主流的做法是使用循环神经网络(RNN);对于图片来说,通常使用卷积神经网络(CNN)来提取像素之间的依赖关系。然而,CNN和RNN都只是在其空间或者时间的很小的邻域内进行特征提取,很难捕捉到距离更远位置的数据的依赖关系。借助于传统计算机视觉中的Non-local mean的思想,并将其扩展到神经网络中,通过定义输出位置和所有输入位置之间的关联函数,建立起了一种具有全局关联特性的操作,输出feature map上的每个位置,都会受到输入feature map上所有位置的数据的影响。在CNN中,经过一次卷积操作,输出feature map上的像素点,只能获取其相应的感受野之内的信息,为了获得更多的上下文信息,就需要做多次卷积操作。然而在Non-local操作中,每个输出点的感受野都相当于整个输入feature map区域,能比CNN和RNN提取到更加全局的信息。
详细信息请参考论文[Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1)
### Non-local操作
Non-local 关联函数的定义如下
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=y_{i}=\frac{1}{C(x)}&space;\sum_{j}f(x_i,&space;x_j)g(x_j)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?y_{i}=\frac{1}{C(x)}&space;\sum_{j}f(x_i,&space;x_j)g(x_j)" title="y_{i}=\frac{1}{C(x)} \sum_{j}f(x_i, x_j)g(x_j)" /></a>
</p>
在上面的公式中,x表示输入feature map, y表示输出feature map,i是输出feature map的位置,j是输入feature map的位置,f(xi, xj)描述了输出点i跟所有输入点j之间的关联,C是根据f(xi, xj)选取的归一化函数。g(xj)是对输入feature map做一个变换操作,通常可以选取比较简单的线性变换形式;f(xi, xj)可以选取不同的形式,通常可以使用如下几种形式
#### Gaussian
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=f(x_i,&space;x_j)&space;=&space;e^{x_i^Tx_j},&space;\qquad&space;C(x)&space;=&space;\sum_{j}f(x_i,&space;x_j)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?f(x_i,&space;x_j)&space;=&space;e^{x_i^Tx_j},&space;\qquad&space;C(x)&space;=&space;\sum_{j}f(x_i,&space;x_j)" title="f(x_i, x_j) = e^{x_i^Tx_j}, \qquad C(x) = \sum_{j}f(x_i, x_j)" /></a>
</p>
#### Embedded Gaussian
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=f(x_i,&space;x_j)&space;=&space;e^{{\theta(x_i)}^T\phi(x_j)},&space;\qquad&space;C(x)&space;=&space;\sum_{j}f(x_i,&space;x_j)" target="_blank"><img src="https://latex.codecogs.com/gif.latex?f(x_i,&space;x_j)&space;=&space;e^{{\theta(x_i)}^T\phi(x_j)},&space;\qquad&space;C(x)&space;=&space;\sum_{j}f(x_i,&space;x_j)" title="f(x_i, x_j) = e^{{\theta(x_i)}^T\phi(x_j)}, \qquad C(x) = \sum_{j}f(x_i, x_j)" /></a>
</p>
#### Dot product
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=f(x_i,&space;x_j)&space;=&space;\theta(x_i)^T\phi(x_j),&space;\qquad&space;C(x)&space;=\mathit{N}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?f(x_i,&space;x_j)&space;=&space;\theta(x_i)^T\phi(x_j),&space;\qquad&space;C(x)&space;=\mathit{N}" title="f(x_i, x_j) = \theta(x_i)^T\phi(x_j), \qquad C(x) =\mathit{N}" /></a>
</p>
#### Concatenation
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=f(x_i,&space;x_j)&space;=&space;ReLU(w_f^T[\theta(x_i),\phi(x_j)]),&space;\qquad&space;C(x)&space;=\mathit{N}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?f(x_i,&space;x_j)&space;=&space;ReLU(w_f^T[\theta(x_i),\phi(x_j)]),&space;\qquad&space;C(x)&space;=\mathit{N}" title="f(x_i, x_j) = ReLU(w_f^T[\theta(x_i),\phi(x_j)]), \qquad C(x) =\mathit{N}" /></a>
</p>
其中
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=\theta(x_i)=W_{\theta}x_i,&space;\qquad&space;\phi(x_j)=W_{\phi}x_j" target="_blank"><img src="https://latex.codecogs.com/gif.latex?\theta(x_i)=W_{\theta}x_i,&space;\qquad&space;\phi(x_j)=W_{\phi}x_j" title="\theta(x_i)=W_{\theta}x_i, \qquad \phi(x_j)=W_{\phi}x_j" /></a>
</p>
上述函数形式中的参数可以使用随机初始化的方式进行赋值,在训练过程中通过End-2-End的方式不断迭代求解。
### Non-local block
采用类似Resnet的结构,定义如下的Non-local block
<p align="center">
<a href="https://www.codecogs.com/eqnedit.php?latex=Z_i&space;=&space;W_zy_i&plus;x_i" target="_blank"><img src="https://latex.codecogs.com/gif.latex?Z_i&space;=&space;W_zy_i&plus;x_i" title="Z_i = W_zy_i+x_i" /></a>
</p>
Non-local操作引入的部分与Resnet中的残差项类似,通过使用Non-local block,可以方便的在网络中的任何地方添加Non-local操作,而其他地方照样可以使用原始的预训练模型进行初始化。如果将Wz初始化为0,则跟不使用Non-local block的初始情形等价。
### 具体实现
下图描述了Non-local block使用内嵌高斯形式关联函数的具体实现过程,
<p align="center">
<img src="../../images/nonlocal_instantiation.png" height=488 width=585 hspace='10'/> <br />
使用Eembedded Gaussian关联函数的Non-local block
</p>
g(Xj)是对输入feature map做一个线性变换,使用1x1x1的卷积;theta和phi也是线性变化,同样使用1x1x1的卷积来实现。从上图中可以看到,Non-local操作只需用到通常的卷积、矩阵相乘、加法、softmax等比较常用的算子,不需要额外添加新的算子,用户可以非常方便的实现组网以构建模型。
### 模型效果
原作者的论文中指出,Non-local模型在视频分类问题上取得了较好的效果,在Resnet-50基础网络上添加Non-local block,能取得比Resnet-101更好的分类效果,TOP-1准确率要高出1~2个百分点。在图像分类和目标检测问题上,也有比较明显的提升效果。
## 数据准备
Non-local模型的训练数据采用由DeepMind公布的Kinetics-400动作识别数据集。数据下载及准备请参考Non-local模型的[数据说明](../../dataset/nonlocal/README.md)
## 模型训练
数据准备完毕后,可以通过如下两种方式启动训练:
python train.py --model-name=NONLOCAL
--config=./configs/nonlocal.txt
--save-dir=checkpoints
--log-interval=10
--valid-interval=1
bash scripts/train/train_nonlocal.sh
- 可下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_classification/nonlocal_kinetics.tar.gz)通过`--resume`指定权重存放路径进行finetune等开发
**数据读取器说明:** 模型读取Kinetics-400数据集中的`mp4`数据,根据视频长度和采样频率随机选取起始帧的位置,每个视频抽取`video_length`帧图像,对每帧图像做随机增强,短边缩放至[256, 320]之间的某个随机数,长边根据长宽比计算出来,然后再截取出224x224的区域作为训练数据输入网络。
**训练策略:**
* 采用Momentum优化算法训练,momentum=0.9
* 采用L2正则化,卷积和fc层weight decay系数为1e-4;bn层则设置weight decay系数为0
* 初始学习率base\_learning\_rate=0.01,在150,000和300,000次迭代的时候分别降一次学习率,衰减系数为0.1
## 模型评估
测试时数据预处理的方式跟训练时不一样,crop区域的大小为256x256,不同于训练时的224x224,所以需要将训练中预测输出时使用的全连接操作改为1x1x1的卷积。每个视频抽取图像帧数据的时候,会选取10个不同的位置作为时间起始点,做crop的时候会选取三个不同的空间起始点。在每个视频上会进行10x3次采样,将这30个样本的预测结果进行求和,选取概率最大的类别作为最终的预测结果。
可通过如下两种方式进行模型评估:
python test.py --model-name=NONLOCAL
--config=configs/nonlocal.txt
--log-interval=1
--weights=$PATH_TO_WEIGHTS
bash scripts/test/test_nonlocal.sh
- 使用`scripts/test/test_nonlocal.sh`进行评估时,需要修改脚本中的`--weights`参数指定需要评估的权重。
- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_classification/nonlocal_kinetics.tar.gz)进行评估
当取如下参数时:
| 参数 | 取值 |
| :---------: | :----: |
| back bone | Resnet-50 |
| 卷积形式 | c2d |
| 采样频率 | 8 |
| 视频长度 | 8 |
在Kinetics400的validation数据集下评估精度如下:
| 精度指标 | 模型精度 |
| :---------: | :----: |
| TOP\_1 | 0.739 |
### 备注
由于Youtube上部分数据已删除,只下载到了kinetics400数据集中的234619条,而原始数据集包含246535条视频,可能会导致精度略微下降。
## 模型推断
可通过如下命令进行模型推断:
python infer.py --model_name=NONLOCAL
--config=configs/nonlocal.txt
--log_interval=1
--weights=$PATH_TO_WEIGHTS
--filelist=$FILELIST
- 模型推断结果存储于`NONLOCAL_infer_result`中,通过`pickle`格式存储。
- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_classification/nonlocal_kinetics.tar.gz)进行推断
## 参考论文
- [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
import paddle
import paddle.fluid as fluid
from paddle.fluid import ParamAttr
# 3d spacetime nonlocal (v1, spatial downsample)
def spacetime_nonlocal(blob_in, dim_in, dim_out, batch_size, prefix, dim_inner, cfg, \
test_mode = False, max_pool_stride = 2):
#------------
cur = blob_in
# we do projection to convert each spacetime location to a feature
# theta original size
# e.g., (8, 1024, 4, 14, 14) => (8, 1024, 4, 14, 14)
theta = fluid.layers.conv3d(
input=cur,
num_filters=dim_inner,
filter_size=[1, 1, 1],
stride=[1, 1, 1],
padding=[0, 0, 0],
param_attr=ParamAttr(
name=prefix + '_theta' + "_w",
initializer=fluid.initializer.Normal(
loc=0.0, scale=cfg.NONLOCAL.conv_init_std)),
bias_attr=ParamAttr(
name=prefix + '_theta' + "_b",
initializer=fluid.initializer.Constant(value=0.))
if (cfg.NONLOCAL.no_bias == 0) else False,
name=prefix + '_theta')
theta_shape = theta.shape
# phi and g: half spatial size
# e.g., (8, 1024, 4, 14, 14) => (8, 1024, 4, 7, 7)
if cfg.NONLOCAL.use_maxpool:
max_pool = fluid.layers.pool3d(
input=cur,
pool_size=[1, max_pool_stride, max_pool_stride],
pool_type='max',
pool_stride=[1, max_pool_stride, max_pool_stride],
pool_padding=[0, 0, 0],
name=prefix + '_pool')
else:
max_pool = cur
phi = fluid.layers.conv3d(
input=max_pool,
num_filters=dim_inner,
filter_size=[1, 1, 1],
stride=[1, 1, 1],
padding=[0, 0, 0],
param_attr=ParamAttr(
name=prefix + '_phi' + "_w",
initializer=fluid.initializer.Normal(
loc=0.0, scale=cfg.NONLOCAL.conv_init_std)),
bias_attr=ParamAttr(
name=prefix + '_phi' + "_b",
initializer=fluid.initializer.Constant(value=0.))
if (cfg.NONLOCAL.no_bias == 0) else False,
name=prefix + '_phi')
phi_shape = phi.shape
g = fluid.layers.conv3d(
input=max_pool,
num_filters=dim_inner,
filter_size=[1, 1, 1],
stride=[1, 1, 1],
padding=[0, 0, 0],
param_attr=ParamAttr(
name=prefix + '_g' + "_w",
initializer=fluid.initializer.Normal(
loc=0.0, scale=cfg.NONLOCAL.conv_init_std)),
bias_attr=ParamAttr(
name=prefix + '_g' + "_b",
initializer=fluid.initializer.Constant(value=0.))
if (cfg.NONLOCAL.no_bias == 0) else False,
name=prefix + '_g')
g_shape = g.shape
# we have to use explicit batch size (to support arbitrary spacetime size)
# e.g. (8, 1024, 4, 14, 14) => (8, 1024, 784)
theta = fluid.layers.reshape(
theta, [-1, 0, theta_shape[2] * theta_shape[3] * theta_shape[4]])
theta = fluid.layers.transpose(theta, [0, 2, 1])
phi = fluid.layers.reshape(
phi, [-1, 0, phi_shape[2] * phi_shape[3] * phi_shape[4]])
theta_phi = fluid.layers.matmul(theta, phi, name=prefix + '_affinity')
g = fluid.layers.reshape(g, [-1, 0, g_shape[2] * g_shape[3] * g_shape[4]])
if cfg.NONLOCAL.use_softmax:
if cfg.NONLOCAL.use_scale is True:
theta_phi_sc = fluid.layers.scale(theta_phi, scale=dim_inner**-.5)
else:
theta_phi_sc = theta_phi
p = fluid.layers.softmax(
theta_phi_sc, name=prefix + '_affinity' + '_prob')
else:
# not clear about what is doing in xlw's code
p = None # not implemented
raise "Not implemented when not use softmax"
# note g's axis[2] corresponds to p's axis[2]
# e.g. g(8, 1024, 784_2) * p(8, 784_1, 784_2) => (8, 1024, 784_1)
p = fluid.layers.transpose(p, [0, 2, 1])
t = fluid.layers.matmul(g, p, name=prefix + '_y')
# reshape back
# e.g. (8, 1024, 784) => (8, 1024, 4, 14, 14)
t_shape = t.shape
# print(t_shape)
# print(theta_shape)
t_re = fluid.layers.reshape(t, shape=list(theta_shape))
blob_out = t_re
blob_out = fluid.layers.conv3d(
input=blob_out,
num_filters=dim_out,
filter_size=[1, 1, 1],
stride=[1, 1, 1],
padding=[0, 0, 0],
param_attr=ParamAttr(
name=prefix + '_out' + "_w",
initializer=fluid.initializer.Constant(value=0.)
if cfg.NONLOCAL.use_zero_init_conv else fluid.initializer.Normal(
loc=0.0, scale=cfg.NONLOCAL.conv_init_std)),
bias_attr=ParamAttr(
name=prefix + '_out' + "_b",
initializer=fluid.initializer.Constant(value=0.))
if (cfg.NONLOCAL.no_bias == 0) else False,
name=prefix + '_out')
blob_out_shape = blob_out.shape
if cfg.NONLOCAL.use_bn is True:
bn_name = prefix + "_bn"
blob_out = fluid.layers.batch_norm(
blob_out,
is_test=test_mode,
momentum=cfg.NONLOCAL.bn_momentum,
epsilon=cfg.NONLOCAL.bn_epsilon,
name=bn_name,
param_attr=ParamAttr(
name=bn_name + "_scale",
initializer=fluid.initializer.Constant(
value=cfg.NONLOCAL.bn_init_gamma),
regularizer=fluid.regularizer.L2Decay(
cfg.TRAIN.weight_decay_bn)),
bias_attr=ParamAttr(
name=bn_name + "_offset",
regularizer=fluid.regularizer.L2Decay(
cfg.TRAIN.weight_decay_bn)),
moving_mean_name=bn_name + "_mean",
moving_variance_name=bn_name + "_variance") # add bn
if cfg.NONLOCAL.use_affine is True:
affine_scale = fluid.layers.create_parameter(
shape=[blob_out_shape[1]],
dtype=blob_out.dtype,
attr=ParamAttr(name=prefix + '_affine' + '_s'),
default_initializer=fluid.initializer.Constant(value=1.))
affine_bias = fluid.layers.create_parameter(
shape=[blob_out_shape[1]],
dtype=blob_out.dtype,
attr=ParamAttr(name=prefix + '_affine' + '_b'),
default_initializer=fluid.initializer.Constant(value=0.))
blob_out = fluid.layers.affine_channel(
blob_out,
scale=affine_scale,
bias=affine_bias,
name=prefix + '_affine') # add affine
return blob_out
def add_nonlocal(blob_in,
dim_in,
dim_out,
batch_size,
prefix,
dim_inner,
cfg,
test_mode=False):
blob_out = spacetime_nonlocal(blob_in, \
dim_in, dim_out, batch_size, prefix, dim_inner, cfg, test_mode = test_mode)
blob_out = fluid.layers.elementwise_add(
blob_out, blob_in, name=prefix + '_sum')
return blob_out
# this is to reduce memory usage if the feature maps are big
# devide the feature maps into groups in the temporal dimension,
# and perform non-local operations inside each group.
def add_nonlocal_group(blob_in,
dim_in,
dim_out,
batch_size,
pool_stride,
height,
width,
group_size,
prefix,
dim_inner,
cfg,
test_mode=False):
group_num = int(pool_stride / group_size)
assert (pool_stride % group_size == 0), \
'nonlocal block {}: pool_stride({}) should be divided by group size({})'.format(prefix, pool_stride, group_size)
if group_num > 1:
blob_in = fluid.layers.transpose(
blob_in, [0, 2, 1, 3, 4], name=prefix + '_pre_trans1')
blob_in = fluid.layers.reshape(
blob_in,
[batch_size * group_num, group_size, dim_in, height, width],
name=prefix + '_pre_reshape1')
blob_in = fluid.layers.transpose(
blob_in, [0, 2, 1, 3, 4], name=prefix + '_pre_trans2')
blob_out = spacetime_nonlocal(
blob_in,
dim_in,
dim_out,
batch_size,
prefix,
dim_inner,
cfg,
test_mode=test_mode)
blob_out = fluid.layers.elementwise_add(
blob_out, blob_in, name=prefix + '_sum')
if group_num > 1:
blob_out = fluid.layers.transpose(
blob_out, [0, 2, 1, 3, 4], name=prefix + '_post_trans1')
blob_out = fluid.layers.reshape(
blob_out,
[batch_size, group_num * group_size, dim_out, height, width],
name=prefix + '_post_reshape1')
blob_out = fluid.layers.transpose(
blob_out, [0, 2, 1, 3, 4], name=prefix + '_post_trans2')
return blob_out
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import os
import numpy as np
import cPickle
import paddle.fluid as fluid
from ..model import ModelBase
import resnet_video
from .nonlocal_utils import load_params_from_file
import logging
logger = logging.getLogger(__name__)
__all__ = ["NonLocal"]
# To add new models, import them, add them to this map and models/TARGETS
class NonLocal(ModelBase):
def __init__(self, name, cfg, mode='train'):
super(NonLocal, self).__init__(name, cfg, mode=mode)
self.get_config()
def get_config(self):
# video_length
self.video_length = self.get_config_from_sec(self.mode, 'video_length')
# crop size
self.crop_size = self.get_config_from_sec(self.mode, 'crop_size')
def build_input(self, use_pyreader=True):
input_shape = [3, self.video_length, self.crop_size, self.crop_size]
label_shape = [1]
py_reader = None
if use_pyreader:
assert self.mode != 'infer', \
'pyreader is not recommendated when infer, please set use_pyreader to be false.'
py_reader = fluid.layers.py_reader(
capacity=20,
shapes=[[-1] + input_shape, [-1] + label_shape],
dtypes=['float32', 'int64'],
name='train_py_reader'
if self.is_training else 'test_py_reader',
use_double_buffer=True)
data, label = fluid.layers.read_file(py_reader)
self.py_reader = py_reader
else:
data = fluid.layers.data(
name='train_data' if self.is_training else 'test_data',
shape=input_shape,
dtype='float32')
if self.mode != 'infer':
label = fluid.layers.data(
name='train_label' if self.is_training else 'test_label',
shape=label_shape,
dtype='int64')
else:
label = None
self.feature_input = [data]
self.label_input = label
def create_model_args(self):
return None
def build_model(self):
pred, loss = resnet_video.create_model(
data=self.feature_input[0],
label=self.label_input,
cfg=self.cfg,
is_training=self.is_training,
mode=self.mode)
if loss is not None:
loss = fluid.layers.mean(loss)
self.network_outputs = [pred]
self.loss_ = loss
def optimizer(self):
base_lr = self.get_config_from_sec('TRAIN', 'learning_rate')
lr_decay = self.get_config_from_sec('TRAIN', 'learning_rate_decay')
step_sizes = self.get_config_from_sec('TRAIN', 'step_sizes')
lr_bounds, lr_values = get_learning_rate_decay_list(base_lr, lr_decay,
step_sizes)
learning_rate = fluid.layers.piecewise_decay(
boundaries=lr_bounds, values=lr_values)
momentum = self.get_config_from_sec('TRAIN', 'momentum')
use_nesterov = self.get_config_from_sec('TRAIN', 'nesterov')
l2_weight_decay = self.get_config_from_sec('TRAIN', 'weight_decay')
logger.info(
'Build up optimizer, \ntype: {}, \nmomentum: {}, \nnesterov: {}, \
\nregularization: L2 {}, \nlr_values: {}, lr_bounds: {}'
.format('Momentum', momentum, use_nesterov, l2_weight_decay,
lr_values, lr_bounds))
optimizer = fluid.optimizer.Momentum(
learning_rate=learning_rate,
momentum=momentum,
use_nesterov=use_nesterov,
regularization=fluid.regularizer.L2Decay(l2_weight_decay))
return optimizer
def loss(self):
return self.loss_
def outputs(self):
return self.network_outputs
def feeds(self):
return self.feature_input if self.mode == 'infer' else \
self.feature_input + [self.label_input]
def pretrain_info(self):
return None, None
def weights_info(self):
pass
def load_pretrain_params(self, exe, pretrain, prog, place):
load_params_from_file(exe, prog, pretrain, place)
def load_test_weights(self, exe, weights, prog, place):
super(NonLocal, self).load_test_weights(exe, weights, prog, place)
pred_w = fluid.global_scope().find_var('pred_w').get_tensor()
pred_array = np.array(pred_w)
pred_w_shape = pred_array.shape
if len(pred_w_shape) == 2:
logger.info('reshape for pred_w when test')
pred_array = np.transpose(pred_array, (1, 0))
pred_w_shape = pred_array.shape
pred_array = np.reshape(
pred_array, [pred_w_shape[0], pred_w_shape[1], 1, 1, 1])
pred_w.set(pred_array.astype('float32'), place)
def get_learning_rate_decay_list(base_learning_rate, lr_decay, step_lists):
lr_bounds = []
lr_values = [base_learning_rate * 1]
cur_step = 0
for i in range(len(step_lists)):
cur_step += step_lists[i]
lr_bounds.append(cur_step)
decay_rate = lr_decay**(i + 1)
lr_values.append(base_learning_rate * decay_rate)
return lr_bounds, lr_values
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import os
import numpy as np
import paddle.fluid as fluid
import logging
logger = logging.getLogger(__name__)
def load_params_from_file(exe, prog, pretrained_file, place):
logger.info('load params from {}'.format(pretrained_file))
if os.path.isdir(pretrained_file):
param_list = prog.block(0).all_parameters()
param_name_list = [p.name for p in param_list]
param_shape = {}
for name in param_name_list:
param_tensor = fluid.global_scope().find_var(name).get_tensor()
param_shape[name] = np.array(param_tensor).shape
param_name_from_file = os.listdir(pretrained_file)
common_names = get_common_names(param_name_list, param_name_from_file)
logger.info('-------- loading params -----------')
# load params from file
def is_parameter(var):
if isinstance(var, fluid.framework.Parameter):
return isinstance(var, fluid.framework.Parameter) and \
os.path.exists(os.path.join(pretrained_file, var.name))
logger.info("Load pretrain weights from file {}".format(
pretrained_file))
vars = filter(is_parameter, prog.list_vars())
fluid.io.load_vars(exe, pretrained_file, vars=vars, main_program=prog)
# reset params if necessary
for name in common_names:
t = fluid.global_scope().find_var(name).get_tensor()
t_array = np.array(t)
origin_shape = param_shape[name]
if t_array.shape == origin_shape:
logger.info("load param {}".format(name))
elif (t_array.shape[:2] == origin_shape[:2]) and (
t_array.shape[-2:] == origin_shape[-2:]):
num_inflate = origin_shape[2]
stack_t_array = np.stack(
[t_array] * num_inflate, axis=2) / float(num_inflate)
assert origin_shape == stack_t_array.shape, "inflated shape should be the same with tensor {}".format(
name)
t.set(stack_t_array.astype('float32'), place)
logger.info("load inflated({}) param {}".format(num_inflate,
name))
else:
logger.info("Invalid case for name: {}".format(name))
raise
logger.info("finished loading params from resnet pretrained model")
else:
logger.info(
"pretrained file is not in a directory, not suitable to load params".
format(pretrained_file))
pass
def get_common_names(param_name_list, param_name_from_file):
# name check and return common names both in param_name_list and file
common_names = []
paddle_only_names = []
file_only_names = []
logger.info('-------- comon params -----------')
for name in param_name_list:
if name in param_name_from_file:
common_names.append(name)
logger.info(name)
else:
paddle_only_names.append(name)
logger.info('-------- paddle only params ----------')
for name in paddle_only_names:
logger.info(name)
logger.info('-------- file only params -----------')
for name in param_name_from_file:
if name in param_name_list:
assert name in common_names
else:
file_only_names.append(name)
logger.info(name)
return common_names
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division
import paddle
import paddle.fluid as fluid
from paddle.fluid import ParamAttr
import numpy as np
import nonlocal_helper
def Conv3dAffine(blob_in,
prefix,
dim_in,
dim_out,
filter_size,
stride,
padding,
cfg,
group=1,
test_mode=False,
bn_init=None):
blob_out = fluid.layers.conv3d(
input=blob_in,
num_filters=dim_out,
filter_size=filter_size,
stride=stride,
padding=padding,
groups=group,
param_attr=ParamAttr(
name=prefix + "_weights", initializer=fluid.initializer.MSRA()),
bias_attr=False,
name=prefix + "_conv")
blob_out_shape = blob_out.shape
affine_name = "bn" + prefix[3:]
affine_scale = fluid.layers.create_parameter(
shape=[blob_out_shape[1]],
dtype=blob_out.dtype,
attr=ParamAttr(name=affine_name + '_scale'),
default_initializer=fluid.initializer.Constant(value=1.))
affine_bias = fluid.layers.create_parameter(
shape=[blob_out_shape[1]],
dtype=blob_out.dtype,
attr=ParamAttr(name=affine_name + '_offset'),
default_initializer=fluid.initializer.Constant(value=0.))
blob_out = fluid.layers.affine_channel(
blob_out, scale=affine_scale, bias=affine_bias, name=affine_name)
return blob_out
def Conv3dBN(blob_in,
prefix,
dim_in,
dim_out,
filter_size,
stride,
padding,
cfg,
group=1,
test_mode=False,
bn_init=None):
blob_out = fluid.layers.conv3d(
input=blob_in,
num_filters=dim_out,
filter_size=filter_size,
stride=stride,
padding=padding,
groups=group,
param_attr=ParamAttr(
name=prefix + "_weights", initializer=fluid.initializer.MSRA()),
bias_attr=False,
name=prefix + "_conv")
bn_name = "bn" + prefix[3:]
blob_out = fluid.layers.batch_norm(
blob_out,
is_test=test_mode,
momentum=cfg.MODEL.bn_momentum,
epsilon=cfg.MODEL.bn_epsilon,
name=bn_name,
param_attr=ParamAttr(
name=bn_name + "_scale",
initializer=fluid.initializer.Constant(value=bn_init if
(bn_init != None) else 1.),
regularizer=fluid.regularizer.L2Decay(cfg.TRAIN.weight_decay_bn)),
bias_attr=ParamAttr(
name=bn_name + "_offset",
regularizer=fluid.regularizer.L2Decay(cfg.TRAIN.weight_decay_bn)),
moving_mean_name=bn_name + "_mean",
moving_variance_name=bn_name + "_variance")
return blob_out
# 3d bottleneck
def bottleneck_transformation_3d(blob_in,
dim_in,
dim_out,
stride,
prefix,
dim_inner,
cfg,
group=1,
use_temp_conv=1,
temp_stride=1,
test_mode=False):
conv_op = Conv3dAffine if cfg.MODEL.use_affine else Conv3dBN
# 1x1 layer
blob_out = conv_op(
blob_in,
prefix + "_branch2a",
dim_in,
dim_inner, [1 + use_temp_conv * 2, 1, 1], [temp_stride, 1, 1],
[use_temp_conv, 0, 0],
cfg,
test_mode=test_mode)
blob_out = fluid.layers.relu(blob_out, name=prefix + "_branch2a" + "_relu")
# 3x3 layer
blob_out = conv_op(
blob_out,
prefix + '_branch2b',
dim_inner,
dim_inner, [1, 3, 3], [1, stride, stride], [0, 1, 1],
cfg,
group=group,
test_mode=test_mode)
blob_out = fluid.layers.relu(blob_out, name=prefix + "_branch2b" + "_relu")
# 1x1 layer, no relu
blob_out = conv_op(
blob_out,
prefix + '_branch2c',
dim_inner,
dim_out, [1, 1, 1], [1, 1, 1], [0, 0, 0],
cfg,
test_mode=test_mode,
bn_init=cfg.MODEL.bn_init_gamma)
return blob_out
def _add_shortcut_3d(blob_in,
prefix,
dim_in,
dim_out,
stride,
cfg,
temp_stride=1,
test_mode=False):
if ((dim_in == dim_out) and (temp_stride == 1) and (stride == 1)):
# identity mapping (do nothing)
return blob_in
else:
# when dim changes
conv_op = Conv3dAffine if cfg.MODEL.use_affine else Conv3dBN
blob_out = conv_op(
blob_in,
prefix,
dim_in,
dim_out, [1, 1, 1], [temp_stride, stride, stride], [0, 0, 0],
cfg,
test_mode=test_mode)
return blob_out
# residual block abstraction
def _generic_residual_block_3d(blob_in,
dim_in,
dim_out,
stride,
prefix,
dim_inner,
cfg,
group=1,
use_temp_conv=0,
temp_stride=1,
trans_func=None,
test_mode=False):
# transformation branch (e.g. 1x1-3x3-1x1, or 3x3-3x3), namely "F(x)"
if trans_func is None:
trans_func = globals()[cfg.RESNETS.trans_func]
tr_blob = trans_func(
blob_in,
dim_in,
dim_out,
stride,
prefix,
dim_inner,
cfg,
group=group,
use_temp_conv=use_temp_conv,
temp_stride=temp_stride,
test_mode=test_mode)
# create short cut, namely, "x"
sc_blob = _add_shortcut_3d(
blob_in,
prefix + "_branch1",
dim_in,
dim_out,
stride,
cfg,
temp_stride=temp_stride,
test_mode=test_mode)
# addition, namely, "x + F(x)", and relu
sum_blob = fluid.layers.elementwise_add(
tr_blob, sc_blob, act='relu', name=prefix + '_sum')
return sum_blob
def res_stage_nonlocal(block_fn,
blob_in,
dim_in,
dim_out,
stride,
num_blocks,
prefix,
cfg,
dim_inner=None,
group=None,
use_temp_convs=None,
temp_strides=None,
batch_size=None,
nonlocal_name=None,
nonlocal_mod=1000,
test_mode=False):
# prefix is something like: res2, res3, etc.
# each res layer has num_blocks stacked.
# check dtype and format of use_temp_convs and temp_strides
if use_temp_convs is None:
use_temp_convs = np.zeros(num_blocks).astype(int)
if temp_strides is None:
temp_strides = np.ones(num_blocks).astype(int)
if len(use_temp_convs) < num_blocks:
for _ in range(num_blocks - len(use_temp_convs)):
use_temp_convs.append(0)
temp_strides.append(1)
for idx in range(num_blocks):
block_prefix = '{}{}'.format(prefix, chr(idx + 97))
block_stride = 2 if ((idx == 0) and (stride == 2)) else 1
blob_in = _generic_residual_block_3d(
blob_in,
dim_in,
dim_out,
block_stride,
block_prefix,
dim_inner,
cfg,
group=group,
use_temp_conv=use_temp_convs[idx],
temp_stride=temp_strides[idx],
test_mode=test_mode)
dim_in = dim_out
if idx % nonlocal_mod == nonlocal_mod - 1:
blob_in = nonlocal_helper.add_nonlocal(
blob_in,
dim_in,
dim_in,
batch_size,
nonlocal_name + '_{}'.format(idx),
int(dim_in / 2),
cfg,
test_mode=test_mode)
return blob_in, dim_in
def res_stage_nonlocal_group(block_fn,
blob_in,
dim_in,
dim_out,
stride,
num_blocks,
prefix,
cfg,
dim_inner=None,
group=None,
use_temp_convs=None,
temp_strides=None,
batch_size=None,
pool_stride=None,
spatial_dim=None,
group_size=None,
nonlocal_name=None,
nonlocal_mod=1000,
test_mode=False):
# prefix is something like res2, res3, etc.
# each res layer has num_blocks stacked
# check dtype and format of use_temp_convs and temp_strides
if use_temp_convs is None:
use_temp_convs = np.zeros(num_blocks).astype(int)
if temp_strides is None:
temp_strides = np.ones(num_blocks).astype(int)
for idx in range(num_blocks):
block_prefix = "{}{}".format(prefix, chr(idx + 97))
block_stride = 2 if (idx == 0 and stride == 2) else 1
blob_in = _generic_residual_block_3d(
blob_in,
dim_in,
dim_out,
block_stride,
block_prefix,
dim_inner,
cfg,
group=group,
use_temp_conv=use_temp_convs[idx],
temp_stride=temp_strides[idx],
test_mode=test_mode)
dim_in = dim_out
if idx % nonlocal_mod == nonlocal_mod - 1:
blob_in = nonlocal_helper.add_nonlocal_group(
blob_in,
dim_in,
dim_in,
batch_size,
pool_stride,
spatial_dim,
spatial_dim,
group_size,
nonlocal_name + "_{}".format(idx),
int(dim_in / 2),
cfg,
test_mode=test_mode)
return blob_in, dim_in
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
from __future__ import unicode_literals
from __future__ import print_function
from __future__ import division
import paddle.fluid as fluid
from paddle.fluid.param_attr import ParamAttr
import resnet_helper
import logging
logger = logging.getLogger(__name__)
# For more depths, add the block config here
BLOCK_CONFIG = {
50: (3, 4, 6, 3),
101: (3, 4, 23, 3),
}
# ------------------------------------------------------------------------
# obtain_arc defines the temporal kernel radius and temporal strides for
# each layers residual blocks in a resnet.
# e.g. use_temp_convs = 1 means a temporal kernel of 3 is used.
# In ResNet50, it has (3, 4, 6, 3) blocks in conv2, 3, 4, 5,
# so the lengths of the corresponding lists are (3, 4, 6, 3).
# ------------------------------------------------------------------------
def obtain_arc(arc_type, video_length):
pool_stride = 1
# c2d, ResNet50
if arc_type == 1:
use_temp_convs_1 = [0]
temp_strides_1 = [1]
use_temp_convs_2 = [0, 0, 0]
temp_strides_2 = [1, 1, 1]
use_temp_convs_3 = [0, 0, 0, 0]
temp_strides_3 = [1, 1, 1, 1]
use_temp_convs_4 = [0, ] * 6
temp_strides_4 = [1, ] * 6
use_temp_convs_5 = [0, 0, 0]
temp_strides_5 = [1, 1, 1]
pool_stride = int(video_length / 2)
# i3d, ResNet50
if arc_type == 2:
use_temp_convs_1 = [2]
temp_strides_1 = [1]
use_temp_convs_2 = [1, 1, 1]
temp_strides_2 = [1, 1, 1]
use_temp_convs_3 = [1, 0, 1, 0]
temp_strides_3 = [1, 1, 1, 1]
use_temp_convs_4 = [1, 0, 1, 0, 1, 0]
temp_strides_4 = [1, 1, 1, 1, 1, 1]
use_temp_convs_5 = [0, 1, 0]
temp_strides_5 = [1, 1, 1]
pool_stride = int(video_length / 2)
# c2d, ResNet101
if arc_type == 3:
use_temp_convs_1 = [0]
temp_strides_1 = [1]
use_temp_convs_2 = [0, 0, 0]
temp_strides_2 = [1, 1, 1]
use_temp_convs_3 = [0, 0, 0, 0]
temp_strides_3 = [1, 1, 1, 1]
use_temp_convs_4 = [0, ] * 23
temp_strides_4 = [1, ] * 23
use_temp_convs_5 = [0, 0, 0]
temp_strides_5 = [1, 1, 1]
pool_stride = int(video_length / 2)
# i3d, ResNet101
if arc_type == 4:
use_temp_convs_1 = [2]
temp_strides_1 = [1]
use_temp_convs_2 = [1, 1, 1]
temp_strides_2 = [1, 1, 1]
use_temp_convs_3 = [1, 0, 1, 0]
temp_strides_3 = [1, 1, 1, 1]
use_temp_convs_4 = []
for i in range(23):
if i % 2 == 0:
use_temp_convs_4.append(1)
else:
use_temp_convs_4.append(0)
temp_strides_4 = [1] * 23
use_temp_convs_5 = [0, 1, 0]
temp_strides_5 = [1, 1, 1]
pool_stride = int(video_length / 2)
use_temp_convs_set = [
use_temp_convs_1, use_temp_convs_2, use_temp_convs_3, use_temp_convs_4,
use_temp_convs_5
]
temp_strides_set = [
temp_strides_1, temp_strides_2, temp_strides_3, temp_strides_4,
temp_strides_5
]
return use_temp_convs_set, temp_strides_set, pool_stride
def create_model(data, label, cfg, is_training=True, mode='train'):
group = cfg.RESNETS.num_groups
width_per_group = cfg.RESNETS.width_per_group
batch_size = int(cfg.TRAIN.batch_size / cfg.TRAIN.num_gpus)
logger.info('--------------- ResNet-{} {}x{}d-{}, {} ---------------'.
format(cfg.MODEL.depth, group, width_per_group,
cfg.RESNETS.trans_func, cfg.MODEL.dataset))
assert cfg.MODEL.depth in BLOCK_CONFIG.keys(), \
"Block config is not defined for specified model depth."
(n1, n2, n3, n4) = BLOCK_CONFIG[cfg.MODEL.depth]
res_block = resnet_helper._generic_residual_block_3d
dim_inner = group * width_per_group
use_temp_convs_set, temp_strides_set, pool_stride = obtain_arc(
cfg.MODEL.video_arc_choice, cfg[mode.upper()]['video_length'])
logger.info(use_temp_convs_set)
logger.info(temp_strides_set)
conv_blob = fluid.layers.conv3d(
input=data,
num_filters=64,
filter_size=[1 + use_temp_convs_set[0][0] * 2, 7, 7],
stride=[temp_strides_set[0][0], 2, 2],
padding=[use_temp_convs_set[0][0], 3, 3],
param_attr=ParamAttr(
name='conv1' + "_weights", initializer=fluid.initializer.MSRA()),
bias_attr=False,
name='conv1')
test_mode = False if (mode == 'train') else True
if cfg.MODEL.use_affine is False:
# use bn
bn_name = 'bn_conv1'
bn_blob = fluid.layers.batch_norm(
conv_blob,
is_test=test_mode,
momentum=cfg.MODEL.bn_momentum,
epsilon=cfg.MODEL.bn_epsilon,
name=bn_name,
param_attr=ParamAttr(
name=bn_name + "_scale",
regularizer=fluid.regularizer.L2Decay(
cfg.TRAIN.weight_decay_bn)),
bias_attr=ParamAttr(
name=bn_name + "_offset",
regularizer=fluid.regularizer.L2Decay(
cfg.TRAIN.weight_decay_bn)),
moving_mean_name=bn_name + "_mean",
moving_variance_name=bn_name + "_variance")
else:
# use affine
affine_name = 'bn_conv1'
conv_blob_shape = conv_blob.shape
affine_scale = fluid.layers.create_parameter(
shape=[conv_blob_shape[1]],
dtype=conv_blob.dtype,
attr=ParamAttr(name=affine_name + '_scale'),
default_initializer=fluid.initializer.Constant(value=1.))
affine_bias = fluid.layers.create_parameter(
shape=[conv_blob_shape[1]],
dtype=conv_blob.dtype,
attr=ParamAttr(name=affine_name + '_offset'),
default_initializer=fluid.initializer.Constant(value=0.))
bn_blob = fluid.layers.affine_channel(
conv_blob, scale=affine_scale, bias=affine_bias, name=affine_name)
# relu
relu_blob = fluid.layers.relu(bn_blob, name='res_conv1_bn_relu')
# max pool
max_pool = fluid.layers.pool3d(
input=relu_blob,
pool_size=[1, 3, 3],
pool_type='max',
pool_stride=[1, 2, 2],
pool_padding=[0, 0, 0],
name='pool1')
# building res block
if cfg.MODEL.depth in [50, 101]:
blob_in, dim_in = resnet_helper.res_stage_nonlocal(
res_block,
max_pool,
64,
256,
stride=1,
num_blocks=n1,
prefix='res2',
cfg=cfg,
dim_inner=dim_inner,
group=group,
use_temp_convs=use_temp_convs_set[1],
temp_strides=temp_strides_set[1],
test_mode=test_mode)
layer_mod = cfg.NONLOCAL.layer_mod
if cfg.MODEL.depth == 101:
layer_mod = 2
if cfg.NONLOCAL.conv3_nonlocal is False:
layer_mod = 1000
blob_in = fluid.layers.pool3d(
blob_in,
pool_size=[2, 1, 1],
pool_type='max',
pool_stride=[2, 1, 1],
pool_padding=[0, 0, 0],
name='pool2')
if cfg.MODEL.use_affine is False:
blob_in, dim_in = resnet_helper.res_stage_nonlocal(
res_block,
blob_in,
dim_in,
512,
stride=2,
num_blocks=n2,
prefix='res3',
cfg=cfg,
dim_inner=dim_inner * 2,
group=group,
use_temp_convs=use_temp_convs_set[2],
temp_strides=temp_strides_set[2],
batch_size=batch_size,
nonlocal_name="nonlocal_conv3",
nonlocal_mod=layer_mod,
test_mode=test_mode)
else:
crop_size = cfg[mode.upper()]['crop_size']
blob_in, dim_in = resnet_helper.res_stage_nonlocal_group(
res_block,
blob_in,
dim_in,
512,
stride=2,
num_blocks=n2,
prefix='res3',
cfg=cfg,
dim_inner=dim_inner * 2,
group=group,
use_temp_convs=use_temp_convs_set[2],
temp_strides=temp_strides_set[2],
batch_size=batch_size,
pool_stride=pool_stride,
spatial_dim=int(crop_size / 8),
group_size=4,
nonlocal_name="nonlocal_conv3_group",
nonlocal_mod=layer_mod,
test_mode=test_mode)
layer_mod = cfg.NONLOCAL.layer_mod
if cfg.MODEL.depth == 101:
layer_mod = layer_mod * 4 - 1
if cfg.NONLOCAL.conv4_nonlocal is False:
layer_mod = 1000
blob_in, dim_in = resnet_helper.res_stage_nonlocal(
res_block,
blob_in,
dim_in,
1024,
stride=2,
num_blocks=n3,
prefix='res4',
cfg=cfg,
dim_inner=dim_inner * 4,
group=group,
use_temp_convs=use_temp_convs_set[3],
temp_strides=temp_strides_set[3],
batch_size=batch_size,
nonlocal_name="nonlocal_conv4",
nonlocal_mod=layer_mod,
test_mode=test_mode)
blob_in, dim_in = resnet_helper.res_stage_nonlocal(
res_block,
blob_in,
dim_in,
2048,
stride=2,
num_blocks=n4,
prefix='res5',
cfg=cfg,
dim_inner=dim_inner * 8,
group=group,
use_temp_convs=use_temp_convs_set[4],
temp_strides=temp_strides_set[4],
test_mode=test_mode)
else:
raise Exception("Unsupported network settings.")
blob_out = fluid.layers.pool3d(
blob_in,
pool_size=[pool_stride, 7, 7],
pool_type='avg',
pool_stride=[1, 1, 1],
pool_padding=[0, 0, 0],
name='pool5')
if (cfg.TRAIN.dropout_rate > 0) and (test_mode is False):
blob_out = fluid.layers.dropout(
blob_out, cfg.TRAIN.dropout_rate, is_test=test_mode)
if mode in ['train', 'valid']:
blob_out = fluid.layers.fc(
blob_out,
cfg.MODEL.num_classes,
param_attr=ParamAttr(
name='pred' + "_w",
initializer=fluid.initializer.Normal(
loc=0.0, scale=cfg.MODEL.fc_init_std)),
bias_attr=ParamAttr(
name='pred' + "_b",
initializer=fluid.initializer.Constant(value=0.)),
name='pred')
elif mode in ['test', 'infer']:
blob_out = fluid.layers.conv3d(
input=blob_out,
num_filters=cfg.MODEL.num_classes,
filter_size=[1, 1, 1],
stride=[1, 1, 1],
padding=[0, 0, 0],
param_attr=ParamAttr(
name='pred' + "_w", initializer=fluid.initializer.MSRA()),
bias_attr=ParamAttr(
name='pred' + "_b",
initializer=fluid.initializer.Constant(value=0.)),
name='pred')
if (mode == 'train') or (mode == 'valid'):
softmax = fluid.layers.softmax(blob_out)
loss = fluid.layers.cross_entropy(
softmax, label, soft_label=False, ignore_index=-100)
elif (mode == 'test') or (mode == 'infer'):
# fully convolutional testing, when loading test model,
# params should be copied from train_prog fc layer named pred
blob_out = fluid.layers.transpose(
blob_out, [0, 2, 3, 4, 1], name='pred_tr')
blob_out = fluid.layers.softmax(blob_out, name='softmax_conv')
softmax = fluid.layers.reduce_mean(
blob_out, dim=[1, 2, 3], keep_dim=False, name='softmax')
loss = None
else:
raise 'Not implemented Error'
return softmax, loss
......@@ -46,6 +46,7 @@ class STNET(ModelBase):
'l2_weight_decay')
self.momentum = self.get_config_from_sec('train', 'momentum')
self.seg_num = self.get_config_from_sec(self.mode, 'seg_num', self.seg_num)
self.target_size = self.get_config_from_sec(self.mode, 'target_size')
self.batch_size = self.get_config_from_sec(self.mode, 'batch_size')
......
# TSM 视频分类模型
---
## 内容
- [模型简介](#模型简介)
- [数据准备](#数据准备)
- [模型训练](#模型训练)
- [模型评估](#模型评估)
- [模型推断](#模型推断)
- [参考论文](#参考论文)
## 模型简介
Temporal Shift Module是由MIT和IBM Watson AI Lab的Ji Lin,Chuang Gan等人提出的通过时间位移来提高网络视频理解能力的模块,其位移操作原理如下图所示。
<p align="center">
<img src="../../images/temporal_shift.png" height=250 width=800 hspace='10'/> <br />
Temporal shift module
</p>
上图中矩阵表示特征图中的temporal和channel维度,通过将一部分的channel在temporal维度上向前位移一步,一部分的channel在temporal维度上向后位移一步,位移后的空缺补零。通过这种方式在特征图中引入temporal维度上的上下文交互,提高在时间维度上的视频理解能力。
TSM模型是将Temporal Shift Module插入到ResNet网络中构建的视频分类模型,本模型库实现版本为以ResNet-50作为主干网络的TSM模型。
详细内容请参考论文[Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383v1)
## 数据准备
TSM的训练数据采用由DeepMind公布的Kinetics-400动作识别数据集。数据下载及准备请参考[数据说明](../../dataset/README.md)
## 模型训练
数据准备完毕后,可以通过如下两种方式启动训练:
python train.py --model-name=TSM
--config=./configs/tsm.txt
--save-dir=checkpoints
--log-interval=10
--valid-interval=1
bash scripts/train/train_tsm.sh
- 可下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz)通过`--resume`指定权重存放路径进行finetune等开发
**数据读取器说明:** 模型读取Kinetics-400数据集中的`mp4`数据,每条数据抽取`seg_num`段,每段抽取1帧图像,对每帧图像做随机增强后,缩放至`target_size`
**训练策略:**
* 采用Momentum优化算法训练,momentum=0.9
* 权重衰减系数为1e-4
## 模型评估
可通过如下两种方式进行模型评估:
python test.py --model-name=TSM
--config=configs/tsm.txt
--log-interval=1
--weights=$PATH_TO_WEIGHTS
bash scripts/test/test_tsm.sh
- 使用`scripts/test/test_tsm.sh`进行评估时,需要修改脚本中的`--weights`参数指定需要评估的权重。
- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz)进行评估
当取如下参数时,在Kinetics400的validation数据集下评估精度如下:
| seg\_num | target\_size | Top-1 |
| :------: | :----------: | :----: |
| 8 | 224 | 0.70 |
## 模型推断
可通过如下命令进行模型推断:
python infer.py --model-name=TSM
--config=configs/tsm.txt
--log-interval=1
--weights=$PATH_TO_WEIGHTS
--filelist=$FILELIST
- 模型推断结果存储于`TSM_infer_result`中,通过`pickle`格式存储。
- 若未指定`--weights`参数,脚本会下载已发布模型[model](https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz)进行推断
## 参考论文
- [Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383v1), Ji Lin, Chuang Gan, Song Han
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import paddle.fluid as fluid
from paddle.fluid import ParamAttr
from ..model import ModelBase
from .tsm_res_model import TSM_ResNet
import logging
logger = logging.getLogger(__name__)
__all__ = ["TSM"]
class TSM(ModelBase):
def __init__(self, name, cfg, mode='train'):
super(TSM, self).__init__(name, cfg, mode=mode)
self.get_config()
def get_config(self):
self.num_classes = self.get_config_from_sec('model', 'num_classes')
self.seg_num = self.get_config_from_sec('model', 'seg_num')
self.seglen = self.get_config_from_sec('model', 'seglen')
self.image_mean = self.get_config_from_sec('model', 'image_mean')
self.image_std = self.get_config_from_sec('model', 'image_std')
self.num_layers = self.get_config_from_sec('model', 'num_layers')
self.num_epochs = self.get_config_from_sec('train', 'epoch')
self.total_videos = self.get_config_from_sec('train', 'total_videos')
self.base_learning_rate = self.get_config_from_sec('train',
'learning_rate')
self.learning_rate_decay = self.get_config_from_sec(
'train', 'learning_rate_decay')
self.decay_epochs = self.get_config_from_sec('train', 'decay_epochs')
self.l2_weight_decay = self.get_config_from_sec('train',
'l2_weight_decay')
self.momentum = self.get_config_from_sec('train', 'momentum')
self.target_size = self.get_config_from_sec(self.mode, 'target_size')
self.batch_size = self.get_config_from_sec(self.mode, 'batch_size')
def build_input(self, use_pyreader=True):
image_shape = [3, self.target_size, self.target_size]
image_shape[0] = image_shape[0] * self.seglen
image_shape = [self.seg_num] + image_shape
self.use_pyreader = use_pyreader
if use_pyreader:
assert self.mode != 'infer', \
'pyreader is not recommendated when infer, please set use_pyreader to be false.'
py_reader = fluid.layers.py_reader(
capacity=100,
shapes=[[-1] + image_shape, [-1] + [1]],
dtypes=['float32', 'int64'],
name='train_py_reader'
if self.is_training else 'test_py_reader',
use_double_buffer=True)
image, label = fluid.layers.read_file(py_reader)
self.py_reader = py_reader
else:
image = fluid.layers.data(
name='image', shape=image_shape, dtype='float32')
if self.mode != 'infer':
label = fluid.layers.data(
name='label', shape=[1], dtype='int64')
else:
label = None
self.feature_input = [image]
self.label_input = label
def build_model(self):
videomodel = TSM_ResNet(
layers=self.num_layers,
seg_num=self.seg_num,
is_training=self.is_training)
out = videomodel.net(input=self.feature_input[0],
class_dim=self.num_classes)
self.network_outputs = [out]
def optimizer(self):
assert self.mode == 'train', "optimizer only can be get in train mode"
total_videos = self.total_videos
step = int(total_videos / self.batch_size + 1)
bd = [e * step for e in self.decay_epochs]
base_lr = self.base_learning_rate
lr_decay = self.learning_rate_decay
lr = [base_lr, base_lr * lr_decay, base_lr * lr_decay * lr_decay]
l2_weight_decay = self.l2_weight_decay
momentum = self.momentum
optimizer = fluid.optimizer.Momentum(
learning_rate=fluid.layers.piecewise_decay(
boundaries=bd, values=lr),
momentum=momentum,
regularization=fluid.regularizer.L2Decay(l2_weight_decay))
return optimizer
def loss(self):
assert self.mode != 'infer', "invalid loss calculationg in infer mode"
cost = fluid.layers.cross_entropy(input=self.network_outputs[0], \
label=self.label_input, ignore_index=-1)
self.loss_ = fluid.layers.mean(x=cost)
return self.loss_
def outputs(self):
return self.network_outputs
def feeds(self):
return self.feature_input if self.mode == 'infer' else self.feature_input + [
self.label_input
]
def pretrain_info(self):
return ('ResNet50_pretrained', 'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz')
def weights_info(self):
return ('tsm_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz')
def load_pretrain_params(self, exe, pretrain, prog, place):
def is_parameter(var):
return isinstance(var, fluid.framework.Parameter) and (not ("fc_0" in var.name))
logger.info("Load pretrain weights from {}, exclude fc layer.".format(pretrain))
vars = filter(is_parameter, prog.list_vars())
fluid.io.load_vars(exe, pretrain, vars=vars, main_program=prog)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
import os
import time
import sys
import paddle.fluid as fluid
import math
class TSM_ResNet():
def __init__(self, layers=50, seg_num=8, is_training=False):
self.layers = layers
self.seg_num = seg_num
self.is_training = is_training
def shift_module(self, input):
output = fluid.layers.temporal_shift(input, self.seg_num, 1.0 / 8)
return output
def conv_bn_layer(self,
input,
num_filters,
filter_size,
stride=1,
groups=1,
act=None,
name=None):
conv = fluid.layers.conv2d(
input=input,
num_filters=num_filters,
filter_size=filter_size,
stride=stride,
padding=(filter_size - 1) // 2,
groups=groups,
act=None,
param_attr=fluid.param_attr.ParamAttr(name=name+"_weights"),
bias_attr=False)
if name == "conv1":
bn_name = "bn_" + name
else:
bn_name = "bn" + name[3:]
return fluid.layers.batch_norm(input=conv, act=act,
is_test=(not self.is_training),
param_attr=fluid.param_attr.ParamAttr(name=bn_name+"_scale"),
bias_attr=fluid.param_attr.ParamAttr(bn_name+'_offset'),
moving_mean_name=bn_name+"_mean",
moving_variance_name=bn_name+'_variance')
def shortcut(self, input, ch_out, stride, name):
ch_in = input.shape[1]
if ch_in != ch_out or stride != 1:
return self.conv_bn_layer(input, ch_out, 1, stride, name=name)
else:
return input
def bottleneck_block(self, input, num_filters, stride, name):
shifted = self.shift_module(input)
conv0 = self.conv_bn_layer(
input=shifted, num_filters=num_filters, filter_size=1, act='relu',
name=name+"_branch2a")
conv1 = self.conv_bn_layer(
input=conv0,
num_filters=num_filters,
filter_size=3,
stride=stride,
act='relu', name=name+"_branch2b")
conv2 = self.conv_bn_layer(
input=conv1, num_filters=num_filters * 4, filter_size=1, act=None, name=name+"_branch2c")
short = self.shortcut(input, num_filters * 4, stride, name=name+"_branch1")
return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
def net(self, input, class_dim=101):
layers = self.layers
seg_num = self.seg_num
supported_layers = [50, 101, 152]
if layers not in supported_layers:
print("supported layers are", supported_layers, \
"but input layer is ", layers)
exit()
# reshape input
channels = input.shape[2]
short_size = input.shape[3]
input = fluid.layers.reshape(
x=input, shape=[-1, channels, short_size, short_size])
if layers == 50:
depth = [3, 4, 6, 3]
elif layers == 101:
depth = [3, 4, 23, 3]
elif layers == 152:
depth = [3, 8, 36, 3]
num_filters = [64, 128, 256, 512]
conv = self.conv_bn_layer(
input=input, num_filters=64, filter_size=7, stride=2, act='relu', name='conv1')
conv = fluid.layers.pool2d(
input=conv,
pool_size=3,
pool_stride=2,
pool_padding=1,
pool_type='max')
for block in range(len(depth)):
for i in range(depth[block]):
if layers in [101, 152] and block == 2:
if i == 0:
conv_name = "res" + str(block+2) + "a"
else:
conv_name = "res" + str(block+2) + "b" + str(i)
else:
conv_name = "res" + str(block+2) + chr(97+i)
conv = self.bottleneck_block(
input=conv,
num_filters=num_filters[block],
stride=2 if i == 0 and block != 0 else 1,
name=conv_name)
pool = fluid.layers.pool2d(
input=conv, pool_size=7, pool_type='avg', global_pooling=True)
dropout = fluid.layers.dropout(x=pool, dropout_prob=0.5, is_test=(not self.is_training))
feature = fluid.layers.reshape(
x=dropout, shape=[-1, seg_num, pool.shape[1]])
out = fluid.layers.reduce_mean(feature, dim=1)
stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
out = fluid.layers.fc(input=out,
size=class_dim,
act='softmax',
param_attr=fluid.param_attr.ParamAttr(
initializer=fluid.initializer.Uniform(-stdv,
stdv)),
bias_attr=fluid.param_attr.ParamAttr(learning_rate=2.0,
regularizer=fluid.regularizer.L2Decay(0.)))
return out
......@@ -47,6 +47,7 @@ class TSN(ModelBase):
'l2_weight_decay')
self.momentum = self.get_config_from_sec('train', 'momentum')
self.seg_num = self.get_config_from_sec(self.mode, 'seg_num', self.seg_num)
self.target_size = self.get_config_from_sec(self.mode, 'target_size')
self.batch_size = self.get_config_from_sec(self.mode, 'batch_size')
......
文件模式从 100755 更改为 100644
python infer.py --model-name="AttentionCluster" --config=./configs/attention_cluster.txt \
--filelist=./data/youtube8m/infer.list \
python infer.py --model_name="AttentionCluster" --config=./configs/attention_cluster.txt \
--filelist=./dataset/youtube8m/infer.list \
--weights=./checkpoints/AttentionCluster_epoch0 \
--save-dir="./save"
--save_dir="./save"
python infer.py --model-name="AttentionLSTM" --config=./configs/attention_lstm.txt \
--filelist=./data/youtube8m/infer.list \
python infer.py --model_name="AttentionLSTM" --config=./configs/attention_lstm.txt \
--filelist=./dataset/youtube8m/infer.list \
--weights=./checkpoints/AttentionLSTM_epoch0 \
--save-dir="./save"
--save_dir="./save"
python infer.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt --filelist=./data/youtube8m/infer.list \
python infer.py --model_name="NEXTVLAD" --config=./configs/nextvlad.txt --filelist=./dataset/youtube8m/infer.list \
--weights=./checkpoints/NEXTVLAD_epoch0 \
--save-dir="./save"
--save_dir="./save"
python infer.py --model_name="NONLOCAL" --config=./configs/nonlocal.txt --filelist=./dataset/nonlocal/inferlist.txt \
--log_interval=10 --weights=./checkpoints/NONLOCAL_epoch0 --save_dir=./save
python infer.py --model-name="STNET" --config=./configs/stnet.txt --filelist=./data/kinetics/infer.list \
--log-interval=10 --weights=./checkpoints/STNET_epoch0 --save-dir=./save
python infer.py --model_name="STNET" --config=./configs/stnet.txt --filelist=./dataset/kinetics/infer.list \
--log_interval=10 --weights=./checkpoints/STNET_epoch0 --save_dir=./save
python infer.py --model_name="TSM" --config=./configs/tsm.txt --filelist=./dataset/kinetics/infer.list \
--log_interval=10 --weights=./checkpoints/TSM_epoch0 --save_dir=./save
python infer.py --model-name="TSN" --config=./configs/tsn.txt --filelist=./data/kinetics/infer.list \
--log-interval=10 --weights=./checkpoints/TSN_epoch0 --save-dir=./save
python infer.py --model_name="TSN" --config=./configs/tsn.txt --filelist=./dataset/kinetics/infer.list \
--log_interval=10 --weights=./checkpoints/TSN_epoch0 --save_dir=./save
python test.py --model-name="AttentionCluster" --config=./configs/attention_cluster.txt \
--log-interval=5 --weights=./checkpoints/AttentionCluster_epoch0
python test.py --model_name="AttentionCluster" --config=./configs/attention_cluster.txt \
--log_interval=5 --weights=./checkpoints/AttentionCluster_epoch0
python test.py --model-name="AttentionLSTM" --config=./configs/attention_lstm.txt \
--log-interval=5 --weights=./checkpoints/AttentionLSTM_epoch0
python test.py --model_name="AttentionLSTM" --config=./configs/attention_lstm.txt \
--log_interval=5 --weights=./checkpoints/AttentionLSTM_epoch0
python test.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt \
--log-interval=10 --weights=./checkpoints/NEXTVLAD_epoch0
python test.py --model_name="NEXTVLAD" --config=./configs/nextvlad.txt \
--log_interval=10 --weights=./checkpoints/NEXTVLAD_epoch0
python -i test.py --model_name="NONLOCAL" --config=./configs/nonlocal.txt \
--log_interval=1 --weights=./checkpoints/NONLOCAL_epoch0
python test.py --model-name="STNET" --config=./configs/stnet.txt \
--log-interval=10 --weights=./checkpoints/STNET_epoch0
python test.py --model_name="STNET" --config=./configs/stnet.txt \
--log_interval=10 --weights=./checkpoints/STNET_epoch0
python test.py --model_name="TSM" --config=./configs/tsm.txt \
--log_interval=10 --weights=./checkpoints/TSM_epoch0
python test.py --model-name="TSN" --config=./configs/tsn.txt \
--log-interval=10 --weights=./checkpoints/TSN_epoch0
python test.py --model_name="TSN" --config=./configs/tsn.txt \
--log_interval=10 --weights=./checkpoints/TSN_epoch0
python train.py --model-name="AttentionCluster" --config=./configs/attention_cluster.txt --epoch-num=5 \
--valid-interval=1 --log-interval=10
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name="AttentionCluster" --config=./configs/attention_cluster.txt --epoch_num=5 \
--valid_interval=1 --log_interval=10
python train.py --model-name="AttentionLSTM" --config=./configs/attention_lstm.txt --epoch-num=10 \
--valid-interval=1 --log-interval=10
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name="AttentionLSTM" --config=./configs/attention_lstm.txt --epoch_num=10 \
--valid_interval=1 --log_interval=10
export CUDA_VISIBLE_DEVICES=0,1,2,3
python train.py --model-name="NEXTVLAD" --config=./configs/nextvlad.txt --epoch-num=6 \
--valid-interval=1 --log-interval=10
python train.py --model_name="NEXTVLAD" --config=./configs/nextvlad.txt --epoch_num=6 \
--valid_interval=1 --log_interval=10
python train.py --model_name="NONLOCAL" --config=./configs/nonlocal.txt --epoch_num=120 \
--valid_interval=1 --log_interval=1 \
--pretrain=./pretrained/ResNet50_pretrained
python train.py --model-name="STNET" --config=./configs/stnet.txt --epoch-num=60 \
--valid-interval=1 --log-interval=10
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name="STNET" --config=./configs/stnet.txt --epoch_num=60 \
--valid_interval=1 --log_interval=10
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name="TSM" --config=./configs/tsm.txt --epoch_num=65 \
--valid_interval=1 --log_interval=10
python train.py --model-name="TSN" --config=./configs/tsn.txt --epoch-num=45 \
--valid-interval=1 --log-interval=10
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name="TSN" --config=./configs/tsn.txt --epoch_num=45 \
--valid_interval=1 --log_interval=10
......@@ -34,7 +34,7 @@ logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--model-name',
'--model_name',
type=str,
default='AttentionCluster',
help='name of model to train.')
......@@ -44,19 +44,19 @@ def parse_args():
default='configs/attention_cluster.txt',
help='path to config file of model')
parser.add_argument(
'--batch-size',
'--batch_size',
type=int,
default=None,
help='traing batch size per GPU. None to use config file setting.')
help='test batch size. None to use config file setting.')
parser.add_argument(
'--use-gpu', type=bool, default=True, help='default use gpu.')
'--use_gpu', type=bool, default=True, help='default use gpu.')
parser.add_argument(
'--weights',
type=str,
default=None,
help='weight path, None to use weights from Paddle.')
parser.add_argument(
'--log-interval',
'--log_interval',
type=int,
default=1,
help='mini-batch interval to log.')
......@@ -68,6 +68,7 @@ def test(args):
# parse config
config = parse_config(args.config)
test_config = merge_configs(config, 'test', vars(args))
print_configs(test_config, "Test")
# build model
test_model = models.get_model(args.model_name, test_config, mode='test')
......@@ -75,7 +76,7 @@ def test(args):
test_model.build_model()
test_feeds = test_model.feeds()
test_outputs = test_model.outputs()
loss = test_model.loss()
test_loss = test_model.loss()
place = fluid.CUDAPlace(0) if args.use_gpu else fluid.CPUPlace()
exe = fluid.Executor(place)
......@@ -85,29 +86,34 @@ def test(args):
args.weights), "Given weight dir {} not exist.".format(args.weights)
weights = args.weights or test_model.get_weights()
def if_exist(var):
return os.path.exists(os.path.join(weights, var.name))
fluid.io.load_vars(exe, weights, predicate=if_exist)
test_model.load_test_weights(exe, weights,
fluid.default_main_program(), place)
# get reader and metrics
test_reader = get_reader(args.model_name.upper(), 'test', test_config)
test_metrics = get_metrics(args.model_name.upper(), 'test', test_config)
test_feeder = fluid.DataFeeder(place=place, feed_list=test_feeds)
fetch_list = [loss.name] + [x.name
for x in test_outputs] + [test_feeds[-1].name]
if test_loss is None:
fetch_list = [x.name for x in test_outputs] + [test_feeds[-1].name]
else:
fetch_list = [test_loss.name] + [x.name for x in test_outputs
] + [test_feeds[-1].name]
epoch_period = []
for test_iter, data in enumerate(test_reader()):
cur_time = time.time()
test_outs = exe.run(fetch_list=fetch_list,
feed=test_feeder.feed(data))
test_outs = exe.run(fetch_list=fetch_list, feed=test_feeder.feed(data))
period = time.time() - cur_time
epoch_period.append(period)
loss = np.array(test_outs[0])
pred = np.array(test_outs[1])
label = np.array(test_outs[-1])
if test_loss is None:
loss = np.zeros(1, ).astype('float32')
pred = np.array(test_outs[0])
label = np.array(test_outs[-1])
else:
loss = np.array(test_outs[0])
pred = np.array(test_outs[1])
label = np.array(test_outs[-1])
test_metrics.accumulate(loss, pred, label)
# metric here
......
......@@ -61,6 +61,11 @@ def train_without_pyreader(exe, train_prog, train_exe, train_reader, train_feede
save_model_name = 'model', test_exe = None, test_reader = None, \
test_feeder = None, test_fetch_list = None, test_metrics = None):
for epoch in range(epochs):
lr = fluid.global_scope().find_var("learning_rate").get_tensor()
lr_count = fluid.global_scope().find_var(
"@LR_DECAY_COUNTER@").get_tensor()
logger.info("------- learning rate {}, learning rate counter {} -----"
.format(np.array(lr), np.array(lr_count)))
epoch_periods = []
for train_iter, data in enumerate(train_reader()):
cur_time = time.time()
......@@ -80,7 +85,8 @@ def train_without_pyreader(exe, train_prog, train_exe, train_reader, train_feede
format(epoch, np.mean(epoch_periods)))
save_model(exe, train_prog, save_dir, save_model_name,
"_epoch{}".format(epoch))
if test_exe and valid_interval > 0 and (epoch + 1) % valid_interval == 0:
if test_exe and valid_interval > 0 and (epoch + 1
) % valid_interval == 0:
test_without_pyreader(test_exe, test_reader, test_feeder,
test_fetch_list, test_metrics, log_interval)
......@@ -95,6 +101,11 @@ def train_with_pyreader(exe, train_prog, train_exe, train_pyreader, \
if not train_pyreader:
logger.error("[TRAIN] get pyreader failed.")
for epoch in range(epochs):
lr = fluid.global_scope().find_var("learning_rate").get_tensor()
lr_count = fluid.global_scope().find_var(
"@LR_DECAY_COUNTER@").get_tensor()
logger.info("------- learning rate {}, learning rate counter {} -----"
.format(np.array(lr), np.array(lr_count)))
train_pyreader.start()
train_metrics.reset()
try:
......@@ -119,7 +130,8 @@ def train_with_pyreader(exe, train_prog, train_exe, train_pyreader, \
format(epoch, np.mean(epoch_periods)))
save_model(exe, train_prog, save_dir, save_model_name,
"_epoch{}".format(epoch))
if test_exe and valid_interval > 0 and (epoch + 1) % valid_interval == 0:
if test_exe and valid_interval > 0 and (epoch + 1
) % valid_interval == 0:
test_with_pyreader(test_exe, test_pyreader, test_fetch_list,
test_metrics, log_interval)
finally:
......
......@@ -35,7 +35,7 @@ logger = logging.getLogger(__name__)
def parse_args():
parser = argparse.ArgumentParser("Paddle Video train script")
parser.add_argument(
'--model-name',
'--model_name',
type=str,
default='AttentionCluster',
help='name of model to train.')
......@@ -45,12 +45,12 @@ def parse_args():
default='configs/attention_cluster.txt',
help='path to config file of model')
parser.add_argument(
'--batch-size',
'--batch_size',
type=int,
default=None,
help='training batch size. None to use config file setting.')
parser.add_argument(
'--learning-rate',
'--learning_rate',
type=float,
default=None,
help='learning rate use for training. None to use config file setting.')
......@@ -65,37 +65,36 @@ def parse_args():
type=str,
default=None,
help='path to resume training based on previous checkpoints. '
'None for not resuming any checkpoints.'
)
'None for not resuming any checkpoints.')
parser.add_argument(
'--use-gpu', type=bool, default=True, help='default use gpu.')
'--use_gpu', type=bool, default=True, help='default use gpu.')
parser.add_argument(
'--no-use-pyreader',
'--no_use_pyreader',
action='store_true',
default=False,
help='whether to use pyreader')
parser.add_argument(
'--no-memory-optimize',
'--no_memory_optimize',
action='store_true',
default=False,
help='whether to use memory optimize in train')
parser.add_argument(
'--epoch-num',
'--epoch_num',
type=int,
default=0,
help='epoch number, 0 for read from config file')
parser.add_argument(
'--valid-interval',
'--valid_interval',
type=int,
default=1,
help='validation epoch interval, 0 for no validation.')
parser.add_argument(
'--save-dir',
'--save_dir',
type=str,
default='checkpoints',
help='directory name to save train snapshoot')
parser.add_argument(
'--log-interval',
'--log_interval',
type=int,
default=10,
help='mini-batch interval to log.')
......@@ -108,6 +107,7 @@ def train(args):
config = parse_config(args.config)
train_config = merge_configs(config, 'train', vars(args))
valid_config = merge_configs(config, 'valid', vars(args))
print_configs(train_config, 'Train')
train_model = models.get_model(args.model_name, train_config, mode='train')
valid_model = models.get_model(args.model_name, valid_config, mode='valid')
......@@ -153,9 +153,12 @@ def train(args):
# if resume weights is given, load resume weights directly
assert os.path.exists(args.resume), \
"Given resume weight dir {} not exist.".format(args.resume)
def if_exist(var):
return os.path.exists(os.path.join(args.resume, var.name))
fluid.io.load_vars(exe, args.resume, predicate=if_exist, main_program=train_prog)
fluid.io.load_vars(
exe, args.resume, predicate=if_exist, main_program=train_prog)
else:
# if not in resume mode, load pretrain weights
if args.pretrain:
......@@ -199,21 +202,43 @@ def train(args):
if args.no_use_pyreader:
train_feeder = fluid.DataFeeder(place=place, feed_list=train_feeds)
valid_feeder = fluid.DataFeeder(place=place, feed_list=valid_feeds)
train_without_pyreader(exe, train_prog, train_exe, train_reader, train_feeder,
train_fetch_list, train_metrics, epochs = epochs,
log_interval = args.log_interval, valid_interval = args.valid_interval,
save_dir = args.save_dir, save_model_name = args.model_name,
test_exe = valid_exe, test_reader = valid_reader, test_feeder = valid_feeder,
test_fetch_list = valid_fetch_list, test_metrics = valid_metrics)
train_without_pyreader(
exe,
train_prog,
train_exe,
train_reader,
train_feeder,
train_fetch_list,
train_metrics,
epochs=epochs,
log_interval=args.log_interval,
valid_interval=args.valid_interval,
save_dir=args.save_dir,
save_model_name=args.model_name,
test_exe=valid_exe,
test_reader=valid_reader,
test_feeder=valid_feeder,
test_fetch_list=valid_fetch_list,
test_metrics=valid_metrics)
else:
train_pyreader.decorate_paddle_reader(train_reader)
valid_pyreader.decorate_paddle_reader(valid_reader)
train_with_pyreader(exe, train_prog, train_exe, train_pyreader, train_fetch_list, train_metrics,
epochs = epochs, log_interval = args.log_interval,
valid_interval = args.valid_interval,
save_dir = args.save_dir, save_model_name = args.model_name,
test_exe = valid_exe, test_pyreader = valid_pyreader,
test_fetch_list = valid_fetch_list, test_metrics = valid_metrics)
train_with_pyreader(
exe,
train_prog,
train_exe,
train_pyreader,
train_fetch_list,
train_metrics,
epochs=epochs,
log_interval=args.log_interval,
valid_interval=args.valid_interval,
save_dir=args.save_dir,
save_model_name=args.model_name,
test_exe=valid_exe,
test_pyreader=valid_pyreader,
test_fetch_list=valid_fetch_list,
test_metrics=valid_metrics)
if __name__ == "__main__":
......
......@@ -14,6 +14,7 @@
__all__ = ['AttrDict']
class AttrDict(dict):
def __getattr__(self, key):
return self[key]
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册