未验证 提交 9afa98c9 编写于 作者: S SunGaofeng 提交者: GitHub

change video dirname to PaddleVideo (#2600)

上级 9b426b2c
## 简介
本教程期望给开发者提供基于PaddlePaddle的便捷、高效的使用深度学习算法解决视频理解、视频编辑、视频生成等一系列模型。目前包含视频分类和动作定位模型,后续会不断的扩展到其他更多场景。
目前视频分类和动作定位模型包括:
| 模型 | 类别 | 描述 |
| :--------------- | :--------: | :------------: |
| [Attention Cluster](./models/attention_cluster/README.md) | 视频分类| CVPR'18提出的视频多模态特征注意力聚簇融合方法 |
| [Attention LSTM](./models/attention_lstm/README.md) | 视频分类| 常用模型,速度快精度高 |
| [NeXtVLAD](./models/nextvlad/README.md) | 视频分类| 2nd-Youtube-8M最优单模型 |
| [StNet](./models/stnet/README.md) | 视频分类| AAAI'19提出的视频联合时空建模方法 |
| [TSM](./models/tsm/README.md) | 视频分类| 基于时序移位的简单高效视频时空建模方法 |
| [TSN](./models/tsn/README.md) | 视频分类| ECCV'16提出的基于2D-CNN经典解决方案 |
| [Non-local](./models/nonlocal_model/README.md) | 视频分类| 视频非局部关联建模模型 |
| [C-TCN](./models/ctcn/README.md) | 视频动作定位| 2018年ActivityNet夺冠方案 |
### 主要特点
- 包含视频分类和动作定位方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。C-TCN也是百度自研模型,2018年ActivityNet比赛的夺冠方案。
- 提供了适合视频分类和动作定位任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
## 安装
在当前模型库运行样例代码需要PadddlePaddle Fluid v.1.5.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据[安装文档](http://www.paddlepaddle.org/documentation/docs/zh/1.5/beginners_guide/install/index_cn.html)中的说明来更新PaddlePaddle。
## 数据准备
视频模型库使用Youtube-8M和Kinetics数据集, 具体使用方法请参考[数据说明](./dataset/README.md)
## 快速使用
视频模型库提供通用的train/test/infer框架,通过`train.py/test.py/infer.py`指定模型名、模型配置参数等可一键式进行训练和预测。
以StNet模型为例:
单卡训练:
``` bash
export CUDA_VISIBLE_DEVICES=0
python train.py --model_name=STNET
--config=./configs/stnet.txt
--save_dir=checkpoints
```
多卡训练:
``` bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name=STNET
--config=./configs/stnet.txt
--save_dir=checkpoints
```
视频模型库同时提供了快速训练脚本,脚本位于`scripts/train`目录下,可通过如下命令启动训练:
``` bash
bash scripts/train/train_stnet.sh
```
- 请根据`CUDA_VISIBLE_DEVICES`指定卡数修改`config`文件中的`num_gpus``batch_size`配置。
### 注意
使用Windows GPU环境的用户,需要将示例代码中的[fluid.ParallelExecutor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#parallelexecutor)替换为[fluid.Executor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#executor)
## 模型库结构
### 代码结构
```
configs/
stnet.txt
tsn.txt
...
dataset/
youtube/
kinetics/
datareader/
feature_readeer.py
kinetics_reader.py
...
metrics/
kinetics/
youtube8m/
...
models/
stnet/
tsn/
...
scripts/
train/
test/
train.py
test.py
infer.py
```
- `configs`: 各模型配置文件模板
- `datareader`: 提供Youtube-8M,Kinetics数据集reader
- `metrics`: Youtube-8,Kinetics数据集评估脚本
- `models`: 各模型网络结构构建脚本
- `scripts`: 各模型快速训练评估脚本
- `train.py`: 一键式训练脚本,可通过指定模型名,配置文件等一键式启动训练
- `test.py`: 一键式评估脚本,可通过指定模型名,配置文件,模型权重等一键式启动评估
- `infer.py`: 一键式推断脚本,可通过指定模型名,配置文件,模型权重,待推断文件列表等一键式启动推断
## Model Zoo
- 基于Youtube-8M数据集模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | GAP | 下载链接 |
| :-------: | :---: | :---------: | :-----: | :----: | :----------: |
| Attention Cluster | 2048 | 8卡P40 | 7.1 | 0.84 | [model](https://paddlemodels.bj.bcebos.com/video_classification/attention_cluster_youtube8m.tar.gz) |
| Attention LSTM | 1024 | 8卡P40 | 7.1 | 0.86 | [model](https://paddlemodels.bj.bcebos.com/video_classification/attention_lstm_youtube8m.tar.gz) |
| NeXtVLAD | 160 | 4卡P40 | 7.1 | 0.87 | [model](https://paddlemodels.bj.bcebos.com/video_classification/nextvlad_youtube8m.tar.gz) |
- 基于Kinetics数据集模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | Top-1 | 下载链接 |
| :-------: | :---: | :---------: | :----: | :----: | :----------: |
| StNet | 128 | 8卡P40 | 7.1 | 0.69 | [model](https://paddlemodels.bj.bcebos.com/video_classification/stnet_kinetics.tar.gz) |
| TSN | 256 | 8卡P40 | 7.1 | 0.67 | [model](https://paddlemodels.bj.bcebos.com/video_classification/tsn_kinetics.tar.gz) |
| TSM | 128 | 8卡P40 | 7.1 | 0.70 | [model](https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz) |
| Non-local | 64 | 8卡P40 | 7.1 | 0.74 | [model](https://paddlemodels.bj.bcebos.com/video_classification/nonlocal_kinetics.tar.gz) |
- 基于ActivityNet的动作定位模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | MAP | 下载链接 |
| :-------: | :---: | :---------: | :----: | :----: | :----------: |
| C-TCN | 16 | 8卡P40 | 7.1 | 0.31| [model](https://paddlemodels.bj.bcebos.com/video_detection/ctcn.tar.gz) |
## 参考文献
- [Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification](https://arxiv.org/abs/1711.09550), Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen
- [Beyond Short Snippets: Deep Networks for Video Classification](https://arxiv.org/abs/1503.08909) Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici
- [NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification](https://arxiv.org/abs/1811.05014), Rongcheng Lin, Jing Xiao, Jianping Fan
- [StNet:Local and Global Spatial-Temporal Modeling for Human Action Recognition](https://arxiv.org/abs/1811.01549), Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, Shilei Wen
- [Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859), Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool
- [Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383v1), Ji Lin, Chuang Gan, Song Han
- [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
## 版本更新
- 3/2019: 新增模型库,发布Attention Cluster,Attention LSTM,NeXtVLAD,StNet,TSN五个视频分类模型。
- 4/2019: 发布Non-local, TSM两个视频分类模型。
- 6/2019: 发布C-TCN视频动作定位模型;Non-local模型增加C2D ResNet101和I3D ResNet50骨干网络;NeXtVLAD、TSM模型速度和显存优化。
......@@ -5,11 +5,26 @@ import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi, DurationKpi
train_cost_card1_kpi = CostKpi('train_cost_card1', 0.08, 0, actived=True, desc='train cost')
train_speed_card1_kpi = DurationKpi('train_speed_card1', 0.08, 0, actived=True, desc='train speed in one GPU card')
train_cost_card4_kpi = CostKpi('train_cost_card4', 0.08, 0, actived=True, desc='train cost')
train_speed_card4_kpi = DurationKpi('train_speed_card4', 0.3, 0, actived=True, desc='train speed in four GPU card')
tracking_kpis = [train_cost_card1_kpi, train_speed_card1_kpi, train_cost_card4_kpi, train_speed_card4_kpi]
train_cost_card1_kpi = CostKpi(
'train_cost_card1', 0.08, 0, actived=True, desc='train cost')
train_speed_card1_kpi = DurationKpi(
'train_speed_card1',
0.08,
0,
actived=True,
desc='train speed in one GPU card')
train_cost_card4_kpi = CostKpi(
'train_cost_card4', 0.08, 0, actived=True, desc='train cost')
train_speed_card4_kpi = DurationKpi(
'train_speed_card4',
0.3,
0,
actived=True,
desc='train speed in four GPU card')
tracking_kpis = [
train_cost_card1_kpi, train_speed_card1_kpi, train_cost_card4_kpi,
train_speed_card4_kpi
]
def parse_log(log):
......
......@@ -62,7 +62,8 @@ def merge_configs(cfg, sec, args_dict):
def print_configs(cfg, mode):
logger.info("---------------- {:>5} Arguments ----------------".format(mode))
logger.info("---------------- {:>5} Arguments ----------------".format(
mode))
for sec, sec_items in cfg.items():
logger.info("{}:".format(sec))
for k, v in sec_items.items():
......
......@@ -64,7 +64,8 @@ class KineticsReader(DataReader):
self.seg_num = self.get_config_from_sec(mode, 'seg_num', self.seg_num)
self.short_size = self.get_config_from_sec(mode, 'short_size')
self.target_size = self.get_config_from_sec(mode, 'target_size')
self.num_reader_threads = self.get_config_from_sec(mode, 'num_reader_threads')
self.num_reader_threads = self.get_config_from_sec(mode,
'num_reader_threads')
self.buf_size = self.get_config_from_sec(mode, 'buf_size')
self.enable_ce = self.get_config_from_sec(mode, 'enable_ce')
......@@ -99,7 +100,6 @@ class KineticsReader(DataReader):
return _batch_reader
def _reader_creator(self,
pickle_list,
mode,
......@@ -113,8 +113,8 @@ class KineticsReader(DataReader):
num_threads=1,
buf_size=1024,
format='pkl'):
def decode_mp4(sample, mode, seg_num, seglen, short_size, target_size, img_mean,
img_std):
def decode_mp4(sample, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
sample = sample[0].split(' ')
mp4_path = sample[0]
# when infer, we store vid as label
......@@ -122,8 +122,8 @@ class KineticsReader(DataReader):
try:
imgs = mp4_loader(mp4_path, seg_num, seglen, mode)
if len(imgs) < 1:
logger.error('{} frame length {} less than 1.'.format(mp4_path,
len(imgs)))
logger.error('{} frame length {} less than 1.'.format(
mp4_path, len(imgs)))
return None, None
except:
logger.error('Error when loading {}'.format(mp4_path))
......@@ -132,20 +132,20 @@ class KineticsReader(DataReader):
return imgs_transform(imgs, label, mode, seg_num, seglen, \
short_size, target_size, img_mean, img_std)
def decode_pickle(sample, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
def decode_pickle(sample, mode, seg_num, seglen, short_size,
target_size, img_mean, img_std):
pickle_path = sample[0]
try:
if python_ver < (3, 0):
data_loaded = pickle.load(open(pickle_path, 'rb'))
else:
data_loaded = pickle.load(open(pickle_path, 'rb'), encoding='bytes')
data_loaded = pickle.load(
open(pickle_path, 'rb'), encoding='bytes')
vid, label, frames = data_loaded
if len(frames) < 1:
logger.error('{} frame length {} less than 1.'.format(pickle_path,
len(frames)))
logger.error('{} frame length {} less than 1.'.format(
pickle_path, len(frames)))
return None, None
except:
logger.info('Error when loading {}'.format(pickle_path))
......@@ -160,9 +160,8 @@ class KineticsReader(DataReader):
return imgs_transform(imgs, ret_label, mode, seg_num, seglen, \
short_size, target_size, img_mean, img_std)
def imgs_transform(imgs, label, mode, seg_num, seglen, short_size, target_size,
img_mean, img_std):
def imgs_transform(imgs, label, mode, seg_num, seglen, short_size,
target_size, img_mean, img_std):
imgs = group_scale(imgs, short_size)
if mode == 'train':
......@@ -182,11 +181,11 @@ class KineticsReader(DataReader):
imgs = np_imgs
imgs -= img_mean
imgs /= img_std
imgs = np.reshape(imgs, (seg_num, seglen * 3, target_size, target_size))
imgs = np.reshape(imgs,
(seg_num, seglen * 3, target_size, target_size))
return imgs, label
def reader():
with open(pickle_list) as flist:
lines = [line.strip() for line in flist]
......@@ -229,8 +228,14 @@ def group_multi_scale_crop(img_group, target_size, scales=None, \
base_size = min(image_w, image_h)
crop_sizes = [int(base_size * x) for x in scales]
crop_h = [input_size[1] if abs(x - input_size[1]) < 3 else x for x in crop_sizes]
crop_w = [input_size[0] if abs(x - input_size[0]) < 3 else x for x in crop_sizes]
crop_h = [
input_size[1] if abs(x - input_size[1]) < 3 else x
for x in crop_sizes
]
crop_w = [
input_size[0] if abs(x - input_size[0]) < 3 else x
for x in crop_sizes
]
pairs = []
for i, h in enumerate(crop_h):
......@@ -273,8 +278,14 @@ def group_multi_scale_crop(img_group, target_size, scales=None, \
return crop_pair[0], crop_pair[1], w_offset, h_offset
crop_w, crop_h, offset_w, offset_h = _sample_crop_size(im_size)
crop_img_group = [img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h)) for img in img_group]
ret_img_group = [img.resize((input_size[0], input_size[1]), Image.BILINEAR) for img in crop_img_group]
crop_img_group = [
img.crop((offset_w, offset_h, offset_w + crop_w, offset_h + crop_h))
for img in img_group
]
ret_img_group = [
img.resize((input_size[0], input_size[1]), Image.BILINEAR)
for img in crop_img_group
]
return ret_img_group
......
......@@ -52,7 +52,6 @@ class DataReader(object):
return self.cfg[sec.upper()].get(item, default)
class ReaderZoo(object):
def __init__(self):
self.reader_zoo = {}
......
# C-TCN模型数据使用说明
C-TCN模型使用ActivityNet 1.3数据集,具体下载方法请参考官方[下载说明](http://activity-net.org/index.html)。在训练此模型时,需要先使用训练好的TSN模型对mp4源文件进行特征提取,这里对RGB和Optical Flow分别提取特征,并存储为pickle文件格式。我们将会提供转化后的数据下载链接。转化后的数据文件目录结构为:
C-TCN模型使用ActivityNet 1.3数据集,具体下载方法请参考官方[下载说明](http://activity-net.org/index.html)。在训练此模型时,需要先对mp4源文件抽取RGB和Flow特征,然后再用训练好的TSN模型提取出抽象的特征数据,并存储为pickle文件格式。我们将会提供转化后的数据下载链接。转化后的数据文件目录结构为:
```
data
......
......@@ -18,10 +18,8 @@ import paddle.fluid as fluid
class LogisticModel(object):
"""Logistic model."""
def build_model(self,
model_input,
vocab_size,
**unused_params):
def build_model(self, model_input, vocab_size, **unused_params):
"""Creates a logistic model.
Args:
......
......@@ -147,5 +147,7 @@ class AttentionLSTM(ModelBase):
]
def weights_info(self):
return ('attention_lstm_youtube8m',
'https://paddlemodels.bj.bcebos.com/video_classification/attention_lstm_youtube8m.tar.gz')
return (
'attention_lstm_youtube8m',
'https://paddlemodels.bj.bcebos.com/video_classification/attention_lstm_youtube8m.tar.gz'
)
......@@ -62,7 +62,9 @@ class LSTMAttentionModel(object):
input=[lstm_forward, lstm_backward], axis=1)
lstm_dropout = fluid.layers.dropout(
x=lstm_concat, dropout_prob=self.drop_rate, is_test=(not is_training))
x=lstm_concat,
dropout_prob=self.drop_rate,
is_test=(not is_training))
lstm_weight = fluid.layers.fc(
input=lstm_dropout,
......
......@@ -61,8 +61,8 @@ C-TCN的训练数据采用ActivityNet1.3提供的数据集,数据下载及准
当取如下参数时,在ActivityNet1.3数据集下评估精度如下:
| score\_thresh | nms\_thresh | soft\_sigma | soft\_thresh | Top-1 |
| :-----------: | :---------: | :---------: | :----------: | :----: |
| score\_thresh | nms\_thresh | soft\_sigma | soft\_thresh | MAP |
| :-----------: | :---------: | :---------: | :----------: | :---: |
| 0.001 | 0.8 | 0.9 | 0.004 | 31% |
......
......@@ -79,4 +79,3 @@ NeXtVLAD模型使用2nd-Youtube-8M数据集, 数据下载及准备请参考[数
## 参考论文
- [NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification](https://arxiv.org/abs/1811.05014), Rongcheng Lin, Jing Xiao, Jianping Fan
......@@ -46,7 +46,8 @@ class STNET(ModelBase):
'l2_weight_decay')
self.momentum = self.get_config_from_sec('train', 'momentum')
self.seg_num = self.get_config_from_sec(self.mode, 'seg_num', self.seg_num)
self.seg_num = self.get_config_from_sec(self.mode, 'seg_num',
self.seg_num)
self.target_size = self.get_config_from_sec(self.mode, 'target_size')
self.batch_size = self.get_config_from_sec(self.mode, 'batch_size')
......@@ -127,11 +128,16 @@ class STNET(ModelBase):
]
def pretrain_info(self):
return ('ResNet50_pretrained', 'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz')
return (
'ResNet50_pretrained',
'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz'
)
def weights_info(self):
return ('stnet_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/stnet_kinetics.tar.gz')
return (
'stnet_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/stnet_kinetics.tar.gz'
)
def load_pretrain_params(self, exe, pretrain, prog, place):
def is_parameter(var):
......@@ -139,7 +145,9 @@ class STNET(ModelBase):
return isinstance(var, fluid.framework.Parameter) and (not ("fc_0" in var.name)) \
and (not ("batch_norm" in var.name)) and (not ("xception" in var.name)) and (not ("conv3d" in var.name))
logger.info("Load pretrain weights from {}, exclude fc, batch_norm, xception, conv3d layers.".format(pretrain))
logger.info(
"Load pretrain weights from {}, exclude fc, batch_norm, xception, conv3d layers.".
format(pretrain))
vars = filter(is_parameter, prog.list_vars())
fluid.io.load_vars(exe, pretrain, vars=vars, main_program=prog)
......
......@@ -122,17 +122,23 @@ class TSM(ModelBase):
]
def pretrain_info(self):
return ('ResNet50_pretrained', 'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz')
return (
'ResNet50_pretrained',
'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz'
)
def weights_info(self):
return ('tsm_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz')
return (
'tsm_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz'
)
def load_pretrain_params(self, exe, pretrain, prog, place):
def is_parameter(var):
return isinstance(var, fluid.framework.Parameter) and (not ("fc_0" in var.name))
return isinstance(var, fluid.framework.Parameter) and (
not ("fc_0" in var.name))
logger.info("Load pretrain weights from {}, exclude fc layer.".format(pretrain))
logger.info("Load pretrain weights from {}, exclude fc layer.".format(
pretrain))
vars = filter(is_parameter, prog.list_vars())
fluid.io.load_vars(exe, pretrain, vars=vars, main_program=prog)
......@@ -45,19 +45,21 @@ class TSM_ResNet():
padding=(filter_size - 1) // 2,
groups=groups,
act=None,
param_attr=fluid.param_attr.ParamAttr(name=name+"_weights"),
param_attr=fluid.param_attr.ParamAttr(name=name + "_weights"),
bias_attr=False)
if name == "conv1":
bn_name = "bn_" + name
else:
bn_name = "bn" + name[3:]
return fluid.layers.batch_norm(input=conv, act=act,
is_test=(not self.is_training),
param_attr=fluid.param_attr.ParamAttr(name=bn_name+"_scale"),
bias_attr=fluid.param_attr.ParamAttr(bn_name+'_offset'),
moving_mean_name=bn_name+"_mean",
moving_variance_name=bn_name+'_variance')
return fluid.layers.batch_norm(
input=conv,
act=act,
is_test=(not self.is_training),
param_attr=fluid.param_attr.ParamAttr(name=bn_name + "_scale"),
bias_attr=fluid.param_attr.ParamAttr(bn_name + '_offset'),
moving_mean_name=bn_name + "_mean",
moving_variance_name=bn_name + '_variance')
def shortcut(self, input, ch_out, stride, name):
ch_in = input.shape[1]
......@@ -70,18 +72,27 @@ class TSM_ResNet():
shifted = self.shift_module(input)
conv0 = self.conv_bn_layer(
input=shifted, num_filters=num_filters, filter_size=1, act='relu',
name=name+"_branch2a")
input=shifted,
num_filters=num_filters,
filter_size=1,
act='relu',
name=name + "_branch2a")
conv1 = self.conv_bn_layer(
input=conv0,
num_filters=num_filters,
filter_size=3,
stride=stride,
act='relu', name=name+"_branch2b")
act='relu',
name=name + "_branch2b")
conv2 = self.conv_bn_layer(
input=conv1, num_filters=num_filters * 4, filter_size=1, act=None, name=name+"_branch2c")
input=conv1,
num_filters=num_filters * 4,
filter_size=1,
act=None,
name=name + "_branch2c")
short = self.shortcut(input, num_filters * 4, stride, name=name+"_branch1")
short = self.shortcut(
input, num_filters * 4, stride, name=name + "_branch1")
return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
......@@ -109,7 +120,12 @@ class TSM_ResNet():
num_filters = [64, 128, 256, 512]
conv = self.conv_bn_layer(
input=input, num_filters=64, filter_size=7, stride=2, act='relu', name='conv1')
input=input,
num_filters=64,
filter_size=7,
stride=2,
act='relu',
name='conv1')
conv = fluid.layers.pool2d(
input=conv,
pool_size=3,
......@@ -121,11 +137,11 @@ class TSM_ResNet():
for i in range(depth[block]):
if layers in [101, 152] and block == 2:
if i == 0:
conv_name = "res" + str(block+2) + "a"
conv_name = "res" + str(block + 2) + "a"
else:
conv_name = "res" + str(block+2) + "b" + str(i)
conv_name = "res" + str(block + 2) + "b" + str(i)
else:
conv_name = "res" + str(block+2) + chr(97+i)
conv_name = "res" + str(block + 2) + chr(97 + i)
conv = self.bottleneck_block(
input=conv,
......@@ -136,7 +152,8 @@ class TSM_ResNet():
pool = fluid.layers.pool2d(
input=conv, pool_size=7, pool_type='avg', global_pooling=True)
dropout = fluid.layers.dropout(x=pool, dropout_prob=0.5, is_test=(not self.is_training))
dropout = fluid.layers.dropout(
x=pool, dropout_prob=0.5, is_test=(not self.is_training))
feature = fluid.layers.reshape(
x=dropout, shape=[-1, seg_num, pool.shape[1]])
......@@ -149,6 +166,7 @@ class TSM_ResNet():
param_attr=fluid.param_attr.ParamAttr(
initializer=fluid.initializer.Uniform(-stdv,
stdv)),
bias_attr=fluid.param_attr.ParamAttr(learning_rate=2.0,
bias_attr=fluid.param_attr.ParamAttr(
learning_rate=2.0,
regularizer=fluid.regularizer.L2Decay(0.)))
return out
......@@ -47,7 +47,8 @@ class TSN(ModelBase):
'l2_weight_decay')
self.momentum = self.get_config_from_sec('train', 'momentum')
self.seg_num = self.get_config_from_sec(self.mode, 'seg_num', self.seg_num)
self.seg_num = self.get_config_from_sec(self.mode, 'seg_num',
self.seg_num)
self.target_size = self.get_config_from_sec(self.mode, 'target_size')
self.batch_size = self.get_config_from_sec(self.mode, 'batch_size')
......@@ -131,17 +132,23 @@ class TSN(ModelBase):
]
def pretrain_info(self):
return ('ResNet50_pretrained', 'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz')
return (
'ResNet50_pretrained',
'https://paddlemodels.bj.bcebos.com/video_classification/ResNet50_pretrained.tar.gz'
)
def weights_info(self):
return ('tsn_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/tsn_kinetics.tar.gz')
return (
'tsn_kinetics',
'https://paddlemodels.bj.bcebos.com/video_classification/tsn_kinetics.tar.gz'
)
def load_pretrain_params(self, exe, pretrain, prog, place):
def is_parameter(var):
return isinstance(var, fluid.framework.Parameter) and (not ("fc_0" in var.name))
return isinstance(var, fluid.framework.Parameter) and (
not ("fc_0" in var.name))
logger.info("Load pretrain weights from {}, exclude fc layer.".format(pretrain))
logger.info("Load pretrain weights from {}, exclude fc layer.".format(
pretrain))
vars = filter(is_parameter, prog.list_vars())
fluid.io.load_vars(exe, pretrain, vars=vars, main_program=prog)
## 简介
本教程期望给开发者提供基于PaddlePaddle的便捷、高效的使用深度学习算法解决视频理解、视频编辑、视频生成等一系列模型。目前包含视频分类和动作定位模型,后续会不断的扩展到其他更多场景。
目前视频分类和动作定位模型包括:
| 模型 | 类别 | 描述 |
| :--------------- | :--------: | :------------: |
| [Attention Cluster](./models/attention_cluster/README.md) | 视频分类| CVPR'18提出的视频多模态特征注意力聚簇融合方法 |
| [Attention LSTM](./models/attention_lstm/README.md) | 视频分类| 常用模型,速度快精度高 |
| [NeXtVLAD](./models/nextvlad/README.md) | 视频分类| 2nd-Youtube-8M最优单模型 |
| [StNet](./models/stnet/README.md) | 视频分类| AAAI'19提出的视频联合时空建模方法 |
| [TSM](./models/tsm/README.md) | 视频分类| 基于时序移位的简单高效视频时空建模方法 |
| [TSN](./models/tsn/README.md) | 视频分类| ECCV'16提出的基于2D-CNN经典解决方案 |
| [Non-local](./models/nonlocal_model/README.md) | 视频分类| 视频非局部关联建模模型 |
| [C-TCN](./models/ctcn/README.md) | 视频动作定位| 2018年ActivityNet夺冠方案 |
### 主要特点
- 包含视频分类和动作定位方向的多个主流领先模型,其中Attention LSTM,Attention Cluster和NeXtVLAD是比较流行的特征序列模型,Non-local, TSN, TSM和StNet是End-to-End的视频分类模型。Attention LSTM模型速度快精度高,NeXtVLAD是2nd-Youtube-8M比赛中最好的单模型, TSN是基于2D-CNN的经典解决方案,TSM是基于时序移位的简单高效视频时空建模方法,Non-local模型提出了视频非局部关联建模方法。Attention Cluster和StNet是百度自研模型,分别发表于CVPR2018和AAAI2019,是Kinetics600比赛第一名中使用到的模型。C-TCN也是百度自研模型,2018年ActivityNet比赛的夺冠方案。
- 提供了适合视频分类和动作定位任务的通用骨架代码,用户可一键式高效配置模型完成训练和评测。
## 安装
在当前模型库运行样例代码需要PadddlePaddle Fluid v.1.5.0或以上的版本。如果你的运行环境中的PaddlePaddle低于此版本,请根据[安装文档](http://www.paddlepaddle.org/documentation/docs/zh/1.5/beginners_guide/install/index_cn.html)中的说明来更新PaddlePaddle。
## 数据准备
视频模型库使用Youtube-8M和Kinetics数据集, 具体使用方法请参考[数据说明](./dataset/README.md)
## 快速使用
视频模型库提供通用的train/test/infer框架,通过`train.py/test.py/infer.py`指定模型名、模型配置参数等可一键式进行训练和预测。
以StNet模型为例:
单卡训练:
``` bash
export CUDA_VISIBLE_DEVICES=0
python train.py --model_name=STNET
--config=./configs/stnet.txt
--save_dir=checkpoints
```
多卡训练:
``` bash
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python train.py --model_name=STNET
--config=./configs/stnet.txt
--save_dir=checkpoints
```
视频模型库同时提供了快速训练脚本,脚本位于`scripts/train`目录下,可通过如下命令启动训练:
``` bash
bash scripts/train/train_stnet.sh
```
- 请根据`CUDA_VISIBLE_DEVICES`指定卡数修改`config`文件中的`num_gpus``batch_size`配置。
### 注意
使用Windows GPU环境的用户,需要将示例代码中的[fluid.ParallelExecutor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#parallelexecutor)替换为[fluid.Executor](http://paddlepaddle.org/documentation/docs/zh/1.4/api_cn/fluid_cn.html#executor)
## 模型库结构
### 代码结构
```
configs/
stnet.txt
tsn.txt
...
dataset/
youtube/
kinetics/
datareader/
feature_readeer.py
kinetics_reader.py
...
metrics/
kinetics/
youtube8m/
...
models/
stnet/
tsn/
...
scripts/
train/
test/
train.py
test.py
infer.py
```
- `configs`: 各模型配置文件模板
- `datareader`: 提供Youtube-8M,Kinetics数据集reader
- `metrics`: Youtube-8,Kinetics数据集评估脚本
- `models`: 各模型网络结构构建脚本
- `scripts`: 各模型快速训练评估脚本
- `train.py`: 一键式训练脚本,可通过指定模型名,配置文件等一键式启动训练
- `test.py`: 一键式评估脚本,可通过指定模型名,配置文件,模型权重等一键式启动评估
- `infer.py`: 一键式推断脚本,可通过指定模型名,配置文件,模型权重,待推断文件列表等一键式启动推断
## Model Zoo
- 基于Youtube-8M数据集模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | GAP | 下载链接 |
| :-------: | :---: | :---------: | :-----: | :----: | :----------: |
| Attention Cluster | 2048 | 8卡P40 | 7.1 | 0.84 | [model](https://paddlemodels.bj.bcebos.com/video_classification/attention_cluster_youtube8m.tar.gz) |
| Attention LSTM | 1024 | 8卡P40 | 7.1 | 0.86 | [model](https://paddlemodels.bj.bcebos.com/video_classification/attention_lstm_youtube8m.tar.gz) |
| NeXtVLAD | 160 | 4卡P40 | 7.1 | 0.87 | [model](https://paddlemodels.bj.bcebos.com/video_classification/nextvlad_youtube8m.tar.gz) |
- 基于Kinetics数据集模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | Top-1 | 下载链接 |
| :-------: | :---: | :---------: | :----: | :----: | :----------: |
| StNet | 128 | 8卡P40 | 7.1 | 0.69 | [model](https://paddlemodels.bj.bcebos.com/video_classification/stnet_kinetics.tar.gz) |
| TSN | 256 | 8卡P40 | 7.1 | 0.67 | [model](https://paddlemodels.bj.bcebos.com/video_classification/tsn_kinetics.tar.gz) |
| TSM | 128 | 8卡P40 | 7.1 | 0.70 | [model](https://paddlemodels.bj.bcebos.com/video_classification/tsm_kinetics.tar.gz) |
| Non-local | 64 | 8卡P40 | 7.1 | 0.74 | [model](https://paddlemodels.bj.bcebos.com/video_classification/nonlocal_kinetics.tar.gz) |
- 基于ActivityNet的动作定位模型:
| 模型 | Batch Size | 环境配置 | cuDNN版本 | MAP | 下载链接 |
| :-------: | :---: | :---------: | :----: | :----: | :----------: |
| C-TCN | 16 | 8卡P40 | 7.1 | 0.31| [model](https://paddlemodels.bj.bcebos.com/video_detection/ctcn.tar.gz) |
## 参考文献
- [Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification](https://arxiv.org/abs/1711.09550), Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen
- [Beyond Short Snippets: Deep Networks for Video Classification](https://arxiv.org/abs/1503.08909) Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici
- [NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification](https://arxiv.org/abs/1811.05014), Rongcheng Lin, Jing Xiao, Jianping Fan
- [StNet:Local and Global Spatial-Temporal Modeling for Human Action Recognition](https://arxiv.org/abs/1811.01549), Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Limin Wang, Shilei Wen
- [Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859), Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool
- [Temporal Shift Module for Efficient Video Understanding](https://arxiv.org/abs/1811.08383v1), Ji Lin, Chuang Gan, Song Han
- [Non-local Neural Networks](https://arxiv.org/abs/1711.07971v1), Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He
## 版本更新
- 3/2019: 新增模型库,发布Attention Cluster,Attention LSTM,NeXtVLAD,StNet,TSN五个视频分类模型。
- 4/2019: 发布Non-local, TSM两个视频分类模型。
- 6/2019: 发布C-TCN视频动作定位模型;Non-local模型增加C2D ResNet101和I3D ResNet50骨干网络;NeXtVLAD、TSM模型速度和显存优化。
您好,该项目已被迁移,请移步到 [PaddleCV/PaddleVideo](../PaddleVideo) 目录下浏览本项目。
# Video Classification Based on Temporal Segment Network
Video classification has drawn a significant amount of attentions in the past few years. This page introduces how to perform video classification with PaddlePaddle Fluid, on the public UCF-101 dataset, based on the state-of-the-art Temporal Segment Network (TSN) method.
______________________________________________________________________________
## Table of Contents
<li>Installation</li>
<li>Data preparation</li>
<li>Training</li>
<li>Evaluation</li>
<li>Inference</li>
<li>Performance</li>
### Installation
Running sample code in this directory requires PaddelPaddle Fluid v0.13.0 and later. If the PaddlePaddle on your device is lower than this version, please follow the instructions in <a href="http://www.paddlepaddle.org/docs/develop/documentation/zh/build_and_install/pip_install_cn.html" rel="nofollow">installation document</a> and make an update.
### Data preparation
#### download UCF-101 dataset
Users can download the UCF-101 dataset by the provided script in <code>data/download.sh</code>.
#### decode video into frame
To avoid the process of decoding videos in network training, we offline decode them into frames and save it in the <code>pickle</code> format, easily readable for python.
Users can refer to the script <code>data/video_decode.py</code> for video decoding.
#### split data into train and test
We follow the split 1 of UCF-101 dataset. After data splitting, users can get 9537 videos for training and 3783 videos for validation. The reference script is <code>data/split_data.py</code>.
#### save pickle for training
As stated above, we save all data as <code>pickle</code> format for training. All information in each video is saved into one pickle, includes video id, frames binary and label. Please refer to the script <code>data/generate_train_data.py</code>.
After this operation, one can get two directories containing training and testing data in <code>pickle</code> format, and two files <em>train.list</em> and <em>test.list</em>, with each line seperated by SPACE.
### Training
After data preparation, users can start the PaddlePaddle Fluid training by:
```
python train.py \
--batch_size=128 \
--total_videos=9537 \
--class_dim=101 \
--num_epochs=60 \
--image_shape=3,224,224 \
--model_save_dir=output/ \
--with_mem_opt=True \
--lr_init=0.01 \
--num_layers=50 \
--seg_num=7 \
--pretrained_model={path_to_pretrained_model}
```
<strong>parameter introduction:</strong>
<li>batch_size: the size of each mini-batch.</li>
<li>total_videos: total number of videos in the training set.</li>
<li>class_dim: the class number of the classification task.</li>
<li>num_epochs: the number of epochs.</li>
<li>image_shape: input size of the network.</li>
<li>model_save_dir: the directory to save trained model.</li>
<li>with_mem_opt: whether to use memory optimization or not.</li>
<li>lr_init: initialized learning rate.</li>
<li>num_layers: the number of layers for ResNet.</li>
<li>seg_num: the number of segments in TSN.</li>
<li>pretrained_model: model path for pretraining.</li>
</br>
<strong>data reader introduction:</strong>
Data reader is defined in <code>reader.py</code>. Note that we use group operation for all frames in one video.
<strong>training:</strong>
The training log is like:
```
[TRAIN] Pass: 0 trainbatch: 0 loss: 4.630959 acc1: 0.0 acc5: 0.0390625 time: 3.09 sec
[TRAIN] Pass: 0 trainbatch: 10 loss: 4.559069 acc1: 0.0546875 acc5: 0.1171875 time: 3.91 sec
[TRAIN] Pass: 0 trainbatch: 20 loss: 4.040092 acc1: 0.09375 acc5: 0.3515625 time: 3.88 sec
[TRAIN] Pass: 0 trainbatch: 30 loss: 3.478214 acc1: 0.3203125 acc5: 0.5546875 time: 3.32 sec
[TRAIN] Pass: 0 trainbatch: 40 loss: 3.005404 acc1: 0.3515625 acc5: 0.6796875 time: 3.33 sec
[TRAIN] Pass: 0 trainbatch: 50 loss: 2.585245 acc1: 0.4609375 acc5: 0.7265625 time: 3.13 sec
[TRAIN] Pass: 0 trainbatch: 60 loss: 2.151489 acc1: 0.4921875 acc5: 0.8203125 time: 3.35 sec
[TRAIN] Pass: 0 trainbatch: 70 loss: 1.981680 acc1: 0.578125 acc5: 0.8359375 time: 3.30 sec
```
### Evaluation
Evaluation is to evaluate the performance of a trained model. One can download pretrained models and set its path to path_to_pretrain_model. Then top1/top5 accuracy can be obtained by running the following command:
```
python eval.py \
--batch_size=128 \
--class_dim=101 \
--image_shape=3,224,224 \
--with_mem_opt=True \
--num_layers=50 \
--seg_num=7 \
--test_model={path_to_pretrained_model}
```
According to the congfiguration of evaluation, the output log is like:
```
[TEST] Pass: 0 testbatch: 0 loss: 0.011551 acc1: 1.0 acc5: 1.0 time: 0.48 sec
[TEST] Pass: 0 testbatch: 10 loss: 0.710330 acc1: 0.75 acc5: 1.0 time: 0.49 sec
[TEST] Pass: 0 testbatch: 20 loss: 0.000547 acc1: 1.0 acc5: 1.0 time: 0.48 sec
[TEST] Pass: 0 testbatch: 30 loss: 0.036623 acc1: 1.0 acc5: 1.0 time: 0.48 sec
[TEST] Pass: 0 testbatch: 40 loss: 0.138705 acc1: 1.0 acc5: 1.0 time: 0.48 sec
[TEST] Pass: 0 testbatch: 50 loss: 0.056909 acc1: 1.0 acc5: 1.0 time: 0.49 sec
[TEST] Pass: 0 testbatch: 60 loss: 0.742937 acc1: 0.75 acc5: 1.0 time: 0.49 sec
[TEST] Pass: 0 testbatch: 70 loss: 1.720186 acc1: 0.5 acc5: 0.875 time: 0.48 sec
[TEST] Pass: 0 testbatch: 80 loss: 0.199669 acc1: 0.875 acc5: 1.0 time: 0.48 sec
[TEST] Pass: 0 testbatch: 90 loss: 0.195510 acc1: 1.0 acc5: 1.0 time: 0.48 sec
```
### Inference
Inference is used to get prediction score or video features based on trained models.
```
python infer.py \
--class_dim=101 \
--image_shape=3,224,224 \
--with_mem_opt=True \
--num_layers=50 \
--seg_num=7 \
--test_model={path_to_pretrained_model}
```
The output contains predication results, including maximum score (before softmax) and corresponding predicted label.
```
Test sample: PlayingGuitar_g01_c03, score: [21.418629], class [62]
Test sample: SalsaSpin_g05_c06, score: [13.238657], class [76]
Test sample: TrampolineJumping_g04_c01, score: [21.722862], class [93]
Test sample: JavelinThrow_g01_c04, score: [16.27892], class [44]
Test sample: PlayingTabla_g01_c01, score: [15.366951], class [65]
Test sample: ParallelBars_g04_c07, score: [18.42596], class [56]
Test sample: PlayingCello_g05_c05, score: [18.795723], class [58]
Test sample: LongJump_g03_c04, score: [7.100088], class [50]
Test sample: SkyDiving_g06_c03, score: [15.144707], class [82]
Test sample: UnevenBars_g07_c04, score: [22.114838], class [95]
```
### Performance
Configuration | Top-1 acc
------------- | ---------------:
seg=7, size=224 | 0.859
seg=10, size=224 | 0.863
您好,该项目已被迁移,请移步到 [PaddleCV/PaddleVideo](../PaddleVideo) 目录下浏览本项目。
# Download the dataset
echo "Downloading..."
wget http://crcv.ucf.edu/data/UCF101/UCF101.rar
wget http://crcv.ucf.edu/data/UCF101/UCF101TrainTestSplits-RecognitionTask.zip
# Extract the data.
echo "Extracting..."
unrar x UCF101.rar
unzip UCF101TrainTestSplits-RecognitionTask.zip
import os
import cPickle
# read class file
dd = {}
f = open('ucfTrainTestlist/classInd.txt')
for line in f.readlines():
label, name = line.split()
dd[name.lower()] = int(label) - 1
f.close()
def generate_pkl(mode):
# generate pkl
path = '%s/' % mode
savepath = '%s_pkl/' % mode
if not os.path.exists(savepath):
os.makedirs(savepath)
fw = open('%s.list' % mode, 'w')
for folder in os.listdir(path):
vidid = folder.split('_', 1)[1]
this_label = dd[folder.split('_')[1].lower()]
this_feat = []
for img in sorted(os.listdir(path + folder)):
fout = open(path + folder + '/' + img, 'rb')
this_feat.append(fout.read())
fout.close()
res = [vidid, this_label, this_feat]
outp = open(savepath + vidid + '.pkl', 'wb')
cPickle.dump(res, outp, protocol=cPickle.HIGHEST_PROTOCOL)
outp.close()
fw.write('data/%s/%s.pkl\n' % (savepath, vidid))
fw.close()
generate_pkl('train')
generate_pkl('test')
import os
import shutil
# set path
train_path = 'train/'
if not os.path.exists(train_path):
os.makedirs(train_path)
test_path = 'test/'
if not os.path.exists(test_path):
os.makedirs(test_path)
# move data
frame_dir = 'frame/'
f = open('ucfTrainTestlist/trainlist01.txt')
for line in f.readlines():
folder = line.split('.')[0]
vidid = folder.split('/')[-1]
shutil.move(frame_dir + folder, train_path + vidid)
f.close()
f = open('ucfTrainTestlist/testlist01.txt')
for line in f.readlines():
folder = line.split('.')[0]
vidid = folder.split('/')[-1]
shutil.move(frame_dir + folder, test_path + vidid)
f.close()
import os, sys
import shutil
def decode():
path = './UCF-101/'
for folder in os.listdir(path):
for vid in os.listdir(path + folder):
print vid
video_path = path + folder + '/' + vid
image_folder = './frame/' + folder + '/' + vid.split('.')[0] + '/'
if not os.path.exists(image_folder):
os.makedirs(image_folder)
os.system('./ffmpeg -i ' + video_path + ' -q 0 ' + image_folder +
'/%06d.jpg')
if __name__ == '__main__':
decode()
import os
import numpy as np
import time
import sys
import paddle
import paddle.fluid as fluid
from resnet import TSN_ResNet
import reader
import argparse
import functools
from paddle.fluid.framework import Parameter
from utility import add_arguments, print_arguments
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 128, "Minibatch size.")
add_arg('num_layers', int, 50, "How many layers for ResNet model.")
add_arg('with_mem_opt', bool, True, "Whether to use memory optimization or not.")
add_arg('class_dim', int, 101, "Number of class.")
add_arg('seg_num', int, 7, "Number of segments.")
add_arg('image_shape', str, "3,224,224", "Input image size.")
add_arg('test_model', str, None, "Test model path.")
# yapf: enable
def eval(args):
# parameters from arguments
seg_num = args.seg_num
class_dim = args.class_dim
num_layers = args.num_layers
batch_size = args.batch_size
test_model = args.test_model
if test_model == None:
print('Please specify the test model ...')
return
image_shape = [int(m) for m in args.image_shape.split(",")]
image_shape = [seg_num] + image_shape
# model definition
model = TSN_ResNet(layers=num_layers, seg_num=seg_num)
image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
out = model.net(input=image, class_dim=class_dim)
cost = fluid.layers.cross_entropy(input=out, label=label)
avg_cost = fluid.layers.mean(x=cost)
acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
# for test
inference_program = fluid.default_main_program().clone(for_test=True)
if args.with_mem_opt:
fluid.memory_optimize(fluid.default_main_program())
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
def is_parameter(var):
if isinstance(var, Parameter):
return isinstance(var, Parameter)
if test_model is not None:
vars = filter(is_parameter, inference_program.list_vars())
fluid.io.load_vars(exe, test_model, vars=vars)
# reader
test_reader = paddle.batch(reader.test(seg_num), batch_size=batch_size / 16)
feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
# test
cnt = 0
pass_id = 0
test_info = [[], [], []]
for batch_id, data in enumerate(test_reader()):
t1 = time.time()
loss, acc1, acc5 = exe.run(inference_program,
fetch_list=fetch_list,
feed=feeder.feed(data))
t2 = time.time()
period = t2 - t1
loss = np.mean(loss)
acc1 = np.mean(acc1)
acc5 = np.mean(acc5)
test_info[0].append(loss * len(data))
test_info[1].append(acc1 * len(data))
test_info[2].append(acc5 * len(data))
cnt += len(data)
if batch_id % 10 == 0:
print(
"[TEST] Pass: {0}\ttestbatch: {1}\tloss: {2}\tacc1: {3}\tacc5: {4}\ttime: {5}"
.format(pass_id, batch_id, '%.6f' % loss, acc1, acc5,
"%2.2f sec" % period))
sys.stdout.flush()
test_loss = np.sum(test_info[0]) / cnt
test_acc1 = np.sum(test_info[1]) / cnt
test_acc5 = np.sum(test_info[2]) / cnt
print("+ End pass: {0}, test_loss: {1}, test_acc1: {2}, test_acc5: {3}"
.format(pass_id, '%.3f' % test_loss, '%.3f' % test_acc1, '%.3f' %
test_acc5))
sys.stdout.flush()
def main():
args = parser.parse_args()
print_arguments(args)
eval(args)
if __name__ == '__main__':
main()
import os
import numpy as np
import time
import sys
import paddle
import paddle.fluid as fluid
from resnet import TSN_ResNet
import reader
import argparse
import functools
from paddle.fluid.framework import Parameter
from utility import add_arguments, print_arguments
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('num_layers', int, 50, "How many layers for ResNet model.")
add_arg('with_mem_opt', bool, True, "Whether to use memory optimization or not.")
add_arg('class_dim', int, 101, "Number of class.")
add_arg('seg_num', int, 7, "Number of segments.")
add_arg('image_shape', str, "3,224,224", "Input image size.")
add_arg('test_model', str, None, "Test model path.")
# yapf: enable
def infer(args):
# parameters from arguments
seg_num = args.seg_num
class_dim = args.class_dim
num_layers = args.num_layers
test_model = args.test_model
if test_model == None:
print('Please specify the test model ...')
return
image_shape = [int(m) for m in args.image_shape.split(",")]
image_shape = [seg_num] + image_shape
# model definition
model = TSN_ResNet(layers=num_layers, seg_num=seg_num)
image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
out = model.net(input=image, class_dim=class_dim)
# for test
inference_program = fluid.default_main_program().clone(for_test=True)
if args.with_mem_opt:
fluid.memory_optimize(fluid.default_main_program())
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
def is_parameter(var):
if isinstance(var, Parameter):
return isinstance(var, Parameter)
if test_model is not None:
vars = filter(is_parameter, inference_program.list_vars())
fluid.io.load_vars(exe, test_model, vars=vars)
# reader
test_reader = paddle.batch(reader.infer(seg_num), batch_size=1)
feeder = fluid.DataFeeder(place=place, feed_list=[image])
fetch_list = [out.name]
# test
TOPK = 1
for batch_id, data in enumerate(test_reader()):
data, vid = data[0]
data = [[data]]
result = exe.run(inference_program,
fetch_list=fetch_list,
feed=feeder.feed(data))
result = result[0][0]
pred_label = np.argsort(result)[::-1][:TOPK]
print("Test sample: {0}, score: {1}, class {2}".format(vid, result[
pred_label], pred_label))
sys.stdout.flush()
def main():
args = parser.parse_args()
print_arguments(args)
infer(args)
if __name__ == '__main__':
main()
import os
import sys
import math
import random
import functools
try:
import cPickle as pickle
from cStringIO import StringIO
except ImportError:
import pickle
from io import BytesIO
import numpy as np
import paddle
from PIL import Image, ImageEnhance
random.seed(0)
THREAD = 8
BUF_SIZE = 1024
TRAIN_LIST = 'data/train.list'
TEST_LIST = 'data/test.list'
INFER_LIST = 'data/test.list'
img_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
img_std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
python_ver = sys.version_info
def imageloader(buf):
if isinstance(buf, str):
img = Image.open(StringIO(buf))
else:
img = Image.open(BytesIO(buf))
return img.convert('RGB')
def group_scale(imgs, target_size):
resized_imgs = []
for i in range(len(imgs)):
img = imgs[i]
w, h = img.size
if (w <= h and w == target_size) or (h <= w and h == target_size):
resized_imgs.append(img)
continue
if w < h:
ow = target_size
oh = int(target_size * 4.0 / 3.0)
resized_imgs.append(img.resize((ow, oh), Image.BILINEAR))
else:
oh = target_size
ow = int(target_size * 4.0 / 3.0)
resized_imgs.append(img.resize((ow, oh), Image.BILINEAR))
return resized_imgs
def group_random_crop(img_group, target_size):
w, h = img_group[0].size
th, tw = target_size, target_size
out_images = []
x1 = random.randint(0, w - tw)
y1 = random.randint(0, h - th)
for img in img_group:
if w == tw and h == th:
out_images.append(img)
else:
out_images.append(img.crop((x1, y1, x1 + tw, y1 + th)))
return out_images
def group_random_flip(img_group):
v = random.random()
if v < 0.5:
ret = [img.transpose(Image.FLIP_LEFT_RIGHT) for img in img_group]
return ret
else:
return img_group
def group_center_crop(img_group, target_size):
img_crop = []
for img in img_group:
w, h = img.size
th, tw = target_size, target_size
x1 = int(round((w - tw) / 2.))
y1 = int(round((h - th) / 2.))
img_crop.append(img.crop((x1, y1, x1 + tw, y1 + th)))
return img_crop
def video_loader(frames, nsample, mode):
videolen = len(frames)
average_dur = videolen // nsample
imgs = []
for i in range(nsample):
idx = 0
if mode == 'train':
if average_dur >= 1:
idx = random.randint(0, average_dur - 1)
idx += i * average_dur
else:
idx = i
else:
if average_dur >= 1:
idx = (average_dur - 1) // 2
idx += i * average_dur
else:
idx = i
imgbuf = frames[int(idx % videolen)]
img = imageloader(imgbuf)
imgs.append(img)
return imgs
def decode_pickle(sample, mode, seg_num, short_size, target_size):
pickle_path = sample[0]
if python_ver < (3, 0):
data_loaded = pickle.load(open(pickle_path, 'rb'))
else:
data_loaded = pickle.load(open(pickle_path, 'rb'), encoding='bytes')
vid, label, frames = data_loaded
imgs = video_loader(frames, seg_num, mode)
imgs = group_scale(imgs, short_size)
if mode == 'train':
imgs = group_random_crop(imgs, target_size)
imgs = group_random_flip(imgs)
else:
imgs = group_center_crop(imgs, target_size)
np_imgs = (np.array(imgs[0]).astype('float32').transpose(
(2, 0, 1))).reshape(1, 3, 224, 224) / 255
for i in range(len(imgs) - 1):
img = (np.array(imgs[i + 1]).astype('float32').transpose(
(2, 0, 1))).reshape(1, 3, 224, 224) / 255
np_imgs = np.concatenate((np_imgs, img))
imgs = np_imgs
imgs -= img_mean
imgs /= img_std
if mode == 'train' or mode == 'test':
return imgs, label
elif mode == 'infer':
return imgs, vid
def _reader_creator(pickle_list,
mode,
seg_num,
short_size,
target_size,
shuffle=False):
def reader():
with open(pickle_list) as flist:
lines = [line.strip() for line in flist]
if shuffle:
random.shuffle(lines)
for line in lines:
pickle_path = line.strip()
yield [pickle_path]
mapper = functools.partial(
decode_pickle,
mode=mode,
seg_num=seg_num,
short_size=short_size,
target_size=target_size)
return paddle.reader.xmap_readers(mapper, reader, THREAD, BUF_SIZE)
def train(seg_num):
return _reader_creator(
TRAIN_LIST,
'train',
shuffle=True,
seg_num=seg_num,
short_size=256,
target_size=224)
def test(seg_num):
return _reader_creator(
TEST_LIST,
'test',
shuffle=False,
seg_num=seg_num,
short_size=256,
target_size=224)
def infer(seg_num):
return _reader_creator(
INFER_LIST,
'infer',
shuffle=False,
seg_num=seg_num,
short_size=256,
target_size=224)
import os
import time
import sys
import paddle.fluid as fluid
import math
class TSN_ResNet():
def __init__(self, layers=50, seg_num=7):
self.layers = layers
self.seg_num = seg_num
def conv_bn_layer(self,
input,
num_filters,
filter_size,
stride=1,
groups=1,
act=None):
conv = fluid.layers.conv2d(
input=input,
num_filters=num_filters,
filter_size=filter_size,
stride=stride,
padding=(filter_size - 1) // 2,
groups=groups,
act=None,
bias_attr=False)
return fluid.layers.batch_norm(input=conv, act=act)
def shortcut(self, input, ch_out, stride):
ch_in = input.shape[1]
if ch_in != ch_out or stride != 1:
return self.conv_bn_layer(input, ch_out, 1, stride)
else:
return input
def bottleneck_block(self, input, num_filters, stride):
conv0 = self.conv_bn_layer(
input=input, num_filters=num_filters, filter_size=1, act='relu')
conv1 = self.conv_bn_layer(
input=conv0,
num_filters=num_filters,
filter_size=3,
stride=stride,
act='relu')
conv2 = self.conv_bn_layer(
input=conv1, num_filters=num_filters * 4, filter_size=1, act=None)
short = self.shortcut(input, num_filters * 4, stride)
return fluid.layers.elementwise_add(x=short, y=conv2, act='relu')
def net(self, input, class_dim=101):
layers = self.layers
seg_num = self.seg_num
supported_layers = [50, 101, 152]
if layers not in supported_layers:
print("supported layers are", supported_layers, \
"but input layer is ", layers)
exit()
# reshape input
channels = input.shape[2]
short_size = input.shape[3]
input = fluid.layers.reshape(
x=input, shape=[-1, channels, short_size, short_size])
if layers == 50:
depth = [3, 4, 6, 3]
elif layers == 101:
depth = [3, 4, 23, 3]
elif layers == 152:
depth = [3, 8, 36, 3]
num_filters = [64, 128, 256, 512]
conv = self.conv_bn_layer(
input=input, num_filters=64, filter_size=7, stride=2, act='relu')
conv = fluid.layers.pool2d(
input=conv,
pool_size=3,
pool_stride=2,
pool_padding=1,
pool_type='max')
for block in range(len(depth)):
for i in range(depth[block]):
conv = self.bottleneck_block(
input=conv,
num_filters=num_filters[block],
stride=2 if i == 0 and block != 0 else 1)
pool = fluid.layers.pool2d(
input=conv, pool_size=7, pool_type='avg', global_pooling=True)
feature = fluid.layers.reshape(
x=pool, shape=[-1, seg_num, pool.shape[1]])
out = fluid.layers.reduce_mean(feature, dim=1)
stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
out = fluid.layers.fc(input=out,
size=class_dim,
act='softmax',
param_attr=fluid.param_attr.ParamAttr(
initializer=fluid.initializer.Uniform(-stdv,
stdv)))
return out
import os
import numpy as np
import time
import sys
import paddle
import paddle.fluid as fluid
from resnet import TSN_ResNet
import reader
import argparse
import functools
from paddle.fluid.framework import Parameter
from utility import add_arguments, print_arguments
parser = argparse.ArgumentParser(description=__doc__)
add_arg = functools.partial(add_arguments, argparser=parser)
# yapf: disable
add_arg('batch_size', int, 128, "Minibatch size.")
add_arg('num_layers', int, 50, "How many layers for ResNet model.")
add_arg('with_mem_opt', bool, True, "Whether to use memory optimization or not.")
add_arg('num_epochs', int, 60, "Number of epochs.")
add_arg('class_dim', int, 101, "Number of class.")
add_arg('seg_num', int, 7, "Number of segments.")
add_arg('image_shape', str, "3,224,224", "Input image size.")
add_arg('model_save_dir', str, "output", "Model save directory.")
add_arg('pretrained_model', str, None, "Whether to use pretrained model.")
add_arg('total_videos', int, 9537, "Training video number.")
add_arg('lr_init', float, 0.01, "Set initial learning rate.")
# yapf: enable
def train(args):
# parameters from arguments
seg_num = args.seg_num
class_dim = args.class_dim
num_layers = args.num_layers
num_epochs = args.num_epochs
batch_size = args.batch_size
pretrained_model = args.pretrained_model
model_save_dir = args.model_save_dir
image_shape = [int(m) for m in args.image_shape.split(",")]
image_shape = [seg_num] + image_shape
# model definition
model = TSN_ResNet(layers=num_layers, seg_num=seg_num)
image = fluid.layers.data(name='image', shape=image_shape, dtype='float32')
label = fluid.layers.data(name='label', shape=[1], dtype='int64')
out = model.net(input=image, class_dim=class_dim)
cost = fluid.layers.cross_entropy(input=out, label=label)
avg_cost = fluid.layers.mean(x=cost)
acc_top1 = fluid.layers.accuracy(input=out, label=label, k=1)
acc_top5 = fluid.layers.accuracy(input=out, label=label, k=5)
# for test
inference_program = fluid.default_main_program().clone(for_test=True)
# learning rate strategy
epoch_points = [num_epochs / 3, num_epochs * 2 / 3]
total_videos = args.total_videos
step = int(total_videos / batch_size + 1)
bd = [e * step for e in epoch_points]
lr_init = args.lr_init
lr = [lr_init, lr_init / 10, lr_init / 100]
# initialize optimizer
optimizer = fluid.optimizer.Momentum(
learning_rate=fluid.layers.piecewise_decay(
boundaries=bd, values=lr),
momentum=0.9,
regularization=fluid.regularizer.L2Decay(1e-4))
opts = optimizer.minimize(avg_cost)
if args.with_mem_opt:
fluid.memory_optimize(fluid.default_main_program())
place = fluid.CUDAPlace(0)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
def is_parameter(var):
if isinstance(var, Parameter):
return isinstance(var, Parameter) and (not ("fc_0" in var.name))
if pretrained_model is not None:
vars = filter(is_parameter, inference_program.list_vars())
fluid.io.load_vars(exe, pretrained_model, vars=vars)
# reader
train_reader = paddle.batch(reader.train(seg_num), batch_size=batch_size, drop_last=True)
# test in single GPU
test_reader = paddle.batch(reader.test(seg_num), batch_size=batch_size / 16)
feeder = fluid.DataFeeder(place=place, feed_list=[image, label])
train_exe = fluid.ParallelExecutor(use_cuda=True, loss_name=avg_cost.name)
fetch_list = [avg_cost.name, acc_top1.name, acc_top5.name]
# train
for pass_id in range(num_epochs):
train_info = [[], [], []]
test_info = [[], [], []]
for batch_id, data in enumerate(train_reader()):
t1 = time.time()
loss, acc1, acc5 = train_exe.run(fetch_list, feed=feeder.feed(data))
t2 = time.time()
period = t2 - t1
loss = np.mean(np.array(loss))
acc1 = np.mean(np.array(acc1))
acc5 = np.mean(np.array(acc5))
train_info[0].append(loss)
train_info[1].append(acc1)
train_info[2].append(acc5)
if batch_id % 10 == 0:
print(
"[TRAIN] Pass: {0}\ttrainbatch: {1}\tloss: {2}\tacc1: {3}\tacc5: {4}\ttime: {5}"
.format(pass_id, batch_id, '%.6f' % loss, acc1, acc5,
"%2.2f sec" % period))
sys.stdout.flush()
train_loss = np.array(train_info[0]).mean()
train_acc1 = np.array(train_info[1]).mean()
train_acc5 = np.array(train_info[2]).mean()
# test
cnt = 0
for batch_id, data in enumerate(test_reader()):
t1 = time.time()
loss, acc1, acc5 = exe.run(inference_program,
fetch_list=fetch_list,
feed=feeder.feed(data))
t2 = time.time()
period = t2 - t1
loss = np.mean(loss)
acc1 = np.mean(acc1)
acc5 = np.mean(acc5)
test_info[0].append(loss * len(data))
test_info[1].append(acc1 * len(data))
test_info[2].append(acc5 * len(data))
cnt += len(data)
if batch_id % 10 == 0:
print(
"[TEST] Pass: {0}\ttestbatch: {1}\tloss: {2}\tacc1: {3}\tacc5: {4}\ttime: {5}"
.format(pass_id, batch_id, '%.6f' % loss, acc1, acc5,
"%2.2f sec" % period))
sys.stdout.flush()
test_loss = np.sum(test_info[0]) / cnt
test_acc1 = np.sum(test_info[1]) / cnt
test_acc5 = np.sum(test_info[2]) / cnt
print(
"+ End pass: {0}, train_loss: {1}, train_acc1: {2}, train_acc5: {3}"
.format(pass_id, '%.3f' % train_loss, '%.3f' % train_acc1, '%.3f' %
train_acc5))
print("+ End pass: {0}, test_loss: {1}, test_acc1: {2}, test_acc5: {3}"
.format(pass_id, '%.3f' % test_loss, '%.3f' % test_acc1, '%.3f' %
test_acc5))
sys.stdout.flush()
# save model
model_path = os.path.join(model_save_dir, str(pass_id))
if not os.path.isdir(model_path):
os.makedirs(model_path)
fluid.io.save_persistables(exe, model_path)
def main():
args = parser.parse_args()
print_arguments(args)
train(args)
if __name__ == '__main__':
main()
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserve.
#
#Licensed under the Apache License, Version 2.0 (the "License");
#you may not use this file except in compliance with the License.
#You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
#Unless required by applicable law or agreed to in writing, software
#distributed under the License is distributed on an "AS IS" BASIS,
#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#See the License for the specific language governing permissions and
#limitations under the License.
"""Contains common utility functions."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import distutils.util
import numpy as np
import six
from paddle.fluid import core
def print_arguments(args):
"""Print argparse's arguments.
Usage:
.. code-block:: python
parser = argparse.ArgumentParser()
parser.add_argument("name", default="Jonh", type=str, help="User name.")
args = parser.parse_args()
print_arguments(args)
:param args: Input argparse.Namespace for printing.
:type args: argparse.Namespace
"""
print("----------- Configuration Arguments -----------")
for arg, value in sorted(six.iteritems(vars(args))):
print("%s: %s" % (arg, value))
print("------------------------------------------------")
def add_arguments(argname, type, default, help, argparser, **kwargs):
"""Add argparse's argument.
Usage:
.. code-block:: python
parser = argparse.ArgumentParser()
add_argument("name", str, "Jonh", "User name.", parser)
args = parser.parse_args()
"""
type = distutils.util.strtobool if type == bool else type
argparser.add_argument(
"--" + argname,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
......@@ -33,7 +33,7 @@ PaddlePaddle 提供了丰富的计算单元,使得用户可以采用模块化
[Attention模型](./PaddleCV/ocr_recognition)|场景文字识别模型|使用attention 识别图片中单行英文字符|[Recurrent Models of Visual Attention](https://arxiv.org/abs/1406.6247)
[Metric Learning](./PaddleCV/metric_learning)|度量学习模型|能够用于分析对象时间的关联、比较关系,可应用于辅助分类、聚类问题,也广泛用于图像检索、人脸识别等领域|-
[TSN](./PaddleCV/video_classification)|视频分类模型|基于长范围时间结构建模,结合了稀疏时间采样策略和视频级监督来保证使用整段视频时学习得有效和高效|[Temporal Segment Networks: Towards Good Practices for Deep Action Recognition](https://arxiv.org/abs/1608.00859)
[视频模型库](./PaddleCV/video)|视频模型库|给开发者提供基于PaddlePaddle的便捷、高效的使用深度学习算法解决视频理解、视频编辑、视频生成等一系列模型||
[视频模型库](./PaddleCV/PaddleVideo)|视频模型库|给开发者提供基于PaddlePaddle的便捷、高效的使用深度学习算法解决视频理解、视频编辑、视频生成等一系列模型||
[caffe2fluid](./PaddleCV/caffe2fluid)|将Caffe模型转换为Paddle Fluid配置和模型文件工具|-|-
## PaddleNLP
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册