提交 a793d8bf 编写于 作者: Y yuanxiao

init

上级
# Multimodal Short Video Data Set and Baseline Classification Model
> If you have data / access to data / better model, please feel free to issue /pull requests / contact me wangzichaochaochao@gmail.com
This resource contains 50+ million(865G) multimodal short video data sets and TensorFlow2.0 multimodal short video classification model, aiming at creating a multimodal classification framework.
Multimodal short video data = short video description text + short video cover image + short video
本资源含有 50+ 万条(865G)多模态短视频数据集和 TensorFlow2.0 多模态短视频分类模型,旨在打造多模态分类框架。
多模态短视频数据 = 短视频描述文本 + 短视频封面图 + 短视频
![](example_data/example_data_file.png)
[click to view example data](example_data)
---
## 1. Multimodal dataset information
The current multimodal short video dataset contains 50+ million multimodal data, covering 31 categories, occupying a total of 865G space. Download and unzip the [multimodal_data_info.rar](aggregate_download_data_to_a_json_file/multimodal_data_info.rar) file and you will get the download address for all datas. You can download them directly using [data_download_tools](data_download_tools), but you can also use your own download tool.
目前多模态短视频数据集含有50+万条多模态数据,它们涵盖31个类别,共占用865G空间。下载并解压 [multimodal_data_info.rar](aggregate_download_data_to_a_json_file/multimodal_data_info.rar) 文件,你可以获得所有数据的下载地址。你可以直接使用 [data_download_tools](data_download_tools) 下载它们,当然你也可以使用自己的下载工具。
### Multimodal data (31 types)
Video category Chinese and English mapping dictionary 视频种类中英文映射字典
```python
video_type_dict = {'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
```
In addition to 360VR type video data, each of the other types has approximately 20,000 pieces of data. You can check the contents of all multimodal files at any time using the [download_file_info.ipynb](data_download_tools/xinpianchang/download_file_info.ipynb) tool in [data_download_tools](data_download_tools). As follows:
除了360VR类型的视频数据,其它每个类型有大约20000条数据。你可以使用[data_download_tools](data_download_tools)中的[download_file_info.ipynb](data_download_tools/xinpianchang/download_file_info.ipynb)工具随时检查所有多模态文件的内容,如下所示:
Check the disk space occupied by the data. 检查数据占用的磁盘空间。
![](data_download_tools/xinpianchang/download_mp4_info.png)
Check a type of video cover image and corresponding video description information. 检查某个类型的视频封面图以及对应的视频描述信息。
![](data_download_tools/xinpianchang/check_image.png)
### multimodal data statistics
The multimodal_data_info.json file contains statistics on 562,342 multimodal data, ```['mp4_id', 'video_label', 'mp4_time', 'mp4_download_url', 'mp4_background_image_url', 'mp4_txt_brief']``` content.
The content of multimodal_data_info.json is as follows:
```python
{"mp4_id": "80328682", "mp4_download_url": "https://p5-v1.xpccdn.com/080328682_main_xl.mp4",
"mp4_time": "0:16", "mp4_background_image_url": "https://p5-i1.xpccdn.com/080328682_iconl.jpeg",
"mp4_txt_brief": " Woman in swimsuit and cover up walking at the beach", "video_label": "Beach"}
{"mp4_id": "63660083", "mp4_download_url": "https://p5-v1.xpccdn.com/063660083_main_xl.mp4",
"mp4_time": "0:29", "mp4_background_image_url": "https://p5-i1.xpccdn.com/063660083_iconl.jpeg",
"mp4_txt_brief": " 4K Happy female friends chatting & drinking on city rooftop in the summer", "video_label": "City"}
```
You can use the [data_analysis.ipynb](aggregate_download_data_to_a_json_file/data_analysis.ipynb) tool in [aggregate_download_data_to_a_json_file](aggregate_download_data_to_a_json_file) to count the data of a multimodal file. The statistics are as follows.
你可以使用[aggregate_download_data_to_a_json_file](aggregate_download_data_to_a_json_file)中的[data_analysis.ipynb](aggregate_download_data_to_a_json_file/data_analysis.ipynb)工具统计多模态文件的数据,统计结果如下所示。
![](aggregate_download_data_to_a_json_file/json_file_data_analysis.png)
---
## 2. Baseline Classification Model
> 查看我的博客 [短视频分类技术](https://yuanxiaosc.github.io/categories/TF/%E5%95%86%E4%B8%9A%E5%BA%94%E7%94%A8%E6%A1%88%E4%BE%8B/) 获取更多短视频分类信息。
Model structure picture 模型结构图
![](baseline_model/multimodal_baseline_model.png)
Model structure test 模型结构测试
![](baseline_model/model_structure_test.png)
[Click on baseline_model to learn more](baseline_model)
### Require
+ python 3+, e.g. python==3.6
+ tensorflow version 2, e.g. tensorflow==2.0.0-beta1
+ tensorflow-datasets
### Train Model
```python
python train_multimodal_baseline_model.py
```
---
## 4. Build your own model
[Click on data_interface_for_model to learn more](data_interface_for_model)
Data can be easily provided to your model using the [data_interface_for_model](data_interface_for_model) data interface. Data_interface_for_model contains three types of data interfaces: tensor required by TensorFlow, numpy required by Pytorch, and native Python type.
可以使用[data_interface_for_model](data_interface_for_model) 数据接口方便的为你的模型提供数据。data_interface_for_model包含三种类型的数据接口,分别是:TensorFlow需要的tensor、Pytorch需要的numpy和原生的Python类型。
---
## 5. Copyright Statement
Currently all multimodal video data comes from the Internet, and the data is copyrighted by the original author. If this data (from https://xinpianchang.com) is used for profit, please contact service@xinpianchang.com to purchase data copyright.
目前所有多模态视频数据来自互联网,该数据版权归原作者所有。如果将该数据(来自 https://xinpianchang.com )用于牟利,请联系 service@xinpianchang.com 购买数据版权。
import os
import sys
import pathlib
import pandas as pd
import json
def clean_specified_type_file(data_root=None, specified_type_list=["*/*/*.mp4", "*/*/*.jpeg", "*/*/*.txt"]):
"""
:param data_root: To delete the root of the file
:param specified_type_list: To delete the relationship between the specified file and the root directory
"""
if data_root is None:
data_root = os.getcwd()
data_root = pathlib.Path(data_root)
garbage_file_list = list()
# get the paths to clear the file
for t in specified_type_list:
names_list = sorted(item.name for item in data_root.glob(t))
garbage_file_list.extend(names_list)
# remove file
for name in garbage_file_list:
os.remove(name)
def get_description_information(txt_path):
"""description_information include: {'mp4_id': '', 'mp4_download_url': '', 'mp4_time': '',
'mp4_background_image_url': '', 'mp4_txt_brief': ''}"""
description_information_dict = eval(open(txt_path).read())
return description_information_dict
def standardization_of_file_names(data_root="MP4_download"):
"""
Uniform naming format for each set of data as follows:
multimodal_data_id
multimodal_data_id.jepg
multimodal_data_id.mp4
multimodal_data_id.txt
"""
# Get all multimodal data type names
data_root = pathlib.Path(data_root)
label_names_list = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
print(f"data_root contain video type numbers {len(label_names_list)}")
print(f"data_root contain video type {label_names_list}")
# Processing multimodal data sequentially
for label_name in label_names_list:
# Get all folders under a certain type of multimodal data
label_mode = label_name + "/*"
multimodal_data_dir = list(data_root.glob(label_mode))
multimodal_data_dir = [str(path) for path in multimodal_data_dir]
# File name for standardized multimodal data
for multimodal_data_path in multimodal_data_dir:
multimodal_data_id = os.path.basename(multimodal_data_path)
for item_file in os.listdir(multimodal_data_path):
item_file = os.path.join(multimodal_data_path, item_file)
if item_file.endswith('.txt'):
os.rename(item_file, os.path.join(multimodal_data_path, multimodal_data_id + ".txt"))
elif item_file.endswith('.jpeg'):
os.rename(item_file, os.path.join(multimodal_data_path, multimodal_data_id + ".jpeg"))
elif item_file.endswith('.mp4'):
os.rename(item_file, os.path.join(multimodal_data_path, multimodal_data_id + ".mp4"))
elif item_file.endswith('.ipynb_checkpoints'):
pass
else:
raise ValueError("An abnormal document appeared! check!")
def count_file_number(data_root="MP4_download"):
"""
statistics files number
:return {'Military': 18560, 'Business': 19200, 'Archival': 10176, 'Romantic': 19162,...}
all number: 56xx42
"""
video_label_number = len(os.listdir(data_root))
print("video_label_number:\t", video_label_number)
multimodal_data_number_dict = dict()
all_number = 0
for video_label in os.listdir(data_root):
video_label_dir = os.path.join(data_root, video_label)
# print("video_label_dir:\t", video_label_dir)
multimodal_data_number = len(os.listdir(video_label_dir))
# print("multimodal_data_number:\t", multimodal_data_number)
multimodal_data_number_dict[video_label] = multimodal_data_number
all_number += multimodal_data_number
print(multimodal_data_number_dict)
print("all number:\t", all_number)
return multimodal_data_number_dict
def statistics_all_multimodal_data_information_to_json_file(data_root="MP4_download",
store_multimodal_info_json_file_path="multimodal_data_info.json"):
"""
data_root all *.txt files to a *.json file
"""
data_root = pathlib.Path(data_root)
all_txt_data_paths = [str(path) for path in
list(data_root.glob('*/*/*.txt'))] # [MP4_download/360VR/89422838/89422838.txt,...]
json_write_f = open(store_multimodal_info_json_file_path, "w", encoding='utf-8')
for text_data_path in all_txt_data_paths:
video_label_path = os.path.dirname(os.path.dirname(text_data_path)) # /MP4_download/360VR/
video_label = os.path.basename(video_label_path) # 360VR
description_information_dict = get_description_information(text_data_path)
description_information_dict["video_label"] = video_label
line_json = json.dumps(description_information_dict, ensure_ascii=False)
json_write_f.write(line_json + "\n")
json_write_f.close()
def read_multimodal_data_information_json_file(json_file_path="multimodal_data_info.json"):
"""
:param json_file_path:
:return: multimodal_data_information_list
[{'mp4_id': '97930081', 'mp4_download_url': ...'video_label': 'Military'},
{'mp4_id': '64413672', 'mp4_download_url': ... 'video_label': 'Military'}]
"""
def check_data(line_dict):
for item in ['mp4_id', 'video_label', 'mp4_time', 'mp4_download_url', 'mp4_background_image_url',
'mp4_txt_brief']:
if item not in line_dict:
return False
return True
multimodal_data_information_list = list()
with open(json_file_path, 'r', encoding='utf-8') as f:
try:
while True:
line = f.readline()
if line:
line_dict = json.loads(line)
if check_data(line_dict):
multimodal_data_information_list.append(line_dict)
else:
print("incomplete data:")
print(line_dict)
else:
break
except:
f.close()
return multimodal_data_information_list
def multimodal_data_json_file_to_datafram(json_file_path="multimodal_data_info.json"):
"""
json file to pandas.DataFrame
"""
if not os.path.exists(json_file_path):
print("python statistics_all_multimodal_data_information_to_json_file(data_root, json_file_path)")
raise ValueError("Not found json file!")
multimodal_data_information_list = read_multimodal_data_information_json_file(json_file_path)
multimodal_data_information_dict = {'mp4_id': [], 'video_label': [], 'mp4_time': [],
'mp4_download_url': [], 'mp4_background_image_url': [], 'mp4_txt_brief': []}
for data in multimodal_data_information_list:
multimodal_data_information_dict['mp4_id'].append(data['mp4_id'])
multimodal_data_information_dict['video_label'].append(data['video_label'])
multimodal_data_information_dict['mp4_time'].append(data['mp4_time'])
multimodal_data_information_dict['mp4_download_url'].append(data['mp4_download_url'])
multimodal_data_information_dict['mp4_background_image_url'].append(data['mp4_background_image_url'])
multimodal_data_information_dict['mp4_txt_brief'].append(data['mp4_txt_brief'])
multimodal_data_information_datafram = pd.DataFrame(multimodal_data_information_dict)
return multimodal_data_information_datafram
def aggravate_data_utils_main(data_root, json_file_path="./multimodal_data_info.json"):
"""
aggregate_download_data_to_a_json_file
:param data_root: download files data root
:param json_file_path: produce json file path
:return:
"""
# standard file name
standardization_of_file_names(data_root)
# produce json file
statistics_all_multimodal_data_information_to_json_file(data_root, json_file_path)
# analysis json file
multimodal_data_information_datafram = multimodal_data_json_file_to_datafram(json_file_path)
print(multimodal_data_information_datafram.describe())
if __name__ == "__main__":
data_root = "/home/b418a/disk1/jupyter_workspace/yuanxiao/douyin/xinpianchang/MP4_download"
json_file_path = "./multimodal_data_info.json"
if len(sys.argv) == 3:
data_root = sys.argv[1]
json_file_path = sys.argv[2]
elif len(sys.argv) == 2:
data_root = sys.argv[1]
aggravate_data_utils_main(data_root, json_file_path)
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import json"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"multimodal_data_info_file_path ='multimodal_data_info.json'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def read_multimodal_data_information_json_file(json_file_path=\"multimodal_data_info.json\"):\n",
" \"\"\"\n",
" :param json_file_path:\n",
" :return: multimodal_data_information_list\n",
" [{'mp4_id': '97930081', 'mp4_download_url': ...'video_label': 'Military'},\n",
" {'mp4_id': '64413672', 'mp4_download_url': ... 'video_label': 'Military'}]\n",
" \"\"\"\n",
" def check_data(line_dict):\n",
" for item in ['mp4_id', 'video_label', 'mp4_time', 'mp4_download_url', 'mp4_background_image_url', 'mp4_txt_brief']:\n",
" if item not in line_dict:\n",
" return False\n",
" return True\n",
" \n",
" multimodal_data_information_list = list()\n",
" with open(json_file_path, 'r', encoding='utf-8') as f:\n",
" try:\n",
" while True:\n",
" line = f.readline()\n",
" if line:\n",
" line_dict = json.loads(line)\n",
" if check_data(line_dict):\n",
" multimodal_data_information_list.append(line_dict)\n",
" else:\n",
" print(\"incomplete data:\")\n",
" print(line_dict)\n",
" else:\n",
" break\n",
" except:\n",
" f.close()\n",
" return multimodal_data_information_list"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"multimodal_data_information_list = read_multimodal_data_information_json_file(multimodal_data_info_file_path)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"562342"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(multimodal_data_information_list)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[{'mp4_id': '75265848',\n",
" 'mp4_download_url': 'https://p5-v1.xpccdn.com/075265848_main_xl.mp4',\n",
" 'mp4_time': '0:13',\n",
" 'mp4_background_image_url': 'https://p5-i1.xpccdn.com/075265848_iconl.jpeg',\n",
" 'mp4_txt_brief': ' Old antique German military rifle',\n",
" 'video_label': 'Military'},\n",
" {'mp4_id': '44566064',\n",
" 'mp4_download_url': 'https://p5-v1.xpccdn.com/044566064_main_xl.mp4',\n",
" 'mp4_time': '0:09',\n",
" 'mp4_background_image_url': 'https://p5-i1.xpccdn.com/044566064_iconl.jpeg',\n",
" 'mp4_txt_brief': ' quadcopter aerial drone',\n",
" 'video_label': 'Military'},\n",
" {'mp4_id': '62447549',\n",
" 'mp4_download_url': 'https://p5-v1.xpccdn.com/062447549_main_xl.mp4',\n",
" 'mp4_time': '0:06',\n",
" 'mp4_background_image_url': 'https://p5-i1.xpccdn.com/062447549_iconl.jpeg',\n",
" 'mp4_txt_brief': ' Firearm dis-assembly for cleaning and safety check of handheld gun',\n",
" 'video_label': 'Military'}]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"multimodal_data_information_list[:3]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def multimodal_data_json_file_to_datafram(json_file_path=\"multimodal_data_info.json\"):\n",
" \"\"\"\n",
" :param json_file_path: \n",
" :return: pandas.datafram\n",
" \"\"\"\n",
" multimodal_data_information_list = read_multimodal_data_information_json_file(json_file_path)\n",
" \n",
" multimodal_data_information_dict = {'mp4_id':[], 'video_label':[], 'mp4_time':[], \n",
" 'mp4_download_url':[], 'mp4_background_image_url':[], 'mp4_txt_brief':[]}\n",
" \n",
" for data in multimodal_data_information_list:\n",
" multimodal_data_information_dict['mp4_id'].append(data['mp4_id'])\n",
" multimodal_data_information_dict['video_label'].append(data['video_label'])\n",
" multimodal_data_information_dict['mp4_time'].append(data['mp4_time'])\n",
" multimodal_data_information_dict['mp4_download_url'].append(data['mp4_download_url'])\n",
" multimodal_data_information_dict['mp4_background_image_url'].append(data['mp4_background_image_url'])\n",
" multimodal_data_information_dict['mp4_txt_brief'].append(data['mp4_txt_brief'])\n",
" \n",
" multimodal_data_information_datafram = pd.DataFrame(multimodal_data_information_dict)\n",
" \n",
" return multimodal_data_information_datafram"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"multimodal_data_information_datafram = multimodal_data_json_file_to_datafram(json_file_path=\"multimodal_data_info.json\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mp4_id</th>\n",
" <th>video_label</th>\n",
" <th>mp4_time</th>\n",
" <th>mp4_download_url</th>\n",
" <th>mp4_background_image_url</th>\n",
" <th>mp4_txt_brief</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>75265848</td>\n",
" <td>Military</td>\n",
" <td>0:13</td>\n",
" <td>https://p5-v1.xpccdn.com/075265848_main_xl.mp4</td>\n",
" <td>https://p5-i1.xpccdn.com/075265848_iconl.jpeg</td>\n",
" <td>Old antique German military rifle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>44566064</td>\n",
" <td>Military</td>\n",
" <td>0:09</td>\n",
" <td>https://p5-v1.xpccdn.com/044566064_main_xl.mp4</td>\n",
" <td>https://p5-i1.xpccdn.com/044566064_iconl.jpeg</td>\n",
" <td>quadcopter aerial drone</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>62447549</td>\n",
" <td>Military</td>\n",
" <td>0:06</td>\n",
" <td>https://p5-v1.xpccdn.com/062447549_main_xl.mp4</td>\n",
" <td>https://p5-i1.xpccdn.com/062447549_iconl.jpeg</td>\n",
" <td>Firearm dis-assembly for cleaning and safety ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>42966432</td>\n",
" <td>Military</td>\n",
" <td>0:08</td>\n",
" <td>https://p5-v1.xpccdn.com/042966432_main_xl.mp4</td>\n",
" <td>https://p5-i1.xpccdn.com/042966432_iconl.jpeg</td>\n",
" <td>Kalashnikov deadly weapon</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>103424272</td>\n",
" <td>Military</td>\n",
" <td>0:13</td>\n",
" <td>https://p5-v1.xpccdn.com/103424272_main_xl.mp4</td>\n",
" <td>https://p5-i1.xpccdn.com/103424272_iconl.jpeg</td>\n",
" <td>Rows of ammunition in front of an animated Le...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mp4_id video_label mp4_time \\\n",
"0 75265848 Military 0:13 \n",
"1 44566064 Military 0:09 \n",
"2 62447549 Military 0:06 \n",
"3 42966432 Military 0:08 \n",
"4 103424272 Military 0:13 \n",
"\n",
" mp4_download_url \\\n",
"0 https://p5-v1.xpccdn.com/075265848_main_xl.mp4 \n",
"1 https://p5-v1.xpccdn.com/044566064_main_xl.mp4 \n",
"2 https://p5-v1.xpccdn.com/062447549_main_xl.mp4 \n",
"3 https://p5-v1.xpccdn.com/042966432_main_xl.mp4 \n",
"4 https://p5-v1.xpccdn.com/103424272_main_xl.mp4 \n",
"\n",
" mp4_background_image_url \\\n",
"0 https://p5-i1.xpccdn.com/075265848_iconl.jpeg \n",
"1 https://p5-i1.xpccdn.com/044566064_iconl.jpeg \n",
"2 https://p5-i1.xpccdn.com/062447549_iconl.jpeg \n",
"3 https://p5-i1.xpccdn.com/042966432_iconl.jpeg \n",
"4 https://p5-i1.xpccdn.com/103424272_iconl.jpeg \n",
"\n",
" mp4_txt_brief \n",
"0 Old antique German military rifle \n",
"1 quadcopter aerial drone \n",
"2 Firearm dis-assembly for cleaning and safety ... \n",
"3 Kalashnikov deadly weapon \n",
"4 Rows of ammunition in front of an animated Le... "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"multimodal_data_information_datafram.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mp4_id</th>\n",
" <th>video_label</th>\n",
" <th>mp4_time</th>\n",
" <th>mp4_download_url</th>\n",
" <th>mp4_background_image_url</th>\n",
" <th>mp4_txt_brief</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>562342</td>\n",
" <td>562342</td>\n",
" <td>562342</td>\n",
" <td>562342</td>\n",
" <td>562342</td>\n",
" <td>562342</td>\n",
" </tr>\n",
" <tr>\n",
" <th>unique</th>\n",
" <td>499607</td>\n",
" <td>31</td>\n",
" <td>184</td>\n",
" <td>499607</td>\n",
" <td>499607</td>\n",
" <td>343020</td>\n",
" </tr>\n",
" <tr>\n",
" <th>top</th>\n",
" <td>88460884</td>\n",
" <td>Alpha Channel</td>\n",
" <td>0:10</td>\n",
" <td>https://p5-v1.xpccdn.com/023726153_main_xl.mp4</td>\n",
" <td>https://p5-i1.xpccdn.com/088460884_iconl.jpeg</td>\n",
" <td>Intro Background Texture Render Animation Col...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>freq</th>\n",
" <td>9</td>\n",
" <td>19200</td>\n",
" <td>49660</td>\n",
" <td>9</td>\n",
" <td>9</td>\n",
" <td>10974</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mp4_id video_label mp4_time \\\n",
"count 562342 562342 562342 \n",
"unique 499607 31 184 \n",
"top 88460884 Alpha Channel 0:10 \n",
"freq 9 19200 49660 \n",
"\n",
" mp4_download_url \\\n",
"count 562342 \n",
"unique 499607 \n",
"top https://p5-v1.xpccdn.com/023726153_main_xl.mp4 \n",
"freq 9 \n",
"\n",
" mp4_background_image_url \\\n",
"count 562342 \n",
"unique 499607 \n",
"top https://p5-i1.xpccdn.com/088460884_iconl.jpeg \n",
"freq 9 \n",
"\n",
" mp4_txt_brief \n",
"count 562342 \n",
"unique 343020 \n",
"top Intro Background Texture Render Animation Col... \n",
"freq 10974 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"multimodal_data_information_datafram.describe()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
import os
import shutil
video_type_dict = {'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
def make_fake_data(true_data_root, fake_data_root="./MP4_download", fake_video_number=1):
"""
In order not to damage the original data, copy the original data for research
"""
if not os.path.exists(fake_data_root):
os.mkdir(fake_data_root)
video_type_list = list(video_type_dict.keys())
for multimodal_data_type in video_type_list[:fake_video_number]:
true_multimodal_a_type_data_dir = os.path.join(true_data_root, multimodal_data_type)
fake_multimodal_a_type_data_dir = os.path.join(fake_data_root, multimodal_data_type)
shutil.copytree(true_multimodal_a_type_data_dir, fake_multimodal_a_type_data_dir)
if __name__=="__main__":
true_data_root = "/home/b418a/disk1/jupyter_workspace/yuanxiao/douyin/xinpianchang/MP4_download"
fake_data_root = "/home/b418a/disk1/pycharm_room/yuanxiao/my_lenovo_P50s/Multimodal-short-video-dataset-and-baseline-model/MP4_download"
fake_video_number = 1
make_fake_data(true_data_root, fake_data_root, fake_video_number)
import tensorflow as tf
def create_text_baseline_model(txt_maxlen, vocab_size, embedding_dim=100, lstm_units=64, output_dim=50):
text_model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(txt_maxlen)),
tf.keras.layers.Embedding(vocab_size, embedding_dim),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units, return_sequences=True)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(output_dim, activation='relu')
], name='text_baseline_model')
return text_model
def create_image_baseline_model(image_height, image_width, image_channels=3, output_dim=50):
image_model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(image_height, image_width, image_channels)),
tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(output_dim)
], name='image_baseline_model')
return image_model
def create_video_baseline_model(max_video_frame_number, video_height, video_width, video_channels=3, output_dim=50):
"""
:param input_shape: [video sequences length, video_height, video_width, video_channels]
:return: 3D_convolutional model
"""
def get_con3d_block(filters=64, kernel_size=(3, 3, 3),
strides=(1, 1, 1), padding='same'):
return tf.keras.layers.Conv3D(filters=filters, kernel_size=kernel_size, strides=strides,
padding=padding, data_format='channels_last',
dilation_rate=(1, 1, 1), activation='relu',
use_bias=True, kernel_initializer='glorot_uniform',
bias_initializer='zeros', kernel_regularizer=None,
bias_regularizer=None, activity_regularizer=None,
kernel_constraint=None, bias_constraint=None)
def get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2), padding='same'):
return tf.keras.layers.MaxPooling3D(pool_size=pool_size, strides=strides,
padding=padding, data_format='channels_last')
model = tf.keras.models.Sequential(name="video_baseline_model")
# Input
model.add(tf.keras.layers.Input(shape=(max_video_frame_number, video_height, video_width, video_channels)))
# Conv3D + MaxPooling3D
model.add(get_con3d_block(filters=32, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=32, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=64, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=64, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=128, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=128, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=256, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
model.add(get_con3d_block(filters=256, kernel_size=(3, 3, 3)))
model.add(get_maxpooling3d_block(pool_size=(1, 2, 2), strides=(1, 2, 2)))
# Flatten
model.add(tf.keras.layers.Flatten())
# FC layers group
model.add(tf.keras.layers.Dense(256, activation='relu', name='fc6'))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(128, activation='relu', name='fc7'))
model.add(tf.keras.layers.Dropout(0.2))
model.add(tf.keras.layers.Dense(output_dim))
return model
def create_multimodal_baseline_model(label_number=31, txt_maxlen=20, text_vocab_size=15799, text_embedding_dim=100,
text_lstm_units=64, text_output_dim=50,
image_height=270, image_width=480, image_channels=3, image_output_dim=50,
max_video_frame_number=100, video_height=360, video_width=640, video_channels=3,
video_output_dim=50):
"""
Multimodal Baseline Model
Text model parameters:
[ vocab_size, txt_maxlen, text_embedding_dim, text_lstm_units, text_output_dim]
Image model parameters:
[image_height, image_width, image_channels, image_output_dim]
Video model parameters:
[max_video_frame_number, video_height, video_width, video_channels, video_output_dim]
label_number
"""
text_input = tf.keras.layers.Input(shape=(txt_maxlen), name='text')
image_input = tf.keras.layers.Input(shape=(image_height, image_width, image_channels), name='image')
video_input = tf.keras.layers.Input(shape=(max_video_frame_number, video_height, video_width, video_channels),
name='video')
text_model = create_text_baseline_model(txt_maxlen, text_vocab_size, text_embedding_dim, text_lstm_units,
text_output_dim)
image_model = create_image_baseline_model(image_height, image_width, image_channels, image_output_dim)
video_model = create_video_baseline_model(max_video_frame_number, video_height, video_width, video_channels,
video_output_dim)
text_feature = text_model(text_input)
image_feature = image_model(image_input)
video_feature = video_model(video_input)
multimodal_feature = tf.keras.layers.concatenate([text_feature, image_feature, video_feature], axis=-1)
x = tf.keras.layers.Dense(100)(multimodal_feature)
label_predict = tf.keras.layers.Dense(label_number, activation='softmax', name='label_predict')(x)
multimodal_baseline_model = tf.keras.Model(inputs=[text_input, image_input, video_input], outputs=[label_predict])
return multimodal_baseline_model
import tensorflow as tf
from mutimodal_baseline_model import create_text_baseline_model, create_image_baseline_model, \
create_video_baseline_model, create_multimodal_baseline_model
if __name__ == '__main__':
shuffle_data = True
BATCH_SIZE = 5
REPEAT_DATASET = None
vocab_size = 15798 + 1 # 1 for unknown
txt_maxlen = 20
image_height = 270
image_width = 480
image_channels = 3
max_video_frame_number = 100
video_height = 360
video_width = 640
video_channels = 3
label_number = 31
batch_txt_data = tf.random.uniform((BATCH_SIZE, txt_maxlen), 0, vocab_size, dtype=tf.int32)
print("batch_txt_data.shape", batch_txt_data.shape)
text_model = create_text_baseline_model(txt_maxlen, vocab_size, embedding_dim=100, lstm_units=64, output_dim=50)
tf.keras.utils.plot_model(text_model, show_shapes=True, to_file='text_model_baseline_model.png')
batch_txt_feature = text_model(batch_txt_data)
print("batch_txt_feature.shape", batch_txt_feature.shape)
print("")
batch_image_data = tf.random.normal(shape=(BATCH_SIZE, image_height, image_width, image_channels))
print("batch_image_data.shape", batch_image_data.shape)
image_model = create_image_baseline_model(image_height, image_width, image_channels, output_dim=50)
tf.keras.utils.plot_model(image_model, show_shapes=True, to_file='image_model_baseline_model.png')
batch_image_feature = image_model(batch_image_data)
print("batch_image_feature.shape", batch_image_feature.shape)
print("")
batch_video_data = tf.random.normal(
shape=(BATCH_SIZE, max_video_frame_number, video_height, video_width, video_channels))
print("batch_video_data.shape", batch_video_data.shape)
video_model = create_video_baseline_model(max_video_frame_number, video_height, video_width, video_channels,
output_dim=50)
tf.keras.utils.plot_model(video_model, show_shapes=True, to_file='video_model_baseline_model.png')
batch_video_feature = video_model(batch_video_data)
print("batch_video_feature.shape", batch_video_feature.shape)
print("")
batch_txt_data = tf.random.uniform((BATCH_SIZE, txt_maxlen), 0, vocab_size, dtype=tf.int32)
batch_image_data = tf.random.normal(shape=(BATCH_SIZE, image_height, image_width, image_channels))
batch_video_data = tf.random.normal(
shape=(BATCH_SIZE, max_video_frame_number, video_height, video_width, video_channels))
multimodal_model = create_multimodal_baseline_model(label_number=label_number, txt_maxlen=txt_maxlen,
text_vocab_size=vocab_size, text_embedding_dim=100,
text_lstm_units=64, text_output_dim=50,
image_height=image_height, image_width=image_width,
image_channels=image_channels, image_output_dim=50,
max_video_frame_number=max_video_frame_number,
video_height=video_height, video_width=video_width,
video_channels=video_channels, video_output_dim=50)
tf.keras.utils.plot_model(multimodal_model, show_shapes=True, to_file='multimodal_baseline_model.png')
multimodal_model_out = multimodal_model([batch_txt_data, batch_image_data, batch_video_data])
print("multimodal_model_out.shape", multimodal_model_out.shape)
## How to use download tools
### Require
+ python 3+, e.g. python==3.6
+ scrapy
### Running web crawlers
```
cd xinpianchang
python start_MP4_meta_info.py
```
### Detail Configuration
[MP4_meta_info.py](data_download_tools/xinpianchang/xinpianchang/spiders/MP4_meta_info.py)
7~13 lines: All video types required by default.
```python
video_type_dict = {'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
```
30~35 lines: DOWNLOAD_DELAY The smaller the data capture speed, the faster
```python
custom_settings = {
'DOWNLOAD_DELAY': 3.5,
'DOWNLOAD_TIMEOUT': 180,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'JOBDIR': "reamin/MP4_meta_info_001"
}
```
import pathlib
import os
import random
import matplotlib.pyplot as plt
def get_video_type(dir_name="MP4_download"):
"""
:param dir_name:
:return: video_type: , example: {
'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
"""
dir_name = pathlib.Path(dir_name)
video_type = list(dir_name.glob('*'))
video_type = [str(path).split("/")[-1] for path in video_type]
print("Existing Video Types Numbers:\t", len(video_type))
print("Existing Video Types :\t", video_type)
print("")
return video_type
def get_description_information(txt_path):
"""description_information include: {'mp4_id': '', 'mp4_download_url': '', 'mp4_time': '',
'mp4_background_image_url': '', 'mp4_txt_brief': ''}"""
description_information_dict = eval(open(txt_path).read())
return description_information_dict
def show_image_and_description_information(image_path, description_information_dict):
lena = plt.imread(image_path)
plt.imshow(lena)
plt.title(description_information_dict["mp4_background_image_url"])
plt.xlabel(description_information_dict["mp4_txt_brief"])
plt.ylabel(description_information_dict["mp4_id"])
plt.xticks([])
plt.yticks([])
# plt.axis('off')
plt.show()
def check_download_file(dir_name="MP4_download", video_type=None, shuffle_data=False,
print_file_path=False, show_txt=False, show_image=False, check_number=None, ):
"""
Check one type of video file downloaded from https://www.xinpianchang.com/
:param dir_name: Root directory for storing data
:param video_type: Check video type, example: {
'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
:param shuffle_data: Scrambling data, sampling check
:param print_file_path: Print out all file paths
:param show_txt: Print video meta information
:param show_image: Show video cover image
:param check_number: Number of files to check, None stands for all
:return: Number of various documents (all_item_number, txt_number, image_number, video_number)
"""
dir_name = pathlib.Path(dir_name)
path_mode = video_type + "/*"
all_item_paths = list(dir_name.glob(path_mode))
all_item_paths = [str(path) for path in all_item_paths]
if shuffle_data:
random.shuffle(all_item_paths)
all_item_number = len(all_item_paths)
txt_number = 0
image_number = 0
video_number = 0
for idx, item in enumerate(all_item_paths):
item_id = item.split("/")[-1]
item_type = item.split("/")[1]
for item_file in os.listdir(item):
if item_file.endswith('.txt'):
txt_path = os.path.join(item, item_file)
elif item_file.endswith('.jpeg'):
image_path = os.path.join(item, item_file)
elif item_file.endswith('.mp4'):
mp4_path = os.path.join(item, item_file)
else:
raise ValueError("An abnormal document appeared! check!")
if os.path.exists(txt_path):
description_information_dict = get_description_information(txt_path)
else:
description_information_dict = {'mp4_id': '', 'mp4_download_url': '', 'mp4_time': '',
'mp4_background_image_url': '', 'mp4_txt_brief': ''}
if os.path.exists(txt_path):
if print_file_path:
print(f"exsit {txt_path}")
txt_number += 1
if show_txt:
print(open(txt_path).read())
print("item_type:\t", item_type)
else:
if print_file_path:
print(f"Not exsit {txt_path}")
if os.path.exists(image_path):
if print_file_path:
print(f"exsit {image_path}")
image_number += 1
if show_image:
show_image_and_description_information(image_path, description_information_dict)
else:
if print_file_path:
print(f"Not exsit {image_path}")
if os.path.exists(mp4_path):
if print_file_path:
print(f"exists {mp4_path}")
video_number += 1
else:
if print_file_path:
print(f"Not exists {mp4_path}")
if print_file_path:
print("")
if check_number is not None:
if idx == check_number - 1:
break
count_item_number_list = [all_item_number, txt_number, image_number, video_number]
if len(set(count_item_number_list)) == 1:
print("All documents are complete!")
else:
print("Document missing!")
print("all_item_number:\t", all_item_number)
print("txt_number:\t", txt_number)
print("image_number:\t", image_number)
print("video_number:\t", video_number)
return all_item_number, txt_number, image_number, video_number
def check_all_downloaded_files(dir_name="MP4_download"):
"""Check all files downloaded from https://www.xinpianchang.com/"""
for mp4_type in get_video_type(dir_name=dir_name):
print(f"video_type\t:{mp4_type}")
check_download_file(dir_name=dir_name, video_type=mp4_type, shuffle_data=False, print_file_path=False, show_txt=False, show_image=False, check_number=None)
print(" ")
\ No newline at end of file
# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html
[settings]
default = xinpianchang.settings
[deploy]
#url = http://localhost:6800/
project = xinpianchang
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'MP4'])
\ No newline at end of file
from scrapy.cmdline import execute
execute(['scrapy', 'crawl', 'MP4_meta_info'])
\ No newline at end of file
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class XinpianchangItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class XinpianchangSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Response, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class XinpianchangDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class XinpianchangPipeline(object):
def process_item(self, item, spider):
return item
# -*- coding: utf-8 -*-
# Scrapy settings for xinpianchang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'xinpianchang'
SPIDER_MODULES = ['xinpianchang.spiders']
NEWSPIDER_MODULE = 'xinpianchang.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'xinpianchang (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'xinpianchang.middlewares.XinpianchangSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
#'xinpianchang.middlewares.XinpianchangDownloaderMiddleware': 543,
'xinpianchang.user_agent.RotateUserAgentMiddleware': 400,
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'xinpianchang.pipelines.XinpianchangPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
import json
import os
def read_raw_mp4_info_json_file_2_list(mp4_info_file="mp4_info.json"):
MP4_info_list = list()
with open(mp4_info_file, 'r', encoding='utf-8') as rf:
while True:
line = rf.readline()
if line:
line_dict = json.loads(line)
MP4_info_list.append(line_dict["mp4_download_url"])
else:
break
return MP4_info_list
class Mp4Spider(CrawlSpider):
name = 'MP4'
allowed_domains = ['xinpianchang.com']
start_urls = ['https://www.xinpianchang.com/']
custom_settings = {
'DOWNLOAD_DELAY': 2,
'RANDOMIZE_DOWNLOAD_DELAY': True,
}
def start_requests(self):
self.MP4_base_dir = "MP4_download"
task_mp4_type_list = ["4k_url", ]
for mp4_type in task_mp4_type_list:
mp4_type_store_dir = os.path.join(self.MP4_base_dir, mp4_type)
if not os.path.exists(mp4_type_store_dir):
os.makedirs(mp4_type_store_dir)
MP4_info_list = read_raw_mp4_info_json_file_2_list(os.path.join("task_file", mp4_type + ".json"))
for url in MP4_info_list:
yield scrapy.Request(url=url, callback=self.parse_video, meta={'mp4_type': mp4_type},
headers={'Referer': 'https://www.xinpianchang.com/'})
def parse_video(self, response):
meta = response.meta
url = response.url
mp4_type = response.meta["mp4_type"]
file_name = url.split("/")[-1]
mp4_type_store_dir = os.path.join(self.MP4_base_dir, mp4_type)
video_local_path = os.path.join(mp4_type_store_dir, file_name)
with open(video_local_path, "wb") as f:
f.write(response.body)
yield meta
\ No newline at end of file
# -*- coding: utf-8 -*-
import scrapy
import random
import os
from bs4 import BeautifulSoup
video_type_dict = {'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
def get_page_start_end_by_mp4_type(mp4_type):
# Check https://resource.xinpianchang.com/video/list for the latest information
# The update time is 2019/07/19
if mp4_type in ["360VR"]:
return 1, 18
elif mp4_type in ["Archival"]:
return 1, 170
elif mp4_type in ["R3d"]:
return 1, 264
else:
return 1, 301
class Mp4Spider(scrapy.Spider):
name = 'MP4_meta_info'
start_urls = ['https://www.xinpianchang.com/']
custom_settings = {
'DOWNLOAD_DELAY': 3.5,
'DOWNLOAD_TIMEOUT': 180,
'RANDOMIZE_DOWNLOAD_DELAY': True,
'JOBDIR': "reamin/MP4_meta_info_001"
}
def start_requests(self):
self.MP4_base_dir = "MP4_download"
accessed_url_file = "reamin/accessed_url.txt"
if not os.path.exists(self.MP4_base_dir):
os.mkdir(self.MP4_base_dir)
video_type_list = list(video_type_dict.keys())
random.shuffle(video_type_list)
for mp4_type in video_type_list:
if mp4_type not in ["Explosion", ]:
mp4_type_store_dir = os.path.join(self.MP4_base_dir, mp4_type)
if not os.path.exists(mp4_type_store_dir):
os.makedirs(mp4_type_store_dir)
page_number_start, page_number_end = get_page_start_end_by_mp4_type(mp4_type)
for page_number in range(page_number_start, page_number_end):
mp4_list_page_url = f"https://resource.xinpianchang.com/video/list?cate={mp4_type}&page={page_number}"
yield scrapy.Request(url=mp4_list_page_url, callback=self.parse_video_meta_info,
meta={'mp4_type_store_dir': mp4_type_store_dir},
headers={'Referer': 'https://www.xinpianchang.com/'})
def parse_video_meta_info(self, response):
mp4_type_store_dir = response.meta["mp4_type_store_dir"]
bs = BeautifulSoup(response.body, "html.parser")
for index, item in enumerate(
bs.find_all("li", {"class": {"single-video J_sigle_video", "single-video J_sigle_video detail-more"}})):
mp4_id = item["id"]
mp4_download_url = item['data-preview']
mp4_time = item.find_all("div", class_="single-video-duration")[0].string
mp4_background_image_url = item.find_all("div", class_="thumb-img")[0]["style"][len("background-image:url("):-1]
mp4_txt_brief = item.find_all("p", class_="single-brief J_single_brief")[0].string
mp4_meta_info_dict = {"mp4_id": mp4_id, "mp4_download_url": mp4_download_url, "mp4_time": mp4_time,
"mp4_background_image_url": mp4_background_image_url,
"mp4_txt_brief": mp4_txt_brief}
mp4_meta_info_dir = os.path.join(mp4_type_store_dir, str(mp4_id))
if not os.path.exists(mp4_meta_info_dir):
os.makedirs(mp4_meta_info_dir)
with open(os.path.join(mp4_meta_info_dir, str(mp4_id) + ".txt"), "w", encoding="utf-8") as mp4_meta_wf:
mp4_meta_wf.write(str(mp4_meta_info_dict))
yield scrapy.Request(url=mp4_download_url, callback=self.parse_video,
meta={'mp4_meta_info_dir': mp4_meta_info_dir})
yield scrapy.Request(url=mp4_background_image_url, callback=self.parse_background_image,
meta={'mp4_meta_info_dir': mp4_meta_info_dir})
def parse_video(self, response):
mp4_meta_info_dir = response.meta["mp4_meta_info_dir"]
url = response.url
file_name = url.split("/")[-1]
video_local_path = os.path.join(mp4_meta_info_dir, file_name)
with open(video_local_path, "wb") as f:
f.write(response.body)
def parse_background_image(self, response):
mp4_meta_info_dir = response.meta["mp4_meta_info_dir"]
url = response.url
file_name = url.split("/")[-1]
image_local_path = os.path.join(mp4_meta_info_dir, file_name)
with open(image_local_path, "wb") as f:
f.write(response.body)
\ No newline at end of file
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
#coding:utf-8
from scrapy import log
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
'''
for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
'''
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
def process_request(self, request, spider):
'''设置默认的请求头,从中任意的选择一个'''
ua = random.choice(self.user_agent_list)
if ua:
request.headers.setdefault('User-Agent', ua)
request.headers.setdefault('accept-language','zh-CN,zh;q=0.8')
import os
import random
import pathlib
import cv2
import numpy as np
from tensorflow import keras
import tensorflow_datasets as tfds
video_type_dict = {'360VR': 'VR', '4k': '4K', 'Technology': '科技', 'Sport': '运动', 'Timelapse': '延时',
'Aerial': '航拍', 'Animals': '动物', 'Sea': '大海', 'Beach': '海滩', 'space': '太空',
'stars': '星空', 'City': '城市', 'Business': '商业', 'Underwater': '水下摄影',
'Wedding': '婚礼', 'Archival': '档案', 'Backgrounds': '背景', 'Alpha Channel': '透明通道',
'Intro': '开场', 'Celebration': '庆典', 'Clouds': '云彩', 'Corporate': '企业',
'Explosion': '爆炸', 'Film': '电影镜头', 'Green Screen': '绿幕', 'Military': '军事',
'Nature': '自然', 'News': '新闻', 'R3d': 'R3d', 'Romantic': '浪漫', 'Abstract': '抽象'}
video_type_list = ['360VR', '4k', 'Abstract', 'Aerial', 'Alpha Channel', 'Animals', 'Archival', 'Backgrounds', 'Beach',
'Business', 'Celebration', 'City', 'Clouds', 'Corporate', 'Explosion', 'Film', 'Green Screen',
'Intro', 'Military', 'Nature', 'News', 'R3d', 'Romantic', 'Sea', 'Sport', 'Technology', 'Timelapse',
'Underwater', 'Wedding', 'space', 'stars']
video_label_to_id = {'360VR': 0, '4k': 1, 'Abstract': 2, 'Aerial': 3, 'Alpha Channel': 4, 'Animals': 5, 'Archival': 6,
'Backgrounds': 7, 'Beach': 8, 'Business': 9, 'Celebration': 10, 'City': 11, 'Clouds': 12,
'Corporate': 13, 'Explosion': 14, 'Film': 15, 'Green Screen': 16, 'Intro': 17, 'Military': 18,
'Nature': 19, 'News': 20, 'R3d': 21, 'Romantic': 22, 'Sea': 23, 'Sport': 24, 'Technology': 25,
'Timelapse': 26, 'Underwater': 27, 'Wedding': 28, 'space': 29, 'stars': 30}
def standardization_of_file_names(data_root="MP4_download"):
"""
Uniform naming format for each set of data as follows:
multimodal_data_id
multimodal_data_id.jepg
multimodal_data_id.mp4
multimodal_data_id.txt
"""
# Get all multimodal data type names
data_root = pathlib.Path(data_root)
label_names_list = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
print(f"data_root contain video type numbers {len(label_names_list)}")
print(f"data_root contain video type {label_names_list}")
# Processing multimodal data sequentially
for label_name in label_names_list:
# Get all folders under a certain type of multimodal data
label_mode = label_name + "/*"
multimodal_data_dir = list(data_root.glob(label_mode))
multimodal_data_dir = [str(path) for path in multimodal_data_dir]
# File name for standardized multimodal data
for multimodal_data_path in multimodal_data_dir:
multimodal_data_id = os.path.basename(multimodal_data_path)
for item_file in os.listdir(multimodal_data_path):
item_file = os.path.join(multimodal_data_path, item_file)
if item_file.endswith('.txt'):
os.rename(item_file, os.path.join(multimodal_data_path, multimodal_data_id + ".txt"))
elif item_file.endswith('.jpeg'):
os.rename(item_file, os.path.join(multimodal_data_path, multimodal_data_id + ".jpeg"))
elif item_file.endswith('.mp4'):
os.rename(item_file, os.path.join(multimodal_data_path, multimodal_data_id + ".mp4"))
elif item_file.endswith('.ipynb_checkpoints'):
pass
else:
raise ValueError("An abnormal document appeared! check!")
def get_filtered_all_multimodal_data_item_file_dir_list(data_root="MP4_download"):
"""
:param data_root: Original file root path
:return: filtered_all_multimodal_data_item_file_dir_list
['data_root/360VR/89422838', 'data_root/360VR/107178375', 'data_root/360VR/67370207']
"""
def delete_incomplete_data(multimodal_data_item_file_dir):
multimodal_data_id = os.path.basename(multimodal_data_item_file_dir)
txt_file_path = os.path.join(multimodal_data_item_file_dir, multimodal_data_id + ".txt")
jpeg_file_path = os.path.join(multimodal_data_item_file_dir, multimodal_data_id + ".jpeg")
mp4_file_path = os.path.join(multimodal_data_item_file_dir, multimodal_data_id + ".mp4")
for file_path in [mp4_file_path, jpeg_file_path, txt_file_path]:
if not os.path.exists(file_path):
return False
return True
# Get all multimodal data type names
data_root = pathlib.Path(data_root)
label_names_list = sorted(item.name for item in data_root.glob('*/') if item.is_dir())
all_multimodal_data_item_file_dir_list = list()
for label_name in label_names_list:
# Get all folders under a certain type of multimodal data
label_mode = label_name + "/*"
multimodal_data_dir = list(data_root.glob(label_mode))
multimodal_data_dir = [str(path) for path in multimodal_data_dir]
all_multimodal_data_item_file_dir_list.extend(multimodal_data_dir)
print("all_multimodal_data_item_file_dir_list length", len(all_multimodal_data_item_file_dir_list))
filtered_all_multimodal_data_item_file_dir_list = list(
filter(delete_incomplete_data, all_multimodal_data_item_file_dir_list))
print("filtered_all_multimodal_data_item_file_dir_list length",
len(filtered_all_multimodal_data_item_file_dir_list))
return filtered_all_multimodal_data_item_file_dir_list
def get_description_information(txt_path):
"""description_information include: {'mp4_id': '', 'mp4_download_url': '', 'mp4_time': '',
'mp4_background_image_url': '', 'mp4_txt_brief': ''}"""
description_information_dict = eval(open(txt_path).read())
return description_information_dict
def get_text_list_from_raw_txt_file(data_root="MP4_download"):
"""
Getting mp4_txt_brief text data from the original file
:param data_root: Original file root path
:return: text_list
"""
data_root = pathlib.Path(data_root)
all_txt_data_paths = [str(path) for path in
list(data_root.glob('*/*/*.txt'))] # [MP4_download/360VR/89422838/89422838.txt,...]
text_list = []
for text_data_path in all_txt_data_paths:
description_information_dict = eval(open(text_data_path).read())
txt_brief = description_information_dict['mp4_txt_brief']
text_list.append(txt_brief)
return text_list
def tfds_text_encoder_and_word_set(text_list):
"""
TensorFlow dataset encoder
:param text_list:
:return:
"""
tokenizer = tfds.features.text.Tokenizer()
vocabulary_set = set()
for text in text_list:
some_tokens = tokenizer.tokenize(text)
vocabulary_set.update(some_tokens)
vocab_size = len(vocabulary_set)
print("vocab_size", vocab_size)
text_encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)
example_text = 'I am the blogger of Wangjiang Artificial Think Tank.' \
' Welcome to https://yuanxiaosc.github.io./'
encoded_example = text_encoder.encode(example_text)
print("example_text:\t", example_text)
print("encoded_example:\t", encoded_example)
return text_encoder, vocabulary_set
def multimodal_data_path_generator(data_root="MP4_download", shuffle_data=False):
"""
Multimodal Data Path Generator
:param data_root: Original file root path
:param shuffle_data: Disrupt data order
:return:data_path_generator
Usage method:
for mp4_file_path, jpeg_file_path, txt_file_path, label in multimodal_data_path_generator(data_root,
shuffle_data):
print("")
print("mp4_file_path", mp4_file_path)
print("jpeg_file_path", jpeg_file_path)
print("txt_file_path", txt_file_path)
print("label", label)
"""
multimodal_data_item_file_dir_list = get_filtered_all_multimodal_data_item_file_dir_list(data_root)
if shuffle_data:
random.shuffle(multimodal_data_item_file_dir_list)
for item_file_dir in multimodal_data_item_file_dir_list: # data_root/Business/849
multimodal_data_id = os.path.basename(item_file_dir) # 849
label = os.path.basename(os.path.dirname(item_file_dir)) # Business
txt_file_path = os.path.join(item_file_dir, multimodal_data_id + ".txt") # data_root/Business/849/849.txt
jpeg_file_path = os.path.join(item_file_dir, multimodal_data_id + ".jpeg") # data_root/Business/849/849.jpeg
mp4_file_path = os.path.join(item_file_dir, multimodal_data_id + ".mp4") # data_root/Business/849/849.mp4
# yield data_root/Business/849/849.mp4, data_root/Business/849/849.jpeg, data_root/Business/849/849.txt, Business
yield mp4_file_path, jpeg_file_path, txt_file_path, label
def get_multimodal_data_path_list(data_root="MP4_download", shuffle_data=False):
"""
Getting a multimodal data path list
:param data_root: Original file root path
:param shuffle_data: Disrupt data order
:return:
"""
multimodal_data_path_list = [(mp4_file_path, jpeg_file_path, txt_file_path) for
mp4_file_path, jpeg_file_path, txt_file_path, label in
multimodal_data_path_generator(data_root, shuffle_data)]
return multimodal_data_path_list
def multimodal_encode_data_generator(data_root="MP4_download", shuffle_data=False, txt_maxlen=25,
max_video_frame_number=None, video_width=640, video_height=360):
"""
Multimodal Encode Data Generator
:param data_root: Original file root path
:param shuffle_data: Disrupt data order
:param max_video_frame_number: None -> keep all video number, int -> max need video frame number
:return: multimodal_encode_data_generator
Usage method:
for encode_video, image_file_path, encode_txt, encode_label in encode_multimodal_data(fake_data_root,
shuffle_data,
max_video_frame_number):
print("")
print("encode_video.shape", encode_video.shape)
print("image_file_path", image_file_path)
print("encode_txt", encode_txt)
print("encode_label", encode_label)
"""
text_list = get_text_list_from_raw_txt_file(data_root)
text_encoder, vocabulary_set = tfds_text_encoder_and_word_set(text_list)
def process_video(video_file_path, max_video_frame_number=None, video_width=640, video_height=360):
videoCapture = cv2.VideoCapture(video_file_path)
success, frame = videoCapture.read()
frame_list = []
frame_number = 0
while success:
if frame is None:
break
if isinstance(max_video_frame_number, int) and frame_number == max_video_frame_number:
break
image_np = frame
resize_image_np = cv2.resize(image_np, dsize=(video_width, video_height))
resize_image_np_expanded = np.expand_dims(resize_image_np, axis=0)
frame_list.append(resize_image_np_expanded)
frame_number += 1
success, frame = videoCapture.read()
encode_video = np.concatenate(frame_list, axis=0)
return encode_video
def process_image_data(label):
encode_label = video_label_to_id[label]
return encode_label
def process_txt_data(txt_file_path, txt_maxlen=25):
description_information_dict = get_description_information(txt_file_path)
encode_txt = text_encoder.encode(description_information_dict['mp4_txt_brief'])
encode_txt = keras.preprocessing.sequence.pad_sequences(
[encode_txt], maxlen=txt_maxlen, dtype='int32', padding='post', truncating='post', value=0.0)
return encode_txt[0]
for mp4_file_path, jpeg_file_path, txt_file_path, label in multimodal_data_path_generator(data_root, shuffle_data):
encode_video = process_video(mp4_file_path, max_video_frame_number, video_width, video_height)
image_file_path = jpeg_file_path
encode_label = process_image_data(label)
encode_txt = process_txt_data(txt_file_path, txt_maxlen)
yield encode_video, image_file_path, encode_txt, encode_label
if __name__ == "__main__":
data_root = "/home/b418a/disk1/jupyter_workspace/yuanxiao/douyin/xinpianchang/MP4_download"
fake_data_root = "/home/b418a/disk1/pycharm_room/yuanxiao/my_lenovo_P50s/Multimodal-short-video-dataset-and-baseline-model/MP4_download"
standardized_file_name = False # Only need to be executed once, format the path of the original download file
shuffle_data = True
txt_maxlen = 25
max_video_frame_number = 100
video_height = 360
video_width = 640
if standardized_file_name:
standardization_of_file_names(data_root)
for mp4_file_path, jpeg_file_path, txt_file_path, label in multimodal_data_path_generator(fake_data_root,
shuffle_data):
print("")
print("mp4_file_path", mp4_file_path)
print("jpeg_file_path", jpeg_file_path)
print("txt_file_path", txt_file_path)
print("label", label)
for encode_video, image_file_path, encode_txt, encode_label in multimodal_encode_data_generator(fake_data_root,
shuffle_data,
txt_maxlen,
max_video_frame_number,
video_width,
video_height):
print("")
print("encode_video.shape", encode_video.shape)
print("image_file_path", image_file_path)
print("encode_txt.shape", encode_txt.shape)
print("encode_txt", encode_txt)
print("encode_label", encode_label)
from tensorflow_dataset_interface import multimodel_numpy_data_interface
if __name__=="__main__":
data_root = "/home/b418a/disk1/jupyter_workspace/yuanxiao/douyin/xinpianchang/MP4_download"
fake_data_root = "/home/b418a/disk1/pycharm_room/yuanxiao/my_lenovo_P50s/Multimodal-short-video-dataset-and-baseline-model/MP4_download"
shuffle_data = True
BATCH_SIZE = 5
REPEAT_DATASET = None
txt_maxlen = 20
image_height = 270
image_width = 480
max_video_frame_number = 100
video_height = 360
video_width = 640
numpy_generator = multimodel_numpy_data_interface(fake_data_root, shuffle_data, BATCH_SIZE, REPEAT_DATASET,
txt_maxlen, image_height, image_width,
max_video_frame_number, video_height, video_width)
for encode_video, encode_image, encoded_text, encode_label in numpy_generator:
print("")
print("encode_video", encode_video.shape, encode_video.dtype)
print("encode_image", encode_image.shape, encode_image.dtype)
print("encoded_text", encoded_text.shape, encoded_text.dtype)
print("encode_label", encode_label.shape, encode_label.dtype)
\ No newline at end of file
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np
import pathlib
from dataset_public_interface import multimodal_encode_data_generator
def multimodal_tensorflow_dataset(data_root, shuffle_data=False, BATCH_SIZE=100, REPEAT_DATASET=None,
txt_maxlen=20, image_height=270, image_width=480,
max_video_frame_number=100, video_height=360, video_width=640):
"""
multimodal tensorflow dataset
:param data_root: Original file root path
Usage method:
multimodal_dataset = multimodal_tensorflow_dataset(fake_data_root, shuffle_data, BATCH_SIZE, REPEAT_DATASET,
txt_maxlen, image_height, image_width,
max_video_frame_number, video_height, video_width)
i = 0
for encode_video, image, encoded_text, encode_label in multimodal_dataset:
print(f"{i}")
print(encode_video.shape, encode_video.dtype)
print(image.shape, image.dtype)
print(encoded_text.shape, encoded_text.dtype)
print(encode_label.shape, encode_label.dtype)
i += 1
"""
def filter_video_data(encode_video, image_file_path, encoded_text, encode_label):
"""
Filtered video is not equal to the specified(max_video_frame_number) number of frames
"""
video_frame_number = tf.shape(encode_video)[0]
return tf.math.equal(video_frame_number, max_video_frame_number)
def parser_multimodal_data(encode_video, image_file_path, encoded_text, encode_label):
def parser_image_data(jpeg_file_path):
"""
Read the picture data and specify the value in the [-1,1] range
"""
image = tf.io.read_file(jpeg_file_path)
image = tf.image.decode_jpeg(image)
image = tf.image.resize(image, [image_height, image_width])
image = tf.cast(image, dtype=tf.float32)
image = (image / 127.5) - 1.0
return image
image = parser_image_data(image_file_path)
return encode_video, image, encoded_text, encode_label
multimodal_dataset = tf.data.Dataset.from_generator(
lambda: multimodal_encode_data_generator(data_root, shuffle_data, txt_maxlen,
max_video_frame_number, video_width, video_height),
output_shapes=(tf.TensorShape([None, video_height, video_width, 3]),
tf.TensorShape(None), tf.TensorShape(txt_maxlen), tf.TensorShape(())),
output_types=(tf.float32, tf.string, tf.int32, tf.int32))
multimodal_dataset = multimodal_dataset.map(parser_multimodal_data,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
multimodal_dataset = multimodal_dataset.filter(filter_video_data)
multimodal_dataset = multimodal_dataset.repeat(REPEAT_DATASET)
multimodal_dataset = multimodal_dataset.batch(BATCH_SIZE)
multimodal_dataset = multimodal_dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return multimodal_dataset
def multimodel_numpy_data_interface(data_root, shuffle_data=False, BATCH_SIZE=100, REPEAT_DATASET=None,
txt_maxlen=20, image_height=270, image_width=480,
max_video_frame_number=100, video_height=360, video_width=640):
multimodal_dataset = multimodal_tensorflow_dataset(data_root, shuffle_data, BATCH_SIZE, REPEAT_DATASET,
txt_maxlen, image_height, image_width,
max_video_frame_number, video_height, video_width)
for encode_video, encode_image, encoded_text, encode_label in multimodal_dataset:
yield encode_video.numpy(), encode_image.numpy(), encoded_text.numpy(), encode_label.numpy()
if __name__ == "__main__":
data_root = "/home/b418a/disk1/jupyter_workspace/yuanxiao/douyin/xinpianchang/MP4_download"
fake_data_root = "/home/b418a/disk1/pycharm_room/yuanxiao/my_lenovo_P50s/Multimodal-short-video-dataset-and-baseline-model/MP4_download"
shuffle_data = True
BATCH_SIZE = 16
REPEAT_DATASET = None
txt_maxlen = 20
image_height = 270
image_width = 480
max_video_frame_number = 100
video_height = 360
video_width = 640
multimodal_dataset = multimodal_tensorflow_dataset(fake_data_root, shuffle_data, BATCH_SIZE, REPEAT_DATASET,
txt_maxlen, image_height, image_width,
max_video_frame_number, video_height, video_width)
i = 0
for encode_video, image, encoded_text, encode_label in multimodal_dataset:
print(f"{i}")
print(encode_video.shape, encode_video.dtype)
print(image.shape, image.dtype)
print(encoded_text.shape, encoded_text.dtype)
print(encode_label.shape, encode_label.dtype)
i += 1
{'mp4_id': '99945958', 'mp4_download_url': 'https://p5-v1.xpccdn.com/099945958_main_xl.mp4', 'mp4_time': '0:15', 'mp4_background_image_url': 'https://p5-i1.xpccdn.com/099945958_iconl.jpeg', 'mp4_txt_brief': ' Hong Kong Circa 2017, City B-Roll'}
\ No newline at end of file
import tensorflow as tf
import sys
import os
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "data_interface_for_model")))
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "baseline_model")))
from baseline_model.mutimodal_baseline_model import create_multimodal_baseline_model
from data_interface_for_model.tensorflow_dataset_interface import multimodal_tensorflow_dataset
def train_multimodal_model_main(data_root, train_dataset_numbers, EPOCHS, LEARN_RATE,
checkpoint_path, shuffle_data, BATCH_SIZE, REPEAT_DATASET,
vocab_size, txt_maxlen,
image_height, image_width, image_channels,
max_video_frame_number, video_height, video_width, video_channels,
label_number):
"""
Training Multimodal Baseline Model
Control training and data parameters:
[data_root, train_dataset_numbers, EPOCHS, LEARN_RATE,
checkpoint_path, shuffle_data, BATCH_SIZE, REPEAT_DATASET,]
Text model parameters:
[ vocab_size, txt_maxlen,]
Image model parameters:
[image_height, image_width, image_channels,]
Video model parameters:
[max_video_frame_number, video_height, video_width, video_channels,]
label_number
"""
multimodal_dataset = multimodal_tensorflow_dataset(data_root, shuffle_data, BATCH_SIZE, REPEAT_DATASET,
txt_maxlen, image_height, image_width,
max_video_frame_number, video_height, video_width)
# Create callback
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, save_weights_only=True,
verbose=1, save_freq='epoch')
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=checkpoint_path)
callback_list = [checkpoint_callback, tensorboard_callback]
STEPS_PER_EPOCH = train_dataset_numbers // BATCH_SIZE
multimodal_model = create_multimodal_baseline_model(label_number=label_number, txt_maxlen=txt_maxlen,
text_vocab_size=vocab_size, text_embedding_dim=100,
text_lstm_units=64, text_output_dim=50,
image_height=image_height, image_width=image_width,
image_channels=image_channels, image_output_dim=50,
max_video_frame_number=max_video_frame_number,
video_height=video_height, video_width=video_width,
video_channels=video_channels, video_output_dim=50)
multimodal_model.compile(optimizer=tf.keras.optimizers.Adam(LEARN_RATE),
loss=tf.keras.losses.CategoricalCrossentropy(),
metrics=[tf.keras.metrics.CategoricalAccuracy()])
multimodal_model.fit(multimodal_dataset, epochs=EPOCHS,
steps_per_epoch=STEPS_PER_EPOCH, callbacks=callback_list)
if __name__ == "__main__":
data_root = "/home/b418a/disk1/jupyter_workspace/yuanxiao/douyin/xinpianchang/MP4_download"
train_dataset_numbers = 500000
EPOCHS = 200
LEARN_RATE = 0.001
checkpoint_path = "./keras_checkpoints/train"
shuffle_data = True
BATCH_SIZE = 64
REPEAT_DATASET = None
vocab_size = 15798 + 1 # 1 for unknown
txt_maxlen = 20
image_height = 270
image_width = 480
image_channels = 3
max_video_frame_number = 100
video_height = 360
video_width = 640
video_channels = 3
label_number = 31
train_multimodal_model_main(data_root, train_dataset_numbers, EPOCHS, LEARN_RATE, checkpoint_path,
shuffle_data, BATCH_SIZE, REPEAT_DATASET, vocab_size, txt_maxlen,
image_height, image_width, image_channels, max_video_frame_number,
video_height, video_width, video_channels, label_number)
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册