mytest.txt 9.9 KB
Newer Older
sahduashufa's avatar
0418  
sahduashufa 已提交
1
We r	consider	the	fully	automated	recognition	of	actions	in	uncontrolled	environment	Most	existing	work	relies	on	domain	knowledge	to	construct	complex	handcrafted	features	from	inputs	In	addition	the	environments	are	usually	assumed	to	be	controlled	Convolu-	tional	neural	networks	(CNNs)	are	a	type	of	deep	models	that	can	act	directly	on	the	raw	inputs	thus	automating	the	process	of	fea-	ture	construction	However	such	models	are	currently	limited	to	handle	2D	inputs	In	this	paper	we	develop	a	novel	3D	CNN	model	for	action	recognition	This	model	extracts	fea-	tures	from	both	spatial	and	temporal	dimen-	sions	by	performing	3D	convolutions	thereby	capturing	the	motion	information	encoded	in	multiple	adjacent	frames	The	developed	model	generates	multiple	channels	of	infor-	mation	from	the	input	frames	and	the	final	feature	representation	is	obtained	by	com-	bining	information	from	all	channels	We	apply	the	developed	model	to	recognize	hu-	man	actions	in	real-world	environment	and	it	achieves	superior	performance	without	re-	lying	on	handcrafted	features	1	Introduction	Recognizing	human	actions	in	real-world	environment	finds	applications	in	a	variety	of	domains	including	in-	telligent	video	surveillance	customer	attributes	and	shopping	behavior	analysis	However	accurate	recog-	nition	of	actions	is	a	highly	challenging	task	due	to	Appearing	in	Proceedings	of	the	27	th	International	Confer-	ence	on	Machine	Learning	Haifa	Israel	2010	Copyright	2010	by	the	author(s)/owner(s)	95014	USA	cluttered	backgrounds	occlusions	and	viewpoint	vari-	ations	etc	Therefore	most	of	the	existing	approaches	(Efros	et	al	2003	Schu ̈ldt	et	al	2004	Dolla ́r	et	al	2005	Laptev	&	P ́erez	2007	Jhuang	et	al	2007)	make	certain	assumptions	(e	g	small	scale	and	view-	point	changes)	about	the	circumstances	under	which	the	video	was	taken	However	such	assumptions	sel-	dom	hold	in	real-world	environment	In	addition	most	of	these	approaches	follow	the	conventional	paradigm	of	pattern	recognition	which	consists	of	two	steps	in	which	the	first	step	computes	complex	handcrafted	fea-	tures	from	raw	video	frames	and	the	second	step	learns	classifiers	based	on	the	obtained	features	In	real-world	scenarios	it	is	rarely	known	which	features	are	impor-	tant	for	the	task	at	hand	since	the	choice	of	feature	is	highly	problem-dependent	Especially	for	human	ac-	tion	recognition	different	action	classes	may	appear	dramatically	different	in	terms	of	their	appearances	and	motion	patterns	Deep	learning	models	(Fukushima	1980	LeCun	et	al	1998	Hinton	&	Salakhutdinov	2006	Hinton	et	al	2006	Bengio	2009)	are	a	class	of	machines	that	can	learn	a	hierarchy	of	features	by	building	high-level	features	from	low-level	ones	thereby	automating	the	process	of	feature	construction	Such	learning	ma-	chines	can	be	trained	using	either	supervised	or	un-	supervised	approaches	and	the	resulting	systems	have	been	shown	to	yield	competitive	performance	in	visual	object	recognition	(LeCun	et	al	1998	Hinton	et	al	2006	Ranzato	et	al	2007	Lee	et	al	2009a)	natu-	ral	language	processing	(Collobert	&	Weston	2008)	and	audio	classification	(Lee	et	al	2009b)	tasks	The	convolutional	neural	networks	(CNNs)	(LeCun	et	al	1998)	are	a	type	of	deep	models	in	which	trainable	filters	and	local	neighborhood	pooling	operations	are	applied	alternatingly	on	the	raw	input	images	result-	ing	in	a	hierarchy	of	increasingly	complex	features	It	has	been	shown	that	when	trained	with	appropri-	3D	Convolutional	Neural	Networks	for	Human	Action	Recognition	ate	regularization	(Ahmed	et	al	2008	Yu	et	al	2008	Mobahi	et	al	2009)	CNNs	can	achieve	superior	per-	formance	on	visual	object	recognition	tasks	without	relying	on	handcrafted	features	In	addition	CNNs	have	been	shown	to	be	relatively	insensitive	to	certain	variations	on	the	inputs	(LeCun	et	al	2004)	As	a	class	of	attractive	deep	models	for	automated	fea-	ture	construction	CNNs	have	been	primarily	applied	on	2D	images	In	this	paper	we	consider	the	use	of	CNNs	for	human	action	recognition	in	videos	A	sim-	ple	approach	in	this	direction	is	to	treat	video	frames	as	still	images	and	apply	CNNs	to	recognize	actions	at	the	individual	frame	level	Indeed	this	approach	has	been	used	to	analyze	the	videos	of	developing	embryos	(Ning	et	al	2005)	However	such	approach	does	not	consider	the	motion	information	encoded	in	multiple	contiguous	frames	To	effectively	incorporate	the	motion	information	in	video	analysis	we	propose	to	perform	3D	convolution	in	the	convolutional	layers	of	CNNs	so	that	discriminative	features	along	both	spatial	and	temporal	dimensions	are	captured	We	show	that	by	applying	multiple	distinct	convolutional	operations	at	the	same	location	on	the	input	multi-	ple	types	of	features	can	be	extracted	Based	on	the	proposed	3D	convolution	a	variety	of	3D	CNN	archi-	tectures	can	be	devised	to	analyze	video	data	We	develop	a	3D	CNN	architecture	that	generates	multi-	ple	channels	of	information	from	adjacent	video	frames	and	performs	convolution	and	subsampling	separately	in	each	channel	The	final	feature	representation	is	obtained	by	combining	information	from	all	channels	An	additional	advantage	of	the	CNN-based	models	is	that	the	recognition	phase	is	very	efficient	due	to	their	feed-forward	nature	We	evaluated	the	developed	3D	CNN	model	on	the	TREC	Video	Retrieval	Evaluation	(TRECVID)	data1	which	consist	of	surveillance	video	data	recorded	in	London	Gatwick	Airport	We	constructed	a	multi-	module	event	detection	system	which	includes	3D	CNN	as	a	module	and	participated	in	three	tasks	of	the	TRECVID	2009	Evaluation	for	Surveillance	Event	Detection	Our	system	achieved	the	best	performance	on	all	three	participated	tasks	To	provide	indepen-	dent	evaluation	of	the	3D	CNN	model	we	report	its	performance	on	the	TRECVID	2008	development	set	in	this	paper	We	also	present	results	on	the	KTH	data	as	published	performance	for	this	data	is	avail-	able	Our	experiments	show	that	the	developed	3D	CNN	model	outperforms	other	baseline	methods	on	the	TRECVID	data	and	it	achieves	competitive	per-	formance	on	the	KTH	data	without	depending	on	against-all	linear	SVM	is	learned	for	each	action	class	Specifically	we	extract	dense	SIFT	descriptors	(Lowe	2004)	from	raw	gray	images	or	motion	edge	history	images	(MEHI)	(Yang	et	al	2009)	Local	features	on	raw	gray	images	preserve	the	appearance	information	while	MEHI	concerns	with	the	shape	and	motion	pat-	terns	These	SIFT	descriptors	are	calculated	every	6	pixels	from	7	×	7	and	16	×	16	local	image	patches	in	the	same	cubes	as	in	the	3D	CNN	model	Then	they	are	softly	quantized	using	a	512-word	codebook	to	build	the	BoW	features	To	exploit	the	spatial	layout	in-	formation	we	employ	similar	approach	as	the	spatial	pyramid	matching	(SPM)	(Lazebnik	et	al	2006)	to	partition	the	candidate	region	into	2	×	2	and	3	×	4	cells	and	concatenate	their	BoW	features	The	dimension-	ality	of	the	entire	feature	vector	is	512×(2×2+3×4)	=	8192	We	denote	the	method	based	on	gray	images	as	SPMcube	and	the	one	based	on	MEHI	as	SPMcube	gray	MEHI	We	report	the	5-fold	cross-validation	results	in	which	the	data	for	a	single	day	are	used	as	a	fold	The	per-	formance	measures	we	used	are	precision	recall	and	area	under	the	ROC	curve	(ACU)	at	multiple	values	of	FALSE	positive	rates	(FPR)	The	performance	of	the	four	methods	is	summarized	in	Table	2	We	can	observe	from	Table	2	that	the	3D	CNN	model	outperforms	the	frame-based	2D	CNN	model	SPMcube	and	SPMcube	gray	MEHI	significantly	on	the	action	classes	CellToEar	and	Ob-	jectPut	in	all	cases	For	the	action	class	Pointing	3D	CNN	model	achieves	slightly	worse	performance	than	the	other	three	methods	From	Table	1	we	can	see	that	the	number	of	positive	samples	in	the	Pointing	class	is	significantly	larger	than	those	of	the	other	two	classes	Hence	we	can	conclude	that	the	3D	CNN	model	is	more	effective	when	the	number	of	positive	samples	is	small	Overall	the	3D	CNN	model	outperforms	other	three	methods	consistently	as	can	be	seen	from	the	average	performance	in	Table	2	4	2	Action	Recognition	on	KTH	Data	We	evaluate	the	3D	CNN	model	on	the	KTH	data	(Schu ̈ldt	et	al	2004)	which	consist	of	6	action	classes	performed	by	25	subjects	To	follow	the	setup	in	the	HMAX	model	we	use	a	9-frame	cube	as	input	and	ex-	tract	foreground	as	in	(Jhuang	et	al	2007)	To	reduce	the	memory	requirement	the	resolutions	of	the	input	frames	are	reduced	to	80	×	60	in	our	experiments	as	compared	to	160	×	120	used	in	(Jhuang	et	al	2007)	We	use	a	similar	3D	CNN	architecture	as	in	Figure	3	with	the	sizes	of	kernels	and	the	number	of	feature	maps	in	each	layer	modified	to	consider	the	80	×	60	×	9	inputs	In	particular	the	three	convolutional	layers	use	kernels	of	sizes	9×7	7×7	and	6×4	respec-	tively	and	the	two	subsampling	layers	use	kernels	of	size	3×3	By	using	this	setting	the	80×60×9	in-	puts	are	converted	into	128D	feature	vectors	The	final	layer	consists	of	6	units	corresponding	to	the	6	classes	As	in	(Jhuang	et	al	2007)	we	use	the	data	for	16	ran-	domly	selected	subjects	for	training	and	the	data	for	the	other	9	subjects	for	testing	The	recognition	per-	formance	averaged	across	5	random	trials	is	reported	in	Table	3	along	with	published	results	in	the	litera-	ture	The	3D	CNN	model	achieves	an	overall	accu-	racy	of	90	2%	as	compared	with	91	7%	achieved	by	the	HMAX	model	Note	that	the	HMAX	model	use	handcrafted	features	computed	from	raw	images	with	4-fold	higher	resolution	5	Conclusions	and	Discussions	We	developed	a	3D	CNN	model	for	action	recognition	in	this	paper	This	model	construct	features	from	both	spatial	and	temporal	dimensions	by	performing	3D	convolutions	The	developed	deep	architecture	gener-	ates	multiple	channels	of	information	from	adjacent	in-	put	frames	and	perform	convolution	and	subsampling	separately	in	each	channel	The	final	feature	represen-	tation	is	computed	by	combining	information	from	all	channels	We	evaluated	the	3D	CNN	model	using	the	TRECVID	and	the	KTH	data	sets	Results	show	that	the	3D	CNN	model	outperforms	compared	methods	on	the	TRECVID	data	while	it	achieves	competitive	performance	on	the	KTH	data	demonstrating	its	su-	perior	performance	in	real-world	environments