提交 8244bbe5 编写于 作者: sahduashufa's avatar sahduashufa

0418

上级 d472c1e3
We consider the fully automated recognition of actions in uncontrolled environment Most existing work relies on domain knowledge to construct complex handcrafted features from inputs In addition the environments are usually assumed to be controlled Convolu- tional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs thus automating the process of fea- ture construction However such models are currently limited to handle 2D inputs In this paper we develop a novel 3D CNN model for action recognition This model extracts fea- tures from both spatial and temporal dimen- sions by performing 3D convolutions thereby capturing the motion information encoded in multiple adjacent frames The developed model generates multiple channels of infor- mation from the input frames and the final feature representation is obtained by com- bining information from all channels We apply the developed model to recognize hu- man actions in real-world environment and it achieves superior performance without re- lying on handcrafted features 1 Introduction Recognizing human actions in real-world environment finds applications in a variety of domains including in- telligent video surveillance customer attributes and shopping behavior analysis However accurate recog- nition of actions is a highly challenging task due to Appearing in Proceedings of the 27 th International Confer- ence on Machine Learning Haifa Israel 2010 Copyright 2010 by the author(s)/owner(s) 95014 USA cluttered backgrounds occlusions and viewpoint vari- ations etc Therefore most of the existing approaches (Efros et al 2003 Schu ̈ldt et al 2004 Dolla ́r et al 2005 Laptev & P ́erez 2007 Jhuang et al 2007) make certain assumptions (e g small scale and view- point changes) about the circumstances under which the video was taken However such assumptions sel- dom hold in real-world environment In addition most of these approaches follow the conventional paradigm of pattern recognition which consists of two steps in which the first step computes complex handcrafted fea- tures from raw video frames and the second step learns classifiers based on the obtained features In real-world scenarios it is rarely known which features are impor- tant for the task at hand since the choice of feature is highly problem-dependent Especially for human ac- tion recognition different action classes may appear dramatically different in terms of their appearances and motion patterns Deep learning models (Fukushima 1980 LeCun et al 1998 Hinton & Salakhutdinov 2006 Hinton et al 2006 Bengio 2009) are a class of machines that can learn a hierarchy of features by building high-level features from low-level ones thereby automating the process of feature construction Such learning ma- chines can be trained using either supervised or un- supervised approaches and the resulting systems have been shown to yield competitive performance in visual object recognition (LeCun et al 1998 Hinton et al 2006 Ranzato et al 2007 Lee et al 2009a) natu- ral language processing (Collobert & Weston 2008) and audio classification (Lee et al 2009b) tasks The convolutional neural networks (CNNs) (LeCun et al 1998) are a type of deep models in which trainable filters and local neighborhood pooling operations are applied alternatingly on the raw input images result- ing in a hierarchy of increasingly complex features It has been shown that when trained with appropri- 3D Convolutional Neural Networks for Human Action Recognition ate regularization (Ahmed et al 2008 Yu et al 2008 Mobahi et al 2009) CNNs can achieve superior per- formance on visual object recognition tasks without relying on handcrafted features In addition CNNs have been shown to be relatively insensitive to certain variations on the inputs (LeCun et al 2004) As a class of attractive deep models for automated fea- ture construction CNNs have been primarily applied on 2D images In this paper we consider the use of CNNs for human action recognition in videos A sim- ple approach in this direction is to treat video frames as still images and apply CNNs to recognize actions at the individual frame level Indeed this approach has been used to analyze the videos of developing embryos (Ning et al 2005) However such approach does not consider the motion information encoded in multiple contiguous frames To effectively incorporate the motion information in video analysis we propose to perform 3D convolution in the convolutional layers of CNNs so that discriminative features along both spatial and temporal dimensions are captured We show that by applying multiple distinct convolutional operations at the same location on the input multi- ple types of features can be extracted Based on the proposed 3D convolution a variety of 3D CNN archi- tectures can be devised to analyze video data We develop a 3D CNN architecture that generates multi- ple channels of information from adjacent video frames and performs convolution and subsampling separately in each channel The final feature representation is obtained by combining information from all channels An additional advantage of the CNN-based models is that the recognition phase is very efficient due to their feed-forward nature We evaluated the developed 3D CNN model on the TREC Video Retrieval Evaluation (TRECVID) data1 which consist of surveillance video data recorded in London Gatwick Airport We constructed a multi- module event detection system which includes 3D CNN as a module and participated in three tasks of the TRECVID 2009 Evaluation for Surveillance Event Detection Our system achieved the best performance on all three participated tasks To provide indepen- dent evaluation of the 3D CNN model we report its performance on the TRECVID 2008 development set in this paper We also present results on the KTH data as published performance for this data is avail- able Our experiments show that the developed 3D CNN model outperforms other baseline methods on the TRECVID data and it achieves competitive per- formance on the KTH data without depending on against-all linear SVM is learned for each action class Specifically we extract dense SIFT descriptors (Lowe 2004) from raw gray images or motion edge history images (MEHI) (Yang et al 2009) Local features on raw gray images preserve the appearance information while MEHI concerns with the shape and motion pat- terns These SIFT descriptors are calculated every 6 pixels from 7 × 7 and 16 × 16 local image patches in the same cubes as in the 3D CNN model Then they are softly quantized using a 512-word codebook to build the BoW features To exploit the spatial layout in- formation we employ similar approach as the spatial pyramid matching (SPM) (Lazebnik et al 2006) to partition the candidate region into 2 × 2 and 3 × 4 cells and concatenate their BoW features The dimension- ality of the entire feature vector is 512×(2×2+3×4) = 8192 We denote the method based on gray images as SPMcube and the one based on MEHI as SPMcube gray MEHI We report the 5-fold cross-validation results in which the data for a single day are used as a fold The per- formance measures we used are precision recall and area under the ROC curve (ACU) at multiple values of FALSE positive rates (FPR) The performance of the four methods is summarized in Table 2 We can observe from Table 2 that the 3D CNN model outperforms the frame-based 2D CNN model SPMcube and SPMcube gray MEHI significantly on the action classes CellToEar and Ob- jectPut in all cases For the action class Pointing 3D CNN model achieves slightly worse performance than the other three methods From Table 1 we can see that the number of positive samples in the Pointing class is significantly larger than those of the other two classes Hence we can conclude that the 3D CNN model is more effective when the number of positive samples is small Overall the 3D CNN model outperforms other three methods consistently as can be seen from the average performance in Table 2 4 2 Action Recognition on KTH Data We evaluate the 3D CNN model on the KTH data (Schu ̈ldt et al 2004) which consist of 6 action classes performed by 25 subjects To follow the setup in the HMAX model we use a 9-frame cube as input and ex- tract foreground as in (Jhuang et al 2007) To reduce the memory requirement the resolutions of the input frames are reduced to 80 × 60 in our experiments as compared to 160 × 120 used in (Jhuang et al 2007) We use a similar 3D CNN architecture as in Figure 3 with the sizes of kernels and the number of feature maps in each layer modified to consider the 80 × 60 × 9 inputs In particular the three convolutional layers use kernels of sizes 9×7 7×7 and 6×4 respec- tively and the two subsampling layers use kernels of size 3×3 By using this setting the 80×60×9 in- puts are converted into 128D feature vectors The final layer consists of 6 units corresponding to the 6 classes As in (Jhuang et al 2007) we use the data for 16 ran- domly selected subjects for training and the data for the other 9 subjects for testing The recognition per- formance averaged across 5 random trials is reported in Table 3 along with published results in the litera- ture The 3D CNN model achieves an overall accu- racy of 90 2% as compared with 91 7% achieved by the HMAX model Note that the HMAX model use handcrafted features computed from raw images with 4-fold higher resolution 5 Conclusions and Discussions We developed a 3D CNN model for action recognition in this paper This model construct features from both spatial and temporal dimensions by performing 3D convolutions The developed deep architecture gener- ates multiple channels of information from adjacent in- put frames and perform convolution and subsampling separately in each channel The final feature represen- tation is computed by combining information from all channels We evaluated the 3D CNN model using the TRECVID and the KTH data sets Results show that the 3D CNN model outperforms compared methods on the TRECVID data while it achieves competitive performance on the KTH data demonstrating its su- perior performance in real-world environments
We r consider the fully automated recognition of actions in uncontrolled environment Most existing work relies on domain knowledge to construct complex handcrafted features from inputs In addition the environments are usually assumed to be controlled Convolu- tional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs thus automating the process of fea- ture construction However such models are currently limited to handle 2D inputs In this paper we develop a novel 3D CNN model for action recognition This model extracts fea- tures from both spatial and temporal dimen- sions by performing 3D convolutions thereby capturing the motion information encoded in multiple adjacent frames The developed model generates multiple channels of infor- mation from the input frames and the final feature representation is obtained by com- bining information from all channels We apply the developed model to recognize hu- man actions in real-world environment and it achieves superior performance without re- lying on handcrafted features 1 Introduction Recognizing human actions in real-world environment finds applications in a variety of domains including in- telligent video surveillance customer attributes and shopping behavior analysis However accurate recog- nition of actions is a highly challenging task due to Appearing in Proceedings of the 27 th International Confer- ence on Machine Learning Haifa Israel 2010 Copyright 2010 by the author(s)/owner(s) 95014 USA cluttered backgrounds occlusions and viewpoint vari- ations etc Therefore most of the existing approaches (Efros et al 2003 Schu ̈ldt et al 2004 Dolla ́r et al 2005 Laptev & P ́erez 2007 Jhuang et al 2007) make certain assumptions (e g small scale and view- point changes) about the circumstances under which the video was taken However such assumptions sel- dom hold in real-world environment In addition most of these approaches follow the conventional paradigm of pattern recognition which consists of two steps in which the first step computes complex handcrafted fea- tures from raw video frames and the second step learns classifiers based on the obtained features In real-world scenarios it is rarely known which features are impor- tant for the task at hand since the choice of feature is highly problem-dependent Especially for human ac- tion recognition different action classes may appear dramatically different in terms of their appearances and motion patterns Deep learning models (Fukushima 1980 LeCun et al 1998 Hinton & Salakhutdinov 2006 Hinton et al 2006 Bengio 2009) are a class of machines that can learn a hierarchy of features by building high-level features from low-level ones thereby automating the process of feature construction Such learning ma- chines can be trained using either supervised or un- supervised approaches and the resulting systems have been shown to yield competitive performance in visual object recognition (LeCun et al 1998 Hinton et al 2006 Ranzato et al 2007 Lee et al 2009a) natu- ral language processing (Collobert & Weston 2008) and audio classification (Lee et al 2009b) tasks The convolutional neural networks (CNNs) (LeCun et al 1998) are a type of deep models in which trainable filters and local neighborhood pooling operations are applied alternatingly on the raw input images result- ing in a hierarchy of increasingly complex features It has been shown that when trained with appropri- 3D Convolutional Neural Networks for Human Action Recognition ate regularization (Ahmed et al 2008 Yu et al 2008 Mobahi et al 2009) CNNs can achieve superior per- formance on visual object recognition tasks without relying on handcrafted features In addition CNNs have been shown to be relatively insensitive to certain variations on the inputs (LeCun et al 2004) As a class of attractive deep models for automated fea- ture construction CNNs have been primarily applied on 2D images In this paper we consider the use of CNNs for human action recognition in videos A sim- ple approach in this direction is to treat video frames as still images and apply CNNs to recognize actions at the individual frame level Indeed this approach has been used to analyze the videos of developing embryos (Ning et al 2005) However such approach does not consider the motion information encoded in multiple contiguous frames To effectively incorporate the motion information in video analysis we propose to perform 3D convolution in the convolutional layers of CNNs so that discriminative features along both spatial and temporal dimensions are captured We show that by applying multiple distinct convolutional operations at the same location on the input multi- ple types of features can be extracted Based on the proposed 3D convolution a variety of 3D CNN archi- tectures can be devised to analyze video data We develop a 3D CNN architecture that generates multi- ple channels of information from adjacent video frames and performs convolution and subsampling separately in each channel The final feature representation is obtained by combining information from all channels An additional advantage of the CNN-based models is that the recognition phase is very efficient due to their feed-forward nature We evaluated the developed 3D CNN model on the TREC Video Retrieval Evaluation (TRECVID) data1 which consist of surveillance video data recorded in London Gatwick Airport We constructed a multi- module event detection system which includes 3D CNN as a module and participated in three tasks of the TRECVID 2009 Evaluation for Surveillance Event Detection Our system achieved the best performance on all three participated tasks To provide indepen- dent evaluation of the 3D CNN model we report its performance on the TRECVID 2008 development set in this paper We also present results on the KTH data as published performance for this data is avail- able Our experiments show that the developed 3D CNN model outperforms other baseline methods on the TRECVID data and it achieves competitive per- formance on the KTH data without depending on against-all linear SVM is learned for each action class Specifically we extract dense SIFT descriptors (Lowe 2004) from raw gray images or motion edge history images (MEHI) (Yang et al 2009) Local features on raw gray images preserve the appearance information while MEHI concerns with the shape and motion pat- terns These SIFT descriptors are calculated every 6 pixels from 7 × 7 and 16 × 16 local image patches in the same cubes as in the 3D CNN model Then they are softly quantized using a 512-word codebook to build the BoW features To exploit the spatial layout in- formation we employ similar approach as the spatial pyramid matching (SPM) (Lazebnik et al 2006) to partition the candidate region into 2 × 2 and 3 × 4 cells and concatenate their BoW features The dimension- ality of the entire feature vector is 512×(2×2+3×4) = 8192 We denote the method based on gray images as SPMcube and the one based on MEHI as SPMcube gray MEHI We report the 5-fold cross-validation results in which the data for a single day are used as a fold The per- formance measures we used are precision recall and area under the ROC curve (ACU) at multiple values of FALSE positive rates (FPR) The performance of the four methods is summarized in Table 2 We can observe from Table 2 that the 3D CNN model outperforms the frame-based 2D CNN model SPMcube and SPMcube gray MEHI significantly on the action classes CellToEar and Ob- jectPut in all cases For the action class Pointing 3D CNN model achieves slightly worse performance than the other three methods From Table 1 we can see that the number of positive samples in the Pointing class is significantly larger than those of the other two classes Hence we can conclude that the 3D CNN model is more effective when the number of positive samples is small Overall the 3D CNN model outperforms other three methods consistently as can be seen from the average performance in Table 2 4 2 Action Recognition on KTH Data We evaluate the 3D CNN model on the KTH data (Schu ̈ldt et al 2004) which consist of 6 action classes performed by 25 subjects To follow the setup in the HMAX model we use a 9-frame cube as input and ex- tract foreground as in (Jhuang et al 2007) To reduce the memory requirement the resolutions of the input frames are reduced to 80 × 60 in our experiments as compared to 160 × 120 used in (Jhuang et al 2007) We use a similar 3D CNN architecture as in Figure 3 with the sizes of kernels and the number of feature maps in each layer modified to consider the 80 × 60 × 9 inputs In particular the three convolutional layers use kernels of sizes 9×7 7×7 and 6×4 respec- tively and the two subsampling layers use kernels of size 3×3 By using this setting the 80×60×9 in- puts are converted into 128D feature vectors The final layer consists of 6 units corresponding to the 6 classes As in (Jhuang et al 2007) we use the data for 16 ran- domly selected subjects for training and the data for the other 9 subjects for testing The recognition per- formance averaged across 5 random trials is reported in Table 3 along with published results in the litera- ture The 3D CNN model achieves an overall accu- racy of 90 2% as compared with 91 7% achieved by the HMAX model Note that the HMAX model use handcrafted features computed from raw images with 4-fold higher resolution 5 Conclusions and Discussions We developed a 3D CNN model for action recognition in this paper This model construct features from both spatial and temporal dimensions by performing 3D convolutions The developed deep architecture gener- ates multiple channels of information from adjacent in- put frames and perform convolution and subsampling separately in each channel The final feature represen- tation is computed by combining information from all channels We evaluated the 3D CNN model using the TRECVID and the KTH data sets Results show that the 3D CNN model outperforms compared methods on the TRECVID data while it achieves competitive performance on the KTH data demonstrating its su- perior performance in real-world environments
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册