Human Action Recognition Algorithm Based on Spatio-Temporal Interactive Attention Model

Na Pan; Min Jiang; Jun Kong

doi:10.3788/LOP57.181506

[1] Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. [C]∥Advances in neural information processing systems, December 8-13, 2014, Montreal, Quebec, Canada: Curran Associates, Inc., 568-576(2014).

[2] Wang L M, Xiong Y J, Wang Z et al[M]. Temporal segment networks: towards good practices for deep action recognition, 20-36(2016).

[3] Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset[C]∥2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 21-26 July 2017, Honolulu, HI, USA., 4724-4733(2017).

[4] Mnih V, Heess N, Graves A et al. Recurrent models of visual attention. [C]∥NIPS'14: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2., 2204-2212(2014).

[5] Fan L F, Chen Y X, Wei P et al. Inferring shared attention in social scene videos[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18-23 June 2018, Salt Lake City, UT, USA., 6460-6468(2018).

[6] Lu M L, Li Z N, Wang Y M et al. Deep attention network for egocentric action recognition[J]. IEEE Transactions on Image Processing, 28, 3703-3713(2019).

[7] Fu J, Liu J, Tian H J et al. Dual attention network for scene segmentation[C]∥2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15-20 June 2019, Long Beach, CA, USA., 3141-3149(2019).

[8] Zhu M K, Lu X L. Human action recognition algorithm based on Bi-LSTM-attention model[J]. Laser & Optoelectronics Progress, 56, 151503(2019).

[9] Tang Y S, Tian Y, Lu J W et al. Deep progressive reinforcement learning for skeleton-based action recognition[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18-23 June 2018, Salt Lake City, UT, USA., 5323-5332(2018).

[10] Jing L L, Yang X D, Tian Y L. Video You only look once: overall temporal convolutions for action recognition[J]. Journal of Visual Communication and Image Representation, 52, 58-65(2018).

[11] Yu T Z, Guo C X, Wang L F et al. Joint spatial-temporal attention for action recognition[J]. Pattern Recognition Letters, 112, 226-233(2018).

[12] Lu L H, Di H J, Lu Y et al. Spatio-temporal attention mechanisms based model for collective activity recognition[J]. Signal Processing: Image Communication, 74, 162-174(2019).

[13] He K M, Gkioxari G, Dollár P et al. Mask R-CNN[C]∥2017 IEEE International Conference on Computer Vision (ICCV). 22-29 Oct. 2017, Venice, Italy., 2980-2988(2017).

[14] Fan L J, Huang W B, Gan C et al. End-to-end learning of motion representation for video understanding[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18-23 June 2018, Salt Lake City, UT, USA., 6016-6025(2018).

[15] Li Z Y, Gavrilyuk K, Gavves E et al. Video LSTM convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 166, 41-50(2018).

[16] Zhang J X, Hu H F. Deep spatiotemporal relation learning with 3D multi-level dense fusion for video action recognition[J]. IEEE Access, 7, 15222-15229(2019).

[17] Khowaja S A, Lee S L. Hybrid and hierarchical fusion networks: a deep cross-modal learning architecture for action recognition[J]. Neural Computing and Applications, 1-12(2019).

[18] Wang H, Schmid C. Action recognition with improved trajectories[C]∥2013 IEEE International Conference on Computer Vision. 1-8 Dec. 2013, Sydney, NSW, Australia., 3551-3558(2013).

[19] Peng X J, Wang L M, Wang X X et al. Bag of visual words and fusion methods for action recognition: comprehensive study and good practice[J]. Computer Vision and Image Understanding, 150, 109-125(2016).

[20] Lan ZZ, LinM, Li XC, et al.Beyond Gaussian pyramid: multi-skip feature stacking for action recognition[C]∥2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7-12 June 2015, Boston, MA, USA. New York: IEEE Press, 2015: 204- 212.

[21] Zhu Y, Lan Z Z, Newsam S et al[M]. Hidden two-stream convolutional networks for action recognition, 363-378(2019).

[22] Tu Z G, Xie W, Dauwels J et al. Semantic cues enhanced multimodality multistream CNN for action recognition[J]. IEEE Transactions on Circuits and Systems for Video Technology, 29, 1423-1437(2019).

[23] Tran A, Cheong L F. Two-stream flow-guided convolutional attention networks for action recognition[C]∥2017 IEEE International Conference on Computer Vision Workshops (ICCVW). 22-29 Oct. 2017, Venice, Italy., 3110-3119(2017).

[24] Du W B, Wang Y L, Qiao Y. Recurrent spatial-temporal attention network for action recognition in videos[J]. IEEE Transactions on Image Processing, 27, 1347-1360(2018).

[25] Cao C Q, Zhang Y F, Zhang C J et al. Action recognition with joints-pooled 3D deep convolutional descriptors. [C]∥IJCAI'16: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence., 3324-3330(2016).

[26] Villegas R, Yang J, Zou Y et al. Learning to generate long-term future via hierarchical prediction. [C]∥Proceedings of the 34th International Conference on Machine Learning-Volume 70, Aug 6-11, 2017, Sydney, Australia: JMLR. org, 3560-3569(2017).

[27] Gao RH, XiongB, GraumanK. Im2flow: motion hallucination from static images for action recognition[C]∥2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18-23 June 2018, Salt Lake City, UT, USA. New York: IEEE Press, 2018: 5937- 5947.