[1] JI S W, XU W, YANG M, et al.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1):221-231.
[2] QIU Z F, YAO T, MEI T.Learning spatio-temporal representation with pseudo-3D residual networks[C]//IEEE International Conference on Computer Vision.Venice:IEEE, 2017:5534-5542.
[3] TRAN D, WANG H, TORRESANI L, et al.A closer look at spatiotemporal convolutions for action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE, 2018:6450-6459.
[4] SUN L, JIA K, YEUNG D Y, et al.Human action recognition using factorized spatio-temporal convolutional networks[C]//IEEE International Conference on Computer Vision.Santiago:IEEE, 2015:4597-4605.
[5] ZHOU Y Z, SUN X Y, ZHA Z J, et al.MiCT:mixed 3D/2D convolutional tube for human action recognition[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE, 2018:449-458.
[6] SIMONYAN K, ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[EB/OL].(2014-11-12)[2022-04-13].https://arxiv.org/abs/1406.2199.
[7] FEICHTENHOFER C, PINZ A, ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE, 2016:1933-1941.
[8] DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al.Long-term recurrent convolutional networks for visual re-cognition and description[C]//IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE, 2015:2625-2634.
[9] NG J Y H, HAUSKNECHT M, VIJAYANARASIMHAN S, et al.Beyond short snippets:deep networks for video classification[C]//IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE, 2015:4694-4702.
[11] WANG L M, XIONG Y J, WANG Z, et al.Temporal segment networks:towards good practices for deep action recognition[C]//European Conference on Computer Vision.Amsterdam:Springer, 2016:20-36.
[12] ZHOU B L, ANDONIAN A, OLIVA A, et al.Temporal relational reasoning in videos[C]//European Conference on Computer Vision.Munich:Springer, 2018:831-846.
[13] LIN J, GAN C, HAN S.TSM:temporal shift module for efficient video understanding[C]//IEEE/CVF International Conference on Computer Vision.Seoul:IEEE, 2019:7082-7092.
[14] FEICHTENHOFER C, PINZ A, WILDES R P.Spatiotemporal residual networks for video action recognition[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Barcelona:Curran Associates Inc., 2016:3476-3484.
[15] FEICHTENHOFER C, PINZ A, WILDES R P.Spatiotemporal multiplier networks for video action recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Honolulu:IEEE, 2017:7445-7454.
[16] HE K M, ZHANG X Y, REN S Q, et al.Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.Las Vegas:IEEE, 2016:770-778.
[17] WANG L M, XIONG Y J, WANG Z, et al.Temporal segment networks for action recognition in videos[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(11):2740-2755.
[18] SULTANI W, CHEN C, SHAH M.Real-world anomaly detection in surveillance videos[C]//IEEE/CVF Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE, 2018:6479-6488.
[19] WU P, LIU J, SHI Y J, et al.Not only look, but also li-sten:learning multimodal violence detection under weak supervision[C]//European Conference on Computer Vision.Cham:Springer, 2020:322-339.