Video-based body geometric aware network for 3D human pose estimation

Chaonan LI; Sheng LIU; Lu YAO; Siyu ZOU

doi:10.1007/s11801-022-2015-8

[1] MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]//2017 International Conference on 3D Vision (3DV), October 10-12, 2017, Qingdao, China. New York:IEEE, 2017:506-516.

[2] HOSSAIN M R I, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]//Proceedings of the European Conference on Computer Vision, September 8-14, 2018, Munich, Germany. Berlin:Springer, 2018:68-84.

[3] LIN J, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[C]//2019 British Machine Vision Conference (BMVC), September 9-12, 2019, Cardiff, UK. BMVA, 2019.

[4] LUVIZON D C, PICARD D, TABIA H. 2D/3D pose estimation and action recognition using multitask deep learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18-22, 2018, Salt Lake, UT, USA. New York:IEEE, 2018: 5137-5146.

[5] MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision, October 22-29, 2017, Venice, Italy. New York:IEEE, 2017:2640-2649.

[6] PARK S, HWANG J, KWAK N. 3D human pose estimation using convolutional neural networks with 2D pose information[C]//Proceedings of the European Conference on Computer Vision, October 11-14, 2016, Amsterdam, The Netherlands. Berlin:Springer, 2016: 156-169.

[7] PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16-20, 2019, Long Beach, CA, USA. New York:IEEE, 2019:7753-7762.

[8] CHEN X, LIN K Y, LIU W, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16-20, 2019, Long Beach, CA, USA. New York:IEEE, 2019:7753-7762.

[9] FANG H S, XU Y, WANG W, et al. Learning pose grammar to encode human body configuration for 3D pose estimation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, February 2-7, 2018, New Orleans, Louisiana, USA. Cambridge:AAAI Press, 2018:6821-6828.

[10] PAVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21-26, 2017, Honolulu, HI, USA. New York:IEEE, 2017:7025-7034.

[11] XU J, YU Z, NI B, et al. Deep kinematics analysis for monocular 3D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13-19, 2020, Seattle, WA, USA. New York:IEEE, 2020:899-908.

[12] CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27-November 2, 2019, Seoul, Korea (South). New York:IEEE, 2019:2272-2281.

[13] ZHAO L, PENG X, TIAN Y, et al. Semantic graph convolutional networks for 3D human pose regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16-20, 2019, Long Beach, CA, USA. New York:IEEE, 2019:3425-3435.

[14] LIU K, DING R, ZOU Z, et al. A comprehensive study of weight sharing in graph networks for 3D human pose estimation[C]//Proceedings of the European Conference on Computer Vision, August 23-28, 2020, Glasgow, UK. Berlin:Springer, 2020:318-334.

[15] CI H, WANG C, MA X, et al. Optimizing network structure for 3D human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27-November 2, 2019, Seoul, Korea (South). New York:IEEE, 2019:2262-2271.

[16] WANG J, YAN S, XIONG Y, et al. Motion guided 3D pose estimation from videos[C]//Proceedings of the European Conference on Computer Vision, August 23-28, 2020, Glasgow, UK. Berlin:Springer, 2020: 764-780.

[17] LIU R, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts:real-time 3D human pose reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13-19, 2020, Seattle, WA, USA. New York:IEEE, 2020:5064-5073.

[18] TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[C]//Thirty-Fifth Conference on Neural Information Processing Systems (NeurlPS), December 6-12, 2021, Virtual Event. New York:Curran Associates, 2021: 24261-24272.

[19] CHEN C H, RAMANAN D. 3D human pose estimation c2D pose estimation + matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21-26, 2017, Honolulu, HI, USA. New York:IEEE, 2017:7035-7043.

[20] ZHENG C, ZHU S, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10-17, 2021, Montreal, QC, Canada. New York:IEEE, 2021: 11656-11665.

[21] DABRAL R, MUNDHADA A, KUSUPATI U, et al. Learning 3D human pose from structure and motion[C]//Proceedings of the European Conference on Computer Vision, September 8-14, 2018, Munich, Germany. Berlin :Springer, 2018:668-683.

[22] CHENG Y, YANG B, WANG B, et al. Occlusion-aware networks for 3D human pose estimation in video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27-November 2, 2019, Seoul, Korea (South). New York:IEEE, 2019:723-732.

[23] LIU J, ROJAS J, LI Y, et al. A graph attention spatio-temporal convolutional network for 3D human pose estimation in video[C]//2021 IEEE International Conference on Robotics and Automation (ICRA), May 30-June 5, 2021, Xi'an, China. New York:IEEE, 2021: 3374-3380.

[24] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8):1735-1780.

[25] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words:transformers for image recognition at scale[C]//9th International Conference on Learning Representations (ICLR), May 3-7, 2021, Virtual Event, Austria. 2021.

[26] HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. (2016-06-27) [2021-12-26].https://arxiv.org/abs/1606.08415v1.

[27] IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6m:large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 36(7):1325-1339.

[28] CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18-22, 2018, Salt Lake, UT, USA. New York:IEEE, 2018:7103-7112.

[29] SIGAL L, BALAN A O, BLACK M J. Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion[J]. International journal of computer vision, 2010, 87(1-2):4.

[30] KINGMA D P, BA J. Adam:a method for stochastic optimization[EB/OL]. (2014-12-22) [2021-12-26]. https://arxiv.org/abs/1412.6980v1.

[31] LOSHCHILOV I, HUTTER F. SGDR:stochastic gradient descent with warm restarts[EB/OL]. (2016-08-13)[2021-12-26]. https://arxiv.org/abs/1608.03983v1.

[32] LEE K, LEE I, LEE S. Propagating LSTM:3D pose estimation based on joint interdependency[C]//Proceedings of the European Conference on Computer Vision, September 8-14, 2018, Munich, Germany. Berlin: Springer, 2018:119-135.