Interactive instance proposal network for HOI detection

Lixia Xue; Kaijian Yin; Ronggui Wang; Juan Yang

doi:10.12086/oee.2022.210429

[1] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015: 91–99.

[2] Girshick R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440–1448.

[3] Yang C Y, Xu Y H, Shi J P, et al. Temporal pyramid network for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 588–597.

[4] Li M S, Chen S H, Chen X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3590–3598.

[5] Kirillov A, He K M, Girshick R, et al. Panoptic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9396–9405.

[6] Sofiiuk K, Sofiyuk K, Barinova O, et al. AdaptIS: adaptive instance selection network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7354–7362.

[7] Gao C, Xu J R, Zou Y L, et al. DRG: dual relation graph for human-object interaction detection[C]//16th European Conference on Computer Vision, 2020: 696–712.

[8] Gao C, Zou Y L, Huang J B. iCAN: instance-centric attention network for human-object interaction detection[C]//British Machine Vision Conference 2018, 2018.

[9] Chao Y W, Liu Y F, Liu X Y, et al. Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018: 381–389.

[10] Hou Z, Peng X J, Qiao Y, et al. Visual compositional learning for human-object interaction detection[C]//16th European Conference on Computer Vision, 2020: 584–600.

[11] Zhou P H, Chi M M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 843–851.

[12] Kim B, Lee J, Kang J, et al. HOTR: end-to-end human-object interaction detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 74–83.

[13] Zhang A X, Liao Y, Liu S, et al. Mining the benefits of two-stage and one-stage HOI detection[C]//Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

[14] Zou C, Wang B H, Hu Y, et al. End-to-end human object interaction detection with HOI transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 11820–11829.

[15] Chen M F, Liao Y, Liu S, et al. Reformulating HOI detection as adaptive set prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9000–9009.

[16] Kamath A, Clark C, Gupta T, et al. Webly supervised concept expansion for general purpose vision models[Z]. arXiv: 2202.02317, 2022. https://arxiv.org/abs/2202.02317v1.

[17] Li Z M, Zou C, Zhao Y, et al. Improving human-object interaction detection via phrase learning and label composition[Z]. arXiv: 2112.07383, 2021. https://doi.org/10.48550/arXiv.2112.07383.

[19] Li Y L, Zhou S Y, Huang X J, et al. Transferable interactiveness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3580–3589.

[20] Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation[C]//Proceedings of the 15th European Conference on Computer Vision (ECCV), 2018: 690–706.

[21] Chen T S, Yu W H, Chen R Q, et al. Knowledge-embedded routing network for scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6156–6164.

[23] Liang W X, Jiang Y H, Liu Z X. GraghVQA: language-guided graph neural networks for graph-based visual question answering[Z]. arXiv: 2104.10283, 2021. https://arxiv.org/abs/2104.10283v2.

[24] Qi S Y, Wang W G, Jia B X, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the 15th European Conference on Computer Vision (ECCV), 2018: 407–423.

[25] Xu B J, Wong Y K, Li J N, et al. Learning to detect human-object interactions with knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 2019–2028.

[26] Zheng S P, Chen S Z, Jin Q. Skeleton-based interactive graph network for human object interaction detection[C]//2020 IEEE International Conference on Multimedia and Expo (ICME), 2020: 1–6.

[27] Shen L Y, Yeung S, Hoffman J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018: 1568–1576.

[28] Wang S C, Yap K H, Yuan J S, et al. Discovering human interactions with novel objects via zero-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11649–11658.

[29] Fang H S, Xie Y C, Shao D, et al. DecAug: augmenting HOI detection via decomposition[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 1300–1308.

[30] Sarullo A, Mu T T. Zero-shot human-object interaction recognition via affordance graphs[Z]. arXiv: 2009.01039, 2020. https://doi.org/10.48550/arXiv.2009.01039.

[31] Wan B, Zhou D S, Liu Y F, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 9468–9477.

[32] Peyre J, Sivic J, Laptev I, et al. Detecting unseen visual relations using analogies[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 1981–1990.

[33] Liu Y, Chen Q C, Zisserman A. Amplifying key cues for human-object-interaction detection[C]//16th European Conference on Computer Vision, 2020: 248–265.

[34] Zhang F Z, Campbell D, Gould S. Spatio-attentive graphs for human-object interaction detection[Z]. arXiv: 2012.06060, 2020. https://arxiv.org/abs/2012.06060v1.

[35] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944.

[36] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.

[37] Chen L, Zhang H W, Xiao J, et al. Zero-shot visual recognition using semantics-preserving adversarial embedding networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 1043–1052.

[38] Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532–1543.

[39] Gupta S, Malik J. Visual semantic role labeling[Z]. arXiv: 1505.04474, 2015. https://arxiv.org/abs/1505.04474v1.

[40] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context[C]//13th European Conference on Computer Vision, 2014: 740–755.

微信扫一扫：分享

微信扫一扫：分享