[1] Ren S Q, He K M, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems, 2015: 91–99.
[2] Girshick R. Fast R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 1440–1448.
[3] Yang C Y, Xu Y H, Shi J P, et al. Temporal pyramid network for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 588–597.
[4] Li M S, Chen S H, Chen X, et al. Actional-structural graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3590–3598.
[5] Kirillov A, He K M, Girshick R, et al. Panoptic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 9396–9405.
[6] Sofiiuk K, Sofiyuk K, Barinova O, et al. AdaptIS: adaptive instance selection network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 7354–7362.
[7] Gao C, Xu J R, Zou Y L, et al. DRG: dual relation graph for human-object interaction detection[C]//16th European Conference on Computer Vision, 2020: 696–712.
[8] Gao C, Zou Y L, Huang J B. iCAN: instance-centric attention network for human-object interaction detection[C]//British Machine Vision Conference 2018, 2018.
[9] Chao Y W, Liu Y F, Liu X Y, et al. Learning to detect human-object interactions[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018: 381–389.
[10] Hou Z, Peng X J, Qiao Y, et al. Visual compositional learning for human-object interaction detection[C]//16th European Conference on Computer Vision, 2020: 584–600.
[11] Zhou P H, Chi M M. Relation parsing neural network for human-object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 843–851.
[12] Kim B, Lee J, Kang J, et al. HOTR: end-to-end human-object interaction detection with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 74–83.
[13] Zhang A X, Liao Y, Liu S, et al. Mining the benefits of two-stage and one-stage HOI detection[C]//Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
[14] Zou C, Wang B H, Hu Y, et al. End-to-end human object interaction detection with HOI transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 11820–11829.
[15] Chen M F, Liao Y, Liu S, et al. Reformulating HOI detection as adaptive set prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9000–9009.
[16] Kamath A, Clark C, Gupta T, et al. Webly supervised concept expansion for general purpose vision models[Z]. arXiv: 2202.02317, 2022. https://arxiv.org/abs/2202.02317v1.
[17] Li Z M, Zou C, Zhao Y, et al. Improving human-object interaction detection via phrase learning and label composition[Z]. arXiv: 2112.07383, 2021. https://doi.org/10.48550/arXiv.2112.07383.
[19] Li Y L, Zhou S Y, Huang X J, et al. Transferable interactiveness knowledge for human-object interaction detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 3580–3589.
[20] Yang J W, Lu J S, Lee S, et al. Graph R-CNN for scene graph generation[C]//Proceedings of the 15th European Conference on Computer Vision (ECCV), 2018: 690–706.
[21] Chen T S, Yu W H, Chen R Q, et al. Knowledge-embedded routing network for scene graph generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 6156–6164.
[23] Liang W X, Jiang Y H, Liu Z X. GraghVQA: language-guided graph neural networks for graph-based visual question answering[Z]. arXiv: 2104.10283, 2021. https://arxiv.org/abs/2104.10283v2.
[24] Qi S Y, Wang W G, Jia B X, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the 15th European Conference on Computer Vision (ECCV), 2018: 407–423.
[25] Xu B J, Wong Y K, Li J N, et al. Learning to detect human-object interactions with knowledge[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 2019–2028.
[26] Zheng S P, Chen S Z, Jin Q. Skeleton-based interactive graph network for human object interaction detection[C]//2020 IEEE International Conference on Multimedia and Expo (ICME), 2020: 1–6.
[27] Shen L Y, Yeung S, Hoffman J, et al. Scaling human-object interaction recognition through zero-shot learning[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018: 1568–1576.
[28] Wang S C, Yap K H, Yuan J S, et al. Discovering human interactions with novel objects via zero-shot learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11649–11658.
[29] Fang H S, Xie Y C, Shao D, et al. DecAug: augmenting HOI detection via decomposition[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence, 2021: 1300–1308.
[30] Sarullo A, Mu T T. Zero-shot human-object interaction recognition via affordance graphs[Z]. arXiv: 2009.01039, 2020. https://doi.org/10.48550/arXiv.2009.01039.
[31] Wan B, Zhou D S, Liu Y F, et al. Pose-aware multi-level feature network for human object interaction detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 9468–9477.
[32] Peyre J, Sivic J, Laptev I, et al. Detecting unseen visual relations using analogies[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 1981–1990.
[33] Liu Y, Chen Q C, Zisserman A. Amplifying key cues for human-object-interaction detection[C]//16th European Conference on Computer Vision, 2020: 248–265.
[34] Zhang F Z, Campbell D, Gould S. Spatio-attentive graphs for human-object interaction detection[Z]. arXiv: 2012.06060, 2020. https://arxiv.org/abs/2012.06060v1.
[35] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 936–944.
[36] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 770–778.
[37] Chen L, Zhang H W, Xiao J, et al. Zero-shot visual recognition using semantics-preserving adversarial embedding networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018: 1043–1052.
[38] Pennington J, Socher R, Manning C D. GloVe: global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014: 1532–1543.
[39] Gupta S, Malik J. Visual semantic role labeling[Z]. arXiv: 1505.04474, 2015. https://arxiv.org/abs/1505.04474v1.
[40] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context[C]//13th European Conference on Computer Vision, 2014: 740–755.