Interactive instance proposal network for HOI detection

Lixia Xue; Kaijian Yin; Ronggui Wang; Juan Yang

doi:10.12086/oee.2022.210429

Abstract

Overview: With the development of computer vision, people increasingly need to understand images, including recognizing the scenes and the human behaviors in images. The task of HOI detection is to locate humans and objects in images and infer their relationships. This requires not only locating a single object instance, but also identifying the interaction between the objects. However, machines cannot know which object humans are interacts in. Most of the existing methods solve this problem by completely pairing the people and objects. They use off-the-shelf object detectors to detect instances, but this does not meet the requirements of the HOI task. This paper proposes an object detector suitable for HOI detection based on relational reasoning, which makes use of the interactive relationship between humans and objects in the images to recommend human-object pairs, so as to reduce the occurrence of non-interactive human-object pairs as much as possible. Our method follows the two-stage detection like most works. Firstly, the interactive instance proposal network (IIPN) is used to recommend human-object pairs. The IIPN follows the pipeline of faster RCNN, but replaces the region proposal network (RPN) with the IIPN. The IIPN selects human-object pairs based on the interaction possibility between humans and objects using the visual information in the picture. It passes the message through the iterative reasoning of the graph neural networks (GNNS), only human-object pairs that include interactive relationships are selected as the IIPN’s outputs. Secondly, we design a cross-modal information fusion module (CIFM), which calculates the fusion attention according to the influence of different features on the detection results, and performs weighted fusion. This is because the existing methods simply add or splice several features such as human visual features, object visual features, and human-object spatial features in the reasoning part. The different influence degrees of various features in different actions are ignored. For example, the verbs like ride and hold in < human, ride bike> and < human, hold, bike > depend more on the spatial relationships, while eat and cut in and depend more on human's postures, that is, visual features. Meanwhile, this paper believes that semantic prior knowledge is also helpful to HOI detection. For example, if we have apples in an image, the probability of predicting the human's action as eating or holding is greater than others. Finally, complete experiments are performed on two popular large-scale HOI datasets, HICO-DET and V-COCO. The experimental results show the effectiveness of the proposed method.Human-object interaction detection is to locate and identify the interactive relationship between humans and objects in an image. The challenge is that the machine cannot know which object the person is interacting in. Most existing methods try to solve this problem by matching humans and objects exactly. Different from them, this paper proposes an interactive instance proposal network based on relational reasoning to adapt to the task. Our main idea is to recommend human-object pairs by using the potential interaction relationships in the visual relationship between humans and objects. In addition, a cross-modal information fusion module is designed to fuse different context information according to its influence on the detection result, so as to improve the detection accuracy. To evaluate the proposed method, we performed sufficient experiments on two large-scale datasets: HICO-DET and V-COCO. Results show that our method achieves 19.90% and 50.3% mAP on HICO-DET and V-COCO, which are 4.5% and 2.8% higher than our baseline, respectively.

微信扫一扫：分享

微信扫一扫：分享