Author Affiliations
Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence, Jiangnan University, Wuxi, Jiangsu 214122, Chinashow less
Fig. 1. Framework of action recognition network based on spatio-temporal interactive attention model
Fig. 2. Local_Mask feature maps generated from UCF101 dataset. (a) Balance beam; (b) walking with dog
Fig. 3. Mask guided spatial attention model
Fig. 4. Optical flow guided temporal attention model
Fig. 5. Training and testing iteration curves of each algorithm on UCF101 dataset.(a) Proposed model; (b) proposed model with OGTAM;(c) proposed model with MGSAM;(d) proposed model with OGTAM+MGSAM
Fig. 6. Visualization results of proposed algorithm on different datasets. (a) UCF101; (b) Penn Action
Parameter | Value |
---|
Loss function | Categorical_cross entropy | Optimizer | Adam | Learning rate | 0.0001 | Batch size | 18 | Epoch | 150(Penn Action)/250(UCF101) |
|
Table 1. Experimental parameters
Modalityattention | RGB | TVNet |
---|
With | Without | With | Without |
---|
3D ConvNet | 76.58 | 75.43 | 82.79 | 81.71 | Bi-LSTM | 82.22 | 80.15 | 80.36 | 79.38 |
|
Table 2. Effects of optical flow guided temporal attention mechanism on UCF101 datasetunit: %
Attention | With | Without |
---|
RGB | 85.44 | 80.15 | TVNet | 82.62 | 81.71 | RGB+TVNet | 92.80 | 91.70 |
|
Table 3. Effects of mask guided spatial attention mechanism on UCF101 dataset%
Model | Accuracy |
---|
VideoLSTM-two stream[15] | 89.2 | Two-stream MLDF-3D[16] | 91.3 | Two-stream HHF[17] | 91.2 | Proposed model | 91.7 | Proposed model(with OGTAM) | 92.2 | Proposed model(with MGSAM) | 92.8 | Proposed model(with OGTAM+MGSAM) | 94.9 |
|
Table 4. Comparison of proposed model and other basic models on UCF101 dataset%
Model | Accuracy |
---|
IDT+FV[18] | 85.9 | IDT+HSV[19] | 87.9 | MIFS[20] | 89.1 | TSN(two modalities)[2] | 94.0 | Hidden two-stream[21] | 93.1 | MLDF-3D[16] | 94.4 | MS-NET[22] | 93.9 | Two-stream I3D[3] | 98.0 | Two-stream FCAN-comp[23] | 92.0 | VideoLSTM[15] | 89.2 | JSTA[11] | 93.7 | RSTAN[24] | 94.6 | VideoYOLO[10] | 90.6 | Proposed model | 91.7 | Proposed model(with OGTAM+MGSAM) | 94.9 |
|
Table 5. Comparison of accuracy of different algorithms on UCF101 dataset%
Model | Accuracy |
---|
Good-practice CNN | 88.6 | JDD[25] | 87.4 | C3D[25] | 86.0 | TSN-S+T[2] | 93.8 | GLTF[26] | 86.1 | Im2Flow[27] | 77.4 | Spatial | 81.7 | Temporal | 83.4 | Proposed model | 89.3 | Proposed model(with OGTAM) | 90.7 | Proposed model(with MGSAM) | 90.6 | Proposed model(with OGTAM+MGSAM) | 91.7 |
|
Table 6. Comparison of accuracy of different algorithms on Penn Action dataset%