• Optics and Precision Engineering
  • Vol. 32, Issue 5, 727 (2024)
Daxiang LI, Jiani XIN*, and Ying LIU
Author Affiliations
  • College of communication and information engineering, Xi'an University of Posts and Telecommunication, Xi'an710121, China
  • show less
    DOI: 10.37188/OPE.20243205.0727 Cite this Article
    Daxiang LI, Jiani XIN, Ying LIU. Position-sensitive Transformer aerial image object detection model[J]. Optics and Precision Engineering, 2024, 32(5): 727 Copy Citation Text show less

    Abstract

    Addressing the challenge of detecting numerous small objects in UAV⁃captured aerial images, this paper introduces the Position⁃Sensitive Transformer Target Detection (PS⁃TOD) model. Initially, it presents a multi⁃scale feature fusion (MSFF) module incorporating a Positional Channel Embedded 3D Attention (PCE3DA) mechanism. PCE3DA leverages the interplay between spatial and channel data to generate 3D attention, enhancing feature representation in areas of interest. This foundation supports a bottom⁃up, cross⁃layer MSFF approach, augmenting the semantic richness of combined features. Subsequently, it proposes a novel Position⁃Sensitive Self⁃Attention (PSSA) mechanism, leading to the development of a position⁃sensitive Transformer encoder⁃decoder. This innovation heightens the model's sensitivity to target positioning, facilitating the capture of long⁃term dependencies within the image's global context. Comparative tests using the VisDrone dataset reveal that the PS⁃TOD model attains an Average Precision (AP) of 28.8%, marking a 4.1% enhancement over the baseline model (DETR). Furthermore, it demonstrates precise object detection in UAV aerial imagery against complex backdrops, significantly boosting the detection accuracy of small targets.
    Fa=F¯5u+F¯4(1)

    View in Article

    F¯a=PCE3DA(Fa)(2)

    View in Article

    F¯4en=F¯a+F¯4(3)

    View in Article

    F¯5en=F¯a+F¯5(4)

    View in Article

    F¯45=Conv3(F¯4en)+Conv3(F¯5en)(5)

    View in Article

    zX=Conv1_X(F)(6)

    View in Article

    fX=σ(BN(Conv1(zX))(7)

    View in Article

    gX=Conv1(fX)(8)

    View in Article

    zY=Conv1_Y(F)fY=σ(BN(Conv1(zY))gY=Conv1(fY)(9)

    View in Article

    β=Sigmoid(gXgY)(10)

    View in Article

    F¯=βF(11)

    View in Article

    x¯i=xi+pi(12)

    View in Article

    asSA(X)=[Q;K;V]=[(X+P)WQ;(X+P)WK;(X+P)WV].(13)

    View in Article

    pnmQ=αE(n,m)QpnmK=αE(n,m)KpnmV=αE(n,m)VE(n,m)=min(snh-smh+snw-smw,T)(14)

    View in Article

    PSSA(S)=[Q;K;V]=[(S+PQ)WQ;(S+PK)WK;(S+PV)WV],(15)

    View in Article

    ωnm=exp1d(QnKmT)m=1Nexp1d(QnKmT)(16)

    View in Article

    zn=m=1NωnmVm(17)

    View in Article

    S0=[s1(h,w);s2(h,w);;sN(h,w)](18)

    View in Article

    ZL'=(SL-1+mhPSSA(SL-1))ZL=LN(ZL'+MLP(ZL'),   L=16Y=Z6(19)

    View in Article

    mhPSSA(z)=[PSSA1(z);;PSSAK(z)](20)

    View in Article

    MLP(x)=ReLUxW1+b1W2+b2(21)

    View in Article

    BL'=LN(BL-1+mhSA(BL-1))BL=LN(BL'+mhCSA(Y;BL';PK))BL=LN(BL+MLP(BL)),L=16B¯=B6(22)

    View in Article

    mhCSAY;BL';PK=CSA1;;CSAKCSA=[K;Q;V]=Y+PKWK;BL'+BL-1WQ;YWV(23)

    View in Article

    σ^=argminσξMiMLmatch(ui,u¯σ(i))(24)

    View in Article

    Lmatch(ui,u¯σ(i))=Lcls(p¯σ(i)(clsi)+Lbox(boxi,box¯σ(i)),clsi0,clsi=(25)

    View in Article

    Lcls(p¯σ(i)(clsi))=-αt(1-p¯σ(i)(clsi))γlog(p¯σ(i)(clsi)),(26)

    View in Article

    Lbox(boxi,box¯σ(i))=1-IOU+ρ2(boxic,box¯σ(i)c)c2(27)

    View in Article

    Daxiang LI, Jiani XIN, Ying LIU. Position-sensitive Transformer aerial image object detection model[J]. Optics and Precision Engineering, 2024, 32(5): 727
    Download Citation