Non-Line-of-Sight imaging based on improved CNN-Transformer

Shuai LIU; Mingjun WANG; Yiming ZHOU

doi:10.3788/IRLA20240570

Abstract

ObjectiveNon-Line-of-Sight (NLOS) imaging is a technique that calculates the imaging of a target behind an obstacle through an intermediary. The emergence of this technology is driven by the need to obtain data in complex environments where direct visual access is not possible. It is widely used in various fields, including security surveillance, robot navigation, and medical imaging. Unlike traditional imaging technologies, NLOS imaging not only overcomes the limitations of the line of sight but also relies on the scattering of light to obtain information about the target in complex environments. However, after light undergoes multiple reflections and scattering, the signal strength is significantly attenuated, and the quality of the signal received through the intermediary surface is often affected by noise. Therefore, effectively improving target reconstruction accuracy and noise reduction has become a key challenge in NLOS imaging technology.Methods Pulsed laser and Time-of-Flight (ToF) technology is a commonly used method for Non-Line-of-Sight (NLOS) imaging. The system typically consists of a laser source, a time-resolved single-photon avalanche diode (SPAD) detector, a relay wall, and hidden objects (Fig.1). Leveraging deep learning techniques, the paper proposes an enhanced CNN-Transformer neural network (Fig.2). This network utilizes a lightweight cross-attention mechanism to establish a bidirectional bridging architecture, where the CNN and Transformer operate in parallel, forming a feedback loop (Fig.3). This design maximizes the strengths of CNNs (MobileNet) in local feature processing and Transformers in global interaction modeling, enabling deep interaction between local and global features within the network to generate rich deep local and global representations. Specifically, the process begins by extracting shallow features based on physical priors. These shallow features are then concatenated with tokens and passed to the Transformer via the bidirectional bridging architecture, where global features are interactively learned through multi-layer self-attention mechanisms. The global features captured by the Transformer are subsequently fed back into the CNN and fused with the shallow local features, enhancing the CNN’s understanding of local details. Finally, by integrating local and global features, the system achieves more accurate reconstruction of occluded 3D targets.Results and DiscussionsTo evaluate the performance of the proposed CNN-Transformer neural network for NLOS imaging reconstruction, we compared it with existing methods, including physics-based approaches: FBP, LCT, FK, and RSD, as well as the deep learning-based method LFE, all trained on the same simulated dataset as this work. Quantitative results (Tab.1) show that the proposed method achieves the best reconstruction performance for both intensity and depth images. For intensity images, the PSNR metric outperforms FK and RSD by 6.39 dB and 5.45 dB, respectively, and improves upon LFE by 1.66 dB. For depth images, the RMSE metric shows a 22% reduction compared to LFE. Additionally, quantitative results on an unseen test set further demonstrate the superior performance of the CNN-Transformer in both intensity and depth reconstructions, highlighting the network's exceptional generalization ability to unseen targets. Qualitative results (Fig.4) further validate these findings. Reconstructions from FBP and LCT are blurry, while FK and RSD methods recover major structures but lack detail. The LFE method improves upon traditional physics-based models but still struggles with fine details. In contrast, the proposed method not only accurately recovers the primary structural contours but also excels in restoring intricate details.Furthermore, qualitative results on real-world datasets (Fig.7) align with the findings from simulated data. FBP, LCT, and FK methods exhibit significant noise and blurred boundaries, while LFE shows some improvement but still lacks detail. The proposed method, however, achieves the best performance in both detail restoration and noise suppression.ConclusionsIn order to improve the detail restoration capability of target reconstruction in Non-Line-of-Sight (NLOS) imaging, this paper proposes an improved CNN-Transformer neural network. This network constructs a bidirectional bridging architecture through a lightweight cross-attention mechanism and parallels CNN (MobileNet) with Transformer, fully leveraging the efficiency of CNN in local processing and the advantages of Transformer in global interaction encoding. The dual fusion of local and global features is achieved. Experimental results show that the proposed method outperforms existing physical models and deep learning models on both simulated and real-world data, effectively solving the problem of detail blur and noise interference in traditional Transformers for NLOS imaging. Moreover, the method demonstrates excellent generalization ability for unseen data, validating its robustness and practicality, and provides new insights for the application of NLOS imaging technology in complex scenarios.

微信扫一扫：分享

微信扫一扫：分享