Single-shot 3D tracking based on polarization multiplexed Fourier-phase camera

Jiajie Teng; Chengyang Hu; Honghao Huang; Minghua Chen; Sigang Yang; Hongwei Chen

doi:10.1364/PRJ.432292

Abstract

For moving objects, 3D mapping and tracking has found important applications in the 3D reconstruction for vision odometry or simultaneous localization and mapping. This paper presents a novel camera architecture to locate the fast-moving objects in four-dimensional (4D) space (

x

y

z

t

) through a single-shot image. Our 3D tracking system records two orthogonal fields-of-view (FoVs) with different polarization states on one polarization sensor. An optical spatial modulator is applied to build up temporal Fourier-phase coding channels, and the integration is performed in the corresponding CMOS pixels during the exposure time. With the 8 bit grayscale modulation, each coding channel can achieve 256 times temporal resolution improvement. A fast single-shot 3D tracking system with 0.78 ms temporal resolution in 200 ms exposure is experimentally demonstrated. Furthermore, it provides a new image format, Fourier-phase map, which has a compact data volume. The latent spatio-temporal information in one 2D image can be efficiently reconstructed at relatively low computation cost through the straightforward phase matching algorithm. Cooperated with scene-driven exposure as well as reasonable Fourier-phase prediction, one could acquire 4D data (

x

y

z

t

) of the moving objects, segment 3D motion based on temporal cues, and track targets in a complicated environment.

1. INTRODUCTION

Moving targets tracking in 3D space has found many applications in various fields, such as vehicle navigation, 3D reconstruction [1], and 3D motion estimation [2]. The most widely known image-free 3D sensor is LiDAR [3,4], which generally utilizes a laser as the active lighting source and high-bandwidth detector with complex data processing to achieve long-distance and high-speed 3D detection. However, due to the costly and complicated system architecture, it is hard to be widely utilized in normal 3D vision tasks. In contrast, image-based sensors or systems are with relatively low cost and common for a wider range of 3D visual applications, such as stereo vision [5,6] and monocular 3D systems [7], which allow the moving objects to track in different scales. Nevertheless, image-based systems often require a series of operations including camera pose estimation, multi-view calibration, feature extraction, and similarity matching, which increase the computational burden of the real-time 3D tracking.

With the advance of novel visional sensors, optoelectronics hybrid devices, and post-reconstruction algorithm, a serious of new camera architectures have emerged to tackle the challenging scenarios that are inaccessible to the traditional sensors. An event camera with high temporal resolution and high dynamic range provides a new data format with pixel-wise intensity changes asynchronously. This time-sparse event data reduces the power and bandwidth requirements and allows it to work in real-time 3D reconstruction for various vision applications [8,9]. For releasing the hardware requirement, a single-pixel detector is also applied to achieve a real-time image-free 3D tracking system [10]. Although this scheme faces trouble in the multi-objects tracking scenarios, it is still a low-cost and computation-efficient system. On the other hand, with the popularity of multi-dimensional encoded optoelectronic modulation devices, computational photography shows great potential in 3D reconstruction with single-shot stereo images [11 –13], which is capable of cooperating with compressive sensing and adaptive reconstruction algorithms. Moreover, the lensless imaging system with a diffuser placed in front of the traditional sensor is demonstrated to achieve a single-shot 3D reconstruction in Refs. [14,15]. However, the effective resolution and computational overhead vary significantly with scene content and limit its practical application.

In this paper, we propose a four-dimensional (4D) information recording camera with multiplexed orthogonal polarization field-of-view (FoV), named polarization multiplexed Fourier-phase camera (PM-FPC). It is a novel camera framework, which is capable of reconstructing 4D data of moving objects in a single shot. The principle of PM-FPC is to perform pixel-wise optical coding on polarization multiplexed scenes to acquire the Fourier-phase maps of two orthogonal perspectives in one exposure. With the 8 bit grayscale quantized sinusoid modulation, the temporal resolution of the camera is increased to 256 times. Compared to the traditional image-based 3D stereo systems, it processes Fourier-phase transforming in the optical domain and has lower computational burden owing to the straightforward matching algorithm. Meanwhile, the image data volume and detection bandwidth get decreased due to the designed coding scheme with polarization multiplexing. Besides, it is able to plug in a standard camera system and adapt to various lighting environments with tunable exposure time. The experiment results with different 3D trajectories show its potential in real-time 3D motion estimation and recognition.

Sign up for Photonics Research TOC. Get the latest issue of Photonics Research delivered right to you！Sign up now

2. PRINCIPLES

A. Polarization Multiplexing and Demultiplexing

Polarization is a basic property of light, expressed as the vibration direction of the light-field. Here, we employ two orthogonal polarization states, polar-0° and polar-90°, to carry the duplet perspectives of the moving objects. A polarization beam splitter (PBS) is placed in the reverse direction to work as a combiner to generate the overlapping scene after the polarization multiplexing, which is shown in Fig. 1(b). Then, through the zooming lens, the overlapped scene is focused on the digital micromirror device (DMD), which temporally modulates each coding channel with four-phase-shifting sinusoid coding patterns to acquire the Fourier phase. Every coding channel is detected by the $4 \times 4$ binning polarization pixels, which are mounted in front of the polarization charge-coupled device (CCD). Thus, the phase information ( $0, π / 2, π, 3 π / 2$ ) and polarization states (0°, 45°, 90°, 135°) can be acquired at the same time. Similar to the polarization extraction scheme in Ref. [13], with the known polarization array, it is able to reassemble the multiplexed scenes by simply extracting the polar-0° and polar-90° values and constructing two perspectives. The whole polarization multiplexing and demultiplexing process is depicted in Fig. 1. The extinction ratio of the PBS and the polarizer array in the camera will determine the orthogonal FoV’s crosstalk level in the measurements.

Polarization multiplexing and demultiplexing process. (a) The 3D coordinate synthesized by the two orthogonal 2D plane. (b) Overlapping scene after the polarization multiplexing. (c) The detection image with four-phase-shifting temporal modulation. (d) Polarization demultiplexing process.

Figure 1.Polarization multiplexing and demultiplexing process. (a) The 3D coordinate synthesized by the two orthogonal 2D plane. (b) Overlapping scene after the polarization multiplexing. (c) The detection image with four-phase-shifting temporal modulation. (d) Polarization demultiplexing process.

B. Fourier-Phase Transforming

When an object is moving through the scene, each detection channel in the sensor will get a similar temporal pulse with a different rising edge. According to the brightness constancy of the objects, $I (x, y, t) = I (x + Δ x, y + Δ y, t + Δ t)$ , the intensity of the voxel remains the same despite small changes of position and time period [16]. Thus, these temporal signals can be simply expressed as an impulse with spatial-variant time shifting. However, this information cannot be resolved in one exposure with a traditional camera. Herein, PM-FPC is designed to record this time-shift information through the optical coding method. With the principle of the discrete Fourier transform (DFT), a temporal signal can be represented by a series of discrete Fourier coefficients with different sampling frequencies. A shift in the time corresponds to the Fourier-phase shift in each encoded channel. To avoid phase unwrapping errors [17], we choose one period sinusoid pattern as the sampling frequency, which means only the first-order Fourier coefficients (1st DFT) in the optical coding process are used. The detailed time-encoded process in each channel is displayed in Fig. 2(a). On the sensor, $4 \times 4$ binning pixels consist of one temporal coding channel named channel $i$ . With the pulse width modulation (PWM) mode of the DMD, four-phase-shifting sinusoid patterns ( $0, π / 2, π, 3 π / 2$ ) have temporally modulated on each coding channel. In a single-shot image of the sensor, a series of Fourier-phase number $F_{in}$ ( $n = 1, 2, 3, 4$ ) is detected. As illustrated in Eq. (1), the Fourier-phase number is the integration of the Hadamard product between the sinusoid pattern ${Wave}_{n} (t)$ and temporal signal $s_{i}$ of coding channel $i$ : $F_{in} = \int_{0}^{t_{expo}} s_{i} (t) \times {Wave}_{n} (t) .$ (1)Thence, with the four-step phase-shifting method [18], the shifted Fourier phase of one coding channel $i$ can be extracted as the following equation: $P_{i} = \arg {(F_{i 1} - F_{i 4}) + j (F_{i 2} - F_{i 3})} .$ (2)The time shifting on the 1st DFT can be summarized as $ℑ_{1} {s_{i} (t - T_{i})} = S_{i} (f_{1}) \cdot \exp (- j 2 π f_{1} T_{i}) = S_{i} (f_{1}) \cdot \exp (- j P_{i}),$ (3)where $ℑ_{1}$ is denoted as the 1st DFT operation, which only calculates the 1st DFT coefficient of the temporal signal in channel $i$ , $S_{i}$ is the amplitude of the Fourier coefficient, and $f_{1}$ is the frequency of the sinusoid coding pattern, which is inversely proportional to the exposure time $t_{expo} = \frac{1}{f_{1}}$ . With the known exposure time and the detected Fourier-phase $P_{i}$ , the time-shifting information $T_{i}$ can be calculated out as the following equation: $T_{i} = \frac{P_{i}}{2 π} \times t_{expo} .$ (4)The whole Fourier-phase transforming process is depicted in Fig. 2. After all of the Fourier-phase is extracted, the whole image becomes a Fourier-phase map, and it is able to imply the appearance time of objects in each channel during the exposure.

Figure 2.Fourier-phase transforming process. (a) Time-encoded process in coding channel i. (b) Fourier-phase map with 3D trajectory.

C. 3D Mapping and Tracking

The proposed system follows a parallel 3D mapping and tracking philosophy, where the main modules operate in the one-way fashion estimating the final 4D data ( $x$ , $y$ , $z$ , $t$ ). A detailed overview of the flowchart is given in Fig. 3. Core modules of the system including Fourier-phase measuring, matching, and mapping processes are marked with dashed rectangles. The only input to the system is a single-shot 2D image of the PM-FPC. Through the measuring process, the Fourier-phase map of two orthogonal views is generated. For removing the phase interference caused by the environmental noise, it implements the mask extraction by setting the amplitude threshold, which is generally set to 1/10 of the average pixel intensity in the image. Besides, the initial phase of the exposure needs to be calibrated before the time-phase mapping process. The matching operation is applied between two orthogonal planes, the XOZ and YOZ planes. With the height-consistency calibration in the experiment, the moving target appears at the same height ( $z_{1} = z_{2}$ ) of two orthogonal planes with its unique phase. Based on this, it can simply take any non-zero phase point P₁( $x_{1}$ , $z_{1}$ ) on the XOZ plane as the reference, to traverse all the pixels with the same height in the YOZ plane to find the correspondence point P₂( $y_{2}$ , $z_{1}$ ), which has the smallest phase difference with P₁. After this straightforward matching process, the 3D coordinates ( $x = x_{1}$ , $y = y_{2}$ , $z = z_{1}$ ) of the corresponding point are determined. Then, with the time-phase mapping relationship shown in Eq. (4), the time information of the object moving through the scenes can be calculated out through the precise phase measurement. Relying on the time-spatial consistency in the XOZ and YOZ plane, the 3D coordinates and time information of the object are synthesized out, which consist of the 4D datasets ( $x$ , $y$ , $z$ , $t$ ) as the final output.

Figure 3.Proposed system flowchart.

3. EXPERIMENT AND RESULTS

The schematic diagram of the experiment is shown in Fig. 4. Light from two orthogonal views is reflected by mirrors M1 and M2 and filtered by the PBS. Respectively, the duplet views get tagged with two polarization states (polar-0° and polar-90°) and combined to be a view-overlapped scene, which is further imaged and projected on a DMD (ViALUX V-9500) through the imaging lens. The DMD is employed to implement the pixel-wise temporal modulation on each coding channel [19], which is modulated by the parallel four-phase-shifting sinusoidal pattern ( $0, π / 2, π, 3 π / 2$ ). Each phase-encoded channel is composed of $2 \times 2$ sub-pixels corresponding to four polarizations (0°, 45°, 90°, 135°). The polarization camera (FLIR BPS-PGE-51S5P-C) applied in the system has $2448 \times 2048$ pixels mounted with different polarizers, and the pixel size is 3.45 μm. The DMD has a $1920 \times 1080$ micro mirror, whose pixel size is 7.6 μm. Through a strict optical calibration (Appendix B), which creates a suitable FoV ( $50 mm \times 50 mm$ ) with pixel-by-pixel correspondence (one micro-mirror corresponding to $2 \times 2$ polar pixels), the Fourier-phase map of the duplet view is measured, respectively. After the mask extraction and phase calibration steps, the matching process is applied between the XOZ and YOZ phase map. Then, the 3D position and time information of the object are derived by the time-phase mapping results. In the experiment, a motorized stage (Zolix LA100-60-ST) is utilized to build the horizontal linear movement of the target. When it comes to the circle motion scenes, an optical chopper (Thorlabs MC200B) with tunable frequency is implemented to work as a rotating stage. For a more complex motion scene, the vertical rotation and horizontal linear movement are combined to produce a spiral motion. The exposure time with each trajectory is different depending on the illumination and lighting conditions. Based on the Fourier-phase coding scheme, one coding channel corresponds to $4 \times 4$ pixels on the polarization camera, so the maximum spatial resolution of the reconstructed 3D space is 1/4 of camera resolution ( $2048 / 4 = 512$ ). However, for better phase measurement results, we utilize $12 \times 12$ binning pixels on the camera to perform as one coding channel and choose an illuminous LED as the moving object. In the first place, we test a dynamic scene with one object, including linear and circular motion, which is depicted in Fig. 5(a). The diameter of the objects is 5 mm, and the horizontal movement speed is 20 mm/s. In the one-line test, the exposure time is set at 2 s with 100 pixel length trajectory in the picture, which is consistent with the actual FoV and zoom ratio (0.645). In the circular motion, the period of the object motion is 320 ms, which is accordant to the rotation frequency (3.1 Hz) of the chopper. The reconstructed 3D position of these scenes is marked with the solid sphere with the changing color indicating the time information. The effective temporal resolution of the reconstruction is dependent on the exposure time and gray quantization bit number of the DMD, which is discussed in the Appendix C. Here, with the 8 bit grayscale DMD and 200 ms exposure time, the equivalent temporal resolution is 1280 fps ( $256 \times 5 fps$ ; fps, frames per second). Then, an object is added in the scene, and the results of two objects 3D tracking are shown in Fig. 5(b). In the multi-target tracking, the spatial-temporal restriction is utilized, which is the Euclidean distance [20] between the previous phase point and the current one. Based on the distance restriction, the multiple targets with different motion can also be distinguished in a single shot.

Figure 4.Schematic diagram of the polarization multiplexed Fourier-phase camera.

Figure 5.(a) One object motion. (b) Two objects motion. (c) Rotation. (d) Handwriting heart (see Visualization 1).

Furthermore, the object spiral motion with different exposure time (200 ms, 400 ms, 600 ms) is recorded and reconstructed, where the spiral forward speed is also 20 mm/s, and the rotation period is 160 ms. As shown in Fig. 5(c), these 3D tracking results all fit well to the real 3D trajectory marked with the gray sphere in the time 3D maps. To further verify the system potential in the 3D motion recognition, we also tested one handwriting trajectory with the heart shape, the results are displayed in Fig. 5(d). It is noted that in the complicated motion scene, the trajectory crossing area is unpreventable with the increasing exposure, which leads to inaccurate 4D reconstruction and poor matching results. To solve this problem, phase loss function and motion estimation are added to mitigate the effect of phase error. Some phase optimization in the crossing area is discussed in Appendix F, such as the K-means clustering [19], cubic spline interpolation [21], and Kalman filter [22]. Owing to these optimizations, a dynamic 3D handwriting trajectory of the heart and its matching process are displayed in Visualization 1.

4. CONCLUSION

In conclusion, we proposed a single-shot 3D tracking system based on a novel camera architecture and a new image format: Fourier-phase map. The principle of the system is to acquire the time-phase shifting of the target in two orthogonal views with different polarization states. Only one 2D image with motion trajectory is needed to reconstruct the 4D data ( $x$ , $y$ , $z$ , $t$ ) of the target. Owing to the polarization multiplexing and optical coding method, the detection bandwidth gets significantly decreased, which makes it work well in a low-cost polarization camera with efficient reconstruction algorithms. Meanwhile, the Fourier-phase transforming in the optical domain reduces the computational overhead from the data acquisition and quantization in the calculation. Compared with the traditional tracking system based on the frame difference method, simple algorithms, such as the phase-matching and time-phase mapping processes, relevantly mitigate the cost of the computation. In the experiment, the effective frame rate of the camera is achieved at 1280 fps under the exposure time of 200 ms, which breaks the exposure constraint of a traditional camera with the preset temporal resolution. With the long-exposure detection scheme, PM-FPC has higher SNR than normal high-speed cameras, providing it with the ability to capture the dynamic objects under low-light 3D scenes. In addition, owing to the pixel-mounted polarizer array, PM-FPC can filter the inevitable glare and specular interference, which is a great challenge to a traditional camera. Besides, to achieve more efficient and accurate phase-to-depth mapping, the universal stereo vision system is considered to replace the orthogonal view system. For the wider application in the 3D motion estimation, the phase prediction process is added on the trajectory crossing area. Furthermore, it is expected to realize a real-time 3D motion prediction and multi-target detection system with the development of the neural network under new data types [23,24].

APPENDIX A: QUANTITATIVE ANALYSIS ON THE PERFORMANCE OF PM-FPC

x

Figure 6.Simulation results.

APPENDIX B: OPTICAL CALIBRATION

Similar to an optical coding camera system, the plug-in optical devices need to be carefully calibrated before the Fourier-phase measurement. First, the imaging lens and the relay lens in the system are both designed for clear imaging effects with suitable FoV. It is important to record the whole object trajectory during the exposure time. Another key parameter in PM-FPC is the pixel-to-pixel correspondence between the DMD and the polarization sensor. There are two common errors in the pixel-to-pixel calibration: the mismatch and the misalignment, which are difficult to observe through the imaging. Fortunately, the sensor and DMD can be regarded as two spatial gratings, so the Moire fringe operation illustrated in Ref. [25] can be applied in the calibration. In the experiment, we adjust the zoom ratio of the zooming lens and the rotation angle of the sensor, which will contribute to the accurate pixel-to-pixel correspondence of the system and the precise phase measurement for 3D tracking.

APPENDIX C: EFFECTIVE TEMPORAL RESOLUTION

t_{expo}

APPENDIX D: SNR IN THE PM-FPC

M

APPENDIX E: DATA VOLUME

F

APPENDIX F: PHASE OPTIMIZATION IN THE CROSSING AREA

K

Figure 7.Phase optimization in the crossing area.

References

[1] D. A. Forsyth, J. Ponce. Computer Vision: A Modern Approach(2012).

[2] D. Pathak, R. Girshick, P. Dollár, T. Darrell, B. Hariharan. Learning features by watching objects move. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2701-2710(2017).

[3] S.-T. Park, J. G. Lee. Improved Kalman filter design for three-dimensional radar tracking. IEEE Trans. Aerosp. Electron. Syst., 37, 727-739(2001).

[4] A. Chaikovsky, Y. O. Grudo, Y. A. Karol, A. Y. Lopatsin, L. Chaikovskaya, S. Denisov, F. Osipenko, A. Slesar, M. Korol, Y. S. Balin, S. V. Samoilova. Regularizing algorithm and processing software for Raman lidar-sensing data. J. Appl. Spectrosc., 82, 779-787(2015).

[5] E. Seemann, K. Nickel, R. Stiefelhagen. Head pose estimation using stereo vision for human-robot interaction. 6th IEEE International Conference on Automatic Face and Gesture Recognition, 626-631(2004).

[6] R. Munoz-Salinas, E. Aguirre, M. Garca-Silvente. People detection and tracking using stereo vision and color. Image Vis. Comput., 25, 995-1007(2007).

[7] A. Mauri, R. Khemmar, B. Decoux, N. Ragot, R. Rossi, R. Trabelsi, R. Boutteau, J.-Y. Ertaud, X. Savatier. Deep learning for real-time 3D multi-object detection, localisation, and tracking: application to smart mobility. Sensors, 20, 532(2020).

[8] Y. Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, D. Scaramuzza. Semi-dense 3D reconstruction with a stereo event camera. Proceedings of the European Conference on Computer Vision (ECCV), 235-251(2018).

[9] H. Rebecq, G. Gallego, E. Mueggler, D. Scaramuzza. EMVS: event-based multi-view stereo—3D reconstruction with an event camera in real-time. Int. J. Comput. Vis., 126, 1394-1414(2018).

[10] Q. Deng, Z. Zhang, J. Zhong. Image-free real-time 3-D tracking of a fast-moving object using dual-pixel detection. Opt. Lett., 45, 4734-4737(2020).

[11] Y. Sun, X. Yuan, S. Pang. Compressive high-speed stereo imaging. Opt. Express, 25, 18182-18190(2017).

[12] Z. Zhang, S. Zhang. One-shot 3D shape and color measurement using composite RGB fringe projection and optimum three-frequency selection. Proc. SPIE, 7511, 751103(2009).

[13] M. Qiao, X. Liu, X. Yuan. Snapshot spatial–temporal compressive imaging. Opt. Lett., 45, 1659-1662(2020).

[14] N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, L. Waller. Diffusercam: lensless single-exposure 3D imaging. Optica, 5, 1-9(2018).

[15] X. Feng, L. Gao. Ultrafast light field tomography for snapshot transient and non-line-of-sight imaging. Nat. Commun., 12, 2179(2021).

[16] T. Yamazato, M. Kinoshita, S. Arai, E. Souke, T. Yendo, T. Fujii, K. Kamakura, H. Okada. Vehicle motion and pixel illumination modeling for image sensor based visible light communication. IEEE J. Sel. Areas Commun., 33, 1793-1805(2015).

[17] C. Zhang, H. Zhao, X. Gao, Z. Zhang, J. Xi. Phase unwrapping error correction based on phase edge detection and classification. Opt. Lasers Eng., 137, 106389(2021).

[18] H. Huang, C. Hu, S. Yang, M. Chen, H. Chen. Temporal ghost imaging by means of Fourier spectrum acquisition. IEEE Photon. J., 12, 6803012(2020).

[19] A. Likas, N. Vlassis, J. J. Verbeek. The global k-means clustering algorithm. Pattern Recognit., 36, 451-461(2003).

[20] T. Saito, J. I. Toriwaki. New algorithms for euclidean distance transformation of an n-dimensional digitized picture with applications. Pattern Recognit., 27, 1551-1565(1994).

[21] S. McKinley, M. Levine. Cubic spline interpolation. Coll. Redwoods, 45, 1049-1060(1998).

[22] E. A. Wan, R. Van Der Merwe. The unscented Kalman filter for nonlinear estimation. Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium, 153-158(2000).

[23] C. Hu, H. Huang, M. Chen, S. Yang, H. Chen. Fouriercam: a camera for video spectrum acquisition in a single shot. Photon. Res., 9, 701-713(2021).

[24] C. Hu, H. Huang, M. Chen, S. Yang, H. Chen. Video object detection from one single image through opto-electronic neural network. APL Photon., 6, 046104(2021).

[25] S. Ri, M. Fujigaki, T. Matui, Y. Morimoto. Accurate pixel-to-pixel correspondence adjustment in a digital micromirror device camera by using the phase-shifting Moiré method. Appl. Opt., 45, 6940-6946(2006).

[26] O.-C. M. Gain. On-Chip Multiplication Gain(2002).

[27] K. Krishna, M. N. Murty. Genetic k-means algorithm. IEEE Trans. Syst. Man Cybernet. B, 29, 433-439(1999).