• Photonics Research
  • Vol. 9, Issue 5, 701 (2021)
Chengyang Hu1、2、†, Honghao Huang1、2、†, Minghua Chen1、2, Sigang Yang1、2, and Hongwei Chen1、2、*
Author Affiliations
  • 1Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
  • 2Beijing National Research Center for Information Science and Technology (BNRist), Beijing 100084, China
  • show less
    DOI: 10.1364/PRJ.412491 Cite this Article Set citation alerts
    Chengyang Hu, Honghao Huang, Minghua Chen, Sigang Yang, Hongwei Chen. FourierCam: a camera for video spectrum acquisition in a single shot[J]. Photonics Research, 2021, 9(5): 701 Copy Citation Text show less

    Abstract

    The novel camera architecture facilitates the development of machine vision. Instead of capturing frame sequences in the temporal domain as traditional video cameras, FourierCam directly measures the pixel-wise temporal spectrum of the video in a single shot through optical coding. Compared to the classic video cameras and time-frequency transformation pipeline, this programmable frequency-domain sampling strategy has an attractive combination of characteristics for low detection bandwidth, low computational burden, and low data volume. Based on the various temporal filter kernel designed by FourierCam, we demonstrated a series of exciting machine vision functions, such as video compression, background subtraction, object extraction, and trajectory tracking.

    1. INTRODUCTION

    Humans observe the world in the space–time coordinate system, and traditional video cameras are also based on the same principle. The video data format in the unit of a time serial image frame is well understood for eyes and is the basis for many years of research in machine vision. With the development of optics, focal plane optoelectronics, and a post-detection algorithm, some novel video camera architectures have gradually emerged [1]. The single-shot ultrafast optical imaging system observes the transient events in physics and chemistry at an incredible rate of one billion frames per second (fps) [2]. An event camera with high dynamic range, high temporal resolution, and low power consumption asynchronously measures the brightness change, position, and symbol of each pixel to generate event streams and is widely used in autonomous driving, robotics, security, and industrial automation [3]. A privacy-preserving camera based on coded aperture has also been applied in action recognition [4]. Although the functions of these cameras are impressive, the essential sampling strategy is still to measure the reflected or transmitted light intensity of a scene in the temporal domain. In the lens system, pixels can be regarded as independent time channels, and the acquired signal is the temporal variation of light intensity at the corresponding position in the scene. It is well-known that the frequency domain feature of a visual temporal signal is more significant. For example, in general, a natural scene video has high temporal redundancies, so most information of a temporal signal concentrates on low-frequency components, which is a premise in video compression [5]. The static background of the scene appears as a DC component in the frequency domain, which provides insights for background subtraction [68]. In deep learning, performing high-level vision tasks based on spatial frequency domain data brings a better result [9]. By taking into account space–time duality, this strategy has the potential to be used for temporal frequency domain data. All of the above frequency characteristics imply that capturing video in the temporal frequency domain instead of the temporal domain will initiate a sampling revolution.

    In this paper, we propose a temporal frequency sampling video camera: FourierCam, which is a novel architecture that innovates the basic sampling strategy. The concept of FourierCam is to perform pixel-wise optical coding on the scene video and directly obtain the temporal spectrum in a single shot. In contrast with the traditional cameras, the framework of single-shot temporal spectrum acquisition has a lower detection bandwidth. Furthermore, the data volume can be reduced by analyzing the temporal spectrum features for efficient sampling. Since the temporal Fourier transform is done in the optical system, its computational burden is lower compared to that of the time-frequency transformation pipeline (sampling–storing–transforming). In addition to the basic advantages, according to the clear physical meaning of the spectrum, a variety of temporal filter kernels can be designed to accomplish typical machine vision tasks. To demonstrate the capability of FourierCam, we present a series of applications, which cover video compression, background subtraction, object extraction, and trajectory tracking. These applications can be easily switched only by adjusting the temporal filter kernels without changing the system structure. As a flexible framework, FourierCam can be easily integrated with existing imaging systems and is suitable for microimaging to macroimaging.

    2. PRINCIPLE OF FourierCam

    Overview of FourierCam. (a) Schematic and prototype of FourierCam. (b) Coding strategy of FourierCam. The real scene is coded by a spatial light modulator (DMD) and integrated during a single exposure of the image sensor. The DMD is spatially divided into coding groups (5×5 coding groups are shown here, marked as CG), and each CG contains multiple coding elements (4×4 coding elements are shown here, marked as CE) to extract the Fourier coefficients of the pixel temporal vector. The Fourier coefficients of different pixel temporal vectors form the temporal spectrum of the scene. (c) Three demonstrative applications of FourierCam: video compression, selective sampling, and trajectory tracking.

    Figure 1.Overview of FourierCam. (a) Schematic and prototype of FourierCam. (b) Coding strategy of FourierCam. The real scene is coded by a spatial light modulator (DMD) and integrated during a single exposure of the image sensor. The DMD is spatially divided into coding groups (5×5 coding groups are shown here, marked as CG), and each CG contains multiple coding elements (4×4 coding elements are shown here, marked as CE) to extract the Fourier coefficients of the pixel temporal vector. The Fourier coefficients of different pixel temporal vectors form the temporal spectrum of the scene. (c) Three demonstrative applications of FourierCam: video compression, selective sampling, and trajectory tracking.

    In the experimental setup, the scene is imaged on a virtual plane through a camera lens (CHIOPT HC3505A). A relay lens (Thorlabs MAP10100100-A) transfers the image to the DMD (ViALUX V-9001, 2560×1600 resolution, 7.6 μm pitch size) for light amplitude distribution modulation. The reflected light from the DMD is then focused onto an image sensor (FLIR GS3-U3-120S6M-C, 4242×2830 resolution, 3.1 μm pitch size) by a zoom lens (Utron VTL0714V). Due to one DMD mirror being matched with 3×3 image sensor pixels, the effective resolution is one-third of the resolution of the image sensor in both the horizontal and the vertical directions (i.e., 1414×943).

    The principle of the proposed FourierCam system is spatially splitting the scene into independent temporal channels and acquiring the temporal spectrum by the corresponding CG for each channel. Every CG contains some CEs to obtain Fourier coefficients for frequencies of interest. During one exposure time texpo, the detected value Djkφ in CE k, CG j is equivalent to an inner product of pixel temporal vector Ij(t) and pixel temporal sampling vector Sjkφ(t): Djkφ=Ij(t),Sjkφ(t)=texpoI(t)[A+Bcos(2πfkt+φ)]dt,where Sjkφ(t) is the sinusoidal pixel temporal sampling vector with frequency fk and phase φ in CE k, CG j. A and B denote the average intensity and the contrast of Sjkφ(t), respectively. The Fourier coefficient Fjk of fk can be extracted by four-step phase-shifting as 2BC×Fjk=(Djk0Djkπ)+i(Djkπ2Djk3π2),where C depends on the response of the image sensor. The DC term A can be canceled out simultaneously by the four-step phase-shifting.

    Based on the aforementioned principle of FourierCam, the temporal spectrum of the scene can be easily obtained. As a novel camera architecture with a special data format, FourierCam is of the following three advantages (see Appendix B for details).

    Low detection bandwidth: Since the image sensor only needs to detect the integration of the coded scene for obtaining the temporal spectrum during the entire exposure time, the required detector bandwidth is much lower than the bandwidth of scene variation.

    Low data volume: Natural scene is of high temporal redundancies; i.e., most information of it concentrates on low-frequency components. Besides, some special scenes, like periodic motions, have a narrow bandwidth in the temporal spectrum. FourierCam enables flexibly designing the sampling frequencies of interest to cut down the temporal redundancies and reduce data volume.

    Low computational burden: The multiplication and summation operations of Fourier transform are realized by optical coding and long exposure in FourierCam; thus, the temporal spectrum can be acquired with low computational burden.

    Here, we introduce three applications to demonstrate these advantages of FourierCam [illustrated in Fig. 1(c)]. The first application is video compression. We verify the temporal spectrum acquisition of FourierCam and demonstrate the video compression by using the low-frequency-concentration property of the natural scene. The second application is selective sampling. We show the FourierCam is able to subtract the static background, as well as extract the objects with a specific texture, motion period, or speed by applying designed temporal filter kernels to process the signals during sensing. The last application is trajectory tracking. The temporal phase reveals the time order of events so the FourierCam can be used to analyze the presence and trajectory of the moving objects. These applications show that the temporal spectrum acquired by FourierCam, as a new format of visual information, is able to provide physical features to assist and complete vision tasks.

    3. TEMPORAL SPECTRUM ACQUISITION: BASIC FUNCTION AND VIDEO COMPRESSION

    The basic spectrum acquisition function of FourierCam is demonstrated. For ordinary aperiodic moving objects or natural varying scenes, the energy in the temporal spectrum is mainly concentrated at low frequencies. This observation is exploited to record compressive video in the temporal domain by only acquiring the Fourier coefficients of low frequencies using FourierCam.

    By using the above method, we assemble the Fourier coefficient Fjk of Fk in CG j. We can combine all Fourier coefficients in CG j to form its temporal spectrum as Fj={Fjh*,Fjh1*,,Fjh1,Fjh},h=p×q,where h (p×q) is the number of CEs in a CG, and Fjh* denotes the complex conjugate of Fjh. The pixel temporal vector Ij(t) can be reconstructed by applying inverse Fourier transform: 2BC×Rj=F1{Fj},where F1 denotes the inverse Fourier transform operator. The result of the inverse transform Rj is proportional to the pixel temporal vector Ij(t) in CG j. By applying the same operation to all CGs, we can reconstruct the video of the scene.

    Capturing aperiodic motion video using FourierCam. (a) Illustration of experiment setup and coding pattern on DMD. Each CG contains nine CEs (3×3, ranging from 0 Hz to 80 Hz) to encode the scene. (b) A toy car is used as a target. Top left: static object as ground truth. Top right: coded data captured by FourierCam. Middle left: amplitude of temporal spectrum. Middle right: phase of temporal spectrum. Bottom row: zoom in of middle row. A white-dotted mesh splits into different CGs. (c) A rotating disk with a panda pattern is used as a target. Top left: static object as ground truth. Top right: coded data captured by FourierCam. Middle left: amplitude of temporal spectrum. Middle right: phase of temporal spectrum. Bottom row: zoom in of middle row. A white-dotted mesh splits into different CGs. (d) Three frames from the reconstructed videos of the two scenes in (b) and (c). A yellow-dotted line is shown as reference.

    Figure 2.Capturing aperiodic motion video using FourierCam. (a) Illustration of experiment setup and coding pattern on DMD. Each CG contains nine CEs (3×3, ranging from 0 Hz to 80 Hz) to encode the scene. (b) A toy car is used as a target. Top left: static object as ground truth. Top right: coded data captured by FourierCam. Middle left: amplitude of temporal spectrum. Middle right: phase of temporal spectrum. Bottom row: zoom in of middle row. A white-dotted mesh splits into different CGs. (c) A rotating disk with a panda pattern is used as a target. Top left: static object as ground truth. Top right: coded data captured by FourierCam. Middle left: amplitude of temporal spectrum. Middle right: phase of temporal spectrum. Bottom row: zoom in of middle row. A white-dotted mesh splits into different CGs. (d) Three frames from the reconstructed videos of the two scenes in (b) and (c). A yellow-dotted line is shown as reference.

    The first demonstrative scene in this application includes a toy car running in the field of view. A capture of the static toy car is shown in Fig. 2(b) (top left) as ground truth. The coded data acquired by FourierCam is shown in Fig. 2(b) (top right) in which the scene is blurred and features of the toy car cannot be visually distinguished. After decoding, the complex temporal spectrum of the scene can be extracted. The corresponding amplitude and phase are shown in Fig. 2(b) (middle row) with their zoom-in view (bottom row). In addition to the toy car with a translating motion, a rotating object is also used for demonstration. This scene is a panda pattern on a rotating disk with an angular velocity of 20  rad/s. In Fig. 2(c), the static capture of the object (top left), coded data (top right), amplitude, and phase (middle row) are shown respectively.

    To visually evaluate the correctness of the acquired temporal spectra, the videos of the two scenes are reconstructed using the inverse Fourier transform. Figure 2(d) displays three frames from the video of the toy car (left column) and the rotating panda (right column). These results clearly show the statuses of the dynamic scenes at different times and indicate that FourierCam is able to correctly acquire the temporal spectrum. As the single-shot detection data includes the information of multiple frames (16 frames for demonstration), FourierCam realizes the effect of (16×) video compression. (See Appendix D for the numerical analysis about the performance of video compression. The reconstructed toy car video is shown as an example in Visualization 1).

    4. SELECTIVE SAMPLING: FLEXIBLE TEMPORAL FILTER KERNELS

    FourierCam provides the flexibility for designing the combination of frequencies to be acquired, which is termed temporal filter kernels in this paper. By considering the prior of the scenes and objects, one can achieve selectively sampling the object of interest. In this part, three scenes are demonstrated: periodic motion video acquisition, static background subtraction, and object extraction based on speed and texture.

    Periodic motions widely exist in medical, industry, and scientific research, such as heartbeat, rotating tool bit, and vibration. Since a periodic signal contains energy only in the direct current, fundamental frequency, and harmonics, it has a very sparse representation in the Fourier domain (see Appendix E for details). By taking the temporal spectrum characteristics into account as prior information, we use FourierCam to selectively acquire several principal frequencies in the temporal spectrum.

    Capturing periodic motion video using FourierCam. (a) To capture a periodic motion with four frequencies, each CG contains four CEs (2×2) to encode the scene. (b) A rotating disk is used as target. Top left: static object as ground truth. Top right: the zoom-in view of the captured data with and without coding, corresponding to normal slow cameras and FourierCam, respectively. Ordinary slow cameras blur out the details of moving objects while coded structure in FourierCam capture provides sufficient information to reconstruct the video. Bottom: four frames from the reconstructed video. Red-dotted lines are shown in each frame to indicate the direction of the disk.

    Figure 3.Capturing periodic motion video using FourierCam. (a) To capture a periodic motion with four frequencies, each CG contains four CEs (2×2) to encode the scene. (b) A rotating disk is used as target. Top left: static object as ground truth. Top right: the zoom-in view of the captured data with and without coding, corresponding to normal slow cameras and FourierCam, respectively. Ordinary slow cameras blur out the details of moving objects while coded structure in FourierCam capture provides sufficient information to reconstruct the video. Bottom: four frames from the reconstructed video. Red-dotted lines are shown in each frame to indicate the direction of the disk.

    Subtracting the background and extracting moving objects are significant techniques for video surveillance and other video processing applications. In the frequency domain, the background is concentrated on the DC component. By filtering the DC component, one can subtract the background and extract moving objects. Some moving object extraction approaches performed in the frequency domain [68] have been proposed, which need to acquire the video first and then perform Fourier transform, and thus suffer from relatively high computational cost and low efficiency. Thanks to the capability of FourierCam to directly acquire specific temporal spectral components in the optical domain, it can overcome the drawbacks of the aforementioned methods. In addition to subtracting the background, preanalysis on the temporal spectrum profile of the objects of interest gives the prior for one to design coding patterns for FourierCam to realize specific object extraction.

    Object extraction by FourierCam. (a) Illustration of object extraction. The coding frequencies are based on the spectrum of the objects of interest. In this demonstration, the four rings on the disk are regarded as four objects of interest. Each ring only contains one frequency so that one CE is used in one CG. (b) Left: reference static scene with a disk and a poker card. The disk is rotating when capturing, and the four rings share the same rotating speed. Four right columns: FourierCam captured data for four rings extraction and corresponding results. For each extracted ring, other rings and static poker card are neglected. (c) Results for two identical rings rotating at different speed (1980 and 800 r/min, respectively). FourierCam enables extraction of a specific one out of these two rings.

    Figure 4.Object extraction by FourierCam. (a) Illustration of object extraction. The coding frequencies are based on the spectrum of the objects of interest. In this demonstration, the four rings on the disk are regarded as four objects of interest. Each ring only contains one frequency so that one CE is used in one CG. (b) Left: reference static scene with a disk and a poker card. The disk is rotating when capturing, and the four rings share the same rotating speed. Four right columns: FourierCam captured data for four rings extraction and corresponding results. For each extracted ring, other rings and static poker card are neglected. (c) Results for two identical rings rotating at different speed (1980 and 800 r/min, respectively). FourierCam enables extraction of a specific one out of these two rings.

    The results show that FourierCam enables background subtraction and object extraction based on the temporal spectrum difference. Although only one frequency was used in the experiment, in principle it allowed using multiple frequencies to reconstruct more complex scenes, as long as the spectral difference is sufficiently obvious. It is worth noting that in some special cases objects with different textures and speeds may have the same spectral features, making FourierCam fail to distinguish them (see Appendix F for details).

    5. TEMPORAL PHASE: TRAJECTORY TRACKING

    Object detection and trajectory tracking for a fast-moving object have found important applications in various fields. In general, object detection is to determine the presence of an object, and object tracking is to acquire the spatial–temporal coordinates of a moving object. For the temporal waveform of a pixel where the object would pass by, the moving object takes the form of a pulse at a specific time. As the object is moving, the temporal waveforms at different spatial positions are of different temporal pulse positions, resulting in a phase shift in their temporal spectra. Since Fourier transform is a global-to-point transformation, one can extract the information of the presence and position of the pulse in the temporal domain from the amplitude and phase of a single Fourier coefficient. From this perspective, one can use FourierCam to determine the presence or/and simultaneously acquire the spatial trajectory and temporal position of a moving object.

    To detect and track the moving object, only one frequency is needed to encode the scene. In this case, we let p=q=1 and f=f0=1/texpo. Thus, f0 is the lowest resolvable frequency, and its Fourier coefficient Fj0 provides sufficient knowledge of presence or/and motion of object. The amplitude Aj0 of Fj0 is Aj0=abs(Fj0), where abs(*) denotes the absolute operation. As a static scene does not contain the f0 component in the temporal spectrum, moving object detection can be achieved by applying a threshold on Aj0 that an Aj0 larger than the threshold indicates the presence of moving objects.

    For moving object tracking, since the long exposure has already given the trace of the object, the phase Pj of Fj0 is utilized to further extract the temporal information: Pj=arg(Fj0), where arg(*) denotes the argument operation. A temporal waveform with a displacement of tj in the temporal domain results in a linear phase shift of 2πf0tj in the temporal spectrum: Ij(ttj)=F1{Fj0×exp(i2πf0tj)}.

    Therefore, the temporal displacement can be derived through tj=texpo×Pj2π.

    By applying the same operation to all CGs, we can extract the temporal information for all CGs and acquire the spatial–temporal coordinates of a moving object in the scene.

    Moving object detection and tracking by FourierCam. (a) Only one frequency is needed to encode the scene for moving object detection and tracking. The period of sinusoidal coding signal is equal to the exposure time. Thus, only one CE is contained in each CG. (b) Coded data captured by FourierCam and tracking results. Left column: characters ‘T’, ’H’, ‘U’, ‘EE’ sequentially displayed by a screen with a 0.25 s duration for each. The color indicates the distribution of appearing time. Middle column: results for a displayed spot moving along a heart-shaped trajectory. Right column: results for two spots moving in circular trajectories with different radii. The spots are printed on a rotating disk driven by a motor.

    Figure 5.Moving object detection and tracking by FourierCam. (a) Only one frequency is needed to encode the scene for moving object detection and tracking. The period of sinusoidal coding signal is equal to the exposure time. Thus, only one CE is contained in each CG. (b) Coded data captured by FourierCam and tracking results. Left column: characters ‘T’, ’H’, ‘U’, ‘EE’ sequentially displayed by a screen with a 0.25 s duration for each. The color indicates the distribution of appearing time. Middle column: results for a displayed spot moving along a heart-shaped trajectory. Right column: results for two spots moving in circular trajectories with different radii. The spots are printed on a rotating disk driven by a motor.

    6. DISCUSSION AND CONCLUSION

    The main achievement of this work is the implementation of a high-quality temporal spectrum vision sensor that represents a concrete step toward the low detection bandwidth, low computational burden, and low data volume novel video camera architecture. In the experiment, we demonstrate the advantages of FourierCam in machine vision applications such as video compression, background subtraction, object extraction, and trajectory tracking. Among these applications, prior knowledge is not required for aperiodic video compression, background subtraction, and trajectory tracking (see Table 1 in Appendix H for details). These applications cover the most common scenarios and can be integrated with existing machine vision systems, especially autonomous driving and security [11]. The emergence of prior knowledge makes FourierCam lose some flexibility but gain better performance. Applications that require prior knowledge (periodic video compression and specific object extraction) have special scenarios (e.g., modal analysis of vibrations). Several engineering disciplines rely on modal analysis of vibrations to learn about the physical properties of structures. Relevant areas include structural health monitoring [12] and nondestructive testing [13,14]. These special scenarios are usually stable (i.e., require less flexibility) and allow better performance at a higher cost.

    Comparison Between Different Application for FourierCam

    ApplicationPrior KnowledgeScenarioCoding Method
    Video compression×NormalMultifrequency coded signals depend on exposure time
    Selective sampling (Periodic motion video acquisition)Motion periodPeriodicMultifrequency coded signals depend on motion period
    Selective sampling (Background subtraction)×NormalMultifrequency DC components are not included
    Selective sampling (Object extraction)Temporal spectrum profile of the interest objectsNormalMultifrequency coded signals depend on prior knowledge
    Trajectory trackingNormalSingle-frequency coded signals depend on exposure time

    It is worth mentioning that the FourierCam is built to enhance the flexibility of information utilization with the given limited data throughput. First, by taking the low-frequency properties of a natural scene, one can only sample the most significant low-frequency components to perform data compression during data acquisition with the frequency sampling flexibility of FourierCam. This compression based on frequency is similar to the JPEG [15] compression based on spatial frequency, that is, to store more significant information within limited data capability. In general, this is a kind of lossy compression, and it can also be lossless for some sparse scenes (such as periodic motion). Second, the FourierCam directly obtains the temporal spectrum as a special data type with abundant physical information of the dynamic scenes. Although the process that uses multiple DMD pixels and camera pixels to decode one frequency component brings data cost, the phase-shift operation of the multiple pixels can also reduce the background noise so that the quality of the data can be increased.

    The temporal and spatial resolutions are the key parameters of the FourierCam. The temporal resolution (the highest frequency component that can be acquired) is determined by the bandwidth of the modulator. In the present optical system, the PWM mode reduces the DMD refresh rate. Zhang et al. [16] used error diffusion dithering techniques to binarize the Fourier basis patterns in space, which can be referenced in the temporal domain to maintain the refresh rate of DMD. In terms of spatial resolution, each Fourier coefficient is in need of 4 pixels for four-step phase-shifting. Although the four-step phase-shifting offers better measurement performance, one can also utilize three-step phase-shifting [16] or two-step phase-shifting [17] for a higher spatial resolution. Furthermore, taking a closer look at the process, one can notice that the principle of FourierCam is similar to the color camera based on the Bayer color filter array (CFA) [18]. CFA and FourierCam use different pixels to collect different wavelengths and temporal Fourier coefficients in parallel, respectively. Therefore, the demosaicing algorithm in CFA can be introduced into FourierCam to improve the spatial resolution [19,20]. Although a monochrome image detector is used in the experiments, the possibility of combining FourierCam with a color image detector is obvious, as long as the coding structure of FourierCam needs to be adjusted according to the distribution of CFA. It is worth mentioning that in machine vision based on deep learning, training and inference on the temporal spectrum is feasible through complex-valued neural networks, without the need for image restoration as an intermediate step [21,22]. We believe that the data format of the temporal spectrum provided by FourierCam has the potential to be used in multimodal learning for high-level vision tasks like optical flow or event flow [23]. In addition, proposing a more compact and lightweight design will help develop a commercial FourierCam. One can borrow the compact optical design from miniaturized DMD-based projectors, or one can integrate the modulator on the sensor chip, which is still challenging with current technology. And in some applications with loose frame rate requirements, a commercial liquid crystal modulator can be used instead of DMD to reduce costs. Beyond machine vision, we believe that the flexible temporal filter kernel design properties of FourierCam can play a role in other fields, for example, using FourierCam to perform frequency division multiplexing demodulation in space optical communication or to extract specific signals in voice signal detection.

    APPENDIX A: CORRESPONDENCE BETWEEN DMD AND IMAGE SENSOR IN FourierCam

    In the FourierCam the most important thing is to adjust each mirror of the DMD so as to correspond exactly to the pixel of the image sensor, such as CCD or CMOS. Under the premise of complete correspondence, FourierCam can achieve high-precision decoding. However, since the sizes of the CCD and DMD are very small, it is difficult to accurately align. Fortunately, CCD and DMD can be regarded as two gratings, so they can be aligned by observing the moiré fringes formed between them [24]. There are two kinds of errors: mismatch and misalignment. Mismatch means line spatial frequency disagreement, and misalignment means rotational disagreement. When each mirror of the DMD and each pixel of the CCD are not corresponding exactly, a diverse moiré fringe pattern according to the mismatch and misalignment conditions will appear. Figure?6 shows the experimental results when we adjust the pixel-to-pixel correspondence in the FourierCam. Figure?6(a) shows the moiré fringe patterns when the mismatch and misalignment occur between the CCD pixels and the DMD. Adjusting the rotation angle of the DMD can eliminate misalignment as shown in Fig.?6(b). Next, adjusting the magnification of lens, the moiré pattern does not appear in the FourierCam as shown in Fig.?6(c). In the statement of Fig.?6(c), the adjustment error is 0.02%, which means that for every 5000 pixels, a pixel offset will occur. Therefore, high-precision correspondence between DMD and CCD is realized in FourierCam.

    Phase analysis of the moiré fringe pattern obtained by the phase-shifting moiré method. (a) There are two errors: mismatch and misalignment. (b) Only mismatch error. (c) FourierCam with high-precision correspondence.

    Figure 6.Phase analysis of the moiré fringe pattern obtained by the phase-shifting moiré method. (a) There are two errors: mismatch and misalignment. (b) Only mismatch error. (c) FourierCam with high-precision correspondence.

    APPENDIX B: DETAILED DISCUSSION ABOUT FEATURES OF FourierCam

    Detection bandwidth: To measure a temporal significance with max frequency fmax, the required minimum detection bandwidth of traditional cameras equals fmax. For FourierCam acquiring h Fourier components, the required minimum detection bandwidth is fmax2h according to the frequency domain sampling theorem (see Appendix?C). For example, in the natural scene demonstration (toy car and panda in the manuscript), fmax is 80?Hz and eight Fourier components except from the direct current are obtained; thus, the required detection bandwidth of FourierCam is 5?Hz, while for traditional cameras it is 80?Hz.

    Assuming a video is captured by traditional cameras with M frames and N pixels in each frame, its data volume is M×N bytes (assuming 1 byte for one pixel). FourierCam obtains h Fourier components of the same video, and the data volume is 2h×N bytes since a complex Fourier coefficient needs twice the capacity than a real number. Generally, M is larger than 2h. For example, in the “running dog” video in Appendix?D, M=100, h=16, and N=10802; thus, the data volumes for a traditional camera and FourierCam are 116.64 and 18.66 megabytes, respectively. By considering the prior information of the object and applying selective sampling, the data volume can be further reduced.

    Floating point operations (FLOPs) comparison between FFT and FourierCam: FLOPs include the standard floating-point operations of additions and multiplications to evaluate the computational burden. To calculate the temporal spectrum of a video with M frames and N pixels in each frame, the fast Fourier transform (FFT) needs 5MN?log2?M FLOPs. In FourierCam, since the multiplication and summation operations of Fourier transform are realized by optical coding, only 3MN FLOPs are required for the four-phase-shifting operation. Therefore, the required FLOPs for the temporal spectrum acquisition can be reduced by (5M?log2?M?3M)×N. For example, in the demonstration of the periodic motion in application II in the paper, 3.9 GFLOPs can be neglected by FourierCam.

    Light throughput in FourierCam: In addition to the above advantages, light throughput plays an important role in high-speed photography and is worthy of discussion. Two types of high-speed cameras (including normal high-speed shutter and impulse coding cameras) are used for comparison. The impulse coding cameras turn on the pixels in a spatial block at a certain time to capture high-speed video [25,26]. Considering one coding group, the average light intensity at a coding group is L, the active area is A, the video has N frames, and the entire duration is T. So the frame rate requirement for the capture device is NT. For high-speed shutter cameras, the whole area A will be an active area, and the light throughput of one frame is L×A×TN. Therefore, the light throughput of N frames video is L×A×T. For impulse coding cameras, the whole area A will be divided into N exposure groups, with each group exposing sequentially. The light throughput per frame (exposure groups) is L×AN×TN, and the light throughput of N frames video is L×AN×T. For FourierCam, each coding group will be divided into p×q×Nphase smallest units (Nphase is the number of phases and in aperiodic scenes p×q=N2), and each unit is modulated by a sinusoidal signal during the whole exposure time of the image detector, so the light throughput of each unit is L×A×TN×Nphase. Similar to the abovementioned temporal domain sampling strategy, which superimposes all frames to calculate the light throughput, the FourierCam should also add all frequency components to calculate the video light throughput. Therefore, the light throughput of FourierCam is L×A×TN×Nphase×p×q=L×A×T2×Nphase. In summary, the light throughput of the FourierCam is lower than that of high-speed shutter cameras (even when Nphase=1); this is introduced by sinusoidal modulation. However, the light throughput of the FourierCam has nothing to do with the number of frames N, while the impulse coding cameras are related to it. When N increases, the light throughput advantage of FourierCam compared to impulse coding cameras becomes more obvious. In principle, FourierCam uses at least two phases which have a 180-deg shift. Fortunately, by using the light from both ON and OFF reflection angles of DMD and adding a second sensor, it is possible to complete the temporal spectrum acquisition with each sensor collecting only one phase. This means that Nphase=1, which can realize the competitive light throughput as high-speed shutter cameras.

    APPENDIX C: FRAME RATE AND FREQUENCY DOMAIN SAMPLING IN FourierCam

    Traditional cameras can be regarded as the temporal-domain sampling process when capturing video, and the frame rate is the temporal sampling rate. Considering each pixel temporal waveform, given the frame rate, the highest frequency component, fmax, that can be acquired is fs2, where fs is the frame rate. Unlike the temporal-domain sampling process of traditional cameras, FourierCam is based on frequency-domain sampling. FourierCam directly acquires frequency components. When the highest frequency component it collects is fmax, the equivalent frame rate of FourierCam is 2fmax. In addition, the frequency domain sampling interval (Δf) of FourierCam needs to satisfy the frequency domain sampling theorem to ensure that the reconstructed video does not alias in the time domain. The frequency domain sampling interval is determined by the exposure time of the image detector (texpo), Δf1texpo. For example, the exposure time of an image detector is 1?s, and the frame rate is 1?Hz. If the frame rate is increased to 10?Hz, the frequency components to be acquired are 1?Hz, 2?Hz, 3?Hz, 4?Hz, and 5?Hz. Its frequency interval is 1?Hz, which satisfies the frequency domain sampling theorem.

    APPENDIX D: QUANTITATIVE ANALYSIS ON THE PERFORMANCE OF FourierCam

    To quantitatively evaluate the reconstruction, we perform a simulation of FourierCam with a “running dog” video, which has 100 frames with a spatial resolution of 1080×1080 pixels. We obtain the temporal spectrum of the video with 16 frequencies (the number of acquired Fourier coefficients h=16). Figure?7(a) compares the long exposure capture and the FourierCam encoded capture. The long exposure with low temporal resolution results in an obvious motion blur, and the details of the object are lost, whereas the temporal spectrum contains information of the motion to further reconstruct the dynamic scene. In the reconstructed video, the SSIM (structural similarity index) keeps stable with an average of 0.9126 and a standard deviation of 0.0107 [shown in Fig.?7(b)]. In Fig.?7(c), we also display a visual comparison of three exemplar frames from the ground truth video and the FourierCam reconstructed results, respectively. These results illustrate that FourierCam is able to reconstruct a clear video with only low-frequency coefficients.

    Simulation of FourierCam video reconstruction. (a) Long exposure capture with all frames directly accumulating together, corresponding to a slow camera and the FourierCam encoded capture. The insets show the zoom-in view of the areas pointed by the arrows. (b) In the reconstructed video with 16 Fourier coefficients, the SSIM of each frame keeps stable with an average of 0.9126 and a standard deviation of 0.0107. (c) Three exemplar frames from the ground truth and reconstructed video.

    Figure 7.Simulation of FourierCam video reconstruction. (a) Long exposure capture with all frames directly accumulating together, corresponding to a slow camera and the FourierCam encoded capture. The insets show the zoom-in view of the areas pointed by the arrows. (b) In the reconstructed video with 16 Fourier coefficients, the SSIM of each frame keeps stable with an average of 0.9126 and a standard deviation of 0.0107. (c) Three exemplar frames from the ground truth and reconstructed video.

    Quantitative analysis on the performance of FourierCam. (a) Relation between number of acquired Fourier coefficients h and spatial resolution reduction L of FourierCam. (b) Comparison of reconstructed frames with different numbers of acquired Fourier coefficients, corresponding to point 1 to point 4 in (a).

    Figure 8.Quantitative analysis on the performance of FourierCam. (a) Relation between number of acquired Fourier coefficients h and spatial resolution reduction L of FourierCam. (b) Comparison of reconstructed frames with different numbers of acquired Fourier coefficients, corresponding to point 1 to point 4 in (a).

    APPENDIX E: FOURIER DOMAIN PROPERTIES OF PERIODIC AND APERIODIC MOTION

    Consider the signal at a position where a periodic motion passes: it is in periodic form in time domain. Fourier transform of a periodic signal with period P contains energy only at the frequencies that are an integer multiple of repetition frequency 1P, and therefore the periodic signal has a sparse representation in the Fourier domain. When the period of the periodic signal becomes infinitely long, the periodic signal comes to an aperiodic signal with a single pulse and its spectrum becomes continuous. Figure?9 provides a graphical illustration of the spectrum of periodic and aperiodic signals.

    Fourier domain properties of periodic and aperiodic signals. The (a) periodic signal has a (b) sparse spectrum while the (c) aperiodic signal has a (d) continuous spectrum.

    Figure 9.Fourier domain properties of periodic and aperiodic signals. The (a) periodic signal has a (b) sparse spectrum while the (c) aperiodic signal has a (d) continuous spectrum.

    APPENDIX F: TEMPORAL RESOLUTION OF OBJECT TRACKING IN FourierCam

    As the object is moving, the temporal waveforms at different spatial positions are of different temporal pulse positions, resulting in a phase shift in their temporal spectra. The phase-shift detection accuracy is the temporal resolution of object tracking in FourierCam. The phase-shift accuracy is determined with the DMD grayscale level and the exposure time of the image detector, so the temporal resolution is texpoDMDgrayscale. Since we use a DMD with PWM mode as the spatial light modulator in FourierCam, the light is digitally modulated by 8-bit grayscale. Therefore, during a single exposure texpo, the temporal resolution of object tracking is texpo256.

    APPENDIX G: FOURIER DOMAIN PROPERTIES OF MOVING OBJECT

    Changes in both the texture and the speed of the moving object can cause a difference in the Fourier domain. As illustrated in Fig.?10(a), when a block with sinusoidal fringe texture is moving at a speed of v, the detected waveform at the red point is also in a sinusoidal form. In Fig.?10(b), a block with a higher spatial frequency texture but also moving at the speed of v corresponds to a higher frequency in the Fourier domain compared to Fig.?10(a). By selectively acquiring a specific range of frequency (e.g.,?2f0), we can extract a specific object [e.g.,?the one in Fig.?10(b)]. Also, the change in moving also causes a difference in the spectrum [Fig.?10(c)]; thus, we can also extract it from the one in Fig.?10(a). However, because of the joint effect of texture and speed, the spectrum in Figs.?10(b) and 10(c) is quite similar. To distinguish these two objects, we can add more constraints such as the length of the waveform, which is one of our future works.

    Illustration of Fourier domain properties of moving objects with different texture and speed. (a) Block with sinusoidal fringe texture moving at a speed of v. The temporal waveform of the red point is shown with its Fourier spectrum. (b) Block with higher spatial frequency texture, also moving at the speed of v. (c) Block identical to (a) but moving at a higher speed 2v.

    Figure 10.Illustration of Fourier domain properties of moving objects with different texture and speed. (a) Block with sinusoidal fringe texture moving at a speed of v. The temporal waveform of the red point is shown with its Fourier spectrum. (b) Block with higher spatial frequency texture, also moving at the speed of v. (c) Block identical to (a) but moving at a higher speed 2v.

    APPENDIX H: COMPARISON BETWEEN DIFFERENT APPLICATION FOR FourierCam

    The comparison between different applications for FourierCam is shown in Table?1. In periodic compressive video reconstruction, a priori knowledge can be used to achieve higher compression ratios. It is also possible not to use prior knowledge, in which case the compression ratio is the same as the aperiodic.

    References

    [1] J. N. Mait, G. W. Euliss, R. A. Athale. Computational imaging. Adv. Opt. Photon., 10, 409-483(2018).

    [2] J. Liang, L. V. Wang. Single-shot ultrafast optical imaging. Optica, 5, 1113-1127(2018).

    [3] G. Gallego, T. Delbruck, G. M. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, D. Scaramuzza. Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell.(2021).

    [4] Z. W. Wang, V. Vineet, F. Pittaluga, S. N. Sinha, O. Cossairt, S. B. Kang. Privacy-preserving action recognition using coded aperture videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 1-10(2019).

    [5] T. Ouni, W. Ayedi, M. Abid. New low complexity DCT based video compression method. International Conference on Telecommunications, 202-207(2009).

    [6] W. Wang, J. Yang, W. Gao. Modeling background and segmenting moving objects from compressed video. IEEE Trans. Circuits Syst. Video Technol., 18, 670-681(2008).

    [7] D.-M. Tsai, W.-Y. Chiu. Motion detection using Fourier image reconstruction. Pattern Recogn. Lett., 29, 2145-2155(2008).

    [8] T.-H. Oh, J.-Y. Lee, I. S. Kweon. Real-time motion detection based on discrete cosine transform. 19th IEEE International Conference on Image Processing, 2381-2384(2012).

    [9] K. Xu, M. Qin, F. Sun, Y. Wang, Y.-K. Chen, F. Ren. Learning in the frequency domain. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1740-1749(2020).

    [10] D. Doherty, G. Hewlett. 10.4: phased reset timing for improved digital micromirror device (DMD) brightness. SID Symp. Dig. Tech. Papers, 29, 125-128(1998).

    [11] S. Ojha, S. Sakhare. Image processing techniques for object tracking in video surveillance: a survey. International Conference on Pervasive Computing (ICPC), 1-6(2015).

    [12] I. Ishii, S. Takemoto, T. Takaki, M. Takamoto, K. Imon, K. Hirakawa. Real-time laryngoscopic measurements of vocal-fold vibrations. Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 6623-6626(2011).

    [13] A. Davis, M. Rubinstein, N. Wadhwa, G. J. Mysore, F. Durand, W. T. Freeman. The visual microphone: passive recovery of sound from video. ACM Trans. Graph., 33, 79(2014).

    [14] A. Davis, K. L. Bouman, J. G. Chen, M. Rubinstein, F. Durand, W. T. Freeman. Visual vibrometry: estimating material properties from small motion in video. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5335-5343(2015).

    [15] G. Wallace. The JPEG still picture compression standard. IEEE Trans. Consum. Electron., 38, xviii-xxxiv(1992).

    [16] Z. Zhang, X. Wang, G. Zheng, J. Zhong. Fast Fourier single-pixel imaging via binary illumination. Sci. Rep., 7, 12029(2017).

    [17] L. Bian, J. Suo, X. Hu, F. Chen, Q. Dai. Efficient single pixel imaging in Fourier space. J. Opt., 18, 085704(2016).

    [18] B. E. Bayer. Color imaging array. U.S. patent(1976).

    [19] H. S. Malvar, L.-W. He, R. Cutler. High-quality linear interpolation for demosaicing of Bayer-patterned color images. IEEE International Conference on Acoustics, Speech, and Signal Processing, iii-485-8(2004).

    [20] R. Ramanath, W. E. Snyder, G. L. Bilbro, W. A. Sander. Demosaicking methods for Bayer color arrays. J. Electron. Imaging, 11, 306-315(2002).

    [21] W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, Y. Chen. Compressing convolutional neural networks in the frequency domain. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1475-1484(2016).

    [22] H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, K. Lee. Phase-aware speech enhancement with deep complex U-Net. International Conference on Learning Representations, 1-20(2019).

    [23] T. Baltrušaitis, C. Ahuja, L.-P. Morency. Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell., 41, 423-443(2019).

    [24] S. Ri, M. Fujigaki, T. Matui, Y. Morimoto. Accurate pixel-to-pixel correspondence adjustment in a digital micromirror device camera by using the phase-shifting Moiré method. Appl. Opt., 45, 6940-6946(2006).

    [25] G. Bub, M. Tecza, M. Helmes, P. Lee, P. Kohl. Temporal pixel multiplexing for simultaneous high-speed, high-resolution imaging. Nat Methods, 7, 209-211(2010).

    [26] K. Daniilidis, D. Hutchison, P. Maragos, T. Kanade, J. Kittler, N. Paragios, J. M. Kleinberg, F. Mattern, J. C. Mitchell, M. Naor, O. Nierstrasz, C. Pandu Rangan, B. Steffen, M. Sudan, D. Terzopoulos, D. Tygar, M. Y. Vardi, G. Weikum, M. Gupta, A. Agrawal, A. Veeraraghavan, S. G. Narasimhan. Flexible voxels for motion-aware videography. Computer Vision—ECCV, 6311, 100-114(2010).

    Chengyang Hu, Honghao Huang, Minghua Chen, Sigang Yang, Hongwei Chen. FourierCam: a camera for video spectrum acquisition in a single shot[J]. Photonics Research, 2021, 9(5): 701
    Download Citation