FourierCam: a camera for video spectrum acquisition in a single shot

Chengyang Hu; Honghao Huang; Minghua Chen; Sigang Yang; Hongwei Chen

doi:10.1364/PRJ.412491

Abstract

The novel camera architecture facilitates the development of machine vision. Instead of capturing frame sequences in the temporal domain as traditional video cameras, FourierCam directly measures the pixel-wise temporal spectrum of the video in a single shot through optical coding. Compared to the classic video cameras and time-frequency transformation pipeline, this programmable frequency-domain sampling strategy has an attractive combination of characteristics for low detection bandwidth, low computational burden, and low data volume. Based on the various temporal filter kernel designed by FourierCam, we demonstrated a series of exciting machine vision functions, such as video compression, background subtraction, object extraction, and trajectory tracking.

In this paper, we propose a temporal frequency sampling video camera: FourierCam, which is a novel architecture that innovates the basic sampling strategy. The concept of FourierCam is to perform pixel-wise optical coding on the scene video and directly obtain the temporal spectrum in a single shot. In contrast with the traditional cameras, the framework of single-shot temporal spectrum acquisition has a lower detection bandwidth. Furthermore, the data volume can be reduced by analyzing the temporal spectrum features for efficient sampling. Since the temporal Fourier transform is done in the optical system, its computational burden is lower compared to that of the time-frequency transformation pipeline (sampling-storing-transforming). In addition to the basic advantages, according to the clear physical meaning of the spectrum, a variety of temporal filter kernels can be designed to accomplish typical machine vision tasks. To demonstrate the capability of FourierCam, we present a series of applications, which cover video compression, background subtraction, object extraction, and trajectory tracking. These applications can be easily switched only by adjusting the temporal filter kernels without changing the system structure. As a flexible framework, FourierCam can be easily integrated with existing imaging systems and is suitable for microimaging to macroimaging.

Figure 1.Overview of FourierCam. (a) Schematic and prototype of FourierCam. (b) Coding strategy of FourierCam. The real scene is coded by a spatial light modulator (DMD) and integrated during a single exposure of the image sensor. The DMD is spatially divided into coding groups (

5 \times 5

coding groups are shown here, marked as CG), and each CG contains multiple coding elements (

4 \times 4

coding elements are shown here, marked as CE) to extract the Fourier coefficients of the pixel temporal vector. The Fourier coefficients of different pixel temporal vectors form the temporal spectrum of the scene. (c) Three demonstrative applications of FourierCam: video compression, selective sampling, and trajectory tracking.

2560 \times 1600

Sign up for Photonics Research TOC. Get the latest issue of Photonics Research delivered right to you！Sign up now

t_{expo}

Based on the aforementioned principle of FourierCam, the temporal spectrum of the scene can be easily obtained. As a novel camera architecture with a special data format, FourierCam is of the following three advantages (see Appendix B for details).

Low detection bandwidth: Since the image sensor only needs to detect the integration of the coded scene for obtaining the temporal spectrum during the entire exposure time, the required detector bandwidth is much lower than the bandwidth of scene variation.

Low data volume: Natural scene is of high temporal redundancies; i.e., most information of it concentrates on low-frequency components. Besides, some special scenes, like periodic motions, have a narrow bandwidth in the temporal spectrum. FourierCam enables flexibly designing the sampling frequencies of interest to cut down the temporal redundancies and reduce data volume.

Low computational burden: The multiplication and summation operations of Fourier transform are realized by optical coding and long exposure in FourierCam; thus, the temporal spectrum can be acquired with low computational burden.

Here, we introduce three applications to demonstrate these advantages of FourierCam [illustrated in Fig. 1 (c)]. The first application is video compression. We verify the temporal spectrum acquisition of FourierCam and demonstrate the video compression by using the low-frequency-concentration property of the natural scene. The second application is selective sampling. We show the FourierCam is able to subtract the static background, as well as extract the objects with a specific texture, motion period, or speed by applying designed temporal filter kernels to process the signals during sensing. The last application is trajectory tracking. The temporal phase reveals the time order of events so the FourierCam can be used to analyze the presence and trajectory of the moving objects. These applications show that the temporal spectrum acquired by FourierCam, as a new format of visual information, is able to provide physical features to assist and complete vision tasks.

The basic spectrum acquisition function of FourierCam is demonstrated. For ordinary aperiodic moving objects or natural varying scenes, the energy in the temporal spectrum is mainly concentrated at low frequencies. This observation is exploited to record compressive video in the temporal domain by only acquiring the Fourier coefficients of low frequencies using FourierCam.

F_{jk}

Figure 2.Capturing aperiodic motion video using FourierCam. (a) Illustration of experiment setup and coding pattern on DMD. Each CG contains nine CEs (

3 \times 3

, ranging from 0 Hz to 80 Hz) to encode the scene. (b) A toy car is used as a target. Top left: static object as ground truth. Top right: coded data captured by FourierCam. Middle left: amplitude of temporal spectrum. Middle right: phase of temporal spectrum. Bottom row: zoom in of middle row. A white-dotted mesh splits into different CGs. (c) A rotating disk with a panda pattern is used as a target. Top left: static object as ground truth. Top right: coded data captured by FourierCam. Middle left: amplitude of temporal spectrum. Middle right: phase of temporal spectrum. Bottom row: zoom in of middle row. A white-dotted mesh splits into different CGs. (d) Three frames from the reconstructed videos of the two scenes in (b) and (c). A yellow-dotted line is shown as reference.

\sim 20 rad / s

To visually evaluate the correctness of the acquired temporal spectra, the videos of the two scenes are reconstructed using the inverse Fourier transform. Figure 2 (d) displays three frames from the video of the toy car (left column) and the rotating panda (right column). These results clearly show the statuses of the dynamic scenes at different times and indicate that FourierCam is able to correctly acquire the temporal spectrum. As the single-shot detection data includes the information of multiple frames (16 frames for demonstration), FourierCam realizes the effect of (16\times) video compression. (See Appendix D for the numerical analysis about the performance of video compression. The reconstructed toy car video is shown as an example in Visualization 1).

FourierCam provides the flexibility for designing the combination of frequencies to be acquired, which is termed temporal filter kernels in this paper. By considering the prior of the scenes and objects, one can achieve selectively sampling the object of interest. In this part, three scenes are demonstrated: periodic motion video acquisition, static background subtraction, and object extraction based on speed and texture.

Periodic motions widely exist in medical, industry, and scientific research, such as heartbeat, rotating tool bit, and vibration. Since a periodic signal contains energy only in the direct current, fundamental frequency, and harmonics, it has a very sparse representation in the Fourier domain (see Appendix E for details). By taking the temporal spectrum characteristics into account as prior information, we use FourierCam to selectively acquire several principal frequencies in the temporal spectrum.

Figure 3.Capturing periodic motion video using FourierCam. (a) To capture a periodic motion with four frequencies, each CG contains four CEs (

2 \times 2

) to encode the scene. (b) A rotating disk is used as target. Top left: static object as ground truth. Top right: the zoom-in view of the captured data with and without coding, corresponding to normal slow cameras and FourierCam, respectively. Ordinary slow cameras blur out the details of moving objects while coded structure in FourierCam capture provides sufficient information to reconstruct the video. Bottom: four frames from the reconstructed video. Red-dotted lines are shown in each frame to indicate the direction of the disk.

Subtracting the background and extracting moving objects are significant techniques for video surveillance and other video processing applications. In the frequency domain, the background is concentrated on the DC component. By filtering the DC component, one can subtract the background and extract moving objects. Some moving object extraction approaches performed in the frequency domain [6 - 8] have been proposed, which need to acquire the video first and then perform Fourier transform, and thus suffer from relatively high computational cost and low efficiency. Thanks to the capability of FourierCam to directly acquire specific temporal spectral components in the optical domain, it can overcome the drawbacks of the aforementioned methods. In addition to subtracting the background, preanalysis on the temporal spectrum profile of the objects of interest gives the prior for one to design coding patterns for FourierCam to realize specific object extraction.

Figure 4.Object extraction by FourierCam. (a) Illustration of object extraction. The coding frequencies are based on the spectrum of the objects of interest. In this demonstration, the four rings on the disk are regarded as four objects of interest. Each ring only contains one frequency so that one CE is used in one CG. (b) Left: reference static scene with a disk and a poker card. The disk is rotating when capturing, and the four rings share the same rotating speed. Four right columns: FourierCam captured data for four rings extraction and corresponding results. For each extracted ring, other rings and static poker card are neglected. (c) Results for two identical rings rotating at different speed (1980 and 800 r/min, respectively). FourierCam enables extraction of a specific one out of these two rings.

The results show that FourierCam enables background subtraction and object extraction based on the temporal spectrum difference. Although only one frequency was used in the experiment, in principle it allowed using multiple frequencies to reconstruct more complex scenes, as long as the spectral difference is sufficiently obvious. It is worth noting that in some special cases objects with different textures and speeds may have the same spectral features, making FourierCam fail to distinguish them (see Appendix F for details).

Object detection and trajectory tracking for a fast-moving object have found important applications in various fields. In general, object detection is to determine the presence of an object, and object tracking is to acquire the spatial-temporal coordinates of a moving object. For the temporal waveform of a pixel where the object would pass by, the moving object takes the form of a pulse at a specific time. As the object is moving, the temporal waveforms at different spatial positions are of different temporal pulse positions, resulting in a phase shift in their temporal spectra. Since Fourier transform is a global-to-point transformation, one can extract the information of the presence and position of the pulse in the temporal domain from the amplitude and phase of a single Fourier coefficient. From this perspective, one can use FourierCam to determine the presence or/and simultaneously acquire the spatial trajectory and temporal position of a moving object.

p = q = 1

P_{j}

t_{j} = t_{expo} \times \frac{P_{j}}{2 π} .

By applying the same operation to all CGs, we can extract the temporal information for all CGs and acquire the spatial-temporal coordinates of a moving object in the scene.

Figure 5.Moving object detection and tracking by FourierCam. (a) Only one frequency is needed to encode the scene for moving object detection and tracking. The period of sinusoidal coding signal is equal to the exposure time. Thus, only one CE is contained in each CG. (b) Coded data captured by FourierCam and tracking results. Left column: characters ‘T’, ’H’, ‘U’, ‘EE’ sequentially displayed by a screen with a 0.25 s duration for each. The color indicates the distribution of appearing time. Middle column: results for a displayed spot moving along a heart-shaped trajectory. Right column: results for two spots moving in circular trajectories with different radii. The spots are printed on a rotating disk driven by a motor.

The main achievement of this work is the implementation of a high-quality temporal spectrum vision sensor that represents a concrete step toward the low detection bandwidth, low computational burden, and low data volume novel video camera architecture. In the experiment, we demonstrate the advantages of FourierCam in machine vision applications such as video compression, background subtraction, object extraction, and trajectory tracking. Among these applications, prior knowledge is not required for aperiodic video compression, background subtraction, and trajectory tracking (see Table 1 in Appendix H for details). These applications cover the most common scenarios and can be integrated with existing machine vision systems, especially autonomous driving and security [11]. The emergence of prior knowledge makes FourierCam lose some flexibility but gain better performance. Applications that require prior knowledge (periodic video compression and specific object extraction) have special scenarios (e.g., modal analysis of vibrations). Several engineering disciplines rely on modal analysis of vibrations to learn about the physical properties of structures. Relevant areas include structural health monitoring [12] and nondestructive testing [13, 14]. These special scenarios are usually stable (i.e., require less flexibility) and allow better performance at a higher cost. Table 1.

Comparison Between Different Application for FourierCam

Application	Prior Knowledge	Scenario	Coding Method
Video compression	×	Normal	Multifrequency coded signals depend on exposure time
Selective sampling (Periodic motion video acquisition)	Motion period	Periodic	Multifrequency coded signals depend on motion period
Selective sampling (Background subtraction)	×	Normal	Multifrequency DC components are not included
Selective sampling (Object extraction)	Temporal spectrum profile of the interest objects	Normal	Multifrequency coded signals depend on prior knowledge
Trajectory tracking	✗	Normal	Single-frequency coded signals depend on exposure time

It is worth mentioning that the FourierCam is built to enhance the flexibility of information utilization with the given limited data throughput. First, by taking the low-frequency properties of a natural scene, one can only sample the most significant low-frequency components to perform data compression during data acquisition with the frequency sampling flexibility of FourierCam. This compression based on frequency is similar to the JPEG [15] compression based on spatial frequency, that is, to store more significant information within limited data capability. In general, this is a kind of lossy compression, and it can also be lossless for some sparse scenes (such as periodic motion). Second, the FourierCam directly obtains the temporal spectrum as a special data type with abundant physical information of the dynamic scenes. Although the process that uses multiple DMD pixels and camera pixels to decode one frequency component brings data cost, the phase-shift operation of the multiple pixels can also reduce the background noise so that the quality of the data can be increased.

In the FourierCam the most important thing is to adjust each mirror of the DMD so as to correspond exactly to the pixel of the image sensor, such as CCD or CMOS. Under the premise of complete correspondence, FourierCam can achieve high-precision decoding. However, since the sizes of the CCD and DMD are very small, it is difficult to accurately align. Fortunately, CCD and DMD can be regarded as two gratings, so they can be aligned by observing the moiré fringes formed between them [24]. There are two kinds of errors: mismatch and misalignment. Mismatch means line spatial frequency disagreement, and misalignment means rotational disagreement. When each mirror of the DMD and each pixel of the CCD are not corresponding exactly, a diverse moiré fringe pattern according to the mismatch and misalignment conditions will appear. Figure?6 shows the experimental results when we adjust the pixel-to-pixel correspondence in the FourierCam. Figure?6(a) shows the moiré fringe patterns when the mismatch and misalignment occur between the CCD pixels and the DMD. Adjusting the rotation angle of the DMD can eliminate misalignment as shown in Fig.?6(b). Next, adjusting the magnification of lens, the moiré pattern does not appear in the FourierCam as shown in Fig.?6(c). In the statement of Fig.?6(c), the adjustment error is 0.02%, which means that for every 5000 pixels, a pixel offset will occur. Therefore, high-precision correspondence between DMD and CCD is realized in FourierCam.

Figure 6.Phase analysis of the moiré fringe pattern obtained by the phase-shifting moiré method. (a) There are two errors: mismatch and misalignment. (b) Only mismatch error. (c) FourierCam with high-precision correspondence.

Detection bandwidth: To measure a temporal significance with max frequency

f_{\max}

, the required minimum detection bandwidth of traditional cameras equals

f_{\max}

. For FourierCam acquiring

h

Fourier components, the required minimum detection bandwidth is

\frac{f_{\max}}{2 h}

according to the frequency domain sampling theorem (see Appendix?C). For example, in the natural scene demonstration (toy car and panda in the manuscript),

f_{\max}

is 80?Hz and eight Fourier components except from the direct current are obtained; thus, the required detection bandwidth of FourierCam is 5?Hz, while for traditional cameras it is 80?Hz.

Assuming a video is captured by traditional cameras with M frames and N pixels in each frame, its data volume is

M \times N

bytes (assuming 1 byte for one pixel). FourierCam obtains

h

Fourier components of the same video, and the data volume is

2 h \times N

bytes since a complex Fourier coefficient needs twice the capacity than a real number. Generally, M is larger than 2h. For example, in the “running dog” video in Appendix?D,

M = 100

h = 16

, and

N = 1080^{2}

; thus, the data volumes for a traditional camera and FourierCam are 116.64 and 18.66 megabytes, respectively. By considering the prior information of the object and applying selective sampling, the data volume can be further reduced.

Floating point operations (FLOPs) comparison between FFT and FourierCam: FLOPs include the standard floating-point operations of additions and multiplications to evaluate the computational burden. To calculate the temporal spectrum of a video with M frames and N pixels in each frame, the fast Fourier transform (FFT) needs

5 M N ? \log_{2} ? M

FLOPs. In FourierCam, since the multiplication and summation operations of Fourier transform are realized by optical coding, only

3 M N

FLOPs are required for the four-phase-shifting operation. Therefore, the required FLOPs for the temporal spectrum acquisition can be reduced by

(5 M ? \log_{2} ? M ? 3 M) \times N

. For example, in the demonstration of the periodic motion in application II in the paper, 3.9 GFLOPs can be neglected by FourierCam.

Light throughput in FourierCam: In addition to the above advantages, light throughput plays an important role in high-speed photography and is worthy of discussion. Two types of high-speed cameras (including normal high-speed shutter and impulse coding cameras) are used for comparison. The impulse coding cameras turn on the pixels in a spatial block at a certain time to capture high-speed video [25,26]. Considering one coding group, the average light intensity at a coding group is

L

, the active area is

A

, the video has

N

frames, and the entire duration is

T

. So the frame rate requirement for the capture device is

\frac{N}{T}

. For high-speed shutter cameras, the whole area

A

will be an active area, and the light throughput of one frame is

L \times A \times \frac{T}{N}

. Therefore, the light throughput of

N

frames video is

L \times A \times T

. For impulse coding cameras, the whole area

A

will be divided into

N

exposure groups, with each group exposing sequentially. The light throughput per frame (exposure groups) is

L \times \frac{A}{N} \times \frac{T}{N}

, and the light throughput of

N

frames video is

L \times \frac{A}{N} \times T

. For FourierCam, each coding group will be divided into

p \times q \times N_{phase}

smallest units (

N_{phase}

is the number of phases and in aperiodic scenes

p \times q = \frac{N}{2}

), and each unit is modulated by a sinusoidal signal during the whole exposure time of the image detector, so the light throughput of each unit is

\frac{L \times A \times T}{N \times N_{phase}}

. Similar to the abovementioned temporal domain sampling strategy, which superimposes all frames to calculate the light throughput, the FourierCam should also add all frequency components to calculate the video light throughput. Therefore, the light throughput of FourierCam is

\frac{L \times A \times T}{N \times N_{phase}} \times p \times q = \frac{L \times A \times T}{2 \times N_{phase}}

. In summary, the light throughput of the FourierCam is lower than that of high-speed shutter cameras (even when

N_{phase} = 1

); this is introduced by sinusoidal modulation. However, the light throughput of the FourierCam has nothing to do with the number of frames

N

, while the impulse coding cameras are related to it. When

N

increases, the light throughput advantage of FourierCam compared to impulse coding cameras becomes more obvious. In principle, FourierCam uses at least two phases which have a 180-deg shift. Fortunately, by using the light from both ON and OFF reflection angles of DMD and adding a second sensor, it is possible to complete the temporal spectrum acquisition with each sensor collecting only one phase. This means that

N_{phase} = 1

, which can realize the competitive light throughput as high-speed shutter cameras.

Traditional cameras can be regarded as the temporal-domain sampling process when capturing video, and the frame rate is the temporal sampling rate. Considering each pixel temporal waveform, given the frame rate, the highest frequency component,

f_{\max}

, that can be acquired is

\frac{f_{s}}{2}

, where

f_{s}

is the frame rate. Unlike the temporal-domain sampling process of traditional cameras, FourierCam is based on frequency-domain sampling. FourierCam directly acquires frequency components. When the highest frequency component it collects is

f_{\max}

, the equivalent frame rate of FourierCam is

2 f_{\max}

. In addition, the frequency domain sampling interval (

Δ

f

) of FourierCam needs to satisfy the frequency domain sampling theorem to ensure that the reconstructed video does not alias in the time domain. The frequency domain sampling interval is determined by the exposure time of the image detector (

t_{expo}

Δ f \leq \frac{1}{t_{expo}}

. For example, the exposure time of an image detector is 1?s, and the frame rate is 1?Hz. If the frame rate is increased to 10?Hz, the frequency components to be acquired are 1?Hz, 2?Hz, 3?Hz, 4?Hz, and 5?Hz. Its frequency interval is 1?Hz, which satisfies the frequency domain sampling theorem.

To quantitatively evaluate the reconstruction, we perform a simulation of FourierCam with a “running dog” video, which has 100 frames with a spatial resolution of

1080 \times 1080

pixels. We obtain the temporal spectrum of the video with 16 frequencies (the number of acquired Fourier coefficients

h = 16

). Figure?7(a) compares the long exposure capture and the FourierCam encoded capture. The long exposure with low temporal resolution results in an obvious motion blur, and the details of the object are lost, whereas the temporal spectrum contains information of the motion to further reconstruct the dynamic scene. In the reconstructed video, the SSIM (structural similarity index) keeps stable with an average of 0.9126 and a standard deviation of 0.0107 [shown in Fig.?7(b)]. In Fig.?7(c), we also display a visual comparison of three exemplar frames from the ground truth video and the FourierCam reconstructed results, respectively. These results illustrate that FourierCam is able to reconstruct a clear video with only low-frequency coefficients.

Figure 7.Simulation of FourierCam video reconstruction. (a) Long exposure capture with all frames directly accumulating together, corresponding to a slow camera and the FourierCam encoded capture. The insets show the zoom-in view of the areas pointed by the arrows. (b) In the reconstructed video with 16 Fourier coefficients, the SSIM of each frame keeps stable with an average of 0.9126 and a standard deviation of 0.0107. (c) Three exemplar frames from the ground truth and reconstructed video.

Figure 8.Quantitative analysis on the performance of FourierCam. (a) Relation between number of acquired Fourier coefficients

h

and spatial resolution reduction L of FourierCam. (b) Comparison of reconstructed frames with different numbers of acquired Fourier coefficients, corresponding to point 1 to point 4 in (a).

Consider the signal at a position where a periodic motion passes: it is in periodic form in time domain. Fourier transform of a periodic signal with period

P

contains energy only at the frequencies that are an integer multiple of repetition frequency

\frac{1}{P}

, and therefore the periodic signal has a sparse representation in the Fourier domain. When the period of the periodic signal becomes infinitely long, the periodic signal comes to an aperiodic signal with a single pulse and its spectrum becomes continuous. Figure?9 provides a graphical illustration of the spectrum of periodic and aperiodic signals.

Figure 9.Fourier domain properties of periodic and aperiodic signals. The (a) periodic signal has a (b) sparse spectrum while the (c) aperiodic signal has a (d) continuous spectrum.

As the object is moving, the temporal waveforms at different spatial positions are of different temporal pulse positions, resulting in a phase shift in their temporal spectra. The phase-shift detection accuracy is the temporal resolution of object tracking in FourierCam. The phase-shift accuracy is determined with the DMD grayscale level and the exposure time of the image detector, so the temporal resolution is

\frac{t_{expo}}{{DMD}_{grayscale}}

. Since we use a DMD with PWM mode as the spatial light modulator in FourierCam, the light is digitally modulated by 8-bit grayscale. Therefore, during a single exposure

t_{expo}

, the temporal resolution of object tracking is

\frac{t_{expo}}{256}

Changes in both the texture and the speed of the moving object can cause a difference in the Fourier domain. As illustrated in Fig.?10(a), when a block with sinusoidal fringe texture is moving at a speed of

v

, the detected waveform at the red point is also in a sinusoidal form. In Fig.?10(b), a block with a higher spatial frequency texture but also moving at the speed of

v

corresponds to a higher frequency in the Fourier domain compared to Fig.?10(a). By selectively acquiring a specific range of frequency (e.g.,?

2 f_{0}

), we can extract a specific object [e.g.,?the one in Fig.?10(b)]. Also, the change in moving also causes a difference in the spectrum [Fig.?10(c)]; thus, we can also extract it from the one in Fig.?10(a). However, because of the joint effect of texture and speed, the spectrum in Figs.?10(b) and 10(c) is quite similar. To distinguish these two objects, we can add more constraints such as the length of the waveform, which is one of our future works.

Figure 10.Illustration of Fourier domain properties of moving objects with different texture and speed. (a) Block with sinusoidal fringe texture moving at a speed of

v

. The temporal waveform of the red point is shown with its Fourier spectrum. (b) Block with higher spatial frequency texture, also moving at the speed of

v

. (c) Block identical to (a) but moving at a higher speed 2v.

The comparison between different applications for FourierCam is shown in Table?1. In periodic compressive video reconstruction, a priori knowledge can be used to achieve higher compression ratios. It is also possible not to use prior knowledge, in which case the compression ratio is the same as the aperiodic.