Optical multi-imaging–casting accelerator for fully parallel universal convolution computing

Guoqing Ma; Junjie Yu; Rongwei Zhu; Changhe Zhou

doi:10.1364/PRJ.472741

Abstract

Recently, optical computing has emerged as a potential solution to computationally heavy convolution, aiming at accelerating various large science and engineering tasks. Based on optical multi-imaging–casting architecture, we propose a paradigm for a universal optical convolutional accelerator with truly massive parallelism and high precision. A two-dimensional Dammann grating is the key element for generating multiple displaced images of the kernel, which is the core process for kernel sliding on the convolved matrix in optical convolutional architecture. Our experimental results indicate that the computing accuracy is typically about 8 bits, and this accuracy could be improved further if high-contrast modulators are used. Moreover, a hybrid analog–digital coding method is demonstrated to improve computing accuracy. Additionally, a convolutional neural network for the standard MNIST dataset is demonstrated, with recognition accuracy for inference reaching 97.3%. Since this architecture could function under incoherent light illumination, this scheme will provide opportunities for handling white-light images directly from lenses without photoelectric conversion, in addition to convolutional accelerators.

1. INTRODUCTION

A convolutional neural network (CNN), as “convolutional” implies, involves extensive convolution operations among neighboring layers, followed by batch normalization and nonlinear activation for the expected performance [1 –3]. Remarkably, these massive linear matrix multiply–accumulate (MAC) operations account for more than 80% of the total number of deep neural network (DNN) calculations [4]. However, the convolution operation, which is unsuitable for modern advanced electric serial processors, is becoming the biggest burden for high-performance computing tasks, particularly for artificial intelligence (AI) algorithms. Furthermore, as the scale of the matrix increases, so does the computational overhead of convolution operations. It has been demonstrated that the amount of computing power required to train state-of-the-art DNNs doubles every 3.5 months [5], far exceeding that of traditional electrical integrated circuits (EICs) following Moore’s law. Although parallel electrical coprocessors such as graphics processing units (GPUs) and tensor processing units (TPUs) can accelerate the convolution calculation, it is still difficult to handle millions of MAC operations in a fully parallel manner for DNNs practically [6,7]. In contrast, it has been proven that many MAC operations can be executed concurrently during a single pass of light, and this may be the prime motivation for the recent interest in optical computing [8,9]. Photonic solutions for computing have been investigated for at least 70 years [10,11]. However, compared with fast-growing EICs, the development of optical computing gradually slowed in the late 2000s [12], owing to a lack of application-driven motivation and adequate optical computing architectures.

Recently, due to the remarkable achievements in AI, there has been renewed interest in attempting to improve computing power, energy efficiency, and processing speed by exploiting photonic or hybrid optical–electric processors rather than their electronic counterparts [13 –15]. Two mainstream optical computing architectures have been rapidly developed. The first is based on a planar waveguide on a two-dimensional (2D) substrate [16 –18], whereas the second is realized by multiple cascading diffractive optical elements (DOEs) in three-dimensional (3D) space [19,20]. However, planar architecture, which includes Mach–Zehnder interferometers [16], microring resonators [21,22], waveguide modulators [23], and acousto-optical modulators [24], does not fully use the 3D interconnectivity of optics, whereas 3D architecture requires full manipulation of the electromagnetic field with high precision, and fabricating large-sized and high-precision subwavelength DOEs in 3D space will still be difficult [19,20].

Despite predictions that photonic processors could be at least 10,000 times faster than state-of-the-art EICs [13,14], the past schemes have not realized fully parallel convolution computing compared with their electronic counterparts, particularly when high precision is required. Here, we propose a new paradigm for a universal convolutional accelerator with full parallelism and adequate precision based on optical multi-imaging–casting architecture (OMica), capable of calculating arbitrarily encoded hybrid analog–digital matrix convolutions. The architecture can be viewed as the starting point for a new roadmap for optical computing, with the potential for building fully massively parallelized optical convolutional accelerators to overcome the intrinsic computing power shortage and unsatisfactory energy efficiency of EICs. Furthermore, the incoherent illumination implies the possibility of handling white-light images directly from lenses without traditional photoelectric conversion, promising to fully exploit the benefits of AI algorithms or accelerate other practical applications where rapid big data processing is desired.

Sign up for Photonics Research TOC. Get the latest issue of Photonics Research delivered right to you！Sign up now

2. PRINCIPLE OF OMica

A. Optical Multi-Imaging–Casting Architecture

The OMica architecture, as depicted in Fig. 1, employs an incident-modulated light (matrix $A$ ) and a spatial light modulator (SLM) (matrix $B$ ), as well as a confocal $4 f$ system with a diffractive beam splitter (BS), and another focusing system with a photodetector (matrix $C$ ). The planes of matrices $A$ and $B$ , the confocal plane of the $4 f$ system, and the plane of the detector are all in a conjugated object–image relationship with each other. When a BS, such as a Dammann grating (DG) [25 –27], is placed behind the plane of matrix $A$ , the two pairs of imaging–casting relationships mentioned above still hold. When the DG is inserted, the optical signal carrying the information of matrix $A$ is duplicated into multiple diffraction orders, with excellent uniformity due to the properties of DG. The different diffraction orders inherently have different angular spectral components ( $θ_{1}$ and $θ_{2}$ ). However, they all carry the same information as matrix $A$ , as shown in Fig. 1(c). This implies that the multiplexing of matrix $A$ is achieved over the spatial pattern. When we pass a pinhole through one of the diffraction orders in the confocal plane, the image corresponding to that diffraction order can be seen clearly on the plane of matrix $B$ through lens $L_{2}$ (as shown in Appendix A and Fig. 9). Because these diffraction orders have different diffraction angles ( $θ_{1}$ and $θ_{2}$ ), the images of the diffraction orders on the plane of matrix $B$ are displaced when we sequentially pass each diffraction order through the pinhole. Thus, as shown in Fig. 1(c), all images are aligned by adjusting the distance $d$ between the DG and matrix $A$ , according to a paraxial relation: $s = l \frac{f_{1}}{f_{2}} \tan θ,$ (1)where $s$ is a convolutional stride, $f_{1}$ and $f_{2}$ are focal lengths of $L_{1}$ and $L_{2}$ , respectively, and $θ$ is the angle difference between any two adjacent diffraction orders. According to the grating equation, $θ_{m} = \arcsin (m λ / Λ)$ , $θ_{m + 1} - θ_{m - 1} \approx θ$ , and $\tan θ \approx \sin θ$ , Eq. (1) can be re-written as $s = d \frac{f_{2}}{f_{1}} \frac{λ}{Λ},$ (2)where $θ_{m}$ is the diffraction angle of the $m$ th order of a DG, $Λ$ is the grating period, and $λ$ is wavelength. Therefore, $s$ can also be adjusted to adapt to different convolutional strides by changing $d$ [Figs. 1(a) and 1(b)].

$Schematic of the optical multi-imaging–casting architecture: optical parallel convolution process with different convolutional strides s1 (a) and s2 (b); (c) optical architecture principle of OMica, where the beam splitter (BS) is a diffractive beam splitter; Oy is diffraction order in the y direction (indicated by different line types), and θ is the angle difference between any two adjacent diffraction orders in object space (θ1 and θ2 are diffraction angles of Oy =1 and Oy=2 diffraction orders, respectively); θ′ is the angle difference in image space (θ1′ and θ2′ are diffraction angles of Oy =1 and Oy=2 diffraction orders, respectively); d is the distance between matrix A and BS, and l is the distance between matrix B and the image of BS. a, b, and c are spot arrays corresponding to different diffraction orders diffracted from a BS. The imaging–casting system is composed of L1 and L2, with focal lengths f1 and f2. L3 is a focusing lens with focal length f3. s is the lateral shifts of the image of diffraction orders of DG on the SLM2 plane corresponding to the convolutional stride, and this convolutional stride could be tunable by changing the distance d [s1 and s2 correspond to different convolutional strides shown in (a) and (b)].$

Figure 1.Schematic of the optical multi-imaging–casting architecture: optical parallel convolution process with different convolutional strides $s_{1}$ (a) and $s_{2}$ (b); (c) optical architecture principle of OMica, where the beam splitter (BS) is a diffractive beam splitter; $O_{y}$ is diffraction order in the $y$ direction (indicated by different line types), and $θ$ is the angle difference between any two adjacent diffraction orders in object space ( $θ_{1}$ and $θ_{2}$ are diffraction angles of $O_{y} = 1$ and $O_{y} = 2$ diffraction orders, respectively); $θ^{'}$ is the angle difference in image space ( $θ_{1}^{'}$ and $θ_{2}^{'}$ are diffraction angles of $O_{y} = 1$ and $O_{y} = 2$ diffraction orders, respectively); $d$ is the distance between matrix $A$ and BS, and $l$ is the distance between matrix $B$ and the image of BS. $a$ , $b$ , and $c$ are spot arrays corresponding to different diffraction orders diffracted from a BS. The imaging–casting system is composed of $L_{1}$ and $L_{2}$ , with focal lengths $f_{1}$ and $f_{2}$ . $L_{3}$ is a focusing lens with focal length $f_{3}$ . $s$ is the lateral shifts of the image of diffraction orders of DG on the ${SLM}_{2}$ plane corresponding to the convolutional stride, and this convolutional stride could be tunable by changing the distance $d$ [ $s_{1}$ and $s_{2}$ correspond to different convolutional strides shown in (a) and (b)].

Because of the conjugation relationship and different angles, the images of all diffraction orders are superimposed on the matrix $B$ plane with naturally shifted displacements when the pinhole is removed. This means that the SLM can modulate these shifted images simultaneously. That is, all multiplications of multiple images of matrix $A$ and matrix $B$ can be implemented in parallel. These multiplications are then summed through $L_{3}$ and separated from each other in the $C$ plane due to the angular spectrum differences. Therefore, the convolution of the two matrices can be performed in parallel after the light passes through the system once. This process is a perfect optical implementation of mathematical convolution, i.e., $C = A \otimes B$ , where “ $\otimes$ ” is the convolution operator. Owing to the object–image conjugate configuration, the OMica proposed here avoids the size trade-off of elements in the matrix between spatial and frequency domains in the $4 f$ optical convolutional system [28,29], allowing massive parallelism with sufficiently high accuracy to be realized. Moreover, because of the object–image conjugate configuration, the OMica can work under both coherent and incoherent light illumination. Thus, this optical hardware allows it to handle white-light images directly from lenses without traditional photoelectric conversion if achromatic lenses are used as the projection system.

B. Negative Matrix Coding Method

In our proof-of-concept implementation, a homemade 2D $28 \times 20$ DG (see details in Appendix B) was inserted into a $4 f$ system. Two amplitude-only SLMs (8-bit grayscale) are located on the object and image planes of the $4 f$ system, where the two convolution matrices are loaded sequentially. In the experiment, light intensity was used as the information carrier, and the two SLMs were used to load the information of matrix $B$ and matrix $A$ into the incident uniform light beam. Therefore, in principle, only nonnegative matrices can be loaded and calculated based on this hardware. To address this limitation, a negative matrix encoding method for hybrid analog–digital optical convolution computing was developed. In a hybrid analog–digital framework, a grayscale matrix with negative elements can be easily decomposed into one larger-scale or several same-size negabinary digit (NBD) matrices in spatial or temporal sequences, respectively [30,31]. In other words, each decimal element in the original matrix can be converted into NBD representation as follows: ${(a)}_{10} = \sum_{i = 0}^{⌈ N / k ⌉} c_{i} {(- 2^{k})}^{i},$ (3)where ${c_{[N / k]}, c_{[N / k] - 1}, \dots, c_{0}}$ NBD is called $c_{i}$ bits, with $c_{i} \in [0, 2^{k} - 1]$ ; $N$ is maximum bits of NBD, $k$ is an integer, and the operator “ $⌈ \cdot ⌉$ ” indicates rounding the number to the nearest integer greater than it. Following this decomposition, a grayscale matrix with negative elements is transformed into a larger matrix spatially or several same-sized matrices in temporal series represented by $⌈ N / k ⌉$ nonnegative bits, allowing these matrices to be loaded directly on the SLMs. The principle of this encoding method is depicted schematically in Fig. 2. Notably, there is a trade-off between computing precision and computing power, which can be adjusted by varying parameter $k$ . A small $k$ indicates that high precision with low computing power will be achieved, whereas a large $k$ indicates high computing power with relatively low precision. Therefore, this encoding method can improve computing precision to the same extent compared with pure-analog optical convolution computing [30,31].

Figure 2.Procedure of converting the original grayscale matrix with negative elements into encoded matrices of NBD. (a) The encoding matrices are loaded into the OMica system to compute the convolution, with the experimental encoded convolutional result decoded into the original matrix. (b) Original grayscale matrices $A$ and $B$ , and original convolutional results matrix $C$ . (c) Larger encoded matrices $A$ and $B$ in spatial sequence and the same size encoded convolutional results matrix $C$ .

Here, as an example, under the condition of $k = 1$ , the encoding process of a grayscale matrix with negative elements ranging from $- 2$ to 5 is demonstrated step by step. As shown in Figs. 2(b) and 2(c), the grayscale number for each element of the original matrix to be encoded is expressed in multiple NBDs after encoding. For example, the first element in the original matrix $A$ is written as $- 2 = 0 \times {(- 2)}^{2} + 1 \times {(- 2)}^{1} + 0 \times {(- 2)}^{0}$ . Therefore, the elements of the matrix are arranged in rows after encoding, denoted as $P_{1}$ , $P_{2}$ , and $P_{3}$ . Each element in the column direction is encoded with three NBDs, denoted as ${Bit}_{3}$ , ${Bit}_{2}$ , and ${Bit}_{1}$ , as shown in Fig. 2(c). Thus, the first element, $- 2$ , is expressed as {010} in the first column of the encoded matrix, that is, $c_{2} = 0$ , $c_{1} = 1$ , and $c_{0} = 0$ . Subsequently, the converted matrices are loaded onto the SLMs in spatial sequence for computing [Fig. 2(c)]. Notably, to avoid aliasing in a spatial sequence, some zero elements should be inserted into the encoded matrix between two adjacent rows or columns of the original high-bit matrix, where the number of zero elements is $⌈ N / k ⌉ - 1$ . Here, the physical pixels of the SLMs will not be fully used because of the redundant zero elements. The computational advantage can be realized only by increasing the matrix scale, but doing so will significantly slow down the system’s refresh rate because the convolution must be performed among all bits of either matrix $A$ or $B$ . Therefore, when the OMica is used for computing acceleration, a compromise should be struck between high computing power and high computing precision by choosing an appropriate parameter $k$ .

3. EXPERIMENTAL RESULTS

A. Hybrid Analog–Digital Matrix Convolution

As an example, the hybrid analog–digital optical convolution of two randomly generated 2-bit grayscale $3 \times 10$ matrices, $A_{1}$ and $B_{1}$ , with elements in the range of $0 t o 3$ , and two negabinary 3-bit grayscale $2 \times 10$ matrices, $A_{2}$ and $B_{2}$ , with negative elements in the range of $- 2 t o 5$ , is demonstrated, and the convolutional results are shown in Fig. 3. In each box, the light intensity distributions of the spot arrays on the detected plane, denoting the raw results of convolution, are shown in the first subfigure of the first row. The theoretical results obtained by an electric computer (full precision, 64 bit) are illuminated in the second subfigure, and the experimental results before decoding are shown in the third subfigure. The absolute error map is shown in the first subfigure of the second row, which is defined as follows: $AE = | C_{theo} - C_{\exp} |,$ (4)where $C_{t h e o}$ and $C_{e x p}$ are the theoretical and experimental convolutional results, respectively. “ $| \cdot |$ ” denotes the absolute operation. Additionally, the theoretical and experimental results of the convolution after decoding are shown in the second and third subfigures in the second row, respectively. It is demonstrated that the overall trend of the experimental and theoretical results of the convolution is consistent.

Figure 3.Experimental results of hybrid analog–digital matrix convolution for two groups of matrices based on spatial sequence encoding. The subfigures from left to right are the light intensity distribution of the spot array denoting the convolution, theoretical convolutional values, experimental convolutional results, error map between theoretical and experimental results, and decoded convolutional results, respectively, in (a) matrices $A_{1}$ and $B_{1}$ and (b) matrices $A_{2}$ and $B_{2}$ . The red cross marks the centroid positions of each spot.

Figures 3(a) and 3(b) show the results of the convolution of two matrices, $A_{1}$ , $A_{2}$ and $B_{1}$ , $B_{2}$ , respectively. The mean values of the absolute errors AE are 0.114 and 0.08, and it is seen that the maximum values are approximately 0.239 and 0.145, respectively, before decoding, indicating that the optical convolutional architecture achieves high precision. It should be noted that the former has a higher mean error before decoding than the latter, owing to increased cross talk caused by relatively large convolutional elements. Moreover, the two encoded matrices in spatial coding methods are filled with zero elements to avoid aliasing, which further reduces the cross talk and final error. Because the maximum absolute errors for the two cases are all less than 0.5, the correct convolutional results, with 100% accuracy, can still be obtained after digitalization. Thus, the experimental light intensity distribution of the two cases precisely reflects the values of the convolutional results.

B. High-Accuracy Matrix Convolution

As an example, the high-accuracy optical convolution of randomly generated 8-bit grayscale $10 \times 10$ matrices $A_{3}$ and $B_{3}$ and $20 \times 20$ matrices $A_{4}$ and $B_{4}$ with elements in the range of $0 t o 255$ is demonstrated. Figure 4 compares the experimental results of the optical convolution of matrices $A_{3}$ , $A_{4}$ and matrices $B_{3}$ , $B_{4}$ with the theoretical results. In each box, the light intensity distributions of the spot arrays on the detected plane, denoting the raw results of convolution, are shown in the first subfigure of all columns. The theoretical results obtained using an electric computer (full precision, 64 bits) are highlighted in the second subfigure, and the experimental results are shown in the third subfigure. The relative error is defined as follows: $RE = | C_{\exp} - C_{theo} | / (| C_{\max} - C_{\min} | / 256),$ (5)where $C_{e x p}$ represents experimental convolution, $C_{t h e o}$ represents theoretical convolution, $C_{m a x}$ represents the maximum value of theoretical convolution, and $C_{m i n}$ is the minimum value of theoretical convolution. Furthermore, “ $| \cdot |$ ” denotes the absolute operation. This relative error indicates that the precision of 8 bits will be obtained if its value is less than one.

Figures 4(a) and 4(b) show the results of the convolution of matrices $A_{3}$ , $A_{4}$ and matrices $B_{3}$ , $B_{4}$ , respectively. It is demonstrated that the overall trend of the experimental and theoretical results of convolution is very consistent. After further assessment, the mean values of the relative error RE are 0.424 and 0.39, and the maximum values are 2.258 and 1.293, respectively. Also, from these error maps, one can see that the relative errors for most of points [98.06% and 97.25% in Figs. 4(a) and 4(b), respectively] are less than one, indicating that the computing accuracy is very close to 8 bits, which is high enough for most AI inference tasks and, at least, some training tasks. Additionally, other examples of the experimental results of larger-scale matrices were also demonstrated in the appendix (see Appendix C).

Figure 4.Experimental results of high-accuracy convolution for two groups of grayscale matrices. (a), (b) Randomly generated 8-bit grayscale $10 \times 10$ matrices $A_{3}$ and $B_{3}$ , 8-bit grayscale $20 \times 20$ matrices $A_{4}$ and $B_{4}$ , respectively. The subfigures from left to right show the light intensity distribution of the spot array denoting the convolution, theoretical convolutional values, experimental convolutional results, error map between theoretical and experimental results (the red circle indicates the computing accuracy at that point is less than 8 bit), and histogram of the error distribution, respectively. The comparison of experimental convolutional results expands into one-dimensional (1D) vectors and theoretical convolutional results.

4. OPTICAL CNN INFERENCE TASKS BASED ON MNIST

With its ability to accelerate universal convolutional computation, this OMica could find applications in a variety of fields where dense convolutions are involved, such as simulation of optical imaging, multi-input multi-output systems, and training and inference of a CNN. As an example, we demonstrate the inference tasks of recognition of handwritten digits based on the OMica using the above-mentioned negative matrix coding method and hybrid analog–digital matrix convolution (see details of CNN in Appendix D). Here, a binary neural network (BNN) [32] is implemented as an example to test the robustness and accuracy of the proposed optical hardware. For a BNN, the input signal is a nonnegative binary (0 or 1) image, and the kernel is a signed binary matrix ( $- 1$ or $+ 1$ ) [33]. Each kernel of the BNN trained in advance is encoded into two identical-sized nonnegative matrices, one of which is a low-bit (positive) matrix and the other a high-bit (negative) matrix, as shown in Fig. 5(a). Intuitively, it seems that two convolution operations should be executed in the temporal sequence. Remarkably, 10 original kernels need to be divided into 10 high-bit sub-kernels and the same low-bit sub-kernel because the low-bit sub-kernels are the same. Furthermore, the first high-bit sub-kernel and low-bit sub-kernel are the same with unity transmittance. Thus, the total number of convolutional kernels after encoding is still 10, implying that no additional computational overhead incurs. Figure 5(b) shows the inference process of the CNN based on encoding low- and high-bit kernels. The 10 encoded kernels are sequentially loaded onto the SLM located at the input plane of matrix $A$ , and the binary input images with a scale of $28 \times 28$ are sequentially loaded onto the SLM at the input plane of matrix $B$ . When light passes through the two SLMs in sequence and is then focused and separated by the focusing lens, the detector on the focal plane captures the spot array denoting the convolutional results. Finally, the original convolutional results are obtained by decoding the corresponding low- and high-bit convolutions. By adding the results of the positive and negative convolutions and multiplying them by the weight $- 2$ , the final convolutional results can be obtained.

Figure 5.Inference process for the convolutional neural network performed by OMica based on the MNIST dataset. (a) Execution of convolution operation by encoding each original convolutional kernel into high-bit and low-bit kernels; (b) schematic of the optical convolutional architecture performing CNN inference; (c) absolute error AE map comparing theoretical and experimental results of the convolution of a handwritten digit 7 as an input; confusion matrix of blind-testing 1000 images from the MNIST dataset when matrix convolutions are executed by the optical hardware (d) and by pure electric hardware (e). The purple box marks the first convolutional kernel to realize the whole process of encoding, convolution, and decoding.

Figure 5(c) shows the absolute error AE map between the theoretical and experimental results of an input image of a handwritten digit 7 convolved by the first kernel. Compared with the matrices in Fig. 4, the size of a standard input image of handwritten digits is $28 \times 28$ , whereas the size of the convolutional kernel is nearly the same, and the average value of the absolute errors is 0.405. This implies that it is possible to calculate the optical convolution of larger-scale matrices using OMica with high precision. The following pooling layer, nonlinear operations, and full connections are executed by a classical electrical computer.

To validate the reliability and robustness of the system, we performed blind testing for the first 1000 sets of MNIST images with serial numbers ranging from 1 to 1000. As shown in Figs. 5(d) and 5(e), the experimental results indicate that the optical convolutional accelerator achieved blind-testing accuracy of up to 97.3%, whereas electrical computers achieved recognition accuracy of 96.7% for the same test dataset. This may be due to the computing error of the optical convolution carrying characteristics of the input images, thus further strengthening the feature extraction ability. It can be seen that the error maps for different handwritten digits are highly correlated with the input image, as shown in Fig. 5(c) (see Appendix E). By optimizing the kernel weights of the optical convolutional system, direct training of the optical CNN is expected to yield better results than those of an electronic computer. Based on this, the architecture can be effectively used as a hardware accelerator with large computing power in various DNNs.

5. DISCUSSION

A. Computing Power Scalability

As shown in Fig. 1, even when the suitable distance $d$ between matrix $A$ and the BS is adjusted to match the convolutional stride $s$ , each diffraction order of the BS involved in the convolution is still imaged to the plane of matrix $B$ . Therefore, it is possible to greatly reduce the physical size of the matrix elements. Given these conditions, the peak computing power of the optical convolutional architecture will reach 10 peta ( $10^{15}$ ) operations per second (POPS) [34], which is even faster than the state-of-the-art GPU, such as TITAN RTX (Nvidia) [35], if a modulator with a higher refresh rate (typically 10 kHz) is used, such as a digital mirror device (DMD) or a specially designed micro–electro–mechanical system. Furthermore, if other multiplexing methods, such as polarization, wavelength, and spatial mode, are used, then speeds at least $10 to 10^{2}$ times faster than this estimation can be achieved [36,37]. Therefore, based on the OMica, the computing power for convolution may, in the near future, be superior, or at least comparable, to that of the most powerful supercomputer (peak performance of the top system, Frontier [38] with Linpack Performance 1102 POPS), with larger-scale and higher-updating-frequency devices.

B. Energy Efficiency Ratio

Additionally, the power consumption of the optical convolutional system is significantly lower than that of an electronic processor with the same computing power, even for such a bulk optical system at present. This fully accounts for the operating power consumption of the optoelectronic device and assumes that the total power consumption of the entire optical convolution computing system, including the light source, two modulators, and the detector, is less than 100 W. Of course, the power consumption of 100 W is meaningless for the MNIST dataset. However, as the matrix size increases, along with the aperture size and DG splitting ratio, etc., the increase in computing power is proportional to $N^{4}$ , whereas the increase in the power consumption of this system is insignificant. Therefore, as computing power continues to grow, the energy efficiency ratio of this architecture will significantly outperform that of existing electronic computing systems. Furthermore, if a more sensitive detection device, such as a multiphoton counter, is used, power consumption will be drastically reduced [39]. In contrast, a powerful supercomputer is energy hungry, with power consumption typically reaching $10^{4} t o 10^{5} kW$ (Frontier’s power is 21,100 kW). Evidently, the optical convolutional architecture will consume far less power than supercomputers, whereas its computing power for a specific task (convolution) could be at least comparable to that of Frontier, the top supercomputer this year.

C. Potential Applications

To the best of our knowledge, the OMica is the only optical parallel acceleration solution that can produce both high-precision convolutional computers and AI hardware accelerators with high recognition accuracy. Additionally, if an appropriate distance $d$ [Figs. 1(a) and 1(b)] is chosen, this OMica architecture could realize not only convolutional layers but also pooling layers and fully connected layers (all layers are linear convolution calculations). For AI algorithms, it has been demonstrated that very high accuracy is not required [40] and that neural networks can operate effectively with both low-accuracy and fixed-point operations. Inference models function nearly as well with $4 − 8 bits$ of precision and are trained with nearly $8 − 16 bits$ of precision per computation [41]. Our results indicate that computing accuracy is close to 8 bits, which is sufficiently accurate for most AI inference applications. Moreover, if high-contrast modulators, such as DMDs, are used, computing accuracy could be improved even further, and the results obtained from this optical accelerator would be sufficient for training most AI models. Furthermore, when training the neural network directly in this optical convolutional system, the physical characteristics of the system itself are also trained, such as alignment errors and cross talk, which are expected to further improve the performance of the aforementioned neural network.

Presently, only one kernel $A$ and one input feature map $B$ are loaded onto these two SLMs. It is also possible to load multiple kernels on the first SLM, allowing for parallel convolutions among multiple kernels and multiple input channel feature maps by filling an appropriate number of zero elements between any two adjacent kernels. By swapping the positions of feature map $B$ and kernel $A$ , a CNN can be built, and the key is to make full use of pixels to increase computing power. Also, it is worth noting that considering the actual hardware scale, it is often necessary to split and reorganize the input feature map to further improve the hardware utilization, that is, to load different matrix combinations to the SLMs to execute the convolution process.

Although these task-specific devices are not yet available, the current CMOS technology, in principle, is adequate for developing high-quality devices, such as SLMs and detectors, for optical computing. This work presents a promising method for building optical convolutional processors to overcome the intrinsic shortage of computing power and unsatisfactory energy efficiency in traditional electrical processors. Furthermore, the experimental results validate the benefits of optical convolutional systems for various application scenarios, including computationally intensive tasks and neuromorphic computing.

6. CONCLUSION

An optical convolutional accelerator for fully parallel universal convolution computing was proposed, and a negative matrix coding scheme with sufficiently high precision was demonstrated. In principle, a suitable encoding scheme and the OMica can be used to efficiently calculate the convolution of an arbitrary bit matrix with massive parallelism and sufficient accuracy. Moreover, convolution is universal, and the computing results obtained may be easily transferred to any other computing platform. Our proof-of-concept experimental results proved the feasibility of the optical convolution of $20 \times 20$ matrices with an accuracy of about 8 bits. Furthermore, a BNN for handwritten digit recognition tasks on the standard MNIST dataset was constructed, and the inference process was demonstrated based on this optical hardware. The results indicated that the blind test recognition accuracy can reach 97.3%, which is comparable with that predicted by pure electrical networks. These proof-of-concept experimental results indicated that the OMica could be used for massive parallelism, high-precision, and high-efficiency AI accelerators, and this computing paradigm has potential applicability in the construction of task-specific cloud computing centers or other AI computing centers. By developing high-speed SLMs with higher contrast, optimizing a specially proposed projection imaging system, and setting up a dedicated dot array lighting source, it is possible to build a photonic coprocessor with higher computing power and lower energy consumption than state-of-the-art supercomputers, such as Frontier, based on the OMica. Additionally, the characteristics of the imaging system itself suggest that the computing power of the system can be exponentially increased by cascading multiple $4 f$ systems and employing extra multiplexing degrees of freedom. Thus, a hybrid optical–electrical computer center or data center could be directly constructed. Furthermore, because the optical hardware could work under incoherent white-light illumination if an achromatic lens projection system is used, the OMica architecture allows it to handle white-light images directly from lenses without traditional photoelectric conversion.

In summary, the OMica is expected to be used in self-driving vehicles [42], machine vision [43], and other fields that require high computing power for real-time or quasi-real-time data processing. This opens the door to increasing the computing power and energy efficiency of convolution by using high-performance devices, such as larger-scale modulators with higher updating frequencies and detectors or detector arrays with wider dynamic ranges and higher sampling frequencies, which would be superior to the most powerful supercomputers, in the near future.

Acknowledgment

Acknowledgment. The authors appreciate the critical discussion on this concept with Guowei Li and also his assistance in the experiment.

APPENDIX A: EXPERIMENTAL SETUP AND METHODS

A

Figure 6.Schematic of the optical convolution experimental system using the DG. LED, light-emitting diode with wavelength $λ = 450 nm$ ; $M_{1 - 6}$ , reflective aluminum mirrors; ${AP}_{1, 2, 3}$ , aperture pinholes; $L_{1 - 5}$ , convergent lenses; $L_{6}$ , $L_{7}$ , $L_{10}$ , Fourier transform lenses; ${PBS}_{1, 2, 3}$ , cube polarization beam splitters; ${SLM}_{1}$ , ${SLM}_{2}$ , reflected liquid crystal SLMs; APA, aperture array; DG, Dammann grating; BS, non-polarizing beam splitter; ${sCMOS}_{1}$ , scientific complementary metal–oxide–semiconductor camera for detection; ${CMOS}_{2}$ , CMOS camera for monitoring. I, II, III, and the plane of the square aperture are one group of object–image conjugate planes. IV and V are other groups of object–image conjugate planes. Plane V is the image plane of the DG. $d_{0}$ is the characteristic distance corresponding to $s = 1$ , which can be adjusted to match the physical size of the matrix unit of matrix $B$ to the different stride size.

Figure 7.Photographs of the experiment system of OMica. (a) Entire optical system; (b) SLM mounted on a 4D manual stage for loading kernel $A$ , (c) SLM mounted on a 4D manual stage for loading matrix $B$ , and (d) enlarged part of the ${sCMOS}_{1}$ detector and monitoring ${CMOS}_{2}$ camera.

Figure 8.Typical patterns loaded onto two SLMs for alignment. (a) Alignment pattern and (b) square array pattern.

$Experimental results for demonstration of kernel sliding. (a), (b) Images loaded onto two SLMs. (c)–(j) Images captured by the monitoring CMOS2 camera as the iris moves from left to right, allowing only one diffraction order to pass through its aperture in sequence.$

Figure 9.Experimental results for demonstration of kernel sliding. (a), (b) Images loaded onto two SLMs. (c)–(j) Images captured by the monitoring ${CMOS}_{2}$ camera as the iris moves from left to right, allowing only one diffraction order to pass through its aperture in sequence.

APPENDIX B: DESIGN AND MANUFACTURING OF DAMMANN GRATING

1 \times 20

Figure 10. $1 \times 20$ (a) and $1 \times 28$ (b) DG beam splitting order normalized energy distribution.

$Intensity and angle distribution of 20×28 2D DG. (a) Simulation result of intensity distribution versus different orders; (b) simulation result of diffraction angle versus diffraction order; (c) intensity map of the spot array captured in the experiment (the cross represents the centroid); (d) experimental results of normalized intensity distribution versus diffraction order.$

Figure 11.Intensity and angle distribution of $20 \times 28$ 2D DG. (a) Simulation result of intensity distribution versus different orders; (b) simulation result of diffraction angle versus diffraction order; (c) intensity map of the spot array captured in the experiment (the cross represents the centroid); (d) experimental results of normalized intensity distribution versus diffraction order.

APPENDIX C: CONVOLUTIONAL RESULTS FOR TWO 8-BIT GRAYSCALE 180×224 LARGE MARTRICES

180 \times 224

Figure 12.Experimental convolutional results for $180 \times 224$ matrices. (a)–(c) Theoretical convolutional results, experimental convolutional results, and experimental detection light distribution, respectively; (d) partially enlarged view of the experimental light spot on (c); (e) error distribution; (f) proportion of experimental light intensity distribution.

APPENDIX D: CONFIGURATION OF THE CNN

9 \times 9

Figure 13.Schematic of the CNN architecture.

Figure 14.Learning curve of the CNN.

APPENDIX E: INPUT-RELATED CROSS TALK

Figure 15 shows the distribution of relative errors between the experimental convolutional results and theoretical convolutional results for different digital inputs. These error maps are clear characteristic of the input numbers. This may be due to optical cross talk between different pixel channels. Optical cross talk is an important factor that limits the improvement of optical computing accuracy. However, for the AI algorithm, if training of the deep learning network model is directly based on an optical computing system, then the optical cross talk may help improve the recognition accuracy of the system. This result has implications for developing optical AI accelerators with high recognition accuracy.

Figure 15.Typical error maps between convolutional results obtained from the optical hardware and that of an electrical computer with the full precision of different input handwritten digits (from 0 to 9) for these 10 convolutional kernels after encoding.

APPENDIX F: SUMMARY OF DIFFERENT OPTICAL CONVOLUTIONAL ARCHITECTURES

D^{2} NN

Summary of Different Optical Convolutional Architectures

Architecture	Principle	Pros/Cons	Computing Accuracy	References
OIUs and delay line	Matrix–vector multiplication	• High integration and high modulation speed.	$\sim 5 bits$	[16,18,24]
OIUs and delay line	Matrix–vector multiplication	• Limited by the integration of integrated photonic devices, it is difficult to realize the parallel convolution process of multiple convolutional kernels.
MRs, OFC, and PCM	Matrix-vector multiplication	• High integration and high modulation speed.	$\sim 5 bits$	[17,22]
		• The OFC can provide multi-wavelength light sources and timing modulation, and the system integration is higher.
		• Low power consumption using non-volatile PCM.
		• Complex electronic control and test configuration.
$4 f$ filter	Multiplication in frequency domain equals convolution in spatial domain	• Object and spectrum are limited by the Fourier transform relationship. There is a trade-off between computing accuracy and computing size.	/	[28,29]
		• Configuration is very simple.
$D^{2} NN$	Diffraction	• High-precision 3D macro-nano structures are difficult to fabricate, and computational accuracy is limited.	$\sim 5 bits$	[19,44]
$D^{2} NN$	Diffraction	• High computing power.
Shadow casting	2D matrix–matrix multiplication	• Diffraction effect exists when matrix $A$ is projected onto matrix $B$ , and computing accuracy cannot be guaranteed.	/	[30,31,39]
Shadow casting	2D matrix–matrix multiplication	• Configuration is very simple.
OMica	2D matrix–matrix convolution and multiplication	• DG and object–image conjugation avoids diffraction effects by wavefront recombination.	$\sim 8 bits$	This work
		• DG is 2D DOE, and it is easy to manufacture. Computing power can be expanded easily by using large-scale DGs.
		• Can work under incoherent light illumination and directly handle optical images.
		• Computational accuracy is high.

In contrast, because of the object-image conjugate relationship, a CMOS monitoring camera can be added to the conjugating plane of two SLMs, making it simple to align two SLMs with a monitor camera. Additionally, an incoherent light source could be used in this architecture to prevent sensitivity and speckle noise. More importantly, this configuration makes it possible to handle images directly from a lens under white-light illumination, which is very challenging for all mainstream architectures, to the best of our knowledge.

Therefore, the convolutional accelerator enabled by the OMica can be used to compute universal matrix convolution, and the results obtained by the hybrid optical-electrical hardware can be easily transferred to any other computing platform, including photonic, hybrid optical-electrical, and traditional electric processors or coprocessors. Because of its universality, this architecture can be used for building task-specific cloud computing centers, or some other AI accelerating centers, as well as the present bulk optical system. In the future, with the advancement of nonlinear optical elements, a scheme based on the OMica could also be integrated into pure photonic accelerators by combining planar waveguides [46, 47], metasurfaces [48 - 50], and advanced modulator arrays, etc.

References

[1] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86, 2278-2324(1998).

[2] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst., 25, 1097-1105(2013).

[3] Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, 521, 436-444(2015).

[4] J. Cong, B. Xiao. Minimizing computation in convolutional neural networks. International Conference on Artificial Neural Networks, 281-290(2014).

[5] T. F. De Lima, H.-T. Peng, A. N. Tait, M. A. Nahmias, H. B. Miller, B. J. Shastri, P. R. Prucnal. Machine learning with neuromorphic photonics. J. Lightwave Technol., 37, 1515-1534(2019).

[6] Y. Ito, R. Matsumiya, T. Endo. OOC-cuDNN: accommodating convolutional neural networks over GPU memory capacity. IEEE International Conference on Big Data, 183-192(2017).

[7] K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 770-778(2016).

[8] G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljačić, C. Denz, D. A. B. Miller, D. Psaltis. Inference in artificial intelligence with deep optics and photonics. Nature, 588, 39-47(2020).

[9] B. J. Shastri, A. N. Tait, T. F. de Lima, W. H. P. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal. Photonics for artificial intelligence and neuromorphic computing. Nat. Photonics, 15, 102-114(2021).

[10] P. Ambs. Optical computing: a 60-year adventure. Adv. Opt. Photon., 2010, 1-15(2010).

[11] A. Maréchal, P. Croce. Un filtre de fréquences spatiales pour l’amélioration du contraste des images optiques. C. R. Acad. Sci., 237(1953).

[12] L. De Marinis, M. Cococcioni, P. Castoldi, N. Andriolli. Photonic neural networks: a survey. IEEE Access, 7, 175827(2019).

[13] P. R. Prucnal, B. J. Shastri. Neuromorphic Photonics(2017).

[14] F. Thomas, B. J. Shastri, A. N. Tait, M. A. Nahmias, P. R. Prucnal. Progress in neuromorphic photonics. Nanophotonics, 6, 577-599(2017).

[15] Q. Zhang, H. Yu, M. Barbiero, B. Wang, M. Gu. Artificial neural networks enabled by nanophotonics. Light Sci. Appl., 8, 42(2019).

[16] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, M. Soljačić. Deep learning with coherent nanophotonic circuits. Nat. Photonics, 11, 441-446(2017).

[17] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, D. J. Moss. 11 TOPS photonic convolutional accelerator for optical neural networks. Nature, 589, 44-51(2021).

[18] H. Bagherian, S. Skirlo, Y. Shen, H. Meng, V. Ceperic, M. Soljacic. On-chip optical convolutional neural networks. arXiv(2018).

[19] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, A. Ozcan. All-optical machine learning using diffractive deep neural networks. Science, 361, 1004-1008(2018).