• Photonics Research
  • Vol. 11, Issue 2, 299 (2023)
Guoqing Ma1、2, Junjie Yu1、2、3, Rongwei Zhu1、2, and Changhe Zhou1、2、*
Author Affiliations
  • 1Laboratory of Information Optics and Optoelectronic Technology, Shanghai Institute of Optics and Fine Mechanics, Chinese Academy of Sciences, Shanghai 201800, China
  • 2Center of Materials Science and Optoelectronics Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
  • 3e-mail: Junjiey@siom.ac.cn
  • show less
    DOI: 10.1364/PRJ.472741 Cite this Article Set citation alerts
    Guoqing Ma, Junjie Yu, Rongwei Zhu, Changhe Zhou. Optical multi-imaging–casting accelerator for fully parallel universal convolution computing[J]. Photonics Research, 2023, 11(2): 299 Copy Citation Text show less

    Abstract

    Recently, optical computing has emerged as a potential solution to computationally heavy convolution, aiming at accelerating various large science and engineering tasks. Based on optical multi-imaging–casting architecture, we propose a paradigm for a universal optical convolutional accelerator with truly massive parallelism and high precision. A two-dimensional Dammann grating is the key element for generating multiple displaced images of the kernel, which is the core process for kernel sliding on the convolved matrix in optical convolutional architecture. Our experimental results indicate that the computing accuracy is typically about 8 bits, and this accuracy could be improved further if high-contrast modulators are used. Moreover, a hybrid analog–digital coding method is demonstrated to improve computing accuracy. Additionally, a convolutional neural network for the standard MNIST dataset is demonstrated, with recognition accuracy for inference reaching 97.3%. Since this architecture could function under incoherent light illumination, this scheme will provide opportunities for handling white-light images directly from lenses without photoelectric conversion, in addition to convolutional accelerators.

    1. INTRODUCTION

    A convolutional neural network (CNN), as “convolutional” implies, involves extensive convolution operations among neighboring layers, followed by batch normalization and nonlinear activation for the expected performance [13]. Remarkably, these massive linear matrix multiply–accumulate (MAC) operations account for more than 80% of the total number of deep neural network (DNN) calculations [4]. However, the convolution operation, which is unsuitable for modern advanced electric serial processors, is becoming the biggest burden for high-performance computing tasks, particularly for artificial intelligence (AI) algorithms. Furthermore, as the scale of the matrix increases, so does the computational overhead of convolution operations. It has been demonstrated that the amount of computing power required to train state-of-the-art DNNs doubles every 3.5 months [5], far exceeding that of traditional electrical integrated circuits (EICs) following Moore’s law. Although parallel electrical coprocessors such as graphics processing units (GPUs) and tensor processing units (TPUs) can accelerate the convolution calculation, it is still difficult to handle millions of MAC operations in a fully parallel manner for DNNs practically [6,7]. In contrast, it has been proven that many MAC operations can be executed concurrently during a single pass of light, and this may be the prime motivation for the recent interest in optical computing [8,9]. Photonic solutions for computing have been investigated for at least 70 years [10,11]. However, compared with fast-growing EICs, the development of optical computing gradually slowed in the late 2000s [12], owing to a lack of application-driven motivation and adequate optical computing architectures.

    Recently, due to the remarkable achievements in AI, there has been renewed interest in attempting to improve computing power, energy efficiency, and processing speed by exploiting photonic or hybrid optical–electric processors rather than their electronic counterparts [1315]. Two mainstream optical computing architectures have been rapidly developed. The first is based on a planar waveguide on a two-dimensional (2D) substrate [1618], whereas the second is realized by multiple cascading diffractive optical elements (DOEs) in three-dimensional (3D) space [19,20]. However, planar architecture, which includes Mach–Zehnder interferometers [16], microring resonators [21,22], waveguide modulators [23], and acousto-optical modulators [24], does not fully use the 3D interconnectivity of optics, whereas 3D architecture requires full manipulation of the electromagnetic field with high precision, and fabricating large-sized and high-precision subwavelength DOEs in 3D space will still be difficult [19,20].

    Despite predictions that photonic processors could be at least 10,000 times faster than state-of-the-art EICs [13,14], the past schemes have not realized fully parallel convolution computing compared with their electronic counterparts, particularly when high precision is required. Here, we propose a new paradigm for a universal convolutional accelerator with full parallelism and adequate precision based on optical multi-imaging–casting architecture (OMica), capable of calculating arbitrarily encoded hybrid analog–digital matrix convolutions. The architecture can be viewed as the starting point for a new roadmap for optical computing, with the potential for building fully massively parallelized optical convolutional accelerators to overcome the intrinsic computing power shortage and unsatisfactory energy efficiency of EICs. Furthermore, the incoherent illumination implies the possibility of handling white-light images directly from lenses without traditional photoelectric conversion, promising to fully exploit the benefits of AI algorithms or accelerate other practical applications where rapid big data processing is desired.

    2. PRINCIPLE OF OMica

    A. Optical Multi-Imaging–Casting Architecture

    The OMica architecture, as depicted in Fig. 1, employs an incident-modulated light (matrix A) and a spatial light modulator (SLM) (matrix B), as well as a confocal 4f system with a diffractive beam splitter (BS), and another focusing system with a photodetector (matrix C). The planes of matrices A and B, the confocal plane of the 4f system, and the plane of the detector are all in a conjugated object–image relationship with each other. When a BS, such as a Dammann grating (DG) [2527], is placed behind the plane of matrix A, the two pairs of imaging–casting relationships mentioned above still hold. When the DG is inserted, the optical signal carrying the information of matrix A is duplicated into multiple diffraction orders, with excellent uniformity due to the properties of DG. The different diffraction orders inherently have different angular spectral components (θ1 and θ2). However, they all carry the same information as matrix A, as shown in Fig. 1(c). This implies that the multiplexing of matrix A is achieved over the spatial pattern. When we pass a pinhole through one of the diffraction orders in the confocal plane, the image corresponding to that diffraction order can be seen clearly on the plane of matrix B through lens L2 (as shown in Appendix A and Fig. 9). Because these diffraction orders have different diffraction angles (θ1 and θ2), the images of the diffraction orders on the plane of matrix B are displaced when we sequentially pass each diffraction order through the pinhole. Thus, as shown in Fig. 1(c), all images are aligned by adjusting the distance d between the DG and matrix A, according to a paraxial relation: s=lf1f2tanθ,where s is a convolutional stride, f1 and f2 are focal lengths of L1 and L2, respectively, and θ is the angle difference between any two adjacent diffraction orders. According to the grating equation, θm=arcsin(mλ/Λ), θm+1θm1θ, and tanθsinθ, Eq. (1) can be re-written as s=df2f1λΛ,where θm is the diffraction angle of the mth order of a DG, Λ is the grating period, and λ is wavelength. Therefore, s can also be adjusted to adapt to different convolutional strides by changing d [Figs. 1(a) and 1(b)].

    Schematic of the optical multi-imaging–casting architecture: optical parallel convolution process with different convolutional strides s1 (a) and s2 (b); (c) optical architecture principle of OMica, where the beam splitter (BS) is a diffractive beam splitter; Oy is diffraction order in the y direction (indicated by different line types), and θ is the angle difference between any two adjacent diffraction orders in object space (θ1 and θ2 are diffraction angles of Oy =1 and Oy=2 diffraction orders, respectively); θ′ is the angle difference in image space (θ1′ and θ2′ are diffraction angles of Oy =1 and Oy=2 diffraction orders, respectively); d is the distance between matrix A and BS, and l is the distance between matrix B and the image of BS. a, b, and c are spot arrays corresponding to different diffraction orders diffracted from a BS. The imaging–casting system is composed of L1 and L2, with focal lengths f1 and f2. L3 is a focusing lens with focal length f3. s is the lateral shifts of the image of diffraction orders of DG on the SLM2 plane corresponding to the convolutional stride, and this convolutional stride could be tunable by changing the distance d [s1 and s2 correspond to different convolutional strides shown in (a) and (b)].

    Figure 1.Schematic of the optical multi-imaging–casting architecture: optical parallel convolution process with different convolutional strides s1 (a) and s2 (b); (c) optical architecture principle of OMica, where the beam splitter (BS) is a diffractive beam splitter; Oy is diffraction order in the y direction (indicated by different line types), and θ is the angle difference between any two adjacent diffraction orders in object space (θ1 and θ2 are diffraction angles of Oy=1 and Oy=2 diffraction orders, respectively); θ is the angle difference in image space (θ1 and θ2 are diffraction angles of Oy=1 and Oy=2 diffraction orders, respectively); d is the distance between matrix A and BS, and l is the distance between matrix B and the image of BS. a, b, and c are spot arrays corresponding to different diffraction orders diffracted from a BS. The imaging–casting system is composed of L1 and L2, with focal lengths f1 and f2. L3 is a focusing lens with focal length f3. s is the lateral shifts of the image of diffraction orders of DG on the SLM2 plane corresponding to the convolutional stride, and this convolutional stride could be tunable by changing the distance d [s1 and s2 correspond to different convolutional strides shown in (a) and (b)].

    Because of the conjugation relationship and different angles, the images of all diffraction orders are superimposed on the matrix B plane with naturally shifted displacements when the pinhole is removed. This means that the SLM can modulate these shifted images simultaneously. That is, all multiplications of multiple images of matrix A and matrix B can be implemented in parallel. These multiplications are then summed through L3 and separated from each other in the C plane due to the angular spectrum differences. Therefore, the convolution of the two matrices can be performed in parallel after the light passes through the system once. This process is a perfect optical implementation of mathematical convolution, i.e., C=AB, where “” is the convolution operator. Owing to the object–image conjugate configuration, the OMica proposed here avoids the size trade-off of elements in the matrix between spatial and frequency domains in the 4f optical convolutional system [28,29], allowing massive parallelism with sufficiently high accuracy to be realized. Moreover, because of the object–image conjugate configuration, the OMica can work under both coherent and incoherent light illumination. Thus, this optical hardware allows it to handle white-light images directly from lenses without traditional photoelectric conversion if achromatic lenses are used as the projection system.

    B. Negative Matrix Coding Method

    In our proof-of-concept implementation, a homemade 2D 28×20 DG (see details in Appendix B) was inserted into a 4f system. Two amplitude-only SLMs (8-bit grayscale) are located on the object and image planes of the 4f system, where the two convolution matrices are loaded sequentially. In the experiment, light intensity was used as the information carrier, and the two SLMs were used to load the information of matrix B and matrix A into the incident uniform light beam. Therefore, in principle, only nonnegative matrices can be loaded and calculated based on this hardware. To address this limitation, a negative matrix encoding method for hybrid analog–digital optical convolution computing was developed. In a hybrid analog–digital framework, a grayscale matrix with negative elements can be easily decomposed into one larger-scale or several same-size negabinary digit (NBD) matrices in spatial or temporal sequences, respectively [30,31]. In other words, each decimal element in the original matrix can be converted into NBD representation as follows: (a)10=i=0N/kci(2k)i,where {c[N/k],c[N/k]1,,c0} NBD is called ci bits, with ci[0,2k1]; N is maximum bits of NBD, k is an integer, and the operator “·” indicates rounding the number to the nearest integer greater than it. Following this decomposition, a grayscale matrix with negative elements is transformed into a larger matrix spatially or several same-sized matrices in temporal series represented by N/k nonnegative bits, allowing these matrices to be loaded directly on the SLMs. The principle of this encoding method is depicted schematically in Fig. 2. Notably, there is a trade-off between computing precision and computing power, which can be adjusted by varying parameter k. A small k indicates that high precision with low computing power will be achieved, whereas a large k indicates high computing power with relatively low precision. Therefore, this encoding method can improve computing precision to the same extent compared with pure-analog optical convolution computing [30,31].

    Procedure of converting the original grayscale matrix with negative elements into encoded matrices of NBD. (a) The encoding matrices are loaded into the OMica system to compute the convolution, with the experimental encoded convolutional result decoded into the original matrix. (b) Original grayscale matrices A and B, and original convolutional results matrix C. (c) Larger encoded matrices A and B in spatial sequence and the same size encoded convolutional results matrix C.

    Figure 2.Procedure of converting the original grayscale matrix with negative elements into encoded matrices of NBD. (a) The encoding matrices are loaded into the OMica system to compute the convolution, with the experimental encoded convolutional result decoded into the original matrix. (b) Original grayscale matrices A and B, and original convolutional results matrix C. (c) Larger encoded matrices A and B in spatial sequence and the same size encoded convolutional results matrix C.

    Here, as an example, under the condition of k=1, the encoding process of a grayscale matrix with negative elements ranging from 2 to 5 is demonstrated step by step. As shown in Figs. 2(b) and 2(c), the grayscale number for each element of the original matrix to be encoded is expressed in multiple NBDs after encoding. For example, the first element in the original matrix A is written as 2=0×(2)2+1×(2)1+0×(2)0. Therefore, the elements of the matrix are arranged in rows after encoding, denoted as P1, P2, and P3. Each element in the column direction is encoded with three NBDs, denoted as Bit3, Bit2, and Bit1, as shown in Fig. 2(c). Thus, the first element, 2, is expressed as {010} in the first column of the encoded matrix, that is, c2=0, c1=1, and c0=0. Subsequently, the converted matrices are loaded onto the SLMs in spatial sequence for computing [Fig. 2(c)]. Notably, to avoid aliasing in a spatial sequence, some zero elements should be inserted into the encoded matrix between two adjacent rows or columns of the original high-bit matrix, where the number of zero elements is N/k1. Here, the physical pixels of the SLMs will not be fully used because of the redundant zero elements. The computational advantage can be realized only by increasing the matrix scale, but doing so will significantly slow down the system’s refresh rate because the convolution must be performed among all bits of either matrix A or B. Therefore, when the OMica is used for computing acceleration, a compromise should be struck between high computing power and high computing precision by choosing an appropriate parameter k.

    3. EXPERIMENTAL RESULTS

    A. Hybrid Analog–Digital Matrix Convolution

    As an example, the hybrid analog–digital optical convolution of two randomly generated 2-bit grayscale 3×10 matrices, A1 and B1, with elements in the range of 0to3, and two negabinary 3-bit grayscale 2×10 matrices, A2 and B2, with negative elements in the range of 2to5, is demonstrated, and the convolutional results are shown in Fig. 3. In each box, the light intensity distributions of the spot arrays on the detected plane, denoting the raw results of convolution, are shown in the first subfigure of the first row. The theoretical results obtained by an electric computer (full precision, 64 bit) are illuminated in the second subfigure, and the experimental results before decoding are shown in the third subfigure. The absolute error map is shown in the first subfigure of the second row, which is defined as follows: AE=|CtheoCexp|,where Ctheo and Cexp are the theoretical and experimental convolutional results, respectively. “|·|” denotes the absolute operation. Additionally, the theoretical and experimental results of the convolution after decoding are shown in the second and third subfigures in the second row, respectively. It is demonstrated that the overall trend of the experimental and theoretical results of the convolution is consistent.

    Experimental results of hybrid analog–digital matrix convolution for two groups of matrices based on spatial sequence encoding. The subfigures from left to right are the light intensity distribution of the spot array denoting the convolution, theoretical convolutional values, experimental convolutional results, error map between theoretical and experimental results, and decoded convolutional results, respectively, in (a) matrices A1 and B1 and (b) matrices A2 and B2. The red cross marks the centroid positions of each spot.

    Figure 3.Experimental results of hybrid analog–digital matrix convolution for two groups of matrices based on spatial sequence encoding. The subfigures from left to right are the light intensity distribution of the spot array denoting the convolution, theoretical convolutional values, experimental convolutional results, error map between theoretical and experimental results, and decoded convolutional results, respectively, in (a) matrices A1 and B1 and (b) matrices A2 and B2. The red cross marks the centroid positions of each spot.

    Figures 3(a) and 3(b) show the results of the convolution of two matrices, A1, A2 and B1, B2, respectively. The mean values of the absolute errors AE are 0.114 and 0.08, and it is seen that the maximum values are approximately 0.239 and 0.145, respectively, before decoding, indicating that the optical convolutional architecture achieves high precision. It should be noted that the former has a higher mean error before decoding than the latter, owing to increased cross talk caused by relatively large convolutional elements. Moreover, the two encoded matrices in spatial coding methods are filled with zero elements to avoid aliasing, which further reduces the cross talk and final error. Because the maximum absolute errors for the two cases are all less than 0.5, the correct convolutional results, with 100% accuracy, can still be obtained after digitalization. Thus, the experimental light intensity distribution of the two cases precisely reflects the values of the convolutional results.

    B. High-Accuracy Matrix Convolution

    As an example, the high-accuracy optical convolution of randomly generated 8-bit grayscale 10×10 matrices A3 and B3 and 20×20 matrices A4 and B4 with elements in the range of 0to255 is demonstrated. Figure 4 compares the experimental results of the optical convolution of matrices A3, A4 and matrices B3, B4 with the theoretical results. In each box, the light intensity distributions of the spot arrays on the detected plane, denoting the raw results of convolution, are shown in the first subfigure of all columns. The theoretical results obtained using an electric computer (full precision, 64 bits) are highlighted in the second subfigure, and the experimental results are shown in the third subfigure. The relative error is defined as follows: RE=|CexpCtheo|/(|CmaxCmin|/256),where Cexp represents experimental convolution, Ctheo represents theoretical convolution, Cmax represents the maximum value of theoretical convolution, and Cmin is the minimum value of theoretical convolution. Furthermore, “|·|” denotes the absolute operation. This relative error indicates that the precision of 8 bits will be obtained if its value is less than one.

    Figures 4(a) and 4(b) show the results of the convolution of matrices A3, A4 and matrices B3, B4, respectively. It is demonstrated that the overall trend of the experimental and theoretical results of convolution is very consistent. After further assessment, the mean values of the relative error RE are 0.424 and 0.39, and the maximum values are 2.258 and 1.293, respectively. Also, from these error maps, one can see that the relative errors for most of points [98.06% and 97.25% in Figs. 4(a) and 4(b), respectively] are less than one, indicating that the computing accuracy is very close to 8 bits, which is high enough for most AI inference tasks and, at least, some training tasks. Additionally, other examples of the experimental results of larger-scale matrices were also demonstrated in the appendix (see Appendix C).

    Experimental results of high-accuracy convolution for two groups of grayscale matrices. (a), (b) Randomly generated 8-bit grayscale 10×10 matrices A3 and B3, 8-bit grayscale 20×20 matrices A4 and B4, respectively. The subfigures from left to right show the light intensity distribution of the spot array denoting the convolution, theoretical convolutional values, experimental convolutional results, error map between theoretical and experimental results (the red circle indicates the computing accuracy at that point is less than 8 bit), and histogram of the error distribution, respectively. The comparison of experimental convolutional results expands into one-dimensional (1D) vectors and theoretical convolutional results.

    Figure 4.Experimental results of high-accuracy convolution for two groups of grayscale matrices. (a), (b) Randomly generated 8-bit grayscale 10×10 matrices A3 and B3, 8-bit grayscale 20×20 matrices A4 and B4, respectively. The subfigures from left to right show the light intensity distribution of the spot array denoting the convolution, theoretical convolutional values, experimental convolutional results, error map between theoretical and experimental results (the red circle indicates the computing accuracy at that point is less than 8 bit), and histogram of the error distribution, respectively. The comparison of experimental convolutional results expands into one-dimensional (1D) vectors and theoretical convolutional results.

    4. OPTICAL CNN INFERENCE TASKS BASED ON MNIST

    With its ability to accelerate universal convolutional computation, this OMica could find applications in a variety of fields where dense convolutions are involved, such as simulation of optical imaging, multi-input multi-output systems, and training and inference of a CNN. As an example, we demonstrate the inference tasks of recognition of handwritten digits based on the OMica using the above-mentioned negative matrix coding method and hybrid analog–digital matrix convolution (see details of CNN in Appendix D). Here, a binary neural network (BNN) [32] is implemented as an example to test the robustness and accuracy of the proposed optical hardware. For a BNN, the input signal is a nonnegative binary (0 or 1) image, and the kernel is a signed binary matrix (1 or +1) [33]. Each kernel of the BNN trained in advance is encoded into two identical-sized nonnegative matrices, one of which is a low-bit (positive) matrix and the other a high-bit (negative) matrix, as shown in Fig. 5(a). Intuitively, it seems that two convolution operations should be executed in the temporal sequence. Remarkably, 10 original kernels need to be divided into 10 high-bit sub-kernels and the same low-bit sub-kernel because the low-bit sub-kernels are the same. Furthermore, the first high-bit sub-kernel and low-bit sub-kernel are the same with unity transmittance. Thus, the total number of convolutional kernels after encoding is still 10, implying that no additional computational overhead incurs. Figure 5(b) shows the inference process of the CNN based on encoding low- and high-bit kernels. The 10 encoded kernels are sequentially loaded onto the SLM located at the input plane of matrix A, and the binary input images with a scale of 28×28 are sequentially loaded onto the SLM at the input plane of matrix B. When light passes through the two SLMs in sequence and is then focused and separated by the focusing lens, the detector on the focal plane captures the spot array denoting the convolutional results. Finally, the original convolutional results are obtained by decoding the corresponding low- and high-bit convolutions. By adding the results of the positive and negative convolutions and multiplying them by the weight 2, the final convolutional results can be obtained.

    Inference process for the convolutional neural network performed by OMica based on the MNIST dataset. (a) Execution of convolution operation by encoding each original convolutional kernel into high-bit and low-bit kernels; (b) schematic of the optical convolutional architecture performing CNN inference; (c) absolute error AE map comparing theoretical and experimental results of the convolution of a handwritten digit 7 as an input; confusion matrix of blind-testing 1000 images from the MNIST dataset when matrix convolutions are executed by the optical hardware (d) and by pure electric hardware (e). The purple box marks the first convolutional kernel to realize the whole process of encoding, convolution, and decoding.

    Figure 5.Inference process for the convolutional neural network performed by OMica based on the MNIST dataset. (a) Execution of convolution operation by encoding each original convolutional kernel into high-bit and low-bit kernels; (b) schematic of the optical convolutional architecture performing CNN inference; (c) absolute error AE map comparing theoretical and experimental results of the convolution of a handwritten digit 7 as an input; confusion matrix of blind-testing 1000 images from the MNIST dataset when matrix convolutions are executed by the optical hardware (d) and by pure electric hardware (e). The purple box marks the first convolutional kernel to realize the whole process of encoding, convolution, and decoding.

    Figure 5(c) shows the absolute error AE map between the theoretical and experimental results of an input image of a handwritten digit 7 convolved by the first kernel. Compared with the matrices in Fig. 4, the size of a standard input image of handwritten digits is 28×28, whereas the size of the convolutional kernel is nearly the same, and the average value of the absolute errors is 0.405. This implies that it is possible to calculate the optical convolution of larger-scale matrices using OMica with high precision. The following pooling layer, nonlinear operations, and full connections are executed by a classical electrical computer.

    To validate the reliability and robustness of the system, we performed blind testing for the first 1000 sets of MNIST images with serial numbers ranging from 1 to 1000. As shown in Figs. 5(d) and 5(e), the experimental results indicate that the optical convolutional accelerator achieved blind-testing accuracy of up to 97.3%, whereas electrical computers achieved recognition accuracy of 96.7% for the same test dataset. This may be due to the computing error of the optical convolution carrying characteristics of the input images, thus further strengthening the feature extraction ability. It can be seen that the error maps for different handwritten digits are highly correlated with the input image, as shown in Fig. 5(c) (see Appendix E). By optimizing the kernel weights of the optical convolutional system, direct training of the optical CNN is expected to yield better results than those of an electronic computer. Based on this, the architecture can be effectively used as a hardware accelerator with large computing power in various DNNs.

    5. DISCUSSION

    A. Computing Power Scalability

    As shown in Fig. 1, even when the suitable distance d between matrix A and the BS is adjusted to match the convolutional stride s, each diffraction order of the BS involved in the convolution is still imaged to the plane of matrix B. Therefore, it is possible to greatly reduce the physical size of the matrix elements. Given these conditions, the peak computing power of the optical convolutional architecture will reach 10 peta (1015) operations per second (POPS) [34], which is even faster than the state-of-the-art GPU, such as TITAN RTX (Nvidia) [35], if a modulator with a higher refresh rate (typically 10 kHz) is used, such as a digital mirror device (DMD) or a specially designed micro–electro–mechanical system. Furthermore, if other multiplexing methods, such as polarization, wavelength, and spatial mode, are used, then speeds at least 10to102 times faster than this estimation can be achieved [36,37]. Therefore, based on the OMica, the computing power for convolution may, in the near future, be superior, or at least comparable, to that of the most powerful supercomputer (peak performance of the top system, Frontier [38] with Linpack Performance 1102 POPS), with larger-scale and higher-updating-frequency devices.

    B. Energy Efficiency Ratio

    Additionally, the power consumption of the optical convolutional system is significantly lower than that of an electronic processor with the same computing power, even for such a bulk optical system at present. This fully accounts for the operating power consumption of the optoelectronic device and assumes that the total power consumption of the entire optical convolution computing system, including the light source, two modulators, and the detector, is less than 100 W. Of course, the power consumption of 100 W is meaningless for the MNIST dataset. However, as the matrix size increases, along with the aperture size and DG splitting ratio, etc., the increase in computing power is proportional to N4, whereas the increase in the power consumption of this system is insignificant. Therefore, as computing power continues to grow, the energy efficiency ratio of this architecture will significantly outperform that of existing electronic computing systems. Furthermore, if a more sensitive detection device, such as a multiphoton counter, is used, power consumption will be drastically reduced [39]. In contrast, a powerful supercomputer is energy hungry, with power consumption typically reaching 104to105  kW (Frontier’s power is 21,100 kW). Evidently, the optical convolutional architecture will consume far less power than supercomputers, whereas its computing power for a specific task (convolution) could be at least comparable to that of Frontier, the top supercomputer this year.

    C. Potential Applications

    To the best of our knowledge, the OMica is the only optical parallel acceleration solution that can produce both high-precision convolutional computers and AI hardware accelerators with high recognition accuracy. Additionally, if an appropriate distance d [Figs. 1(a) and 1(b)] is chosen, this OMica architecture could realize not only convolutional layers but also pooling layers and fully connected layers (all layers are linear convolution calculations). For AI algorithms, it has been demonstrated that very high accuracy is not required [40] and that neural networks can operate effectively with both low-accuracy and fixed-point operations. Inference models function nearly as well with 48  bits of precision and are trained with nearly 816  bits of precision per computation [41]. Our results indicate that computing accuracy is close to 8 bits, which is sufficiently accurate for most AI inference applications. Moreover, if high-contrast modulators, such as DMDs, are used, computing accuracy could be improved even further, and the results obtained from this optical accelerator would be sufficient for training most AI models. Furthermore, when training the neural network directly in this optical convolutional system, the physical characteristics of the system itself are also trained, such as alignment errors and cross talk, which are expected to further improve the performance of the aforementioned neural network.

    Presently, only one kernel A and one input feature map B are loaded onto these two SLMs. It is also possible to load multiple kernels on the first SLM, allowing for parallel convolutions among multiple kernels and multiple input channel feature maps by filling an appropriate number of zero elements between any two adjacent kernels. By swapping the positions of feature map B and kernel A, a CNN can be built, and the key is to make full use of pixels to increase computing power. Also, it is worth noting that considering the actual hardware scale, it is often necessary to split and reorganize the input feature map to further improve the hardware utilization, that is, to load different matrix combinations to the SLMs to execute the convolution process.

    Although these task-specific devices are not yet available, the current CMOS technology, in principle, is adequate for developing high-quality devices, such as SLMs and detectors, for optical computing. This work presents a promising method for building optical convolutional processors to overcome the intrinsic shortage of computing power and unsatisfactory energy efficiency in traditional electrical processors. Furthermore, the experimental results validate the benefits of optical convolutional systems for various application scenarios, including computationally intensive tasks and neuromorphic computing.

    6. CONCLUSION

    An optical convolutional accelerator for fully parallel universal convolution computing was proposed, and a negative matrix coding scheme with sufficiently high precision was demonstrated. In principle, a suitable encoding scheme and the OMica can be used to efficiently calculate the convolution of an arbitrary bit matrix with massive parallelism and sufficient accuracy. Moreover, convolution is universal, and the computing results obtained may be easily transferred to any other computing platform. Our proof-of-concept experimental results proved the feasibility of the optical convolution of 20×20 matrices with an accuracy of about 8 bits. Furthermore, a BNN for handwritten digit recognition tasks on the standard MNIST dataset was constructed, and the inference process was demonstrated based on this optical hardware. The results indicated that the blind test recognition accuracy can reach 97.3%, which is comparable with that predicted by pure electrical networks. These proof-of-concept experimental results indicated that the OMica could be used for massive parallelism, high-precision, and high-efficiency AI accelerators, and this computing paradigm has potential applicability in the construction of task-specific cloud computing centers or other AI computing centers. By developing high-speed SLMs with higher contrast, optimizing a specially proposed projection imaging system, and setting up a dedicated dot array lighting source, it is possible to build a photonic coprocessor with higher computing power and lower energy consumption than state-of-the-art supercomputers, such as Frontier, based on the OMica. Additionally, the characteristics of the imaging system itself suggest that the computing power of the system can be exponentially increased by cascading multiple 4f systems and employing extra multiplexing degrees of freedom. Thus, a hybrid optical–electrical computer center or data center could be directly constructed. Furthermore, because the optical hardware could work under incoherent white-light illumination if an achromatic lens projection system is used, the OMica architecture allows it to handle white-light images directly from lenses without traditional photoelectric conversion.

    In summary, the OMica is expected to be used in self-driving vehicles [42], machine vision [43], and other fields that require high computing power for real-time or quasi-real-time data processing. This opens the door to increasing the computing power and energy efficiency of convolution by using high-performance devices, such as larger-scale modulators with higher updating frequencies and detectors or detector arrays with wider dynamic ranges and higher sampling frequencies, which would be superior to the most powerful supercomputers, in the near future.

    Acknowledgment

    Acknowledgment. The authors appreciate the critical discussion on this concept with Guowei Li and also his assistance in the experiment.

    APPENDIX A: EXPERIMENTAL SETUP AND METHODS

    Figure 6 shows a proof-of-concept experimental system based on the OMica. Figure 7 shows photographs of the experimental setup. Two large-scale matrices, A and B, are assumed to be two convolved matrices and are loaded onto two modulators, SLM1 and SLM2, respectively. The convolutional matrix C is detected by sCMOS1. The DG was removed before alignment, and the monitoring camera was placed in the focal plane of L9. During the alignment process, some specially designed patterns shown in Fig. 8 are used. Subsequently, the DG is inserted before SLM1, and the distance d0 between the DG and SLM1 should be adjusted carefully to make the lateral shift of the image correspond to normalized convolutional stride size s=1 [Eq. (1)].

    Schematic of the optical convolution experimental system using the DG. LED, light-emitting diode with wavelength λ=450 nm; M1–6, reflective aluminum mirrors; AP1,2,3, aperture pinholes; L1–5, convergent lenses; L6, L7, L10, Fourier transform lenses; PBS1,2,3, cube polarization beam splitters; SLM1, SLM2, reflected liquid crystal SLMs; APA, aperture array; DG, Dammann grating; BS, non-polarizing beam splitter; sCMOS1, scientific complementary metal–oxide–semiconductor camera for detection; CMOS2, CMOS camera for monitoring. I, II, III, and the plane of the square aperture are one group of object–image conjugate planes. IV and V are other groups of object–image conjugate planes. Plane V is the image plane of the DG. d0 is the characteristic distance corresponding to s=1, which can be adjusted to match the physical size of the matrix unit of matrix B to the different stride size.

    Figure 6.Schematic of the optical convolution experimental system using the DG. LED, light-emitting diode with wavelength λ=450  nm; M16, reflective aluminum mirrors; AP1,2,3, aperture pinholes; L15, convergent lenses; L6, L7, L10, Fourier transform lenses; PBS1,2,3, cube polarization beam splitters; SLM1, SLM2, reflected liquid crystal SLMs; APA, aperture array; DG, Dammann grating; BS, non-polarizing beam splitter; sCMOS1, scientific complementary metal–oxide–semiconductor camera for detection; CMOS2, CMOS camera for monitoring. I, II, III, and the plane of the square aperture are one group of object–image conjugate planes. IV and V are other groups of object–image conjugate planes. Plane V is the image plane of the DG. d0 is the characteristic distance corresponding to s=1, which can be adjusted to match the physical size of the matrix unit of matrix B to the different stride size.

    Photographs of the experiment system of OMica. (a) Entire optical system; (b) SLM mounted on a 4D manual stage for loading kernel A, (c) SLM mounted on a 4D manual stage for loading matrix B, and (d) enlarged part of the sCMOS1 detector and monitoring CMOS2 camera.

    Figure 7.Photographs of the experiment system of OMica. (a) Entire optical system; (b) SLM mounted on a 4D manual stage for loading kernel A, (c) SLM mounted on a 4D manual stage for loading matrix B, and (d) enlarged part of the sCMOS1 detector and monitoring CMOS2 camera.

    Typical patterns loaded onto two SLMs for alignment. (a) Alignment pattern and (b) square array pattern.

    Figure 8.Typical patterns loaded onto two SLMs for alignment. (a) Alignment pattern and (b) square array pattern.

    Experimental results for demonstration of kernel sliding. (a), (b) Images loaded onto two SLMs. (c)–(j) Images captured by the monitoring CMOS2 camera as the iris moves from left to right, allowing only one diffraction order to pass through its aperture in sequence.

    Figure 9.Experimental results for demonstration of kernel sliding. (a), (b) Images loaded onto two SLMs. (c)–(j) Images captured by the monitoring CMOS2 camera as the iris moves from left to right, allowing only one diffraction order to pass through its aperture in sequence.

    APPENDIX B: DESIGN AND MANUFACTURING OF DAMMANN GRATING

    Here, a simulated annealing algorithm is used to optimize the structure of DGs. The normalized energy distributions of 1×20 and 1×28 DGs with diffraction orders for ideal π phase retardation are shown in Fig. 10. Under ideal conditions, the efficiencies of 1×20 and 1×28 1D gratings were 81.93% and 82.38%, respectively, and the energy uniformity was less than 1%. The structure of a 2D DG can be easily obtained after the orthogonal superposition of two crossing 1D gratings.

    1×20 (a) and 1×28 (b) DG beam splitting order normalized energy distribution.

    Figure 10.1×20 (a) and 1×28 (b) DG beam splitting order normalized energy distribution.

    Intensity and angle distribution of 20×28 2D DG. (a) Simulation result of intensity distribution versus different orders; (b) simulation result of diffraction angle versus diffraction order; (c) intensity map of the spot array captured in the experiment (the cross represents the centroid); (d) experimental results of normalized intensity distribution versus diffraction order.

    Figure 11.Intensity and angle distribution of 20×28 2D DG. (a) Simulation result of intensity distribution versus different orders; (b) simulation result of diffraction angle versus diffraction order; (c) intensity map of the spot array captured in the experiment (the cross represents the centroid); (d) experimental results of normalized intensity distribution versus diffraction order.

    APPENDIX C: CONVOLUTIONAL RESULTS FOR TWO 8-BIT GRAYSCALE 180×224 LARGE MARTRICES

    In principle, the OMica can achieve high computing power due to its true parallel processing capabilities. Furthermore, the convolution of two 180×224 matrices was also demonstrated in the analog framework. The theoretical and experimental results, as well as the experimental detection of the light distribution of the convolution, are shown in Figs. 12(a)–12(c). The relative errors defined above are shown in Fig. 12(e). The mean errors for the five groups of data computed using OMica hardware were 10.87, 10.93, 11.12, 11.17, and 11.48, respectively. This low precision was mainly caused by the alignment error. This alignment error could be significantly reduced using piezo actuators with resolutions in the nanometer range. Under this condition, a matrix scale of 200×200 indicates that the peak computing power reaches 3.2×109 MAC operations when light passes through the system once.

    Experimental convolutional results for 180×224 matrices. (a)–(c) Theoretical convolutional results, experimental convolutional results, and experimental detection light distribution, respectively; (d) partially enlarged view of the experimental light spot on (c); (e) error distribution; (f) proportion of experimental light intensity distribution.

    Figure 12.Experimental convolutional results for 180×224 matrices. (a)–(c) Theoretical convolutional results, experimental convolutional results, and experimental detection light distribution, respectively; (d) partially enlarged view of the experimental light spot on (c); (e) error distribution; (f) proportion of experimental light intensity distribution.

    APPENDIX D: CONFIGURATION OF THE CNN

    The configuration of the CNN model used in our experiment for demonstration of the handwritten digit recognition based on the MNIST dataset is shown in Fig. 13. It can be seen that this CNN network contains five layers: convolutional layer, pooling layer, nonlinear activation layer, and two fully connected layers. To achieve a higher recognition rate while avoiding overfitting, we set the learning rate to 0.05 and the training batch size to 50. The number of epochs was set to four to avoid overfitting. The activation function for the first layer was the rectified linear unit (ReLU) function, and 10 9×9 convolutional kernels with binary element values of 1 or +1 were used. Owing to its simple derivative formation, the training speed of the ReLU function is faster than that of the sigmoid and tanh functions when the kernel weights are trained based on the backpropagation algorithm. Because the derivative is not zero, it can effectively address the vanishing gradient problem and further reduce overfitting. The average pooling method was selected for the pooling layer because all the information in the feature map can be obtained on average without losing too much information. Because the image is processed through binarization in advance, the foreground and background information in the feature map maintains a high resolution after average pooling. The first fully connected layer had 200 nodes, and the activation function was chosen as the ReLU function. The last fully connected layer had 10 nodes, and the activation function was selected as the sigmoid function. Because the sigmoid function is used in the final layer for classification tasks, we chose the cross-entropy loss function to avoid the vanishing gradient problem. For the 60,000 training set, the total training time of the CNN network was approximately 3 min (Intel Core i7-4790 CPU at 3.60 GHz), and the recognition accuracy on the 1000-sample test set was 96.7%. The recognition accuracy on the 10,000-sample test set was 96.3%.

    Schematic of the CNN architecture.

    Figure 13.Schematic of the CNN architecture.

    Learning curve of the CNN.

    Figure 14.Learning curve of the CNN.

    APPENDIX E: INPUT-RELATED CROSS TALK

    Figure 15 shows the distribution of relative errors between the experimental convolutional results and theoretical convolutional results for different digital inputs. These error maps are clear characteristic of the input numbers. This may be due to optical cross talk between different pixel channels. Optical cross talk is an important factor that limits the improvement of optical computing accuracy. However, for the AI algorithm, if training of the deep learning network model is directly based on an optical computing system, then the optical cross talk may help improve the recognition accuracy of the system. This result has implications for developing optical AI accelerators with high recognition accuracy.

    Typical error maps between convolutional results obtained from the optical hardware and that of an electrical computer with the full precision of different input handwritten digits (from 0 to 9) for these 10 convolutional kernels after encoding.

    Figure 15.Typical error maps between convolutional results obtained from the optical hardware and that of an electrical computer with the full precision of different input handwritten digits (from 0 to 9) for these 10 convolutional kernels after encoding.

    APPENDIX F: SUMMARY OF DIFFERENT OPTICAL CONVOLUTIONAL ARCHITECTURES

    Table 1 shows a summary of various mainstream optical convolutional architectures (OIU, optical interference unit; MRs, microring resonators; OFC, optical frequency comb; PCM, phase change materials; D2NN, diffractive DNN). It has been shown that precision of only about 45  bits is achieved for most photonic accelerators reported, although they work well for most artificial learning tasks after retraining with noise. However, it has been verified empirically that for most neural networks, inference models work nearly just as well with 48  bits of precision, while training with nearly 816  bits of precision per computation [41]. This is one important reason why most photonic accelerators have been used only for inference tasks. Besides artificial neural networks, the OMica provides the ability for accelerating universal convolution computation, and thus could find applications in many other fields, such as simulation of optical imaging, and multi-input multi-output systems.

    Summary of Different Optical Convolutional Architectures

    ArchitecturePrinciplePros/ConsComputing AccuracyReferences
    OIUs and delay lineMatrix–vector multiplication• High integration and high modulation speed.5  bits[16,18,24]
    • Limited by the integration of integrated photonic devices, it is difficult to realize the parallel convolution process of multiple convolutional kernels.
    MRs, OFC, and PCMMatrix-vector multiplication• High integration and high modulation speed.5  bits[17,22]
    • The OFC can provide multi-wavelength light sources and timing modulation, and the system integration is higher.
    • Low power consumption using non-volatile PCM.
    • Complex electronic control and test configuration.
    4f filterMultiplication in frequency domain equals convolution in spatial domain• Object and spectrum are limited by the Fourier transform relationship. There is a trade-off between computing accuracy and computing size./[28,29]
    • Configuration is very simple.
    D2NNDiffraction• High-precision 3D macro-nano structures are difficult to fabricate, and computational accuracy is limited.5  bits[19,44]
    • High computing power.
    Shadow casting2D matrix–matrix multiplication• Diffraction effect exists when matrix A is projected onto matrix B, and computing accuracy cannot be guaranteed./[30,31,39]
    • Configuration is very simple.
    OMica2D matrix–matrix convolution and multiplication• DG and object–image conjugation avoids diffraction effects by wavefront recombination.8  bitsThis work
    • DG is 2D DOE, and it is easy to manufacture. Computing power can be expanded easily by using large-scale DGs.
    • Can work under incoherent light illumination and directly handle optical images.
    • Computational accuracy is high.

    Compared with the most popular scheme involving planar waveguides on a 2D substrate [17,18,22,24], the scheme of multiple cascading DOEs inherently takes full advantage of the 3D connection ability of optics. Thus, it can achieve higher computing power in a single computing step. Recently, Xu et al. realized a type of photonic convolutional accelerator based on optical frequency combs [17], whose computing power is as high as tera operations per second (TOPS). The use of optical frequency combs to realize multi-wavelength light sources is remarkable progress. However, the scalability of this architecture is still limited by the number of channels of the optical frequency combs. Mario et al. [29] proposed an optical system that performs fast updating of optical neural networks based on two amplitude-only DMDs, where one amplitude-only DMD is located at the Fourier transform plane of the other. Although the mapping relationship between the input images and the recognition digits can be successfully established using this method, the computing results are essentially not standard convolutions. Therefore, this method cannot be used for high-precision universal convolution computing. Moreover, it is difficult to align the two DMDs pixel by pixel. Because of the Fourier transform, the relationship between input and filter planes, realizing large-scale optical networks will be difficult. Recently, Zhou et al. [44,45] demonstrated a reconfigurable scheme for realizing 3D architecture with multiple cascading DOEs, using two programmable modulators and a DMD, as well as another pure-phase SLM, for amplitude and phase modulation, respectively. Because of the coherent working mode, micrometer-sized pixels, alignment error between the DMD and SLM, and alignment errors between different layers, achieving high computing precision is difficult. Therefore, recognition is drastically degraded without adaptive training. Although this scheme performs well after adaptive training, it cannot be used for universal convolution computing because of its low precision.

    In contrast, because of the object–image conjugate relationship, a CMOS monitoring camera can be added to the conjugating plane of two SLMs, making it simple to align two SLMs with a monitor camera. Additionally, an incoherent light source could be used in this architecture to prevent sensitivity and speckle noise. More importantly, this configuration makes it possible to handle images directly from a lens under white-light illumination, which is very challenging for all mainstream architectures, to the best of our knowledge.

    Therefore, the convolutional accelerator enabled by the OMica can be used to compute universal matrix convolution, and the results obtained by the hybrid optical–electrical hardware can be easily transferred to any other computing platform, including photonic, hybrid optical–electrical, and traditional electric processors or coprocessors. Because of its universality, this architecture can be used for building task-specific cloud computing centers, or some other AI accelerating centers, as well as the present bulk optical system. In the future, with the advancement of nonlinear optical elements, a scheme based on the OMica could also be integrated into pure photonic accelerators by combining planar waveguides [46,47], metasurfaces [4850], and advanced modulator arrays, etc.

    References

    [1] Y. Lecun, L. Bottou, Y. Bengio, P. Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86, 2278-2324(1998).

    [2] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst., 25, 1097-1105(2013).

    [3] Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, 521, 436-444(2015).

    [4] J. Cong, B. Xiao. Minimizing computation in convolutional neural networks. International Conference on Artificial Neural Networks, 281-290(2014).

    [5] T. F. De Lima, H.-T. Peng, A. N. Tait, M. A. Nahmias, H. B. Miller, B. J. Shastri, P. R. Prucnal. Machine learning with neuromorphic photonics. J. Lightwave Technol., 37, 1515-1534(2019).

    [6] Y. Ito, R. Matsumiya, T. Endo. OOC-cuDNN: accommodating convolutional neural networks over GPU memory capacity. IEEE International Conference on Big Data, 183-192(2017).

    [7] K. He, X. Zhang, S. Ren, J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition, 770-778(2016).

    [8] G. Wetzstein, A. Ozcan, S. Gigan, S. Fan, D. Englund, M. Soljačić, C. Denz, D. A. B. Miller, D. Psaltis. Inference in artificial intelligence with deep optics and photonics. Nature, 588, 39-47(2020).

    [9] B. J. Shastri, A. N. Tait, T. F. de Lima, W. H. P. Pernice, H. Bhaskaran, C. D. Wright, P. R. Prucnal. Photonics for artificial intelligence and neuromorphic computing. Nat. Photonics, 15, 102-114(2021).

    [10] P. Ambs. Optical computing: a 60-year adventure. Adv. Opt. Photon., 2010, 1-15(2010).

    [11] A. Maréchal, P. Croce. Un filtre de fréquences spatiales pour l’amélioration du contraste des images optiques. C. R. Acad. Sci., 237(1953).

    [12] L. De Marinis, M. Cococcioni, P. Castoldi, N. Andriolli. Photonic neural networks: a survey. IEEE Access, 7, 175827(2019).

    [13] P. R. Prucnal, B. J. Shastri. Neuromorphic Photonics(2017).

    [14] F. Thomas, B. J. Shastri, A. N. Tait, M. A. Nahmias, P. R. Prucnal. Progress in neuromorphic photonics. Nanophotonics, 6, 577-599(2017).

    [15] Q. Zhang, H. Yu, M. Barbiero, B. Wang, M. Gu. Artificial neural networks enabled by nanophotonics. Light Sci. Appl., 8, 42(2019).

    [16] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, M. Soljačić. Deep learning with coherent nanophotonic circuits. Nat. Photonics, 11, 441-446(2017).

    [17] X. Xu, M. Tan, B. Corcoran, J. Wu, A. Boes, T. G. Nguyen, S. T. Chu, B. E. Little, D. G. Hicks, R. Morandotti, A. Mitchell, D. J. Moss. 11 TOPS photonic convolutional accelerator for optical neural networks. Nature, 589, 44-51(2021).

    [18] H. Bagherian, S. Skirlo, Y. Shen, H. Meng, V. Ceperic, M. Soljacic. On-chip optical convolutional neural networks. arXiv(2018).

    [19] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, A. Ozcan. All-optical machine learning using diffractive deep neural networks. Science, 361, 1004-1008(2018).

    [20] A. Silva, F. Monticone, G. Castaldi, V. Galdi, A. Alù, N. Engheta. Performing mathematical operations with metamaterials. Science, 343, 160-163(2014).

    [21] J. Feldmann, N. Youngblood, C. D. Wright, H. Bhaskaran, W. H. P. Pernice. All-optical spiking neurosynaptic networks with self-learning capabilities. Nature, 569, 208-214(2019).

    [22] J. Feldmann, N. Youngblood, M. Karpov, H. Gehring, X. Li, M. Stappers, M. Le Gallo, X. Fu, A. Lukashchuk, A. S. Raja, J. Liu, C. D. Wright, A. Sebastian, T. J. Kippenberg, W. H. P. Pernice, H. Bhaskaran. Parallel convolutional processing using an integrated photonic tensor core. Nature, 589, 52-58(2021).

    [23] F. Ashtiani, A. J. Geers, F. Aflatouni. An on-chip photonic deep neural network for image classification. Nature, 606, 501-506(2022).

    [24] S. Xu, J. Wang, R. Wang, J. Chen, W. Zou. High-accuracy optical convolution unit architecture for convolutional neural networks by cascaded acousto-optical modulator arrays. Opt. Express, 27, 19778-19787(2019).

    [25] H. Dammann, E. Klotz. Coherent optical generation and inspection of two-dimensional periodic structures. Opt. Acta, 24, 505-515(1977).

    [26] C. Zhou, L. Liu. Numerical study of Dammann array illuminators. Appl. Opt., 34, 5961-5969(1995).

    [27] J. Yu, C. Zhou, W. Jia, W. Cao, S. Wang, J. Ma, H. Cao. Three-dimensional Dammann array. Appl. Opt., 51, 1619-1630(2012).

    [28] J. Chang, V. Sitzmann, X. Dun, W. Heidrich, G. Wetzstein. Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification. Sci. Rep., 8, 12324(2018).

    [29] M. Miscuglio, Z. Hu, S. Li, J. K. Grorge, R. Capanna, H. Dalir, P. M. Bardet, P. Gupta, V. J. Sorger. Massively parallel amplitude-only Fourier neural network. Optica, 7, 1812-1819(2020).

    [30] C. Zhou, L. Liu, Z. Wang. Binary-encoded vector–matrix multiplication architecture. Opt. Lett., 17, 1800-1802(1992).

    [31] L. Liu, G. Li, Y. Yin. Optical complex matrix–vector multiplication with negative binary inner products. Opt. Lett., 19, 1759-1761(1994).

    [32] H. Qin, R. Gong, X. Liu, X. Bai, J. Song, N. Sebe. Binary neural networks: a survey. Pattern Recogn., 105, 107281(2020).

    [33] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, Y. Bengio. Binarized neural networks training neural networks with weights and activations constrained to +1 or −1. arXiv(2016).

    [34] C. Zhou, J. Yu, G. Li, G. Ma. Roadmap of optical computing. Proc. SPIE, 11898, 118981B(2021).

    [35] https://www.nvidia.com/en-us/deep-learning-ai/products/titan-rtx/. https://www.nvidia.com/en-us/deep-learning-ai/products/titan-rtx/

    [36] P. Minzioni, C. Lacava, T. Tanabe, J. Dong, X. Hu, G. Csaba, W. Porod, G. Singh, A. E. Willner, A. Almaiman, V. Torres-Company, J. Schröder, A. C. Peacock, M. J. Strain, F. Parmigiani, G. Contestabile, D. Marpaung, Z. Liu, J. E. Bowers, L. Chang, S. Fabbri, M. R. Vázquez, V. Bharadwaj, S. M. Eaton, P. Lodahl, X. Zhang, B. J. Eggleton, W. J. Munro, K. Nemoto, O. Morin, J. Laurat, J. Nunn. Roadmap on all-optical processing. J. Opt., 21, 063001(2019).

    [37] J. Wang, J.-Y. Yang, I. M. Fazal, N. Ahmed, Y. Yan, H. Huang, Y. Ren, Y. Yue, S. Dolinar, M. Tur, A. E. Willner. Terabit free-space data transmission employing orbital angular momentum multiplexing. Nat. Photonics, 6, 488-496(2012).

    [38] https://www.top500.org/system/180047/. https://www.top500.org/system/180047/

    [39] T. Wang, S.-Y. Ma, L. G. Wright, T. Onodera, B. C. Richard, P. L. McMahon. An optical neural network using less than 1 photon per multiplication. Nat. Commun., 13, 123(2022).

    [40] S. Gupta, A. Agrawal, K. Gopalakrishnan, P. Narayanan. Deep learning with limited numerical precision. arXiv(2015).

    [41] M. A. Nahmias, T. F. de Lima, A. N. Tait, H.-T. Peng, B. J. Shastri, P. P. Prucnal. Photonic multiply-accumulate operations for neural networks. IEEE J. Quantum Electron., 26, 7701518(2020).

    [42] J. Han, A. Jentzen, E. Weinan. Solving high-dimensional partial differential equations using deep learning. Proc. Natl. Acad. Sci. USA, 115, 8505-8510(2018).

    [43] L. Mennel, J. Symonowicz, S. Wachter, D. K. Polyushkin, A. J. Molina-Mendoza, T. Mueller. Ultrafast machine vision with 2D material neural network image sensors. Nature, 579, 62-66(2020).

    [44] T. Zhou, X. Lin, J. Wu, Y. Chen, H. Xie, Y. Li, J. Fan, H. Wu, L. Fang, Q. Dai. Large-scale neuromorphic optoelectronic computing with a reconfigurable diffractive processing unit. Nat. Photonics, 15, 367-373(2021).

    [45] T. Zhou, L. Fang, T. Yan, J. Wu, Y. Li, J. Fan, H. Wu, X. Lin, Q. Dai. In situ optical backpropagation training of diffractive optical neural networks. Photon. Res., 8, 940-953(2020).

    [46] M. Gruber. Multichip module with planar-integrated free-space optical vector-matrix-type interconnects. Appl. Opt., 43, 463-470(2004).

    [47] G. Mínguez-Vega, M. Gruber, J. Jahns, J. Lancis. Achromatic optical Fourier transformer with planar-integrated free-space optics. Appl. Opt., 44, 229-235(2005).

    [48] Y. Zhang, C. Fowler, J. Liang, B. Azhar, M. Y. Shalaginov, S. Deckoff-Jones, S. An, J. B. Chou, C. M. Roberts, V. Liberman, M. Kang, C. Ríos, K. A. Richardson, C. Rivero-Baleine, T. Gu, H. Zhang, J. Hu. Electrically reconfigurable non-volatile metasurface using low-loss optical phase-change material. Nat. Nanotechnol., 3, 661-666(2021).

    [49] Z. Wu, M. Zhou, E. Khoram, B. Liu, Z. Yu. Neuromorphic metasurface. Photon. Res., 8, 46-50(2020).

    [50] H. Kwon, D. Sounas, A. Cordaro, A. Polman, A. Alù. Nonlocal metasurfaces for optical signal processing. Phys. Rev. Lett., 121, 173004(2018).

    Guoqing Ma, Junjie Yu, Rongwei Zhu, Changhe Zhou. Optical multi-imaging–casting accelerator for fully parallel universal convolution computing[J]. Photonics Research, 2023, 11(2): 299
    Download Citation