
- Advanced Photonics
- Vol. 7, Issue 1, 016004 (2025)
Abstract
1 Introduction
Machine learning, one of the most revolutionary scientific breakthroughs in the past decades, has completely transformed the technology landscape, enabling innovative applications in fields ranging from natural language processing to drug discovery. As the demand for increasingly sophisticated machine-learning models continues to escalate, there is a pressing need for faster and more energy-efficient computing solutions. In this context, analog computing has emerged as a promising alternative to traditional digital electronics.1
Most of the current analog computing research and development is aimed at using the NN for inference.8,10 Training such NNs, on the other hand, is a challenge. This is because the backpropagation algorithm,11 the workhorse of training in digital NNs, requires the calculation to be performed in the order opposite to the information flow for inference, which is difficult to implement on an analog physical platform. Hence, analog models are typically trained offline (in silico), on a separate digital simulator, after which the parameters are transferred to the analog hardware. In addition to being slow and inefficient, this approach can lead to errors arising from imperfect simulation and systematic errors (“reality gap”). In optics, for example, these effects may result from dust, aberrations, spurious reflections, and inaccurate calibration.12
To enable learning in analog NNs, different approaches have been proposed and realized.13 Several groups explored various “hardware-in-the-loop” schemes, in which, although the backpropagation was done in silico, the signal acquired from the analog NN operating in the inference regime was incorporated into the calculation of the feedback for optimizing the NN parameters.12,14
Sign up for Advanced Photonics TOC. Get the latest issue of Advanced Photonics delivered right to you!Sign up now
Recently, several optical neural networks (ONNs) were reported that were trained online (in situ) using methods alternative to backpropagation. Bandyopadhyay et al.21 trained an ONN based on integrated photonic circuits using simultaneous perturbation stochastic approximation, i.e., randomly perturbing all ONN parameters and using the observed change of the loss function to approximate its gradient. Filipovich et al.22 applied direct feedback alignment, wherein the error calculated at the output of the ONN is used to update the parameters of all layers. However, both these methods are inferior to backpropagation, as they take a much longer time to converge, especially for sufficiently deep ONNs.23
An optical implementation of the backpropagation algorithm was proposed by Hughes et al.,24 and recently demonstrated experimentally,25 showing that the training methods of current digital NNs can be applied to analog hardware. However, their scheme omitted a crucial step for optical implementation: backpropagation through nonlinear activation layers. Their method requires digital nonlinear activation and multiple opto-electronic interconversions inside the network, complicating the training process. Furthermore, the method applies only to a specific type of ONN that uses interferometer meshes for the linear layer and does not generalize to other ONN architectures. Complete implementation of the backpropagation algorithm in optics, through all the linear and nonlinear layers that can be generalized to many ONN systems, remains a highly challenging goal.
In this work, we address this long-standing challenge and present what we believe is the first complete optical implementation of the backpropagation algorithm in a two-layer ONN. The gradients of the loss function with respect to the NN parameters are calculated by light traveling through the system in the reverse direction. The main difficulty of all-optical training lies in the requirement that the nonlinear optical element used for the activation function needs to exhibit different properties for the forward and backward propagating signals. Fortunately, as demonstrated in our earlier theoretical work26 and explained below, there does exist a group of nonlinear phenomena that exhibit the required set of properties with sufficient precision.
We optically train our ONNs to perform classification tasks, and our results surpass those trained with a conventional in silico method. Our optical training scheme can be further generalized to other platforms using different linear layers and analog activation functions, making it an ideal tool for exploring the vast potential of analog computing for training NNs. Optical backpropagation offers faster convergence in training compared with alternative algorithms,21
1.1 Optical Training Algorithm
We consider a multilayer perceptron—a common type of NN that consists of multiple linear layers that establish weighted connections among neurons, interlaid by activation functions that enable the network to learn complex nonlinear functions. To train the NN, one presents it with a training set of labeled examples and iteratively adjusts the NN parameters (weights and biases) to find the correct mapping between the inputs and outputs.
The training steps are summarized in Fig. 1(d), and the complete analysis is presented in Note 1 in the Supplementary Material. The weight matrices, denoted
Figure 1.Illustration of optical training. (a) Network architecture of the ONN used in this work, which consists of two fully connected linear layers and a hidden layer. (b) Simplified experimental schematic of the ONN. Each linear layer performs optical MVM with a cylindrical lens and an SLM that encodes the weight matrix. Hidden layer activations are computed using SA in an atomic vapor cell. Light propagates in both directions during optical training. (c) Working principle of SA activation. The forward beam (pump) is shown by solid red arrows and the backward (probe) by purple wavy arrows. The probe transmission depends on the strength of the pump and approximates the gradient of the SA function. For high forward intensity (top panel), a large portion of the atoms are excited to the upper level. Stimulated emission produced by these atoms largely compensates for the absorption due to the atoms at the ground level. For the weak pump (bottom panel), the excited level population is small, and the absorption is significant. (d) NN training procedure. (e) Optical training procedure. Both signal and error propagations in the two directions are fully implemented optically. Loss function calculation and parameter update are left for electronics without interrupting the optical information flow.
The output
The gradients we require are given by11
We see from Eq. (4) that the error backpropagation consists of two operations. First, we must perform an MVM, mirroring the feed-forward linear operation, Eq. (1). In an ONN, this can be done by light that propagates backward through the same linear optical arrangement.27 The second operation consists of modulation of the MVM output by the activation function derivative and poses a notable challenge for optical implementation. This is because most optical media exhibit similar properties for forward and backward propagation. On the other hand, our application requires an optical element that is (1) nonlinear in the forward direction, (2) linear in the backward direction, and (3) modulates the backward light amplitude by the derivative of the forward activation function.
We have solved this challenge with our optical backpropagation protocol, which calculates the right-hand side of Eq. (4) entirely optically with no opto-electronic conversion or digital processing. The first component of our solution is the observation that many optical media exhibit nonlinear properties for strong optical fields but are approximately linear for weak fields. Hence, we can satisfy conditions 1 and 2 by maintaining the back-injected beam at a much lower intensity level than the forward. Furthermore, there exists a set of nonlinear phenomena that also addresses the requirement (3). An example is saturable absorption (SA), whose nonlinear response of the forward incoming light field
As shown in our prior work,26 the term before the central dot in Eq. (6) is approximately constant over a wide range of input values
1.2 Multilayer ONN
Our ONN as shown in Figs. 1(a) and 1(b) is implemented in a free-space tabletop setting. The neuron values are encoded in the transverse spatial structure of the propagating light-field amplitude. Spatial light modulators (SLMs) are used to encode the input vectors and weight matrices. The NN consists of two fully connected linear layers implemented with optical MVM28 following our previously demonstrated experimental design.29 This design has a few characteristics that make it suitable for use in a deep NN. First, it is reconfigurable, so that both neuron values and network weights can be arbitrarily changed. Second, multiple MVM blocks can be cascaded to form a multilayer network, as the output of one MVM naturally forms the input of the next MVM. Using a coherent beam also allows us to encode both positive- and negative-valued weights. Finally, the MVM works in both directions, meaning the inputs and outputs are reversible, which is critical for the implementation of our optical backpropagation algorithm. The hidden layer activation between the two layers is implemented optically by means of SA in a rubidium atomic vapor cell [Fig. 1(c)].
2 Results
2.1 Linear Layers
We first set up the linear layers that serve as the backbone of our ONN, and we make sure that they work accurately and simultaneously in both directions—a highly challenging task that has never been achieved before, to our best knowledge. This involves three MVMs: first layer in the forward direction (MVM-1), second layer in both forward (MVM-2a) and backward (MVM-2b) directions. MVM-2b is the transpose of MVM-2a because the matrix elements are the same, but the fan-in directions are perpendicular for the forward and backward propagating beams. To characterize these MVMs, we apply random vectors and matrices and simultaneously measure the output of all three: the results for 300 random MVMs are presented in Fig. 2(a). To quantify the MVM performance, we define the signal-to-noise ratio (SNR, see Appendix for details). As illustrated by the histograms, MVM-1 has the greatest SNR of 14.9, and MVM-2a has a lower SNR of 7.1 as a result of noise accumulation from both layers and the reduced signal range. MVM-2b has a slightly lower SNR of 6.7 because the optical system is optimized for the forward direction. Comparing these experimental results with a simple numerical model, we estimate 1.3% multiplicative noise in our MVMs, which is small enough not to degrade the ONN performance.12
Figure 2.Multilayer ONN characterization. (a) Scatterplots of measured-against-theory results for MVM-1 (first layer forward), MVM-2a (second layer forward), and MVM-2b (second layer backward). All three MVM results are taken simultaneously. Histograms of the signal and noise error for each MVM are displayed underneath. (b) First layer activations
2.2 Nonlinearity
With the linear layers fully characterized, we now measure the response of the activation units in both directions. With the vapor cell placed in the setup and the laser tuned to resonance with the atomic transition, we pass the output of MVM-1 through the vapor cell in the forward direction. The response as presented in Fig. 2(b) shows strong nonlinearity. We fit the data with the theoretically expected SA transmissivity (see Supplementary Material for details), thereby finding the optical depth to be
In Fig. 2(c), we measure the effect of the forward amplitude
2.3 All-Optical Classification
After setting up the two-layer ONN, we perform end-to-end optical training and inference on classification tasks: distinguishing two classes of data points on a two-dimensional plane (Fig. 3). We implement a fully connected feed-forward architecture, with three input neurons, five hidden layer neurons, and two output neurons (Fig. 1). Two input neurons are used to encode the input data point coordinates
Figure 3.Optical training performance. (a) Decision boundary charts of the ONN inference output for three different classification tasks, after the ONN has been trained optically (top) or
We optically train the ONN on three 400-element datasets with different nonlinear boundary shapes, which we refer to as “Rings,” “XOR,” and “Arches” [Fig. 3(a)]. Another 200 similar elements of each set are used for validation, i.e., to measure the loss and accuracy after each epoch of training. The test set consists of a uniform grid of equally spaced
For all three datasets, each epoch consists of 20 minibatches, with a minibatch size of 20, and we use the Adam optimizer to update the weights and biases from the gradients. We tune hyperparameters such as learning rate and number of epochs to maximize network performance. Table 1 summarizes the network architecture and hyperparameters used for each dataset.
Dataset | Input neurons | Hidden neurons | Output neurons | Learning rate | Epochs | Batches per epoch | Batch size |
Rings | 2 | 5 | 2 | 0.01 | 16 | 20 | 20 |
XOR | 0.005 | 30 | |||||
Arches | 0.01 | 25 |
Table 1. Summary of network architecture and hyperparameters used in both optical and digital training.
The optical training performance on the “Rings” dataset is shown in Fig. 3(b). We perform five repeated training runs and plot the loss and accuracy for the validation set after each epoch of training. To visualize how the network is learning the boundary between the two classes, we also run a test dataset after each epoch. Examples of the network output after 1, 3, 6, and 10 epochs are shown. We see that the ONN quickly learns the nonlinear boundary and gradually improves the accuracy to 100%. This indicates a strong optical nonlinearity in the system and a good gradient approximation in optical backpropagation. Details of the training procedure are provided in the Appendix, and results for the other two datasets in Note S3 in the Supplementary Material.
To better understand the optical training process, we explore the evolution of the output neuron and error vector values in Fig. 3(c). First, we plot the minibatch mean value of each output neuron,
Second, we similarly plot the evolution of the minibatch mean output error,
To prove that the convergence is truly from optical gradient descent, we further compare the optically estimated gradients with digitally calculated gradients in Fig. 3(d). With the ONN well calibrated, optical gradients should track digital gradients. In Fig. 3(d), we see some deviation of optically estimated gradients from the digital gradients, which is a signature of imperfection in the physical system. In spite of the deviation, these plots still show a high degree of correlation between the optical and digital gradients for all 10 weight matrix elements. It shows that the ONN is optically trained in the correct direction according to digital calculation, subject to some perturbation by stochastic noise. The evolution of optical and digital gradients during the training process is further analyzed in Note 3 in the Supplementary Material.
2.4 Optical Training versus in silico Training
To demonstrate the optical training advantage, we perform in silico training of our ONN as a comparison. We digitally model our system with an NN of the equivalent architecture, including identical learning rate, number of epochs, and all other hyperparameters. The hidden layer nonlinearity and the associated gradient are given by the best-fit curve and theoretical probe response of Fig. 2(b). The trained weights are subsequently used for inference with our ONN. The top and bottom rows in Fig. 3(a) plot the network output of the test boundary set, after the system has been trained optically and digitally, respectively, for all three datasets. In all cases, the optically trained network achieves almost perfect accuracy, whereas the digitally trained network is clearly not optimized, with the network prediction not matching the data. This is further evidence of the already well-documented advantages of hardware-in-the-loop training schemes.
3 Discussion and Conclusion
According to simple estimates, optical implementation will enhance the energy efficiency of an NN by 3 orders of magnitude in comparison with its digital electronic counterpart. Our surprisingly simple and effective optical training scheme is capable of offering the same advantage factor to training as was previously promised for inference. It adds minimal computational overhead to the network because it does not require in silico simulation or intricate mapping of network parameters to physical device settings. Our method also imposes minimal hardware complexity on the system, as it requires only a few additional beam splitters and detectors to measure the activation and error values for parameter updates.
Our scheme can be generalized and applied to many other analog NNs with different physical implementations of the linear and nonlinear layers. We list a few examples in Table 2. Common optical linear operations include MVM, diffraction, and convolution. Compatible optical MVM examples include our free-space multiplier and photonic crossbar array,30 as they are both bidirectional, in the sense that the optical field propagating backward through these arrangements gets multiplied by the transpose of the weight matrix. Diffraction naturally works in both directions; hence, diffractive NNs constructed using different programmable amplitude and phase masks also satisfy the requirements.31 Optical convolution, achieved with the Fourier transform by means of a lens, and mean pooling, achieved through an optical low-pass filter, also work in both directions. Therefore, a convolutional NN can be optically trained as well. Detailed analysis of the generalization to these linear layers can be found in Note 4 in the Supplementary Material.
Network layer | Function | Implementation example |
Linear layer | MVM | Free-space optical multiplier and photonic crossbar array |
Diffraction | Programmable optical mask | |
Convolution | Lens Fourier transform | |
Nonlinear layer | SA | Atomic vapor cell, semiconductor absorber, and graphene |
Saturable gain | EDFA, SOA, and Raman amplifier |
Table 2. Generalization of the optical training scheme.
Regarding the generalization to other nonlinearity choices, the critical requirement is the ability to acquire gradients during backpropagation. Our pump–probe method is compatible with multiple types of saturable effects, including SA and saturable gain.32 Using saturable gain as the nonlinearity offers the added advantage of loss compensation in a deep network. This is important for scaling ONNs to real-world workloads, which may otherwise be limited to only a few layers if optical losses are not overcome. Importantly, both SA and saturable gain nonlinearities can be implemented not only in free space but also in integrated ONN settings.33,34
In our ONN training implementation, some computational operations remain digital, specifically the calculation of the last layer error
The compute performance and energy efficiency of our proof-of-principle experiment were not optimized to be comparable to its digital counterpart and were primarily bottlenecked by the slow refresh rate of the liquid crystal spatial light modulator (LC-SLM) and data communication time between optical devices and the host PC. However, because our scheme is applicable to a variety of implementations, the optical training algorithm is not limited to the devices used in this demonstration. Our optically trained ONN can therefore be scaled up to improve computing performance. In a previous experimental setup, we demonstrated an ONN with 100 neurons per layer.12 In addition, an ONN capable of interconnecting 1000 neurons can be realized using high-resolution SLMs. ONN input data can be switched at speeds up to 100 GHz using advanced optical transceiver components. Therefore, computational speeds up to
4 Appendix: Materials and Methods
4.1 Multilayer ONN
To construct the multilayer ONN, we connect two optical multipliers in series. For the first layer (MVM-1), the input neuron vector
As MVM requires performing dot products of the input vector with every row of the matrix, we create multiple copies of the input vector pattern on DMD-1, replicating the logical pixel patterns vertically. We image DMD-1 onto the
The DMD-1 logical pixels are imaged to blocks of pixels on SLM-1 representing matrix elements using a simple
The MVM-1 result
The beam passes through a rubidium vapor cell to apply the activation function, such that immediately after the cell the beam encodes the hidden layer activation vector,
To read out the activation vectors required for the optical training, we insert beam splitters at the output of each MVM to tap off a small portion of the beam. The real-valued vectors are measured by high-speed cameras, using coherent detection techniques detailed in Note 2 in the Supplementary Material.
At the output layer of the ONN, we use a digital softmax function to convert the output values into probabilities and calculate the loss function and output error vector, which initiates the optical backpropagation.
4.2 Optical Backpropagation
The output error vector,
In our experiment, different areas of a single DMD were used as DMD-1 and DMD-2. The entire DMD area is mapped to SLM-1. The area of SLM-1 mapped by DMD-1 region is used to encode the weight matrix, whereas the area of SLM-1 mapped by DMD-2 is used to encode the sign of the error vector. The forward and backward beams are separated by a pick-off mirror after SLM-1. A schematic of this setup is provided in Fig. S2(a) in Note 2 in the Supplementary Material.
Each training iteration consists of optically measuring all of
4.3 SA Activation
The cell with atomic rubidium vapor is heated to 70 deg by a simple heating jacket and temperature controller. The laser wavelength is locked to the
In the experiment, the backward probe response does not match perfectly with the simple two-level atomic model, due to two factors.
First, the probe does not undergo 100% absorption, even with the pump turned off. Second, a strong pump beam causes the atoms to fluoresce in all directions, including along the backward probe path. Therefore, the backward signal has a background offset proportional to the forward signal. To compensate for these issues, three measurements are taken to determine the probe response
Finally, using a single vapor cell to perform nonlinear activation on all hidden layer vector elements limited the achievable hidden layer dimension, due to cross talk among modes. We were able to accommodate five individual modes without substantially increasing the nonlinear activation noise due to cross talk. A greater number of modes could be accommodated by encoding the activation vector in both transverse dimensions or by dividing the cell into multiple physically separated microcells. Alternatively, one can use a different activation mechanism altogether, e.g., coupling each hidden layer mode to an erbium-doped fiber amplifier.
More experimental details are available in the thesis by Dr. James Spall.36
James Spall completed his PhD at the University of Oxford in the Atomic and Laser Physics Department. His research into optical neural networks and optical computing hardware has resulted in publications in a range of leading scientific journals, numerous conference talks, multiple patents, and co-founding the optical computing start-up Lumai. He previously completed his master’s degree in mathematics and physics at Durham University, winning numerous awards and graduating top of his cohort.
Xianxin Guo is co-founder and head of research at Lumai, an Oxford-based startup developing optical computing products. After earning his PhD in physics from Hong Kong University of Science and Technology in 2018, he served as an RCE 1851 research fellow at Oxford and lecturer at Keble College, bringing a decade of international experience in optics and quantum physics.
Alexander I. Lvovsky is an award-winning educator and experimental physicist with expertise in quantum and classical optics and optical neural networks, best known for his work on quantum light. He grew up in Moscow and completed his PhD at Columbia University in 1998. After working in several academic positions throughout the world, he became a professor at Oxford University in 2018. Aside from his research, he is a popular public speaker, quantum science evangelist, and education outreach leader.
References
[11] Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, 521, 436-444(2015).
[13] S. M. Buckley et al. Photonic online learning: a perspective. Nanophotonics, 12, 833-845(2023).
[16] S. M. Tam et al. Learning on an analog VLSI neural network chip, 701-703(1990).
[23] S. Bartunov et al. Assessing the scalability of biologically-motivated deep learning algorithms and architectures, 9390-9400(2018).
[27] K. Wagner, D. Psaltis. Multilayer optical learning networks. Appl. Opt., 26, 5061-5076(1987).
[32] R. W. Boyd. Nonlinear Optics(2020).
[36] J. Spall. Training neural networks with end-to-end optical backpropagation(2024).

Set citation alerts for the article
Please enter your email address