
- Photonics Research
- Vol. 13, Issue 6, 1469 (2025)
Abstract
1. INTRODUCTION
End-to-end (E2E) learning is an emerging approach based on deep learning that offers new solutions to complex problems [1]. By using a single neural network (NN) to represent an entire target system, E2E learning bypasses the intermediate steps typically required in traditional methods, thereby simplifying the learning process. For example, in computer vision, E2E learning can simultaneously optimize both the encoder NN for input image compression and the decoder NN for output recovery, achieving better overall system performance compared to optimizing these components separately. E2E learning has become a widely adopted optimization strategy, with applications across various domains, including computer vision, natural language processing, autonomous driving, robot control, computational imaging, and optical computing [2–7].
The rise of E2E learning has also impacted communication technologies since 2017 [8]. As communication systems grow more complex [9–15], traditional block-wise optimization methods often fall short in ensuring optimal overall performance [16]. E2E learning has emerged as a promising solution. The process of signal generation and recovery in communication systems is analogous to image compression and reconstruction in computer vision, as both aim to recover input messages at the system output. This similarity makes it natural to apply the E2E learning concept to communication systems by treating the entire system—including the transmitter, receiver, and transmission channel—as an autoencoder, as illustrated in Fig. 1(a). In this framework, the transmitter (Tx) and receiver (Rx) are represented as an encoder and a decoder, respectively, through two separate NN blocks that are trained jointly to learn an intermediate representation robust to channel impairments.
Figure 1.(a) A communication system utilizing end-to-end (E2E) learning can be represented as an autoencoder, which consists of three main components: an encoder (transmitter), a decoder (receiver), and the actual transmission channel. The transmission channel can be various communication systems. (b) Conventional E2E learning method involves accurately modeling the transmission channel first, followed by performing the backpropagation (BP) algorithm in the digital domain. Channel modeling can be achieved using either physics-based approaches, known as “white-box” models, or pure data-driven methods, referred to as “black-box” models. (c) Proposed physics-guided learning: executing the forward pass of backpropagation on the actual transmission channel, while the backward pass estimates the gradient using a simplified white-box model.
In theory, E2E learning enables overall system optimization for any practical channel without the need for extensive analytic evaluations to the physical system, therefore allowing for pursuing the best end-to-end performance [16]. Despite these potential advantages, implementing E2E learning effectively in real-world communication systems remains a significant challenge [17,18]. The main obstacle is that, to use effective training algorithms such as backpropagation (BP), E2E learning requires the entire transmission system to be differentiable. However, the physical transmission channel is inherently non-differentiable, which complicates the direct application of E2E learning.
Sign up for Photonics Research TOC. Get the latest issue of Photonics Research delivered right to you!Sign up now
Most solutions to this challenge are to construct a differentiable digital model that approximates the actual physical channel, enabling the learning process to occur in the digital domain [16,17,19–39], as illustrated in Fig. 1(b). Differentiable channel models are typically derived through two main approaches. The first approach relies on physical laws, such as using the split-step Fourier method to model nonlinear fiber channels [21,22]. The second approach utilizes data-driven algorithms, where NN models, including generative adversarial networks (GANs) [17,40] and other NN architectures [34,36], approximate the channel. While these methods facilitate E2E learning, they also impose significant demands for accurate modeling, leading to high training costs and complexity. For example, in order to obtain an accurate channel approximation across different channel conditions, data-driven approaches can be both data-intensive and time-consuming [17]. Even with these efforts, discrepancies between the digital model and the actual physical system remain inevitable, resulting in performance degradation in real-world deployments [41]. Additionally, model-free methods, such as reinforcement learning [41–43] and cubature Kalman filters [44], come with their own challenges, including high training complexity and slow convergence [28]. Therefore, achieving efficient E2E learning with fast, low-cost training while maintaining system performance and signal quality remains a significant challenge.
To address these challenges, we propose a new framework called physics-guided learning, which incorporates the actual physical channel into the training process. By doing so, our method enables a simple and approximate white-box model to outperform a complicated digital model in both training and implementation performance, while also becoming more resilient to system noise. As shown in Fig. 1(c), in our approach, the physical channel is used during the forward pass to generate the output, while the white-box model is employed to compute gradients for backpropagation. Our method is analogous to hardware-in-the-loop optimization, a concept often used to validate simulations of complex systems [45,46]. Our method simultaneously enhances training speed, generalization ability, and signal quality.
We experimentally demonstrate the effectiveness of our approach in short-reach coherent optical fiber systems. The goal is to train an NN-based digital pre-distortion (DPD) module to mitigate impairments from both the transmitter and receiver. Verified through both experiments and simulations, our approach outperforms other mainstream methods during both the training and deployment stages [16–18]. During training, our method improves efficiency by reducing the number of training iterations by more than 80%. In deployment, our approach demonstrates the strongest capability in addressing nonlinear distortion, shows the highest resilience to noise, and exhibits superior generalization to different link losses, modulation formats, and transmission scenarios. Furthermore, our method offers a brand-new framework for enabling accurate and efficient E2E learning in communication systems, which holds significant potential for broader applications, including long-haul optical systems and hybrid RF/optical wireless systems.
2. WORKING PRINCIPLE OF PHYSICS-GUIDED LEARNING
The principle and process of the proposed physics-guided learning are illustrated in Fig. 1(c). In this framework, the input message
In a standard training procedure using backpropagation, the process involves a forward pass, error calculation, backpropagation of errors through the NN to compute gradients, and, finally, updating the NN parameters. This process requires a precise mathematical model to represent the entire transmission system to compute gradients accurately. However, deriving such an accurate model is challenging and computationally intensive, especially when accounting for complex systems with noises.
To address this problem, our approach uses the actual physical transmission system for the forward pass and error measurement, while employing a differentiable proxy model for backpropagation. A key difference between our method and standard digital-domain learning is that our approach does not require the actual system and the proxy model to be identical. The tolerance for discrepancies is large—the training process decreases the error as long as the angle between the gradient obtained from the proxy model and the gradient of the actual system is less than 90° [47,48]. This allows the proxy model to be significantly simplified to a rough white box, reducing computational complexity with high system performance.
The specific training process is as follows. Before training, we first establish a simple, differentiable proxy model for the physical channel based on physical laws—this model is referred to as a “white box.” During the training process, the output
This training flow benefits E2E learning in two significant ways. First, executing the forward pass through the actual physical channel ensures that the output incorporates all real information. Implicit impairments and features, such as noise and residual effects post-DSP, are automatically included through forward propagation. Non-differentiable operations in the actual system, such as quantization, which typically limits E2E learning [41], are also included. Consequently, our method enhances the deployment accuracy.
Second, grounding the process in real data reduces the need for a precise digital model for gradient estimation, thereby accelerating the training process. Our method can achieve necessary accuracy using a simple white-box model as the proxy channel. Compared to purely data-driven neural network models, known as “black boxes,” white-box models offer greater generalization ability and higher noise resilience.
The following section will demonstrate an example of applying our method for digital pre-distortion in short-reach coherent fiber systems, highlighting its significant advantages at both the training and deployment stages.
3. APPLICATION IN COHERENT OPTICAL SYSTEM
A. Digital Pre-Distortion For Short-Reach Transmission
Coherent optical systems have dominated long-haul transmission over the past decade [15]. Recently, coherent technologies have opened up new development opportunities for short-reach transmissions, especially for beyond Tb/s data-center interconnects and passive optical networks (PONs), to meet the growing demands on data traffic [49–54]. At such high data rate, transmission performance becomes increasingly susceptible to imperfections in cost-effective transceiver components rather than impairments from fiber [55]. To tackle these issues, digital pre-distortion (DPD) is employed at the Tx side to pre-compensate for signal distortions caused by transceiver devices [25,42,55–63]. Recent advances have demonstrated the use of NNs to develop DPD modules, with ongoing research focused on efficiently learning DPD parameters while maintaining low costs and high performance [42,55,59,60].
We present that our physics-guided learning approach can effectively optimize the parameters of a DPD module, yielding superior performance compared to mainstream training methods [16–18]. The operation scheme of our method adapted for DPD training is illustrated in Fig. 2(a). During training, the encoder NN, represented by the green block, functions as the DPD module. The actual physical channel is used for the forward pass to generate the output
Figure 2.Schematic diagrams of methods adapted for DPD training. (a) Proposed physics-guided learning method. (b) Prior E2E learning methods. (b-i) Method 1: hybrid-domain learning with a data-driven model [18]. (b-ii) Method 2: digital-domain learning with a complicated physics-based model [16,19
Figure 3.Experimental setup for our physics-guided learning method, configured in an either amplifier-less or 80-km transmission system. The physical transmission system and the Rx-DSP compose the actual channel. The simplified white-box digital channel employs a series of physical models that cover only several components. DPD, digital pre-distortion; AWG, arbitrary waveform generator; IQ-MOD, IQ modulator; VOA, variable optical attenuator; EDFA, erbium-doped fiber amplifier; SSMF, standard single-mode fiber; LO, local oscillator; OSC, oscilloscope; LPF, low-pass filter.
In our experiments, we use an adaptive finite impulse response (FIR) filter as a post-equalizer at the Rx, the coefficients of which are adaptive to the changes in the DPD. We optimize its coefficients jointly with DPD using gradient descent, analogous to training an NN. The function of the FIR is equivalent to a single-layer NN [22]. Similar single-layer NNs have been adopted in prior E2E learning frameworks [22,34,36,37]. While we do not use an NN-based post-equalizer, the method of training the FIR filter in our experiment is also applicable to training an NN-based post-equalizer. Although including an NN-based post-equalizer would further improve the performance (see Appendix C), we do not include it, as it significantly increases DSP complexity, which is not acceptable for the short-reach system we are targeting.
To demonstrate the effectiveness of our method, we compare it comprehensively with three prior E2E learning methods. The first method, illustrated in Fig. 2(b–i), also incorporates the actual physical system during training but relies on a digital NN obtained through a data-driven approach for the backward pass (referred to as Method 1) [18]. This method requires essential alternating training for two NNs, one for the channel model and the other for the DPD module. Since training is performed directly on the real system, updates to the DPD would significantly change the statistics of transmitted signals, which, in turn, cause notable changes in the actual channel response. Therefore, the NN for channel modeling needs to be retrained periodically to approach the true response. This will be demonstrated in Section 3.C.
The second and third methods, shown in Figs. 2(b-ii) and 2(b-iii), are standard model-based E2E learning approaches (referred to as Methods 2 and 3, respectively). Both approaches require a highly accurate digital model. In the method illustrated in Fig. 2(b-ii), the digital model is based on complicated digital representations of physical laws, such as the split-step Fourier method [21,22], to simulate the actual channel response. The third method, depicted in Fig. 2(b-iii), uses a data-driven approach to derive an NN model for the digital representation. There are two common strategies to implement Method 3: one follows alternating training similar to Method 1 [17,33,34], while the other fully pre-trains the channel model first and then fixes the pre-trained channel model when training transceiver modules [36–39]. Alternating training allows the channel model to track the changes in system response during learning but needs frequent data acquisition from the real channel. In contrast, fully pre-training the channel model enables offline learning but demands a large, diverse dataset to ensure accurate modeling [35]. Therefore, both strategies require a considerable amount of data and time to train the channel model. In our demonstration, we use alternating training for Method 3 in order to draw a direct comparison with Method 1.
B. System Setup
Figure 3 shows the experimental setup for short-reach coherent fiber transmission and the DPD learning flow using our proposed method. To cover diverse short-reach scenarios [49], we examine two systems: an optical-amplifier-less system, accounting for data-center links, and an 80-km transmission system with optical amplification, suitable for access and metro links where amplification is acceptable. The DPD module is trained on the amplifier-less system but evaluated on both systems, showcasing its adaptability to longer transmission scenarios.
Initially, a sequence of transmitted symbols
Parameters for Experimental Systems
Parameter | Value |
---|---|
AWG (DAC resolution, bandwidth) | 8 bits, 45 GHz |
RF driver (gain, | 17 dB, 7.8 V, 40 GHz |
IQ modulator ( | 3.5 V, 22 GHz |
Laser/LO (linewidth) | |
Coherent receiver (bandwidth) | 22 GHz |
Oscilloscope (bandwidth) | 59 GHz |
Symbol rate | 50 Gbaud |
The amplifier-less system’s transmission performance is primarily impacted by nonlinearities, bandwidth limitations of components at both the transmitter and receiver, and system noises. These include additive noise from the receiver and phase noise caused by non-ideal laser sources. In the 80-km transmission system, more challenges are introduced, such as amplified spontaneous emission (ASE) noise from EDFAs and CD and potential fiber nonlinearities from the longer fiber link. This necessitates the use of a CD compensation block. The DPD module’s role is to address signal distortions mainly originating from transceiver devices. Nevertheless, our DPD can improve signal quality even in the presence of fiber-induced impairments under 80-km transmission conditions, as demonstrated in Section 3.E.3.
To train the DPD using our physics-guided learning method, we transmit signals in the actual amplifier-less channel to perform the forward pass, as indicated by the central blue area in Fig. 3. The backward pass, as a feedback link to the DPD, is designed to reflect only the major distortions from devices, thus greatly simplified compared to the actual channel. As shown in the white box of Fig. 3, the backward pass employs a series of physical models that represent a few key components, including RF drivers, the IQ modulator, the overall bandwidth limitation of the system represented as a low-pass filter (LPF), and a matched filter paired with the pulse shaping before DPD. These physical models are rough. For instance, the IQ modulator is modeled as ideal sinusoidal functions disregarding potential mismatches between the in-phase and quadrature arms. Moreover, all the required parameters are obtained from datasheets rather than measured from actual devices, and noises are excluded from the model. Despite this simplicity, we will demonstrate in the results sections that our learning method effectively guides the DPD to optimize towards the best performances. (See Appendix B for modeling methods of physical models.)
Note that, to explore the simplification limit of channel models, we also attempted to exclude all components and assume the backward pass to be an identity matrix. We observed that training could converge, albeit with poorer performance compared to the white-box model shown in Fig. 3 (resulting in a 34.6% increase in mean square error). This indicates that the angle between the identity matrix and the true gradient is large but still less than 90°. A similar example can be found in Ref. [55].
The three prior learning methods introduced in Section 3.A for comparison are performed on the amplifier-less system with the same DPD NN structure. Methods 1 and 3 use the identical NN structure for both channel modeling and DPD. The NN takes the architecture of feed-forward neural network with an input sliding window [42,60], involving 3108 learnable real-valued parameters. Method 2 uses a complicated physics-based model to match the amplifier-less experimental setup. The model includes basic models of the transmitter and receiver (identical to the white-box model used in our method), system noises (additive white noise and phase noise), and Rx-DSP. Its physical parameters are also extracted from datasheets. Additionally, hyperparameters such as initial learning rates are optimized for each method, which is detailed in Appendix D. (See Appendices A–D for NN structure, channel models, and training details.)
In the results sections, we will first analyze the training complexity of our method and compare it with existing approaches (see Section 3.C). We will then evaluate the performance of the DPD module under the critical factor peak-to-peak voltage (Vpp) at the AWG output that affects the transmission quality of the amplifier-less short-reach system (see Section 3.D). Vpp influences the output swing of the RF drivers and the operation of the IQ modulator. While a higher Vpp can boost optical signal power, it also leads to larger nonlinear distortions from the RF drivers and modulator. We will investigate how effectively our method identifies the optimal Vpp value compared to other methods. Finally, we investigate the generalization ability of our method by showing how the DPD module trained using our approach can adapt to different transmission conditions, including fiber link losses, modulation formats, and the 80-km transmission scenario with optical amplification (see Section 3.E).
C. Evaluation and Comparison of Training Complexity and Accuracy
Here, we demonstrate the advantages of our method in terms of training complexity and accuracy. The amplifier-less system is trained using 32-QAM signals with a 600 mV Vpp at the AWG and 5 dB optical link loss. Figure 4(a) compares the training loss of our method with that of Method 1 and Method 3, where both channel modeling and DPD training processes use mean square error (MSE) as the loss function. We visualize convergence speed as a function of training iterations because it reflects both required training time and training data. Fewer training iterations imply lower training data requirements. Method 2 is not included in this comparison, as it does not need system measurements but produces unacceptably low DPD performance during deployment in the real system. The performance of Method 2 will be discussed in the next section.
Figure 4.Training process comparisons between our method and prior Method 1 (hybrid-domain data-driven method) and Method 3 (digital-domain data-driven method, implemented by alternating training). (a) Training loss versus training iteration in experiments, under the conditions of 5 dB link loss and 600 mV Vpp. (b) Validation MSE versus training iteration in simulations. (b-ii) Zoom-in of (b-i) to compare the required iteration numbers for different methods when reaching the same MSE.
As shown in Fig. 4(a), our method achieves the fastest convergence speed by using a simplified white-box model during training. This contrasts with Methods 1 and 3, which adopt alternating training for two NNs. Despite its simplicity, our method can effectively guide the optimization process in the correct direction, as evidenced by the continuously decreasing training loss. In contrast, Methods 1 and 3 involve alternately training two NNs—one for the channel model and the other for the DPD module—during each round [as illustrated by the different colored regions in Figs. 4(a-ii) and 4(a-iii)]. A high channel modeling loss is observed at the start of the second round (marked by the black dashed circle) in Fig. 4(a-ii), which indicates that the actual channel response changes significantly after DPD training and retraining the channel model is essential. This alternating training process substantially increases both the training time and training data.
It is important to note that training loss alone does not rigorously reflect the true performance of the system, as some methods do not account for real-system effects, such as noise, during training. To accurately evaluate system performance evolution during training, a validation dataset must be tested on the system. The validation results, shown in Fig. 4(b), are periodically assessed by calculating the MSE between the input symbol sequence in the validation dataset and the output signal. As shown in Fig. 4(b-i), our method achieves the lowest MSE with the least training iteration. Remarkably, our method already converges when the two prior methods have just completed their first round of training. The MSE is reduced by 23.3% and 69.2% compared to Method 1 and Method 3, respectively. Notably, Method 3 not only exhibits the slowest convergence speed but also suffers from overfitting, as indicated by the rising MSE, marked by red dashed circles in Fig. 4(b-i). Method 1 avoids the overfitting issue by incorporating the transmission system into the training process. However, alternating training greatly increases the training time. Finally, as shown in Fig. 4(b-ii), our method achieves the same MSE with approximately 80 iterations, compared to around 440 and 720 iterations for Method 1 and Method 3, respectively. This corresponds to a reduction of more than 80% in iteration numbers and required data, as well as a decrease of more than 75% in the number of system measurements. Additionally, the training time is estimated to be reduced from about 20 min to just 3 min. The calculations of data amount, number of system measurements, and training time are detailed in Appendix D.
D. Performance in Nonlinearity Impairment Mitigation
After training, we deploy the DPD models trained by different methods to evaluate their performance in signal pre-equalization. In this section, we compare the methods based on their capability to mitigate nonlinear impairments. This ability is crucial as it allows for higher Vpp values to drive the IQ modulator to result in a higher signal-to-noise ratio (SNR) for the amplifier-less system. While a higher Vpp can increase the optical signal power, it also introduces more significant nonlinear distortions from the RF drivers and modulator. Thus, effective nonlinear impairment mitigation is essential for improving the system SNR. In this experiment, we fix the optical link loss at 5 dB and train separate DPD modules for each Vpp value using 32-QAM signals. After training, we test each DPD module at the specific Vpp value set for the training. The fiber link without DPD serves as the baseline for comparison. The system performance is analyzed by calculating the SNR and bit error rate (BER) based on recovered symbols.
The results are presented in Fig. 5. Our method demonstrates the highest tolerance to nonlinearity and achieves the best SNR with the time-efficient training procedure. As shown in Fig. 5(a), the DPD trained using our method (blue line) consistently delivers the highest SNR across all tested Vpp values. Compared to the baseline (yellow line), it provides an SNR gain of 0.88 dB and improves the optimal Vpp value from 500 to 600 mV, indicating its capability to mitigate severe nonlinear distortions and enhance launch powers. We also observe that the performance gap between prior methods and ours widens as the Vpp increases. This is because higher Vpp levels introduce more severe nonlinear distortions, and the DPD obtained using our method demonstrates a greater ability to compensate for these distortions. The SNR using Method 1 (orange line) is 0.33 dB lower than that of our method. The worse performance arises from inadequate training for channel modeling in Method 1, which introduces additional biases and increases the gap between the model and the actual system. Method 3 exhibits an even lower SNR, indicating that the NN used for channel modeling, which is fixed for digital-domain DPD training, is not accurate. Method 2 (purple line) shows the worst SNR performance, which is even lower than the baseline at some Vpp values. This suggests that its DPD is ineffective, revealing that completely detaching from the measurement of the actual system can result in large modeling errors and significant performance loss during real-system deployment. As a result, our method is the only one that achieves BER values below the 14.8% overhead (OH) forward error correction (FEC) threshold of 0.0125 [50], as shown in Fig. 5(b), outperforming all other methods under comparison.
Figure 5.Performance comparison in impairments mitigation. (a) Calculated SNR versus Vpp of DPDs trained through different methods. (b) Calculated BER versus Vpp, followed by (b-i) and (b-ii) showing the received constellations without and with DPD at their respective optimal Vpp values. (c) Comparison of transmitted signal spectra with and without DPD.
In addition, the DPD obtained through our method can compensate for the bandwidth limitations of the physical system, which always exhibits low-pass characteristics and hinders high-speed transmission. Our DPD counteracts this impairment by boosting the high-frequency components before transmission. As shown in Fig. 5(c), the signal spectrum with our DPD shows peaks at the edge frequencies, in contrast to the flat spectrum of original signals without DPD.
E. Generalization Capability
This section will examine the generalization capability of our method, specifically its ability to adapt the trained DPD to different link conditions that were not included in the training phase. The DPD is first trained in the amplifier-less system with 5 dB link loss and 600 mV Vpp and then tested under different link conditions. Specifically, the evaluations against link losses and modulation formats are conducted in the amplifier-less system, while the DPD’s performance in the 80-km transmission system is evaluated for varying launch powers.
1. Adaptability to Optical Link Losses
We first evaluate the performance of the trained DPD under varying link losses, where the signal experiences different SNRs after detection. Figure 6(a) shows the BER as a function of link losses. Across all evaluated link losses, our method (blue line) consistently achieves the lowest BER. When the link loss is small (less than 5 dB), the BERs of all methods stop decreasing, because residual distortions, such as nonlinearity, become the dominant limiting factors rather than noise. In this case, our method remains the only one that achieves BER values below the 14.8% OH FEC threshold. As the link loss increases, noise becomes the dominant factor. Under these conditions, our method continues to demonstrate lower BER values, while other methods show performance close to or even worse than the baseline, particularly at a 15 dB link loss. Our method achieves a 1.00 dB gain in power budget over the baseline for the 25% OH FEC threshold of 0.04 [50]. The power budget is calculated as the difference between the launched and received optical power, equivalent to the link loss.
Figure 6.(a) BER versus optical link loss for DPD modules from different E2E learning methods. All DPDs were trained at a fixed 5 dB link loss. (b) Noise resilience investigation in comparison with ILA. (b-i) Calculated BER versus link loss of DPDs trained at fixed 3 dB, 5 dB, and 8 dB link loss values, respectively. (b-ii) Zoom-in of (b-i).
Notably, the BER value of Method 3 (green line) increases dramatically as the link loss rises, performing worse than the baseline’s BER starting from a 9 dB link loss. This decline in performance, even worse than the baseline, can be attributed to training biases in the channel model due to a data-driven approach—since noise is not included in the DPD training process, the learned DPD suffers from overfitting and fails to adapt to higher noise levels, leading to degraded performance under higher loss conditions. Incorporating the physical channel in the training can reduce such errors, as demonstrated by the results of our method and Method 1. However, Method 1 still performs worse than ours because the training biases still occur during the NN training for channel modeling. (See Appendix A for the detailed analysis.)
The above results well illustrate the superior performance of our method over prior E2E learning methods. To thoroughly validate the adaptability and optimality of our method in DPD applications, we also conduct a comparison with one traditional DPD training method called the indirect learning approach (ILA). ILA is a practical DPD optimization method owing to its relatively low complexity [56,57]. Unlike E2E learning methods, it circumvents channel modeling by training the DPD module at the Rx side before deploying it to the Tx side. However, ILA may suffer from noise bias and not yield the optimal DPD module [58,59]. Here we investigate the noise resilience of training processes and demonstrate that our method can outperform ILA.
We compare the two methods by training DPD modules at three fixed link losses—3 dB, 5 dB, and 8 dB—and then evaluate system performance under varying link losses. The Vpp is set to 600 mV during both training and testing. The resulting BERs are shown in Fig. 6(b). Our method demonstrates strong resilience to noise variance—the three DPDs, despite being trained under different link losses, exhibit consistent BER performance. The relative differences in BER are less than 6.5%. In contrast, ILA is proved to be more sensitive to noise. The relative differences in BER can be as large as 18.8%. Specifically, the DPD trained at high loss (e.g., 8 dB) cannot adapt to low-loss link conditions. The BER performance using DPD trained with ILA worsens as the link loss increases during training. The most significant BER degradation, from 0.0133 to 0.0158, occurs at a 4 dB link loss, as shown in Fig. 6(b-ii). These results are comparable to those reported by other groups [55,59], suggesting that excessive noise due to high link losses hinders the DPD from effectively learning the inverse channel function. As a result, ILA cannot achieve a BER value lower than the 14.8% OH FEC threshold. When comparing DPD modules from our method and ILA, both trained at 8 dB link loss (dark blue and dark red lines), our method achieves a 28.5% reduction in BER at 4 dB link loss and provides a 0.56 dB gain in power budget over ILA for the 25% OH FEC threshold.
2. Adaptability to Modulation Formats
Next, we evaluate the generalization ability of our method, specifically its ability to apply the trained DPD to different modulation formats that were not included in the training stage. The DPD is initially trained with 32-QAM, and then we test its performance on fiber links using 16-QAM and 64-QAM without retraining the DPD. The results in Fig. 7 show the signals’ SNR at different Vpp levels. Compared to the baseline, our method achieves SNR gains of 0.79 dB and 0.98 dB for 16-QAM and 64-QAM, respectively, both at a Vpp of 600 mV—the same value as for 32-QAM. This indicates that the optimal Vpp obtained from the training with 32-QAM can still be applied to 16-QAM and 64-QAM, which demonstrates the generalization ability of our method. Notably, both our method and the baseline achieve higher SNR in 16-QAM than in 64-QAM. This is because the higher-order QAM is more susceptible to noise as well as channel distortions, leading to lower measured SNR values [64]. Consequently, DPD for nonlinear compensation is more effective for 64-QAM, resulting in a larger SNR gain over the baseline compared to 16-QAM. As a result, the SNR gap between the two modulation formats narrows after applying our DPD, with the reduction equal to the difference in SNR gains (0.19 dB). Similar results can be found in Ref. [64].
Figure 7.Applying the DPD module trained for 32-QAM to other formats without retraining. SNR versus Vpp for 16-QAM and 64-QAM. The insets (i)–(iv) show the received constellations at the optimal Vpp.
3. Adaptability to the 80-km Transmission Scenario
Finally, we evaluate the adaptability of our DPD trained in the amplifier-less system to the 80-km transmission system without retraining. Performance is tested under varying launch powers. Launch power is crucial for this longer transmission scenario, as higher power increases the optical signal-to-noise ratio (OSNR, the power ratio between optical signal and ASE noise) but also leads to larger fiber nonlinearity that distorts signals. We demonstrate that our DPD consistently improves the transmission performance across a wide range of launch powers.
Figure 8 shows the received signal’s SNR over different launch powers, under 32-QAM and 600 mV Vpp. We compare our method with three prior learning methods introduced in Section 3.A. DPD modules trained in the amplifier-less system using different methods are directly tested on the 80-km transmission system without retraining. The results demonstrate that our method outperforms all other methods in SNR crossing all launch power. At the optimal launch power of around 6 dBm, our approach not only attains the highest SNR value but also exhibits the largest SNR gain compared to other methods. These results indicate that our method can adapt to different transmission scenarios best without the need for retraining. It can maintain optimal communication performance even in the presence of ASE noise and fiber nonlinearity.
Figure 8.Performance comparison in the 80-km transmission system. DPD modules trained in the amplifier-less system by different learning methods are tested without retraining.
4. DISCUSSION AND CONCLUSION
In this paper, we introduce a novel end-to-end (E2E) learning framework called physics-guided learning. This approach offers two significant improvements. First, by executing the forward pass through the actual physical channel, our method ensures that the output includes all real information, including implicit impairments and features such as noise, residual effects post-DSP, and non-differentiable operations such as quantization. This integration enhances the overall accuracy and effectiveness of E2E learning. Second, grounding the process in real data reduces the need for a precise digital model for gradient estimation, simplifying the training process and offering greater generalization ability.
Our demonstrations for DPD highlight the advantages of our approach through a comprehensive comparison with existing methods. Our method achieves the fastest training speed by using a simplified white-box model, avoiding the need for alternating training of two complex NNs. This results in more than 80% fewer training iterations compared to previous data-driven methods. Additionally, our approach provides the highest SNR improvement of 0.88 dB for 32-QAM signals in an amplifier-less system. Moreover, our method exhibits enhanced robustness and strong generalization capabilities. It remains resilient to system noise, and the DPD module trained with 32-QAM can be effectively applied to other modulation formats without retraining, achieving SNR gains of 0.79 dB for 16-QAM and 0.98 dB for 64-QAM. When directly applied to an 80-km transmission system, our DPD also achieves impressive SNR gains over prior learning methods. Further improvements could introduce learnable physical parameters into channel models. In this way, channel models can be dynamically tuned to adapt to significant changes in actual systems.
For the practical implementation of our learning framework, training can be conducted on pilot symbol (bit) sequences shared by the Tx and Rx sides for error calculation. The main consideration is how to send gradient information back to the Tx side. Nevertheless, owing to the fast training speed of our method, exchanging gradient information is only required during the brief training period. Therefore, any reliable feedback link can be used without the demands of high data rate. In fact, similar feedback links have been employed in various systems, such as quantum key distribution [65], model-free E2E learning methods [41], and autonomous optical networks [66,67]. Thus, constructing a temporary feedback link is feasible.
From a broader perspective, our framework offers a novel approach for optimization in communication systems and other physical systems. For instance, practical E2E learning in communications often involves DSP modules that cannot be easily replaced by a single NN. These modules are typically non-differentiable. Our method facilitates the joint optimization of neural networks even in the presence of such non-differentiable operations. To extend our method to other communication systems, their channel models for the backward pass need to be designed accordingly.
In conclusion, we have proposed a general strategy for optimizing communication systems that is applicable to various systems. This approach paves the way for developing more flexible and intelligent optical networks and holds promise for future integrated sensing and communications, where designing and optimizing increasingly complex system configurations will be crucial.
APPENDIX A: FORMULATION OF PHYSICS-GUIDED E2E LEARNING
Here we present the general formulation of the E2E learning process and details of our physics-guided learning framework.
Figure
Figure 9.(a) Schematic of E2E learning for a communication system. (b) Proposed physics-guided learning: gradient estimation with a physics-based (white-box) model. (c) Gradient estimation with a data-driven (black-box) model.
After signal transmission in the forward pass, we compare the entire system’s input
Based on the loss function, the fundamental training algorithm backpropagation (BP) is used to update the parameters of the autoencoder. To do that, the gradients of the loss function with respect to (w.r.t.) the parameters, including
Next, we calculate the gradient w.r.t. parameters of the encoder, which is given by
Similar to the decoder,
By comparing Eqs. (
Conventional methods strive to circumvent this problem by constructing a precise digital model to conduct learning in the digital domain. However, accurate modeling leads to high training costs and complexity, while being unable to bypass performance loss after deployment. In response, we propose our physics-guided learning framework.
Our physics-guided learning framework leverages a simple white-box model for gradient estimation, as shown in Fig.
In comparison to Eq. (
According to Eqs. (
As a comparison, we also illustrate the gradient estimated by a data-driven channel model. We refer to a data-driven model as a black box purely based on NNs. This model is first trained to reduce the error between its output
After fitting the actual channel, the fixed NN channel model is then used to facilitate the E2E learning process, as shown in Fig.
Compared with that estimated by a physics-based model [Eq. (
Notably, for the digital-domain learning with data-driven models (prior Method 3), the similar model output
Actually, noisy labels are reported to be more harmful than noisy inputs for NN training. Related training methods have been widely studied for years [
APPENDIX B: SIMULATION SYSTEM AND MODELING METHODS
Details of the simulation setup for the amplifier-less system and key physical models are presented here.
The simulation setup is shown in Fig.
Figure 10.Simulation setup of the amplifier-less coherent system. The inset shows the NN structure of the DPD module. Tx, transmitter; Rx, receiver.
The entire simulation system shown in Fig.
Table
Parameters for the Simulation System
Parameter | Value |
---|---|
DAC resolution | 8 bits |
RF driver (gain, | 17 dB, 7.8 V |
IQ modulator | 3.5 V |
Tx LPF bandwidth | 22 GHz |
Normalized reference power | 0 dBm |
AWGN power | −20 dBm |
Laser linewidth | 100 kHz |
Rx LPF bandwidth | 22 GHz |
Sample per symbol | 2 sps |
The RF driver is used to amplify the output signal of the DAC and then drive the modulator to accomplish electro-optical conversion. We assume it behaves as a memoryless system and only amplifies the electrical amplitude of the signal. Additionally, we assume that its transfer functions for the two branches (in-phase and quadrature parts) of the IQ modulator are identical, without any imbalance and phase distortion. Hence, it is modeled using the Rapp model [
The IQ modulator is usually composed of a pair of parallel Mach–Zehnder modulators (MZMs), each of which is configured by a push–pull methodology. High-order modulation schemes in coherent optical transmitters rely on these dual MZMs biased at the null point, with the transfer function of each branch modeled as a sinusoidal response [
In the experimental setup, the minimum bandwidth limitation of the transmitter is determined by the modulator. To represent the combined effect of temporal distortions and frequency response of components, a first-order Gaussian filter is applied as the low-pass filter of the transmitter. Therefore, the overall transfer function can be understood as a Wiener–Hammerstein (WH) structure [
As shown in Fig.
APPENDIX C: LEARNING WITH AN NN-BASED POST-EQUALIZER
Here we demonstrate that our proposed approach can further improve the performance by including an extra NN-based post-equalizer. Specifically, an NN with the same structure as the DPD is introduced after phase recovery for nonlinear equalization and jointly learned with the Tx DPD module in simulations using our approach. As shown in Fig.
Figure 11.Training process comparison between training DPD alone and joint learning with an NN-based post-equalizer.
APPENDIX D: DPD NEURAL NETWORK AND TRAINING DETAILS
We refer to Refs. [
Previous DPD works used two real-valued NNs to process the real and imaginary parts of complex signals separately [
Two prior E2E learning methods (Methods 1 and 3) use the identical NN structure for both channel modeling and DPD, according to Refs. [
All NNs are trained using supervised learning, where the training data consists of input–target pairs. For DPD training, the transmitted symbol
In experimental implementations, training is conducted by controlling the arbitrary waveform generator (AWG) and real-time oscilloscope simultaneously, requiring frequent data loading and acquisition from the actual system. The sequence length must be carefully chosen: a shorter sequence increases data acquisition frequency but may fail to capture complete channel effects, whereas a longer sequence increases data loading time. Given that, we fix the length to
Regarding the system measurement during training, we define one measurement as one time acquisition of an overall system output sequence
For training processes requiring system measurements, the main time-consuming factor is the forward propagation stage, which includes data loading to the AWG, signal transmission through the system, data acquisition from the oscilloscope, and Rx-DSP. On our platform, the time to complete one training iteration with a sequence of
During the test, the same sequence of
In experiments, we find that the selection of optimizer and initial learning rate significantly influences the training of data-driven channel models. When using an SGD optimizer, a large initial learning rate hinders convergence, while a small one tends to result in local minima with poor modeling performance. In contrast, Adam provides stable training results across a wide range of initial learning rates (0.001–0.01). Consequently, we adopt Adam for both channel modeling and DPD training. We choose the optimal initial learning rates for different training processes by standard grid search with a precision of 0.001 over the range of 0.001–0.01. This ensures the hyperparameters used in each method are optimal. The initial learning rates used for each method are listed in Table
Initial Learning Rate
Method | Channel Model | DPD Module |
---|---|---|
Our method | N.A. | 0.005 |
Method 1 | 0.003 | 0.004 |
Method 2 | N.A. | 0.002 |
Method 3 | 0.003 | 0.003 |
References
[2] I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence learning with neural networks. arXiv(2014).
[3] N. Carion, F. Massa, G. Synnaeve. End-to-end object detection with transformers. European Conference on Computer Vision, 213-229(2020).
[17] B. Karanov, M. Chagnon, V. Aref. Concept and experimental demonstration of optical IM/DD end-to-end system optimization using a generative model. 2020 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2020).
[19] S. Li, C. Häger, N. Garcia. Achievable information rates for nonlinear fiber communication via end-to-end autoencoder learning. 2018 European Conference on Optical Communication (ECOC), 1-3(2018).
[20] T. Uhlemann, S. Cammerer, A. Span. Deep-learning autoencoder for coherent and nonlinear optical communication. Photonic Networks; 21th ITG-Symposium, 1-8(2020).
[23] J. Song, C. Häger, J. Schröder. End-to-end autoencoder for superchannel transceivers with hardware impairment. 2021 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2021).
[24] Z. He, J. Song, C. Häger. Experimental demonstration of learned pulse shaping filter for superchannels. 2022 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2022).
[28] A. Rode, B. Geiger, L. Schmalen. Geometric constellation shaping for phase-noise channels using a differentiable blind phase search. 2022 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2022).
[30] M. Schaedler, S. Calabrò, F. Pittalà. Neural network assisted geometric shaping for 800 Gbit/s and 1 Tbit/s optical transmission. 2020 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2020).
[31] V. Aref, M. Chagnon. End-to-end learning of joint geometric and probabilistic constellation shaping. 2022 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2022).
[32] V. Neskorniuk, A. Carnio, V. Bajaj. End-to-end deep learning of long-haul coherent optical fiber communications via regular perturbation model. 2021 European Conference on Optical Communication (ECOC), 1-4(2021).
[42] J. Song, Z. He, C. Häger. Over-the-fiber digital predistortion using reinforcement learning. 2021 European Conference on Optical Communication (ECOC), 1-4(2021).
[50] F. Buchali, M. Chagnon, K. Schuh. Amplifier less 400 Gb/s coherent transmission at short reach. 2018 European Conference on Optical Communication (ECOC), 1-3(2018).
[58] H. Paaso, A. Mammela. Comparison of direct learning and indirect learning predistortion architectures. IEEE International Symposium on Wireless Communication Systems, 309-313(2008).
[63] X. Lu, M. Zhao, L. Qiao. Non-linear compensation of multi-CAP VLC system employing pre-distortion base on clustering of machine learning. 2018 Optical Fiber Communications Conference and Exposition (OFC), 1-3(2018).
[64] R. Elschner, R. Emmerich, C. Schmidt-Langhorst. Improving achievable information rates of 64-GBd PDM-64QAM by nonlinear transmitter predistortion. Optical Fiber Communication Conference, M1C.2(2018).
[71] C. Rapp. Effects of HPA-nonlinearity on a 4-DPSK/OFDM-signal for a digital sound broadcasting signal. ESA Spec. Publ., 332, 179-184(1991).

Set citation alerts for the article
Please enter your email address