• Photonics Research
  • Vol. 13, Issue 6, 1469 (2025)
Qiarong Xiao, Chen Ding, Tengji Xu, Chester Shu, and Chaoran Huang*
Author Affiliations
  • Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
  • show less
    DOI: 10.1364/PRJ.551798 Cite this Article Set citation alerts
    Qiarong Xiao, Chen Ding, Tengji Xu, Chester Shu, Chaoran Huang, "Concept and experimental demonstration of physics-guided end-to-end learning for optical communication systems," Photonics Res. 13, 1469 (2025) Copy Citation Text show less

    Abstract

    Driven by advancements in artificial intelligence, end-to-end learning has become a key method for system optimization in various fields, including communications. However, applying learning algorithms such as backpropagation directly to communication systems is challenging due to their non-differentiable nature. Existing methods typically require developing a precise differentiable digital model of the physical system, which is computationally complex and can cause significant performance loss after deployment. In response, we propose a novel end-to-end learning framework called physics-guided learning. This approach performs the forward pass through the actual transmission channel while simplifying the channel model for the backward pass to a simple white-box model. Despite the simplicity, both experimental and simulation results show that our method significantly outperforms other learning approaches for digital pre-distortion applications in coherent optical fiber systems. It enhances training speed and accuracy, reducing the number of training iterations by more than 80%. It improves transmission quality and noise resilience and offers superior generalization to varying transmission link conditions such as link losses, modulation formats, and scenarios with different transmission distances and optical amplification. Furthermore, our new end-to-end learning framework shows promise for broader applications in optimizing future communication systems, paving the way for more flexible and intelligent network designs.

    1. INTRODUCTION

    End-to-end (E2E) learning is an emerging approach based on deep learning that offers new solutions to complex problems [1]. By using a single neural network (NN) to represent an entire target system, E2E learning bypasses the intermediate steps typically required in traditional methods, thereby simplifying the learning process. For example, in computer vision, E2E learning can simultaneously optimize both the encoder NN for input image compression and the decoder NN for output recovery, achieving better overall system performance compared to optimizing these components separately. E2E learning has become a widely adopted optimization strategy, with applications across various domains, including computer vision, natural language processing, autonomous driving, robot control, computational imaging, and optical computing [27].

    The rise of E2E learning has also impacted communication technologies since 2017 [8]. As communication systems grow more complex [915], traditional block-wise optimization methods often fall short in ensuring optimal overall performance [16]. E2E learning has emerged as a promising solution. The process of signal generation and recovery in communication systems is analogous to image compression and reconstruction in computer vision, as both aim to recover input messages at the system output. This similarity makes it natural to apply the E2E learning concept to communication systems by treating the entire system—including the transmitter, receiver, and transmission channel—as an autoencoder, as illustrated in Fig. 1(a). In this framework, the transmitter (Tx) and receiver (Rx) are represented as an encoder and a decoder, respectively, through two separate NN blocks that are trained jointly to learn an intermediate representation robust to channel impairments.

    (a) A communication system utilizing end-to-end (E2E) learning can be represented as an autoencoder, which consists of three main components: an encoder (transmitter), a decoder (receiver), and the actual transmission channel. The transmission channel can be various communication systems. (b) Conventional E2E learning method involves accurately modeling the transmission channel first, followed by performing the backpropagation (BP) algorithm in the digital domain. Channel modeling can be achieved using either physics-based approaches, known as “white-box” models, or pure data-driven methods, referred to as “black-box” models. (c) Proposed physics-guided learning: executing the forward pass of backpropagation on the actual transmission channel, while the backward pass estimates the gradient using a simplified white-box model.

    Figure 1.(a) A communication system utilizing end-to-end (E2E) learning can be represented as an autoencoder, which consists of three main components: an encoder (transmitter), a decoder (receiver), and the actual transmission channel. The transmission channel can be various communication systems. (b) Conventional E2E learning method involves accurately modeling the transmission channel first, followed by performing the backpropagation (BP) algorithm in the digital domain. Channel modeling can be achieved using either physics-based approaches, known as “white-box” models, or pure data-driven methods, referred to as “black-box” models. (c) Proposed physics-guided learning: executing the forward pass of backpropagation on the actual transmission channel, while the backward pass estimates the gradient using a simplified white-box model.

    In theory, E2E learning enables overall system optimization for any practical channel without the need for extensive analytic evaluations to the physical system, therefore allowing for pursuing the best end-to-end performance [16]. Despite these potential advantages, implementing E2E learning effectively in real-world communication systems remains a significant challenge [17,18]. The main obstacle is that, to use effective training algorithms such as backpropagation (BP), E2E learning requires the entire transmission system to be differentiable. However, the physical transmission channel is inherently non-differentiable, which complicates the direct application of E2E learning.

    Most solutions to this challenge are to construct a differentiable digital model that approximates the actual physical channel, enabling the learning process to occur in the digital domain [16,17,1939], as illustrated in Fig. 1(b). Differentiable channel models are typically derived through two main approaches. The first approach relies on physical laws, such as using the split-step Fourier method to model nonlinear fiber channels [21,22]. The second approach utilizes data-driven algorithms, where NN models, including generative adversarial networks (GANs) [17,40] and other NN architectures [34,36], approximate the channel. While these methods facilitate E2E learning, they also impose significant demands for accurate modeling, leading to high training costs and complexity. For example, in order to obtain an accurate channel approximation across different channel conditions, data-driven approaches can be both data-intensive and time-consuming [17]. Even with these efforts, discrepancies between the digital model and the actual physical system remain inevitable, resulting in performance degradation in real-world deployments [41]. Additionally, model-free methods, such as reinforcement learning [4143] and cubature Kalman filters [44], come with their own challenges, including high training complexity and slow convergence [28]. Therefore, achieving efficient E2E learning with fast, low-cost training while maintaining system performance and signal quality remains a significant challenge.

    To address these challenges, we propose a new framework called physics-guided learning, which incorporates the actual physical channel into the training process. By doing so, our method enables a simple and approximate white-box model to outperform a complicated digital model in both training and implementation performance, while also becoming more resilient to system noise. As shown in Fig. 1(c), in our approach, the physical channel is used during the forward pass to generate the output, while the white-box model is employed to compute gradients for backpropagation. Our method is analogous to hardware-in-the-loop optimization, a concept often used to validate simulations of complex systems [45,46]. Our method simultaneously enhances training speed, generalization ability, and signal quality.

    We experimentally demonstrate the effectiveness of our approach in short-reach coherent optical fiber systems. The goal is to train an NN-based digital pre-distortion (DPD) module to mitigate impairments from both the transmitter and receiver. Verified through both experiments and simulations, our approach outperforms other mainstream methods during both the training and deployment stages [1618]. During training, our method improves efficiency by reducing the number of training iterations by more than 80%. In deployment, our approach demonstrates the strongest capability in addressing nonlinear distortion, shows the highest resilience to noise, and exhibits superior generalization to different link losses, modulation formats, and transmission scenarios. Furthermore, our method offers a brand-new framework for enabling accurate and efficient E2E learning in communication systems, which holds significant potential for broader applications, including long-haul optical systems and hybrid RF/optical wireless systems.

    2. WORKING PRINCIPLE OF PHYSICS-GUIDED LEARNING

    The principle and process of the proposed physics-guided learning are illustrated in Fig. 1(c). In this framework, the input message s is encoded and transmitted through a transmission channel. The receiver, along with the subsequent digital signal processing (DSP), functions as the decoder, providing the output s^. The overall goal of the communication system is to reconstruct the message s at the output with minimal error. Therefore, the training objective is to minimize the error ss^ by simultaneously optimizing the NNs at both the encoder and decoder using backpropagation.

    In a standard training procedure using backpropagation, the process involves a forward pass, error calculation, backpropagation of errors through the NN to compute gradients, and, finally, updating the NN parameters. This process requires a precise mathematical model to represent the entire transmission system to compute gradients accurately. However, deriving such an accurate model is challenging and computationally intensive, especially when accounting for complex systems with noises.

    To address this problem, our approach uses the actual physical transmission system for the forward pass and error measurement, while employing a differentiable proxy model for backpropagation. A key difference between our method and standard digital-domain learning is that our approach does not require the actual system and the proxy model to be identical. The tolerance for discrepancies is large—the training process decreases the error as long as the angle between the gradient obtained from the proxy model and the gradient of the actual system is less than 90° [47,48]. This allows the proxy model to be significantly simplified to a rough white box, reducing computational complexity with high system performance.

    The specific training process is as follows. Before training, we first establish a simple, differentiable proxy model for the physical channel based on physical laws—this model is referred to as a “white box.” During the training process, the output s^ is measured from the actual physical system, and the error is calculated. The gradient of the loss function with respect to the parameters in the encoder and decoder is then derived using the white-box model through the chain rule. Finally, all parameters are updated according to the calculated gradient. This training loop is repeated until the objective function converges. (See Appendix A for the general formulation.)

    This training flow benefits E2E learning in two significant ways. First, executing the forward pass through the actual physical channel ensures that the output incorporates all real information. Implicit impairments and features, such as noise and residual effects post-DSP, are automatically included through forward propagation. Non-differentiable operations in the actual system, such as quantization, which typically limits E2E learning [41], are also included. Consequently, our method enhances the deployment accuracy.

    Second, grounding the process in real data reduces the need for a precise digital model for gradient estimation, thereby accelerating the training process. Our method can achieve necessary accuracy using a simple white-box model as the proxy channel. Compared to purely data-driven neural network models, known as “black boxes,” white-box models offer greater generalization ability and higher noise resilience.

    The following section will demonstrate an example of applying our method for digital pre-distortion in short-reach coherent fiber systems, highlighting its significant advantages at both the training and deployment stages.

    3. APPLICATION IN COHERENT OPTICAL SYSTEM

    A. Digital Pre-Distortion For Short-Reach Transmission

    Coherent optical systems have dominated long-haul transmission over the past decade [15]. Recently, coherent technologies have opened up new development opportunities for short-reach transmissions, especially for beyond Tb/s data-center interconnects and passive optical networks (PONs), to meet the growing demands on data traffic [4954]. At such high data rate, transmission performance becomes increasingly susceptible to imperfections in cost-effective transceiver components rather than impairments from fiber [55]. To tackle these issues, digital pre-distortion (DPD) is employed at the Tx side to pre-compensate for signal distortions caused by transceiver devices [25,42,5563]. Recent advances have demonstrated the use of NNs to develop DPD modules, with ongoing research focused on efficiently learning DPD parameters while maintaining low costs and high performance [42,55,59,60].

    We present that our physics-guided learning approach can effectively optimize the parameters of a DPD module, yielding superior performance compared to mainstream training methods [1618]. The operation scheme of our method adapted for DPD training is illustrated in Fig. 2(a). During training, the encoder NN, represented by the green block, functions as the DPD module. The actual physical channel is used for the forward pass to generate the output y of the entire system. The error e is calculated as sy, where s is the desired signal. This error is then backpropagated through a simplified white-box model to compute the gradient. In the short-reach coherent systems under study, the white-box model we use is simply a combination of mathematical models for RF drivers, an in-phase and quadrature (IQ) modulator, a low-pass filter, and a matched filter, while other device imperfections and DSP modules are ignored (see Fig. 3). Additionally, all physical parameters involved in the channel model are simply extracted from datasheets without additional experimental characterizations. Finally, the parameters of the DPD module are updated based on the estimated gradient. The training loop is repeated until the error converges and no longer decreases.

    Schematic diagrams of methods adapted for DPD training. (a) Proposed physics-guided learning method. (b) Prior E2E learning methods. (b-i) Method 1: hybrid-domain learning with a data-driven model [18]. (b-ii) Method 2: digital-domain learning with a complicated physics-based model [16,19–24" target="_self" style="display: inline;">–24]. (b-iii) Method 3: digital-domain learning with a data-driven model (implemented by alternating training) [17,33,34].

    Figure 2.Schematic diagrams of methods adapted for DPD training. (a) Proposed physics-guided learning method. (b) Prior E2E learning methods. (b-i) Method 1: hybrid-domain learning with a data-driven model [18]. (b-ii) Method 2: digital-domain learning with a complicated physics-based model [16,1924" target="_self" style="display: inline;">24]. (b-iii) Method 3: digital-domain learning with a data-driven model (implemented by alternating training) [17,33,34].

    Experimental setup for our physics-guided learning method, configured in an either amplifier-less or 80-km transmission system. The physical transmission system and the Rx-DSP compose the actual channel. The simplified white-box digital channel employs a series of physical models that cover only several components. DPD, digital pre-distortion; AWG, arbitrary waveform generator; IQ-MOD, IQ modulator; VOA, variable optical attenuator; EDFA, erbium-doped fiber amplifier; SSMF, standard single-mode fiber; LO, local oscillator; OSC, oscilloscope; LPF, low-pass filter.

    Figure 3.Experimental setup for our physics-guided learning method, configured in an either amplifier-less or 80-km transmission system. The physical transmission system and the Rx-DSP compose the actual channel. The simplified white-box digital channel employs a series of physical models that cover only several components. DPD, digital pre-distortion; AWG, arbitrary waveform generator; IQ-MOD, IQ modulator; VOA, variable optical attenuator; EDFA, erbium-doped fiber amplifier; SSMF, standard single-mode fiber; LO, local oscillator; OSC, oscilloscope; LPF, low-pass filter.

    In our experiments, we use an adaptive finite impulse response (FIR) filter as a post-equalizer at the Rx, the coefficients of which are adaptive to the changes in the DPD. We optimize its coefficients jointly with DPD using gradient descent, analogous to training an NN. The function of the FIR is equivalent to a single-layer NN [22]. Similar single-layer NNs have been adopted in prior E2E learning frameworks [22,34,36,37]. While we do not use an NN-based post-equalizer, the method of training the FIR filter in our experiment is also applicable to training an NN-based post-equalizer. Although including an NN-based post-equalizer would further improve the performance (see Appendix C), we do not include it, as it significantly increases DSP complexity, which is not acceptable for the short-reach system we are targeting.

    To demonstrate the effectiveness of our method, we compare it comprehensively with three prior E2E learning methods. The first method, illustrated in Fig. 2(b–i), also incorporates the actual physical system during training but relies on a digital NN obtained through a data-driven approach for the backward pass (referred to as Method 1) [18]. This method requires essential alternating training for two NNs, one for the channel model and the other for the DPD module. Since training is performed directly on the real system, updates to the DPD would significantly change the statistics of transmitted signals, which, in turn, cause notable changes in the actual channel response. Therefore, the NN for channel modeling needs to be retrained periodically to approach the true response. This will be demonstrated in Section 3.C.

    The second and third methods, shown in Figs. 2(b-ii) and 2(b-iii), are standard model-based E2E learning approaches (referred to as Methods 2 and 3, respectively). Both approaches require a highly accurate digital model. In the method illustrated in Fig. 2(b-ii), the digital model is based on complicated digital representations of physical laws, such as the split-step Fourier method [21,22], to simulate the actual channel response. The third method, depicted in Fig. 2(b-iii), uses a data-driven approach to derive an NN model for the digital representation. There are two common strategies to implement Method 3: one follows alternating training similar to Method 1 [17,33,34], while the other fully pre-trains the channel model first and then fixes the pre-trained channel model when training transceiver modules [3639]. Alternating training allows the channel model to track the changes in system response during learning but needs frequent data acquisition from the real channel. In contrast, fully pre-training the channel model enables offline learning but demands a large, diverse dataset to ensure accurate modeling [35]. Therefore, both strategies require a considerable amount of data and time to train the channel model. In our demonstration, we use alternating training for Method 3 in order to draw a direct comparison with Method 1.

    B. System Setup

    Figure 3 shows the experimental setup for short-reach coherent fiber transmission and the DPD learning flow using our proposed method. To cover diverse short-reach scenarios [49], we examine two systems: an optical-amplifier-less system, accounting for data-center links, and an 80-km transmission system with optical amplification, suitable for access and metro links where amplification is acceptable. The DPD module is trained on the amplifier-less system but evaluated on both systems, showcasing its adaptability to longer transmission scenarios.

    Initially, a sequence of transmitted symbols s is up-sampled and pulse-shaped using a root-raised cosine filter. The shaped digital waveforms are then processed by the DPD module. These waveforms are resampled and fed into an arbitrary waveform generator (AWG), which converts them into electrical signals. The signals are subsequently amplified by RF drivers and converted to the optical domain using an IQ modulator. An external cavity laser provides the optical carrier for modulation. The optical signal operates at a symbol rate of 50 Gbaud. In the amplifier-less system, optical signals are sent directly to the coherent receiver and mixed with a local oscillator (LO). The resulting electrical waveforms are sampled by a real-time oscilloscope. The digitized signals undergo Rx-DSP, including resampling, frame synchronization, frequency recovery, matched filtering, equalization, and phase recovery. Finally, the sequence of recovered symbols y is obtained. In the 80-km transmission system, the modulated optical signals are first boosted by an erbium-doped fiber amplifier (EDFA) and filtered by an optical filter to suppress out-of-band noise. After transmission through 80 km of standard single-mode fiber (SSMF), a second EDFA and optical filter are used to compensate for the fiber loss and further filter the signals. The received optical signals are processed similarly to the amplifier-less system but include additional chromatic dispersion (CD) compensation, as indicated by the dashed gray block in Fig. 3. Detailed device parameters are listed in Table 1.

    Parameters for Experimental Systems

    ParameterValue
    AWG (DAC resolution, bandwidth)8 bits, 45 GHz
    RF driver (gain, Vsat, bandwidth)17 dB, 7.8 V, 40 GHz
    IQ modulator (Vπ, bandwidth)3.5 V, 22 GHz
    Laser/LO (linewidth)<100  kHz
    Coherent receiver (bandwidth)22 GHz
    Oscilloscope (bandwidth)59 GHz
    Symbol rate50 Gbaud

    The amplifier-less system’s transmission performance is primarily impacted by nonlinearities, bandwidth limitations of components at both the transmitter and receiver, and system noises. These include additive noise from the receiver and phase noise caused by non-ideal laser sources. In the 80-km transmission system, more challenges are introduced, such as amplified spontaneous emission (ASE) noise from EDFAs and CD and potential fiber nonlinearities from the longer fiber link. This necessitates the use of a CD compensation block. The DPD module’s role is to address signal distortions mainly originating from transceiver devices. Nevertheless, our DPD can improve signal quality even in the presence of fiber-induced impairments under 80-km transmission conditions, as demonstrated in Section 3.E.3.

    To train the DPD using our physics-guided learning method, we transmit signals in the actual amplifier-less channel to perform the forward pass, as indicated by the central blue area in Fig. 3. The backward pass, as a feedback link to the DPD, is designed to reflect only the major distortions from devices, thus greatly simplified compared to the actual channel. As shown in the white box of Fig. 3, the backward pass employs a series of physical models that represent a few key components, including RF drivers, the IQ modulator, the overall bandwidth limitation of the system represented as a low-pass filter (LPF), and a matched filter paired with the pulse shaping before DPD. These physical models are rough. For instance, the IQ modulator is modeled as ideal sinusoidal functions disregarding potential mismatches between the in-phase and quadrature arms. Moreover, all the required parameters are obtained from datasheets rather than measured from actual devices, and noises are excluded from the model. Despite this simplicity, we will demonstrate in the results sections that our learning method effectively guides the DPD to optimize towards the best performances. (See Appendix B for modeling methods of physical models.)

    Note that, to explore the simplification limit of channel models, we also attempted to exclude all components and assume the backward pass to be an identity matrix. We observed that training could converge, albeit with poorer performance compared to the white-box model shown in Fig. 3 (resulting in a 34.6% increase in mean square error). This indicates that the angle between the identity matrix and the true gradient is large but still less than 90°. A similar example can be found in Ref. [55].

    The three prior learning methods introduced in Section 3.A for comparison are performed on the amplifier-less system with the same DPD NN structure. Methods 1 and 3 use the identical NN structure for both channel modeling and DPD. The NN takes the architecture of feed-forward neural network with an input sliding window [42,60], involving 3108 learnable real-valued parameters. Method 2 uses a complicated physics-based model to match the amplifier-less experimental setup. The model includes basic models of the transmitter and receiver (identical to the white-box model used in our method), system noises (additive white noise and phase noise), and Rx-DSP. Its physical parameters are also extracted from datasheets. Additionally, hyperparameters such as initial learning rates are optimized for each method, which is detailed in Appendix D. (See Appendices AD for NN structure, channel models, and training details.)

    In the results sections, we will first analyze the training complexity of our method and compare it with existing approaches (see Section 3.C). We will then evaluate the performance of the DPD module under the critical factor peak-to-peak voltage (Vpp) at the AWG output that affects the transmission quality of the amplifier-less short-reach system (see Section 3.D). Vpp influences the output swing of the RF drivers and the operation of the IQ modulator. While a higher Vpp can boost optical signal power, it also leads to larger nonlinear distortions from the RF drivers and modulator. We will investigate how effectively our method identifies the optimal Vpp value compared to other methods. Finally, we investigate the generalization ability of our method by showing how the DPD module trained using our approach can adapt to different transmission conditions, including fiber link losses, modulation formats, and the 80-km transmission scenario with optical amplification (see Section 3.E).

    C. Evaluation and Comparison of Training Complexity and Accuracy

    Here, we demonstrate the advantages of our method in terms of training complexity and accuracy. The amplifier-less system is trained using 32-QAM signals with a 600 mV Vpp at the AWG and 5 dB optical link loss. Figure 4(a) compares the training loss of our method with that of Method 1 and Method 3, where both channel modeling and DPD training processes use mean square error (MSE) as the loss function. We visualize convergence speed as a function of training iterations because it reflects both required training time and training data. Fewer training iterations imply lower training data requirements. Method 2 is not included in this comparison, as it does not need system measurements but produces unacceptably low DPD performance during deployment in the real system. The performance of Method 2 will be discussed in the next section.

    Training process comparisons between our method and prior Method 1 (hybrid-domain data-driven method) and Method 3 (digital-domain data-driven method, implemented by alternating training). (a) Training loss versus training iteration in experiments, under the conditions of 5 dB link loss and 600 mV Vpp. (b) Validation MSE versus training iteration in simulations. (b-ii) Zoom-in of (b-i) to compare the required iteration numbers for different methods when reaching the same MSE.

    Figure 4.Training process comparisons between our method and prior Method 1 (hybrid-domain data-driven method) and Method 3 (digital-domain data-driven method, implemented by alternating training). (a) Training loss versus training iteration in experiments, under the conditions of 5 dB link loss and 600 mV Vpp. (b) Validation MSE versus training iteration in simulations. (b-ii) Zoom-in of (b-i) to compare the required iteration numbers for different methods when reaching the same MSE.

    As shown in Fig. 4(a), our method achieves the fastest convergence speed by using a simplified white-box model during training. This contrasts with Methods 1 and 3, which adopt alternating training for two NNs. Despite its simplicity, our method can effectively guide the optimization process in the correct direction, as evidenced by the continuously decreasing training loss. In contrast, Methods 1 and 3 involve alternately training two NNs—one for the channel model and the other for the DPD module—during each round [as illustrated by the different colored regions in Figs. 4(a-ii) and 4(a-iii)]. A high channel modeling loss is observed at the start of the second round (marked by the black dashed circle) in Fig. 4(a-ii), which indicates that the actual channel response changes significantly after DPD training and retraining the channel model is essential. This alternating training process substantially increases both the training time and training data.

    It is important to note that training loss alone does not rigorously reflect the true performance of the system, as some methods do not account for real-system effects, such as noise, during training. To accurately evaluate system performance evolution during training, a validation dataset must be tested on the system. The validation results, shown in Fig. 4(b), are periodically assessed by calculating the MSE between the input symbol sequence in the validation dataset and the output signal. As shown in Fig. 4(b-i), our method achieves the lowest MSE with the least training iteration. Remarkably, our method already converges when the two prior methods have just completed their first round of training. The MSE is reduced by 23.3% and 69.2% compared to Method 1 and Method 3, respectively. Notably, Method 3 not only exhibits the slowest convergence speed but also suffers from overfitting, as indicated by the rising MSE, marked by red dashed circles in Fig. 4(b-i). Method 1 avoids the overfitting issue by incorporating the transmission system into the training process. However, alternating training greatly increases the training time. Finally, as shown in Fig. 4(b-ii), our method achieves the same MSE with approximately 80 iterations, compared to around 440 and 720 iterations for Method 1 and Method 3, respectively. This corresponds to a reduction of more than 80% in iteration numbers and required data, as well as a decrease of more than 75% in the number of system measurements. Additionally, the training time is estimated to be reduced from about 20 min to just 3 min. The calculations of data amount, number of system measurements, and training time are detailed in Appendix D.

    D. Performance in Nonlinearity Impairment Mitigation

    After training, we deploy the DPD models trained by different methods to evaluate their performance in signal pre-equalization. In this section, we compare the methods based on their capability to mitigate nonlinear impairments. This ability is crucial as it allows for higher Vpp values to drive the IQ modulator to result in a higher signal-to-noise ratio (SNR) for the amplifier-less system. While a higher Vpp can increase the optical signal power, it also introduces more significant nonlinear distortions from the RF drivers and modulator. Thus, effective nonlinear impairment mitigation is essential for improving the system SNR. In this experiment, we fix the optical link loss at 5 dB and train separate DPD modules for each Vpp value using 32-QAM signals. After training, we test each DPD module at the specific Vpp value set for the training. The fiber link without DPD serves as the baseline for comparison. The system performance is analyzed by calculating the SNR and bit error rate (BER) based on recovered symbols.

    The results are presented in Fig. 5. Our method demonstrates the highest tolerance to nonlinearity and achieves the best SNR with the time-efficient training procedure. As shown in Fig. 5(a), the DPD trained using our method (blue line) consistently delivers the highest SNR across all tested Vpp values. Compared to the baseline (yellow line), it provides an SNR gain of 0.88 dB and improves the optimal Vpp value from 500 to 600 mV, indicating its capability to mitigate severe nonlinear distortions and enhance launch powers. We also observe that the performance gap between prior methods and ours widens as the Vpp increases. This is because higher Vpp levels introduce more severe nonlinear distortions, and the DPD obtained using our method demonstrates a greater ability to compensate for these distortions. The SNR using Method 1 (orange line) is 0.33 dB lower than that of our method. The worse performance arises from inadequate training for channel modeling in Method 1, which introduces additional biases and increases the gap between the model and the actual system. Method 3 exhibits an even lower SNR, indicating that the NN used for channel modeling, which is fixed for digital-domain DPD training, is not accurate. Method 2 (purple line) shows the worst SNR performance, which is even lower than the baseline at some Vpp values. This suggests that its DPD is ineffective, revealing that completely detaching from the measurement of the actual system can result in large modeling errors and significant performance loss during real-system deployment. As a result, our method is the only one that achieves BER values below the 14.8% overhead (OH) forward error correction (FEC) threshold of 0.0125 [50], as shown in Fig. 5(b), outperforming all other methods under comparison.

    Performance comparison in impairments mitigation. (a) Calculated SNR versus Vpp of DPDs trained through different methods. (b) Calculated BER versus Vpp, followed by (b-i) and (b-ii) showing the received constellations without and with DPD at their respective optimal Vpp values. (c) Comparison of transmitted signal spectra with and without DPD.

    Figure 5.Performance comparison in impairments mitigation. (a) Calculated SNR versus Vpp of DPDs trained through different methods. (b) Calculated BER versus Vpp, followed by (b-i) and (b-ii) showing the received constellations without and with DPD at their respective optimal Vpp values. (c) Comparison of transmitted signal spectra with and without DPD.

    In addition, the DPD obtained through our method can compensate for the bandwidth limitations of the physical system, which always exhibits low-pass characteristics and hinders high-speed transmission. Our DPD counteracts this impairment by boosting the high-frequency components before transmission. As shown in Fig. 5(c), the signal spectrum with our DPD shows peaks at the edge frequencies, in contrast to the flat spectrum of original signals without DPD.

    E. Generalization Capability

    This section will examine the generalization capability of our method, specifically its ability to adapt the trained DPD to different link conditions that were not included in the training phase. The DPD is first trained in the amplifier-less system with 5 dB link loss and 600 mV Vpp and then tested under different link conditions. Specifically, the evaluations against link losses and modulation formats are conducted in the amplifier-less system, while the DPD’s performance in the 80-km transmission system is evaluated for varying launch powers.

    1. Adaptability to Optical Link Losses

    We first evaluate the performance of the trained DPD under varying link losses, where the signal experiences different SNRs after detection. Figure 6(a) shows the BER as a function of link losses. Across all evaluated link losses, our method (blue line) consistently achieves the lowest BER. When the link loss is small (less than 5 dB), the BERs of all methods stop decreasing, because residual distortions, such as nonlinearity, become the dominant limiting factors rather than noise. In this case, our method remains the only one that achieves BER values below the 14.8% OH FEC threshold. As the link loss increases, noise becomes the dominant factor. Under these conditions, our method continues to demonstrate lower BER values, while other methods show performance close to or even worse than the baseline, particularly at a 15 dB link loss. Our method achieves a 1.00 dB gain in power budget over the baseline for the 25% OH FEC threshold of 0.04 [50]. The power budget is calculated as the difference between the launched and received optical power, equivalent to the link loss.

    (a) BER versus optical link loss for DPD modules from different E2E learning methods. All DPDs were trained at a fixed 5 dB link loss. (b) Noise resilience investigation in comparison with ILA. (b-i) Calculated BER versus link loss of DPDs trained at fixed 3 dB, 5 dB, and 8 dB link loss values, respectively. (b-ii) Zoom-in of (b-i).

    Figure 6.(a) BER versus optical link loss for DPD modules from different E2E learning methods. All DPDs were trained at a fixed 5 dB link loss. (b) Noise resilience investigation in comparison with ILA. (b-i) Calculated BER versus link loss of DPDs trained at fixed 3 dB, 5 dB, and 8 dB link loss values, respectively. (b-ii) Zoom-in of (b-i).

    Notably, the BER value of Method 3 (green line) increases dramatically as the link loss rises, performing worse than the baseline’s BER starting from a 9 dB link loss. This decline in performance, even worse than the baseline, can be attributed to training biases in the channel model due to a data-driven approach—since noise is not included in the DPD training process, the learned DPD suffers from overfitting and fails to adapt to higher noise levels, leading to degraded performance under higher loss conditions. Incorporating the physical channel in the training can reduce such errors, as demonstrated by the results of our method and Method 1. However, Method 1 still performs worse than ours because the training biases still occur during the NN training for channel modeling. (See Appendix A for the detailed analysis.)

    The above results well illustrate the superior performance of our method over prior E2E learning methods. To thoroughly validate the adaptability and optimality of our method in DPD applications, we also conduct a comparison with one traditional DPD training method called the indirect learning approach (ILA). ILA is a practical DPD optimization method owing to its relatively low complexity [56,57]. Unlike E2E learning methods, it circumvents channel modeling by training the DPD module at the Rx side before deploying it to the Tx side. However, ILA may suffer from noise bias and not yield the optimal DPD module [58,59]. Here we investigate the noise resilience of training processes and demonstrate that our method can outperform ILA.

    We compare the two methods by training DPD modules at three fixed link losses—3 dB, 5 dB, and 8 dB—and then evaluate system performance under varying link losses. The Vpp is set to 600 mV during both training and testing. The resulting BERs are shown in Fig. 6(b). Our method demonstrates strong resilience to noise variance—the three DPDs, despite being trained under different link losses, exhibit consistent BER performance. The relative differences in BER are less than 6.5%. In contrast, ILA is proved to be more sensitive to noise. The relative differences in BER can be as large as 18.8%. Specifically, the DPD trained at high loss (e.g., 8 dB) cannot adapt to low-loss link conditions. The BER performance using DPD trained with ILA worsens as the link loss increases during training. The most significant BER degradation, from 0.0133 to 0.0158, occurs at a 4 dB link loss, as shown in Fig. 6(b-ii). These results are comparable to those reported by other groups [55,59], suggesting that excessive noise due to high link losses hinders the DPD from effectively learning the inverse channel function. As a result, ILA cannot achieve a BER value lower than the 14.8% OH FEC threshold. When comparing DPD modules from our method and ILA, both trained at 8 dB link loss (dark blue and dark red lines), our method achieves a 28.5% reduction in BER at 4 dB link loss and provides a 0.56 dB gain in power budget over ILA for the 25% OH FEC threshold.

    2. Adaptability to Modulation Formats

    Next, we evaluate the generalization ability of our method, specifically its ability to apply the trained DPD to different modulation formats that were not included in the training stage. The DPD is initially trained with 32-QAM, and then we test its performance on fiber links using 16-QAM and 64-QAM without retraining the DPD. The results in Fig. 7 show the signals’ SNR at different Vpp levels. Compared to the baseline, our method achieves SNR gains of 0.79 dB and 0.98 dB for 16-QAM and 64-QAM, respectively, both at a Vpp of 600 mV—the same value as for 32-QAM. This indicates that the optimal Vpp obtained from the training with 32-QAM can still be applied to 16-QAM and 64-QAM, which demonstrates the generalization ability of our method. Notably, both our method and the baseline achieve higher SNR in 16-QAM than in 64-QAM. This is because the higher-order QAM is more susceptible to noise as well as channel distortions, leading to lower measured SNR values [64]. Consequently, DPD for nonlinear compensation is more effective for 64-QAM, resulting in a larger SNR gain over the baseline compared to 16-QAM. As a result, the SNR gap between the two modulation formats narrows after applying our DPD, with the reduction equal to the difference in SNR gains (0.19 dB). Similar results can be found in Ref. [64].

    Applying the DPD module trained for 32-QAM to other formats without retraining. SNR versus Vpp for 16-QAM and 64-QAM. The insets (i)–(iv) show the received constellations at the optimal Vpp.

    Figure 7.Applying the DPD module trained for 32-QAM to other formats without retraining. SNR versus Vpp for 16-QAM and 64-QAM. The insets (i)–(iv) show the received constellations at the optimal Vpp.

    3. Adaptability to the 80-km Transmission Scenario

    Finally, we evaluate the adaptability of our DPD trained in the amplifier-less system to the 80-km transmission system without retraining. Performance is tested under varying launch powers. Launch power is crucial for this longer transmission scenario, as higher power increases the optical signal-to-noise ratio (OSNR, the power ratio between optical signal and ASE noise) but also leads to larger fiber nonlinearity that distorts signals. We demonstrate that our DPD consistently improves the transmission performance across a wide range of launch powers.

    Figure 8 shows the received signal’s SNR over different launch powers, under 32-QAM and 600 mV Vpp. We compare our method with three prior learning methods introduced in Section 3.A. DPD modules trained in the amplifier-less system using different methods are directly tested on the 80-km transmission system without retraining. The results demonstrate that our method outperforms all other methods in SNR crossing all launch power. At the optimal launch power of around 6 dBm, our approach not only attains the highest SNR value but also exhibits the largest SNR gain compared to other methods. These results indicate that our method can adapt to different transmission scenarios best without the need for retraining. It can maintain optimal communication performance even in the presence of ASE noise and fiber nonlinearity.

    Performance comparison in the 80-km transmission system. DPD modules trained in the amplifier-less system by different learning methods are tested without retraining.

    Figure 8.Performance comparison in the 80-km transmission system. DPD modules trained in the amplifier-less system by different learning methods are tested without retraining.

    4. DISCUSSION AND CONCLUSION

    In this paper, we introduce a novel end-to-end (E2E) learning framework called physics-guided learning. This approach offers two significant improvements. First, by executing the forward pass through the actual physical channel, our method ensures that the output includes all real information, including implicit impairments and features such as noise, residual effects post-DSP, and non-differentiable operations such as quantization. This integration enhances the overall accuracy and effectiveness of E2E learning. Second, grounding the process in real data reduces the need for a precise digital model for gradient estimation, simplifying the training process and offering greater generalization ability.

    Our demonstrations for DPD highlight the advantages of our approach through a comprehensive comparison with existing methods. Our method achieves the fastest training speed by using a simplified white-box model, avoiding the need for alternating training of two complex NNs. This results in more than 80% fewer training iterations compared to previous data-driven methods. Additionally, our approach provides the highest SNR improvement of 0.88 dB for 32-QAM signals in an amplifier-less system. Moreover, our method exhibits enhanced robustness and strong generalization capabilities. It remains resilient to system noise, and the DPD module trained with 32-QAM can be effectively applied to other modulation formats without retraining, achieving SNR gains of 0.79 dB for 16-QAM and 0.98 dB for 64-QAM. When directly applied to an 80-km transmission system, our DPD also achieves impressive SNR gains over prior learning methods. Further improvements could introduce learnable physical parameters into channel models. In this way, channel models can be dynamically tuned to adapt to significant changes in actual systems.

    For the practical implementation of our learning framework, training can be conducted on pilot symbol (bit) sequences shared by the Tx and Rx sides for error calculation. The main consideration is how to send gradient information back to the Tx side. Nevertheless, owing to the fast training speed of our method, exchanging gradient information is only required during the brief training period. Therefore, any reliable feedback link can be used without the demands of high data rate. In fact, similar feedback links have been employed in various systems, such as quantum key distribution [65], model-free E2E learning methods [41], and autonomous optical networks [66,67]. Thus, constructing a temporary feedback link is feasible.

    From a broader perspective, our framework offers a novel approach for optimization in communication systems and other physical systems. For instance, practical E2E learning in communications often involves DSP modules that cannot be easily replaced by a single NN. These modules are typically non-differentiable. Our method facilitates the joint optimization of neural networks even in the presence of such non-differentiable operations. To extend our method to other communication systems, their channel models for the backward pass need to be designed accordingly.

    In conclusion, we have proposed a general strategy for optimizing communication systems that is applicable to various systems. This approach paves the way for developing more flexible and intelligent optical networks and holds promise for future integrated sensing and communications, where designing and optimizing increasingly complex system configurations will be crucial.

    APPENDIX A: FORMULATION OF PHYSICS-GUIDED E2E LEARNING

    Here we present the general formulation of the E2E learning process and details of our physics-guided learning framework.

    E2E Learning and the Non-Differentiable Actual Channel

    Figure 9(a) shows the schematic diagram of E2E learning for a communication system, where fen and fde represent the neural network (NN) based encoder and decoder with learnable parameters θen and θde, respectively. With channel input x (vector of signal waveform or symbols), we formulate the function of the actual channel fChannel with decoupled terms as y=fChannel(x)=C(x)+n(x)+nr,where y is channel output (vector of signal waveform or symbols), C(x) describes the deterministic channel effects of the input x, n(x) stands for signal-dependent noise such as noise–signal interaction, and nr is signal-independent random noise.

    (a) Schematic of E2E learning for a communication system. (b) Proposed physics-guided learning: gradient estimation with a physics-based (white-box) model. (c) Gradient estimation with a data-driven (black-box) model.

    Figure 9.(a) Schematic of E2E learning for a communication system. (b) Proposed physics-guided learning: gradient estimation with a physics-based (white-box) model. (c) Gradient estimation with a data-driven (black-box) model.

    After signal transmission in the forward pass, we compare the entire system’s input s and output s^ to compute the error and define the loss function, for example, in the form of the Euclidean norm (i.e., MSE): L=ss^2.

    Based on the loss function, the fundamental training algorithm backpropagation (BP) is used to update the parameters of the autoencoder. To do that, the gradients of the loss function with respect to (w.r.t.) the parameters, including Lθde and Lθen, are computed through the chain rule along the backward pass. In standard training procedures, the backward pass is the same as the forward pass. We first calculate the gradients related to the decoder: Lθde=s^θdeLs^=[fdeθde(y,θde)]TLs^,Ly=s^yLs^=[fdey(y,θde)]TLs^,where fdeθde(y,θde) and fdey(y,θde) denote the gradients of decoder function fde w.r.t. to θde and y, respectively, y is the input to the decoder, and Ls^ is the error vector directly derived from Eq. (A2). Since fde is a differentiable NN function, fdeθde(y,θde) can be obtained directly through automatic differentiation (autodiff) and would be in a form consisting of y. Analogously, fdey(y,θde) is obtained by autodiff and in a form built from θde. Both y and θde are known information for the decoder at the receiving end. Thus, the computations of Eqs. (A3) and (A4) are viable.

    Next, we calculate the gradient w.r.t. parameters of the encoder, which is given by Lθen=xθenLx=[fenθen(s,θen)]TyxLy.

    Similar to the decoder, fenθen(s,θen) can be computed upon autodiff. But the middle term yx blocks the BP process, which indicates the relationship between the actual channel’s input x and output y. According to Eq. (A1), it should be given by yx=C(x)x+n(x)x.

    By comparing Eqs. (A1) and (A6), it is clear that the random noise nr does not affect the gradient of y w.r.t. x. Nevertheless, the function of the actual physical channel, including C(x) and n(x), can never be exactly known and is inherently non-differentiable, blocking the gradient computation of Eq. (A6).

    Conventional methods strive to circumvent this problem by constructing a precise digital model to conduct learning in the digital domain. However, accurate modeling leads to high training costs and complexity, while being unable to bypass performance loss after deployment. In response, we propose our physics-guided learning framework.

    Physics-Guided E2E Learning

    Our physics-guided learning framework leverages a simple white-box model for gradient estimation, as shown in Fig. 9(b). The gradient is estimated upon the model output y^phy, given by y^phy=C^(x),y^phyx=C^(x)x.

    In comparison to Eq. (A1), Eq. (A7) involves only the approximation for the deterministic channel effects C^(·), resulting in a clean backward pass without the derivative of noise, as shown in Eq. (A8). We construct C^(·) based on well-developed physical models and prior knowledge of the system under investigation and eliminate intractable noise modeling that may incur bias instead. With y^phyx, Eq. (A5) is modified to be Lθen[fenθen(s,θen)]Ty^phyxLy.

    According to Eqs. (A3) and (A9), we derive the gradients for both the decoder and encoder. Therefore, we can update their parameters based on gradient descent: θdeθdeηLθde,θenθenηLθen,where η is the learning rate. In this way, we accomplish the training loop with the BP algorithm. The training loop would repeat until the error no longer decreases.

    Gradient Estimation with a Data-Driven Model

    As a comparison, we also illustrate the gradient estimated by a data-driven channel model. We refer to a data-driven model as a black box purely based on NNs. This model is first trained to reduce the error between its output y^data and the real channel output y. Its parameters are updated as indicated by the yellow arrow in Fig. 9(c). For simplicity, the loss function of this channel fitting process is given by the Euclidean norm as well: Ldata=yy^data2,where the training target y is given by Eq. (A1), implying that the model is trained on noisy labels. Notably, the real gradient yx is free from random noise nr, as indicated by Eq. (A6). Therefore, the NN channel model can be designed to fit mainly the mean effects of the actual channel, using common types of NN structure such as feed-forward neural network (FNN) and long short-term memory network (LSTM). However, the noisy label y would perturb the fitting process, hindering the model from capturing the exact effects and even leading to extra noise bias when trained with insufficient data and time. In this case, the NN model’s estimation of channel output, y^data, can be given by the following expression: y^data=C^(x)+n^data(x)+n¯rdata,where n^data(x) denotes a noise bias related to the input x, and n¯rdata is the estimated mean of the random noise. Both of these terms are learned from the training dataset and vary with the level of noise during the training process.

    After fitting the actual channel, the fixed NN channel model is then used to facilitate the E2E learning process, as shown in Fig. 9(c). Consequently, the estimated gradient is given by y^datax=C^data(x)x+n^data(x)x.

    Compared with that estimated by a physics-based model [Eq. (A8)], the gradient in Eq. (A14) incorporates an additional noise bias. Such a noise effect reflects only the distribution of its training data and would remain static once the channel model is fixed, perhaps enlarging its discrepancy between the true gradient given by Eq. (A6).

    Notably, for the digital-domain learning with data-driven models (prior Method 3), the similar model output y^data will serve as the target for its DPD training process. In that case, both noise-involved terms in Eq. (A13) would mislead the optimization process and may cause over-fitting.

    Actually, noisy labels are reported to be more harmful than noisy inputs for NN training. Related training methods have been widely studied for years [6870]. To obtain better training results, data-driven model-assisted learning methods should utilize complex training techniques and consume plenty of time.

    APPENDIX B: SIMULATION SYSTEM AND MODELING METHODS

    Details of the simulation setup for the amplifier-less system and key physical models are presented here.

    The simulation setup is shown in Fig. 10. The processing before the simulated transmission channel includes pulse shaping and DPD. The NN structure of the DPD module will be detailed in the next section. The transmission channel consists of three parts, namely the Tx impairments, optical channel, and Rx process. Specifically, we mainly focus on signal distortions caused by Tx impairments, including DAC quantization, nonlinearities in RF drivers and modulator, and the bandwidth limitation represented as a low-pass filter. Except for the DAC, their models are identical to those comprising the backward white box in experiments (shown in the white box of Fig. 3). Here the DAC is non-differentiable, since it is directly modeled as ideal 8-bit quantization with zero derivative. The optical channel is a noisy link, and the Rx process outputs the final recovered symbols.

    Simulation setup of the amplifier-less coherent system. The inset shows the NN structure of the DPD module. Tx, transmitter; Rx, receiver.

    Figure 10.Simulation setup of the amplifier-less coherent system. The inset shows the NN structure of the DPD module. Tx, transmitter; Rx, receiver.

    The entire simulation system shown in Fig. 10, apart from the DAC, also serves as the physics-based channel model for prior Method 2. The major difference between prior Method 2 and our physics-guided learning method is that its learning objective is computed based on the model output instead of the real one, and the backpropagation is executed in the same digital model as the forward pass.

    Table 2 lists the physical parameters of the simulation system, which are matched with those in experiments. Details of the main physical models are as follows.

    Parameters for the Simulation System

    ParameterValue
    DAC resolution8 bits
    RF driver (gain, Vsat)17 dB, 7.8 V
    IQ modulator Vπ3.5 V
    Tx LPF bandwidth22 GHz
    Normalized reference power0 dBm
    AWGN power−20 dBm
    Laser linewidth100 kHz
    Rx LPF bandwidth22 GHz
    Sample per symbol2 sps

    RF Driver Model

    The RF driver is used to amplify the output signal of the DAC and then drive the modulator to accomplish electro-optical conversion. We assume it behaves as a memoryless system and only amplifies the electrical amplitude of the signal. Additionally, we assume that its transfer functions for the two branches (in-phase and quadrature parts) of the IQ modulator are identical, without any imbalance and phase distortion. Hence, it is modeled using the Rapp model [71]: Vout=Vin·G1+(Vin·GVsat)44,where Vin is the input Vpp of the driver and is also the output Vpp of the DAC, Vout and Vsat are the output and the saturation Vpp of the RF driver, and G denotes the nominal gain of the driver.

    IQ Modulator Model

    The IQ modulator is usually composed of a pair of parallel Mach–Zehnder modulators (MZMs), each of which is configured by a push–pull methodology. High-order modulation schemes in coherent optical transmitters rely on these dual MZMs biased at the null point, with the transfer function of each branch modeled as a sinusoidal response [72]: E(t)=E0[sin(πVI(t)2Vπ)+j·sin(πVQ(t)2Vπ)],where E0 is the amplitude of the optical field, Vπ is the required voltage difference to apply a π radian change to the sinusoidal transfer function, and VI(t) and VQ(t) are the voltage from RF drivers applied to the in-phase and quadrature branches, respectively. We assume the ideal IQ modulator possesses symmetric and identical MZMs, thus without amplitude imbalance and relative delay between the two branches.

    In the experimental setup, the minimum bandwidth limitation of the transmitter is determined by the modulator. To represent the combined effect of temporal distortions and frequency response of components, a first-order Gaussian filter is applied as the low-pass filter of the transmitter. Therefore, the overall transfer function can be understood as a Wiener–Hammerstein (WH) structure [59]. Notably, the same low-pass filter is applied at the receiver.

    Optical Amplifier-less Channel

    As shown in Fig. 10, the output signal of DPD x is distorted by the transmitter to be x˜ and then transmitted over an optical channel consisting of optical link loss and noises. The resulting signal is further impaired as z=α·x˜·eiϕPN+nr,where α denotes the loss coefficient applied to the magnitude of the optical field in linear units, ϕPN is the phase noise modeled as a Wiener process determined by the laser linewidth, and nr is the random noise modeled as zero-mean additive white Gaussian noise (AWGN). To simulate the impact of driving voltage and fiber link loss, the power of nr is fixed and the transmitted signal x˜ is normalized according to a reference power level. This power level is determined by a non-DPD test signal sequence, with a driving voltage equal to the Vπ of the modulator. Detailed simulation parameters can be found in Table 2.

    APPENDIX C: LEARNING WITH AN NN-BASED POST-EQUALIZER

    Here we demonstrate that our proposed approach can further improve the performance by including an extra NN-based post-equalizer. Specifically, an NN with the same structure as the DPD is introduced after phase recovery for nonlinear equalization and jointly learned with the Tx DPD module in simulations using our approach. As shown in Fig. 11, our method continues to reduce the MSE after adding the post-equalizer, and the performance is better than optimizing the DPD alone. This result shows the effectiveness and general applicability of our approach. However, we do not include this NN-based post-equalizer in our experiments to avoid increased DSP complexity.

    Training process comparison between training DPD alone and joint learning with an NN-based post-equalizer.

    Figure 11.Training process comparison between training DPD alone and joint learning with an NN-based post-equalizer.

    APPENDIX D: DPD NEURAL NETWORK AND TRAINING DETAILS

    We refer to Refs. [42,60] to design the NN structure for DPD applications. It takes the basic architecture of the feed-forward neural network, with a sliding-time-window input layer to account for the memory effect between symbols. As shown in the inset of Fig. 10, the input layer is fed with 2m+1 symbols, where the central one (ut) corresponds to the current input and the remaining 2m symbols are the inputs of the previous and future m time steps, respectively. The final layer yields one symbol at each time step, which is added with the central input directly by a shortcut connection to produce the final output xt. The outputs from all time steps are concatenated to form a complete sequence. Note that, before DPD, signals have been up-sampled to be 2 samples per symbol (sps) in the pulse shaping module. Thus, there are 2(2m+1) neurons in the input layer and two neurons in the output layer. Here we set the time step m=7.

    Previous DPD works used two real-valued NNs to process the real and imaginary parts of complex signals separately [42,60]. Instead, we implement our DPD as a complex-valued NN through an online toolbox [73] and utilize complex tanh as the activation function for the input and hidden layers. The toolbox is supported by the machine learning library PyTorch, facilitating the use of autodiff functions for backpropagation.

    Two prior E2E learning methods (Methods 1 and 3) use the identical NN structure for both channel modeling and DPD, according to Refs. [18,34,60]. This structure may not be optimal for the system under investigation. The optimization of NN hyperparameters should employ a systematic design strategy, requiring more time and tricks. This constitutes a part of their training complexity.

    All NNs are trained using supervised learning, where the training data consists of input–target pairs. For DPD training, the transmitted symbol s serves as both input and target. For data-driven channel model training, the pre-distorted signal x is the input, while the corresponding system output symbol y is the target. Since x is derived from s, the training data amount for both processes is measured by the total number of transmitted symbols. In our experiments, in each training iteration, a sequence of fixed-length random symbols is generated and used for training. Thus, the total training data amount is proportional to the iteration number and can be expressed as Data amount=Symbol length per iteration×(Iteration number for channel modeling+Iteration number for DPD training).

    In experimental implementations, training is conducted by controlling the arbitrary waveform generator (AWG) and real-time oscilloscope simultaneously, requiring frequent data loading and acquisition from the actual system. The sequence length must be carefully chosen: a shorter sequence increases data acquisition frequency but may fail to capture complete channel effects, whereas a longer sequence increases data loading time. Given that, we fix the length to 214 symbols for each iteration.

    Regarding the system measurement during training, we define one measurement as one time acquisition of an overall system output sequence y. For our method and Method 1, since training relies on real system outputs, the number of measurements equals the iteration number. For Method 3, since real system outputs are only used for channel model training, the number of measurements is the half of iteration number (as its iterations are equally split between DPD and channel modeling). Method 2 does not require system measurements. However, the performance of Method 2 is not acceptable (as demonstrated in Section 3.D).

    For training processes requiring system measurements, the main time-consuming factor is the forward propagation stage, which includes data loading to the AWG, signal transmission through the system, data acquisition from the oscilloscope, and Rx-DSP. On our platform, the time to complete one training iteration with a sequence of 214 symbols is about 2.8 s. On the other hand, the digital-domain DPD training process takes around 0.52 s per iteration. All NN models are trained in PyTorch, with 32 GB of 2400 MHz RAM and an Intel Core i7-12700H 2.3 GHz CPU.

    During the test, the same sequence of 216 symbols is used to evaluate all methods. For cases requiring periodic validation of training progress, a sequence of 215 symbols is used as the validation set.

    In experiments, we find that the selection of optimizer and initial learning rate significantly influences the training of data-driven channel models. When using an SGD optimizer, a large initial learning rate hinders convergence, while a small one tends to result in local minima with poor modeling performance. In contrast, Adam provides stable training results across a wide range of initial learning rates (0.001–0.01). Consequently, we adopt Adam for both channel modeling and DPD training. We choose the optimal initial learning rates for different training processes by standard grid search with a precision of 0.001 over the range of 0.001–0.01. This ensures the hyperparameters used in each method are optimal. The initial learning rates used for each method are listed in Table 3.

    Initial Learning Rate

    MethodChannel ModelDPD Module
    Our methodN.A.0.005
    Method 10.0030.004
    Method 2N.A.0.002
    Method 30.0030.003

    References

    [1] V. Mnih, K. Kavukcuoglu, D. Silver. Human-level control through deep reinforcement learning. Nature, 518, 529-533(2015).

    [2] I. Sutskever, O. Vinyals, Q. V. Le. Sequence to sequence learning with neural networks. arXiv(2014).

    [3] N. Carion, F. Massa, G. Synnaeve. End-to-end object detection with transformers. European Conference on Computer Vision, 213-229(2020).

    [4] M. Bojarski, D. Del Testa, D. Dworakowski. End to end learning for self-driving cars. arXiv(2016).

    [5] S. Levine, C. Finn, T. Darrell. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res, 17, 1334-1373(2016).

    [6] G. Barbastathis, A. Ozcan, G. Situ. On the use of deep learning for computational imaging. Optica, 6, 921-943(2019).

    [7] X. Guo, T. D. Barrett, Z. M. Wang. Backpropagation through nonlinear units for the all-optical training of neural networks. Photonics Res., 9, B71-B80(2021).

    [8] T. O’Shea, J. Hoydis. An introduction to deep learning for the physical layer. IEEE Trans. Cognit. Commun. Netw., 3, 563-575(2017).

    [9] N. Chi, Y. Zhou, Y. Wei. Visible light communication in 6G: advances, challenges, and prospects. IEEE Veh. Technol. Mag., 15, 93-102(2020).

    [10] M. Z. Chowdhury, M. K. Hasan, M. Shahjalal. Optical wireless hybrid networks: trends, opportunities, challenges, and research directions. IEEE Commun. Surv. Tuts., 22, 930-966(2020).

    [11] W. Shi, Y. Tian, A. Gervais. Scaling capacity of fiber-optic transmission systems via silicon photonics. Nanophotonics, 9, 4629-4663(2020).

    [12] B. J. Puttnam, G. Rademacher, R. S. Luís. Space-division multiplexing for optical fiber communications. Optica, 8, 1186-1203(2021).

    [13] M. Srinivasan, J. Song, A. Grabowski. End-to-end learning for vcsel-based optical interconnects: state-of-the-art, challenges, and opportunities. J. Lightwave Technol., 41, 3261-3277(2023).

    [14] Z. Li, Q. Xie, Y. Zhang. Four-wave mixing based spectral Talbot amplifier for programmable purification of optical frequency combs. APL Photonics, 9, 036101(2024).

    [15] E. Agrell, M. Karlsson, F. Poletti. Roadmap on optical communications. J. Opt., 26, 093001(2024).

    [16] B. Karanov, M. Chagnon, F. Thouin. End-to-end deep learning of optical fiber communications. J. Lightwave Technol., 36, 4843-4855(2018).

    [17] B. Karanov, M. Chagnon, V. Aref. Concept and experimental demonstration of optical IM/DD end-to-end system optimization using a generative model. 2020 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2020).

    [18] Z. Niu, H. Yang, H. Zhao. End-to-end deep learning for long-haul fiber transmission using differentiable surrogate channel. J. Lightwave Technol., 40, 2807-2822(2022).

    [19] S. Li, C. Häger, N. Garcia. Achievable information rates for nonlinear fiber communication via end-to-end autoencoder learning. 2018 European Conference on Optical Communication (ECOC), 1-3(2018).

    [20] T. Uhlemann, S. Cammerer, A. Span. Deep-learning autoencoder for coherent and nonlinear optical communication. Photonic Networks; 21th ITG-Symposium, 1-8(2020).

    [21] S. Gaiarin, F. Da Ros, R. T. Jones. End-to-end optimization of coherent optical communications over the split-step fourier method guided by the nonlinear fourier transform theory. J. Lightwave Technol., 39, 418-428(2021).

    [22] Z. Zhai, H. Jiang, M. Fu. An interpretable mapping from a communication system to a neural network for optimal transceiver-joint equalization. J. Lightwave Technol., 39, 5449-5458(2021).

    [23] J. Song, C. Häger, J. Schröder. End-to-end autoencoder for superchannel transceivers with hardware impairment. 2021 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2021).

    [24] Z. He, J. Song, C. Häger. Experimental demonstration of learned pulse shaping filter for superchannels. 2022 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2022).

    [25] L. Minelli, F. Forghieri, A. Nespola. A multi-rate approach for nonlinear pre-distortion using end-to-end deep learning in IM-DD systems. J. Lightwave Technol., 41, 420-431(2023).

    [26] H. Lee, S. H. Lee, T. Q. S. Quek. Deep learning framework for wireless systems: applications to optical wireless communications. IEEE Commun. Mag., 57, 35-41(2019).

    [27] O. Jovanovic, M. P. Yankov, F. Da Ros. End-to-end learning of a constellation shape robust to channel condition uncertainties. J. Lightwave Technol., 40, 3316-3324(2022).

    [28] A. Rode, B. Geiger, L. Schmalen. Geometric constellation shaping for phase-noise channels using a differentiable blind phase search. 2022 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2022).

    [29] B. M. Oliveira, M. S. Neves, F. P. Guiomar. End-to-end deep learning of geometric shaping for unamplified coherent systems. Opt. Express, 30, 41459-41472(2022).

    [30] M. Schaedler, S. Calabrò, F. Pittalà. Neural network assisted geometric shaping for 800 Gbit/s and 1 Tbit/s optical transmission. 2020 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2020).

    [31] V. Aref, M. Chagnon. End-to-end learning of joint geometric and probabilistic constellation shaping. 2022 Optical Fiber Communications Conference and Exhibition (OFC), 1-3(2022).

    [32] V. Neskorniuk, A. Carnio, V. Bajaj. End-to-end deep learning of long-haul coherent optical fiber communications via regular perturbation model. 2021 European Conference on Optical Communication (ECOC), 1-4(2021).

    [33] H. Ye, L. Liang, G. Y. Li. Deep learning-based end-to-end wireless communication systems with conditional gans as unknown channels. IEEE Trans. Wirel. Commun., 19, 3133-3143(2020).

    [34] Y. Xu, L. Huang, W. Jiang. End-to-end learning for 100G-PON based on noise adaptation network. J. Lightwave Technol., 42, 2328-2337(2024).

    [35] Y. Xu, X. Guan, W. Jiang. Low-complexity end-to-end deep learning framework for 100G-PON. J. Opt. Commun. Netw., 16, 1093-1103(2024).

    [36] J. Shi, W. Niu, Z. Li. Optimal adaptive waveform design utilizing an end-to-end learning-based pre-equalization neural network in an UVLC system. J. Lightwave Technol., 41, 1626-1636(2023).

    [37] J. Shi, Z. Li, J. Jia. Waveform-to-waveform end-to-end learning framework in a seamless fiber-terahertz integrated communication system. J. Lightwave Technol., 41, 2381-2392(2023).

    [38] S. Xing, Z. Li, C. Huang. End-to-end deep learning for a flexible coherent pon with user-specific constellation optimization. J. Opt. Commun. Netw., 16, 59-70(2023).

    [39] A. Sun, Z. Li, J. Jia. End-to-end deep-learning-based photonic-assisted multi-user fiber-mmwave integrated communication system. J. Lightwave Technol., 42, 80-94(2023).

    [40] H. Yang, Z. Niu, S. Xiao. Fast and accurate optical fiber channel modeling using generative adversarial network. J. Lightwave Technol., 39, 1322-1333(2021).

    [41] F. A. Aoudia, J. Hoydis. Model-free training of end-to-end communication systems. IEEE J. Sel. Areas Commun., 37, 2503-2516(2019).

    [42] J. Song, Z. He, C. Häger. Over-the-fiber digital predistortion using reinforcement learning. 2021 European Conference on Optical Communication (ECOC), 1-4(2021).

    [43] J. Song, C. Häger, J. Schröder. Model-based end-to-end learning for WDM systems with transceiver hardware impairments. IEEE J. Sel. Top. Quantum Electron., 28, 7700114(2022).

    [44] O. Jovanovic, M. P. Yankov, F. Da Ros. Gradient-free training of autoencoders for non-differentiable communication channels. J. Lightwave Technol., 39, 6381-6391(2021).

    [45] D. Bullock, B. Johnson, R. B. Wells. Hardware-in-the-loop simulation. Transp. Res. Emerg. Technol., 12, 73-89(2004).

    [46] Y. Peng, S. Choi, N. Padmanaban. Neural holography with camera-in-the-loop training. ACM Trans. Graph., 39, 185(2020).

    [47] T. P. Lillicrap, D. Cownden, D. B. Tweed. Random synaptic feedback weights support error backpropagation for deep learning. Nat Commun, 7, 13276(2016).

    [48] L. G. Wright, T. Onodera, M. M. Stein. Deep physical neural networks trained with backpropagation. Nature, 601, 549-555(2022).

    [49] K. Zhong, X. Zhou, J. Huo. Digital signal processing for short-reach optical communications: a review of current technologies and future trends. J. Lightwave Technol., 36, 377-400(2018).

    [50] F. Buchali, M. Chagnon, K. Schuh. Amplifier less 400 Gb/s coherent transmission at short reach. 2018 European Conference on Optical Communication (ECOC), 1-3(2018).

    [51] G. RizzelliMartella, A. Nespola, S. Straullu. Scaling laws for unamplified coherent transmission in next-generation short-reach and access networks. J. Lightwave Technol., 39, 5805-5814(2021).

    [52] D. Tauber, B. Smith, D. Lewis. Role of coherent systems in the next DCI generation. J. Lightwave Technol., 41, 1139-1151(2023).

    [53] X. Zhou, R. Urata, H. Liu. Beyond 1 Tb/s intra-data center interconnect technology: IM-DD or coherent?. J. Lightwave Technol., 38, 475-484(2019).

    [54] S. Bernal, M. Dumont, E. Berikaa. 12.1 terabit/second data center interconnects using O-band coherent transmission with QD-MLL frequency combs. Nat. Commun., 15, 7741(2024).

    [55] H. Jiang, M. Fu, Y. Zhu. Digital pre-distortion using a Gauss-Newton-based direct learning architecture for coherent optical transmitters. Opt. Lett., 48, 1706-1709(2023).

    [56] C. Eun, E. J. Powers. A new volterra predistorter based on the indirect learning architecture. IEEE Trans. Signal Process., 45, 223-227(1997).

    [57] P. W. Berenguer, M. Nolle, L. Molle. Nonlinear digital pre-distortion of transmitter components. J. Lightwave Technol., 34, 1739-1745(2016).

    [58] H. Paaso, A. Mammela. Comparison of direct learning and indirect learning predistortion architectures. IEEE International Symposium on Wireless Communication Systems, 309-313(2008).

    [59] G. Paryanti, H. Faig, L. Rokach. A direct learning approach for neural network based pre-distortion for coherent nonlinear optical transmitter. J. Lightwave Technol., 38, 3883-3896(2020).

    [60] V. Bajaj, F. Buchali, M. Chagnon. Deep neural network-based digital pre-distortion for high baudrate optical coherent transmission. J. Lightwave Technol., 40, 597-606(2022).

    [61] T. Sasai, M. Nakamura, E. Yamazaki. Wiener-Hammerstein model and its learning for nonlinear digital pre-distortion of optical transmitters. Opt Express, 28, 30952-30963(2020).

    [62] R. Emmerich, M. Sena, R. Elschner. Enabling S-C-I-band systems with standard C-band modulator and coherent receiver using coherent system identification and nonlinear predistortion. J. Lightwave Technol., 40, 1360-1368(2022).

    [63] X. Lu, M. Zhao, L. Qiao. Non-linear compensation of multi-CAP VLC system employing pre-distortion base on clustering of machine learning. 2018 Optical Fiber Communications Conference and Exposition (OFC), 1-3(2018).

    [64] R. Elschner, R. Emmerich, C. Schmidt-Langhorst. Improving achievable information rates of 64-GBd PDM-64QAM by nonlinear transmitter predistortion. Optical Fiber Communication Conference, M1C.2(2018).

    [65] V. Scarani, H. Bechmann-Pasquinucci, N. J. Cerf. The security of practical quantum key distribution. Rev. Mod. Phys., 81, 1301-1350(2009).

    [66] D. Rafique, L. Velasco. Machine learning for network automation: overview, architecture, and applications. J. Opt. Commun. Netw., 10, D126-D143(2018).

    [67] X. Liu, Y. Zhang, Y. Chen. Digital twin modeling and controlling of optical power evolution enabling autonomous-driving optical networks: a Bayesian approach. Adv. Photonics, 6, 026006(2024).

    [68] S. Reed, H. Lee, D. Anguelov. Training deep neural networks on noisy labels with bootstrapping. arXiv(2014).

    [69] B. Han, Q. Yao, X. Yu. Co-teaching: robust training of deep neural networks with extremely noisy labels. arXiv(2018).

    [70] H. Song, M. Kim, D. Park. Learning from noisy labels with deep neural networks: a survey. IEEE Trans. Neural Netw. Learn. Syst., 34, 8135-8153(2023).

    [71] C. Rapp. Effects of HPA-nonlinearity on a 4-DPSK/OFDM-signal for a digital sound broadcasting signal. ESA Spec. Publ., 332, 179-184(1991).

    [72] G. Li, P. Yu. Optical intensity modulators for digital and analog applications. J. Lightwave Technol., 21, 2010-2030(2003).

    [73] M. W. Matthès, Y. Bromberg, J. de Rosny. Learning and avoiding disorder in multimode fibers. Phys. Rev. X, 11, 021060(2021).

    Qiarong Xiao, Chen Ding, Tengji Xu, Chester Shu, Chaoran Huang, "Concept and experimental demonstration of physics-guided end-to-end learning for optical communication systems," Photonics Res. 13, 1469 (2025)
    Download Citation