Abstract
1. INTRODUCTION
Advanced artificial intelligence systems are becoming extremely demanding in terms of training time and energy consumption [1,2]. An ever-larger number of trainable parameters is required to exploit over-parameterization and achieve state-of-the-art performances [3,4]. Large-scale, energy-efficient computational hardware is becoming a subject of intense interest.
Photonic neuromorphic computing systems [5–7] offer large throughput [8] and energetically efficient [9] hardware accelerators based on integrated silicon photonic circuits [10–14], engineered meta-materials [15], or 3D-printed diffractive masks [16,17]. Other light-based computing architectures leverage reservoir computing (RC) [18] and extreme learning machine (ELM) [19] computational paradigms, where the input data is mapped into a feature space through a fixed set of random weights and training is performed only on the linear readout layer. Optical reservoir computers [20–33] and photonic extreme learning machines (PELMs) [34–37] apply successfully to various learning tasks, ranging from time series prediction [38,39] to image classification [40,41]. Despite the remarkable performances achieved by these architectures, their impact on large-scale problems is limited, mainly due to size constraints.
Here, we demonstrate photonic machine learning at an ultralarge scale by mapping the optical output layer in the full three-dimensional (3D) structure of the optical field. This original approach offers inherent scalability to the optical device and effortless access to the over-parameterized learning region. The 3D-PELM that we implement can process simultaneously up to 250,000 total input-output nodes via spatial light modulation, featuring a total network capacity one order of magnitude larger than existing optical reservoir computers for supervised learning. Our large-scale photonic network allows us to optically implement a massive text classification problem, the sentiment analysis of the Internet Movie Database (IMDb) [44], and enable, for the first time to our knowledge, the observation of the double descent phenomenon [45–47] on a photonic computing device. We demonstrate that, thanks to its huge number of optical output nodes, the 3D-PELM successfully classifies text by using only a limited number of training points, realizing energy-efficient, large-scale text processing.
Sign up for Photonics Research TOC. Get the latest issue of Photonics Research delivered right to you!Sign up now
2. RESULTS
A. Three-Dimensional PELM
A PELM classifies a dataset with points by mapping the input sample into a high-dimensional feature space through a nonlinear transformation that is performed by a photonic platform [34]. To perform the classification, the output matrix , which contains the measured optical signals, is linearly combined with a set of trainable readout weights (see Appendix A). In all previous PELM realizations [25,34,35,39,41], the intensities stored in only contain information on the optical field at a single spatial or temporal plane. Here, to scale up the optical network, we exploit the entire three-dimensional optical field. Figure 1(B) shows a schematic of our 3D-PELM: the photonic network uses as output nodes the full 3D structure of the optical field propagating in free space. The transformation of the input data sample into an optical intensity volume occurs through linear wave mixing by coherent free-space propagation and intensity detection. Specifically, the 3D optical mapping reads as
Figure 1.Three-dimensional PELM for language processing. (A) The text database entry is a paragraph of variable length. Text pre-processing: a sparse representation of the input paragraph is mapped into a Hadamard matrix with phase values in
In experiments, to collect uncorrelated output data , we employ multiple cameras that simultaneously image the intensity in distinct far-field planes (see Appendix B). These speckle-like distributions [Fig. 1(C)] are uncorrelated to each other to acquire non-redundant information. Importantly, exploiting the intensity of multiple optical planes is crucial for increasing the effective number of readout channels. In fact, the intensity on a single camera has a finite correlation length and near pixels are necessarily correlated, which makes the number of independent channels lower than the available sensor pixels. Our 3D scheme circumvents this limitation and allows to achieve a huge number of output nodes.
B. Optical Encoding of Natural Language Sentences
The natural language processing (NLP) task we considered is the classification of the IMDb dataset, constructed by Maas
Optical encoding of natural language in the 3D-PELM requires using a spatial light modulator (SLM) with a fixed number of input modes for sentences with variable lengths. We follow the one-hot method employed in digital NLP. We use an extension of one-hot encoding, the so-called tf-idf representation, and we construct the input representation . In information retrieval, the statistics reflects how important a word is to a document (frequency ) in a collection of documents. It is defined as , where is the length of a paragraph and is the number of sentences containing the th word. Within the tf-idf representation, each paragraph becomes a sparse, real vector [see Fig. 1(A)] whose non-zero elements are the tf-idf values of the words composing it. To optically encode these sparse data, we applied to the Walsh–Hadamard transform (WHT) [49], which decomposes a discrete function in a superposition of Walsh functions. The transformed dataset is a dense matrix [Fig. 1(A)], which is displayed on the SLM as a phase mask within the [0, ] range [ in Eq. (2)]. By exploiting the large number of available input pixels, we encode paragraphs with a massive size containing hundreds of thousands of words.
C. Observation of the Photonic Double Descent
Machine learning models with a very large number of trainable parameters show unique features with respect to smaller models, such as the double descent phenomenon. This effect is a resonance in the neural network performance, ruled by the ratio between the number of trainable parameters and the number of training points . In the under-parameterized regime (), good performances are obtained by balancing model bias and variance. As grows, models tend to overfit training data, and prediction performance gets worse until the so-called interpolation threshold () is reached. At this point the model optimally interpolates the training data and the prediction error is maximum. Beyond this resonance, in the over-parameterized regime, the model keeps interpolating training points, but performances on the test set reach the global optimum [50].
We experimentally implement photonic NLP and investigate the double descent effect on our large-scale 3D-PELM. Specifically, we analyze the classification accuracy for the IMDb task (Appendix D) as a function of the features and training set size . In Fig. 2(A) we report the observation of the double descent. We observe a dip in test accuracy as the number of channels reaches the number of training points . Beyond this resonance (interpolation threshold), we find the over-parameterized region in which maximum accuracy on the training set is achieved. The behavior is obtained via training on examples and using fixed train/test split ratio of 0.67/0.33. In this case, the larger classification accuracy is found in the under-parameterized region (0.77). Conversely, in Fig. 2(B), we consider a much smaller number of training points, . Remarkably, we reach the same optimal accuracy despite using only a fraction of the available training points. This observation reveals a particularly favorable learning region of the 3D-PELM in which only a reduced number of training points are required to reach high accuracy levels. From the operational standpoint, it is much more efficient to measure in parallel a large number of modes rather than sequentially processing many training examples. In Fig. 2(C) we report the full dynamics of the double descent phenomenon by continuously varying both and . We observe that the accuracy dip shifts with a constant velocity. This result shows the existence of an optimal learning region that is accessible on the 3D-PELM thanks to its large number of photonic nodes.
Figure 2.Photonic sentiment analysis. (A), (B) Training and test accuracy of the 3D-PELM on the IMDb dataset as a function of the number of output channels. The shaded area corresponds to the over-parameterized region. The configuration in (B) allows us to reach very high accuracy in the over-parameterized region with a dataset limited to
D. Sentiment Analysis at Ultralarge Scale
The observed operational advantage of the over-parameterized region indicates that, by further increasing the number of output modes, one can increase training effectiveness and/or performances [3]. In our 3D-PELM we reached readout channels that are independent of each other. We also considered larger input spaces, extending the vocabulary to include the whole set of words in the corpus (). Figure 3(A) shows that the test accuracy reaches a plateau as increases in the over-parameterized region. Saturation indicates that all the essential information encoded within the optical field has been extracted through the available channels. However, we observe a change in the performance as we employ more input features [Figs. 3(B) and 3(C)]. To estimate the rate by which performance improves as more channels were used for training, we estimate the angular coefficient of the linear growth that precedes the plateau. Although the onset of saturation can be estimated using diverse criteria, and varies depending on the measured range, its trend when increasing remains unaltered. We note a relevant enhancement of for , which indicates that PELMs featuring larger input spaces are able to reach optimal performances with a lower number of parameters. Figure 3(D) reports the accuracy as a function of the training-test split ratio (keeping fixed ) for and . Importantly, a limited number of examples () are enough to reach maximum accuracy in the over-parameterized region.
Figure 3.Performances at ultralarge scale. (A)–(C) Test accuracy as a function of
E. Optical Network Capacity
To establish a comparison among the various photonic neuromorphic computing devices, we introduce the optical network capacity as a useful ingredient related to the over-parameterization context. We define the capacity of a generic optical neural network as the product between the number of input and output nodes that can be processed in a single iteration, . This quantity gives direct information on the kind of large-scale problems that can be implemented on the optical setup. It depends only on the number of controllable nodes, which is the quantity that is believed to play the main role in big data processing [3]. Specifically, sets the size of the dataset that can be encoded, while reflects the embedding capacity of the network, as larger datasets necessitate larger feature spaces to be learned. Moreover, also furnishes an indication on how far the over-parameterized regime is for a given task. Useful over-parameterized conditions can be reached only if . We remark that the capacity is not a measure of the processor accuracy. It is instead a useful quantity to compare the scalability of different photonic computing devices.
In Table 1 we report the optical network capacity for various photonic processors that have been recently demonstrated. We focus on photonic platforms that exploit RC and ELM paradigms, since these devices are suitable for big data processing and may present the so-called photonic advantage at a large scale [51]. Our 3D-PELM has a record capacity , more than one order of magnitude larger than any other optical processor that has been demonstrated on supervised learning. Moreover, while increasing the capacity is challenging on many optical reservoir computers, our 3D-PELM can be scaled up further by enlarging the measured volume of the optical field. For example, the optical processor in Ref. [52], which is exploited for randomized numerical linear algebra, suggests that our scheme can reach a larger capacity by exploiting large-area cameras and multiple optical scattering. Maximum Network Capacity of Current Photonic Neuromorphic Computing Hardware for Supervised LearningWorking Principle Machine Learning Task Ref. Time-multiplexed cavity 1400 7129 Regression [ Amplitude modulation 16,384 2000 Human action recognition [ Frequency multiplexing 200 640 Time series recovery [ Optical multiple scattering 50,000 64 Chaotic series prediction [ Amplitude Fourier filtering 1024 43,263 Image classification [ Multimode fiber 240 240 Classification, regression [ Free-space propagation 6400 784 Classification, regression [ 3D optical field 120,000 131,044 Natural language processing 3D-PELM
F. Comparison with Digital Machine Learning
To validate further the capability of our photonic setup for text processing, we compare the accuracy of our 3D-PELM with various general-purpose digital neural networks on the IMDb task. To underline the overall impact of the over-parameterization on the performance, we consider two opposite parameterization regimes, and , and two distinct conditions for the dataset size, [Fig. 4(A)] and [Fig. 4(B)]. Since the learning principle of the 3D-PELM is based on kernel methods (see Ref. [34]), we implement a support vector machine (SVM) and a ridge regression based on nonlinear random projections (RP) [25] as representative models of the kernel algorithms class. The input for the SVM and the RP is the tf-idf–encoded dataset, but no significant differences are found using the Hadamard dataset. We also simulate the 3D-PELM, i.e., we evaluate our photonic scheme by using Eq. (2) to generate the output data. A convolutional neural network (CNN) with the same number of trainable weights is used as an additional benchmark model. Details on the various digital models are reported in Appendix E. The results for a split ratio are shown in Fig. 4. Overall, we observe that the photonic device performs sentiment analysis on the IMDb with an accuracy comparable with standard digital methods. The SVM sets the maximum expected accuracy (0.86). The device over-parameterized accuracy (0.75 for ) is mainly limited by the limited precision of the optical components and noise, and it could be enhanced up to 0.83 (3D-PELM numerics) by using a 64-bit camera. In fact, when operating with 8-bit precision, both the simulated 3D-PELM and the RP models achieve performance that agrees well with the experiments. Interestingly, Fig. 4(B) indicates that, for a limited set of training samples, the 3D-PELM surpasses the SVM and also the CNN. This points out an additional advantage that the photonic setup achieves thanks to over-parameterization. Not only accuracy is improved with respect to the standard under-parameterized regime, but the device operates effectively in conditions where digital models are less accurate.
Figure 4.Analysis of the IMDb accuracy. (A), (B) The comparison reports the accuracy for the experimental device (3D-PELM device), the simulated device (3D-PELM numerics), the random projection method with ridge regression (RP), the support vector machine (SVM), and a convolutional neural network (CNN) in both the under-parameterized (
3. DISCUSSION
We have reported the first, to our knowledge, photonic implementation of natural language processing by performing sentiment analysis on the IMBd dataset. Our results demonstrate the feasibility of modern large-scale computing tasks in photonic hardware. They can be potentially improved both in terms of performance, via low-noise high-resolution cameras and more complex encoding/training algorithms, and applications. Further developments include the implementation of advanced tasks such as multiclass sentiment analysis, which allows discriminating positive and negative sentences into different levels. Moreover, using different datasets such as the Stanford Sentiment Treebank (SST-5), we could train our 3D-PELM for providing ratings. On the other hand, our versatile optical setting may allow implementation of alternative strategies for photonic NLP, including dimensionality reduction of the input data via optical random projection [53]. An optical scheme that preserves the sequential nature of language may be also conceived by introducing recurrent mechanisms in our 3D-PELM. Another interesting direction is developing specific encoding schemes for representing text in the optical domain. Viable approaches include adapting bidimensional text representations such as quick response (QR) codes. Furthermore, by following the concept of word embedding used in digital NLP [54], photonic hardware may be used to learn a purely optical word embedding, including semantics, word ordering, and syntax.
In conclusion, modern machine learning architectures rely on over-parameterization to reach the global optimum of objective functions. In photonic computing, achieving over-parameterization for large-scale problems is challenging. Here, we realize a novel photonic computing device that, thanks to its huge network capacity, naturally operates in the over-parameterized region. The 3D-PELM is a very promising solution for developing a fully scalable photonic machine-learning platform. Our photonic processor enables the first link between NLP and optical computing hardware, opening a new vibrant research stream toward fast and energy-efficient artificial intelligence tools.
Acknowledgment
Acknowledgment. We thank I. MD Deen and F. Farrelly for technical support in the laboratory. D.P. and C.C. conceived the research. C.M.V. proposed application to NLP. D.P. developed the 3D photonic network. I.G. and D.P. carried out experimental measurements. C.M.V. and I.G. performed data analysis. D.P and C.C. co-supervised the project. All authors contributed to the manuscript.
APPENDIX A: PELM FRAMEWORK
In ELMs, a dataset with data points , with , is mapped into a higher-dimensional feature space through a nonlinear transformation , yielding the hidden-layer output matrix , with . A set of weights is learned via ridge regression such that the linear combination well approximates the true label vector . An explicit solution is , where is the regularization parameter and is the identity. The use of ridge regression allows reducing at the minimum training costs, which is crucial in applications requiring fast and reconfigurable learning. In a free-space PELM, the nonlinear mapping of input data is realized by a combination of free-space optical propagation and intensity detection [
APPENDIX B: EXPERIMENTAL SETUP
The optical setup of the 3D-PELM is sketched in Fig.
APPENDIX C: 3D-PELM TRAINING
Training operates by loading the randomly ordered input dataset on the SLM and measuring three speckle-like intensity distributions for each input sample (paragraph). From the acquired signals, we randomly select of the available channels, and we use the corresponding intensity values to form the output dataset. This dataset is split randomly into a training and a test datasets by using a split ratio , which is kept fixed throughout the analysis. All the classification accuracies we obtained refer to this hyperparameter. The training dataset is the output matrix used for training (see Appendix
APPENDIX D: NLP TASK, TEXT PRE-PROCESSING, AND OPTICAL ENCODING
We consider the IMDb dataset [
APPENDIX E: DIGITAL MACHINE LEARNING
The 3D-PELM is numerically simulated by following the optical mapping model in Eq. (
References
[1] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. A. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, M. Zaharia. Efficient large-scale language model training on GPU clusters using megatron-LM. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 1-15(2021).
[3] A. Chatelain, A. Djeghri, D. Hesslow, J. Launay, I. Poli. Is the number of trainable parameters all that actually matters?(2021).
[4] N. C. Thompson, K. Greenewald, K. Lee, G. F. Manso. The computational limits of deep learning(2020).
[25] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Dremeau, S. Gigan, F. Krzakala. Random projections through multiple optical scattering: approximating kernels at the speed of light. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6215-6219(2016).
[36] D. Pierangeli, G. Marcucci, C. Conti. Neuromorphic computing device using optical shock waves. OSA Nonlinear Optics, NTh1A-3(2021).
[44] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts. Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142-150(2011).
[47] S. Mei, A. Montanari. The generalization error of random features regression: precise asymptotics and double descent curve. Commun. Pure Appl. Math., 75, 667-766(2020).
[49] A. Ashrafi. Walsh–Hadamard transforms: a review. Advances in Imaging and Electron Physics, 1-55(2017).
[52] D. Hesslow, A. Cappelli, I. Carron, L. Daudet, R. Lafargue, K. Müller, R. Ohana, G. Pariente, I. Poli. Photonic co-processors in HPC: using LightOn OPUs for randomized numerical linear algebra(2021).
[54] T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient estimation of word representations in vector space(2013).
Set citation alerts for the article
Please enter your email address