• Chinese Journal of Lasers
  • Vol. 49, Issue 20, 2007205 (2022)
Yuan Yuan, Minghui Chen*, Shuting Ke, Teng Wang, Longxi He, Linjie Lü, Hao Sun, and Jiannan Liu
Author Affiliations
  • Shanghai Engineering Research Center of Interventional Medical, Ministry of Education of Medical Optical Engineering Center, School of Health Sciences and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
  • show less
    DOI: 10.3788/CJL202249.2007205 Cite this Article Set citation alerts
    Yuan Yuan, Minghui Chen, Shuting Ke, Teng Wang, Longxi He, Linjie Lü, Hao Sun, Jiannan Liu. Fundus Image Classification Research Based on Ensemble Convolutional Neural Network and Vision Transformer[J]. Chinese Journal of Lasers, 2022, 49(20): 2007205 Copy Citation Text show less

    Abstract

    Objective

    With the increasing prevalence and blindness rate of fundus diseases, the lack of ophthalmologist resources is increasingly unable to meet the demand for medical examination. Given the shortage of ophthalmic medical staff, long waiting process for medical treatment, and challenges in remote areas, there is an irresistible trend to reduce the workload of medical staff via artificial intelligence. Several studies have applied convolutional neural network (CNN) in the classification task of fundus diseases; however with the advancement of Transformer model application, Vision Transformer (ViT) model has shown higher performance in the field of medical images. ViT models require pretraining on large datasets and are limited by the high cost of medical image acquisition. Thus, this study proposes an ensemble model. The ensemble model combines CNN (EfficientNetV2-S) and Transformer models (ViT). Compared with the existing advanced model, the proposed model can extract the features of fundus images in two completely different ways to achieve better classification results, which not only have high accuracy but also have precision and sensitivity. Specifically, it can be used to diagnose fundus diseases. This model can improve the work efficiency of the fundamental doctor if applied to the medical secondary diagnosis process, thus effectively alleviating the difficulties in diagnosis of fundus diseases caused by the shortage of ophthalmologist staff, long medical treatment process, and difficult medical treatment in remote areas.

    Methods

    We propose the EfficientNet-ViT ensemble model for the classification of fundus images. This model integrates the CNN and Transformer models, which adopt the EfficientNetV2-S and ViT models, respectively. First, train the EfficientNetV2-S and ViT models. Then, apply adaptive weighting data fusion technology to accomplish the complementation of the function of the two types of models. The optimal weighting factors of the EfficientNetV2-S and ViT models are calculated using the adaptive weighting algorithm and then the new model (EfficientNet-ViT) is integrated with them. After calculating the weighting factors 0.4 and 0.6, multiply the output of the ViT model by a weighting factor of 0.4, multiply the output of the EfficientNetV2-S model by a weighting factor of 0.6, and then weigh the two to obtain the final prediction result. According to clinical statistics, the current common fundamental disease in my country includes the following diseases: diabetic retinopathy (DR), age-related macular degeneration (ARMD), cataract, and myopia. These fundus diseases are the main factors that cause irreversible blindness in my country. Thus, we classify fundus images into the following five categories: normal, DR, ARMD, myopia, and cataract. Furthermore, we use three indicators, such as accuracy, precision, and specificity. The EfficientNet-ViT ensemble model can extract the features of fundus images in two completely different ways to achieve better classification results and higher accuracy. Finally, we compare the performance indicators of this model and other models. The superiority of the integrated model in the fundus classification is verified.

    Results and Discussions

    The accuracy of EfficientNet-ViT ensemble model in fundus image classification reaches 92.7%, the precision is 88.3%, and the specificity reaches 98.1%. Compared with EfficientNetV2-S and ViT models, the precision of EfficientNet-ViT ensemble model improves by 0.5% and 1.6%, accuracy improves by 0.7% and 1.9%, and specificity increases by 0.6% and 0.9%, respectively (Table 3). Compared with Resnet50, Densenet121, ResNeSt-101, and EfficientNet-B0, the accuracy of the EfficientNet-ViT ensemble model increases by 5.4%, 3.2%, 2.0%, 1.4%, respectively (Table 4), showing its superiority in the fundus image classification task.

    Conclusions

    The EfficientNet-ViT ensemble model proposed in this study is a network model combining a CNN and a transformer. The core of the CNN is the convolution kernel, which has inductive biases, such as translation invariance and local sensitivity, and can capture local spatio-temporal information but lacks a global understanding of the image itself. Compared with the CNN, the self-attention mechanism of the transformer is not limited by local interactions and can not only mine long-distance dependencies but also perform parallel computation. This study uses the EfficientNetV2-S and ViT models to calculate the most weighted factors for the CNN and Transformer models through the adaptive weighted fusion method. The EfficientNet-ViT can extract image features in two completely different ways. Our experimental results show that the accuracy and precision of fundus image classification can be improved by integrating the two models. If applied in the process of medical auxiliary diagnosis, this model can improve the work efficiency of fundus doctors and effectively alleviate the difficulties in diagnosis of fundus diseases caused by the shortage of ophthalmic medical staff, long waiting process for medical treatment, and difficult medical treatment in remote areas in China. When more datasets are used to train the model in the future, the accuracy, precision, and sensitivity of automatic classification may be further improved to achieve better clinical results.

    Yuan Yuan, Minghui Chen, Shuting Ke, Teng Wang, Longxi He, Linjie Lü, Hao Sun, Jiannan Liu. Fundus Image Classification Research Based on Ensemble Convolutional Neural Network and Vision Transformer[J]. Chinese Journal of Lasers, 2022, 49(20): 2007205
    Download Citation