Efficient feature selection based on Gower distance for breast cancer diagnosis

Salwa Shakir Baawi; Mustafa Noaman Kadhim; Dhiah Al-Shammary

doi:10.1016/j.jnlest.2025.100315

Note: This section is automatically generated by AI . The website and platform operators shall not be liable for any commercial or legal consequences arising from your use of AI generated content on this website. Please be aware of this.

Abstract

This study presents an efficient feature selection method based on the Gower distance to enhance the accuracy and efficiency of standard classifiers on high-dimensional medical datasets. High-dimensional data poses significant challenges for traditional classifiers due to feature redundancy or being irrelevant. The proposed method addresses these challenges by partitioning the dataset into blocks, calculating the Gower distance within each block, and selecting features based on their average similarity. Technically, the Gower distance normalizes the absolute difference between numerical features, ensuring that each feature contributes equally to the distance calculation. This normalization prevents features with larger scales from overshadowing those with smaller scales. This process facilitates the identification of features that exhibit high harmony and are the most relevant for classification. The proposed feature selection strategy significantly reduces dimensionality, retains the most relevant features, and improves model performance. Experimental results show that the accuracy for the classifiers including k-nearest neighbors (KNN), naive Bayes (NB), decision tree (DT), random forest (RF), support vector machine (SVM), and logistic regression (LR) was increased by 4.38%–7.02%. Besides, the reduction in the feature set size contributes to a considerable decrease in computational complexity and thus faster diagnosis speed. The execution time was averagely reduced by 77.82% for all samples and 76.45% for one sample. These results demonstrate that the proposed feature selection method shows enhanced performance on both prediction accuracy and diagnostic speed, making it a promising tool for real-time clinical decision-making and improving patient care outcomes.

Keywords

Breast cancer disease classification Feature selection Gower distance Machine learning classifiers

1 Introduction

Breast cancer treatment is highly effective, particularly when the disease is detected in its early stage. Standard therapies include medication, radiation therapy, and surgical excision, which are aimed at targeting microscopic cancer cells that have spread from the primary tumor into the bloodstream. These treatments have proven to be life-saving, capable of halting the growth and spread of cancer. However, breast cancer remains a significant global health challenge. Breast cancer became the most prevalent type of cancer worldwide, with the World Health Organization (WHO) reporting 2.3 million new cases and 685000 deaths annually [1]. During the previous 5 years, the incidence of breast cancer has surged dramatically, with 7.8 million women diagnosed, making it the second most common cancer among women after lung cancer.

Several risk factors contribute to the incidence of breast cancer, including family history, excessive alcohol consumption, and postmenopausal hormone therapy. The most common symptoms are a painless lump or thickening in the breast, which should be evaluated by a doctor right once [2]. Even though most breast lumps are noncancerous and benign at 90% of the time, careful medical evaluations are necessary in order to rule out malignant tumors versus noncancerous states such as infections, fibroadenomas, and cysts [3]. Moreover, early diagnosis through various medical tests like ultrasound, mammogram, biopsy, and magnetic resonance imaging (MRI) is very crucial [4]. Among them, biopsy is still considered the golden standard diagnosis [5]. However, operations are usually time-consuming and expensive, and also rely on the availability of radiologists and doctors. To overcome this limitation, machine learning (ML) and deep learning (DL)-based algorithms have been developed to apply on computer-aided diagnosis systems for the automatic diagnosis of breast cancer [6], which have exhibited huge promise for enhancing the prediction accuracy of the conventional diagnosis methods. Typically, ML classifiers, such as the support vector machine (SVM) [7], naive Bayes (NB) [8], k-nearest neighbors (KNN) [9], random forest (RF) [10], decision tree (DT) [11], and residual networks [12], have been investigated in this domain. However, although ML is able to increase the diagnosis accuracy for breast cancer, it is still challenging for conventional classifiers when dealing with high-dimensional datasets containing superfluous or non-essential features. This may lead to the performance degradation of various classifiers, due to either overfitting or underfitting, and further complicate the derivation of meaningful patterns (high harmony features).

In high-dimensional datasets, reducing the dimensionality without compromising model accuracy is critical. However, as a typical method to improve classification accuracy, the current feature selection approaches usually fail to achieve such balance between the number of features and accuracy. They either kept too many irrelevant features or oversimplified the dataset, both leading to suboptimal performance. Therefore, more efficient and robust feature selection methods are required, especially for these capable to reduce the dimensionality and simultaneously perform with high prediction accuracy. Aiming at achieving high accuracy with a smaller number of selected features, a new feature selection approach based on the Gower distance is proposed, which evaluates features by calculating the distance between samples within blocks and determines the optimal block size and feature selection ratios by exploring different configurations. The main contributions are as follows:

• A novel feature selection method based on the Gower distance is proposed to select optimal features. The Gower distance normalizes the absolute difference between numerical features, ensuring that each feature contributes equally to the distance calculation. This normalization prevents features with larger scales from overshadowing those with smaller scales. Its performance with different block sizes (ranging from 5 to 25 samples per block) and varying feature ratios (ranging from 10% to 70%) is investigated to determine the optimal configuration, with the aim to reduce the number of selected features while maximizing the accuracy.

• The proposed feature selection method significantly enhances the accuracy of standard classifiers. Experimental results showed notable performance improvements in KNN, NB, DT, and RF classifiers, with an accuracy increase ranging from 4.38% to 7.02%. This indicates that the method is viable and efficient in capturing relevant features. Additionally, the reduction in the feature set size contributed to a significantly accelerated speed, with the execution time reduced by an average of 77.82% for all samples and 76.45% for one sample. This further highlights that the method is able to enhance both accuracy and computational efficiency.

• The proposed method attained remarkable accuracy of 99.12% with only 12 selected features, which is 40% of the original 30 features. This outperforms the recent related studies on the diagnosis of breast cancer, showcasing the efficiency and precision of our feature selection approach.

• The proposed Gower distance-based feature selection method allows medical organizations to efficiently upload patient data to a dedicated server for feature selection, significantly enhancing the accuracy and speed of ML and DL-based algorithms. This process has potential to providing more precise and faster diagnosis reports, thus benefitting for the improvement of patient care outcomes, as shown in Fig. 1.

Figure 1.Cloud-based feature selection framework using the Gower distance for enhanced breast cancer diagnosis in medical organizations.

2 Related works

Studies on ML or DL-based methods for disease diagnosis highlight the significance of feature selection in optimizing model performance. Because feature selection can help to reduce the input dimensionality, thereby enhancing efficiency and accuracy. Recently, various techniques, including statistical methods and correlation-based approaches, have been proposed, many of which have used feature selection to improve the accuracy of classifiers in detecting and classifying diseases [13–15]. Further investigations on how to improve these methods for better and more reliable feature selection are still ongoing. For instance, Minnoor and Baths [16] proposed five ML classifiers, namely, SVM, KNN, DT, RF, and multi-layer perceptron (MLP), for the diagnosis of breast cancer. In this study, two stages were included during feature selection: 16 relevant features were first roughly selected based on the Pearson correlation coefficient (PCC) and then 8 of them were identified with the help of multiple feature selection methods, namely, univariate selection, logistic regression (LR), and recursive feature elimination (RFE). Although they demonstrated that both stages could achieve high accuracy with the RF classifier, the exclusion of features with high correlation during the first stage may potentially discard relevant information. Similarly, Chen et al. [17] investigated four ML classifiers, i.e. RF, LR, KNN, and extreme gradient boosting (XGBoost), and combined PCC for feature selection, by which 16 relevant features were extracted from the original dataset. With such PCC-based feature selection, the XGBoost classifier attained 97.4% accuracy in detecting breast cancer. But the predefined PCC threshold of 0.5 may overlook nonlinear relationships and interactions, risking overfitting.

As an alternative, adopting more features and advanced techniques is also conducive to the improvement of these proposed models. Reshan et al. [18] used ensemble ML with multi-model features for breast cancer detection. By employing RFE for feature selection, they selected 4–22 most informative features from the breast cancer dataset, contributing to high accuracy and precision of breast cancer diagnosis. However, this method is unfeasible for real-world applications due to the diverse patient demographics, feature integration, and imaging modalities. Gopal et al. [19] proposed three classifiers, including LR, RF, and MLP, and adopted a correlation coefficient function for feature selection, resulting in 11 most relevant features being selected from 32 in the dataset. They found that MLP with a correlation coefficient outperformed other classifiers with 98% accuracy. However, it performed inferior with lower accuracy in some classifiers. This means that such correlation and principal component analysis (PCA)-based feature selection cannot work well with all classifiers. Uddin et al. [20] applied 11 classifiers for the diagnosis of breast cancer. With PCA as feature selection, only 16 most relevant features were selected from the 32 features in the original dataset. The proposed approach could achieve high accuracy with all classifiers. Especially for PCA with the voting classifier (LR and SVM), as high accuracy as 98.77% was obtained. While it can reduce the dimensionality and prevent overfitting, it is realized under the assumption of linear relationships, neglecting other potential complex nonlinear patterns in the feature samples, and may discard valuable features. This could limit the model’s reliability and generalizability in real-world scenarios. Ara et al. [21] applied five classifiers to diagnose whether breast cancer is malignant or benign. The correlation-based feature selection was applied to identify the most relevant features, which notably enhanced the accuracy, especially with SVM and RF classifiers (95.6%). But it also possibly omitted the nonlinear relationships between features, leading to missing some potential important information.

In addition, Khashei et al. [22] demonstrated that the discrete learning-based MLP model consistently outperformed the standard MLP across all datasets by testing on several breast cancer samples, with average accuracy of 94.70%, 6.95% larger than 88.54% of the standard MLP approach. However, the feature selection method they adopted relies on a discrete cost function, which may not effectively capture all relevant features with complex relationships in datasets. Putra et al. [8] applied correlation as feature selection with NB and SVM as classifiers for the detection of breast cancer. This feature selection method identified features with a correlation value larger than 0.6, resulting in a total of 11 highly correlated features. The results showed that SVM and NB achieved accuracy of 96.8% and 95%, respectively, in classifying breast cancer. However, the fixed correlation threshold of 0.6 may get rid of significant features and not take nonlinear relationships into account, potentially weakening the model’s robustness and generalizability.

Laghmati et al. [11] studied seven ML classifiers by using PCA for feature selection. Based on this, features with a correlation value of 0.7 or larger were identified. As a result, 9 features were selected for PCA from the original dataset. Based on this kind of PCA-based feature selection, stacking-logistic regression (S-LR) achieved accuracy of 97.37%, outperforming the remaining six classifiers. On the contrary, such feature selection process may neglect important features with weaker correlation, hindering the model’s viability and availability across diverse datasets.

As summarized above, the current existing feature selection methods mainly rely on linear relationships (as in correlation-based methods) or feature variance (as in PCA-based methods). This shadows its availability in capturing more complex, nonlinear relationships between features, leading to inferior selection performance. In this study, an improved feature selection method is proposed, which evaluates the samples within each block of every feature to capture the relationships between blocks within a feature. It makes a more nuanced assessment of feature relevance, ensuring that even the relationships beyond simple linear correlations are also considered.

3 Methodology

As shown in Fig. 2, the proposed approach to diagnosing breast cancer mainly consists of three stages: Preprocessing, feature selection based on the Gower distance, and splitting the dataset for evaluation. In the preprocessing stage, the dataset is normalized with a min-max scaler for uniform scaling of the features. This scaling technique ensures that all features are transformed into a uniform range, typically between 0 and 1, preventing any single feature from dominating the distance calculations due to its larger numerical range. During feature selection based on the Gower distance, similarities between samples within blocks are computed. Features are then ranked based on their average minimum distances, and those with the highest similarity (harmony) and relevance are selected for classification. Finally, the best subset of features is split into a training set comprising 80% and a test set comprising 20%. The performance of different ML-based algorithms, namely, KNN, NB, DT, RF, SVM, and LR, is evaluated by using the precision, accuracy, confusion matrix, F1-score, and recall, aiming at identifying the optimal model for diagnosing breast cancer.

Figure 2.Workflow of the proposed methodology.

3.1 Dataset description

The dataset utilized in this study is the Wisconsin Diagnostic Breast Cancer (WDBC) dataset, sourced from University of California Irvine (UCI) Machine Learning Repository. Originally compiled by Dr. William H. Wolberg at University of Wisconsin Hospital, Madison, USA, this dataset has become a benchmark for breast cancer research since the early 1990s. The WDBC dataset has 569 samples including malignant samples of 212 and benign ones of 357. Each sample is represented by 30 numerical features that describe the characteristics of cell nuclei in digitized images of fine needle aspirates (FNA) of breast masses. These features include measurements such as the mean radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension, as well as their standard errors and the worst (largest) values. The target variable indicates the diagnosis, where 0 denotes benign and 1 denotes malignant. Table 1 exhibits the details of the WDBC dataset.

Parameter	Characteristic
Feature type	Real
Dataset characteristics	Multivariate
Number of cases	569
Number of healthy people	357
Number of unhealthy people	212
Number of features	30
Missing values	N\A
Classification type	Binary classification

Table 1. WDBC dataset details.

View all Tables

3.2 Feature scaling

Feature scaling is a crucial preprocessing step in both DL and ML-based methods, ensuring that all features contribute equally to the distance calculation and model performance [23]. In this study, we used the min-max scaler for feature scaling and normalized the range of the features to [0, 1], which can be formulated as

$ u_i=\frac{e_i-{\mathrm{min}} \left(e_i\right)}{{\mathrm{max}} \left(e_i\right)-{\mathrm{min}} \left(e_i\right)} $ (1)

where $ e_{i} $ is the original sample value, $ {\mathrm{min}} \left(e_{i}\right) $ is the minimum value of the sample in the dataset, $ {\mathrm{max}} \left(e_{i}\right) $ is the maximum value of the sample in the dataset, and $ u_{i} $ is the scaled sample value in the range of [0, 1].

This min-max scaler is especially advantageous to the ML and DL related models. Scaling all features to a uniform range of [0, 1] ensures that no single feature dominates the distance calculation due to its larger numerical range. This uniformity in feature contribution is quite necessary in the models that rely upon the distance calculation, such as KNN. It contributes to faster and more stable training of models. However, the min-max scaler also has its limitations. For example, it is highly sensitive to the range of data. If the dataset contains outliers, the minimum and maximum values can be significantly affected, leading to a distorted scaling of the feature values. Besides, if the ranges of different features overlap, the inherent differences between features may be masked. Fig. 3 illustrates the normalization stage using the min-max scaler, demonstrating how the original feature values are transformed into the range of [0, 1] to ensure uniformity across all features.

Figure 3.Normalization stage with min-max scaler: (a) before and (b) after min-max normalization.

3.3 Feature selection based on the Gower distance

Selecting informative features is crucial in ML, as it plays a critical role in enhancing model performance, reducing overfitting, and speeding up the training process. Identifying and retaining only the most significant features makes our models efficient and effective. In this paper, a novel feature selection method based on the Gower distance is introduced. As we all know, the Gower distance is appropriate for datasets that contain mixed-type attributes, both numerical and categorical [24]. It also normalizes the absolute difference between numerical features, ensuring that each feature contributes equally to the distance calculation. This property makes it capture a feature that has high harmony based on the measure of the similarity between blocks in the feature. Using this distance metric, we can identify the features that capture most of the variations within the dataset; this will retain the most informative attributes. The proposed feature selection approach works as follows.

Let $ X $ represent the dataset with $ N $ samples and $ P $ features. The feature ${x_k}$ (for $k = 1{\mathrm{,}}\;2{\mathrm{,}}\;3{\mathrm{,}}\; \cdots {\mathrm{,}}\;P$) is divided into blocks of size $ B $, where $ B $ is one of the predefined block sizes (5, 10, 15, 20, or 25). The division process can be expressed as

$ {x_k} = \left\{ {{x_1}{\mathrm{,}}\;{x_2}{\mathrm{,}}\; \cdots {\mathrm{,}}\;{x_B}} \right\} $ (2)

where each $ {x}_{B} $ represents a block containing consecutive samples of feature ${x_k}$. For example, if $ B=25 $, then each block will contain 25 consecutive samples from the feature ${x_k}$.

Within each block, the Gower distance $ {G}_{\mathrm{d}\mathrm{i}\mathrm{s}} $ is calculated for each pair of samples across all features, as

$ {G_{{\text{dis}}}} = \frac{{\displaystyle\sum\limits_{k = 1}^P {\displaystyle\sum\limits_{m = 1}^{N - 1} {\displaystyle\sum\limits_{l = m + 1}^N {\left| {x_m^k - x_l^k} \right|} } } }}{{\displaystyle\sum\limits_{k = 1}^P {\left( {{\mathrm{max}} ({x_k}) - {\mathrm{min}} ({x_k})} \right)} }} $ (3)

where $\left| {x_m^k - x_l^k} \right|$ is the absolute difference between the samples $m$ and $l$ for feature ${x_k}$, and ${\mathrm{max}} ({x_k})$ and ${\mathrm{min}} ({x_k})$ are the maximum and minimum values of feature ${x_k}$ across all samples in the dataset.

The average Gower distance $ {G}_{B} $ is computed for each block by averaging the distances of all pairs of samples within the block as (4). This step aggregates the information within each block to provide a summarized measure of similarity.

$ G_B=\frac{1}{N_B\left(N_B-1\right)} \sum_{m=1}^{N_B-1} \sum_{l=m+1}^{N_B} G_{\mathrm{dis}} \left(x_m^k-x_l^k\right) $ (4)

where $ {N}_{B} $ is the number of samples within a block.

For each feature, we first compute the average Gower distance within each block (as detailed in (4)). Then, we calculate the overall average Gower distance for the feature by taking the mean of these block averages by using (5), which indicates the typical dissimilarity or similarity within the blocks for that feature.

$ G_{{k}}=\frac{1}{B_{{\mathrm{max}} }} \sum_{b=1}^{B_{\mathrm{max}}} G_{b}^{k} $ (5)

where $ {G}_{b}^{k} $ is the overall average Gower distance for feature ${x_k}$ and $ {B}_{\mathrm{m}\mathrm{a}\mathrm{x}} $ is the total number of blocks for this feature. Equation (5) aggregates the average Gower distance from each block to give a single metric ${G_k}$ for the entire feature.

Features are ranked based on their average block distances. The features with shorter average distances are considered to have higher similarity within the blocks, indicating that they are more consistent and potentially more informative for classification. Conversely, features with longer average distances are considered to have higher dissimilarity, which means that they may provide more diverse or varied information. The ranking of features can be formulated as

$ {\text{Rank}}({x_k}) = {\text{sort}} \left( {{G_k}{\mathrm{,}}{\text{ ascending}}} \right) $ (6)

Thus, the features with shorter Gower distances are ranked higher because they exhibit higher similarity within their blocks, that is to say, they are more likely to capture the most relevant and consistent patterns within the data.

3.4 Evaluation metrics

Evaluation metrics are essential in assessing the performance of ML-based models and algorithms. In this study, several metrics are employed to evaluate the effectiveness of our proposed feature selection approach based on the Gower distance. The model’s predictions are carefully distinguished by the confusion matrix to effectively classify true positive (TP) and true negative (TN) cases [25]. As shown in Fig. 4, it also finds false positive (FP) and false negative (FN) cases.

Figure 4.Confusion matrix.

The other metrics used to evaluate the performance of breast cancer classification include precision, accuracy, recall, and F1-score. Precision measures the reliability of the model’s predictions by comparing correctly detected benign cases (true negatives) to all predicted benign cases, as shown in (7). Accuracy computes the percentage of correctly identified benign and malignant cases out of all cases, as represented in (8). Recall assesses the ability of the model to correctly identify benign cases from all actual benign cases, as shown in (9). The F1-score is a harmonic mean that balances precision with recall and provides a comprehensive rating of the model’s performance, as illustrated in (10).

$ \text { Precision }=\frac{{\mathrm{T}} {\mathrm{P}}}{{\mathrm{T}} {\mathrm{P}}+{\mathrm{F}} {\mathrm{P}}} $ (7)

$ \text { Accuracy }=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TN}+\mathrm{TP}+\mathrm{FN}+\mathrm{FP}} $ (8)

$ \text { Recall }=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} $ (9)

$ \text { F1-score }=2 \times \frac{\text { Recall } \times \text { Precision }}{\text { Precision }+ \text { Recall }} $ (10)

4 Experimental results and discussion

This section discusses two scenarios for breast cancer diagnosis. In the first scenario, standard classifiers are evaluated without feature selection. The second scenario assesses their performance after the proposed feature selection method based on the Gower distance is introduced. Additionally, the time consumption of traditional classifiers is analyzed with and without the proposed feature selection. Finally, the effectiveness of the proposed feature selection method is compared with recent studies. Table 2 shows the system configuration used in this study.

Specification	Description
RAM	16 GB
Processor	6th Gen Intel® Core™ i7
Development environment	Visual Studio Code
Software environment	Python 3.12

Table 2. System environment and setup.

View all Tables

4.1 Performance analysis of classifiers without the proposed feature selection method

By calculating the metrics of F1-score, accuracy, recall, and precision, the performance of standard classifiers without feature selection on the classification of breast cancer is evaluated, and the results are shown in Table 3.

Classifier	Metrics
Classifier	Recall (%)	Precision (%)	F1-score (%)	Accuracy (%)
KNN	92.10	97.22	94.60	92.98
NB	90.54	93.06	91.78	89.47
DT	89.61	95.83	92.62	90.35
RF	97.10	93.06	95.04	93.86
SVM	89.74	97.22	93.33	91.23
LR	92.00	95.83	93.88	92.10

Table 3. Performance evaluation of traditional classifiers without feature selection.

View all Tables

For the classification of breast cancer without feature selection, the RF classifier emerged as the most effective method, with the highest overall performance metrics. It achieves a recall value of 97.10%, precision of 93.06%, F1-score of 95.04%, and accuracy of 93.86%. The KNN classifier also performs excellent with a recall value of 92.10%, precision of 97.22%, F1-score of 94.60%, and accuracy of 92.98%. Both of them have thus demonstrated high accuracy and consistency when all the metrics were taken into consideration. The DT classifier relatively performs well, having a recall value of 89.61%, precision of 95.83%, F1-score of 92.62%, and accuracy of 90.35% in its classification. The NB classifier, while slightly inferior to RF and KNN, still achieves commendable results with a recall value of 90.54%, precision of 93.06%, F1-score of 91.78%, and accuracy of 89.47%. The SVM classifier shows a recall value of 89.74%, precision of 97.22%, F1-score of 93.33%, and accuracy of 91.23%. Although its recall and accuracy were slightly lower than those of RF and KNN, its higher precision indicates strong performance in minimizing false positives. The LR classifier performs equivalently to the KNN classifier, with a recall value of 92.00%, precision of 95.83%, F1-score of 93.88%, and accuracy of 92.10%. While its precision is slightly lower than SVM, LR maintains a good balance across all metrics, with particularly superior recall and accuracy.

The corresponding confusion matrices of KNN, NB, DT, and RF classifiers without feature selection are shown in Fig. 5. Amongst, the RF classifier stands out as the most effective. Its numbers of both true positives and true negatives are maximum. Meanwhile, its numbers of false positives and false negatives are minimum. RF shows high precision and recall, excelling at both identifying malignant cases and accurately classifying benign ones. Despite KNN also achieves a good result of true positives (70) and true negatives (36), its false negatives (6) is slightly higher than that of RF. This means that KNN might miss some malignant cases, hence its recall is not that high. The NB classifier performs well but has more difficulty in identifying malignant cases, as indicated by its larger false negative count of 7. The DT classifier exhibits NB-like results with 69 true positives and 34 true negatives. However, it has more false negatives, so its recall is a little lower compared to those of RF and KNN. The SVM classifier has yielded 70 true positives with 34 true negatives, thus keeping relatively less numbers of false positives (2) and false negatives (8). Similarly, the LR classifier performs with 69 true positives versus 36 true negatives, and it has fewer false negatives of 6. Overall, RF stands out as the top-performing classifier, closely followed by KNN, while NB and DT lag slightly behind, particularly in recall. SVM and LR also perform well, but there exists a slight trade-off between their precision and recall. These results show that while all classifiers are effective in classifying breast cancer, RF performs the best.

Figure 5.Confusion matrices of standard classifiers without feature selection: (a) KNN, (b) NB, (c) SVM, (d) DT, (e) RF, and (f) LR.

4.2 Performance analysis of classifiers with the proposed feature selection method

As demonstrated by the previous works, feature selection techniques could benefit to alleviate the noise interference and further enhance the classifiers’ ability to identify relevant features, thus contributing to even more accurate and reliable results. Therefore, we further evaluate the performance of standard classifiers, including KNN, NB, DT, RF, SVM, and LR, with feature selection based on the Gower distance. The simulated results of accuracy with different feature ratios (10% to 70%) and block sizes (5, 10, 15, 20, and 25 samples per block) are summarized in Table 4.

Block size	Classifier	Feature ratio from the dataset
Block size	Classifier	10%	20%	30%	40%	50%	60%	70%
5 samples per block	KNN	87.71	89.08	93.22	98.24	90.35	92.10	91.22
	NB	85.96	87.08	91.01	96.49	91.22	88.59	90.35
	DT	82.45	89.47	92.21	97.36	93.85	94.73	91.22
	RF	88.59	92.10	94.73	99.12	94.73	95.61	95.61
	SVM	83.34	84.56	92.91	95.61	94.76	92.30	90.74
	LR	85.95	88.40	91.50	97.36	90.89	87.89	90.21
10 samples per block	KNN	87.71	85.08	91.22	89.47	91.22	92.10	91.22
	NB	85.96	85.98	84.21	88.59	89.47	90.33	90.35
	DT	80.70	88.59	82.45	92.98	88.59	92.98	93.85
	RF	89.47	92.10	92.10	94.73	95.61	95.61	94.73
	SVM	84.65	92.40	88.40	93.94	92.20	88.90	86.30
	LR	89.20	90.80	92.02	92.98	94.34	94.87	93.33
15 samples per block	KNN	88.21	85.08	89.47	87.71	91.22	91.22	92.11
	NB	86.96	85.08	85.96	85.08	90.08	90.35	93.21
	DT	81.70	88.59	89.47	89.47	92.98	92.98	94.73
	RF	88.47	92.98	90.35	92.10	94.73	94.73	95.61
	SVM	88.90	90.32	89.02	85.41	89.89	90.87	94.67
	LR	90.21	92.56	90.20	91.21	93.43	94.67	96.89
20 samples per block	KNN	89.31	85.08	92.22	87.71	93.12	92.10	93.33
	NB	86.46	85.99	88.21	85.08	92.47	88.59	90.35
	DT	83.40	88.59	89.33	88.59	91.22	92.98	91.22
	RF	91.22	92.10	93.85	92.98	94.73	93.85	95.61
	SVM	89.20	87.60	89.80	90.65	91.30	87.80	89.50
	LR	84.50	90.40	92.31	91.83	92.50	91.12	90.32
25 samples per block	KNN	87.71	85.08	89.47	87.71	91.22	92.10	91.22
	NB	85.96	85.08	85.96	85.08	89.47	90.11	90.35
	DT	83.33	87.71	91.22	88.59	90.35	93.85	92.98
	RF	90.35	92.98	92.98	92.10	94.73	95.61	94.73
	SVM	84.20	87.60	89.54	89.60	90.30	90.59	90.74
	LR	88.80	91.50	90.89	91.23	93.98	94.43	93.20

Table 4. Accuracy results of classifiers and its impact on performance with varying feature ratios and block sizes.

View all Tables

As shown in Table 4, it is obvious that the accuracy of all the classifiers has been increased by introducing the proposed feature selection method based on the Gower distance. All of them perform the best under the condition of 40% feature ratio and 5 samples per block, with the largest accuracy of 98.24%, 96.49%, 97.36%, 99.12%, 95.61%, and 97.36% for KNN, NB, DT, RF, SVM, and LR, respectively. For visualization, we compared them with the obtained results of those without feature selection, as shown in Fig. 6. The accuracy values of classifiers after introducing our proposed feature selection method are always higher than those before. In details, KNN, RF, and LR have been improved by 5.26%; more obvious enhancements of 7.02% and 7.01% are observed in NB and DT, respectively; comparatively, SVM shows a slightly inferior increase of 4.38%. Such enhancements demonstrate the superiority of the proposed feature selection method to identify the most relevant features, which is especially desirable to realize more robust and accurate models. In addition, Table 4 also indicates that the performance of classifiers is sensitive to feature ratio and block size. Amongst, RF always performs the best with the highest accuracy of 99.12% at a 40% feature ratio with 5 samples per block and relatively high accuracy of >90% under almost all the other conditions. Such robust performance demonstrates the feasibility and effectiveness of feature selection to enhance diagnostic precision, making RF the most reliable classifier in this study. As a comparison, for KNN, the accuracy fluctuates significantly, ranging from 85.08% to 93.33% with different feature ratios and block sizes. NB exhibits a steady but moderate improvement after introducing feature selection, with accuracy varying between 84.21% and 93.21%. Significant increases are observed in DT with relatively high accuracy of >90% achieved under half of these investigated conditions. The accuracy of SVM varies with different conditions between 84.20% and 94.67%, which is lower than that of RF or KNN. It is also obvious that SVM is more sensitive to the feature ratio and block size. Even so, SVM still remains a competitive model and benefits from the selection of relevant features. LR performs well with as high accuracy as >90% under most of all conditions. However, its performance is slightly inferior to both RF and KNN, especially when the feature ratio is increased to 60%. Despite this, LR achieves an accuracy value high to 97.36%, demonstrating its strong ability to adapt to the selected features. This also indicates its potential as a highly effective and reliable model for breast cancer classification.

Figure 6.Accuracy comparison of classifiers with and without the proposed feature selection method based on the Gower distance.

Fig. 7 presents the corresponding confusion matrices of these classifiers with a block size of 5 samples and a feature ratio of 40%, which has the highest accuracy as shown in Table 4. Compared to classification without feature selection, the refined feature subset enhances the classifiers’ ability to distinguish malignant from benign cases, as evident from the larger of true positive and true negative cases. The reduction in false positive and false negative cases further highlights the effectiveness of the Gower distance-based feature selection in optimizing classifier performance. For instance, with the KNN classifier, the number of correct benign classifications increases from 70 to 72, and that of correct malignant classifications rises from 36 to 40. Furthermore, the number of false positive cases drops from 2 to 0, and that of false negative cases decreases from 6 to 2. Similar improvements are observed in the other classifiers, demonstrating the positive effect of feature selection on classifier accuracy and reliability. Such enhancement is primarily attributed to the proposed feature selection method’s ability to eliminate irrelevant and redundant features, enabling the classifiers to focus on the most relevant information for accurate predictions.

Figure 7.Confusion matrices of standard classifiers with the proposed feature selection method based on the Gower distance: (a) KNN, (b) NB, (c) SVM, (d) DT, (e) RF, and (f) LR.

4.3 Analysis of time consumption with and without the proposed feature selection method

To investigate the influence of the proposed feature selection method on the diagnosis speed, the execution time of classifiers with and without feature selection is measured. Here two scenarios are studied: One considering all samples contained in the test set and another only considering one sample.

The measured results are summarized and compared in Fig. 8. It shows that the execution time of classifiers with feature selection is always shorter than the corresponding one without feature selection. This is because the proposed feature selection method identifies and retains relevant features while removing irrelevant ones, thus significantly simplifying the computation process. During the experiment, an optimal block size of 5 and feature ratio of 40% are applied; as a result, the number of features is reduced to 12 from 30. This reduction in features allows the classifiers to focus on the most relevant attributes, which contributes to the acceleration of the classification process with no sacrifice of diagnostic accuracy. This property is particularly advantageous for classifiers that rely on distance measurement between samples, such as KNN and SVM, where the distance computation is time-consuming. For instance, 239.8212 ms is required for KNN to process all samples when no feature selection is introduced, while only 16.7887 ms is needed after introducing the proposed feature selection method. Similarly, with regards to SVM, this value is decreased to 2.9922 ms from 9.9649 ms. In terms of the scenarios of one sample, the improvement in time consumption is even more pronounced. KNN reduces its execution time from 2.1037 ms to 0.1473 ms, while that of NB drops from 0.0880 ms to 0.0088 ms. DT and LR also show significantly accelerated processing, with the execution time decreasing from 0.0087 ms to 0.0021 ms and 0.0063 ms to 0.0019 ms, respectively. All these results demonstrate that the proposed feature selection method not only can improve the classifier efficiency but also can enhance computation performance, making it suitable for real-time applications. Such a method is especially valuable in time-sensitive fields like healthcare, where rapid and reliable diagnostics can greatly impact patient outcomes.

Figure 8.Execution time of classifiers with and without the proposed feature selection method.

4.4 Comparison with other strategies

By comparing the accuracy and execution time of classifiers with and without feature selection, it has been demonstrated that introducing the proposed feature selection method based on the Gower distance is beneficial to enhance the performance of classifiers in breast cancer diagnosis. To further demonstrate its superiority, it is also compared with the existing state-of-the-art techniques [17,20,26–30], as shown in Table 5. Notably, the classifiers with the proposed feature selection method based on the Gower distance also perform better than almost all of the corresponding ones with other feature selection methods. For instance, KNNs with feature selection based on PCC, Chi-square, and elephant herding optimization (EHO) achieve accuracy levels of 91.2%, 96%, and 97.96%, respectively, which are 7.04%, 2.24%, and 0.28% lower than ours (98.24%). Compared with the PCA-based ones, KNN, NB, DT, and RF with our Gower distance-based method are also superior, having an accuracy increase of 1.40%, 5.45%, 3.16%, and 4.92%, respectively. It is also worthy to note that RF not only performs the best among all the studied classifiers with feature selection based on the Gower distance, its accuracy is also the highest in comparison with other competitors in Table 5. The proposed method is also equipped with exceptional computation efficiency in terms of execution time. For example, with eagle strategy optimization (ESO)-based feature selection, Singh et al. [29] realized a high-performance RF classifier, having a nearly equivalent accuracy level to ours. However, due to the larger number of selected features, as long as 4 s execution time is required by Singh et al.’s method, which is far longer than 0.0583 ms by our RF. Moreover, our DT and LR classifiers are faster, with execution time as low as ~2 μs. Such speedy methods are suitable and exactly desirable for real-time clinical settings in practical applications.

Year	Reference	Feature selection	Number of selected features	Classifier	Accuracy (%)	Execution time
2022	[26]	Univariate and recursive	16	Deep extreme gradient descent optimization	98.73	N/A
2023	[17]	PCC	15	KNN	91.2	N/A
				RF	96.5
				LR	94.7
				XGBoost	97.4
2023	[20]	PCA	16	SVM	98.07	N/A
				DT	94.20
				KNN	96.84
				RF	94.20
				MLP	97.54
				NB	91.04
				LR	98.42
				LR+SVM	98.77
2023	[27]	Chi-square	15	MLP	95	N/A
				LR	92
				KNN	96
				SVM	92
				RF	94
2023	[28]	Gorilla troops optimization (GTO)	30	Deep Q learning (DQL)	98.88	N/A
2023	[29]	ESO	12	RF	98.95	4 s
2024	[30]	EHO	18	KNN	97.96	N/A
2024	Proposed in this paper	Gower distance	12	KNN	98.24	0.1472 ms
				NB	96.49	0.0087 ms
				DT	97.36	0.0021 ms
				RF	99.12	0.0583 ms
				SVM	95.61	0.0262 ms
				LR	97.36	0.0018 ms

Table 5. Performance comparison with recently related works on the WDBC dataset.

View all Tables

The above results indicate that this feature reduction by the proposed feature selection method not only boosts accuracy but also significantly decreases computation time and resource usage. Because fewer features mean less data to process, which accelerates both training and testing phases. It renders the method highly efficient and practical for real-world applications, which is crucial in clinical settings where quick and accurate diagnostics are essential. It means that the proposed method could offer a potential alternative for high-efficient classification of breast cancer.

4.5 Limitations of the proposed method

Despite the proposed feature selection method based on the Gower distance demonstrates significant improvements in the accuracy and efficiency of standard classifiers, the following limitations are still needing to be considered:

• The computational cost associated with partitioning the dataset into blocks and calculating the Gower distance is high, which becomes more challenging as the dataset size increases. Moreover, the determination of the optimal block size to balance accuracy and the number of features is difficult and requires additional experimentation. To address these issues, future work could incorporate parallelization or more efficient algorithms to calculate the Gower distance. For example, the dimensionality reduction techniques, such as PCA, is combined in the first stage to reduce the dataset size, whereas the Gower distance-based feature selection is applied in the second stage to maintain the model performance.

• It is still challenging to process large, high-dimensional data with the proposed method. Even though an optimal subset of features is estimated by using the proposed feature selection technique, the remaining features can still be plentiful, which may have a negative influence on the model performance. To overcome this limitation, additional feature reduction, i.e., multi-stage feature selection, is needed. In the future, the following two stages can be considered: The proposed feature selection is performed in the first stage, and a metaheuristic algorithm, such as particle swarm optimization (PSO), is applied in the second stage to further optimize the selected subset.

• The proposed approach is highly sensitive to the quality of the input dataset. Noise, unscaled, or incoherent data would deteriorate the credibility of feature selection. It heavily relies on proper data preprocessing, where inaccurate or inconsistent data and challenges in determining an appropriate scaling range can lead to incorrect feature selection, diminishing model performance and accuracy.

5 Conclusions

Traditional classifiers are facing several challenges when processing high-dimensional data, including high dimensionality, increased computational complexity, and feature redundancy and irrelevance. In this paper, a novel feature selection method based on the Gower distance is proposed to improve the performance of standard classifiers on high-dimensional medical datasets. By dividing the dataset into blocks, calculating the Gower distance within each block, and selecting features based on their average block distances, the dimensionality is significantly reduced with only 40% of the original features (12 out of 30) selected on the WDBC dataset. With only the most relevant features maintained, experimental results demonstrated substantial accuracy improvements across all classifiers: KNN improves its value from 92.98% to 98.24%, NB from 89.47% to 96.49%, DT from 90.35% to 97.36%, RF from 93.86% to 99.12%, SVM from 91.23% to 95.61%, and LR from 92.10% to 97.36%. In addition, the proposed feature selection method also significantly enhances the computation speed and decreases the diagnosis time. This is because the reduction in features leads to a considerable decrease in computation complexity, with execution time reduced by up to 89% for certain classifiers. This makes the method not only more accurate but also faster, which is critical for real-time clinical decision-making. The significant enhancement in model accuracy underscores the method’s potential for advancing prediction models, particularly in complex and high-dimensional datasets like those in the medical field. It can be concluded that our Gower distance-based feature selection method is a powerful tool for enhancing ML-based models in high-dimensional contexts. Future work could integrate this feature selection method with DL-based techniques and apply it to various datasets to further boost performance.

Disclosures

The authors declare no conflicts of interest.

References

[1] WHO, Breast Cancer [Online]. Available, https:www.who.intnewsroomfactsheetsdetailbreastcancer, November 2021.

[2] Wang H.-Y., Feng J., Bu Q.-R. et al. Breast mass detection in digital mammogram based on Gestalt psychology. J. Healthc. Eng., 2018, 4015613(2018).

[3] Valvano G., Santini G., Martini N. et al. Convolutional neural networks for the segmentation of microcalcification in mammography imaging. J. Healthc. Eng., 2019, 9360941(2019).

[4] Biswas N., Uddin K.M.M., Rikta S.T., Dey S.K.. A comparative analysis of machine learning classifiers for stroke prediction: a predictive analytics approach. Healthc. Anal., 2, 100116(2022).

[5] M. Gupta, B. Gupta, A comparative study of breast cancer diagnosis using supervised machine learning techniques, in: Proc. of the 2nd Intl. Conf. on Computing Methodologies Communication, Erode, India, 2018, pp. 997–1002.

[6] Thawkar S., Katta V., Parashar A.R., Singh L.K., Khanna M.. Breast cancer: a hybrid method for feature selection and classification in digital mammography. Int. J. Imag. Syst. Tech., 33, 1696-1712(2023).

[7] P. Dinesh, A.S. Vickram, P. Kalyanasundaram, Medical image prediction f diagnosis of breast cancer disease comparing the machine learning algithms: SVM, KNN, logistic regression, rom fest decision tree to measure accuracy, AIP Conf. Proc. 2853 (1) (2024) 020140.

[8] Putra L.G.R., Marzuki K., Hairani H.. Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction. Eng. Appl. Sci. Res., 50, 577-583(2023).

[9] Maheswari B.U., Guhan T., Britto C.F., Sheeba A., Rajakumar M.P., Pratyush K.. Performance analysis of classifying the breast cancer images using KNN and naive Bayes classifier. AIP Conf. Proc., 2831, 020012(2023).

[10] Hassan M.M., Hassan M.M., Yasmin F. et al. A comparative assessment of machine learning algorithms with the least absolute shrinkage and selection operator for breast cancer detection and prediction. Decis. Anal. J., 7, 100245(2023).

[11] Laghmati S., Hamida S., Hicham K., Cherradi B., Tmiri A.. An improved breast cancer disease prediction system using ML and PCA. Multimed. Tools Appl., 83, 33785-33821(2024).

[12] Atban F., Ekinci E., Garip Z.. Traditional machine learning algorithms for breast cancer image classification with optimized deep features. Biomed. Signal Proces., 81, 104534(2023).

[13] Singh L.K., Khanna M., Singh R.. An enhanced soft-computing based strategy for efficient feature selection for timely breast cancer prediction: Wisconsin Diagnostic Breast Cancer dataset case. Multimed. Tools Appl., 83, 76607-76672(2024).

[14] Alsaeedi A.H., Al-Mahmood H.H.R., Alnaseri Z.F. et al. Fractal feature selection model for enhancing high-dimensional biological problems. BMC Bioinf., 25, 12(2024).

[15] Singh L.K., Khanna M., Singh R.. Efficient feature selection for breast cancer classification using soft computing approach: a novel clinical decision support system. Multimed. Tools Appl., 83, 43223-43276(2024).

[16] Minnoor M., Baths V.. Diagnosis of breast cancer using random forests. Procedia Comput. Sci., 218, 429-437(2023).

[17] Chen H., Wang N., Du X.-P., Mei K.-H., Zhou Y., Cai G.-X.. Classification prediction of breast cancer based on machine learning. Comput. Intel. Neurosc., 2023, 6530719(2023).

[18] Al Reshan M.S., Amin S., Zeb M.A. et al. Enhancing breast cancer detection and classification using advanced multi-model features and ensemble machine learning techniques. Life, 13, 2093(2023).

[19] Gopal V.N., Al-Turjman F., Kumar R., Anand L., Rajesh M.. Feature selection and classification in breast cancer prediction using IoT and machine learning. Measurement, 178, 109442(2021).

[20] Uddin K.M.M., Biswas N., Rikta S.T., Dey S.K.. Machine learning-based diagnosis of breast cancer utilizing feature optimization technique. Comput. Methods Prog. Biomed. Update, 3, 100098(2023).

[21] S. Ara, A. Das, A. Dey, Malignant benign breast cancer classification using machine learning algithms, in: Proc. of the Intl. Conf. on Artificial Intelligence, Islamabad, Pakistan, 2021, pp. 97–101.

[22] Khashei M., Bakhtiarvand N.. A novel discrete learning-based intelligent methodology for breast cancer classification purposes. Artif. Intell. Med., 139, 102492(2023).

[23] Al-Shammary D., Kadhim M.N., Mahdi A.M., Ibaida A., Ahmed K.. Efficient ECG classification based on Chi-square distance for arrhythmia detection. J. Elect. Sci. Technol., 22, 100249(2024).

[24] Kadhim M.N., Al-Shammary D., Sufi F.. A novel voice classification based on Gower distance for Parkinson disease detection. Int. J. Med. Inform., 191, 105583(2024).

[25] Sadiq M., Kadhim M.N., Al-Shammary D., Milanova M.. Novel EEG feature selection based on Hellinger distance for epileptic seizure detection. Smart Health, 35, 100536(2025).

[26] Khan M.B.S., Atta-Ur-rahman, Nawaz M.S., Ahmed R., Khan M.A., Mosavi A.. Intelligent breast cancer diagnostic system empowered by deep extreme gradient descent optimization. Math. Biosci. Eng., 19, 7978-8002(2022).

[27] Shafique R., Rustam F., Choi G.S. et al. Breast cancer prediction using fine needle aspiration features and upsampling with supervised machine learning. Cancers, 15, 681(2023).

[28] Almutairi S., Manimurugan S., Kim B.G., Aborokbah M.M., Narmatha C.. Breast cancer classification using deep Q learning (DQL) and gorilla troops optimization (GTO). Appl. Soft Comput., 142, 110292(2023).

[29] Singh L.K., Khanna M., Singh R.. Artificial intelligence based medical decision support system for early and accurate breast cancer prediction. Adv. Eng. Softw., 175, 103338(2023).

[30] Khanna M., Singh L.K., Shrivastava K., Singh R.. An enhanced and efficient approach for feature selection for chronic human disease prediction: a breast cancer study. Heliyon, 10, e26799(2024).

微信扫一扫：分享

微信扫一扫：分享