An Improved Auto Categorical PSO with ML for Heart Disease Prediction

-Cardiovascular or heart diseases consist a global major health concern. Cardiovascular diseases have the highest mortality rate worldwide, and the death rate increases with age, but an accurate prognosis at an early stage may increase the chances of surviving. In this paper, a combined approach, based on Machine Learning (ML) with an optimization method for the prediction of heart diseases is proposed. For this, the Improved Auto Categorical Particle Swarm Optimization (IACPSO) method was utilized to pick an optimum set of features, while ML methods were used for data categorization. Three heart disease datasets were taken from the UCI ML library for testing: Cleveland, Statlog, and Hungarian. The proposed model was assessed for different performance parameters. The results indicated that, with 98% accuracy, Logistic Regression (LR) and Support Vector Machine by Grid Search (SVMGS) performed better for the Statlog, SVMGS outperformed on the Cleveland, while the LR, Random Forest (RF), Support Vector Machine (SVM), and SVMGS performed better with 97% accuracy on the Hungarian dataset. The outcomes were improved by 3 to 33% in terms of performance parameters when ML was applied with IACPSO.

INTRODUCTION Globally, many people suffer from heart diseases [1]. In 1990, there were 24 million fatalities related to heart disease in the United States, by 2010 that number had risen to 38 million, a 59% increase [2]. According to current forecasts, India will have the world's highest prevalence of cardiovascular diseases and will soon overtake the rest of the world [3]. Heart disorders are responsible for almost 4 million fatalities in Europe and 1.9 million deaths in the EU [4]. In Africa, heart diseases are the leading cause of death among persons over the age of 35 [5]. Massive quantities of data on heart illnesses are collected from hospitals all around the world, which can be used manually to quantify disease rates. However, the data so far have not been efficiently translated to correlate with disease risk and symptoms [6]. Cardiovascular disorder is accompanied by common symptoms of chest tightness, loss of body strength, and swollen legs [7]. Health history examination, clinical test reports, and associated symptoms are usually used by the doctors for diagnosis. But the obtained results by this method are not always accurate, whereas they are costly and difficult to computationally analyze [8]. Researchers have tried to come up with an efficient technique to detect heart diseases since the current diagnostic approaches for heart disease are not very effective in identifying the early stages [9]. It has been reported from different methodological approaches that a combination of ML and optimization methods may be effective in predicting early-stages of heart diseases [10]. Appropriate data are needed for training and testing in ML predictive models. We can increase the performance of ML algorithms by optimal dataset balancing for training and testing [11].
Feature selection is capable of reducing dimensionality, increasing efficiency, and enhancing classification accuracy [12][13][14]. The data comprising several dimensions may create trouble in feature selection. According to [15][16][17][18], classification and clustering methods of ML were proven to be more effective in terms of accuracy rates. Several feature selection evaluation metrics were investigated in [19], to improve the computational efficiency of ML algorithms as well as to discuss the unexpected problems of feature selection. Various data mining and combinations of data mining with optimization algorithms have been proposed as a means of detecting heart diseases. Ant Colony Optimization (ACO) was applied to select an effective subset from a large training set with improved accuracy in [20]. In [21], a combination of optimization and data mining was proposed, based on Glowworm Swarm Optimization (GSO) with k-means to improve the accuracy of image classification. For omics data www.etasr.com Dubey et a.: An Improved Auto Categorical PSO with ML for Heart Disease Prediction classification, a Particle Swarm Optimization (PSO)-based model was developed in [22] and claimed good accuracy. In [23], a hybrid structure, based on PSO and grid search was proposed to predict heart diseases with 95.95% accuracy. Authors in [24], suggested a combined approach of PSO with Support Vector Machine (SVM) and Convolutional Neural Networks (CNNs). The highest accuracy, 98%, was achieved by PSO with CNN. In [25], the performance of swarm optimization algorithms Artificial Bee Colony (ABC), PSO, and ACO to an Artificial Neural Network (ANN) was studied, and PSO was determined to be the most effective. A Hhybrid Genetic Algorithm (HGA) with k-means was implemented in [26] for classifying and predicting heart diseases with 94.06% accuracy. In [27], ANNs with PCA and PSO were used to predict cardiac diseases, with an accuracy of 98%. In [28], PSO algorithm was proposed for dimension reduction, and several classification methods were used to diagnose heart diseases. Utilizing the dataset from the UCI library, Naïve Bayes (NB), Decision Tree (DT), and k-Nearest Neighbour (KNN) models were applied in [29] to predict heart diseases. In [30], PSO with feedforward backpropagation ANN were implemented to diagnose heart diseases, and achieved 91.94% accuracy. NB with a Genetic Algorithm (GA) were used to eliminate unnecessary information in [31] achieving high accuracy. To deal with the issues related to overfitting and underfitting, as well as selecting attributes, a deep ANN was implemented in [32], and the achieved precision was 93.33%. Similarly, Random Search Algorithm (RSA) for feature selection and Random Forest (RF) for classification were used in [33]. Several DT-based methods along with PSO were used in [34], to identify heart disease occurrence and the highest precision was obtained using a bagged tree with PSO. SVM with a multiclass approach of ML was applied in [35], to detect apple fruit diseases. In [36], crow search with deep learning method were used for the prediction of Parkinson's disease with 96% accuracy. For the detection of various plant diseases, ML algorithms were used in [37]. Various unsupervised learning and optimization methods were used in [38][39][40] for the prediction of heart diseases.
Most researchers used supervised and unsupervised ML methods and swarm intelligence-based optimization methods such as ACO, GSO, PSO, etc., in conjunction with ML. But their approaches were not stable in handling real-time situations. Hence, there is a need for an automated approach, which can generate the optimal solution based on the current situation. The present research work proposes a combined approach, including ML algorithms with optimization to predict heart diseases. ML methods, such as Logistic Regression (LR), DT, SVM, SVM by Grid Search (SVMGS), RF, KNN, and NB, are used, and the Improved Auto Categorical PSO (IACPSO) method is applied for selecting an optimized set of features. The major objectives of this research are: • For PSO, to create an automated approach to the selection of the optimal value of control parameters at each iteration.
• To analyze the impact of ML algorithms on different performance parameters in the prediction of cardiovascular diseases.
• To evaluate the combined impact of ML algorithms with optimization for heart disease prediction.
II. MATERIALS AND METHODS Three heart disease datasets were taken from the UCI ML library for testing: Cleveland, Statlog, and Hungarian [41]. The datasets consist of a total of 76 attributes, but only 14 relevant attributes including the attributes preferred in most published experiments [42]. Predicted trait values were represented by A and P indicating the absence and presence of heart disease respectively. ML algorithms such as LR [43,44], DT [45,46], RF [47], KNN [48], SVM [49,50], and NB [45,48], were used for prediction and analysis. IACPSO was used to select an optimal set of features.

A. IACPSO
PSO is a search-based stochastic optimization technique based on population. The particles, which are potential solutions in PSO, follow the current optimum particles through the problem space [51]. The performance of PSO depends on three control parameters which are the inertial weight (w), the acceleration coefficients (C 1 and C 2 ), and random numbers (R 1 and R 2 ) [52]. The inertial weight is used for maintaining the effect of convergence and diversity. A large value of w indicates better global exploration, while a small value works on exploitation. Unbalanced values containing the parameters can hurt the results, such that if we take low values for C 1 , it tends to acquire a smooth particle trajectory and abrupt movements. Similarly, if C 1 is much greater than C 2 , it tends to excessive wander and cause premature convergence [53]. Large inertial weight tends to global search ability and small inertial weight leads to increase in local search power. By dynamic changing w, the acceleration coefficients, efficiently explore the search space [52]. Therefore, in the proposed IACPSO, the control parameters are updated automatically based on the number of particles as well as balancing them at each iteration. The Steps included in IACPSO are given below: Step I: Initialize the Particle size (P 0, P 1 , P 2 , ………., P n ) in the D-dimensional space.
Step II: Initialize the velocity ܸ ௫ ௗ (0), where x ∈ {P 0 , P 1 , P 2 , ………., P n }, d ∈ {0, 1, …., D}. Calculate the velocity of a particle at the mth iteration by using (1): where w is the inertial weight, Ø ଵ =R 1 C 1 (local accelerations), Ø ଶ = R 2 C 2 (global accelerations), C 1 and C 2 are the acceleration coefficient, and R 1 and R 2 are random numbers. ܲ ௫ ௗ is the position of the particles, ‫ܮ‬ ௫ (݉) and ‫)݉(ܩ‬ are the local and global best positions. For each iteration, control parameter values have been chosen automatically on the basis of the number of swarm particles (n), which are given in (2)(3)(4)(5). For the value of Ø ଵ and Ø ଶ , generate the n/8 random number between 0 to 2 for the selection of C 1 and C 2 .
Similarly, for the value of w, select three values between 0.4 and 0.9 and categorize those into L, M and H, in which L is close to 0.4, M is nearer to the mean of the other two values, and H is close to 0.9. For each iteration select the value of w as per the following conditions: So, the updated velocity at each iteration is: Step III: Calculate position of the particles, given in (8): By using (8), we get n/8 number of positions for each particle, then we proceed to step IV.
Step IV: Calculate the current fitness function Ɣ(ܲ). Based on the n/8 position of particles, calculate the fitness function, given in (9) and (10), and choose the best one based on Minimization or Maximization.
Step V: On the basis of the fitness value, update ‫ܮ‬ ௫ (݉) and ‫.)݉(ܩ‬ Figure 1 shows the flow chart of the suggested methodology. Cleveland, Statlog, and Hungarian datasets of heart diseases were considered for the evaluation of the proposed approach. Experiments were run using an x64-based processor with Windows 10 OS and an Intel (R) Core (TM) i5-7200 CPU @ 2.50GHz. The analysis and visual presentation were done using Python and Java7. The experimental results and analysis on heart diseases datasets, based on ML algorithms like LR, SVM, DT, SVMGS, RF, NB, and KNN and the IACPSO optimization method are reported. Fig. 1.

III. EXPERIMENTAL RESULTS AND ANALYSIS
Flowchart of the proposed methodology.

A. Result Analysis Based on ML Algorithms
We assessed the effectiveness of the ML models by using parameters such as accuracy (AC), precision (PR), Matthews Correlation Coefficient (MCC), sensitivity (SV), and F-score (FS) [11]. Table I presents the AC, PR, SV, FS, and MCC values, where PR, SV, and FS were either A or P. Here, 25% of the data were used for testing and the rest for training. For KNN, the considered values of k were 11, 17, and 15 for Cleveland, Statlog, and Hungarian datasets respectively. According to the comparative analysis shown in Figure 2, SVMGS outperformed the other methods in all aspects and achieved accuracy of 89% for Cleveland and Hungarian datasets, while for Statlog dataset, NB and LR showed better accuracy of 91%. Comparative analysis of ML algorithms in terms of AC, average PR, and average SV.

B. Result Analysis Based on IACPSO with ML Methods
IACPSO was used for feature selection. The selection of features depend upon their ranks, their values are either 0 or 1, with 0 representing the rejection of a feature and 1 representing its selection. The number of features represents the solution size for each data set. For the optimization process, the selected features along with the performance of the classifier were taken into account. The fitness function was αC(E)+β (|SF|)/(|TF|) (11) where C(E) is the misclassification rate, |SF| shows the number of the selected features, |TF| represents the total features in the data set, and α belongs to [1,0], β = (1-α). The value of α and β were taken from [54,55]. For experimentation, the parameters values are: population size=12, number of iterations=100, dimension=7, w= 0.4 to 0.9 and C1 and C2= 0 to 2.  The percentage of data based on misclassification rates (shown in Figure 3) were considered for the experiment. Based on the IACPSO, only 7 of the 14 features were chosen for testing, which were thalassemia, chest pain, number of major vessels, ST anxiety exercise-induced relative to rest, exerciseinduced angina, maximal heart rate achieved, and exerciseinduced angina. Table II presents   IV. DISCUSSION The performance of ML algorithms with and without optimization was examined in this paper. An optimization algorithm (IACPSO) was used for the selection of features to improve accuracy in less time, using the optimum value of the acceleration coefficients in each iteration. In addition, these values were also linked to the graded inertial load, to balance exploration and exploitation. Table III compares the results, based on ML techniques with and without optimization. Figure  5 shows the improvement rates of performance parameters when combined approach (ML algorithms + IACPSO) were used. Outcomes were improved by 3 to 33% in terms of performance parameters when ML was applied with IACPSO. As per Table III and Figure 5, the major findings were as follows: • When LR was applied separately, the achieved AC was 87% (Cleveland), 91% (Statlog), and 88% (Hungarian) but in the combined methodology of LR with IACPSO, the AC increased by 9% for Cleveland and Hungarian and 7% for Statlog.
• When the DT was applied separately, AC was 82% for Cleveland and Hungarian, and 83% for Statlog, but in the combined methodology of DT with the IACPSO, the AC increased by 10% for Cleveland and Statlog and 11% for Hungarian.
• In the case of RF with IACPSO, AC was 10%, 8%, and 11% higher for Cleveland, Statlog, and Hungarian than for separately applied RF.
• For SVM, the obtained AC with the IACPSO in Cleveland, Statlog and Hungarian was 97%, which was 10% higher than when using only SVM. There was an increase in AC of 1% in Cleveland and Statlog when IACPSO was used with SVMGS.
• For KNN with IACPSO, an increase of 16% in AC occurred for Cleveland and Statlog. A large difference was found in the improvement percentage in the value of MCC, being 33% in Cleveland, 31% in Statlog, and 27% in Hungarian.
• For NB, improvement after IACPSO was 3 to 5% in terms of AC, PR, SV, and FS.
For all the aspects of performance used in the currrent research, ML methods with IACPSO were proven to be superior. ML algorithms were used for the classification and IACPSO method was applied for the selection of effective features. This type of combined approach gave better models for the early prediction of cardiovascular diseases.   V. CONCLUSION In this paper, a combined approach, based on ML algorithms and optimization is proposed to predict heart diseases at an early stage using the history of the patients. ML algorithms, such as DT, LR, SVM, SVMGS, NB, KNN, and RF were used and IACPSO was used for optimization. In IACPSO, the optimum value of control parameters was used, which helped in proper exploration and exploitation. In each iteration, P n /8 number of solutions was generated in terms of local and global best, according to the objective function. The best solution was used for the next iteration. The proposed approach was investigated on Cleveland, Statlog, and Hungarian datasets and the evaluation was performed based on AC, PR, SV, FS, and MCC. The ML algorithms were compared and assessed with and without optimization, and it was found that the former were superior in all parameters. The proposed ML approach with an optimized set of features helped in predicting cardiovascular diseases and yielded better predictive results. In the future, this work can be repeated with more parameters, with real or primary datasets and various other threshold mechanisms towards the use of attributes in detecting different diseases.