Performance Comparison of Ensemble Learning and Supervised Algorithms in Classifying Multi-label Network Traffic Flow

Network traffic classification is of significant importance. It helps identify network anomalies and assists in taking measures to avoid them. However, classifying network traffic correctly is a challenging task. This study aims to compare ensemble learning methods with normal supervised classification to come up with improved classification methods. Three types of network traffic were classified (Benign, Malicious, and Outliers). The data were collected experimentally by using Paessler Router Traffic Grapher software and online and were analyzed by R software. The datasets were used to train five supervised models (k-nearest neighbors, mixture discriminant analysis, Naïve Bayes, C5.0 classification model, and regularized discriminant analysis). The models were trained by 70% of the samples and the rest 30% were used for validation. The same samples were used separately in predicting individual accuracy. The results were compared to the ensemble learning models which were built with the use of the same datasets. Among the five supervised classifiers, k-nearest neighbors and C5.0 classification scored the highest accuracy of 0.868 and 0.761. The ensemble learning classifiers Bagging (Random Forest) and Boosting (eXtreme Gradient Boosting) had accuracy of 0.904 and 0.902 respectively. The results show that the ensemble learning method has higher accuracy compared to the normal supervised classifiers. Therefore, it can be used to detect malicious activities in network traffic as well as anomalies with improved accuracy. Keywords-ensemble; malicious; anomalies; security


I. INTRODUCTION
The rapid development of Information and Communication Technologies (ICTs) including hardware, the Internet, data science techniques, and services such as online transactions, edge, and cloud computing has changed the way many societies communicate, work, and learn [1]. Because of the variety of computer software available as well as the growing number of users, large-volume data processing has become more complex. The trends show that an increase in mobile data traffic will exceed zettabyte (200 10 ) by 2025 [2]. The total network traffic pattern has increased exponentially the last few years [3,4]. Furthermore, it has also been reported that such rapid growth in data traffic is likely to cause security breaches, risks, and lowering of network performance, especially in communication network systems [5,6]. It also provides room for potential demand for re-designing network architectures to overcome security threats [7]. Through the use of the Internet, there is a possibility of security attacks occurring at any given time in the communication network system including software, hardware, and attached accessories or communication devices.
Data flows from one point to another as unidirectional packets in Internet Protocol (IP) communication networks. The flows depend on the hardware performance and network architecture [8]. A flow is a traffic stream with a common set of identifiers that has the same source IP, destination IP, protocol, source, and destination ports [9]. Monitoring data traffic in connected devices provides useful information that would be of importance in the timely understanding of the behavior of the flows and in predicting bandwidth usage. Monitoring data flows is crucial in that, any Denial of Service attacks (DoS) and other network security threats and vulnerabilities within the network can be easily identified for timely interventions [5,[10][11][12]. It helps system administrators and security experts to understand and monitor all activities in the given computer network.
High online service demand is embedded in our daily life. Detecting anomalies in the network can be very difficult [13].
It is very difficult to detect, identify and prevent malicious activities in computer networks, especially in a distributed computing environment. Priority is given to the process of identifying the characteristics of applications that generate high traffic due to malicious activities in communicating devices. These facts reveal that network traffic management and monitoring for the smooth running of an information system is a demanding task [14]. Hence, in order to avoid the occurrence of such an unpreferred situation in a computer network, highperforming models are needed for both hardware and software solutions, therefore, the study of Machine Learning (ML) is of importance to supplement both hardware and software-based solutions [15]. ML is the ability of computer algorithms to learn from a large amount of data through experience and provide a predicted output. ML can be applied in different fields such as data science, ICT, health care, finance, etc. [16].
There are four types of ML schemes: supervised, semisupervised, unsupervised, and reinforcement learning. They both use classification algorithms. Examples of supervised learning classifications algorithms are Decision Tree (DT), k-Nearest Neighbors (k-NN), Naïve Bayes (NB), and logistic regression [6]. They use labeled training data as input features to generate the output [17]. Examples of unsupervised classification algorithms include k-means and hierarchical clustering [18]. They use unlabeled input features to generate the outputs.
The supervised learning classification relies on an ensemble learning technique to generate multiple models. Ensemble learning technique is a collection of classifiers used to build very sophisticated models with higher accuracy compared to single estimator classifiers [19,50]. It is a machine learning approach in classifying datasets with high dimensions and its training process is not very complex [12]. Its output is based on the training data by aggregating them to generate a strong model. It thus fuses the results from several different models. This improves performance and prediction by stacking different models.
There are many studies related to the use of supervised and unsupervised ML in flow-based network traffic classification. For instance, in [20], the authors used ML in the classification of end users' applications. Different methods were used, such as k-NN, Random Forest (RF), and J48. The k-NN technique provided the best results followed by RF with accuracy of 93.94 % and 90.87% respectively. In [21], the authors compared Principal Component Analysis (PCA) with the Gaussian NB method. The mean accuracy of PCA was about 86% during the validation process in network intrusion detection compared to the 74% of Gaussian NB. A comparison study of DT, k-NN, Support Vector Machine (SVM), and RF was conducted in [22], resulting in higher accuracy of RF (96.87%) in comparison to the 48.56% of the SVM. This study shows that the ensemble learning method RF is very useful in network traffic data analytics compared to other ML techniques. In this study, SVM which is a supervised classifier, failed to separate network traffic based on feature classes used. Data mining approaches were applied in [18] in finding the dynamic patterns of network traffic. The study applied the clustering method by portioning the data from different domains and characterization the traffic in the time series data set. Two feature classes (benign and malicious) were considered. The authors in [23] compared the data mining approach using ML and ensemble learning method to forecast water flow. It was shown that the use of ensemble learning provided the best performance. Some of the performance metrics were not generated, for example recall, sensitivity, and kappa. When compared to other supervised algorithms, the use of ensemble learning in evaluating intrusion detection by using different data from network traffic tracing shows an improvement in network traffic classification [19].
The development of network infrastructure hardware has resulted in the use of the Deep Packet Inspection (DPI) method in classifying network attacks and threats. However, this approach needs a lot of memory as well as resources during computation. Another weakness of the method is that it is very difficult for database maintenance, especially for zero-day attacks and protocols [24]. In the field of ICT, there are three main approaches in classifying network traffic, namely flow, payload-inspection, and port-based methods [25]. Regarding the port-based and flow-based classification, it was shown in [26] that port-based classification has higher accuracy. However, ML classification provides both higher accuracy and performance results [27] compared to port and flow-based classifications. Based on the above study, this paper aimed at comparing the accuracy and performance of ML approaches.
Since the focus of the study for this article was to develop a learning classification model that can identify and detect network anomalies with high accuracy, especially zero-day attacks, we opted for supervised learning classification algorithms. The supervised learning classification algorithms provide higher accuracy than the unsupervised ones [24]. There are two types of ensemble learning methods in supervised learning classification [28]: Boosting and Bootstrap Aggregating (Bagging), both with a potential of being used in classification and regression. Boosting is an ensemble learning method which combines several weak learners to build a strong learner by using supervised classification [29]. Bagging methods divide the training data set into small samples for training the models.
In Tanzania, only a few studies have been conducted on the evaluation of computer system network traffic by using data mining and the ML approach. Authors in [30] compared network traffic classification and packet detection, showing that both computational performance and classification accuracy can be used for the management of computer network systems. This article thus aims to compare the performance of ensemble learning techniques (Bagging and Boosting) with the normal supervised classification algorithms, particularly k-NN, NB, LDA, MDA, and C5.0. The comparison is focused on three metrics (accuracy, kappa, and logloss). The question is whether using ensemble methods improves the accuracy and value of kappa. To fulfill this objective, we computed model sensitivity and specificity, precision and recall metrics from the models, and generated both positive and negative predictions from the models. This article contributes to the development of models, hardware or software, to detect network traffic anomalies in a much more effective and efficient way. It also contributes to the literature related to the ML approach in the computer network security field.

A. Data Capture and Classification Process
This section illustrates the structure of network traffic dataset classification step by step from data capture up to model evaluation as shown in Figure 1. Diagram illustrating data capture and classification.

B. Network Traffic Data Collection
We set experiments for network traffic data capture by using Paessler Router Traffic Grapher (PRTG) software and Cisco flow software in a Cisco 3900 router series hardware. The data captured at this stage were used to test the models. Online data donated by Mills [31] were downloaded in April 2021 from the Kaggle website (www.kaggle.com). These data were used for training the models with the variables shown in Table II. The experiment of the training dataset was set at Lancaster University's network address space. The data set contained robust ground truth through the correlation of malicious behavior in the network. The data were then stored in a computer and external hard disks for backup in packet capture (pcap) file format.

C. Feature Selection from Datasets
In supervised learning, after data capture, the next step is to select features from the data set collected from the network intended for testing the models. We used Joy software [32] which is a BSD-licensed libpcap-based software package for extracting features from live network traffic or pcap files. Sixteen variables with three feature class labels were generated in Comma Separated Values (CSV) and MS excel format (Table I).

D. Data Pre-processing
Data pre-processing was done in R software (Version. 4.1.2 named Bird Hippie) [33] by using RStudio editor Integrated Development Environment (IDE) [34]. Data pre-processing was performed to transform the data to a useful format for import and manipulation by ML algorithms. A total of 191,223 datasets were extracted, followed by feature selection and were labeled as benign (86,762), malicious (74,110), and outliers (30,351). The pre-processing stage generated 133,971 labeled samples. Out of these samples, 70% (n = 93,780) and 30% (n = 40,191) were used for training and model testing respectively. The dataset was then scaled and centered by using median imputation for every variable.

E. Variable Multicollinearity Test
After the data pre-processing stage, we looked for variable multicollinearity by using the Spearman correlation coefficient test [35]. The Variance Inflation Factor (VIF) method was applied to detect and remove highly correlated variables based on VIF interpretation as shown in Table II. To identify variables to remove or to retain in the model, R software [33] was used and RStudio IDE [34]. Model development and data analysis were conducted in a laptop with Quad Intel Core i5, 8GB of RAM, and 560 SSD running macOS Big Sur. VIF is the measure of how the variance is inflated by the correlation of the predictors which leads to the variance increase of predictors [36]. Variables with higher correlation were removed from the list while those with VIF 1 ≤ VIF < 6 were kept for model development as shown in Table III.

F. Model Development
Models were developed by using the classification and regression training (Caret) packages [21] in R software with R programming language. Other packages (e.g. ggplot2, randomForest, and xgboost) were used for calculations, data manipulation, and visualization. A total of nine predictors, with three classes, namely benign, malicious, and outlier from A serial process for RF and eGB as sequential and parallel categories of ensemble learning classifiers respectively was completed as indicated in Figure 2. RF algorithm depends on aggregating the output from several trees. Trees are modified, pruned, and an average of the results and predictions is done by using the estimation of the dependent variables on new observations. The eGB a popular ensemble learner' method which is used in ML with AdaBoost in DTs. It avoids overfitting challenges and its accuracy is higher than AdaBoost's. Flow chart for ensemble learning classifiers.
The normal classifiers and learning methods are described below.

1) C5.0 Classification Model
This model is an extension of C4.5 which establishes a DT where every feature is considered during classification [37]. The trees constructed by C50 have high accuracy and a minimum breakdown which makes the classifier reliable and faster. The model is used to handle non-numerical features such as factor, character, etc., therefore, the model was used as a DT classifier or boosted classifier following [38]. In most studies, C5.0 performs higher than CART and C4.5 [39].

2) k-Nearest Neighbors Algorithm
In ML, k-NN is considered as a lazy learning classifier and it is used to classify objects that are closely related in training data samples based on instance learning. It uses similarity and distance between two points and categorizes the dataset based on the distance or similarities from other categories as shown in (1) and Figure 2. By calculating the Euclidean distance [40], the New Class in the figure will belong to Class C and not in Class B whereas, by using similarities, this occurs when we choose k = 4 as the number of neighbors. Alternatively, these can be done by using the Euclidean distance as indicated in (1) from P1 to P2.
In this study, we used k = 29 because it produced the optimal results for the acquired data.

3) Mixture Discriminant Analysis
Discriminant analysis is used to predict the probability of belonging to a given class (or category) based on one or multiple predictor variables. It works with continuous and/or categorical predictor variables. In MDA, each class is assumed to be a Gaussian mixture of subclasses. It is the extension of Linear Discriminant Analysis (LDA) which can be used as supervised classification. The method is nonparametric because it minimizes within-group variability. LDA can be used in multi-class classification methods that follow the Gaussian theorem to model classes. In our dataset we had three classes denoted by "P" and our training sample was denoted by (y1……. yn) with classes (w1…… wn), where wi ∈ {1...P}. The prior probability jk of each class follows the Gaussian  (2) where ܽ is the model estimate, p is the number of classes in the data sets, n the number of samples, and z ୧ a constant for normal distribution.

4) Regularized Discriminant Analysis
RDA uses multivariate means as well as a covariance matrix. The properties are generated from the data and used in the predictions. RDA data use Gaussian assumptions whereby each variable when plotted is like a bell curve. The model generates variance and means of each class from the data as illustrated in (3): The variance of the samples was computed using (4): where n is the number of instances, P the number of classes, x the input values, and np is the number of classes in the instance.
For model prediction in RDA, we used the classes with the highest probability of the classes (h) with x as input through the Bayesian theorem as illustrated in (5): where P(x|k) is the estimated probability of x belonging to the class k, P(Y = k|X = x) is the probability of the class (Y = k) given the input data x, and P(k) is the base probability of a given class k. We are considering (Y = k).

5) Naïve Bayes
The use of NB classifiers in ML especially in anomaly detection has been widely applied in filtering spam emails. The accuracy of separating spam in the email is limited because its strength depends on the independence between the features [41]. The model also suffers from the heavy overhead computation which makes the mode use more resources during execution [42]. Therefore, we used this model with others for comparison due to simplicity and efficiency. NB uses the concept of the Bayesian theorem with the assumption of prior knowledge of a given hypothesis to classify features. The theorem state as: where P(d) is the probability of the data, P(h) is the probability of hypothesis h being true, P(h|d) the posterior probability, P(d|h) the probability of data d given that the hypothesis h was true. Likewise, the maximum posterior (MAP) hypothesis can be calculated by applying (7):

6) Model Evaluation Metrics
The proposed classification techniques used two ensemble learning methods versus five normal supervised ML. Model Accuracy, Precision, Recall, F1 score metrics were used for model evaluations. Recall, F1 score, Precision, Accuracy can be mathematically computed by using the equations from Table  V [ III. RESULTS AND DISCUSSION The study's main objective was to evaluate the performance of different models in network traffic classifications. To achieve this objective, the current study used ensemble learning methods and normal supervised classifications for comparison. Multilabel data features were classified by using different models. The following evaluation metrics were applied for both ensemble and normal supervised learning: Accuracy, Under the Curve (AUC), Precision, Recall, Sensitivity, Specificity, Positive and Negative predictions.

1) Normal Supervised Learning Methods
Results from the five algorithms that were developed before subjecting individual models to ensemble learning techniques showed that k-NN had the highest accuracy followed by C5.0 and MDA with accuracy of 0.868, 0.761, and 0.741 respectively. NB classifier scored the lowest accuracy of 0.696 as shown in Table VI. These results are in accordance with the findings in [25,43,44].

2) Ensemble Models
Results from the two higher-performing ensemble models in higher dimensional dataset techniques showed that RF had higher accuracy compared to eGB as presented in Table VII. Our study is supported by [45], in which ensemble methods (xGB and RF) were used with accuracy of 89.09% and 85.49% respectively. Another study that supports our result was [13] as indicated in the comparison in Table XIII.

3) Comparison of Normal Supervised and Ensemble Models
The results from the comparison done after developing the models by using ensemble learning and supervised algorithms revealed that in normal supervised algorithms, k-NN had the highest accuracy as shown in Table VI. Both the ensemble learning methods had higher accuracy, with RF having the highest.

B. Evaluation of the Normal Supervised Learning Processed by Ensemble Classifier
After processing supervised algorithms by using ensemble learning methods, there was an improvement of accuracy and Kappa values as shown in Table VIII. C5.0 had the highest accuracy (0.902) as compared to the previous accuracy of 0.761 (Table V). On the other hand, k-NN also improved with a small margin from 0.868 to 0.898. C. Model AUC AUC was the highest in eGB, followed by C5.0, whereas RDA had the least as shown in Table IX. The value of F1 score metrics, as the measure of the test's accuracy, was highest in C50, followed by k-NN and eGB. RDA scored the least F1score.

D. Precision and Recall
The results showed that eGB has the highest Precision followed by k-NN and RF, while RDA scored the least Precision as presented in Table X. RF exhibited the highest Recall followed by C50 and k-NN, while NB scored the lowest value.

E. Model Sensitivity and Specificity
RF scored the highest sensitivity, followed by C50 and k-NN, while NB had the lowest sensitivity as shown in Table XI. Furthermore, RF attained the highest specificity, followed by k-NN and eGB, while NB scored the lowest.

F. Prediction
Both Positive and Negative predictions generated from the models are presented in Table XII. eGB attained the highest positive prediction followed by RF and k-NN. RDA scored the lowest positive prediction. NB scored the highest negative prediction, followed by MDA.

G. Discussion
The result from this study has been compared with the results from [13] as shown in Table XIII and Figure 4. The acquired results show that the proposed techniques achieved better accuracy, AUC, Recall, and Precision, but not F1 score. The study which [13] shows that eGB, C5.0, and RF had an accuracy of 0.901, 0.886, and 0.885 respectively. The findings of this paper also show an accuracy of 0.902, 0.902, and 0.904 for C5.0, eGB, and RF. Therefore, our results are higher considering model accuracies. The study conducted in [46] was looking at anomaly detection by using ML techniques by using RF. One of the performance metrics was the accuracy of RF which was 99.7 which is higher compared to this paper results. Another study [47] was utilized the RF classifier and scored an accuracy of 0.893, Recall 0.890, F1 score of 0.896, and precision of 0.92. In [45] F1 score 0.924 which are similar or less than the same respective scores of the current study (Table XIII). The results of this paper are also equivalent and sometimes above the results of [48,49]. One can conclude that the results of the current study are supported by other studies, however, with slight variations in some parameters.
IV. CONCLUSION, RECOMMENDATIONS, AND FUTURE WORK This article presented and compared the results of normal supervised algorithms and ensemble learning techniques, namely RF and eGB. The individual classifiers were compared with ensemble learners by using a real experimental dataset with little correlation. The overall accuracy of the ensemble methods was higher than the accuracy of normal classifiers. Therefore, we can conclude that the ensemble learning techniques can be used to classify the multilabel network traffic.
This study contributes to the knowledge of network traffic classification by using supervised and ensemble learning and multilabel datasets. To the best of our knowledge there are no similar studies regarding the network traffic classification in Tanzania.
The current study can be extended to new emerging technologies (edge computing, cyber security, e-commerce, fog computing, and distributed databases such as Blockchain). Also, the use of emerging ML approaches like reinforcement and deep learning could be applied with new experimental datasets. The performance comparison of ensemble learning with other learning methods in classifying network traffic in emerging technologies is very important. Application of higher processing speeds and distributed systems such as H2O, Apache Spark, etc. to facilitate the application of big data in the massive network traffic data can be also considered.