From Raw to Ready: Industrial Fault Data Enhancement Via Preprocessing and Balancing
Received: 16 June 2025 | Revised: 22 July 2025 | Accepted: 11 August 2025 | Online: 6 October 2025
Corresponding author: Suroor M. Albattat
Abstract
In recent years, predictive maintenance has emerged as a critical component for improving the efficiency and reliability of industrial systems. However, much of the existing research has primarily emphasized model development, often overlooking the fundamental role of data quality and class distribution in shaping predictive performance. To address this gap, this study proposes an integrated preprocessing framework that ensures high-quality data readiness across all stages. A case study was conducted on a dataset of industrial sensors for fault prediction. The preprocessing pipeline involved handling missing values using K-Nearest Neighbors (KNN), detecting outliers with Isolation Forest (IF), and correcting abnormal values through the Clipping method. To address data imbalance, synthetic data were generated using Generative Adversarial Networks (GAN), Variational Autoencoders (VAE), and a hybrid GAN-VAE model that leverages the strengths of both approaches. The hybrid GAN-VAE demonstrated superior data generation performance, yielding the highest Pearson correlation and best Kernel Density Estimation (KDE) fit, thereby ensuring dataset reliability for training. The effectiveness of the preprocessing framework was validated using a 1-Dimensional Convolutional Neural Network (1D-CNN) classifier, which achieved a high accuracy of 98.83%.
Keywords:
data preprocessing, imbalanced data, machine learning, outliers, Generative Adversarial Network (GAN)Downloads
References
F. Duan, S. Zhang, Y. Yan, and Z. Cai, "An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE," Sensors, vol. 22, no. 14, Jul. 2022, Art. no. 5166.
S. B. Belhaouari, A. Islam, K. Kassoul, A. Al-Fuqaha, and A. Bouzerdoum, "Oversampling techniques for imbalanced data in regression," Expert Systems with Applications, vol. 252, Oct. 2024, Art. no. 124118.
A. Islam, S. B. Belhaouari, A. U. Rehman, and H. Bensmail, "KNNOR: An oversampling technique for imbalanced datasets," Applied Soft Computing, vol. 115, Jan. 2022, Art. no. 108288.
A. Amin et al., "Comparing Oversampling Techniques to Handle the Class Imbalance Problem: A Customer Churn Prediction Case Study," IEEE Access, vol. 4, pp. 7940–7957, 2016.
Y. Fathy, M. Jaber, and A. Brintrup, "Learning With Imbalanced Data in Smart Manufacturing: A Comparative Analysis," IEEE Access, vol. 9, pp. 2734–2757, 2021, https://doi.org/10.1109/ACCESS.2020.3047838.
J. Kafunah, M. I. Ali, and J. G. Breslin, "Handling Imbalanced Datasets for Robust Deep Neural Network-Based Fault Detection in Manufacturing Systems," Applied Sciences, vol. 11, no. 21, Oct. 2021, Art. no. 9783.
S. Singh and M. T. U. Haider, "Pre-processing of datasets with best feature selection and outlier removal techniques for a fair and robust model of software defect prediction." In Review, May 2022.
O. Celik, M. Hasanbasoglu, M. S. Aktas, O. Kalipsiz, and A. N. Kanli, "Implementation of Data Preprocessing Techniques on Distributed Big Data Platforms," in 2019 4th International Conference on Computer Science and Engineering (UBMK), Samsun, Turkey, Sep. 2019, pp. 73–78.
M. Zhang, X. Li, and L. Wang, "An Adaptive Outlier Detection and Processing Approach Towards Time Series Sensor Data," IEEE Access, vol. 7, pp. 175192–175212, 2019.
A. Jain, L. Patil, and P. Dandannavar, "Big Data Preprocessing – A Survey of Existing and Latest Outlier Detection Techniques," International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE), vol. 14, no. 2, pp. 178–181, Apr. 2015.
C. Lartey, J. Liu, R. K. Asamoah, C. Greet, M. Zanin, and W. Skinner, "Effective Outlier Detection for Ensuring Data Quality in Flotation Data Modelling Using Machine Learning (ML) Algorithms," Minerals, vol. 14, no. 9, Sep. 2024, Art. no. 925.
S. Cofre-Martel, E. L. Droguett, and M. Modarres, "Big Machinery Data Preprocessing Methodology for Data-Driven Models in Prognostics and Health Management," Sensors, vol. 21, no. 20, Oct. 2021, Art. no. 6841.
A. Hakami, "Strategies for overcoming data scarcity, imbalance, and feature selection challenges in machine learning models for predictive maintenance," Scientific Reports, vol. 14, no. 1, Apr. 2024, Art. no. 9645.
A. Sharma, P. K. Singh, and R. Chandra, "SMOTified-GAN for Class Imbalanced Pattern Classification Problems," IEEE Access, vol. 10, pp. 30655–30665, 2022.
Q. Liu, G. Ma, and C. Cheng, "Data Fusion Generative Adversarial Network for Multi-Class Imbalanced Fault Diagnosis of Rotating Machinery," IEEE Access, vol. 8, pp. 70111–70124, 2020.
A. Abraham, H. S. Mohideen, and R. Kayalvizhi, "A Tabular Variational Auto Encoder-Based Hybrid Model for Imbalanced Data Classification With Feature Selection," IEEE Access, vol. 11, pp. 122760–122771, 2023.
M. Yin, J. Tian, Y. Wang, and J. Jiang, "A Novel Distributed Process Monitoring Framework of VAE-Enhanced with Deep Neural Network," Neural Processing Letters, vol. 56, no. 2, Mar. 2024, Art. no. 118.
S. Chatterjee and Y.-C. Byun, "Leveraging generative adversarial networks for data augmentation to improve fault detection in wind turbines with imbalanced data," Results in Engineering, vol. 25, Mar. 2025, Art. no. 103991.
Y. S. Hindistan and E. F. Yetkin, "A Hybrid Approach With GAN and DP for Privacy Preservation of IIoT Data," IEEE Access, vol. 11, pp. 5837–5849, 2023.
M. A. Hailan, B. M. Albaker, and M. S. Alwan, "Two-Dimensional Transformation of a Conventional Manufacturer into a Smart Manufacturer: Architectonic Design, Maintenance Strategies and Applications," Al-Iraqia Journal for Scientific Engineering Research, vol. 1, no. 1, pp. 77–87, Sep. 2022.
A. Fadlil, Herman, and D. Praseptian M, "K Nearest Neighbor Imputation Performance on Missing Value Data Graduate User Satisfaction," Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 6, no. 4, pp. 570–576, Aug. 2022.
M. F. S. AlRijeb, M. L. Othman, A. Ishak, M. K. Hassan, and B. M. Albaker, "Machine Learning-Driven Soft Sensor Implementation for Real-Time Fault Detection in CDU of Oil Refinery," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20425–20432, Feb. 2025.
P. M. Goad, P. J. Deore, and V. B. Patil, "A novel approach for detecting outliers by using Isolation Forest with reducing under fitting issue." In Review, Dec. 2022.
H. A. H. Al-Najjar, B. Pradhan, R. Sarkar, G. Beydoun, and A. Alamri, "A New Integrated Approach for Landslide Data Balancing and Spatial Prediction Based on Generative Adversarial Networks (GAN)," Remote Sensing, vol. 13, no. 19, Oct. 2021, Art. no. 4011.
F. Meghdouri, T. Schmied, T. Gärtner, and T. Zseby, "Controllable Network Data Balancing with GANs," in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, Dec. 2021.
A. Von Birgelen, D. Buratti, J. Mager, and O. Niggemann, "Self-Organizing Maps for Anomaly Localization and Predictive Maintenance in Cyber-Physical Production Systems," Procedia CIRP, vol. 72, pp. 480–485, 2018.
D. Zou et al., "Outlier detection and data filling based on KNN and LOF for power transformer operation data classification," Energy Reports, vol. 9, pp. 698–711, Sep. 2023.
E. F. Hadi, M. Z. Bin Baharuddin, and A. W. M. Zuhdi, "Advancing Predictive Maintenance: Median-Based Particle Filtering in MOSFET Prognostics," Journal Européen des Systèmes Automatisés, vol. 57, no. 4, pp. 1103–1117, Aug. 2024.
H. Perez and J. H. M. Tah, "Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE," Mathematics, vol. 8, no. 5, Apr. 2020, Art. no. 662.
A. Popov et al., "Reduced Graphene Oxide and Polyaniline Nanofibers Nanocomposite for the Development of an Amperometric Glucose Biosensor," Sensors, vol. 21, no. 3, Feb. 2021, Art. no. 948.
J. E. Choi, D. H. Seol, C. Y. Kim, and S. J. Hong, "Generative Adversarial Network-Based Fault Detection in Semiconductor Equipment with Class-Imbalanced Data," Sensors, vol. 23, no. 4, Feb. 2023, Art. no. 1889.
S. Stocksieker, D. Pommeret, and A. Charpentier, "Data Augmentation with Variational Autoencoder for Imbalanced Dataset." arXiv, Dec. 2024.
J.-H. Lee, J.-H. Lee, C.-J. Lee, S.-L. Lee, J.-P. Kim, and J.-H. Jeong, "A Study on Wheel Member Condition Recognition Using 1D–CNN," Sensors, vol. 23, no. 23, Nov. 2023, Art. no. 9501.
K. Yan and X. Zhou, "Chiller faults detection and diagnosis with sensor network and adaptive 1D CNN," Digital Communications and Networks, vol. 8, no. 4, pp. 531–539, Aug. 2022.
M. Kassem and B. M. Albaker, "Efficient Classification Model of Pneumonia Infection Based on Deep Transfer Learning and Chest X-Ray Images," Al-Iraqia Journal for Scientific Engineering Research, vol. 1, no. 1, pp. 58–67, Sep. 2022.
Downloads
How to Cite
License
Copyright (c) 2025 Suroor M. Albattat, Baraa M. Albaker, Malik A. Alsaedi

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.